TL;DR
- use xml-sitemaps.com to create a make-do site-map of the documentation’s homepage/entrypoint.
- IMPORTANT: At the end of the page you’ll find “view HTML sitemap”. use that as the Entrypoint of the documentation. (Cursor doesn’t recognize XML, which is made available officially by every website, which would’ve made this a lot easier)
- the prefix is the “repeating” element of the official URL, NOT
xml-sitemaps.com
Full version:
some documentations do not come with an …/index.html which makes a great Entrypoint for Cursor.
To make matters worse, some websites like docs.langflow.org do not use traditional hierarchy in their URL. (by hierarchy i mean, the links are not in example.com/page/sub-page/..
format)
For example, in the case of langflow, the URL structure is example.com/Topic-SubTopic-SubSubTopic..
which seems to be an unfamiliar format for Cursor’s bot (atleast at the time of writing this)
Due to which my attempts at indexing Langflow’s documentation had Cursor forcing itself to look for links with a hierarchical format which I mentioned earlier. So it ended up hallucinating links which don’t exist, and indexed 404 pages which were in hierarchical format. Yet, it hallucinated answers with confidence. Classic AI.
A sitemap can fix this issue, but when there isn’t one, you gotta make your own. Most websites keep a xml sitemap for the search-engines (example.com/sitemap.xml)
But Cursor doesn’t work with XML links. That’s where XML-sitemaps.com
comes in. It not only crawls the given link to make a sitemap, it also gives a HTML version of the result which can be used as an Entrypoint with Cursor.
Edit: the hierarchical URL structure is called path-based or directory-style URL. the one with the hyphen can be called flat/delimited URL, although gpt4 said it’s not a recognized term