How does docs crawling work

amanrs · September 3, 2023, 8:31pm

We plan on adding a visualization of what it is the model actually sees when using docs in a future release. We’ll also have better docs management and auditing of text is shown for each page

For now, we use an HTML to markdown parser, then do n-gram deduplication across the crawled pages (to take out useless info like navbars). Finally, we use some simple chunking heuristics to break down a page into multiple ~500 token chunks.

The library name is node-html-markdown if you’d like to see what the raw markdown looks like for a webpage.

Topic		Replies	Views
Which format is better for custom docs? Discussion	2	231	August 29, 2024
Application Architecture Discussion	1	865	September 4, 2023
Doc crawler question Discussion	3	468	October 29, 2023
Is CURSOR Broke? Discussion	3	190	January 9, 2025
Let Cursor crawl our Confluence documentation? Discussion	6	1924	March 12, 2025

How does docs crawling work

Related topics