Hey everyone!
I’m working with a very large JSON file structured according to schema.org standards. While the general structure of the objects is mostly consistent, there are some variations—certain keys appear in some objects but not in others.
I exposed a small portion of the JSON to ChatGPT, and it generated a parser that works well for the repeating parts. However, since the supported context is smaller than the JSON file, it couldn’t identify the structural variations in the parts it wasn’t exposed to.
My goal is to have Cursor access the entire JSON structure so it can generate a parser that works in 100% of cases, handling all valid variations within the schema.
Given the file size, what are the best practices for fine-tuning this approach? Some key questions I have:
- How can I split the JSON so the LLM can process it without losing context?
- What’s an efficient way to identify and unify all key variations across objects?
- Are there any recommended tools or approaches to validate and extract a complete schema from the data?
- Which LLM performs best for this type of task? Has anyone tested Claude, Gemini, GPT, DeepSeek, or others for parsing large and complex JSON structures?
I’d really appreciate any insights or suggestions!