[Technical Question] AI Code Tracking API's Core Data Mechanism: Real-time Metadata or Backend Diff?

Hi Cursor Team,

We are very interested in the “AI Code Tracking API” feature available in the Enterprise version, as it addresses the critical problem of code attribution.

To better understand the data flow and implementation architecture of this function, we’d like to clarify a core question: How is your attribution data generated?

We’ve primarily envisioned two possible technical paths and would appreciate your confirmation or correction:

Possibility A: Backend Diff / Analysis Mechanism

  • Does this mechanism involve Cursor’s backend servers storing all (or key) historical response data from AI chats, edits, and fixits?

  • Then, after a developer runs git push (or on a periodic scan), does your backend service pull the latest commit content and perform a large-scale diff and matching (e.g., based on hashes, AST structure, or fuzzy/semantic matching) against this “AI response history database”?

  • Is it through this “post-commit” process that you identify which code in the repository originated from AI?

Possibility B: Editor Real-time Metadata Mechanism

  • Alternatively, is this a client-side (editor) driven mechanism?

  • Does the Cursor IDE itself, at the moment AI-generated code is inserted into a file, attach a tracking tag (e.g., {source: "ai", model: "gpt-4", block_id: "uuid-..."}) to those specific lines or blocks, either in-memory or via a local lightweight DB?

  • Subsequently, when the developer executes git commit (or push), does the Cursor client intercept this action, scan the staged files for these tracking tags, and then upload this “Attribution Report”—associating it with the commit_hash—to your central API server?


Follow-up Questions (Based on the Mechanism):

We are asking whether it’s A or B because it directly impacts how the core challenge of “Edit Dilution” is handled:

  1. If using (A) Backend Diff Mechanism:

    • Does this imply that the mechanism relies primarily on pure, unmodified AI code blocks?

    • If an AI code block is slightly modified by a developer (e.g., fixing a typo, renaming a variable), would the matching algorithm (whether hash-based or fuzzy) be likely to fail, leading to a statistical omission?

  2. If using (B) Editor Metadata Mechanism:

    • When a developer modifies a line of code already tagged as source: "ai" in the IDE (even just one character), how does the attribution of that tag change?

    • Does it immediately flip to source: "human" (i.e., a “zero-tolerance” attribution model), or is the attribution judged based on a modification threshold (e.g., a diff percentage against the original AI-generated block)?

Our Request:

We are very curious which architectural path (A, B, or perhaps C, one we haven’t thought of) Cursor has chosen, and how you are solving the “edit dilution” problem that arises from it.

Understanding this core mechanism is crucial for us to evaluate the data accuracy and robustness of this API in a real-world development workflow.

We look forward to your professional insights.

hi @lary and thank you for the detailed post. The data is from client side tracking as user action determines if they accepted the code written by AI. Subsequent edits by a human are also tracked in the same way.

Thank you for confirming it’s client-side tracking based on user action. That is very clear and helpful.

Just to be 100% certain we are on the same page: Does this mean the attribution logic relies entirely on this client-side data, and there is no separate, backend process that performs diffs or comparisons against the repository content afterward?

Assuming that is correct, our main follow-up question is about the attribution logic for those “subsequent edits” that you mentioned are also tracked:

When the client tracks a human edit on an ‘AI-generated’ line, how is that classified?

  • (A) Does the system immediately re-classify that entire line as ‘Human’ (a “zero-tolerance” model)?

  • (B) Or, does it maintain an ‘AI-modified’ or ‘AI-assisted’ status?

We are trying to understand the ‘purity’ of the AI attribution data and how “edit dilution” is handled by the client-side tracker.

Thanks again for the insights.