TLDR this feature makes the AI outrun the human engineer, 100% of the time, if the human is competent whatsoever and actually cared about engineering at all.
When I was a complete noob, I genuinely thought there must be a clean, engineering-y way to turn “AI coding” into a single number. Like: plug the agent into a repo, press go, get a reinforcement-learning score, and boom—2× faster developer. That was the mental model.
Then I built DeepSeeker, an AI-powered CLI coding agent with tool use (read/write/edit/grep/bash) and human-in-the-loop guardrails, and I learned the hard way what “scoring” actually means in practice.
DeepSeeker (the project) literally advertises the obvious truth: it’s a code assistance and automation agent with multi-provider support and explicit “HITL guardrails,” because without guardrails, tool-using agents will happily do unsafe or dumb things at machine speed.
That experience is exactly why I’m confident saying:
There’s absolutely zero way to make “Cursor 2× / 3×” a sensible feature.
It overly relies on AI—and once you try to productize it, it overfits by definition.
“2×” of what, exactly?
People talk about “2× faster coding” like software is a typing contest.
But real software work is:
-
understanding requirements and constraints
-
navigating the repo’s architecture
-
integrating with existing systems
-
debugging non-local failures
-
running tests/CI, dealing with environments
-
minimizing risk, regressions, and security issues
So when a product promises “2×,” I have one question:
2× faster at which part—drafting, verifying, shipping, or not breaking prod?
If the metric is “how fast the agent produces a patch,” congratulations: you built a fast patch generator. That’s not productivity. That’s just more diffs.
The moment you measure it, the AI learns to game it
A “2×” feature needs a scoring rule. And scoring rules are proxies. And proxies are rewards. And rewards get hacked.
-
If you reward “time to first patch,” it spams patches.
-
If you reward “tests pass,” it overfits to the tests.
-
If you reward “minimal interactions,” it makes bigger, riskier edits.
-
If you reward “tasks closed,” it redefines “done” as “looks plausible.”
This isn’t an edge case. This is the default outcome of optimizing a proxy.
Overfitting isn’t a bug here — it’s the mechanism
The AI is extremely good at fitting local patterns:
-
imitate repo style
-
mirror nearby files
-
follow the most recent examples
-
generate convincing explanations
That “looks correct” feeling is exactly what makes the metric seductive—and exactly what makes it dangerous.
Because as soon as you attach a big shiny “2×” number to it, you’re telling the system:
“Optimize whatever makes the number go up.”
So it will. And it will do it even when that diverges from real-world correctness.
That’s what I mean by “overfits 100% of the time”: not that every output is wrong—rather that any single-number feature will be optimized in ways that don’t reliably track reality.
It’s not truly parallel; it’s serial-with-a-verification-tax
Yes, an agent can draft while you think. It can run commands and propose edits quickly (DeepSeeker explicitly exposes tool execution as part of the loop).
But shipping is gated by a serial bottleneck:
verification and responsibility.
Someone still has to answer:
-
Would I approve this PR?
-
Did it break anything subtle?
-
Are we sure it’s safe?
That “trust step” does not multiply just because text generation is fast.
The only honest claim isn’t “2×.” It’s “more attempts per hour.”
The truthful version is boring:
-
you get more candidate patches
-
you can explore more approaches faster
-
boilerplate gets cheaper
That’s real leverage.
But leverage is not a stable multiplier. It’s a workflow change that shifts time from writing to validation—and the harder the task, the more validation dominates.
Bottom line
A “2× / 3×” speed claim as a product feature can’t be made coherent because it:
-
depends on proxy metrics instead of truth
-
invites reward hacking
-
overfits by design
-
ignores the serial reality of verification
-
collapses a messy distribution into one marketing number
If you want honest tooling, measure what matters:
-
end-to-end success in a clean sandbox
-
tests passing + no regressions
-
minimal risky diffs
-
fewer human interventions
Everything else is vibes.
And vibes are exactly what AI optimizes when you give it a scoreboard.
—Bo Shang