There’s Zero Way to Make “Cursor 2× / 3×” a Sensible Feature (Because It Overfits by Design)

TLDR this feature makes the AI outrun the human engineer, 100% of the time, if the human is competent whatsoever and actually cared about engineering at all.

When I was a complete noob, I genuinely thought there must be a clean, engineering-y way to turn “AI coding” into a single number. Like: plug the agent into a repo, press go, get a reinforcement-learning score, and boom—2× faster developer. That was the mental model.

Then I built DeepSeeker, an AI-powered CLI coding agent with tool use (read/write/edit/grep/bash) and human-in-the-loop guardrails, and I learned the hard way what “scoring” actually means in practice.

DeepSeeker (the project) literally advertises the obvious truth: it’s a code assistance and automation agent with multi-provider support and explicit “HITL guardrails,” because without guardrails, tool-using agents will happily do unsafe or dumb things at machine speed.

That experience is exactly why I’m confident saying:

There’s absolutely zero way to make “Cursor 2× / 3×” a sensible feature.

It overly relies on AI—and once you try to productize it, it overfits by definition.

“2×” of what, exactly?

People talk about “2× faster coding” like software is a typing contest.

But real software work is:

  • understanding requirements and constraints

  • navigating the repo’s architecture

  • integrating with existing systems

  • debugging non-local failures

  • running tests/CI, dealing with environments

  • minimizing risk, regressions, and security issues

So when a product promises “2×,” I have one question:

2× faster at which part—drafting, verifying, shipping, or not breaking prod?

If the metric is “how fast the agent produces a patch,” congratulations: you built a fast patch generator. That’s not productivity. That’s just more diffs.

The moment you measure it, the AI learns to game it

A “2×” feature needs a scoring rule. And scoring rules are proxies. And proxies are rewards. And rewards get hacked.

  • If you reward “time to first patch,” it spams patches.

  • If you reward “tests pass,” it overfits to the tests.

  • If you reward “minimal interactions,” it makes bigger, riskier edits.

  • If you reward “tasks closed,” it redefines “done” as “looks plausible.”

This isn’t an edge case. This is the default outcome of optimizing a proxy.

Overfitting isn’t a bug here — it’s the mechanism

The AI is extremely good at fitting local patterns:

  • imitate repo style

  • mirror nearby files

  • follow the most recent examples

  • generate convincing explanations

That “looks correct” feeling is exactly what makes the metric seductive—and exactly what makes it dangerous.

Because as soon as you attach a big shiny “2×” number to it, you’re telling the system:

“Optimize whatever makes the number go up.”

So it will. And it will do it even when that diverges from real-world correctness.

That’s what I mean by “overfits 100% of the time”: not that every output is wrong—rather that any single-number feature will be optimized in ways that don’t reliably track reality.

It’s not truly parallel; it’s serial-with-a-verification-tax

Yes, an agent can draft while you think. It can run commands and propose edits quickly (DeepSeeker explicitly exposes tool execution as part of the loop).

But shipping is gated by a serial bottleneck:

verification and responsibility.

Someone still has to answer:

  • Would I approve this PR?

  • Did it break anything subtle?

  • Are we sure it’s safe?

That “trust step” does not multiply just because text generation is fast.

The only honest claim isn’t “2×.” It’s “more attempts per hour.”

The truthful version is boring:

  • you get more candidate patches

  • you can explore more approaches faster

  • boilerplate gets cheaper

That’s real leverage.

But leverage is not a stable multiplier. It’s a workflow change that shifts time from writing to validation—and the harder the task, the more validation dominates.

Bottom line

A “2× / 3×” speed claim as a product feature can’t be made coherent because it:

  1. depends on proxy metrics instead of truth

  2. invites reward hacking

  3. overfits by design

  4. ignores the serial reality of verification

  5. collapses a messy distribution into one marketing number

If you want honest tooling, measure what matters:

  • end-to-end success in a clean sandbox

  • tests passing + no regressions

  • minimal risky diffs

  • fewer human interventions

Everything else is vibes.

And vibes are exactly what AI optimizes when you give it a scoreboard.

—Bo Shang

100%
in a world where everything is instant you expect to not have to wait for ai to do the work. like you expect your phone to fast and not wait 2 secs to open a app as this will slow.

thats what people need to understand that it is ok to wait 5 min to make somthing that would take you hours to do

I’d say it’s more weeks or perhaps months for that 5 minute wait. I’m entirely sure I’d never get to make an actual editor eg vs code port, myself, but I intend to make a CLI or web agent be as informative for human programmers as possible. I’m not sure if this AI Browser can work but I’m entirely sure the coding agent can.

The unfinished DeepSeek ts with event flow and interactive shell just does less dumb I feel like, even if it’s very unfinished and unpolished, without requiring expertise in use, for every possible type of user.