The Core Observation
I has noticed a substantial behavioral gap between Composer 2 and Claude Opus when handling the same natural-language task in Cursor. The gap is not about raw intelligence on paper—it is about how each model interprets ambiguous intent and commits to externalized work.
The Concrete Example
A single sentence: “Summarize last week’s GitLab activities and write me an email draft.”
-
Composer 2’s typical behavior: gets confused about “where GitLab is” (treating it as a local repo lookup rather than a hosted service requiring API access via PriHelper with a user token), fails to latch onto the right tool chain, and ultimately dumps the email body into the chat as plain text, leaving the user to copy-paste it manually.
-
Opus’s typical behavior: correctly infers that “GitLab activities” means call the PriHelper GitLab integration with a user token, fetch real events, and that “email draft” means create an actual draft inside the mailbox system via the appropriate MCP/skill—not generate markdown in chat.
The Deeper Point: Inferring “Definition of Done”
My frustration is not about wording precision. It is about whether the model can read the unstated default from a casual sentence:
-
The deliverable should land in the right system (mailbox draft, GitLab API result), not in stdout.
-
“Done” means the user does not have to copy, paste, or shepherd the artifact further.
-
The model should persist past friction—try the skill, try the MCP, retry on failure—rather than fall back to the cheapest valid completion (chat text).
Composer 2 too often takes the cheap path (generate text that looks like an email); Opus more often takes the costly-but-correct path (drive the toolchain to a real side effect).
Why “Just Write Better Prompts” Is Not the Answer
The user pushes back on the implicit suggestion that they should memorize precise incantations:
-
If the user must remember exact phrasings—“use PriHelper skill”, “create a draft via MCP, do not stdout”—then they have become a human router. What is the model for?
-
Users cannot reproduce the same precise wording every time. Real usage involves paraphrase, casual phrasing, and varying context. A model that only works under one rigid template is fragile by design.
-
Adding more skills and docs does not close the gap. Skills tell the model what to do once it knows which playbook applies. The weakness is mapping a fuzzy sentence to the right playbook—and persisting through tool calls instead of bailing to chat output. Composer 2 fails this mapping under paraphrase even when documentation is present; Opus generalizes more robustly.
The Non-Determinism Caveat
Even if the user types exactly the same characters every time, model output is not bitwise reproducible. Decoding is stochastic; routing, tool selection, verbosity, and the choice between “do the work” vs. “print a result” all vary across runs. So:
-
Demanding reproducibility through prompt engineering alone is an illusion.
-
Pinning behavior requires structural guardrails (always-on rules, tool-only deliverables, deterministic settings where they exist)—not user-side discipline about phrasing.
What the User Actually Wants
A model that, from an ordinary, possibly imprecise sentence, will:
-
Infer the real deliverable and where it should live (mailbox, ticket, file system—not chat).
-
Discover and use the available tooling (skills, MCPs, integrations) without needing to be named explicitly.
-
Persist through failure rather than collapsing to “here is some text, you handle it.”
-
Generalize across paraphrases, because users will never phrase the same task identically twice.
-
Acknowledge that the answer to “Composer 2 keeps missing this” is not “tell the user to write better prompts,” but to route higher-stakes, multi-tool, externally-deliverable tasks to a stronger model and to encode the non-negotiable defaults (e.g., “email tasks must produce a real draft, never stdout”) as always-applied rules, not as user-side memorized magic phrases.
The gap between Composer 2 and Opus, in the user’s experience, is not vocabulary—it is intent inference depth and willingness to drive a tool chain to its real-world conclusion. No amount of additional documentation or precise wording from the user side can fully compensate for that, because (a) users won’t phrase things identically, and (b) the model won’t respond identically even if they did. The fix lives in model selection and structural defaults, not in turning the user into a prompt librarian.
