Summary
When the cloud-side ConnectRPC stream backing agent.send() is terminated by the
server with an end-stream error (observed: [unauthenticated] Error), the SDK:
- Does not throw from
sdkRun.stream()async iterator; - Returns
wait()with a bare{ id, status: "error", model, durationMs }object
(noerrorfield, nocode, nomessage); - Lets the underlying
ConnectErrorescape as a Node-level
process.on('unhandledRejection'), with a stack pointing at
@connectrpc/connect’sendStreamFromJson/async-iterable.js.
Critically, the API key is valid and was never the problem — see “Scope of
failure” below. The failure is isolated to one specific long-lived Agent
handle obtained via Agent.resume(...). The result is that the caller has no
programmatic way to learn that the run failed and no way to learn that it was
the resumed handle’s internal state — not the credentials — that is stuck.
Environment
| Component | Version |
|---|---|
@cursor/sdk |
1.0.13 |
@connectrpc/connect (transitive) |
1.7.0 |
| Node.js | v25.9.0 |
| OS | macOS (darwin, arm64) |
Runtime topology: long-lived host process holds many Agent handles obtained
via Agent.resume(agent_id, { apiKey, ... }), one per user conversation.
Multiple agent.send().stream() + .wait() cycles are issued against the same
handle over hours/days. Each handle is auto-disposed after an idle TTL
(30 min) and re-resumed on demand.
Symptom — what we observed
One agent had been alive for ~43 hours, with ~17 successful Agent.resume
cycles and many successful send/stream/wait round trips. Then within a
4-minute window:
T+0 agent.send().stream() iterates normally, emits exactly one event with
{ status: "running" }, then exactly one event with { status: "error" }.
The async iterator completes WITHOUT throwing.
T+~0 await sdkRun.wait() resolves with
{ id: "run-…", status: "error", model: "composer-2", durationMs: … }
— no `error`, no `code`, no `message` field.
T+~0 process.on('unhandledRejection') fires with the actual cause:
ConnectError: [unauthenticated] Error
at errorFromJson (@connectrpc/connect/.../protocol-connect/error-json.js:53:19)
at endStreamFromJson (.../protocol-connect/end-stream.js:64:11)
at Object.parse (.../protocol-connect/end-stream.js:118:24)
at .../protocol/async-iterable.js:399:75
at Generator.next (<anonymous>)
at resume (.../protocol/async-iterable.js:28:44)
at fulfill (.../protocol/async-iterable.js:30:31)
at process.processTicksAndRejections (node:internal/process/task_queues:104:5)
The exact same fingerprint repeated for two consecutive runs on the same agent 4 minutes apart, then the agent stayed permanently broken until our host process was restarted.
Scope of failure — what we ruled out
This is the part that makes us confident the issue is in the SDK’s handling of a single long-lived resumed handle, not in our credentials or the Cursor cloud as a whole:
-
API key is valid and was never rotated / revoked. The exact same key is used by every other agent in the same host process, and they all continued to work normally throughout the incident. -
Other concurrent agents using the same API key kept succeeding during the failure window. We have logs of successful send/stream/waitcycles on sibling agents at the same wall-clock seconds as the failing agent’s unhandled rejections. -
A fresh Agent.create(...)against the same workspace + same API key works immediately, with no token rotation, no config change, no network change. Once we restarted the host (which discarded the brokenAgenthandle), the nextAgent.createsucceeded on first try and the conversation continued normally on a brand-new agent. -
The same Agent.resume(agent_id, …)of the broken agent_id issued from the new host process also succeeded — i.e. the agent_id itself is resumable; it’s the in-memoryAgentinstance that had become stuck. -
Not rate limit — RateLimitErrorwould have been thrown synchronously and counted in metrics; nothing of the sort. -
Not network — no socket errors, no other Connect calls in the same process were failing.
This leaves only one plausible cause: some internal state in the SDK’s Agent instance (a credential cache, a Connect session, a transport, a streaming reader) went stale on this particular resumed handle after long uptime, and the SDK had no path to detect/refresh/surface that.
Reproduction sketch
We don’t have a deterministic local repro because we can’t synthesize a UNAUTHENTICATED end-stream from Cursor’s cloud at will. However, the failure mode should be reproducible by intercepting the SDK’s ConnectRPC transport and making the server stream emit an end-stream trailer with code: "unauthenticated" after the first data frame:
// fake server response
data: {“status”:“running”}\n\n
// end-stream trailer with error
data: {“metadata”:{},“error”:{“code”:“unauthenticated”,“message”:“Error”}}\n\n
Expected: sdkRun.stream() rejects with a ConnectError AND/OR sdkRun.wait() resolves with a result whose error field carries the ConnectError details.
Actual: the iterator completes silently, wait() returns the bare result, and the ConnectError surfaces only via process.unhandledRejection.
Why we believe this is an SDK bug
The stack ends inside @connectrpc/connect’s async-iterable transform: async-iterable.js:399 → Generator.next → resume / fulfill. That code path runs as a background microtask attached to the underlying iterator. If the SDK consumes the iterator (e.g. for stream()) but does not await the iterator’s .return() / .throw() to completion, a rejection produced after the last value is delivered will escape into the process scheduler instead of being re-thrown into the consumer’s for await loop.
Concretely, we suspect one of the following patterns inside the SDK:
// Pattern A — fire-and-forget tail iteration
for await (const ev of iter) yield ev;
// underlying iter is then GC’d; if iter.next() was pre-queued and rejects,
// the rejection has no handler.
// Pattern B — duplicate consumers
const reader = iter[Symbol.asyncIterator]();
// reader.next() called from two paths; only one is awaited.
// Pattern C — bare promise propagation
iter.next().then(handle, reportError);
// reportError doesn’t translate into rejection of stream()/wait()'s promise.
Whatever the exact internal shape, the user-visible contract should be: any ConnectRPC error that terminates the underlying stream MUST surface either through stream() throwing, or through wait() resolving with an error field. Right now neither happens.
Expected behavior
-
sdkRun.stream()'s async iterator throws theConnectError(or a typedAuthenticationErrorsubclass) when the end-stream signals an error. -
Or equivalently
sdkRun.wait()resolves with a result that includes the underlying ConnectError, e.g.:{
“id”: “run-…”,
“status”: “error”,
“model”: “composer-2”,
“durationMs”: 10262,
“error”: {
“code”: “unauthenticated”,
“message”: “Error”,
“category”: “auth”
}
}
-
No
process.on('unhandledRejection')event should fire for end-stream errors that originated from the user’s ownsend()/stream()/wait()call chain. -
Given that our “Scope of failure” evidence points at a stale internal
Agentstate rather than bad credentials, please also consider:-
A way for the consumer to detect “this
Agenthandle is poisoned; dispose and re-resume” without restarting the host process (e.g. aagent.isHealthy()probe, or a typedAgentInstanceStaleErrorthat instructs the caller to dispose). -
An internal credential / Connect-session refresh on
Agent.resume(...)so that long-lived handles can’t drift into a state where the API key is valid but the cached session is rejected.
-
-
Ideally the SDK should map ConnectRPC codes to the existing
CursorSdkErrorsubclass surface — at minimumunauthenticated→AuthenticationError,unavailable/resource_exhausted→ a retriable network/poison classification.
Impact
Without these guarantees:
-
Callers cannot distinguish “transient auth glitch in one stale handle” from “API key revoked” from “agent state corruption” from “unknown failure”. We are forced to use heuristic behavior counters (≥N bare-ERROR failures within a sliding window → assume “agent poisoned, recreate”), which produces false positives for stale-handle auth and false negatives for genuinely unrecoverable agents.
-
Unhandled rejections at process level can crash long-lived services that enable
--unhandled-rejections=strict. We currently mitigate by installing a global handler that classifies the rejection and logs it, but we have no reliable way to attribute the rejection back to the specificsdkRun.send()call that produced it (we have to rely onAsyncLocalStoragecontext, which is fragile). -
Observability/SLO dashboards cannot report auth-failure rate accurately; every
unauthenticatedevent currently shows up as “unknown”.
Suggested investigation areas
Without source access, here is what we would look at in the SDK:
-
The function that adapts the ConnectRPC
unary/serverStreamingresponse into thesdkRun.stream()async generator — make sure itawaits the underlying iterator until either adone: truevalue is observed or athrowpropagates out. Do not allow tailiter.next()promises to be pre-queued and orphaned. -
The
sdkRun.wait()implementation — if it shares state with the streaming reader, ensure that a stream rejection observed afterwait()started is surfaced aswait()rejecting (or as the resolved result’serrorfield). -
ConnectRPC code →
CursorSdkErrorsubclass mapping. Today the SDK exportsAuthenticationError,NetworkError,RateLimitError, etc., but only for top-level / sync paths. Apply the same mapping to errors raised from the streaming end-stream parser. -
Agent.resume(...)credential / Connect-session lifecycle. Given that sibling agents on the same API key kept working and a freshAgent.createimmediately recovered, please verify that long-livedAgentinstances periodically refresh whatever underlying session token / transport they hold, or expose a hook for the host to do so.
Downstream mitigation we applied
For reference (we run a Node bridge around @cursor/sdk that exposes a stable HTTP/SSE API to our app servers):
-
Added
"unauthenticated"(lower-cased) to our error-classifier’sauthsubstring set so the process-levelunhandledRejectionis at least tagged correctly in logs. -
Treat the bare
wait()result{id, status: "error", model, durationMs}combined withunhandledRejection’s ConnectError as an auth failure; trigger adispose handle + Agent.resumeretry once before propagating to the caller. -
Restart the host process when more than N bare-error runs happen on the same agent within a 10-minute window.
These are all heuristic workarounds for what should be deterministic SDK behavior. We’d much rather just await sdkRun.wait() and get a typed error.
Happy to help
We can supply more detailed stack traces, the exact agent lifecycle (created → many resumed → first failure), the parallel sibling-agent traffic during the failure window (proving the API key was fine), and a minimal reproducer harness if it would help triage. Repro on our end is rare (≈1 in tens of thousands of runs) but high-impact because the affected Agent instance gets stuck until the host process is restarted, even though everything around it (key, cloud, sibling agents, fresh Agent.create) keeps working.
Thanks!