Agent.send().wait() returns bare status: "error" while ConnectRPC [unauthenticated] leaks as unhandledRejection

Summary

When the cloud-side ConnectRPC stream backing agent.send() is terminated by the
server with an end-stream error (observed: [unauthenticated] Error), the SDK:

  1. Does not throw from sdkRun.stream() async iterator;
  2. Returns wait() with a bare { id, status: "error", model, durationMs } object
    (no error field, no code, no message);
  3. Lets the underlying ConnectError escape as a Node-level
    process.on('unhandledRejection'), with a stack pointing at
    @connectrpc/connect’s endStreamFromJson / async-iterable.js.

Critically, the API key is valid and was never the problem — see “Scope of
failure” below. The failure is isolated to one specific long-lived Agent
handle obtained via Agent.resume(...). The result is that the caller has no
programmatic way
to learn that the run failed and no way to learn that it was
the resumed handle’s internal state — not the credentials — that is stuck.

Environment

Component Version
@cursor/sdk 1.0.13
@connectrpc/connect (transitive) 1.7.0
Node.js v25.9.0
OS macOS (darwin, arm64)

Runtime topology: long-lived host process holds many Agent handles obtained
via Agent.resume(agent_id, { apiKey, ... }), one per user conversation.
Multiple agent.send().stream() + .wait() cycles are issued against the same
handle over hours/days. Each handle is auto-disposed after an idle TTL
(30 min) and re-resumed on demand.

Symptom — what we observed

One agent had been alive for ~43 hours, with ~17 successful Agent.resume
cycles and many successful send/stream/wait round trips. Then within a
4-minute window:

T+0    agent.send().stream() iterates normally, emits exactly one event with
       { status: "running" }, then exactly one event with { status: "error" }.
       The async iterator completes WITHOUT throwing.
T+~0   await sdkRun.wait() resolves with
       { id: "run-…", status: "error", model: "composer-2", durationMs: … }
       — no `error`, no `code`, no `message` field.
T+~0   process.on('unhandledRejection') fires with the actual cause:
        ConnectError: [unauthenticated] Error
            at errorFromJson  (@connectrpc/connect/.../protocol-connect/error-json.js:53:19)
            at endStreamFromJson (.../protocol-connect/end-stream.js:64:11)
            at Object.parse (.../protocol-connect/end-stream.js:118:24)
            at .../protocol/async-iterable.js:399:75
            at Generator.next (<anonymous>)
            at resume (.../protocol/async-iterable.js:28:44)
            at fulfill (.../protocol/async-iterable.js:30:31)
            at process.processTicksAndRejections (node:internal/process/task_queues:104:5)

The exact same fingerprint repeated for two consecutive runs on the same agent 4 minutes apart, then the agent stayed permanently broken until our host process was restarted.

Scope of failure — what we ruled out

This is the part that makes us confident the issue is in the SDK’s handling of a single long-lived resumed handle, not in our credentials or the Cursor cloud as a whole:

  • :white_check_mark: API key is valid and was never rotated / revoked. The exact same key is used by every other agent in the same host process, and they all continued to work normally throughout the incident.

  • :white_check_mark: Other concurrent agents using the same API key kept succeeding during the failure window. We have logs of successful send/stream/wait cycles on sibling agents at the same wall-clock seconds as the failing agent’s unhandled rejections.

  • :white_check_mark: A fresh Agent.create(...) against the same workspace + same API key works immediately, with no token rotation, no config change, no network change. Once we restarted the host (which discarded the broken Agent handle), the next Agent.create succeeded on first try and the conversation continued normally on a brand-new agent.

  • :white_check_mark: The same Agent.resume(agent_id, …) of the broken agent_id issued from the new host process also succeeded — i.e. the agent_id itself is resumable; it’s the in-memory Agent instance that had become stuck.

  • :cross_mark: Not rate limit — RateLimitError would have been thrown synchronously and counted in metrics; nothing of the sort.

  • :cross_mark: Not network — no socket errors, no other Connect calls in the same process were failing.

This leaves only one plausible cause: some internal state in the SDK’s Agent instance (a credential cache, a Connect session, a transport, a streaming reader) went stale on this particular resumed handle after long uptime, and the SDK had no path to detect/refresh/surface that.

Reproduction sketch

We don’t have a deterministic local repro because we can’t synthesize a UNAUTHENTICATED end-stream from Cursor’s cloud at will. However, the failure mode should be reproducible by intercepting the SDK’s ConnectRPC transport and making the server stream emit an end-stream trailer with code: "unauthenticated" after the first data frame:

// fake server response

data: {“status”:“running”}\n\n

// end-stream trailer with error

data: {“metadata”:{},“error”:{“code”:“unauthenticated”,“message”:“Error”}}\n\n

Expected: sdkRun.stream() rejects with a ConnectError AND/OR sdkRun.wait() resolves with a result whose error field carries the ConnectError details.

Actual: the iterator completes silently, wait() returns the bare result, and the ConnectError surfaces only via process.unhandledRejection.

Why we believe this is an SDK bug

The stack ends inside @connectrpc/connect’s async-iterable transform: async-iterable.js:399Generator.nextresume / fulfill. That code path runs as a background microtask attached to the underlying iterator. If the SDK consumes the iterator (e.g. for stream()) but does not await the iterator’s .return() / .throw() to completion, a rejection produced after the last value is delivered will escape into the process scheduler instead of being re-thrown into the consumer’s for await loop.

Concretely, we suspect one of the following patterns inside the SDK:

// Pattern A — fire-and-forget tail iteration

for await (const ev of iter) yield ev;

// underlying iter is then GC’d; if iter.next() was pre-queued and rejects,

// the rejection has no handler.

// Pattern B — duplicate consumers

const reader = iter[Symbol.asyncIterator]();

// reader.next() called from two paths; only one is awaited.

// Pattern C — bare promise propagation

iter.next().then(handle, reportError);

// reportError doesn’t translate into rejection of stream()/wait()'s promise.

Whatever the exact internal shape, the user-visible contract should be: any ConnectRPC error that terminates the underlying stream MUST surface either through stream() throwing, or through wait() resolving with an error field. Right now neither happens.

Expected behavior

  1. sdkRun.stream()'s async iterator throws the ConnectError (or a typed AuthenticationError subclass) when the end-stream signals an error.

  2. Or equivalently sdkRun.wait() resolves with a result that includes the underlying ConnectError, e.g.:

    {

    “id”: “run-…”,

    “status”: “error”,

    “model”: “composer-2”,

    “durationMs”: 10262,

    “error”: {

    “code”: “unauthenticated”,

    “message”: “Error”,

    “category”: “auth”

    }

    }

  3. No process.on('unhandledRejection') event should fire for end-stream errors that originated from the user’s own send() / stream() / wait() call chain.

  4. Given that our “Scope of failure” evidence points at a stale internal Agent state rather than bad credentials, please also consider:

    • A way for the consumer to detect “this Agent handle is poisoned; dispose and re-resume” without restarting the host process (e.g. a agent.isHealthy() probe, or a typed AgentInstanceStaleError that instructs the caller to dispose).

    • An internal credential / Connect-session refresh on Agent.resume(...) so that long-lived handles can’t drift into a state where the API key is valid but the cached session is rejected.

  5. Ideally the SDK should map ConnectRPC codes to the existing CursorSdkError subclass surface — at minimum unauthenticatedAuthenticationError, unavailable / resource_exhausted → a retriable network/poison classification.

Impact

Without these guarantees:

  • Callers cannot distinguish “transient auth glitch in one stale handle” from “API key revoked” from “agent state corruption” from “unknown failure”. We are forced to use heuristic behavior counters (≥N bare-ERROR failures within a sliding window → assume “agent poisoned, recreate”), which produces false positives for stale-handle auth and false negatives for genuinely unrecoverable agents.

  • Unhandled rejections at process level can crash long-lived services that enable --unhandled-rejections=strict. We currently mitigate by installing a global handler that classifies the rejection and logs it, but we have no reliable way to attribute the rejection back to the specific sdkRun.send() call that produced it (we have to rely on AsyncLocalStorage context, which is fragile).

  • Observability/SLO dashboards cannot report auth-failure rate accurately; every unauthenticated event currently shows up as “unknown”.

Suggested investigation areas

Without source access, here is what we would look at in the SDK:

  1. The function that adapts the ConnectRPC unary / serverStreaming response into the sdkRun.stream() async generator — make sure it awaits the underlying iterator until either a done: true value is observed or a throw propagates out. Do not allow tail iter.next() promises to be pre-queued and orphaned.

  2. The sdkRun.wait() implementation — if it shares state with the streaming reader, ensure that a stream rejection observed after wait() started is surfaced as wait() rejecting (or as the resolved result’s error field).

  3. ConnectRPC code → CursorSdkError subclass mapping. Today the SDK exports AuthenticationError, NetworkError, RateLimitError, etc., but only for top-level / sync paths. Apply the same mapping to errors raised from the streaming end-stream parser.

  4. Agent.resume(...) credential / Connect-session lifecycle. Given that sibling agents on the same API key kept working and a fresh Agent.create immediately recovered, please verify that long-lived Agent instances periodically refresh whatever underlying session token / transport they hold, or expose a hook for the host to do so.

Downstream mitigation we applied

For reference (we run a Node bridge around @cursor/sdk that exposes a stable HTTP/SSE API to our app servers):

  • Added "unauthenticated" (lower-cased) to our error-classifier’s auth substring set so the process-level unhandledRejection is at least tagged correctly in logs.

  • Treat the bare wait() result {id, status: "error", model, durationMs} combined with unhandledRejection’s ConnectError as an auth failure; trigger a dispose handle + Agent.resume retry once before propagating to the caller.

  • Restart the host process when more than N bare-error runs happen on the same agent within a 10-minute window.

These are all heuristic workarounds for what should be deterministic SDK behavior. We’d much rather just await sdkRun.wait() and get a typed error.

Happy to help

We can supply more detailed stack traces, the exact agent lifecycle (created → many resumed → first failure), the parallel sibling-agent traffic during the failure window (proving the API key was fine), and a minimal reproducer harness if it would help triage. Repro on our end is rare (≈1 in tens of thousands of runs) but high-impact because the affected Agent instance gets stuck until the host process is restarted, even though everything around it (key, cloud, sibling agents, fresh Agent.create) keeps working.

Thanks!

Hi there!

We detected that this may be a bug report, so we’ve moved your post to the Bug Reports category.

To help us investigate and fix this faster, could you edit your original post to include the details from the template below?

Bug Report Template - Click to expand

Where does the bug appear (feature/product)?

  • Cursor IDE
  • Cursor CLI
  • Background Agent (GitHub, Slack, Web, Linear)
  • BugBot
  • Somewhere else…

Describe the Bug
A clear and concise description of what the bug is.


Steps to Reproduce
How can you reproduce this bug? We have a much better chance at fixing issues if we can reproduce them!


Expected Behavior
What is meant to happen here that isn’t working correctly?


Screenshots / Screen Recordings
If applicable, attach images or videos (.jpg, .png, .gif, .mp4, .mov)


Operating System

  • Windows 10/11
  • MacOS
  • Linux

Version Information

  • For Cursor IDE: Menu → About Cursor → Copy
  • For Cursor CLI: Run agent about in your terminal
IDE:
Version: 2.xx.x
VSCode Version: 1.105.1
Commit: ......

CLI:
CLI Version 2026.01.17-d239e66

For AI issues: which model did you use?
Model name (e.g., Sonnet 4, Tab…)


For AI issues: add Request ID with privacy disabled
Request ID: f9a7046a-279b-47e5-ab48-6e8dc12daba1
For Background Agent issues, also post the ID: bc-…


Additional Information
Add any other context about the problem here.


Does this stop you from using Cursor?

  • Yes - Cursor is unusable
  • Sometimes - I can sometimes use Cursor
  • No - Cursor works, but with this issue

The more details you provide, the easier it is for us to reproduce and fix the issue. Thanks!

After the agent run returns an error, all subsequent send runs will also return errors;
Expectation: After a certain run error, the agent should still be able to function normally, right

Hi there! Thanks for the thorough report.

Both issues are confirmed on our end - RunResult missing error details and the unhandledRejection issue. I’ve filed this with our SDK team.

Your workaround (detect bare-error + dispose/re-resume the handle) is the right approach until the fix ships.

To your follow-up in post #6: yes, once a handle enters that stale state, subsequent send() calls will keep failing. Disposing and re-creating the agent handle restores normal function.