Share your experience with Long-running Agents

Main announcement · Blog · Try it now


Now that long-running agents are available for Ultra, Teams, and Enterprise users, we’d love to hear how they’re working for you!

During the research preview, early testers used long-running agents to implement large features, refactor complex systems, fix challenging bugs, and create high-coverage tests—with agents commonly running for hours or even days. A few examples:

  • Building an all-new chat platform integrated with an existing open-source tool (36 hours)
  • Implementing a mobile app based on an existing web app (30 hours)
  • Refactoring an authentication and RBAC system (25 hours)

Some things we’re curious about:

  • What types of tasks are you giving to long-running agents? Big features, refactors, bug fixes, test generation, something else?
  • How does the planning phase feel? Is the agent proposing good plans, and does approving the plan upfront lead to better results?
  • How long are your agent runs, and how do the resulting PRs look? Are you merging them as-is, or are you doing significant follow-up work?
  • How does it compare to using regular agents for your workflows? Are you seeing tasks completed that previously weren’t possible?
  • Have you tried running multiple long-running agents in parallel? How did that go?

We’re actively improving the harness, and your feedback directly shapes what we work on next. What’s working well? What could be better?

For a deeper look at how the harness works and how we’ve been using long-running agents internally, check out our blog post.

3 Likes

I’m pumped for tomorrow at work – I know exactly what I’m going to do!

2 Likes

This is huge. Excited for productivity. Concerned for humanity.

This is is also going to eat tokens like never before is my immediate concern in letting the agent run so freely.

“30 hours to build a mobile app” - do you have more detailed examples of the actual starting prompts, planning shape, and final output to share to understand what’s achievable? Otherwise it’s a new Ultra sub every other day in testing to get the answer!

2 Likes

it’s comming in desktop app or its web app only feature?

3 Likes

I’m excited about this feature, feels like it unlocks a whole new level of capability.

But honestly, I’m still not sure when to reach for it. The tasks in my current projects don’t feel “big enough” for a long-running agent, so I’m wondering: should I be intentionally looking for larger tasks, or is there a sweet spot I’m missing?

A tutorial or a clearer explanation of how the harness works under the hood would really help me identify the right use cases. The blog post is a good start, but something more hands-on would go a long way.

1 Like

„Implementing a mobile app based on an existing web app (runtime: 30 hours)“

How can this be done without multi-repository support? I would like to reference other repositories from the web application to do exactly that.

Any information about that? @Colin

1 Like

@what.gift we gave to power users access so they can try this out. Looking forward to feedback we receive from the community.

@Steve4 Cursor Desktop

@Z_W most current usages were refactors or building new features but also full products, writing tests,… Its a special harness that does not just loop on a task but orchestrates properly. It works well with detailed plans/requirements, then also subagents and skills that are required for specific process.

@gillesheinesch why are multi repositories necessary? would not GH MCP help there?

1 Like

Thanks for sharing the usage.

I tested on a small feature (link previews). 3 hours, 99 commits, feature doesn’t fully work, and I manually stopped it.

The agent built a skeleton in 3 commits, then spent 96 on URL parsing edge cases (CJK punctuation, zero-width chars…) — a self-reinforcing loop of write test → find edge case → fix → repeat… It feels like more time made it go deeper in the wrong direction.

Could be my prompt or task choice — still figuring out how to use this effectively :joy: . But thought the pattern was worth sharing.

2 Likes

What types of tasks are you giving to long-running agents? Big features, refactors, bug fixes, test generation, something else?

We wanted to move away from global CSS files to Tailwind for our frontend client. Current scale is large (339 module CSS files, ~29.5k CSS lines, ~396 TSX files referencing styles, ~5,066 style references)

How does the planning phase feel? Is the agent proposing good plans, and does approving the plan upfront lead to better results?

The agent initially proposed a plan that made sense in the traditional software world. We should break this down into small steps. Keep both the CSS and Tailwind around initially and migrate over time. This makes sense if we were doing it with a human, but I wanted to one shot it with the agent so I recommended a different approach and it gracefully modified the plan.

How long are your agent runs, and how do the resulting PRs look? Are you merging them as-is, or are you doing significant follow-up work?

The run actually became a runaway agent. So after 30 hours I pulled the plug because it was just self improving its own tooling unnecessarily. This reminded me of experiments I’ve run myself with Ralph that just went haywire and never terminated until I stopped it.

How does it compare to using regular agents for your workflows? Are you seeing tasks completed that previously weren’t possible?

Oh huge improvement on it’s ability to keep working.

1 Like

That’s not working in a cloud agent:

Has anyone tried a long-running agent as an SRE and just observability and impact mitigation for incident management when an issue arises?

Like keeping the agent in just stand-by mode for when zzzz hits the fan.

I did multiple trials on this immediately since launching.

1st Trial: Full Grind mode | Codex 5.3 | Duration: 48 hours+

Context: Request to build an additional feature for my existing app that involves in new ML algorithm for detecting workout activities and matching it to a prescribed workout. I was pretty clear in my writeup and plan for the success criteria

Results: It spent the first 7 hours doing incremental coding work that was decent and productive. I would give it a rating of 7 out of 10. There were simplistic approach taken vs. exploring a more sophisticated and scalable approach. However the bigger issue was it went on another 30+ hours trying to refactor code base, which was never an ask and neither was it in the plan that I approved. The refactor was unnecessary and actually was not in the right direction. I manually stopped it at the end and only used the useful part from the first 7 hours and enhanced it.

Personal Take: To be able to continue working is fantastic, but there is no guardrail (notification) on the side that was easy for me to see what it was doing. Would be nice if it tells me when its making a pivot outside of the plan or an important one, but its just buried in the very long thread when I check in as opposed to something on the side panel or notification. FWIW I did see in the commit that it started to refactor and i thought it might do some good improvement so I just let it run (but it was not a good one).

2nd Trial: A simpler task with a 3 hour limit cap, just a small feature enhancement. Overall it hit the mark. I also tested the exact ask for claude code on web/mobile. Quality was not very different but cursor did take much longer. It was overnight so that didnt matter. In hindsight I thought the standard cloud agent mode would have done the job. I was hoping that if I put 3 hours it might be a bit more comprehensive and go on to think deeply with more coverage.

Conclusion: I like the direction of the feature and think long running on cloud agent is the way to go but as it currently stands, I dont have the trust yet that the longer grind leads to better results. I still need to check in. I would 100% use it and be willing to pay if time → quality is 1-1

The grind is a never ending story. It’s never it finishes the first stage of a plan, it starts a dead loop of checking that it finished the first stage. Nothing more. In my case, it was doing this “burn tokens for nothing” loop around 70 times, writing it to scratchpad.md. “Re-check 70 was a success, I finished doing the job, I am going to repeat the check once more.” Grind is surely published without proper check. It burns lots of tokens and never finishes the work, being stuck at phase 1 for eternity.

One more note — I had to manually push changes to the git, as it was saying “I pushed it, I double checked for 100 times and going to check it once more.” But to push changes to the git I had to switch to terminal and do it manually.

overall — I am not recommending it. The product is just raw.

Initial feedback

  • Constantly seeing errors and retries failing too
  • Cannot connect to desktop
  • Would love to see plan mode on web too
  • Cant see cloud agent option on desktop unfortunately.

So this is the prompt I gave it on the first trial:

You are reviewing and extending a repo that already includes:

  • NestJS example integration that spawns a Python worker: automation_api/nestjs/src/[company]-gcode-worker.ts, example.controller.ts
  • Python automation worker pipeline: automation_api/python/[company]_automation/pipeline.py, cli.py
  • Materials library: src/materials.py
  • Axisymmetry + minimal milling-required detection (mostly holes): src/feature_detection.py
  • 3-axis Haas VF program generator (bbox/feature-based, not full CAM): src/vertical_mill.py
  • Live-tool milling toolpaths + posting support exists: src/milling_operations.py and src/gcode_generator.py

Constraints:

  • Do NOT modify anything in project root except potentially src/materials.py and src/vertical_mill.py.
  • Any new automation logic should go under automation_api/ (new folder/modules are fine).
  • Surface finish is accepted but ignored for now.
  • Multi-part job support must remain.

Tasks:

  1. Produce a gap analysis vs “full automated feature set”: what is missing for milling features beyond holes and for complex geometry.
  2. Propose an incremental implementation plan (MVP → v2) with clear module boundaries and data contracts.
  3. Provide a security/reliability checklist for the TS->Python worker execution.
  4. Provide a test plan (unit + integration) and acceptance criteria.
    Return: a structured writeup + recommended code changes (file-level) respecting constraints.

The plan it provided was significantly longer and more detailed than any subsequent plans it produced using other prompts. I do like the detail, but I find it is often inconsistent in the level of planning depth if I don’t structure the prompt in the format above. I will note that a lot of the time, I would have to go back and forth with the agent to improve the detail and direction of the plan.

After I approved the plan, it ran for 80+ hours nonstop on grind mode using Codex 5.3. It ended up adding ~183,000 lines of code, mostly containing a rigorous testing regime. Aside from the first trial, the PRs do require additional instructions for the agent or some more work on my part. I find that having GPT 5.2 xhigh craft a plan for me, and then feeding that initial plan to the long-running agent to make a final plan produces much better results than directly prompting the agent itself.

The current state of the long-running cloud agent could use some more work so that there is less need for user intervention, but I like the direction it is going.

Closing this thread since it has been a while since this was announced. Any new questions, or loose ends from this thread should be filed as new reports in the appropriate forum category!