Now that long-running agents are available for Ultra, Teams, and Enterprise users, we’d love to hear how they’re working for you!
During the research preview, early testers used long-running agents to implement large features, refactor complex systems, fix challenging bugs, and create high-coverage tests—with agents commonly running for hours or even days. A few examples:
Building an all-new chat platform integrated with an existing open-source tool (36 hours)
Implementing a mobile app based on an existing web app (30 hours)
Refactoring an authentication and RBAC system (25 hours)
Some things we’re curious about:
What types of tasks are you giving to long-running agents? Big features, refactors, bug fixes, test generation, something else?
How does the planning phase feel? Is the agent proposing good plans, and does approving the plan upfront lead to better results?
How long are your agent runs, and how do the resulting PRs look? Are you merging them as-is, or are you doing significant follow-up work?
How does it compare to using regular agents for your workflows? Are you seeing tasks completed that previously weren’t possible?
Have you tried running multiple long-running agents in parallel? How did that go?
We’re actively improving the harness, and your feedback directly shapes what we work on next. What’s working well? What could be better?
For a deeper look at how the harness works and how we’ve been using long-running agents internally, check out our blog post.
This is huge. Excited for productivity. Concerned for humanity.
This is is also going to eat tokens like never before is my immediate concern in letting the agent run so freely.
“30 hours to build a mobile app” - do you have more detailed examples of the actual starting prompts, planning shape, and final output to share to understand what’s achievable? Otherwise it’s a new Ultra sub every other day in testing to get the answer!
I’m excited about this feature, feels like it unlocks a whole new level of capability.
But honestly, I’m still not sure when to reach for it. The tasks in my current projects don’t feel “big enough” for a long-running agent, so I’m wondering: should I be intentionally looking for larger tasks, or is there a sweet spot I’m missing?
A tutorial or a clearer explanation of how the harness works under the hood would really help me identify the right use cases. The blog post is a good start, but something more hands-on would go a long way.
@Z_W most current usages were refactors or building new features but also full products, writing tests,… Its a special harness that does not just loop on a task but orchestrates properly. It works well with detailed plans/requirements, then also subagents and skills that are required for specific process.
@gillesheinesch why are multi repositories necessary? would not GH MCP help there?
I tested on a small feature (link previews). 3 hours, 99 commits, feature doesn’t fully work, and I manually stopped it.
The agent built a skeleton in 3 commits, then spent 96 on URL parsing edge cases (CJK punctuation, zero-width chars…) — a self-reinforcing loop of write test → find edge case → fix → repeat… It feels like more time made it go deeper in the wrong direction.
Could be my prompt or task choice — still figuring out how to use this effectively . But thought the pattern was worth sharing.
What types of tasks are you giving to long-running agents? Big features, refactors, bug fixes, test generation, something else?
We wanted to move away from global CSS files to Tailwind for our frontend client. Current scale is large (339 module CSS files, ~29.5k CSS lines, ~396 TSX files referencing styles, ~5,066 style references)
How does the planning phase feel? Is the agent proposing good plans, and does approving the plan upfront lead to better results?
The agent initially proposed a plan that made sense in the traditional software world. We should break this down into small steps. Keep both the CSS and Tailwind around initially and migrate over time. This makes sense if we were doing it with a human, but I wanted to one shot it with the agent so I recommended a different approach and it gracefully modified the plan.
How long are your agent runs, and how do the resulting PRs look? Are you merging them as-is, or are you doing significant follow-up work?
The run actually became a runaway agent. So after 30 hours I pulled the plug because it was just self improving its own tooling unnecessarily. This reminded me of experiments I’ve run myself with Ralph that just went haywire and never terminated until I stopped it.
How does it compare to using regular agents for your workflows? Are you seeing tasks completed that previously weren’t possible?
Oh huge improvement on it’s ability to keep working.