I haven’t found the parallel agents feature to be useful for code production - I think subagents would be much better for that.
(Long explanation for why I think parallel agents aren't great for code production)
This is mainly because using parallel agents multiplies the amount of reviewing I have to do (+1 whole review session per agent, per message!), whereas in my normal workflow I usually just update my prompt or provide better/clearer context if something goes wrong with the initial output. In my experience, frontier models like GPT 5(.1) (codex) and sonnet 4.5 do what I want in most cases when given good enough context. Poor output is usually a result of either my poor input or the model making a wrong assumption - which can almost always be solved by having it ask you clarifying questions in thinking mode before planning. To me, this means that poor output is the result of human error more often than not (my ambiguous/inadequate input resulted in poor output); Parallel agents don’t help me with that. Instead of reviewing multiple attempts at the same solution, I’d rather put my time toward improving the context in the single thread I have (which saves time as the chat grows, because the agent incrementally gains a better and better understanding of what I want). In other words, when reviewing multiple solutions, I spend my time trying to determine whether the output I’m presented with happened to adequate by chance. On the other hand, when I actively try to improve context by updating a prompt, I’m actively increasing my chances of getting adequate output for the rest of the conversation. I could see parallel agents being a good option if you have a detailed spec/plan ready to go ahead of time though.
For me, the best use case for parallel agents is getting insights about some code or a system. The difference here is that I’m trying to learn/discover instead of produce; I don’t have a clear outcome in mind from the start, so having multiple perspectives at once is extremely helpful and actually saves me a lot of time. This is where the differences in models can really be made to shine.
Here are a few prompts I used recently this way:
Comparison; Thinking through the difference between the contents of two files in order to find a useful indicator. Even though all 4 models discovered the same indicator, they all presented the information to me in a different way. Having information explained from 4 different angles really helped me grasp it quicker.
The files `@1-before-check.html` and `@example-invisible.html` both contain the contents of a webpage displaying a ReCAPTCHA. The difference is that `@example-invisible.html` has an "invisible" recaptcha; This means its checkbox isn't shown. Your task is to carefully compare the recaptcha components in `@1-before-check.html` and `@example-invisible.html` to see if you can find some designator or indicator that determines whether a captcha is visible or "invisible". Maybe an element is missing, some CSS is different, an attribute determines it, or something else. Feel free to be creative in your analysis and exploration. Think through this until you can determine a clear indicator. If you think for a long time and aren't able to recognize any clear indicators, feel free to say so.
Report back your results in a clear, understandable format. Respond inline in this chat - don't create a new file for your report.
Bug finding: GPT 5.1, 5.1 codex, and sonnet 4.5 each go off on their own paths when tracing through a feature. Having 3 different 'sets of eyes' looking for potential problems was great. I no longer needed to launch separate chats for this, and having 3 explanations helped me discover and think through potential issues much faster.
{{long knowledge primer about a large set of changes I just made to an existing feature}}
Now walk yourself through the implementation one more time. Look carefully for any potential logic errors, bugs, flow issues, or discrepancies where the implementation differs from the expectations I've laid out throughout this session.
I imagine you could do the same for debugging, brainstorming, learning, and “having the model ask you questions” (for the purpose of discovery) before planning. Weirdly, the effectiveness of these use cases seems to line up with the question “would you assign multiple humans to this task?”.
- “Does it make sense to make multiple developers implement the same thing at the same time?” Probably not.
- “Is it more effective for multiple developers to try and identify bugs in the new implementation?”. If you have the manpower, yes - you’ll get way better coverage from having multiple perspectives. And with AI, we don’t have a “manpower shortage” issue.
I’m sure there’s at least one good use case for code generation itself here. I just haven’t found it yet, nor needed it with the way I work.