Free Cursor-GPT5 feedback

Thanks for letting us use GPT 5 for free for a week.
In that time, I used about 50USD of API equivalent cost on GPT 5 normal, high, and fast versions. So I feel a bit obligated to also give some feedback, and if others do so too, maybe we can all benefit (Users and Cursor).

As a bit of a strange user who is hopping interfaces and models regularly, i can share some subjective experience:

#Thinking
Agenticness
The week before GPT5 was launced, i intensively used cline in combination with Horzion Beta (HB) which was GPT5. And honestly, Cline-HB felt better. It felt less lazy and did many long runs in complex code environments without stopping.
The differences are or seem to be in the additional prompts and plumbing around my/our prompts and requests and how you send them to the model. It seems that tool usage is restricted, and the context window is more restricted. In a bit longer chat sessions, Cursor-GPT5 started to do small tasks only, not going for longer sequences of tool usage. Only if started new chat, set up my context manually, i got expected normal behaviour. While in cline-HB and Claude Code, tasks feel to be executed more agentic always even if context is almost full. They keep consistent agentic coding behaviour, more fluent coding, running code, and analysis iterations per call and task. Maybe options about the amount of tool usage in a single call, would help to give some more control to users, also regarding cost and approximate effort budget for a task.

Cost
If i am charged with similar to API-cost at cursor, i currently prefer to use (free) cline with my own OpenAI API. The current quality of the agent, tool usage, and plumbing does not justify any extra cost in my experience. You buy large tokens and get some discount a bit of that discount also for the users would already help to convince usage. Say i pay 20USD to Cursor, but i get 25USD equivalent API. But superior agents would be the most convincing; it could even justify additional cost. Maybe also add some cost incentive to share or not share our code. If we share it, maybe provide some feedback on prompts and requests, we can all improve.

Context window
Also, I really like the context size bar indicator in Cline, and the warning Claude Code gives when context is almost full. Compared to them, I have to say that the Cursor-GPT5 combination is very good in starting new projects. But then, when things get larger, I am not so convinced yet. Even the size of the project should still completely fit in context; it feels like it is still doing some compression, RAG, or something that makes it feel lost in the code base, or only partially aware. Good context compression can be very useful, but i would like some more control on when or how and even what stuff gets compressed. Sometimes context compression is exactly what I want; other times I really need to find a needle in a haystack, and I want a large context. Additionally, good management of Big Project Picture, and small task picture combined with context management seems to be a place where you and others can still make a lot of difference.

Mixing model usage
Also, GPT-5 at normal cost is relatively expensive, as are the good models from Antropic. In Cline, I can use GLM4.5 or Qwuen Code 3. For the routine stuff. And just use the heavier models when i need them, but i have to choose select manually. Maybe it would be good if you also added some of those models at a lower cost. Maybe it would be nice to have different types of Auto [auto with different model ranges and with different prices], instead of just one auto.

Occasional need for speed
I really liked the option to have fast model evals some times, even fast (Auto mode), and i would be ready to pay extra for that some times. In Cline, I use my Cerebras API account for that. Maybe offering fast options is something you could consider, of course, also at a higher equivalent API price.

The big thinking/writing discrepancy
Also, there is one other issue that I have noticed in GPT5 but also Claude and other models. When debugging or trying to solve an issue, the models’ thinking can be very different from what they are saying or presenting. Thinking is usually hidden, but I like to look at it. In the response, the models are very optimistic and positive. However, they are aware that the code is not working, but they later avoid addressing the issues, focusing too much on what is good, and they even modify tests not to show the bugs, making the quick fixes not explicit. They seem very much like humans, almost lying sometimes. I try to compensate for it with my prompts, with partial success. I am not sure if this type of behaviour is because of the system prompt you are given or the model providers are giving, or if it is really an emergent phenomenon of the models. Anyway, I would prefer honest models who are open about all issues and have not passed tests, instead of cheating a bit. Especially when vibe coding larger, more complex projects. I would even prefer an overly critical agent who brings up and remembers all issues. In the end, a more critical agent will save work, and yes, reduce API tokens used, and profit, and is maybe more frustrating to work with, but it can actually help to achieve the goal.

#Response
Overall, to me, Cursor-GPT5 is excellent for starting new projects, but not great on larger tasks (compared to Claude Code and Cline-Horizon Beta). Please improve the system prompts and tool usage, and maybe add some context management options.
Free GPT5 usage was great as long as it lasted, but please also give us other reasons to keep using that model with Cursor. Good Auto mode, Agents, forwarding API pricing benefits, there are many places where you can still make a difference to the competition.
Thanks for this free test, and for providing all the services and a nice coding Interface, to make this possible.

5 Likes

Feedback on free GPT-5

I’ve spent a lot of time testing the free GPT-5 and wanted to share my experience.

Autonomy
When starting a fresh chat, GPT-5 works well. But as the context grows, it becomes noticeably lazier and its “agent” qualities drop. Even with a .md checklist and clear instructions, the longer it runs, the slower and less productive it gets. Eventually, it produces only 1–3 lines per step, then repeats summaries of what it’s already done, often for dozens of iterations. This is fine for free use, but paying for it would be costly.

Manual control
With specific, targeted instructions, GPT-5 performs very well. It focuses on exactly what you ask for, without touching unrelated code. In this sense, it’s often more precise than Claude. Claude can produce more code — sometimes high-quality — but often outside the scope of the request, requiring closer supervision.

Cursor integration
Integration with Cursor’s tools is weak. GPT-5 handles basic tools fine, but struggles with custom rules. It can read them if prompted, yet often ignores them even in “rules always” mode. It also fails to use many other Cursor features effectively.

I haven’t tested as much as you, but the fall-off in quality as the context window grows is something I’ve noticed as well. The first prompt in a new chat got more done than the last 20 prompts in the old chat (at 50% context window in Max mode).

I agree with the exceeding amount of chat turns required - it’s sometimes when you’re mid task and still have lets say 50+ files changes pending in a larger refactoring and then you have to 50 chat turns because it does 1 line/file at a time and then providing a full summary.

It’s possible to debate a little with the model and if you tell it minimize chat turns (exact term) that seems to help in those situations.

Alternative i ask it to provide a summary of the tasks, what’s been done and what’s missing, then close the chat and open a new one with that summary. That does wonders.