Llama 3.3 70B is better than Claude and GPT-4o for tool use

subashc2023 · January 25, 2025, 8:13am

I don’t like to be conspiratorial, but I feel like something should be said about this. Even the 8B model beats GPT-4o…

GPT-4o and Claude 3.5 are the only models that can do Tool Use aka Agent functionality in Cursor. Their business model is to sell 500 fast requests to either of these models for agent functionality for $20.

Groq released their benchmarks here Showing Llama 3.3 70B Tool Use was beating Claude 3.5 and GPT-4o in benchmarks consistently.

Is there an actual reason why Cursor team has not implemented Tool Use for Llama models?

I feel that there may be a conflict of interest. Perhaps they grew too fast, and are locked into their current growth model?

danperks · January 27, 2025, 12:48pm

Choosing a model based on its ability to run tools is likely not the best metric to use here, as most of the tools reuse the same model to generate or evaluate code or a terminal command.

While Llama may be the best at executing tools, it’s not the best at evaluating or writing code, so would perform worse in the usage needed inside Cursor.

We are always evaluating models internally, and will use Llama if there came a use case where it excelled at most!

Topic		Replies	Views
Groq and Llama3 Discussions	12	2740	September 19, 2024
[Solved] Add Claude 3 models Feature Requests	96	14051	March 29, 2024
[Done] Request GPT-4o model on cursor Discussions	20	4304	October 1, 2024
ChatGPT-4o-latest - is access to the model planned for PRO users? Discussions	1	148	January 21, 2025
Feedback on User Experience with Different Models in Cursor Chat Output Feedback	2	62	November 2, 2024

Llama 3.3 70B is better than Claude and GPT-4o for tool use

Related topics