Llama 3.3 70B is better than Claude and GPT-4o for tool use

I don’t like to be conspiratorial, but I feel like something should be said about this. Even the 8B model beats GPT-4o…

GPT-4o and Claude 3.5 are the only models that can do Tool Use aka Agent functionality in Cursor. Their business model is to sell 500 fast requests to either of these models for agent functionality for $20.

Groq released their benchmarks here Showing Llama 3.3 70B Tool Use was beating Claude 3.5 and GPT-4o in benchmarks consistently.

Is there an actual reason why Cursor team has not implemented Tool Use for Llama models?

I feel that there may be a conflict of interest. Perhaps they grew too fast, and are locked into their current growth model?

Choosing a model based on its ability to run tools is likely not the best metric to use here, as most of the tools reuse the same model to generate or evaluate code or a terminal command.

While Llama may be the best at executing tools, it’s not the best at evaluating or writing code, so would perform worse in the usage needed inside Cursor.

We are always evaluating models internally, and will use Llama if there came a use case where it excelled at most!