Grok 3 vs Claude 3.5-3.7 vs O3-mini vs DeepSeek R1

normalnormie · February 24, 2025, 7:34pm

Currently there’s a lot of talk about Grok being the new beAst in town or DeepSeek-R1 impressive reasoning, from past days testing them I’m finding O3-mini to be the best, with Grok-3 being too generic, DeepSeek-R1 too specific(adding or hallucinating stuff), Claude to be too specific but without hallucinations(academic), O3-mini just feels right, sometimes it doesn’t write files but situation is improving, lets see a real example from today experiments, prompt:

# CONTEXT: Developing a financial visualization module in Panda3D (primarily 3D engine) that requires 2D rendering capabilities. Challenges include coordinate system conversion, real-time data updates, and maintaining visual clarity for market data, our python and comments guidelines need to be followed strictly.

# OBJECTIVE: Create a reusable 2D market chart component demonstrating:  
- Candlestick plot rendering  
- Moving average overlays  
- Volume histogram bars  
- Interactive zoom/pan controls  
- Dynamic data updates  

# STYLE: Follow Panda3D's official example patterns (clean OOP structure, explicit coordinate management). Incorporate financial charting conventions from libraries like mplfinance. Use PEP8-compliant Python with type hints.

# TONE: Technical precision with didactic clarity. Target intermediate developers needing to extend 3D engine capabilities for 2D financial visualization.

# AUDIENCE: Python developers with basic Panda3D experience seeking advanced 2D/3D integration techniques for financial applications.

# RESPONSE: Provide complete Panda3D script with:  
1. Separate ChartManager class handling coordinate transformations  
2. DataSeries class for OHLC data management  
3. Real-time update mechanism using Panda3D's task manager  
4. Annotation system for price levels  
5. Performance optimizations for large datasets

O3-mini got it in one-shot:

Claude-3.5 got it after resolving a display issue(that still persist):

Claude-3.7 got it after 1 iteration with some display issues(but they’re pretty indeed):

Grok-3 after 4 iterations showed this:

DeepSeek-R1 didn’t solved issues in 5 iterations

RESULTS:

Create a reusable 2D market chart(O3-mini, Claude3.5[3.7 was in part 3D])
Candlestick plot rendering (O3-mini, Claude3.5[3.7])
Moving average overlays (O3-mini, Claude3.7)
Volume histogram bars (O3-mini, Claude3.7)
Interactive zoom/pan controls (O3-mini, Claude3.5[3.7])
Dynamic data updates (O3-mini)

Happy to hear others experiences!

XML ruleset used to create the COSTAR prompt:

<costarprompt name="COSTAR Prompt Instruction Set" version="1.0">
  <metadata>
    <author>Prompt Engineering Specialist</author>
    <created>2025-01-28</created>
    <scope>Prompt Structuring for LLMs</scope>
    <application-boundary>
      <limit>Applies to the structure and content of prompts for LLMs</limit>
      <limit>Does not govern the LLM's response content</limit>
      <limit>Excludes non-prompt related text</limit>
    </application-boundary>
  </metadata>
  <ruleset>
    <rule name="CONTEXT">
      <description>Provides background information for task understanding</description>
      <guidelines>
        <guideline>Include relevant scenario details</guideline>
        <guideline>Specify domain/task parameters</guideline>
        <example>Company launching new product in beauty tech sector</example>
      </guidelines>
    </rule>
    <rule name="OBJECTIVE">
      <description>Defines the primary task goal</description>
      <guidelines>
        <guideline>Use action-oriented language</guideline>
        <guideline>Specify measurable outcomes</guideline>
        <example>Create persuasive Facebook post driving product link clicks</example>
      </guidelines>
    </rule>
    <rule name="STYLE">
      <description>Dictates writing style and voice</description>
      <guidelines>
        <guideline>Reference specific style models (e.g., corporate, influencer)</guideline>
        <guideline>Specify professional/expert alignment</guideline>
        <example>Mimic Dyson's product launch style</example>
      </guidelines>
    </rule>
    <rule name="TONE">
      <description>Sets emotional/attitudinal parameters</description>
      <guidelines>
        <guideline>Specify emotional resonance</guideline>
        <guideline>Include tone examples</guideline>
        <example>Persuasive yet respectful for mature audience</example>
      </guidelines>
    </rule>
    <rule name="AUDIENCE">
      <description>Identifies target recipients</description>
      <guidelines>
        <guideline>Specify demographic details</guideline>
        <guideline>Include psychographic factors</guideline>
        <example>Older generation valuing simplicity and reliability</example>
      </guidelines>
    </rule>
    <rule name="RESPONSE">
      <description>Defines output format requirements</description>
      <guidelines>
        <guideline>Specify structural format</guideline>
        <guideline>Include technical requirements</guideline>
        <example>JSON structure for API consumption</example>
      </guidelines>
    </rule>
  </ruleset>
  <example id="Example-COSTAR">
    <description>
      A practical application of the CO-STAR framework for drafting a Facebook post.
    </description>
    <scenario>
      <p>
        Suppose you work as a social media manager and need help drafting a Facebook post to advertise your company’s
        new product.
      </p>
    </scenario>
    <prompt>
      <line># CONTEXT: I want to advertise my company’s new product. My company’s name is Alpha and the product is called Beta—
        a new ultra-fast hairdryer.</line>
      <line># OBJECTIVE: Create a Facebook post aimed at getting people to click on the product link to purchase it.</line>
      <line># STYLE: Adopt the writing style of successful companies that advertise similar products, such as Dyson.</line>
      <line># TONE: Persuasive</line>
      <line># AUDIENCE: Target the older generation on Facebook, focusing on what they look for in quality hair products.</line>
      <line># RESPONSE: Compose a concise yet impactful Facebook post.</line>
    </prompt>
  </example>
</costarprompt>

AbleArcher · February 25, 2025, 4:08am

Interesting comparison.
If there’s any language models should excel at it’s Python.
I mostly use Pine Script which is quite similar.

Subjective Experience:

Claude 3.5 = , but it can be really dumb following Pine Script syntax instructions. e.g. I’ve told it 100x in every possible way not to use 4 spaces for line-wrapping (syntax error in Pine)
DeepSeek-r1, limited testing so far, but very good, and the thinking helps when Sonnet gets stuck (no o-series model ever has.)
OAI models, useless (except for planning, but I’d rather just use r1.)
Grok 3, very limited testing. Reluctant to output code without explicit prompting. It’ll give a very in depth analysis/plans though. One-shot Pine Script output is quite bad without a sample. Same indenting issue as Claude. It’s helped me with a problem I’ve been stuck on for ages, so I’m looking forward to having it in Cursor.

I’ve yet to try Sonnet 3.7, but expect it and Grok 3 will be my go-to, but it will mostly come down to how well it’s integrated with Cursor.

Topic		Replies	Views
Deciding which model to use (Claude vs O3-mini) Discussion	18	4618	February 16, 2025
Is GPT-4o better to use or is Claude 3.5 sonnet better to use? Discussion	4	768	February 18, 2025
Claude 3.5 Sonnet vs Deepseek-r1 Discussion	1	2200	January 26, 2025
Claude Sonnet 3.5 Agent is sooo much better than o3 mini high Discussion	13	2934	February 21, 2025
Sonnet 3.5 vs o3 mini Discussion	16	3179	February 22, 2025

Grok 3 vs Claude 3.5-3.7 vs O3-mini vs DeepSeek R1

Related topics