Grok 3 vs Claude 3.5-3.7 vs O3-mini vs DeepSeek R1

Currently there’s a lot of talk about Grok being the new beAst in town or DeepSeek-R1 impressive reasoning, from past days testing them I’m finding O3-mini to be the best, with Grok-3 being too generic, DeepSeek-R1 too specific(adding or hallucinating stuff), Claude to be too specific but without hallucinations(academic), O3-mini just feels right, sometimes it doesn’t write files but situation is improving, lets see a real example from today experiments, prompt:

# CONTEXT: Developing a financial visualization module in Panda3D (primarily 3D engine) that requires 2D rendering capabilities. Challenges include coordinate system conversion, real-time data updates, and maintaining visual clarity for market data, our python and comments guidelines need to be followed strictly.

# OBJECTIVE: Create a reusable 2D market chart component demonstrating:  
- Candlestick plot rendering  
- Moving average overlays  
- Volume histogram bars  
- Interactive zoom/pan controls  
- Dynamic data updates  

# STYLE: Follow Panda3D's official example patterns (clean OOP structure, explicit coordinate management). Incorporate financial charting conventions from libraries like mplfinance. Use PEP8-compliant Python with type hints.

# TONE: Technical precision with didactic clarity. Target intermediate developers needing to extend 3D engine capabilities for 2D financial visualization.

# AUDIENCE: Python developers with basic Panda3D experience seeking advanced 2D/3D integration techniques for financial applications.

# RESPONSE: Provide complete Panda3D script with:  
1. Separate ChartManager class handling coordinate transformations  
2. DataSeries class for OHLC data management  
3. Real-time update mechanism using Panda3D's task manager  
4. Annotation system for price levels  
5. Performance optimizations for large datasets

O3-mini got it in one-shot:

Claude-3.5 got it after resolving a display issue(that still persist):

Claude-3.7 got it after 1 iteration with some display issues(but they’re pretty indeed):

Grok-3 after 4 iterations showed this:

DeepSeek-R1 didn’t solved issues in 5 iterations

RESULTS:

  • Create a reusable 2D market chart(O3-mini, Claude3.5[3.7 was in part 3D])
  • Candlestick plot rendering (O3-mini, Claude3.5[3.7])
  • Moving average overlays (O3-mini, Claude3.7)
  • Volume histogram bars (O3-mini, Claude3.7)
  • Interactive zoom/pan controls (O3-mini, Claude3.5[3.7])
  • Dynamic data updates (O3-mini)

Happy to hear others experiences!

XML ruleset used to create the COSTAR prompt:

<costarprompt name="COSTAR Prompt Instruction Set" version="1.0">
  <metadata>
    <author>Prompt Engineering Specialist</author>
    <created>2025-01-28</created>
    <scope>Prompt Structuring for LLMs</scope>
    <application-boundary>
      <limit>Applies to the structure and content of prompts for LLMs</limit>
      <limit>Does not govern the LLM's response content</limit>
      <limit>Excludes non-prompt related text</limit>
    </application-boundary>
  </metadata>
  <ruleset>
    <rule name="CONTEXT">
      <description>Provides background information for task understanding</description>
      <guidelines>
        <guideline>Include relevant scenario details</guideline>
        <guideline>Specify domain/task parameters</guideline>
        <example>Company launching new product in beauty tech sector</example>
      </guidelines>
    </rule>
    <rule name="OBJECTIVE">
      <description>Defines the primary task goal</description>
      <guidelines>
        <guideline>Use action-oriented language</guideline>
        <guideline>Specify measurable outcomes</guideline>
        <example>Create persuasive Facebook post driving product link clicks</example>
      </guidelines>
    </rule>
    <rule name="STYLE">
      <description>Dictates writing style and voice</description>
      <guidelines>
        <guideline>Reference specific style models (e.g., corporate, influencer)</guideline>
        <guideline>Specify professional/expert alignment</guideline>
        <example>Mimic Dyson's product launch style</example>
      </guidelines>
    </rule>
    <rule name="TONE">
      <description>Sets emotional/attitudinal parameters</description>
      <guidelines>
        <guideline>Specify emotional resonance</guideline>
        <guideline>Include tone examples</guideline>
        <example>Persuasive yet respectful for mature audience</example>
      </guidelines>
    </rule>
    <rule name="AUDIENCE">
      <description>Identifies target recipients</description>
      <guidelines>
        <guideline>Specify demographic details</guideline>
        <guideline>Include psychographic factors</guideline>
        <example>Older generation valuing simplicity and reliability</example>
      </guidelines>
    </rule>
    <rule name="RESPONSE">
      <description>Defines output format requirements</description>
      <guidelines>
        <guideline>Specify structural format</guideline>
        <guideline>Include technical requirements</guideline>
        <example>JSON structure for API consumption</example>
      </guidelines>
    </rule>
  </ruleset>
  <example id="Example-COSTAR">
    <description>
      A practical application of the CO-STAR framework for drafting a Facebook post.
    </description>
    <scenario>
      <p>
        Suppose you work as a social media manager and need help drafting a Facebook post to advertise your company’s
        new product.
      </p>
    </scenario>
    <prompt>
      <line># CONTEXT: I want to advertise my company’s new product. My company’s name is Alpha and the product is called Beta—
        a new ultra-fast hairdryer.</line>
      <line># OBJECTIVE: Create a Facebook post aimed at getting people to click on the product link to purchase it.</line>
      <line># STYLE: Adopt the writing style of successful companies that advertise similar products, such as Dyson.</line>
      <line># TONE: Persuasive</line>
      <line># AUDIENCE: Target the older generation on Facebook, focusing on what they look for in quality hair products.</line>
      <line># RESPONSE: Compose a concise yet impactful Facebook post.</line>
    </prompt>
  </example>
</costarprompt>
2 Likes

Interesting comparison.
If there’s any language models should excel at it’s Python.
I mostly use Pine Script which is quite similar.

Subjective Experience:

  • Claude 3.5 = :goat:, but it can be really dumb following Pine Script syntax instructions. e.g. I’ve told it 100x in every possible way not to use 4 spaces for line-wrapping (syntax error in Pine)
  • DeepSeek-r1, limited testing so far, but very good, and the thinking helps when Sonnet gets stuck (no o-series model ever has.)
  • OAI models, useless (except for planning, but I’d rather just use r1.)
  • Grok 3, very limited testing. Reluctant to output code without explicit prompting. It’ll give a very in depth analysis/plans though. One-shot Pine Script output is quite bad without a sample. Same indenting issue as Claude. It’s helped me with a problem I’ve been stuck on for ages, so I’m looking forward to having it in Cursor.

I’ve yet to try Sonnet 3.7, but expect it and Grok 3 will be my go-to, but it will mostly come down to how well it’s integrated with Cursor.

2 Likes