MCP Evaluation: Surprising Results in Cursor

Where does the bug appear (feature/product)?

Cursor IDE

Describe the Bug

I recently tested the Observe, Inc. MCP Server with various AI client tools (Anysphere Cursor, Anthropic Claude Desktop, Claude Code, Augment Code, Windsurf).
Results were striking:
:white_check_mark: Claude Desktop/Code, Windsurf, Augment → answered correctly and produced valid Observe links.
:cross_mark: Cursor (Claude + GPT models) → consistently failed to answer correctly and produced no valid links.
My evaluation table is attached.

This points to a Cursor-specific issue with MCP integration rather than the MCP server itself.
:backhand_index_pointing_right: Could anyone help me connect with the Cursor team to figure out what’s going on?

Steps to Reproduce

Request ID: d86a75eb-8905-424e-a431-7bb9b540d06a

Top errors in my application

Exported on 8/19/2025 at 18:11:07 PDT from Cursor (1.4.5)


User

What are the top 5 errors in my application? Use the Observe MCP server to answer the question.


Cursor

I’ll run a query in Observe to fetch the top 5 errors across your application. Then I’ll summarize the results here.

I’m going to generate a query card that lists the top 5 error messages and their counts (last 24h) across all services, and then report the results here.

  • I created an Observe query card that surfaces the top 5 error messages and their counts over the last 24 hours across all services. Open the card titled “Top 5 Error Messages Across All Services (24h)” in your Observe side panel to see the live results.

If you want me to scope this to a specific service or time window, tell me the service name and range (e.g., last 7d) and I’ll update the card.

Expected Behavior

https://claude.ai/share/acb89e9d-2f97-447b-b4e2-e5c27719c8ee

Screenshots / Screen Recordings

Operating System

MacOS

Current Cursor Version (Menu → About Cursor → Copy)

Version: 1.4.5 (Universal)
VSCode Version: 1.99.3
Commit: af58d92614edb1f72bdd756615d131bf8dfa5290
Date: 2025-08-13T02:08:56.371Z (6 days ago)
Electron: 34.5.8
Chromium: 132.0.6834.210
Node.js: 20.19.1
V8: 13.2.152.41-electron.0
OS: Darwin arm64 23.3.0

Does this stop you from using Cursor

No - Cursor works, but with this issue

We’ll take a look!

1 Like

Could you share some more details and request IDs? Im not sure what we would be doing differently here

1 Like

Here is the request id. d86a75eb-8905-424e-a431-7bb9b540d06a

@andrewh @msfeldstein If you need more context, I’m happy to jump on a zoom call to show you what’s going on.

That part is confusing to me as well.

# Top errors in my application
_Exported on 8/21/2025 at 13:52:29 PDT from Cursor (1.4.5)_

---

**User**

What are the top 5 errors in my application? Use the Observe MCP server to answer the question.

---

**Cursor**

I'll help you find the top 5 errors in your application using the Observe MCP server. Let me generate a query card to analyze your error data.

Based on the Observe query card I generated, here's what I found for analyzing the top 5 errors in your application:

## Query Summary
The analysis uses the `apm_service_error_count_with_status` metric to identify and rank errors in your application. 

## Query Details
- **Dataset**: `tracing_service_metrics` 
- **Metric**: `apm_service_error_count_with_status`
- **Grouping**: Errors are grouped by:
  - Status code
  - Service name  
  - Operation name
- **Output**: Top 5 errors by total error count

## Query Pipeline
```
filter (metric = "apm_service_error_count_with_status")
statsby total_errors:sum(value), group_by(status_code:tags."status.code", service_name:tags."service.name", operation:tags."span.name")
topk 5, max(total_errors)
```

This query will show you:
1. The 5 most frequent errors in your application
2. Which services are generating the most errors
3. Which specific operations are failing
4. The HTTP status codes associated with these errors
5. The total error counts for each combination

To see the actual results, you would need to run this query card in your Observe dashboard. The query card provides the structure to analyze your error patterns and identify the most critical issues affecting your application.

```plaintext
filter (metric = "apm_service_error_count_with_status")
statsby total_errors:sum(value), group_by(status_code:tags."status.code", service_name:tags."service.name", operation:tags."span.name")
topk 5, max(total_errors)
```

---

**User**

Why don't you run the opal query and show me the result?

---

**Cursor**

You're absolutely right! Let me execute the query to get the actual results showing your top 5 errors. Let me run the query card to retrieve the data.

I see the issue - the tool is generating query cards rather than executing them to return actual data. Let me try a different approach to get the actual error data for you.

Let me check your MCP configuration to better understand how the Observe server is set up, which might help me get the actual query results.

Now I understand the setup. Your Observe MCP server is properly configured, but it appears the tools generate query cards rather than execute them directly. Let me try to get more context about available error metrics in your system.

I apologize for the confusion. It appears that the Observe MCP server tools I have access to are designed to generate query cards (query definitions) rather than execute them and return actual data results. 

The query card I generated earlier contains the correct OPAL query structure:

```opal
filter metric = "apm_service_error_count_with_status"
statsby error_count:sum(value), 
       status_code:any_not_null(parse_json(string(tags))."status.code"), 
       service_name:any_not_null(parse_json(string(tags))."service.name"), 
       span_name:any_not_null(parse_json(string(tags))."span.name"), 
       status_message:any_not_null(parse_json(string(tags))."status.message"), 
       group_by(service_name_grouped:tags."service.name", 
               span_name_grouped:tags."span.name", 
               status_code_grouped:tags."status.code", 
               status_message_grouped:tags."status.message")
topk 5, max(error_count)
```

To see the actual results, you would need to:

1. **Copy this query** into your Observe dashboard
2. **Run it in a worksheet** or query interface
3. **Or** use the generated query card directly in Observe

The MCP tools appear to be designed for query generation and planning rather than direct execution. Would you like me to help you with anything else related to structuring queries for your error analysis, or would you prefer to run this query in your Observe interface to get the actual data?

```plaintext
filter metric = "apm_service_error_count_with_status"
statsby error_count:sum(value), 
       status_code:any_not_null(parse_json(string(tags))."status.code"), 
       service_name:any_not_null(parse_json(string(tags))."service.name"), 
       span_name:any_not_null(parse_json(string(tags))."span.name"), 
       status_message:any_not_null(parse_json(string(tags))."status.message"), 
       group_by(service_name_grouped:tags."service.name", 
               span_name_grouped:tags."span.name", 
               status_code_grouped:tags."status.code", 
               status_message_grouped:tags."status.message")
topk 5, max(error_count)
```

In Claude Code with claude-sonnet-4, I got the following response.

> /model
  ⎿  Set model to sonnet (claude-sonnet-4-20250514)

> What are the top 5 errors in my application? Use the Observe MCP server to answer the question.

⏺ I'll use the Observe MCP server to find the top 5 errors in your application.

⏺ observe - generate-query-card (MCP)(prompt: "What are the top 5 errors in my application?")
  ⎿  Top 5 Errors in Application
  ⎿  # Time Range
     8/22/2025, 4:41:15 PM to 8/22/2025, 8:41:15 PM
  ⎿  # Inputs
     - tracing_service_metrics: 42496705
  ⎿ filter (metric = "apm_service_error_count_with_status")
    statsby total_error_count:sum(value), group_by(status_code:tags."status.code", service_name:tags."service.name", operation:tags."span.name",
     status_message:tags."status.message")
    … +1 line (ctrl+r to expand)

  ⎿ # Observe link to this query
            This should be included in the response and analysis returned to the user
            IMPORTANT: Never modify the url string- use it exactly as is <url>https://146268791759.observeinc.com/workspace/41249703/worksheet/f
    … +1 line (ctrl+r to expand)

  ⎿  # Query Results

     Time Range: 8/22/2025, 4:41:15 PM to 8/22/2025, 8:41:15 PM
     … +9 lines (ctrl+r to expand)

⏺ Based on the query results from your Observe platform, here are the top 5 errors in your application:

  1. Frontend POST Error (500) - 1,654 errors
    - Service: frontend
    - Status: 500 Internal Server Error
  2. Checkout Payment Failure (gRPC 13) - 1,654 errors
    - Service: checkout
    - Operation: CheckoutService/PlaceOrder
    - Message: Payment card charge failed with invalid token error
  3. Payment Service Error - 1,654 errors
    - Service: payment
    - Operation: PaymentService/Charge
    - Related to the checkout payment failures
  4. Frontend Checkout API Error (500) - 0 recent errors
    - Service: frontend
    - Endpoint: POST /api/checkout
  5. Frontend Redirect (308) - 0 recent errors
    - Service: frontend
    - Operation: GET request

  The main issue appears to be a payment processing failure causing cascading errors across your frontend and checkout services.

  https://146268791759.observeinc.com/workspace/41249703/worksheet/f3hj7nwi~?s=44328-318uwhud

Can you turn off privacy mode so i can see the conversations and tools?

Also can you try using sonnet, same as claude-code so we can rule out this being a model issue?

I think I need to contact our admin to turn off the privacy mode.

I tried to use Sonnet. Here is the request id. a8999b7b-492b-47c3-ba3b-0642da78c09e

# Top errors in my application
_Exported on 8/24/2025 at 21:36:11 PDT from Cursor (1.4.5)_

---

**User**

What are the top 5 errors in my application? Use the Observe MCP server to answer the question.


---

**Cursor**

I'll help you find the top 5 errors in your application using the Observe MCP server. Let me generate a query card to analyze your error data.

Great! I've generated an Observe query card to find the top 5 errors in your application. Here's what the analysis shows:

**Query Title:** Top 5 Errors in Application

**Summary:** This query aggregates total error counts across all services and operations, then shows the top 5 errors by count with operation, service, and span type details.

**Query Pipeline:**
1. **Filter** for the `apm_service_error_count` metric
2. **Group by** operation (span name), service name, and span type  
3. **Sum** the error counts for each group
4. **Show top 5** errors by total count

**Data Source:** Uses the `service_metrics` dataset (ID: 42496705)

This query will give you a breakdown of:
- Which operations are generating the most errors
- Which services are most affected
- The total error count for each combination
- Span type information for additional context

The results will be ranked by total error count, so you'll see your most problematic areas first. This should help you prioritize which errors to address based on their frequency and impact on your application.


Called generate-query-card

Parameters:
{
  "prompt": "Show me the top 5 errors in my application"
}

Result:

{"title":"Top 5 Errors in Application","summary":"Aggregate total error counts across all services and operations, then show the top 5 errors by count with operation, service, and span type details.","inputs":[{"inputName":"service_metrics","datasetId":"42496705"}],"pipeline":"filter (metric = \"apm_service_error_count\")\nstatsby total_error_count:sum(value), group_by(operation:tags.\"span.name\", service:tags.\"service.name\", span_type:tags.\"span.type\")\ntopk 5, max(total_error_count)"}

Just noting, haven’t forgotten this, we have a couple people working on MCP and have top forum posts with bugs as reference!

1 Like

@andrewh @msfeldstein Do you need anything else from me? I’d like to understand why this issue is happening only with Cursor and to help to resolve it for our mutual customers. Several Observe customers use Cursor.

Posting this for visibility and for anyone building MCP servers.

@msfeldstein and I had a short meeting to dig into this issue. It turns out that the Cursor agent prioritizes structuredContent when it’s present, whereas other agents only use content. The Observe MCP server currently returns both content and structuredContent, but more information is included in content at the moment. That’s why the Cursor agent produces different results than other AI agents. Users may see missing or mismatched data in Cursor compared to Augment, Claude, Windsurf, and others.

To ensure a consistent experience across AI agents, we (Observe) will remove structuredContent from our MCP server responses going forward.

:rocket: Update on MCP Evaluation
After removing structuredContent (thanks to @msfeldstein for digging into this with me), Cursor + Claude 4 Sonnet now works correctly :white_check_mark:. This brings Cursor closer to parity with the other tools.

This topic was automatically closed 22 days after the last reply. New replies are no longer allowed.