Enhancing System Robustness and Reliability by Allowing Retry Configuration on MCP Client Tool Calls

Enhancing System Robustness and Reliability by Allowing Retry Configuration on MCP Client Tool Calls

Current Issues:

When the MCP client makes tool calls to the MCP server, it frequently encounters temporary errors causing failures in the tool calls. Observations indicate that once a tool call fails, the system may switch to a different MCP server or abandon MCP server capabilities altogether, relying solely on internal models. This behavior diverges from user expectations of the Agent/LLM and significantly degrades the user experience.

Common reasons for tool call failures:

  • Temporary network disruptions
  • Temporary server overloads
  • Processing delays due to timeouts or insufficient resources
  • Connection resets or other transient communication errors

Therefore, we would like to make it possible to configure retries for tool call failures via .rules, config.json, or similar configurations.
Currently, retry policies depend solely on the MCP client (host-side) settings, which users cannot control directly, causing inconvenience.
Additionally, applications like Claude Desktop implement effective retry strategies, incorporating backoff mechanisms and retrying several times after an initial failure, leading to an improved user experience, as noted by our engineers.

Example configuration:

{
  "mcpClient": {
    "networkRetry": {
      "enabled": true,
      "attempts": 3,
      "backoffFactor": 2
    }
  }
}
{  
  "mcp": {  
    "toolCalls": {  
      "retryEnabled": true,  
      "maxRetries": 3,  
      "retryDelay": 1000,  
      "exponentialBackoff": true  
    }  
  }  
}

Implementation of Retry Mechanisms for MCP Server Tool Call Failures

To implement retries on the client side when tool calls to the MCP server fail, the following approaches can be considered:

Basic Retry Implementation Method

  1. Exception Handling:

    • Use try-catch blocks during tool calls to catch exceptions.
    • Identify transient errors such as network issues or timeouts.
  2. Selecting Retry Strategies:

    • Exponential Backoff: Gradually increase retry intervals (e.g., 1 second, 2 seconds, 4 seconds, 8 seconds…).
    • Adding Jitter: Introduce random delays to prevent simultaneous retries from multiple clients.
    • Maximum Retry Count: Set an upper limit to avoid infinite retry loops.
  3. Differentiation by Error Types:

    • Temporary errors (network issues, server overload): Retry
    • Permanent errors (authentication errors, resource nonexistence): Immediate failure

Additional Considerations

  • Timeout Settings: Assign appropriate timeout values to each request.
  • Circuit Breaker: Suspend requests temporarily after a series of consecutive failures.
  • Status Monitoring: Track retry frequency and success rates.
  • Error Logging: Maintain detailed logs to analyze failure causes.

Hi @deanrie requests for improved MCP reliability have come up in other threads, and you indicated that the team is aware of this as a potential improvement. Has there been any progress in this area? Lost MCP connectivity and failed tool calls are a daily and ongoing annoyance.

Hey @rsb the team is actively monitoring MCP reliability. We recently shipped a few fixes, like reconnecting MCP after VPN drops, and we’re still investigating a few related connection stability issues.

As for user-configurable retry policies specifically, like max retries and backoff settings via config, we don’t have a clear ETA yet.

A couple things to keep in mind for now:

  • If your MCP server disconnects after sleep and wake or after switching networks, reloading the window (Cmd/Ctrl + Shift + P then Developer: Reload Window) is the quickest way to recover.
  • For MCP servers on HTTP/SSE, reconnect behavior is improving in recent versions, so make sure you’re on the latest Cursor build.

I’ll update this thread if we have news about configurable retries. Your ongoing reports really help us raise the priority of this work.

Dean, thank you for taking the time to provide a detailed update. For me, the two use cases that produce the most MCP connectivity failures are coming back to my machine with Cursor after a long suspend or locked screen (on Linux), or failing to proactively reconnect when the MCP server is stopped and restarted, especially after some period has elapsed between the stop and start. MCPs are part of my daily workflow, so I appreciate you and the team taking the time to make them work better.