Enhancing System Robustness and Reliability by Allowing Retry Configuration on MCP Client Tool Calls

Enhancing System Robustness and Reliability by Allowing Retry Configuration on MCP Client Tool Calls

Current Issues:

When the MCP client makes tool calls to the MCP server, it frequently encounters temporary errors causing failures in the tool calls. Observations indicate that once a tool call fails, the system may switch to a different MCP server or abandon MCP server capabilities altogether, relying solely on internal models. This behavior diverges from user expectations of the Agent/LLM and significantly degrades the user experience.

Common reasons for tool call failures:

  • Temporary network disruptions
  • Temporary server overloads
  • Processing delays due to timeouts or insufficient resources
  • Connection resets or other transient communication errors

Therefore, we would like to make it possible to configure retries for tool call failures via .rules, config.json, or similar configurations.
Currently, retry policies depend solely on the MCP client (host-side) settings, which users cannot control directly, causing inconvenience.
Additionally, applications like Claude Desktop implement effective retry strategies, incorporating backoff mechanisms and retrying several times after an initial failure, leading to an improved user experience, as noted by our engineers.

Example configuration:

{
  "mcpClient": {
    "networkRetry": {
      "enabled": true,
      "attempts": 3,
      "backoffFactor": 2
    }
  }
}
{  
  "mcp": {  
    "toolCalls": {  
      "retryEnabled": true,  
      "maxRetries": 3,  
      "retryDelay": 1000,  
      "exponentialBackoff": true  
    }  
  }  
}

Implementation of Retry Mechanisms for MCP Server Tool Call Failures

To implement retries on the client side when tool calls to the MCP server fail, the following approaches can be considered:

Basic Retry Implementation Method

  1. Exception Handling:

    • Use try-catch blocks during tool calls to catch exceptions.
    • Identify transient errors such as network issues or timeouts.
  2. Selecting Retry Strategies:

    • Exponential Backoff: Gradually increase retry intervals (e.g., 1 second, 2 seconds, 4 seconds, 8 seconds…).
    • Adding Jitter: Introduce random delays to prevent simultaneous retries from multiple clients.
    • Maximum Retry Count: Set an upper limit to avoid infinite retry loops.
  3. Differentiation by Error Types:

    • Temporary errors (network issues, server overload): Retry
    • Permanent errors (authentication errors, resource nonexistence): Immediate failure

Additional Considerations

  • Timeout Settings: Assign appropriate timeout values to each request.
  • Circuit Breaker: Suspend requests temporarily after a series of consecutive failures.
  • Status Monitoring: Track retry frequency and success rates.
  • Error Logging: Maintain detailed logs to analyze failure causes.