Exa MCP Rate Limits and Context Drop: What Actually Fixes It
When Exa MCP returns a 429 mid-Claude session, the tool call fails silently and the model either hallucinates a result or loops into retry behavior that burns tokens without progress. The problem is not just the rate limit itself — it is what happens to conversation state when a tool call errors at layer boundaries.
Why Failed MCP Calls Destabilize Sessions
Claude's context window treats each tool result as a message block. A failed call produces an error block where a result block was expected. Depending on how the MCP client surfaces the error, you get one of three failure modes:
- Silent null: The tool returns no content. Claude proceeds as if the search returned nothing, often hallucinating what it expected to find.
- Error string passed as content: The error message enters the context as if it were search results. Subsequent reasoning references the error string literally.
- Exception propagation: The MCP client throws, the session terminates, and any unsaved intermediate work is lost.
Exa's free tier gives 1,000 requests/month. At $7/1k for standard search and $12-15/1k for deep research, rate limits hit predictably if agents are not throttled.
Fix 1: Client-Side Throttle Queue
Wrap your MCP tool calls in a queue with a configurable inter-request delay. For Exa's standard tier, 1 request per second stays under most rate limit thresholds:
import PQueue from 'p-queue';
const queue = new PQueue({ intervalCap: 1, interval: 1000 });
async function searchWithThrottle(query: string) {
return queue.add(() => exaMcpClient.search(query));
}This does not eliminate the problem for burst-heavy agents that queue 20+ searches in a planning step. For those, you need tier-appropriate rate budgets or pre-flight budget checks.
Fix 2: Result Caching
Cache search results keyed on query + date. Most research agents repeat semantically identical queries across subtasks. A simple in-process LRU cache eliminates the redundant calls:
import LRU from 'lru-cache';
const cache = new LRU<string, SearchResult>({
max: 200,
ttl: 1000 * 60 * 60 // 1 hour
});
async function cachedSearch(query: string) {
const key = query.toLowerCase().trim();
if (cache.has(key)) return cache.get(key);
const result = await searchWithThrottle(query);
cache.set(key, result);
return result;
}For multi-session agents, persist the cache to Redis or a local SQLite file so warm results survive process restarts.
Fix 3: Switch to a More Lenient MCP
If your agent design requires high-frequency search bursts, Exa's per-request pricing and rate limits may not fit. Options:
- Scavio MCP at
mcp.scavio.dev/mcp: credit-based at $0.005/credit with 250 free credits/month. The hosted MCP removes local process management and rate limit state lives server-side. - Brave Search API: $5/mo free credit, ~1,000 queries. Simple JSON result format. The 50 QPS cap is generous for most agents.
- Tavily: 1,000 free credits/month, MCP-compatible. Purpose-built for LLM search with clean markdown output.
The migration path from Exa MCP to Scavio MCP is a one-line config change if you are using Claude Desktop's claude_desktop_config.json:
{
"mcpServers": {
"search": {
"command": "npx",
"args": ["-y", "@scavio/mcp"],
"env": {
"SCAVIO_API_KEY": "your_key"
}
}
}
}Or point directly at the hosted SSE endpoint if your client supports it.
Fix 4: Graceful Degradation in Tool Definitions
Define your search tool so the agent knows what to do when it fails. Include a fallback_behavior field in your tool description:
If this tool returns an error, do not retry more than once.
Mark the search result as unavailable and continue with
already-retrieved context. Do not fabricate results.
This is prompt engineering, not infrastructure, but it prevents the hallucination failure mode when the rate limit error enters context.
Monitoring for Rate Limit Events
Add structured logging around every MCP call:
const start = Date.now();
try {
const result = await cachedSearch(query);
log.info({ query, latencyMs: Date.now() - start, cached: false });
return result;
} catch (err) {
log.error({ query, err, latencyMs: Date.now() - start });
throw err;
}Watch for latency spikes before the 429. Many providers throttle by slowing responses before hard-rejecting, giving you a window to back off.
When Exa Wins Despite Rate Limits
Exa's neural search with autoprompt handles ambiguous, research-style queries better than keyword-based SERP APIs. If your agent does open-ended research rather than structured data retrieval, Exa's quality per result justifies the tighter rate envelope. Use caching aggressively and batch searches into planning phases rather than interleaving search and synthesis.
For structured product data, geographic queries, or platform-specific results (Reddit, YouTube, Amazon), a multi-platform SERP API is a better fit than neural search regardless of rate limits.