Per-Token Search Billing Makes Agent Costs Unpredictable

Per-token billing on search APIs like Perplexity Sonar and Exa contents makes agent costs unpredictable because agents control context length. The agent decides how many tokens to pull from each search, and that decision is hidden from your billing. Credit-per-query pricing makes agent costs predictable because cost does not depend on result length.

How Per-Token Search Billing Works

Perplexity Sonar charges per token on both input and output. Output includes the web content the model synthesizes into its answer. If an agent asks a complex question, Perplexity pulls more web sources and generates a longer response — more tokens, higher cost. The agent does not control this; Perplexity's model decides how much content to include.

Exa's content retrieval similarly charges per token for the content returned. A request for webpage contents that returns 5,000 tokens costs 5x more than one returning 1,000 tokens. The agent cannot predict how long the retrieved content will be.

The Variance Problem

Consider an agent running 1,000 searches in a month, split between simple queries ("current Bitcoin price") and complex research queries ("explain the implications of the EU AI Act for small software companies"). The simple queries return short responses; the complex ones return long syntheses.

With per-token billing:

Simple query: 200 tokens output = $0.001
Complex query: 2,000 tokens output = $0.01

If the ratio of complex to simple queries shifts — because users increasingly use the agent for research rather than lookups — your costs can double without your query count changing.

With credit-per-query billing:

Simple query: 1 credit = $0.005
Complex query: 1 credit = $0.005

Your cost is purely a function of query count, which you can measure and predict.

Practical Budget Planning

For budget planning with per-token APIs:

Python

# You cannot predict this reliably
estimated_monthly_cost = num_queries * avg_tokens_per_query * token_price

# avg_tokens_per_query is unknown and varies by query type
# token_price varies by input vs output
# result: wide confidence interval on actual cost

For budget planning with credit-based APIs:

Python

# This is exact
estimated_monthly_cost = num_queries * credit_price
# credit_price = $0.005 (Scavio), $0.008 (Tavily)
# num_queries = measurable from logs
# result: narrow confidence interval

Where Per-Token Billing Is Justified

Per-token billing makes sense when result length is predictably short and consistent:

Factual lookup queries that always return brief answers
Structured data extraction where you control the output format
Use cases where you need the synthesis built into the response (Perplexity's prose answers)

Per-token billing becomes expensive relative to credit-based when:

Queries vary widely in complexity
The API generates long synthesized answers you then re-process with your own LLM anyway (paying twice)
Agents autonomously decide query complexity

The Double-Processing Problem

This is the hidden cost of per-token search APIs in agent pipelines. If your agent:

Calls Perplexity Sonar to get a synthesized answer (pay per token for synthesis)
Passes that synthesized answer to Claude for further reasoning (pay again for the input tokens)

You are paying for synthesis twice. With a credit-based SERP API that returns raw results, you pay once for the search and once for your LLM reasoning — and the SERP call is cheaper.

Python

# Inefficient: per-token synthesis + LLM reasoning
perplexity_answer = perplexity.search(query)  # pay for synthesis
claude_analysis = claude.reason(perplexity_answer)  # pay again for input

# Efficient: raw results + LLM reasoning
search_results = serp_api.search(query)  # pay for search only
claude_analysis = claude.reason(search_results)  # pay once for LLM

The efficient version costs less and gives you more control over the reasoning step.

Per-Token Search Billing Makes Agent Costs Unpredictable

How Per-Token Search Billing Works

The Variance Problem

With per-token billing:

Simple query: 200 tokens output = $0.001
Complex query: 2,000 tokens output = $0.01

If the ratio of complex to simple queries shifts — because users increasingly use the agent for research rather than lookups — your costs can double without your query count changing.

With credit-per-query billing:

Simple query: 1 credit = $0.005
Complex query: 1 credit = $0.005

Your cost is purely a function of query count, which you can measure and predict.

Practical Budget Planning

For budget planning with per-token APIs:

Python

# You cannot predict this reliably
estimated_monthly_cost = num_queries * avg_tokens_per_query * token_price

# avg_tokens_per_query is unknown and varies by query type
# token_price varies by input vs output
# result: wide confidence interval on actual cost

For budget planning with credit-based APIs:

Python

# This is exact
estimated_monthly_cost = num_queries * credit_price
# credit_price = $0.005 (Scavio), $0.008 (Tavily)
# num_queries = measurable from logs
# result: narrow confidence interval

Where Per-Token Billing Is Justified

Per-token billing makes sense when result length is predictably short and consistent:

Factual lookup queries that always return brief answers
Structured data extraction where you control the output format
Use cases where you need the synthesis built into the response (Perplexity's prose answers)

Per-token billing becomes expensive relative to credit-based when:

Queries vary widely in complexity
The API generates long synthesized answers you then re-process with your own LLM anyway (paying twice)
Agents autonomously decide query complexity

The Double-Processing Problem

This is the hidden cost of per-token search APIs in agent pipelines. If your agent:

Calls Perplexity Sonar to get a synthesized answer (pay per token for synthesis)
Passes that synthesized answer to Claude for further reasoning (pay again for the input tokens)

You are paying for synthesis twice. With a credit-based SERP API that returns raw results, you pay once for the search and once for your LLM reasoning — and the SERP call is cheaper.

Python

# Inefficient: per-token synthesis + LLM reasoning
perplexity_answer = perplexity.search(query)  # pay for synthesis
claude_analysis = claude.reason(perplexity_answer)  # pay again for input

# Efficient: raw results + LLM reasoning
search_results = serp_api.search(query)  # pay for search only
claude_analysis = claude.reason(search_results)  # pay once for LLM

The efficient version costs less and gives you more control over the reasoning step.

Per-Token Search Billing Makes Agent Costs Unpredictable

Per-Token Search Billing Makes Agent Costs Unpredictable

How Per-Token Search Billing Works

The Variance Problem

Practical Budget Planning

Where Per-Token Billing Is Justified

The Double-Processing Problem

Continue reading

AEO Tracking for D2C Ecommerce Brands in 2026

Agent Discovery vs Extraction: Why Cost Split Matters

Per-Token Search Billing Makes Agent Costs Unpredictable

Per-Token Search Billing Makes Agent Costs Unpredictable

How Per-Token Search Billing Works

The Variance Problem

Practical Budget Planning

Where Per-Token Billing Is Justified

The Double-Processing Problem

Continue reading

AEO Tracking for D2C Ecommerce Brands in 2026

Agent Discovery vs Extraction: Why Cost Split Matters