Agent Discovery vs Extraction: Why Cost Split Matters
Discovery via search API costs $0.005 per query. Extraction via scraper costs $0.05-0.50 per page. Running extraction on everything that passes discovery is the biggest cost mistake in agent pipelines. The fix is a relevance filter between the two stages that passes only 20-30% of discovered URLs to the extraction stage.
The Cost Asymmetry
Search API calls and scraping have fundamentally different cost structures:
- Search API: structured result, no HTML parsing, no proxy management, $0.003-0.008/query
- Scraper (Firecrawl, Jina, custom): full page content, HTML rendering, proxy handling, $0.05-0.50/page depending on anti-bot complexity
The 10-100x cost difference means that a single unnecessary scrape call costs as much as 10-100 search API calls. If your agent scrapes 50 pages when 10 contained the relevant information, you paid 5x the necessary scrape cost.
Measuring Your Current Split
Before optimizing, measure:
class CostTracker:
def __init__(self):
self.search_calls = 0
self.scrape_calls = 0
self.search_cost = 0.0
self.scrape_cost = 0.0
def log_search(self, cost: float = 0.005):
self.search_calls += 1
self.search_cost += cost
def log_scrape(self, cost: float = 0.10):
self.scrape_calls += 1
self.scrape_cost += cost
@property
def scrape_ratio(self):
total = self.search_calls + self.scrape_calls
return self.scrape_calls / total if total > 0 else 0
@property
def total_cost(self):
return self.search_cost + self.scrape_costIf your scrape ratio exceeds 0.5 (more than half your calls are scrapes), your pipeline likely has no relevance filter.
Adding the Relevance Filter
Between discovery and extraction, add an LLM call that scores snippet relevance:
def filter_for_extraction(task: str, search_results: list[dict]) -> list[str]:
snippets = [
f"{i}. [{r['title']}] {r['snippet']}"
for i, r in enumerate(search_results, 1)
]
prompt = f"""Task: {task}
Search results (title + snippet only):
{chr(10).join(snippets)}
Which result numbers are highly likely to contain primary source
information needed for the task? Return only numbers, comma-separated.
Be selective — only include results where the snippet strongly suggests
the page contains the specific information needed."""
response = llm.complete(prompt)
selected_nums = [int(n.strip()) for n in response.split(',') if n.strip().isdigit()]
return [search_results[i-1]['link'] for i in selected_nums
if 1 <= i <= len(search_results)]This LLM call costs ~$0.001 in Claude Haiku tokens for 10 snippets. It typically selects 2-3 URLs from 10 candidates, cutting scrape volume by 70-80%.
The Extraction Stage
Only scrape the filtered URLs:
def run_pipeline(task: str, queries: list[str], tracker: CostTracker) -> list[dict]:
all_results = []
# Stage 1: Discovery
for query in queries:
results = search_api.call(query)
tracker.log_search()
all_results.extend(results)
# Stage 2: Filter
urls_to_scrape = filter_for_extraction(task, all_results)
# Stage 3: Extraction (only filtered URLs)
extracted = []
for url in urls_to_scrape:
content = scraper.fetch(url)
tracker.log_scrape()
extracted.append({"url": url, "content": content})
return extractedReal Cost Impact at Scale
For an agent running 100 research tasks per month, each with 10 queries and 10 results per query:
Without filter:
- 1,000 search calls: $5
- 10,000 scrape calls at $0.10: $1,000
- Total: $1,005
With filter (25% pass rate):
- 1,000 search calls: $5
- 2,500 scrape calls at $0.10: $250
- 100 filter LLM calls at $0.001: $0.10
- Total: $255.10
The filter saves $750/month on this workload. The LLM filter call costs $0.10 total.
When to Skip the Filter
For targeted extraction where you already know which URLs contain relevant content (a curated list, a specific domain pattern, known documentation pages), skip the filter — you are paying $0.001 for a decision you have already made. The filter earns its keep on broad research tasks where the relevant fraction of results is uncertain.