ScavioScavio
ProductPricingDocs
Sign InGet Started
  1. Home
  2. Tutorials
  3. How to Use Common Crawl with Live Search in a Hybrid Pipeline
Tutorial

How to Use Common Crawl with Live Search in a Hybrid Pipeline

Combine Common Crawl historical data with live search API results. Get massive scale from Common Crawl and freshness from live search.

Get Free API KeyAPI Docs

Common Crawl provides petabytes of free web archive data, but it is months old and requires heavy processing. Live search APIs return fresh results instantly but cost per query. This tutorial combines both into a hybrid pipeline: use Common Crawl for bulk historical analysis and Scavio for real-time freshness. The hybrid approach gives you the scale of Common Crawl with the currency of live search at minimal cost.

Prerequisites

  • Python 3.9+ installed
  • requests library installed
  • A Scavio API key from scavio.dev
  • Basic understanding of Common Crawl index

Walkthrough

Step 1: Query the Common Crawl Index API

Search Common Crawl's CDX index to find historical web pages matching your query. The index is free to query and returns URLs of archived pages.

Python
import requests

CC_INDEX_URL = 'https://index.commoncrawl.org/CC-MAIN-2026-04-index'

def search_common_crawl(domain: str, limit: int = 20) -> list:
    """Search Common Crawl index for pages from a domain."""
    resp = requests.get(CC_INDEX_URL,
        params={'url': f'{domain}/*', 'output': 'json', 'limit': limit},
        timeout=30)
    if resp.status_code != 200:
        print(f'Common Crawl returned {resp.status_code}')
        return []
    lines = resp.text.strip().split('\n')
    results = []
    for line in lines:
        try:
            import json
            record = json.loads(line)
            results.append({
                'url': record.get('url', ''),
                'timestamp': record.get('timestamp', ''),
                'status': record.get('status', ''),
                'mime': record.get('mime', ''),
                'length': record.get('length', 0),
            })
        except json.JSONDecodeError:
            continue
    return results

# Search for pages from a domain
pages = search_common_crawl('docs.python.org')
print(f'Found {len(pages)} archived pages from docs.python.org')
for p in pages[:5]:
    print(f'  [{p["timestamp"]}] {p["url"][:60]}')

Step 2: Build the hybrid search function

Combine Common Crawl data with live search results. Use Common Crawl for historical coverage and live search for fresh results.

Python
import os

SCAVIO_KEY = os.environ['SCAVIO_API_KEY']

def live_search(query: str, count: int = 5) -> list:
    resp = requests.post('https://api.scavio.dev/api/v1/search',
        headers={'x-api-key': SCAVIO_KEY, 'Content-Type': 'application/json'},
        json={'query': query, 'country_code': 'us', 'num_results': count})
    return [{'url': r['link'], 'title': r['title'],
             'snippet': r.get('snippet', ''), 'source': 'live_search'}
            for r in resp.json().get('organic_results', [])]

def hybrid_search(query: str, domain: str = None) -> dict:
    """Combine Common Crawl historical data with live results."""
    results = {'historical': [], 'live': [], 'cost': 0}
    # Step 1: Check Common Crawl for historical data (free)
    if domain:
        cc_results = search_common_crawl(domain, limit=10)
        results['historical'] = [{
            'url': r['url'], 'timestamp': r['timestamp'],
            'source': 'common_crawl'
        } for r in cc_results]
    # Step 2: Get fresh live results ($0.005)
    live = live_search(query, count=5)
    results['live'] = live
    results['cost'] = 0.005
    results['total'] = len(results['historical']) + len(results['live'])
    return results

result = hybrid_search('python 3.14 new features', domain='docs.python.org')
print(f'Historical (Common Crawl): {len(result["historical"])} pages')
print(f'Live (Scavio): {len(result["live"])} results')
print(f'Total: {result["total"]} | Cost: ${result["cost"]}')

Step 3: Analyze coverage gaps between historical and live data

Compare what Common Crawl has versus what live search returns. This reveals content that exists but is not in the archive, and archived content that may have been removed.

Python
def analyze_coverage(domain: str, queries: list) -> dict:
    # Get Common Crawl inventory
    cc_urls = set(r['url'] for r in search_common_crawl(domain, limit=50))
    # Get live search URLs for the domain
    live_urls = set()
    for query in queries:
        live = live_search(f'site:{domain} {query}', count=5)
        live_urls.update(r['url'] for r in live)
    # Analyze overlap
    both = cc_urls & live_urls
    cc_only = cc_urls - live_urls
    live_only = live_urls - cc_urls
    print(f'Domain: {domain}')
    print(f'Common Crawl pages: {len(cc_urls)}')
    print(f'Live search pages: {len(live_urls)}')
    print(f'In both: {len(both)}')
    print(f'CC only (may be removed): {len(cc_only)}')
    print(f'Live only (new content): {len(live_only)}')
    print(f'Cost: {len(queries)} live searches = ${len(queries) * 0.005:.3f}')
    return {'cc_only': cc_only, 'live_only': live_only, 'both': both}

coverage = analyze_coverage('docs.python.org', ['asyncio', 'pathlib', 'dataclasses'])

Python Example

Python
import requests, os, json

SCAVIO_KEY = os.environ['SCAVIO_API_KEY']
CC_INDEX = 'https://index.commoncrawl.org/CC-MAIN-2026-04-index'

def hybrid_search(query, domain=None):
    results = []
    # Common Crawl (free)
    if domain:
        cc = requests.get(CC_INDEX, params={'url': f'{domain}/*', 'output': 'json', 'limit': 10}, timeout=30)
        for line in cc.text.strip().split('\n'):
            try:
                r = json.loads(line)
                results.append({'url': r['url'], 'source': 'common_crawl'})
            except Exception:
                pass
    # Live search ($0.005)
    live = requests.post('https://api.scavio.dev/api/v1/search',
        headers={'x-api-key': SCAVIO_KEY, 'Content-Type': 'application/json'},
        json={'query': query, 'country_code': 'us', 'num_results': 5})
    for r in live.json().get('organic_results', []):
        results.append({'url': r['link'], 'title': r['title'], 'source': 'live'})
    print(f'{len(results)} total results (cost: $0.005)')
    return results

hybrid_search('python asyncio', 'docs.python.org')

JavaScript Example

JavaScript
const SCAVIO_KEY = process.env.SCAVIO_API_KEY;

async function hybridSearch(query, domain) {
  const results = [];
  if (domain) {
    const cc = await fetch(`https://index.commoncrawl.org/CC-MAIN-2026-04-index?url=${domain}/*&output=json&limit=10`);
    const lines = (await cc.text()).trim().split('\n');
    lines.forEach(line => { try { results.push({...JSON.parse(line), source: 'cc'}); } catch {} });
  }
  const live = await fetch('https://api.scavio.dev/api/v1/search', {
    method: 'POST',
    headers: { 'x-api-key': SCAVIO_KEY, 'Content-Type': 'application/json' },
    body: JSON.stringify({ query, country_code: 'us', num_results: 5 })
  });
  for (const r of (await live.json()).organic_results || []) {
    results.push({ url: r.link, title: r.title, source: 'live' });
  }
  console.log(`${results.length} results (cost: $0.005)`);
  return results;
}

hybridSearch('python asyncio', 'docs.python.org');

Expected Output

JSON
Found 15 archived pages from docs.python.org
  [20260115] https://docs.python.org/3/library/asyncio.html
  [20260115] https://docs.python.org/3/whatsnew/3.14.html

Historical (Common Crawl): 15 pages
Live (Scavio): 5 results
Total: 20 | Cost: $0.005

Domain: docs.python.org
Common Crawl pages: 50
Live search pages: 12
In both: 8
CC only (may be removed): 42
Live only (new content): 4
Cost: 3 live searches = $0.015

Related Tutorials

  • How to Build a Multi-Engine Search Fallback Agent
  • How to Build a Search Trust Score Pipeline
  • How to Benchmark Search API Quality Per Dollar

Frequently Asked Questions

Most developers complete this tutorial in 15 to 30 minutes. You will need a Scavio API key (free tier works) and a working Python or JavaScript environment.

Python 3.9+ installed. requests library installed. A Scavio API key from scavio.dev. Basic understanding of Common Crawl index. A Scavio API key gives you 50 free credits on signup.

Yes. The free tier includes 50 credits on signup, which is more than enough to complete this tutorial and prototype a working solution.

Scavio has a native LangChain package (langchain-scavio), an MCP server, and a plain REST API that works with any HTTP client. This tutorial uses the raw REST API, but you can adapt to your framework of choice.

Related Resources

Comparison

Common Crawl vs Live Search API (Scavio, Tavily, Brave)

Read more
Best Of

Best Search APIs for Pipeline Integration in 2026

Read more
Use Case

Hybrid RAG with Live Search Augmentation

Read more
Glossary

Search API Provider Landscape (2026)

Read more
Best Of

Best Budget Search APIs for AI Agents Under $10/mo (2026)

Read more
Use Case

Open Source AI Web Access 2026

Read more

Start Building

Combine Common Crawl historical data with live search API results. Get massive scale from Common Crawl and freshness from live search.

Get Free API KeyRead the Docs
ScavioScavio

Real-time search API for AI agents. Search every platform, not just Google.

Product

  • Features
  • Pricing
  • Dashboard
  • Affiliates

Developers

  • Documentation
  • API Reference
  • Quickstart
  • MCP Integration
  • Python SDK

Alternatives

  • Tavily Alternative
  • SerpAPI Alternative
  • Firecrawl Alternative
  • Exa Alternative

Tools

  • JSON Formatter
  • cURL to Code
  • Token Counter
  • All Tools

© 2026 Scavio. All rights reserved.

Featured on TAAFT
Terms of ServicePrivacy Policy