How long does this build a news corpus with search api tutorial take?

Most developers complete this tutorial in 15 to 30 minutes. You will need a Scavio API key (free tier works) and a working Python or JavaScript environment.

What do I need before starting?

Python 3.9+ installed. requests library installed. A Scavio API key from scavio.dev. A list of news topics to collect. A Scavio API key gives you 50 free credits on signup.

Can I run this tutorial with the free tier?

Yes. The free tier includes 50 credits on signup, which is more than enough to complete this tutorial and prototype a working solution.

What frameworks does this work with?

Scavio has a native LangChain package (langchain-scavio), an MCP server, and a plain REST API that works with any HTTP client. This tutorial uses the raw REST API, but you can adapt to your framework of choice.

Build a News Corpus with Search API (2026)

Training ML models for news classification, summarization, or sentiment analysis requires a large, well-structured corpus of news articles. Web scraping news sites is fragile and legally complex. This tutorial builds a news corpus collection pipeline using the Scavio API to search for news on specific topics, extract article metadata from SERP snippets, deduplicate by URL, and store the corpus in a structured format ready for ML preprocessing. Each topic search costs $0.005.

Prerequisites

Python 3.9+ installed
requests library installed
A Scavio API key from scavio.dev
A list of news topics to collect

Walkthrough

Step 1: Define topics and search for news articles

Search for recent news on each topic. Use date-restricted queries to ensure freshness and news-specific search patterns.

Python

import os, requests, json, time, hashlib
from datetime import datetime

SCAVIO_KEY = os.environ['SCAVIO_API_KEY']
H = {'x-api-key': SCAVIO_KEY, 'Content-Type': 'application/json'}
URL = 'https://api.scavio.dev/api/v1/search'

TOPICS = [
    'artificial intelligence regulation',
    'climate technology startups',
    'semiconductor supply chain',
    'electric vehicle market',
    'cybersecurity breaches 2026',
]

def search_news(topic: str, num: int = 10) -> list:
    resp = requests.post(URL, headers=H,
        json={'query': f'{topic} news 2026', 'country_code': 'us', 'num_results': num})
    results = resp.json().get('organic_results', [])
    articles = []
    for r in results:
        articles.append({
            'title': r.get('title', ''),
            'url': r.get('link', ''),
            'snippet': r.get('snippet', ''),
            'source_domain': r.get('link', '').split('/')[2] if '/' in r.get('link', '') else '',
            'topic': topic,
            'collected_at': datetime.now().isoformat(),
        })
    return articles

articles = search_news('artificial intelligence regulation')
print(f'Collected {len(articles)} articles on AI regulation')

Step 2: Deduplicate and categorize the corpus

Remove duplicate articles by URL hash, and add basic categorization metadata. Track corpus statistics.

Python

class NewsCorpus:
    def __init__(self):
        self.articles = []
        self.seen_urls = set()

    def add_articles(self, new_articles: list) -> int:
        added = 0
        for article in new_articles:
            url_hash = hashlib.md5(article['url'].encode()).hexdigest()
            if url_hash not in self.seen_urls:
                self.seen_urls.add(url_hash)
                article['url_hash'] = url_hash
                article['word_count'] = len(article['snippet'].split())
                self.articles.append(article)
                added += 1
        return added

    def stats(self) -> dict:
        topics = {}
        sources = {}
        for a in self.articles:
            topics[a['topic']] = topics.get(a['topic'], 0) + 1
            sources[a['source_domain']] = sources.get(a['source_domain'], 0) + 1
        return {
            'total_articles': len(self.articles),
            'unique_urls': len(self.seen_urls),
            'topics': topics,
            'top_sources': dict(sorted(sources.items(), key=lambda x: -x[1])[:10]),
        }

corpus = NewsCorpus()
for topic in TOPICS:
    articles = search_news(topic)
    added = corpus.add_articles(articles)
    print(f'{topic}: +{added} articles')
    time.sleep(0.3)

stats = corpus.stats()
print(f'\nCorpus: {stats["total_articles"]} articles across {len(stats["topics"])} topics')

Step 3: Export the corpus for ML training

Save the corpus in JSONL format, which is the standard input format for most ML training pipelines. Include metadata for filtering.

Python

def export_corpus(corpus: NewsCorpus, output_file: str = 'news_corpus.jsonl'):
    with open(output_file, 'w') as f:
        for article in corpus.articles:
            f.write(json.dumps(article) + '\n')
    stats = corpus.stats()
    print(f'Exported {stats["total_articles"]} articles to {output_file}')
    print(f'Topics: {", ".join(f"{k} ({v})" for k, v in stats["topics"].items())}')
    print(f'Top sources: {", ".join(list(stats["top_sources"].keys())[:5])}')
    print(f'Cost: ${len(TOPICS) * 0.005:.3f}')

export_corpus(corpus)

Python Example

Python

import os, requests, json, time, hashlib

SCAVIO_KEY = os.environ['SCAVIO_API_KEY']
H = {'x-api-key': SCAVIO_KEY, 'Content-Type': 'application/json'}

def collect_news_corpus(topics, num_per_topic=10):
    corpus = []
    seen = set()
    for topic in topics:
        resp = requests.post('https://api.scavio.dev/api/v1/search', headers=H,
            json={'query': f'{topic} news 2026', 'country_code': 'us', 'num_results': num_per_topic})
        for r in resp.json().get('organic_results', []):
            url_hash = hashlib.md5(r['link'].encode()).hexdigest()
            if url_hash not in seen:
                seen.add(url_hash)
                corpus.append({'title': r['title'], 'url': r['link'],
                    'snippet': r.get('snippet', ''), 'topic': topic})
        time.sleep(0.3)
    print(f'Corpus: {len(corpus)} articles, {len(topics)} topics')
    return corpus

corpus = collect_news_corpus(['AI regulation', 'climate tech', 'cybersecurity'])
with open('corpus.jsonl', 'w') as f:
    for a in corpus:
        f.write(json.dumps(a) + '\n')

JavaScript Example

JavaScript

const SCAVIO_KEY = process.env.SCAVIO_API_KEY;
const fs = require('fs');

async function collectCorpus(topics) {
  const corpus = [];
  const seen = new Set();
  for (const topic of topics) {
    const resp = await fetch('https://api.scavio.dev/api/v1/search', {
      method: 'POST',
      headers: { 'x-api-key': SCAVIO_KEY, 'Content-Type': 'application/json' },
      body: JSON.stringify({ query: `${topic} news 2026`, country_code: 'us', num_results: 10 })
    });
    for (const r of (await resp.json()).organic_results || []) {
      if (!seen.has(r.link)) {
        seen.add(r.link);
        corpus.push({ title: r.title, url: r.link, snippet: r.snippet || '', topic });
      }
    }
  }
  console.log(`Corpus: ${corpus.length} articles`);
  fs.writeFileSync('corpus.jsonl', corpus.map(a => JSON.stringify(a)).join('\n'));
}

collectCorpus(['AI regulation', 'climate tech']);

Expected Output

JSON

artificial intelligence regulation: +10 articles
climate technology startups: +10 articles
semiconductor supply chain: +9 articles
electric vehicle market: +10 articles
cybersecurity breaches 2026: +10 articles

Corpus: 49 articles across 5 topics
Exported 49 articles to news_corpus.jsonl
Topics: artificial intelligence regulation (10), climate technology startups (10), semiconductor supply chain (9)
Top sources: reuters.com, bloomberg.com, techcrunch.com, nytimes.com, wired.com
Cost: $0.025

Prerequisites

Python 3.9+ installed
requests library installed
A Scavio API key from scavio.dev
A list of news topics to collect

Walkthrough

Step 1: Define topics and search for news articles

Search for recent news on each topic. Use date-restricted queries to ensure freshness and news-specific search patterns.

Python

import os, requests, json, time, hashlib
from datetime import datetime

SCAVIO_KEY = os.environ['SCAVIO_API_KEY']
H = {'x-api-key': SCAVIO_KEY, 'Content-Type': 'application/json'}
URL = 'https://api.scavio.dev/api/v1/search'

TOPICS = [
    'artificial intelligence regulation',
    'climate technology startups',
    'semiconductor supply chain',
    'electric vehicle market',
    'cybersecurity breaches 2026',
]

def search_news(topic: str, num: int = 10) -> list:
    resp = requests.post(URL, headers=H,
        json={'query': f'{topic} news 2026', 'country_code': 'us', 'num_results': num})
    results = resp.json().get('organic_results', [])
    articles = []
    for r in results:
        articles.append({
            'title': r.get('title', ''),
            'url': r.get('link', ''),
            'snippet': r.get('snippet', ''),
            'source_domain': r.get('link', '').split('/')[2] if '/' in r.get('link', '') else '',
            'topic': topic,
            'collected_at': datetime.now().isoformat(),
        })
    return articles

articles = search_news('artificial intelligence regulation')
print(f'Collected {len(articles)} articles on AI regulation')

Step 2: Deduplicate and categorize the corpus

Remove duplicate articles by URL hash, and add basic categorization metadata. Track corpus statistics.

Python

class NewsCorpus:
    def __init__(self):
        self.articles = []
        self.seen_urls = set()

    def add_articles(self, new_articles: list) -> int:
        added = 0
        for article in new_articles:
            url_hash = hashlib.md5(article['url'].encode()).hexdigest()
            if url_hash not in self.seen_urls:
                self.seen_urls.add(url_hash)
                article['url_hash'] = url_hash
                article['word_count'] = len(article['snippet'].split())
                self.articles.append(article)
                added += 1
        return added

    def stats(self) -> dict:
        topics = {}
        sources = {}
        for a in self.articles:
            topics[a['topic']] = topics.get(a['topic'], 0) + 1
            sources[a['source_domain']] = sources.get(a['source_domain'], 0) + 1
        return {
            'total_articles': len(self.articles),
            'unique_urls': len(self.seen_urls),
            'topics': topics,
            'top_sources': dict(sorted(sources.items(), key=lambda x: -x[1])[:10]),
        }

corpus = NewsCorpus()
for topic in TOPICS:
    articles = search_news(topic)
    added = corpus.add_articles(articles)
    print(f'{topic}: +{added} articles')
    time.sleep(0.3)

stats = corpus.stats()
print(f'\nCorpus: {stats["total_articles"]} articles across {len(stats["topics"])} topics')

Step 3: Export the corpus for ML training

Save the corpus in JSONL format, which is the standard input format for most ML training pipelines. Include metadata for filtering.

Python

def export_corpus(corpus: NewsCorpus, output_file: str = 'news_corpus.jsonl'):
    with open(output_file, 'w') as f:
        for article in corpus.articles:
            f.write(json.dumps(article) + '\n')
    stats = corpus.stats()
    print(f'Exported {stats["total_articles"]} articles to {output_file}')
    print(f'Topics: {", ".join(f"{k} ({v})" for k, v in stats["topics"].items())}')
    print(f'Top sources: {", ".join(list(stats["top_sources"].keys())[:5])}')
    print(f'Cost: ${len(TOPICS) * 0.005:.3f}')

export_corpus(corpus)

Python Example

Python

import os, requests, json, time, hashlib

SCAVIO_KEY = os.environ['SCAVIO_API_KEY']
H = {'x-api-key': SCAVIO_KEY, 'Content-Type': 'application/json'}

def collect_news_corpus(topics, num_per_topic=10):
    corpus = []
    seen = set()
    for topic in topics:
        resp = requests.post('https://api.scavio.dev/api/v1/search', headers=H,
            json={'query': f'{topic} news 2026', 'country_code': 'us', 'num_results': num_per_topic})
        for r in resp.json().get('organic_results', []):
            url_hash = hashlib.md5(r['link'].encode()).hexdigest()
            if url_hash not in seen:
                seen.add(url_hash)
                corpus.append({'title': r['title'], 'url': r['link'],
                    'snippet': r.get('snippet', ''), 'topic': topic})
        time.sleep(0.3)
    print(f'Corpus: {len(corpus)} articles, {len(topics)} topics')
    return corpus

corpus = collect_news_corpus(['AI regulation', 'climate tech', 'cybersecurity'])
with open('corpus.jsonl', 'w') as f:
    for a in corpus:
        f.write(json.dumps(a) + '\n')

JavaScript Example

JavaScript

const SCAVIO_KEY = process.env.SCAVIO_API_KEY;
const fs = require('fs');

async function collectCorpus(topics) {
  const corpus = [];
  const seen = new Set();
  for (const topic of topics) {
    const resp = await fetch('https://api.scavio.dev/api/v1/search', {
      method: 'POST',
      headers: { 'x-api-key': SCAVIO_KEY, 'Content-Type': 'application/json' },
      body: JSON.stringify({ query: `${topic} news 2026`, country_code: 'us', num_results: 10 })
    });
    for (const r of (await resp.json()).organic_results || []) {
      if (!seen.has(r.link)) {
        seen.add(r.link);
        corpus.push({ title: r.title, url: r.link, snippet: r.snippet || '', topic });
      }
    }
  }
  console.log(`Corpus: ${corpus.length} articles`);
  fs.writeFileSync('corpus.jsonl', corpus.map(a => JSON.stringify(a)).join('\n'));
}

collectCorpus(['AI regulation', 'climate tech']);

Expected Output

JSON

artificial intelligence regulation: +10 articles
climate technology startups: +10 articles
semiconductor supply chain: +9 articles
electric vehicle market: +10 articles
cybersecurity breaches 2026: +10 articles

Corpus: 49 articles across 5 topics
Exported 49 articles to news_corpus.jsonl
Topics: artificial intelligence regulation (10), climate technology startups (10), semiconductor supply chain (9)
Top sources: reuters.com, bloomberg.com, techcrunch.com, nytimes.com, wired.com
Cost: $0.025

How to Build a News Corpus with Search API

Prerequisites

Walkthrough

Step 1: Define topics and search for news articles

Step 2: Deduplicate and categorize the corpus

Step 3: Export the corpus for ML training

Python Example

JavaScript Example

Expected Output

Related Tutorials

Frequently Asked Questions

How long does this build a news corpus with search api tutorial take?

What do I need before starting?

Can I run this tutorial with the free tier?

What frameworks does this work with?

Related Resources

Search API Provider Landscape (2026)

Best Search APIs for Pipeline Integration in 2026

Best Budget Search APIs for AI Agents Under $10/mo (2026)

Daily News Collection for ML

Market Prediction News Corpus

n8n Search Enrichment Workflow

Start Building

How to Build a News Corpus with Search API

Prerequisites

Walkthrough

Step 1: Define topics and search for news articles

Step 2: Deduplicate and categorize the corpus

Step 3: Export the corpus for ML training

Python Example

JavaScript Example

Expected Output

Related Tutorials

Frequently Asked Questions

How long does this build a news corpus with search api tutorial take?

What do I need before starting?

Can I run this tutorial with the free tier?

What frameworks does this work with?

Related Resources

Search API Provider Landscape (2026)

Best Search APIs for Pipeline Integration in 2026

Best Budget Search APIs for AI Agents Under $10/mo (2026)

Daily News Collection for ML

Market Prediction News Corpus

n8n Search Enrichment Workflow

Start Building