ScavioScavio
ProductPricingDocs
Sign InGet Started
  1. Home
  2. Workflows
  3. LLM Wiki Ingestion Workflow
Workflow

LLM Wiki Ingestion Workflow

Daily ingestion of new sources for a Karpathy-style LLM Wiki. Scavio search across web/Reddit/YouTube + extract + embed.

Start FreeAPI Docs

Overview

Daily wiki-ingestion: per topic, pull new sources from Google + Reddit + YouTube via Scavio, extract markdown, embed into Qdrant, dedupe by URL.

Trigger

Daily cron at 6am for the active topic list

Schedule

Daily at 6am

Workflow Steps

1

Iterate active topics

Pull topic list from a Postgres table or YAML config.

2

Per topic: Scavio search across 3 surfaces

search, reddit_search, youtube_search calls in parallel.

3

Dedupe candidate URLs against Qdrant payload index

Skip URLs already ingested.

4

Per new URL: Scavio /extract for markdown

Cleaner than raw HTML; saves embedding tokens.

5

Chunk + embed + upsert

Chunk to 500-token blocks, embed via your embedding model, upsert to Qdrant with URL as payload.

6

Log new-doc count + per-topic cost

Cost-budget guardrail per topic.

Python Implementation

Python
import requests, os
from qdrant_client import QdrantClient
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}
qdrant = QdrantClient(url=os.environ['QDRANT_URL'])

def discover(topic):
    results = []
    for endpoint in ['search', 'reddit/search', 'youtube/search']:
        r = requests.post(f'https://api.scavio.dev/api/v1/{endpoint}', headers=H, json={'query': topic}).json()
        results.extend(r.get('organic_results', []) + r.get('posts', []) + r.get('videos', []))
    return results

def ingest_topic(topic):
    candidates = discover(topic)
    for c in candidates:
        url = c.get('link') or c.get('url')
        if not url or already_ingested(url): continue
        md = requests.post('https://api.scavio.dev/api/v1/extract',
            headers=H, json={'url': url, 'format': 'markdown'}).json().get('markdown', '')
        store(url, topic, md)

JavaScript Implementation

JavaScript
// Same flow in TS via Qdrant JS client + Scavio fetch calls.

Platforms Used

Google

Web search with knowledge graph, PAA, and AI overviews

Reddit

Community, posts & threaded comments from any subreddit

YouTube

Video search with transcripts and metadata

Frequently Asked Questions

Daily wiki-ingestion: per topic, pull new sources from Google + Reddit + YouTube via Scavio, extract markdown, embed into Qdrant, dedupe by URL.

This workflow uses a daily cron at 6am for the active topic list. Daily at 6am.

This workflow uses the following Scavio platforms: google, reddit, youtube. Each platform is called via the same unified API endpoint.

Yes. Scavio's free tier includes 50 credits on signup with no credit card required. That is enough to test and validate this workflow before scaling it.

LLM Wiki Ingestion Workflow

Daily ingestion of new sources for a Karpathy-style LLM Wiki. Scavio search across web/Reddit/YouTube + extract + embed.

Get Your API KeyRead the Docs
ScavioScavio

Real-time search API for AI agents. Search every platform, not just Google.

Product

  • Features
  • Pricing
  • Dashboard
  • Affiliates

Developers

  • Documentation
  • API Reference
  • Quickstart
  • MCP Integration
  • Python SDK

Alternatives

  • Tavily Alternative
  • SerpAPI Alternative
  • Firecrawl Alternative
  • Exa Alternative

Tools

  • JSON Formatter
  • cURL to Code
  • Token Counter
  • All Tools

© 2026 Scavio. All rights reserved.

Featured on TAAFT
Terms of ServicePrivacy Policy