LLM Wiki Ingestion Workflow

Overview

Daily wiki-ingestion: per topic, pull new sources from Google + Reddit + YouTube via Scavio, extract markdown, embed into Qdrant, dedupe by URL.

Trigger

Daily cron at 6am for the active topic list

Schedule

Daily at 6am

Workflow Steps

Iterate active topics

Pull topic list from a Postgres table or YAML config.

Per topic: Scavio search across 3 surfaces

search, reddit_search, youtube_search calls in parallel.

Dedupe candidate URLs against Qdrant payload index

Skip URLs already ingested.

Per new URL: Scavio /extract for markdown

Cleaner than raw HTML; saves embedding tokens.

Chunk + embed + upsert

Chunk to 500-token blocks, embed via your embedding model, upsert to Qdrant with URL as payload.

Log new-doc count + per-topic cost

Cost-budget guardrail per topic.

Python Implementation

Python

import requests, os
from qdrant_client import QdrantClient
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}
qdrant = QdrantClient(url=os.environ['QDRANT_URL'])

def discover(topic):
    results = []
    for endpoint in ['search', 'reddit/search', 'youtube/search']:
        r = requests.post(f'https://api.scavio.dev/api/v1/{endpoint}', headers=H, json={'query': topic}).json()
        results.extend(r.get('organic_results', []) + r.get('posts', []) + r.get('videos', []))
    return results

def ingest_topic(topic):
    candidates = discover(topic)
    for c in candidates:
        url = c.get('link') or c.get('url')
        if not url or already_ingested(url): continue
        md = requests.post('https://api.scavio.dev/api/v1/extract',
            headers=H, json={'url': url, 'format': 'markdown'}).json().get('markdown', '')
        store(url, topic, md)

JavaScript Implementation

JavaScript

// Same flow in TS via Qdrant JS client + Scavio fetch calls.

Platforms Used

Google

Web search with knowledge graph, PAA, and AI overviews

Community, posts & threaded comments from any subreddit

YouTube

Video search with transcripts and metadata

Frequently Asked Questions

Daily wiki-ingestion: per topic, pull new sources from Google + Reddit + YouTube via Scavio, extract markdown, embed into Qdrant, dedupe by URL.

This workflow uses a daily cron at 6am for the active topic list. Daily at 6am.

This workflow uses the following Scavio platforms: google, reddit, youtube. Each platform is called via the same unified API endpoint.

Yes. Scavio's free tier includes 50 credits on signup with no credit card required. That is enough to test and validate this workflow before scaling it.

Workflow Steps

Iterate active topics

Pull topic list from a Postgres table or YAML config.

Per topic: Scavio search across 3 surfaces

search, reddit_search, youtube_search calls in parallel.

Dedupe candidate URLs against Qdrant payload index

Skip URLs already ingested.

Per new URL: Scavio /extract for markdown

Cleaner than raw HTML; saves embedding tokens.

Chunk + embed + upsert

Chunk to 500-token blocks, embed via your embedding model, upsert to Qdrant with URL as payload.

Log new-doc count + per-topic cost

Cost-budget guardrail per topic.

Python Implementation

Python

import requests, os
from qdrant_client import QdrantClient
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}
qdrant = QdrantClient(url=os.environ['QDRANT_URL'])

def discover(topic):
    results = []
    for endpoint in ['search', 'reddit/search', 'youtube/search']:
        r = requests.post(f'https://api.scavio.dev/api/v1/{endpoint}', headers=H, json={'query': topic}).json()
        results.extend(r.get('organic_results', []) + r.get('posts', []) + r.get('videos', []))
    return results

def ingest_topic(topic):
    candidates = discover(topic)
    for c in candidates:
        url = c.get('link') or c.get('url')
        if not url or already_ingested(url): continue
        md = requests.post('https://api.scavio.dev/api/v1/extract',
            headers=H, json={'url': url, 'format': 'markdown'}).json().get('markdown', '')
        store(url, topic, md)

Frequently Asked Questions

Daily wiki-ingestion: per topic, pull new sources from Google + Reddit + YouTube via Scavio, extract markdown, embed into Qdrant, dedupe by URL.

This workflow uses a daily cron at 6am for the active topic list. Daily at 6am.

This workflow uses the following Scavio platforms: google, reddit, youtube. Each platform is called via the same unified API endpoint.

Yes. Scavio's free tier includes 50 credits on signup with no credit card required. That is enough to test and validate this workflow before scaling it.

LLM Wiki Ingestion Workflow

Overview

Trigger

Schedule

Workflow Steps

Iterate active topics

Per topic: Scavio search across 3 surfaces

Dedupe candidate URLs against Qdrant payload index

Per new URL: Scavio /extract for markdown

Chunk + embed + upsert

Log new-doc count + per-topic cost

Python Implementation

JavaScript Implementation

Platforms Used

Google

Reddit

YouTube

Frequently Asked Questions

What does the LLM Wiki Ingestion Workflow workflow do?

How is this workflow triggered?

Which Scavio platforms does this workflow use?

Can I run this workflow on the free tier?