ScavioScavio
ProductPricingDocs
Sign InGet Started
  1. Home
  2. Tutorials
  3. How to Build a Google Dorks + LLM Extraction Pipeline
Tutorial

How to Build a Google Dorks + LLM Extraction Pipeline

Combine Google Dorks search with LLM extraction to turn PDFs and government portals into typed JSON. Pattern from r/LangChain's DaaS build.

Get Free API KeyAPI Docs

An r/LangChain post shared an autonomous DaaS architecture using Google Dorks + Llama-3 + MCP. The pattern works for any structured-document discovery job. This tutorial walks the same flow on Scavio.

Prerequisites

  • Python 3.10+
  • Scavio API key
  • Groq or Anthropic API key

Walkthrough

Step 1: Dork patterns for the target

site: + filetype: + keyword.

Python
DORKS = ['site:gov.br filetype:pdf 2026 contratos', 'site:europa.eu filetype:pdf AI act']

Step 2: Run the dork via Scavio search

Returns organic results pointing to PDFs.

Python
import requests, os
API_KEY = os.environ['SCAVIO_API_KEY']

def dork(q):
    return requests.post('https://api.scavio.dev/api/v1/search',
        headers={'x-api-key': API_KEY}, json={'query': q}).json()

Step 3: Filter for fresh PDFs

Date filter or LLM screening.

Python
def fresh_pdfs(results, year='2026'):
    return [r for r in results.get('organic_results', []) if year in r.get('snippet', '') and r['link'].endswith('.pdf')]

Step 4: Extract PDF to text via Scavio extract

PDF-aware extract returns markdown.

Python
def pdf_to_text(url):
    r = requests.post('https://api.scavio.dev/api/v1/extract',
        headers={'x-api-key': API_KEY},
        json={'url': url, 'format': 'markdown'}).json()
    return r.get('markdown', '')

Step 5: LLM converts garbage text to typed JSON

Strict-schema prompt; reject if doesn't parse.

Python
import anthropic, json
client = anthropic.Anthropic()

def typed(md):
    msg = client.messages.create(model='claude-sonnet-4-6', max_tokens=600,
        messages=[{'role':'user','content':f'Extract opportunity details as JSON: title, deadline, amount, agency. Source: {md[:6000]}'}])
    return json.loads(msg.content[0].text)

Python Example

Python
# Daily run: 5 dorks × ~20 PDFs each = ~105 calls = ~$0.45 on Project tier.

JavaScript Example

JavaScript
// TS version uses the same endpoints.

Expected Output

JSON
Government bid PDFs converted to typed JSON daily. Cache layer keeps repeat queries at sub-50ms.

Related Tutorials

  • How to Replace Serper with Scavio in a CrewAI SDR Agent
  • How to Cache Search Results in SQLite for AI Agents

Frequently Asked Questions

Most developers complete this tutorial in 15 to 30 minutes. You will need a Scavio API key (free tier works) and a working Python or JavaScript environment.

Python 3.10+. Scavio API key. Groq or Anthropic API key. A Scavio API key gives you 50 free credits on signup.

Yes. The free tier includes 50 credits on signup, which is more than enough to complete this tutorial and prototype a working solution.

Scavio has a native LangChain package (langchain-scavio), an MCP server, and a plain REST API that works with any HTTP client. This tutorial uses the raw REST API, but you can adapt to your framework of choice.

Related Resources

Workflow

Government Bid Monitoring Workflow

Read more
Use Case

News Digest Agent Pipeline

Read more
Use Case

Government Portal Monitoring SDR Agent

Read more
Solution

Government Portal Scraping Alternative

Read more
Glossary

Google Dorks Pipeline

Read more
Glossary

Google AI Agent

Read more

Start Building

Combine Google Dorks search with LLM extraction to turn PDFs and government portals into typed JSON. Pattern from r/LangChain's DaaS build.

Get Free API KeyRead the Docs
ScavioScavio

Real-time search API for AI agents. Search every platform, not just Google.

Product

  • Features
  • Pricing
  • Dashboard
  • Affiliates

Developers

  • Documentation
  • API Reference
  • Quickstart
  • MCP Integration
  • Python SDK

Alternatives

  • Tavily Alternative
  • SerpAPI Alternative
  • Firecrawl Alternative
  • Exa Alternative

Tools

  • JSON Formatter
  • cURL to Code
  • Token Counter
  • All Tools

© 2026 Scavio. All rights reserved.

Featured on TAAFT
Terms of ServicePrivacy Policy