ScavioScavio
ProductPricingDocs
Sign InGet Started
Blog
playwrightscrapingagent-architecture

Playwright Fallback: The Search-First Pattern in 2026

An r/LangChain DaaS post showed it. Route by target type: search API for indexed (85%), browser for auth-gated (15%). 80%+ less captcha exposure.

April 30, 2026
5 min read

An r/LangChain post (also cross-posted to r/crewai) described an autonomous DaaS architecture for LATAM gov sites where Playwright kept breaking. The fallback: Google Dorks + Llama-3 + MCP. The pattern generalizes beyond LATAM gov; it applies to any agent that scrapes public data at scale.

Why pure-Playwright pipelines fail

Three failure modes show up at scale:

  • Cloudflare/captcha walls. Browsers look human enough to pass first checks but fail under repeated load. Captcha appears, then IP blocks.
  • JS rendering cost. Headless browsers cost ~$0.50-2.00 per page in browser-time. At scale, this dwarfs search-API per-call costs.
  • Maintenance debt. Selectors change, anti-bot scripts update, captchas evolve. Every break costs an engineer-day.

The search-first fallback architecture

Route by target type:

  • Indexed/public targets -> structured search API (Scavio, SerpAPI, Tavily). Cheap, reliable, no browser fight.
  • Auth-gated/JS-only targets -> real browser (Playwright/Stagehand/Browserbase). Expensive but necessary.

For most agents, the first bucket is 80-95% of work. The second is the remaining minority that genuinely needs a browser.

The dorks-first pattern

Google Dorks (site:, filetype:, intitle:, inurl:) are the bridge between "I want this gov document" and "the gov portal blocks Playwright". Google has indexed the document; the search API returns it as typed JSON.

Python
import requests, os
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}

DORK_TEMPLATES = [
    'site:{domain} filetype:pdf {topic}',
    'site:{domain} intitle:{topic}',
    'site:{domain} inurl:reports {topic}',
    'site:{domain} {topic} 2026',
]

def search_first(domain, topic):
    """Dorked SERP via Scavio. No browser, no captcha fight."""
    urls = []
    for tpl in DORK_TEMPLATES:
        q = tpl.format(domain=domain, topic=topic)
        r = requests.post('https://api.scavio.dev/api/v1/search',
            headers=H, json={'query': q}).json()
        urls.extend(o['link'] for o in r.get('organic_results', [])[:5])
    return list(set(urls))

def extract(url):
    """Markdown ready for LLM extraction."""
    return requests.post('https://api.scavio.dev/api/v1/extract',
        headers=H, json={'url': url, 'format': 'markdown'}).json().get('markdown', '')

The Llama-3 (or any LLM) extraction step

Once the document is in markdown, structured extraction is a standard LLM prompt:

Python
PROMPT = '''Extract from this gov document:
- title, date_published, summary (3 sentences), key entities (list)
Return JSON: {"title": ..., "date_published": ..., "summary": ..., "entities": [...]}

Document:
{md}'''

result = llm.complete(PROMPT.format(md=markdown))

The MCP wiring

For agent stacks, expose the search-first pattern as a named MCP tool. Scavio's hosted MCP at https://mcp.scavio.dev/mcp already exposes search, reddit_search, youtube_search, extract as named tools. Attach to Claude Code with one line:

Bash
claude mcp add scavio https://mcp.scavio.dev/mcp \\
  --header 'x-api-key: $SCAVIO_API_KEY'

Cost math vs pure Playwright

For a 1,000-page gov document extraction job:

  • Pure Playwright on Browserbase Developer: ~$2-5 in browser-time + 30-50% captcha failure rate (= rerun costs)
  • Search-first with Scavio: ~$4.30 in search/extract credits + 1-2% failure rate

Raw cost is comparable; variance is wildly different. The search-first pipeline runs predictably; the Playwright pipeline burns engineer time on failures.

When search-first does not work

Three cases keep Playwright as the right call:

  • Auth-gated portals (login required)
  • JS-only SPAs that don't server-side render
  • Documents not indexed by Google (intranet, recent uploads behind robots.txt)

For everything else, the search-first fallback wins on operational stability, which is what production agents actually optimize for.

The honest summary

The r/LangChain post pattern is real and portable. Replace the "Playwright breaks weekly" story in your stack with "dorked search via Scavio for 85% of targets, Playwright for the auth-gated 15%". The agent stops breaking; the engineer-time bill drops proportionally.

Continue reading

aeod2c

AEO Tracking for D2C Ecommerce Brands in 2026

6 min read
ai-agentscost-optimization

Agent Discovery vs Extraction: Why Cost Split Matters

6 min read
ScavioScavio

Real-time search API for AI agents. Search every platform, not just Google.

Product

  • Features
  • Pricing
  • Dashboard
  • Affiliates

Developers

  • Documentation
  • API Reference
  • Quickstart
  • MCP Integration
  • Python SDK

Alternatives

  • Tavily Alternative
  • SerpAPI Alternative
  • Firecrawl Alternative
  • Exa Alternative

Tools

  • JSON Formatter
  • cURL to Code
  • Token Counter
  • All Tools

© 2026 Scavio. All rights reserved.

Featured on TAAFT
Terms of ServicePrivacy Policy