Stop Using BeautifulSoup for RAG Search: Cleaner Alternatives
Scraping HTML with BeautifulSoup to feed a RAG pipeline adds fragile parsing code between your LLM and the web. Search APIs return clean, LLM-ready text directly, collapsing the scrape-parse-chunk-embed sequence into a single API call.
The Typical RAG Web Search Pipeline
Most LangChain tutorials for web-augmented RAG follow this pattern:
- Take a user query
- Google the query (manually or via API)
- Fetch HTML of top results
- Parse HTML with BeautifulSoup, strip scripts and styles
- Chunk into ~500 token segments
- Embed chunks
- Retrieve top-k chunks by cosine similarity
- Pass to LLM
Steps 3-6 are where things break. BeautifulSoup fails on JavaScript-rendered pages. Chunking strategies require tuning. Embeddings add latency and cost. The LLM at the end does not need a vector retrieved chunk — it needs the relevant text.
Simplified Pipeline with Search API
Search APIs with built-in markdown extraction skip steps 3-6 entirely:
from langchain_core.tools import tool
import requests
@tool
def web_search(query: str) -> str:
"""Search the web and return relevant context for a question."""
resp = requests.post(
"https://api.scavio.dev/api/v1/search",
headers={"x-api-key": API_KEY},
json={"query": query, "num": 5}
)
results = resp.json().get("organic_results", [])
context_parts = []
for r in results:
title = r.get("title", "")
snippet = r.get("snippet", "")
link = r.get("link", "")
if snippet:
context_parts.append(f"[{title}]({link})\n{snippet}")
return "\n\n".join(context_parts)This tool returns clean text snippets ready to pass directly to your LLM. For deeper content, combine with Jina Reader for full-page extraction on the top 1-2 results:
def fetch_full_content(url: str) -> str:
resp = requests.get(
f"https://r.jina.ai/{url}",
headers={"Authorization": f"Bearer {JINA_KEY}"}
)
return resp.text # Clean markdown, no HTMLJina charges $0.002/1k tokens (~$0.01/page). For most RAG use cases, the snippet from the search API is sufficient without needing the full page.
Tavily's Built-In Approach
Tavily was purpose-built for this use case and includes content extraction in the search result:
from tavily import TavilyClient
client = TavilyClient(api_key=TAVILY_KEY)
result = client.search(
query="current GPT-4o pricing",
search_depth="advanced", # includes full page content
include_raw_content=True
)
# result['results'][0]['content'] contains clean textTavily's advanced search mode fetches and extracts page content server-side. The tradeoff: it costs 2 credits instead of 1. At $0.008/credit, a deep search call costs $0.016.
When BeautifulSoup Still Makes Sense
If your RAG pipeline needs to extract data from specific, known URLs — not search results — BeautifulSoup or a dedicated scraper is still appropriate. Use cases:
- Extracting structured data from a specific site's known URL pattern
- Monitoring a specific page for changes
- Scraping behind-auth content where search APIs have no access
For these cases, pair BeautifulSoup with Playwright for JavaScript rendering, not raw requests. The parsing code is still fragile, but at least you avoid the non-render failure mode.
LangChain Integration Comparison
LangChain has built-in tools for both approaches:
# Old approach
from langchain_community.utilities import SerpAPIWrapper
from langchain_community.document_loaders import WebBaseLoader
# New approach: custom tool with search API
from langchain_core.tools import tool
# Register your search tool
tools = [web_search] # the function defined aboveThe custom tool approach gives you more control over what text reaches the LLM. The LangChain built-in SerpAPI wrapper returns raw SERP data that your chain must process further. A custom tool returns exactly the text format you want.
Latency and Cost Comparison
For a single RAG query:
| Approach | Steps | Approximate latency | Approximate cost |
|---|---|---|---|
| Scrape + BeautifulSoup | 5+ steps | 3-8s | $0.05-0.50/page |
| Search API snippets | 1 step | 0.5-2s | $0.005-0.008 |
| Search API + Jina full page | 2 steps | 1-3s | $0.015-0.02 |
| Tavily advanced | 1 step | 1-3s | $0.016 |
For latency-sensitive applications (chat interfaces), the single-step approach is 3-4x faster than scrape-parse pipelines.