The Olostep LangChain integration provides comprehensive tools to build AI agents that can search, scrape, analyze, and structure data from any website. Perfect for LangChain and LangGraph applications.
Features
The integration provides access to all 5 Olostep API capabilities:
Scrapes Extract content from any single URL in multiple formats (Markdown, HTML, JSON, text)
Batches Process up to 10,000 URLs in parallel. Batch jobs complete in 5-8 minutes
Answers AI-powered web search with natural language queries and structured output
Maps Extract all URLs from a website for site structure analysis
Crawls Autonomously discover and scrape entire websites by following links
Installation
pip install langchain-olostep
Setup
Set your Olostep API key as an environment variable:
export OLOSTEP_API_KEY = "your_olostep_api_key_here"
Get your API key from the Olostep Dashboard .
scrape_website
Extract content from a single URL. Supports multiple formats and JavaScript rendering.
Website URL to scrape (must include http:// or https://)
Output format: markdown, html, json, or text
Country code for location-specific content (e.g., “US”, “GB”, “CA”)
Wait time in milliseconds for JavaScript rendering (0-10000)
Optional parser ID for specialized extraction (e.g., “@olostep/amazon-product”)
Basic Scraping
With JavaScript
With Parser
from langchain_olostep import scrape_website
import asyncio
# Scrape a website
content = asyncio.run(scrape_website.ainvoke({
"url" : "https://example.com" ,
"format" : "markdown"
}))
print (content)
scrape_batch
Process multiple URLs in parallel (up to 10,000 at once).
Output format for all URLs: markdown, html, json, or text
Country code for location-specific content
Wait time in milliseconds for JavaScript rendering
Optional parser ID for specialized extraction
from langchain_olostep import scrape_batch
import asyncio
# Scrape multiple URLs
result = asyncio.run(scrape_batch.ainvoke({
"urls" : [
"https://example1.com" ,
"https://example2.com" ,
"https://example3.com"
],
"format" : "markdown"
}))
print (result)
# Returns: {"batch_id": "batch_xxx", "status": "in_progress", ...}
answer_question
Search the web and get AI-powered answers with sources. Perfect for data enrichment and research.
Question or task to search for
Optional JSON schema dict/string describing desired output format
Simple Question
Structured Output
Data Enrichment
from langchain_olostep import answer_question
import asyncio
# Ask a simple question
result = asyncio.run(answer_question.ainvoke({
"task" : "What is the capital of France?"
}))
print (result)
# Returns: {"answer": {"result": "Paris"}, "sources": [...]}
Extract all URLs from a website for site structure analysis.
Website URL to extract URLs from
Optional search query to filter URLs
Limit the number of URLs returned
Glob patterns to include (e.g., [“/blog/**”])
Glob patterns to exclude (e.g., [“/admin/**”])
Extract All URLs
Filter URLs
from langchain_olostep import extract_urls
import asyncio
# Get all URLs from a website
result = asyncio.run(extract_urls.ainvoke({
"url" : "https://example.com" ,
"top_n" : 100
}))
print (result)
# Returns: {"urls": [...], "total_urls": 100, ...}
crawl_website
Autonomously discover and scrape entire websites by following links.
Starting URL for the crawl
Maximum number of pages to crawl
Glob patterns to include (e.g., [”/**”] for all)
Glob patterns to exclude (e.g., [“/admin/**”])
Maximum depth to crawl from start_url
Crawl Website
With Filters
from langchain_olostep import crawl_website
import asyncio
# Crawl entire documentation site
result = asyncio.run(crawl_website.ainvoke({
"start_url" : "https://docs.example.com" ,
"max_pages" : 100
}))
print (result)
# Returns: {"crawl_id": "crawl_xxx", "status": "in_progress", ...}
LangChain Agent Integration
Build intelligent agents that can search and scrape the web:
from langchain.agents import initialize_agent, AgentType
from langchain_openai import ChatOpenAI
from langchain_olostep import (
scrape_website,
answer_question,
extract_urls
)
# Create agent with Olostep tools
tools = [scrape_website, answer_question, extract_urls]
llm = ChatOpenAI( model = "gpt-4o-mini" )
agent = initialize_agent(
tools = tools,
llm = llm,
agent = AgentType. ZERO_SHOT_REACT_DESCRIPTION ,
verbose = True
)
# Use the agent
result = agent.run( """
Research the company at https://company.com:
1. Scrape their about page
2. Search for their latest funding round
3. Extract all their product pages
""" )
print (result)
LangGraph Integration
Build complex multi-step workflows with LangGraph:
from langgraph.graph import StateGraph, END
from langchain_olostep import (
scrape_website,
scrape_batch,
answer_question,
extract_urls
)
from langchain_openai import ChatOpenAI
import json
def create_research_agent ():
workflow = StateGraph( dict )
def discover_pages ( state ):
# Extract all URLs from target site
result = extract_urls.invoke({
"url" : state[ "target_url" ],
"include_urls" : [ "/product/**" ],
"top_n" : 50
})
state[ "urls" ] = json.loads(result)[ "urls" ]
return state
def scrape_pages ( state ):
# Scrape discovered pages in batch
result = scrape_batch.invoke({
"urls" : state[ "urls" ],
"format" : "markdown"
})
state[ "batch_id" ] = json.loads(result)[ "batch_id" ]
return state
def answer_questions ( state ):
# Use AI to answer questions about the data
result = answer_question.invoke({
"task" : state[ "research_question" ],
"json_schema" : state[ "desired_format" ]
})
state[ "answer" ] = json.loads(result)[ "answer" ]
return state
workflow.add_node( "discover" , discover_pages)
workflow.add_node( "scrape" , scrape_pages)
workflow.add_node( "analyze" , answer_questions)
workflow.set_entry_point( "discover" )
workflow.add_edge( "discover" , "scrape" )
workflow.add_edge( "scrape" , "analyze" )
workflow.add_edge( "analyze" , END )
return workflow.compile()
# Use the agent
agent = create_research_agent()
result = agent.invoke({
"target_url" : "https://store.com" ,
"research_question" : "What are the top 5 most expensive products?" ,
"desired_format" : {
"products" : [{ "name" : "" , "price" : "" , "url" : "" }]
}
})
Advanced Use Cases
Data Enrichment
Enrich spreadsheet data with web information:
from langchain_olostep import answer_question
companies = [ "Stripe" , "Shopify" , "Square" ]
for company in companies:
result = answer_question.invoke({
"task" : f "Find information about { company } " ,
"json_schema" : {
"ceo" : "" ,
"headquarters" : "" ,
"employee_count" : "" ,
"latest_funding" : ""
}
})
print ( f " { company } : { result } " )
E-commerce Product Scraping
Scrape product data with specialized parsers:
from langchain_olostep import scrape_website
# Scrape Amazon product
result = scrape_website.invoke({
"url" : "https://www.amazon.com/dp/PRODUCT_ID" ,
"parser" : "@olostep/amazon-product" ,
"format" : "json"
})
# Returns structured product data: price, title, rating, etc.
SEO Audit
Analyze entire websites for SEO:
from langchain_olostep import extract_urls, scrape_batch
import json
# 1. Discover all pages
urls_result = extract_urls.invoke({
"url" : "https://yoursite.com" ,
"top_n" : 1000
})
# 2. Scrape all pages
urls = json.loads(urls_result)[ "urls" ]
batch_result = scrape_batch.invoke({
"urls" : urls,
"format" : "html"
})
Documentation Scraping
Crawl and extract documentation:
from langchain_olostep import crawl_website
# Crawl entire docs site
result = crawl_website.invoke({
"start_url" : "https://docs.example.com" ,
"max_pages" : 500 ,
"include_urls" : [ "/docs/**" ],
"exclude_urls" : [ "/api/**" , "/v1/**" ]
})
Specialized Parsers
Olostep provides pre-built parsers for popular websites:
@olostep/google-search - Google search results
Use them with the parser parameter:
scrape_website.invoke({
"url" : "https://www.google.com/search?q=alexander+the+great&gl=us&hl=en" ,
"parser" : "@olostep/google-search"
})
Error Handling
from langchain_core.exceptions import LangChainException
try :
result = await scrape_website.ainvoke({
"url" : "https://example.com"
})
except LangChainException as e:
print ( f "Scraping failed: { e } " )
Best Practices
Use Batch Processing for Multiple URLs
When scraping more than 3-5 URLs, use scrape_batch instead of multiple scrape_website calls. Batch processing is much faster and more cost-effective.
For JavaScript-heavy sites, use wait_before_scraping parameter (2000-5000ms is typical). This ensures dynamic content is fully loaded.
For popular websites (Amazon, LinkedIn, Google), use our pre-built parsers to get structured data automatically.
When using extract_urls or crawl_website, use glob patterns to focus on relevant pages and avoid unnecessary processing.
Implement exponential backoff for rate limit errors. The API automatically handles most rate limiting internally.
Support