Skip to main content
The Olostep LangChain integration provides comprehensive tools to build AI agents that can search, scrape, analyze, and structure data from any website. Perfect for LangChain and LangGraph applications.

Features

The integration provides access to all 5 Olostep API capabilities:

Scrapes

Extract content from any single URL in multiple formats (Markdown, HTML, JSON, text)

Batches

Process up to 10,000 URLs in parallel. Batch jobs complete in 5-8 minutes

Answers

AI-powered web search with natural language queries and structured output

Maps

Extract all URLs from a website for site structure analysis

Crawls

Autonomously discover and scrape entire websites by following links

Installation

pip install langchain-olostep

Setup

Set your Olostep API key as an environment variable:
export OLOSTEP_API_KEY="your_olostep_api_key_here"
Get your API key from the Olostep Dashboard.

Available Tools

scrape_website

Extract content from a single URL. Supports multiple formats and JavaScript rendering.
url
string
required
Website URL to scrape (must include http:// or https://)
format
string
default:"markdown"
Output format: markdown, html, json, or text
country
string
Country code for location-specific content (e.g., “US”, “GB”, “CA”)
wait_before_scraping
integer
Wait time in milliseconds for JavaScript rendering (0-10000)
parser
string
Optional parser ID for specialized extraction (e.g., “@olostep/amazon-product”)
from langchain_olostep import scrape_website
import asyncio

# Scrape a website
content = asyncio.run(scrape_website.ainvoke({
    "url": "https://example.com",
    "format": "markdown"
}))

print(content)

scrape_batch

Process multiple URLs in parallel (up to 10,000 at once).
urls
array
required
List of URLs to scrape
format
string
default:"markdown"
Output format for all URLs: markdown, html, json, or text
country
string
Country code for location-specific content
wait_before_scraping
integer
Wait time in milliseconds for JavaScript rendering
parser
string
Optional parser ID for specialized extraction
from langchain_olostep import scrape_batch
import asyncio

# Scrape multiple URLs
result = asyncio.run(scrape_batch.ainvoke({
    "urls": [
        "https://example1.com",
        "https://example2.com",
        "https://example3.com"
    ],
    "format": "markdown"
}))

print(result)
# Returns: {"batch_id": "batch_xxx", "status": "in_progress", ...}

answer_question

Search the web and get AI-powered answers with sources. Perfect for data enrichment and research.
task
string
required
Question or task to search for
json_schema
object
Optional JSON schema dict/string describing desired output format
from langchain_olostep import answer_question
import asyncio

# Ask a simple question
result = asyncio.run(answer_question.ainvoke({
    "task": "What is the capital of France?"
}))

print(result)
# Returns: {"answer": {"result": "Paris"}, "sources": [...]}

extract_urls

Extract all URLs from a website for site structure analysis.
url
string
required
Website URL to extract URLs from
search_query
string
Optional search query to filter URLs
top_n
integer
Limit the number of URLs returned
include_urls
array
Glob patterns to include (e.g., [“/blog/**”])
exclude_urls
array
Glob patterns to exclude (e.g., [“/admin/**”])
from langchain_olostep import extract_urls
import asyncio

# Get all URLs from a website
result = asyncio.run(extract_urls.ainvoke({
    "url": "https://example.com",
    "top_n": 100
}))

print(result)
# Returns: {"urls": [...], "total_urls": 100, ...}

crawl_website

Autonomously discover and scrape entire websites by following links.
start_url
string
required
Starting URL for the crawl
max_pages
integer
default:"100"
Maximum number of pages to crawl
include_urls
array
Glob patterns to include (e.g., [”/**”] for all)
exclude_urls
array
Glob patterns to exclude (e.g., [“/admin/**”])
max_depth
integer
Maximum depth to crawl from start_url
include_external
boolean
default:"false"
Include external URLs
from langchain_olostep import crawl_website
import asyncio

# Crawl entire documentation site
result = asyncio.run(crawl_website.ainvoke({
    "start_url": "https://docs.example.com",
    "max_pages": 100
}))

print(result)
# Returns: {"crawl_id": "crawl_xxx", "status": "in_progress", ...}

LangChain Agent Integration

Build intelligent agents that can search and scrape the web:
from langchain.agents import initialize_agent, AgentType
from langchain_openai import ChatOpenAI
from langchain_olostep import (
    scrape_website,
    answer_question,
    extract_urls
)

# Create agent with Olostep tools
tools = [scrape_website, answer_question, extract_urls]
llm = ChatOpenAI(model="gpt-4o-mini")

agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

# Use the agent
result = agent.run("""
Research the company at https://company.com:
1. Scrape their about page
2. Search for their latest funding round
3. Extract all their product pages
""")

print(result)

LangGraph Integration

Build complex multi-step workflows with LangGraph:
from langgraph.graph import StateGraph, END
from langchain_olostep import (
    scrape_website,
    scrape_batch,
    answer_question,
    extract_urls
)
from langchain_openai import ChatOpenAI
import json

def create_research_agent():
    workflow = StateGraph(dict)
    
    def discover_pages(state):
        # Extract all URLs from target site
        result = extract_urls.invoke({
            "url": state["target_url"],
            "include_urls": ["/product/**"],
            "top_n": 50
        })
        state["urls"] = json.loads(result)["urls"]
        return state
    
    def scrape_pages(state):
        # Scrape discovered pages in batch
        result = scrape_batch.invoke({
            "urls": state["urls"],
            "format": "markdown"
        })
        state["batch_id"] = json.loads(result)["batch_id"]
        return state
    
    def answer_questions(state):
        # Use AI to answer questions about the data
        result = answer_question.invoke({
            "task": state["research_question"],
            "json_schema": state["desired_format"]
        })
        state["answer"] = json.loads(result)["answer"]
        return state
    
    workflow.add_node("discover", discover_pages)
    workflow.add_node("scrape", scrape_pages)
    workflow.add_node("analyze", answer_questions)
    
    workflow.set_entry_point("discover")
    workflow.add_edge("discover", "scrape")
    workflow.add_edge("scrape", "analyze")
    workflow.add_edge("analyze", END)
    
    return workflow.compile()

# Use the agent
agent = create_research_agent()
result = agent.invoke({
    "target_url": "https://store.com",
    "research_question": "What are the top 5 most expensive products?",
    "desired_format": {
        "products": [{"name": "", "price": "", "url": ""}]
    }
})

Advanced Use Cases

Data Enrichment

Enrich spreadsheet data with web information:
from langchain_olostep import answer_question

companies = ["Stripe", "Shopify", "Square"]

for company in companies:
    result = answer_question.invoke({
        "task": f"Find information about {company}",
        "json_schema": {
            "ceo": "",
            "headquarters": "",
            "employee_count": "",
            "latest_funding": ""
        }
    })
    print(f"{company}: {result}")

E-commerce Product Scraping

Scrape product data with specialized parsers:
from langchain_olostep import scrape_website

# Scrape Amazon product
result = scrape_website.invoke({
    "url": "https://www.amazon.com/dp/PRODUCT_ID",
    "parser": "@olostep/amazon-product",
    "format": "json"
})
# Returns structured product data: price, title, rating, etc.

SEO Audit

Analyze entire websites for SEO:
from langchain_olostep import extract_urls, scrape_batch
import json

# 1. Discover all pages
urls_result = extract_urls.invoke({
    "url": "https://yoursite.com",
    "top_n": 1000
})

# 2. Scrape all pages
urls = json.loads(urls_result)["urls"]
batch_result = scrape_batch.invoke({
    "urls": urls,
    "format": "html"
})

Documentation Scraping

Crawl and extract documentation:
from langchain_olostep import crawl_website

# Crawl entire docs site
result = crawl_website.invoke({
    "start_url": "https://docs.example.com",
    "max_pages": 500,
    "include_urls": ["/docs/**"],
    "exclude_urls": ["/api/**", "/v1/**"]
})

Specialized Parsers

Olostep provides pre-built parsers for popular websites:
  • @olostep/google-search - Google search results
Use them with the parser parameter:
scrape_website.invoke({
    "url": "https://www.google.com/search?q=alexander+the+great&gl=us&hl=en",
    "parser": "@olostep/google-search"
})

Error Handling

from langchain_core.exceptions import LangChainException

try:
    result = await scrape_website.ainvoke({
        "url": "https://example.com"
    })
except LangChainException as e:
    print(f"Scraping failed: {e}")

Best Practices

When scraping more than 3-5 URLs, use scrape_batch instead of multiple scrape_website calls. Batch processing is much faster and more cost-effective.
For JavaScript-heavy sites, use wait_before_scraping parameter (2000-5000ms is typical). This ensures dynamic content is fully loaded.
For popular websites (Amazon, LinkedIn, Google), use our pre-built parsers to get structured data automatically.
When using extract_urls or crawl_website, use glob patterns to focus on relevant pages and avoid unnecessary processing.
Implement exponential backoff for rate limit errors. The API automatically handles most rate limiting internally.

Support