LangChain Integration

The Olostep LangChain integration provides comprehensive tools to build AI agents that can search, scrape, analyze, and structure data from any website. Perfect for LangChain and LangGraph applications.

Features

The integration provides access to all 5 Olostep API capabilities:

Scrapes

Extract content from any single URL in multiple formats (Markdown, HTML, JSON, text)

Batches

Process up to 10,000 URLs in parallel. Batch jobs complete in 5-8 minutes

Answers

AI-powered web search with natural language queries and structured output

Maps

Extract all URLs from a website for site structure analysis

Crawls

Autonomously discover and scrape entire websites by following links

Installation

pip install langchain-olostep

Setup

Set your Olostep API key as an environment variable:

export OLOSTEP_API_KEY="your_olostep_api_key_here"

Get your API key from the Olostep Dashboard.

Available Tools

scrape_website

Extract content from a single URL. Supports multiple formats and JavaScript rendering.

url

string

required

Website URL to scrape (must include http:// or https://)

format

string

default:"markdown"

Output format: markdown, html, json, or text

country

string

Country code for location-specific content (e.g., “US”, “GB”, “CA”)

wait_before_scraping

integer

Wait time in milliseconds for JavaScript rendering (0-10000)

parser

string

Optional parser ID for specialized extraction (e.g., “@olostep/amazon-product”)

from langchain_olostep import scrape_website
import asyncio

# Scrape a website
content = asyncio.run(scrape_website.ainvoke({
    "url": "https://example.com",
    "format": "markdown"
}))

print(content)

scrape_batch

Process multiple URLs in parallel (up to 10,000 at once).

urls

array

required

List of URLs to scrape

format

string

default:"markdown"

Output format for all URLs: markdown, html, json, or text

country

string

Country code for location-specific content

wait_before_scraping

integer

Wait time in milliseconds for JavaScript rendering

parser

string

Optional parser ID for specialized extraction

from langchain_olostep import scrape_batch
import asyncio

# Scrape multiple URLs
result = asyncio.run(scrape_batch.ainvoke({
    "urls": [
        "https://example1.com",
        "https://example2.com",
        "https://example3.com"
    ],
    "format": "markdown"
}))

print(result)
# Returns: {"batch_id": "batch_xxx", "status": "in_progress", ...}

answer_question

Search the web and get AI-powered answers with sources. Perfect for data enrichment and research.

task

string

required

Question or task to search for

json_schema

object

Optional JSON schema dict/string describing desired output format

from langchain_olostep import answer_question
import asyncio

# Ask a simple question
result = asyncio.run(answer_question.ainvoke({
    "task": "What is the capital of France?"
}))

print(result)
# Returns: {"answer": {"result": "Paris"}, "sources": [...]}

extract_urls

Extract all URLs from a website for site structure analysis.

url

string

required

Website URL to extract URLs from

search_query

string

Optional search query to filter URLs

top_n

integer

Limit the number of URLs returned

include_urls

array

Glob patterns to include (e.g., [“/blog/**”])

exclude_urls

array

Glob patterns to exclude (e.g., [“/admin/**”])

from langchain_olostep import extract_urls
import asyncio

# Get all URLs from a website
result = asyncio.run(extract_urls.ainvoke({
    "url": "https://example.com",
    "top_n": 100
}))

print(result)
# Returns: {"urls": [...], "total_urls": 100, ...}

crawl_website

Autonomously discover and scrape entire websites by following links.

start_url

string

required

Starting URL for the crawl

max_pages

integer

default:"100"

Maximum number of pages to crawl

include_urls

array

Glob patterns to include (e.g., [”/**”] for all)

exclude_urls

array

Glob patterns to exclude (e.g., [“/admin/**”])

max_depth

integer

Maximum depth to crawl from start_url

include_external

boolean

default:"false"

Include external URLs

from langchain_olostep import crawl_website
import asyncio

# Crawl entire documentation site
result = asyncio.run(crawl_website.ainvoke({
    "start_url": "https://docs.example.com",
    "max_pages": 100
}))

print(result)
# Returns: {"crawl_id": "crawl_xxx", "status": "in_progress", ...}

LangChain Agent Integration

Build intelligent agents that can search and scrape the web:

from langchain.agents import initialize_agent, AgentType
from langchain_openai import ChatOpenAI
from langchain_olostep import (
    scrape_website,
    answer_question,
    extract_urls
)

# Create agent with Olostep tools
tools = [scrape_website, answer_question, extract_urls]
llm = ChatOpenAI(model="gpt-4o-mini")

agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

# Use the agent
result = agent.run("""
Research the company at https://company.com:
1. Scrape their about page
2. Search for their latest funding round
3. Extract all their product pages
""")

print(result)

LangGraph Integration

Build complex multi-step workflows with LangGraph:

from langgraph.graph import StateGraph, END
from langchain_olostep import (
    scrape_website,
    scrape_batch,
    answer_question,
    extract_urls
)
from langchain_openai import ChatOpenAI
import json

def create_research_agent():
    workflow = StateGraph(dict)
    
    def discover_pages(state):
        # Extract all URLs from target site
        result = extract_urls.invoke({
            "url": state["target_url"],
            "include_urls": ["/product/**"],
            "top_n": 50
        })
        state["urls"] = json.loads(result)["urls"]
        return state
    
    def scrape_pages(state):
        # Scrape discovered pages in batch
        result = scrape_batch.invoke({
            "urls": state["urls"],
            "format": "markdown"
        })
        state["batch_id"] = json.loads(result)["batch_id"]
        return state
    
    def answer_questions(state):
        # Use AI to answer questions about the data
        result = answer_question.invoke({
            "task": state["research_question"],
            "json_schema": state["desired_format"]
        })
        state["answer"] = json.loads(result)["answer"]
        return state
    
    workflow.add_node("discover", discover_pages)
    workflow.add_node("scrape", scrape_pages)
    workflow.add_node("analyze", answer_questions)
    
    workflow.set_entry_point("discover")
    workflow.add_edge("discover", "scrape")
    workflow.add_edge("scrape", "analyze")
    workflow.add_edge("analyze", END)
    
    return workflow.compile()

# Use the agent
agent = create_research_agent()
result = agent.invoke({
    "target_url": "https://store.com",
    "research_question": "What are the top 5 most expensive products?",
    "desired_format": {
        "products": [{"name": "", "price": "", "url": ""}]
    }
})

Advanced Use Cases

Data Enrichment

Enrich spreadsheet data with web information:

from langchain_olostep import answer_question

companies = ["Stripe", "Shopify", "Square"]

for company in companies:
    result = answer_question.invoke({
        "task": f"Find information about {company}",
        "json_schema": {
            "ceo": "",
            "headquarters": "",
            "employee_count": "",
            "latest_funding": ""
        }
    })
    print(f"{company}: {result}")

E-commerce Product Scraping

Scrape product data with specialized parsers:

from langchain_olostep import scrape_website

# Scrape Amazon product
result = scrape_website.invoke({
    "url": "https://www.amazon.com/dp/PRODUCT_ID",
    "parser": "@olostep/amazon-product",
    "format": "json"
})
# Returns structured product data: price, title, rating, etc.

SEO Audit

Analyze entire websites for SEO:

from langchain_olostep import extract_urls, scrape_batch
import json

# 1. Discover all pages
urls_result = extract_urls.invoke({
    "url": "https://yoursite.com",
    "top_n": 1000
})

# 2. Scrape all pages
urls = json.loads(urls_result)["urls"]
batch_result = scrape_batch.invoke({
    "urls": urls,
    "format": "html"
})

Documentation Scraping

Crawl and extract documentation:

from langchain_olostep import crawl_website

# Crawl entire docs site
result = crawl_website.invoke({
    "start_url": "https://docs.example.com",
    "max_pages": 500,
    "include_urls": ["/docs/**"],
    "exclude_urls": ["/api/**", "/v1/**"]
})

Specialized Parsers

Olostep provides pre-built parsers for popular websites:

@olostep/google-search - Google search results

Use them with the parser parameter:

scrape_website.invoke({
    "url": "https://www.google.com/search?q=alexander+the+great&gl=us&hl=en",
    "parser": "@olostep/google-search"
})

Error Handling

from langchain_core.exceptions import LangChainException

try:
    result = await scrape_website.ainvoke({
        "url": "https://example.com"
    })
except LangChainException as e:
    print(f"Scraping failed: {e}")

Best Practices

Use Batch Processing for Multiple URLs

When scraping more than 3-5 URLs, use scrape_batch instead of multiple scrape_website calls. Batch processing is much faster and more cost-effective.

Set Appropriate Timeouts

For JavaScript-heavy sites, use wait_before_scraping parameter (2000-5000ms is typical). This ensures dynamic content is fully loaded.

Use Specialized Parsers

For popular websites (Amazon, LinkedIn, Google), use our pre-built parsers to get structured data automatically.

Filter URLs Efficiently

When using extract_urls or crawl_website, use glob patterns to focus on relevant pages and avoid unnecessary processing.

Handle Rate Limits

Implement exponential backoff for rate limit errors. The API automatically handles most rate limiting internally.

Support

PyPI Package: langchain-olostep
Documentation: docs.olostep.com
Issues: GitHub Issues
Email: info@olostep.com

Scrapes API

Learn about the Scrapes endpoint

Batches API

Learn about the Batches endpoint

Answers API

Learn about the Answers endpoint

Maps API

Learn about the Maps endpoint

Crawls API

Learn about the Crawls endpoint

Python SDK

Explore the Python SDK

Get Started

Features

Integrations

Features

Scrapes

Batches

Answers

Maps

Crawls

Installation

Setup

Available Tools

scrape_website

scrape_batch

answer_question

extract_urls

crawl_website

LangChain Agent Integration

LangGraph Integration

Advanced Use Cases

Data Enrichment

E-commerce Product Scraping

SEO Audit

Documentation Scraping

Specialized Parsers

Error Handling

Best Practices

Support

Scrapes API

Batches API

Answers API

Maps API

Crawls API

Python SDK

Get Started

Features

Integrations

​Features

Scrapes

Batches

Answers

Maps

Crawls

​Installation

​Setup

​Available Tools

​scrape_website

​scrape_batch

​answer_question

​extract_urls

​crawl_website

​LangChain Agent Integration

​LangGraph Integration

​Advanced Use Cases

​Data Enrichment

​E-commerce Product Scraping

​SEO Audit

​Documentation Scraping

​Specialized Parsers

​Error Handling

​Best Practices

​Support

​Related Resources

Scrapes API

Batches API

Answers API

Maps API

Crawls API

Python SDK

Features

Installation

Setup

Available Tools

scrape_website

scrape_batch

answer_question

extract_urls

crawl_website

LangChain Agent Integration

LangGraph Integration

Advanced Use Cases

Data Enrichment

E-commerce Product Scraping

SEO Audit

Documentation Scraping

Specialized Parsers

Error Handling

Best Practices

Support

Related Resources