Create Scrape - Olostep Docs

curl --request POST \ --url https://api.olostep.com/v1/scrapes \ --header 'Authorization: Bearer <token>' \ --header 'Content-Type: application/json' \ --data ' { "url_to_scrape": "<string>", "wait_before_scraping": 123, "formats": [], "actions": [ { "milliseconds": 1 } ], "country": "<string>", "remove_images": false, "remove_class_names": [ "<string>" ], "llm_extract": { "schema": {} }, "links_on_page": { "query_to_order_links_by": "<string>", "include_links": [ "<string>" ], "exclude_links": [ "<string>" ] }, "screen_size": { "screen_width": 123, "screen_height": 123 }, "screenshot": { "full_page": true }, "metadata": {} } '

{ "id": "<string>", "object": "<string>", "created": 123, "metadata": {}, "url_to_scrape": "<string>", "result": { "html_content": "<string>", "markdown_content": "<string>", "text_content": "<string>", "json_content": "<string>", "screenshot_hosted_url": "<string>", "html_hosted_url": "<string>", "markdown_hosted_url": "<string>", "text_hosted_url": "<string>", "links_on_page": [ "<string>" ], "page_metadata": { "status_code": 123, "title": "<string>" } }, "credits_consumed": 123, "cost_usd": 123 }

Authorizations

Authorization

string

header

required

Bearer authentication header of the form Bearer , where is your auth token.

Body

application/json

url_to_scrape

string<uri>

required

The URL to start scraping from.

wait_before_scraping

integer

Time to wait in milliseconds before starting the scraping.

formats

enum<string>[]

Formats in which you want the content.

Available options:

html,

markdown,

text,

json,

raw_pdf,

screenshot

remove_css_selectors

enum<string>

Option to remove certain CSS selectors from the content. Optionally, you can also pass a JSON stringified array of specific selectors you want to remove. The CSS selectors removed when this option is set to default are ['nav','footer','script','style','noscript','svg',[role=alert],[role=banner],[role=dialog],[role=alertdialog],[role=region][aria-label*=skip i],[aria-modal=true]]

Available options:

default,

none,

array

actions

(Wait · object | Click · object | Fill Input · object | Scroll · object)[]

Actions to perform on the page before getting the content.

Wait
Click
Fill Input
Scroll

Show child attributes

country

string

Residential country to load the request from.

Supported values are:

US (United States)
CA (Canada)
IT (Italy)
IN (India)
GB (England)
JP (Japan)
MX (Mexico)
AU (Australia)
ID (Indonesia)
UA (UAE)
RU (Russia)
RANDOM

Some operations, like scraping Google Search and Google News, support all countries.

transformer

enum<string>

Specify the HTML transformer to use, if any. Postlight's Mercury Parser library is used to remove ads and other unwanted content from the scraped content.

Available options:

postlight,

none

remove_images

boolean

default:false

Option to remove images from the scraped content. Defaults to false.

remove_class_names

string[]

List of class names to remove from the content.

parser

object

When defining json as a format, you can use this parameter to specify the parser to use. Parsers are useful to extract structured content from web pages. Olostep has a few parsers built in for most common web pages, and you can also create your own parsers.

Show child attributes

llm_extract

object

Show child attributes

links_on_page

object

With this option, you can get all the links present on the page you scrape. Links are always returned as absolute URLs.

Show child attributes

screen_size

object

Configuration for screen size. Preset dimensions are available through screen_type: desktop (1920x1080), mobile (414x896), or default (768x1024).

Show child attributes

screenshot

object

Show child attributes

metadata

object

User-defined metadata. Not supported yet

Response

Successful response with the scrape initiation details.

string

Scrape ID

object

string

The kind of object. "scrape" for this endpoint.

created

number

Created epoch

metadata

object

User-defined metadata.

url_to_scrape

string

The URL that was scraped.

result

object

Show child attributes

credits_consumed

integer | null

Number of credits consumed by this request. Populated after execution completes. Credits are the source of truth for billing.

cost_usd

number | null

Estimated cost in USD for this request. Populated after execution completes. Calculated from credits consumed and your plan rate — 99% accurate, but credits_consumed is the authoritative value.