Skip to content

Databrew

Config-driven web extraction framework

Databrew is a Python framework for extracting structured data from websites using declarative TOML configuration files. It handles the complexity of web scraping—pagination, rate limiting, retries, incremental updates—so you can focus on defining what data to extract.

Features

  • Config-driven: Define extraction rules in TOML, no code required
  • HTML & JSON support: Extract from web pages or REST APIs
  • Smart pagination: Automatic link following with per-branch incremental stopping
  • Resume support: Automatically resume interrupted crawls
  • Browser rendering: Optional headless browser for JavaScript-heavy sites
  • Fast exports: DuckDB-powered exports (7-10x faster for large datasets)
  • Extensible: Custom parsers, fetchers, and middleware

Quick Example

Create a config file mysite.toml:

name = "mysite"
start_urls = ["https://example.com/products"]

[extract]
type = "html"

[extract.items]
selector = ".product"

[extract.items.fields]
title = "h2.title"
price = { selector = ".price", parser = "parse_price" }
url = { selector = "a", attribute = "href" }

[extract.links]
pagination = ["a.next-page"]
items = [".product a"]

[policy]
concurrency = 5
delay = 1.0

Run the extraction:

# Run the crawl
databrew run mysite.toml

# Export to JSONL
databrew export mysite.toml -o products.jsonl

Installation

pip install databrew
uv add databrew

Optional Extras

# With browser support (pydoll)
pip install databrew[browser]

# With fast export support (DuckDB)
pip install databrew[analytics]

# All extras
pip install databrew[browser,analytics]

Next Steps