Databrew¶

Config-driven web extraction framework

Databrew is a Python framework for extracting structured data from websites using declarative TOML configuration files. It handles the complexity of web scraping—pagination, rate limiting, retries, incremental updates—so you can focus on defining what data to extract.

Features¶

Config-driven: Define extraction rules in TOML, no code required
HTML & JSON support: Extract from web pages or REST APIs
Smart pagination: Automatic link following with per-branch incremental stopping
Resume support: Automatically resume interrupted crawls
Browser rendering: Optional headless browser for JavaScript-heavy sites
Fast exports: DuckDB-powered exports (7-10x faster for large datasets)
Extensible: Custom parsers, fetchers, and middleware

Quick Example¶

Create a config file mysite.toml:

name = "mysite"
start_urls = ["https://example.com/products"]

[extract]
type = "html"

[extract.items]
selector = ".product"

[extract.items.fields]
title = "h2.title"
price = { selector = ".price", parser = "parse_price" }
url = { selector = "a", attribute = "href" }

[extract.links]
pagination = ["a.next-page"]
items = [".product a"]

[policy]
concurrency = 5
delay = 1.0

Run the extraction:

# Run the crawl
databrew run mysite.toml

# Export to JSONL
databrew export mysite.toml -o products.jsonl

Installation¶

pipuv

pip install databrew

uv add databrew

Optional Extras¶

# With browser support (pydoll)
pip install databrew[browser]

# With fast export support (DuckDB)
pip install databrew[analytics]

# All extras
pip install databrew[browser,analytics]

Next Steps¶

Quick Start - Build your first crawler
Core Concepts - Understand how databrew works
CLI Reference - All available commands
Configuration Guide - Deep dive into TOML config