Databrew¶
Config-driven web extraction framework
Databrew is a Python framework for extracting structured data from websites using declarative TOML configuration files. It handles the complexity of web scraping—pagination, rate limiting, retries, incremental updates—so you can focus on defining what data to extract.
Features¶
- Config-driven: Define extraction rules in TOML, no code required
- HTML & JSON support: Extract from web pages or REST APIs
- Smart pagination: Automatic link following with per-branch incremental stopping
- Resume support: Automatically resume interrupted crawls
- Browser rendering: Optional headless browser for JavaScript-heavy sites
- Fast exports: DuckDB-powered exports (7-10x faster for large datasets)
- Extensible: Custom parsers, fetchers, and middleware
Quick Example¶
Create a config file mysite.toml:
name = "mysite"
start_urls = ["https://example.com/products"]
[extract]
type = "html"
[extract.items]
selector = ".product"
[extract.items.fields]
title = "h2.title"
price = { selector = ".price", parser = "parse_price" }
url = { selector = "a", attribute = "href" }
[extract.links]
pagination = ["a.next-page"]
items = [".product a"]
[policy]
concurrency = 5
delay = 1.0
Run the extraction:
# Run the crawl
databrew run mysite.toml
# Export to JSONL
databrew export mysite.toml -o products.jsonl
Installation¶
Optional Extras¶
# With browser support (pydoll)
pip install databrew[browser]
# With fast export support (DuckDB)
pip install databrew[analytics]
# All extras
pip install databrew[browser,analytics]
Next Steps¶
- Quick Start - Build your first crawler
- Core Concepts - Understand how databrew works
- CLI Reference - All available commands
- Configuration Guide - Deep dive into TOML config