Databrew¶
Config-driven web extraction framework
Databrew is a Python framework for extracting structured data from websites using declarative TOML configuration files. It handles the complexity of web scraping—pagination, rate limiting, retries, incremental updates—so you can focus on defining what data to extract.
Features¶
- Config-driven: Define extraction rules in TOML, no code required
- HTML & JSON support: Extract from web pages or REST APIs
- Smart pagination: Automatic link following with per-branch incremental stopping
- Resume support: Automatically resume interrupted crawls
- Browser rendering: Optional headless browser for JavaScript-heavy sites
- Parquet storage: Rolling Parquet part files, queryable with DuckDB/Pandas/Polars
- Extensible: Custom parsers, fetchers, and middleware
Component Architecture¶
Databrew is internally composed from focused packages:
databrew.corefetchkitextractkititemstore
Crawl state management lives in the databrew.state subpackage.
The top-level databrew package is the composition layer and CLI entrypoint.
Quick Example¶
Create a config file mysite.toml:
name = "mysite"
start_urls = ["https://example.com/products"]
[extract]
type = "html"
[extract.items]
selector = ".product"
[extract.items.fields]
title = "h2.title"
price = { selector = ".price", parser = "parse_price" }
url = { selector = "a", attribute = "href" }
[extract.links]
pagination = ["a.next-page"]
items = [".product a"]
[policy]
concurrency = 5
delay = 1.0
Run the extraction:
# Run the crawl
databrew run mysite.toml
# Data is stored in Parquet files at data/mysite/items/
# Query with DuckDB, Pandas, Polars, etc.
Installation¶
Optional Extras¶
Next Steps¶
- Quick Start - Build your first crawler
- Core Concepts - Understand how databrew works
- CLI Reference - All available commands
- Configuration Guide - Deep dive into TOML config
- Performance Tuning - Throughput tuning and benchmarking