Skip to content

Databrew

Config-driven web extraction framework

Databrew is a Python framework for extracting structured data from websites using declarative TOML configuration files. It handles the complexity of web scraping—pagination, rate limiting, retries, incremental updates—so you can focus on defining what data to extract.

Features

  • Config-driven: Define extraction rules in TOML, no code required
  • HTML & JSON support: Extract from web pages or REST APIs
  • Smart pagination: Automatic link following with per-branch incremental stopping
  • Resume support: Automatically resume interrupted crawls
  • Browser rendering: Optional headless browser for JavaScript-heavy sites
  • Parquet storage: Rolling Parquet part files, queryable with DuckDB/Pandas/Polars
  • Extensible: Custom parsers, fetchers, and middleware

Component Architecture

Databrew is internally composed from focused packages:

  • databrew.core
  • fetchkit
  • extractkit
  • itemstore

Crawl state management lives in the databrew.state subpackage. The top-level databrew package is the composition layer and CLI entrypoint.

Quick Example

Create a config file mysite.toml:

name = "mysite"
start_urls = ["https://example.com/products"]

[extract]
type = "html"

[extract.items]
selector = ".product"

[extract.items.fields]
title = "h2.title"
price = { selector = ".price", parser = "parse_price" }
url = { selector = "a", attribute = "href" }

[extract.links]
pagination = ["a.next-page"]
items = [".product a"]

[policy]
concurrency = 5
delay = 1.0

Run the extraction:

# Run the crawl
databrew run mysite.toml

# Data is stored in Parquet files at data/mysite/items/
# Query with DuckDB, Pandas, Polars, etc.

Installation

pip install databrew
uv add databrew

Optional Extras

# With browser support (pydoll) for JavaScript-heavy sites
pip install databrew[browser]

Next Steps