Configuration Guide¶
Databrew uses TOML configuration files to define extraction rules. This guide covers all configuration options.
Basic Structure¶
A config file has these main sections:
# Required: Site identity
name = "mysite"
start_urls = ["https://example.com"]
# Optional: Inherit from base config
# extends = "base.toml"
# Required: Extraction rules
[extract]
type = "html" # or "json"
# Item extraction
[extract.items]
# ...
# Link discovery
[extract.links]
# ...
# Optional: Crawl behavior
[policy]
# ...
# Optional: Output location
[storage]
# ...
# Optional: HTTP/browser settings
[fetch]
# ...
Top-Level Settings¶
name¶
Site identifier used for logging and default output paths.
start_urls¶
URLs to begin crawling from. Can be a list or loaded from a file.
# List of URLs
start_urls = [
"https://example.com/page1",
"https://example.com/page2",
]
# Load from file (one URL per line)
start_urls = { file = "urls.txt" }
File format:
# Comments are ignored
https://example.com/category/1
https://example.com/category/2
# Blank lines are ignored
https://example.com/category/3
extends¶
Inherit from a base config file. Path is relative to the config file.
See Config Composition for details.
parsers¶
Custom parser modules to load (Python files in the same directory as the config).
See Custom Parsers for details.
[extract] Section¶
type¶
Extraction type: html for web pages, json for APIs.
items_from¶
Which URL types to extract items from. Default is "item".
[extract]
items_from = "item" # Only from item/detail pages (default)
# items_from = "pagination" # Only from listing pages
# items_from = "all" # From all pages
base_url (HTML only)¶
Base URL for resolving relative links.
[extract.items] Section¶
selector (HTML)¶
CSS selector for item containers. Empty string means the whole page is one item.
[extract.items]
selector = ".product-card" # Multiple items per page
# selector = "" # Whole page is one item (detail pages)
path (JSON)¶
JSON path to items array. Empty string means the full response.
[extract.items]
path = "data.products" # Array at data.products
# path = "" # Full response is the item
id¶
Field to use as unique identifier for deduplication and export filenames.
[extract.items]
id = "product_id" # Name of an extracted field
# For nested data (JSON or derived fields)
id = "details.Property ID"
fields¶
Fields to extract from each item.
[extract.items.fields]
# Simple selector (HTML) or path (JSON)
title = "h2.title"
# Full config
price = { selector = ".price", parser = "parse_price", required = true }
image = { selector = "img", attribute = "src" }
tags = { selector = ".tag", multiple = true }
See HTML Extraction and JSON Extraction for field options.
[extract.links] Section¶
pagination (HTML/JSON)¶
Selectors/paths for pagination links (always followed).
[extract.links]
# HTML: CSS selectors
pagination = ["a.next-page", ".pagination a"]
# JSON: dot-notation paths
pagination = ["links.next", "meta.next_page_url"]
items (HTML)¶
CSS selectors for item detail page links.
Item URL Construction (JSON)¶
For JSON APIs, construct item URLs from IDs:
[extract.links]
items_path = "data.products" # Path to items array
items_id = "id" # ID field in each item
items_url = "https://api.example.com/products/{id}" # URL template
attribute (HTML)¶
Attribute containing the URL (default: href).
[extract.derived] Section¶
Extract fields from already-extracted nested data.
[extract.derived]
# Simple path
property_id = "details.Property ID"
# With parser
bedrooms = { path = "details.Bedrooms", parser = "parse_int" }
# Keep source field (default removes it)
status = { path = "info.status", remove_source = false }
See HTML Extraction for details.
[policy] Section¶
Controls crawl behavior, retries, and stopping conditions.
Retry Settings¶
[policy]
max_retries = 3 # Retry attempts per URL (default: 3)
retry_delay = 1.0 # Initial retry delay in seconds (default: 1.0)
backoff_factor = 2.0 # Multiply delay after each retry (default: 2.0)
max_retry_delay = 60.0 # Maximum retry delay (default: 60.0)
Concurrency and Pacing¶
[policy]
concurrency = 5 # Parallel requests (default: 5)
delay = 1.0 # Delay after each batch in seconds (default: 0.0)
jitter = 0.2 # Random delay 0-N seconds before each request (default: 0.1)
Stopping Conditions¶
[policy]
max_requests = 1000 # Stop after N requests (default: unlimited)
max_consecutive_failures = 10 # Stop after N consecutive failures (default: 10)
max_error_rate = 0.5 # Stop if error rate exceeds 50% (default: 0.5)
min_requests_for_error_rate = 20 # Min requests before error rate check (default: 20)
Incremental Crawling¶
[policy]
stop_on_empty = true # Stop branch when page yields no items/links (default: true)
stop_on_caught_up = false # Global stop on caught-up (default: false)
caught_up_threshold = 3 # Pages before global stop (default: 3)
[storage] Section¶
path¶
Output directory for state.db and exports. Relative to CWD (current working directory).
Note
The path is relative to where you run databrew, not the config file location.
[fetch] Section¶
type¶
Fetcher type: httpx (HTTP client) or pydoll (browser).
headers¶
HTTP headers to send with requests.
[fetch.browser] (pydoll only)¶
Browser-specific settings when using type = "pydoll".
[fetch]
type = "pydoll"
[fetch.browser]
headless = true # Run without GUI (default: true)
page_load_timeout = 30.0 # Timeout in seconds (default: 30.0)
wait_for_selector = ".content" # Wait for element before extraction
selector_timeout = 10.0 # Selector wait timeout (default: 10.0)
wait_for_network_idle = false # Wait for network to settle (default: false)
network_idle_time = 2.0 # Network idle wait time (default: 2.0)
wait_after_load = 0.0 # Additional delay after load (default: 0.0)
user_agent = "Mozilla/5.0 ..." # Custom user agent
viewport_width = 1920 # Browser viewport width (default: 1920)
viewport_height = 1080 # Browser viewport height (default: 1080)
See Browser Fetching for details.
Config Composition¶
Configs can inherit from a base config using extends:
# base.toml
[fetch.headers]
User-Agent = "MyBot/1.0"
[policy]
max_retries = 3
concurrency = 5
delay = 1.0
# mysite.toml
extends = "base.toml"
name = "mysite"
start_urls = ["https://example.com"]
[extract]
type = "html"
# ... site-specific rules
Merge Behavior¶
- Dicts: Merge recursively (child values override base)
- Lists: Replace entirely (no concatenation)
- Scalars: Replace entirely
Chained Inheritance¶
# base.toml → common.toml → mysite.toml
extends = "common.toml" # common.toml can also extend base.toml
Complete Example¶
name = "realestate"
start_urls = ["https://example.com/listings"]
[extract]
type = "html"
base_url = "https://example.com"
[extract.items]
selector = "" # Detail pages
id = "property_id"
[extract.items.fields]
title = ".listing-title"
price = { selector = ".price", parser = "parse_price", required = true }
address = ".address"
bedrooms = { selector = ".beds", parser = "parse_int" }
bathrooms = { selector = ".baths", parser = "parse_int" }
sqft = { selector = ".sqft", parser = "parse_int" }
description = ".description"
images = { selector = ".gallery img", attribute = "src", multiple = true }
details = { selector = ".details li", keys = "strong", values = "span" }
coordinates = { selector = "#map", attribute = "data-coords", parser = "parse_coordinates" }
[extract.links]
pagination = [".pagination a.next"]
items = [".listing-card a"]
[extract.derived]
property_id = "details.Property ID"
year_built = { path = "details.Year Built", parser = "parse_int" }
lot_size = "details.Lot Size"
[policy]
concurrency = 3
delay = 1.0
jitter = 0.2
max_retries = 3
max_requests = 5000
[storage]
path = "data/realestate"
[fetch.headers]
User-Agent = "RealEstateBot/1.0"