Skip to content

Quick Start

This guide walks you through creating a config file and running your first extraction.

Generate a Starter Config

Use the init command to generate a starter config:

databrew init mysite --url "https://example.com/products" --type html

This creates mysite.toml with a template you can customize.

Alternatively, run databrew init without arguments for interactive prompts:

databrew init
# Site name: mysite
# Start URL: https://example.com/products
# Created config: mysite.toml

Understanding the Config

A minimal config has four main sections:

# Site identifier and starting point
name = "mysite"
start_urls = ["https://example.com/products"]

# What to extract
[extract]
type = "html"  # or "json" for APIs

# Item extraction rules
[extract.items]
selector = ".product-card"  # CSS selector for item containers

[extract.items.fields]
title = "h2.title"  # Simple selector
price = { selector = ".price", parser = "parse_price" }
url = { selector = "a", attribute = "href" }

# Links to follow
[extract.links]
pagination = ["a.next-page"]  # Next page links
items = [".product-card a"]   # Detail page links

# Crawl behavior
[policy]
concurrency = 5
max_retries = 3

Validate Your Config

Before running, validate the config:

databrew check mysite.toml
# Valid config: mysite.toml
#   Name: mysite
#   Type: html
#   Start URLs: 1

Add -v for more details:

databrew check mysite.toml -v

Run a Test Crawl

Start with a limited crawl to test your extraction rules:

databrew run mysite.toml -n 10

The -n 10 flag limits the crawl to 10 requests.

Dry Run

Preview what would happen without fetching:

databrew run mysite.toml --dry-run

Check Status

See the crawl progress:

databrew status mysite.toml
# Status for: mysite
#   Items stored: 42
#   URLs pending: 15
#   URLs completed: 53
#   URLs failed: 2

Export Data

Export extracted items to different formats:

# JSONL (one JSON object per line)
databrew export mysite.toml -o products.jsonl

# JSON array
databrew export mysite.toml -o products.json

# Parquet (requires analytics extra)
databrew export mysite.toml -o products.parquet

Resume an Interrupted Crawl

Databrew automatically tracks progress. If a crawl is interrupted, simply run again:

databrew run mysite.toml
# Resuming: 15 URLs pending

To force a fresh start:

databrew run mysite.toml --fresh

Example: Real Estate Site

Here's a complete example for a real estate listing site:

name = "realestate"
start_urls = ["https://example.com/listings?page=1"]

[extract]
type = "html"

[extract.items]
selector = ""  # Empty = whole page is one item (detail pages)
id = "property_id"  # Field for deduplication

[extract.items.fields]
title = ".listing-title"
price = { selector = ".price", parser = "parse_price", required = true }
address = ".address"
bedrooms = { selector = ".beds", parser = "parse_int" }
bathrooms = { selector = ".baths", parser = "parse_int" }
description = ".description"
images = { selector = ".gallery img", attribute = "src", multiple = true }

# Key-value extraction for property details
details = { selector = ".details li", keys = "strong", values = "span" }

[extract.links]
pagination = [".pagination a.next"]
items = [".listing-card a.view-details"]

# Derived fields from nested data
[extract.derived]
property_id = "details.Property ID"
lot_size = { path = "details.Lot Size", parser = "squish" }

[policy]
concurrency = 3
delay = 1.0
jitter = 0.2
max_retries = 3

[storage]
path = "data/realestate"

Run it:

databrew run realestate.toml

Next Steps