Core Concepts¶

This guide explains the key concepts and architecture of databrew.

Architecture Overview¶

Databrew has a clean separation of concerns:

Orchestrator (coordinates everything)
    ├── Store (SQLite URL queue + item storage)
    ├── Fetcher (HTTP with rate limiting)
    ├── Extractor (HTML/JSON parsing)
    └── Policy (retry rules, stopping conditions)

URL Types¶

Databrew distinguishes between two types of URLs:

Pagination URLs¶

Listing pages that contain links to other pages and items:

Search result pages
Category listings
Index pages

Pagination URLs are always followed to discover new content.

Item URLs¶

Detail pages that contain the actual data to extract:

Product pages
Article pages
Profile pages

Item URLs are checked against storage before fetching (deduplication).

Why This Matters¶

The separation enables smart incremental crawling:

Pagination pages are re-crawled each run to find new items
Item pages are only fetched once (unless the item doesn't exist in storage)
When all items on a pagination page already exist, that branch stops

Data Flow¶

Config (TOML) → create_components() → Orchestrator.run()
                                          ↓
Store.get_next_url() → Fetcher.fetch() → Extractor.extract()
        ↑                                      ↓
        └────────── Store.add_*_urls() ←───────┘
                           ↓
                    Store.save_item()

Config Loading: TOML is parsed into typed configuration objects
Component Creation: Store, fetcher, and extractor are initialized
URL Queue: The orchestrator pulls URLs from the queue
Fetching: The fetcher retrieves page content
Extraction: The extractor parses items and discovers links
Storage: Items are saved, new links are added to the queue

The Orchestrator¶

The orchestrator is the main crawl loop that coordinates everything:

async def run(self):
    while True:
        # Get batch of URLs
        tasks = self.store.get_pending_urls(limit=self.policy.concurrency)

        if not tasks:
            break  # Nothing left to process

        # Process concurrently
        results = await asyncio.gather(*[
            self._process_url(task) for task in tasks
        ])

        # Check stopping conditions
        if self.policy.should_stop(...):
            break

Key features:

Concurrent processing: Fetches multiple URLs in parallel
Automatic retries: Failed URLs are retried with exponential backoff
Incremental stopping: Each pagination branch stops independently
Progress tracking: Statistics are updated in real-time

Extractors¶

Extractors parse page content and return structured data.

HTML Extractor¶

Uses CSS selectors to extract data from HTML:

[extract]
type = "html"

[extract.items]
selector = ".product"

[extract.items.fields]
title = "h2"
price = { selector = ".price", parser = "parse_price" }

JSON Extractor¶

Uses dot-notation paths to extract data from JSON:

[extract]
type = "json"

[extract.items]
path = "data.products"

[extract.items.fields]
title = "name"
price = { path = "pricing.amount", parser = "parse_float" }

Fetchers¶

Fetchers retrieve page content:

HTTP Fetcher (httpx)¶

The default fetcher uses httpx for fast HTTP requests:

[fetch]
type = "httpx"

[fetch.headers]
User-Agent = "MyBot/1.0"

Browser Fetcher (pydoll)¶

For JavaScript-heavy sites, use browser rendering:

[fetch]
type = "pydoll"

[fetch.browser]
headless = true
wait_for_selector = ".content-loaded"

Policy¶

The policy controls crawl behavior:

[policy]
max_retries = 3           # Retry failed requests
max_requests = 1000       # Stop after N requests
concurrency = 5           # Parallel requests
delay = 1.0               # Delay between batches
jitter = 0.2              # Random delay (anti-fingerprinting)
max_consecutive_failures = 10  # Stop on too many failures

Storage¶

Databrew uses SQLite for storage:

URL Queue: Tracks pending, completed, and failed URLs
Item Store: Stores extracted items as JSON
Deduplication: Items are deduplicated by ID field (if configured)

Storage location is configured in the config:

[storage]
path = "data/mysite"  # Creates data/mysite/state.db

The path is relative to CWD (where you run databrew), not the config file location.

Incremental Crawling¶

Databrew supports smart incremental updates:

Per-Branch Stopping¶

Each pagination chain stops independently when it encounters a page where all items already exist:

Seed URL 1 → Page 1 → Page 2 → Page 3 (all items exist, STOP)
Seed URL 2 → Page 1 → Page 2 (new items found, continue...)

This is automatic for re-runs (when items already exist in storage).

Cross-Run Retry¶

Item URLs that fail are automatically retried on subsequent runs:

Run 1: Item fails → status='failed', failed_runs=1
Run 2: Reset to pending, retry → fails → failed_runs=2
Run 3: Retry again → fails → status='permanently_failed'

After 3 failed runs, the URL is marked permanently failed.

Config Composition¶

Configs can inherit from a base config:

# base.toml
[fetch.headers]
User-Agent = "MyBot/1.0"

[policy]
max_retries = 3
concurrency = 5

# mysite.toml
extends = "base.toml"
name = "mysite"
start_urls = ["https://example.com"]
# ... site-specific config

Merge behavior:

Dicts: merge recursively (child overrides base)
Lists: replace entirely (no concatenation)
Scalars: replace entirely

Next Steps¶

CLI Reference - All available commands
Configuration Guide - Complete config reference
HTML Extraction - CSS selector patterns