Skip to content

Core Concepts

This guide explains the key concepts and architecture of databrew.

Architecture Overview

Databrew has a clean separation of concerns:

Orchestrator (coordinates everything)
    ├── Store (SQLite URL queue + item storage)
    ├── Fetcher (HTTP with rate limiting)
    ├── Extractor (HTML/JSON parsing)
    └── Policy (retry rules, stopping conditions)

URL Types

Databrew distinguishes between two types of URLs:

Pagination URLs

Listing pages that contain links to other pages and items:

  • Search result pages
  • Category listings
  • Index pages

Pagination URLs are always followed to discover new content.

Item URLs

Detail pages that contain the actual data to extract:

  • Product pages
  • Article pages
  • Profile pages

Item URLs are checked against storage before fetching (deduplication).

Why This Matters

The separation enables smart incremental crawling:

  1. Pagination pages are re-crawled each run to find new items
  2. Item pages are only fetched once (unless the item doesn't exist in storage)
  3. When all items on a pagination page already exist, that branch stops

Data Flow

Config (TOML) → create_components() → Orchestrator.run()
Store.get_next_url() → Fetcher.fetch() → Extractor.extract()
        ↑                                      ↓
        └────────── Store.add_*_urls() ←───────┘
                    Store.save_item()
  1. Config Loading: TOML is parsed into typed configuration objects
  2. Component Creation: Store, fetcher, and extractor are initialized
  3. URL Queue: The orchestrator pulls URLs from the queue
  4. Fetching: The fetcher retrieves page content
  5. Extraction: The extractor parses items and discovers links
  6. Storage: Items are saved, new links are added to the queue

The Orchestrator

The orchestrator is the main crawl loop that coordinates everything:

async def run(self):
    while True:
        # Get batch of URLs
        tasks = self.store.get_pending_urls(limit=self.policy.concurrency)

        if not tasks:
            break  # Nothing left to process

        # Process concurrently
        results = await asyncio.gather(*[
            self._process_url(task) for task in tasks
        ])

        # Check stopping conditions
        if self.policy.should_stop(...):
            break

Key features:

  • Concurrent processing: Fetches multiple URLs in parallel
  • Automatic retries: Failed URLs are retried with exponential backoff
  • Incremental stopping: Each pagination branch stops independently
  • Progress tracking: Statistics are updated in real-time

Extractors

Extractors parse page content and return structured data.

HTML Extractor

Uses CSS selectors to extract data from HTML:

[extract]
type = "html"

[extract.items]
selector = ".product"

[extract.items.fields]
title = "h2"
price = { selector = ".price", parser = "parse_price" }

JSON Extractor

Uses dot-notation paths to extract data from JSON:

[extract]
type = "json"

[extract.items]
path = "data.products"

[extract.items.fields]
title = "name"
price = { path = "pricing.amount", parser = "parse_float" }

Fetchers

Fetchers retrieve page content:

HTTP Fetcher (httpx)

The default fetcher uses httpx for fast HTTP requests:

[fetch]
type = "httpx"

[fetch.headers]
User-Agent = "MyBot/1.0"

Browser Fetcher (pydoll)

For JavaScript-heavy sites, use browser rendering:

[fetch]
type = "pydoll"

[fetch.browser]
headless = true
wait_for_selector = ".content-loaded"

Policy

The policy controls crawl behavior:

[policy]
max_retries = 3           # Retry failed requests
max_requests = 1000       # Stop after N requests
concurrency = 5           # Parallel requests
delay = 1.0               # Delay between batches
jitter = 0.2              # Random delay (anti-fingerprinting)
max_consecutive_failures = 10  # Stop on too many failures

Storage

Databrew uses SQLite for storage:

  • URL Queue: Tracks pending, completed, and failed URLs
  • Item Store: Stores extracted items as JSON
  • Deduplication: Items are deduplicated by ID field (if configured)

Storage location is configured in the config:

[storage]
path = "data/mysite"  # Creates data/mysite/state.db

The path is relative to CWD (where you run databrew), not the config file location.

Incremental Crawling

Databrew supports smart incremental updates:

Per-Branch Stopping

Each pagination chain stops independently when it encounters a page where all items already exist:

Seed URL 1 → Page 1 → Page 2 → Page 3 (all items exist, STOP)
Seed URL 2 → Page 1 → Page 2 (new items found, continue...)

This is automatic for re-runs (when items already exist in storage).

Cross-Run Retry

Item URLs that fail are automatically retried on subsequent runs:

Run 1: Item fails → status='failed', failed_runs=1
Run 2: Reset to pending, retry → fails → failed_runs=2
Run 3: Retry again → fails → status='permanently_failed'

After 3 failed runs, the URL is marked permanently failed.

Config Composition

Configs can inherit from a base config:

# base.toml
[fetch.headers]
User-Agent = "MyBot/1.0"

[policy]
max_retries = 3
concurrency = 5
# mysite.toml
extends = "base.toml"
name = "mysite"
start_urls = ["https://example.com"]
# ... site-specific config

Merge behavior:

  • Dicts: merge recursively (child overrides base)
  • Lists: replace entirely (no concatenation)
  • Scalars: replace entirely

Next Steps