Skip to content

Core Concepts

This guide explains the key concepts and architecture of databrew.

Architecture Overview

Databrew has a clean separation of concerns:

Orchestrator (coordinates everything)
    ├── Store (SQLite URL queue + Parquet item storage)
    ├── Fetcher (HTTP with rate limiting)
    ├── Extractor (HTML/JSON parsing)
    └── Policy (retry rules, stopping conditions)

At the package level, these concerns map to component packages:

  • databrew.core: policy, stats, strict config models, and module-loading utilities
  • fetchkit: HTTP/browser fetchers, request pacer, fetcher registry
  • extractkit: HTML/JSON extractors and parser registry
  • itemstore: Parquet item storage and storage metadata contracts
  • databrew.state: URL queue and unified state store

The main databrew package composes these components and provides the CLI, config loading, orchestrator, middleware, and hooks.

URL Types

Databrew distinguishes between two types of URLs:

Pagination URLs

Listing pages that contain links to other pages and items:

  • Search result pages
  • Category listings
  • Index pages

Pagination URLs are always followed to discover new content.

Item URLs

Detail pages that contain the actual data to extract:

  • Product pages
  • Article pages
  • Profile pages

Item URLs are checked against storage before fetching (deduplication).

Why This Matters

The separation enables smart incremental crawling:

  1. Pagination pages are re-crawled each run to find new items
  2. Item pages are only fetched once (unless the item doesn't exist in storage)
  3. When all items on a pagination page already exist, that branch stops

Data Flow

Config (TOML) → create_components() → Orchestrator.run()
Store.get_next_url() → Fetcher.fetch() → Extractor.extract()
        ↑                                      ↓
        └────────── Store.add_*_urls() ←───────┘
                    Store.save_item()
  1. Config Loading: TOML is parsed into typed configuration objects
  2. Component Creation: Store, fetcher, and extractor are initialized
  3. URL Queue: The orchestrator pulls URLs from the queue
  4. Fetching: The fetcher retrieves page content
  5. Extraction: The extractor parses items and discovers links
  6. Storage: Items are saved, new links are added to the queue

The Orchestrator

The orchestrator is the main crawl loop that coordinates everything:

async def run(self):
    while True:
        # Get batch of URLs
        tasks = self.store.get_pending_urls(limit=self.policy.concurrency)

        if not tasks:
            break  # Nothing left to process

        # Process concurrently
        results = await asyncio.gather(*[
            self._process_url(task) for task in tasks
        ])

        # Check stopping conditions
        if self.policy.should_stop(...):
            break

Key features:

  • Concurrent processing: Fetches multiple URLs in parallel
  • Automatic retries: Failed URLs are retried with exponential backoff
  • Incremental stopping: Each pagination branch stops independently
  • Lifecycle hooks: Shell commands at key points (start, failure, complete) for automated recovery
  • Progress tracking: Statistics are updated in real-time

Extractors

Extractors parse page content and return structured data.

HTML Extractor

Uses CSS selectors to extract data from HTML:

[extract]
type = "html"

[extract.items]
selector = ".product"

[extract.items.fields]
title = "h2"
price = { selector = ".price", parser = "parse_price" }

JSON Extractor

Uses dot-notation paths to extract data from JSON:

[extract]
type = "json"

[extract.items]
path = "data.products"

[extract.items.fields]
title = "name"
price = { path = "pricing.amount", parser = "parse_float" }

Fetchers

Fetchers retrieve page content:

HTTP Fetcher (httpx)

The default fetcher uses httpx for fast HTTP requests:

[fetch]
type = "httpx"

[fetch.headers]
User-Agent = "MyBot/1.0"

Browser Fetcher (pydoll)

For JavaScript-heavy sites, use browser rendering:

[fetch]
type = "pydoll"

[fetch.browser]
headless = true
wait_for_selector = ".content-loaded"

Policy

The policy controls crawl behavior:

[policy]
max_retries = 3           # Retry failed requests
max_requests = 1000       # Stop after N requests
concurrency = 5           # Parallel requests
delay = 1.0               # Delay between batches
jitter = 0.2              # Random delay (anti-fingerprinting)
max_consecutive_failures = 50  # Stop on too many failures

Storage

Databrew uses a dual-layer storage architecture:

data/mysite/
├── .state.db             # URL queue/retry state (ephemeral, gitignored)
├── .failures.db          # Durable failure tracking (local, gitignored)
├── _failed_urls.json     # Portable failure snapshot (committed/synced)
├── .index.db             # Storage dedupe/index catalog (ephemeral, gitignored)
└── items/
    ├── part_000001.parquet   # Rolling part files (compressed)
    ├── part_000002.parquet
    └── ...
  • .state.db: SQLite database for URL queue/retry state. This is local-only.
  • .failures.db: Durable failure tracking in a separate SQLite file. Survives .state.db deletion.
  • _failed_urls.json: Portable failure snapshot exported at run end. Safe for cross-machine sync (e.g. via git).
  • .index.db: SQLite storage catalog for dedupe and item metadata. This is local-only and auto-rebuilt from Parquet files on startup.
  • items/*.parquet: Rolling Parquet part files containing the actual extracted items. These are the source of truth and should be synced across machines.
  • Deduplication: Items are deduplicated by ID field (if configured)

Storage location is configured in the config:

[storage]
path = "data/mysite"

The path is relative to CWD (where you run databrew), not the config file location.

See Working with Extracted Data for how to query and use the Parquet files.

Incremental Crawling

Databrew supports smart incremental updates:

Per-Branch Stopping

Each pagination chain stops independently when it encounters a page where all items already exist:

Seed URL 1 → Page 1 → Page 2 → Page 3 (all items exist, STOP)
Seed URL 2 → Page 1 → Page 2 (new items found, continue...)

This is automatic for re-runs (when items already exist in storage).

Cross-Run Retry

Item URLs that fail are automatically retried on subsequent runs:

Run 1: Item fails → status='failed', failed_runs=1
Run 2: Reset to pending, retry → fails → failed_runs=2
Run 3: Retry again → fails → status='permanently_failed'

After 3 failed runs, the URL is marked permanently failed.

Failures are tracked durably in .failures.db and exported to _failed_urls.json at run end, so they survive .state.db deletion and can be synced across machines.

Config Composition

Configs can inherit from a base config:

# base.toml
[fetch.headers]
User-Agent = "MyBot/1.0"

[policy]
max_retries = 3
concurrency = 5
# mysite.toml
extends = "base.toml"
name = "mysite"
start_urls = ["https://example.com"]
# ... site-specific config

Merge behavior:

  • Dicts: merge recursively (child overrides base)
  • Lists: replace entirely (no concatenation)
  • Scalars: replace entirely

Next Steps