Core Concepts¶
This guide explains the key concepts and architecture of databrew.
Architecture Overview¶
Databrew has a clean separation of concerns:
Orchestrator (coordinates everything)
├── Store (SQLite URL queue + item storage)
├── Fetcher (HTTP with rate limiting)
├── Extractor (HTML/JSON parsing)
└── Policy (retry rules, stopping conditions)
URL Types¶
Databrew distinguishes between two types of URLs:
Pagination URLs¶
Listing pages that contain links to other pages and items:
- Search result pages
- Category listings
- Index pages
Pagination URLs are always followed to discover new content.
Item URLs¶
Detail pages that contain the actual data to extract:
- Product pages
- Article pages
- Profile pages
Item URLs are checked against storage before fetching (deduplication).
Why This Matters¶
The separation enables smart incremental crawling:
- Pagination pages are re-crawled each run to find new items
- Item pages are only fetched once (unless the item doesn't exist in storage)
- When all items on a pagination page already exist, that branch stops
Data Flow¶
Config (TOML) → create_components() → Orchestrator.run()
↓
Store.get_next_url() → Fetcher.fetch() → Extractor.extract()
↑ ↓
└────────── Store.add_*_urls() ←───────┘
↓
Store.save_item()
- Config Loading: TOML is parsed into typed configuration objects
- Component Creation: Store, fetcher, and extractor are initialized
- URL Queue: The orchestrator pulls URLs from the queue
- Fetching: The fetcher retrieves page content
- Extraction: The extractor parses items and discovers links
- Storage: Items are saved, new links are added to the queue
The Orchestrator¶
The orchestrator is the main crawl loop that coordinates everything:
async def run(self):
while True:
# Get batch of URLs
tasks = self.store.get_pending_urls(limit=self.policy.concurrency)
if not tasks:
break # Nothing left to process
# Process concurrently
results = await asyncio.gather(*[
self._process_url(task) for task in tasks
])
# Check stopping conditions
if self.policy.should_stop(...):
break
Key features:
- Concurrent processing: Fetches multiple URLs in parallel
- Automatic retries: Failed URLs are retried with exponential backoff
- Incremental stopping: Each pagination branch stops independently
- Progress tracking: Statistics are updated in real-time
Extractors¶
Extractors parse page content and return structured data.
HTML Extractor¶
Uses CSS selectors to extract data from HTML:
[extract]
type = "html"
[extract.items]
selector = ".product"
[extract.items.fields]
title = "h2"
price = { selector = ".price", parser = "parse_price" }
JSON Extractor¶
Uses dot-notation paths to extract data from JSON:
[extract]
type = "json"
[extract.items]
path = "data.products"
[extract.items.fields]
title = "name"
price = { path = "pricing.amount", parser = "parse_float" }
Fetchers¶
Fetchers retrieve page content:
HTTP Fetcher (httpx)¶
The default fetcher uses httpx for fast HTTP requests:
Browser Fetcher (pydoll)¶
For JavaScript-heavy sites, use browser rendering:
Policy¶
The policy controls crawl behavior:
[policy]
max_retries = 3 # Retry failed requests
max_requests = 1000 # Stop after N requests
concurrency = 5 # Parallel requests
delay = 1.0 # Delay between batches
jitter = 0.2 # Random delay (anti-fingerprinting)
max_consecutive_failures = 10 # Stop on too many failures
Storage¶
Databrew uses SQLite for storage:
- URL Queue: Tracks pending, completed, and failed URLs
- Item Store: Stores extracted items as JSON
- Deduplication: Items are deduplicated by ID field (if configured)
Storage location is configured in the config:
The path is relative to CWD (where you run databrew), not the config file location.
Incremental Crawling¶
Databrew supports smart incremental updates:
Per-Branch Stopping¶
Each pagination chain stops independently when it encounters a page where all items already exist:
Seed URL 1 → Page 1 → Page 2 → Page 3 (all items exist, STOP)
Seed URL 2 → Page 1 → Page 2 (new items found, continue...)
This is automatic for re-runs (when items already exist in storage).
Cross-Run Retry¶
Item URLs that fail are automatically retried on subsequent runs:
Run 1: Item fails → status='failed', failed_runs=1
Run 2: Reset to pending, retry → fails → failed_runs=2
Run 3: Retry again → fails → status='permanently_failed'
After 3 failed runs, the URL is marked permanently failed.
Config Composition¶
Configs can inherit from a base config:
# mysite.toml
extends = "base.toml"
name = "mysite"
start_urls = ["https://example.com"]
# ... site-specific config
Merge behavior:
- Dicts: merge recursively (child overrides base)
- Lists: replace entirely (no concatenation)
- Scalars: replace entirely
Next Steps¶
- CLI Reference - All available commands
- Configuration Guide - Complete config reference
- HTML Extraction - CSS selector patterns