Performance Tuning¶
This guide helps you make crawls faster while keeping results stable and respectful to target sites.
Quick Tuning Checklist¶
- Set
concurrencyto match site tolerance and your machine limits. - Keep
delaysmall but non-zero for fragile sites. - Use
items_from = "item"when listing pages are only for discovery. - Configure
idin[extract.items]for reliable deduplication. - Start with
-nlimits and scale gradually. - Tune
[storage].max_pending_itemsto reduce tiny parquet parts.
Recommended Starting Points¶
Small/fragile sites¶
Typical production crawls¶
High-throughput internal/API crawls¶
Use high-throughput settings only when the upstream service allows it.
Where Time Usually Goes¶
- Network wait: server latency, throttling, timeouts.
- Extractor cost: heavy selectors, parser complexity, large responses.
- Storage cost: item serialization + Parquet writes.
- Queue churn: large bursts of discovered links.
Databrew now batches queue inserts and writes Parquet in append-only part files, which helps both queue and storage throughput at scale. Items are first persisted in SQLite pending WAL and then flushed to Parquet, so larger buffers do not drop unflushed data on restart.
Parquet Compaction¶
Databrew writes append-only part files during crawling. To merge many small files into fewer larger files, use the itemstore.compact_storage API while the crawler is stopped:
from pathlib import Path
from itemstore import compact_storage
# Preview only
result = compact_storage(storage_path=Path("data/mysite"), dry_run=True)
# Compact all part files
result = compact_storage(storage_path=Path("data/mysite"))
# Compact with custom compression and target size
result = compact_storage(
storage_path=Path("data/mysite"),
compression="zstd",
target_max_file_mb=90,
)
Use target_max_file_mb to keep compacted files under your Git safety threshold.
Practical Workflow¶
- Validate config on a small run:
databrew run mysite.toml -n 100 --dry-run - Run a controlled crawl:
databrew run mysite.toml -n 1000 -c 5 - Check status:
databrew status mysite.toml - Increase
concurrencygradually while watching failure rate and 429s.
Browser Fetching Notes¶
Browser mode (fetch.type = "pydoll") is slower and heavier than httpx.
- Keep
concurrencylower (often2-4) - Use
wait_for_selectoronly when needed - Avoid large
wait_after_loadunless site behavior requires it
Safety and Stability¶
Fast crawls are only useful when they are correct and repeatable.
- Keep retries enabled for transient failures.
- Monitor
urls_failedanderror_rate. - Prefer incremental mode for recurring crawls.
- Sync
items/*.parquet; treat.state.dband.index.dbas local ephemeral state.