Incremental Crawling¶
Databrew is designed for incremental crawling—efficiently updating your data by only fetching what's new.
Resume Support¶
Automatic Resume¶
Databrew automatically tracks progress. If a crawl is interrupted, simply run again:
# First run (interrupted at 500 items)
databrew run mysite.toml
# Crawl interrupted (Ctrl+C, network error, etc.)
# Resume automatically
databrew run mysite.toml
# Resuming: 234 URLs pending
No special flags needed—databrew detects pending URLs and continues.
Fresh Start¶
To force a fresh crawl (re-add start URLs):
This adds start_urls again but keeps existing data.
Per-Branch Incremental Stopping¶
When re-running a crawl with existing data, databrew stops pagination branches intelligently.
How It Works¶
Each pagination chain stops independently when it encounters a page where all item links already exist in storage:
Seed URL 1 (Category A)
→ Page 1: 5 new items, continue...
→ Page 2: 3 new items, continue...
→ Page 3: 0 new items (all exist), STOP this branch
Seed URL 2 (Category B)
→ Page 1: 10 new items, continue...
→ Page 2: 8 new items, continue...
→ ... continues independently
Multi-Seed Crawls¶
This is particularly useful for multi-seed crawls (e.g., 100+ category URLs):
Each category is processed as its own branch. Categories with no new items stop quickly, while active categories continue to full depth.
Detection Logic¶
A pagination page triggers "caught up" when:
- The crawl is incremental (items already exist in storage)
- The page is a pagination type URL
- The page has item links (not empty)
- All item links already exist in storage
Fresh vs. Incremental¶
- Fresh crawl: All pagination is followed (caught-up detection disabled)
- Incremental crawl: Per-branch stopping is active
The mode is determined automatically based on whether items exist in storage.
Cross-Run Retry¶
Item URLs that fail (after exhausting retries) are automatically retried on subsequent runs.
Why Only Item URLs?¶
- Pagination pages hold dynamic data—retrying old pagination pages doesn't make sense
- Item pages hold static data—the item content doesn't change
Retry Progression¶
Run 1: Item URL fails after 3 retries → status='failed', failed_runs=1
Run 2: Reset to pending, retry → fails again → failed_runs=2
Run 3: Reset to pending, retry → fails again → status='permanently_failed'
After 3 failed runs, the URL is marked permanently_failed and won't be retried.
Durable Failure Tracking¶
Failed URLs are tracked in a dedicated .failures.db SQLite file, separate from the
ephemeral .state.db. This means failures survive .state.db deletion (e.g. after
compaction or manual reset).
At run end, failures are also exported to _failed_urls.json for cross-machine
portability. On startup, the JSON snapshot is merged back into .failures.db using
field-level merge rules (failed_runs takes the MAX, timestamps take MIN/MAX), so
neither a stale local DB nor a stale CI snapshot can overwrite fresher data.
When a previously-failed URL succeeds, its failure record is automatically resolved.
Checking Failed URLs¶
databrew status mysite.toml
# URLs failed: 5 # Will retry next run
# URLs permanently failed: 2 # Exhausted all retries
Full Crawl Mode¶
To crawl all pagination regardless of existing data:
This disables caught-up detection but still skips existing items.
Use this when:
- You want to verify no items were missed
- The site structure changed
- You need to re-crawl all pages for updated content
Incremental Data Access¶
Since items include an _extracted_at timestamp, you can easily access only recent data:
import duckdb
# Items since a specific date
df = duckdb.sql("""
SELECT * FROM 'data/mysite/items/*.parquet'
WHERE _extracted_at >= '2026-01-20'
""").df()
Or read all items directly:
State Management¶
Storage Layout¶
data/mysite/
├── .state.db # URL queue/retry state (ephemeral, gitignored)
├── .failures.db # Durable failure tracking (local, gitignored)
├── _failed_urls.json # Portable failure snapshot (committed/synced)
├── .index.db # Storage dedupe/index catalog (ephemeral, gitignored)
└── items/
├── part_000001.parquet # Rolling part files (compressed)
├── part_000002.parquet
└── ...
.state.dbcontains URL queue/retry state..failures.dbtracks failed URLs durably, surviving.state.dbdeletion._failed_urls.jsonis a portable snapshot of failures, exported at run end for cross-machine sync..index.dbcontains storage dedupe/index metadata and is auto-rebuilt from Parquet files on startup.items/*.parquetcontain the actual extracted items. These are the source of truth.
Cross-Machine Sync¶
Sync Parquet files and the failure snapshot. The .index.db file is rebuilt automatically:
# Machine A
databrew run mysite.toml
git add data/mysite/items/ data/mysite/_failed_urls.json
git commit -m "Crawled items"
git push
# Machine B
git pull
databrew run mysite.toml # Rebuilds index, merges failure snapshot, continues crawling
Add this to your .gitignore:
Best Practices¶
1. Use Item IDs¶
Always configure an ID field for proper deduplication:
2. Start Small¶
Test with limited requests before full crawl:
3. Monitor Progress¶
Check status regularly:
4. Sync Parquet Files¶
Back up your data by syncing the Parquet files:
5. Handle Failures¶
If too many URLs fail:
- Check the site for changes
- Review your config selectors
- Consider increasing retries or delays
- Use lifecycle hooks for automated recovery
The on_failure hook runs when the crawl hits max_consecutive_failures. If your recovery script (e.g., refreshing cookies, rotating proxies) exits 0, the crawl resets its failure counter and resumes automatically.