Incremental Crawling¶

Databrew is designed for incremental crawling—efficiently updating your data by only fetching what's new.

Resume Support¶

Automatic Resume¶

Databrew automatically tracks progress. If a crawl is interrupted, simply run again:

# First run (interrupted at 500 items)
databrew run mysite.toml
# Crawl interrupted (Ctrl+C, network error, etc.)

# Resume automatically
databrew run mysite.toml
# Resuming: 234 URLs pending

No special flags needed—databrew detects pending URLs and continues.

Fresh Start¶

To force a fresh crawl (re-add start URLs):

databrew run mysite.toml --fresh

This adds start_urls again but keeps existing data.

Per-Branch Incremental Stopping¶

When re-running a crawl with existing data, databrew stops pagination branches intelligently.

How It Works¶

Each pagination chain stops independently when it encounters a page where all item links already exist in storage:

Seed URL 1 (Category A)
  → Page 1: 5 new items, continue...
  → Page 2: 3 new items, continue...
  → Page 3: 0 new items (all exist), STOP this branch

Seed URL 2 (Category B)
  → Page 1: 10 new items, continue...
  → Page 2: 8 new items, continue...
  → ... continues independently

Multi-Seed Crawls¶

This is particularly useful for multi-seed crawls (e.g., 100+ category URLs):

start_urls = { file = "categories.txt" }  # 100 category URLs

Each category is processed as its own branch. Categories with no new items stop quickly, while active categories continue to full depth.

Detection Logic¶

A pagination page triggers "caught up" when:

The crawl is incremental (items already exist in storage)
The page is a pagination type URL
The page has item links (not empty)
All item links already exist in storage

Fresh vs. Incremental¶

Fresh crawl: All pagination is followed (caught-up detection disabled)
Incremental crawl: Per-branch stopping is active

The mode is determined automatically based on whether items exist in storage.

Cross-Run Retry¶

Item URLs that fail (after exhausting retries) are automatically retried on subsequent runs.

Why Only Item URLs?¶

Pagination pages hold dynamic data—retrying old pagination pages doesn't make sense
Item pages hold static data—the item content doesn't change

Retry Progression¶

Run 1: Item URL fails after 3 retries → status='failed', failed_runs=1
Run 2: Reset to pending, retry → fails again → failed_runs=2
Run 3: Reset to pending, retry → fails again → status='permanently_failed'

After 3 failed runs, the URL is marked permanently_failed and won't be retried.

Durable Failure Tracking¶

Failed URLs are tracked in a dedicated .failures.db SQLite file, separate from the ephemeral .state.db. This means failures survive .state.db deletion (e.g. after compaction or manual reset).

At run end, failures are also exported to _failed_urls.json for cross-machine portability. On startup, the JSON snapshot is merged back into .failures.db using field-level merge rules (failed_runs takes the MAX, timestamps take MIN/MAX), so neither a stale local DB nor a stale CI snapshot can overwrite fresher data.

When a previously-failed URL succeeds, its failure record is automatically resolved.

Checking Failed URLs¶

databrew status mysite.toml
# URLs failed: 5              # Will retry next run
# URLs permanently failed: 2  # Exhausted all retries

Full Crawl Mode¶

To crawl all pagination regardless of existing data:

databrew run mysite.toml --full-crawl

This disables caught-up detection but still skips existing items.

Use this when:

You want to verify no items were missed
The site structure changed
You need to re-crawl all pages for updated content

Incremental Data Access¶

Since items include an _extracted_at timestamp, you can easily access only recent data:

import duckdb

# Items since a specific date
df = duckdb.sql("""
    SELECT * FROM 'data/mysite/items/*.parquet'
    WHERE _extracted_at >= '2026-01-20'
""").df()

Or read all items directly:

duckdb -c "SELECT COUNT(*) FROM 'data/mysite/items/*.parquet'"

State Management¶

Storage Layout¶

data/mysite/
├── .state.db             # URL queue/retry state (ephemeral, gitignored)
├── .failures.db          # Durable failure tracking (local, gitignored)
├── _failed_urls.json     # Portable failure snapshot (committed/synced)
├── .index.db             # Storage dedupe/index catalog (ephemeral, gitignored)
└── items/
    ├── part_000001.parquet   # Rolling part files (compressed)
    ├── part_000002.parquet
    └── ...

.state.db contains URL queue/retry state.
.failures.db tracks failed URLs durably, surviving .state.db deletion.
_failed_urls.json is a portable snapshot of failures, exported at run end for cross-machine sync.
.index.db contains storage dedupe/index metadata and is auto-rebuilt from Parquet files on startup.
items/*.parquet contain the actual extracted items. These are the source of truth.

Cross-Machine Sync¶

Sync Parquet files and the failure snapshot. The .index.db file is rebuilt automatically:

# Machine A
databrew run mysite.toml
git add data/mysite/items/ data/mysite/_failed_urls.json
git commit -m "Crawled items"
git push

# Machine B
git pull
databrew run mysite.toml  # Rebuilds index, merges failure snapshot, continues crawling

Add this to your .gitignore:

data/*/.state.db
data/*/.failures.db
data/*/.index.db

Best Practices¶

1. Use Item IDs¶

Always configure an ID field for proper deduplication:

[extract.items]
id = "product_id"  # Or "details.Property ID" for nested data

2. Start Small¶

Test with limited requests before full crawl:

databrew run mysite.toml -n 100

3. Monitor Progress¶

Check status regularly:

databrew status mysite.toml

4. Sync Parquet Files¶

Back up your data by syncing the Parquet files:

git add data/mysite/items/
git commit -m "Crawled items"

5. Handle Failures¶

If too many URLs fail:

Check the site for changes
Review your config selectors
Consider increasing retries or delays
Use lifecycle hooks for automated recovery

[policy]
max_retries = 5
delay = 2.0

[hooks]
on_failure = "python scripts/recover.py {name}"

The on_failure hook runs when the crawl hits max_consecutive_failures. If your recovery script (e.g., refreshing cookies, rotating proxies) exits 0, the crawl resets its failure counter and resumes automatically.