Skip to content

Incremental Crawling

Databrew is designed for incremental crawling—efficiently updating your data by only fetching what's new.

Resume Support

Automatic Resume

Databrew automatically tracks progress. If a crawl is interrupted, simply run again:

# First run (interrupted at 500 items)
databrew run mysite.toml
# Crawl interrupted (Ctrl+C, network error, etc.)

# Resume automatically
databrew run mysite.toml
# Resuming: 234 URLs pending

No special flags needed—databrew detects pending URLs and continues.

Fresh Start

To force a fresh crawl (re-add start URLs):

databrew run mysite.toml --fresh

This adds start_urls again but keeps existing data.

Per-Branch Incremental Stopping

When re-running a crawl with existing data, databrew stops pagination branches intelligently.

How It Works

Each pagination chain stops independently when it encounters a page where all item links already exist in storage:

Seed URL 1 (Category A)
  → Page 1: 5 new items, continue...
  → Page 2: 3 new items, continue...
  → Page 3: 0 new items (all exist), STOP this branch

Seed URL 2 (Category B)
  → Page 1: 10 new items, continue...
  → Page 2: 8 new items, continue...
  → ... continues independently

Multi-Seed Crawls

This is particularly useful for multi-seed crawls (e.g., 100+ category URLs):

start_urls = { file = "categories.txt" }  # 100 category URLs

Each category is processed as its own branch. Categories with no new items stop quickly, while active categories continue to full depth.

Detection Logic

A pagination page triggers "caught up" when:

  1. The crawl is incremental (items already exist in storage)
  2. The page is a pagination type URL
  3. The page has item links (not empty)
  4. All item links already exist in storage

Fresh vs. Incremental

  • Fresh crawl: All pagination is followed (caught-up detection disabled)
  • Incremental crawl: Per-branch stopping is active

The mode is determined automatically based on whether items exist in storage.

Cross-Run Retry

Item URLs that fail (after exhausting retries) are automatically retried on subsequent runs.

Why Only Item URLs?

  • Pagination pages hold dynamic data—retrying old pagination pages doesn't make sense
  • Item pages hold static data—the item content doesn't change

Retry Progression

Run 1: Item URL fails after 3 retries → status='failed', failed_runs=1
Run 2: Reset to pending, retry → fails again → failed_runs=2
Run 3: Reset to pending, retry → fails again → status='permanently_failed'

After 3 failed runs, the URL is marked permanently_failed and won't be retried.

Durable Failure Tracking

Failed URLs are tracked in a dedicated .failures.db SQLite file, separate from the ephemeral .state.db. This means failures survive .state.db deletion (e.g. after compaction or manual reset).

At run end, failures are also exported to _failed_urls.json for cross-machine portability. On startup, the JSON snapshot is merged back into .failures.db using field-level merge rules (failed_runs takes the MAX, timestamps take MIN/MAX), so neither a stale local DB nor a stale CI snapshot can overwrite fresher data.

When a previously-failed URL succeeds, its failure record is automatically resolved.

Checking Failed URLs

databrew status mysite.toml
# URLs failed: 5              # Will retry next run
# URLs permanently failed: 2  # Exhausted all retries

Full Crawl Mode

To crawl all pagination regardless of existing data:

databrew run mysite.toml --full-crawl

This disables caught-up detection but still skips existing items.

Use this when:

  • You want to verify no items were missed
  • The site structure changed
  • You need to re-crawl all pages for updated content

Incremental Data Access

Since items include an _extracted_at timestamp, you can easily access only recent data:

import duckdb

# Items since a specific date
df = duckdb.sql("""
    SELECT * FROM 'data/mysite/items/*.parquet'
    WHERE _extracted_at >= '2026-01-20'
""").df()

Or read all items directly:

duckdb -c "SELECT COUNT(*) FROM 'data/mysite/items/*.parquet'"

State Management

Storage Layout

data/mysite/
├── .state.db             # URL queue/retry state (ephemeral, gitignored)
├── .failures.db          # Durable failure tracking (local, gitignored)
├── _failed_urls.json     # Portable failure snapshot (committed/synced)
├── .index.db             # Storage dedupe/index catalog (ephemeral, gitignored)
└── items/
    ├── part_000001.parquet   # Rolling part files (compressed)
    ├── part_000002.parquet
    └── ...
  • .state.db contains URL queue/retry state.
  • .failures.db tracks failed URLs durably, surviving .state.db deletion.
  • _failed_urls.json is a portable snapshot of failures, exported at run end for cross-machine sync.
  • .index.db contains storage dedupe/index metadata and is auto-rebuilt from Parquet files on startup.
  • items/*.parquet contain the actual extracted items. These are the source of truth.

Cross-Machine Sync

Sync Parquet files and the failure snapshot. The .index.db file is rebuilt automatically:

# Machine A
databrew run mysite.toml
git add data/mysite/items/ data/mysite/_failed_urls.json
git commit -m "Crawled items"
git push

# Machine B
git pull
databrew run mysite.toml  # Rebuilds index, merges failure snapshot, continues crawling

Add this to your .gitignore:

data/*/.state.db
data/*/.failures.db
data/*/.index.db

Best Practices

1. Use Item IDs

Always configure an ID field for proper deduplication:

[extract.items]
id = "product_id"  # Or "details.Property ID" for nested data

2. Start Small

Test with limited requests before full crawl:

databrew run mysite.toml -n 100

3. Monitor Progress

Check status regularly:

databrew status mysite.toml

4. Sync Parquet Files

Back up your data by syncing the Parquet files:

git add data/mysite/items/
git commit -m "Crawled items"

5. Handle Failures

If too many URLs fail:

  1. Check the site for changes
  2. Review your config selectors
  3. Consider increasing retries or delays
  4. Use lifecycle hooks for automated recovery
[policy]
max_retries = 5
delay = 2.0

[hooks]
on_failure = "python scripts/recover.py {name}"

The on_failure hook runs when the crawl hits max_consecutive_failures. If your recovery script (e.g., refreshing cookies, rotating proxies) exits 0, the crawl resets its failure counter and resumes automatically.