Skip to content

Incremental Crawling

Databrew is designed for incremental crawling—efficiently updating your data by only fetching what's new.

Resume Support

Automatic Resume

Databrew automatically tracks progress. If a crawl is interrupted, simply run again:

# First run (interrupted at 500 items)
databrew run mysite.toml
# Crawl interrupted (Ctrl+C, network error, etc.)

# Resume automatically
databrew run mysite.toml
# Resuming: 234 URLs pending

No special flags needed—databrew detects pending URLs and continues.

Fresh Start

To force a fresh crawl (re-add start URLs):

databrew run mysite.toml --fresh

This adds start_urls again but keeps existing data.

Reset Everything

To start completely over:

# Reset URL queue only (keep items)
databrew reset mysite.toml

# Delete everything (queue + items)
databrew reset mysite.toml --all

Per-Branch Incremental Stopping

When re-running a crawl with existing data, databrew stops pagination branches intelligently.

How It Works

Each pagination chain stops independently when it encounters a page where all item links already exist in storage:

Seed URL 1 (Category A)
  → Page 1: 5 new items, continue...
  → Page 2: 3 new items, continue...
  → Page 3: 0 new items (all exist), STOP this branch

Seed URL 2 (Category B)
  → Page 1: 10 new items, continue...
  → Page 2: 8 new items, continue...
  → ... continues independently

Multi-Seed Crawls

This is particularly useful for multi-seed crawls (e.g., 100+ category URLs):

start_urls = { file = "categories.txt" }  # 100 category URLs

Each category is processed as its own branch. Categories with no new items stop quickly, while active categories continue to full depth.

Detection Logic

A pagination page triggers "caught up" when:

  1. The crawl is incremental (items already exist in storage)
  2. The page is a pagination type URL
  3. The page has item links (not empty)
  4. All item links already exist in storage

Fresh vs. Incremental

  • Fresh crawl: All pagination is followed (caught-up detection disabled)
  • Incremental crawl: Per-branch stopping is active

The mode is determined automatically based on whether items exist in storage.

Cross-Run Retry

Item URLs that fail (after exhausting retries) are automatically retried on subsequent runs.

Why Only Item URLs?

  • Pagination pages hold dynamic data—retrying old pagination pages doesn't make sense
  • Item pages hold static data—the item content doesn't change

Retry Progression

Run 1: Item URL fails after 3 retries → status='failed', failed_runs=1
Run 2: Reset to pending, retry → fails again → failed_runs=2
Run 3: Reset to pending, retry → fails again → status='permanently_failed'

After 3 failed runs, the URL is marked permanently_failed and won't be retried.

Checking Failed URLs

databrew status mysite.toml
# URLs failed: 5              # Will retry next run
# URLs permanently failed: 2  # Exhausted all retries

Full Crawl Mode

To crawl all pagination regardless of existing data:

databrew run mysite.toml --full-crawl

This disables caught-up detection but still skips existing items.

Use this when:

  • You want to verify no items were missed
  • The site structure changed
  • You need to re-crawl all pages for updated content

Incremental Exports

Export only items extracted since a specific time:

# Export items since a date
databrew export mysite.toml -o new_items.jsonl --since "2026-01-20"

# Export items since a timestamp
databrew export mysite.toml -o new_items.jsonl --since "2026-01-20T14:30:00"

This is useful for:

  • Daily/weekly data syncs
  • Streaming new items to another system
  • Generating delta files

State Management

State File Location

State is stored in state.db in the output directory:

data/mysite/
├── state.db      # SQLite database with URL queue + items
└── mysite.jsonl  # Exported data

Backing Up State

The state file is a standard SQLite database. Back it up like any file:

cp data/mysite/state.db data/mysite/state.db.backup

Importing Data

Repopulate state from exported data:

# After a reset or on a new machine
databrew import mysite.toml backup.jsonl

Best Practices

1. Use Item IDs

Always configure an ID field for proper deduplication:

[extract.items]
id = "product_id"  # Or "details.Property ID" for nested data

2. Start Small

Test with limited requests before full crawl:

databrew run mysite.toml -n 100

3. Monitor Progress

Check status regularly:

databrew status mysite.toml

4. Export Regularly

Export data periodically to avoid losing work:

databrew export mysite.toml -o data/mysite/backup_$(date +%Y%m%d).jsonl

5. Handle Failures

If too many URLs fail:

  1. Check the site for changes
  2. Review your config selectors
  3. Consider increasing retries or delays
[policy]
max_retries = 5
delay = 2.0