CLI Reference¶
Databrew provides a command-line interface for running crawls and managing state.
Commands Overview¶
databrew [-h] {run,status,list,check,init} ...
Commands:
run Run extraction from config
status Show crawl status
list List available configs
check Validate config file
init Generate a starter config file
run¶
Run extraction from one or more config files.
Options¶
| Option | Description |
|---|---|
-v, --verbose |
Verbose output |
-n, --limit LIMIT |
Max requests to process |
-c, --concurrency N |
Concurrent requests (default: 5) |
--start-urls FILE |
File with additional start URLs |
--full-crawl |
Ignore caught-up detection |
--dry-run |
Validate without fetching |
--fresh |
Force fresh crawl |
--retry-failed |
Retry only failed URLs from previous runs |
--on-start CMD |
Shell command before crawl (exit non-zero to abort) |
--on-failure CMD |
Shell command on consecutive failure stop (exit 0 to resume) |
--on-complete CMD |
Shell command after crawl finishes |
Examples¶
# Basic run
databrew run mysite.toml
# Limit to 50 requests
databrew run mysite.toml -n 50
# With custom concurrency
databrew run mysite.toml -c 10
# Force fresh crawl
databrew run mysite.toml --fresh
# Retry only failed URLs
databrew run mysite.toml --retry-failed
# Dry run (preview)
databrew run mysite.toml --dry-run
# Multiple configs (batch mode)
databrew run examples/*.toml -n 100
# With lifecycle hooks
databrew run mysite.toml --on-failure "python scripts/recover.py {name}"
databrew run mysite.toml --on-start "echo start" --on-complete "echo done"
Hook CLI flags override values from the [hooks] config section. See Lifecycle Hooks for details.
Resume Behavior¶
By default, databrew automatically resumes interrupted crawls:
- If the URL queue has pending URLs → resume (continues from where it stopped)
- If the queue is empty → fresh start (adds start_urls)
Use --fresh to force a fresh crawl even if URLs are pending.
Use --retry-failed to process only URLs currently in failed state. This skips
start URL seeding and temporarily defers pending/in-progress URLs.
status¶
Show crawl status and statistics.
Example Output¶
Status for: realethio
Storage: data/realethio
Items stored: 1763
Storage files: 3
- part_000001.parquet
- part_000002.parquet
- part_000003.parquet
URLs pending: 0
URLs completed: 1996
URLs failed: 0
check¶
Validate a config file without running.
Examples¶
init¶
Generate a starter config file.
Options¶
| Option | Description |
|---|---|
--url URL |
Start URL for the site |
--type TYPE |
Extraction type: html or json |
-o, --output PATH |
Output file path |
Examples¶
# Generate HTML config
databrew init mysite --url "https://example.com" --type html
# Generate JSON API config
databrew init myapi --url "https://api.example.com/items" --type json
list¶
List available config files in a directory.
Exit Codes¶
| Code | Description |
|---|---|
| 0 | Success |
| 1 | Error (invalid config, crawl failure, etc.) |