Skip to content

CLI Reference

Databrew provides a command-line interface for running crawls, exporting data, and managing state.

Commands Overview

databrew [-h] {run,status,export,reset,list,import,check,init} ...

Commands:
  run       Run extraction from config
  status    Show crawl status
  export    Export data to JSON/JSONL
  reset     Reset crawl state
  list      List available configs
  import    Import data from JSON/JSONL
  check     Validate config file
  init      Generate a starter config file

run

Run extraction from one or more config files.

databrew run [OPTIONS] CONFIG [CONFIG ...]

Options

Option Description
-v, --verbose Verbose output
-n, --limit LIMIT Max requests to process
-c, --concurrency N Concurrent requests (default: 5)
--start-urls FILE File with additional start URLs
--full-crawl Ignore caught-up detection
--dry-run Validate without fetching
--fresh Force fresh crawl

Examples

# Basic run
databrew run mysite.toml

# Limit to 50 requests
databrew run mysite.toml -n 50

# With custom concurrency
databrew run mysite.toml -c 10

# Force fresh crawl
databrew run mysite.toml --fresh

# Dry run (preview)
databrew run mysite.toml --dry-run

# Multiple configs (batch mode)
databrew run examples/*.toml -n 100

Resume Behavior

By default, databrew automatically resumes interrupted crawls:

  • If the URL queue has pending URLs → resume (continues from where it stopped)
  • If the queue is empty → fresh start (adds start_urls)

Use --fresh to force a fresh crawl even if URLs are pending.

export

Export extracted data to various formats.

databrew export [OPTIONS] CONFIG

Options

Option Description
-o, --output PATH Output file/directory path
-f, --format FORMAT Output format: jsonl, json, individual, parquet
--no-meta Exclude _source_url and _extracted_at
--url-type TYPE Filter by URL type: item, pagination, all
--since TIMESTAMP Only export items after this timestamp

Format Detection

Format is auto-detected from the output file extension:

  • .jsonl → JSONL (one JSON object per line)
  • .json → JSON array
  • .parquet → Parquet (requires analytics extra)

Examples

# Export to JSONL
databrew export mysite.toml -o data.jsonl

# Export to Parquet (fast, compressed)
databrew export mysite.toml -o data.parquet

# Incremental export (items since date)
databrew export mysite.toml -o data.jsonl --since "2026-01-20"

status

Show crawl status and statistics.

databrew status CONFIG

Example Output

Status for: realethio
  Items stored: 1763
  URLs pending: 0
  URLs completed: 1996
  URLs failed: 0

check

Validate a config file without running.

databrew check [-v] CONFIG

Examples

# Basic validation
databrew check mysite.toml

# Verbose output
databrew check mysite.toml -v

init

Generate a starter config file.

databrew init [OPTIONS] [NAME]

Options

Option Description
--url URL Start URL for the site
--type TYPE Extraction type: html or json
-o, --output PATH Output file path

Examples

# Generate HTML config
databrew init mysite --url "https://example.com" --type html

# Generate JSON API config
databrew init myapi --url "https://api.example.com/items" --type json

list

List available config files in a directory.

databrew list [DIR]

reset

Reset crawl state (URL queue and optionally items).

databrew reset [--all] CONFIG
Option Description
--all Delete entire database including items

import

Import data from JSON/JSONL files to repopulate state.

databrew import [OPTIONS] CONFIG INPUT

Exit Codes

Code Description
0 Success
1 Error (invalid config, crawl failure, etc.)