Exporting Data¶

Databrew supports multiple export formats for extracted data.

Export Formats¶

JSONL (JSON Lines)¶

One JSON object per line. Best for streaming and large datasets.

databrew export mysite.toml -o data.jsonl

Output:

{"title": "Product 1", "price": 99.99, "_source_url": "https://..."}
{"title": "Product 2", "price": 149.99, "_source_url": "https://..."}

JSON Array¶

Standard JSON array. Best for small datasets or when you need valid JSON.

databrew export mysite.toml -o data.json

Output:

[
  {"title": "Product 1", "price": 99.99},
  {"title": "Product 2", "price": 149.99}
]

Parquet¶

Columnar format for analytics. Requires the analytics extra.

pip install databrew[analytics]
databrew export mysite.toml -o data.parquet

Benefits:

Compressed (smaller files)
Fast column-based queries
Works with Pandas, DuckDB, Spark, etc.

Individual Files¶

One JSON file per item. Useful for item-based workflows.

databrew export mysite.toml -f individual -o items/

Creates:

items/
├── item_12345.json
├── item_12346.json
└── ...

Filenames use the configured ID field, or fall back to hash-based names.

Format Detection¶

Format is auto-detected from the output file extension:

Extension	Format
`.jsonl`	JSONL
`.json`	JSON array
`.parquet`	Parquet
(directory)	Individual files

Override with -f:

databrew export mysite.toml -o data.txt -f jsonl

Export Options¶

Exclude Metadata¶

By default, items include _source_url and _extracted_at fields. Exclude them with --no-meta:

databrew export mysite.toml -o data.jsonl --no-meta

Filter by URL Type¶

By default, only items from item URLs are exported. Include pagination items:

# All items (pagination + item URLs)
databrew export mysite.toml -o data.jsonl --url-type all

# Only pagination items
databrew export mysite.toml -o data.jsonl --url-type pagination

Incremental Export¶

Export only items extracted after a specific time:

# Since a date
databrew export mysite.toml -o new.jsonl --since "2026-01-20"

# Since a timestamp
databrew export mysite.toml -o new.jsonl --since "2026-01-20T14:30:00"

Combine with --url-type for filtered incremental exports:

databrew export mysite.toml -o new.jsonl --since "2026-01-20" --url-type all

Fast Exports with DuckDB¶

When the analytics extra is installed, exports use DuckDB for significantly faster performance:

pip install databrew[analytics]

# 7-10x faster for large datasets
databrew export mysite.toml -o data.jsonl

DuckDB reads SQLite directly, bypassing Python's JSON parsing. The speedup is most noticeable with large datasets (100k+ items).

Performance Comparison¶

Items	Python Export	DuckDB Export	Speedup
10k	3s	0.5s	6x
100k	30s	4s	7.5x
680k	90s	13s	7x

DuckDB is used automatically when available. Falls back to Python if not installed.

Default Output Location¶

If -o is not specified:

Format	Default Path
JSONL	`data/<name>/<name>.jsonl`
JSON	`data/<name>/<name>.json`
Parquet	`data/<name>/<name>.parquet`
Individual	`data/<name>/items/`

Working with Exported Data¶

Python (Pandas)¶

import pandas as pd

# JSONL
df = pd.read_json("data.jsonl", lines=True)

# Parquet
df = pd.read_parquet("data.parquet")

# JSON array
df = pd.read_json("data.json")

Python (DuckDB)¶

import duckdb

# Query JSONL directly
duckdb.sql("SELECT * FROM 'data.jsonl' WHERE price > 100")

# Query Parquet
duckdb.sql("SELECT category, AVG(price) FROM 'data.parquet' GROUP BY category")

Command Line (jq)¶

# Filter JSONL
cat data.jsonl | jq 'select(.price > 100)'

# Extract field
cat data.jsonl | jq -r '.title'

# Count items
wc -l data.jsonl

Command Line (DuckDB CLI)¶

duckdb -c "SELECT COUNT(*) FROM 'data.jsonl'"
duckdb -c "COPY (SELECT * FROM 'data.jsonl') TO 'data.parquet'"

Importing Data¶

Re-import exported data to repopulate state:

# From JSONL
databrew import mysite.toml backup.jsonl

# From JSON array
databrew import mysite.toml backup.json

# From Parquet
databrew import mysite.toml backup.parquet

# From directory of individual files
databrew import mysite.toml items/

Custom Source URL Field¶

If your data has a different URL field:

databrew import mysite.toml data.jsonl --source-url "meta.url"

URL Prefix¶

Add a prefix to construct URLs:

databrew import mysite.toml data.jsonl --source-url "id" --source-url-prefix "https://example.com/item/"

Export Workflow Example¶

# Run daily crawl
databrew run mysite.toml

# Export new items only
databrew export mysite.toml -o daily/$(date +%Y%m%d).jsonl --since "yesterday"

# Weekly full export
databrew export mysite.toml -o weekly/$(date +%Y%m%d).parquet