Exporting Data¶
Databrew supports multiple export formats for extracted data.
Export Formats¶
JSONL (JSON Lines)¶
One JSON object per line. Best for streaming and large datasets.
Output:
{"title": "Product 1", "price": 99.99, "_source_url": "https://..."}
{"title": "Product 2", "price": 149.99, "_source_url": "https://..."}
JSON Array¶
Standard JSON array. Best for small datasets or when you need valid JSON.
Output:
Parquet¶
Columnar format for analytics. Requires the analytics extra.
Benefits:
- Compressed (smaller files)
- Fast column-based queries
- Works with Pandas, DuckDB, Spark, etc.
Individual Files¶
One JSON file per item. Useful for item-based workflows.
Creates:
Filenames use the configured ID field, or fall back to hash-based names.
Format Detection¶
Format is auto-detected from the output file extension:
| Extension | Format |
|---|---|
.jsonl |
JSONL |
.json |
JSON array |
.parquet |
Parquet |
| (directory) | Individual files |
Override with -f:
Export Options¶
Exclude Metadata¶
By default, items include _source_url and _extracted_at fields. Exclude them with --no-meta:
Filter by URL Type¶
By default, only items from item URLs are exported. Include pagination items:
# All items (pagination + item URLs)
databrew export mysite.toml -o data.jsonl --url-type all
# Only pagination items
databrew export mysite.toml -o data.jsonl --url-type pagination
Incremental Export¶
Export only items extracted after a specific time:
# Since a date
databrew export mysite.toml -o new.jsonl --since "2026-01-20"
# Since a timestamp
databrew export mysite.toml -o new.jsonl --since "2026-01-20T14:30:00"
Combine with --url-type for filtered incremental exports:
Fast Exports with DuckDB¶
When the analytics extra is installed, exports use DuckDB for significantly faster performance:
pip install databrew[analytics]
# 7-10x faster for large datasets
databrew export mysite.toml -o data.jsonl
DuckDB reads SQLite directly, bypassing Python's JSON parsing. The speedup is most noticeable with large datasets (100k+ items).
Performance Comparison¶
| Items | Python Export | DuckDB Export | Speedup |
|---|---|---|---|
| 10k | 3s | 0.5s | 6x |
| 100k | 30s | 4s | 7.5x |
| 680k | 90s | 13s | 7x |
DuckDB is used automatically when available. Falls back to Python if not installed.
Default Output Location¶
If -o is not specified:
| Format | Default Path |
|---|---|
| JSONL | data/<name>/<name>.jsonl |
| JSON | data/<name>/<name>.json |
| Parquet | data/<name>/<name>.parquet |
| Individual | data/<name>/items/ |
Working with Exported Data¶
Python (Pandas)¶
import pandas as pd
# JSONL
df = pd.read_json("data.jsonl", lines=True)
# Parquet
df = pd.read_parquet("data.parquet")
# JSON array
df = pd.read_json("data.json")
Python (DuckDB)¶
import duckdb
# Query JSONL directly
duckdb.sql("SELECT * FROM 'data.jsonl' WHERE price > 100")
# Query Parquet
duckdb.sql("SELECT category, AVG(price) FROM 'data.parquet' GROUP BY category")
Command Line (jq)¶
# Filter JSONL
cat data.jsonl | jq 'select(.price > 100)'
# Extract field
cat data.jsonl | jq -r '.title'
# Count items
wc -l data.jsonl
Command Line (DuckDB CLI)¶
duckdb -c "SELECT COUNT(*) FROM 'data.jsonl'"
duckdb -c "COPY (SELECT * FROM 'data.jsonl') TO 'data.parquet'"
Importing Data¶
Re-import exported data to repopulate state:
# From JSONL
databrew import mysite.toml backup.jsonl
# From JSON array
databrew import mysite.toml backup.json
# From Parquet
databrew import mysite.toml backup.parquet
# From directory of individual files
databrew import mysite.toml items/
Custom Source URL Field¶
If your data has a different URL field:
URL Prefix¶
Add a prefix to construct URLs:
databrew import mysite.toml data.jsonl --source-url "id" --source-url-prefix "https://example.com/item/"