Skip to content

Exporting Data

Databrew supports multiple export formats for extracted data.

Export Formats

JSONL (JSON Lines)

One JSON object per line. Best for streaming and large datasets.

databrew export mysite.toml -o data.jsonl

Output:

{"title": "Product 1", "price": 99.99, "_source_url": "https://..."}
{"title": "Product 2", "price": 149.99, "_source_url": "https://..."}

JSON Array

Standard JSON array. Best for small datasets or when you need valid JSON.

databrew export mysite.toml -o data.json

Output:

[
  {"title": "Product 1", "price": 99.99},
  {"title": "Product 2", "price": 149.99}
]

Parquet

Columnar format for analytics. Requires the analytics extra.

pip install databrew[analytics]
databrew export mysite.toml -o data.parquet

Benefits:

  • Compressed (smaller files)
  • Fast column-based queries
  • Works with Pandas, DuckDB, Spark, etc.

Individual Files

One JSON file per item. Useful for item-based workflows.

databrew export mysite.toml -f individual -o items/

Creates:

items/
├── item_12345.json
├── item_12346.json
└── ...

Filenames use the configured ID field, or fall back to hash-based names.

Format Detection

Format is auto-detected from the output file extension:

Extension Format
.jsonl JSONL
.json JSON array
.parquet Parquet
(directory) Individual files

Override with -f:

databrew export mysite.toml -o data.txt -f jsonl

Export Options

Exclude Metadata

By default, items include _source_url and _extracted_at fields. Exclude them with --no-meta:

databrew export mysite.toml -o data.jsonl --no-meta

Filter by URL Type

By default, only items from item URLs are exported. Include pagination items:

# All items (pagination + item URLs)
databrew export mysite.toml -o data.jsonl --url-type all

# Only pagination items
databrew export mysite.toml -o data.jsonl --url-type pagination

Incremental Export

Export only items extracted after a specific time:

# Since a date
databrew export mysite.toml -o new.jsonl --since "2026-01-20"

# Since a timestamp
databrew export mysite.toml -o new.jsonl --since "2026-01-20T14:30:00"

Combine with --url-type for filtered incremental exports:

databrew export mysite.toml -o new.jsonl --since "2026-01-20" --url-type all

Fast Exports with DuckDB

When the analytics extra is installed, exports use DuckDB for significantly faster performance:

pip install databrew[analytics]

# 7-10x faster for large datasets
databrew export mysite.toml -o data.jsonl

DuckDB reads SQLite directly, bypassing Python's JSON parsing. The speedup is most noticeable with large datasets (100k+ items).

Performance Comparison

Items Python Export DuckDB Export Speedup
10k 3s 0.5s 6x
100k 30s 4s 7.5x
680k 90s 13s 7x

DuckDB is used automatically when available. Falls back to Python if not installed.

Default Output Location

If -o is not specified:

Format Default Path
JSONL data/<name>/<name>.jsonl
JSON data/<name>/<name>.json
Parquet data/<name>/<name>.parquet
Individual data/<name>/items/

Working with Exported Data

Python (Pandas)

import pandas as pd

# JSONL
df = pd.read_json("data.jsonl", lines=True)

# Parquet
df = pd.read_parquet("data.parquet")

# JSON array
df = pd.read_json("data.json")

Python (DuckDB)

import duckdb

# Query JSONL directly
duckdb.sql("SELECT * FROM 'data.jsonl' WHERE price > 100")

# Query Parquet
duckdb.sql("SELECT category, AVG(price) FROM 'data.parquet' GROUP BY category")

Command Line (jq)

# Filter JSONL
cat data.jsonl | jq 'select(.price > 100)'

# Extract field
cat data.jsonl | jq -r '.title'

# Count items
wc -l data.jsonl

Command Line (DuckDB CLI)

duckdb -c "SELECT COUNT(*) FROM 'data.jsonl'"
duckdb -c "COPY (SELECT * FROM 'data.jsonl') TO 'data.parquet'"

Importing Data

Re-import exported data to repopulate state:

# From JSONL
databrew import mysite.toml backup.jsonl

# From JSON array
databrew import mysite.toml backup.json

# From Parquet
databrew import mysite.toml backup.parquet

# From directory of individual files
databrew import mysite.toml items/

Custom Source URL Field

If your data has a different URL field:

databrew import mysite.toml data.jsonl --source-url "meta.url"

URL Prefix

Add a prefix to construct URLs:

databrew import mysite.toml data.jsonl --source-url "id" --source-url-prefix "https://example.com/item/"

Export Workflow Example

# Run daily crawl
databrew run mysite.toml

# Export new items only
databrew export mysite.toml -o daily/$(date +%Y%m%d).jsonl --since "yesterday"

# Weekly full export
databrew export mysite.toml -o weekly/$(date +%Y%m%d).parquet