Skip to content

Lifecycle Hooks

Lifecycle hooks let you run shell commands at key points during a crawl. This enables automated recovery from transient failures (expired cookies, blocked IPs, rate limits) without manual intervention or wrapper scripts.

Overview

Hook When Use Cases
on_start Before crawl begins Preflight checks, login scripts, cache warming
on_failure On consecutive failure stop Refresh cookies, rotate proxies, restart VPN
on_complete After crawl finishes Notifications, data export, cleanup

Quick Start

[hooks]
on_failure = "python scripts/refresh_cookies.py {name}"

When the crawl hits max_consecutive_failures, databrew runs your script instead of stopping. If the script exits 0, the crawl resets its failure counter and resumes.

Configuration

TOML Config

[hooks]
on_start = "echo Starting {name}"
on_failure = "python scripts/recover.py {name}"
on_complete = "python scripts/notify.py {name} {items}"
max_hook_retries = 3       # Max times on_failure can fire (default: 3)
hook_timeout = 300.0       # Timeout per hook in seconds (default: 300)

CLI Overrides

CLI flags override config values:

# Override on_failure from config
databrew run mysite.toml --on-failure "python scripts/recover.py {name}"

# Add hooks to a config that has none
databrew run mysite.toml --on-start "echo start" --on-complete "echo done"

# Override just one hook
databrew run mysite.toml --on-failure "echo recovery attempt"

Template Variables

Hook commands support these template variables:

Variable Description
{name} Site name from config
{failures} Current consecutive failure count
{items} Total items extracted so far
{requests} Total requests processed so far
[hooks]
on_failure = "echo '{name} failed {failures} times after {requests} requests'"
on_complete = "echo '{name}: {items} items in {requests} requests'"

Hook Behavior

on_start

Runs before the crawl begins. If the command exits non-zero, the crawl aborts.

[hooks]
on_start = "python scripts/preflight.py {name}"

Use cases:

  • Validate that credentials are fresh
  • Check that the target site is reachable
  • Warm up caches or sessions

on_failure

Runs when the crawl hits max_consecutive_failures. This is the primary recovery hook.

[hooks]
on_failure = "python scripts/refresh_cookies.py {name}"
max_hook_retries = 5

Recovery flow:

Crawl running → 10 consecutive failures → on_failure fires
                                    Exit 0 (success)?
                                    ├── Yes: Reset failure counter, reload config, resume crawl
                                    └── No:  Stop crawl (same as no hook)

After recovery, databrew:

  1. Resets the consecutive failure counter to 0
  2. Reloads config and recreates the fetcher — so your recovery script can update headers, cookies, or proxy settings in the config file and they take effect immediately
  3. Resumes the crawl loop

The hook fires up to max_hook_retries times per crawl. After that, the crawl stops normally.

on_complete

Runs after the crawl finishes, regardless of how it stopped (success, failure, or abort). The CrawlResult.stopped_reason tells you what happened.

[hooks]
on_complete = "python scripts/notify.py {name} {items} {requests}"

Use cases:

  • Send notifications (Slack, email)
  • Trigger data pipelines
  • Upload results to cloud storage
  • Log crawl metrics

Examples

When a site requires session cookies that expire during long crawls:

[hooks]
on_failure = "python scripts/refresh_cookies.py {name}"
max_hook_retries = 5
hook_timeout = 60.0
# scripts/refresh_cookies.py
import sys
import tomllib

name = sys.argv[1] if len(sys.argv) > 1 else ""

# Your cookie refresh logic here
# (selenium login, API auth, etc.)
new_cookie = get_fresh_cookie()

# Update the config file with new cookie
# (databrew reloads config after recovery)
config_path = f"configs/{name}.toml"
# ... update headers in config ...

Proxy Rotation

[hooks]
on_failure = "python scripts/rotate_proxy.py {name}"
max_hook_retries = 10

Slack Notification

[hooks]
on_complete = "curl -X POST -d '{\"text\": \"{name}: {items} items extracted\"}' $SLACK_WEBHOOK"

Simple Logging

databrew run mysite.toml \
    --on-start "echo 'Starting crawl for {name}'" \
    --on-failure "echo 'Recovery needed for {name} after {failures} failures'" \
    --on-complete "echo 'Done: {name} extracted {items} items'"

Dry Run

Use --dry-run to verify hooks are configured correctly:

databrew run mysite.toml --dry-run --on-failure "echo recover"

Output includes a Hooks section showing the resolved commands:

Hooks:
  on_failure: echo recover
  max_hook_retries: 3
  hook_timeout: 300.0s

Programmatic Usage

Hooks are implemented as async callbacks on the Orchestrator. You can use them directly in Python:

import asyncio
from databrew import Orchestrator, load_config, create_components, HookContext, run_hook

config = load_config("mysite.toml")
components = create_components(config)

async def on_failure():
    ctx = HookContext(name=config.name, failures=10)
    return await run_hook("python scripts/recover.py {name}", ctx)

async def on_complete(result):
    print(f"Done: {result.stats.items_extracted} items")

orchestrator = Orchestrator(
    store=components.store,
    fetcher=components.fetcher,
    extractor=components.extractor,
    policy=components.policy,
    on_failure=on_failure,
    on_complete=on_complete,
    max_hook_retries=3,
)

result = asyncio.run(orchestrator.run())

The orchestrator accepts these callbacks:

Parameter Signature Description
on_start () -> bool Return False to abort
on_failure () -> bool Return True to resume
on_complete (CrawlResult) -> None Called after crawl ends
on_recover () -> None Called after successful on_failure (e.g., recreate fetcher)
max_hook_retries int Max on_failure invocations per crawl

Best Practices

1. Keep hooks fast

Hooks run between batches, blocking the crawl. Use hook_timeout to prevent hangs:

[hooks]
hook_timeout = 60.0  # Kill slow hooks after 60s

2. Make recovery scripts idempotent

The on_failure hook may fire multiple times. Ensure your recovery script handles being run repeatedly.

3. Test with dry-run first

databrew run mysite.toml --dry-run --on-failure "python scripts/recover.py {name}"

4. Use config reload for recovery

Since databrew reloads config after on_failure succeeds, your recovery script can modify the TOML file (e.g., update cookies in [fetch.headers]) and the changes take effect immediately.

5. Start with conservative retries

[hooks]
max_hook_retries = 3  # Don't retry forever

If your recovery script fails consistently, the crawl should stop so you can investigate.