HTML Extraction¶

This guide covers HTML extraction using CSS selectors.

Basic Field Extraction¶

Simple Selector¶

The simplest form is a CSS selector string:

[extract.items.fields]
title = "h1.title"
description = ".description p"

This extracts the text content of the first matching element.

Full Field Config¶

For more control, use a table:

[extract.items.fields]
title = { selector = "h1.title" }
price = { selector = ".price", parser = "parse_price" }
image = { selector = "img.main", attribute = "src" }

Field Options¶

Option	Type	Default	Description
`selector`	string	required	CSS selector to find the element
`attribute`	string	`null`	Attribute to extract (null = text content)
`parser`	string	`null`	Parser function to transform the value
`required`	bool	`false`	Fail item if field is missing
`multiple`	bool	`false`	Extract all matches as a list

Extracting Attributes¶

Use the attribute option to extract element attributes instead of text:

[extract.items.fields]
# Link URL
url = { selector = "a.product-link", attribute = "href" }

# Image source
image = { selector = "img.thumbnail", attribute = "src" }

# Data attributes
product_id = { selector = ".product", attribute = "data-id" }
coordinates = { selector = "#map", attribute = "data-coords" }

Multiple Values¶

Set multiple = true to extract all matching elements as a list:

[extract.items.fields]
# All image URLs
images = { selector = ".gallery img", attribute = "src", multiple = true }

# All tags
tags = { selector = ".tag-list a", multiple = true }

# All feature items
features = { selector = ".features li", multiple = true }

Required Fields¶

Mark fields as required to skip items where the field is missing:

[extract.items.fields]
title = { selector = "h1", required = true }
price = { selector = ".price", required = true }
description = ".description"  # Optional

If a required field is missing, the entire item is skipped with a warning.

Key-Value Pair Extraction¶

For structured data in key-value format (like property details), use keys and values:

[extract.items.fields]
# Extract from <dt>/<dd> pairs
details = { keys = "dt", values = "dd" }

# Extract from custom structure
# <li><strong>Bedrooms:</strong> <span>3</span></li>
specs = { selector = ".specs li", keys = "strong", values = "span" }

With Container Selector¶

When key-value pairs are in containers:

# Each <li> contains a key-value pair
details = { selector = ".detail-item", keys = ".label", values = ".value" }

With Units¶

For values with separate unit elements:

# <div class="spec">
#   <span class="key">Area</span>
#   <span class="value">1500</span>
#   <span class="unit">sqft</span>
# </div>
specs = { selector = ".spec", keys = ".key", values = ".value", units = ".unit" }
# Result: {"Area": "1500 sqft"}

Using Parsers¶

Parsers transform extracted values:

[extract.items.fields]
# Parse price with currency
price = { selector = ".price", parser = "parse_price" }
# "ETB 1,500,000" → {"amount": 1500000.0, "currency": "ETB", "raw": "ETB 1,500,000"}

# Parse integer
bedrooms = { selector = ".beds", parser = "parse_int" }
# "3 Beds" → 3

# Parse float
rating = { selector = ".rating", parser = "parse_float" }
# "4.5 stars" → 4.5

# Collapse whitespace
description = { selector = ".desc", parser = "squish" }
# "  Multiple   spaces  " → "Multiple spaces"

# Parse coordinates
location = { selector = "#map", attribute = "data-coords", parser = "parse_coordinates" }
# '{"lat": 9.03, "lng": 38.74}' → "9.03,38.74"

See Built-in Parsers for all available parsers.

JSON-LD Extraction¶

Extract data from JSON-LD scripts using the ldjson: parser prefix:

[extract.items.fields]
# Extract specific field from JSON-LD
date_published = { selector = "script[type='application/ld+json']", parser = "ldjson:datePublished" }
date_modified = { selector = "script[type='application/ld+json']", parser = "ldjson:dateModified" }

# Extract entire JSON-LD as dict
structured_data = { selector = "script[type='application/ld+json']", parser = "parse_ldjson" }

The ldjson: prefix automatically handles @graph structures.

Item Containers¶

Multiple Items Per Page¶

When a page has multiple items (e.g., a listing page):

[extract.items]
selector = ".product-card"  # Each card is one item

[extract.items.fields]
title = "h2"  # Relative to the container
price = ".price"
url = { selector = "a", attribute = "href" }

Fields are extracted relative to each container.

Whole Page as Item¶

For detail pages where the whole page is one item:

[extract.items]
selector = ""  # Empty string = whole page

[extract.items.fields]
title = "h1.page-title"
description = "#content .description"
# Selectors work on the entire document

Derived Fields¶

Derived fields extract values from already-extracted nested data (like key-value pairs):

[extract.items.fields]
# Extract key-value pairs as a dict
details = { selector = ".details li", keys = "strong", values = "span" }
# Result: {"Property ID": "12345", "Bedrooms": "3", "Bathrooms": "2"}

[extract.derived]
# Pull specific values out to top-level fields
property_id = "details.Property ID"
bedrooms = { path = "details.Bedrooms", parser = "parse_int" }
bathrooms = { path = "details.Bathrooms", parser = "parse_int" }

Derived Field Options¶

Option	Type	Default	Description
`path`	string	required	Dot-notation path to the value
`parser`	string	`null`	Parser to transform the value
`remove_source`	bool	`true`	Remove the key from source dict

Shorthand Syntax¶

[extract.derived]
# Shorthand (just the path)
property_id = "details.Property ID"

# Equivalent full form
property_id = { path = "details.Property ID" }

Keeping Source Fields¶

By default, derived keys are removed from the source dict. To keep them:

[extract.derived]
property_id = { path = "details.Property ID", remove_source = false }

Link Extraction¶

Pagination Links¶

Links to more listing pages:

[extract.links]
pagination = [
    "a.next-page",
    ".pagination a[rel='next']",
    ".load-more-btn",
]

Item Links¶

Links to detail pages:

[extract.links]
items = [
    ".product-card a.detail-link",
    ".listing h2 a",
]

Link Attribute¶

By default, links are extracted from href. For other attributes:

[extract.links]
attribute = "data-url"  # Use data-url attribute instead

Complete Example¶

name = "realestate"
start_urls = ["https://example.com/listings"]

[extract]
type = "html"
base_url = "https://example.com"

[extract.items]
selector = ""  # Detail pages
id = "property_id"

[extract.items.fields]
# Basic fields
title = ".listing-title"
price = { selector = ".price", parser = "parse_price", required = true }
address = ".address"
description = { selector = ".description", parser = "squish" }

# Numeric fields with parsers
bedrooms = { selector = ".beds span", parser = "parse_int" }
bathrooms = { selector = ".baths span", parser = "parse_int" }
sqft = { selector = ".sqft span", parser = "parse_int" }

# Attributes
url = { selector = "link[rel='canonical']", attribute = "href" }
images = { selector = ".gallery img", attribute = "src", multiple = true }

# Key-value extraction
details = { selector = ".property-details li", keys = "strong", values = "span" }
features = { selector = ".features li", multiple = true }

# JSON-LD
date_listed = { selector = "script[type='application/ld+json']", parser = "ldjson:datePosted" }

[extract.links]
pagination = [".pagination a.next"]
items = [".listing-card a.view-details"]

[extract.derived]
property_id = "details.Property ID"
year_built = { path = "details.Year Built", parser = "parse_int" }
lot_size = "details.Lot Size"

[storage]
path = "data/realestate"