Skip to content

Built-in Parsers

Parsers transform extracted values. They can be used with any field.

Text Parsers

strip

Strip leading and trailing whitespace.

>>> strip("  Hello World  ")
'Hello World'
>>> strip("\n\tText\n")
'Text'

Usage in config:

title = { selector = ".title", parser = "strip" }

squish

Collapse multiple whitespace characters into single spaces.

>>> squish("Multiple   spaces")
'Multiple spaces'
>>> squish("Line\n\nbreaks")
'Line breaks'
>>> squish("  Mixed \t whitespace  ")
'Mixed whitespace'

Usage in config:

description = { selector = ".desc", parser = "squish" }

Numeric Parsers

parse_int

Extract integer from text, stripping non-numeric characters.

>>> parse_int("3 Bedrooms")
3
>>> parse_int("$1,500")
1500
>>> parse_int("Rating: 5")
5
>>> parse_int("")
None

Usage in config:

bedrooms = { selector = ".beds", parser = "parse_int" }

parse_float

Extract float from text, handling currency and formatting.

>>> parse_float("4.5 stars")
4.5
>>> parse_float("$1,234.56")
1234.56
>>> parse_float("99%")
99.0
>>> parse_float("")
None

Usage in config:

rating = { selector = ".rating", parser = "parse_float" }

parse_price

Parse price into a structured object with amount, currency, and raw value.

>>> parse_price("ETB 1,500,000")
{'amount': 1500000.0, 'currency': 'ETB', 'raw': 'ETB 1,500,000'}
>>> parse_price("$99.99")
{'amount': 99.99, 'currency': 'USD', 'raw': '$99.99'}
>>> parse_price("€199")
{'amount': 199.0, 'currency': 'EUR', 'raw': '€199'}

Recognized currencies: ETB (Ethiopian Birr), USD, EUR.

Usage in config:

price = { selector = ".price", parser = "parse_price" }

Structured Data Parsers

parse_json

Parse a JSON string into Python object.

>>> parse_json('{"key": "value"}')
{'key': 'value'}
>>> parse_json('[1, 2, 3]')
[1, 2, 3]
>>> parse_json("invalid")
None

Usage in config:

metadata = { selector = ".data", parser = "parse_json" }

parse_ldjson

Extract the first graph object from JSON-LD text.

>>> # With @graph structure
>>> ld_json = '{"@graph": [{"@type": "Product", "name": "Item"}]}'
>>> parse_ldjson(ld_json)
{'@type': 'Product', 'name': 'Item'}

>>> # Without @graph
>>> simple = '{"@type": "Product", "name": "Simple Item"}'
>>> parse_ldjson(simple)
{'@type': 'Product', 'name': 'Simple Item'}

Usage in config:

structured = { selector = "script[type='application/ld+json']", parser = "parse_ldjson" }

parse_coordinates

Extract latitude and longitude from JSON or text, returning as "lat,long" string.

>>> parse_coordinates('{"latitude": 9.03, "longitude": 38.74}')
'9.03,38.74'
>>> parse_coordinates('lat: 9.03, lng: 38.74')
'9.03,38.74'
>>> parse_coordinates('"Latitude": "9.03", "Long": "38.74"')
'9.03,38.74'

Usage in config:

location = { selector = "#map", attribute = "data-coords", parser = "parse_coordinates" }

JSON-LD Field Extraction

Use the ldjson: prefix to extract specific fields from JSON-LD:

[extract.items.fields]
date_published = { selector = "script[type='application/ld+json']", parser = "ldjson:datePublished" }
author_name = { selector = "script[type='application/ld+json']", parser = "ldjson:author.name" }

The path after ldjson: uses dot-notation:

  • ldjson:datePublished → extracts datePublished
  • ldjson:author.name → extracts author.name
  • ldjson:offers.0.price → extracts first offer's price

Available Built-in Parsers

Parser Description
strip Strip leading/trailing whitespace
squish Collapse multiple whitespace to single spaces
parse_int Extract integer from text
parse_float Extract float from text
parse_price Parse price with currency detection
parse_json Parse JSON string to object
parse_ldjson Extract from JSON-LD (handles @graph)
parse_coordinates Extract lat/long from various formats

Using Parsers in Config

In Field Config

[extract.items.fields]
price = { selector = ".price", parser = "parse_price" }
count = { selector = ".count", parser = "parse_int" }

In Derived Fields

[extract.derived]
bedrooms = { path = "details.Bedrooms", parser = "parse_int" }
price_amount = { path = "price.amount", parser = "parse_float" }

Error Handling

If a parser fails:

  • Returns None for the field
  • Logs a warning (visible with -v)
  • Does not fail the item (unless field is required)

Custom Parsers

See Custom Parsers for creating your own parsers.