How Ladon Works
Ladon crawls tree-structured web resources — sites where content is organized hierarchically: a forum has threads, a thread has posts; a shop has categories, a category has products; a news site has topics, a topic has articles. This guide explains the mental model behind Ladon's plugin system and how the three pipeline roles map to a real crawl.
The Source → Expander → Sink pipeline
Every Ladon plugin decomposes its crawl into three roles:
| Role | Question it answers | Output |
|---|---|---|
| Source | Where do I start? | A list of top-level refs |
| Expander | What's inside this node? | A record + child refs |
| Sink | What's at this leaf? | A final record |
The runner drives the pipeline:
- Call
Source.discover()to get the top-level ref list. - For each ref, call the first
Expander.expand()to get its record and children. - Repeat with subsequent Expanders for deeper levels.
- Call
Sink.consume()on every leaf ref. - Fire the
on_leafcallback (your persistence hook) after each success.
You never write the loop. Ladon owns traversal, error counting, rate limiting, and the leaf limit. Your code owns the parsing.
A concrete example: Hacker News
Hacker News has a two-level structure: a front page (list of stories) and individual story pages (comments). Here's how that maps:
Source → [Expander] → Sink
─────────────────────────────────────────────
HN front page Story page Comment
(top story IDs) (story + comment (comment record)
IDs)
Source fetches https://hacker-news.firebaseio.com/v0/topstories.json and
returns a list of story IDs as refs.
Expander takes one story ID ref, fetches the story item, and returns an
Expansion:
@dataclass(frozen=True)
class Expansion:
record: object # the story record (title, score, author…)
child_refs: Sequence[object] # the comment ID refs
Sink takes one comment ID ref and returns the fully parsed comment record.
The runner calls the Expander once per story, then the Sink once per comment. You write three focused classes, each doing one thing.
Flat crawls (no tree)
Not every site has multiple levels. A flat crawl — a paginated list of items
where each item is the leaf — uses a single Expander that returns the item
URLs as child_refs, followed by a Sink that processes each one:
The number of Expanders is yours to choose; the framework handles any depth.
A Sink is always required — the runner always calls Sink.consume() on each
leaf ref.
Refs carry context
A ref is a lightweight, immutable pointer to a resource. It always has a
URL, and it can carry a raw dict for pre-fetched data:
@dataclass(frozen=True)
class Ref:
url: str
raw: Mapping[str, object] # arbitrary pre-fetched context
The Expander populates ref.raw with whatever the Sink will need — parent
metadata, pre-loaded JSON, currency information — so the Sink never has to
re-fetch it. No parent context is threaded through the runner; context travels
with the ref.
Errors are explicit, not swallowed
Ladon's error taxonomy is part of the protocol. Each exception carries a specific meaning that tells the runner exactly what to do:
| Exception | Meaning | Runner behaviour |
|---|---|---|
ExpansionNotReadyError |
Resource not ready yet | Abort the run; retry next schedule |
PartialExpansionError |
Child list incomplete | First expander: abort the run (re-raised to caller). Non-first expander: log and skip the branch. |
ChildListUnavailableError |
Child list unreachable | First expander: abort the run (re-raised to caller). Non-first expander: log and skip the branch. |
LeafUnavailableError |
Leaf fetch failed | Log, skip this leaf, continue |
Your plugin raises the right exception; the runner decides what to do with it.
No catch-all except Exception that masks real bugs.
Persistence is injected, not built in
The runner has no database dependency. You inject persistence as a callback:
def my_persist(leaf_record: object, parent_record: object) -> None:
db.insert(leaf_record)
result = run_crawl(
top_ref=my_ref,
plugin=my_plugin,
client=client,
config=RunConfig(),
on_leaf=my_persist,
)
on_leaf fires after every successful Sink call. If the Sink raises, on_leaf
is never called for that leaf. What you do in on_leaf — write to a database,
append to a file, stream to S3, print to stdout — is entirely up to you.
What Ladon provides for free
Because all plugins go through the same runner and HttpClient, every crawl
gets these automatically:
- Rate limiting — per-domain request pacing
- Retries with backoff — configurable, per-domain
- Circuit breaking — stops hammering a failing host (ADR-007)
- robots.txt enforcement — opt-in, RFC 9309 compliant (ADR-008)
- Structured logging — runner and leaf events logged with plugin name, ref, counts
- Leaf limit —
RunConfig.leaf_limitcaps any crawl for safe testing
Next steps
- Authoring Plugins — implement your first plugin
- ADR-004: SES Protocol Design — the full rationale behind this architecture
- API Reference: Runner —
run_crawl,RunConfig,RunResult - API Reference: Plugins —
CrawlPlugin,Source,Expander,Sink,Expansion