Authoring Plugins
Ladon crawl plugins live in separate repos — the core library is intentionally unaware of any site-specific logic. This keeps the framework dependency-free and each adapter independently versioned.
Ethical crawling: enable robots.txt enforcement
When writing a plugin that targets public third-party websites, configure
your HttpClient with respect_robots_txt=True. This is the IETF
standard (RFC 9309), the established industry norm, and increasingly a
legal expectation under EU data-protection law. See
Getting Started → Ethical note
for the full rationale.
The plugin protocol
A plugin must implement the CrawlPlugin protocol:
from ladon.plugins.protocol import CrawlPlugin, Expander, Sink, Source
class MyPlugin:
name: str # short identifier used in logs
source: Source # top-level ref discovery
expanders: list[Expander] # ordered chain of URL/ref expanders
sink: Sink # leaf processor
Ladon uses structural subtyping (PEP 544 Protocol). No inheritance is
required — your class just needs to provide the attributes above. Instance
attributes set in __init__ satisfy the protocol at runtime, so the common
pattern is:
class MyPlugin:
def __init__(self, client: HttpClient) -> None:
self.name = "my_plugin"
self.source = MySource()
self.expanders = [MyExpander()]
self.sink = MySink()
CLI constructor requirement
When invoked via ladon run --plugin, the CLI constructs your plugin
as plugin_cls(client=client). Make sure your __init__ accepts
client as a keyword argument.
Expander
An Expander turns one ref into an Expansion — the current node's record
plus the child refs to process next (e.g. catalogue record + lot URLs):
from ladon.plugins.models import Expansion
class MyExpander:
def expand(self, ref: object, client: HttpClient) -> Expansion:
"""Fetch ref; return its record and child refs.
Raises:
ExpansionNotReadyError: ref is not yet ready to be expanded.
PartialExpansionError: child list is incomplete.
ChildListUnavailableError: child list could not be retrieved.
"""
...
Exceptions that halt expansion:
| Exception | Meaning |
|---|---|
ExpansionNotReadyError |
Data not ready; abort the entire run — caller retries later |
PartialExpansionError |
Some children unavailable; runner logs and continues |
ChildListUnavailableError |
Child list fetch failed; runner logs and continues |
Sink
A Sink processes each leaf ref (e.g. downloads a lot page):
class MySink:
def consume(self, ref: object, client: HttpClient) -> object:
"""Fetch and process the leaf; return a record for on_leaf callback."""
...
LeafUnavailableError signals that the leaf is temporarily unavailable;
the runner records the failure and moves on.
CrawlPlugin
Combine expanders and sink into a plugin:
from ladon.networking.client import HttpClient
class AuctionPlugin:
def __init__(self, client: HttpClient) -> None:
self.name = "auction_example"
self.source = CatalogueSource()
self.expanders = [CategoryExpander(), AuctionExpander()]
self.sink = LotSink()
Running from code
from ladon.networking.client import HttpClient
from ladon.networking.config import HttpClientConfig
from ladon.runner import RunConfig, run_crawl
config = HttpClientConfig(retries=2, min_request_interval_seconds=1.0)
client = HttpClient(config)
plugin = AuctionPlugin(client=client)
result = run_crawl(
top_ref="https://example-auction.com/catalogue/2026",
plugin=plugin,
client=client,
config=RunConfig(leaf_limit=100),
on_leaf=lambda leaf_record, parent_record: db.save(leaf_record),
)
print(f"fetched {result.leaves_fetched}, failed {result.leaves_failed}")
client.close()
Running from the CLI
ladon run --plugin mypackage.adapters:AuctionPlugin \
--ref https://example-auction.com/catalogue/2026
The CLI uses default RunConfig settings (no leaf limit, no on_leaf
callback). For production use write a Python script that calls run_crawl
directly.
Error taxonomy
All errors are in ladon.plugins.errors and ladon.networking.errors.
| Error | Layer | Meaning |
|---|---|---|
ExpansionNotReadyError |
Plugin | Run not yet possible; abort and retry later |
PartialExpansionError |
Plugin | Some branches unavailable; log and continue |
ChildListUnavailableError |
Plugin | Child list fetch failed |
LeafUnavailableError |
Plugin | Individual leaf unavailable |
CircuitOpenError |
Networking | Host circuit breaker is open |
RobotsBlockedError |
Networking | robots.txt disallows the URL |
RequestTimeoutError |
Networking | Request exceeded timeout |
TransientNetworkError |
Networking | Connection-level transport failure; all internal retries exhausted |