ADR-006 — Persistence Layer
Context and Problem Statement
The runner's on_leaf callback is intentionally thin — the runner has no
database dependency, and persistence is the caller's responsibility (ADR-004).
This was the right design for Phases 1–3. For production use it is
insufficient for three reasons:
1. on_leaf gives no structural guidance. The caller receives two untyped
objects and must figure out upsert semantics, transaction scope, and which
table to write to on their own. Every adapter will reinvent the same
boilerplate.
2. There is no run-level audit trail. RunResult captures counts and
errors for a single invocation, but there is no durable record of what ran,
when, and with what outcome. Answering "when did we last successfully crawl
this source?" requires application-level instrumentation that the framework
provides no guidance for.
3. The framework must remain domain-agnostic. A persistence layer that prescribes SQL schemas, ORM models, or database drivers would couple the framework to specific technology choices. Different domains call for fundamentally different storage backends — relational, document, time-series, flat files — and Ladon cannot mandate any of them.
The question is: what is the minimum persistence surface that the framework can provide without becoming opinionated about storage technology?
Decision Drivers
- The framework defines contracts, not implementations. SQL, DDL, and connection management are adapter concerns.
- The runner must remain persistence-agnostic — no
repositoryparameter inrun_crawl(). - Run history must be queryable for incremental crawling and operator dashboards, without prescribing the backing store.
- Adapters must be able to implement persistence without inheriting from any Ladon class.
- A no-op implementation must exist for dry runs and tests, with a production-safe warning when accidentally used with live data.
Considered Options
-
Option A: Add
repositoryparameter torun_crawl()— the runner callsrepository.write_leaf()directly. Simple, but couples the runner to a persistence concept it does not own. Makes the runner untestable without a real or mock repository. -
Option B: Two protocols + orchestration outside the runner (chosen) —
RepositoryandRunAuditprotocols sit outsiderun_crawl(). The orchestration layer wrapsrun_crawl()and calls the protocols before and after. The runner stays persistence-agnostic. -
Option C: Event-driven persistence — the runner emits leaf events to a queue; a consumer writes to storage. Decouples crawling from persistence and enables replay. Correct long-term direction; premature for a single-worker deployment. Deferred to when concurrent workers justify the added infrastructure.
Decision Outcome
Option B: two @runtime_checkable protocols + orchestration outside the
runner.
The two protocols
@runtime_checkable
class Repository(Protocol):
def write_leaf(self, record: object, run_id: str) -> None: ...
@runtime_checkable
class RunAudit(Protocol):
def record_run(self, run: RunRecord) -> None: ...
def get_last_run(
self, plugin_name: str, status: str | None = "done",
) -> RunRecord | None: ...
Repository is the minimum persistence contract. Every adapter that
writes data implements this one method. It is intentionally narrow: the
framework does not prescribe upsert semantics, transaction scope, or table
names. Those are the adapter's domain.
RunAudit is an optional extension for adapters that need durable run
history — incremental crawling, operator dashboards, trend analysis. It is
independent of Repository: a class may implement one without the other.
The two-protocol split reflects a fundamental distinction: leaf persistence is always needed; run auditing is an optional capability. Forcing every adapter to implement audit methods would create dead no-op implementations everywhere.
Why record: object (untyped)
write_leaf(record: object, run_id: str) gives no type information to the
implementor. The correct fix is generic protocols:
T = TypeVar("T")
class Repository(Protocol[T]):
def write_leaf(self, record: T, run_id: str) -> None: ...
This is deferred: generic protocols interact poorly with NullRepository[Any]
and pyright inference in the absence of explicit type parameters. The untyped
object is a known limitation, documented here and in the adapter-authoring
guide. Adapter implementations must cast or use isinstance to access
domain-specific fields.
The RunRecord dataclass
from dataclasses import dataclass, field
from datetime import datetime
from typing import Literal
@dataclass
class RunRecord:
run_id: str
plugin_name: str
top_ref: str
started_at: datetime
status: Literal["running", "done", "failed", "not_ready", "partial"]
finished_at: datetime | None = None
leaves_fetched: int = 0
leaves_persisted: int = 0
leaves_failed: int = 0
branch_errors: int = 0
errors: tuple[str, ...] = field(default_factory=tuple)
RunRecord is mutable (non-frozen) so the orchestration layer can either
create two separate instances (start + finish) or mutate one in place.
branch_errors counts expander branch failures isolated by the bulkhead
(ADR-004). These appear in errors with the prefix
"expander branch '...':" and are counted separately for dashboards.
The record_run two-call contract
record_run is called twice per run:
- At run start:
status='running',finished_at=None, all counters at 0. - At run finish:
statusset to the final outcome, counters populated.
Implementations must upsert on run_id — a plain INSERT will raise a
primary key violation on the second call. This is documented prominently in
the method docstring with a SQL example.
NullRepository
A no-op implementation satisfying both protocols structurally. Used for dry runs, search-only crawls, and tests where persistence is not under test.
class NullRepository:
def __init__(self, *, silent: bool = False) -> None:
if not silent:
logger.warning("NullRepository instantiated: all leaf records "
"and run audit data will be silently discarded. ...")
def write_leaf(self, record: object, run_id: str) -> None: pass
def record_run(self, run: RunRecord) -> None: pass
def get_last_run(self, plugin_name: str,
status: str | None = "done") -> RunRecord | None:
return None
silent=False (default) emits a WARNING on construction. Accidentally
deploying NullRepository in production is immediately visible in logs.
Orchestration pattern
The runner receives no repository. The orchestration layer wraps it:
import uuid
from datetime import datetime, timezone
from ladon.persistence import RunAudit, RunRecord
run_id = str(uuid.uuid4())
run = RunRecord(run_id=run_id, plugin_name=plugin.name,
top_ref=str(top_ref),
started_at=datetime.now(tz=timezone.utc),
status="running")
if isinstance(repository, RunAudit):
repository.record_run(run)
result = run_crawl(
top_ref, plugin, client, config,
on_leaf=lambda rec, _: repository.write_leaf(rec, run_id),
)
run.status = "done" # or "failed", "partial", "not_ready"
run.finished_at = datetime.now(tz=timezone.utc)
run.leaves_fetched = result.leaves_fetched
run.leaves_persisted = result.leaves_persisted
run.leaves_failed = result.leaves_failed
run.branch_errors = sum(1 for e in result.errors
if e.startswith("expander branch"))
run.errors = result.errors
if isinstance(repository, RunAudit):
repository.record_run(run)
The isinstance(repository, RunAudit) check uses Python structural subtyping
at runtime — adapters implementing both protocols need not declare it
explicitly. Note: Python's runtime checkable protocol check verifies only
that the required method names are present, not their signatures. A class
with a mismatched record_run signature would pass isinstance at runtime
but fail under a static type-checker (pyright, mypy).
Why no DatabaseBackend abstraction
A DatabaseBackend protocol mapping SQL operations to implementations is
explicitly rejected. Each adapter owns its schema and its queries. The
framework has no business abstracting query execution — it only needs to know
that a Repository exists. Adapters that want SQL portability use SQLAlchemy
independently.
Structural subtyping — no inheritance required
Adapters implement the protocols without importing any Ladon base class:
# In a third-party adapter — only RunRecord needs to be imported
from ladon.persistence import RunRecord
class MyDuckDBRepository:
def write_leaf(self, record: object, run_id: str) -> None:
... # cast record, insert into DuckDB
def record_run(self, run: RunRecord) -> None:
... # upsert into ladon_runs table
def get_last_run(
self, plugin_name: str, status: str | None = "done"
) -> RunRecord | None:
... # query ladon_runs, return RunRecord or None
A minimal adapter — one that only writes leaves, no audit trail:
class MyMinimalRepository:
def write_leaf(self, record: object, run_id: str) -> None:
... # satisfies Repository; RunAudit not required
Design decision: get_last_run default status filter
get_last_run defaults to status="done". Callers using this for
incremental crawling almost always want the last successful run, not the
last attempted one. A failed run followed by a retry would otherwise
return the failed run, causing the crawl to re-run from the beginning.
status=None returns the most recent run regardless of outcome.
Implementations must order by finished_at descending (or started_at
when finished_at is None).
Consequences
Good:
- The runner remains persistence-agnostic and fully testable without a database.
- Each adapter owns its schema, storage engine, and migration tooling. New data domains plug in without any framework change.
- Run history is durable and queryable regardless of domain or backend.
NullRepositoryprovides a safe, visible no-op for development and testing.- Structural subtyping means third-party adapters have no AGPL import
obligation beyond
RunRecord(see ADR-010 and theladon-typesfuture option).
Trade-offs:
record: objectis untyped — adapter authors must cast manually and rely on plugin-specific documentation to know the concrete type. Generic protocols deferred.- The two-call
record_runcontract is a documented runtime requirement, not a type-system guarantee. An adapter using a plainINSERTwill fail on the second call with a primary key violation — the framework cannot detect this misconfiguration at construction time. - Keeping
on_leafalongside the Repository pattern means two persistence idioms coexist.on_leafis still the right default for simple use cases.
Related ADRs
- ADR-004 — SES pipeline and
on_leafdependency injection - ADR-005 — Asset storage protocol (
ladon.storage) — Accepted, Phase 3 - ADR-010 — Contributor License Agreement and
ladon-typesfuture option