0014: Pipeline Utilities — State Migration Hooks for Checkpoints¶
- Status: Accepted
- Author: Chris Colinsky
- Created: 2026-05-13
- Accepted: 2026-05-14
- Targets: spec/pipeline-utilities/spec.md (adds §10.12 State migrations; modifies §10.2, §10.10)
- Related: 0008 (checkpointing contract reserving
schema_version), 0001 - Supersedes:
Summary¶
Activate the schema_version field that proposal 0008 reserved on CheckpointRecord, and add a
registration surface for state migrations: user-supplied transformations that run on
checkpoint load when a saved record's schema_version does not match the current state schema's
schema_version. A compiled graph MAY register an ordered set of migrations; the engine walks
them on load to project a stored record's state into the current shape. Two new canonical error
categories cover migration-related failures: checkpoint_state_migration_missing (no
chain of registered migrations connects the stored version to the current version) and
checkpoint_state_migration_failed (a registered migration function raised during chain
application).
Motivation¶
Proposal 0008 left this hook deliberately empty. §10.2 reserved schema_version as a string
"implementation-defined; lets backends evolve the record shape without breaking older saved
records," and §10.10 introduced checkpoint_record_invalid for "state shape mismatch, missing
required fields, incompatible schema_version." But §10 did not specify how a user evolves the
state schema between runs without invalidating their existing checkpoints, and the current
contract gives them only two options: stay on the old schema forever, or discard checkpoints on
every schema change and re-run from scratch.
For long-running LLM pipelines this bites in two places:
- Iteration cost. A pipeline with expensive intermediates (frame extraction, ASR cleanup, visual narration, embedding computation) re-runs hours of work whenever the state schema evolves. Adding one optional field to a state class invalidates every prior checkpoint.
- Production migration. When a deployed pipeline ships a new state shape, in-flight
invocations from the previous deploy cannot resume against the new code — they raise
checkpoint_record_invalidper §10.10. The operator's only path is to drain the previous deploy, accept the loss, or maintain a sidecar process to manually convert records.
The state-snapshot pattern §10 codifies makes the migration story tractable: the record carries
the state as a serialized representation, plus a version identifier. A function (serialized
state at version V) -> (serialized state at version V+1) is enough to bridge a schema bump.
Migration chains generalize this across multiple bumps.
The contract is small (one new section §10.12, a clarified §10.2 field, one new error category) and the user-facing surface is one registration call per migration. Pipelines that never evolve their schema pay nothing.
Detailed design¶
Pipeline-utilities §10.2: schema_version becomes user-facing¶
Amend the §10.2 description of schema_version to make it a state-class-level identifier (set
by the user on their state definition), not an implementation-internal field. The current
text:
schema_version— string. Implementation-defined; lets backends evolve the record shape without breaking older saved records.
Replace with:
schema_version— string. Carries the version identifier of the user's state schema at the time the record was saved. The state definition MAY expose a stable, user-controlledschema_versionidentifier (the surface for declaring it is per-language ergonomic — e.g., a class attribute in Python, a constant in TypeScript). When declared, the framework readsschema_versionfrom the state definition at save time and writes it onto the record. State classes that do not declare aschema_versionare treated as carrying an implementation-defined sentinel value (typically the empty string), and are not migration-eligible until they declare one. Users intending to evolve their schema across deploys MUST declare an explicitschema_versionso that migrations (per §10.12) can be registered against it.
The framework does not constrain the version identifier's syntax. Users MAY use semver, integer counters, date stamps, or content hashes — whatever makes sense for their evolution discipline. Two distinct identifiers are treated as distinct versions; identical identifiers are treated as the same version.
Pipeline-utilities §10.12: State migrations (new subsection)¶
10.12.1 Migration registration¶
A compiled graph MAY register zero or more state migrations. Each migration is described by three pieces:
from_version— theschema_versionidentifier the migration accepts as input.to_version— theschema_versionidentifier the migration produces as output.- A migration function that, given a serialized state representation at
from_version, returns a serialized state representation atto_version. The serialized form is whatever shape the active Checkpointer round-trips (per §10.1's "backends pick their own serialization"); the framework SHOULD pass the migration the most-deserialized form that is still independent of the current state class (e.g., a plain dict in Python, anunknown-shaped object in TypeScript) so the migration is not constrained by the user's current state-class definitions.
Migration support requires the active Checkpointer to be able to expose a structural
intermediate form of the loaded state (a plain dict, a JSON tree, or similar) that is
independent of the current state class definition. Backends using JSON, msgpack, or
similar schema-independent encodings naturally satisfy this; the SQLiteCheckpointer
reference implementation (per §10.11) does so by default. Backends using class-bound
serialization (Python pickle of state class instances) or live in-memory references to
typed state objects (the InMemoryCheckpointer reference implementation) cannot expose a
class-independent intermediate. When such a backend encounters a version mismatch on load
AND one or more migrations are registered, it MUST raise checkpoint_record_invalid per
§10.10 with the version mismatch in the error description; the migration registry has no
opportunity to bridge versions in that case. Implementations MUST document whether their
Checkpointer backend supports state migration.
The registration surface is per-language ergonomic. Python implementations are expected to
expose this on GraphBuilder (e.g., with_state_migration(...)); TypeScript implementations
may expose it on the builder or as a configuration object. The registration concept is what
this spec mandates: migrations are bound to the compiled graph and consulted during
checkpoint load.
A compiled graph's migration set is ordered by (from_version, to_version) pair. The
order of registration does not affect chain resolution; chains are resolved by version pair,
not by registration order. Two migrations with the same from_version and same to_version
MUST raise a configuration-time error (the chain is ambiguous). Two migrations with the
same from_version and different to_version define a branched migration graph; chain
resolution (§10.12.2) is responsible for picking a path.
10.12.2 Chain resolution¶
When Checkpointer.load(invocation_id) returns a record whose schema_version does not match
the current state schema's schema_version, the engine MUST attempt to resolve a migration
chain from the record's version to the current version using the graph's registered
migrations.
Chain resolution proceeds:
- Build a directed graph over registered migrations: each migration is an edge from its
from_versionto itsto_version. - Find any path from the record's
schema_versionto the current state schema'sschema_version. Implementations MAY use any reasonable search (BFS for shortest path is recommended). - If at least one path exists, apply the migrations along the path in order: each migration's output becomes the next migration's input. The final serialized state is passed to the current state class's deserialization step (per §10.1 round-trip integrity).
- If no path exists, raise
checkpoint_state_migration_missing(per §10.10 below).
If a migration function itself raises during step 3 (chain application), the engine MUST
wrap the raised exception as checkpoint_state_migration_failed (per §10.10) and propagate
it to the caller. The migration's exception is preserved as the cause per the language's
idiom (__cause__ in Python). Subsequent migrations in the chain MUST NOT run; the engine
abandons the chain at the failing migration and the resume attempt fails as a whole.
Migrations MUST be pure functions of their input (no I/O, no implicit state, deterministic
output for a given input). The framework does not enforce purity — users who violate the
contract risk non-deterministic resume, but the spec mirrors 0008 §10.5's idempotency
stance: the contract is documented, not policed. The engine MAY consult the migration registry multiple times
during a single resume — for example, when subgraph parent states (§10.2 parent_states)
also need migration. Implementations MUST apply the same chain resolution to each
parent-state entry; in the absence of per-parent version metadata, parent states MUST be
treated as carrying the same schema_version as the outer record. (A future proposal may
add per-parent versioning if subgraph state schemas evolve independently of the outer
schema; for now the outer record's schema_version is authoritative.)
10.12.3 No-op when versions match¶
When the loaded record's schema_version equals the current state schema's
schema_version, the engine MUST NOT consult the migration registry; the record is loaded
directly per §10.4. This is the common-case fast path and incurs no migration overhead.
10.12.4 Composition with checkpoint_record_invalid¶
Proposal 0008's checkpoint_record_invalid (§10.10) covers structural incompatibility a
migration cannot fix — e.g., the serialized record itself is corrupt, or the post-migration
state fails the current state class's deserialization. After a migration chain runs, if the
final deserialized state still raises checkpoint_record_invalid, that error propagates
unchanged. Migrations are an opportunity to avoid checkpoint_record_invalid on
schema-version mismatches; they are not a recovery mechanism for arbitrary record
corruption.
If no migrations are registered for a graph and a loaded record's schema_version does not
match the current schema, the engine MUST raise checkpoint_state_migration_missing (the
new category below), NOT checkpoint_record_invalid. Distinguishing the two categories
matters: the former is actionable ("register a migration"); the latter is not ("the record
is broken").
Pipeline-utilities §10.10: New error category¶
Add to §10.10:
New canonical runtime category:
checkpoint_state_migration_missing— raised oninvoke(resume_invocation=X)when the loaded record'sschema_versiondoes not match the current state schema'sschema_versionAND no chain of registered migrations connects the two. Non-transient. The error MUST carry at least the record'sschema_version, the current schema'sschema_version, and a description of the registered migration set (in a form appropriate to the host language) so the user can see what migrations would need to be added.New canonical runtime category:
checkpoint_state_migration_failed— raised when a user-supplied migration function raises during chain application (per §10.12.2). Non-transient (a buggy migration is deterministic; retrying without changing the migration code will not succeed). The error MUST carry the failing migration'sfrom_versionandto_version, and the underlying exception as cause (per the language's idiom).
Replace the existing §10.10 checkpoint_record_invalid description with:
Canonical runtime category:
checkpoint_record_invalid— raised whenCheckpointer.load(X)returns a record whose schema is incompatible with the current graph (state shape mismatch, missing required fields, OR a post-migration state that fails to deserialize against the current state class per §10.12.4). Non-transient.
The "incompatible schema_version" reason from the original §10.10 text is removed;
raw schema_version mismatches now route through checkpoint_state_migration_missing
per §10.12 (or through checkpoint_state_migration_failed if a migration is registered
but raises).
The amended checkpoint_record_invalid category covers structural failures and
post-migration deserialization failures. The three categories are mutually exclusive on
any given resume: the engine evaluates version compatibility first (routing through
checkpoint_state_migration_missing if no chain exists), then applies the chain (routing
through checkpoint_state_migration_failed if a migration raises), then attempts
deserialization (routing through checkpoint_record_invalid if the post-migration state
cannot deserialize).
Cross-spec touchpoints¶
This proposal does not modify graph-engine. The state-class declaration of
schema_version is a per-language ergonomic surface that does not require a graph-engine
spec change.
This proposal does not modify observability. Migration runs SHOULD be visible in the §6
observer stream so the OTel mapping (per observability §5) can surface them as spans
during resume, but the exact event shape is left to the implementation. A span like
openarmature.checkpoint.migrate with attributes for from_version, to_version, and
the chain length is the recommended shape. This is SHOULD rather than MUST because
migrations run at most once per resume and the observability overhead is negligible
either way; implementations choosing not to emit are accepting the loss of migration
visibility in their trace UI.
This proposal does not modify llm-provider.
Conformance test impact¶
Add fixtures under spec/pipeline-utilities/conformance/. Each fixture is a pair
(NNN-name.yaml + NNN-name.md) per the conformance README:
0NN-state-migration-additive-field.yaml— state class declaresschema_version = "v2". A saved record exists atschema_version = "v1"carrying a state that lacks an optional field added in v2. One migration registered:v1 → v2populates the new field with its default. Callinvoke(resume_invocation=...); assert the migration runs once, the resumed invocation sees the populated default, and execution proceeds normally.0NN+1-state-migration-chain.yaml— state class atschema_version = "v3". A saved record exists atv1. Two migrations registered:v1 → v2andv2 → v3. Assert both run in order on resume and the resumed invocation sees the final v3 shape.0NN+2-state-migration-missing.yaml— state class atv2, saved record atv1, no migrations registered. Assert resume raisescheckpoint_state_migration_missing(NOTcheckpoint_record_invalid); assert the error carriesfrom_version=v1,to_version=v2, and an empty migration-set description.0NN+3-state-migration-versions-match-no-op.yaml— record atv2, state class atv2. Assert resume does NOT consult the migration registry (no migration runs, no migration event fires) and the record loads via the §10.4 fast path.0NN+4-state-migration-parent-states-migrated.yaml— saved record was taken at a subgraph-internal save point;parent_statesis populated with one outer-graph state atv1. State class atv2with one registeredv1 → v2migration. Assert the migration runs once for the outer record'sstateAND once for each entry inparent_states, and the resumed subgraph re-enters correctly with the migrated parent state.0NN+5-state-migration-post-migration-deserialization-fails.yaml— record atv1, state class atv2, registeredv1 → v2migration produces output that does not match the v2 state class's deserialization contract (e.g., a required field is missing). Assert resume raisescheckpoint_record_invalid(per §10.10's existing contract), NOTcheckpoint_state_migration_missing. Verifies the §10.12.4 category-distinction rule.0NN+6-state-migration-no-path-in-registry.yaml— state class atv2, saved record atv1, migrations registered but none form a chain fromv1tov2(e.g., av3 → v4migration is registered, unrelated to the v1→v2 path). Assert resume raisescheckpoint_state_migration_missing(same category as the empty-registry case); assert the error carriesfrom_version=v1,to_version=v2, and a migration-set description listing the registered (but unhelpful) migrations so the user can see what IS available. Complements0NN+2to verify the error category surfaces uniformly across both empty and no-path-found registry states.0NN+7-state-migration-function-raises.yaml— state class atv2, saved record atv1, registeredv1 → v2migration function raises aKeyErrormid-execution (simulating a buggy user-supplied migration). Assert resume raisescheckpoint_state_migration_failed; assert the error exposes the underlyingKeyErroras cause, and carriesfrom_version=v1andto_version=v2. Verifies the §10.12.2 contract that a raising migration aborts the chain and propagates as the dedicated category (NOTcheckpoint_record_invalid).
(Fixture numbering deferred until proposals 0009 and 0011 are Accepted with finalized fixture numbering; this proposal's accept PR will pick the next available slot.)
Alternatives considered¶
Force users to embed migration logic inside the state class's deserialization¶
Rejected. State classes are user-domain types; pushing migration logic into their
deserialization hooks (e.g., __init_subclass__ magic, custom model_validator in
Pydantic) couples the state schema to its full version history and bloats every state
class with code that runs only during resume. A separate registration surface keeps
migrations localized to the graph build site, where they belong alongside the rest of
the graph's configuration.
Auto-discover migrations by inspecting state-class fields¶
Rejected. "Auto-migration" (e.g., "v1 has fields A, B; v2 has fields A, B, C — auto-add C with its default") works for additive changes but fails on field renames, shape changes, and any semantic transformation. Spec mandating auto-discovery would under-serve any non-trivial migration; spec NOT mandating it but allowing implementations to ship one as a convenience layer over the registration surface is the right level of constraint. The spec specifies the explicit registration; user libraries or per-implementation conveniences can layer auto-discovery on top.
Use semver for schema_version and resolve chains by version arithmetic¶
Rejected. Semver constrains the version identifier syntax; users with non-semver versioning (date stamps, content hashes, integer counters) would have to translate. Chain resolution via graph search over registered edges is identifier-agnostic and imposes no syntactic constraint. Users who want semver discipline can adopt it without the spec requiring it.
Bind migrations to the Checkpointer protocol instead of the compiled graph¶
Rejected. Migrations are state-schema concerns (they transform user state, not backend storage), and they vary across graphs even when the same Checkpointer is shared (two graphs sharing a SQLite store may have different state schemas evolving on different cadences). Binding to the compiled graph keeps the concern local to the graph build site; the Checkpointer remains a pure storage layer.
Persist a record-shape schema_version (backend-internal) separately from the¶
state-schema schema_version (user-facing)
Considered. Proposal 0008's original phrasing of schema_version ("lets backends evolve
the record shape without breaking older saved records") suggested a backend-internal
field. This proposal repurposes the field to the user-facing meaning instead of adding a
second field. Rationale: the backend-internal record-shape evolution is already
addressable inside the backend's own deserialization step (a SQLite backend can stamp
its rows with a backend version and migrate them on read without exposing this to the
framework); the user-facing state-schema evolution is the case that needs a
spec-mandated registration surface. Sharing one field for the user-facing meaning is
cleaner. If a backend later needs a distinct record-shape version, it can be added
without affecting the user-facing migration surface.
Open questions¶
None at time of submission.