Retrieval Provider¶
Canonical behavioral specification for the OpenArmature retrieval-provider abstraction.
- Capability: retrieval-provider
- Introduced: spec version 0.54.0
- History:
- created by proposal 0059
- rerank protocol added by proposal 0060
This specification is language-agnostic. Each implementation (Python, TypeScript, …) maps its own idioms
onto the behavioral contract described here. Conformance is verified by the fixtures under conformance/.
Normative keywords (MUST, MUST NOT, SHOULD, MAY) are used per RFC 2119.
1. Purpose¶
The retrieval-provider capability is the home for retrieval-primitive provider operations that sit
alongside LLM completion. The first protocol surface this specification defines is embedding —
turning a list of input strings into a list of vectors via an EmbeddingProvider. A sibling
RerankProvider protocol (§5) re-scores a list of candidate documents against a query, returning
them sorted by relevance — the second retrieval-primitive surface on the same capability.
Retrieval-provider is a sibling capability to llm-provider (per proposal 0006),
not a subtype of it. Embedding and LLM completion both bind per-instance to a model identifier per
the llm-provider §5 contract, but embedding model identifiers (text-embedding-3-small,
voyage-3, embed-multilingual-v3.0, etc.) live in disjoint namespaces from completion model
identifiers (gpt-4o-mini, claude-3-5-sonnet, etc.). A single Provider abstraction bundling both
surfaces would either contradict the per-model-binding contract OR carve a different contract shape
for the same protocol — both bad. Separate protocols preserve per-model-binding while opening a
path to observable embedding calls.
Retrieval-provider is one of a planned family of <domain>-provider capabilities (llm-provider,
retrieval-provider, plus future siblings as downstream demand surfaces). Each domain capability
covers related-shape provider operations under a narrow protocol surface; new domains land as new
capabilities rather than as extensions to existing ones. This keeps per-capability protocol surface
narrow and per-domain evolution independent.
The substrate is intentionally narrow, matching llm-provider's posture:
- An
EmbeddingProvideris stateless. It does not cache vectors; the caller passes the full input list on every call. - An
EmbeddingProviderdoes not handle retry, rate limiting, fallback, or routing. Those are pipeline-utilities concerns and compose above the provider via middleware. - An
EmbeddingProvideris bound to a single model identifier. Switching models means constructing a new provider, not passing a different argument per call.
Transparency. Per charter §3.1 principle 8 ("Transparency over abstraction"), the embedding
abstraction surfaces a normalized shape — EmbeddingResponse, EmbeddingUsage — without hiding
what the underlying provider returned. The EmbeddingResponse.raw field carries the parsed provider
response verbatim alongside the normalized fields, and the error categories preserve the underlying
provider exception as cause.
2. Concepts¶
RetrievalProvider. The umbrella term covering both EmbeddingProvider and RerankProvider. Not
a concrete protocol itself; used as the capability-level descriptor when discussing cross-protocol
concerns (observability, error semantics, per-model binding).
EmbeddingProvider. An object that, given a sequence of input strings, returns a sequence of
vectors wrapped in an EmbeddingResponse. Bound to a specific embedding model identifier per
instance.
EmbeddingResponse. The result of an embed() call: the vectors, the model identifier, usage
information, and (when present) the provider-returned request identifier and the parsed raw
response.
EmbeddingUsage. A usage record carrying input_tokens only — embedding has no output tokens
(vectors aren't tokens).
Embedding runtime config. A RuntimeConfig-shaped record (parallel to llm-provider §6) carrying
embedding-specific caller-supplied request parameters: an optional dimensions field (for callers
controlling output vector size on providers that support it), an optional input_type field, and
the extras-pass-through bag for vendor-specific knobs. input_type ("query" / "document", an
extensible string) declares what the embedded text is for, so a provider bound to an asymmetric model
applies the model-appropriate query/document treatment per its §8 wire mapping; absent ⇒ symmetric
embedding (a symmetric model ignores it). Additional well-known values ("classification",
"clustering", …) MAY be recognized by mappings whose backend supports them; an unrecognized value is
provider_invalid_request (§7) at a mapping that declares a closed set, or passed through by a mapping
that accepts arbitrary types. Free-form per-model instruction overrides ride the extras-pass-through
bag (no second declared field).
RerankProvider. An object that, given a query string and a list of candidate documents, returns the documents sorted by query-relevance with provider-specific scores. Bound to a specific rerank model identifier per instance.
RerankResponse. The result of a rerank() call: the sorted scored documents, the model
identifier, usage information, and (when present) the provider-returned response identifier and the
parsed raw response.
RerankUsage. A usage record with optional search_units and optional input_tokens, reflecting
the messy provider landscape where rerank pricing surfaces vary widely.
ScoredDocument. A single result entry carrying the document's original index in the input list, the provider-assigned relevance score, and (optionally) an echo of the document text.
Rerank runtime config. A RuntimeConfig-shaped record (parallel to llm-provider §6) carrying
rerank-specific caller-supplied request parameters. Initially minimal: one declared field,
return_documents (boolean, default False), controlling whether the provider echoes document text
on each ScoredDocument in the response. The field name + default match the major rerank vendors'
wire-shape parameter (Cohere and Voyage AI expose return_documents defaulting False; Jina AI's wire default is True, so the §8.2 Jina mapping sends the OA value explicitly);
per-vendor wire-format mappings pin the source-side translation where a vendor diverges. Plus the
extras-pass-through bag for vendor-specific knobs.
3. EmbeddingProvider protocol¶
The EmbeddingProvider protocol exposes two async operations.
ready()¶
A no-argument async operation that verifies the provider can serve requests against the bound model. Implementations MAY surface a hosted-provider authentication check, a local-model load attempt, or a noop — whichever fits the backend. The operation:
- MUST be idempotent. Repeated
ready()calls MUST NOT change observable provider state. - MUST surface
provider_invalid_model(per §7) if the bound embedding model is not recognized by the backend. - MUST surface
provider_model_not_loaded(per §7) if the bound model is recognized but not currently usable (e.g., a local model that requires explicit loading). - MAY raise other §7 error categories (
provider_authentication,provider_unavailable) when the underlying readiness check encounters those conditions.
embed(input, *, config=None)¶
The embedding operation. Parameters:
| Parameter | Type | Description |
|---|---|---|
input |
list of strings | The input strings to embed. MUST be a list even for single-string callers (callers wrap as a one-element list). Matches complete()'s "always a list" message contract. |
config |
optional EmbeddingRuntimeConfig (keyword-only) |
Caller-supplied embedding runtime config. Keyword-only (or per-language idiomatic equivalent that prevents positional confusion with input). |
Returns an EmbeddingResponse (§4).
The embed() operation:
- MUST be stateless. Repeated calls with the same
inputandconfigMUST NOT change observable provider state. - MUST raise one of the §7 error categories on failure. The §7 enumeration is shared cross- capability with llm-provider §7; embedding-applicable subset documented in §7.
- MUST preserve input order in the response (the vector at index
iMUST be the embedding ofinput[i]). Implementations MUST NOT permute vector position relative to input position. - MUST NOT loop, retry, or fall back. Pipeline-utilities §6 middleware and per-call retry compose above the provider for those concerns.
Query vs. document (input_type). When the caller sets EmbeddingRuntimeConfig.input_type (§2),
the provider applies the model-appropriate query/document treatment per its §8 wire mapping.
input_type is a request-side parameter: it flows into EmbeddingEvent.request_params (graph-engine
§6) with the same absence-is-meaningful semantics as dimensions, and is surfaced on the embedding
span as the openarmature.embedding.input_type attribute (observability §5.5.8). Absent ⇒
the symmetric default.
Per-instance model binding¶
Per-instance model binding follows the llm-provider §5 contract exactly. Implementations bind one
embedding model identifier per provider instance via a constructor parameter (or per-language
idiomatic equivalent). The bound identifier is visible to the observability layer (per observability §5.5) as the
gen_ai.request.model attribute on the embedding span.
4. EmbeddingResponse and EmbeddingUsage shapes¶
EmbeddingResponse¶
| Field | Description |
|---|---|
vectors |
List of vectors (each a list of floats); one vector per input string in the order the inputs were supplied. The length of vectors MUST equal the length of input. |
model |
The model identifier the provider returned. MAY be a more specific identifier than the one the provider was bound against. |
usage |
An EmbeddingUsage record (defined below). |
response_id |
The provider-returned response identifier when present; null otherwise. Matches the OTel GenAI semconv gen_ai.response.id attribute (per observability §5.5.8) and the typed EmbeddingEvent.response_id field (per graph-engine §6). |
dimensions |
Int. The output vector dimensionality. MUST equal the length of each inner list in vectors. Derivable from vectors[0] but kept on the response for ergonomics and cross-vendor consistency. |
raw |
The parsed provider response, as a language-idiomatic representation of deserialized JSON (Python: dict[str, Any]; TypeScript: Record<string, unknown>). MUST be populated on every successful return. Parallel to llm-provider §6 Response.raw. |
EmbeddingUsage¶
| Field | Description |
|---|---|
input_tokens |
Int. Tokens billed for the embedding call. Always reported (no output_tokens — vectors aren't tokens). |
Cross-impl invariants¶
- Exactly one vector per input string (the length of
vectorsMUST match the length ofinput). - Vector position is keyed by input order; implementations MUST NOT permute.
- All vectors in a single response have the same dimensionality. Implementations MUST verify this
on the response and raise
provider_invalid_response(§7) if violated. - The
dimensionsfield on the response MUST equal the dimensionality of each inner vector — cross-check invariant for adapters. - Implementations MUST raise
provider_invalid_response(§7) when the response carries a mismatched count of vectors vs. input strings.
5. RerankProvider protocol¶
The RerankProvider protocol exposes two async operations, mirroring EmbeddingProvider (§3) with
rerank-specific shapes.
ready()¶
A no-argument async operation that verifies the provider can serve requests against the bound rerank
model. Same idempotency + error-surfacing contract as §3's ready():
- MUST be idempotent. Repeated
ready()calls MUST NOT change observable provider state. - MUST surface
provider_invalid_model(per §7) if the bound rerank model is not recognized by the backend. - MUST surface
provider_model_not_loaded(per §7) if the bound model is recognized but not currently usable. - MAY raise other §7 error categories (
provider_authentication,provider_unavailable) under the same conditions as §3'sready().
rerank(query, documents, *, top_k=None, config=None)¶
The rerank operation. Parameters:
| Parameter | Type | Description |
|---|---|---|
query |
string | The query string the documents are scored against. MUST be non-empty; an empty query raises provider_invalid_request (§7) at the pre-send validation layer. |
documents |
list of strings | The candidate documents to score against the query. MUST be a list (single-document callers wrap as a one-element list — matches the embedding protocol's "always a list" framing). MUST be non-empty; an empty document list raises provider_invalid_request (§7). |
top_k |
optional int (keyword-only) | The maximum number of results the caller wants returned. None means "all" (the provider MAY return up to len(documents) results). MUST be positive when supplied; zero or negative raises provider_invalid_request (§7). MAY exceed len(documents) — the provider returns at most len(documents) results regardless. |
config |
optional RerankRuntimeConfig (keyword-only) |
Caller-supplied rerank runtime config. Keyword-only (or per-language idiomatic equivalent that prevents positional confusion). |
Returns a RerankResponse (§6).
The rerank() operation:
- MUST be stateless. Repeated calls with the same
query/documents/top_k/configMUST NOT change observable provider state. - MUST raise one of the §7 error categories on failure.
- MUST return results sorted by
relevance_scoredescending (most relevant first). - MUST preserve each result's
indexfield as the position in the inputdocumentslist, so callers can map sorted results back to their original documents. - MUST NOT loop, retry, or fall back. Pipeline-utilities §6 middleware and per-call retry compose above the provider.
Per-instance model binding¶
Per-instance model binding follows the llm-provider §5 contract exactly, identical to §3's framing.
Implementations bind one rerank model identifier per provider instance via a constructor parameter
(or per-language idiomatic equivalent). The bound identifier is visible to the observability layer
(per observability §5.5) as the gen_ai.request.model attribute on the rerank span.
6. RerankResponse, RerankUsage, ScoredDocument shapes¶
RerankResponse¶
| Field | Description |
|---|---|
results |
List of ScoredDocument entries sorted by relevance_score descending (most relevant first). len(results) is at most min(top_k, len(documents)) when top_k is supplied; at most len(documents) otherwise. MAY be shorter than that bound if the provider returns fewer results (e.g., relevance-threshold filtering on the provider side). |
model |
The model identifier the provider returned. MAY be a more specific identifier than the one the provider was bound against. |
usage |
A RerankUsage record (defined below). |
response_id |
The provider-returned response identifier when present; null otherwise. Matches the OTel GenAI semconv gen_ai.response.id attribute (per observability §5.5.13) and the typed RerankEvent.response_id field (per graph-engine §6). |
raw |
The parsed provider response, as a language-idiomatic representation of deserialized JSON (Python: dict[str, Any]; TypeScript: Record<string, unknown>). MUST be populated on every successful return. Per charter §3.1 principle 8 ("Transparency over abstraction") — callers retain access to provider-specific fields the normalized shape doesn't surface. Parallel to llm-provider §6 Response.raw and §4 EmbeddingResponse.raw. |
ScoredDocument¶
| Field | Description |
|---|---|
index |
Int. The 0-based position of this document in the original input documents list. Load-bearing for caller-side lookup — callers MUST be able to map a result back to its input document via documents[result.index]. Implementations MUST preserve this verbatim from the provider response. |
relevance_score |
Float. The provider-assigned relevance score; higher = more relevant. Provider-specific scale — most providers normalize to [0.0, 1.0] but the spec does NOT pin a scale. Cross-provider score comparisons are NOT meaningful. |
document |
The echoed document text when the provider returns it; null otherwise. Implementations MUST surface the provider's echo verbatim when present; MUST NOT fabricate the echo from the input documents list when the provider omits it (the provider's echo and the caller's input are two different surfaces; conflating them would mask provider-side document transformations like deduplication or truncation). |
RerankUsage¶
| Field | Description |
|---|---|
search_units |
Int or null. The provider-reported count of "search units" billed for this call. Populated for providers that surface it (e.g., Cohere); null otherwise. |
input_tokens |
Int or null. The provider-reported count of input tokens (query + concatenated documents). Populated for providers that surface it (e.g., Voyage AI); null otherwise. |
Both fields default to null. Implementations MUST populate the field when the provider returns a
corresponding value and MUST NOT fabricate one when the provider omits it. A RerankUsage with both
fields null is valid and represents the "provider reports no billing surface" case.
Cross-impl invariants¶
resultsare sorted byrelevance_scoredescending. Implementations that receive an unsorted provider response MUST sort before returning (some provider SDKs pre-sort; some don't).- Each result's
indexMUST be a valid index into the inputdocumentslist (0 <= index < len(documents)). Implementations MUST raiseprovider_invalid_response(§7) when the provider returns an out-of-range index. - The same
indexMUST NOT appear twice inresults. Implementations MUST raiseprovider_invalid_response(§7) on duplicate-index responses. - When
top_kis supplied,len(results) <= top_k. Implementations MUST raiseprovider_invalid_response(§7) if the provider returns more results than requested. - When the provider returns
documentechoes for some results but not others, implementations MUST preserve the per-result variance (null where the provider omitted; populated where the provider echoed). MUST NOT auto-fill from the inputdocumentslist.
7. Error semantics¶
The retrieval-provider capability inherits the llm-provider §7 error-category enumeration. The same nine normative categories are available to both embedding and rerank calls. The retrieval-applicable subset (the §7 categories minus the LLM-completion-specific ones), shared by both protocols, is:
provider_authentication— credentials missing, invalid, or revoked.provider_unavailable— transport failure, provider-side outage, timeout.provider_invalid_model— bound model identifier not recognized by the provider.provider_model_not_loaded— model recognized but not currently usable.provider_rate_limit— provider-side rate limit signaled.provider_invalid_response— provider returned a malformed response: missing required fields, or a violation of the capability's cross-impl invariants (embedding §4 — mismatched vector count, inconsistent dimensions; rerank §6 — out-of-range or duplicateindex, more results thantop_k).provider_invalid_request— caller-supplied input failed pre-send validation (embedding: empty input list, invaliddimensions; rerank: emptyquery, emptydocumentslist,top_k <= 0).
The following llm-provider §7 categories do NOT apply to embedding or rerank:
provider_unsupported_content_block— both take strings, not content blocks.structured_output_invalid— neither has aresponse_schema.
The exception-flow contract from llm-provider §7 applies identically: the error category exception
MUST raise out of embed() / rerank() whether raised by the provider or by the implementation's
pre-send validation layer.
8. Wire-format mappings¶
Wire mappings are per-vendor / per-runtime realizations of the runtime-agnostic EmbeddingProvider /
RerankProvider contracts (§3 / §5) — the retrieval-provider analogue of llm-provider §8. Each mapping
pins the wire shapes, the construction parameters (e.g. base_url), and the per-mapping realization of
cross-vendor knobs (input_type). Mappings are normative: a conforming implementation of a given
mapping MUST produce the wire requests and consume the wire responses described here.
8.1 TEI (Text Embeddings Inference)¶
HuggingFace Text Embeddings Inference is a self-hosted serving runtime. Its gen_ai.system identifier
is "tei" (per observability §5.5.8 / §5.5.13 — identify the wire surface, not the model developer).
The /embed and /rerank wire shapes below were verified against the TEI OpenAPI; docs/compatibility.md
records the verified version.
Construction (two separate instances). TEI hosts one model per instance, and embedding models and
cross-encoder rerankers are different model families — so a TEI EmbeddingProvider and a TEI
RerankProvider are distinct provider instances against distinct TEI deployments, each binding its own
base_url (§3 / §5 per-instance binding):
- the TEI
EmbeddingProviderbindsbase_url(the embedding instance) + the bound model + aninput_type→prompt_namemap (e.g.{query: "query", document: "passage"}) realizing asymmetric embedding via TEI's native server-side prompts, with OPTIONAL client-sidequery_prefix/document_prefixstrings as the fallback for models without configured prompts; - the TEI
RerankProviderbindsbase_url(the reranker instance) + the bound model +chunk_size(the rerank client-batch chunk size, default32— see Mandatory rerank batch chunking).
The spec does NOT enumerate per-model prefixes (model-specific, a moving target) — they are operator-supplied at construction.
/embed. POST {base_url}/embed with {"inputs": [str]} (TEI accepts a string or array; the
mapping always sends the array form per §3's "always a list"); EmbeddingRuntimeConfig.dimensions maps
to TEI's dimensions field when set. The response is the vector array, in input order.
input_type realization: the mapping sends TEI's native prompt_name field, looked up from the
construction input_type → prompt_name map, so TEI applies the model's configured query/document
prompt server-side (the idiomatic path — TEI models carry named prompts in their config). For a
model without configured prompts, the mapping MAY instead prepend the construction-supplied
query_prefix / document_prefix client-side. Either way, input_type absent ⇒ no prompt and no
prefix (the symmetric default).
/rerank. POST {base_url}/rerank with {"query": str, "texts": [str], "truncate": false, "return_text": <bool>}.
TEI's texts: [str] maps directly onto documents: list[str] (§5; no per-document object wrapping);
return_documents (§2 rerank runtime config) → TEI's return_text (default false), surfacing the
echoed text on ScoredDocument.document. The response [{"index": int, "score": float, "text"?: str}]
maps onto results (§6): index → ScoredDocument.index, score → relevance_score, text →
document. Scores are normalized by default (raw_scores: false); the scale is model-specific (§6
pins none). TEI does not guarantee response sort order, so the mapping MUST sort per §6's "sort if the
provider didn't" invariant — subsumed by the chunk-and-stitch global re-sort below.
Mandatory rerank batch chunking. TEI enforces max-client-batch-size (server-configured, default
32). When len(documents) exceeds the instance's chunk_size, the mapping MUST split the documents
into consecutive ≤chunk_size chunks, issue one /rerank request per chunk (same query), and stitch
the results: re-base each chunk's index to its absolute position in the original documents list,
concatenate all (index, score) pairs, then apply §6's contract — sort by score descending and honor
top_k. This is valid because a cross-encoder scores each (query, document) pair independently of the
others in its batch. chunk_size is a construction parameter, default 32 (TEI's documented default;
an operator who lowered --max-client-batch-size sets it to match; an implementation MAY auto-detect
from TEI's /info). A mapping that does not chunk MUST NOT silently send an over-cap request; chunking
is required, not optional.
truncate: false (fail-loud). TEI's truncate defaults to false on both endpoints, so an
over-length input errors rather than being silently truncated (model context caps vary). The /rerank
mapping sends truncate: false explicitly (leaving TEI's truncation_direction default, Right) —
the surface where a silently truncated (query, document) pair would yield a wrong relevance score; the
resulting TEI error (HTTP 413 / 422) maps to provider_invalid_request (§7). The /embed mapping
relies on TEI's false default and does not add truncate to the request, keeping the body minimal
({inputs[, prompt_name][, dimensions]}) and byte-identical to the symmetric path when input_type is
absent.
Errors. TEI HTTP / transport failures map to the §7 categories per the shared enumeration:
connection / 5xx → provider_unavailable; unknown model → provider_invalid_model; over-length /
malformed request (413 / 422) → provider_invalid_request; malformed response →
provider_invalid_response.
8.2 Jina¶
Jina AI is a hosted retrieval API. Its gen_ai.system identifier is "jina" (per observability §5.5.8 / §5.5.13 —
identify the wire surface, not the model developer). The wire shapes below were verified against the
Jina OpenAPI; docs/compatibility.md records the verified version.
Construction. A Jina provider instance binds an API key (sent as Authorization: Bearer <key>)
+ the bound model identifier (§3 / §5 per-instance binding), with base_url defaulting to
https://api.jina.ai (origin only — the /v1 version stays in the route; override for a proxy /
private gateway). A Jina EmbeddingProvider (/v1/embeddings) and a Jina RerankProvider
(/v1/rerank) are distinct instances (one model each) sharing the hosted endpoint.
/v1/rerank. POST {base_url}/v1/rerank with
{"model": str, "query": str, "documents": [str], "top_n"?: int, "return_documents": <bool>, "truncation": false}.
documents ← documents (§5); top_n ← top_k (§5); return_documents ← the
RerankRuntimeConfig.return_documents value, sent explicitly — Jina's wire default is true, but
OA's default is False (§2), so the mapping sends the OA value rather than relying on Jina's default.
The response {model, usage: {total_tokens}, results: [{index, relevance_score, document?}]} maps onto
results (§6): index → ScoredDocument.index, relevance_score → relevance_score, document →
document; usage.total_tokens → RerankUsage.input_tokens (Jina meters rerank by tokens, not search
units). Results are returned ranked; the mapping applies §6's "sort if the provider didn't" invariant
regardless.
/v1/embeddings. POST {base_url}/v1/embeddings with
{"model": str, "input": [str], "task"?: str, "dimensions"?: int, "truncate": false}. input_type
realization: the mapping sets Jina's native task from input_type — "query" → "retrieval.query",
"document" → "retrieval.passage" — so Jina applies the model-appropriate query/passage representation
server-side; input_type absent ⇒ task omitted. The mapping recognizes a closed input_type set
(query / document); an unrecognized value is a pre-send provider_invalid_request (§7). Jina's other
task values (e.g. text-matching, classification, clustering — model-dependent) are reached via
the extras-pass-through bag, not input_type (widening input_type's normative value space is a
protocol-level change, deferred until a consumer needs it). EmbeddingRuntimeConfig.dimensions → Jina's
dimensions (Matryoshka) when set. The response {model, usage, data: [{index, embedding}]} maps to the
EmbeddingResponse vectors in input order.
truncation / truncate (fail-loud). Jina names the flag truncation on /v1/rerank and
truncate on /v1/embeddings (vendor inconsistency); the mapping sends the per-endpoint flag false
so an over-length input errors rather than being silently truncated (consistent with §8.1's TEI
fail-loud posture).
Errors. Jina HTTP failures map to the §7 categories per the shared enumeration: 401 →
provider_authentication; 429 (rate limit) → provider_rate_limit; 5xx → provider_unavailable;
unknown model (404) → provider_invalid_model; over-length / malformed request (422) →
provider_invalid_request; malformed response → provider_invalid_response.
8.3 OpenAI-compatible embeddings¶
The OpenAI /v1/embeddings wire is the de-facto-standard embedding API, exposed by OpenAI and the
OpenAI-compatible serving ecosystem (vLLM, LocalAI, Together, TEI's own OpenAI-compatible endpoint, …).
gen_ai.system is "openai" — identifying the wire surface, not the backing deployment (per
§5.5.8 / §5.5.13; a vLLM / LocalAI backend reached through this wire is still the OpenAI wire surface).
Wire shapes verified against the OpenAI OpenAPI; docs/compatibility.md records the verified version.
Embeddings-only — OpenAI exposes no rerank API, so this mapping has no RerankProvider counterpart.
Construction. An OpenAI-compatible EmbeddingProvider binds an API key (sent as
Authorization: Bearer <key>) + the bound model identifier (§3 / §5 per-instance binding), with
base_url defaulting to https://api.openai.com (origin only — the /v1 version stays in the route,
consistent with §8.1 / §8.2) and overridable for any OpenAI-compatible backend. It MAY additionally
bind the optional client-side query_prefix / document_prefix from §8.1 — off by default
(pure-symmetric OpenAI), set only for an asymmetric model served behind a compatible endpoint (see
input_type below).
/v1/embeddings. POST {base_url}/v1/embeddings with
{"model": str, "input": [str], "dimensions"?: int}. input is always the array form (§3's "always a
list"); EmbeddingRuntimeConfig.dimensions → wire dimensions (Matryoshka, on models that support it)
when set. The mapping does not send encoding_format by default (OpenAI's wire default is
"float"); "base64" rides the extras-pass-through bag. The response
{object: "list", data: [{object: "embedding", index, embedding}], model, usage: {prompt_tokens, total_tokens}}
maps to the EmbeddingResponse vectors in input order — the mapping consumes data + usage (the
object fields are OpenAI wire metadata); usage.prompt_tokens → EmbeddingUsage.input_tokens
(embedding has no output tokens, so total_tokens equals prompt_tokens).
input_type (symmetric base wire; client-side prefix for asymmetric). The OpenAI /v1/embeddings
wire has no query/document parameter, so on the base wire input_type is not realized — an absent
input_type is the correct symmetric default for OpenAI's symmetric models (e.g. text-embedding-3),
and the mapping does not error on it. For an asymmetric model served behind a compatible endpoint
(e.g. a BGE / E5 model on vLLM), the mapping applies the client-side prefix from §8.1: when
query_prefix / document_prefix are bound at construction, input_type selects which to prepend
before sending — the only way to express the distinction on a wire with no input_type field. A server
that extends the wire with its own input_type-style field instead takes it through the
extras-pass-through bag.
Errors. HTTP failures map to the §7 categories per the shared enumeration: 401 →
provider_authentication; 429 (rate limit) → provider_rate_limit; 5xx → provider_unavailable;
unknown model (404) → provider_invalid_model; malformed / oversized request (400) →
provider_invalid_request; malformed response → provider_invalid_response.
9. Determinism¶
Embedding model determinism guarantees vary by provider. This specification MUST NOT assume bit-identical vectors for equivalent inputs across calls — providers MAY return slightly different vectors for the same input (model-version updates, server-side non-determinism, etc.).
Embedding-aware caches keyed on input strings MAY apply per the provider's documented determinism guarantees but are NOT a spec contract. A future proposal MAY define a cache-attribute family analogous to proposal 0047's LLM prefix-cache attributes; out of scope for v1.
Rerank determinism. Rerank guarantees are similar to embedding's: providers MAY return slightly
different scores for the same (query, documents) pair across calls. This specification MUST NOT
assume bit-identical responses for equivalent inputs. Even when scores are identical bit-for-bit, two
documents with identical scores MAY appear in either order across calls (provider implementation
detail) unless the provider documents a tie-breaking rule; the spec MUST NOT assume one.
10. Cross-spec touchpoints¶
- graph-engine §6 — typed observer events
EmbeddingEvent/EmbeddingFailedEvent(embedding) andRerankEvent/RerankFailedEvent(rerank). See graph-engine §6 for the full event surface and dispatch contract. - observability §5.5 — OTel mapping for embedding spans (§5.5.8) and rerank spans (§5.5.13): the
core GenAI semconv subset (per the §5.5 GenAI de-facto-standard carve-out) plus OA-namespace
openarmature.embedding.*/openarmature.rerank.*attributes; span namesopenarmature.embedding.complete/openarmature.rerank.completediscriminate the operation. - observability §8 — Langfuse mapping using Langfuse's dedicated
Embedding(§8.4.5) andRetriever(§8.4.7) observation types. - observability §5.5.4 — observer-level privacy flag
disable_provider_payload(renamed fromdisable_llm_payloadby proposal 0059) gates payload from any provider call, including embedding payload (input_strings,request_extras, the Langfuseoutputvectors) and rerank payload (query,documents, the result document echoes). - llm-provider §7 — error-category enumeration (inherited).
- pipeline-utilities §6 (middleware) —
EmbeddingProviderandRerankProvidercalls are eligible for retry middleware identically tocomplete()calls.
11. Out of scope¶
Not covered by this specification; deferred to follow-on capabilities or proposals:
- Multi-modal embedding and rerank — image / audio documents. Text-only in v1.
- Further per-vendor and per-runtime wire-format mappings. Beyond §8.1 (TEI), follow-on proposals
add concrete vendor / runtime mappings — embedding (OpenAI, Cohere, Voyage, Jina) and rerank
(Cohere, Voyage, Jina hosted) — each pinning the per-vendor wire sourcing for fields the protocol
leaves position-agnostic (e.g., where
response_idis surfaced in that vendor's response shape). - Per-SDK implementation details — httpx batching strategies, provider-layer retry timing, SDK-specific error mapping. Provider-internal choices.
- Caller-supplied determinism / seeding. Embedding and rerank models rarely expose seeds; not v1.
- Cross-call observability correlation (e.g., "this rerank call used vectors from that embedding call"). Each call is independent at the protocol layer; any cross-call correlation lives in node-body code.
- Embedding / rerank result caching at the framework level. Caching is an application concern.
- Streaming embeddings and streaming rerank. Some providers stream results for very long inputs / large result sets; not v1.
- Score normalization across providers. Each rerank provider's relevance scale is surfaced as-returned; the spec does NOT define a normalization layer.
- Hybrid retrieval (embedding + rerank in one call). No provider exposes this as a single protocol-level operation; the two stages are always separate calls.
gen_ai.operation.nameadoption for rerank. Deferred per the stable-only upstream adoption policy; a follow-on proposal adds it when upstream reaches Stable with a rerank-applicable well-known value.