Tech spec

This commit is contained in:
Cyber MacGeddon 2026-04-14 20:31:42 +01:00
parent ce3c8b421b
commit 0f4c91b99a

View file

@ -0,0 +1,354 @@
---
layout: default
title: "Ontology Analytics Service Technical Specification"
parent: "Tech Specs"
---
# Ontology Analytics Service Technical Specification
## Overview
This specification proposes extracting ontology analytics — the embedding,
vector storage, and similarity-based selection of ontology elements — out of
`kg-extract-ontology` and into a separate, reusable service. The goal is to
make ontology analytics available to processors other than the extractor,
preload the analytics state at ontology-load time rather than on-demand, and
simplify the current per-flow duplication of vector stores.
## Problem Statement
The current implementation in `kg-extract-ontology`
(`trustgraph-flow/trustgraph/extract/kg/ontology/extract.py`) embeds
ontology analytics inside a single processor. This has four concrete
problems:
1. **Analytics are locked inside one processor.** Other services that
could benefit from ontology-similarity lookups (e.g. relationship
extraction, agent tool selection, query routing, future features that
don't exist yet) have no way to call into the analytics without
duplicating the embedder/vector store/selector machinery. The
extractor owns state that conceptually belongs to the platform.
2. **Vector stores are built lazily, per-flow, on first message.** Each
flow that uses `kg-extract-ontology` gets its own
`{embedder, vector_store, selector}` triple, keyed by `id(flow)`, and
the vector store is only populated the first time a message arrives
on that flow. This has several consequences:
- **Cold-start cost on the hot path.** The first document chunk
processed by a flow pays the full cost of embedding every element
of the ontology (hundreds to thousands of calls to the embeddings
service) before extraction can even begin. Subsequent chunks are
fast, but the first one can take seconds to minutes depending on
ontology size.
- **N copies of the same data.** If three flows all use the same
embedding model and the same ontology, three identical in-memory
vector stores are built and maintained. Memory scales with the
number of flows, not the number of ontologies.
- **Total loss on restart.** Vector stores are `InMemoryVectorStore`
instances — nothing is persisted. A restart forces every flow to
re-embed the entire ontology on its next message.
- **Ontology updates re-embed everything.** When the ontology config
changes, `flow_components` is cleared and the re-embedding happens
lazily across every flow, again on the hot path.
The per-flow split exists for a real reason: different flows may use
different embedding models with different vector dimensions, and the
current code detects the dimension by probing the flow's embeddings
service with a test string. But in practice most deployments use one
embedding model, and the per-flow split is paying a cost for
flexibility that isn't being used.
3. **Per-message selection is not batched.** For every document chunk
processed, the selector calls `embedder.embed_text(segment.text)`
once per segment (`ontology_selector.py:96`), each of which
round-trips through the pub/sub layer to the embeddings service as a
single-element batch. A chunk with 20 segments fires 20 sequential
embeddings requests when it could fire one. Ontology ingest is
already batched at 50 at a time; per-message lookups are not.
4. **There is no way to scope ontologies to a flow.** All loaded
ontologies are visible to all flows. A flow that only cares about,
say, the FIBO ontology still pays for embedding and similarity search
across every unrelated ontology in config. This is a usability and
performance problem that gets worse as the number of loaded
ontologies grows.
Together these problems mean the current implementation is tightly
coupled, duplicates work across flows, pays its worst costs on the
hot path, and can't be reused. Each of the four goals below addresses
one of these problems directly.
## Goals
- **Eager ontology processing.** When an ontology is loaded (via config
push), embed it and populate the vector store immediately, before any
document processing messages arrive. First-chunk latency should not
include ontology embedding cost.
- **Simplify away from per-flow analytics.** Move the vector store out
of per-flow scope. Revisit whether per-flow dimension detection is
still necessary, or whether a shared analytics store (per embedding
model, or globally) is sufficient for realistic deployments.
- **Extract ontology analytics into a standalone service.** Expose
embedding, similarity search, and ontology-subset selection as a
service callable from any processor via the normal pub/sub
request/response pattern. `kg-extract-ontology` becomes one consumer
among many.
- **(Stretch)** Allow flows to select which ontologies they use when
the flow is started, so a flow can restrict itself to a named subset
of the loaded ontologies rather than seeing all of them.
## Background
The current ontology analytics live in four files under
`trustgraph-flow/trustgraph/extract/kg/ontology/`:
- `ontology_loader.py` — parses ontology definitions from config.
- `ontology_embedder.py` — generates embeddings for ontology elements
and stores them in an `InMemoryVectorStore`. Batches at 50 elements
per call on ingest.
- `vector_store.py``InMemoryVectorStore`, a numpy-backed dense vector
store with similarity search.
- `ontology_selector.py` — given a query (a document segment), returns
the top-K most relevant ontology elements via similarity search.
These are wired together inside `extract.py`, where `Processor` holds
one `OntologyLoader` and one `TextProcessor` but maintains a
`flow_components` dict mapping `id(flow) →
{embedder, vector_store, selector}`. `initialize_flow_components` is
called on the first message for each flow; ontology config changes
trigger `self.flow_components.clear()` to force re-initialisation.
The analytics stack is conceptually independent from the extractor —
it's just a "given text, find relevant ontology elements" service —
but because it's embedded in the extractor, no other processor can use
it, and its lifecycle is coupled to message processing rather than to
ontology loading.
## Technical Design
### Architecture
A new processor, tentatively `ontology-analytics`, hosts the embedder,
vector store, and selector. It is a `FlowProcessor` — flow-aware so
future per-user state management can hook in at the flow boundary —
but the analytics state it holds is **global to the process**, not
per-flow. Shared state is the default; flow scope is just a handle
available when needed.
Crucially, the service does **not** call a separate embeddings service
for its own embedding work. It loads and runs an embedding model
directly, in-process, the same way `embeddings-fastembed` does today.
The rationale: the embedding model used for ontology analytics is a
deployment choice driven by the ontology's semantics, not by the
user's document embedding model. Coupling the two would be an
accidental constraint. Decoupling means:
- The analytics service has no dependency on a flow's
`embeddings-request` service. No routing, no external call, no
async dependency on another processor being up.
- Ontology ingest happens synchronously in the analytics process as
soon as the ontology config is loaded. No round-trips.
- Per-message selection (embedding a query segment and searching the
vector store) is also local. Fast path, no pub/sub hop for
embedding.
- Most deployments will use a single embedding model configured on
the analytics service. If a deployment needs two, run two
`ontology-analytics` processor instances with different ids and
have flows address the one they want.
The flow configuration optionally specifies which analytics service
id to use and which ontologies the flow cares about; otherwise the
flow sees all loaded ontologies through the default analytics
service.
Components inside the new processor:
1. **OntologyLoader** (moved from `kg-extract-ontology`)
Parses ontologies from config-push messages. Owns the parsed
ontology objects.
2. **In-process embedding model**
Loaded once at startup using the service's configured model name.
Same pattern as `embeddings-fastembed` — direct library call,
batched.
3. **Vector store (global)**
One `InMemoryVectorStore` per loaded ontology, not per flow.
Populated eagerly when the ontology is loaded (or reloaded).
Lives in process memory for the lifetime of the service.
4. **Selector**
Given a query text and a subset of ontology ids, returns the
top-K most relevant ontology elements across the union of those
stores. Batches embedding calls for multi-segment queries.
5. **Request handler**
Exposes a request/response service over the pub/sub layer for
other processors (initially `kg-extract-ontology`, later
others) to call.
Module: `trustgraph-flow/trustgraph/ontology_analytics/` (new)
### Data Models
#### Service request
```
OntologyAnalyticsRequest {
query_texts: list[str] # batch of query segments
ontology_ids: list[str] | None # optional subset; None = all
top_k: int # default per-service
similarity_threshold: float # default per-service
}
```
#### Service response
```
OntologyAnalyticsResponse {
results: list[list[OntologyMatch]] # one list per query_text
}
OntologyMatch {
element_id: str # e.g. "fibo:Bond"
element_type: str # class | property | individual
ontology_id: str # which ontology it came from
text: str # the text that was embedded
score: float # similarity score
}
```
The request is a batch by construction: the caller sends a list of
segments, the service embeds them in one batched embedding call,
searches the relevant stores, and returns a list of match-lists
aligned with the input. This directly fixes the per-segment
unbatched call in the current selector.
#### Flow configuration
Flow instance config gains an optional `ontology_analytics` block:
```
ontology_analytics:
service_id: ontology-analytics # default
ontologies: [fibo, geonames] # default: all loaded
```
If omitted, flows use the default analytics service id and see every
loaded ontology.
### APIs
New services:
- `ontology-analytics` request/response service, schema as above.
New flow service client:
- `flow("ontology-analytics-request")` — standard flow-scoped client
wrapper for calling the analytics service.
Modified code paths:
- `kg-extract-ontology` no longer owns the embedder, vector store,
or selector. Its per-message logic calls
`flow("ontology-analytics-request")` with the segments extracted
from the chunk and gets back the same shape of data it used to
compute locally.
- The `on_ontology_config` handler in `kg-extract-ontology` goes
away; config-push of ontology types is handled by the new
service.
### Implementation Details
- **Eager ingest.** When `on_ontology_config` fires on the analytics
service, the service synchronously (well, in an asyncio task)
re-embeds the changed ontologies and atomically swaps the new
vector stores in. Per-ontology granularity — unchanged ontologies
are not re-embedded.
- **Ontology updates as cutover.** Once a new ontology version is
swapped in, all subsequent requests see the new state. In-flight
requests complete against whichever version they started reading.
No version pinning; callers who care take responsibility.
- **Shared vs flow state.** The processor keeps a flow dict for
future per-user additions but the vector stores themselves are
module-level (on `self`, not on per-flow state). The flow is only
used to find the caller's `ontology_analytics` config block for
this request.
- **Batching.** Both ontology ingest and per-request query embedding
use the in-process embedder's batched interface. No per-element
round-trips anywhere on the hot path.
## Security Considerations
*(To be filled in.)*
## Performance Considerations
*(To be filled in — but note up front that batching per-message
selector calls is an easy pre-existing win that doesn't require the
new service, and should probably land separately first.)*
## Testing Strategy
*(To be filled in.)*
## Migration Plan
*(To be filled in — needs to cover how `kg-extract-ontology`
transitions from owning the analytics to calling a new service, and
what happens to in-flight flows during the rollover.)*
## Resolved Questions
- **Persistence:** in-memory only. Ontology data is small, restart
cost is acceptable, and adding a storage backend (Qdrant, local
file, etc.) isn't justified.
- **Per-embedding-model scoping:** the service has its own embedding
model, loaded in-process, independent of any flow's embeddings
service. Deployments needing multiple analytics embedding models
run multiple `ontology-analytics` instances.
- **Flow-scoped ontology selection:** configured in flow instance
config under an `ontology_analytics` block, same place as other
flow-scoped service selection (LLM, embedding model, etc.).
- **Ontology-update semantics:** cutover. Once a new ontology version
is swapped in, subsequent requests see the new state. No version
pinning; users take responsibility for dangerous changes.
- **Processor shape:** FlowProcessor. Keeps the flow boundary
available as a hook for future per-user state management, even
though the analytics state itself is global to the process.
- **Embedding model configuration:** processor argument (e.g.
`--embedding-model all-MiniLM-L6-v2`), not flow-class parameter.
Default is minilm. Deployments needing a different model restart
the service with a different argument; multi-model deployments
run multiple service instances.
- **`kg-extract-ontology` after the split:** still owns LLM
invocation, prompt construction, triple building, and emission.
The only change is that it calls
`flow("ontology-analytics-request")` to get the relevant ontology
subset for a chunk, instead of maintaining its own embedder,
vector store, and selector. The local `OntologyLoader`,
`OntologyEmbedder`, `OntologySelector`, `InMemoryVectorStore` and
the per-flow `flow_components` dict all go away.
## Future Work
- **Dynamic ontology learning.** A future flow may identify new
potential ontology elements at runtime and extend the ontology in
a learning/bootstrap fashion. That this is even feasible is a
useful validation of the split: the analytics service becomes a
reusable component with broader applications than the initial
extractor use case. Several write-path shapes are viable —
config-round-trip (learning flow writes back to config, normal
config-push triggers re-embed, one source of truth, heavier per
element) or a direct add API on the service (cheaper per element,
but creates two sources of truth and loses state on restart). The
config-round-trip feels like the right default but the decision
is deferred until a concrete learning flow is designed.
## References
- Current implementation:
`trustgraph-flow/trustgraph/extract/kg/ontology/`
- Related existing spec: `ontorag.md`