mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-25 00:16:23 +02:00
Tech spec
This commit is contained in:
parent
ce3c8b421b
commit
0f4c91b99a
1 changed files with 354 additions and 0 deletions
354
docs/tech-specs/ontology-analytics-service.md
Normal file
354
docs/tech-specs/ontology-analytics-service.md
Normal file
|
|
@ -0,0 +1,354 @@
|
||||||
|
---
|
||||||
|
layout: default
|
||||||
|
title: "Ontology Analytics Service Technical Specification"
|
||||||
|
parent: "Tech Specs"
|
||||||
|
---
|
||||||
|
|
||||||
|
# Ontology Analytics Service Technical Specification
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
This specification proposes extracting ontology analytics — the embedding,
|
||||||
|
vector storage, and similarity-based selection of ontology elements — out of
|
||||||
|
`kg-extract-ontology` and into a separate, reusable service. The goal is to
|
||||||
|
make ontology analytics available to processors other than the extractor,
|
||||||
|
preload the analytics state at ontology-load time rather than on-demand, and
|
||||||
|
simplify the current per-flow duplication of vector stores.
|
||||||
|
|
||||||
|
## Problem Statement
|
||||||
|
|
||||||
|
The current implementation in `kg-extract-ontology`
|
||||||
|
(`trustgraph-flow/trustgraph/extract/kg/ontology/extract.py`) embeds
|
||||||
|
ontology analytics inside a single processor. This has four concrete
|
||||||
|
problems:
|
||||||
|
|
||||||
|
1. **Analytics are locked inside one processor.** Other services that
|
||||||
|
could benefit from ontology-similarity lookups (e.g. relationship
|
||||||
|
extraction, agent tool selection, query routing, future features that
|
||||||
|
don't exist yet) have no way to call into the analytics without
|
||||||
|
duplicating the embedder/vector store/selector machinery. The
|
||||||
|
extractor owns state that conceptually belongs to the platform.
|
||||||
|
|
||||||
|
2. **Vector stores are built lazily, per-flow, on first message.** Each
|
||||||
|
flow that uses `kg-extract-ontology` gets its own
|
||||||
|
`{embedder, vector_store, selector}` triple, keyed by `id(flow)`, and
|
||||||
|
the vector store is only populated the first time a message arrives
|
||||||
|
on that flow. This has several consequences:
|
||||||
|
|
||||||
|
- **Cold-start cost on the hot path.** The first document chunk
|
||||||
|
processed by a flow pays the full cost of embedding every element
|
||||||
|
of the ontology (hundreds to thousands of calls to the embeddings
|
||||||
|
service) before extraction can even begin. Subsequent chunks are
|
||||||
|
fast, but the first one can take seconds to minutes depending on
|
||||||
|
ontology size.
|
||||||
|
- **N copies of the same data.** If three flows all use the same
|
||||||
|
embedding model and the same ontology, three identical in-memory
|
||||||
|
vector stores are built and maintained. Memory scales with the
|
||||||
|
number of flows, not the number of ontologies.
|
||||||
|
- **Total loss on restart.** Vector stores are `InMemoryVectorStore`
|
||||||
|
instances — nothing is persisted. A restart forces every flow to
|
||||||
|
re-embed the entire ontology on its next message.
|
||||||
|
- **Ontology updates re-embed everything.** When the ontology config
|
||||||
|
changes, `flow_components` is cleared and the re-embedding happens
|
||||||
|
lazily across every flow, again on the hot path.
|
||||||
|
|
||||||
|
The per-flow split exists for a real reason: different flows may use
|
||||||
|
different embedding models with different vector dimensions, and the
|
||||||
|
current code detects the dimension by probing the flow's embeddings
|
||||||
|
service with a test string. But in practice most deployments use one
|
||||||
|
embedding model, and the per-flow split is paying a cost for
|
||||||
|
flexibility that isn't being used.
|
||||||
|
|
||||||
|
3. **Per-message selection is not batched.** For every document chunk
|
||||||
|
processed, the selector calls `embedder.embed_text(segment.text)`
|
||||||
|
once per segment (`ontology_selector.py:96`), each of which
|
||||||
|
round-trips through the pub/sub layer to the embeddings service as a
|
||||||
|
single-element batch. A chunk with 20 segments fires 20 sequential
|
||||||
|
embeddings requests when it could fire one. Ontology ingest is
|
||||||
|
already batched at 50 at a time; per-message lookups are not.
|
||||||
|
|
||||||
|
4. **There is no way to scope ontologies to a flow.** All loaded
|
||||||
|
ontologies are visible to all flows. A flow that only cares about,
|
||||||
|
say, the FIBO ontology still pays for embedding and similarity search
|
||||||
|
across every unrelated ontology in config. This is a usability and
|
||||||
|
performance problem that gets worse as the number of loaded
|
||||||
|
ontologies grows.
|
||||||
|
|
||||||
|
Together these problems mean the current implementation is tightly
|
||||||
|
coupled, duplicates work across flows, pays its worst costs on the
|
||||||
|
hot path, and can't be reused. Each of the four goals below addresses
|
||||||
|
one of these problems directly.
|
||||||
|
|
||||||
|
## Goals
|
||||||
|
|
||||||
|
- **Eager ontology processing.** When an ontology is loaded (via config
|
||||||
|
push), embed it and populate the vector store immediately, before any
|
||||||
|
document processing messages arrive. First-chunk latency should not
|
||||||
|
include ontology embedding cost.
|
||||||
|
|
||||||
|
- **Simplify away from per-flow analytics.** Move the vector store out
|
||||||
|
of per-flow scope. Revisit whether per-flow dimension detection is
|
||||||
|
still necessary, or whether a shared analytics store (per embedding
|
||||||
|
model, or globally) is sufficient for realistic deployments.
|
||||||
|
|
||||||
|
- **Extract ontology analytics into a standalone service.** Expose
|
||||||
|
embedding, similarity search, and ontology-subset selection as a
|
||||||
|
service callable from any processor via the normal pub/sub
|
||||||
|
request/response pattern. `kg-extract-ontology` becomes one consumer
|
||||||
|
among many.
|
||||||
|
|
||||||
|
- **(Stretch)** Allow flows to select which ontologies they use when
|
||||||
|
the flow is started, so a flow can restrict itself to a named subset
|
||||||
|
of the loaded ontologies rather than seeing all of them.
|
||||||
|
|
||||||
|
## Background
|
||||||
|
|
||||||
|
The current ontology analytics live in four files under
|
||||||
|
`trustgraph-flow/trustgraph/extract/kg/ontology/`:
|
||||||
|
|
||||||
|
- `ontology_loader.py` — parses ontology definitions from config.
|
||||||
|
- `ontology_embedder.py` — generates embeddings for ontology elements
|
||||||
|
and stores them in an `InMemoryVectorStore`. Batches at 50 elements
|
||||||
|
per call on ingest.
|
||||||
|
- `vector_store.py` — `InMemoryVectorStore`, a numpy-backed dense vector
|
||||||
|
store with similarity search.
|
||||||
|
- `ontology_selector.py` — given a query (a document segment), returns
|
||||||
|
the top-K most relevant ontology elements via similarity search.
|
||||||
|
|
||||||
|
These are wired together inside `extract.py`, where `Processor` holds
|
||||||
|
one `OntologyLoader` and one `TextProcessor` but maintains a
|
||||||
|
`flow_components` dict mapping `id(flow) →
|
||||||
|
{embedder, vector_store, selector}`. `initialize_flow_components` is
|
||||||
|
called on the first message for each flow; ontology config changes
|
||||||
|
trigger `self.flow_components.clear()` to force re-initialisation.
|
||||||
|
|
||||||
|
The analytics stack is conceptually independent from the extractor —
|
||||||
|
it's just a "given text, find relevant ontology elements" service —
|
||||||
|
but because it's embedded in the extractor, no other processor can use
|
||||||
|
it, and its lifecycle is coupled to message processing rather than to
|
||||||
|
ontology loading.
|
||||||
|
|
||||||
|
## Technical Design
|
||||||
|
|
||||||
|
### Architecture
|
||||||
|
|
||||||
|
A new processor, tentatively `ontology-analytics`, hosts the embedder,
|
||||||
|
vector store, and selector. It is a `FlowProcessor` — flow-aware so
|
||||||
|
future per-user state management can hook in at the flow boundary —
|
||||||
|
but the analytics state it holds is **global to the process**, not
|
||||||
|
per-flow. Shared state is the default; flow scope is just a handle
|
||||||
|
available when needed.
|
||||||
|
|
||||||
|
Crucially, the service does **not** call a separate embeddings service
|
||||||
|
for its own embedding work. It loads and runs an embedding model
|
||||||
|
directly, in-process, the same way `embeddings-fastembed` does today.
|
||||||
|
The rationale: the embedding model used for ontology analytics is a
|
||||||
|
deployment choice driven by the ontology's semantics, not by the
|
||||||
|
user's document embedding model. Coupling the two would be an
|
||||||
|
accidental constraint. Decoupling means:
|
||||||
|
|
||||||
|
- The analytics service has no dependency on a flow's
|
||||||
|
`embeddings-request` service. No routing, no external call, no
|
||||||
|
async dependency on another processor being up.
|
||||||
|
- Ontology ingest happens synchronously in the analytics process as
|
||||||
|
soon as the ontology config is loaded. No round-trips.
|
||||||
|
- Per-message selection (embedding a query segment and searching the
|
||||||
|
vector store) is also local. Fast path, no pub/sub hop for
|
||||||
|
embedding.
|
||||||
|
- Most deployments will use a single embedding model configured on
|
||||||
|
the analytics service. If a deployment needs two, run two
|
||||||
|
`ontology-analytics` processor instances with different ids and
|
||||||
|
have flows address the one they want.
|
||||||
|
|
||||||
|
The flow configuration optionally specifies which analytics service
|
||||||
|
id to use and which ontologies the flow cares about; otherwise the
|
||||||
|
flow sees all loaded ontologies through the default analytics
|
||||||
|
service.
|
||||||
|
|
||||||
|
Components inside the new processor:
|
||||||
|
|
||||||
|
1. **OntologyLoader** (moved from `kg-extract-ontology`)
|
||||||
|
Parses ontologies from config-push messages. Owns the parsed
|
||||||
|
ontology objects.
|
||||||
|
|
||||||
|
2. **In-process embedding model**
|
||||||
|
Loaded once at startup using the service's configured model name.
|
||||||
|
Same pattern as `embeddings-fastembed` — direct library call,
|
||||||
|
batched.
|
||||||
|
|
||||||
|
3. **Vector store (global)**
|
||||||
|
One `InMemoryVectorStore` per loaded ontology, not per flow.
|
||||||
|
Populated eagerly when the ontology is loaded (or reloaded).
|
||||||
|
Lives in process memory for the lifetime of the service.
|
||||||
|
|
||||||
|
4. **Selector**
|
||||||
|
Given a query text and a subset of ontology ids, returns the
|
||||||
|
top-K most relevant ontology elements across the union of those
|
||||||
|
stores. Batches embedding calls for multi-segment queries.
|
||||||
|
|
||||||
|
5. **Request handler**
|
||||||
|
Exposes a request/response service over the pub/sub layer for
|
||||||
|
other processors (initially `kg-extract-ontology`, later
|
||||||
|
others) to call.
|
||||||
|
|
||||||
|
Module: `trustgraph-flow/trustgraph/ontology_analytics/` (new)
|
||||||
|
|
||||||
|
### Data Models
|
||||||
|
|
||||||
|
#### Service request
|
||||||
|
|
||||||
|
```
|
||||||
|
OntologyAnalyticsRequest {
|
||||||
|
query_texts: list[str] # batch of query segments
|
||||||
|
ontology_ids: list[str] | None # optional subset; None = all
|
||||||
|
top_k: int # default per-service
|
||||||
|
similarity_threshold: float # default per-service
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Service response
|
||||||
|
|
||||||
|
```
|
||||||
|
OntologyAnalyticsResponse {
|
||||||
|
results: list[list[OntologyMatch]] # one list per query_text
|
||||||
|
}
|
||||||
|
|
||||||
|
OntologyMatch {
|
||||||
|
element_id: str # e.g. "fibo:Bond"
|
||||||
|
element_type: str # class | property | individual
|
||||||
|
ontology_id: str # which ontology it came from
|
||||||
|
text: str # the text that was embedded
|
||||||
|
score: float # similarity score
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
The request is a batch by construction: the caller sends a list of
|
||||||
|
segments, the service embeds them in one batched embedding call,
|
||||||
|
searches the relevant stores, and returns a list of match-lists
|
||||||
|
aligned with the input. This directly fixes the per-segment
|
||||||
|
unbatched call in the current selector.
|
||||||
|
|
||||||
|
#### Flow configuration
|
||||||
|
|
||||||
|
Flow instance config gains an optional `ontology_analytics` block:
|
||||||
|
|
||||||
|
```
|
||||||
|
ontology_analytics:
|
||||||
|
service_id: ontology-analytics # default
|
||||||
|
ontologies: [fibo, geonames] # default: all loaded
|
||||||
|
```
|
||||||
|
|
||||||
|
If omitted, flows use the default analytics service id and see every
|
||||||
|
loaded ontology.
|
||||||
|
|
||||||
|
### APIs
|
||||||
|
|
||||||
|
New services:
|
||||||
|
- `ontology-analytics` request/response service, schema as above.
|
||||||
|
|
||||||
|
New flow service client:
|
||||||
|
- `flow("ontology-analytics-request")` — standard flow-scoped client
|
||||||
|
wrapper for calling the analytics service.
|
||||||
|
|
||||||
|
Modified code paths:
|
||||||
|
- `kg-extract-ontology` no longer owns the embedder, vector store,
|
||||||
|
or selector. Its per-message logic calls
|
||||||
|
`flow("ontology-analytics-request")` with the segments extracted
|
||||||
|
from the chunk and gets back the same shape of data it used to
|
||||||
|
compute locally.
|
||||||
|
- The `on_ontology_config` handler in `kg-extract-ontology` goes
|
||||||
|
away; config-push of ontology types is handled by the new
|
||||||
|
service.
|
||||||
|
|
||||||
|
### Implementation Details
|
||||||
|
|
||||||
|
- **Eager ingest.** When `on_ontology_config` fires on the analytics
|
||||||
|
service, the service synchronously (well, in an asyncio task)
|
||||||
|
re-embeds the changed ontologies and atomically swaps the new
|
||||||
|
vector stores in. Per-ontology granularity — unchanged ontologies
|
||||||
|
are not re-embedded.
|
||||||
|
- **Ontology updates as cutover.** Once a new ontology version is
|
||||||
|
swapped in, all subsequent requests see the new state. In-flight
|
||||||
|
requests complete against whichever version they started reading.
|
||||||
|
No version pinning; callers who care take responsibility.
|
||||||
|
- **Shared vs flow state.** The processor keeps a flow dict for
|
||||||
|
future per-user additions but the vector stores themselves are
|
||||||
|
module-level (on `self`, not on per-flow state). The flow is only
|
||||||
|
used to find the caller's `ontology_analytics` config block for
|
||||||
|
this request.
|
||||||
|
- **Batching.** Both ontology ingest and per-request query embedding
|
||||||
|
use the in-process embedder's batched interface. No per-element
|
||||||
|
round-trips anywhere on the hot path.
|
||||||
|
|
||||||
|
## Security Considerations
|
||||||
|
|
||||||
|
*(To be filled in.)*
|
||||||
|
|
||||||
|
## Performance Considerations
|
||||||
|
|
||||||
|
*(To be filled in — but note up front that batching per-message
|
||||||
|
selector calls is an easy pre-existing win that doesn't require the
|
||||||
|
new service, and should probably land separately first.)*
|
||||||
|
|
||||||
|
## Testing Strategy
|
||||||
|
|
||||||
|
*(To be filled in.)*
|
||||||
|
|
||||||
|
## Migration Plan
|
||||||
|
|
||||||
|
*(To be filled in — needs to cover how `kg-extract-ontology`
|
||||||
|
transitions from owning the analytics to calling a new service, and
|
||||||
|
what happens to in-flight flows during the rollover.)*
|
||||||
|
|
||||||
|
## Resolved Questions
|
||||||
|
|
||||||
|
- **Persistence:** in-memory only. Ontology data is small, restart
|
||||||
|
cost is acceptable, and adding a storage backend (Qdrant, local
|
||||||
|
file, etc.) isn't justified.
|
||||||
|
- **Per-embedding-model scoping:** the service has its own embedding
|
||||||
|
model, loaded in-process, independent of any flow's embeddings
|
||||||
|
service. Deployments needing multiple analytics embedding models
|
||||||
|
run multiple `ontology-analytics` instances.
|
||||||
|
- **Flow-scoped ontology selection:** configured in flow instance
|
||||||
|
config under an `ontology_analytics` block, same place as other
|
||||||
|
flow-scoped service selection (LLM, embedding model, etc.).
|
||||||
|
- **Ontology-update semantics:** cutover. Once a new ontology version
|
||||||
|
is swapped in, subsequent requests see the new state. No version
|
||||||
|
pinning; users take responsibility for dangerous changes.
|
||||||
|
- **Processor shape:** FlowProcessor. Keeps the flow boundary
|
||||||
|
available as a hook for future per-user state management, even
|
||||||
|
though the analytics state itself is global to the process.
|
||||||
|
- **Embedding model configuration:** processor argument (e.g.
|
||||||
|
`--embedding-model all-MiniLM-L6-v2`), not flow-class parameter.
|
||||||
|
Default is minilm. Deployments needing a different model restart
|
||||||
|
the service with a different argument; multi-model deployments
|
||||||
|
run multiple service instances.
|
||||||
|
- **`kg-extract-ontology` after the split:** still owns LLM
|
||||||
|
invocation, prompt construction, triple building, and emission.
|
||||||
|
The only change is that it calls
|
||||||
|
`flow("ontology-analytics-request")` to get the relevant ontology
|
||||||
|
subset for a chunk, instead of maintaining its own embedder,
|
||||||
|
vector store, and selector. The local `OntologyLoader`,
|
||||||
|
`OntologyEmbedder`, `OntologySelector`, `InMemoryVectorStore` and
|
||||||
|
the per-flow `flow_components` dict all go away.
|
||||||
|
|
||||||
|
## Future Work
|
||||||
|
|
||||||
|
- **Dynamic ontology learning.** A future flow may identify new
|
||||||
|
potential ontology elements at runtime and extend the ontology in
|
||||||
|
a learning/bootstrap fashion. That this is even feasible is a
|
||||||
|
useful validation of the split: the analytics service becomes a
|
||||||
|
reusable component with broader applications than the initial
|
||||||
|
extractor use case. Several write-path shapes are viable —
|
||||||
|
config-round-trip (learning flow writes back to config, normal
|
||||||
|
config-push triggers re-embed, one source of truth, heavier per
|
||||||
|
element) or a direct add API on the service (cheaper per element,
|
||||||
|
but creates two sources of truth and loses state on restart). The
|
||||||
|
config-round-trip feels like the right default but the decision
|
||||||
|
is deferred until a concrete learning flow is designed.
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
- Current implementation:
|
||||||
|
`trustgraph-flow/trustgraph/extract/kg/ontology/`
|
||||||
|
- Related existing spec: `ontorag.md`
|
||||||
Loading…
Add table
Add a link
Reference in a new issue