feat: replace LLM edge scoring with cross-encoder reranker in GraphRAG (#1005)

Replace the three-prompt LLM scoring pipeline (kg-edge-scoring,
kg-edge-reasoning, kg-edge-selection) with a cross-encoder reranker
service backed by FlashRank. The new hop_and_filter() method performs
iterative graph traversal with semantic scoring at each hop, replacing
the previous follow_edges/get_subgraph approach.

- Add reranker service (trustgraph-base client/service, FlashRank processor)
- Add gateway dispatch for reranker via API and WebSocket
- Rewrite GraphRAG pipeline: hop_and_filter() with per-hop cross-encoder scoring
- Remove kg_prompt() and edge_score_limit from prompt client
- Update provenance: add tg:EdgeSelection type, tg:concept, tg:score predicates
- Update CLIs (tg-invoke-graph-rag, tg-show-explain-trace) for new metadata
- Add tg-invoke-reranker CLI tool
- Add tech spec and UX developer guidance
- Update all unit and integration tests
This commit is contained in:
cybermaggedon 2026-06-30 14:36:37 +01:00 committed by GitHub
parent 1aa9549912
commit 01cc8dbc64
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
43 changed files with 1613 additions and 792 deletions

View file

@ -0,0 +1,523 @@
# GraphRAG Semantic Filter Improvement
## Problem Statement
The GraphRAG semantic filter is observed to be ineffective with certain
LLM models. Smaller models in particular produce poor-quality edge
relevance scores, and there is a suspicion that models trained or
evaluated heavily on non-Roman-script datasets offer lower performance
on the semantic ranking operation.
The root cause is that the current implementation delegates edge
relevance scoring to the LLM via a prompt that asks the model to
assign a 110 relevance score to each knowledge-graph edge. This
task — ranking structured triples for relevance to a natural-language
query — is not well covered in standard LLM evaluation suites, so
model benchmark scores are not predictive of performance on this
operation. The result is that GraphRAG quality varies unpredictably
across model choices, undermining confidence in the pipeline.
Beyond model variability, the LLM scoring step has further problems:
- **Cost and latency.** The LLM call consumes tokens and adds
latency to every query, yet its output is unreliable. Even when
the model performs well, the cost is disproportionate for what is
fundamentally a ranking operation.
- **Subjective scoring scale.** The 110 relevance scale gives the
model no objective criteria for what constitutes a 5 versus a 7.
Different models interpret the scale differently, and even the same
model can produce inconsistent scores across runs.
- **Redundancy with the embedding pre-filter.** The pipeline already
contains a cosine-similarity stage that ranks edges by semantic
relevance using embeddings. The LLM scoring step is a second
filter applied on top of this, and it is not clear that it adds
enough value to justify the additional cost and risk of
degradation.
### Industry context
Semantic ranking is rigorously evaluated on dedicated benchmarks such
as MTEB (Massive Text Embedding Benchmark) and BEIR (Benchmarking
Information Retrieval), which test retrieval and reranking across
diverse domains. The current TrustGraph approach — prompting a
general-purpose LLM to score and rank documents (the "listwise"
approach) — is known to be poorly optimized for this task. It
suffers from positional bias, formatting failures, and
inconsistency at scale.
The industry standard for semantic ranking has moved to
cross-encoder models: lightweight, purpose-built models that take a
querydocument pair as input and produce a single relevance score.
These models are fine-tuned on millions of relevance-labelled pairs
and dominate retrieval benchmarks. They are fast, deterministic,
and do not require an LLM inference call.
## Architecture
### Cross-encoder service
A new request/response service that exposes a generic semantic
ranking API. The service is not specific to GraphRAG — it is a
reusable building block for any component that needs to rank text
by relevance.
The service interface is pluggable. Alternative implementations
can be swapped in behind the same API.
**Packaging options considered:**
- *`sentence-transformers`.* Full-featured, widely used.
However, it pulls in PyTorch (~2 GB), making containers
very large. Tested at ~1.8 seconds for 2200 edges.
- *`optimum.onnxruntime`.* ONNX-based inference. Still
depends on PyTorch at import time despite using ONNX for
inference. Tested at ~4.2 seconds for 2200 edges.
- *`flashrank`.* Lightweight wrapper around ONNX Runtime
with a clean API (`Ranker`, `RerankRequest`). No PyTorch
dependency. Tested at ~4.4 seconds for 2200 edges.
- *Pure `onnxruntime` + `tokenizers`.* Leanest option
(~200 MB total). Requires manual tokenisation, padding,
and numpy array management — more boilerplate to maintain.
- *External API (e.g. Cohere Rerank).* No local model at
all. Adds network latency and an external dependency.
**Decision:** `flashrank` for the initial implementation.
No PyTorch dependency, clean API, comparable performance.
The pluggable interface allows swapping to another backend
later.
**Request:**
- `queries` — list of `{id, text}` objects. In the GraphRAG use
case these are the concepts extracted from the user's question.
- `documents` — list of `{id, text}` objects. In the GraphRAG
use case these are the candidate knowledge-graph edges
represented as text.
- `limit` — integer. Maximum number of results to return.
**Scoring:**
The service produces the cartesian product of all querydocument
pairs and scores each pair through the cross-encoder model. For
each document, the maximum score across all queries is taken as the
document's relevance score. Documents are then ranked by this
score and the top `limit` results are returned.
**Response:**
A list of the top `limit` results, each containing:
- `document_id` — the ID of the matched document.
- `query_id` — the ID of the query (concept) that produced the
highest score for this document.
- `score` — the relevance score.
Including `query_id` in the response supports the explainability
interface: it records that an edge was selected because it is
related to a specific concept.
### Integration
The cross-encoder service follows the standard TrustGraph service
integration pattern:
- **Base package (trustgraph-base).** Schema definitions for the
cross-encoder request/response messages. A client class that
other components (e.g. GraphRAG) can use to call the
cross-encoder service. Message translator registration so the
pub/sub layer can serialise/deserialise the messages.
- **Flow package (trustgraph-flow).** The cross-encoder service
implementation itself — loads the model, listens for requests,
scores pairs, returns results. Flow definition support so the
cross-encoder can be introduced into a processing flow via the
standard flow configuration. `flashrank` is added as a
dependency of `trustgraph-flow`. The service runs in its own
container.
- **API gateway.** A gateway endpoint that routes cross-encoder
requests from the HTTP API to the service over pub/sub and
returns the response.
- **CLI tool.** A command-line utility
(e.g. `tg-invoke-cross-encoder`) that calls the gateway
endpoint for manual testing and debugging.
### Current GraphRAG pipeline
The current pipeline follows these steps:
1. **Concept extraction.** An LLM prompt extracts key concepts
from the user's query.
2. **Graph exploration.** Seed entities are found via embedding
similarity. A subgraph is built by multi-hop traversal from
the seed entities (up to `max_path_length` hops, capped at
`max_subgraph_size` edges).
3. **Embedding pre-filter.** Each edge is embedded as
`"subject, predicate, object"` and scored by cosine similarity
against the concept embeddings. The top `edge_score_limit`
(default 30) edges are kept.
4. **LLM edge scoring.** The `kg-edge-scoring` prompt asks the
LLM to assign a 110 relevance score to each remaining edge.
The top `edge_limit` (default 25) edges are kept.
5. **LLM edge reasoning.** The `kg-edge-reasoning` prompt asks
the LLM to explain why each selected edge is relevant to the
query. Used for the explainability interface.
6. **Document tracing.** Selected edges are traced back to their
source documents in the librarian. Runs concurrently with
step 5.
7. **Synthesis.** The `kg-synthesis` prompt generates the final
answer from the selected edges and source document metadata.
### Potential improvements
#### Replace LLM edge scoring with cross-encoder (step 4)
The LLM edge scoring step is replaced by a call to the
cross-encoder service. The candidate edges are the documents and
`edge_limit` is the limit. This is a direct substitution: faster,
cheaper, deterministic, and more reliable across model choices.
The LLM `kg-edge-scoring` prompt is retired.
**Cross-encoder query input: concepts vs. raw query.** There are
two options for what to use as the cross-encoder queries:
- *Option A: Raw user query.* Pass the original question as a
single query string. Simpler, no dependency on concept
extraction. However, raw queries contain noise words and
conversational phrasing that do not match well against the
structured vocabulary of knowledge-graph edges. A single query
also means every edge competes against the full question — a
partial match on one aspect is diluted.
- *Option B: Extracted concepts.* Pass the concepts from step 1
as separate queries. The concepts are distilled, focused terms
that are closer to the language of the edges. With multiple
concepts as independent queries, the cross-encoder scores each
edge against each concept separately, giving better coverage —
an edge only needs to match one concept well to be selected.
The trade-off is a dependency on the LLM concept extraction
step, but this is already in the pipeline and is a lightweight,
reliable LLM call.
**Decision:** Option B — use extracted concepts. The concept
extraction is fast, and the resulting terms produce better
cross-encoder matches against structured triples.
#### Edge text representation
The current embedding pre-filter represents each edge as
`"subject, predicate, object"`. Two changes:
- **Drop commas.** Commas add tokenisation noise without semantic
value.
- **Drop the subject.** The subject identifies which entity the
edge belongs to, but it does not contribute to whether the
edge's content is relevant to the query. The predicate and
object carry the semantic meaning — what relationship exists
and what it connects to. Representing edges as `"{p} {o}"`
produces cleaner cross-encoder matches.
#### Remove the embedding pre-filter (step 3)
The embedding pre-filter was introduced to reduce the number of
edges before the expensive LLM scoring call. With the
cross-encoder replacing the LLM call, this cost equation changes.
**Arguments for removal:**
- The cross-encoder is fast enough to score the full subgraph
directly. In testing, 2200 edges scored in ~1.8 seconds; at
the default `max_subgraph_size` of 150 edges, scoring takes
a fraction of a second.
- The pre-filter is a weaker version of what the cross-encoder
does. Bi-encoder cosine similarity embeds the query and
document independently and compares vectors; the cross-encoder
processes both texts together through the full transformer,
giving it much better relevance judgement. Running a weaker
filter before a stronger one adds latency without improving
quality.
- Removing it eliminates an embedding service call (two batches:
concepts + edges) and the associated latency.
**Arguments for keeping it:**
- If the subgraph is very large (thousands of edges), the
cross-encoder's linear scaling could become a bottleneck.
The pre-filter would act as a safety valve.
- The embedding call is cheap compared to an LLM call, so the
overhead is modest.
**Decision:** Remove the pre-filter. The `max_subgraph_size`
parameter (default 150) already caps the number of edges entering
this stage, so the cross-encoder will not face an unbounded
workload. If very large subgraphs become a concern in future,
the pre-filter can be reintroduced or `max_subgraph_size` can be
tuned.
#### Iterative graph traversal with cross-encoder filtering
The current pipeline performs graph exploration and edge filtering
as separate phases: first build the full subgraph (up to
`max_path_length` hops), then score and filter edges. An
alternative is to interleave traversal and filtering — at each
hop, use the cross-encoder to select relevant edges before
expanding further.
**Option A: Big-bang traversal then filter.** Traverse the full
subgraph up to `max_path_length` hops from the seed entities,
collecting all edges up to `max_subgraph_size`. Then
cross-encode the entire result to select the top edges.
- Simple to implement — the current traversal logic is largely
unchanged.
- Produces large, unfocused subgraphs. Irrelevant branches are
explored and scored even though they will be discarded.
- Poorly suited to multi-hop reasoning. For a query about
Voyager 1, the subgraph includes Voyager 2's edges because
they are within hop distance, and the filter must then
separate them.
**Option B: Iterative hop-and-filter.** At each hop:
1. Retrieve all edges one hop from the current frontier nodes.
2. Cross-encode these edges against the query concepts.
3. Select the top relevant edges.
4. The target nodes of the selected edges become the frontier
for the next hop.
5. Repeat up to `max_path_length` hops.
The final set of selected edges across all hops is the input to
synthesis.
- **Guided exploration.** Each hop focuses the search by
pruning irrelevant branches before expanding further. The
working set stays small and relevant at every step.
- **Multi-hop reasoning works naturally.** Following
"Voyager 1 → has-event → crossed the heliopause" succeeds
because each hop is individually relevant and leads to the
next.
- **Smaller total workload.** Fewer edges are scored overall
because irrelevant branches are never expanded.
- **Trade-off: greedy pruning.** An edge discarded at hop 1
cannot lead to relevant edges at hop 2. This is inherent in
any bounded traversal, and the cross-encoder is better
equipped to make this relevance judgement than a blind hop
limit.
- **Trade-off: sequential latency.** Hops cannot be
parallelised since each depends on the previous. However,
each cross-encoder call on a small edge set is very fast
(sub-second for typical working sets).
**Decision:** Option B — iterative hop-and-filter. The guided
traversal produces more focused subgraphs and supports multi-hop
reasoning, which is a significant quality improvement over the
current approach.
#### Replace LLM edge reasoning with cross-encoder metadata (step 5)
The current `kg-edge-reasoning` prompt asks the LLM to explain why
each edge is relevant. With the cross-encoder now making the
selection, this explanation would be a post-hoc fabrication — the
LLM was not involved in the decision.
- *Option A: Keep LLM reasoning.* Generates natural-language
explanations but they are not grounded in the actual selection
process. Adds an LLM call per query.
- *Option B: Record cross-encoder metadata.* The cross-encoder
already returns the matched concept and score for each selected
edge. Use this directly as the explanation.
**Decision:** Option B. The cross-encoder metadata is the true
reason the edge was selected. The `kg-edge-reasoning` prompt is
retired.
#### Explainability interface update
The explainability interface uses a `Focus` entity containing
`EdgeSelection` sub-entities. Each `EdgeSelection` currently
carries an `edge` (the quoted triple) and a `reasoning` field
(free-text LLM prose), stored as `tg:reasoning` in the
provenance graph.
With the cross-encoder replacing LLM reasoning, the
`EdgeSelection` type gains two new predicates and drops one:
- **Remove** `tg:reasoning` — no longer produced.
- **Add** `tg:concept` — the concept text that produced the
highest cross-encoder score for this edge.
- **Add** `tg:score` — the cross-encoder relevance score.
This is an evolution of the existing `EdgeSelection` type, not a
new entity type. The edge selection sub-entities currently have
no `rdf:type` declared; a new `tg:EdgeSelection` type should be
added so that consumers can identify them in the provenance
graph. The `Focus` entity and its relationship to `Exploration`
are unchanged.
The `Focus` entity's token-usage metadata (`tg:inToken`,
`tg:outToken`, `tg:llmModel`) no longer applies since there is
no LLM call. These fields are dropped from the Focus entity.
### Proposed pipeline
1. **Concept extraction.** Unchanged — LLM extracts key concepts
from the user's query.
2. **Seed entity lookup.** Find seed entities via embedding
similarity against the extracted concepts.
3. **Iterative hop-and-filter.** For each hop up to
`max_path_length`:
a. Retrieve all edges one hop from the current frontier nodes.
b. Represent each edge as `"{predicate} {object}"`.
c. Score edges against the extracted concepts using the
cross-encoder service.
d. Select the top relevant edges. The target nodes of the
selected edges become the frontier for the next hop.
4. **Document tracing.** Selected edges are traced back to source
documents.
5. **Synthesis.** The `kg-synthesis` prompt generates the final
answer from the selected edges and source document metadata.
### Implementation order
1. Cross-encoder service with full integration (base schema,
flow service, gateway endpoint, CLI tool).
2. GraphRAG pipeline changes (iterative hop-and-filter,
edge representation, remove pre-filter).
3. Explainability update (`tg:EdgeSelection` type, concept
and score predicates, retire `tg:reasoning`).
4. Retire `kg-edge-scoring` and `kg-edge-reasoning` prompts.
5. Update `tg-invoke-graph-rag` and `tg-show-explain-trace`
to display the new metadata. Use these as the main
end-to-end test.
6. Fix any failing unit tests, then add new tests as needed.
7. Write guidance for UX devs to update the UI for the new
explainability predicates.
## UX developer guidance
This section describes the changes to the explainability interface
that affect frontend rendering of GraphRAG Focus events.
### What changed
Edge selection in GraphRAG previously used LLM-based scoring and
reasoning. Each selected edge carried a `tg:reasoning` predicate
with free-text explanation from the LLM. This has been replaced
by a cross-encoder reranker that scores edges against query
concepts. The explainability data now carries structured metadata
instead of free text.
### Removed
- **`tg:reasoning`** is no longer emitted on edge selection
entities in GraphRAG Focus events. UX code that reads
`edge_sel.reasoning` will get an empty string. Remove any
rendering that displays a "Reasoning" or "Reason" field for
Focus edges.
- The **`kg-edge-scoring`**, **`kg-edge-reasoning`**, and
**`kg-edge-selection`** prompts are retired. Any UX that
references these prompt names should be cleaned up.
### Added
Each edge selection entity within a Focus event now has three
new properties:
| RDF predicate | API field | Type | Description |
|---|---|---|---|
| `rdf:type tg:EdgeSelection` | (type check) | — | Each edge selection entity is now explicitly typed |
| `tg:concept` | `edge_sel.concept` | `str` | The query concept that matched this edge |
| `tg:score` | `edge_sel.score` | `float` or `None` | Cross-encoder relevance score (0.01.0) |
The `tg:edge` predicate (RDF-star quoted triple) is unchanged.
### How to render
The recommended rendering for each selected edge in a Focus event:
```
Edge: (subject_label, predicate_label, object_label)
Concept: <concept> Score: <score formatted to 4 decimal places>
```
Scores near 1.0 indicate high relevance; scores near 0.0 indicate
low relevance. UX could use the score to drive visual indicators
such as colour intensity or a relevance bar.
Edges are not returned in score order — they arrive in traversal
order across hops. If the UX wants to display edges ranked by
relevance, sort by `edge_sel.score` descending.
### API classes (Python)
The `EdgeSelection` dataclass in `trustgraph.api.explainability`
has these fields:
```python
@dataclass
class EdgeSelection:
uri: str
edge: Optional[Dict[str, str]] # {"s": ..., "p": ..., "o": ...}
reasoning: str = "" # Legacy, always empty for new traces
concept: str = "" # Query concept that matched
score: Optional[float] = None # Cross-encoder relevance score
```
These are populated when calling
`ExplainabilityClient.fetch_focus_with_edges()` or when parsing
inline provenance triples from the streaming response.
### WebSocket response format
For inline explainability via the streaming WebSocket, Focus events
arrive as `message_type: "explain"` responses. The `explain_triples`
array contains the edge selection triples. The relevant predicates
in wire format are:
```json
{"s": {"t": "i", "i": "<edge_sel_uri>"},
"p": {"t": "i", "i": "https://trustgraph.ai/ns/concept"},
"o": {"t": "l", "v": "flyby event"}}
{"s": {"t": "i", "i": "<edge_sel_uri>"},
"p": {"t": "i", "i": "https://trustgraph.ai/ns/score"},
"o": {"t": "l", "v": "0.9962"}}
```
Note that `tg:score` is transmitted as a string literal and must
be parsed to a float on the client side.
### Exploration event
The Exploration event's `edge_count` field now reports the number
of edges selected by the cross-encoder across all hops (previously
it reported the total number of edges retrieved before filtering).
The `entities` list continues to report the seed entities found
by vector search.