mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-07-05 11:22:11 +02:00
feat: filter and cap GraphRAG reranker input across full stack (#1021)
- Filter out RDF/RDFS/OWL schema predicates (rdfs:domain, owl:inverseOf, etc.) from hop traversal, keeping rdf:type for data signal - Skip edges where reranker-visible components are unlabeled IRIs, since the cross-encoder cannot meaningfully score raw URIs - Add max-reranker-input safety cap (default 350) to prevent overloading the reranker, applied after filtering for maximum useful candidates - Expose max-reranker-input as per-request parameter through schema, translator, REST API, socket client, CLI, and OpenAPI spec - Update tests - Update tech spec
This commit is contained in:
parent
76c4763b9b
commit
68e816e65c
10 changed files with 198 additions and 43 deletions
|
|
@ -404,10 +404,33 @@ no LLM call. These fields are dropped from the Focus entity.
|
|||
|
||||
a. Retrieve all edges one hop from the current frontier nodes.
|
||||
|
||||
b. Represent each edge using direction-aware text: from a
|
||||
subject node use `"{predicate} {object}"`, from an object
|
||||
node use `"{subject} {predicate}"`, from a predicate node
|
||||
use `"{subject} {object}"`.
|
||||
b. Filter and represent edges for scoring:
|
||||
|
||||
- **Schema predicate filter.** Edges with RDF/RDFS/OWL
|
||||
schema predicates (`rdfs:domain`, `owl:inverseOf`, etc.)
|
||||
are removed. `rdf:type` is kept as it carries useful
|
||||
data signal.
|
||||
|
||||
- **IRI filter.** Edges where the reranker-visible text
|
||||
components (after label resolution) are still raw IRIs
|
||||
are removed — the cross-encoder cannot meaningfully score
|
||||
unresolved URIs. Only the components that would appear
|
||||
in the reranker text are checked, based on traversal
|
||||
direction.
|
||||
|
||||
- **Direction-aware text.** Each surviving edge is
|
||||
represented using direction-aware text: from a subject
|
||||
node use `"{predicate} {object}"`, from an object node
|
||||
use `"{subject} {predicate}"`, from a predicate node
|
||||
use `"{subject} {object}"`.
|
||||
|
||||
- **Reranker input cap.** The candidate set is truncated
|
||||
to `max_reranker_input` (default 350) edges. This is a
|
||||
safety measure, not an accuracy optimisation — there is
|
||||
no point in producing a perfectly ranked edge set if the
|
||||
reranker crashes or times out because it was handed
|
||||
thousands of candidates. The cap is applied after
|
||||
filtering so that the most useful edges fill the budget.
|
||||
|
||||
c. Score edges against the extracted concepts using the
|
||||
cross-encoder service.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue