feat: filter and cap GraphRAG reranker input across full stack (#1021)

- Filter out RDF/RDFS/OWL schema predicates (rdfs:domain, owl:inverseOf,
  etc.) from hop traversal, keeping rdf:type for data signal
- Skip edges where reranker-visible components are unlabeled IRIs, since
  the cross-encoder cannot meaningfully score raw URIs
- Add max-reranker-input safety cap (default 350) to prevent overloading
  the reranker, applied after filtering for maximum useful candidates
- Expose max-reranker-input as per-request parameter through schema,
  translator, REST API, socket client, CLI, and OpenAPI spec
- Update tests
- Update tech spec
This commit is contained in:
cybermaggedon 2026-07-03 15:51:04 +01:00 committed by GitHub
parent 76c4763b9b
commit 68e816e65c
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
10 changed files with 198 additions and 43 deletions

View file

@ -404,10 +404,33 @@ no LLM call. These fields are dropped from the Focus entity.
a. Retrieve all edges one hop from the current frontier nodes.
b. Represent each edge using direction-aware text: from a
subject node use `"{predicate} {object}"`, from an object
node use `"{subject} {predicate}"`, from a predicate node
use `"{subject} {object}"`.
b. Filter and represent edges for scoring:
- **Schema predicate filter.** Edges with RDF/RDFS/OWL
schema predicates (`rdfs:domain`, `owl:inverseOf`, etc.)
are removed. `rdf:type` is kept as it carries useful
data signal.
- **IRI filter.** Edges where the reranker-visible text
components (after label resolution) are still raw IRIs
are removed — the cross-encoder cannot meaningfully score
unresolved URIs. Only the components that would appear
in the reranker text are checked, based on traversal
direction.
- **Direction-aware text.** Each surviving edge is
represented using direction-aware text: from a subject
node use `"{predicate} {object}"`, from an object node
use `"{subject} {predicate}"`, from a predicate node
use `"{subject} {object}"`.
- **Reranker input cap.** The candidate set is truncated
to `max_reranker_input` (default 350) edges. This is a
safety measure, not an accuracy optimisation — there is
no point in producing a perfectly ranked edge set if the
reranker crashes or times out because it was handed
thousands of candidates. The cap is applied after
filtering so that the most useful edges fill the budget.
c. Score edges against the extracted concepts using the
cross-encoder service.