feat: direction-aware reranker text in GraphRAG hop-and-filter (#1016)

The reranker document text now reflects the traversal direction, showing only the new information relative to the frontier entity: - From S (subject is frontier): text = "{predicate} {object}" - From O (object is frontier): text = "{subject} {predicate}" - From P (predicate is frontier): text = "{subject} {object}" This eliminates duplicate reranker texts when traversing inward from shared object nodes (e.g. 18 CPUs all producing identical "hasSubcategory Processors" text when the subject was dropped). execute_batch_triple_queries now returns (triple, direction) tuples so hop_and_filter can select the appropriate text format. Updates tech spec to document the direction-aware approach. Adds unit tests for direction tracking and reranker text construction.
2026-07-03 06:51:00 +02:00 · 2026-07-02 21:14:47 +01:00 · 2026-07-02 21:14:47 +01:00 · db7fdbc652
commit db7fdbc652
parent 9cf7dcb578
4 changed files with 502 additions and 19 deletions
--- a/docs/tech-specs/graph-rag-semantic-filter.md
+++ b/docs/tech-specs/graph-rag-semantic-filter.md
@ -224,12 +224,27 @@ The current embedding pre-filter represents each edge as
 - **Drop commas.**  Commas add tokenisation noise without semantic
  value.

- **Drop the subject.**  The subject identifies which entity the
-  edge belongs to, but it does not contribute to whether the
-  edge's content is relevant to the query.  The predicate and
-  object carry the semantic meaning — what relationship exists
-  and what it connects to.  Representing edges as `"{p} {o}"`
-  produces cleaner cross-encoder matches.
+- **Direction-aware text.**  The reranker text should highlight
+  the *new* information relative to the traversal direction.
+  The frontier entity is already known context — repeating it
+  adds noise and, when traversing from an object node, causes
+  many edges to produce identical reranker text (e.g. 18
+  products sharing the same `hasSubcategory Processors` triple
+  all collapse to the same string when the subject is dropped).
+
+  The text is constructed based on which position the frontier
+  entity occupied in the triple:
+
+  - **From subject** (s=entity): `"{predicate} {object}"` —
+    the subject is known, predicate and object are new.
+  - **From object** (o=entity): `"{subject} {predicate}"` —
+    the object is known, subject and predicate are new.
+  - **From predicate** (p=entity): `"{subject} {object}"` —
+    the predicate is known, subject and object are new.
+
+  This eliminates the duplicate-text problem that arises when
+  traversing inward from a shared object node, and gives the
+  cross-encoder a more informative signal at every hop.

 #### Remove the embedding pre-filter (step 3)

@ -389,7 +404,10 @@ no LLM call.  These fields are dropped from the Focus entity.

   a. Retrieve all edges one hop from the current frontier nodes.

-   b. Represent each edge as `"{predicate} {object}"`.
+   b. Represent each edge using direction-aware text: from a
+      subject node use `"{predicate} {object}"`, from an object
+      node use `"{subject} {predicate}"`, from a predicate node
+      use `"{subject} {object}"`.

   c. Score edges against the extracted concepts using the
      cross-encoder service.