The id field in pipeline Metadata was being overwritten at each processing (#686)

The id field in pipeline Metadata was being overwritten at each processing stage (document → page → chunk), causing knowledge storage to create separate cores per chunk instead of grouping by document. Add a root field that: - Is set by librarian to the original document ID - Is copied unchanged through PDF decoder, chunkers, and extractors - Is used by knowledge storage for document_id grouping (with fallback to id) Changes: - Add root field to Metadata schema with empty string default - Set root=document.id in librarian when initiating document processing - Copy root through PDF decoder, recursive chunker, and all extractors - Update knowledge storage to use root (or id as fallback) for grouping - Add root handling to translators and gateway serialization - Update test mock Metadata class to include root parameter
2026-06-15 17:55:12 +02:00 · 2026-03-11 12:16:39 +00:00 · 2026-03-11 12:16:39 +00:00 · 286f762369
commit 286f762369
parent aa4f5c6c00
15 changed files with 48 additions and 4 deletions
--- a/tests/unit/test_knowledge_graph/conftest.py
+++ b/tests/unit/test_knowledge_graph/conftest.py
@ -29,8 +29,9 @@ class Triple:
        self.o = o

 class Metadata:
-    def __init__(self, id, user, collection):
+    def __init__(self, id, user, collection, root=""):
        self.id = id
+        self.root = root
        self.user = user
        self.collection = collection