The id field in pipeline Metadata was being overwritten at each processing (#686)

The id field in pipeline Metadata was being overwritten at each processing
stage (document → page → chunk), causing knowledge storage to create
separate cores per chunk instead of grouping by document.

Add a root field that:
- Is set by librarian to the original document ID
- Is copied unchanged through PDF decoder, chunkers, and extractors
- Is used by knowledge storage for document_id grouping (with fallback to id)

Changes:
- Add root field to Metadata schema with empty string default
- Set root=document.id in librarian when initiating document processing
- Copy root through PDF decoder, recursive chunker, and all extractors
- Update knowledge storage to use root (or id as fallback) for grouping
- Add root handling to translators and gateway serialization
- Update test mock Metadata class to include root parameter
This commit is contained in:
cybermaggedon 2026-03-11 12:16:39 +00:00 committed by GitHub
parent aa4f5c6c00
commit 286f762369
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
15 changed files with 48 additions and 4 deletions

View file

@ -334,6 +334,7 @@ class Processor(AsyncProcessor):
triples_msg = Triples(
metadata=Metadata(
id=doc_uri,
root=document.id,
user=processing.user,
collection=processing.collection,
),
@ -380,6 +381,7 @@ class Processor(AsyncProcessor):
doc = TextDocument(
metadata = Metadata(
id = document.id,
root = document.id,
user = processing.user,
collection = processing.collection
),
@ -390,6 +392,7 @@ class Processor(AsyncProcessor):
doc = TextDocument(
metadata = Metadata(
id = document.id,
root = document.id,
user = processing.user,
collection = processing.collection
),
@ -405,6 +408,7 @@ class Processor(AsyncProcessor):
doc = Document(
metadata = Metadata(
id = document.id,
root = document.id,
user = processing.user,
collection = processing.collection
),
@ -415,6 +419,7 @@ class Processor(AsyncProcessor):
doc = Document(
metadata = Metadata(
id = document.id,
root = document.id,
user = processing.user,
collection = processing.collection
),