trustgraph

mirror of https://github.com/trustgraph-ai/trustgraph.git synced 2026-06-11 15:55:12 +02:00

cybermaggedon 96fd1eab15 Use UUID-based URNs for page and chunk IDs (#703 ) Page and chunk document IDs were deterministic ({doc_id}/p{num}, {doc_id}/p{num}/c{num}), causing "Document already exists" errors when reprocessing documents through different flows. Content may differ between runs due to different parameters or extractors, so deterministic IDs are incorrect. Pages now use urn:page:{uuid}, chunks use urn:chunk:{uuid}. Parent- child relationships are tracked via librarian metadata and provenance triples. Also brings Mistral OCR and Tesseract OCR decoders up to parity with the PDF decoder: librarian fetch/save support, per-page output with unique IDs, and provenance triple emission. Fixes Mistral OCR bug where only the first 5 pages were processed.		2026-03-21 21:17:03 +00:00
..
trustgraph	Use UUID-based URNs for page and chunk IDs (#703 )	2026-03-21 21:17:03 +00:00
pyproject.toml	Loki logging (#586 )	2025-12-09 23:24:41 +00:00
README.md	Maint/fix build env (#84 )	2024-09-30 19:47:09 +01:00