trustgraph/tests
cybermaggedon 5c6fe90fe2
Add universal document decoder with multi-format support (#705)
Add universal document decoder with multi-format support
using 'unstructured'.

New universal decoder service powered by the unstructured
library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF,
ODT, EPUB and more through a single service. Tables are preserved
as HTML markup for better downstream extraction. Images are
stored in the librarian but excluded from the text
pipeline. Configurable section grouping strategies
(whole-document, heading, element-type, count, size) for non-page
formats. Page-based formats (PDF, PPTX, XLSX) are automatically
grouped by page.

All four decoders (PDF, Mistral OCR, Tesseract OCR, universal)
now share the "document-decoder" ident so they are
interchangeable.  PDF-only decoders fetch document metadata to
check MIME type and gracefully skip unsupported formats.

Librarian changes: removed MIME type whitelist validation so any
document format can be ingested. Simplified routing so text/plain
goes to text-load and everything else goes to document-load.
Removed dual inline/streaming data paths — documents always use
document_id for content retrieval.

New provenance entity types (tg:Section, tg:Image) and metadata
predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for
richer explainability.

Universal decoder is in its own package (trustgraph-unstructured)
and container image (trustgraph-unstructured).
2026-03-23 12:56:35 +00:00
..
contract Add unified explainability support and librarian storage for (#693) 2026-03-12 21:40:09 +00:00
integration Enhance retrieval pipelines: 4-stage GraphRAG, DocRAG grounding (#697) 2026-03-16 12:12:13 +00:00
unit Add universal document decoder with multi-format support (#705) 2026-03-23 12:56:35 +00:00
utils Streaming rag responses (#568) 2025-11-26 19:47:39 +00:00
__init__.py Test suite executed from CI pipeline (#433) 2025-07-14 14:57:44 +01:00
conftest.py Fix test async warnings (#601) 2026-01-06 22:09:34 +00:00
pytest.ini Entity-centric graph (#633) 2026-02-16 13:26:43 +00:00
requirements.txt Test suite executed from CI pipeline (#433) 2025-07-14 14:57:44 +01:00