trustgraph/tests/unit
cybermaggedon 5c6fe90fe2
Add universal document decoder with multi-format support (#705)
Add universal document decoder with multi-format support
using 'unstructured'.

New universal decoder service powered by the unstructured
library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF,
ODT, EPUB and more through a single service. Tables are preserved
as HTML markup for better downstream extraction. Images are
stored in the librarian but excluded from the text
pipeline. Configurable section grouping strategies
(whole-document, heading, element-type, count, size) for non-page
formats. Page-based formats (PDF, PPTX, XLSX) are automatically
grouped by page.

All four decoders (PDF, Mistral OCR, Tesseract OCR, universal)
now share the "document-decoder" ident so they are
interchangeable.  PDF-only decoders fetch document metadata to
check MIME type and gracefully skip unsupported formats.

Librarian changes: removed MIME type whitelist validation so any
document format can be ingested. Simplified routing so text/plain
goes to text-load and everything else goes to document-load.
Removed dual inline/streaming data paths — documents always use
document_id for content retrieval.

New provenance entity types (tg:Section, tg:Image) and metadata
predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for
richer explainability.

Universal decoder is in its own package (trustgraph-unstructured)
and container image (trustgraph-unstructured).
2026-03-23 12:56:35 +00:00
..
test_agent Updated test suite for explainability & provenance (#696) 2026-03-13 14:27:42 +00:00
test_base Embeddings API scores (#671) 2026-03-09 10:53:44 +00:00
test_chunking Fix ontology RAG pipeline + add query concurrency (#691) 2026-03-12 11:34:42 +00:00
test_cli Fix/tests (#647) 2026-02-23 22:01:47 +00:00
test_clients Embeddings API scores (#671) 2026-03-09 10:53:44 +00:00
test_concurrency Updated test suite for explainability & provenance (#696) 2026-03-13 14:27:42 +00:00
test_config Structure data mvp (#452) 2025-08-07 20:47:20 +01:00
test_cores Remove redundant metadata (#685) 2026-03-11 10:51:39 +00:00
test_decoding Add universal document decoder with multi-format support (#705) 2026-03-23 12:56:35 +00:00
test_direct Updated test suite for explainability & provenance (#696) 2026-03-13 14:27:42 +00:00
test_embeddings Updated test suite for explainability & provenance (#696) 2026-03-13 14:27:42 +00:00
test_extract Updated test suite for explainability & provenance (#696) 2026-03-13 14:27:42 +00:00
test_gateway Remove redundant metadata (#685) 2026-03-11 10:51:39 +00:00
test_knowledge_graph Remove schema:subjectOf edges from KG extraction (#695) 2026-03-13 12:11:21 +00:00
test_librarian Add universal document decoder with multi-format support (#705) 2026-03-23 12:56:35 +00:00
test_provenance Use UUID-based URNs for page and chunk IDs (#703) 2026-03-21 21:17:03 +00:00
test_query Knowledge core processing updated for embeddings interface change (#681) 2026-03-10 13:28:16 +00:00
test_rdf Updated test suite for explainability & provenance (#696) 2026-03-13 14:27:42 +00:00
test_reliability Updated test suite for explainability & provenance (#696) 2026-03-13 14:27:42 +00:00
test_retrieval Enhance retrieval pipelines: 4-stage GraphRAG, DocRAG grounding (#697) 2026-03-16 12:12:13 +00:00
test_rev_gateway Fix tests (#593) 2025-12-19 08:53:21 +00:00
test_storage Remove redundant metadata (#685) 2026-03-11 10:51:39 +00:00
test_structured_data Updated test suite for explainability & provenance (#696) 2026-03-13 14:27:42 +00:00
test_text_completion Updated test suite for explainability & provenance (#696) 2026-03-13 14:27:42 +00:00
__init__.py Test suite executed from CI pipeline (#433) 2025-07-14 14:57:44 +01:00
test_prompt_manager.py Feature/prompts jsonl (#619) 2026-01-26 17:38:00 +00:00
test_prompt_manager_edge_cases.py Update to enable knowledge extraction using the agent framework (#439) 2025-07-21 14:31:57 +01:00
test_python_api_client.py Structured data 2 (#645) 2026-02-23 15:56:29 +00:00