mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-25 00:16:23 +02:00
Add universal document decoder with multi-format support using 'unstructured'. New universal decoder service powered by the unstructured library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF, ODT, EPUB and more through a single service. Tables are preserved as HTML markup for better downstream extraction. Images are stored in the librarian but excluded from the text pipeline. Configurable section grouping strategies (whole-document, heading, element-type, count, size) for non-page formats. Page-based formats (PDF, PPTX, XLSX) are automatically grouped by page. All four decoders (PDF, Mistral OCR, Tesseract OCR, universal) now share the "document-decoder" ident so they are interchangeable. PDF-only decoders fetch document metadata to check MIME type and gracefully skip unsupported formats. Librarian changes: removed MIME type whitelist validation so any document format can be ingested. Simplified routing so text/plain goes to text-load and everything else goes to document-load. Removed dual inline/streaming data paths — documents always use document_id for content retrieval. New provenance entity types (tg:Section, tg:Image) and metadata predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for richer explainability. Universal decoder is in its own package (trustgraph-unstructured) and container image (trustgraph-unstructured). |
||
|---|---|---|
| .. | ||
| test_agent | ||
| test_base | ||
| test_chunking | ||
| test_cli | ||
| test_clients | ||
| test_concurrency | ||
| test_config | ||
| test_cores | ||
| test_decoding | ||
| test_direct | ||
| test_embeddings | ||
| test_extract | ||
| test_gateway | ||
| test_knowledge_graph | ||
| test_librarian | ||
| test_provenance | ||
| test_query | ||
| test_rdf | ||
| test_reliability | ||
| test_retrieval | ||
| test_rev_gateway | ||
| test_storage | ||
| test_structured_data | ||
| test_text_completion | ||
| __init__.py | ||
| test_prompt_manager.py | ||
| test_prompt_manager_edge_cases.py | ||
| test_python_api_client.py | ||