mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-25 00:16:23 +02:00
Add universal document decoder with multi-format support using 'unstructured'. New universal decoder service powered by the unstructured library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF, ODT, EPUB and more through a single service. Tables are preserved as HTML markup for better downstream extraction. Images are stored in the librarian but excluded from the text pipeline. Configurable section grouping strategies (whole-document, heading, element-type, count, size) for non-page formats. Page-based formats (PDF, PPTX, XLSX) are automatically grouped by page. All four decoders (PDF, Mistral OCR, Tesseract OCR, universal) now share the "document-decoder" ident so they are interchangeable. PDF-only decoders fetch document metadata to check MIME type and gracefully skip unsupported formats. Librarian changes: removed MIME type whitelist validation so any document format can be ingested. Simplified routing so text/plain goes to text-load and everything else goes to document-load. Removed dual inline/streaming data paths — documents always use document_id for content retrieval. New provenance entity types (tg:Section, tg:Image) and metadata predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for richer explainability. Universal decoder is in its own package (trustgraph-unstructured) and container image (trustgraph-unstructured). |
||
|---|---|---|
| .. | ||
| __TEMPLATE.md | ||
| agent-explainability.md | ||
| architecture-principles.md | ||
| cassandra-consolidation.md | ||
| cassandra-performance-refactor.md | ||
| collection-management.md | ||
| document-embeddings-chunk-id.md | ||
| embeddings-batch-processing.md | ||
| entity-centric-graph.md | ||
| explainability-cli.md | ||
| extraction-flows.md | ||
| extraction-provenance-subgraph.md | ||
| extraction-time-provenance.md | ||
| flow-class-definition.md | ||
| flow-configurable-parameters.md | ||
| graph-contexts.md | ||
| graphql-query.md | ||
| graphrag-performance-optimization.md | ||
| import-export-graceful-shutdown.md | ||
| jsonl-prompt-output.md | ||
| large-document-loading.md | ||
| logging-strategy.md | ||
| mcp-tool-arguments.md | ||
| mcp-tool-bearer-token.md | ||
| minio-to-s3-migration.md | ||
| more-config-cli.md | ||
| multi-tenant-support.md | ||
| neo4j-user-collection-isolation.md | ||
| ontology-extract-phase-2.md | ||
| ontology.md | ||
| ontorag.md | ||
| openapi-spec.md | ||
| pubsub.md | ||
| python-api-refactor.md | ||
| query-time-explainability.md | ||
| rag-streaming-support.md | ||
| schema-refactoring-proposal.md | ||
| streaming-llm-responses.md | ||
| structured-data-2.md | ||
| structured-data-descriptor.md | ||
| structured-data-schemas.md | ||
| structured-data.md | ||
| structured-diag-service.md | ||
| tool-group.md | ||
| tool-services.md | ||
| universal-decoder.md | ||
| vector-store-lifecycle.md | ||