Add universal document decoder with multi-format support (#705)

mirror of https://github.com/trustgraph-ai/trustgraph.git synced 2026-06-22 05:08:06 +02:00

Add universal document decoder with multi-format support
using 'unstructured'.

New universal decoder service powered by the unstructured
library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF,
ODT, EPUB and more through a single service. Tables are preserved
as HTML markup for better downstream extraction. Images are
stored in the librarian but excluded from the text
pipeline. Configurable section grouping strategies
(whole-document, heading, element-type, count, size) for non-page
formats. Page-based formats (PDF, PPTX, XLSX) are automatically
grouped by page.

All four decoders (PDF, Mistral OCR, Tesseract OCR, universal)
now share the "document-decoder" ident so they are
interchangeable.  PDF-only decoders fetch document metadata to
check MIME type and gracefully skip unsupported formats.

Librarian changes: removed MIME type whitelist validation so any
document format can be ingested. Simplified routing so text/plain
goes to text-load and everything else goes to document-load.
Removed dual inline/streaming data paths — documents always use
document_id for content retrieval.

New provenance entity types (tg:Section, tg:Image) and metadata
predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for
richer explainability.

Universal decoder is in its own package (trustgraph-unstructured)
and container image (trustgraph-unstructured).

This commit is contained in:

cybermaggedon

2026-03-23 12:56:35 +00:00

• committed by

GitHub

parent 4609424afe

commit 5c6fe90fe2

No known key found for this signature in database

GPG key ID: B5690EEEBB952194

25 changed files with 2247 additions and 79 deletions

									
										6

trustgraph-unstructured/trustgraph/decoding/universal/__main__.py
									
										Normal file
									
										View file
										
				@ -0,0 +1,6 @@

				#!/usr/bin/env python3

				from . processor import run

				if __name__ == '__main__':

				    run()

Rows
Columns

Add universal document decoder with multi-format support (#705)

6 trustgraph-unstructured/trustgraph/decoding/universal/__main__.py Normal file Unescape Escape View file

6

trustgraph-unstructured/trustgraph/decoding/universal/main.py Normal file

View file