Add universal document decoder with multi-format support (#705)
Add universal document decoder with multi-format support
using 'unstructured'.
New universal decoder service powered by the unstructured
library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF,
ODT, EPUB and more through a single service. Tables are preserved
as HTML markup for better downstream extraction. Images are
stored in the librarian but excluded from the text
pipeline. Configurable section grouping strategies
(whole-document, heading, element-type, count, size) for non-page
formats. Page-based formats (PDF, PPTX, XLSX) are automatically
grouped by page.
All four decoders (PDF, Mistral OCR, Tesseract OCR, universal)
now share the "document-decoder" ident so they are
interchangeable. PDF-only decoders fetch document metadata to
check MIME type and gracefully skip unsupported formats.
Librarian changes: removed MIME type whitelist validation so any
document format can be ingested. Simplified routing so text/plain
goes to text-load and everything else goes to document-load.
Removed dual inline/streaming data paths — documents always use
document_id for content retrieval.
New provenance entity types (tg:Section, tg:Image) and metadata
predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for
richer explainability.
Universal decoder is in its own package (trustgraph-unstructured)
and container image (trustgraph-unstructured).
2026-03-23 12:56:35 +00:00
|
|
|
[build-system]
|
|
|
|
|
requires = ["setuptools>=61.0", "wheel"]
|
|
|
|
|
build-backend = "setuptools.build_meta"
|
|
|
|
|
|
|
|
|
|
[project]
|
|
|
|
|
name = "trustgraph-unstructured"
|
|
|
|
|
dynamic = ["version"]
|
|
|
|
|
authors = [{name = "trustgraph.ai", email = "security@trustgraph.ai"}]
|
|
|
|
|
description = "TrustGraph provides a means to run a pipeline of flexible AI processing components in a flexible means to achieve a processing pipeline."
|
|
|
|
|
readme = "README.md"
|
|
|
|
|
requires-python = ">=3.8"
|
|
|
|
|
dependencies = [
|
2026-04-10 14:42:19 +01:00
|
|
|
"trustgraph-base>=2.3,<2.4",
|
Add universal document decoder with multi-format support (#705)
Add universal document decoder with multi-format support
using 'unstructured'.
New universal decoder service powered by the unstructured
library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF,
ODT, EPUB and more through a single service. Tables are preserved
as HTML markup for better downstream extraction. Images are
stored in the librarian but excluded from the text
pipeline. Configurable section grouping strategies
(whole-document, heading, element-type, count, size) for non-page
formats. Page-based formats (PDF, PPTX, XLSX) are automatically
grouped by page.
All four decoders (PDF, Mistral OCR, Tesseract OCR, universal)
now share the "document-decoder" ident so they are
interchangeable. PDF-only decoders fetch document metadata to
check MIME type and gracefully skip unsupported formats.
Librarian changes: removed MIME type whitelist validation so any
document format can be ingested. Simplified routing so text/plain
goes to text-load and everything else goes to document-load.
Removed dual inline/streaming data paths — documents always use
document_id for content retrieval.
New provenance entity types (tg:Section, tg:Image) and metadata
predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for
richer explainability.
Universal decoder is in its own package (trustgraph-unstructured)
and container image (trustgraph-unstructured).
2026-03-23 12:56:35 +00:00
|
|
|
"pulsar-client",
|
|
|
|
|
"prometheus-client",
|
|
|
|
|
"python-magic",
|
2026-03-29 20:22:06 +01:00
|
|
|
"unstructured[csv,docx,epub,md,odt,pdf,pptx,rst,rtf,tsv,xlsx]",
|
Add universal document decoder with multi-format support (#705)
Add universal document decoder with multi-format support
using 'unstructured'.
New universal decoder service powered by the unstructured
library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF,
ODT, EPUB and more through a single service. Tables are preserved
as HTML markup for better downstream extraction. Images are
stored in the librarian but excluded from the text
pipeline. Configurable section grouping strategies
(whole-document, heading, element-type, count, size) for non-page
formats. Page-based formats (PDF, PPTX, XLSX) are automatically
grouped by page.
All four decoders (PDF, Mistral OCR, Tesseract OCR, universal)
now share the "document-decoder" ident so they are
interchangeable. PDF-only decoders fetch document metadata to
check MIME type and gracefully skip unsupported formats.
Librarian changes: removed MIME type whitelist validation so any
document format can be ingested. Simplified routing so text/plain
goes to text-load and everything else goes to document-load.
Removed dual inline/streaming data paths — documents always use
document_id for content retrieval.
New provenance entity types (tg:Section, tg:Image) and metadata
predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for
richer explainability.
Universal decoder is in its own package (trustgraph-unstructured)
and container image (trustgraph-unstructured).
2026-03-23 12:56:35 +00:00
|
|
|
]
|
|
|
|
|
classifiers = [
|
|
|
|
|
"Programming Language :: Python :: 3",
|
|
|
|
|
"Operating System :: OS Independent",
|
|
|
|
|
]
|
|
|
|
|
|
|
|
|
|
[project.urls]
|
|
|
|
|
Homepage = "https://github.com/trustgraph-ai/trustgraph"
|
|
|
|
|
|
|
|
|
|
[project.scripts]
|
|
|
|
|
universal-decoder = "trustgraph.decoding.universal:run"
|
|
|
|
|
|
|
|
|
|
[tool.setuptools.packages.find]
|
|
|
|
|
include = ["trustgraph*"]
|
|
|
|
|
|
|
|
|
|
[tool.setuptools.dynamic]
|
|
|
|
|
version = {attr = "trustgraph.unstructured_version.__version__"}
|