Add universal document decoder with multi-format support (#705)
Add universal document decoder with multi-format support
using 'unstructured'.
New universal decoder service powered by the unstructured
library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF,
ODT, EPUB and more through a single service. Tables are preserved
as HTML markup for better downstream extraction. Images are
stored in the librarian but excluded from the text
pipeline. Configurable section grouping strategies
(whole-document, heading, element-type, count, size) for non-page
formats. Page-based formats (PDF, PPTX, XLSX) are automatically
grouped by page.
All four decoders (PDF, Mistral OCR, Tesseract OCR, universal)
now share the "document-decoder" ident so they are
interchangeable. PDF-only decoders fetch document metadata to
check MIME type and gracefully skip unsupported formats.
Librarian changes: removed MIME type whitelist validation so any
document format can be ingested. Simplified routing so text/plain
goes to text-load and everything else goes to document-load.
Removed dual inline/streaming data paths — documents always use
document_id for content retrieval.
New provenance entity types (tg:Section, tg:Image) and metadata
predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for
richer explainability.
Universal decoder is in its own package (trustgraph-unstructured)
and container image (trustgraph-unstructured).
2026-03-23 12:56:35 +00:00
|
|
|
|
|
|
|
|
# ----------------------------------------------------------------------------
|
|
|
|
|
# Base container with system dependencies
|
|
|
|
|
# ----------------------------------------------------------------------------
|
|
|
|
|
|
|
|
|
|
FROM docker.io/fedora:42 AS base
|
|
|
|
|
|
|
|
|
|
ENV PIP_BREAK_SYSTEM_PACKAGES=1
|
|
|
|
|
|
2026-03-29 20:27:25 +01:00
|
|
|
RUN dnf install -y python3.13 libxcb mesa-libGL && \
|
Add universal document decoder with multi-format support (#705)
Add universal document decoder with multi-format support
using 'unstructured'.
New universal decoder service powered by the unstructured
library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF,
ODT, EPUB and more through a single service. Tables are preserved
as HTML markup for better downstream extraction. Images are
stored in the librarian but excluded from the text
pipeline. Configurable section grouping strategies
(whole-document, heading, element-type, count, size) for non-page
formats. Page-based formats (PDF, PPTX, XLSX) are automatically
grouped by page.
All four decoders (PDF, Mistral OCR, Tesseract OCR, universal)
now share the "document-decoder" ident so they are
interchangeable. PDF-only decoders fetch document metadata to
check MIME type and gracefully skip unsupported formats.
Librarian changes: removed MIME type whitelist validation so any
document format can be ingested. Simplified routing so text/plain
goes to text-load and everything else goes to document-load.
Removed dual inline/streaming data paths — documents always use
document_id for content retrieval.
New provenance entity types (tg:Section, tg:Image) and metadata
predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for
richer explainability.
Universal decoder is in its own package (trustgraph-unstructured)
and container image (trustgraph-unstructured).
2026-03-23 12:56:35 +00:00
|
|
|
alternatives --install /usr/bin/python python /usr/bin/python3.13 1 && \
|
|
|
|
|
python -m ensurepip --upgrade && \
|
|
|
|
|
pip3 install --no-cache-dir build wheel aiohttp && \
|
|
|
|
|
pip3 install --no-cache-dir pulsar-client==3.7.0 && \
|
|
|
|
|
dnf clean all
|
|
|
|
|
|
|
|
|
|
# ----------------------------------------------------------------------------
|
|
|
|
|
# Build a container which contains the built Python packages. The build
|
|
|
|
|
# creates a bunch of left-over cruft, a separate phase means this is only
|
|
|
|
|
# needed to support package build
|
|
|
|
|
# ----------------------------------------------------------------------------
|
|
|
|
|
|
|
|
|
|
FROM base AS build
|
|
|
|
|
|
|
|
|
|
COPY trustgraph-base/ /root/build/trustgraph-base/
|
|
|
|
|
COPY trustgraph-unstructured/ /root/build/trustgraph-unstructured/
|
|
|
|
|
|
|
|
|
|
WORKDIR /root/build/
|
|
|
|
|
|
|
|
|
|
RUN pip3 wheel -w /root/wheels/ --no-deps ./trustgraph-base/
|
|
|
|
|
RUN pip3 wheel -w /root/wheels/ --no-deps ./trustgraph-unstructured/
|
|
|
|
|
|
|
|
|
|
RUN ls /root/wheels
|
|
|
|
|
|
|
|
|
|
# ----------------------------------------------------------------------------
|
|
|
|
|
# Finally, the target container. Start with base and add the package.
|
|
|
|
|
# ----------------------------------------------------------------------------
|
|
|
|
|
|
|
|
|
|
FROM base
|
|
|
|
|
|
2026-03-29 20:27:25 +01:00
|
|
|
# Pre-install CPU-only PyTorch so that unstructured[pdf]'s torch
|
|
|
|
|
# dependency is satisfied without pulling in CUDA (~190MB vs ~2GB+)
|
|
|
|
|
RUN pip3 install --no-cache-dir torch==2.11.0+cpu \
|
|
|
|
|
--index-url https://download.pytorch.org/whl/cpu
|
|
|
|
|
|
Add universal document decoder with multi-format support (#705)
Add universal document decoder with multi-format support
using 'unstructured'.
New universal decoder service powered by the unstructured
library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF,
ODT, EPUB and more through a single service. Tables are preserved
as HTML markup for better downstream extraction. Images are
stored in the librarian but excluded from the text
pipeline. Configurable section grouping strategies
(whole-document, heading, element-type, count, size) for non-page
formats. Page-based formats (PDF, PPTX, XLSX) are automatically
grouped by page.
All four decoders (PDF, Mistral OCR, Tesseract OCR, universal)
now share the "document-decoder" ident so they are
interchangeable. PDF-only decoders fetch document metadata to
check MIME type and gracefully skip unsupported formats.
Librarian changes: removed MIME type whitelist validation so any
document format can be ingested. Simplified routing so text/plain
goes to text-load and everything else goes to document-load.
Removed dual inline/streaming data paths — documents always use
document_id for content retrieval.
New provenance entity types (tg:Section, tg:Image) and metadata
predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for
richer explainability.
Universal decoder is in its own package (trustgraph-unstructured)
and container image (trustgraph-unstructured).
2026-03-23 12:56:35 +00:00
|
|
|
COPY --from=build /root/wheels /root/wheels
|
|
|
|
|
|
|
|
|
|
RUN \
|
|
|
|
|
pip3 install --no-cache-dir /root/wheels/trustgraph_base-* && \
|
|
|
|
|
pip3 install --no-cache-dir /root/wheels/trustgraph_unstructured-* && \
|
|
|
|
|
rm -rf /root/wheels
|
|
|
|
|
|
|
|
|
|
WORKDIR /
|