trustgraph/containers/Containerfile.unstructured


# ----------------------------------------------------------------------------
# Base container with system dependencies
# ----------------------------------------------------------------------------

FROM docker.io/fedora:42 AS base

ENV PIP_BREAK_SYSTEM_PACKAGES=1

RUN dnf install -y python3.13 libxcb mesa-libGL && \
  alternatives --install /usr/bin/python python /usr/bin/python3.13 1 && \
  python -m ensurepip --upgrade && \
  pip3 install --no-cache-dir build wheel aiohttp && \
  pip3 install --no-cache-dir pulsar-client==3.7.0 && \
  dnf clean all

# ----------------------------------------------------------------------------
# Build a container which contains the built Python packages.  The build
# creates a bunch of left-over cruft, a separate phase means this is only
# needed to support package build
# ----------------------------------------------------------------------------

FROM base AS build

COPY trustgraph-base/ /root/build/trustgraph-base/
COPY trustgraph-unstructured/ /root/build/trustgraph-unstructured/

WORKDIR /root/build/

RUN pip3 wheel -w /root/wheels/ --no-deps ./trustgraph-base/
RUN pip3 wheel -w /root/wheels/ --no-deps ./trustgraph-unstructured/

RUN ls /root/wheels

# ----------------------------------------------------------------------------
# Finally, the target container.  Start with base and add the package.
# ----------------------------------------------------------------------------

FROM base

# Pre-install CPU-only PyTorch so that unstructured[pdf]'s torch
# dependency is satisfied without pulling in CUDA (~190MB vs ~2GB+)
RUN pip3 install --no-cache-dir torch==2.11.0+cpu \
    --index-url https://download.pytorch.org/whl/cpu

COPY --from=build /root/wheels /root/wheels

RUN \
    pip3 install --no-cache-dir /root/wheels/trustgraph_base-* && \
    pip3 install --no-cache-dir /root/wheels/trustgraph_unstructured-* && \
    rm -rf /root/wheels

WORKDIR /
Add universal document decoder with multi-format support (#705) Add universal document decoder with multi-format support using 'unstructured'. New universal decoder service powered by the unstructured library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF, ODT, EPUB and more through a single service. Tables are preserved as HTML markup for better downstream extraction. Images are stored in the librarian but excluded from the text pipeline. Configurable section grouping strategies (whole-document, heading, element-type, count, size) for non-page formats. Page-based formats (PDF, PPTX, XLSX) are automatically grouped by page. All four decoders (PDF, Mistral OCR, Tesseract OCR, universal) now share the "document-decoder" ident so they are interchangeable. PDF-only decoders fetch document metadata to check MIME type and gracefully skip unsupported formats. Librarian changes: removed MIME type whitelist validation so any document format can be ingested. Simplified routing so text/plain goes to text-load and everything else goes to document-load. Removed dual inline/streaming data paths — documents always use document_id for content retrieval. New provenance entity types (tg:Section, tg:Image) and metadata predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for richer explainability. Universal decoder is in its own package (trustgraph-unstructured) and container image (trustgraph-unstructured). 2026-03-23 12:56:35 +00:00
			`# ----------------------------------------------------------------------------`
			`# Base container with system dependencies`
			`# ----------------------------------------------------------------------------`

			`FROM docker.io/fedora:42 AS base`

			`ENV PIP_BREAK_SYSTEM_PACKAGES=1`

release/v2.2 -> master (#733) 2026-03-29 20:27:25 +01:00			`RUN dnf install -y python3.13 libxcb mesa-libGL && \`
Add universal document decoder with multi-format support (#705) Add universal document decoder with multi-format support using 'unstructured'. New universal decoder service powered by the unstructured library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF, ODT, EPUB and more through a single service. Tables are preserved as HTML markup for better downstream extraction. Images are stored in the librarian but excluded from the text pipeline. Configurable section grouping strategies (whole-document, heading, element-type, count, size) for non-page formats. Page-based formats (PDF, PPTX, XLSX) are automatically grouped by page. All four decoders (PDF, Mistral OCR, Tesseract OCR, universal) now share the "document-decoder" ident so they are interchangeable. PDF-only decoders fetch document metadata to check MIME type and gracefully skip unsupported formats. Librarian changes: removed MIME type whitelist validation so any document format can be ingested. Simplified routing so text/plain goes to text-load and everything else goes to document-load. Removed dual inline/streaming data paths — documents always use document_id for content retrieval. New provenance entity types (tg:Section, tg:Image) and metadata predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for richer explainability. Universal decoder is in its own package (trustgraph-unstructured) and container image (trustgraph-unstructured). 2026-03-23 12:56:35 +00:00			`alternatives --install /usr/bin/python python /usr/bin/python3.13 1 && \`
			`python -m ensurepip --upgrade && \`
			`pip3 install --no-cache-dir build wheel aiohttp && \`
			`pip3 install --no-cache-dir pulsar-client==3.7.0 && \`
			`dnf clean all`

			`# ----------------------------------------------------------------------------`
			`# Build a container which contains the built Python packages. The build`
			`# creates a bunch of left-over cruft, a separate phase means this is only`
			`# needed to support package build`
			`# ----------------------------------------------------------------------------`

			`FROM base AS build`

			`COPY trustgraph-base/ /root/build/trustgraph-base/`
			`COPY trustgraph-unstructured/ /root/build/trustgraph-unstructured/`

			`WORKDIR /root/build/`

			`RUN pip3 wheel -w /root/wheels/ --no-deps ./trustgraph-base/`
			`RUN pip3 wheel -w /root/wheels/ --no-deps ./trustgraph-unstructured/`

			`RUN ls /root/wheels`

			`# ----------------------------------------------------------------------------`
			`# Finally, the target container. Start with base and add the package.`
			`# ----------------------------------------------------------------------------`

			`FROM base`

release/v2.2 -> master (#733) 2026-03-29 20:27:25 +01:00			`# Pre-install CPU-only PyTorch so that unstructured[pdf]'s torch`
			`# dependency is satisfied without pulling in CUDA (~190MB vs ~2GB+)`
			`RUN pip3 install --no-cache-dir torch==2.11.0+cpu \`
			`--index-url https://download.pytorch.org/whl/cpu`

Add universal document decoder with multi-format support (#705) Add universal document decoder with multi-format support using 'unstructured'. New universal decoder service powered by the unstructured library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF, ODT, EPUB and more through a single service. Tables are preserved as HTML markup for better downstream extraction. Images are stored in the librarian but excluded from the text pipeline. Configurable section grouping strategies (whole-document, heading, element-type, count, size) for non-page formats. Page-based formats (PDF, PPTX, XLSX) are automatically grouped by page. All four decoders (PDF, Mistral OCR, Tesseract OCR, universal) now share the "document-decoder" ident so they are interchangeable. PDF-only decoders fetch document metadata to check MIME type and gracefully skip unsupported formats. Librarian changes: removed MIME type whitelist validation so any document format can be ingested. Simplified routing so text/plain goes to text-load and everything else goes to document-load. Removed dual inline/streaming data paths — documents always use document_id for content retrieval. New provenance entity types (tg:Section, tg:Image) and metadata predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for richer explainability. Universal decoder is in its own package (trustgraph-unstructured) and container image (trustgraph-unstructured). 2026-03-23 12:56:35 +00:00			`COPY --from=build /root/wheels /root/wheels`

			`RUN \`
			`pip3 install --no-cache-dir /root/wheels/trustgraph_base-* && \`
			`pip3 install --no-cache-dir /root/wheels/trustgraph_unstructured-* && \`
			`rm -rf /root/wheels`

			`WORKDIR /`