Pull requests back-ported to release/v2.2: 801, 802, 804, 805
Restructure container builds for multi-platform support, enabling
ARM-based deployments (e.g. Apple Silicon via Docker Desktop).
Makefile:
- Replace per-container named targets with pattern rules
(container-%, manifest-%, platform-%-{amd64,arm64},
combine-manifest-%)
- Add parallel CI targets: platform builds push per-arch images,
combine-manifest creates and pushes the multi-arch manifest list
- Remove legacy cruft targets (update-dcs, update-templates)
CI (release.yaml):
- Split single deploy job into build-platform-image (16 parallel
jobs: 8 containers x 2 platforms) and combine-manifests (8 jobs,
metadata only)
- Use native ARM runners (ubuntu-24.04-arm)
Containerfile.hf:
- Downgrade to Python 3.12 (PyTorch lacks arm64 wheels for 3.13)
- Use standard PyTorch package instead of +cpu variant (no arm64 wheels
on the cpu index)
Add universal document decoder with multi-format support
using 'unstructured'.
New universal decoder service powered by the unstructured
library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF,
ODT, EPUB and more through a single service. Tables are preserved
as HTML markup for better downstream extraction. Images are
stored in the librarian but excluded from the text
pipeline. Configurable section grouping strategies
(whole-document, heading, element-type, count, size) for non-page
formats. Page-based formats (PDF, PPTX, XLSX) are automatically
grouped by page.
All four decoders (PDF, Mistral OCR, Tesseract OCR, universal)
now share the "document-decoder" ident so they are
interchangeable. PDF-only decoders fetch document metadata to
check MIME type and gracefully skip unsupported formats.
Librarian changes: removed MIME type whitelist validation so any
document format can be ingested. Simplified routing so text/plain
goes to text-load and everything else goes to document-load.
Removed dual inline/streaming data paths — documents always use
document_id for content retrieval.
New provenance entity types (tg:Section, tg:Image) and metadata
predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for
richer explainability.
Universal decoder is in its own package (trustgraph-unstructured)
and container image (trustgraph-unstructured).
* Some basic structure for workflows
* Add PyPI publication for 0.12
* Bump version
* Test bundle generation
* Install jsonnet
* Use release action to automate release creation