trustgraph/.github/workflows/release.yaml
cybermaggedon 5c6fe90fe2
Add universal document decoder with multi-format support (#705)
Add universal document decoder with multi-format support
using 'unstructured'.

New universal decoder service powered by the unstructured
library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF,
ODT, EPUB and more through a single service. Tables are preserved
as HTML markup for better downstream extraction. Images are
stored in the librarian but excluded from the text
pipeline. Configurable section grouping strategies
(whole-document, heading, element-type, count, size) for non-page
formats. Page-based formats (PDF, PPTX, XLSX) are automatically
grouped by page.

All four decoders (PDF, Mistral OCR, Tesseract OCR, universal)
now share the "document-decoder" ident so they are
interchangeable.  PDF-only decoders fetch document metadata to
check MIME type and gracefully skip unsupported formats.

Librarian changes: removed MIME type whitelist validation so any
document format can be ingested. Simplified routing so text/plain
goes to text-load and everything else goes to document-load.
Removed dual inline/streaming data paths — documents always use
document_id for content retrieval.

New provenance entity types (tg:Section, tg:Image) and metadata
predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for
richer explainability.

Universal decoder is in its own package (trustgraph-unstructured)
and container image (trustgraph-unstructured).
2026-03-23 12:56:35 +00:00

87 lines
2 KiB
YAML

name: Build
on:
workflow_dispatch:
push:
tags:
- v*
permissions:
contents: read
jobs:
python-packages:
name: Release Python packages
runs-on: ubuntu-24.04
permissions:
contents: write
id-token: write
environment:
name: release
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Get version
id: version
run: echo VERSION=$(git describe --exact-match --tags | sed 's/^v//') >> $GITHUB_OUTPUT
- name: Install dependencies
run: pip install build wheel
- name: Build packages
run: make packages VERSION=${{ steps.version.outputs.VERSION }}
- name: Publish release distributions to PyPI
uses: pypa/gh-action-pypi-publish@release/v1
deploy-container-image:
name: Release container images
runs-on: ubuntu-24.04
permissions:
contents: write
id-token: write
environment:
name: release
strategy:
matrix:
container:
- trustgraph-base
- trustgraph-flow
- trustgraph-bedrock
- trustgraph-vertexai
- trustgraph-hf
- trustgraph-ocr
- trustgraph-unstructured
- trustgraph-mcp
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Docker Hub token
run: echo ${{ secrets.DOCKER_SECRET }} > docker-token.txt
- name: Authenticate with Docker hub
run: make docker-hub-login
- name: Get version
id: version
run: echo VERSION=$(git describe --exact-match --tags | sed 's/^v//') >> $GITHUB_OUTPUT
- name: Put version into package manifests
run: make update-package-versions VERSION=${{ steps.version.outputs.VERSION }}
- name: Build container - ${{ matrix.container }}
run: make container-${{ matrix.container }} VERSION=${{ steps.version.outputs.VERSION }}
- name: Push container - ${{ matrix.container }}
run: make push-${{ matrix.container }} VERSION=${{ steps.version.outputs.VERSION }}