mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-25 08:26:21 +02:00
Add universal document decoder with multi-format support using 'unstructured'. New universal decoder service powered by the unstructured library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF, ODT, EPUB and more through a single service. Tables are preserved as HTML markup for better downstream extraction. Images are stored in the librarian but excluded from the text pipeline. Configurable section grouping strategies (whole-document, heading, element-type, count, size) for non-page formats. Page-based formats (PDF, PPTX, XLSX) are automatically grouped by page. All four decoders (PDF, Mistral OCR, Tesseract OCR, universal) now share the "document-decoder" ident so they are interchangeable. PDF-only decoders fetch document metadata to check MIME type and gracefully skip unsupported formats. Librarian changes: removed MIME type whitelist validation so any document format can be ingested. Simplified routing so text/plain goes to text-load and everything else goes to document-load. Removed dual inline/streaming data paths — documents always use document_id for content retrieval. New provenance entity types (tg:Section, tg:Image) and metadata predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for richer explainability. Universal decoder is in its own package (trustgraph-unstructured) and container image (trustgraph-unstructured).
87 lines
2 KiB
YAML
87 lines
2 KiB
YAML
|
|
name: Build
|
|
|
|
on:
|
|
workflow_dispatch:
|
|
push:
|
|
tags:
|
|
- v*
|
|
|
|
permissions:
|
|
contents: read
|
|
|
|
jobs:
|
|
|
|
python-packages:
|
|
|
|
name: Release Python packages
|
|
runs-on: ubuntu-24.04
|
|
permissions:
|
|
contents: write
|
|
id-token: write
|
|
environment:
|
|
name: release
|
|
|
|
steps:
|
|
|
|
- name: Checkout
|
|
uses: actions/checkout@v4
|
|
|
|
- name: Get version
|
|
id: version
|
|
run: echo VERSION=$(git describe --exact-match --tags | sed 's/^v//') >> $GITHUB_OUTPUT
|
|
|
|
- name: Install dependencies
|
|
run: pip install build wheel
|
|
|
|
- name: Build packages
|
|
run: make packages VERSION=${{ steps.version.outputs.VERSION }}
|
|
|
|
- name: Publish release distributions to PyPI
|
|
uses: pypa/gh-action-pypi-publish@release/v1
|
|
|
|
deploy-container-image:
|
|
|
|
name: Release container images
|
|
runs-on: ubuntu-24.04
|
|
permissions:
|
|
contents: write
|
|
id-token: write
|
|
environment:
|
|
name: release
|
|
strategy:
|
|
matrix:
|
|
container:
|
|
- trustgraph-base
|
|
- trustgraph-flow
|
|
- trustgraph-bedrock
|
|
- trustgraph-vertexai
|
|
- trustgraph-hf
|
|
- trustgraph-ocr
|
|
- trustgraph-unstructured
|
|
- trustgraph-mcp
|
|
|
|
steps:
|
|
|
|
- name: Checkout
|
|
uses: actions/checkout@v4
|
|
|
|
- name: Docker Hub token
|
|
run: echo ${{ secrets.DOCKER_SECRET }} > docker-token.txt
|
|
|
|
- name: Authenticate with Docker hub
|
|
run: make docker-hub-login
|
|
|
|
- name: Get version
|
|
id: version
|
|
run: echo VERSION=$(git describe --exact-match --tags | sed 's/^v//') >> $GITHUB_OUTPUT
|
|
|
|
- name: Put version into package manifests
|
|
run: make update-package-versions VERSION=${{ steps.version.outputs.VERSION }}
|
|
|
|
- name: Build container - ${{ matrix.container }}
|
|
run: make container-${{ matrix.container }} VERSION=${{ steps.version.outputs.VERSION }}
|
|
|
|
- name: Push container - ${{ matrix.container }}
|
|
run: make push-${{ matrix.container }} VERSION=${{ steps.version.outputs.VERSION }}
|
|
|