trustgraph/.github/workflows/pull-request.yaml
cybermaggedon 5c6fe90fe2
Add universal document decoder with multi-format support (#705)
Add universal document decoder with multi-format support
using 'unstructured'.

New universal decoder service powered by the unstructured
library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF,
ODT, EPUB and more through a single service. Tables are preserved
as HTML markup for better downstream extraction. Images are
stored in the librarian but excluded from the text
pipeline. Configurable section grouping strategies
(whole-document, heading, element-type, count, size) for non-page
formats. Page-based formats (PDF, PPTX, XLSX) are automatically
grouped by page.

All four decoders (PDF, Mistral OCR, Tesseract OCR, universal)
now share the "document-decoder" ident so they are
interchangeable.  PDF-only decoders fetch document metadata to
check MIME type and gracefully skip unsupported formats.

Librarian changes: removed MIME type whitelist validation so any
document format can be ingested. Simplified routing so text/plain
goes to text-load and everything else goes to document-load.
Removed dual inline/streaming data paths — documents always use
document_id for content retrieval.

New provenance entity types (tg:Section, tg:Image) and metadata
predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for
richer explainability.

Universal decoder is in its own package (trustgraph-unstructured)
and container image (trustgraph-unstructured).
2026-03-23 12:56:35 +00:00

62 lines
1.3 KiB
YAML

name: Test pull request
on:
pull_request:
permissions:
contents: read
jobs:
test:
name: Run tests
runs-on: ubuntu-latest
container:
image: python:3.13
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Setup packages
run: make update-package-versions VERSION=2.2.999
- name: Setup environment
run: python3 -m venv env
- name: Invoke environment
run: . env/bin/activate
- name: Install trustgraph-base
run: (cd trustgraph-base; pip install .)
- name: Install trustgraph-cli
run: (cd trustgraph-cli; pip install .)
- name: Install trustgraph-flow
run: (cd trustgraph-flow; pip install .)
- name: Install trustgraph-unstructured
run: (cd trustgraph-unstructured; pip install .)
- name: Install trustgraph-vertexai
run: (cd trustgraph-vertexai; pip install .)
- name: Install trustgraph-bedrock
run: (cd trustgraph-bedrock; pip install .)
- name: Install some stuff
run: pip install pytest pytest-cov pytest-asyncio pytest-mock
- name: Unit tests
run: pytest tests/unit
- name: Integration tests (cut the out the long-running tests)
run: pytest tests/integration -m 'not slow'
- name: Contract tests
run: pytest tests/contract