Commit graph

15 commits

Author SHA1 Message Date
cybermaggedon
03c7a7c5a8 Add multi-arch (amd64/arm64) container builds and parallel CI
Pull requests back-ported to release/v2.2: 801, 802, 804, 805

Restructure container builds for multi-platform support, enabling
ARM-based deployments (e.g. Apple Silicon via Docker Desktop).

Makefile:
- Replace per-container named targets with pattern rules
  (container-%, manifest-%, platform-%-{amd64,arm64},
  combine-manifest-%)
- Add parallel CI targets: platform builds push per-arch images,
  combine-manifest creates and pushes the multi-arch manifest list
- Remove legacy cruft targets (update-dcs, update-templates)

CI (release.yaml):
- Split single deploy job into build-platform-image (16 parallel
  jobs: 8 containers x 2 platforms) and combine-manifests (8 jobs,
  metadata only)
- Use native ARM runners (ubuntu-24.04-arm)

Containerfile.hf:
- Downgrade to Python 3.12 (PyTorch lacks arm64 wheels for 3.13)
- Use standard PyTorch package instead of +cpu variant (no arm64 wheels
  on the cpu index)
2026-04-14 12:22:12 +01:00
cybermaggedon
413f917676 Add missing pdf extra to unstructured dependency (#728)
* Fix PDF processing deps so that PDF processing works
2026-03-29 20:22:45 +01:00
cybermaggedon
5c6fe90fe2
Add universal document decoder with multi-format support (#705)
Add universal document decoder with multi-format support
using 'unstructured'.

New universal decoder service powered by the unstructured
library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF,
ODT, EPUB and more through a single service. Tables are preserved
as HTML markup for better downstream extraction. Images are
stored in the librarian but excluded from the text
pipeline. Configurable section grouping strategies
(whole-document, heading, element-type, count, size) for non-page
formats. Page-based formats (PDF, PPTX, XLSX) are automatically
grouped by page.

All four decoders (PDF, Mistral OCR, Tesseract OCR, universal)
now share the "document-decoder" ident so they are
interchangeable.  PDF-only decoders fetch document metadata to
check MIME type and gracefully skip unsupported formats.

Librarian changes: removed MIME type whitelist validation so any
document format can be ingested. Simplified routing so text/plain
goes to text-load and everything else goes to document-load.
Removed dual inline/streaming data paths — documents always use
document_id for content retrieval.

New provenance entity types (tg:Section, tg:Image) and metadata
predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for
richer explainability.

Universal decoder is in its own package (trustgraph-unstructured)
and container image (trustgraph-unstructured).
2026-03-23 12:56:35 +00:00
cybermaggedon
08063a5ee9
Remove unused deps (#640)
* Removed the Google GenAI hard-coded install
2026-02-20 10:13:44 +00:00
cybermaggedon
05b9063fea
Feature/python3.13 (#553)
* Python to 3.13

* cassandra-driver -> scylla-driver
  (cassandra-driver doesn't work with Python 3.13)
2025-10-11 12:19:26 +01:00
cybermaggedon
98022d6af4
Migrate from setup.py to pyproject.toml (#440)
* Converted setup.py to pyproject.toml

* Modern package infrastructure as recommended by py docs
2025-07-23 21:22:08 +01:00
cybermaggedon
f907ea7db8
PoC MCP server (#419)
* Very initial MCP server PoC for TrustGraph

* Put service on port 8000

* Add MCP container and packages to buildout
2025-07-02 18:19:23 +01:00
cybermaggedon
81d73445bd
Add missing dependencies to the PDF OCR container (#411) 2025-06-16 14:15:16 +01:00
cybermaggedon
448819ed47
Updates to Google AI: (#394)
- Changed GoogleAIStudio LLM code to match latest documentation
- Very minor tweak to vertexai LLM code - just matching what's in SDK docs
  no actual change to implementation.
- Tweaked VertexAI container build to speed up in dev
- Comments in LLM code to mention which docs it was built from.  Google
  SDKs are confusing ATM.
2025-05-24 12:09:43 +01:00
cybermaggedon
b380c2054d
Change containers to Python 3.12 (#386) 2025-05-17 23:15:29 +01:00
cybermaggedon
e04d3631fd
Update container deps (#385)
- Fedora 42 container
- Pulsar client 3.7.0
- Latest AI libs
2025-05-17 20:54:56 +01:00
cybermaggedon
322725be04
Fix container build (#325) 2025-03-20 09:38:54 +00:00
cybermaggedon
c759d55734
Added module which does OCR for PDF, pdf-ocr in a separate package (#324)
(has a lot of dependencies).  Uses Tesseract.
2025-03-20 09:29:40 +00:00
JackColquitt
d676804107 Added Mistral jsonnet templates 2025-03-14 18:07:51 -07:00
cybermaggedon
edcdc4d59d
Feature/separate containers (#287)
* Separate containerfiles

* Add push to Makefile

* Update image names in the templates
2025-01-28 19:36:05 +00:00