Commit graph

13 commits

Author SHA1 Message Date
cybermaggedon
5c6fe90fe2
Add universal document decoder with multi-format support (#705)
Add universal document decoder with multi-format support
using 'unstructured'.

New universal decoder service powered by the unstructured
library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF,
ODT, EPUB and more through a single service. Tables are preserved
as HTML markup for better downstream extraction. Images are
stored in the librarian but excluded from the text
pipeline. Configurable section grouping strategies
(whole-document, heading, element-type, count, size) for non-page
formats. Page-based formats (PDF, PPTX, XLSX) are automatically
grouped by page.

All four decoders (PDF, Mistral OCR, Tesseract OCR, universal)
now share the "document-decoder" ident so they are
interchangeable.  PDF-only decoders fetch document metadata to
check MIME type and gracefully skip unsupported formats.

Librarian changes: removed MIME type whitelist validation so any
document format can be ingested. Simplified routing so text/plain
goes to text-load and everything else goes to document-load.
Removed dual inline/streaming data paths — documents always use
document_id for content retrieval.

New provenance entity types (tg:Section, tg:Image) and metadata
predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for
richer explainability.

Universal decoder is in its own package (trustgraph-unstructured)
and container image (trustgraph-unstructured).
2026-03-23 12:56:35 +00:00
cybermaggedon
08063a5ee9
Remove unused deps (#640)
* Removed the Google GenAI hard-coded install
2026-02-20 10:13:44 +00:00
cybermaggedon
05b9063fea
Feature/python3.13 (#553)
* Python to 3.13

* cassandra-driver -> scylla-driver
  (cassandra-driver doesn't work with Python 3.13)
2025-10-11 12:19:26 +01:00
cybermaggedon
98022d6af4
Migrate from setup.py to pyproject.toml (#440)
* Converted setup.py to pyproject.toml

* Modern package infrastructure as recommended by py docs
2025-07-23 21:22:08 +01:00
cybermaggedon
f907ea7db8
PoC MCP server (#419)
* Very initial MCP server PoC for TrustGraph

* Put service on port 8000

* Add MCP container and packages to buildout
2025-07-02 18:19:23 +01:00
cybermaggedon
81d73445bd
Add missing dependencies to the PDF OCR container (#411) 2025-06-16 14:15:16 +01:00
cybermaggedon
448819ed47
Updates to Google AI: (#394)
- Changed GoogleAIStudio LLM code to match latest documentation
- Very minor tweak to vertexai LLM code - just matching what's in SDK docs
  no actual change to implementation.
- Tweaked VertexAI container build to speed up in dev
- Comments in LLM code to mention which docs it was built from.  Google
  SDKs are confusing ATM.
2025-05-24 12:09:43 +01:00
cybermaggedon
b380c2054d
Change containers to Python 3.12 (#386) 2025-05-17 23:15:29 +01:00
cybermaggedon
e04d3631fd
Update container deps (#385)
- Fedora 42 container
- Pulsar client 3.7.0
- Latest AI libs
2025-05-17 20:54:56 +01:00
cybermaggedon
322725be04
Fix container build (#325) 2025-03-20 09:38:54 +00:00
cybermaggedon
c759d55734
Added module which does OCR for PDF, pdf-ocr in a separate package (#324)
(has a lot of dependencies).  Uses Tesseract.
2025-03-20 09:29:40 +00:00
JackColquitt
d676804107 Added Mistral jsonnet templates 2025-03-14 18:07:51 -07:00
cybermaggedon
edcdc4d59d
Feature/separate containers (#287)
* Separate containerfiles

* Add push to Makefile

* Update image names in the templates
2025-01-28 19:36:05 +00:00