Add universal document decoder with multi-format support
using 'unstructured'.
New universal decoder service powered by the unstructured
library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF,
ODT, EPUB and more through a single service. Tables are preserved
as HTML markup for better downstream extraction. Images are
stored in the librarian but excluded from the text
pipeline. Configurable section grouping strategies
(whole-document, heading, element-type, count, size) for non-page
formats. Page-based formats (PDF, PPTX, XLSX) are automatically
grouped by page.
All four decoders (PDF, Mistral OCR, Tesseract OCR, universal)
now share the "document-decoder" ident so they are
interchangeable. PDF-only decoders fetch document metadata to
check MIME type and gracefully skip unsupported formats.
Librarian changes: removed MIME type whitelist validation so any
document format can be ingested. Simplified routing so text/plain
goes to text-load and everything else goes to document-load.
Removed dual inline/streaming data paths — documents always use
document_id for content retrieval.
New provenance entity types (tg:Section, tg:Image) and metadata
predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for
richer explainability.
Universal decoder is in its own package (trustgraph-unstructured)
and container image (trustgraph-unstructured).
* Changed schema for Value -> Term, majorly breaking change
* Following the schema change, Value -> Term into all processing
* Updated Cassandra for g, p, s, o index patterns (7 indexes)
* Reviewed and updated all tests
* Neo4j, Memgraph and FalkorDB remain broken, will look at once settled down
- Keeps processing in different flows separate so that data can go to different stores / collections etc.
- Potentially supports different processing flows
- Tidies the processing API with common base-classes for e.g. LLMs, and automatic configuration of 'clients' to use the right queue names in a flow
* - Fixed error reporting in config
- Updated tg-init-pulsar to be able to load initial config to config-svc
- Tweaked API naming and added more config calls
* Tools to dump out prompts and agent tools
* - Locked 0.11 packages to 0.11 deps
- Added 'trustgraph' uber-package which installs the rest
- Added dependency to set package versions before building packages
* Bump version
* Some basic structure for workflows
* Add PyPI publication for 0.12
* Bump version
* Test bundle generation
* Install jsonnet
* Use release action to automate release creation
* Renaming what will become the core package
* Tweaking to get package build working
* Fix metering merge
* Rename to core directory
* Bump version. Use namespace searching for packaging trustgraph-core
* Change references to trustgraph-core
* Forming embeddings-hf package
* Reference modules in core package.
* Build both packages to one container, bump version
* Update YAMLs
* Separate Prom metrics, different processors as different jobs
* Create producers before consumers, may streamline startup.
* Bump version
* Add Pulsar init command, will replace pulsar-admin invocations.
* Integrate tg-init-pulsar with YAMLs
* Update YAMLs
* Add a Prom metric to consumers & consumer/producers to track the running
state.
* New script, gets processor state using prometheus
* Bump version, add tg-processor-state to package
* Update templates