Commit graph

19 commits

Author SHA1 Message Date
cybermaggedon
5c6fe90fe2
Add universal document decoder with multi-format support (#705)
Add universal document decoder with multi-format support
using 'unstructured'.

New universal decoder service powered by the unstructured
library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF,
ODT, EPUB and more through a single service. Tables are preserved
as HTML markup for better downstream extraction. Images are
stored in the librarian but excluded from the text
pipeline. Configurable section grouping strategies
(whole-document, heading, element-type, count, size) for non-page
formats. Page-based formats (PDF, PPTX, XLSX) are automatically
grouped by page.

All four decoders (PDF, Mistral OCR, Tesseract OCR, universal)
now share the "document-decoder" ident so they are
interchangeable.  PDF-only decoders fetch document metadata to
check MIME type and gracefully skip unsupported formats.

Librarian changes: removed MIME type whitelist validation so any
document format can be ingested. Simplified routing so text/plain
goes to text-load and everything else goes to document-load.
Removed dual inline/streaming data paths — documents always use
document_id for content retrieval.

New provenance entity types (tg:Section, tg:Image) and metadata
predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for
richer explainability.

Universal decoder is in its own package (trustgraph-unstructured)
and container image (trustgraph-unstructured).
2026-03-23 12:56:35 +00:00
cybermaggedon
4609424afe
Prepare 2.2 release branch (#704) 2026-03-22 15:23:23 +00:00
cybermaggedon
88fe8468bc
Update CI for 2.1 release (#653) 2026-02-28 11:10:11 +00:00
cybermaggedon
23cc4dfdd1
Fix: version needed updating in pipelines (#623) 2026-01-27 15:42:01 +00:00
cybermaggedon
e4f0013841
Open 1.9 branch (#620) 2026-01-26 17:36:25 +00:00
Cyber MacGeddon
1865b3f3c8 Start 1.8 release branch 2025-12-17 21:32:13 +00:00
Cyber MacGeddon
98aaa4f67e Configure for 1.7 release branch 2025-12-03 09:46:55 +00:00
cybermaggedon
97d8b84d7f
Open 1.6 release branch (#564) 2025-11-24 10:05:29 +00:00
cybermaggedon
3580e7a7ae
Remove some 'unnecessary' parameters from OpenAI invocation (#561)
* Remove some 'unnecessary' parameters from OpenAI invocation.  The OpenAI
API is getting complicated with the API and SDK changing on OpenAI's end,
but this not getting mapped through to other services which are 'compatible'
with OpenAI.

* Update OpenAI test for this change

* Trying running tests with Python 3.13
2025-11-20 17:56:31 +00:00
cybermaggedon
ad35656811
Prepare 1.5 release branch (#550) 2025-10-11 11:44:00 +01:00
cybermaggedon
0b59f0c828
Maint/open 1.4 release branch (#508)
* Change pyproject files for 1.4

* Fix tests to track 1.4
2025-09-10 22:11:03 +01:00
cybermaggedon
672e358b2f
Feature/graphql table query (#486)
* Tech spec

* Object query service for Cassandra

* Gateway support for objects-query

* GraphQL query utility

* Filters, ordering
2025-09-03 23:39:11 +01:00
cybermaggedon
210d600f78
Bump pull-request.yaml test version (#478) 2025-08-28 13:54:12 +01:00
cybermaggedon
98022d6af4
Migrate from setup.py to pyproject.toml (#440)
* Converted setup.py to pyproject.toml

* Modern package infrastructure as recommended by py docs
2025-07-23 21:22:08 +01:00
cybermaggedon
d83e4e3d59
Update to enable knowledge extraction using the agent framework (#439)
* Implement KG extraction agent (kg-extract-agent)

* Using ReAct framework (agent-manager-react)
 
* ReAct manager had an issue when emitting JSON, which conflicts which ReAct manager's own JSON messages, so refactored ReAct manager to use traditional ReAct messages, non-JSON structure.
 
* Minor refactor to take the prompt template client out of prompt-template so it can be more readily used by other modules. kg-extract-agent uses this framework.
2025-07-21 14:31:57 +01:00
cybermaggedon
f37decea2b
Increase storage test coverage (#435)
* Fixing storage and adding tests

* PR pipeline only runs quick tests
2025-07-15 09:33:35 +01:00
cybermaggedon
4daa54abaf
Extending test coverage (#434)
* Contract tests

* Testing embeedings

* Agent unit tests

* Knowledge pipeline tests

* Turn on contract tests
2025-07-14 17:54:04 +01:00
cybermaggedon
2f7fddd206
Test suite executed from CI pipeline (#433)
* Test strategy & test cases

* Unit tests

* Integration tests
2025-07-14 14:57:44 +01:00
cybermaggedon
dda29bb663
Workflows (#105)
* Some basic structure for workflows
* Add PyPI publication for 0.12
* Bump version
* Test bundle generation
* Install jsonnet
* Use release action to automate release creation
2024-10-04 17:28:07 +01:00