Commit graph

24 commits

Author SHA1 Message Date
cybermaggedon
25995d03f4
Fix stray log messages caused by librarian messages (#706)
Warning generated by librarian responses meant for other
services (chunker, embeddings, etc.) arriving on the shared
response queue. The decoder's subscription picks them up, can't
match them to a pending request, and logs a warning.

Removed the warnings, as not serving a purpose.
2026-03-23 13:16:39 +00:00
cybermaggedon
5c6fe90fe2
Add universal document decoder with multi-format support (#705)
Add universal document decoder with multi-format support
using 'unstructured'.

New universal decoder service powered by the unstructured
library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF,
ODT, EPUB and more through a single service. Tables are preserved
as HTML markup for better downstream extraction. Images are
stored in the librarian but excluded from the text
pipeline. Configurable section grouping strategies
(whole-document, heading, element-type, count, size) for non-page
formats. Page-based formats (PDF, PPTX, XLSX) are automatically
grouped by page.

All four decoders (PDF, Mistral OCR, Tesseract OCR, universal)
now share the "document-decoder" ident so they are
interchangeable.  PDF-only decoders fetch document metadata to
check MIME type and gracefully skip unsupported formats.

Librarian changes: removed MIME type whitelist validation so any
document format can be ingested. Simplified routing so text/plain
goes to text-load and everything else goes to document-load.
Removed dual inline/streaming data paths — documents always use
document_id for content retrieval.

New provenance entity types (tg:Section, tg:Image) and metadata
predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for
richer explainability.

Universal decoder is in its own package (trustgraph-unstructured)
and container image (trustgraph-unstructured).
2026-03-23 12:56:35 +00:00
cybermaggedon
4609424afe
Prepare 2.2 release branch (#704) 2026-03-22 15:23:23 +00:00
cybermaggedon
96fd1eab15
Use UUID-based URNs for page and chunk IDs (#703)
Page and chunk document IDs were deterministic ({doc_id}/p{num},
{doc_id}/p{num}/c{num}), causing "Document already exists" errors
when reprocessing documents through different flows. Content may
differ between runs due to different parameters or extractors, so
deterministic IDs are incorrect.

Pages now use urn:page:{uuid}, chunks use
urn:chunk:{uuid}. Parent- child relationships are tracked via
librarian metadata and provenance triples.

Also brings Mistral OCR and Tesseract OCR decoders up to parity
with the PDF decoder: librarian fetch/save support, per-page
output with unique IDs, and provenance triple emission. Fixes
Mistral OCR bug where only the first 5 pages were processed.
2026-03-21 21:17:03 +00:00
cybermaggedon
88fe8468bc
Update CI for 2.1 release (#653) 2026-02-28 11:10:11 +00:00
cybermaggedon
cf0daedefa
Changed schema for Value -> Term, majorly breaking change (#622)
* Changed schema for Value -> Term, majorly breaking change

* Following the schema change, Value -> Term into all processing

* Updated Cassandra for g, p, s, o index patterns (7 indexes)

* Reviewed and updated all tests

* Neo4j, Memgraph and FalkorDB remain broken, will look at once settled down
2026-01-27 13:48:08 +00:00
cybermaggedon
e4f0013841
Open 1.9 branch (#620) 2026-01-26 17:36:25 +00:00
Cyber MacGeddon
1865b3f3c8 Start 1.8 release branch 2025-12-17 21:32:13 +00:00
Cyber MacGeddon
98aaa4f67e Configure for 1.7 release branch 2025-12-03 09:46:55 +00:00
cybermaggedon
97d8b84d7f
Open 1.6 release branch (#564) 2025-11-24 10:05:29 +00:00
cybermaggedon
ad35656811
Prepare 1.5 release branch (#550) 2025-10-11 11:44:00 +01:00
cybermaggedon
0b59f0c828
Maint/open 1.4 release branch (#508)
* Change pyproject files for 1.4

* Fix tests to track 1.4
2025-09-10 22:11:03 +01:00
cybermaggedon
5139c6ad5d
Bump pyproject.toml constraints (#477) 2025-08-28 13:45:58 +01:00
cybermaggedon
dd70aade11
Implement logging strategy (#444)
* Logging strategy and convert all prints() to logging invocations
2025-07-30 23:18:38 +01:00
cybermaggedon
98022d6af4
Migrate from setup.py to pyproject.toml (#440)
* Converted setup.py to pyproject.toml

* Modern package infrastructure as recommended by py docs
2025-07-23 21:22:08 +01:00
Cyber MacGeddon
1fe4ed5226 Update Python deps to 1.2 2025-07-17 19:26:19 +01:00
Cyber MacGeddon
f0b2752abf Bump setup.py versions for 1.1 2025-07-02 16:40:13 +01:00
cybermaggedon
6dc7b4cbfc
Merge pull request #382 from trustgraph-ai/fix/import-queues-not-working
Fix/import queues not working
2025-05-17 13:02:58 +01:00
Cyber MacGeddon
848d93922b Port Tesseract OCR code to new API 2025-05-12 16:27:04 +01:00
Cyber MacGeddon
6dadf30c66 Bump package versions 2025-05-08 22:06:58 +01:00
cybermaggedon
099018e103
Update package versions (#352) 2025-04-25 19:45:02 +01:00
cybermaggedon
a9197d11ee
Feature/configure flows (#345)
- Keeps processing in different flows separate so that data can go to different stores / collections etc.
- Potentially supports different processing flows
- Tidies the processing API with common base-classes for e.g. LLMs, and automatic configuration of 'clients' to use the right queue names in a flow
2025-04-22 20:21:38 +01:00
Cyber MacGeddon
b1cefbe1f7 Update setup.py files to prep 0.22 branch 2025-03-31 22:14:38 +01:00
cybermaggedon
c759d55734
Added module which does OCR for PDF, pdf-ocr in a separate package (#324)
(has a lot of dependencies).  Uses Tesseract.
2025-03-20 09:29:40 +00:00