trustgraph

mirror of https://github.com/trustgraph-ai/trustgraph.git synced 2026-06-10 15:25:14 +02:00

Author	SHA1	Message	Date
cybermaggedon	25995d03f4	Fix stray log messages caused by librarian messages (#706 ) Warning generated by librarian responses meant for other services (chunker, embeddings, etc.) arriving on the shared response queue. The decoder's subscription picks them up, can't match them to a pending request, and logs a warning. Removed the warnings, as not serving a purpose.	2026-03-23 13:16:39 +00:00
cybermaggedon	5c6fe90fe2	Add universal document decoder with multi-format support (#705 ) Add universal document decoder with multi-format support using 'unstructured'. New universal decoder service powered by the unstructured library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF, ODT, EPUB and more through a single service. Tables are preserved as HTML markup for better downstream extraction. Images are stored in the librarian but excluded from the text pipeline. Configurable section grouping strategies (whole-document, heading, element-type, count, size) for non-page formats. Page-based formats (PDF, PPTX, XLSX) are automatically grouped by page. All four decoders (PDF, Mistral OCR, Tesseract OCR, universal) now share the "document-decoder" ident so they are interchangeable. PDF-only decoders fetch document metadata to check MIME type and gracefully skip unsupported formats. Librarian changes: removed MIME type whitelist validation so any document format can be ingested. Simplified routing so text/plain goes to text-load and everything else goes to document-load. Removed dual inline/streaming data paths — documents always use document_id for content retrieval. New provenance entity types (tg:Section, tg:Image) and metadata predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for richer explainability. Universal decoder is in its own package (trustgraph-unstructured) and container image (trustgraph-unstructured).	2026-03-23 12:56:35 +00:00
cybermaggedon	4609424afe	Prepare 2.2 release branch (#704 )	2026-03-22 15:23:23 +00:00
cybermaggedon	96fd1eab15	Use UUID-based URNs for page and chunk IDs (#703 ) Page and chunk document IDs were deterministic ({doc_id}/p{num}, {doc_id}/p{num}/c{num}), causing "Document already exists" errors when reprocessing documents through different flows. Content may differ between runs due to different parameters or extractors, so deterministic IDs are incorrect. Pages now use urn:page:{uuid}, chunks use urn:chunk:{uuid}. Parent- child relationships are tracked via librarian metadata and provenance triples. Also brings Mistral OCR and Tesseract OCR decoders up to parity with the PDF decoder: librarian fetch/save support, per-page output with unique IDs, and provenance triple emission. Fixes Mistral OCR bug where only the first 5 pages were processed.	2026-03-21 21:17:03 +00:00
cybermaggedon	88fe8468bc	Update CI for 2.1 release (#653 )	2026-02-28 11:10:11 +00:00
cybermaggedon	cf0daedefa	Changed schema for Value -> Term, majorly breaking change (#622 ) * Changed schema for Value -> Term, majorly breaking change * Following the schema change, Value -> Term into all processing * Updated Cassandra for g, p, s, o index patterns (7 indexes) * Reviewed and updated all tests * Neo4j, Memgraph and FalkorDB remain broken, will look at once settled down	2026-01-27 13:48:08 +00:00
cybermaggedon	e4f0013841	Open 1.9 branch (#620 )	2026-01-26 17:36:25 +00:00
Cyber MacGeddon	1865b3f3c8	Start 1.8 release branch	2025-12-17 21:32:13 +00:00
Cyber MacGeddon	98aaa4f67e	Configure for 1.7 release branch	2025-12-03 09:46:55 +00:00
cybermaggedon	97d8b84d7f	Open 1.6 release branch (#564 )	2025-11-24 10:05:29 +00:00
cybermaggedon	ad35656811	Prepare 1.5 release branch (#550 )	2025-10-11 11:44:00 +01:00
cybermaggedon	0b59f0c828	Maint/open 1.4 release branch (#508 ) * Change pyproject files for 1.4 * Fix tests to track 1.4	2025-09-10 22:11:03 +01:00
cybermaggedon	5139c6ad5d	Bump pyproject.toml constraints (#477 )	2025-08-28 13:45:58 +01:00
cybermaggedon	dd70aade11	Implement logging strategy (#444 ) * Logging strategy and convert all prints() to logging invocations	2025-07-30 23:18:38 +01:00
cybermaggedon	98022d6af4	Migrate from setup.py to pyproject.toml (#440 ) * Converted setup.py to pyproject.toml * Modern package infrastructure as recommended by py docs	2025-07-23 21:22:08 +01:00
Cyber MacGeddon	1fe4ed5226	Update Python deps to 1.2	2025-07-17 19:26:19 +01:00
Cyber MacGeddon	f0b2752abf	Bump setup.py versions for 1.1	2025-07-02 16:40:13 +01:00
cybermaggedon	6dc7b4cbfc	Merge pull request #382 from trustgraph-ai/fix/import-queues-not-working Fix/import queues not working	2025-05-17 13:02:58 +01:00
Cyber MacGeddon	848d93922b	Port Tesseract OCR code to new API	2025-05-12 16:27:04 +01:00
Cyber MacGeddon	6dadf30c66	Bump package versions	2025-05-08 22:06:58 +01:00
cybermaggedon	099018e103	Update package versions (#352 )	2025-04-25 19:45:02 +01:00
cybermaggedon	a9197d11ee	Feature/configure flows (#345 ) - Keeps processing in different flows separate so that data can go to different stores / collections etc. - Potentially supports different processing flows - Tidies the processing API with common base-classes for e.g. LLMs, and automatic configuration of 'clients' to use the right queue names in a flow	2025-04-22 20:21:38 +01:00
Cyber MacGeddon	b1cefbe1f7	Update setup.py files to prep 0.22 branch	2025-03-31 22:14:38 +01:00
cybermaggedon	c759d55734	Added module which does OCR for PDF, pdf-ocr in a separate package (#324 ) (has a lot of dependencies). Uses Tesseract.	2025-03-20 09:29:40 +00:00

24 commits