Commit graph

27 commits

Author SHA1 Message Date
cybermaggedon
5c6fe90fe2
Add universal document decoder with multi-format support (#705)
Add universal document decoder with multi-format support
using 'unstructured'.

New universal decoder service powered by the unstructured
library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF,
ODT, EPUB and more through a single service. Tables are preserved
as HTML markup for better downstream extraction. Images are
stored in the librarian but excluded from the text
pipeline. Configurable section grouping strategies
(whole-document, heading, element-type, count, size) for non-page
formats. Page-based formats (PDF, PPTX, XLSX) are automatically
grouped by page.

All four decoders (PDF, Mistral OCR, Tesseract OCR, universal)
now share the "document-decoder" ident so they are
interchangeable.  PDF-only decoders fetch document metadata to
check MIME type and gracefully skip unsupported formats.

Librarian changes: removed MIME type whitelist validation so any
document format can be ingested. Simplified routing so text/plain
goes to text-load and everything else goes to document-load.
Removed dual inline/streaming data paths — documents always use
document_id for content retrieval.

New provenance entity types (tg:Section, tg:Image) and metadata
predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for
richer explainability.

Universal decoder is in its own package (trustgraph-unstructured)
and container image (trustgraph-unstructured).
2026-03-23 12:56:35 +00:00
cybermaggedon
f6bccd7438
Parallel contain builds (#515) 2025-09-11 12:32:04 +01:00
cybermaggedon
3e0651222b
Install missing build deps (#442) 2025-07-23 21:28:19 +01:00
cybermaggedon
e19e0f00fe
Install missing build deps (#441) 2025-07-23 21:25:48 +01:00
cybermaggedon
ad5a1bbff4
Remove release bundle step from release (#339) 2025-04-04 15:59:10 +01:00
cybermaggedon
ceff3f0e34
Remove 2nd push (#258) 2025-01-06 21:52:25 +00:00
cybermaggedon
dc2b599fda
Fix/release broken (#257)
* Break release into 3 jobs

* Replace Github action with podman command
2025-01-06 21:45:42 +00:00
JackColquitt
5946c47d3d Updated docker login and checkout versions 2025-01-06 12:43:01 -08:00
JackColquitt
6a27de22db Updated docker push action 2025-01-06 12:33:57 -08:00
Cyber MacGeddon
cff90cada1 Prepare for 0.19 2024-12-30 10:44:33 +00:00
Cyber MacGeddon
62d25effd5 Fix pipeline 2024-12-20 10:16:25 +00:00
Cyber MacGeddon
d6cdce8391 Open 0.18 branch 2024-12-10 22:13:10 +00:00
cybermaggedon
7df7843dad
Main/remove parquet (#195)
* Remove Parquet code, and package build
2024-12-06 08:51:10 +00:00
Cyber MacGeddon
c844d805e5 Setup for release 0.17 branch 2024-11-29 17:03:31 +00:00
Cyber MacGeddon
b536d78b57 Prepare for 0.16: Change Python dep restrictions and Gitlab merge criteria 2024-11-20 19:55:05 +00:00
Cyber MacGeddon
5140f8834d CI/CD for 0.15 2024-11-10 11:40:28 +00:00
Cyber MacGeddon
f83643f670 Update CI for 0.14 2024-10-25 15:12:53 +01:00
Cyber MacGeddon
b8818e28d0 Fire actions on 0.13 tag 2024-10-15 20:43:32 +01:00
cybermaggedon
ec444d12b7
Bug in release YAML (#122) 2024-10-15 20:35:48 +01:00
Cyber MacGeddon
6ecc7536d7 Fix pipeline 2024-10-04 20:24:17 +01:00
Cyber MacGeddon
e271bb7317 Fix pipeline 2024-10-04 20:21:44 +01:00
Cyber MacGeddon
169ccc1997 Permissions? 2024-10-04 20:12:41 +01:00
Cyber MacGeddon
0ae427c8e5 Permissions? 2024-10-04 18:24:33 +01:00
Cyber MacGeddon
fb2fb4135a Maybe fix permissions 2024-10-04 18:20:22 +01:00
Cyber MacGeddon
61af3a5fbd Stop branch trigger 2024-10-04 18:16:43 +01:00
cybermaggedon
514b0e8ac6
Feature/workflows 2 (#106)
* Container push to Docker hub
2024-10-04 18:15:30 +01:00
cybermaggedon
dda29bb663
Workflows (#105)
* Some basic structure for workflows
* Add PyPI publication for 0.12
* Bump version
* Test bundle generation
* Install jsonnet
* Use release action to automate release creation
2024-10-04 17:28:07 +01:00