fix: large document handling and Cassandra query pagination

- Paginate heavy Cassandra reads (triples, graph/document embeddings) using synchronous session.execute() in run_in_executor with fetch_size paging, preventing materialization hang on large result sets - Fix document stream endpoint to use workspace-scoped librarian queues - Add decoder error handling for PDF/OCR/unstructured processors - Add WebSocket mux guards for missing auth fields - Add null check in librarian document streaming - Rewrite get_document_content CLI to stream via librarian - Add Poppler dependency to unstructured container
2026-06-13 08:45:13 +02:00 · 2026-06-01 22:37:04 +01:00 · 2026-06-01 22:37:04 +01:00 · c3ce07d6f0
commit c3ce07d6f0
parent 7e1fb76bc9
11 changed files with 166 additions and 74 deletions
--- a/trustgraph-flow/trustgraph/decoding/pdf/pdf_decoder.py
+++ b/trustgraph-flow/trustgraph/decoding/pdf/pdf_decoder.py
@ -129,7 +129,15 @@ class Processor(FlowProcessor):
                )
                PyPDFLoader = _cls
            loader = PyPDFLoader(temp_path)
-            pages = loader.load()
+            try:
+                pages = loader.load()
+            except Exception as e:
+                source_doc_id = v.document_id or v.metadata.id
+                logger.error(
+                    f"Failed to decode PDF {source_doc_id}: "
+                    f"{type(e).__name__}: {e}"
+                )
+                return

            # Get the source document ID
            source_doc_id = v.document_id or v.metadata.id