fix: large document handling and Cassandra query pagination

- Paginate heavy Cassandra reads (triples, graph/document embeddings)
  using synchronous session.execute() in run_in_executor with fetch_size
  paging, preventing materialization hang on large result sets
- Fix document stream endpoint to use workspace-scoped librarian queues
- Add decoder error handling for PDF/OCR/unstructured processors
- Add WebSocket mux guards for missing auth fields
- Add null check in librarian document streaming
- Rewrite get_document_content CLI to stream via librarian
- Add Poppler dependency to unstructured container
This commit is contained in:
Cyber MacGeddon 2026-06-01 22:37:04 +01:00
parent 7e1fb76bc9
commit c3ce07d6f0
11 changed files with 166 additions and 74 deletions

View file

@ -129,7 +129,15 @@ class Processor(FlowProcessor):
)
PyPDFLoader = _cls
loader = PyPDFLoader(temp_path)
pages = loader.load()
try:
pages = loader.load()
except Exception as e:
source_doc_id = v.document_id or v.metadata.id
logger.error(
f"Failed to decode PDF {source_doc_id}: "
f"{type(e).__name__}: {e}"
)
return
# Get the source document ID
source_doc_id = v.document_id or v.metadata.id