mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-06-22 13:18:06 +02:00
Compare commits
No commits in common. "master" and "v2.5.4" have entirely different histories.
63 changed files with 1638 additions and 3408 deletions
282
README.md
282
README.md
|
|
@ -3,7 +3,7 @@
|
|||
|
||||
<img src="TG-fullname-logo.svg" width=100% />
|
||||
|
||||
[](https://pypi.org/project/trustgraph/)  
|
||||
[](https://pypi.org/project/trustgraph/) [](LICENSE) 
|
||||
[](https://discord.gg/sQMwkRz5GX) [](https://deepwiki.com/trustgraph-ai/trustgraph)
|
||||
|
||||
|
|
@ -11,89 +11,44 @@
|
|||
|
||||
<a href="https://trendshift.io/repositories/17291" target="_blank"><img src="https://trendshift.io/api/badge/repositories/17291" alt="trustgraph-ai%2Ftrustgraph | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
|
||||
|
||||
# Write context once. Run agents anywhere.
|
||||
# The agent runtime platform
|
||||
|
||||
</div>
|
||||
|
||||
Stop rebuilding context from scratch. TrustGraph treats context as a holon — a modular, independent whole that naturally snaps into a larger domain-wide intelligence layer. By deploying context as holonic context graphs, TrustGraph powers multi-tenant agent workflows, dramatically reduces token consumption, and aligns with semantic web standards (RDF, OWL, SKOS, SHACL). Version your context, share it across teams, and scale with full provenance.
|
||||
TrustGraph is an agent runtime platform built around context graphs — structured, queryable representations of your domain knowledge that ground every agent query in verified, explainable facts in private deployments with sovereign control. The platform is the full stack for agentic systems: context graphs, memory, retrieval, orchestration, and inference for precision-critical agent workloads.
|
||||
|
||||
## What TrustGraph Does
|
||||
|
||||
TrustGraph is a complete holonic context harness for all LLMs. It provides the full infrastructure layer underneath your agents: knowledge ingestion, structured storage, graph-grounded retrieval, agent orchestration, and a full LLM inferencing stack.
|
||||
|
||||
TrustGraph relies on absolutely no 3rd party services aside from optional API integrations to cloud-hosted LLMs. Whether you are using Anthropic's or OpenAI's API, or self-hosting Qwen3.7 via vLLM, TrustGraph handles it all with pre-built API connectors and a full LLM inferencing stack to enrich the models with a sovereign, private holonic system that grounds your agents in reality.
|
||||
|
||||
## The Problem: Why Agents Break
|
||||
|
||||
When you build an AI agent today, you spend most of your time fighting context:
|
||||
|
||||
- **RAG retrieves fragments, not meaning**. Chunks of text have no structure. Relationships between facts are invisible. Your agent guesses at the connections.
|
||||
|
||||
- **Context is disposable**. What the agent learned in one session is gone in the next. There is no persistent, structured knowledge layer underneath.
|
||||
|
||||
- **Answers aren't traceable**. You can't explain why the agent said what it said, which means you can't trust it in production.
|
||||
|
||||
- **Knowledge can't be reused**. You rebuild the same context pipelines for every new project, every new agent, every new environment.
|
||||
|
||||
These aren't retrieval problems. They are structural problems. Context needs to be organized, versioned, and composable — exactly the way software infrastructure is.
|
||||
|
||||
## The Solution: A Holonic Context System
|
||||
The philosopher Arthur Koestler coined the word [holon](https://en.wikipedia.org/wiki/Holon_(philosophy)) to describe something that is simultaneously a whole in itself and a part of something larger. A fact is whole. It is also part of a domain. A domain is whole. It is also part of an organization's knowledge.
|
||||
|
||||
AI agents break down because this holonic structure is never built. Context gets shoved into flat text windows, scattered across vector stores, or hardwired into one-off prompts. Facts lose their relationships.
|
||||
|
||||
TrustGraph solves this by organizing your domain into holonic context graphs. Entities, relationships, and evidence are treated as first-class objects. Every agent query is grounded against these holons—marrying symbolic graph structures with vector embeddings. Every answer carries provenance. Every fact is traceable.
|
||||
|
||||
## Context Cores: Knowledge as a First-Class Citizen
|
||||
|
||||
A Context Core is the deployable unit of knowledge in TrustGraph. It packages everything an agent needs to reason reliably over a domain into a single, portable artifact.
|
||||
|
||||
### What's inside a Context Core
|
||||
- **Ontology** — your domain schema and entity mappings
|
||||
- **Holon** — entities, relationships, and supporting evidence
|
||||
- **Embeddings** — vector indexes for fast semantic entry-point lookup
|
||||
- **Provenance** — where every fact came from, when, and how it was derived
|
||||
- **Retrieval policies** — traversal rules, freshness controls, authority ranking
|
||||
|
||||
Context Cores decouple what agents know from how agents are deployed. Build once. Run in Docker locally, Kubernetes in production, or on any cloud. Pin a version. Roll back. Promote across environments. This is context engineering — and it works because knowledge is finally treated like the infrastructure it is.
|
||||
|
||||
## Explainability: Trust Your Agents in Production
|
||||
LLMs are black boxes, and traditional RAG makes it worse. When an agent pulls flat text chunks from a vector store, you have no idea how it connected those fragments to form an answer. You cannot ship agents to production if you can't explain why they said what they said.
|
||||
|
||||
### How TrustGraph makes agents explainable:
|
||||
|
||||
- **Traceable Reasoning Paths**: Instead of guessing at connections between text chunks, TrustGraph traverses explicit relationship paths in the holonic context graph. You can inspect exactly which entities, relationships, and sub-graphs were pulled into the LLM's context window to generate a given response.
|
||||
- **Fact-Level Provenance**: Every node and edge in the graph carries strict provenance. When an agent makes a claim, you can trace it back to the exact source document, the time it was ingested, and the extraction method used to derive it.
|
||||
- **No Black-Box Guesses**: By grounding the LLM in a structured, symbolic graph, you eliminate the hallucinations that occur when models are forced to infer relationships from unstructured text. If a fact isn't in the graph, the agent doesn't use it.
|
||||
|
||||
TrustGraph doesn't just give you answers - it gives you the receipt. Every fact is traceable, every connection is visible, and every output is verifiable.
|
||||
|
||||
## Workspaces, Collections, and Flows
|
||||
|
||||
TrustGraph has a [three-level system](https://docs.trustgraph.ai/overview/workspaces) for organizing and isolating knowledge.
|
||||
|
||||
A `Workspace` is the outermost boundary — a fully isolated tenancy scope where all data, users, configuration, and pipelines live independently from every other workspace. Isolation is structural: enforced at the pub/sub queue, storage, and API gateway layers, not by trusting a field in a message body.
|
||||
|
||||
Within a workspace, a `Collection` groups related holons, graph structures, embeddings, and documents together — think of it as a dedicated shelf in a library, scoped to a specific domain, project, or customer.
|
||||
|
||||
A `Flow` is a running data processing pipeline that defines how raw data moves through ingestion, extraction, structuring, and storage — the assembly line that turns documents into queryable knowledge. Together, the three layers let you run multiple isolated tenants on a single deployment, separate knowledge by domain within each tenant, and process that knowledge through fully configurable pipelines — all without restarting the system or rebuilding your infrastructure.
|
||||
|
||||
## The Full Stack
|
||||
TrustGraph is not a wrapper around a graph database. It is the complete backend for production agentic systems.
|
||||
|
||||
- **Holonic context graph engine**: automated entity and relationship extraction, ontology-driven graph construction, graph-grounded retrieval for explainable outputs
|
||||
- **Multi-model database**: tabular/relational, key-value, document, graph, vectors, images, video, and audio — all managed in Cassandra and S3-compatible Garage
|
||||
- **Out-of-the-box RAG pipelines**: DocumentRAG, GraphRAG, and OntologyRAG ready to deploy
|
||||
- **Fully agentic orchestration**: single or multi-agent, ReAct, Plan-then-Execute, Supervisor patterns, and MCP integration
|
||||
- **3D Knowledge Explorer**: interactive graph visualization with BFS neighborhood extraction and edge pulse animation
|
||||
- **Automated data ingest**: quick ingest with semantic similarity or ontology-structured precision retrieval
|
||||
- **Run anywhere**: Docker/Podman locally, Kubernetes in the cloud
|
||||
|
||||
All major LLMs — Anthropic, Cohere, Gemini, Mistral, OpenAI, and more via API.
|
||||
|
||||
vLLM, Ollama, TGI, LM Studio, and Llamafiles for fully local inferencing.
|
||||
|
||||
Verified cloud deployments for Alibaba Cloud, AWS, Azure, GCP, OVHcloud, and Scaleway.
|
||||
The platform:
|
||||
- [x] Multi-model and multimodal database system
|
||||
- [x] Tabular/relational, key-value
|
||||
- [x] Document, graph, and vectors
|
||||
- [x] Images, video, and audio
|
||||
- [x] Context Graph engine
|
||||
- [x] Automated entity and relationship extraction
|
||||
- [x] Ontology-driven graph construction
|
||||
- [x] Graph-grounded retrieval for explainable outputs
|
||||
- [x] Automated data ingest and loading
|
||||
- [x] Quick ingest with semantic similarity retrieval
|
||||
- [x] Ontology structuring for precision retrieval
|
||||
- [x] Out-of-the-box RAG pipelines
|
||||
- [x] DocumentRAG
|
||||
- [x] GraphRAG
|
||||
- [x] OntologyRAG
|
||||
- [x] 3D GraphViz for exploring context
|
||||
- [x] Fully Agentic System
|
||||
- [x] Single or Multi Agent
|
||||
- [x] ReAct, Plan-then-Execute, and Supervisor patterns
|
||||
- [x] MCP integration
|
||||
- [x] Run anywhere
|
||||
- [x] Deploy locally with Docker
|
||||
- [x] Deploy in cloud with Kubernetes
|
||||
- [x] Support for all major LLMs
|
||||
- [x] API support for Anthropic, Cohere, Gemini, Mistral, OpenAI, and others
|
||||
- [x] Model inferencing with vLLM, Ollama, TGI, LM Studio, and Llamafiles
|
||||
- [x] Developer friendly
|
||||
- [x] REST API [Docs](https://docs.trustgraph.ai/reference/apis/rest.html)
|
||||
- [x] Websocket API [Docs](https://docs.trustgraph.ai/reference/apis/websocket.html)
|
||||
- [x] Python API [Docs](https://docs.trustgraph.ai/reference/apis/python)
|
||||
- [x] CLI [Docs](https://docs.trustgraph.ai/reference/cli/)
|
||||
|
||||
## No API Keys Required
|
||||
|
||||
|
|
@ -107,12 +62,12 @@ Everything else is included.
|
|||
- [x] Managed Multi-model storage in [Cassandra](https://cassandra.apache.org/_/index.html)
|
||||
- [x] Managed Vector embedding storage in [Qdrant](https://github.com/qdrant/qdrant)
|
||||
- [x] Managed File and Object storage in [Garage](https://github.com/deuxfleurs-org/garage) (S3 compatible)
|
||||
- [x] Managed High-speed Pub/Sub messaging fabric with [Pulsar](https://github.com/apache/pulsar) or [RabbitMQ](https://www.rabbitmq.com/)
|
||||
- [x] Managed High-speed Pub/Sub messaging fabric with [Pulsar](https://github.com/apache/pulsar)
|
||||
- [x] Complete LLM inferencing stack for open LLMs with [vLLM](https://github.com/vllm-project/vllm), [TGI](https://github.com/huggingface/text-generation-inference), [Ollama](https://github.com/ollama/ollama), [LM Studio](https://github.com/lmstudio-ai), and [Llamafiles](https://github.com/mozilla-ai/llamafile)
|
||||
|
||||
## Quickstart
|
||||
|
||||
No need to clone the repo unless you are building from source. TrustGraph deploys as a set of Docker containers. Configure it on the command line in one step:
|
||||
There's no need to clone this repo, unless you want to build from source. TrustGraph is a fully containerized app that deploys as a set of Docker containers. To configure TrustGraph on the command line:
|
||||
|
||||
```
|
||||
npx @trustgraph/config
|
||||
|
|
@ -123,39 +78,44 @@ The config process will generate an app config that can be run locally with Dock
|
|||
- Deployment instructions as `INSTALLATION.md`
|
||||
|
||||
<p align="center">
|
||||
<video src="https://github.com/user-attachments/assets/33434c3c-f586-4610-8bb2-d7b7b586a672"
|
||||
<video src="https://github.com/user-attachments/assets/2978a6aa-4c9c-4d7c-ad02-8f3d01a1c602"
|
||||
width="80%" controls></video>
|
||||
</p>
|
||||
|
||||
For a browser based configuration, try the [Configuration Terminal](https://config-ui.demo.trustgraph.ai/).
|
||||
|
||||
## Watch What is a Holonic Context Graph?
|
||||
## Watch What is a Context Graph?
|
||||
|
||||
[](https://www.youtube.com/watch?v=gZjlt5WcWB4)
|
||||
|
||||
## Watch Holonic Context Graphs in Action
|
||||
## Watch Context Graphs in Action
|
||||
|
||||
[](https://www.youtube.com/watch?v=sWc7mkhITIo)
|
||||
|
||||
## Getting Started with TrustGraph
|
||||
|
||||
- [**Getting Started Guides**](https://docs.trustgraph.ai/getting-started)
|
||||
- [**Using the Workbench**](#workbench)
|
||||
- [**Developer APIs and CLI**](https://docs.trustgraph.ai/reference)
|
||||
- [**Deployment Guides**](https://docs.trustgraph.ai/deployment)
|
||||
|
||||
## TrustGraph UI
|
||||
## Workbench
|
||||
|
||||
<img width="1389" height="961" alt="Image" src="https://github.com/user-attachments/assets/35c9250d-0f01-40cb-9294-1ee8fd9a1b56" />
|
||||
The **Workbench** provides tools for all major features of TrustGraph. The **Workbench** is on port `8888` by default.
|
||||
|
||||
The UI provides tools for all major features of TrustGraph. The UI deploys on port `8888` by default.
|
||||
|
||||
- **Agent Console** — Query your agents directly with streaming responses and live explainability event tracking, so you can watch reasoning unfold in real time
|
||||
- **GraphRAG View** — Interactive graph RAG queries with a visual explainability DAG and inline provenance display, making it easy to see exactly where answers came from
|
||||
- **Context Explorer** — An interactive 3D context graph explorer with dynamic graph loading, BFS neighborhood extraction, edge pulse animation, and multiple navigation views
|
||||
- **Document Ingestion** — A complete upload and submission workflow with page and chunk inspection and document structure browsing
|
||||
- **Ontology Workbench** — A full ontology editor with class and property trees, OWL/XML and Turtle import/export with round-trip fidelity, circular dependency detection, and safe-delete confirmation dialogs
|
||||
- **Schema Workbench** — Interactive schema management with list, create, edit, and delete operations including field and index management
|
||||
- **Prompt Editor** — A dedicated prompt editing workflow
|
||||
- **Vector Search**: Search the installed knowledge bases
|
||||
- **Agentic, GraphRAG and LLM Chat**: Chat interface for agents, GraphRAG queries, or direct to LLMs
|
||||
- **Relationships**: Analyze deep relationships in the installed knowledge bases
|
||||
- **Graph Visualizer**: 3D GraphViz of the installed knowledge bases
|
||||
- **Library**: Staging area for installing knowledge bases
|
||||
- **Flow Classes**: Workflow preset configurations
|
||||
- **Flows**: Create custom workflows and adjust LLM parameters during runtime
|
||||
- **Knowledge Cores**: Manage resuable knowledge bases
|
||||
- **Prompts**: Manage and adjust prompts during runtime
|
||||
- **Schemas**: Define custom schemas for structured data knowledge bases
|
||||
- **Ontologies**: Define custom ontologies for unstructured data knowledge bases
|
||||
- **Agent Tools**: Define tools with collections, knowledge cores, MCP connections, and tool groups
|
||||
- **MCP Tools**: Connect to MCP servers
|
||||
|
||||
## TypeScript Library for UIs
|
||||
|
||||
|
|
@ -165,6 +125,134 @@ There are 3 libraries for quick UI integration of TrustGraph services.
|
|||
- [@trustgraph/react-state](https://www.npmjs.com/package/@trustgraph/react-state)
|
||||
- [@trustgraph/react-provider](https://www.npmjs.com/package/@trustgraph/react-provider)
|
||||
|
||||
## Context Cores
|
||||
|
||||
Context Cores are how TrustGraph treats context like code. A Context Core is a **portable, versioned bundle of context** that you can ship between projects and environments, pin in production, and reuse across agents. It packages the “stuff agents need to know” (structured knowledge + embeddings + evidence + policies) into a single artifact, so you can treat context like code: build it, test it, version it, promote it, and roll it back. TrustGraph is built to support this kind of end-to-end context engineering and orchestration workflow.
|
||||
|
||||
### What’s inside a Context Core
|
||||
A Context Core typically includes:
|
||||
- Ontology (your domain schema) and mappings
|
||||
- Context Graph (entities, relationships, supporting evidence)
|
||||
- Embeddings / vector indexes for fast semantic entry-point lookup
|
||||
- Source manifests + provenance (where facts came from, when, and how they were derived)
|
||||
- Retrieval policies (traversal rules, freshness, authority ranking)
|
||||
|
||||
## Tech Stack
|
||||
TrustGraph provides component flexibility to optimize agent workflows.
|
||||
|
||||
<details>
|
||||
<summary>LLM APIs</summary>
|
||||
<br>
|
||||
|
||||
- Anthropic<br>
|
||||
- AWS Bedrock<br>
|
||||
- AzureAI<br>
|
||||
- AzureOpenAI<br>
|
||||
- Cohere<br>
|
||||
- Google AI Studio<br>
|
||||
- Google VertexAI<br>
|
||||
- Mistral<br>
|
||||
- OpenAI<br>
|
||||
|
||||
</details>
|
||||
<details>
|
||||
<summary>LLM Orchestration</summary>
|
||||
<br>
|
||||
|
||||
- LM Studio<br>
|
||||
- Llamafiles<br>
|
||||
- Ollama<br>
|
||||
- TGI<br>
|
||||
- vLLM<br>
|
||||
|
||||
</details>
|
||||
<details>
|
||||
<summary>Multi-model storage</summary>
|
||||
<br>
|
||||
|
||||
- Apache Cassandra<br>
|
||||
|
||||
</details>
|
||||
<details>
|
||||
<summary>VectorDB</summary>
|
||||
<br>
|
||||
|
||||
- Qdrant<br>
|
||||
|
||||
</details>
|
||||
<details>
|
||||
<summary>File and Object Storage</summary>
|
||||
<br>
|
||||
|
||||
- Garage<br>
|
||||
|
||||
</details>
|
||||
<details>
|
||||
<summary>Observability</summary>
|
||||
<br>
|
||||
|
||||
- Prometheus<br>
|
||||
- Grafana<br>
|
||||
- Loki<br>
|
||||
|
||||
</details>
|
||||
<details>
|
||||
<summary>Data Streaming</summary>
|
||||
<br>
|
||||
|
||||
- Apache Pulsar<br>
|
||||
- RabbitMQ<br>
|
||||
- Apache Kafka<br>
|
||||
|
||||
</details>
|
||||
<details>
|
||||
<summary>Clouds</summary>
|
||||
<br>
|
||||
|
||||
- AWS<br>
|
||||
- Azure<br>
|
||||
- Google Cloud<br>
|
||||
- OVHcloud<br>
|
||||
- Scaleway<br>
|
||||
|
||||
</details>
|
||||
|
||||
## Observability & Telemetry
|
||||
|
||||
Once the platform is running, access the Grafana dashboard at:
|
||||
|
||||
```
|
||||
http://localhost:3000
|
||||
```
|
||||
|
||||
Default credentials are:
|
||||
|
||||
```
|
||||
user: admin
|
||||
password: admin
|
||||
```
|
||||
|
||||
The default Grafana dashboard tracks the following:
|
||||
|
||||
<details>
|
||||
<summary>Telemetry</summary>
|
||||
<br>
|
||||
|
||||
- LLM Latency<br>
|
||||
- Error Rate<br>
|
||||
- Service Request Rates<br>
|
||||
- Queue Backlogs<br>
|
||||
- Chunking Histogram<br>
|
||||
- Error Source by Service<br>
|
||||
- Rate Limit Events<br>
|
||||
- CPU usage by Service<br>
|
||||
- Memory usage by Service<br>
|
||||
- Models Deployed<br>
|
||||
- Token Throughput (Tokens/second)<br>
|
||||
- Cost Throughput (Cost/second)<br>
|
||||
|
||||
</details>
|
||||
|
||||
## Contributing
|
||||
|
||||
[Developer's Guide](https://docs.trustgraph.ai/guides/building/introduction.html)
|
||||
|
|
@ -173,7 +261,7 @@ There are 3 libraries for quick UI integration of TrustGraph services.
|
|||
|
||||
**TrustGraph** is licensed under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0).
|
||||
|
||||
Copyright 2024-2026 TrustGraph
|
||||
Copyright 2024-2025 TrustGraph
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
|
|
|
|||
|
|
@ -7,7 +7,7 @@ FROM docker.io/fedora:42 AS base
|
|||
|
||||
ENV PIP_BREAK_SYSTEM_PACKAGES=1
|
||||
|
||||
RUN dnf install -y python3.13 libxcb mesa-libGL poppler-utils && \
|
||||
RUN dnf install -y python3.13 libxcb mesa-libGL && \
|
||||
alternatives --install /usr/bin/python python /usr/bin/python3.13 1 && \
|
||||
python -m ensurepip --upgrade && \
|
||||
pip3 install --no-cache-dir --upgrade 'pip>=26.0' 'setuptools>=78.1.1' && \
|
||||
|
|
|
|||
|
|
@ -1,535 +0,0 @@
|
|||
---
|
||||
layout: default
|
||||
title: "Knowledge Core Completeness"
|
||||
parent: "Tech Specs"
|
||||
---
|
||||
|
||||
# Knowledge Core Completeness
|
||||
|
||||
## Overview
|
||||
|
||||
Knowledge cores are portable snapshots of extracted knowledge: triples, graph
|
||||
embeddings, and document embeddings stored in Cassandra's `knowledge` keyspace.
|
||||
They can be downloaded as files, transferred between TrustGraph instances, and
|
||||
loaded back into vector and graph stores.
|
||||
|
||||
Recent additions to TrustGraph — explainability/provenance and named graphs —
|
||||
were not carried through to the knowledge core system. This means that
|
||||
exporting and re-importing a core loses provenance links, graph assignments,
|
||||
and source material, breaking the explainability chain.
|
||||
|
||||
This specification addresses three gaps:
|
||||
|
||||
1. **Named graphs not stored** — The `g` (graph name) field on triples is
|
||||
silently dropped when writing to the core store and comes back as `None`
|
||||
on read.
|
||||
2. **Provenance triples not captured** — Provenance triples (PROV-O) are
|
||||
generated during extraction and flow to graph stores, but never enter
|
||||
the knowledge core store. It is unclear whether they arrive at the store
|
||||
in the correct form.
|
||||
3. **Source material not included** — Documents, text pages, and chunks in
|
||||
the librarian's bucket store are not part of the core. After loading a
|
||||
core on a different instance, provenance links to source material point
|
||||
at nothing.
|
||||
|
||||
## Goals
|
||||
|
||||
- **Self-contained cores**: A downloaded knowledge core file contains
|
||||
everything needed to reconstruct the full knowledge graph including
|
||||
provenance and source attribution on a fresh instance.
|
||||
- **Named graph preservation**: Round-tripping a core preserves graph
|
||||
assignments on all triples.
|
||||
- **Backward compatibility**: Existing core files (without graph names or
|
||||
source material) can still be uploaded and loaded. New fields are optional
|
||||
on import.
|
||||
- **No change to core identity**: A core is still identified by its document
|
||||
ID. The additional data is associated with the same core ID.
|
||||
- **Minimal file format changes**: Extend the existing msgpack record format
|
||||
with new record types rather than restructuring existing ones.
|
||||
|
||||
## Background
|
||||
|
||||
### Current Lifecycle
|
||||
|
||||
```
|
||||
Extraction pipeline
|
||||
│
|
||||
├─ triples ──────────────────► knowledge core store (Cassandra)
|
||||
├─ graph embeddings ─────────► knowledge core store (Cassandra)
|
||||
├─ document embeddings ──────► knowledge core store (Cassandra)
|
||||
├─ provenance triples ───────► graph store (only)
|
||||
└─ source documents ─────────► librarian bucket store (only)
|
||||
|
||||
Download: Cassandra ──► knowledge manager ──► API gateway ──► client file
|
||||
Upload: client file ──► API gateway ──► knowledge manager ──► Cassandra
|
||||
Load: Cassandra ──► knowledge manager ──► Pulsar topics ──► graph/vector stores
|
||||
```
|
||||
|
||||
### Current Core File Format (msgpack)
|
||||
|
||||
A core file is a sequence of concatenated msgpack records. Each record is a
|
||||
2-element tuple: `(type_tag, payload)`.
|
||||
|
||||
| Type tag | Payload | Description |
|
||||
|----------|---------|-------------|
|
||||
| `"t"` | `{"m": {id, root, collection}, "t": [triple_dicts]}` | Triple batch |
|
||||
| `"ge"` | `{"m": {id, root, collection}, "e": [{entity, vector}]}` | Graph embedding batch |
|
||||
|
||||
### What's Missing
|
||||
|
||||
#### Named Graphs
|
||||
|
||||
The `Triple` dataclass has a `g: str | None` field (graph name IRI), used to
|
||||
separate provenance graphs (`urn:graph:source`, `urn:graph:retrieval`) from
|
||||
the default graph. However:
|
||||
|
||||
- **Cassandra schema** (`knowledge.triples` table): stores a 6-tuple per
|
||||
triple `(s_val, s_is_uri, p_val, p_is_uri, o_val, o_is_uri)` — no graph
|
||||
field.
|
||||
- **`add_triples()`** (`tables/knowledge.py:231`): destructures only `s`,
|
||||
`p`, `o` — `g` is discarded.
|
||||
- **`get_triples()`** (`tables/knowledge.py:396`): reconstructs `Triple`
|
||||
with `g` defaulting to `None`.
|
||||
- **Core file format**: triple dicts do not include a graph field.
|
||||
|
||||
#### Provenance Triples
|
||||
|
||||
Provenance triples are generated in the extraction pipeline
|
||||
(`trustgraph-base/trustgraph/provenance/triples.py`) and published to graph
|
||||
store topics. They use named graphs (`urn:graph:source`,
|
||||
`urn:graph:retrieval`) and PROV-O vocabulary.
|
||||
|
||||
The knowledge core store processor (`storage/knowledge/store.py`) listens on
|
||||
`triples-input` and `graph-embeddings-input`. Whether provenance triples
|
||||
arrive on the same `triples-input` topic or a separate one needs
|
||||
verification. Even if they do arrive, the graph name would be lost (per
|
||||
above).
|
||||
|
||||
#### Source Material
|
||||
|
||||
The librarian stores the full document hierarchy in a separate system:
|
||||
|
||||
- **Blob store** (S3/MinIO): original documents, text pages, chunks —
|
||||
keyed by object UUID under `doc/{object_id}`.
|
||||
- **Cassandra `library` keyspace**: document metadata including `id`,
|
||||
`kind` (MIME type), `title`, `parent_id`, `document_type`
|
||||
(`source`/`extracted`), `object_id` (blob reference).
|
||||
|
||||
Provenance triples link extracted facts back to chunk/page/document IDs.
|
||||
Those IDs resolve through the librarian. When a core is loaded on a
|
||||
different instance, the librarian has no matching documents, so the entire
|
||||
provenance chain is broken.
|
||||
|
||||
### Key Source Files
|
||||
|
||||
| Component | File | Purpose |
|
||||
|-----------|------|---------|
|
||||
| Core Cassandra schema | `trustgraph-flow/trustgraph/tables/knowledge.py` | Table definitions, read/write |
|
||||
| Core manager | `trustgraph-flow/trustgraph/cores/knowledge.py` | API operations, load-to-store |
|
||||
| Core store processor | `trustgraph-flow/trustgraph/storage/knowledge/store.py` | Extraction → Cassandra |
|
||||
| CLI download | `trustgraph-cli/trustgraph/cli/get_kg_core.py` | Core → msgpack file |
|
||||
| CLI upload | `trustgraph-cli/trustgraph/cli/put_kg_core.py` | Msgpack file → core |
|
||||
| CLI load | `trustgraph-cli/trustgraph/cli/load_kg_core.py` | Core → graph/vector stores |
|
||||
| API client | `trustgraph-base/trustgraph/api/knowledge.py` | Client-side knowledge API |
|
||||
| Triple schema | `trustgraph-base/trustgraph/schema/core/primitives.py` | Triple dataclass with `g` field |
|
||||
| Provenance generation | `trustgraph-base/trustgraph/provenance/triples.py` | PROV-O triple creation |
|
||||
| Librarian | `trustgraph-flow/trustgraph/librarian/librarian.py` | Document storage service |
|
||||
| Library tables | `trustgraph-flow/trustgraph/tables/library.py` | Document metadata in Cassandra |
|
||||
| Blob store | `trustgraph-flow/trustgraph/librarian/blob_store.py` | S3/MinIO object storage |
|
||||
|
||||
## Technical Design
|
||||
|
||||
### Change 1: Named Graph Field in Core Storage
|
||||
|
||||
#### Cassandra Schema
|
||||
|
||||
Extend the `triples` tuple from 6 to 7 elements, adding the graph name:
|
||||
|
||||
```
|
||||
triples list<tuple<
|
||||
text, boolean, -- s_val, s_is_uri
|
||||
text, boolean, -- p_val, p_is_uri
|
||||
text, boolean, -- o_val, o_is_uri
|
||||
text -- graph name (empty string = default graph)
|
||||
>>
|
||||
```
|
||||
|
||||
**Migration**: The schema change uses `ALTER TABLE` or is handled by
|
||||
creating a new table version. Existing rows with 6-element tuples must be
|
||||
handled gracefully on read — if the tuple has 6 elements, treat graph as
|
||||
default.
|
||||
|
||||
#### Write Path (`add_triples`)
|
||||
|
||||
Change `tables/knowledge.py:add_triples()` to include `triple.g`:
|
||||
|
||||
```python
|
||||
triples = [
|
||||
(
|
||||
*term_to_tuple(v.s), *term_to_tuple(v.p), *term_to_tuple(v.o),
|
||||
v.g or ""
|
||||
)
|
||||
for v in m.triples
|
||||
]
|
||||
```
|
||||
|
||||
#### Read Path (`get_triples`)
|
||||
|
||||
Change `tables/knowledge.py:get_triples()` to restore the graph name:
|
||||
|
||||
```python
|
||||
Triple(
|
||||
s = tuple_to_term(elt[0], elt[1]),
|
||||
p = tuple_to_term(elt[2], elt[3]),
|
||||
o = tuple_to_term(elt[4], elt[5]),
|
||||
g = elt[6] if len(elt) > 6 and elt[6] else None,
|
||||
)
|
||||
```
|
||||
|
||||
The `len(elt) > 6` guard provides backward compatibility with existing
|
||||
6-element rows.
|
||||
|
||||
#### Core File Format
|
||||
|
||||
Extend triple dicts in the `"t"` record to include the graph name:
|
||||
|
||||
```python
|
||||
# In get_kg_core.py write_triple — each triple dict gains "g" key
|
||||
{"s": ..., "p": ..., "o": ..., "g": "urn:graph:source"}
|
||||
```
|
||||
|
||||
On read (`put_kg_core.py`), treat missing `"g"` key as default graph for
|
||||
backward compatibility with old core files.
|
||||
|
||||
### Change 2: Provenance Triples in Cores
|
||||
|
||||
#### Investigation Required
|
||||
|
||||
Before implementation, verify:
|
||||
|
||||
1. Whether provenance triples arrive on the `triples-input` topic that the
|
||||
knowledge core store processor already listens on.
|
||||
2. If not, which topic they use, and whether the store processor should
|
||||
subscribe to it.
|
||||
|
||||
#### If provenance triples already arrive at the store
|
||||
|
||||
The only change needed is Change 1 (named graphs) — the provenance triples
|
||||
are already being stored, just without their graph name. Once graph names
|
||||
are preserved, provenance triples will round-trip correctly.
|
||||
|
||||
#### If provenance triples do NOT arrive at the store
|
||||
|
||||
Two options:
|
||||
|
||||
**Option A — Route provenance to the existing store topic**: Configure the
|
||||
flow so provenance triples are published to the same `triples-input` topic.
|
||||
This is the simpler approach and keeps the store processor unchanged.
|
||||
|
||||
**Option B — Add a subscription**: Add a new `ConsumerSpec` in the store
|
||||
processor for the provenance topic. This keeps provenance routing
|
||||
independent but adds complexity.
|
||||
|
||||
Recommendation: Option A, unless there is a reason provenance triples are
|
||||
intentionally kept off the core store topic.
|
||||
|
||||
### Change 3: Source Material in Cores
|
||||
|
||||
This is the largest change. The goal is that when a core is loaded on a
|
||||
fresh instance, provenance links to source material resolve.
|
||||
|
||||
#### Architecture
|
||||
|
||||
Source material is **not stored in the knowledge core tables**. It lives in
|
||||
the librarian (Cassandra `library` keyspace + S3/MinIO blob store) and is
|
||||
fetched on demand via the librarian's existing service API.
|
||||
|
||||
The knowledge manager acts as a **client of the librarian service** — it
|
||||
calls the librarian's request/response API over pub/sub to retrieve document
|
||||
metadata and content. It does not access the library's Cassandra tables or
|
||||
blob store directly.
|
||||
|
||||
#### Transport
|
||||
|
||||
The librarian's pub/sub API already handles chunking of large documents.
|
||||
This chunking is designed to be websocket-friendly, so library content
|
||||
flowing through the API gateway to external clients does not require
|
||||
re-chunking. The API gateway remains a transport layer.
|
||||
|
||||
```
|
||||
Download:
|
||||
Knowledge manager ──pub/sub──► Librarian (fetch metadata + content)
|
||||
Knowledge manager ──pub/sub──► API gateway ──websocket──► Client
|
||||
|
||||
Upload:
|
||||
Client ──websocket──► API gateway ──pub/sub──► Knowledge manager
|
||||
Knowledge manager ──pub/sub──► Librarian (store metadata + content)
|
||||
```
|
||||
|
||||
#### What to Include
|
||||
|
||||
The provenance chain links facts → chunks → pages → documents. For the
|
||||
chain to resolve, the core must include:
|
||||
|
||||
1. **Document metadata** — the library record for each document in the
|
||||
hierarchy (id, kind, title, parent_id, document_type, etc.)
|
||||
2. **Document content** — the blob data for each document (original file,
|
||||
extracted text pages, text chunks)
|
||||
|
||||
Including the full hierarchy is necessary because:
|
||||
- A user viewing provenance needs to traverse fact → chunk → page → document
|
||||
- The chunk text is needed to show what text a fact was extracted from
|
||||
- The page text provides broader context
|
||||
- The original document is needed for full source attribution
|
||||
|
||||
#### Size Implications
|
||||
|
||||
Source material will significantly increase core file sizes. A rough model:
|
||||
|
||||
| Component | Typical size per document |
|
||||
|-----------|-------------------------|
|
||||
| Triples + embeddings (current) | 1-10 MB |
|
||||
| Chunk text (all chunks) | ~same as original document |
|
||||
| Page text (all pages) | ~same as original document |
|
||||
| Original document (PDF, etc.) | Varies widely (KB to hundreds of MB) |
|
||||
|
||||
For a 10 MB PDF, the core could grow from ~5 MB to ~25 MB (original +
|
||||
derived text + existing data). For large document sets, cores could become
|
||||
very large.
|
||||
|
||||
**Decision needed**: Whether to include original documents or just derived
|
||||
text (pages + chunks). Including only derived text still allows provenance
|
||||
display but loses the ability to serve the original file.
|
||||
|
||||
#### New Core File Record Types
|
||||
|
||||
Add new msgpack record types for library content:
|
||||
|
||||
| Type tag | Payload | Description |
|
||||
|----------|---------|-------------|
|
||||
| `"lm"` | `{"id", "kind", "title", "parent_id", "document_type", "comments", "tags", "metadata"}` | Library document metadata |
|
||||
| `"lb"` | `{"id", "data"}` | Library document blob content (chunked by pub/sub layer) |
|
||||
|
||||
These are emitted after the existing `"t"` and `"ge"` records during
|
||||
download and processed during upload.
|
||||
|
||||
#### Download Path
|
||||
|
||||
Extend `KnowledgeManager.get_kg_core()` to:
|
||||
|
||||
1. Stream triples and graph embeddings from the core store (existing
|
||||
behavior).
|
||||
2. Use the librarian service API to retrieve documents associated with
|
||||
this core ID:
|
||||
a. Fetch the root document metadata and content.
|
||||
b. Use `list-children` to discover child documents (pages, chunks).
|
||||
c. Recursively fetch metadata and content for each child.
|
||||
3. Stream each document as `"lm"` (metadata) and `"lb"` (content) records.
|
||||
|
||||
The knowledge manager gains the librarian service as a pub/sub dependency.
|
||||
Large document content is chunked by the librarian's existing pub/sub
|
||||
transport — the knowledge manager receives and forwards these chunks without
|
||||
buffering the full blob in memory.
|
||||
|
||||
#### Upload Path
|
||||
|
||||
Extend `KnowledgeManager.put_kg_core()` to handle the new record types:
|
||||
|
||||
1. For `"lm"` records: call the librarian service API to create/update
|
||||
the document metadata.
|
||||
2. For `"lb"` records: call the librarian service API to store the
|
||||
document content.
|
||||
|
||||
Parent-child relationships are preserved because `parent_id` is stored in
|
||||
the metadata. Documents should be processed in hierarchy order (parent
|
||||
before child) to satisfy any ordering constraints.
|
||||
|
||||
#### Load Path
|
||||
|
||||
The load path (`_load_kg_core`) publishes triples and embeddings to Pulsar
|
||||
topics for ingestion into graph/vector stores. Source material does not need
|
||||
to flow through the load path — it is already in the librarian after the
|
||||
upload step and can be accessed directly by services that need it.
|
||||
|
||||
No changes to the load path for source material.
|
||||
|
||||
#### CLI Changes
|
||||
|
||||
**`tg-get-kg-core`**: Add handling for `"lm"` and `"lb"` record types in
|
||||
the file writer.
|
||||
|
||||
**`tg-put-kg-core`**: Add handling for `"lm"` and `"lb"` record types in
|
||||
the file reader. Send library records to the knowledge manager alongside
|
||||
triple/embedding records.
|
||||
|
||||
#### Associating Documents with Cores
|
||||
|
||||
The core ID is `metadata.root`, which is the root document ID from the
|
||||
librarian. This provides a natural join: the core's root document and all
|
||||
its children (pages, chunks) are the source material for that core.
|
||||
|
||||
The librarian's `list-children` API provides the child documents. A
|
||||
recursive traversal from the root document collects the full hierarchy.
|
||||
|
||||
### API Changes
|
||||
|
||||
#### KnowledgeResponse Schema
|
||||
|
||||
Add optional fields to `KnowledgeResponse` for library data:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class KnowledgeResponse:
|
||||
error: Error | None = None
|
||||
ids: list | None = None
|
||||
eos: bool = False
|
||||
triples: Triples | None = None
|
||||
graph_embeddings: GraphEmbeddings | None = None
|
||||
document_embeddings: DocumentEmbeddings | None = None
|
||||
library_metadata: LibraryMetadata | None = None # new
|
||||
library_blob: LibraryBlob | None = None # new
|
||||
```
|
||||
|
||||
#### New Schema Types
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class LibraryMetadata:
|
||||
id: str
|
||||
kind: str | None = None
|
||||
title: str | None = None
|
||||
parent_id: str | None = None
|
||||
document_type: str | None = None
|
||||
comments: str | None = None
|
||||
tags: list[str] | None = None
|
||||
metadata: list[Triple] | None = None
|
||||
|
||||
@dataclass
|
||||
class LibraryBlob:
|
||||
id: str
|
||||
data: bytes
|
||||
```
|
||||
|
||||
#### Socket API
|
||||
|
||||
The existing streaming protocol for `get-kg-core` / `put-kg-core` carries
|
||||
these new fields naturally — responses already stream multiple record types.
|
||||
|
||||
### Dependencies Between Changes
|
||||
|
||||
```
|
||||
Change 1 (named graphs) ◄── Change 2 depends on this
|
||||
│
|
||||
└── Change 2 (provenance triples)
|
||||
│
|
||||
└── Change 3 (source material) is independent
|
||||
```
|
||||
|
||||
Change 1 is a prerequisite for Change 2 (provenance triples use named
|
||||
graphs). Change 3 is independent and can be implemented in parallel.
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- **Workspace isolation**: Core download/upload must respect workspace
|
||||
boundaries. Source material from the librarian must only be included if
|
||||
it belongs to the same workspace as the core. This is already enforced
|
||||
by the existing workspace-scoped queries.
|
||||
- **Large blob transfer**: Streaming large documents through the API
|
||||
is handled by the librarian's existing pub/sub chunking, which is
|
||||
designed to be websocket-friendly. No additional chunking layer is
|
||||
needed.
|
||||
- **Cross-instance trust**: When uploading a core from an external source,
|
||||
the library content should be treated as untrusted input. Document
|
||||
metadata and blob content should be validated before insertion.
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
- **Core file size**: Including source material will significantly increase
|
||||
core file sizes. Consider adding a flag to download/upload commands to
|
||||
optionally exclude source material for use cases where only the knowledge
|
||||
graph is needed.
|
||||
- **Streaming**: All paths already use streaming (paged Cassandra queries,
|
||||
msgpack record-at-a-time). Library content should follow the same pattern.
|
||||
- **Cassandra schema migration**: Changing the tuple width in the `triples`
|
||||
table requires careful handling. Cassandra frozen tuples cannot be altered
|
||||
in place — a migration strategy is needed (see Migration Plan).
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
- **Unit tests**: Triple round-trip with graph name (write → read →
|
||||
verify `g` field preserved). Backward compatibility with 6-element tuples.
|
||||
- **Integration tests**: Full lifecycle — extract with provenance → download
|
||||
core → upload to fresh instance → load → verify provenance chain resolves.
|
||||
- **File format tests**: Read old-format core files (no graph name, no
|
||||
library records) and verify they load without error.
|
||||
- **Library inclusion tests**: Download core with source material → upload →
|
||||
verify documents accessible through librarian.
|
||||
|
||||
## Migration Plan
|
||||
|
||||
### Cassandra Schema
|
||||
|
||||
The `triples` table stores tuples in a `list<tuple<...>>` column. Cassandra
|
||||
does not support altering the type of an existing column. Options:
|
||||
|
||||
**Option A — New table**: Create a `triples_v2` table with the 7-element
|
||||
tuple. Migrate data from `triples` to `triples_v2`. The read path checks
|
||||
both tables during a transition period, then the old table is dropped.
|
||||
|
||||
**Option B — Dual read**: Keep the existing table. The read path handles
|
||||
both 6-element and 7-element tuples by checking length. New writes use
|
||||
7-element tuples. This works if Cassandra accepts variable-length tuples in
|
||||
a list — **needs verification**.
|
||||
|
||||
**Option C — Separate graph column**: Instead of extending the tuple, add a
|
||||
parallel `graphs list<text>` column where `graphs[i]` corresponds to
|
||||
`triples[i]`. This avoids tuple migration entirely but requires keeping the
|
||||
two lists in sync.
|
||||
|
||||
Recommendation: Verify Option B first (simplest). Fall back to Option A if
|
||||
Cassandra rejects mixed tuple lengths.
|
||||
|
||||
### Core File Format
|
||||
|
||||
Backward compatible by design:
|
||||
- Old files lack `"g"` in triple dicts and have no `"lm"`/`"lb"` records →
|
||||
handled by defaults.
|
||||
- New files read by old code → old code ignores unknown record types (the
|
||||
existing `read_message` raises on unknown types, so this needs a small
|
||||
fix to skip unknown types gracefully).
|
||||
|
||||
## Open Questions
|
||||
|
||||
1. **Provenance topic routing**: Do provenance triples currently arrive at
|
||||
the `triples-input` topic consumed by the knowledge core store? If not,
|
||||
what topic are they on?
|
||||
|
||||
2. **Include original documents?**: Should cores include the original
|
||||
uploaded document (e.g. PDF), or only derived text (pages + chunks)?
|
||||
Including originals makes cores fully self-contained but potentially
|
||||
very large. Excluding them preserves provenance text display but loses
|
||||
the ability to serve the original file.
|
||||
|
||||
3. **Optional source material**: Should there be a flag on download/upload
|
||||
to include or exclude source material? This would let users choose
|
||||
between compact cores (knowledge only) and complete cores (knowledge +
|
||||
sources).
|
||||
|
||||
4. **Cassandra tuple migration**: Can Cassandra handle mixed-length tuples
|
||||
in a `list<tuple<...>>` column, or is a table migration required?
|
||||
|
||||
5. **Document embedding cores**: DE cores are managed alongside KG cores.
|
||||
Do they need the same treatment (source material inclusion)? The
|
||||
document embeddings reference chunk IDs — the same provenance chain
|
||||
applies.
|
||||
|
||||
6. **Core versioning**: Should the core file include a version marker so
|
||||
readers can distinguish old-format from new-format files without
|
||||
trial-and-error parsing?
|
||||
|
||||
## References
|
||||
|
||||
- Extraction-time provenance: `docs/tech-specs/extraction-time-provenance.md`
|
||||
- Query-time explainability: `docs/tech-specs/query-time-explainability.md`
|
||||
- Agent explainability: `docs/tech-specs/agent-explainability.md`
|
||||
- Data ownership model: `docs/tech-specs/data-ownership-model.md`
|
||||
|
|
@ -410,56 +410,3 @@ class TestEdgeCases:
|
|||
assert hosts == ['mixed-host']
|
||||
assert username is None # Stays None
|
||||
assert password == 'mixed-pass'
|
||||
|
||||
|
||||
class TestReplicationFactorParamPath:
|
||||
|
||||
def test_explicit_kwarg(self):
|
||||
with patch.dict(os.environ, {}, clear=True):
|
||||
_, _, _, _, rf = resolve_cassandra_config(
|
||||
replication_factor=3,
|
||||
)
|
||||
assert rf == 3
|
||||
|
||||
def test_kwarg_overrides_env(self):
|
||||
with patch.dict(os.environ, {'CASSANDRA_REPLICATION_FACTOR': '5'}, clear=True):
|
||||
_, _, _, _, rf = resolve_cassandra_config(
|
||||
replication_factor=3,
|
||||
)
|
||||
assert rf == 3
|
||||
|
||||
def test_env_fallback_when_kwarg_none(self):
|
||||
with patch.dict(os.environ, {'CASSANDRA_REPLICATION_FACTOR': '5'}, clear=True):
|
||||
_, _, _, _, rf = resolve_cassandra_config(
|
||||
replication_factor=None,
|
||||
)
|
||||
assert rf == 5
|
||||
|
||||
def test_default_when_no_kwarg_no_env(self):
|
||||
with patch.dict(os.environ, {}, clear=True):
|
||||
_, _, _, _, rf = resolve_cassandra_config()
|
||||
assert rf == 1
|
||||
|
||||
def test_params_dict_path(self):
|
||||
with patch.dict(os.environ, {}, clear=True):
|
||||
params = {'cassandra_replication_factor': 3}
|
||||
_, _, _, _, rf = resolve_cassandra_config(
|
||||
replication_factor=params.get('cassandra_replication_factor'),
|
||||
)
|
||||
assert rf == 3
|
||||
|
||||
def test_params_dict_overrides_env(self):
|
||||
with patch.dict(os.environ, {'CASSANDRA_REPLICATION_FACTOR': '5'}, clear=True):
|
||||
params = {'cassandra_replication_factor': 3}
|
||||
_, _, _, _, rf = resolve_cassandra_config(
|
||||
replication_factor=params.get('cassandra_replication_factor'),
|
||||
)
|
||||
assert rf == 3
|
||||
|
||||
def test_params_dict_missing_falls_to_env(self):
|
||||
with patch.dict(os.environ, {'CASSANDRA_REPLICATION_FACTOR': '5'}, clear=True):
|
||||
params = {}
|
||||
_, _, _, _, rf = resolve_cassandra_config(
|
||||
replication_factor=params.get('cassandra_replication_factor'),
|
||||
)
|
||||
assert rf == 5
|
||||
|
|
@ -1,136 +0,0 @@
|
|||
|
||||
import os
|
||||
import pytest
|
||||
from unittest.mock import patch
|
||||
|
||||
from trustgraph.base.qdrant_config import (
|
||||
get_qdrant_defaults,
|
||||
resolve_qdrant_config,
|
||||
)
|
||||
|
||||
|
||||
class TestGetQdrantDefaults:
|
||||
|
||||
def test_defaults_with_no_env_vars(self):
|
||||
with patch.dict(os.environ, {}, clear=True):
|
||||
defaults = get_qdrant_defaults()
|
||||
assert defaults['url'] == 'http://localhost:6333'
|
||||
assert defaults['api_key'] is None
|
||||
assert defaults['replication_factor'] == 1
|
||||
assert defaults['shard_number'] == 1
|
||||
|
||||
def test_defaults_from_env(self):
|
||||
env = {
|
||||
'QDRANT_URL': 'http://qdrant:6333',
|
||||
'QDRANT_API_KEY': 'secret',
|
||||
'QDRANT_REPLICATION_FACTOR': '3',
|
||||
'QDRANT_SHARD_NUMBER': '5',
|
||||
}
|
||||
with patch.dict(os.environ, env, clear=True):
|
||||
defaults = get_qdrant_defaults()
|
||||
assert defaults['url'] == 'http://qdrant:6333'
|
||||
assert defaults['api_key'] == 'secret'
|
||||
assert defaults['replication_factor'] == 3
|
||||
assert defaults['shard_number'] == 5
|
||||
|
||||
|
||||
class TestResolveQdrantConfig:
|
||||
|
||||
def test_defaults(self):
|
||||
with patch.dict(os.environ, {}, clear=True):
|
||||
url, api_key, rf, sn = resolve_qdrant_config()
|
||||
assert url == 'http://localhost:6333'
|
||||
assert api_key is None
|
||||
assert rf == 1
|
||||
assert sn == 1
|
||||
|
||||
def test_explicit_kwargs(self):
|
||||
with patch.dict(os.environ, {}, clear=True):
|
||||
url, api_key, rf, sn = resolve_qdrant_config(
|
||||
url='http://custom:6333',
|
||||
api_key='key',
|
||||
replication_factor=3,
|
||||
shard_number=5,
|
||||
)
|
||||
assert url == 'http://custom:6333'
|
||||
assert api_key == 'key'
|
||||
assert rf == 3
|
||||
assert sn == 5
|
||||
|
||||
def test_kwargs_override_env(self):
|
||||
env = {
|
||||
'QDRANT_URL': 'http://env:6333',
|
||||
'QDRANT_REPLICATION_FACTOR': '10',
|
||||
'QDRANT_SHARD_NUMBER': '10',
|
||||
}
|
||||
with patch.dict(os.environ, env, clear=True):
|
||||
url, _, rf, sn = resolve_qdrant_config(
|
||||
url='http://explicit:6333',
|
||||
replication_factor=3,
|
||||
shard_number=5,
|
||||
)
|
||||
assert url == 'http://explicit:6333'
|
||||
assert rf == 3
|
||||
assert sn == 5
|
||||
|
||||
def test_env_fallback_when_kwargs_none(self):
|
||||
env = {
|
||||
'QDRANT_URL': 'http://env:6333',
|
||||
'QDRANT_REPLICATION_FACTOR': '3',
|
||||
'QDRANT_SHARD_NUMBER': '5',
|
||||
}
|
||||
with patch.dict(os.environ, env, clear=True):
|
||||
url, _, rf, sn = resolve_qdrant_config()
|
||||
assert url == 'http://env:6333'
|
||||
assert rf == 3
|
||||
assert sn == 5
|
||||
|
||||
def test_params_dict_path(self):
|
||||
with patch.dict(os.environ, {}, clear=True):
|
||||
params = {
|
||||
'store_uri': 'http://params:6333',
|
||||
'api_key': 'pkey',
|
||||
'qdrant_replication_factor': 3,
|
||||
'qdrant_shard_number': 5,
|
||||
}
|
||||
url, api_key, rf, sn = resolve_qdrant_config(
|
||||
url=params.get('store_uri'),
|
||||
api_key=params.get('api_key'),
|
||||
replication_factor=params.get('qdrant_replication_factor'),
|
||||
shard_number=params.get('qdrant_shard_number'),
|
||||
)
|
||||
assert url == 'http://params:6333'
|
||||
assert api_key == 'pkey'
|
||||
assert rf == 3
|
||||
assert sn == 5
|
||||
|
||||
def test_params_dict_overrides_env(self):
|
||||
env = {
|
||||
'QDRANT_REPLICATION_FACTOR': '10',
|
||||
'QDRANT_SHARD_NUMBER': '10',
|
||||
}
|
||||
with patch.dict(os.environ, env, clear=True):
|
||||
params = {
|
||||
'qdrant_replication_factor': 3,
|
||||
'qdrant_shard_number': 5,
|
||||
}
|
||||
_, _, rf, sn = resolve_qdrant_config(
|
||||
replication_factor=params.get('qdrant_replication_factor'),
|
||||
shard_number=params.get('qdrant_shard_number'),
|
||||
)
|
||||
assert rf == 3
|
||||
assert sn == 5
|
||||
|
||||
def test_params_dict_missing_falls_to_env(self):
|
||||
env = {
|
||||
'QDRANT_REPLICATION_FACTOR': '3',
|
||||
'QDRANT_SHARD_NUMBER': '5',
|
||||
}
|
||||
with patch.dict(os.environ, env, clear=True):
|
||||
params = {}
|
||||
_, _, rf, sn = resolve_qdrant_config(
|
||||
replication_factor=params.get('qdrant_replication_factor'),
|
||||
shard_number=params.get('qdrant_shard_number'),
|
||||
)
|
||||
assert rf == 3
|
||||
assert sn == 5
|
||||
|
|
@ -11,12 +11,7 @@ from unittest.mock import AsyncMock, Mock, patch, MagicMock
|
|||
from unittest.mock import call
|
||||
|
||||
from trustgraph.cores.knowledge import KnowledgeManager
|
||||
from trustgraph.schema import (
|
||||
KnowledgeResponse, Triples, GraphEmbeddings, Metadata, Triple, Term,
|
||||
EntityEmbeddings, IRI, LITERAL,
|
||||
LibraryMetadata, LibraryBlob,
|
||||
LibrarianResponse, DocumentMetadata,
|
||||
)
|
||||
from trustgraph.schema import KnowledgeResponse, Triples, GraphEmbeddings, Metadata, Triple, Term, EntityEmbeddings, IRI, LITERAL
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
|
|
@ -386,244 +381,3 @@ class TestKnowledgeManagerOtherMethods:
|
|||
mock_respond.assert_called_once()
|
||||
response = mock_respond.call_args[0][0]
|
||||
assert response.error is None
|
||||
|
||||
|
||||
class TestKnowledgeManagerLibraryDownload:
|
||||
"""Test get_kg_core streaming of library documents."""
|
||||
|
||||
@pytest.fixture
|
||||
def manager_with_librarian(self, mock_flow_config):
|
||||
with patch('trustgraph.cores.knowledge.KnowledgeTableStore'):
|
||||
mock_librarian = AsyncMock()
|
||||
manager = KnowledgeManager(
|
||||
cassandra_host=["localhost"],
|
||||
cassandra_username="test_user",
|
||||
cassandra_password="test_pass",
|
||||
keyspace="test_keyspace",
|
||||
flow_config=mock_flow_config,
|
||||
librarian=mock_librarian,
|
||||
)
|
||||
manager.table_store = AsyncMock()
|
||||
return manager
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_get_kg_core_streams_library_docs(self, manager_with_librarian):
|
||||
mock_request = Mock()
|
||||
mock_request.id = "root-doc"
|
||||
mock_respond = AsyncMock()
|
||||
|
||||
manager_with_librarian.table_store.get_triples = AsyncMock()
|
||||
manager_with_librarian.table_store.get_graph_embeddings = AsyncMock()
|
||||
|
||||
root_meta = DocumentMetadata(
|
||||
id="root-doc", kind="application/pdf", title="Test PDF",
|
||||
document_type="source",
|
||||
)
|
||||
child_meta = DocumentMetadata(
|
||||
id="chunk-1", kind="text/plain", title="Chunk 1",
|
||||
parent_id="root-doc", document_type="chunk",
|
||||
)
|
||||
|
||||
manager_with_librarian.librarian.fetch_document_metadata.return_value = root_meta
|
||||
manager_with_librarian.librarian.request.return_value = LibrarianResponse(
|
||||
document_metadatas=[child_meta],
|
||||
)
|
||||
manager_with_librarian.librarian.fetch_document_content.side_effect = [
|
||||
b"cm9vdCBjb250ZW50",
|
||||
b"Y2h1bmsgY29udGVudA==",
|
||||
]
|
||||
|
||||
await manager_with_librarian.get_kg_core(
|
||||
mock_request, mock_respond, "test-user"
|
||||
)
|
||||
|
||||
responses = [c[0][0] for c in mock_respond.call_args_list]
|
||||
|
||||
lm_responses = [r for r in responses if r.library_metadata is not None]
|
||||
lb_responses = [r for r in responses if r.library_blob is not None]
|
||||
eos_responses = [r for r in responses if r.eos is True]
|
||||
|
||||
assert len(lm_responses) == 2
|
||||
assert lm_responses[0].library_metadata.id == "root-doc"
|
||||
assert lm_responses[0].library_metadata.document_type == "source"
|
||||
assert lm_responses[1].library_metadata.id == "chunk-1"
|
||||
assert lm_responses[1].library_metadata.parent_id == "root-doc"
|
||||
|
||||
assert len(lb_responses) == 2
|
||||
assert lb_responses[0].library_blob.id == "root-doc"
|
||||
assert lb_responses[0].library_blob.data == b"cm9vdCBjb250ZW50"
|
||||
assert lb_responses[1].library_blob.id == "chunk-1"
|
||||
|
||||
assert len(eos_responses) == 1
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_get_kg_core_no_librarian_skips_library(self, mock_flow_config):
|
||||
with patch('trustgraph.cores.knowledge.KnowledgeTableStore'):
|
||||
manager = KnowledgeManager(
|
||||
cassandra_host=["localhost"],
|
||||
cassandra_username="u", cassandra_password="p",
|
||||
keyspace="ks", flow_config=mock_flow_config,
|
||||
)
|
||||
manager.table_store = AsyncMock()
|
||||
manager.table_store.get_triples = AsyncMock()
|
||||
manager.table_store.get_graph_embeddings = AsyncMock()
|
||||
|
||||
mock_request = Mock()
|
||||
mock_request.id = "doc-1"
|
||||
mock_respond = AsyncMock()
|
||||
|
||||
await manager.get_kg_core(mock_request, mock_respond, "w")
|
||||
|
||||
responses = [c[0][0] for c in mock_respond.call_args_list]
|
||||
assert all(r.library_metadata is None for r in responses)
|
||||
assert all(r.library_blob is None for r in responses)
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_get_kg_core_librarian_metadata_failure_is_graceful(
|
||||
self, manager_with_librarian,
|
||||
):
|
||||
mock_request = Mock()
|
||||
mock_request.id = "missing-doc"
|
||||
mock_respond = AsyncMock()
|
||||
|
||||
manager_with_librarian.table_store.get_triples = AsyncMock()
|
||||
manager_with_librarian.table_store.get_graph_embeddings = AsyncMock()
|
||||
manager_with_librarian.librarian.fetch_document_metadata.side_effect = (
|
||||
RuntimeError("not found")
|
||||
)
|
||||
|
||||
await manager_with_librarian.get_kg_core(
|
||||
mock_request, mock_respond, "test-user"
|
||||
)
|
||||
|
||||
responses = [c[0][0] for c in mock_respond.call_args_list]
|
||||
assert all(r.library_metadata is None for r in responses)
|
||||
assert any(r.eos for r in responses)
|
||||
|
||||
|
||||
class TestKnowledgeManagerLibraryUpload:
|
||||
"""Test put_kg_core handling of library metadata and blob records."""
|
||||
|
||||
@pytest.fixture
|
||||
def manager_with_librarian(self, mock_flow_config):
|
||||
with patch('trustgraph.cores.knowledge.KnowledgeTableStore'):
|
||||
mock_librarian = AsyncMock()
|
||||
manager = KnowledgeManager(
|
||||
cassandra_host=["localhost"],
|
||||
cassandra_username="u", cassandra_password="p",
|
||||
keyspace="ks", flow_config=mock_flow_config,
|
||||
librarian=mock_librarian,
|
||||
)
|
||||
manager.table_store = AsyncMock()
|
||||
return manager
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_put_metadata_then_blob_calls_librarian(
|
||||
self, manager_with_librarian,
|
||||
):
|
||||
mock_respond = AsyncMock()
|
||||
manager_with_librarian.librarian.request.return_value = LibrarianResponse()
|
||||
|
||||
# First call: metadata
|
||||
req_meta = Mock()
|
||||
req_meta.triples = None
|
||||
req_meta.graph_embeddings = None
|
||||
req_meta.library_metadata = LibraryMetadata(
|
||||
id="doc-1", kind="application/pdf", title="Test",
|
||||
document_type="source",
|
||||
)
|
||||
req_meta.library_blob = None
|
||||
await manager_with_librarian.put_kg_core(req_meta, mock_respond, "ws")
|
||||
|
||||
# Metadata is buffered, librarian not called yet
|
||||
manager_with_librarian.librarian.request.assert_not_called()
|
||||
|
||||
# Second call: blob
|
||||
req_blob = Mock()
|
||||
req_blob.triples = None
|
||||
req_blob.graph_embeddings = None
|
||||
req_blob.library_metadata = None
|
||||
req_blob.library_blob = LibraryBlob(
|
||||
id="doc-1", data=b"dGVzdA==",
|
||||
)
|
||||
await manager_with_librarian.put_kg_core(req_blob, mock_respond, "ws")
|
||||
|
||||
# Now librarian should have been called with add-document
|
||||
manager_with_librarian.librarian.request.assert_called_once()
|
||||
call_args = manager_with_librarian.librarian.request.call_args[0][0]
|
||||
assert call_args.operation == "add-document"
|
||||
assert call_args.document_metadata.id == "doc-1"
|
||||
assert call_args.document_metadata.kind == "application/pdf"
|
||||
assert call_args.content == b"dGVzdA=="
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_put_child_document_uses_add_child_operation(
|
||||
self, manager_with_librarian,
|
||||
):
|
||||
mock_respond = AsyncMock()
|
||||
manager_with_librarian.librarian.request.return_value = LibrarianResponse()
|
||||
|
||||
req_meta = Mock()
|
||||
req_meta.triples = None
|
||||
req_meta.graph_embeddings = None
|
||||
req_meta.library_metadata = LibraryMetadata(
|
||||
id="chunk-1", kind="text/plain", title="Chunk",
|
||||
parent_id="doc-1", document_type="chunk",
|
||||
)
|
||||
req_meta.library_blob = None
|
||||
await manager_with_librarian.put_kg_core(req_meta, mock_respond, "ws")
|
||||
|
||||
req_blob = Mock()
|
||||
req_blob.triples = None
|
||||
req_blob.graph_embeddings = None
|
||||
req_blob.library_metadata = None
|
||||
req_blob.library_blob = LibraryBlob(id="chunk-1", data=b"Y2h1bms=")
|
||||
await manager_with_librarian.put_kg_core(req_blob, mock_respond, "ws")
|
||||
|
||||
call_args = manager_with_librarian.librarian.request.call_args[0][0]
|
||||
assert call_args.operation == "add-child-document"
|
||||
assert call_args.document_metadata.parent_id == "doc-1"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_put_blob_without_metadata_logs_warning(
|
||||
self, manager_with_librarian,
|
||||
):
|
||||
mock_respond = AsyncMock()
|
||||
|
||||
req_blob = Mock()
|
||||
req_blob.triples = None
|
||||
req_blob.graph_embeddings = None
|
||||
req_blob.library_metadata = None
|
||||
req_blob.library_blob = LibraryBlob(id="orphan", data=b"data")
|
||||
await manager_with_librarian.put_kg_core(req_blob, mock_respond, "ws")
|
||||
|
||||
# Librarian should not be called for orphan blob
|
||||
manager_with_librarian.librarian.request.assert_not_called()
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_put_existing_document_is_graceful(
|
||||
self, manager_with_librarian,
|
||||
):
|
||||
mock_respond = AsyncMock()
|
||||
manager_with_librarian.librarian.request.side_effect = RuntimeError(
|
||||
"Document already exists"
|
||||
)
|
||||
|
||||
req_meta = Mock()
|
||||
req_meta.triples = None
|
||||
req_meta.graph_embeddings = None
|
||||
req_meta.library_metadata = LibraryMetadata(
|
||||
id="doc-1", kind="application/pdf", title="Test",
|
||||
document_type="source",
|
||||
)
|
||||
req_meta.library_blob = None
|
||||
await manager_with_librarian.put_kg_core(req_meta, mock_respond, "ws")
|
||||
|
||||
req_blob = Mock()
|
||||
req_blob.triples = None
|
||||
req_blob.graph_embeddings = None
|
||||
req_blob.library_metadata = None
|
||||
req_blob.library_blob = LibraryBlob(id="doc-1", data=b"data")
|
||||
await manager_with_librarian.put_kg_core(req_blob, mock_respond, "ws")
|
||||
|
||||
# Should not raise — "already exists" is handled gracefully
|
||||
|
|
@ -49,7 +49,7 @@ class TestPdfDecoderProcessor(IsolatedAsyncioTestCase):
|
|||
async def test_on_message_success(self, mock_pdf_loader_class, mock_producer, mock_consumer):
|
||||
"""Test successful PDF processing"""
|
||||
# Mock PDF content
|
||||
pdf_content = b"%PDF-1.7\nfake pdf content"
|
||||
pdf_content = b"fake pdf content"
|
||||
pdf_base64 = base64.b64encode(pdf_content).decode('utf-8')
|
||||
|
||||
# Mock PyPDFLoader
|
||||
|
|
@ -88,55 +88,13 @@ class TestPdfDecoderProcessor(IsolatedAsyncioTestCase):
|
|||
# Verify triples were sent for each page (provenance)
|
||||
assert mock_triples_flow.send.call_count == 2
|
||||
|
||||
@patch('trustgraph.base.librarian_client.Consumer')
|
||||
@patch('trustgraph.base.librarian_client.Producer')
|
||||
@patch('trustgraph.decoding.pdf.pdf_decoder.PyPDFLoader')
|
||||
@patch('trustgraph.base.async_processor.AsyncProcessor', MockAsyncProcessor)
|
||||
async def test_on_message_rejects_librarian_content_that_is_not_pdf(self, mock_pdf_loader_class, mock_producer, mock_consumer):
|
||||
"""Test rejecting non-PDF content before invoking the PDF loader"""
|
||||
html_content = b"<html><body>Not found</body></html>"
|
||||
html_base64 = base64.b64encode(html_content)
|
||||
|
||||
mock_metadata = Metadata(id="test-doc")
|
||||
mock_document = Document(metadata=mock_metadata, document_id="doc-123")
|
||||
mock_msg = MagicMock()
|
||||
mock_msg.value.return_value = mock_document
|
||||
|
||||
mock_output_flow = AsyncMock()
|
||||
mock_triples_flow = AsyncMock()
|
||||
mock_flow = MagicMock(side_effect=lambda name: {
|
||||
"output": mock_output_flow,
|
||||
"triples": mock_triples_flow,
|
||||
}.get(name))
|
||||
mock_flow.librarian.fetch_document_metadata = AsyncMock(
|
||||
return_value=MagicMock(kind="application/pdf")
|
||||
)
|
||||
mock_flow.librarian.fetch_document_content = AsyncMock(
|
||||
return_value=html_base64
|
||||
)
|
||||
mock_flow.librarian.save_child_document = AsyncMock()
|
||||
|
||||
config = {
|
||||
'id': 'test-pdf-decoder',
|
||||
'taskgroup': AsyncMock()
|
||||
}
|
||||
|
||||
processor = Processor(**config)
|
||||
|
||||
await processor.on_message(mock_msg, None, mock_flow)
|
||||
|
||||
mock_pdf_loader_class.assert_not_called()
|
||||
mock_output_flow.send.assert_not_called()
|
||||
mock_triples_flow.send.assert_not_called()
|
||||
mock_flow.librarian.save_child_document.assert_not_called()
|
||||
|
||||
@patch('trustgraph.base.librarian_client.Consumer')
|
||||
@patch('trustgraph.base.librarian_client.Producer')
|
||||
@patch('trustgraph.decoding.pdf.pdf_decoder.PyPDFLoader')
|
||||
@patch('trustgraph.base.async_processor.AsyncProcessor', MockAsyncProcessor)
|
||||
async def test_on_message_empty_pdf(self, mock_pdf_loader_class, mock_producer, mock_consumer):
|
||||
"""Test handling of empty PDF"""
|
||||
pdf_content = b"%PDF-1.7\nfake pdf content"
|
||||
pdf_content = b"fake pdf content"
|
||||
pdf_base64 = base64.b64encode(pdf_content).decode('utf-8')
|
||||
|
||||
mock_loader = MagicMock()
|
||||
|
|
@ -168,7 +126,7 @@ class TestPdfDecoderProcessor(IsolatedAsyncioTestCase):
|
|||
@patch('trustgraph.base.async_processor.AsyncProcessor', MockAsyncProcessor)
|
||||
async def test_on_message_unicode_content(self, mock_pdf_loader_class, mock_producer, mock_consumer):
|
||||
"""Test handling of unicode content in PDF"""
|
||||
pdf_content = b"%PDF-1.7\nfake pdf content"
|
||||
pdf_content = b"fake pdf content"
|
||||
pdf_base64 = base64.b64encode(pdf_content).decode('utf-8')
|
||||
|
||||
mock_loader = MagicMock()
|
||||
|
|
|
|||
|
|
@ -18,7 +18,7 @@ from trustgraph.embeddings.hf.hf import Processor
|
|||
class TestHuggingFaceDynamicModelLoading(IsolatedAsyncioTestCase):
|
||||
"""Test HuggingFace dynamic model loading and caching"""
|
||||
|
||||
@patch('langchain_huggingface.HuggingFaceEmbeddings')
|
||||
@patch('trustgraph.embeddings.hf.hf.HuggingFaceEmbeddings')
|
||||
@patch('trustgraph.base.async_processor.AsyncProcessor.__init__')
|
||||
@patch('trustgraph.base.embeddings_service.EmbeddingsService.__init__')
|
||||
async def test_default_model_loaded_on_init(self, mock_embeddings_init, mock_async_init, mock_hf_class):
|
||||
|
|
@ -39,7 +39,7 @@ class TestHuggingFaceDynamicModelLoading(IsolatedAsyncioTestCase):
|
|||
assert processor.cached_model_name == "test-model"
|
||||
assert processor.embeddings is not None
|
||||
|
||||
@patch('langchain_huggingface.HuggingFaceEmbeddings')
|
||||
@patch('trustgraph.embeddings.hf.hf.HuggingFaceEmbeddings')
|
||||
@patch('trustgraph.base.async_processor.AsyncProcessor.__init__')
|
||||
@patch('trustgraph.base.embeddings_service.EmbeddingsService.__init__')
|
||||
async def test_model_caching_avoids_reload(self, mock_embeddings_init, mock_async_init, mock_hf_class):
|
||||
|
|
@ -63,7 +63,7 @@ class TestHuggingFaceDynamicModelLoading(IsolatedAsyncioTestCase):
|
|||
mock_hf_class.assert_not_called()
|
||||
assert processor.cached_model_name == "test-model"
|
||||
|
||||
@patch('langchain_huggingface.HuggingFaceEmbeddings')
|
||||
@patch('trustgraph.embeddings.hf.hf.HuggingFaceEmbeddings')
|
||||
@patch('trustgraph.base.async_processor.AsyncProcessor.__init__')
|
||||
@patch('trustgraph.base.embeddings_service.EmbeddingsService.__init__')
|
||||
async def test_model_reload_on_name_change(self, mock_embeddings_init, mock_async_init, mock_hf_class):
|
||||
|
|
@ -84,7 +84,7 @@ class TestHuggingFaceDynamicModelLoading(IsolatedAsyncioTestCase):
|
|||
mock_hf_class.assert_called_once_with(model_name="different-model")
|
||||
assert processor.cached_model_name == "different-model"
|
||||
|
||||
@patch('langchain_huggingface.HuggingFaceEmbeddings')
|
||||
@patch('trustgraph.embeddings.hf.hf.HuggingFaceEmbeddings')
|
||||
@patch('trustgraph.base.async_processor.AsyncProcessor.__init__')
|
||||
@patch('trustgraph.base.embeddings_service.EmbeddingsService.__init__')
|
||||
async def test_on_embeddings_uses_default_model(self, mock_embeddings_init, mock_async_init, mock_hf_class):
|
||||
|
|
@ -107,7 +107,7 @@ class TestHuggingFaceDynamicModelLoading(IsolatedAsyncioTestCase):
|
|||
assert processor.cached_model_name == "test-model" # Still using default
|
||||
assert result == [[0.1, 0.2, 0.3, 0.4, 0.5]]
|
||||
|
||||
@patch('langchain_huggingface.HuggingFaceEmbeddings')
|
||||
@patch('trustgraph.embeddings.hf.hf.HuggingFaceEmbeddings')
|
||||
@patch('trustgraph.base.async_processor.AsyncProcessor.__init__')
|
||||
@patch('trustgraph.base.embeddings_service.EmbeddingsService.__init__')
|
||||
async def test_on_embeddings_uses_specified_model(self, mock_embeddings_init, mock_async_init, mock_hf_class):
|
||||
|
|
@ -130,7 +130,7 @@ class TestHuggingFaceDynamicModelLoading(IsolatedAsyncioTestCase):
|
|||
assert processor.cached_model_name == "custom-model"
|
||||
mock_hf_instance.embed_documents.assert_called_once_with(["test text"])
|
||||
|
||||
@patch('langchain_huggingface.HuggingFaceEmbeddings')
|
||||
@patch('trustgraph.embeddings.hf.hf.HuggingFaceEmbeddings')
|
||||
@patch('trustgraph.base.async_processor.AsyncProcessor.__init__')
|
||||
@patch('trustgraph.base.embeddings_service.EmbeddingsService.__init__')
|
||||
async def test_multiple_model_switches(self, mock_embeddings_init, mock_async_init, mock_hf_class):
|
||||
|
|
@ -164,7 +164,7 @@ class TestHuggingFaceDynamicModelLoading(IsolatedAsyncioTestCase):
|
|||
assert call_count_after_b == initial_call_count + 2 # Reload for model-b
|
||||
assert call_count_after_a_again == initial_call_count + 3 # Reload back to model-a
|
||||
|
||||
@patch('langchain_huggingface.HuggingFaceEmbeddings')
|
||||
@patch('trustgraph.embeddings.hf.hf.HuggingFaceEmbeddings')
|
||||
@patch('trustgraph.base.async_processor.AsyncProcessor.__init__')
|
||||
@patch('trustgraph.base.embeddings_service.EmbeddingsService.__init__')
|
||||
async def test_none_model_uses_default(self, mock_embeddings_init, mock_async_init, mock_hf_class):
|
||||
|
|
@ -187,7 +187,7 @@ class TestHuggingFaceDynamicModelLoading(IsolatedAsyncioTestCase):
|
|||
assert mock_hf_class.call_count == initial_count
|
||||
assert processor.cached_model_name == "test-model"
|
||||
|
||||
@patch('langchain_huggingface.HuggingFaceEmbeddings')
|
||||
@patch('trustgraph.embeddings.hf.hf.HuggingFaceEmbeddings')
|
||||
@patch('trustgraph.base.async_processor.AsyncProcessor.__init__')
|
||||
@patch('trustgraph.base.embeddings_service.EmbeddingsService.__init__')
|
||||
async def test_initialization_without_model_uses_default(self, mock_embeddings_init, mock_async_init, mock_hf_class):
|
||||
|
|
|
|||
|
|
@ -333,8 +333,8 @@ class TestUnifiedTableQueries:
|
|||
"""Test queries against the unified rows table"""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
@patch('trustgraph.query.rows.cassandra.service.async_execute_paged', new_callable=AsyncMock)
|
||||
async def test_query_with_index_match(self, mock_async_execute_paged):
|
||||
@patch('trustgraph.query.rows.cassandra.service.async_execute', new_callable=AsyncMock)
|
||||
async def test_query_with_index_match(self, mock_async_execute):
|
||||
"""Test query execution with matching index"""
|
||||
processor = MagicMock()
|
||||
processor.session = MagicMock()
|
||||
|
|
@ -344,10 +344,10 @@ class TestUnifiedTableQueries:
|
|||
processor.find_matching_index = Processor.find_matching_index.__get__(processor, Processor)
|
||||
processor.query_cassandra = Processor.query_cassandra.__get__(processor, Processor)
|
||||
|
||||
# Mock async_execute_paged to return test data (list of pages)
|
||||
# Mock async_execute to return test data
|
||||
mock_row = MagicMock()
|
||||
mock_row.data = {"id": "123", "name": "Test Product", "category": "electronics"}
|
||||
mock_async_execute_paged.return_value = [[mock_row]]
|
||||
mock_async_execute.return_value = [mock_row]
|
||||
|
||||
schema = RowSchema(
|
||||
name="products",
|
||||
|
|
@ -370,10 +370,10 @@ class TestUnifiedTableQueries:
|
|||
|
||||
# Verify Cassandra was connected and queried
|
||||
processor.connect_cassandra.assert_called_once()
|
||||
mock_async_execute_paged.assert_called_once()
|
||||
mock_async_execute.assert_called_once()
|
||||
|
||||
# Verify query structure - should query unified rows table
|
||||
call_args = mock_async_execute_paged.call_args
|
||||
call_args = mock_async_execute.call_args
|
||||
query = call_args[0][1]
|
||||
params = call_args[0][2]
|
||||
|
||||
|
|
@ -394,8 +394,8 @@ class TestUnifiedTableQueries:
|
|||
assert results[0]["category"] == "electronics"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
@patch('trustgraph.query.rows.cassandra.service.async_scan', new_callable=AsyncMock)
|
||||
async def test_query_without_index_match(self, mock_async_scan):
|
||||
@patch('trustgraph.query.rows.cassandra.service.async_execute', new_callable=AsyncMock)
|
||||
async def test_query_without_index_match(self, mock_async_execute):
|
||||
"""Test query execution without matching index (scan mode)"""
|
||||
processor = MagicMock()
|
||||
processor.session = MagicMock()
|
||||
|
|
@ -406,10 +406,12 @@ class TestUnifiedTableQueries:
|
|||
processor._matches_filters = Processor._matches_filters.__get__(processor, Processor)
|
||||
processor.query_cassandra = Processor.query_cassandra.__get__(processor, Processor)
|
||||
|
||||
# Mock async_scan to return filtered test data
|
||||
# Mock async_execute to return test data
|
||||
mock_row1 = MagicMock()
|
||||
mock_row1.data = {"id": "1", "name": "Product A", "price": "100"}
|
||||
mock_async_scan.return_value = [mock_row1]
|
||||
mock_row2 = MagicMock()
|
||||
mock_row2.data = {"id": "2", "name": "Product B", "price": "200"}
|
||||
mock_async_execute.return_value = [mock_row1, mock_row2]
|
||||
|
||||
schema = RowSchema(
|
||||
name="products",
|
||||
|
|
@ -430,16 +432,13 @@ class TestUnifiedTableQueries:
|
|||
limit=10
|
||||
)
|
||||
|
||||
# Verify async_scan was called
|
||||
mock_async_scan.assert_called_once()
|
||||
|
||||
# Verify query structure
|
||||
call_args = mock_async_scan.call_args
|
||||
# Query should use ALLOW FILTERING for scan
|
||||
call_args = mock_async_execute.call_args
|
||||
query = call_args[0][1]
|
||||
|
||||
assert "ALLOW FILTERING" in query
|
||||
|
||||
# Should return filtered results
|
||||
# Should post-filter results
|
||||
assert len(results) == 1
|
||||
assert results[0]["name"] == "Product A"
|
||||
|
||||
|
|
|
|||
|
|
@ -259,8 +259,6 @@ class TestGraphEmbeddingsNullProtection:
|
|||
proc.collection_exists = MagicMock(return_value=True)
|
||||
proc._cache_lock = asyncio.Lock()
|
||||
proc._known_collections = set()
|
||||
proc.replication_factor = 1
|
||||
proc.shard_number = 1
|
||||
|
||||
msg = MagicMock()
|
||||
msg.metadata.collection = "graphs"
|
||||
|
|
|
|||
|
|
@ -35,9 +35,9 @@ def _make_store():
|
|||
class TestGetGraphEmbeddings:
|
||||
|
||||
@pytest.mark.asyncio
|
||||
@patch('trustgraph.tables.knowledge.async_execute_paged', new_callable=AsyncMock)
|
||||
@patch('trustgraph.tables.knowledge.async_execute', new_callable=AsyncMock)
|
||||
async def test_row_converts_to_entity_embeddings_with_singular_vector(
|
||||
self, mock_async_execute_paged
|
||||
self, mock_async_execute
|
||||
):
|
||||
"""
|
||||
Cassandra rows return entities as a list of [entity_tuple, vector]
|
||||
|
|
@ -57,7 +57,7 @@ class TestGetGraphEmbeddings:
|
|||
store = _make_store()
|
||||
store.cassandra = Mock()
|
||||
store.get_graph_embeddings_stmt = Mock()
|
||||
mock_async_execute_paged.return_value = [[fake_row]]
|
||||
mock_async_execute.return_value = [fake_row]
|
||||
|
||||
received = []
|
||||
|
||||
|
|
@ -66,7 +66,7 @@ class TestGetGraphEmbeddings:
|
|||
|
||||
await store.get_graph_embeddings("alice", "doc-1", receiver)
|
||||
|
||||
mock_async_execute_paged.assert_called_once_with(
|
||||
mock_async_execute.assert_called_once_with(
|
||||
store.cassandra,
|
||||
store.get_graph_embeddings_stmt,
|
||||
("alice", "doc-1"),
|
||||
|
|
@ -96,8 +96,8 @@ class TestGetGraphEmbeddings:
|
|||
assert ge.entities[2].entity.value == "a literal entity"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
@patch('trustgraph.tables.knowledge.async_execute_paged', new_callable=AsyncMock)
|
||||
async def test_empty_entities_blob_yields_empty_list(self, mock_async_execute_paged):
|
||||
@patch('trustgraph.tables.knowledge.async_execute', new_callable=AsyncMock)
|
||||
async def test_empty_entities_blob_yields_empty_list(self, mock_async_execute):
|
||||
"""row[3] being None / empty must produce a GraphEmbeddings with
|
||||
no entities, not raise."""
|
||||
fake_row = (None, None, None, None)
|
||||
|
|
@ -105,7 +105,7 @@ class TestGetGraphEmbeddings:
|
|||
store = _make_store()
|
||||
store.cassandra = Mock()
|
||||
store.get_graph_embeddings_stmt = Mock()
|
||||
mock_async_execute_paged.return_value = [[fake_row]]
|
||||
mock_async_execute.return_value = [fake_row]
|
||||
|
||||
received = []
|
||||
|
||||
|
|
@ -118,8 +118,8 @@ class TestGetGraphEmbeddings:
|
|||
assert received[0].entities == []
|
||||
|
||||
@pytest.mark.asyncio
|
||||
@patch('trustgraph.tables.knowledge.async_execute_paged', new_callable=AsyncMock)
|
||||
async def test_multiple_rows_each_emit_one_message(self, mock_async_execute_paged):
|
||||
@patch('trustgraph.tables.knowledge.async_execute', new_callable=AsyncMock)
|
||||
async def test_multiple_rows_each_emit_one_message(self, mock_async_execute):
|
||||
fake_rows = [
|
||||
(None, None, None, [
|
||||
(("http://example.org/a", True), [1.0]),
|
||||
|
|
@ -132,7 +132,7 @@ class TestGetGraphEmbeddings:
|
|||
store = _make_store()
|
||||
store.cassandra = Mock()
|
||||
store.get_graph_embeddings_stmt = Mock()
|
||||
mock_async_execute_paged.return_value = [fake_rows]
|
||||
mock_async_execute.return_value = fake_rows
|
||||
|
||||
received = []
|
||||
|
||||
|
|
@ -153,9 +153,9 @@ class TestGetTriples:
|
|||
the same Metadata construction. Cover it for parity."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
@patch('trustgraph.tables.knowledge.async_execute_paged', new_callable=AsyncMock)
|
||||
async def test_row_converts_to_triples(self, mock_async_execute_paged):
|
||||
# row[3] is a list of (s_val, s_uri, p_val, p_uri, o_val, o_uri, graph)
|
||||
@patch('trustgraph.tables.knowledge.async_execute', new_callable=AsyncMock)
|
||||
async def test_row_converts_to_triples(self, mock_async_execute):
|
||||
# row[3] is a list of (s_val, s_uri, p_val, p_uri, o_val, o_uri)
|
||||
fake_row = (
|
||||
None, None, None,
|
||||
[
|
||||
|
|
@ -163,7 +163,6 @@ class TestGetTriples:
|
|||
"http://example.org/alice", True,
|
||||
"http://example.org/knows", True,
|
||||
"http://example.org/bob", True,
|
||||
"urn:graph:source",
|
||||
),
|
||||
],
|
||||
)
|
||||
|
|
@ -171,7 +170,7 @@ class TestGetTriples:
|
|||
store = _make_store()
|
||||
store.cassandra = Mock()
|
||||
store.get_triples_stmt = Mock()
|
||||
mock_async_execute_paged.return_value = [[fake_row]]
|
||||
mock_async_execute.return_value = [fake_row]
|
||||
|
||||
received = []
|
||||
|
||||
|
|
@ -192,33 +191,3 @@ class TestGetTriples:
|
|||
assert t.s.iri == "http://example.org/alice"
|
||||
assert t.p.iri == "http://example.org/knows"
|
||||
assert t.o.iri == "http://example.org/bob"
|
||||
assert t.g == "urn:graph:source"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
@patch('trustgraph.tables.knowledge.async_execute_paged', new_callable=AsyncMock)
|
||||
async def test_empty_graph_name_becomes_none(self, mock_async_execute_paged):
|
||||
fake_row = (
|
||||
None, None, None,
|
||||
[
|
||||
(
|
||||
"http://example.org/alice", True,
|
||||
"http://example.org/knows", True,
|
||||
"http://example.org/bob", True,
|
||||
"",
|
||||
),
|
||||
],
|
||||
)
|
||||
|
||||
store = _make_store()
|
||||
store.cassandra = Mock()
|
||||
store.get_triples_stmt = Mock()
|
||||
mock_async_execute_paged.return_value = [[fake_row]]
|
||||
|
||||
received = []
|
||||
|
||||
async def receiver(msg):
|
||||
received.append(msg)
|
||||
|
||||
await store.get_triples("w", "d", receiver)
|
||||
|
||||
assert received[0].triples[0].g is None
|
||||
|
|
|
|||
|
|
@ -1,6 +1,5 @@
|
|||
"""
|
||||
Round-trip unit tests for KnowledgeRequestTranslator and
|
||||
KnowledgeResponseTranslator.
|
||||
Round-trip unit tests for KnowledgeRequestTranslator.
|
||||
|
||||
Regression coverage: a previous version of the decode side constructed
|
||||
EntityEmbeddings(vectors=...) — the schema field is `vector` (singular),
|
||||
|
|
@ -16,13 +15,9 @@ Triples breaks the test.
|
|||
|
||||
import pytest
|
||||
|
||||
from trustgraph.messaging.translators.knowledge import (
|
||||
KnowledgeRequestTranslator,
|
||||
KnowledgeResponseTranslator,
|
||||
)
|
||||
from trustgraph.messaging.translators.knowledge import KnowledgeRequestTranslator
|
||||
from trustgraph.schema import (
|
||||
KnowledgeRequest,
|
||||
KnowledgeResponse,
|
||||
GraphEmbeddings,
|
||||
EntityEmbeddings,
|
||||
Triples,
|
||||
|
|
@ -30,8 +25,6 @@ from trustgraph.schema import (
|
|||
Metadata,
|
||||
Term,
|
||||
IRI,
|
||||
LibraryMetadata,
|
||||
LibraryBlob,
|
||||
)
|
||||
|
||||
|
||||
|
|
@ -152,161 +145,3 @@ class TestKnowledgeRequestTranslatorTriples:
|
|||
assert t.s.iri == "http://example.org/alice"
|
||||
assert t.p.iri == "http://example.org/knows"
|
||||
assert t.o.iri == "http://example.org/bob"
|
||||
|
||||
|
||||
class TestKnowledgeRequestTranslatorLibrary:
|
||||
|
||||
def test_roundtrip_preserves_library_metadata(self, translator):
|
||||
request = KnowledgeRequest(
|
||||
operation="put-kg-core",
|
||||
id="doc-1",
|
||||
library_metadata=LibraryMetadata(
|
||||
id="doc-1",
|
||||
kind="application/pdf",
|
||||
title="Test Document",
|
||||
parent_id="",
|
||||
document_type="source",
|
||||
comments="test comments",
|
||||
tags=["tag1", "tag2"],
|
||||
),
|
||||
)
|
||||
|
||||
encoded = translator.encode(request)
|
||||
assert "library-metadata" in encoded
|
||||
lm = encoded["library-metadata"]
|
||||
assert lm["id"] == "doc-1"
|
||||
assert lm["kind"] == "application/pdf"
|
||||
assert lm["title"] == "Test Document"
|
||||
assert lm["parent-id"] == ""
|
||||
assert lm["document-type"] == "source"
|
||||
assert lm["comments"] == "test comments"
|
||||
assert lm["tags"] == ["tag1", "tag2"]
|
||||
|
||||
decoded = translator.decode(encoded)
|
||||
assert decoded.library_metadata is not None
|
||||
assert decoded.library_metadata.id == "doc-1"
|
||||
assert decoded.library_metadata.kind == "application/pdf"
|
||||
assert decoded.library_metadata.title == "Test Document"
|
||||
assert decoded.library_metadata.parent_id == ""
|
||||
assert decoded.library_metadata.document_type == "source"
|
||||
assert decoded.library_metadata.comments == "test comments"
|
||||
assert decoded.library_metadata.tags == ["tag1", "tag2"]
|
||||
|
||||
def test_roundtrip_preserves_child_document_metadata(self, translator):
|
||||
request = KnowledgeRequest(
|
||||
operation="put-kg-core",
|
||||
id="doc-1",
|
||||
library_metadata=LibraryMetadata(
|
||||
id="chunk-1",
|
||||
kind="text/plain",
|
||||
title="Chunk 1",
|
||||
parent_id="doc-1",
|
||||
document_type="chunk",
|
||||
),
|
||||
)
|
||||
|
||||
encoded = translator.encode(request)
|
||||
decoded = translator.decode(encoded)
|
||||
|
||||
assert decoded.library_metadata.parent_id == "doc-1"
|
||||
assert decoded.library_metadata.document_type == "chunk"
|
||||
|
||||
def test_roundtrip_preserves_library_blob(self, translator):
|
||||
request = KnowledgeRequest(
|
||||
operation="put-kg-core",
|
||||
id="doc-1",
|
||||
library_blob=LibraryBlob(
|
||||
id="doc-1",
|
||||
data=b"SGVsbG8gV29ybGQ=",
|
||||
),
|
||||
)
|
||||
|
||||
encoded = translator.encode(request)
|
||||
assert "library-blob" in encoded
|
||||
assert encoded["library-blob"]["id"] == "doc-1"
|
||||
assert encoded["library-blob"]["data"] == "SGVsbG8gV29ybGQ="
|
||||
|
||||
decoded = translator.decode(encoded)
|
||||
assert decoded.library_blob is not None
|
||||
assert decoded.library_blob.id == "doc-1"
|
||||
assert decoded.library_blob.data == "SGVsbG8gV29ybGQ="
|
||||
|
||||
def test_absent_library_fields_decode_as_none(self, translator):
|
||||
decoded = translator.decode({
|
||||
"operation": "get-kg-core",
|
||||
"id": "doc-1",
|
||||
})
|
||||
assert decoded.library_metadata is None
|
||||
assert decoded.library_blob is None
|
||||
|
||||
|
||||
class TestKnowledgeResponseTranslatorLibrary:
|
||||
|
||||
@pytest.fixture
|
||||
def response_translator(self):
|
||||
return KnowledgeResponseTranslator()
|
||||
|
||||
def test_encode_library_metadata(self, response_translator):
|
||||
response = KnowledgeResponse(
|
||||
ids=None,
|
||||
library_metadata=LibraryMetadata(
|
||||
id="doc-1",
|
||||
kind="application/pdf",
|
||||
title="Test",
|
||||
parent_id="",
|
||||
document_type="source",
|
||||
comments="",
|
||||
tags=[],
|
||||
),
|
||||
)
|
||||
encoded = response_translator.encode(response)
|
||||
assert "library-metadata" in encoded
|
||||
assert encoded["library-metadata"]["id"] == "doc-1"
|
||||
assert encoded["library-metadata"]["kind"] == "application/pdf"
|
||||
assert encoded["library-metadata"]["document-type"] == "source"
|
||||
|
||||
def test_encode_library_blob_bytes_to_string(self, response_translator):
|
||||
response = KnowledgeResponse(
|
||||
ids=None,
|
||||
library_blob=LibraryBlob(
|
||||
id="doc-1",
|
||||
data=b"dGVzdCBkYXRh",
|
||||
),
|
||||
)
|
||||
encoded = response_translator.encode(response)
|
||||
assert "library-blob" in encoded
|
||||
assert encoded["library-blob"]["id"] == "doc-1"
|
||||
assert encoded["library-blob"]["data"] == "dGVzdCBkYXRh"
|
||||
assert isinstance(encoded["library-blob"]["data"], str)
|
||||
|
||||
def test_encode_library_blob_string_passthrough(self, response_translator):
|
||||
response = KnowledgeResponse(
|
||||
ids=None,
|
||||
library_blob=LibraryBlob(
|
||||
id="doc-1",
|
||||
data="already-a-string",
|
||||
),
|
||||
)
|
||||
encoded = response_translator.encode(response)
|
||||
assert encoded["library-blob"]["data"] == "already-a-string"
|
||||
|
||||
def test_library_metadata_is_not_final(self, response_translator):
|
||||
response = KnowledgeResponse(
|
||||
ids=None,
|
||||
library_metadata=LibraryMetadata(id="doc-1"),
|
||||
)
|
||||
_, is_final = response_translator.encode_with_completion(response)
|
||||
assert is_final is False
|
||||
|
||||
def test_library_blob_is_not_final(self, response_translator):
|
||||
response = KnowledgeResponse(
|
||||
ids=None,
|
||||
library_blob=LibraryBlob(id="doc-1", data=b"data"),
|
||||
)
|
||||
_, is_final = response_translator.encode_with_completion(response)
|
||||
assert is_final is False
|
||||
|
||||
def test_eos_is_final(self, response_translator):
|
||||
response = KnowledgeResponse(eos=True)
|
||||
_, is_final = response_translator.encode_with_completion(response)
|
||||
assert is_final is True
|
||||
|
|
|
|||
|
|
@ -337,7 +337,7 @@ class Api:
|
|||
from . bulk_client import BulkClient
|
||||
# Extract base URL (remove api/v1/ suffix)
|
||||
base_url = self.url.rsplit("api/v1/", 1)[0].rstrip("/")
|
||||
self._bulk_client = BulkClient(base_url, self.timeout, self.token, workspace=self.workspace)
|
||||
self._bulk_client = BulkClient(base_url, self.timeout, self.token)
|
||||
return self._bulk_client
|
||||
|
||||
def metrics(self):
|
||||
|
|
@ -462,7 +462,7 @@ class Api:
|
|||
from . async_bulk_client import AsyncBulkClient
|
||||
# Extract base URL (remove api/v1/ suffix)
|
||||
base_url = self.url.rsplit("api/v1/", 1)[0].rstrip("/")
|
||||
self._async_bulk_client = AsyncBulkClient(base_url, self.timeout, self.token, workspace=self.workspace)
|
||||
self._async_bulk_client = AsyncBulkClient(base_url, self.timeout, self.token)
|
||||
return self._async_bulk_client
|
||||
|
||||
def async_metrics(self):
|
||||
|
|
|
|||
|
|
@ -9,11 +9,10 @@ from . types import Triple
|
|||
class AsyncBulkClient:
|
||||
"""Asynchronous bulk operations client"""
|
||||
|
||||
def __init__(self, url: str, timeout: int, token: Optional[str], workspace: str = "default") -> None:
|
||||
def __init__(self, url: str, timeout: int, token: Optional[str]) -> None:
|
||||
self.url: str = self._convert_to_ws_url(url)
|
||||
self.timeout: int = timeout
|
||||
self.token: Optional[str] = token
|
||||
self.workspace: str = workspace
|
||||
|
||||
def _convert_to_ws_url(self, url: str) -> str:
|
||||
"""Convert HTTP URL to WebSocket URL"""
|
||||
|
|
@ -26,21 +25,11 @@ class AsyncBulkClient:
|
|||
else:
|
||||
return f"ws://{url}"
|
||||
|
||||
def _build_ws_url(self, path: str) -> str:
|
||||
"""Build a WebSocket URL with token and workspace query params."""
|
||||
ws_url = f"{self.url}{path}"
|
||||
params = []
|
||||
if self.token:
|
||||
params.append(f"token={self.token}")
|
||||
if self.workspace:
|
||||
params.append(f"workspace={self.workspace}")
|
||||
if params:
|
||||
ws_url = f"{ws_url}?{'&'.join(params)}"
|
||||
return ws_url
|
||||
|
||||
async def import_triples(self, flow: str, triples: AsyncIterator[Triple], **kwargs: Any) -> None:
|
||||
"""Bulk import triples via WebSocket"""
|
||||
ws_url = self._build_ws_url(f"/api/v1/flow/{flow}/import/triples")
|
||||
ws_url = f"{self.url}/api/v1/flow/{flow}/import/triples"
|
||||
if self.token:
|
||||
ws_url = f"{ws_url}?token={self.token}"
|
||||
|
||||
async with websockets.connect(ws_url, ping_interval=20, ping_timeout=self.timeout) as websocket:
|
||||
async for triple in triples:
|
||||
|
|
@ -53,7 +42,9 @@ class AsyncBulkClient:
|
|||
|
||||
async def export_triples(self, flow: str, **kwargs: Any) -> AsyncIterator[Triple]:
|
||||
"""Bulk export triples via WebSocket"""
|
||||
ws_url = self._build_ws_url(f"/api/v1/flow/{flow}/export/triples")
|
||||
ws_url = f"{self.url}/api/v1/flow/{flow}/export/triples"
|
||||
if self.token:
|
||||
ws_url = f"{ws_url}?token={self.token}"
|
||||
|
||||
async with websockets.connect(ws_url, ping_interval=20, ping_timeout=self.timeout) as websocket:
|
||||
async for raw_message in websocket:
|
||||
|
|
@ -66,7 +57,9 @@ class AsyncBulkClient:
|
|||
|
||||
async def import_graph_embeddings(self, flow: str, embeddings: AsyncIterator[Dict[str, Any]], **kwargs: Any) -> None:
|
||||
"""Bulk import graph embeddings via WebSocket"""
|
||||
ws_url = self._build_ws_url(f"/api/v1/flow/{flow}/import/graph-embeddings")
|
||||
ws_url = f"{self.url}/api/v1/flow/{flow}/import/graph-embeddings"
|
||||
if self.token:
|
||||
ws_url = f"{ws_url}?token={self.token}"
|
||||
|
||||
async with websockets.connect(ws_url, ping_interval=20, ping_timeout=self.timeout) as websocket:
|
||||
async for embedding in embeddings:
|
||||
|
|
@ -74,7 +67,9 @@ class AsyncBulkClient:
|
|||
|
||||
async def export_graph_embeddings(self, flow: str, **kwargs: Any) -> AsyncIterator[Dict[str, Any]]:
|
||||
"""Bulk export graph embeddings via WebSocket"""
|
||||
ws_url = self._build_ws_url(f"/api/v1/flow/{flow}/export/graph-embeddings")
|
||||
ws_url = f"{self.url}/api/v1/flow/{flow}/export/graph-embeddings"
|
||||
if self.token:
|
||||
ws_url = f"{ws_url}?token={self.token}"
|
||||
|
||||
async with websockets.connect(ws_url, ping_interval=20, ping_timeout=self.timeout) as websocket:
|
||||
async for raw_message in websocket:
|
||||
|
|
@ -82,7 +77,9 @@ class AsyncBulkClient:
|
|||
|
||||
async def import_document_embeddings(self, flow: str, embeddings: AsyncIterator[Dict[str, Any]], **kwargs: Any) -> None:
|
||||
"""Bulk import document embeddings via WebSocket"""
|
||||
ws_url = self._build_ws_url(f"/api/v1/flow/{flow}/import/document-embeddings")
|
||||
ws_url = f"{self.url}/api/v1/flow/{flow}/import/document-embeddings"
|
||||
if self.token:
|
||||
ws_url = f"{ws_url}?token={self.token}"
|
||||
|
||||
async with websockets.connect(ws_url, ping_interval=20, ping_timeout=self.timeout) as websocket:
|
||||
async for embedding in embeddings:
|
||||
|
|
@ -90,7 +87,9 @@ class AsyncBulkClient:
|
|||
|
||||
async def export_document_embeddings(self, flow: str, **kwargs: Any) -> AsyncIterator[Dict[str, Any]]:
|
||||
"""Bulk export document embeddings via WebSocket"""
|
||||
ws_url = self._build_ws_url(f"/api/v1/flow/{flow}/export/document-embeddings")
|
||||
ws_url = f"{self.url}/api/v1/flow/{flow}/export/document-embeddings"
|
||||
if self.token:
|
||||
ws_url = f"{ws_url}?token={self.token}"
|
||||
|
||||
async with websockets.connect(ws_url, ping_interval=20, ping_timeout=self.timeout) as websocket:
|
||||
async for raw_message in websocket:
|
||||
|
|
@ -98,7 +97,9 @@ class AsyncBulkClient:
|
|||
|
||||
async def import_entity_contexts(self, flow: str, contexts: AsyncIterator[Dict[str, Any]], **kwargs: Any) -> None:
|
||||
"""Bulk import entity contexts via WebSocket"""
|
||||
ws_url = self._build_ws_url(f"/api/v1/flow/{flow}/import/entity-contexts")
|
||||
ws_url = f"{self.url}/api/v1/flow/{flow}/import/entity-contexts"
|
||||
if self.token:
|
||||
ws_url = f"{ws_url}?token={self.token}"
|
||||
|
||||
async with websockets.connect(ws_url, ping_interval=20, ping_timeout=self.timeout) as websocket:
|
||||
async for context in contexts:
|
||||
|
|
@ -106,7 +107,9 @@ class AsyncBulkClient:
|
|||
|
||||
async def export_entity_contexts(self, flow: str, **kwargs: Any) -> AsyncIterator[Dict[str, Any]]:
|
||||
"""Bulk export entity contexts via WebSocket"""
|
||||
ws_url = self._build_ws_url(f"/api/v1/flow/{flow}/export/entity-contexts")
|
||||
ws_url = f"{self.url}/api/v1/flow/{flow}/export/entity-contexts"
|
||||
if self.token:
|
||||
ws_url = f"{ws_url}?token={self.token}"
|
||||
|
||||
async with websockets.connect(ws_url, ping_interval=20, ping_timeout=self.timeout) as websocket:
|
||||
async for raw_message in websocket:
|
||||
|
|
@ -114,7 +117,9 @@ class AsyncBulkClient:
|
|||
|
||||
async def import_rows(self, flow: str, rows: AsyncIterator[Dict[str, Any]], **kwargs: Any) -> None:
|
||||
"""Bulk import rows via WebSocket"""
|
||||
ws_url = self._build_ws_url(f"/api/v1/flow/{flow}/import/rows")
|
||||
ws_url = f"{self.url}/api/v1/flow/{flow}/import/rows"
|
||||
if self.token:
|
||||
ws_url = f"{ws_url}?token={self.token}"
|
||||
|
||||
async with websockets.connect(ws_url, ping_interval=20, ping_timeout=self.timeout) as websocket:
|
||||
async for row in rows:
|
||||
|
|
|
|||
|
|
@ -30,7 +30,6 @@ class AsyncSocketClient:
|
|||
self.timeout = timeout
|
||||
self.token = token
|
||||
self.workspace = workspace
|
||||
self._workspace_explicit = workspace != "default"
|
||||
self._request_counter = 0
|
||||
self._socket = None
|
||||
self._connect_cm = None
|
||||
|
|
@ -93,7 +92,6 @@ class AsyncSocketClient:
|
|||
)
|
||||
|
||||
if resp.get("type") == "auth-ok":
|
||||
if not self._workspace_explicit:
|
||||
self.workspace = resp.get("workspace", self.workspace)
|
||||
elif resp.get("type") == "auth-failed":
|
||||
await self._socket.close()
|
||||
|
|
|
|||
|
|
@ -34,7 +34,7 @@ class BulkClient:
|
|||
Note: For true async support, use AsyncBulkClient instead.
|
||||
"""
|
||||
|
||||
def __init__(self, url: str, timeout: int, token: Optional[str], workspace: str = "default") -> None:
|
||||
def __init__(self, url: str, timeout: int, token: Optional[str]) -> None:
|
||||
"""
|
||||
Initialize synchronous bulk client.
|
||||
|
||||
|
|
@ -42,12 +42,10 @@ class BulkClient:
|
|||
url: Base URL for TrustGraph API (HTTP/HTTPS will be converted to WS/WSS)
|
||||
timeout: WebSocket timeout in seconds
|
||||
token: Optional bearer token for authentication
|
||||
workspace: Workspace for data isolation
|
||||
"""
|
||||
self.url: str = self._convert_to_ws_url(url)
|
||||
self.timeout: int = timeout
|
||||
self.token: Optional[str] = token
|
||||
self.workspace: str = workspace
|
||||
|
||||
def _convert_to_ws_url(self, url: str) -> str:
|
||||
"""Convert HTTP URL to WebSocket URL"""
|
||||
|
|
@ -60,18 +58,6 @@ class BulkClient:
|
|||
else:
|
||||
return f"ws://{url}"
|
||||
|
||||
def _build_ws_url(self, path: str) -> str:
|
||||
"""Build a WebSocket URL with token and workspace query params."""
|
||||
ws_url = f"{self.url}{path}"
|
||||
params = []
|
||||
if self.token:
|
||||
params.append(f"token={self.token}")
|
||||
if self.workspace:
|
||||
params.append(f"workspace={self.workspace}")
|
||||
if params:
|
||||
ws_url = f"{ws_url}?{'&'.join(params)}"
|
||||
return ws_url
|
||||
|
||||
def _run_async(self, coro: Coroutine[Any, Any, Any]) -> Any:
|
||||
"""Run async coroutine synchronously"""
|
||||
try:
|
||||
|
|
@ -130,7 +116,9 @@ class BulkClient:
|
|||
metadata: Optional[Dict[str, Any]], batch_size: int
|
||||
) -> None:
|
||||
"""Async implementation of triple import"""
|
||||
ws_url = self._build_ws_url(f"/api/v1/flow/{flow}/import/triples")
|
||||
ws_url = f"{self.url}/api/v1/flow/{flow}/import/triples"
|
||||
if self.token:
|
||||
ws_url = f"{ws_url}?token={self.token}"
|
||||
|
||||
if metadata is None:
|
||||
metadata = {"id": "", "metadata": [], "collection": "default"}
|
||||
|
|
@ -206,7 +194,9 @@ class BulkClient:
|
|||
|
||||
async def _export_triples_async(self, flow: str) -> Iterator[Triple]:
|
||||
"""Async implementation of triple export"""
|
||||
ws_url = self._build_ws_url(f"/api/v1/flow/{flow}/export/triples")
|
||||
ws_url = f"{self.url}/api/v1/flow/{flow}/export/triples"
|
||||
if self.token:
|
||||
ws_url = f"{ws_url}?token={self.token}"
|
||||
|
||||
async with websockets.connect(ws_url, ping_interval=20, ping_timeout=self.timeout) as websocket:
|
||||
async for raw_message in websocket:
|
||||
|
|
@ -248,7 +238,9 @@ class BulkClient:
|
|||
|
||||
async def _import_graph_embeddings_async(self, flow: str, embeddings: Iterator[Dict[str, Any]]) -> None:
|
||||
"""Async implementation of graph embeddings import"""
|
||||
ws_url = self._build_ws_url(f"/api/v1/flow/{flow}/import/graph-embeddings")
|
||||
ws_url = f"{self.url}/api/v1/flow/{flow}/import/graph-embeddings"
|
||||
if self.token:
|
||||
ws_url = f"{ws_url}?token={self.token}"
|
||||
|
||||
async with websockets.connect(ws_url, ping_interval=20, ping_timeout=self.timeout) as websocket:
|
||||
for embedding in embeddings:
|
||||
|
|
@ -304,7 +296,9 @@ class BulkClient:
|
|||
|
||||
async def _export_graph_embeddings_async(self, flow: str) -> Iterator[Dict[str, Any]]:
|
||||
"""Async implementation of graph embeddings export"""
|
||||
ws_url = self._build_ws_url(f"/api/v1/flow/{flow}/export/graph-embeddings")
|
||||
ws_url = f"{self.url}/api/v1/flow/{flow}/export/graph-embeddings"
|
||||
if self.token:
|
||||
ws_url = f"{ws_url}?token={self.token}"
|
||||
|
||||
async with websockets.connect(ws_url, ping_interval=20, ping_timeout=self.timeout) as websocket:
|
||||
async for raw_message in websocket:
|
||||
|
|
@ -342,7 +336,9 @@ class BulkClient:
|
|||
|
||||
async def _import_document_embeddings_async(self, flow: str, embeddings: Iterator[Dict[str, Any]]) -> None:
|
||||
"""Async implementation of document embeddings import"""
|
||||
ws_url = self._build_ws_url(f"/api/v1/flow/{flow}/import/document-embeddings")
|
||||
ws_url = f"{self.url}/api/v1/flow/{flow}/import/document-embeddings"
|
||||
if self.token:
|
||||
ws_url = f"{ws_url}?token={self.token}"
|
||||
|
||||
async with websockets.connect(ws_url, ping_interval=20, ping_timeout=self.timeout) as websocket:
|
||||
for embedding in embeddings:
|
||||
|
|
@ -398,7 +394,9 @@ class BulkClient:
|
|||
|
||||
async def _export_document_embeddings_async(self, flow: str) -> Iterator[Dict[str, Any]]:
|
||||
"""Async implementation of document embeddings export"""
|
||||
ws_url = self._build_ws_url(f"/api/v1/flow/{flow}/export/document-embeddings")
|
||||
ws_url = f"{self.url}/api/v1/flow/{flow}/export/document-embeddings"
|
||||
if self.token:
|
||||
ws_url = f"{ws_url}?token={self.token}"
|
||||
|
||||
async with websockets.connect(ws_url, ping_interval=20, ping_timeout=self.timeout) as websocket:
|
||||
async for raw_message in websocket:
|
||||
|
|
@ -448,7 +446,9 @@ class BulkClient:
|
|||
metadata: Optional[Dict[str, Any]], batch_size: int
|
||||
) -> None:
|
||||
"""Async implementation of entity contexts import"""
|
||||
ws_url = self._build_ws_url(f"/api/v1/flow/{flow}/import/entity-contexts")
|
||||
ws_url = f"{self.url}/api/v1/flow/{flow}/import/entity-contexts"
|
||||
if self.token:
|
||||
ws_url = f"{ws_url}?token={self.token}"
|
||||
|
||||
if metadata is None:
|
||||
metadata = {"id": "", "metadata": [], "collection": "default"}
|
||||
|
|
@ -522,7 +522,9 @@ class BulkClient:
|
|||
|
||||
async def _export_entity_contexts_async(self, flow: str) -> Iterator[Dict[str, Any]]:
|
||||
"""Async implementation of entity contexts export"""
|
||||
ws_url = self._build_ws_url(f"/api/v1/flow/{flow}/export/entity-contexts")
|
||||
ws_url = f"{self.url}/api/v1/flow/{flow}/export/entity-contexts"
|
||||
if self.token:
|
||||
ws_url = f"{ws_url}?token={self.token}"
|
||||
|
||||
async with websockets.connect(ws_url, ping_interval=20, ping_timeout=self.timeout) as websocket:
|
||||
async for raw_message in websocket:
|
||||
|
|
@ -560,7 +562,9 @@ class BulkClient:
|
|||
|
||||
async def _import_rows_async(self, flow: str, rows: Iterator[Dict[str, Any]]) -> None:
|
||||
"""Async implementation of rows import"""
|
||||
ws_url = self._build_ws_url(f"/api/v1/flow/{flow}/import/rows")
|
||||
ws_url = f"{self.url}/api/v1/flow/{flow}/import/rows"
|
||||
if self.token:
|
||||
ws_url = f"{ws_url}?token={self.token}"
|
||||
|
||||
async with websockets.connect(ws_url, ping_interval=20, ping_timeout=self.timeout) as websocket:
|
||||
for row in rows:
|
||||
|
|
|
|||
|
|
@ -167,7 +167,6 @@ class SocketClient:
|
|||
)
|
||||
|
||||
if resp.get("type") == "auth-ok":
|
||||
if self.workspace == "default":
|
||||
self.workspace = resp.get("workspace", self.workspace)
|
||||
elif resp.get("type") == "auth-failed":
|
||||
await self._socket.close()
|
||||
|
|
@ -502,7 +501,6 @@ class SocketClient:
|
|||
|
||||
def put_kg_core(
|
||||
self, id: str, triples=None, graph_embeddings=None,
|
||||
library_metadata=None, library_blob=None,
|
||||
) -> Dict[str, Any]:
|
||||
request = {
|
||||
"operation": "put-kg-core",
|
||||
|
|
@ -513,10 +511,6 @@ class SocketClient:
|
|||
request["triples"] = triples
|
||||
if graph_embeddings is not None:
|
||||
request["graph-embeddings"] = graph_embeddings
|
||||
if library_metadata is not None:
|
||||
request["library-metadata"] = library_metadata
|
||||
if library_blob is not None:
|
||||
request["library-blob"] = library_blob
|
||||
return self._send_request_sync("knowledge", None, request)
|
||||
|
||||
def get_de_core(self, id: str) -> Iterator[Dict[str, Any]]:
|
||||
|
|
|
|||
|
|
@ -103,19 +103,35 @@ def resolve_cassandra_config(
|
|||
host: Optional[str] = None,
|
||||
username: Optional[str] = None,
|
||||
password: Optional[str] = None,
|
||||
default_keyspace: Optional[str] = None,
|
||||
replication_factor: Optional[int] = None,
|
||||
default_keyspace: Optional[str] = None
|
||||
) -> Tuple[List[str], Optional[str], Optional[str], Optional[str], int]:
|
||||
"""
|
||||
Resolve Cassandra configuration from various sources.
|
||||
|
||||
Can accept either argparse args object or explicit parameters.
|
||||
Converts host string to list format for Cassandra driver.
|
||||
|
||||
Args:
|
||||
args: Optional argparse namespace with cassandra_host, cassandra_username, cassandra_password, cassandra_keyspace, cassandra_replication_factor
|
||||
host: Optional explicit host parameter (overrides args)
|
||||
username: Optional explicit username parameter (overrides args)
|
||||
password: Optional explicit password parameter (overrides args)
|
||||
default_keyspace: Optional default keyspace if not specified elsewhere
|
||||
|
||||
Returns:
|
||||
tuple: (hosts_list, username, password, keyspace, replication_factor)
|
||||
"""
|
||||
# If args provided, extract values
|
||||
keyspace = None
|
||||
replication_factor = 1
|
||||
if args is not None:
|
||||
host = host or getattr(args, 'cassandra_host', None)
|
||||
username = username or getattr(args, 'cassandra_username', None)
|
||||
password = password or getattr(args, 'cassandra_password', None)
|
||||
keyspace = getattr(args, 'cassandra_keyspace', None)
|
||||
replication_factor = replication_factor or getattr(
|
||||
args, 'cassandra_replication_factor', None
|
||||
)
|
||||
replication_factor = getattr(args, 'cassandra_replication_factor', 1)
|
||||
|
||||
# Apply defaults if still None
|
||||
defaults = get_cassandra_defaults()
|
||||
host = host or defaults['host']
|
||||
username = username or defaults['username']
|
||||
|
|
|
|||
|
|
@ -11,7 +11,6 @@ Supports dual output to console and Loki for centralized log aggregation.
|
|||
import contextvars
|
||||
import logging
|
||||
import logging.handlers
|
||||
import uuid
|
||||
from argparse import ArgumentParser
|
||||
from queue import Queue
|
||||
from typing import Any
|
||||
|
|
@ -133,12 +132,14 @@ def setup_logging(args: dict[str, Any]) -> None:
|
|||
try:
|
||||
from logging_loki import LokiHandler
|
||||
|
||||
instance_id = str(uuid.uuid4())[:8]
|
||||
|
||||
# Create Loki handler with optional authentication. The
|
||||
# processor label is NOT baked in here — it's stamped onto
|
||||
# each record by _ProcessorIdFilter reading the task-local
|
||||
# contextvar, and logging_loki's emitter reads record.tags
|
||||
# to build per-record Loki labels.
|
||||
loki_handler_kwargs = {
|
||||
'url': loki_url,
|
||||
'version': "1",
|
||||
'tags': {'instance': instance_id},
|
||||
}
|
||||
|
||||
if loki_username and loki_password:
|
||||
|
|
|
|||
|
|
@ -1,87 +0,0 @@
|
|||
|
||||
import os
|
||||
import argparse
|
||||
from typing import Optional, Any, Tuple
|
||||
|
||||
|
||||
def get_qdrant_defaults() -> dict:
|
||||
return {
|
||||
'url': os.getenv('QDRANT_URL', 'http://localhost:6333'),
|
||||
'api_key': os.getenv('QDRANT_API_KEY'),
|
||||
'replication_factor': int(os.getenv('QDRANT_REPLICATION_FACTOR', '1')),
|
||||
'shard_number': int(os.getenv('QDRANT_SHARD_NUMBER', '1')),
|
||||
}
|
||||
|
||||
|
||||
def add_qdrant_args(parser: argparse.ArgumentParser) -> None:
|
||||
defaults = get_qdrant_defaults()
|
||||
|
||||
url_help = f"Qdrant URL (default: {defaults['url']})"
|
||||
if 'QDRANT_URL' in os.environ:
|
||||
url_help += " [from QDRANT_URL]"
|
||||
|
||||
api_key_help = "Qdrant API key"
|
||||
if defaults['api_key']:
|
||||
api_key_help += " (default: <set>)"
|
||||
if 'QDRANT_API_KEY' in os.environ:
|
||||
api_key_help += " [from QDRANT_API_KEY]"
|
||||
|
||||
replication_help = f"Qdrant collection replication factor (default: {defaults['replication_factor']})"
|
||||
if 'QDRANT_REPLICATION_FACTOR' in os.environ:
|
||||
replication_help += " [from QDRANT_REPLICATION_FACTOR]"
|
||||
|
||||
shard_help = f"Qdrant collection shard number (default: {defaults['shard_number']})"
|
||||
if 'QDRANT_SHARD_NUMBER' in os.environ:
|
||||
shard_help += " [from QDRANT_SHARD_NUMBER]"
|
||||
|
||||
parser.add_argument(
|
||||
'--store-uri',
|
||||
default=defaults['url'],
|
||||
help=url_help,
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--api-key',
|
||||
default=defaults['api_key'],
|
||||
help=api_key_help,
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--qdrant-replication-factor',
|
||||
type=int,
|
||||
default=defaults['replication_factor'],
|
||||
help=replication_help,
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--qdrant-shard-number',
|
||||
type=int,
|
||||
default=defaults['shard_number'],
|
||||
help=shard_help,
|
||||
)
|
||||
|
||||
|
||||
def resolve_qdrant_config(
|
||||
args: Optional[Any] = None,
|
||||
url: Optional[str] = None,
|
||||
api_key: Optional[str] = None,
|
||||
replication_factor: Optional[int] = None,
|
||||
shard_number: Optional[int] = None,
|
||||
) -> Tuple[str, Optional[str], int, int]:
|
||||
if args is not None:
|
||||
url = url or getattr(args, 'store_uri', None)
|
||||
api_key = api_key or getattr(args, 'api_key', None)
|
||||
replication_factor = replication_factor or getattr(
|
||||
args, 'qdrant_replication_factor', None
|
||||
)
|
||||
shard_number = shard_number or getattr(
|
||||
args, 'qdrant_shard_number', None
|
||||
)
|
||||
|
||||
defaults = get_qdrant_defaults()
|
||||
url = url or defaults['url']
|
||||
api_key = api_key or defaults['api_key']
|
||||
replication_factor = replication_factor or defaults['replication_factor']
|
||||
shard_number = shard_number or defaults['shard_number']
|
||||
|
||||
return url, api_key, replication_factor, shard_number
|
||||
|
|
@ -2,8 +2,7 @@ from typing import Dict, Any, Tuple, Optional
|
|||
from ...schema import (
|
||||
KnowledgeRequest, KnowledgeResponse, Triples, GraphEmbeddings,
|
||||
DocumentEmbeddings, ChunkEmbeddings,
|
||||
Metadata, EntityEmbeddings,
|
||||
LibraryMetadata, LibraryBlob,
|
||||
Metadata, EntityEmbeddings
|
||||
)
|
||||
from .base import MessageTranslator
|
||||
from .primitives import ValueTranslator, SubgraphTranslator
|
||||
|
|
@ -62,27 +61,6 @@ class KnowledgeRequestTranslator(MessageTranslator):
|
|||
]
|
||||
)
|
||||
|
||||
library_metadata = None
|
||||
if "library-metadata" in data:
|
||||
lm = data["library-metadata"]
|
||||
library_metadata = LibraryMetadata(
|
||||
id=lm.get("id", ""),
|
||||
kind=lm.get("kind", ""),
|
||||
title=lm.get("title", ""),
|
||||
parent_id=lm.get("parent-id", ""),
|
||||
document_type=lm.get("document-type", ""),
|
||||
comments=lm.get("comments", ""),
|
||||
tags=lm.get("tags", []),
|
||||
)
|
||||
|
||||
library_blob = None
|
||||
if "library-blob" in data:
|
||||
lb = data["library-blob"]
|
||||
library_blob = LibraryBlob(
|
||||
id=lb.get("id", ""),
|
||||
data=lb.get("data", b""),
|
||||
)
|
||||
|
||||
return KnowledgeRequest(
|
||||
operation=data.get("operation"),
|
||||
id=data.get("id"),
|
||||
|
|
@ -91,8 +69,6 @@ class KnowledgeRequestTranslator(MessageTranslator):
|
|||
triples=triples,
|
||||
graph_embeddings=graph_embeddings,
|
||||
document_embeddings=document_embeddings,
|
||||
library_metadata=library_metadata,
|
||||
library_blob=library_blob,
|
||||
)
|
||||
|
||||
def encode(self, obj: KnowledgeRequest) -> Dict[str, Any]:
|
||||
|
|
@ -149,26 +125,6 @@ class KnowledgeRequestTranslator(MessageTranslator):
|
|||
],
|
||||
}
|
||||
|
||||
if obj.library_metadata:
|
||||
result["library-metadata"] = {
|
||||
"id": obj.library_metadata.id,
|
||||
"kind": obj.library_metadata.kind,
|
||||
"title": obj.library_metadata.title,
|
||||
"parent-id": obj.library_metadata.parent_id,
|
||||
"document-type": obj.library_metadata.document_type,
|
||||
"comments": obj.library_metadata.comments,
|
||||
"tags": obj.library_metadata.tags,
|
||||
}
|
||||
|
||||
if obj.library_blob:
|
||||
data = obj.library_blob.data
|
||||
if isinstance(data, bytes):
|
||||
data = data.decode("utf-8")
|
||||
result["library-blob"] = {
|
||||
"id": obj.library_blob.id,
|
||||
"data": data,
|
||||
}
|
||||
|
||||
return result
|
||||
|
||||
|
||||
|
|
@ -238,32 +194,6 @@ class KnowledgeResponseTranslator(MessageTranslator):
|
|||
}
|
||||
}
|
||||
|
||||
# Streaming library metadata response
|
||||
if obj.library_metadata:
|
||||
return {
|
||||
"library-metadata": {
|
||||
"id": obj.library_metadata.id,
|
||||
"kind": obj.library_metadata.kind,
|
||||
"title": obj.library_metadata.title,
|
||||
"parent-id": obj.library_metadata.parent_id,
|
||||
"document-type": obj.library_metadata.document_type,
|
||||
"comments": obj.library_metadata.comments,
|
||||
"tags": obj.library_metadata.tags,
|
||||
}
|
||||
}
|
||||
|
||||
# Streaming library blob response
|
||||
if obj.library_blob:
|
||||
data = obj.library_blob.data
|
||||
if isinstance(data, bytes):
|
||||
data = data.decode("utf-8")
|
||||
return {
|
||||
"library-blob": {
|
||||
"id": obj.library_blob.id,
|
||||
"data": data,
|
||||
}
|
||||
}
|
||||
|
||||
# End of stream marker
|
||||
if obj.eos is True:
|
||||
return {"eos": True}
|
||||
|
|
@ -279,9 +209,7 @@ class KnowledgeResponseTranslator(MessageTranslator):
|
|||
is_final = (
|
||||
obj.ids is not None or # List response
|
||||
obj.eos is True or # End of stream
|
||||
(not obj.triples and not obj.graph_embeddings
|
||||
and not obj.document_embeddings
|
||||
and not obj.library_metadata and not obj.library_blob) # Empty response
|
||||
(not obj.triples and not obj.graph_embeddings and not obj.document_embeddings) # Empty response
|
||||
)
|
||||
|
||||
return response, is_final
|
||||
|
|
@ -21,21 +21,6 @@ from .embeddings import GraphEmbeddings, DocumentEmbeddings
|
|||
# <- ()
|
||||
# <- (error)
|
||||
|
||||
@dataclass
|
||||
class LibraryMetadata:
|
||||
id: str = ""
|
||||
kind: str = ""
|
||||
title: str = ""
|
||||
parent_id: str = ""
|
||||
document_type: str = ""
|
||||
comments: str = ""
|
||||
tags: list[str] = field(default_factory=list)
|
||||
|
||||
@dataclass
|
||||
class LibraryBlob:
|
||||
id: str = ""
|
||||
data: bytes = b""
|
||||
|
||||
@dataclass
|
||||
class KnowledgeRequest:
|
||||
# get-kg-core, delete-kg-core, list-kg-cores, put-kg-core
|
||||
|
|
@ -59,10 +44,6 @@ class KnowledgeRequest:
|
|||
# put-de-core
|
||||
document_embeddings: DocumentEmbeddings | None = None
|
||||
|
||||
# put-kg-core (source material)
|
||||
library_metadata: LibraryMetadata | None = None
|
||||
library_blob: LibraryBlob | None = None
|
||||
|
||||
@dataclass
|
||||
class KnowledgeResponse:
|
||||
error: Error | None = None
|
||||
|
|
@ -71,8 +52,6 @@ class KnowledgeResponse:
|
|||
triples: Triples | None = None
|
||||
graph_embeddings: GraphEmbeddings | None = None
|
||||
document_embeddings: DocumentEmbeddings | None = None
|
||||
library_metadata: LibraryMetadata | None = None
|
||||
library_blob: LibraryBlob | None = None
|
||||
|
||||
knowledge_request_queue = queue('knowledge', cls='request')
|
||||
knowledge_response_queue = queue('knowledge', cls='response')
|
||||
|
|
|
|||
|
|
@ -5,7 +5,7 @@ Gets document content from the library by document ID.
|
|||
import argparse
|
||||
import os
|
||||
import sys
|
||||
import requests
|
||||
from trustgraph.api import Api
|
||||
|
||||
default_url = os.getenv("TRUSTGRAPH_URL", 'http://localhost:8088/')
|
||||
default_token = os.getenv("TRUSTGRAPH_TOKEN", None)
|
||||
|
|
@ -13,29 +13,15 @@ default_workspace = os.getenv("TRUSTGRAPH_WORKSPACE", "default")
|
|||
|
||||
def get_content(url, document_id, output_file, token=None, workspace="default"):
|
||||
|
||||
stream_url = url.rstrip("/") + "/api/v1/document-stream"
|
||||
api = Api(url, token=token, workspace=workspace).library()
|
||||
|
||||
params = {
|
||||
"document-id": document_id,
|
||||
"workspace": workspace,
|
||||
}
|
||||
|
||||
headers = {}
|
||||
if token:
|
||||
headers["Authorization"] = f"Bearer {token}"
|
||||
|
||||
resp = requests.get(stream_url, params=params, headers=headers, stream=True)
|
||||
resp.raise_for_status()
|
||||
content = api.get_document_content(id=document_id)
|
||||
|
||||
if output_file:
|
||||
total = 0
|
||||
with open(output_file, 'wb') as f:
|
||||
for chunk in resp.iter_content(chunk_size=65536):
|
||||
f.write(chunk)
|
||||
total += len(chunk)
|
||||
print(f"Written {total} bytes to {output_file}")
|
||||
f.write(content)
|
||||
print(f"Written {len(content)} bytes to {output_file}")
|
||||
else:
|
||||
content = resp.content
|
||||
try:
|
||||
text = content.decode('utf-8')
|
||||
print(text)
|
||||
|
|
|
|||
|
|
@ -47,31 +47,6 @@ def write_ge(f, data):
|
|||
)
|
||||
f.write(msgpack.packb(msg, use_bin_type=True))
|
||||
|
||||
def write_library_metadata(f, data):
|
||||
msg = (
|
||||
"lm",
|
||||
{
|
||||
"i": data["id"],
|
||||
"k": data.get("kind", ""),
|
||||
"t": data.get("title", ""),
|
||||
"p": data.get("parent-id", ""),
|
||||
"d": data.get("document-type", ""),
|
||||
"c": data.get("comments", ""),
|
||||
"g": data.get("tags", []),
|
||||
}
|
||||
)
|
||||
f.write(msgpack.packb(msg, use_bin_type=True))
|
||||
|
||||
def write_library_blob(f, data):
|
||||
msg = (
|
||||
"lb",
|
||||
{
|
||||
"i": data["id"],
|
||||
"d": data.get("data", b""),
|
||||
}
|
||||
)
|
||||
f.write(msgpack.packb(msg, use_bin_type=True))
|
||||
|
||||
def fetch(url, workspace, id, output, token=None):
|
||||
|
||||
api = Api(url=url, token=token, workspace=workspace)
|
||||
|
|
@ -80,8 +55,6 @@ def fetch(url, workspace, id, output, token=None):
|
|||
try:
|
||||
ge = 0
|
||||
t = 0
|
||||
lm = 0
|
||||
lb = 0
|
||||
|
||||
with open(output, "wb") as f:
|
||||
|
||||
|
|
@ -95,15 +68,7 @@ def fetch(url, workspace, id, output, token=None):
|
|||
ge += 1
|
||||
write_ge(f, response["graph-embeddings"])
|
||||
|
||||
if "library-metadata" in response:
|
||||
lm += 1
|
||||
write_library_metadata(f, response["library-metadata"])
|
||||
|
||||
if "library-blob" in response:
|
||||
lb += 1
|
||||
write_library_blob(f, response["library-blob"])
|
||||
|
||||
print(f"Got: {t} triple, {ge} GE, {lm} library metadata, {lb} library blob messages.")
|
||||
print(f"Got: {t} triple, {ge} GE messages.")
|
||||
|
||||
finally:
|
||||
socket.close()
|
||||
|
|
|
|||
|
|
@ -78,7 +78,7 @@ def load_structured_data(
|
|||
logger.info("Step 1: Analyzing data to discover best matching schema...")
|
||||
|
||||
# Step 1: Auto-discover schema (reuse discover_schema logic)
|
||||
discovered_schema = _auto_discover_schema(api_url, input_file, sample_chars, flow, logger, token=token, workspace=workspace)
|
||||
discovered_schema = _auto_discover_schema(api_url, input_file, sample_chars, flow, logger, workspace=workspace)
|
||||
if not discovered_schema:
|
||||
logger.error("Failed to discover suitable schema automatically")
|
||||
print("❌ Could not automatically determine the best schema for your data.")
|
||||
|
|
@ -90,7 +90,7 @@ def load_structured_data(
|
|||
|
||||
# Step 2: Auto-generate descriptor
|
||||
logger.info("Step 2: Generating descriptor configuration...")
|
||||
auto_descriptor = _auto_generate_descriptor(api_url, input_file, discovered_schema, sample_chars, flow, logger, token=token, workspace=workspace)
|
||||
auto_descriptor = _auto_generate_descriptor(api_url, input_file, discovered_schema, sample_chars, flow, logger, workspace=workspace)
|
||||
if not auto_descriptor:
|
||||
logger.error("Failed to generate descriptor automatically")
|
||||
print("❌ Could not automatically generate descriptor configuration.")
|
||||
|
|
@ -172,7 +172,7 @@ def load_structured_data(
|
|||
logger.info(f"Sample chars: {sample_chars} characters")
|
||||
|
||||
# Use the helper function to discover schema (get raw response for display)
|
||||
response = _auto_discover_schema(api_url, input_file, sample_chars, flow, logger, return_raw_response=True, token=token, workspace=workspace)
|
||||
response = _auto_discover_schema(api_url, input_file, sample_chars, flow, logger, return_raw_response=True, workspace=workspace)
|
||||
|
||||
if response:
|
||||
# Debug: print response type and content
|
||||
|
|
@ -203,7 +203,7 @@ def load_structured_data(
|
|||
# If no schema specified, discover it first
|
||||
if not schema_name:
|
||||
logger.info("No schema specified, auto-discovering...")
|
||||
schema_name = _auto_discover_schema(api_url, input_file, sample_chars, flow, logger, token=token, workspace=workspace)
|
||||
schema_name = _auto_discover_schema(api_url, input_file, sample_chars, flow, logger, workspace=workspace)
|
||||
if not schema_name:
|
||||
print("Error: Could not determine schema automatically.")
|
||||
print("Please specify a schema using --schema-name or run --discover-schema first.")
|
||||
|
|
@ -213,7 +213,7 @@ def load_structured_data(
|
|||
logger.info(f"Target schema: {schema_name}")
|
||||
|
||||
# Generate descriptor using helper function
|
||||
descriptor = _auto_generate_descriptor(api_url, input_file, schema_name, sample_chars, flow, logger, token=token, workspace=workspace)
|
||||
descriptor = _auto_generate_descriptor(api_url, input_file, schema_name, sample_chars, flow, logger, workspace=workspace)
|
||||
|
||||
if descriptor:
|
||||
# Output the generated descriptor
|
||||
|
|
@ -293,7 +293,7 @@ def load_structured_data(
|
|||
|
||||
# Send to TrustGraph
|
||||
print(f"🚀 Importing {len(output_records)} records to TrustGraph...")
|
||||
imported_count = _send_to_trustgraph(output_records, api_url, flow, batch_size, token=token, workspace=workspace)
|
||||
imported_count = _send_to_trustgraph(output_records, api_url, flow, batch_size, token=token)
|
||||
|
||||
# Get summary info from descriptor
|
||||
format_info = descriptor.get('format', {})
|
||||
|
|
@ -603,7 +603,7 @@ def _send_to_trustgraph(rows, api_url, flow, batch_size=1000, token=None, worksp
|
|||
|
||||
|
||||
# Helper functions for auto mode
|
||||
def _auto_discover_schema(api_url, input_file, sample_chars, flow, logger, return_raw_response=False, token=None, workspace="default"):
|
||||
def _auto_discover_schema(api_url, input_file, sample_chars, flow, logger, return_raw_response=False, workspace="default"):
|
||||
"""Auto-discover the best matching schema for the input data
|
||||
|
||||
Args:
|
||||
|
|
@ -626,7 +626,7 @@ def _auto_discover_schema(api_url, input_file, sample_chars, flow, logger, retur
|
|||
# Import API modules
|
||||
from trustgraph.api import Api
|
||||
from trustgraph.api.types import ConfigKey
|
||||
api = Api(api_url, token=token, workspace=workspace)
|
||||
api = Api(api_url, workspace=workspace)
|
||||
config_api = api.config()
|
||||
|
||||
# Get available schemas
|
||||
|
|
@ -707,7 +707,7 @@ def _auto_discover_schema(api_url, input_file, sample_chars, flow, logger, retur
|
|||
return None
|
||||
|
||||
|
||||
def _auto_generate_descriptor(api_url, input_file, schema_name, sample_chars, flow, logger, token=None, workspace="default"):
|
||||
def _auto_generate_descriptor(api_url, input_file, schema_name, sample_chars, flow, logger, workspace="default"):
|
||||
"""Auto-generate descriptor configuration for the discovered schema"""
|
||||
try:
|
||||
# Read sample data
|
||||
|
|
@ -717,7 +717,7 @@ def _auto_generate_descriptor(api_url, input_file, schema_name, sample_chars, fl
|
|||
# Import API modules
|
||||
from trustgraph.api import Api
|
||||
from trustgraph.api.types import ConfigKey
|
||||
api = Api(api_url, token=token, workspace=workspace)
|
||||
api = Api(api_url, workspace=workspace)
|
||||
config_api = api.config()
|
||||
|
||||
# Get schema definition
|
||||
|
|
|
|||
|
|
@ -40,23 +40,6 @@ def read_message(unpacked, id):
|
|||
},
|
||||
"triples": msg["t"],
|
||||
}
|
||||
elif unpacked[0] == "lm":
|
||||
msg = unpacked[1]
|
||||
return "lm", {
|
||||
"id": msg["i"],
|
||||
"kind": msg.get("k", ""),
|
||||
"title": msg.get("t", ""),
|
||||
"parent-id": msg.get("p", ""),
|
||||
"document-type": msg.get("d", ""),
|
||||
"comments": msg.get("c", ""),
|
||||
"tags": msg.get("g", []),
|
||||
}
|
||||
elif unpacked[0] == "lb":
|
||||
msg = unpacked[1]
|
||||
return "lb", {
|
||||
"id": msg["i"],
|
||||
"data": msg.get("d", b""),
|
||||
}
|
||||
else:
|
||||
raise RuntimeError("Unpacked unexpected messsage type", unpacked[0])
|
||||
|
||||
|
|
@ -68,8 +51,6 @@ def put(url, workspace, id, input, token=None):
|
|||
try:
|
||||
ge = 0
|
||||
t = 0
|
||||
lm = 0
|
||||
lb = 0
|
||||
|
||||
with open(input, "rb") as f:
|
||||
|
||||
|
|
@ -92,18 +73,10 @@ def put(url, workspace, id, input, token=None):
|
|||
t += 1
|
||||
socket.put_kg_core(id, triples=msg)
|
||||
|
||||
elif kind == "lm":
|
||||
lm += 1
|
||||
socket.put_kg_core(id, library_metadata=msg)
|
||||
|
||||
elif kind == "lb":
|
||||
lb += 1
|
||||
socket.put_kg_core(id, library_blob=msg)
|
||||
|
||||
else:
|
||||
raise RuntimeError("Unexpected message kind", kind)
|
||||
|
||||
print(f"Put: {t} triple, {ge} GE, {lm} library metadata, {lb} library blob messages.")
|
||||
print(f"Put: {t} triple, {ge} GE messages.")
|
||||
|
||||
finally:
|
||||
socket.close()
|
||||
|
|
|
|||
|
|
@ -119,8 +119,7 @@ def main():
|
|||
raise RuntimeError("Can't use --system with other args")
|
||||
|
||||
set_system(
|
||||
url=args.api_url, system=args.system, token=args.token,
|
||||
workspace=args.workspace,
|
||||
url=args.api_url, system=args.system, token=args.token
|
||||
)
|
||||
|
||||
else:
|
||||
|
|
|
|||
|
|
@ -105,7 +105,7 @@ async def fetch_data(client, workspace):
|
|||
return blueprint_names, blueprints, param_type_defs
|
||||
|
||||
async def _show_flow_blueprints_async(url, token=None, workspace="default"):
|
||||
async with AsyncSocketClient(url, timeout=60, token=token, workspace=workspace) as client:
|
||||
async with AsyncSocketClient(url, timeout=60, token=token) as client:
|
||||
return await fetch_data(client, workspace)
|
||||
|
||||
def show_flow_blueprints(url, token=None, workspace="default"):
|
||||
|
|
|
|||
|
|
@ -213,7 +213,7 @@ async def fetch_show_flows(client, workspace):
|
|||
|
||||
async def _show_flows_async(url, token=None, workspace="default"):
|
||||
|
||||
async with AsyncSocketClient(url, timeout=60, token=token, workspace=workspace) as client:
|
||||
async with AsyncSocketClient(url, timeout=60, token=token) as client:
|
||||
return await fetch_show_flows(client, workspace)
|
||||
|
||||
def show_flows(url, token=None, workspace="default"):
|
||||
|
|
|
|||
|
|
@ -15,7 +15,6 @@ import json
|
|||
|
||||
default_url = os.getenv("TRUSTGRAPH_URL", 'http://localhost:8088/')
|
||||
default_token = os.getenv("TRUSTGRAPH_TOKEN", None)
|
||||
default_workspace = os.getenv("TRUSTGRAPH_WORKSPACE", "default")
|
||||
|
||||
def format_enum_values(enum_list):
|
||||
"""
|
||||
|
|
@ -126,11 +125,11 @@ async def fetch_single_param_type(client, param_type_name):
|
|||
return json.loads(values[0].get("value", "{}"))
|
||||
return None
|
||||
|
||||
def show_parameter_types(url, token=None, workspace="default"):
|
||||
def show_parameter_types(url, token=None):
|
||||
"""Show all parameter type definitions."""
|
||||
|
||||
async def _fetch():
|
||||
async with AsyncSocketClient(url, timeout=60, token=token, workspace=workspace) as client:
|
||||
async with AsyncSocketClient(url, timeout=60, token=token) as client:
|
||||
return await fetch_all_param_types(client)
|
||||
|
||||
param_type_names, param_type_defs = asyncio.run(_fetch())
|
||||
|
|
@ -154,11 +153,11 @@ def show_parameter_types(url, token=None, workspace="default"):
|
|||
))
|
||||
print()
|
||||
|
||||
def show_specific_parameter_type(url, param_type_name, token=None, workspace="default"):
|
||||
def show_specific_parameter_type(url, param_type_name, token=None):
|
||||
"""Show a specific parameter type definition."""
|
||||
|
||||
async def _fetch():
|
||||
async with AsyncSocketClient(url, timeout=60, token=token, workspace=workspace) as client:
|
||||
async with AsyncSocketClient(url, timeout=60, token=token) as client:
|
||||
return await fetch_single_param_type(client, param_type_name)
|
||||
|
||||
param_type_def = asyncio.run(_fetch())
|
||||
|
|
@ -194,12 +193,6 @@ def main():
|
|||
help='Authentication token (default: $TRUSTGRAPH_TOKEN)',
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'-w', '--workspace',
|
||||
default=default_workspace,
|
||||
help=f'Workspace (default: {default_workspace})',
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'-t', '--type',
|
||||
help='Show only the specified parameter type',
|
||||
|
|
@ -209,9 +202,9 @@ def main():
|
|||
|
||||
try:
|
||||
if args.type:
|
||||
show_specific_parameter_type(args.api_url, args.type, args.token, workspace=args.workspace)
|
||||
show_specific_parameter_type(args.api_url, args.type, args.token)
|
||||
else:
|
||||
show_parameter_types(args.api_url, args.token, workspace=args.workspace)
|
||||
show_parameter_types(args.api_url, args.token)
|
||||
|
||||
except Exception as e:
|
||||
print("Exception:", e, flush=True)
|
||||
|
|
|
|||
|
|
@ -83,8 +83,7 @@ class Processor(AsyncProcessor):
|
|||
host=cassandra_host,
|
||||
username=cassandra_username,
|
||||
password=cassandra_password,
|
||||
default_keyspace="config",
|
||||
replication_factor=params.get("cassandra_replication_factor"),
|
||||
default_keyspace="config"
|
||||
)
|
||||
|
||||
# Store resolved configuration
|
||||
|
|
|
|||
|
|
@ -1,7 +1,6 @@
|
|||
|
||||
from .. schema import KnowledgeResponse, Error, Triples, GraphEmbeddings
|
||||
from .. schema import DocumentEmbeddings, LibraryMetadata, LibraryBlob
|
||||
from .. schema import LibrarianRequest, DocumentMetadata
|
||||
from .. schema import DocumentEmbeddings
|
||||
from .. knowledge import hash
|
||||
from .. exceptions import RequestError
|
||||
from .. tables.knowledge import KnowledgeTableStore
|
||||
|
|
@ -19,7 +18,7 @@ class KnowledgeManager:
|
|||
|
||||
def __init__(
|
||||
self, cassandra_host, cassandra_username, cassandra_password,
|
||||
keyspace, flow_config, librarian=None, replication_factor=1,
|
||||
keyspace, flow_config, replication_factor=1,
|
||||
):
|
||||
|
||||
self.table_store = KnowledgeTableStore(
|
||||
|
|
@ -27,9 +26,6 @@ class KnowledgeManager:
|
|||
replication_factor
|
||||
)
|
||||
|
||||
self.librarian = librarian
|
||||
self._pending_library_metadata = {}
|
||||
|
||||
self.loader_queue = asyncio.Queue(maxsize=20)
|
||||
self.background_task = None
|
||||
self.flow_config = flow_config
|
||||
|
|
@ -90,9 +86,6 @@ class KnowledgeManager:
|
|||
publish_ge,
|
||||
)
|
||||
|
||||
if self.librarian:
|
||||
await self._stream_library_docs(request.id, respond)
|
||||
|
||||
logger.debug("Knowledge core retrieval complete")
|
||||
|
||||
await respond(
|
||||
|
|
@ -129,12 +122,6 @@ class KnowledgeManager:
|
|||
workspace, request.graph_embeddings
|
||||
)
|
||||
|
||||
if request.library_metadata and self.librarian:
|
||||
await self._put_library_metadata(request.library_metadata, workspace)
|
||||
|
||||
if request.library_blob and self.librarian:
|
||||
await self._put_library_blob(request.library_blob, workspace)
|
||||
|
||||
await respond(
|
||||
KnowledgeResponse(
|
||||
error = None,
|
||||
|
|
@ -263,112 +250,6 @@ class KnowledgeManager:
|
|||
|
||||
await self.loader_queue.put((request, respond, workspace))
|
||||
|
||||
async def _stream_library_docs(self, document_id, respond):
|
||||
|
||||
try:
|
||||
root_meta = await self.librarian.fetch_document_metadata(
|
||||
document_id
|
||||
)
|
||||
except Exception as e:
|
||||
logger.warning(f"Could not fetch library metadata for {document_id}: {e}")
|
||||
return
|
||||
|
||||
if root_meta is None:
|
||||
return
|
||||
|
||||
await self._stream_one_doc(root_meta, respond)
|
||||
|
||||
try:
|
||||
resp = await self.librarian.request(
|
||||
LibrarianRequest(
|
||||
operation="list-children",
|
||||
document_id=document_id,
|
||||
)
|
||||
)
|
||||
except Exception as e:
|
||||
logger.warning(f"Could not list children for {document_id}: {e}")
|
||||
return
|
||||
|
||||
for child_meta in resp.document_metadatas:
|
||||
await self._stream_one_doc(child_meta, respond)
|
||||
|
||||
async def _stream_one_doc(self, doc_meta, respond):
|
||||
|
||||
lm = LibraryMetadata(
|
||||
id=doc_meta.id,
|
||||
kind=doc_meta.kind,
|
||||
title=doc_meta.title,
|
||||
parent_id=doc_meta.parent_id,
|
||||
document_type=doc_meta.document_type,
|
||||
comments=doc_meta.comments,
|
||||
tags=doc_meta.tags or [],
|
||||
)
|
||||
|
||||
await respond(
|
||||
KnowledgeResponse(library_metadata=lm)
|
||||
)
|
||||
|
||||
try:
|
||||
content = await self.librarian.fetch_document_content(
|
||||
doc_meta.id
|
||||
)
|
||||
except Exception as e:
|
||||
logger.warning(f"Could not fetch content for {doc_meta.id}: {e}")
|
||||
return
|
||||
|
||||
await respond(
|
||||
KnowledgeResponse(
|
||||
library_blob=LibraryBlob(
|
||||
id=doc_meta.id,
|
||||
data=content,
|
||||
)
|
||||
)
|
||||
)
|
||||
|
||||
async def _put_library_metadata(self, lm, workspace):
|
||||
self._pending_library_metadata[lm.id] = lm
|
||||
|
||||
async def _put_library_blob(self, lb, workspace):
|
||||
|
||||
lm = self._pending_library_metadata.pop(lb.id, None)
|
||||
if lm is None:
|
||||
logger.warning(
|
||||
f"Received library blob for {lb.id} with no preceding metadata"
|
||||
)
|
||||
return
|
||||
|
||||
doc_meta = DocumentMetadata(
|
||||
id=lm.id,
|
||||
kind=lm.kind,
|
||||
title=lm.title,
|
||||
parent_id=lm.parent_id,
|
||||
document_type=lm.document_type,
|
||||
comments=lm.comments,
|
||||
tags=lm.tags or [],
|
||||
)
|
||||
|
||||
if lm.parent_id:
|
||||
operation = "add-child-document"
|
||||
else:
|
||||
operation = "add-document"
|
||||
|
||||
try:
|
||||
await self.librarian.request(
|
||||
LibrarianRequest(
|
||||
operation=operation,
|
||||
document_id=lm.id,
|
||||
document_metadata=doc_meta,
|
||||
content=lb.data,
|
||||
)
|
||||
)
|
||||
except RuntimeError as e:
|
||||
if "already exists" in str(e):
|
||||
logger.debug(f"Library document {lm.id} already exists, skipping")
|
||||
else:
|
||||
logger.warning(f"Could not save library document {lm.id}: {e}")
|
||||
except Exception as e:
|
||||
logger.warning(f"Could not save library document {lm.id}: {e}")
|
||||
|
||||
async def core_loader(self):
|
||||
|
||||
logger.info("Knowledge background processor running...")
|
||||
|
|
|
|||
|
|
@ -12,7 +12,6 @@ import logging
|
|||
from .. base import WorkspaceProcessor, Consumer, Producer, Publisher, Subscriber
|
||||
from .. base import ConsumerMetrics, ProducerMetrics
|
||||
from .. base.cassandra_config import add_cassandra_args, resolve_cassandra_config
|
||||
from .. base import LibrarianClient
|
||||
|
||||
from .. schema import KnowledgeRequest, KnowledgeResponse, Error
|
||||
from .. schema import knowledge_request_queue, knowledge_response_queue
|
||||
|
|
@ -61,8 +60,7 @@ class Processor(WorkspaceProcessor):
|
|||
host=cassandra_host,
|
||||
username=cassandra_username,
|
||||
password=cassandra_password,
|
||||
default_keyspace="knowledge",
|
||||
replication_factor=params.get("cassandra_replication_factor"),
|
||||
default_keyspace="knowledge"
|
||||
)
|
||||
|
||||
self.cassandra_host = hosts
|
||||
|
|
@ -79,17 +77,12 @@ class Processor(WorkspaceProcessor):
|
|||
}
|
||||
)
|
||||
|
||||
self.librarian_client = LibrarianClient(
|
||||
id=id, backend=self.pubsub, taskgroup=self.taskgroup,
|
||||
)
|
||||
|
||||
self.knowledge = KnowledgeManager(
|
||||
cassandra_host = self.cassandra_host,
|
||||
cassandra_username = self.cassandra_username,
|
||||
cassandra_password = self.cassandra_password,
|
||||
keyspace = keyspace,
|
||||
flow_config = self,
|
||||
librarian = self.librarian_client,
|
||||
replication_factor = replication_factor,
|
||||
)
|
||||
|
||||
|
|
@ -163,7 +156,6 @@ class Processor(WorkspaceProcessor):
|
|||
async def start(self):
|
||||
|
||||
await super(Processor, self).start()
|
||||
await self.librarian_client.start()
|
||||
|
||||
async def on_knowledge_config(self, workspace, config, version):
|
||||
|
||||
|
|
|
|||
|
|
@ -219,14 +219,7 @@ class Processor(FlowProcessor):
|
|||
source_doc_id = v.document_id or v.metadata.id
|
||||
|
||||
# Run OCR, get per-page markdown
|
||||
try:
|
||||
pages = self.ocr(blob)
|
||||
except Exception as e:
|
||||
logger.error(
|
||||
f"Failed to decode PDF {source_doc_id}: "
|
||||
f"{type(e).__name__}: {e}"
|
||||
)
|
||||
return
|
||||
|
||||
for markdown, page_num in pages:
|
||||
|
||||
|
|
|
|||
|
|
@ -32,10 +32,6 @@ logger = logging.getLogger(__name__)
|
|||
default_ident = "document-decoder"
|
||||
|
||||
|
||||
def _looks_like_pdf(content):
|
||||
return content.lstrip().startswith(b"%PDF-")
|
||||
|
||||
|
||||
class Processor(FlowProcessor):
|
||||
|
||||
def __init__(self, **params):
|
||||
|
|
@ -98,10 +94,14 @@ class Processor(FlowProcessor):
|
|||
)
|
||||
return
|
||||
|
||||
with tempfile.NamedTemporaryFile(delete_on_close=False, suffix='.pdf') as fp:
|
||||
temp_path = fp.name
|
||||
|
||||
# Check if we should fetch from librarian or use inline data
|
||||
if v.document_id:
|
||||
# Fetch from librarian via Pulsar
|
||||
logger.info(f"Fetching document {v.document_id} from librarian...")
|
||||
fp.close()
|
||||
|
||||
content = await flow.librarian.fetch_document_content(
|
||||
document_id=v.document_id,
|
||||
|
|
@ -113,21 +113,13 @@ class Processor(FlowProcessor):
|
|||
content = content.encode('utf-8')
|
||||
decoded_content = base64.b64decode(content)
|
||||
|
||||
with open(temp_path, 'wb') as f:
|
||||
f.write(decoded_content)
|
||||
|
||||
logger.info(f"Fetched {len(decoded_content)} bytes from librarian")
|
||||
else:
|
||||
# Use inline data (backward compatibility)
|
||||
decoded_content = base64.b64decode(v.data)
|
||||
|
||||
if not _looks_like_pdf(decoded_content):
|
||||
logger.error(
|
||||
f"Document {v.metadata.id} is not valid PDF content. "
|
||||
f"Ignoring document."
|
||||
)
|
||||
return
|
||||
|
||||
with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as fp:
|
||||
temp_path = fp.name
|
||||
fp.write(decoded_content)
|
||||
fp.write(base64.b64decode(v.data))
|
||||
fp.close()
|
||||
|
||||
global PyPDFLoader
|
||||
|
|
@ -137,15 +129,7 @@ class Processor(FlowProcessor):
|
|||
)
|
||||
PyPDFLoader = _cls
|
||||
loader = PyPDFLoader(temp_path)
|
||||
try:
|
||||
pages = loader.load()
|
||||
except Exception as e:
|
||||
source_doc_id = v.document_id or v.metadata.id
|
||||
logger.error(
|
||||
f"Failed to decode PDF {source_doc_id}: "
|
||||
f"{type(e).__name__}: {e}"
|
||||
)
|
||||
return
|
||||
|
||||
# Get the source document ID
|
||||
source_doc_id = v.document_id or v.metadata.id
|
||||
|
|
|
|||
|
|
@ -6,7 +6,7 @@ import logging
|
|||
from cassandra.cluster import Cluster
|
||||
from cassandra.auth import PlainTextAuthProvider
|
||||
from cassandra.query import BatchStatement, SimpleStatement
|
||||
import ssl
|
||||
from ssl import SSLContext, PROTOCOL_TLSv1_2
|
||||
|
||||
from ..tables.cassandra_async import async_execute
|
||||
|
||||
|
|
@ -41,15 +41,13 @@ class KnowledgeGraph:
|
|||
|
||||
def __init__(
|
||||
self, hosts=None,
|
||||
keyspace="trustgraph", username=None, password=None,
|
||||
replication_factor=1,
|
||||
keyspace="trustgraph", username=None, password=None
|
||||
):
|
||||
|
||||
if hosts is None:
|
||||
hosts = ["localhost"]
|
||||
|
||||
self.keyspace = keyspace
|
||||
self.replication_factor = replication_factor
|
||||
self.username = username
|
||||
|
||||
# 7-table schema for quads with full query pattern support
|
||||
|
|
@ -70,7 +68,7 @@ class KnowledgeGraph:
|
|||
self.collection_metadata_table = "collection_metadata"
|
||||
|
||||
if username and password:
|
||||
ssl_context = ssl.create_default_context()
|
||||
ssl_context = SSLContext(PROTOCOL_TLSv1_2)
|
||||
auth_provider = PlainTextAuthProvider(username=username, password=password)
|
||||
self.cluster = Cluster(hosts, auth_provider=auth_provider, ssl_context=ssl_context)
|
||||
else:
|
||||
|
|
@ -94,7 +92,7 @@ class KnowledgeGraph:
|
|||
create keyspace if not exists {self.keyspace}
|
||||
with replication = {{
|
||||
'class' : 'SimpleStrategy',
|
||||
'replication_factor' : {self.replication_factor}
|
||||
'replication_factor' : 1
|
||||
}};
|
||||
""")
|
||||
|
||||
|
|
@ -541,15 +539,13 @@ class EntityCentricKnowledgeGraph:
|
|||
|
||||
def __init__(
|
||||
self, hosts=None,
|
||||
keyspace="trustgraph", username=None, password=None,
|
||||
replication_factor=1,
|
||||
keyspace="trustgraph", username=None, password=None
|
||||
):
|
||||
|
||||
if hosts is None:
|
||||
hosts = ["localhost"]
|
||||
|
||||
self.keyspace = keyspace
|
||||
self.replication_factor = replication_factor
|
||||
self.username = username
|
||||
|
||||
# 2-table entity-centric schema
|
||||
|
|
@ -560,7 +556,7 @@ class EntityCentricKnowledgeGraph:
|
|||
self.collection_metadata_table = "collection_metadata"
|
||||
|
||||
if username and password:
|
||||
ssl_context = ssl.create_default_context()
|
||||
ssl_context = SSLContext(PROTOCOL_TLSv1_2)
|
||||
auth_provider = PlainTextAuthProvider(username=username, password=password)
|
||||
self.cluster = Cluster(hosts, auth_provider=auth_provider, ssl_context=ssl_context)
|
||||
else:
|
||||
|
|
@ -584,7 +580,7 @@ class EntityCentricKnowledgeGraph:
|
|||
create keyspace if not exists {self.keyspace}
|
||||
with replication = {{
|
||||
'class' : 'SimpleStrategy',
|
||||
'replication_factor' : {self.replication_factor}
|
||||
'replication_factor' : 1
|
||||
}};
|
||||
""")
|
||||
|
||||
|
|
|
|||
|
|
@ -73,39 +73,6 @@ class CoreExport:
|
|||
enc = msgpack.packb(msg)
|
||||
await response.write(enc)
|
||||
|
||||
if "library-metadata" in resp:
|
||||
|
||||
data = resp["library-metadata"]
|
||||
msg = (
|
||||
"lm",
|
||||
{
|
||||
"i": data["id"],
|
||||
"k": data.get("kind", ""),
|
||||
"t": data.get("title", ""),
|
||||
"p": data.get("parent-id", ""),
|
||||
"d": data.get("document-type", ""),
|
||||
"c": data.get("comments", ""),
|
||||
"g": data.get("tags", []),
|
||||
}
|
||||
)
|
||||
|
||||
enc = msgpack.packb(msg)
|
||||
await response.write(enc)
|
||||
|
||||
if "library-blob" in resp:
|
||||
|
||||
data = resp["library-blob"]
|
||||
msg = (
|
||||
"lb",
|
||||
{
|
||||
"i": data["id"],
|
||||
"d": data.get("data", b""),
|
||||
}
|
||||
)
|
||||
|
||||
enc = msgpack.packb(msg, use_bin_type=True)
|
||||
await response.write(enc)
|
||||
|
||||
await kr.process(
|
||||
{
|
||||
"operation": "get-kg-core",
|
||||
|
|
|
|||
|
|
@ -79,39 +79,6 @@ class CoreImport:
|
|||
|
||||
await kr.process(msg)
|
||||
|
||||
elif unpacked[0] == "lm":
|
||||
msg = unpacked[1]
|
||||
msg = {
|
||||
"operation": "put-kg-core",
|
||||
"workspace": workspace,
|
||||
"id": id,
|
||||
"library-metadata": {
|
||||
"id": msg["i"],
|
||||
"kind": msg.get("k", ""),
|
||||
"title": msg.get("t", ""),
|
||||
"parent-id": msg.get("p", ""),
|
||||
"document-type": msg.get("d", ""),
|
||||
"comments": msg.get("c", ""),
|
||||
"tags": msg.get("g", []),
|
||||
}
|
||||
}
|
||||
|
||||
await kr.process(msg)
|
||||
|
||||
elif unpacked[0] == "lb":
|
||||
msg = unpacked[1]
|
||||
msg = {
|
||||
"operation": "put-kg-core",
|
||||
"workspace": workspace,
|
||||
"id": id,
|
||||
"library-blob": {
|
||||
"id": msg["i"],
|
||||
"data": msg.get("d", b""),
|
||||
}
|
||||
}
|
||||
|
||||
await kr.process(msg)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Core import exception: {e}", exc_info=True)
|
||||
await error(str(e))
|
||||
|
|
|
|||
|
|
@ -3,7 +3,6 @@ import asyncio
|
|||
import uuid
|
||||
import logging
|
||||
from . librarian import LibrarianRequestor
|
||||
from ... schema import librarian_request_queue, librarian_response_queue
|
||||
|
||||
# Module logger
|
||||
logger = logging.getLogger(__name__)
|
||||
|
|
@ -24,13 +23,10 @@ class DocumentStreamExport:
|
|||
|
||||
response = await ok()
|
||||
|
||||
uid = str(uuid.uuid4())
|
||||
lr = LibrarianRequestor(
|
||||
backend=self.backend,
|
||||
consumer="api-gateway-doc-stream-" + uid,
|
||||
subscriber="api-gateway-doc-stream-" + uid,
|
||||
request_queue=f"{librarian_request_queue}:{workspace}",
|
||||
response_queue=f"{librarian_response_queue}:{workspace}",
|
||||
consumer="api-gateway-doc-stream-" + str(uuid.uuid4()),
|
||||
subscriber="api-gateway-doc-stream-" + str(uuid.uuid4()),
|
||||
)
|
||||
|
||||
try:
|
||||
|
|
|
|||
|
|
@ -4,8 +4,6 @@ import queue
|
|||
import uuid
|
||||
import logging
|
||||
|
||||
from ..capabilities import PUBLIC, AUTHENTICATED
|
||||
|
||||
# Module logger
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
|
@ -158,18 +156,15 @@ class Mux:
|
|||
})
|
||||
return
|
||||
|
||||
# Resolve workspace (default-fill from the caller's
|
||||
# bound workspace). Workspace resolution applies to all
|
||||
# operations regardless of capability level.
|
||||
# Resolve workspace first (default-fill from the caller's
|
||||
# bound workspace), then ask the regime to authorise the
|
||||
# service-level capability against the matched
|
||||
# operation's resource shape.
|
||||
try:
|
||||
await enforce_workspace(data, self.identity, self.auth)
|
||||
if isinstance(inner, dict):
|
||||
await enforce_workspace(inner, self.identity, self.auth)
|
||||
|
||||
# Authorisation: capability sentinels short-circuit
|
||||
# the regime call; capability strings go through
|
||||
# authorise().
|
||||
if op.capability not in (PUBLIC, AUTHENTICATED):
|
||||
if data.get("flow"):
|
||||
resource = {
|
||||
"workspace": data.get("workspace", ""),
|
||||
|
|
@ -178,9 +173,8 @@ class Mux:
|
|||
parameters = {}
|
||||
else:
|
||||
# Build a minimal RequestContext so the matched
|
||||
# operation's own extractors decide resource
|
||||
# and parameters — same path the HTTP
|
||||
# endpoints take.
|
||||
# operation's own extractors decide resource and
|
||||
# parameters — same path the HTTP endpoints take.
|
||||
from ..registry import RequestContext
|
||||
ctx = RequestContext(
|
||||
body=inner if isinstance(inner, dict) else {},
|
||||
|
|
@ -294,8 +288,6 @@ class Mux:
|
|||
await self.maybe_tidy_workers(workers)
|
||||
|
||||
async def responder(resp, fin):
|
||||
if self.ws is None:
|
||||
return
|
||||
await self.ws.send_json({
|
||||
"id": id,
|
||||
"response": resp,
|
||||
|
|
@ -329,8 +321,6 @@ class Mux:
|
|||
)
|
||||
|
||||
except Exception as e:
|
||||
if self.ws is None:
|
||||
return
|
||||
await self.ws.send_json({
|
||||
"id": id,
|
||||
"error": {"message": str(e), "type": "error"},
|
||||
|
|
|
|||
|
|
@ -117,10 +117,8 @@ class SocketEndpoint:
|
|||
|
||||
running = Running()
|
||||
|
||||
params = dict(request.query)
|
||||
params.update(request.match_info)
|
||||
dispatcher = await self.dispatcher(
|
||||
ws, running, params
|
||||
ws, running, request.match_info
|
||||
)
|
||||
|
||||
worker_task = tg.create_task(
|
||||
|
|
|
|||
|
|
@ -101,7 +101,6 @@ class Processor(AsyncProcessor):
|
|||
username=cassandra_username,
|
||||
password=cassandra_password,
|
||||
default_keyspace="iam",
|
||||
replication_factor=params.get("cassandra_replication_factor"),
|
||||
)
|
||||
|
||||
self.cassandra_host = hosts
|
||||
|
|
|
|||
|
|
@ -162,9 +162,6 @@ class Librarian:
|
|||
request.document_id
|
||||
)
|
||||
|
||||
if object_id is None:
|
||||
raise RequestError(f"Document not found: {request.document_id}")
|
||||
|
||||
content = await self.blob_store.get(
|
||||
object_id
|
||||
)
|
||||
|
|
|
|||
|
|
@ -8,7 +8,6 @@ import asyncio
|
|||
import base64
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from datetime import datetime
|
||||
|
||||
from .. base import WorkspaceProcessor, Consumer, Producer, Publisher, Subscriber
|
||||
|
|
@ -55,16 +54,6 @@ default_object_store_access_key = "object-user"
|
|||
default_object_store_secret_key = "object-password"
|
||||
default_object_store_use_ssl = False
|
||||
default_object_store_region = None
|
||||
|
||||
# Environment variables consulted as a fallback when the
|
||||
# corresponding params field is not set in the processor-group YAML
|
||||
# or via CLI. Intended for K8s Secret / env-var injection so
|
||||
# credentials never have to live in the YAML (and thus in git).
|
||||
ENV_OBJECT_STORE_ENDPOINT = "OBJECT_STORE_ENDPOINT"
|
||||
ENV_OBJECT_STORE_ACCESS_KEY = "OBJECT_STORE_ACCESS_KEY"
|
||||
ENV_OBJECT_STORE_SECRET_KEY = "OBJECT_STORE_SECRET_KEY"
|
||||
ENV_OBJECT_STORE_USE_SSL = "OBJECT_STORE_USE_SSL"
|
||||
ENV_OBJECT_STORE_REGION = "OBJECT_STORE_REGION"
|
||||
default_cassandra_host = "cassandra"
|
||||
default_min_chunk_size = 1 # No minimum by default (for Garage)
|
||||
|
||||
|
|
@ -100,36 +89,22 @@ class Processor(WorkspaceProcessor):
|
|||
"config_response_queue", default_config_response_queue
|
||||
)
|
||||
|
||||
# Resolve object-store config. Precedence: explicit params
|
||||
# (CLI / processor-group YAML) → environment variable →
|
||||
# hardcoded default. The env-var path lets K8s Secrets feed
|
||||
# credentials without them appearing in the YAML.
|
||||
object_store_endpoint = (
|
||||
params.get("object_store_endpoint")
|
||||
or os.environ.get(ENV_OBJECT_STORE_ENDPOINT)
|
||||
or default_object_store_endpoint
|
||||
object_store_endpoint = params.get("object_store_endpoint", default_object_store_endpoint)
|
||||
object_store_access_key = params.get(
|
||||
"object_store_access_key",
|
||||
default_object_store_access_key
|
||||
)
|
||||
object_store_access_key = (
|
||||
params.get("object_store_access_key")
|
||||
or os.environ.get(ENV_OBJECT_STORE_ACCESS_KEY)
|
||||
or default_object_store_access_key
|
||||
object_store_secret_key = params.get(
|
||||
"object_store_secret_key",
|
||||
default_object_store_secret_key
|
||||
)
|
||||
object_store_secret_key = (
|
||||
params.get("object_store_secret_key")
|
||||
or os.environ.get(ENV_OBJECT_STORE_SECRET_KEY)
|
||||
or default_object_store_secret_key
|
||||
object_store_use_ssl = params.get(
|
||||
"object_store_use_ssl",
|
||||
default_object_store_use_ssl
|
||||
)
|
||||
object_store_use_ssl = params.get("object_store_use_ssl")
|
||||
if object_store_use_ssl is None:
|
||||
env_ssl = os.environ.get(ENV_OBJECT_STORE_USE_SSL)
|
||||
if env_ssl is not None:
|
||||
object_store_use_ssl = env_ssl.lower() in ("true", "1", "yes")
|
||||
else:
|
||||
object_store_use_ssl = default_object_store_use_ssl
|
||||
object_store_region = (
|
||||
params.get("object_store_region")
|
||||
or os.environ.get(ENV_OBJECT_STORE_REGION)
|
||||
or default_object_store_region
|
||||
object_store_region = params.get(
|
||||
"object_store_region",
|
||||
default_object_store_region
|
||||
)
|
||||
|
||||
min_chunk_size = params.get(
|
||||
|
|
@ -146,8 +121,7 @@ class Processor(WorkspaceProcessor):
|
|||
host=cassandra_host,
|
||||
username=cassandra_username,
|
||||
password=cassandra_password,
|
||||
default_keyspace="librarian",
|
||||
replication_factor=params.get("cassandra_replication_factor"),
|
||||
default_keyspace="librarian"
|
||||
)
|
||||
|
||||
# Store resolved configuration
|
||||
|
|
|
|||
|
|
@ -12,33 +12,31 @@ from qdrant_client import QdrantClient
|
|||
from .... schema import DocumentEmbeddingsResponse, ChunkMatch
|
||||
from .... schema import Error
|
||||
from .... base import DocumentEmbeddingsQueryService
|
||||
from .... base.qdrant_config import add_qdrant_args, resolve_qdrant_config
|
||||
|
||||
# Module logger
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
default_ident = "doc-embeddings-query"
|
||||
|
||||
default_store_uri = 'http://localhost:6333'
|
||||
|
||||
class Processor(DocumentEmbeddingsQueryService):
|
||||
|
||||
def __init__(self, **params):
|
||||
|
||||
store_uri = params.get("store_uri")
|
||||
api_key = params.get("api_key")
|
||||
store_uri = params.get("store_uri", default_store_uri)
|
||||
|
||||
url, api_key, _, _ = resolve_qdrant_config(
|
||||
url=store_uri,
|
||||
api_key=api_key,
|
||||
)
|
||||
#optional api key
|
||||
api_key = params.get("api_key", None)
|
||||
|
||||
super(Processor, self).__init__(
|
||||
**params | {
|
||||
"store_uri": url,
|
||||
"store_uri": store_uri,
|
||||
"api_key": api_key,
|
||||
}
|
||||
)
|
||||
|
||||
self.qdrant = QdrantClient(url=url, api_key=api_key)
|
||||
self.qdrant = QdrantClient(url=store_uri, api_key=api_key)
|
||||
|
||||
async def query_document_embeddings(self, workspace, msg):
|
||||
|
||||
|
|
@ -87,7 +85,18 @@ class Processor(DocumentEmbeddingsQueryService):
|
|||
def add_args(parser):
|
||||
|
||||
DocumentEmbeddingsQueryService.add_args(parser)
|
||||
add_qdrant_args(parser)
|
||||
|
||||
parser.add_argument(
|
||||
'-t', '--store-uri',
|
||||
default=default_store_uri,
|
||||
help=f'Qdrant store URI (default: {default_store_uri})'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'-k', '--api-key',
|
||||
default=None,
|
||||
help=f'API key for qdrant (default: None)'
|
||||
)
|
||||
|
||||
def run():
|
||||
|
||||
|
|
|
|||
|
|
@ -12,32 +12,31 @@ from qdrant_client import QdrantClient
|
|||
from .... schema import GraphEmbeddingsResponse, EntityMatch
|
||||
from .... schema import Error, Term, IRI, LITERAL
|
||||
from .... base import GraphEmbeddingsQueryService
|
||||
from .... base.qdrant_config import add_qdrant_args, resolve_qdrant_config
|
||||
|
||||
# Module logger
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
default_ident = "graph-embeddings-query"
|
||||
|
||||
default_store_uri = 'http://localhost:6333'
|
||||
|
||||
class Processor(GraphEmbeddingsQueryService):
|
||||
|
||||
def __init__(self, **params):
|
||||
|
||||
store_uri = params.get("store_uri")
|
||||
api_key = params.get("api_key")
|
||||
store_uri = params.get("store_uri", default_store_uri)
|
||||
|
||||
url, api_key, _, _ = resolve_qdrant_config(
|
||||
url=store_uri, api_key=api_key,
|
||||
)
|
||||
#optional api key
|
||||
api_key = params.get("api_key", None)
|
||||
|
||||
super(Processor, self).__init__(
|
||||
**params | {
|
||||
"store_uri": url,
|
||||
"store_uri": store_uri,
|
||||
"api_key": api_key,
|
||||
}
|
||||
)
|
||||
|
||||
self.qdrant = QdrantClient(url=url, api_key=api_key)
|
||||
self.qdrant = QdrantClient(url=store_uri, api_key=api_key)
|
||||
|
||||
def create_value(self, ent):
|
||||
if ent.startswith("http://") or ent.startswith("https://"):
|
||||
|
|
@ -105,7 +104,18 @@ class Processor(GraphEmbeddingsQueryService):
|
|||
def add_args(parser):
|
||||
|
||||
GraphEmbeddingsQueryService.add_args(parser)
|
||||
add_qdrant_args(parser)
|
||||
|
||||
parser.add_argument(
|
||||
'-t', '--store-uri',
|
||||
default=default_store_uri,
|
||||
help=f'Qdrant store URI (default: {default_store_uri})'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'-k', '--api-key',
|
||||
default=None,
|
||||
help=f'API key for qdrant (default: None)'
|
||||
)
|
||||
|
||||
def run():
|
||||
|
||||
|
|
|
|||
|
|
@ -116,7 +116,7 @@ class CassandraTripleStore(Store if RDFLIB_AVAILABLE else object):
|
|||
# Create keyspace
|
||||
self.session.execute(f"""
|
||||
CREATE KEYSPACE IF NOT EXISTS {self.keyspace}
|
||||
WITH replication = {{'class': 'SimpleStrategy', 'replication_factor': {self.cassandra_config.get('replication_factor', 1)}}}
|
||||
WITH replication = {{'class': 'SimpleStrategy', 'replication_factor': 1}}
|
||||
""")
|
||||
|
||||
# Create triples table optimized for SPARQL queries
|
||||
|
|
|
|||
|
|
@ -19,12 +19,12 @@ from .... schema import (
|
|||
RowIndexMatch, Error
|
||||
)
|
||||
from .... base import FlowProcessor, ConsumerSpec, ProducerSpec
|
||||
from .... base.qdrant_config import add_qdrant_args, resolve_qdrant_config
|
||||
|
||||
# Module logger
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
default_ident = "row-embeddings-query"
|
||||
default_store_uri = 'http://localhost:6333'
|
||||
default_concurrency = 10
|
||||
|
||||
|
||||
|
|
@ -35,17 +35,13 @@ class Processor(FlowProcessor):
|
|||
id = params.get("id", default_ident)
|
||||
concurrency = params.get("concurrency", default_concurrency)
|
||||
|
||||
store_uri = params.get("store_uri")
|
||||
api_key = params.get("api_key")
|
||||
|
||||
url, api_key, _, _ = resolve_qdrant_config(
|
||||
url=store_uri, api_key=api_key,
|
||||
)
|
||||
store_uri = params.get("store_uri", default_store_uri)
|
||||
api_key = params.get("api_key", None)
|
||||
|
||||
super(Processor, self).__init__(
|
||||
**params | {
|
||||
"id": id,
|
||||
"store_uri": url,
|
||||
"store_uri": store_uri,
|
||||
"api_key": api_key,
|
||||
}
|
||||
)
|
||||
|
|
@ -66,7 +62,7 @@ class Processor(FlowProcessor):
|
|||
)
|
||||
)
|
||||
|
||||
self.qdrant = QdrantClient(url=url, api_key=api_key)
|
||||
self.qdrant = QdrantClient(url=store_uri, api_key=api_key)
|
||||
|
||||
def sanitize_name(self, name: str) -> str:
|
||||
"""Sanitize names for Qdrant collection naming"""
|
||||
|
|
@ -196,9 +192,21 @@ class Processor(FlowProcessor):
|
|||
|
||||
@staticmethod
|
||||
def add_args(parser):
|
||||
"""Add command-line arguments"""
|
||||
|
||||
FlowProcessor.add_args(parser)
|
||||
add_qdrant_args(parser)
|
||||
|
||||
parser.add_argument(
|
||||
'-t', '--store-uri',
|
||||
default=default_store_uri,
|
||||
help=f'Qdrant store URI (default: {default_store_uri})'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'-k', '--api-key',
|
||||
default=None,
|
||||
help='API key for Qdrant (default: None)'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'-c', '--concurrency',
|
||||
|
|
|
|||
|
|
@ -24,7 +24,7 @@ from .... schema import RowsQueryRequest, RowsQueryResponse, GraphQLError
|
|||
from .... schema import Error, RowSchema, Field as SchemaField
|
||||
from .... base import FlowProcessor, ConsumerSpec, ProducerSpec
|
||||
from .... base.cassandra_config import add_cassandra_args, resolve_cassandra_config
|
||||
from .... tables.cassandra_async import async_execute, async_execute_paged, async_scan
|
||||
from .... tables.cassandra_async import async_execute
|
||||
|
||||
from ... graphql import GraphQLSchemaBuilder, SortDirection
|
||||
|
||||
|
|
@ -180,7 +180,7 @@ class Processor(FlowProcessor):
|
|||
description=field_def.get("description", ""),
|
||||
required=field_def.get("required", False),
|
||||
enum_values=field_def.get("enum", []),
|
||||
indexed=field_def.get("indexed", False),
|
||||
indexed=field_def.get("indexed", False)
|
||||
)
|
||||
fields.append(field)
|
||||
|
||||
|
|
@ -232,8 +232,6 @@ class Processor(FlowProcessor):
|
|||
for index_name in index_names:
|
||||
if index_name in filters:
|
||||
value = filters[index_name]
|
||||
if value == "" or value is None:
|
||||
continue
|
||||
# Single field index -> single element list
|
||||
index_value = [str(value)]
|
||||
return (index_name, index_value)
|
||||
|
|
@ -284,11 +282,9 @@ class Processor(FlowProcessor):
|
|||
query += f" LIMIT {limit}"
|
||||
|
||||
try:
|
||||
pages = await async_execute_paged(
|
||||
self.session, query, params
|
||||
)
|
||||
for page in pages:
|
||||
for row in page:
|
||||
rows = await async_execute(self.session, query, params)
|
||||
for row in rows:
|
||||
# Convert data map to dict with proper field names
|
||||
row_dict = dict(row.data) if row.data else {}
|
||||
results.append(row_dict)
|
||||
except Exception as e:
|
||||
|
|
@ -312,6 +308,8 @@ class Processor(FlowProcessor):
|
|||
# Query using the first index (arbitrary choice for scan)
|
||||
primary_index = index_names[0]
|
||||
|
||||
# We need to scan all values for this index
|
||||
# This requires ALLOW FILTERING or a different approach
|
||||
query = f"""
|
||||
SELECT data, source FROM {safe_keyspace}.rows
|
||||
WHERE collection = %s
|
||||
|
|
@ -322,19 +320,18 @@ class Processor(FlowProcessor):
|
|||
params = [collection, schema_name, primary_index]
|
||||
|
||||
try:
|
||||
def row_filter(row):
|
||||
row_dict = dict(row.data) if row.data else {}
|
||||
return self._matches_filters(row_dict, filters, row_schema)
|
||||
rows = await async_execute(self.session, query, params)
|
||||
|
||||
matched_rows = await async_scan(
|
||||
self.session, query, params,
|
||||
row_filter=row_filter,
|
||||
limit=limit,
|
||||
)
|
||||
for row in matched_rows:
|
||||
for row in rows:
|
||||
row_dict = dict(row.data) if row.data else {}
|
||||
|
||||
# Apply post-filters
|
||||
if self._matches_filters(row_dict, filters, row_schema):
|
||||
results.append(row_dict)
|
||||
|
||||
if limit and len(results) >= limit:
|
||||
break
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to scan rows: {e}", exc_info=True)
|
||||
raise
|
||||
|
|
@ -366,7 +363,7 @@ class Processor(FlowProcessor):
|
|||
# Parse filter key for operator
|
||||
if '_' in filter_key:
|
||||
parts = filter_key.rsplit('_', 1)
|
||||
if parts[1] in ['gt', 'gte', 'lt', 'lte', 'contains', 'in', 'not', 'startsWith', 'endsWith', 'not_in']:
|
||||
if parts[1] in ['gt', 'gte', 'lt', 'lte', 'contains', 'in']:
|
||||
field_name = parts[0]
|
||||
operator = parts[1]
|
||||
else:
|
||||
|
|
@ -403,18 +400,6 @@ class Processor(FlowProcessor):
|
|||
elif operator == 'in':
|
||||
if str(row_value) not in [str(v) for v in filter_value]:
|
||||
return False
|
||||
elif operator == 'not':
|
||||
if str(row_value) == str(filter_value):
|
||||
return False
|
||||
elif operator == 'startsWith':
|
||||
if not str(row_value).startswith(str(filter_value)):
|
||||
return False
|
||||
elif operator == 'endsWith':
|
||||
if not str(row_value).endswith(str(filter_value)):
|
||||
return False
|
||||
elif operator == 'not_in':
|
||||
if str(row_value) in [str(v) for v in filter_value]:
|
||||
return False
|
||||
except (ValueError, TypeError):
|
||||
return False
|
||||
|
||||
|
|
|
|||
|
|
@ -14,36 +14,29 @@ from qdrant_client.models import Distance, VectorParams
|
|||
from .... base import DocumentEmbeddingsStoreService, CollectionConfigHandler
|
||||
from .... base import AsyncProcessor, Consumer, Producer
|
||||
from .... base import ConsumerMetrics, ProducerMetrics
|
||||
from .... base.qdrant_config import add_qdrant_args, resolve_qdrant_config
|
||||
|
||||
# Module logger
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
default_ident = "doc-embeddings-write"
|
||||
|
||||
default_store_uri = 'http://localhost:6333'
|
||||
|
||||
class Processor(CollectionConfigHandler, DocumentEmbeddingsStoreService):
|
||||
|
||||
def __init__(self, **params):
|
||||
|
||||
store_uri = params.get("store_uri")
|
||||
api_key = params.get("api_key")
|
||||
|
||||
url, api_key, replication_factor, shard_number = resolve_qdrant_config(
|
||||
url=store_uri, api_key=api_key,
|
||||
replication_factor=params.get("qdrant_replication_factor"),
|
||||
shard_number=params.get("qdrant_shard_number"),
|
||||
)
|
||||
store_uri = params.get("store_uri", default_store_uri)
|
||||
api_key = params.get("api_key", None)
|
||||
|
||||
super(Processor, self).__init__(
|
||||
**params | {
|
||||
"store_uri": url,
|
||||
"store_uri": store_uri,
|
||||
"api_key": api_key,
|
||||
}
|
||||
)
|
||||
|
||||
self.qdrant = QdrantClient(url=url, api_key=api_key)
|
||||
self.replication_factor = replication_factor
|
||||
self.shard_number = shard_number
|
||||
self.qdrant = QdrantClient(url=store_uri, api_key=api_key)
|
||||
self._cache_lock = asyncio.Lock()
|
||||
self._known_collections: set[str] = set()
|
||||
|
||||
|
|
@ -68,8 +61,6 @@ class Processor(CollectionConfigHandler, DocumentEmbeddingsStoreService):
|
|||
vectors_config=VectorParams(
|
||||
size=dim, distance=Distance.COSINE
|
||||
),
|
||||
replication_factor=self.replication_factor,
|
||||
shard_number=self.shard_number,
|
||||
)
|
||||
self._known_collections.add(collection_name)
|
||||
|
||||
|
|
@ -118,7 +109,18 @@ class Processor(CollectionConfigHandler, DocumentEmbeddingsStoreService):
|
|||
def add_args(parser):
|
||||
|
||||
DocumentEmbeddingsStoreService.add_args(parser)
|
||||
add_qdrant_args(parser)
|
||||
|
||||
parser.add_argument(
|
||||
'-t', '--store-uri',
|
||||
default=default_store_uri,
|
||||
help=f'Qdrant URI (default: {default_store_uri})'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'-k', '--api-key',
|
||||
default=None,
|
||||
help=f'Qdrant API key (default: None)'
|
||||
)
|
||||
|
||||
async def create_collection(self, workspace: str, collection: str, metadata: dict):
|
||||
"""
|
||||
|
|
|
|||
|
|
@ -14,7 +14,6 @@ from qdrant_client.models import Distance, VectorParams
|
|||
from .... base import GraphEmbeddingsStoreService, CollectionConfigHandler
|
||||
from .... base import AsyncProcessor, Consumer, Producer
|
||||
from .... base import ConsumerMetrics, ProducerMetrics
|
||||
from .... base.qdrant_config import add_qdrant_args, resolve_qdrant_config
|
||||
from .... schema import IRI, LITERAL
|
||||
|
||||
# Module logger
|
||||
|
|
@ -30,34 +29,29 @@ def get_term_value(term):
|
|||
elif term.type == LITERAL:
|
||||
return term.value
|
||||
else:
|
||||
# For blank nodes or other types, use id or value
|
||||
return term.id or term.value
|
||||
|
||||
|
||||
default_ident = "graph-embeddings-write"
|
||||
|
||||
default_store_uri = 'http://localhost:6333'
|
||||
|
||||
class Processor(CollectionConfigHandler, GraphEmbeddingsStoreService):
|
||||
|
||||
def __init__(self, **params):
|
||||
|
||||
store_uri = params.get("store_uri")
|
||||
api_key = params.get("api_key")
|
||||
|
||||
url, api_key, replication_factor, shard_number = resolve_qdrant_config(
|
||||
url=store_uri, api_key=api_key,
|
||||
replication_factor=params.get("qdrant_replication_factor"),
|
||||
shard_number=params.get("qdrant_shard_number"),
|
||||
)
|
||||
store_uri = params.get("store_uri", default_store_uri)
|
||||
api_key = params.get("api_key", None)
|
||||
|
||||
super(Processor, self).__init__(
|
||||
**params | {
|
||||
"store_uri": url,
|
||||
"store_uri": store_uri,
|
||||
"api_key": api_key,
|
||||
}
|
||||
)
|
||||
|
||||
self.qdrant = QdrantClient(url=url, api_key=api_key)
|
||||
self.replication_factor = replication_factor
|
||||
self.shard_number = shard_number
|
||||
self.qdrant = QdrantClient(url=store_uri, api_key=api_key)
|
||||
self._cache_lock = asyncio.Lock()
|
||||
self._known_collections: set[str] = set()
|
||||
|
||||
|
|
@ -82,8 +76,6 @@ class Processor(CollectionConfigHandler, GraphEmbeddingsStoreService):
|
|||
vectors_config=VectorParams(
|
||||
size=dim, distance=Distance.COSINE
|
||||
),
|
||||
replication_factor=self.replication_factor,
|
||||
shard_number=self.shard_number,
|
||||
)
|
||||
self._known_collections.add(collection_name)
|
||||
|
||||
|
|
@ -136,7 +128,18 @@ class Processor(CollectionConfigHandler, GraphEmbeddingsStoreService):
|
|||
def add_args(parser):
|
||||
|
||||
GraphEmbeddingsStoreService.add_args(parser)
|
||||
add_qdrant_args(parser)
|
||||
|
||||
parser.add_argument(
|
||||
'-t', '--store-uri',
|
||||
default=default_store_uri,
|
||||
help=f'Qdrant store URI (default: {default_store_uri})'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'-k', '--api-key',
|
||||
default=None,
|
||||
help=f'Qdrant API key'
|
||||
)
|
||||
|
||||
async def create_collection(self, workspace: str, collection: str, metadata: dict):
|
||||
"""
|
||||
|
|
|
|||
|
|
@ -27,8 +27,7 @@ class Processor(FlowProcessor):
|
|||
host=params.get("cassandra_host"),
|
||||
username=params.get("cassandra_username"),
|
||||
password=params.get("cassandra_password"),
|
||||
default_keyspace='knowledge',
|
||||
replication_factor=params.get("cassandra_replication_factor"),
|
||||
default_keyspace='knowledge'
|
||||
)
|
||||
|
||||
super(Processor, self).__init__(
|
||||
|
|
|
|||
|
|
@ -27,12 +27,12 @@ from qdrant_client.models import PointStruct, Distance, VectorParams
|
|||
from .... schema import RowEmbeddings
|
||||
from .... base import FlowProcessor, ConsumerSpec
|
||||
from .... base import CollectionConfigHandler
|
||||
from .... base.qdrant_config import add_qdrant_args, resolve_qdrant_config
|
||||
|
||||
# Module logger
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
default_ident = "row-embeddings-write"
|
||||
default_store_uri = 'http://localhost:6333'
|
||||
|
||||
|
||||
class Processor(CollectionConfigHandler, FlowProcessor):
|
||||
|
|
@ -41,19 +41,13 @@ class Processor(CollectionConfigHandler, FlowProcessor):
|
|||
|
||||
id = params.get("id", default_ident)
|
||||
|
||||
store_uri = params.get("store_uri")
|
||||
api_key = params.get("api_key")
|
||||
|
||||
url, api_key, replication_factor, shard_number = resolve_qdrant_config(
|
||||
url=store_uri, api_key=api_key,
|
||||
replication_factor=params.get("qdrant_replication_factor"),
|
||||
shard_number=params.get("qdrant_shard_number"),
|
||||
)
|
||||
store_uri = params.get("store_uri", default_store_uri)
|
||||
api_key = params.get("api_key", None)
|
||||
|
||||
super(Processor, self).__init__(
|
||||
**params | {
|
||||
"id": id,
|
||||
"store_uri": url,
|
||||
"store_uri": store_uri,
|
||||
"api_key": api_key,
|
||||
}
|
||||
)
|
||||
|
|
@ -69,9 +63,7 @@ class Processor(CollectionConfigHandler, FlowProcessor):
|
|||
# Register config handler for collection management
|
||||
self.register_config_handler(self.on_collection_config, types=["collection"])
|
||||
|
||||
self.qdrant = QdrantClient(url=url, api_key=api_key)
|
||||
self.replication_factor = replication_factor
|
||||
self.shard_number = shard_number
|
||||
self.qdrant = QdrantClient(url=store_uri, api_key=api_key)
|
||||
self._cache_lock = asyncio.Lock()
|
||||
self._known_collections: set[str] = set()
|
||||
|
||||
|
|
@ -111,8 +103,6 @@ class Processor(CollectionConfigHandler, FlowProcessor):
|
|||
size=dimension,
|
||||
distance=Distance.COSINE
|
||||
),
|
||||
replication_factor=self.replication_factor,
|
||||
shard_number=self.shard_number,
|
||||
)
|
||||
self._known_collections.add(collection_name)
|
||||
|
||||
|
|
@ -259,9 +249,21 @@ class Processor(CollectionConfigHandler, FlowProcessor):
|
|||
|
||||
@staticmethod
|
||||
def add_args(parser):
|
||||
"""Add command-line arguments"""
|
||||
|
||||
FlowProcessor.add_args(parser)
|
||||
add_qdrant_args(parser)
|
||||
|
||||
parser.add_argument(
|
||||
'-t', '--store-uri',
|
||||
default=default_store_uri,
|
||||
help=f'Qdrant URI (default: {default_store_uri})'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'-k', '--api-key',
|
||||
default=None,
|
||||
help='Qdrant API key (default: None)'
|
||||
)
|
||||
|
||||
|
||||
def run():
|
||||
|
|
|
|||
|
|
@ -47,18 +47,16 @@ class Processor(CollectionConfigHandler, FlowProcessor):
|
|||
cassandra_password = params.get("cassandra_password")
|
||||
|
||||
# Resolve configuration with environment variable fallback
|
||||
hosts, username, password, keyspace, replication_factor = resolve_cassandra_config(
|
||||
hosts, username, password, keyspace, _ = resolve_cassandra_config(
|
||||
host=cassandra_host,
|
||||
username=cassandra_username,
|
||||
password=cassandra_password,
|
||||
replication_factor=params.get("cassandra_replication_factor"),
|
||||
password=cassandra_password
|
||||
)
|
||||
|
||||
# Store resolved configuration with proper names
|
||||
self.cassandra_host = hosts # Store as list
|
||||
self.cassandra_username = username
|
||||
self.cassandra_password = password
|
||||
self.replication_factor = replication_factor
|
||||
|
||||
# Config key for schemas
|
||||
self.config_key = params.get("config_type", "schema")
|
||||
|
|
@ -172,7 +170,7 @@ class Processor(CollectionConfigHandler, FlowProcessor):
|
|||
description=field_def.get("description", ""),
|
||||
required=field_def.get("required", False),
|
||||
enum_values=field_def.get("enum", []),
|
||||
indexed=field_def.get("indexed", False),
|
||||
indexed=field_def.get("indexed", False)
|
||||
)
|
||||
fields.append(field)
|
||||
|
||||
|
|
@ -234,7 +232,7 @@ class Processor(CollectionConfigHandler, FlowProcessor):
|
|||
CREATE KEYSPACE IF NOT EXISTS {safe_keyspace}
|
||||
WITH REPLICATION = {{
|
||||
'class': 'SimpleStrategy',
|
||||
'replication_factor': {self.replication_factor}
|
||||
'replication_factor': 1
|
||||
}}
|
||||
"""
|
||||
|
||||
|
|
|
|||
|
|
@ -27,8 +27,6 @@ Notes:
|
|||
|
||||
import asyncio
|
||||
|
||||
from cassandra.query import SimpleStatement
|
||||
|
||||
|
||||
async def async_execute(session, query, parameters=None):
|
||||
"""Execute a CQL statement asynchronously.
|
||||
|
|
@ -78,83 +76,3 @@ def _set_result_if_pending(fut, result):
|
|||
def _set_exception_if_pending(fut, exc):
|
||||
if not fut.done():
|
||||
fut.set_exception(exc)
|
||||
|
||||
|
||||
async def async_execute_paged(session, query, parameters=None, fetch_size=5000):
|
||||
"""Execute a CQL query with page-by-page iteration.
|
||||
|
||||
Uses synchronous session.execute() inside run_in_executor so that
|
||||
the driver's ResultSet paging works correctly without materialising
|
||||
the entire result set in memory.
|
||||
|
||||
Returns all pages as a list of lists.
|
||||
"""
|
||||
loop = asyncio.get_running_loop()
|
||||
|
||||
if isinstance(query, str):
|
||||
stmt = SimpleStatement(query, fetch_size=fetch_size)
|
||||
else:
|
||||
stmt = query
|
||||
stmt.fetch_size = fetch_size
|
||||
|
||||
def _fetch_all_pages():
|
||||
pages = []
|
||||
result_set = session.execute(stmt, parameters)
|
||||
while True:
|
||||
pages.append(list(result_set.current_rows))
|
||||
if result_set.has_more_pages:
|
||||
result_set.fetch_next_page()
|
||||
else:
|
||||
break
|
||||
return pages
|
||||
|
||||
return await loop.run_in_executor(
|
||||
None, _fetch_all_pages
|
||||
)
|
||||
|
||||
|
||||
async def async_scan(
|
||||
session, query, parameters=None, row_filter=None,
|
||||
limit=None, fetch_size=5000,
|
||||
):
|
||||
"""Scan a CQL query page-by-page, applying a filter and limit.
|
||||
|
||||
Only matching rows accumulate in memory. Each page is discarded
|
||||
after processing, so peak memory is bounded by fetch_size plus
|
||||
the number of matching rows (capped by limit).
|
||||
|
||||
Args:
|
||||
session: cassandra.cluster.Session
|
||||
query: CQL statement string
|
||||
parameters: bind params
|
||||
row_filter: callable(row) -> bool, or None to accept all
|
||||
limit: max results to return, or None for unlimited
|
||||
fetch_size: rows per Cassandra page fetch
|
||||
|
||||
Returns:
|
||||
List of matching rows.
|
||||
"""
|
||||
loop = asyncio.get_running_loop()
|
||||
|
||||
if isinstance(query, str):
|
||||
stmt = SimpleStatement(query, fetch_size=fetch_size)
|
||||
else:
|
||||
stmt = query
|
||||
stmt.fetch_size = fetch_size
|
||||
|
||||
def _scan():
|
||||
results = []
|
||||
result_set = session.execute(stmt, parameters)
|
||||
while True:
|
||||
for row in result_set.current_rows:
|
||||
if row_filter is None or row_filter(row):
|
||||
results.append(row)
|
||||
if limit and len(results) >= limit:
|
||||
return results
|
||||
if result_set.has_more_pages:
|
||||
result_set.fetch_next_page()
|
||||
else:
|
||||
break
|
||||
return results
|
||||
|
||||
return await loop.run_in_executor(None, _scan)
|
||||
|
|
|
|||
|
|
@ -4,7 +4,7 @@ from .. schema import Metadata, GraphEmbeddings
|
|||
|
||||
from cassandra.cluster import Cluster
|
||||
from cassandra.auth import PlainTextAuthProvider
|
||||
import ssl
|
||||
from ssl import SSLContext, PROTOCOL_TLSv1_2
|
||||
|
||||
import uuid
|
||||
import time
|
||||
|
|
@ -33,7 +33,7 @@ class ConfigTableStore:
|
|||
cassandra_host = [h.strip() for h in cassandra_host.split(',')]
|
||||
|
||||
if cassandra_username and cassandra_password:
|
||||
ssl_context = ssl.create_default_context()
|
||||
ssl_context = SSLContext(PROTOCOL_TLSv1_2)
|
||||
auth_provider = PlainTextAuthProvider(
|
||||
username=cassandra_username, password=cassandra_password
|
||||
)
|
||||
|
|
|
|||
|
|
@ -15,7 +15,7 @@ import logging
|
|||
|
||||
from cassandra.cluster import Cluster
|
||||
from cassandra.auth import PlainTextAuthProvider
|
||||
import ssl
|
||||
from ssl import SSLContext, PROTOCOL_TLSv1_2
|
||||
|
||||
from . cassandra_async import async_execute
|
||||
|
||||
|
|
@ -39,7 +39,7 @@ class IamTableStore:
|
|||
cassandra_host = [h.strip() for h in cassandra_host.split(",")]
|
||||
|
||||
if cassandra_username and cassandra_password:
|
||||
ssl_context = ssl.create_default_context()
|
||||
ssl_context = SSLContext(PROTOCOL_TLSv1_2)
|
||||
auth_provider = PlainTextAuthProvider(
|
||||
username=cassandra_username, password=cassandra_password,
|
||||
)
|
||||
|
|
|
|||
|
|
@ -5,7 +5,7 @@ from .. schema import DocumentEmbeddings, ChunkEmbeddings
|
|||
|
||||
from cassandra.cluster import Cluster
|
||||
|
||||
from . cassandra_async import async_execute, async_execute_paged
|
||||
from . cassandra_async import async_execute
|
||||
|
||||
|
||||
def term_to_tuple(term):
|
||||
|
|
@ -23,7 +23,7 @@ def tuple_to_term(value, is_uri):
|
|||
else:
|
||||
return Term(type=LITERAL, value=value)
|
||||
from cassandra.auth import PlainTextAuthProvider
|
||||
import ssl
|
||||
from ssl import SSLContext, PROTOCOL_TLSv1_2
|
||||
|
||||
import uuid
|
||||
import time
|
||||
|
|
@ -50,7 +50,7 @@ class KnowledgeTableStore:
|
|||
cassandra_host = [h.strip() for h in cassandra_host.split(',')]
|
||||
|
||||
if cassandra_username and cassandra_password:
|
||||
ssl_context = ssl.create_default_context()
|
||||
ssl_context = SSLContext(PROTOCOL_TLSv1_2)
|
||||
auth_provider = PlainTextAuthProvider(
|
||||
username=cassandra_username, password=cassandra_password
|
||||
)
|
||||
|
|
@ -98,8 +98,7 @@ class KnowledgeTableStore:
|
|||
text, boolean, text, boolean, text, boolean
|
||||
>>,
|
||||
triples list<tuple<
|
||||
text, boolean, text, boolean, text, boolean,
|
||||
text
|
||||
text, boolean, text, boolean, text, boolean
|
||||
>>,
|
||||
PRIMARY KEY ((workspace, document_id), id)
|
||||
);
|
||||
|
|
@ -235,8 +234,7 @@ class KnowledgeTableStore:
|
|||
|
||||
triples = [
|
||||
(
|
||||
*term_to_tuple(v.s), *term_to_tuple(v.p), *term_to_tuple(v.o),
|
||||
v.g or ""
|
||||
*term_to_tuple(v.s), *term_to_tuple(v.p), *term_to_tuple(v.o)
|
||||
)
|
||||
for v in m.triples
|
||||
]
|
||||
|
|
@ -400,7 +398,7 @@ class KnowledgeTableStore:
|
|||
logger.debug("Get triples...")
|
||||
|
||||
try:
|
||||
pages = await async_execute_paged(
|
||||
rows = await async_execute(
|
||||
self.cassandra,
|
||||
self.get_triples_stmt,
|
||||
(workspace, document_id),
|
||||
|
|
@ -409,8 +407,7 @@ class KnowledgeTableStore:
|
|||
logger.error("Exception occurred", exc_info=True)
|
||||
raise
|
||||
|
||||
for page in pages:
|
||||
for row in page:
|
||||
for row in rows:
|
||||
|
||||
if row[3]:
|
||||
triples = [
|
||||
|
|
@ -418,7 +415,6 @@ class KnowledgeTableStore:
|
|||
s = tuple_to_term(elt[0], elt[1]),
|
||||
p = tuple_to_term(elt[2], elt[3]),
|
||||
o = tuple_to_term(elt[4], elt[5]),
|
||||
g = elt[6] if elt[6] else None,
|
||||
)
|
||||
for elt in row[3]
|
||||
]
|
||||
|
|
@ -429,7 +425,7 @@ class KnowledgeTableStore:
|
|||
Triples(
|
||||
metadata = Metadata(
|
||||
id = document_id,
|
||||
collection = "default",
|
||||
collection = "default", # FIXME: What to put here?
|
||||
),
|
||||
triples = triples
|
||||
)
|
||||
|
|
@ -442,7 +438,7 @@ class KnowledgeTableStore:
|
|||
logger.debug("Get GE...")
|
||||
|
||||
try:
|
||||
pages = await async_execute_paged(
|
||||
rows = await async_execute(
|
||||
self.cassandra,
|
||||
self.get_graph_embeddings_stmt,
|
||||
(workspace, document_id),
|
||||
|
|
@ -451,8 +447,7 @@ class KnowledgeTableStore:
|
|||
logger.error("Exception occurred", exc_info=True)
|
||||
raise
|
||||
|
||||
for page in pages:
|
||||
for row in page:
|
||||
for row in rows:
|
||||
|
||||
if row[3]:
|
||||
entities = [
|
||||
|
|
@ -469,7 +464,7 @@ class KnowledgeTableStore:
|
|||
GraphEmbeddings(
|
||||
metadata = Metadata(
|
||||
id = document_id,
|
||||
collection = "default",
|
||||
collection = "default", # FIXME: What to put here?
|
||||
),
|
||||
entities = entities
|
||||
)
|
||||
|
|
@ -482,7 +477,7 @@ class KnowledgeTableStore:
|
|||
logger.debug("Get DE...")
|
||||
|
||||
try:
|
||||
pages = await async_execute_paged(
|
||||
rows = await async_execute(
|
||||
self.cassandra,
|
||||
self.get_document_embeddings_stmt,
|
||||
(workspace, document_id),
|
||||
|
|
@ -491,8 +486,7 @@ class KnowledgeTableStore:
|
|||
logger.error("Exception occurred", exc_info=True)
|
||||
raise
|
||||
|
||||
for page in pages:
|
||||
for row in page:
|
||||
for row in rows:
|
||||
|
||||
if row[3]:
|
||||
chunks = [
|
||||
|
|
|
|||
|
|
@ -24,7 +24,7 @@ from .. exceptions import RequestError
|
|||
from cassandra.cluster import Cluster
|
||||
from cassandra.auth import PlainTextAuthProvider
|
||||
from cassandra.query import BatchStatement
|
||||
import ssl
|
||||
from ssl import SSLContext, PROTOCOL_TLSv1_2
|
||||
|
||||
import uuid
|
||||
import time
|
||||
|
|
@ -53,7 +53,7 @@ class LibraryTableStore:
|
|||
cassandra_host = [h.strip() for h in cassandra_host.split(',')]
|
||||
|
||||
if cassandra_username and cassandra_password:
|
||||
ssl_context = ssl.create_default_context()
|
||||
ssl_context = SSLContext(PROTOCOL_TLSv1_2)
|
||||
auth_provider = PlainTextAuthProvider(
|
||||
username=cassandra_username, password=cassandra_password
|
||||
)
|
||||
|
|
|
|||
File diff suppressed because it is too large
Load diff
|
|
@ -1,110 +1,49 @@
|
|||
|
||||
from dataclasses import dataclass
|
||||
from websockets.asyncio.client import connect
|
||||
from urllib.parse import urlencode, urlparse, urlunparse, parse_qs
|
||||
import asyncio
|
||||
import logging
|
||||
import json
|
||||
import uuid
|
||||
import hashlib
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def _token_key(token):
|
||||
"""Derive a dict key from a token without storing the raw secret."""
|
||||
return hashlib.sha256(token.encode()).hexdigest()[:16]
|
||||
|
||||
import time
|
||||
|
||||
class WebSocketManager:
|
||||
"""Manages an authenticated WebSocket connection to the TrustGraph
|
||||
gateway on behalf of a single caller.
|
||||
|
||||
Each caller token gets its own WebSocketManager so that gateway-side
|
||||
identity, workspace, and capability scoping are preserved end-to-end.
|
||||
"""
|
||||
|
||||
def __init__(self, url, token):
|
||||
def __init__(self, url, token=None):
|
||||
self.url = url
|
||||
# ── Security boundary: token storage ──
|
||||
# This is the MCP caller's Bearer token, forwarded verbatim to
|
||||
# the gateway. It MUST NOT be logged, persisted, or shared
|
||||
# across callers. It is held only for the lifetime of this
|
||||
# connection so that re-auth (e.g. after a reconnect) is
|
||||
# possible.
|
||||
self.token = token
|
||||
self.socket = None
|
||||
self.identity = None
|
||||
self.last_used = None
|
||||
|
||||
# FIXME: authentication is broken. The /api/v1/socket endpoint uses
|
||||
# in-band auth (first-frame protocol via the Mux dispatcher), not
|
||||
# query-parameter tokens. This query-string token is silently ignored.
|
||||
# Fix: after connect(), send an auth frame with the bearer token as
|
||||
# the first message, matching the gateway's in-band auth protocol.
|
||||
def _build_url(self):
|
||||
if not self.token:
|
||||
return self.url
|
||||
parsed = urlparse(self.url)
|
||||
params = parse_qs(parsed.query)
|
||||
params["token"] = [self.token]
|
||||
new_query = urlencode(params, doseq=True)
|
||||
return urlunparse(parsed._replace(query=new_query))
|
||||
|
||||
async def start(self):
|
||||
"""Connect and authenticate via the gateway's in-band auth
|
||||
protocol. Raises on auth failure."""
|
||||
|
||||
# ── Security boundary: MCP server → gateway ──
|
||||
# The WebSocket connects to the gateway and authenticates using
|
||||
# the caller's Bearer token via the in-band first-frame auth
|
||||
# protocol. The token belongs to the MCP client — we forward
|
||||
# it as-is and never interpret its contents.
|
||||
self.socket = await connect(self.url)
|
||||
self.socket = await connect(self._build_url())
|
||||
self.pending_requests = {}
|
||||
self.running = True
|
||||
|
||||
await self._authenticate()
|
||||
|
||||
self.reader_task = asyncio.create_task(self.reader())
|
||||
|
||||
async def _authenticate(self):
|
||||
"""Send in-band auth frame and wait for auth-ok / auth-failed.
|
||||
|
||||
The gateway expects ``{"type": "auth", "token": "..."}`` as the
|
||||
first frame on a new WebSocket. Any service frame sent before
|
||||
auth-ok is rejected.
|
||||
"""
|
||||
await self.socket.send(json.dumps({
|
||||
"type": "auth",
|
||||
"token": self.token,
|
||||
}))
|
||||
|
||||
response_text = await asyncio.wait_for(self.socket.recv(), 10)
|
||||
response = json.loads(response_text)
|
||||
|
||||
if response.get("type") == "auth-ok":
|
||||
logger.info(
|
||||
"WebSocket authenticated, default workspace: %s",
|
||||
response.get("workspace"),
|
||||
)
|
||||
return
|
||||
|
||||
# Auth failed — close immediately, do not leave an
|
||||
# unauthenticated socket open.
|
||||
await self.socket.close()
|
||||
self.socket = None
|
||||
|
||||
if response.get("type") == "auth-failed":
|
||||
raise RuntimeError(
|
||||
"Gateway rejected the authentication token"
|
||||
)
|
||||
|
||||
raise RuntimeError(
|
||||
f"Unexpected auth response type: {response.get('type')}"
|
||||
)
|
||||
|
||||
async def whoami(self):
|
||||
"""Verify the token by calling the gateway's whoami endpoint.
|
||||
Returns the identity dict and caches it on ``self.identity``.
|
||||
"""
|
||||
gen = self.request("iam", {"operation": "whoami"}, flow_id=None)
|
||||
async for response in gen:
|
||||
self.identity = response
|
||||
return response
|
||||
|
||||
async def stop(self):
|
||||
self.running = False
|
||||
if hasattr(self, "reader_task"):
|
||||
await self.reader_task
|
||||
|
||||
async def reader(self):
|
||||
"""Background task: read WebSocket frames and route them to the
|
||||
correct pending-request queue by ``id``."""
|
||||
"""
|
||||
Background task to read websocket responses and route to correct
|
||||
request
|
||||
"""
|
||||
|
||||
while self.running:
|
||||
try:
|
||||
|
|
@ -120,21 +59,23 @@ class WebSocketManager:
|
|||
|
||||
request_id = response.get("id")
|
||||
if request_id and request_id in self.pending_requests:
|
||||
# Put the response in the queue
|
||||
queue = self.pending_requests[request_id]
|
||||
await queue.put(response)
|
||||
else:
|
||||
logger.warning(
|
||||
"Response for unknown request ID: %s", request_id
|
||||
logging.warning(
|
||||
f"Response for unknown request ID: {request_id}"
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
|
||||
logger.error("Error in websocket reader: %s", e)
|
||||
logging.error(f"Error in websocket reader: {e}")
|
||||
|
||||
# Put error in all pending queues
|
||||
for queue in self.pending_requests.values():
|
||||
try:
|
||||
await queue.put({"error": str(e)})
|
||||
except Exception:
|
||||
except:
|
||||
pass
|
||||
|
||||
self.pending_requests.clear()
|
||||
|
|
@ -145,29 +86,25 @@ class WebSocketManager:
|
|||
|
||||
async def request(
|
||||
self, service, request_data, flow_id="default",
|
||||
workspace=None,
|
||||
):
|
||||
"""Send a request via WebSocket and yield responses.
|
||||
|
||||
Args:
|
||||
service: Gateway service name (e.g. "graph-rag", "config").
|
||||
request_data: Inner request payload.
|
||||
flow_id: Optional flow identifier. ``None`` omits the field
|
||||
(workspace-level services don't use flows).
|
||||
workspace: Optional workspace override. When ``None`` the
|
||||
gateway uses the caller's default workspace.
|
||||
"""
|
||||
Send a request via websocket and handle single or streaming responses
|
||||
"""
|
||||
|
||||
import time
|
||||
self.last_used = time.monotonic()
|
||||
|
||||
# Generate unique request ID
|
||||
request_id = f"{uuid.uuid4()}"
|
||||
|
||||
# Determine if this service streams responses
|
||||
streaming_services = {"agent"}
|
||||
is_streaming = service in streaming_services
|
||||
|
||||
# Create a queue for all responses (streaming and single)
|
||||
response_queue = asyncio.Queue()
|
||||
self.pending_requests[request_id] = response_queue
|
||||
|
||||
try:
|
||||
|
||||
# Build request message
|
||||
message = {
|
||||
"id": request_id,
|
||||
"service": service,
|
||||
|
|
@ -177,16 +114,7 @@ class WebSocketManager:
|
|||
if flow_id is not None:
|
||||
message["flow"] = flow_id
|
||||
|
||||
# ── Security boundary: workspace scoping ──
|
||||
# When the caller supplies a workspace, we set it on the
|
||||
# message envelope. The gateway's enforce_workspace()
|
||||
# validates that the authenticated identity is permitted
|
||||
# to access the target workspace — we MUST NOT skip or
|
||||
# override that check. When workspace is None, the
|
||||
# gateway default-fills from the identity's bound workspace.
|
||||
if workspace is not None:
|
||||
message["workspace"] = workspace
|
||||
|
||||
# Send request
|
||||
await self.socket.send(json.dumps(message))
|
||||
|
||||
while self.running:
|
||||
|
|
@ -199,17 +127,19 @@ class WebSocketManager:
|
|||
continue
|
||||
|
||||
if "error" in response:
|
||||
if isinstance(response["error"], dict):
|
||||
raise RuntimeError(
|
||||
response["error"].get("message", str(response["error"]))
|
||||
)
|
||||
if "message" in response["error"]:
|
||||
raise RuntimeError(response["error"]["text"])
|
||||
else:
|
||||
raise RuntimeError(str(response["error"]))
|
||||
|
||||
yield response["response"]
|
||||
|
||||
if response.get("complete"):
|
||||
if "complete" in response:
|
||||
if response["complete"]:
|
||||
break
|
||||
|
||||
finally:
|
||||
except Exception as e:
|
||||
# Clean up on error
|
||||
self.pending_requests.pop(request_id, None)
|
||||
raise e
|
||||
|
||||
|
|
|
|||
|
|
@ -107,14 +107,7 @@ class Processor(FlowProcessor):
|
|||
# Get the source document ID
|
||||
source_doc_id = v.document_id or v.metadata.id
|
||||
|
||||
try:
|
||||
pages = convert_from_bytes(blob)
|
||||
except Exception as e:
|
||||
logger.error(
|
||||
f"Failed to decode PDF {source_doc_id}: "
|
||||
f"{type(e).__name__}: {e}"
|
||||
)
|
||||
return
|
||||
|
||||
for ix, page in enumerate(pages):
|
||||
|
||||
|
|
|
|||
|
|
@ -418,14 +418,7 @@ class Processor(FlowProcessor):
|
|||
doc_uri_str = document_uri(source_doc_id)
|
||||
|
||||
# Extract elements using unstructured
|
||||
try:
|
||||
elements = self.extract_elements(blob, mime_type)
|
||||
except Exception as e:
|
||||
logger.error(
|
||||
f"Failed to extract elements from {source_doc_id}: "
|
||||
f"{type(e).__name__}: {e}"
|
||||
)
|
||||
return
|
||||
|
||||
if not elements:
|
||||
logger.warning("No elements extracted from document")
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue