Feature/graphql table query (#486)

* Tech spec * Object query service for Cassandra * Gateway support for objects-query * GraphQL query utility * Filters, ordering
2026-06-14 17:25:14 +02:00 · 2025-09-03 23:39:11 +01:00 · 2025-09-03 23:39:11 +01:00 · 672e358b2f
commit 672e358b2f
parent 38826c7de1
20 changed files with 3133 additions and 3 deletions
--- a/docs/tech-specs/graphql-query.md
+++ b/docs/tech-specs/graphql-query.md
@ -0,0 +1,383 @@
+# GraphQL Query Technical Specification
+
+## Overview
+
+This specification describes the implementation of a GraphQL query interface for TrustGraph's structured data storage in Apache Cassandra. Building upon the structured data capabilities outlined in the structured-data.md specification, this document details how GraphQL queries will be executed against Cassandra tables containing extracted and ingested structured objects.
+
+The GraphQL query service will provide a flexible, type-safe interface for querying structured data stored in Cassandra. It will dynamically adapt to schema changes, support complex queries including relationships between objects, and integrate seamlessly with TrustGraph's existing message-based architecture.
+
+## Goals
+
+- **Dynamic Schema Support**: Automatically adapt to schema changes in configuration without service restarts
+- **GraphQL Standards Compliance**: Provide a standard GraphQL interface compatible with existing GraphQL tooling and clients
+- **Efficient Cassandra Queries**: Translate GraphQL queries into efficient Cassandra CQL queries respecting partition keys and indexes
+- **Relationship Resolution**: Support GraphQL field resolvers for relationships between different object types
+- **Type Safety**: Ensure type-safe query execution and response generation based on schema definitions
+- **Scalable Performance**: Handle concurrent queries efficiently with proper connection pooling and query optimization
+- **Request/Response Integration**: Maintain compatibility with TrustGraph's Pulsar-based request/response pattern
+- **Error Handling**: Provide comprehensive error reporting for schema mismatches, query errors, and data validation issues
+
+## Background
+
+The structured data storage implementation (trustgraph-flow/trustgraph/storage/objects/cassandra/) writes objects to Cassandra tables based on schema definitions stored in TrustGraph's configuration system. These tables use a composite partition key structure with collection and schema-defined primary keys, enabling efficient queries within collections.
+
+Current limitations that this specification addresses:
+- No query interface for the structured data stored in Cassandra
+- Inability to leverage GraphQL's powerful query capabilities for structured data
+- Missing support for relationship traversal between related objects
+- Lack of a standardized query language for structured data access
+
+The GraphQL query service will bridge these gaps by:
+- Providing a standard GraphQL interface for querying Cassandra tables
+- Dynamically generating GraphQL schemas from TrustGraph configuration
+- Efficiently translating GraphQL queries to Cassandra CQL
+- Supporting relationship resolution through field resolvers
+
+## Technical Design
+
+### Architecture
+
+The GraphQL query service will be implemented as a new TrustGraph flow processor following established patterns:
+
+**Module Location**: `trustgraph-flow/trustgraph/query/objects/cassandra/`
+
+**Key Components**:
+
+1. **GraphQL Query Service Processor**
+   - Extends base FlowProcessor class
+   - Implements request/response pattern similar to existing query services
+   - Monitors configuration for schema updates
+   - Maintains GraphQL schema synchronized with configuration
+
+2. **Dynamic Schema Generator**
+   - Converts TrustGraph RowSchema definitions to GraphQL types
+   - Creates GraphQL object types with proper field definitions
+   - Generates root Query type with collection-based resolvers
+   - Updates GraphQL schema when configuration changes
+
+3. **Query Executor**
+   - Parses incoming GraphQL queries using Strawberry library
+   - Validates queries against current schema
+   - Executes queries and returns structured responses
+   - Handles errors gracefully with detailed error messages
+
+4. **Cassandra Query Translator**
+   - Converts GraphQL selections to CQL queries
+   - Optimizes queries based on available indexes and partition keys
+   - Handles filtering, pagination, and sorting
+   - Manages connection pooling and session lifecycle
+
+5. **Relationship Resolver**
+   - Implements field resolvers for object relationships
+   - Performs efficient batch loading to avoid N+1 queries
+   - Caches resolved relationships within request context
+   - Supports both forward and reverse relationship traversal
+
+### Configuration Schema Monitoring
+
+The service will register a configuration handler to receive schema updates:
+
+```python
+self.register_config_handler(self.on_schema_config)
+```
+
+When schemas change:
+1. Parse new schema definitions from configuration
+2. Regenerate GraphQL types and resolvers
+3. Update the executable schema
+4. Clear any schema-dependent caches
+
+### GraphQL Schema Generation
+
+For each RowSchema in configuration, generate:
+
+1. **GraphQL Object Type**:
+   - Map field types (string → String, integer → Int, float → Float, boolean → Boolean)
+   - Mark required fields as non-nullable in GraphQL
+   - Add field descriptions from schema
+
+2. **Root Query Fields**:
+   - Collection query (e.g., `customers`, `transactions`)
+   - Filtering arguments based on indexed fields
+   - Pagination support (limit, offset)
+   - Sorting options for sortable fields
+
+3. **Relationship Fields**:
+   - Identify foreign key relationships from schema
+   - Create field resolvers for related objects
+   - Support both single object and list relationships
+
+### Query Execution Flow
+
+1. **Request Reception**:
+   - Receive ObjectsQueryRequest from Pulsar
+   - Extract GraphQL query string and variables
+   - Identify user and collection context
+
+2. **Query Validation**:
+   - Parse GraphQL query using Strawberry
+   - Validate against current schema
+   - Check field selections and argument types
+
+3. **CQL Generation**:
+   - Analyze GraphQL selections
+   - Build CQL query with proper WHERE clauses
+   - Include collection in partition key
+   - Apply filters based on GraphQL arguments
+
+4. **Query Execution**:
+   - Execute CQL query against Cassandra
+   - Map results to GraphQL response structure
+   - Resolve any relationship fields
+   - Format response according to GraphQL spec
+
+5. **Response Delivery**:
+   - Create ObjectsQueryResponse with results
+   - Include any execution errors
+   - Send response via Pulsar with correlation ID
+
+### Data Models
+
+> **Note**: An existing StructuredQueryRequest/Response schema exists in `trustgraph-base/trustgraph/schema/services/structured_query.py`. However, it lacks critical fields (user, collection) and uses suboptimal types. The schemas below represent the recommended evolution, which should either replace the existing schemas or be created as new ObjectsQueryRequest/Response types.
+
+#### Request Schema (ObjectsQueryRequest)
+
+```python
+from pulsar.schema import Record, String, Map, Array
+
+class ObjectsQueryRequest(Record):
+    user = String()              # Cassandra keyspace (follows pattern from TriplesQueryRequest)
+    collection = String()        # Data collection identifier (required for partition key)
+    query = String()             # GraphQL query string
+    variables = Map(String())    # GraphQL variables (consider enhancing to support all JSON types)
+    operation_name = String()    # Operation to execute for multi-operation documents
+```
+
+**Rationale for changes from existing StructuredQueryRequest:**
+- Added `user` and `collection` fields to match other query services pattern
+- These fields are essential for identifying the Cassandra keyspace and collection
+- Variables remain as Map(String()) for now but should ideally support all JSON types
+
+#### Response Schema (ObjectsQueryResponse)
+
+```python
+from pulsar.schema import Record, String, Array
+from ..core.primitives import Error
+
+class GraphQLError(Record):
+    message = String()
+    path = Array(String())       # Path to the field that caused the error
+    extensions = Map(String())   # Additional error metadata
+
+class ObjectsQueryResponse(Record):
+    error = Error()              # System-level error (connection, timeout, etc.)
+    data = String()              # JSON-encoded GraphQL response data
+    errors = Array(GraphQLError) # GraphQL field-level errors
+    extensions = Map(String())   # Query metadata (execution time, etc.)
+```
+
+**Rationale for changes from existing StructuredQueryResponse:**
+- Distinguishes between system errors (`error`) and GraphQL errors (`errors`)
+- Uses structured GraphQLError objects instead of string array
+- Adds `extensions` field for GraphQL spec compliance
+- Keeps data as JSON string for compatibility, though native types would be preferable
+
+### Cassandra Query Optimization
+
+The service will optimize Cassandra queries by:
+
+1. **Respecting Partition Keys**:
+   - Always include collection in queries
+   - Use schema-defined primary keys efficiently
+   - Avoid full table scans
+
+2. **Leveraging Indexes**:
+   - Use secondary indexes for filtering
+   - Combine multiple filters when possible
+   - Warn when queries may be inefficient
+
+3. **Batch Loading**:
+   - Collect relationship queries
+   - Execute in batches to reduce round trips
+   - Cache results within request context
+
+4. **Connection Management**:
+   - Maintain persistent Cassandra sessions
+   - Use connection pooling
+   - Handle reconnection on failures
+
+### Example GraphQL Queries
+
+#### Simple Collection Query
+```graphql
+{
+  customers(status: "active") {
+    customer_id
+    name
+    email
+    registration_date
+  }
+}
+```
+
+#### Query with Relationships
+```graphql
+{
+  orders(order_date_gt: "2024-01-01") {
+    order_id
+    total_amount
+    customer {
+      name
+      email
+    }
+    items {
+      product_name
+      quantity
+      price
+    }
+  }
+}
+```
+
+#### Paginated Query
+```graphql
+{
+  products(limit: 20, offset: 40) {
+    product_id
+    name
+    price
+    category
+  }
+}
+```
+
+### Implementation Dependencies
+
+- **Strawberry GraphQL**: For GraphQL schema definition and query execution
+- **Cassandra Driver**: For database connectivity (already used in storage module)
+- **TrustGraph Base**: For FlowProcessor and schema definitions
+- **Configuration System**: For schema monitoring and updates
+
+### Command-Line Interface
+
+The service will provide a CLI command: `kg-query-objects-graphql-cassandra`
+
+Arguments:
+- `--cassandra-host`: Cassandra cluster contact point
+- `--cassandra-username`: Authentication username
+- `--cassandra-password`: Authentication password
+- `--config-type`: Configuration type for schemas (default: "schema")
+- Standard FlowProcessor arguments (Pulsar configuration, etc.)
+
+## API Integration
+
+### Pulsar Topics
+
+**Input Topic**: `objects-graphql-query-request`
+- Schema: ObjectsQueryRequest
+- Receives GraphQL queries from gateway services
+
+**Output Topic**: `objects-graphql-query-response`
+- Schema: ObjectsQueryResponse
+- Returns query results and errors
+
+### Gateway Integration
+
+The gateway and reverse-gateway will need endpoints to:
+1. Accept GraphQL queries from clients
+2. Forward to the query service via Pulsar
+3. Return responses to clients
+4. Support GraphQL introspection queries
+
+### Agent Tool Integration
+
+A new agent tool class will enable:
+- Natural language to GraphQL query generation
+- Direct GraphQL query execution
+- Result interpretation and formatting
+- Integration with agent decision flows
+
+## Security Considerations
+
+- **Query Depth Limiting**: Prevent deeply nested queries that could cause performance issues
+- **Query Complexity Analysis**: Limit query complexity to prevent resource exhaustion
+- **Field-Level Permissions**: Future support for field-level access control based on user roles
+- **Input Sanitization**: Validate and sanitize all query inputs to prevent injection attacks
+- **Rate Limiting**: Implement query rate limiting per user/collection
+
+## Performance Considerations
+
+- **Query Planning**: Analyze queries before execution to optimize CQL generation
+- **Result Caching**: Consider caching frequently accessed data at the field resolver level
+- **Connection Pooling**: Maintain efficient connection pools to Cassandra
+- **Batch Operations**: Combine multiple queries when possible to reduce latency
+- **Monitoring**: Track query performance metrics for optimization
+
+## Testing Strategy
+
+### Unit Tests
+- Schema generation from RowSchema definitions
+- GraphQL query parsing and validation
+- CQL query generation logic
+- Field resolver implementations
+
+### Contract Tests
+- Pulsar message contract compliance
+- GraphQL schema validity
+- Response format verification
+- Error structure validation
+
+### Integration Tests
+- End-to-end query execution against test Cassandra instance
+- Schema update handling
+- Relationship resolution
+- Pagination and filtering
+- Error scenarios
+
+### Performance Tests
+- Query throughput under load
+- Response time for various query complexities
+- Memory usage with large result sets
+- Connection pool efficiency
+
+## Migration Plan
+
+No migration required as this is a new capability. The service will:
+1. Read existing schemas from configuration
+2. Connect to existing Cassandra tables created by the storage module
+3. Start accepting queries immediately upon deployment
+
+## Timeline
+
+- Week 1-2: Core service implementation and schema generation
+- Week 3: Query execution and CQL translation
+- Week 4: Relationship resolution and optimization
+- Week 5: Testing and performance tuning
+- Week 6: Gateway integration and documentation
+
+## Open Questions
+
+1. **Schema Evolution**: How should the service handle queries during schema transitions?
+   - Option: Queue queries during schema updates
+   - Option: Support multiple schema versions simultaneously
+
+2. **Caching Strategy**: Should query results be cached?
+   - Consider: Time-based expiration
+   - Consider: Event-based invalidation
+
+3. **Federation Support**: Should the service support GraphQL federation for combining with other data sources?
+   - Would enable unified queries across structured and graph data
+
+4. **Subscription Support**: Should the service support GraphQL subscriptions for real-time updates?
+   - Would require WebSocket support in gateway
+
+5. **Custom Scalars**: Should custom scalar types be supported for domain-specific data types?
+   - Examples: DateTime, UUID, JSON fields
+
+## References
+
+- Structured Data Technical Specification: `docs/tech-specs/structured-data.md`
+- Strawberry GraphQL Documentation: https://strawberry.rocks/
+- GraphQL Specification: https://spec.graphql.org/
+- Apache Cassandra CQL Reference: https://cassandra.apache.org/doc/stable/cassandra/cql/
+- TrustGraph Flow Processor Documentation: Internal documentation