# GraphQL Query Technical Specification ## Overview This specification describes the implementation of a GraphQL query interface for TrustGraph's structured data storage in Apache Cassandra. Building upon the structured data capabilities outlined in the structured-data.md specification, this document details how GraphQL queries will be executed against Cassandra tables containing extracted and ingested structured objects. The GraphQL query service will provide a flexible, type-safe interface for querying structured data stored in Cassandra. It will dynamically adapt to schema changes, support complex queries including relationships between objects, and integrate seamlessly with TrustGraph's existing message-based architecture. ## Goals - **Dynamic Schema Support**: Automatically adapt to schema changes in configuration without service restarts - **GraphQL Standards Compliance**: Provide a standard GraphQL interface compatible with existing GraphQL tooling and clients - **Efficient Cassandra Queries**: Translate GraphQL queries into efficient Cassandra CQL queries respecting partition keys and indexes - **Relationship Resolution**: Support GraphQL field resolvers for relationships between different object types - **Type Safety**: Ensure type-safe query execution and response generation based on schema definitions - **Scalable Performance**: Handle concurrent queries efficiently with proper connection pooling and query optimization - **Request/Response Integration**: Maintain compatibility with TrustGraph's Pulsar-based request/response pattern - **Error Handling**: Provide comprehensive error reporting for schema mismatches, query errors, and data validation issues ## Background The structured data storage implementation (trustgraph-flow/trustgraph/storage/objects/cassandra/) writes objects to Cassandra tables based on schema definitions stored in TrustGraph's configuration system. These tables use a composite partition key structure with collection and schema-defined primary keys, enabling efficient queries within collections. Current limitations that this specification addresses: - No query interface for the structured data stored in Cassandra - Inability to leverage GraphQL's powerful query capabilities for structured data - Missing support for relationship traversal between related objects - Lack of a standardized query language for structured data access The GraphQL query service will bridge these gaps by: - Providing a standard GraphQL interface for querying Cassandra tables - Dynamically generating GraphQL schemas from TrustGraph configuration - Efficiently translating GraphQL queries to Cassandra CQL - Supporting relationship resolution through field resolvers ## Technical Design ### Architecture The GraphQL query service will be implemented as a new TrustGraph flow processor following established patterns: **Module Location**: `trustgraph-flow/trustgraph/query/objects/cassandra/` **Key Components**: 1. **GraphQL Query Service Processor** - Extends base FlowProcessor class - Implements request/response pattern similar to existing query services - Monitors configuration for schema updates - Maintains GraphQL schema synchronized with configuration 2. **Dynamic Schema Generator** - Converts TrustGraph RowSchema definitions to GraphQL types - Creates GraphQL object types with proper field definitions - Generates root Query type with collection-based resolvers - Updates GraphQL schema when configuration changes 3. **Query Executor** - Parses incoming GraphQL queries using Strawberry library - Validates queries against current schema - Executes queries and returns structured responses - Handles errors gracefully with detailed error messages 4. **Cassandra Query Translator** - Converts GraphQL selections to CQL queries - Optimizes queries based on available indexes and partition keys - Handles filtering, pagination, and sorting - Manages connection pooling and session lifecycle 5. **Relationship Resolver** - Implements field resolvers for object relationships - Performs efficient batch loading to avoid N+1 queries - Caches resolved relationships within request context - Supports both forward and reverse relationship traversal ### Configuration Schema Monitoring The service will register a configuration handler to receive schema updates: ```python self.register_config_handler(self.on_schema_config) ``` When schemas change: 1. Parse new schema definitions from configuration 2. Regenerate GraphQL types and resolvers 3. Update the executable schema 4. Clear any schema-dependent caches ### GraphQL Schema Generation For each RowSchema in configuration, generate: 1. **GraphQL Object Type**: - Map field types (string → String, integer → Int, float → Float, boolean → Boolean) - Mark required fields as non-nullable in GraphQL - Add field descriptions from schema 2. **Root Query Fields**: - Collection query (e.g., `customers`, `transactions`) - Filtering arguments based on indexed fields - Pagination support (limit, offset) - Sorting options for sortable fields 3. **Relationship Fields**: - Identify foreign key relationships from schema - Create field resolvers for related objects - Support both single object and list relationships ### Query Execution Flow 1. **Request Reception**: - Receive ObjectsQueryRequest from Pulsar - Extract GraphQL query string and variables - Identify user and collection context 2. **Query Validation**: - Parse GraphQL query using Strawberry - Validate against current schema - Check field selections and argument types 3. **CQL Generation**: - Analyze GraphQL selections - Build CQL query with proper WHERE clauses - Include collection in partition key - Apply filters based on GraphQL arguments 4. **Query Execution**: - Execute CQL query against Cassandra - Map results to GraphQL response structure - Resolve any relationship fields - Format response according to GraphQL spec 5. **Response Delivery**: - Create ObjectsQueryResponse with results - Include any execution errors - Send response via Pulsar with correlation ID ### Data Models > **Note**: An existing StructuredQueryRequest/Response schema exists in `trustgraph-base/trustgraph/schema/services/structured_query.py`. However, it lacks critical fields (user, collection) and uses suboptimal types. The schemas below represent the recommended evolution, which should either replace the existing schemas or be created as new ObjectsQueryRequest/Response types. #### Request Schema (ObjectsQueryRequest) ```python from pulsar.schema import Record, String, Map, Array class ObjectsQueryRequest(Record): user = String() # Cassandra keyspace (follows pattern from TriplesQueryRequest) collection = String() # Data collection identifier (required for partition key) query = String() # GraphQL query string variables = Map(String()) # GraphQL variables (consider enhancing to support all JSON types) operation_name = String() # Operation to execute for multi-operation documents ``` **Rationale for changes from existing StructuredQueryRequest:** - Added `user` and `collection` fields to match other query services pattern - These fields are essential for identifying the Cassandra keyspace and collection - Variables remain as Map(String()) for now but should ideally support all JSON types #### Response Schema (ObjectsQueryResponse) ```python from pulsar.schema import Record, String, Array from ..core.primitives import Error class GraphQLError(Record): message = String() path = Array(String()) # Path to the field that caused the error extensions = Map(String()) # Additional error metadata class ObjectsQueryResponse(Record): error = Error() # System-level error (connection, timeout, etc.) data = String() # JSON-encoded GraphQL response data errors = Array(GraphQLError) # GraphQL field-level errors extensions = Map(String()) # Query metadata (execution time, etc.) ``` **Rationale for changes from existing StructuredQueryResponse:** - Distinguishes between system errors (`error`) and GraphQL errors (`errors`) - Uses structured GraphQLError objects instead of string array - Adds `extensions` field for GraphQL spec compliance - Keeps data as JSON string for compatibility, though native types would be preferable ### Cassandra Query Optimization The service will optimize Cassandra queries by: 1. **Respecting Partition Keys**: - Always include collection in queries - Use schema-defined primary keys efficiently - Avoid full table scans 2. **Leveraging Indexes**: - Use secondary indexes for filtering - Combine multiple filters when possible - Warn when queries may be inefficient 3. **Batch Loading**: - Collect relationship queries - Execute in batches to reduce round trips - Cache results within request context 4. **Connection Management**: - Maintain persistent Cassandra sessions - Use connection pooling - Handle reconnection on failures ### Example GraphQL Queries #### Simple Collection Query ```graphql { customers(status: "active") { customer_id name email registration_date } } ``` #### Query with Relationships ```graphql { orders(order_date_gt: "2024-01-01") { order_id total_amount customer { name email } items { product_name quantity price } } } ``` #### Paginated Query ```graphql { products(limit: 20, offset: 40) { product_id name price category } } ``` ### Implementation Dependencies - **Strawberry GraphQL**: For GraphQL schema definition and query execution - **Cassandra Driver**: For database connectivity (already used in storage module) - **TrustGraph Base**: For FlowProcessor and schema definitions - **Configuration System**: For schema monitoring and updates ### Command-Line Interface The service will provide a CLI command: `kg-query-objects-graphql-cassandra` Arguments: - `--cassandra-host`: Cassandra cluster contact point - `--cassandra-username`: Authentication username - `--cassandra-password`: Authentication password - `--config-type`: Configuration type for schemas (default: "schema") - Standard FlowProcessor arguments (Pulsar configuration, etc.) ## API Integration ### Pulsar Topics **Input Topic**: `objects-graphql-query-request` - Schema: ObjectsQueryRequest - Receives GraphQL queries from gateway services **Output Topic**: `objects-graphql-query-response` - Schema: ObjectsQueryResponse - Returns query results and errors ### Gateway Integration The gateway and reverse-gateway will need endpoints to: 1. Accept GraphQL queries from clients 2. Forward to the query service via Pulsar 3. Return responses to clients 4. Support GraphQL introspection queries ### Agent Tool Integration A new agent tool class will enable: - Natural language to GraphQL query generation - Direct GraphQL query execution - Result interpretation and formatting - Integration with agent decision flows ## Security Considerations - **Query Depth Limiting**: Prevent deeply nested queries that could cause performance issues - **Query Complexity Analysis**: Limit query complexity to prevent resource exhaustion - **Field-Level Permissions**: Future support for field-level access control based on user roles - **Input Sanitization**: Validate and sanitize all query inputs to prevent injection attacks - **Rate Limiting**: Implement query rate limiting per user/collection ## Performance Considerations - **Query Planning**: Analyze queries before execution to optimize CQL generation - **Result Caching**: Consider caching frequently accessed data at the field resolver level - **Connection Pooling**: Maintain efficient connection pools to Cassandra - **Batch Operations**: Combine multiple queries when possible to reduce latency - **Monitoring**: Track query performance metrics for optimization ## Testing Strategy ### Unit Tests - Schema generation from RowSchema definitions - GraphQL query parsing and validation - CQL query generation logic - Field resolver implementations ### Contract Tests - Pulsar message contract compliance - GraphQL schema validity - Response format verification - Error structure validation ### Integration Tests - End-to-end query execution against test Cassandra instance - Schema update handling - Relationship resolution - Pagination and filtering - Error scenarios ### Performance Tests - Query throughput under load - Response time for various query complexities - Memory usage with large result sets - Connection pool efficiency ## Migration Plan No migration required as this is a new capability. The service will: 1. Read existing schemas from configuration 2. Connect to existing Cassandra tables created by the storage module 3. Start accepting queries immediately upon deployment ## Timeline - Week 1-2: Core service implementation and schema generation - Week 3: Query execution and CQL translation - Week 4: Relationship resolution and optimization - Week 5: Testing and performance tuning - Week 6: Gateway integration and documentation ## Open Questions 1. **Schema Evolution**: How should the service handle queries during schema transitions? - Option: Queue queries during schema updates - Option: Support multiple schema versions simultaneously 2. **Caching Strategy**: Should query results be cached? - Consider: Time-based expiration - Consider: Event-based invalidation 3. **Federation Support**: Should the service support GraphQL federation for combining with other data sources? - Would enable unified queries across structured and graph data 4. **Subscription Support**: Should the service support GraphQL subscriptions for real-time updates? - Would require WebSocket support in gateway 5. **Custom Scalars**: Should custom scalar types be supported for domain-specific data types? - Examples: DateTime, UUID, JSON fields ## References - Structured Data Technical Specification: `docs/tech-specs/structured-data.md` - Strawberry GraphQL Documentation: https://strawberry.rocks/ - GraphQL Specification: https://spec.graphql.org/ - Apache Cassandra CQL Reference: https://cassandra.apache.org/doc/stable/cassandra/cql/ - TrustGraph Flow Processor Documentation: Internal documentation