14 KiB
GraphQL Query Technical Specification
Overview
This specification describes the implementation of a GraphQL query interface for TrustGraph's structured data storage in Apache Cassandra. Building upon the structured data capabilities outlined in the structured-data.md specification, this document details how GraphQL queries will be executed against Cassandra tables containing extracted and ingested structured objects.
The GraphQL query service will provide a flexible, type-safe interface for querying structured data stored in Cassandra. It will dynamically adapt to schema changes, support complex queries including relationships between objects, and integrate seamlessly with TrustGraph's existing message-based architecture.
Goals
- Dynamic Schema Support: Automatically adapt to schema changes in configuration without service restarts
- GraphQL Standards Compliance: Provide a standard GraphQL interface compatible with existing GraphQL tooling and clients
- Efficient Cassandra Queries: Translate GraphQL queries into efficient Cassandra CQL queries respecting partition keys and indexes
- Relationship Resolution: Support GraphQL field resolvers for relationships between different object types
- Type Safety: Ensure type-safe query execution and response generation based on schema definitions
- Scalable Performance: Handle concurrent queries efficiently with proper connection pooling and query optimization
- Request/Response Integration: Maintain compatibility with TrustGraph's Pulsar-based request/response pattern
- Error Handling: Provide comprehensive error reporting for schema mismatches, query errors, and data validation issues
Background
The structured data storage implementation (trustgraph-flow/trustgraph/storage/objects/cassandra/) writes objects to Cassandra tables based on schema definitions stored in TrustGraph's configuration system. These tables use a composite partition key structure with collection and schema-defined primary keys, enabling efficient queries within collections.
Current limitations that this specification addresses:
- No query interface for the structured data stored in Cassandra
- Inability to leverage GraphQL's powerful query capabilities for structured data
- Missing support for relationship traversal between related objects
- Lack of a standardized query language for structured data access
The GraphQL query service will bridge these gaps by:
- Providing a standard GraphQL interface for querying Cassandra tables
- Dynamically generating GraphQL schemas from TrustGraph configuration
- Efficiently translating GraphQL queries to Cassandra CQL
- Supporting relationship resolution through field resolvers
Technical Design
Architecture
The GraphQL query service will be implemented as a new TrustGraph flow processor following established patterns:
Module Location: trustgraph-flow/trustgraph/query/objects/cassandra/
Key Components:
-
GraphQL Query Service Processor
- Extends base FlowProcessor class
- Implements request/response pattern similar to existing query services
- Monitors configuration for schema updates
- Maintains GraphQL schema synchronized with configuration
-
Dynamic Schema Generator
- Converts TrustGraph RowSchema definitions to GraphQL types
- Creates GraphQL object types with proper field definitions
- Generates root Query type with collection-based resolvers
- Updates GraphQL schema when configuration changes
-
Query Executor
- Parses incoming GraphQL queries using Strawberry library
- Validates queries against current schema
- Executes queries and returns structured responses
- Handles errors gracefully with detailed error messages
-
Cassandra Query Translator
- Converts GraphQL selections to CQL queries
- Optimizes queries based on available indexes and partition keys
- Handles filtering, pagination, and sorting
- Manages connection pooling and session lifecycle
-
Relationship Resolver
- Implements field resolvers for object relationships
- Performs efficient batch loading to avoid N+1 queries
- Caches resolved relationships within request context
- Supports both forward and reverse relationship traversal
Configuration Schema Monitoring
The service will register a configuration handler to receive schema updates:
self.register_config_handler(self.on_schema_config)
When schemas change:
- Parse new schema definitions from configuration
- Regenerate GraphQL types and resolvers
- Update the executable schema
- Clear any schema-dependent caches
GraphQL Schema Generation
For each RowSchema in configuration, generate:
-
GraphQL Object Type:
- Map field types (string → String, integer → Int, float → Float, boolean → Boolean)
- Mark required fields as non-nullable in GraphQL
- Add field descriptions from schema
-
Root Query Fields:
- Collection query (e.g.,
customers,transactions) - Filtering arguments based on indexed fields
- Pagination support (limit, offset)
- Sorting options for sortable fields
- Collection query (e.g.,
-
Relationship Fields:
- Identify foreign key relationships from schema
- Create field resolvers for related objects
- Support both single object and list relationships
Query Execution Flow
-
Request Reception:
- Receive ObjectsQueryRequest from Pulsar
- Extract GraphQL query string and variables
- Identify user and collection context
-
Query Validation:
- Parse GraphQL query using Strawberry
- Validate against current schema
- Check field selections and argument types
-
CQL Generation:
- Analyze GraphQL selections
- Build CQL query with proper WHERE clauses
- Include collection in partition key
- Apply filters based on GraphQL arguments
-
Query Execution:
- Execute CQL query against Cassandra
- Map results to GraphQL response structure
- Resolve any relationship fields
- Format response according to GraphQL spec
-
Response Delivery:
- Create ObjectsQueryResponse with results
- Include any execution errors
- Send response via Pulsar with correlation ID
Data Models
Note
: An existing StructuredQueryRequest/Response schema exists in
trustgraph-base/trustgraph/schema/services/structured_query.py. However, it lacks critical fields (user, collection) and uses suboptimal types. The schemas below represent the recommended evolution, which should either replace the existing schemas or be created as new ObjectsQueryRequest/Response types.
Request Schema (ObjectsQueryRequest)
from pulsar.schema import Record, String, Map, Array
class ObjectsQueryRequest(Record):
user = String() # Cassandra keyspace (follows pattern from TriplesQueryRequest)
collection = String() # Data collection identifier (required for partition key)
query = String() # GraphQL query string
variables = Map(String()) # GraphQL variables (consider enhancing to support all JSON types)
operation_name = String() # Operation to execute for multi-operation documents
Rationale for changes from existing StructuredQueryRequest:
- Added
userandcollectionfields to match other query services pattern - These fields are essential for identifying the Cassandra keyspace and collection
- Variables remain as Map(String()) for now but should ideally support all JSON types
Response Schema (ObjectsQueryResponse)
from pulsar.schema import Record, String, Array
from ..core.primitives import Error
class GraphQLError(Record):
message = String()
path = Array(String()) # Path to the field that caused the error
extensions = Map(String()) # Additional error metadata
class ObjectsQueryResponse(Record):
error = Error() # System-level error (connection, timeout, etc.)
data = String() # JSON-encoded GraphQL response data
errors = Array(GraphQLError) # GraphQL field-level errors
extensions = Map(String()) # Query metadata (execution time, etc.)
Rationale for changes from existing StructuredQueryResponse:
- Distinguishes between system errors (
error) and GraphQL errors (errors) - Uses structured GraphQLError objects instead of string array
- Adds
extensionsfield for GraphQL spec compliance - Keeps data as JSON string for compatibility, though native types would be preferable
Cassandra Query Optimization
The service will optimize Cassandra queries by:
-
Respecting Partition Keys:
- Always include collection in queries
- Use schema-defined primary keys efficiently
- Avoid full table scans
-
Leveraging Indexes:
- Use secondary indexes for filtering
- Combine multiple filters when possible
- Warn when queries may be inefficient
-
Batch Loading:
- Collect relationship queries
- Execute in batches to reduce round trips
- Cache results within request context
-
Connection Management:
- Maintain persistent Cassandra sessions
- Use connection pooling
- Handle reconnection on failures
Example GraphQL Queries
Simple Collection Query
{
customers(status: "active") {
customer_id
name
email
registration_date
}
}
Query with Relationships
{
orders(order_date_gt: "2024-01-01") {
order_id
total_amount
customer {
name
email
}
items {
product_name
quantity
price
}
}
}
Paginated Query
{
products(limit: 20, offset: 40) {
product_id
name
price
category
}
}
Implementation Dependencies
- Strawberry GraphQL: For GraphQL schema definition and query execution
- Cassandra Driver: For database connectivity (already used in storage module)
- TrustGraph Base: For FlowProcessor and schema definitions
- Configuration System: For schema monitoring and updates
Command-Line Interface
The service will provide a CLI command: kg-query-objects-graphql-cassandra
Arguments:
--cassandra-host: Cassandra cluster contact point--cassandra-username: Authentication username--cassandra-password: Authentication password--config-type: Configuration type for schemas (default: "schema")- Standard FlowProcessor arguments (Pulsar configuration, etc.)
API Integration
Pulsar Topics
Input Topic: objects-graphql-query-request
- Schema: ObjectsQueryRequest
- Receives GraphQL queries from gateway services
Output Topic: objects-graphql-query-response
- Schema: ObjectsQueryResponse
- Returns query results and errors
Gateway Integration
The gateway and reverse-gateway will need endpoints to:
- Accept GraphQL queries from clients
- Forward to the query service via Pulsar
- Return responses to clients
- Support GraphQL introspection queries
Agent Tool Integration
A new agent tool class will enable:
- Natural language to GraphQL query generation
- Direct GraphQL query execution
- Result interpretation and formatting
- Integration with agent decision flows
Security Considerations
- Query Depth Limiting: Prevent deeply nested queries that could cause performance issues
- Query Complexity Analysis: Limit query complexity to prevent resource exhaustion
- Field-Level Permissions: Future support for field-level access control based on user roles
- Input Sanitization: Validate and sanitize all query inputs to prevent injection attacks
- Rate Limiting: Implement query rate limiting per user/collection
Performance Considerations
- Query Planning: Analyze queries before execution to optimize CQL generation
- Result Caching: Consider caching frequently accessed data at the field resolver level
- Connection Pooling: Maintain efficient connection pools to Cassandra
- Batch Operations: Combine multiple queries when possible to reduce latency
- Monitoring: Track query performance metrics for optimization
Testing Strategy
Unit Tests
- Schema generation from RowSchema definitions
- GraphQL query parsing and validation
- CQL query generation logic
- Field resolver implementations
Contract Tests
- Pulsar message contract compliance
- GraphQL schema validity
- Response format verification
- Error structure validation
Integration Tests
- End-to-end query execution against test Cassandra instance
- Schema update handling
- Relationship resolution
- Pagination and filtering
- Error scenarios
Performance Tests
- Query throughput under load
- Response time for various query complexities
- Memory usage with large result sets
- Connection pool efficiency
Migration Plan
No migration required as this is a new capability. The service will:
- Read existing schemas from configuration
- Connect to existing Cassandra tables created by the storage module
- Start accepting queries immediately upon deployment
Timeline
- Week 1-2: Core service implementation and schema generation
- Week 3: Query execution and CQL translation
- Week 4: Relationship resolution and optimization
- Week 5: Testing and performance tuning
- Week 6: Gateway integration and documentation
Open Questions
-
Schema Evolution: How should the service handle queries during schema transitions?
- Option: Queue queries during schema updates
- Option: Support multiple schema versions simultaneously
-
Caching Strategy: Should query results be cached?
- Consider: Time-based expiration
- Consider: Event-based invalidation
-
Federation Support: Should the service support GraphQL federation for combining with other data sources?
- Would enable unified queries across structured and graph data
-
Subscription Support: Should the service support GraphQL subscriptions for real-time updates?
- Would require WebSocket support in gateway
-
Custom Scalars: Should custom scalar types be supported for domain-specific data types?
- Examples: DateTime, UUID, JSON fields
References
- Structured Data Technical Specification:
docs/tech-specs/structured-data.md - Strawberry GraphQL Documentation: https://strawberry.rocks/
- GraphQL Specification: https://spec.graphql.org/
- Apache Cassandra CQL Reference: https://cassandra.apache.org/doc/stable/cassandra/cql/
- TrustGraph Flow Processor Documentation: Internal documentation