mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-25 08:26:21 +02:00
Feature/graphql table query (#486)
* Tech spec * Object query service for Cassandra * Gateway support for objects-query * GraphQL query utility * Filters, ordering
This commit is contained in:
parent
38826c7de1
commit
672e358b2f
20 changed files with 3133 additions and 3 deletions
383
docs/tech-specs/graphql-query.md
Normal file
383
docs/tech-specs/graphql-query.md
Normal file
|
|
@ -0,0 +1,383 @@
|
|||
# GraphQL Query Technical Specification
|
||||
|
||||
## Overview
|
||||
|
||||
This specification describes the implementation of a GraphQL query interface for TrustGraph's structured data storage in Apache Cassandra. Building upon the structured data capabilities outlined in the structured-data.md specification, this document details how GraphQL queries will be executed against Cassandra tables containing extracted and ingested structured objects.
|
||||
|
||||
The GraphQL query service will provide a flexible, type-safe interface for querying structured data stored in Cassandra. It will dynamically adapt to schema changes, support complex queries including relationships between objects, and integrate seamlessly with TrustGraph's existing message-based architecture.
|
||||
|
||||
## Goals
|
||||
|
||||
- **Dynamic Schema Support**: Automatically adapt to schema changes in configuration without service restarts
|
||||
- **GraphQL Standards Compliance**: Provide a standard GraphQL interface compatible with existing GraphQL tooling and clients
|
||||
- **Efficient Cassandra Queries**: Translate GraphQL queries into efficient Cassandra CQL queries respecting partition keys and indexes
|
||||
- **Relationship Resolution**: Support GraphQL field resolvers for relationships between different object types
|
||||
- **Type Safety**: Ensure type-safe query execution and response generation based on schema definitions
|
||||
- **Scalable Performance**: Handle concurrent queries efficiently with proper connection pooling and query optimization
|
||||
- **Request/Response Integration**: Maintain compatibility with TrustGraph's Pulsar-based request/response pattern
|
||||
- **Error Handling**: Provide comprehensive error reporting for schema mismatches, query errors, and data validation issues
|
||||
|
||||
## Background
|
||||
|
||||
The structured data storage implementation (trustgraph-flow/trustgraph/storage/objects/cassandra/) writes objects to Cassandra tables based on schema definitions stored in TrustGraph's configuration system. These tables use a composite partition key structure with collection and schema-defined primary keys, enabling efficient queries within collections.
|
||||
|
||||
Current limitations that this specification addresses:
|
||||
- No query interface for the structured data stored in Cassandra
|
||||
- Inability to leverage GraphQL's powerful query capabilities for structured data
|
||||
- Missing support for relationship traversal between related objects
|
||||
- Lack of a standardized query language for structured data access
|
||||
|
||||
The GraphQL query service will bridge these gaps by:
|
||||
- Providing a standard GraphQL interface for querying Cassandra tables
|
||||
- Dynamically generating GraphQL schemas from TrustGraph configuration
|
||||
- Efficiently translating GraphQL queries to Cassandra CQL
|
||||
- Supporting relationship resolution through field resolvers
|
||||
|
||||
## Technical Design
|
||||
|
||||
### Architecture
|
||||
|
||||
The GraphQL query service will be implemented as a new TrustGraph flow processor following established patterns:
|
||||
|
||||
**Module Location**: `trustgraph-flow/trustgraph/query/objects/cassandra/`
|
||||
|
||||
**Key Components**:
|
||||
|
||||
1. **GraphQL Query Service Processor**
|
||||
- Extends base FlowProcessor class
|
||||
- Implements request/response pattern similar to existing query services
|
||||
- Monitors configuration for schema updates
|
||||
- Maintains GraphQL schema synchronized with configuration
|
||||
|
||||
2. **Dynamic Schema Generator**
|
||||
- Converts TrustGraph RowSchema definitions to GraphQL types
|
||||
- Creates GraphQL object types with proper field definitions
|
||||
- Generates root Query type with collection-based resolvers
|
||||
- Updates GraphQL schema when configuration changes
|
||||
|
||||
3. **Query Executor**
|
||||
- Parses incoming GraphQL queries using Strawberry library
|
||||
- Validates queries against current schema
|
||||
- Executes queries and returns structured responses
|
||||
- Handles errors gracefully with detailed error messages
|
||||
|
||||
4. **Cassandra Query Translator**
|
||||
- Converts GraphQL selections to CQL queries
|
||||
- Optimizes queries based on available indexes and partition keys
|
||||
- Handles filtering, pagination, and sorting
|
||||
- Manages connection pooling and session lifecycle
|
||||
|
||||
5. **Relationship Resolver**
|
||||
- Implements field resolvers for object relationships
|
||||
- Performs efficient batch loading to avoid N+1 queries
|
||||
- Caches resolved relationships within request context
|
||||
- Supports both forward and reverse relationship traversal
|
||||
|
||||
### Configuration Schema Monitoring
|
||||
|
||||
The service will register a configuration handler to receive schema updates:
|
||||
|
||||
```python
|
||||
self.register_config_handler(self.on_schema_config)
|
||||
```
|
||||
|
||||
When schemas change:
|
||||
1. Parse new schema definitions from configuration
|
||||
2. Regenerate GraphQL types and resolvers
|
||||
3. Update the executable schema
|
||||
4. Clear any schema-dependent caches
|
||||
|
||||
### GraphQL Schema Generation
|
||||
|
||||
For each RowSchema in configuration, generate:
|
||||
|
||||
1. **GraphQL Object Type**:
|
||||
- Map field types (string → String, integer → Int, float → Float, boolean → Boolean)
|
||||
- Mark required fields as non-nullable in GraphQL
|
||||
- Add field descriptions from schema
|
||||
|
||||
2. **Root Query Fields**:
|
||||
- Collection query (e.g., `customers`, `transactions`)
|
||||
- Filtering arguments based on indexed fields
|
||||
- Pagination support (limit, offset)
|
||||
- Sorting options for sortable fields
|
||||
|
||||
3. **Relationship Fields**:
|
||||
- Identify foreign key relationships from schema
|
||||
- Create field resolvers for related objects
|
||||
- Support both single object and list relationships
|
||||
|
||||
### Query Execution Flow
|
||||
|
||||
1. **Request Reception**:
|
||||
- Receive ObjectsQueryRequest from Pulsar
|
||||
- Extract GraphQL query string and variables
|
||||
- Identify user and collection context
|
||||
|
||||
2. **Query Validation**:
|
||||
- Parse GraphQL query using Strawberry
|
||||
- Validate against current schema
|
||||
- Check field selections and argument types
|
||||
|
||||
3. **CQL Generation**:
|
||||
- Analyze GraphQL selections
|
||||
- Build CQL query with proper WHERE clauses
|
||||
- Include collection in partition key
|
||||
- Apply filters based on GraphQL arguments
|
||||
|
||||
4. **Query Execution**:
|
||||
- Execute CQL query against Cassandra
|
||||
- Map results to GraphQL response structure
|
||||
- Resolve any relationship fields
|
||||
- Format response according to GraphQL spec
|
||||
|
||||
5. **Response Delivery**:
|
||||
- Create ObjectsQueryResponse with results
|
||||
- Include any execution errors
|
||||
- Send response via Pulsar with correlation ID
|
||||
|
||||
### Data Models
|
||||
|
||||
> **Note**: An existing StructuredQueryRequest/Response schema exists in `trustgraph-base/trustgraph/schema/services/structured_query.py`. However, it lacks critical fields (user, collection) and uses suboptimal types. The schemas below represent the recommended evolution, which should either replace the existing schemas or be created as new ObjectsQueryRequest/Response types.
|
||||
|
||||
#### Request Schema (ObjectsQueryRequest)
|
||||
|
||||
```python
|
||||
from pulsar.schema import Record, String, Map, Array
|
||||
|
||||
class ObjectsQueryRequest(Record):
|
||||
user = String() # Cassandra keyspace (follows pattern from TriplesQueryRequest)
|
||||
collection = String() # Data collection identifier (required for partition key)
|
||||
query = String() # GraphQL query string
|
||||
variables = Map(String()) # GraphQL variables (consider enhancing to support all JSON types)
|
||||
operation_name = String() # Operation to execute for multi-operation documents
|
||||
```
|
||||
|
||||
**Rationale for changes from existing StructuredQueryRequest:**
|
||||
- Added `user` and `collection` fields to match other query services pattern
|
||||
- These fields are essential for identifying the Cassandra keyspace and collection
|
||||
- Variables remain as Map(String()) for now but should ideally support all JSON types
|
||||
|
||||
#### Response Schema (ObjectsQueryResponse)
|
||||
|
||||
```python
|
||||
from pulsar.schema import Record, String, Array
|
||||
from ..core.primitives import Error
|
||||
|
||||
class GraphQLError(Record):
|
||||
message = String()
|
||||
path = Array(String()) # Path to the field that caused the error
|
||||
extensions = Map(String()) # Additional error metadata
|
||||
|
||||
class ObjectsQueryResponse(Record):
|
||||
error = Error() # System-level error (connection, timeout, etc.)
|
||||
data = String() # JSON-encoded GraphQL response data
|
||||
errors = Array(GraphQLError) # GraphQL field-level errors
|
||||
extensions = Map(String()) # Query metadata (execution time, etc.)
|
||||
```
|
||||
|
||||
**Rationale for changes from existing StructuredQueryResponse:**
|
||||
- Distinguishes between system errors (`error`) and GraphQL errors (`errors`)
|
||||
- Uses structured GraphQLError objects instead of string array
|
||||
- Adds `extensions` field for GraphQL spec compliance
|
||||
- Keeps data as JSON string for compatibility, though native types would be preferable
|
||||
|
||||
### Cassandra Query Optimization
|
||||
|
||||
The service will optimize Cassandra queries by:
|
||||
|
||||
1. **Respecting Partition Keys**:
|
||||
- Always include collection in queries
|
||||
- Use schema-defined primary keys efficiently
|
||||
- Avoid full table scans
|
||||
|
||||
2. **Leveraging Indexes**:
|
||||
- Use secondary indexes for filtering
|
||||
- Combine multiple filters when possible
|
||||
- Warn when queries may be inefficient
|
||||
|
||||
3. **Batch Loading**:
|
||||
- Collect relationship queries
|
||||
- Execute in batches to reduce round trips
|
||||
- Cache results within request context
|
||||
|
||||
4. **Connection Management**:
|
||||
- Maintain persistent Cassandra sessions
|
||||
- Use connection pooling
|
||||
- Handle reconnection on failures
|
||||
|
||||
### Example GraphQL Queries
|
||||
|
||||
#### Simple Collection Query
|
||||
```graphql
|
||||
{
|
||||
customers(status: "active") {
|
||||
customer_id
|
||||
name
|
||||
email
|
||||
registration_date
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Query with Relationships
|
||||
```graphql
|
||||
{
|
||||
orders(order_date_gt: "2024-01-01") {
|
||||
order_id
|
||||
total_amount
|
||||
customer {
|
||||
name
|
||||
email
|
||||
}
|
||||
items {
|
||||
product_name
|
||||
quantity
|
||||
price
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Paginated Query
|
||||
```graphql
|
||||
{
|
||||
products(limit: 20, offset: 40) {
|
||||
product_id
|
||||
name
|
||||
price
|
||||
category
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Implementation Dependencies
|
||||
|
||||
- **Strawberry GraphQL**: For GraphQL schema definition and query execution
|
||||
- **Cassandra Driver**: For database connectivity (already used in storage module)
|
||||
- **TrustGraph Base**: For FlowProcessor and schema definitions
|
||||
- **Configuration System**: For schema monitoring and updates
|
||||
|
||||
### Command-Line Interface
|
||||
|
||||
The service will provide a CLI command: `kg-query-objects-graphql-cassandra`
|
||||
|
||||
Arguments:
|
||||
- `--cassandra-host`: Cassandra cluster contact point
|
||||
- `--cassandra-username`: Authentication username
|
||||
- `--cassandra-password`: Authentication password
|
||||
- `--config-type`: Configuration type for schemas (default: "schema")
|
||||
- Standard FlowProcessor arguments (Pulsar configuration, etc.)
|
||||
|
||||
## API Integration
|
||||
|
||||
### Pulsar Topics
|
||||
|
||||
**Input Topic**: `objects-graphql-query-request`
|
||||
- Schema: ObjectsQueryRequest
|
||||
- Receives GraphQL queries from gateway services
|
||||
|
||||
**Output Topic**: `objects-graphql-query-response`
|
||||
- Schema: ObjectsQueryResponse
|
||||
- Returns query results and errors
|
||||
|
||||
### Gateway Integration
|
||||
|
||||
The gateway and reverse-gateway will need endpoints to:
|
||||
1. Accept GraphQL queries from clients
|
||||
2. Forward to the query service via Pulsar
|
||||
3. Return responses to clients
|
||||
4. Support GraphQL introspection queries
|
||||
|
||||
### Agent Tool Integration
|
||||
|
||||
A new agent tool class will enable:
|
||||
- Natural language to GraphQL query generation
|
||||
- Direct GraphQL query execution
|
||||
- Result interpretation and formatting
|
||||
- Integration with agent decision flows
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- **Query Depth Limiting**: Prevent deeply nested queries that could cause performance issues
|
||||
- **Query Complexity Analysis**: Limit query complexity to prevent resource exhaustion
|
||||
- **Field-Level Permissions**: Future support for field-level access control based on user roles
|
||||
- **Input Sanitization**: Validate and sanitize all query inputs to prevent injection attacks
|
||||
- **Rate Limiting**: Implement query rate limiting per user/collection
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
- **Query Planning**: Analyze queries before execution to optimize CQL generation
|
||||
- **Result Caching**: Consider caching frequently accessed data at the field resolver level
|
||||
- **Connection Pooling**: Maintain efficient connection pools to Cassandra
|
||||
- **Batch Operations**: Combine multiple queries when possible to reduce latency
|
||||
- **Monitoring**: Track query performance metrics for optimization
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Tests
|
||||
- Schema generation from RowSchema definitions
|
||||
- GraphQL query parsing and validation
|
||||
- CQL query generation logic
|
||||
- Field resolver implementations
|
||||
|
||||
### Contract Tests
|
||||
- Pulsar message contract compliance
|
||||
- GraphQL schema validity
|
||||
- Response format verification
|
||||
- Error structure validation
|
||||
|
||||
### Integration Tests
|
||||
- End-to-end query execution against test Cassandra instance
|
||||
- Schema update handling
|
||||
- Relationship resolution
|
||||
- Pagination and filtering
|
||||
- Error scenarios
|
||||
|
||||
### Performance Tests
|
||||
- Query throughput under load
|
||||
- Response time for various query complexities
|
||||
- Memory usage with large result sets
|
||||
- Connection pool efficiency
|
||||
|
||||
## Migration Plan
|
||||
|
||||
No migration required as this is a new capability. The service will:
|
||||
1. Read existing schemas from configuration
|
||||
2. Connect to existing Cassandra tables created by the storage module
|
||||
3. Start accepting queries immediately upon deployment
|
||||
|
||||
## Timeline
|
||||
|
||||
- Week 1-2: Core service implementation and schema generation
|
||||
- Week 3: Query execution and CQL translation
|
||||
- Week 4: Relationship resolution and optimization
|
||||
- Week 5: Testing and performance tuning
|
||||
- Week 6: Gateway integration and documentation
|
||||
|
||||
## Open Questions
|
||||
|
||||
1. **Schema Evolution**: How should the service handle queries during schema transitions?
|
||||
- Option: Queue queries during schema updates
|
||||
- Option: Support multiple schema versions simultaneously
|
||||
|
||||
2. **Caching Strategy**: Should query results be cached?
|
||||
- Consider: Time-based expiration
|
||||
- Consider: Event-based invalidation
|
||||
|
||||
3. **Federation Support**: Should the service support GraphQL federation for combining with other data sources?
|
||||
- Would enable unified queries across structured and graph data
|
||||
|
||||
4. **Subscription Support**: Should the service support GraphQL subscriptions for real-time updates?
|
||||
- Would require WebSocket support in gateway
|
||||
|
||||
5. **Custom Scalars**: Should custom scalar types be supported for domain-specific data types?
|
||||
- Examples: DateTime, UUID, JSON fields
|
||||
|
||||
## References
|
||||
|
||||
- Structured Data Technical Specification: `docs/tech-specs/structured-data.md`
|
||||
- Strawberry GraphQL Documentation: https://strawberry.rocks/
|
||||
- GraphQL Specification: https://spec.graphql.org/
|
||||
- Apache Cassandra CQL Reference: https://cassandra.apache.org/doc/stable/cassandra/cql/
|
||||
- TrustGraph Flow Processor Documentation: Internal documentation
|
||||
Loading…
Add table
Add a link
Reference in a new issue