mirror of https://github.com/trustgraph-ai/trustgraph.git synced 2026-04-25 08:26:21 +02:00

Catch up

2025-09-20 16:00:37 +01:00

14 KiB

Raw Blame History

GraphQL Query Technical Specification

Overview

This specification describes the implementation of a GraphQL query interface for TrustGraph's structured data storage in Apache Cassandra. Building upon the structured data capabilities outlined in the structured-data.md specification, this document details how GraphQL queries will be executed against Cassandra tables containing extracted and ingested structured objects.

The GraphQL query service will provide a flexible, type-safe interface for querying structured data stored in Cassandra. It will dynamically adapt to schema changes, support complex queries including relationships between objects, and integrate seamlessly with TrustGraph's existing message-based architecture.

Goals

Dynamic Schema Support: Automatically adapt to schema changes in configuration without service restarts
GraphQL Standards Compliance: Provide a standard GraphQL interface compatible with existing GraphQL tooling and clients
Efficient Cassandra Queries: Translate GraphQL queries into efficient Cassandra CQL queries respecting partition keys and indexes
Relationship Resolution: Support GraphQL field resolvers for relationships between different object types
Type Safety: Ensure type-safe query execution and response generation based on schema definitions
Scalable Performance: Handle concurrent queries efficiently with proper connection pooling and query optimization
Request/Response Integration: Maintain compatibility with TrustGraph's Pulsar-based request/response pattern
Error Handling: Provide comprehensive error reporting for schema mismatches, query errors, and data validation issues

Background

The structured data storage implementation (trustgraph-flow/trustgraph/storage/objects/cassandra/) writes objects to Cassandra tables based on schema definitions stored in TrustGraph's configuration system. These tables use a composite partition key structure with collection and schema-defined primary keys, enabling efficient queries within collections.

Current limitations that this specification addresses:

No query interface for the structured data stored in Cassandra
Inability to leverage GraphQL's powerful query capabilities for structured data
Missing support for relationship traversal between related objects
Lack of a standardized query language for structured data access

The GraphQL query service will bridge these gaps by:

Providing a standard GraphQL interface for querying Cassandra tables
Dynamically generating GraphQL schemas from TrustGraph configuration
Efficiently translating GraphQL queries to Cassandra CQL
Supporting relationship resolution through field resolvers

Technical Design

Architecture

The GraphQL query service will be implemented as a new TrustGraph flow processor following established patterns:

Module Location: trustgraph-flow/trustgraph/query/objects/cassandra/

Key Components:

GraphQL Query Service Processor
- Extends base FlowProcessor class
- Implements request/response pattern similar to existing query services
- Monitors configuration for schema updates
- Maintains GraphQL schema synchronized with configuration
Dynamic Schema Generator
- Converts TrustGraph RowSchema definitions to GraphQL types
- Creates GraphQL object types with proper field definitions
- Generates root Query type with collection-based resolvers
- Updates GraphQL schema when configuration changes
Query Executor
- Parses incoming GraphQL queries using Strawberry library
- Validates queries against current schema
- Executes queries and returns structured responses
- Handles errors gracefully with detailed error messages
Cassandra Query Translator
- Converts GraphQL selections to CQL queries
- Optimizes queries based on available indexes and partition keys
- Handles filtering, pagination, and sorting
- Manages connection pooling and session lifecycle
Relationship Resolver
- Implements field resolvers for object relationships
- Performs efficient batch loading to avoid N+1 queries
- Caches resolved relationships within request context
- Supports both forward and reverse relationship traversal

Configuration Schema Monitoring

The service will register a configuration handler to receive schema updates:

self.register_config_handler(self.on_schema_config)

When schemas change:

Parse new schema definitions from configuration
Regenerate GraphQL types and resolvers
Update the executable schema
Clear any schema-dependent caches

GraphQL Schema Generation

For each RowSchema in configuration, generate:

GraphQL Object Type:
- Map field types (string → String, integer → Int, float → Float, boolean → Boolean)
- Mark required fields as non-nullable in GraphQL
- Add field descriptions from schema
Root Query Fields:
- Collection query (e.g., customers, transactions)
- Filtering arguments based on indexed fields
- Pagination support (limit, offset)
- Sorting options for sortable fields
Relationship Fields:
- Identify foreign key relationships from schema
- Create field resolvers for related objects
- Support both single object and list relationships

Query Execution Flow

Request Reception:
- Receive ObjectsQueryRequest from Pulsar
- Extract GraphQL query string and variables
- Identify user and collection context
Query Validation:
- Parse GraphQL query using Strawberry
- Validate against current schema
- Check field selections and argument types
CQL Generation:
- Analyze GraphQL selections
- Build CQL query with proper WHERE clauses
- Include collection in partition key
- Apply filters based on GraphQL arguments
Query Execution:
- Execute CQL query against Cassandra
- Map results to GraphQL response structure
- Resolve any relationship fields
- Format response according to GraphQL spec
Response Delivery:
- Create ObjectsQueryResponse with results
- Include any execution errors
- Send response via Pulsar with correlation ID

Data Models

Note

: An existing StructuredQueryRequest/Response schema exists in trustgraph-base/trustgraph/schema/services/structured_query.py. However, it lacks critical fields (user, collection) and uses suboptimal types. The schemas below represent the recommended evolution, which should either replace the existing schemas or be created as new ObjectsQueryRequest/Response types.

Request Schema (ObjectsQueryRequest)

from pulsar.schema import Record, String, Map, Array

class ObjectsQueryRequest(Record):
    user = String()              # Cassandra keyspace (follows pattern from TriplesQueryRequest)
    collection = String()        # Data collection identifier (required for partition key)
    query = String()             # GraphQL query string
    variables = Map(String())    # GraphQL variables (consider enhancing to support all JSON types)
    operation_name = String()    # Operation to execute for multi-operation documents

Rationale for changes from existing StructuredQueryRequest:

Added user and collection fields to match other query services pattern
These fields are essential for identifying the Cassandra keyspace and collection
Variables remain as Map(String()) for now but should ideally support all JSON types

Response Schema (ObjectsQueryResponse)

from pulsar.schema import Record, String, Array
from ..core.primitives import Error

class GraphQLError(Record):
    message = String()
    path = Array(String())       # Path to the field that caused the error
    extensions = Map(String())   # Additional error metadata

class ObjectsQueryResponse(Record):
    error = Error()              # System-level error (connection, timeout, etc.)
    data = String()              # JSON-encoded GraphQL response data
    errors = Array(GraphQLError) # GraphQL field-level errors
    extensions = Map(String())   # Query metadata (execution time, etc.)

Rationale for changes from existing StructuredQueryResponse:

Distinguishes between system errors (error) and GraphQL errors (errors)
Uses structured GraphQLError objects instead of string array
Adds extensions field for GraphQL spec compliance
Keeps data as JSON string for compatibility, though native types would be preferable

Cassandra Query Optimization

The service will optimize Cassandra queries by:

Respecting Partition Keys:
- Always include collection in queries
- Use schema-defined primary keys efficiently
- Avoid full table scans
Leveraging Indexes:
- Use secondary indexes for filtering
- Combine multiple filters when possible
- Warn when queries may be inefficient
Batch Loading:
- Collect relationship queries
- Execute in batches to reduce round trips
- Cache results within request context
Connection Management:
- Maintain persistent Cassandra sessions
- Use connection pooling
- Handle reconnection on failures

Example GraphQL Queries

Simple Collection Query

{
  customers(status: "active") {
    customer_id
    name
    email
    registration_date
  }
}

Query with Relationships

{
  orders(order_date_gt: "2024-01-01") {
    order_id
    total_amount
    customer {
      name
      email
    }
    items {
      product_name
      quantity
      price
    }
  }
}

Paginated Query

{
  products(limit: 20, offset: 40) {
    product_id
    name
    price
    category
  }
}

Implementation Dependencies

Strawberry GraphQL: For GraphQL schema definition and query execution
Cassandra Driver: For database connectivity (already used in storage module)
TrustGraph Base: For FlowProcessor and schema definitions
Configuration System: For schema monitoring and updates

Command-Line Interface

The service will provide a CLI command: kg-query-objects-graphql-cassandra

Arguments:

--cassandra-host: Cassandra cluster contact point
--cassandra-username: Authentication username
--cassandra-password: Authentication password
--config-type: Configuration type for schemas (default: "schema")
Standard FlowProcessor arguments (Pulsar configuration, etc.)

API Integration

Pulsar Topics

Input Topic: objects-graphql-query-request

Schema: ObjectsQueryRequest
Receives GraphQL queries from gateway services

Output Topic: objects-graphql-query-response

Schema: ObjectsQueryResponse
Returns query results and errors

Gateway Integration

The gateway and reverse-gateway will need endpoints to:

Accept GraphQL queries from clients
Forward to the query service via Pulsar
Return responses to clients
Support GraphQL introspection queries

Agent Tool Integration

A new agent tool class will enable:

Natural language to GraphQL query generation
Direct GraphQL query execution
Result interpretation and formatting
Integration with agent decision flows

Security Considerations

Query Depth Limiting: Prevent deeply nested queries that could cause performance issues
Query Complexity Analysis: Limit query complexity to prevent resource exhaustion
Field-Level Permissions: Future support for field-level access control based on user roles
Input Sanitization: Validate and sanitize all query inputs to prevent injection attacks
Rate Limiting: Implement query rate limiting per user/collection

Performance Considerations

Query Planning: Analyze queries before execution to optimize CQL generation
Result Caching: Consider caching frequently accessed data at the field resolver level
Connection Pooling: Maintain efficient connection pools to Cassandra
Batch Operations: Combine multiple queries when possible to reduce latency
Monitoring: Track query performance metrics for optimization

Testing Strategy

Unit Tests

Schema generation from RowSchema definitions
GraphQL query parsing and validation
CQL query generation logic
Field resolver implementations

Contract Tests

Pulsar message contract compliance
GraphQL schema validity
Response format verification
Error structure validation

Integration Tests

End-to-end query execution against test Cassandra instance
Schema update handling
Relationship resolution
Pagination and filtering
Error scenarios

Performance Tests

Query throughput under load
Response time for various query complexities
Memory usage with large result sets
Connection pool efficiency

Migration Plan

No migration required as this is a new capability. The service will:

Read existing schemas from configuration
Connect to existing Cassandra tables created by the storage module
Start accepting queries immediately upon deployment

Timeline

Week 1-2: Core service implementation and schema generation
Week 3: Query execution and CQL translation
Week 4: Relationship resolution and optimization
Week 5: Testing and performance tuning
Week 6: Gateway integration and documentation

Open Questions

Schema Evolution: How should the service handle queries during schema transitions?
- Option: Queue queries during schema updates
- Option: Support multiple schema versions simultaneously
Caching Strategy: Should query results be cached?
- Consider: Time-based expiration
- Consider: Event-based invalidation
Federation Support: Should the service support GraphQL federation for combining with other data sources?
- Would enable unified queries across structured and graph data
Subscription Support: Should the service support GraphQL subscriptions for real-time updates?
- Would require WebSocket support in gateway
Custom Scalars: Should custom scalar types be supported for domain-specific data types?
- Examples: DateTime, UUID, JSON fields

References

Structured Data Technical Specification: docs/tech-specs/structured-data.md
Strawberry GraphQL Documentation: https://strawberry.rocks/
GraphQL Specification: https://spec.graphql.org/
Apache Cassandra CQL Reference: https://cassandra.apache.org/doc/stable/cassandra/cql/
TrustGraph Flow Processor Documentation: Internal documentation

14 KiB Raw Blame History