trustgraph/docs/tech-specs/graphql-query.md
cybermaggedon 672e358b2f
Feature/graphql table query (#486)
* Tech spec

* Object query service for Cassandra

* Gateway support for objects-query

* GraphQL query utility

* Filters, ordering
2025-09-03 23:39:11 +01:00

14 KiB

GraphQL Query Technical Specification

Overview

This specification describes the implementation of a GraphQL query interface for TrustGraph's structured data storage in Apache Cassandra. Building upon the structured data capabilities outlined in the structured-data.md specification, this document details how GraphQL queries will be executed against Cassandra tables containing extracted and ingested structured objects.

The GraphQL query service will provide a flexible, type-safe interface for querying structured data stored in Cassandra. It will dynamically adapt to schema changes, support complex queries including relationships between objects, and integrate seamlessly with TrustGraph's existing message-based architecture.

Goals

  • Dynamic Schema Support: Automatically adapt to schema changes in configuration without service restarts
  • GraphQL Standards Compliance: Provide a standard GraphQL interface compatible with existing GraphQL tooling and clients
  • Efficient Cassandra Queries: Translate GraphQL queries into efficient Cassandra CQL queries respecting partition keys and indexes
  • Relationship Resolution: Support GraphQL field resolvers for relationships between different object types
  • Type Safety: Ensure type-safe query execution and response generation based on schema definitions
  • Scalable Performance: Handle concurrent queries efficiently with proper connection pooling and query optimization
  • Request/Response Integration: Maintain compatibility with TrustGraph's Pulsar-based request/response pattern
  • Error Handling: Provide comprehensive error reporting for schema mismatches, query errors, and data validation issues

Background

The structured data storage implementation (trustgraph-flow/trustgraph/storage/objects/cassandra/) writes objects to Cassandra tables based on schema definitions stored in TrustGraph's configuration system. These tables use a composite partition key structure with collection and schema-defined primary keys, enabling efficient queries within collections.

Current limitations that this specification addresses:

  • No query interface for the structured data stored in Cassandra
  • Inability to leverage GraphQL's powerful query capabilities for structured data
  • Missing support for relationship traversal between related objects
  • Lack of a standardized query language for structured data access

The GraphQL query service will bridge these gaps by:

  • Providing a standard GraphQL interface for querying Cassandra tables
  • Dynamically generating GraphQL schemas from TrustGraph configuration
  • Efficiently translating GraphQL queries to Cassandra CQL
  • Supporting relationship resolution through field resolvers

Technical Design

Architecture

The GraphQL query service will be implemented as a new TrustGraph flow processor following established patterns:

Module Location: trustgraph-flow/trustgraph/query/objects/cassandra/

Key Components:

  1. GraphQL Query Service Processor

    • Extends base FlowProcessor class
    • Implements request/response pattern similar to existing query services
    • Monitors configuration for schema updates
    • Maintains GraphQL schema synchronized with configuration
  2. Dynamic Schema Generator

    • Converts TrustGraph RowSchema definitions to GraphQL types
    • Creates GraphQL object types with proper field definitions
    • Generates root Query type with collection-based resolvers
    • Updates GraphQL schema when configuration changes
  3. Query Executor

    • Parses incoming GraphQL queries using Strawberry library
    • Validates queries against current schema
    • Executes queries and returns structured responses
    • Handles errors gracefully with detailed error messages
  4. Cassandra Query Translator

    • Converts GraphQL selections to CQL queries
    • Optimizes queries based on available indexes and partition keys
    • Handles filtering, pagination, and sorting
    • Manages connection pooling and session lifecycle
  5. Relationship Resolver

    • Implements field resolvers for object relationships
    • Performs efficient batch loading to avoid N+1 queries
    • Caches resolved relationships within request context
    • Supports both forward and reverse relationship traversal

Configuration Schema Monitoring

The service will register a configuration handler to receive schema updates:

self.register_config_handler(self.on_schema_config)

When schemas change:

  1. Parse new schema definitions from configuration
  2. Regenerate GraphQL types and resolvers
  3. Update the executable schema
  4. Clear any schema-dependent caches

GraphQL Schema Generation

For each RowSchema in configuration, generate:

  1. GraphQL Object Type:

    • Map field types (string → String, integer → Int, float → Float, boolean → Boolean)
    • Mark required fields as non-nullable in GraphQL
    • Add field descriptions from schema
  2. Root Query Fields:

    • Collection query (e.g., customers, transactions)
    • Filtering arguments based on indexed fields
    • Pagination support (limit, offset)
    • Sorting options for sortable fields
  3. Relationship Fields:

    • Identify foreign key relationships from schema
    • Create field resolvers for related objects
    • Support both single object and list relationships

Query Execution Flow

  1. Request Reception:

    • Receive ObjectsQueryRequest from Pulsar
    • Extract GraphQL query string and variables
    • Identify user and collection context
  2. Query Validation:

    • Parse GraphQL query using Strawberry
    • Validate against current schema
    • Check field selections and argument types
  3. CQL Generation:

    • Analyze GraphQL selections
    • Build CQL query with proper WHERE clauses
    • Include collection in partition key
    • Apply filters based on GraphQL arguments
  4. Query Execution:

    • Execute CQL query against Cassandra
    • Map results to GraphQL response structure
    • Resolve any relationship fields
    • Format response according to GraphQL spec
  5. Response Delivery:

    • Create ObjectsQueryResponse with results
    • Include any execution errors
    • Send response via Pulsar with correlation ID

Data Models

Note

: An existing StructuredQueryRequest/Response schema exists in trustgraph-base/trustgraph/schema/services/structured_query.py. However, it lacks critical fields (user, collection) and uses suboptimal types. The schemas below represent the recommended evolution, which should either replace the existing schemas or be created as new ObjectsQueryRequest/Response types.

Request Schema (ObjectsQueryRequest)

from pulsar.schema import Record, String, Map, Array

class ObjectsQueryRequest(Record):
    user = String()              # Cassandra keyspace (follows pattern from TriplesQueryRequest)
    collection = String()        # Data collection identifier (required for partition key)
    query = String()             # GraphQL query string
    variables = Map(String())    # GraphQL variables (consider enhancing to support all JSON types)
    operation_name = String()    # Operation to execute for multi-operation documents

Rationale for changes from existing StructuredQueryRequest:

  • Added user and collection fields to match other query services pattern
  • These fields are essential for identifying the Cassandra keyspace and collection
  • Variables remain as Map(String()) for now but should ideally support all JSON types

Response Schema (ObjectsQueryResponse)

from pulsar.schema import Record, String, Array
from ..core.primitives import Error

class GraphQLError(Record):
    message = String()
    path = Array(String())       # Path to the field that caused the error
    extensions = Map(String())   # Additional error metadata

class ObjectsQueryResponse(Record):
    error = Error()              # System-level error (connection, timeout, etc.)
    data = String()              # JSON-encoded GraphQL response data
    errors = Array(GraphQLError) # GraphQL field-level errors
    extensions = Map(String())   # Query metadata (execution time, etc.)

Rationale for changes from existing StructuredQueryResponse:

  • Distinguishes between system errors (error) and GraphQL errors (errors)
  • Uses structured GraphQLError objects instead of string array
  • Adds extensions field for GraphQL spec compliance
  • Keeps data as JSON string for compatibility, though native types would be preferable

Cassandra Query Optimization

The service will optimize Cassandra queries by:

  1. Respecting Partition Keys:

    • Always include collection in queries
    • Use schema-defined primary keys efficiently
    • Avoid full table scans
  2. Leveraging Indexes:

    • Use secondary indexes for filtering
    • Combine multiple filters when possible
    • Warn when queries may be inefficient
  3. Batch Loading:

    • Collect relationship queries
    • Execute in batches to reduce round trips
    • Cache results within request context
  4. Connection Management:

    • Maintain persistent Cassandra sessions
    • Use connection pooling
    • Handle reconnection on failures

Example GraphQL Queries

Simple Collection Query

{
  customers(status: "active") {
    customer_id
    name
    email
    registration_date
  }
}

Query with Relationships

{
  orders(order_date_gt: "2024-01-01") {
    order_id
    total_amount
    customer {
      name
      email
    }
    items {
      product_name
      quantity
      price
    }
  }
}

Paginated Query

{
  products(limit: 20, offset: 40) {
    product_id
    name
    price
    category
  }
}

Implementation Dependencies

  • Strawberry GraphQL: For GraphQL schema definition and query execution
  • Cassandra Driver: For database connectivity (already used in storage module)
  • TrustGraph Base: For FlowProcessor and schema definitions
  • Configuration System: For schema monitoring and updates

Command-Line Interface

The service will provide a CLI command: kg-query-objects-graphql-cassandra

Arguments:

  • --cassandra-host: Cassandra cluster contact point
  • --cassandra-username: Authentication username
  • --cassandra-password: Authentication password
  • --config-type: Configuration type for schemas (default: "schema")
  • Standard FlowProcessor arguments (Pulsar configuration, etc.)

API Integration

Pulsar Topics

Input Topic: objects-graphql-query-request

  • Schema: ObjectsQueryRequest
  • Receives GraphQL queries from gateway services

Output Topic: objects-graphql-query-response

  • Schema: ObjectsQueryResponse
  • Returns query results and errors

Gateway Integration

The gateway and reverse-gateway will need endpoints to:

  1. Accept GraphQL queries from clients
  2. Forward to the query service via Pulsar
  3. Return responses to clients
  4. Support GraphQL introspection queries

Agent Tool Integration

A new agent tool class will enable:

  • Natural language to GraphQL query generation
  • Direct GraphQL query execution
  • Result interpretation and formatting
  • Integration with agent decision flows

Security Considerations

  • Query Depth Limiting: Prevent deeply nested queries that could cause performance issues
  • Query Complexity Analysis: Limit query complexity to prevent resource exhaustion
  • Field-Level Permissions: Future support for field-level access control based on user roles
  • Input Sanitization: Validate and sanitize all query inputs to prevent injection attacks
  • Rate Limiting: Implement query rate limiting per user/collection

Performance Considerations

  • Query Planning: Analyze queries before execution to optimize CQL generation
  • Result Caching: Consider caching frequently accessed data at the field resolver level
  • Connection Pooling: Maintain efficient connection pools to Cassandra
  • Batch Operations: Combine multiple queries when possible to reduce latency
  • Monitoring: Track query performance metrics for optimization

Testing Strategy

Unit Tests

  • Schema generation from RowSchema definitions
  • GraphQL query parsing and validation
  • CQL query generation logic
  • Field resolver implementations

Contract Tests

  • Pulsar message contract compliance
  • GraphQL schema validity
  • Response format verification
  • Error structure validation

Integration Tests

  • End-to-end query execution against test Cassandra instance
  • Schema update handling
  • Relationship resolution
  • Pagination and filtering
  • Error scenarios

Performance Tests

  • Query throughput under load
  • Response time for various query complexities
  • Memory usage with large result sets
  • Connection pool efficiency

Migration Plan

No migration required as this is a new capability. The service will:

  1. Read existing schemas from configuration
  2. Connect to existing Cassandra tables created by the storage module
  3. Start accepting queries immediately upon deployment

Timeline

  • Week 1-2: Core service implementation and schema generation
  • Week 3: Query execution and CQL translation
  • Week 4: Relationship resolution and optimization
  • Week 5: Testing and performance tuning
  • Week 6: Gateway integration and documentation

Open Questions

  1. Schema Evolution: How should the service handle queries during schema transitions?

    • Option: Queue queries during schema updates
    • Option: Support multiple schema versions simultaneously
  2. Caching Strategy: Should query results be cached?

    • Consider: Time-based expiration
    • Consider: Event-based invalidation
  3. Federation Support: Should the service support GraphQL federation for combining with other data sources?

    • Would enable unified queries across structured and graph data
  4. Subscription Support: Should the service support GraphQL subscriptions for real-time updates?

    • Would require WebSocket support in gateway
  5. Custom Scalars: Should custom scalar types be supported for domain-specific data types?

    • Examples: DateTime, UUID, JSON fields

References