trustgraph/docs/tech-specs/sparql-query.md
cybermaggedon d9dc4cbab5
SPARQL query service (#754)
SPARQL 1.1 query service wrapping pub/sub triples interface

Add a backend-agnostic SPARQL query service that parses SPARQL
queries using rdflib, decomposes them into triple pattern lookups
via the existing TriplesClient pub/sub interface, and performs
in-memory joins, filters, and projections.

Includes:
- SPARQL parser, algebra evaluator, expression evaluator, solution
  sequence operations (BGP, JOIN, OPTIONAL, UNION, FILTER, BIND,
  VALUES, GROUP BY, ORDER BY, LIMIT/OFFSET, DISTINCT, aggregates)
- FlowProcessor service with TriplesClientSpec
- Gateway dispatcher, request/response translators, API spec
- Python SDK method (FlowInstance.sparql_query)
- CLI command (tg-invoke-sparql-query)
- Tech spec (docs/tech-specs/sparql-query.md)

New unit tests for SPARQL query
2026-04-02 17:21:39 +01:00

11 KiB

SPARQL Query Service Technical Specification

Overview

A pub/sub-hosted SPARQL query service that accepts SPARQL queries, decomposes them into triple pattern lookups via the existing triples query pub/sub interface, performs in-memory joins/filters/projections, and returns SPARQL result bindings.

This makes the triple store queryable using a standard graph query language without coupling to any specific backend (Neo4j, Cassandra, FalkorDB, etc.).

Goals

  • SPARQL 1.1 support: SELECT, ASK, CONSTRUCT, DESCRIBE queries
  • Backend-agnostic: query via the pub/sub triples interface, not direct database access
  • Standard service pattern: FlowProcessor with ConsumerSpec/ProducerSpec, using TriplesClientSpec to call the triples query service
  • Correct SPARQL semantics: proper BGP evaluation, joins, OPTIONAL, UNION, FILTER, BIND, aggregation, solution modifiers (ORDER BY, LIMIT, OFFSET, DISTINCT)

Background

The triples query service provides a single-pattern lookup: given optional (s, p, o) values, return matching triples. This is the equivalent of one triple pattern in a SPARQL Basic Graph Pattern.

To evaluate a full SPARQL query, we need to:

  1. Parse the SPARQL string into an algebra tree
  2. Walk the algebra tree, issuing triple pattern lookups for each BGP pattern
  3. Join results across patterns (nested-loop or hash join)
  4. Apply filters, optionals, unions, and aggregations in-memory
  5. Project and return the requested variables

rdflib (already a dependency) provides a SPARQL 1.1 parser and algebra compiler. We use rdflib to parse queries into algebra trees, then evaluate the algebra ourselves using the triples query client as the data source.

Technical Design

Architecture

                       pub/sub
  [Client] ──request──> [SPARQL Query Service] ──triples-request──> [Triples Query Service]
  [Client] <─response── [SPARQL Query Service] <─triples-response── [Triples Query Service]

The service is a FlowProcessor that:

  • Consumes SPARQL query requests
  • Uses TriplesClientSpec to issue triple pattern lookups
  • Evaluates the SPARQL algebra in-memory
  • Produces result responses

Components

  1. SPARQL Query Service (FlowProcessor)

    • ConsumerSpec for incoming SPARQL requests
    • ProducerSpec for outgoing results
    • TriplesClientSpec for calling the triples query service
    • Delegates parsing and evaluation to the components below

    Module: trustgraph-flow/trustgraph/query/sparql/service.py

  2. SPARQL Parser (rdflib wrapper)

    • Uses rdflib.plugins.sparql.prepareQuery / parseQuery and rdflib.plugins.sparql.algebra.translateQuery to produce an algebra tree
    • Extracts PREFIX declarations, query type (SELECT/ASK/CONSTRUCT/DESCRIBE), and the algebra root

    Module: trustgraph-flow/trustgraph/query/sparql/parser.py

  3. Algebra Evaluator

    • Recursive evaluator over the rdflib algebra tree
    • Each algebra node type maps to an evaluation function
    • BGP nodes issue triple pattern queries via TriplesClient
    • Join/Filter/Optional/Union etc. operate on in-memory solution sequences

    Module: trustgraph-flow/trustgraph/query/sparql/algebra.py

  4. Solution Sequence

    • A solution is a dict mapping variable names to Term values
    • Solution sequences are lists of solutions
    • Join: hash join on shared variables
    • LeftJoin (OPTIONAL): hash join preserving unmatched left rows
    • Union: concatenation
    • Filter: evaluate SPARQL expressions against each solution
    • Projection/Distinct/Order/Slice: standard post-processing

    Module: trustgraph-flow/trustgraph/query/sparql/solutions.py

Data Models

Request

@dataclass
class SparqlQueryRequest:
    user: str = ""
    collection: str = ""
    query: str = ""           # SPARQL query string
    limit: int = 10000        # Safety limit on results

Response

@dataclass
class SparqlQueryResponse:
    error: Error | None = None
    query_type: str = ""      # "select", "ask", "construct", "describe"

    # For SELECT queries
    variables: list[str] = field(default_factory=list)
    bindings: list[SparqlBinding] = field(default_factory=list)

    # For ASK queries
    ask_result: bool = False

    # For CONSTRUCT/DESCRIBE queries
    triples: list[Triple] = field(default_factory=list)

@dataclass
class SparqlBinding:
    values: list[Term | None] = field(default_factory=list)

BGP Evaluation Strategy

For each triple pattern in a BGP:

  • Extract bound terms (concrete IRIs/literals) and variables
  • Call TriplesClient.query_stream(s, p, o) with bound terms, None for variables
  • Map returned triples back to variable bindings

For multi-pattern BGPs, join solutions incrementally:

  • Order patterns by selectivity (patterns with more bound terms first)
  • For each subsequent pattern, substitute bound variables from the current solution sequence before querying
  • This avoids full cross-products and reduces the number of triples queries

Streaming and Early Termination

The triples query service supports streaming responses (batched delivery via TriplesClient.query_stream). The SPARQL evaluator should use streaming from the start, not as an optimisation. This is important because:

  • Early termination: when the SPARQL query has a LIMIT, or when only one solution is needed (ASK queries), we can stop consuming triples as soon as we have enough results. Without streaming, a wildcard pattern like ?s ?p ?o would fetch the entire graph before we could apply the limit.
  • Memory efficiency: results are processed batch-by-batch rather than materialising the full result set in memory before joining.

The batch callback in query_stream returns a boolean to signal completion. The evaluator should signal completion (return True) as soon as sufficient solutions have been produced, allowing the underlying pub/sub consumer to stop pulling batches.

Parallel BGP Execution (Phase 2 Optimisation)

Within a BGP, patterns that share variables benefit from sequential evaluation with bound-variable substitution (query results from earlier patterns narrow later queries). However, patterns with no shared variables are independent and could be issued concurrently via asyncio.gather.

A practical approach for a future optimisation pass:

  • Analyse BGP patterns and identify connected components (groups of patterns linked by shared variables)
  • Execute independent components in parallel
  • Within each component, evaluate patterns sequentially with substitution

This is not needed for correctness -- the sequential approach works for all cases -- but could significantly reduce latency for queries with independent pattern groups. Flagged as a phase 2 optimisation.

FILTER Expression Evaluation

rdflib's algebra represents FILTER expressions as expression trees. We evaluate these against each solution row, supporting:

  • Comparison operators (=, !=, <, >, <=, >=)
  • Logical operators (&&, ||, !)
  • SPARQL built-in functions (isIRI, isLiteral, isBlank, str, lang, datatype, bound, regex, etc.)
  • Arithmetic operators (+, -, *, /)

Implementation Order

  1. Schema and service skeleton -- define SparqlQueryRequest/Response dataclasses, create the FlowProcessor subclass with ConsumerSpec, ProducerSpec, and TriplesClientSpec wired up. Verify it starts and connects.

  2. SPARQL parsing -- wrap rdflib's parser to produce algebra trees from SPARQL strings. Handle parse errors gracefully. Unit test with a range of query shapes.

  3. BGP evaluation -- implement single-pattern and multi-pattern BGP evaluation using TriplesClient. This is the core building block. Test with simple SELECT WHERE { ?s ?p ?o } queries.

  4. Joins and solution sequences -- implement hash join, left join (for OPTIONAL), and union. Test with multi-pattern queries.

  5. FILTER evaluation -- implement the expression evaluator for FILTER clauses. Start with comparisons and logical operators, then add built-in functions incrementally.

  6. Solution modifiers -- DISTINCT, ORDER BY, LIMIT, OFFSET, projection.

  7. ASK / CONSTRUCT / DESCRIBE -- extend beyond SELECT. ASK is trivial (non-empty result = true). CONSTRUCT builds triples from a template. DESCRIBE fetches all triples for matched resources.

  8. Aggregation -- GROUP BY, HAVING, COUNT, SUM, AVG, MIN, MAX, GROUP_CONCAT, SAMPLE.

  9. BIND, VALUES, subqueries -- remaining SPARQL 1.1 features.

  10. API gateway integration -- add SparqlQueryRequestor dispatcher, request/response translators, and API endpoint so that the SPARQL service is accessible via the HTTP gateway.

  11. SDK support -- add sparql_query() method to FlowInstance in the Python API SDK, following the same pattern as triples_query().

  12. CLI command -- add a tg-sparql-query CLI command that takes a SPARQL query string (or reads from a file/stdin), submits it via the SDK, and prints results in a readable format (table for SELECT, true/false for ASK, Turtle for CONSTRUCT/DESCRIBE).

Performance Considerations

In-memory join over pub/sub round-trips will be slower than native SPARQL on a graph database. Key mitigations:

  • Streaming with early termination: use query_stream so that limit-bound queries don't fetch entire result sets. A SELECT ... LIMIT 1 against a wildcard pattern fetches one batch, not the whole graph.
  • Bound-variable substitution: when evaluating BGP patterns sequentially, substitute known bindings into subsequent patterns to issue narrow queries rather than broad ones followed by in-memory filtering.
  • Parallel independent patterns (phase 2): patterns with no shared variables can be issued concurrently.
  • Query complexity limits: may need a cap on the number of triple pattern queries issued per SPARQL query to prevent runaway evaluation.

Named Graph Mapping

SPARQL's GRAPH ?g { ... } and GRAPH <uri> { ... } clauses map to the triples query service's graph filter parameter:

  • GRAPH <uri> { ?s ?p ?o } — pass g=uri to the triples query
  • Patterns outside any GRAPH clause — pass g="" (default graph only)
  • GRAPH ?g { ?s ?p ?o } — pass g="*" (all graphs), then bind ?g from the returned triple's graph field

The triples query interface does not support a wildcard graph natively in the SPARQL sense, but g="*" (all graphs) combined with client-side filtering on the returned graph values achieves the same effect.

Open Questions

  • SPARQL 1.2: rdflib's parser support for 1.2 features (property paths are already in 1.1; 1.2 adds lateral joins, ADJUST, etc.). Start with 1.1 and extend as rdflib support matures.