trustgraph/docs/tech-specs/sparql-query.md

---
layout: default
title: "SPARQL Query Service Technical Specification"
parent: "Tech Specs"
---

# SPARQL Query Service Technical Specification

## Overview

A pub/sub-hosted SPARQL query service that accepts SPARQL queries, decomposes
them into triple pattern lookups via the existing triples query pub/sub
interface, performs in-memory joins/filters/projections, and returns SPARQL
result bindings.

This makes the triple store queryable using a standard graph query language
without coupling to any specific backend (Neo4j, Cassandra, FalkorDB, etc.).

## Goals

- **SPARQL 1.1 support**: SELECT, ASK, CONSTRUCT, DESCRIBE queries
- **Backend-agnostic**: query via the pub/sub triples interface, not direct
  database access
- **Standard service pattern**: FlowProcessor with ConsumerSpec/ProducerSpec,
  using TriplesClientSpec to call the triples query service
- **Correct SPARQL semantics**: proper BGP evaluation, joins, OPTIONAL, UNION,
  FILTER, BIND, aggregation, solution modifiers (ORDER BY, LIMIT, OFFSET,
  DISTINCT)

## Background

The triples query service provides a single-pattern lookup: given optional
(s, p, o) values, return matching triples. This is the equivalent of one
triple pattern in a SPARQL Basic Graph Pattern.

To evaluate a full SPARQL query, we need to:
1. Parse the SPARQL string into an algebra tree
2. Walk the algebra tree, issuing triple pattern lookups for each BGP pattern
3. Join results across patterns (nested-loop or hash join)
4. Apply filters, optionals, unions, and aggregations in-memory
5. Project and return the requested variables

rdflib (already a dependency) provides a SPARQL 1.1 parser and algebra
compiler. We use rdflib to parse queries into algebra trees, then evaluate
the algebra ourselves using the triples query client as the data source.

## Technical Design

### Architecture

```
                       pub/sub
  [Client] ──request──> [SPARQL Query Service] ──triples-request──> [Triples Query Service]
  [Client] <─response── [SPARQL Query Service] <─triples-response── [Triples Query Service]
```

The service is a FlowProcessor that:
- Consumes SPARQL query requests
- Uses TriplesClientSpec to issue triple pattern lookups
- Evaluates the SPARQL algebra in-memory
- Produces result responses

### Components

1. **SPARQL Query Service (FlowProcessor)**
   - ConsumerSpec for incoming SPARQL requests
   - ProducerSpec for outgoing results
   - TriplesClientSpec for calling the triples query service
   - Delegates parsing and evaluation to the components below

   Module: `trustgraph-flow/trustgraph/query/sparql/service.py`

2. **SPARQL Parser (rdflib wrapper)**
   - Uses `rdflib.plugins.sparql.prepareQuery` / `parseQuery` and
     `rdflib.plugins.sparql.algebra.translateQuery` to produce an algebra tree
   - Extracts PREFIX declarations, query type (SELECT/ASK/CONSTRUCT/DESCRIBE),
     and the algebra root

   Module: `trustgraph-flow/trustgraph/query/sparql/parser.py`

3. **Algebra Evaluator**
   - Recursive evaluator over the rdflib algebra tree
   - Each algebra node type maps to an evaluation function
   - BGP nodes issue triple pattern queries via TriplesClient
   - Join/Filter/Optional/Union etc. operate on in-memory solution sequences

   Module: `trustgraph-flow/trustgraph/query/sparql/algebra.py`

4. **Solution Sequence**
   - A solution is a dict mapping variable names to Term values
   - Solution sequences are lists of solutions
   - Join: hash join on shared variables
   - LeftJoin (OPTIONAL): hash join preserving unmatched left rows
   - Union: concatenation
   - Filter: evaluate SPARQL expressions against each solution
   - Projection/Distinct/Order/Slice: standard post-processing

   Module: `trustgraph-flow/trustgraph/query/sparql/solutions.py`

### Data Models

#### Request

```python
@dataclass
class SparqlQueryRequest:
    user: str = ""
    collection: str = ""
    query: str = ""           # SPARQL query string
    limit: int = 10000        # Safety limit on results
```

#### Response

```python
@dataclass
class SparqlQueryResponse:
    error: Error | None = None
    query_type: str = ""      # "select", "ask", "construct", "describe"

    # For SELECT queries
    variables: list[str] = field(default_factory=list)
    bindings: list[SparqlBinding] = field(default_factory=list)

    # For ASK queries
    ask_result: bool = False

    # For CONSTRUCT/DESCRIBE queries
    triples: list[Triple] = field(default_factory=list)

@dataclass
class SparqlBinding:
    values: list[Term | None] = field(default_factory=list)
```

### BGP Evaluation Strategy

For each triple pattern in a BGP:
- Extract bound terms (concrete IRIs/literals) and variables
- Call `TriplesClient.query_stream(s, p, o)` with bound terms, None for
  variables
- Map returned triples back to variable bindings

For multi-pattern BGPs, join solutions incrementally:
- Order patterns by selectivity (patterns with more bound terms first)
- For each subsequent pattern, substitute bound variables from the current
  solution sequence before querying
- This avoids full cross-products and reduces the number of triples queries

### Streaming and Early Termination

The triples query service supports streaming responses (batched delivery via
`TriplesClient.query_stream`). The SPARQL evaluator should use streaming
from the start, not as an optimisation. This is important because:

- **Early termination**: when the SPARQL query has a LIMIT, or when only one
  solution is needed (ASK queries), we can stop consuming triples as soon as
  we have enough results. Without streaming, a wildcard pattern like
  `?s ?p ?o` would fetch the entire graph before we could apply the limit.
- **Memory efficiency**: results are processed batch-by-batch rather than
  materialising the full result set in memory before joining.

The batch callback in `query_stream` returns a boolean to signal completion.
The evaluator should signal completion (return True) as soon as sufficient
solutions have been produced, allowing the underlying pub/sub consumer to
stop pulling batches.

### Parallel BGP Execution (Phase 2 Optimisation)

Within a BGP, patterns that share variables benefit from sequential
evaluation with bound-variable substitution (query results from earlier
patterns narrow later queries). However, patterns with no shared variables
are independent and could be issued concurrently via `asyncio.gather`.

A practical approach for a future optimisation pass:
- Analyse BGP patterns and identify connected components (groups of
  patterns linked by shared variables)
- Execute independent components in parallel
- Within each component, evaluate patterns sequentially with substitution

This is not needed for correctness -- the sequential approach works for all
cases -- but could significantly reduce latency for queries with independent
pattern groups. Flagged as a phase 2 optimisation.

### FILTER Expression Evaluation

rdflib's algebra represents FILTER expressions as expression trees. We
evaluate these against each solution row, supporting:
- Comparison operators (=, !=, <, >, <=, >=)
- Logical operators (&&, ||, !)
- SPARQL built-in functions (isIRI, isLiteral, isBlank, str, lang,
  datatype, bound, regex, etc.)
- Arithmetic operators (+, -, *, /)

## Implementation Order

1. **Schema and service skeleton** -- define SparqlQueryRequest/Response
   dataclasses, create the FlowProcessor subclass with ConsumerSpec,
   ProducerSpec, and TriplesClientSpec wired up. Verify it starts and
   connects.

2. **SPARQL parsing** -- wrap rdflib's parser to produce algebra trees from
   SPARQL strings. Handle parse errors gracefully. Unit test with a range of
   query shapes.

3. **BGP evaluation** -- implement single-pattern and multi-pattern BGP
   evaluation using TriplesClient. This is the core building block. Test
   with simple SELECT WHERE { ?s ?p ?o } queries.

4. **Joins and solution sequences** -- implement hash join, left join (for
   OPTIONAL), and union. Test with multi-pattern queries.

5. **FILTER evaluation** -- implement the expression evaluator for FILTER
   clauses. Start with comparisons and logical operators, then add built-in
   functions incrementally.

6. **Solution modifiers** -- DISTINCT, ORDER BY, LIMIT, OFFSET, projection.

7. **ASK / CONSTRUCT / DESCRIBE** -- extend beyond SELECT. ASK is trivial
   (non-empty result = true). CONSTRUCT builds triples from a template.
   DESCRIBE fetches all triples for matched resources.

8. **Aggregation** -- GROUP BY, HAVING, COUNT, SUM, AVG, MIN, MAX,
   GROUP_CONCAT, SAMPLE.

9. **BIND, VALUES, subqueries** -- remaining SPARQL 1.1 features.

10. **API gateway integration** -- add SparqlQueryRequestor dispatcher,
    request/response translators, and API endpoint so that the SPARQL
    service is accessible via the HTTP gateway.

11. **SDK support** -- add `sparql_query()` method to FlowInstance in the
    Python API SDK, following the same pattern as `triples_query()`.

12. **CLI command** -- add a `tg-sparql-query` CLI command that takes a
    SPARQL query string (or reads from a file/stdin), submits it via the
    SDK, and prints results in a readable format (table for SELECT,
    true/false for ASK, Turtle for CONSTRUCT/DESCRIBE).

## Performance Considerations

In-memory join over pub/sub round-trips will be slower than native SPARQL on
a graph database. Key mitigations:

- **Streaming with early termination**: use `query_stream` so that
  limit-bound queries don't fetch entire result sets. A `SELECT ... LIMIT 1`
  against a wildcard pattern fetches one batch, not the whole graph.
- **Bound-variable substitution**: when evaluating BGP patterns sequentially,
  substitute known bindings into subsequent patterns to issue narrow queries
  rather than broad ones followed by in-memory filtering.
- **Parallel independent patterns** (phase 2): patterns with no shared
  variables can be issued concurrently.
- **Query complexity limits**: may need a cap on the number of triple pattern
  queries issued per SPARQL query to prevent runaway evaluation.

### Named Graph Mapping

SPARQL's `GRAPH ?g { ... }` and `GRAPH <uri> { ... }` clauses map to the
triples query service's graph filter parameter:

- `GRAPH <uri> { ?s ?p ?o }` — pass `g=uri` to the triples query
- Patterns outside any GRAPH clause — pass `g=""` (default graph only)
- `GRAPH ?g { ?s ?p ?o }` — pass `g="*"` (all graphs), then bind `?g` from
  the returned triple's graph field

The triples query interface does not support a wildcard graph natively in
the SPARQL sense, but `g="*"` (all graphs) combined with client-side
filtering on the returned graph values achieves the same effect.

## Open Questions

- **SPARQL 1.2**: rdflib's parser support for 1.2 features (property paths
  are already in 1.1; 1.2 adds lateral joins, ADJUST, etc.). Start with
  1.1 and extend as rdflib support matures.