SPARQL 1.1 query service wrapping pub/sub triples interface Add a backend-agnostic SPARQL query service that parses SPARQL queries using rdflib, decomposes them into triple pattern lookups via the existing TriplesClient pub/sub interface, and performs in-memory joins, filters, and projections. Includes: - SPARQL parser, algebra evaluator, expression evaluator, solution sequence operations (BGP, JOIN, OPTIONAL, UNION, FILTER, BIND, VALUES, GROUP BY, ORDER BY, LIMIT/OFFSET, DISTINCT, aggregates) - FlowProcessor service with TriplesClientSpec - Gateway dispatcher, request/response translators, API spec - Python SDK method (FlowInstance.sparql_query) - CLI command (tg-invoke-sparql-query) - Tech spec (docs/tech-specs/sparql-query.md) New unit tests for SPARQL query
11 KiB
SPARQL Query Service Technical Specification
Overview
A pub/sub-hosted SPARQL query service that accepts SPARQL queries, decomposes them into triple pattern lookups via the existing triples query pub/sub interface, performs in-memory joins/filters/projections, and returns SPARQL result bindings.
This makes the triple store queryable using a standard graph query language without coupling to any specific backend (Neo4j, Cassandra, FalkorDB, etc.).
Goals
- SPARQL 1.1 support: SELECT, ASK, CONSTRUCT, DESCRIBE queries
- Backend-agnostic: query via the pub/sub triples interface, not direct database access
- Standard service pattern: FlowProcessor with ConsumerSpec/ProducerSpec, using TriplesClientSpec to call the triples query service
- Correct SPARQL semantics: proper BGP evaluation, joins, OPTIONAL, UNION, FILTER, BIND, aggregation, solution modifiers (ORDER BY, LIMIT, OFFSET, DISTINCT)
Background
The triples query service provides a single-pattern lookup: given optional (s, p, o) values, return matching triples. This is the equivalent of one triple pattern in a SPARQL Basic Graph Pattern.
To evaluate a full SPARQL query, we need to:
- Parse the SPARQL string into an algebra tree
- Walk the algebra tree, issuing triple pattern lookups for each BGP pattern
- Join results across patterns (nested-loop or hash join)
- Apply filters, optionals, unions, and aggregations in-memory
- Project and return the requested variables
rdflib (already a dependency) provides a SPARQL 1.1 parser and algebra compiler. We use rdflib to parse queries into algebra trees, then evaluate the algebra ourselves using the triples query client as the data source.
Technical Design
Architecture
pub/sub
[Client] ──request──> [SPARQL Query Service] ──triples-request──> [Triples Query Service]
[Client] <─response── [SPARQL Query Service] <─triples-response── [Triples Query Service]
The service is a FlowProcessor that:
- Consumes SPARQL query requests
- Uses TriplesClientSpec to issue triple pattern lookups
- Evaluates the SPARQL algebra in-memory
- Produces result responses
Components
-
SPARQL Query Service (FlowProcessor)
- ConsumerSpec for incoming SPARQL requests
- ProducerSpec for outgoing results
- TriplesClientSpec for calling the triples query service
- Delegates parsing and evaluation to the components below
Module:
trustgraph-flow/trustgraph/query/sparql/service.py -
SPARQL Parser (rdflib wrapper)
- Uses
rdflib.plugins.sparql.prepareQuery/parseQueryandrdflib.plugins.sparql.algebra.translateQueryto produce an algebra tree - Extracts PREFIX declarations, query type (SELECT/ASK/CONSTRUCT/DESCRIBE), and the algebra root
Module:
trustgraph-flow/trustgraph/query/sparql/parser.py - Uses
-
Algebra Evaluator
- Recursive evaluator over the rdflib algebra tree
- Each algebra node type maps to an evaluation function
- BGP nodes issue triple pattern queries via TriplesClient
- Join/Filter/Optional/Union etc. operate on in-memory solution sequences
Module:
trustgraph-flow/trustgraph/query/sparql/algebra.py -
Solution Sequence
- A solution is a dict mapping variable names to Term values
- Solution sequences are lists of solutions
- Join: hash join on shared variables
- LeftJoin (OPTIONAL): hash join preserving unmatched left rows
- Union: concatenation
- Filter: evaluate SPARQL expressions against each solution
- Projection/Distinct/Order/Slice: standard post-processing
Module:
trustgraph-flow/trustgraph/query/sparql/solutions.py
Data Models
Request
@dataclass
class SparqlQueryRequest:
user: str = ""
collection: str = ""
query: str = "" # SPARQL query string
limit: int = 10000 # Safety limit on results
Response
@dataclass
class SparqlQueryResponse:
error: Error | None = None
query_type: str = "" # "select", "ask", "construct", "describe"
# For SELECT queries
variables: list[str] = field(default_factory=list)
bindings: list[SparqlBinding] = field(default_factory=list)
# For ASK queries
ask_result: bool = False
# For CONSTRUCT/DESCRIBE queries
triples: list[Triple] = field(default_factory=list)
@dataclass
class SparqlBinding:
values: list[Term | None] = field(default_factory=list)
BGP Evaluation Strategy
For each triple pattern in a BGP:
- Extract bound terms (concrete IRIs/literals) and variables
- Call
TriplesClient.query_stream(s, p, o)with bound terms, None for variables - Map returned triples back to variable bindings
For multi-pattern BGPs, join solutions incrementally:
- Order patterns by selectivity (patterns with more bound terms first)
- For each subsequent pattern, substitute bound variables from the current solution sequence before querying
- This avoids full cross-products and reduces the number of triples queries
Streaming and Early Termination
The triples query service supports streaming responses (batched delivery via
TriplesClient.query_stream). The SPARQL evaluator should use streaming
from the start, not as an optimisation. This is important because:
- Early termination: when the SPARQL query has a LIMIT, or when only one
solution is needed (ASK queries), we can stop consuming triples as soon as
we have enough results. Without streaming, a wildcard pattern like
?s ?p ?owould fetch the entire graph before we could apply the limit. - Memory efficiency: results are processed batch-by-batch rather than materialising the full result set in memory before joining.
The batch callback in query_stream returns a boolean to signal completion.
The evaluator should signal completion (return True) as soon as sufficient
solutions have been produced, allowing the underlying pub/sub consumer to
stop pulling batches.
Parallel BGP Execution (Phase 2 Optimisation)
Within a BGP, patterns that share variables benefit from sequential
evaluation with bound-variable substitution (query results from earlier
patterns narrow later queries). However, patterns with no shared variables
are independent and could be issued concurrently via asyncio.gather.
A practical approach for a future optimisation pass:
- Analyse BGP patterns and identify connected components (groups of patterns linked by shared variables)
- Execute independent components in parallel
- Within each component, evaluate patterns sequentially with substitution
This is not needed for correctness -- the sequential approach works for all cases -- but could significantly reduce latency for queries with independent pattern groups. Flagged as a phase 2 optimisation.
FILTER Expression Evaluation
rdflib's algebra represents FILTER expressions as expression trees. We evaluate these against each solution row, supporting:
- Comparison operators (=, !=, <, >, <=, >=)
- Logical operators (&&, ||, !)
- SPARQL built-in functions (isIRI, isLiteral, isBlank, str, lang, datatype, bound, regex, etc.)
- Arithmetic operators (+, -, *, /)
Implementation Order
-
Schema and service skeleton -- define SparqlQueryRequest/Response dataclasses, create the FlowProcessor subclass with ConsumerSpec, ProducerSpec, and TriplesClientSpec wired up. Verify it starts and connects.
-
SPARQL parsing -- wrap rdflib's parser to produce algebra trees from SPARQL strings. Handle parse errors gracefully. Unit test with a range of query shapes.
-
BGP evaluation -- implement single-pattern and multi-pattern BGP evaluation using TriplesClient. This is the core building block. Test with simple SELECT WHERE { ?s ?p ?o } queries.
-
Joins and solution sequences -- implement hash join, left join (for OPTIONAL), and union. Test with multi-pattern queries.
-
FILTER evaluation -- implement the expression evaluator for FILTER clauses. Start with comparisons and logical operators, then add built-in functions incrementally.
-
Solution modifiers -- DISTINCT, ORDER BY, LIMIT, OFFSET, projection.
-
ASK / CONSTRUCT / DESCRIBE -- extend beyond SELECT. ASK is trivial (non-empty result = true). CONSTRUCT builds triples from a template. DESCRIBE fetches all triples for matched resources.
-
Aggregation -- GROUP BY, HAVING, COUNT, SUM, AVG, MIN, MAX, GROUP_CONCAT, SAMPLE.
-
BIND, VALUES, subqueries -- remaining SPARQL 1.1 features.
-
API gateway integration -- add SparqlQueryRequestor dispatcher, request/response translators, and API endpoint so that the SPARQL service is accessible via the HTTP gateway.
-
SDK support -- add
sparql_query()method to FlowInstance in the Python API SDK, following the same pattern astriples_query(). -
CLI command -- add a
tg-sparql-queryCLI command that takes a SPARQL query string (or reads from a file/stdin), submits it via the SDK, and prints results in a readable format (table for SELECT, true/false for ASK, Turtle for CONSTRUCT/DESCRIBE).
Performance Considerations
In-memory join over pub/sub round-trips will be slower than native SPARQL on a graph database. Key mitigations:
- Streaming with early termination: use
query_streamso that limit-bound queries don't fetch entire result sets. ASELECT ... LIMIT 1against a wildcard pattern fetches one batch, not the whole graph. - Bound-variable substitution: when evaluating BGP patterns sequentially, substitute known bindings into subsequent patterns to issue narrow queries rather than broad ones followed by in-memory filtering.
- Parallel independent patterns (phase 2): patterns with no shared variables can be issued concurrently.
- Query complexity limits: may need a cap on the number of triple pattern queries issued per SPARQL query to prevent runaway evaluation.
Named Graph Mapping
SPARQL's GRAPH ?g { ... } and GRAPH <uri> { ... } clauses map to the
triples query service's graph filter parameter:
GRAPH <uri> { ?s ?p ?o }— passg=urito the triples query- Patterns outside any GRAPH clause — pass
g=""(default graph only) GRAPH ?g { ?s ?p ?o }— passg="*"(all graphs), then bind?gfrom the returned triple's graph field
The triples query interface does not support a wildcard graph natively in
the SPARQL sense, but g="*" (all graphs) combined with client-side
filtering on the returned graph values achieves the same effect.
Open Questions
- SPARQL 1.2: rdflib's parser support for 1.2 features (property paths are already in 1.1; 1.2 adds lateral joins, ADJUST, etc.). Start with 1.1 and extend as rdflib support matures.