Changed schema for Value -> Term, majorly breaking change (#622)

* Changed schema for Value -> Term, majorly breaking change

* Following the schema change, Value -> Term into all processing

* Updated Cassandra for g, p, s, o index patterns (7 indexes)

* Reviewed and updated all tests

* Neo4j, Memgraph and FalkorDB remain broken, will look at once settled down
This commit is contained in:
cybermaggedon 2026-01-27 13:48:08 +00:00 committed by GitHub
parent e061f2c633
commit cf0daedefa
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
86 changed files with 2458 additions and 1764 deletions

View file

@ -228,10 +228,15 @@ Following SPARQL conventions for backward compatibility:
- **`g` omitted / None**: Query the default graph only
- **`g` = specific IRI**: Query that named graph only
- **`g` = wildcard / `*`**: Query across all graphs
- **`g` = wildcard / `*`**: Query across all graphs (equivalent to SPARQL
`GRAPH ?g { ... }`)
This keeps simple queries simple and makes named graph queries opt-in.
Cross-graph queries (g=wildcard) are fully supported. The Cassandra schema
includes dedicated tables (SPOG, POSG, OSPG) where g is a clustering column
rather than a partition key, enabling efficient queries across all graphs.
#### Temporal Queries
**Find all facts discovered after a given date:**
@ -388,12 +393,78 @@ will proceed in phases:
Cassandra requires multiple tables to support different query access patterns
(each table efficiently queries by its partition key + clustering columns).
**Challenge: Quads**
##### Query Patterns
For triples, typical indexes are SPO, POS, OSP (partition by first, cluster by
rest). For quads, the graph dimension adds: SPOG, POSG, OSPG, GSPO, etc.
With quads (g, s, p, o), each position can be specified or wildcard, giving
16 possible query patterns:
**Challenge: Quoted Triples**
| # | g | s | p | o | Description |
|---|---|---|---|---|-------------|
| 1 | ? | ? | ? | ? | All quads |
| 2 | ? | ? | ? | o | By object |
| 3 | ? | ? | p | ? | By predicate |
| 4 | ? | ? | p | o | By predicate + object |
| 5 | ? | s | ? | ? | By subject |
| 6 | ? | s | ? | o | By subject + object |
| 7 | ? | s | p | ? | By subject + predicate |
| 8 | ? | s | p | o | Full triple (which graphs?) |
| 9 | g | ? | ? | ? | By graph |
| 10 | g | ? | ? | o | By graph + object |
| 11 | g | ? | p | ? | By graph + predicate |
| 12 | g | ? | p | o | By graph + predicate + object |
| 13 | g | s | ? | ? | By graph + subject |
| 14 | g | s | ? | o | By graph + subject + object |
| 15 | g | s | p | ? | By graph + subject + predicate |
| 16 | g | s | p | o | Exact quad |
##### Table Design
Cassandra constraint: You can only efficiently query by partition key, then
filter on clustering columns left-to-right. For g-wildcard queries, g must be
a clustering column. For g-specified queries, g in the partition key is more
efficient.
**Two table families needed:**
**Family A: g-wildcard queries** (g in clustering columns)
| Table | Partition | Clustering | Supports patterns |
|-------|-----------|------------|-------------------|
| SPOG | (user, collection, s) | p, o, g | 5, 7, 8 |
| POSG | (user, collection, p) | o, s, g | 3, 4 |
| OSPG | (user, collection, o) | s, p, g | 2, 6 |
**Family B: g-specified queries** (g in partition key)
| Table | Partition | Clustering | Supports patterns |
|-------|-----------|------------|-------------------|
| GSPO | (user, collection, g, s) | p, o | 9, 13, 15, 16 |
| GPOS | (user, collection, g, p) | o, s | 11, 12 |
| GOSP | (user, collection, g, o) | s, p | 10, 14 |
**Collection table** (for iteration and bulk deletion)
| Table | Partition | Clustering | Purpose |
|-------|-----------|------------|---------|
| COLL | (user, collection) | g, s, p, o | Enumerate all quads in collection |
##### Write and Delete Paths
**Write path**: Insert into all 7 tables.
**Delete collection path**:
1. Iterate COLL table for `(user, collection)`
2. For each quad, delete from all 6 query tables
3. Delete from COLL table (or range delete)
**Delete single quad path**: Delete from all 7 tables directly.
##### Storage Cost
Each quad is stored 7 times. This is the cost of flexible querying combined
with efficient collection deletion.
##### Quoted Triples in Storage
Subject or object can be a triple itself. Options:
@ -425,29 +496,9 @@ Metadata table:
- Pro: Clean separation, can index triple IDs
- Con: Requires computing/managing triple identity, two-phase lookups
**Option C: Hybrid**
- Store quads normally with serialized quoted triple strings for simple cases
- Maintain a separate triple ID lookup for advanced queries
- Pro: Flexibility
- Con: Complexity
**Recommendation**: TBD after prototyping. Option A is simplest for initial
implementation; Option B may be needed for advanced query patterns.
#### Indexing Strategy
Indexes must support the defined query patterns:
| Query Type | Access Pattern | Index Needed |
|------------|----------------|--------------|
| Facts by date | P=discoveredOn, O>date | POG (predicate, object, graph) |
| Facts by source | P=supportedBy, O=source | POG |
| Facts by asserter | P=assertedBy, O=person | POG |
| Metadata for a fact | S=quotedTriple | SPO/SPOG |
| All facts in graph | G=graphIRI | GSPO |
For temporal range queries (dates), Cassandra clustering column ordering
enables efficient scans when date is a clustering column.
**Recommendation**: Start with Option A (serialized strings) for simplicity.
Option B may be needed if advanced query patterns over quoted triple
components are required.
2. **Phase 2+: Other Backends**
- Neo4j and other stores implemented in subsequent stages