Fix Cassandra schema and graph filter semantics (#680)

Schema fix (dtype/lang clustering key):
- Add dtype and lang to PRIMARY KEY in quads_by_entity table
- Add otype, dtype, lang to PRIMARY KEY in quads_by_collection table
- Fixes deduplication bug where literals with same value but different
  datatype or language tag were collapsed (e.g., "thing" vs "thing"@en)
- Update delete_collection to pass new clustering columns
- Update tech spec to reflect new schema

Graph filter semantics (simplified, no wildcard constant):
- g=None means all graphs (no filter)
- g="" means default graph only
- g="uri" means specific named graph
- Remove GRAPH_WILDCARD usage from EntityCentricKnowledgeGraph
- Fix service.py streaming and non-streaming paths
- Fix CLI to preserve empty string for -g '' argument
This commit is contained in:
cybermaggedon 2026-03-10 12:52:51 +00:00 committed by GitHub
parent c951562189
commit 84941ce645
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
5 changed files with 102 additions and 65 deletions

View file

@ -42,7 +42,7 @@ CREATE TABLE quads_by_entity (
d text, -- Dataset/graph of the quad
dtype text, -- XSD datatype (when otype = 'L'), e.g. 'xsd:string'
lang text, -- Language tag (when otype = 'L'), e.g. 'en', 'fr'
PRIMARY KEY ((collection, entity), role, p, otype, s, o, d)
PRIMARY KEY ((collection, entity), role, p, otype, s, o, d, dtype, lang)
);
```
@ -54,6 +54,7 @@ CREATE TABLE quads_by_entity (
2. **p** — next most common filter, "give me all `knows` relationships"
3. **otype** — enables filtering by URI-valued vs literal-valued relationships
4. **s, o, d** — remaining columns for uniqueness
5. **dtype, lang** — distinguish literals with same value but different type metadata (e.g., `"thing"` vs `"thing"@en` vs `"thing"^^xsd:string`)
### Table 2: quads_by_collection
@ -69,11 +70,11 @@ CREATE TABLE quads_by_collection (
otype text, -- 'U' (URI), 'L' (literal), 'T' (triple/reification)
dtype text, -- XSD datatype (when otype = 'L')
lang text, -- Language tag (when otype = 'L')
PRIMARY KEY (collection, d, s, p, o)
PRIMARY KEY (collection, d, s, p, o, otype, dtype, lang)
);
```
Clustered by dataset first, enabling deletion at either collection or dataset granularity.
Clustered by dataset first, enabling deletion at either collection or dataset granularity. The `otype`, `dtype`, and `lang` columns are included in the clustering key to distinguish literals with the same value but different type metadata — in RDF, `"thing"`, `"thing"@en`, and `"thing"^^xsd:string` are semantically distinct values.
## Write Path