Fix Cassandra schema and graph filter semantics (#680)

Schema fix (dtype/lang clustering key): - Add dtype and lang to PRIMARY KEY in quads_by_entity table - Add otype, dtype, lang to PRIMARY KEY in quads_by_collection table - Fixes deduplication bug where literals with same value but different datatype or language tag were collapsed (e.g., "thing" vs "thing"@en) - Update delete_collection to pass new clustering columns - Update tech spec to reflect new schema Graph filter semantics (simplified, no wildcard constant): - g=None means all graphs (no filter) - g="" means default graph only - g="uri" means specific named graph - Remove GRAPH_WILDCARD usage from EntityCentricKnowledgeGraph - Fix service.py streaming and non-streaming paths - Fix CLI to preserve empty string for -g '' argument
2026-06-23 21:58:06 +02:00 · 2026-03-10 12:52:51 +00:00 · 2026-03-10 12:52:51 +00:00 · 84941ce645
commit 84941ce645
parent c951562189
5 changed files with 102 additions and 65 deletions
--- a/docs/tech-specs/entity-centric-graph.md
+++ b/docs/tech-specs/entity-centric-graph.md
@ -42,7 +42,7 @@ CREATE TABLE quads_by_entity (
    d          text,       -- Dataset/graph of the quad
    dtype      text,       -- XSD datatype (when otype = 'L'), e.g. 'xsd:string'
    lang       text,       -- Language tag (when otype = 'L'), e.g. 'en', 'fr'
-    PRIMARY KEY ((collection, entity), role, p, otype, s, o, d)
+    PRIMARY KEY ((collection, entity), role, p, otype, s, o, d, dtype, lang)
 );
 ```

@ -54,6 +54,7 @@ CREATE TABLE quads_by_entity (
 2. **p** — next most common filter, "give me all `knows` relationships"
 3. **otype** — enables filtering by URI-valued vs literal-valued relationships
 4. **s, o, d** — remaining columns for uniqueness
+5. **dtype, lang** — distinguish literals with same value but different type metadata (e.g., `"thing"` vs `"thing"@en` vs `"thing"^^xsd:string`)

 ### Table 2: quads_by_collection

@ -69,11 +70,11 @@ CREATE TABLE quads_by_collection (
    otype      text,       -- 'U' (URI), 'L' (literal), 'T' (triple/reification)
    dtype      text,       -- XSD datatype (when otype = 'L')
    lang       text,       -- Language tag (when otype = 'L')
-    PRIMARY KEY (collection, d, s, p, o)
+    PRIMARY KEY (collection, d, s, p, o, otype, dtype, lang)
 );
 ```

-Clustered by dataset first, enabling deletion at either collection or dataset granularity.
+Clustered by dataset first, enabling deletion at either collection or dataset granularity. The `otype`, `dtype`, and `lang` columns are included in the clustering key to distinguish literals with the same value but different type metadata — in RDF, `"thing"`, `"thing"@en`, and `"thing"^^xsd:string` are semantically distinct values.

 ## Write Path