* CLI auth migration, document embeddings core lifecycle (#913) Migrate get_kg_core and put_kg_core CLI tools to use Api/SocketClient with first-frame auth (fixes broken raw websocket path). Fix wire format field names (root/vector). Remove ~600 lines of dead raw websocket code from invoke_graph_rag.py. Add document embeddings core lifecycle to the knowledge service: list/get/put/delete/load operations across schema, translator, Cassandra table store, knowledge manager, gateway registry, REST API, socket client, and CLI (tg-get-de-core, tg-put-de-core). Fix delete_kg_core to also clean up document embeddings rows. * Remove spurious workspace parameter from SPARQL algebra evaluator (#915) Fix threading of workspace paramater: - The SPARQL algebra evaluator was threading a workspace parameter through every function and passing it to TriplesClient.query(), which doesn't accept it. Workspace isolation is handled by pub/sub topic routing — the TriplesClient is already scoped to a workspace-specific flow, same as GraphRAG. Passing workspace explicitly was both incorrect and unnecessary. Update tests: - tests/unit/test_query/test_sparql_algebra.py (new) — Tests _query_pattern, _eval_bgp, and evaluate() with various algebra nodes. Key tests assert workspace is never in tc.query() kwargs, plus correctness tests for BGP, JOIN, UNION, SLICE, DISTINCT, and edge cases. - tests/unit/test_retrieval/test_graph_rag.py — Added test_triples_query_never_passes_workspace (checks query()) and test_follow_edges_never_passes_workspace (checks query_stream()). * Make all Cassandra and Qdrant I/O async-safe with proper concurrency controls (#916) Cassandra triples services were using syncronous EntityCentricKnowledgeGraph methods from async contexts, and connection state was managed with threading.local which is wrong for asyncio coroutines sharing a single thread. Qdrant services had no async wrapping at all, blocking the event loop on every network call. Rows services had unprotected shared state mutations across concurrent coroutines. - Add async methods to EntityCentricKnowledgeGraph (async_insert, async_get_s/p/o/sp/po/os/spo/all, async_collection_exists, async_create_collection, async_delete_collection) using the existing cassandra_async.async_execute bridge - Rewrite triples write + query services: replace threading.local with asyncio.Lock + dict cache for per-workspace connections, use async ECKG methods for all data operations, keep asyncio.to_thread only for one-time blocking ECKG construction - Wrap all Qdrant calls in asyncio.to_thread across all 6 services (doc/graph/row embeddings write + query), add asyncio.Lock + set cache for collection existence checks - Add asyncio.Lock to rows write + query services to protect shared state (schemas, sessions, config caches) from concurrent mutation - Update all affected tests to match new async patterns * Fixed error only returning a page of results (#921) The root cause: async_execute only materialises the first result page (by design — it says so in its docstring). The streaming query set fetch_size=20 and expected to iterate all results, but only got the first 20 rows back. The fix uses asyncio.to_thread(lambda: list(tg.session.execute(...))) which lets the sync driver iterate all pages in a worker thread — exactly what the pre-async code did. * Optional test warning suppression (#923) * Fix test collection module errors & silence upstream Pytest warnings (#823) * chore: add virtual environment and .env directories to gitignore * test: filter upstream DeprecationWarning and UserWarning messages * fix(namespace): remove empty __init__.py files to fix PEP 420 implicit namespace routing for trustgraph sub-packages * Revert __init__.py deletions * Add .ini changes but commented out, will be useful at times --------- Co-authored-by: Salil M <d2kyt@protonmail.com> * fix(openai): fail fast on unrecoverable RateLimitError codes (#901) (#904) (#925) Co-authored-by: Sahil Yadav <sahilyadav.sy2004@gmail.com> * Ensure retry exception is properly raised (#926) * fix: library API get/update document round-trip bugs (#893) (#928) Fix 5 cascading bugs in the Library API wrapper that prevented the get_documents → update_document round-trip from working: - Tolerate missing title field in document metadata (use .get()) - Use attribute access on Triple objects instead of subscript - Serialize datetime to int seconds for JSON compatibility - Handle empty server response on successful update - Send both id and document-id keys in update request Added library API tests * Fix ontology selector defaults, add bypass mode, enforce domain/range (#929) - Align similarity_threshold default to 0.3 everywhere (class signature had stale 0.7). Fix matching contradiction in tech-spec. - Add bypass_selector_below parameter (default 5) to skip vector similarity selection when ontology element count is small enough. - Enforce domain/range constraints in TripleConverter for object properties and datatype properties, with subclass hierarchy support. Properties with no declared domain/range pass through unchanged. - Add unit tests for domain/range validation, subclass acceptance, polymorphic pass-through, and selector bypass. Fixes #908, #920 * Close producers on flow stop to prevent stale non-persistent topics (#930) Flow.stop() only stopped consumers, leaving response producers connected to non-persistent Pulsar topics. After flow restart, the orphaned producers held stale broker routing state, causing response messages to never reach new consumers — manifesting as 120s timeouts on document-embeddings and similar RPC paths. Fix: Flow.stop() now explicitly stops all producers. Producer.stop() closes the underlying Pulsar producer connection rather than just setting a flag. Fixes #906 * fix(gateway): propagate --timeout flag to per-service dispatchers (#931) The api-gateway accepts a --timeout flag (default 600s) but the value was not propagated into DispatcherManager, which hard-coded timeout=120 for every per-service dispatcher (graph-rag, document-rag, text-completion, embeddings, librarian, etc.). This meant any synchronous request taking more than 120 seconds would always return a Timeout error at the 120s mark, regardless of the --timeout value set on the gateway. Changes: - Add timeout parameter to DispatcherManager.__init__ (default: 120 for backward compatibility) - Store self.timeout in DispatcherManager - Replace both hardcoded timeout=120 with self.timeout in invoke_global_service and invoke_flow_service - Pass self.timeout from Api to DispatcherManager in service.py - Document the timeout parameter in the docstring Fixes #894 --------- Co-authored-by: Salil M <d2kyt@protonmail.com> Co-authored-by: Sahil Yadav <sahilyadav.sy2004@gmail.com> Co-authored-by: Mister Lobster <jlaportebot@gmail.com> |
||
|---|---|---|
| .. | ||
| __init__.py | ||
| README.md | ||
| test_embedding_and_similarity.py | ||
| test_entity_contexts.py | ||
| test_extract_with_simplified_format.py | ||
| test_ontology_loading.py | ||
| test_ontology_selector.py | ||
| test_ontology_triples.py | ||
| test_prompt_and_extraction.py | ||
| test_text_processing.py | ||
| test_triple_converter_validation.py | ||
| test_uri_expansion.py | ||
Ontology Extractor Unit Tests
Comprehensive unit tests for the OntoRAG ontology extraction system.
Test Coverage
1. test_ontology_selector.py - Auto-Include Properties Feature
Tests the critical dependency resolution that automatically includes all properties related to selected classes.
Key Tests:
test_auto_include_properties_for_recipe_class- Verifies Recipe class auto-includesingredients,method,produces,servestest_auto_include_properties_for_ingredient_class- Verifies Ingredient class auto-includesfoodpropertytest_auto_include_properties_for_range_class- Tests properties are included when class appears in rangetest_auto_include_adds_domain_and_range_classes- Ensures related classes are added tootest_multiple_classes_get_all_related_properties- Tests combining multiple class selectionstest_no_duplicate_properties_added- Ensures properties aren't duplicated
2. test_uri_expansion.py - URI Expansion
Tests that URIs are properly expanded using ontology definitions instead of constructed fallback URIs.
Key Tests:
test_expand_class_uri_from_ontology- Class names expand to ontology URIstest_expand_object_property_uri_from_ontology- Object properties use ontology URIstest_expand_datatype_property_uri_from_ontology- Datatype properties use ontology URIstest_expand_rdf_prefix- Standard RDF prefixes expand correctlytest_expand_rdfs_prefix,test_expand_owl_prefix,test_expand_xsd_prefix- Other standard prefixestest_fallback_uri_for_instance- Entity instances get constructed URIstest_already_full_uri_unchanged- Full URIs pass throughtest_dict_access_not_object_attribute- Critical test verifying dict access works (not object attributes)
3. test_ontology_triples.py - Ontology Triple Generation
Tests that ontology elements (classes and properties) are properly converted to RDF triples with labels, comments, domains, and ranges.
Key Tests:
test_generates_class_type_triples- Classes getrdf:type owl:Classtriplestest_generates_class_labels- Classes getrdfs:labeltriplestest_generates_class_comments- Classes getrdfs:commenttriplestest_generates_object_property_type_triples- Object properties get proper type triplestest_generates_object_property_labels- Object properties get labelstest_generates_object_property_domain- Object properties getrdfs:domaintriplestest_generates_object_property_range- Object properties getrdfs:rangetriplestest_generates_datatype_property_type_triples- Datatype properties get proper type triplestest_generates_datatype_property_range- Datatype properties get XSD type rangestest_uses_dict_field_names_not_rdf_names- Critical test verifying dict field names worktest_total_triple_count_is_reasonable- Validates expected number of triples
4. test_text_processing.py - Text Processing and Segmentation
Tests that text is properly split into sentences for ontology matching, including NLTK tokenization and TextSegment creation.
Key Tests:
test_segment_single_sentence- Single sentence produces one segmenttest_segment_multiple_sentences- Multiple sentences split correctlytest_segment_positions- Segment start/end positions are correcttest_segment_complex_punctuation- Handles abbreviations (Dr., U.S.A., Mr.)test_segment_question_and_exclamation- Different sentence terminatorstest_segment_preserves_original_text- Segments can reconstruct originaltest_text_segment_non_overlapping- Segments don't overlaptest_nltk_punkt_availability- NLTK tokenizer is availabletest_unicode_text- Handles unicode characterstest_quoted_text- Handles quoted text correctly
5. test_prompt_and_extraction.py - LLM Prompt Construction and Triple Extraction
Tests that the system correctly constructs prompts with ontology constraints and extracts/validates triples from LLM responses.
Key Tests:
test_build_extraction_variables_includes_text- Prompt includes input texttest_build_extraction_variables_includes_classes- Prompt includes ontology classestest_build_extraction_variables_includes_properties- Prompt includes propertiestest_validates_rdf_type_triple_with_valid_class- Validates rdf:type against ontologytest_rejects_rdf_type_triple_with_invalid_class- Rejects invalid classestest_validates_object_property_triple- Validates object propertiestest_rejects_unknown_property- Rejects properties not in ontologytest_parse_simple_triple_dict- Parses triple from dict formattest_filters_invalid_triples- Filters out invalid triplestest_expands_uris_in_parsed_triples- Expands URIs using ontologytest_creates_proper_triple_objects- Creates Triple objects with Value subjects/predicates/objects
6. test_embedding_and_similarity.py - Ontology Embedding and Similarity Matching
Tests that ontology elements are properly embedded and matched against input text using vector similarity.
Key Tests:
test_create_text_from_class_with_id- Text representation includes class IDtest_create_text_from_class_with_labels- Includes labels in texttest_create_text_from_class_with_comment- Includes comments in texttest_create_text_from_property_with_domain_range- Includes domain/range in property texttest_normalizes_id_with_underscores- Normalizes IDs (underscores to spaces)test_includes_subclass_info_for_classes- Includes subclass relationshipstest_vector_store_api_structure- Vector store has expected APItest_selector_handles_text_segments- Selector processes text segmentstest_merge_subsets_combines_elements- Merging combines ontology elementstest_ontology_element_metadata_structure- Metadata structure is correct
Running the Tests
Run all ontology extractor tests:
cd /home/mark/work/trustgraph.ai/trustgraph
pytest tests/unit/test_extract/test_ontology/ -v
Run specific test file:
pytest tests/unit/test_extract/test_ontology/test_ontology_selector.py -v
pytest tests/unit/test_extract/test_ontology/test_uri_expansion.py -v
pytest tests/unit/test_extract/test_ontology/test_ontology_triples.py -v
pytest tests/unit/test_extract/test_ontology/test_text_processing.py -v
pytest tests/unit/test_extract/test_ontology/test_prompt_and_extraction.py -v
pytest tests/unit/test_extract/test_ontology/test_embedding_and_similarity.py -v
Run specific test:
pytest tests/unit/test_extract/test_ontology/test_ontology_selector.py::TestOntologySelector::test_auto_include_properties_for_recipe_class -v
Run with coverage:
pytest tests/unit/test_extract/test_ontology/ --cov=trustgraph.extract.kg.ontology --cov-report=html
Test Fixtures
sample_ontology- Complete Food Ontology with Recipe, Ingredient, Food, Method classesontology_loader_with_sample- Mock OntologyLoader with the sample ontologyontology_embedder- Mock embedder for testingmock_embedding_service- Mock service for generating deterministic embeddingsvector_store- InMemoryVectorStore for testingextractor- Processor instance for URI expansion testsontology_subset_with_uris- OntologySubset with proper URIs definedsample_ontology_subset- OntologySubset for testing triple generationtext_processor- TextProcessor instance for text segmentation testssample_ontology_class- Sample OntologyClass for testingsample_ontology_property- Sample OntologyProperty for testing
Implementation Notes
These tests verify the fixes made to address:
- Disconnected graph problem - Auto-include properties feature ensures all relevant relationships are available
- Wrong URIs problem - URI expansion using ontology definitions instead of constructed fallbacks
- Dict vs object attribute problem - URI expansion works with dicts (from
cls.__dict__) not object attributes - Ontology visibility in KG - Ontology elements themselves appear in the knowledge graph with proper metadata
- Text segmentation - Proper sentence splitting for ontology matching using NLTK