trustgraph/tests/unit/test_extract/test_ontology
cybermaggedon 71517e6417
release/v2.4 -> master (#932)
* CLI auth migration, document embeddings core lifecycle (#913)

Migrate get_kg_core and put_kg_core CLI tools to use Api/SocketClient
with first-frame auth (fixes broken raw websocket path). Fix wire
format field names (root/vector). Remove ~600 lines of dead raw
websocket code from invoke_graph_rag.py.

Add document embeddings core lifecycle to the knowledge service:
list/get/put/delete/load operations across schema, translator,
Cassandra table store, knowledge manager, gateway registry, REST API,
socket client, and CLI (tg-get-de-core, tg-put-de-core).

Fix delete_kg_core to also clean up document embeddings rows.

* Remove spurious workspace parameter from SPARQL algebra evaluator (#915)

Fix threading of workspace paramater:
- The SPARQL algebra evaluator was threading a workspace parameter
  through every function and passing it to TriplesClient.query(),
  which doesn't accept it. Workspace isolation is handled by pub/sub
  topic routing — the TriplesClient is already scoped to a
  workspace-specific flow, same as GraphRAG. Passing workspace
  explicitly was both incorrect and unnecessary.

Update tests:
- tests/unit/test_query/test_sparql_algebra.py (new) — Tests
  _query_pattern, _eval_bgp, and evaluate() with various algebra
  nodes. Key tests assert workspace is never in tc.query() kwargs,
  plus correctness tests for BGP, JOIN, UNION, SLICE, DISTINCT, and
  edge cases.
- tests/unit/test_retrieval/test_graph_rag.py — Added
  test_triples_query_never_passes_workspace (checks query()) and
  test_follow_edges_never_passes_workspace (checks query_stream()).

* Make all Cassandra and Qdrant I/O async-safe with proper concurrency controls (#916)

Cassandra triples services were using syncronous EntityCentricKnowledgeGraph
methods from async contexts, and connection state was managed with
threading.local which is wrong for asyncio coroutines sharing a single
thread. Qdrant services had no async wrapping at all, blocking the event
loop on every network call. Rows services had unprotected shared state
mutations across concurrent coroutines.

- Add async methods to EntityCentricKnowledgeGraph (async_insert,
  async_get_s/p/o/sp/po/os/spo/all, async_collection_exists,
  async_create_collection, async_delete_collection) using the existing
  cassandra_async.async_execute bridge
- Rewrite triples write + query services: replace threading.local with
  asyncio.Lock + dict cache for per-workspace connections, use async
  ECKG methods for all data operations, keep asyncio.to_thread only for
  one-time blocking ECKG construction
- Wrap all Qdrant calls in asyncio.to_thread across all 6 services
  (doc/graph/row embeddings write + query), add asyncio.Lock + set cache
  for collection existence checks
- Add asyncio.Lock to rows write + query services to protect shared
  state (schemas, sessions, config caches) from concurrent mutation
- Update all affected tests to match new async patterns

* Fixed error only returning a page of results (#921)

The root cause: async_execute only materialises the first result
page (by design — it says so in its docstring). The streaming query
set fetch_size=20 and expected to iterate all results, but only got
the first 20 rows back.

The fix uses
  asyncio.to_thread(lambda: list(tg.session.execute(...)))
which lets the sync driver iterate
all pages in a worker thread — exactly what the pre-async code did.

* Optional test warning suppression (#923)

* Fix test collection module errors & silence upstream Pytest warnings (#823)

* chore: add virtual environment and .env directories to gitignore

* test: filter upstream DeprecationWarning and UserWarning messages

* fix(namespace): remove empty __init__.py files to fix PEP 420 implicit namespace routing for trustgraph sub-packages

* Revert __init__.py deletions

* Add .ini changes but commented out, will be useful at times

---------

Co-authored-by: Salil M <d2kyt@protonmail.com>

* fix(openai): fail fast on unrecoverable RateLimitError codes (#901) (#904) (#925)

Co-authored-by: Sahil Yadav <sahilyadav.sy2004@gmail.com>

* Ensure retry exception is properly raised (#926)

* fix: library API get/update document round-trip bugs (#893) (#928)

Fix 5 cascading bugs in the Library API wrapper that prevented
the get_documents → update_document round-trip from working:

- Tolerate missing title field in document metadata (use .get())
- Use attribute access on Triple objects instead of subscript
- Serialize datetime to int seconds for JSON compatibility
- Handle empty server response on successful update
- Send both id and document-id keys in update request

Added library API tests

* Fix ontology selector defaults, add bypass mode, enforce domain/range (#929)

- Align similarity_threshold default to 0.3 everywhere (class signature
  had stale 0.7). Fix matching contradiction in tech-spec.
- Add bypass_selector_below parameter (default 5) to skip vector
  similarity selection when ontology element count is small enough.
- Enforce domain/range constraints in TripleConverter for object
  properties and datatype properties, with subclass hierarchy support.
  Properties with no declared domain/range pass through unchanged.
- Add unit tests for domain/range validation, subclass acceptance,
  polymorphic pass-through, and selector bypass.

Fixes #908, #920

* Close producers on flow stop to prevent stale non-persistent topics (#930)

Flow.stop() only stopped consumers, leaving response producers
connected to non-persistent Pulsar topics. After flow restart, the
orphaned producers held stale broker routing state, causing response
messages to never reach new consumers — manifesting as 120s timeouts
on document-embeddings and similar RPC paths.

Fix: Flow.stop() now explicitly stops all producers. Producer.stop()
closes the underlying Pulsar producer connection rather than just
setting a flag.

Fixes #906

* fix(gateway): propagate --timeout flag to per-service dispatchers (#931)

The api-gateway accepts a --timeout flag (default 600s) but the value
was not propagated into DispatcherManager, which hard-coded
timeout=120 for every per-service dispatcher (graph-rag, document-rag,
text-completion, embeddings, librarian, etc.).

This meant any synchronous request taking more than 120 seconds would
always return a Timeout error at the 120s mark, regardless of the
--timeout value set on the gateway.

Changes:
- Add timeout parameter to DispatcherManager.__init__ (default: 120
  for backward compatibility)
- Store self.timeout in DispatcherManager
- Replace both hardcoded timeout=120 with self.timeout in
  invoke_global_service and invoke_flow_service
- Pass self.timeout from Api to DispatcherManager in service.py
- Document the timeout parameter in the docstring

Fixes #894

---------

Co-authored-by: Salil M <d2kyt@protonmail.com>
Co-authored-by: Sahil Yadav <sahilyadav.sy2004@gmail.com>
Co-authored-by: Mister Lobster <jlaportebot@gmail.com>
2026-05-18 09:46:58 +01:00
..
__init__.py Ontology extraction tests (#560) 2025-11-13 20:02:12 +00:00
README.md Ontology extraction tests (#560) 2025-11-13 20:02:12 +00:00
test_embedding_and_similarity.py Ontology extraction tests (#560) 2025-11-13 20:02:12 +00:00
test_entity_contexts.py Merge 2.0 to master (#651) 2026-02-28 11:03:14 +00:00
test_extract_with_simplified_format.py release/v2.4 -> master (#844) 2026-04-22 15:19:57 +01:00
test_ontology_loading.py Ontology extraction tests (#560) 2025-11-13 20:02:12 +00:00
test_ontology_selector.py Ontology extraction tests (#560) 2025-11-13 20:02:12 +00:00
test_ontology_triples.py Merge 2.0 to master (#651) 2026-02-28 11:03:14 +00:00
test_prompt_and_extraction.py test(ontology): harden domain/range validation + add missing tests (#848) 2026-04-28 16:33:49 +01:00
test_text_processing.py Ontology extraction tests (#560) 2025-11-13 20:02:12 +00:00
test_triple_converter_validation.py release/v2.4 -> master (#932) 2026-05-18 09:46:58 +01:00
test_uri_expansion.py Ontology extraction tests (#560) 2025-11-13 20:02:12 +00:00

Ontology Extractor Unit Tests

Comprehensive unit tests for the OntoRAG ontology extraction system.

Test Coverage

1. test_ontology_selector.py - Auto-Include Properties Feature

Tests the critical dependency resolution that automatically includes all properties related to selected classes.

Key Tests:

  • test_auto_include_properties_for_recipe_class - Verifies Recipe class auto-includes ingredients, method, produces, serves
  • test_auto_include_properties_for_ingredient_class - Verifies Ingredient class auto-includes food property
  • test_auto_include_properties_for_range_class - Tests properties are included when class appears in range
  • test_auto_include_adds_domain_and_range_classes - Ensures related classes are added too
  • test_multiple_classes_get_all_related_properties - Tests combining multiple class selections
  • test_no_duplicate_properties_added - Ensures properties aren't duplicated

2. test_uri_expansion.py - URI Expansion

Tests that URIs are properly expanded using ontology definitions instead of constructed fallback URIs.

Key Tests:

  • test_expand_class_uri_from_ontology - Class names expand to ontology URIs
  • test_expand_object_property_uri_from_ontology - Object properties use ontology URIs
  • test_expand_datatype_property_uri_from_ontology - Datatype properties use ontology URIs
  • test_expand_rdf_prefix - Standard RDF prefixes expand correctly
  • test_expand_rdfs_prefix, test_expand_owl_prefix, test_expand_xsd_prefix - Other standard prefixes
  • test_fallback_uri_for_instance - Entity instances get constructed URIs
  • test_already_full_uri_unchanged - Full URIs pass through
  • test_dict_access_not_object_attribute - Critical test verifying dict access works (not object attributes)

3. test_ontology_triples.py - Ontology Triple Generation

Tests that ontology elements (classes and properties) are properly converted to RDF triples with labels, comments, domains, and ranges.

Key Tests:

  • test_generates_class_type_triples - Classes get rdf:type owl:Class triples
  • test_generates_class_labels - Classes get rdfs:label triples
  • test_generates_class_comments - Classes get rdfs:comment triples
  • test_generates_object_property_type_triples - Object properties get proper type triples
  • test_generates_object_property_labels - Object properties get labels
  • test_generates_object_property_domain - Object properties get rdfs:domain triples
  • test_generates_object_property_range - Object properties get rdfs:range triples
  • test_generates_datatype_property_type_triples - Datatype properties get proper type triples
  • test_generates_datatype_property_range - Datatype properties get XSD type ranges
  • test_uses_dict_field_names_not_rdf_names - Critical test verifying dict field names work
  • test_total_triple_count_is_reasonable - Validates expected number of triples

4. test_text_processing.py - Text Processing and Segmentation

Tests that text is properly split into sentences for ontology matching, including NLTK tokenization and TextSegment creation.

Key Tests:

  • test_segment_single_sentence - Single sentence produces one segment
  • test_segment_multiple_sentences - Multiple sentences split correctly
  • test_segment_positions - Segment start/end positions are correct
  • test_segment_complex_punctuation - Handles abbreviations (Dr., U.S.A., Mr.)
  • test_segment_question_and_exclamation - Different sentence terminators
  • test_segment_preserves_original_text - Segments can reconstruct original
  • test_text_segment_non_overlapping - Segments don't overlap
  • test_nltk_punkt_availability - NLTK tokenizer is available
  • test_unicode_text - Handles unicode characters
  • test_quoted_text - Handles quoted text correctly

5. test_prompt_and_extraction.py - LLM Prompt Construction and Triple Extraction

Tests that the system correctly constructs prompts with ontology constraints and extracts/validates triples from LLM responses.

Key Tests:

  • test_build_extraction_variables_includes_text - Prompt includes input text
  • test_build_extraction_variables_includes_classes - Prompt includes ontology classes
  • test_build_extraction_variables_includes_properties - Prompt includes properties
  • test_validates_rdf_type_triple_with_valid_class - Validates rdf:type against ontology
  • test_rejects_rdf_type_triple_with_invalid_class - Rejects invalid classes
  • test_validates_object_property_triple - Validates object properties
  • test_rejects_unknown_property - Rejects properties not in ontology
  • test_parse_simple_triple_dict - Parses triple from dict format
  • test_filters_invalid_triples - Filters out invalid triples
  • test_expands_uris_in_parsed_triples - Expands URIs using ontology
  • test_creates_proper_triple_objects - Creates Triple objects with Value subjects/predicates/objects

6. test_embedding_and_similarity.py - Ontology Embedding and Similarity Matching

Tests that ontology elements are properly embedded and matched against input text using vector similarity.

Key Tests:

  • test_create_text_from_class_with_id - Text representation includes class ID
  • test_create_text_from_class_with_labels - Includes labels in text
  • test_create_text_from_class_with_comment - Includes comments in text
  • test_create_text_from_property_with_domain_range - Includes domain/range in property text
  • test_normalizes_id_with_underscores - Normalizes IDs (underscores to spaces)
  • test_includes_subclass_info_for_classes - Includes subclass relationships
  • test_vector_store_api_structure - Vector store has expected API
  • test_selector_handles_text_segments - Selector processes text segments
  • test_merge_subsets_combines_elements - Merging combines ontology elements
  • test_ontology_element_metadata_structure - Metadata structure is correct

Running the Tests

Run all ontology extractor tests:

cd /home/mark/work/trustgraph.ai/trustgraph
pytest tests/unit/test_extract/test_ontology/ -v

Run specific test file:

pytest tests/unit/test_extract/test_ontology/test_ontology_selector.py -v
pytest tests/unit/test_extract/test_ontology/test_uri_expansion.py -v
pytest tests/unit/test_extract/test_ontology/test_ontology_triples.py -v
pytest tests/unit/test_extract/test_ontology/test_text_processing.py -v
pytest tests/unit/test_extract/test_ontology/test_prompt_and_extraction.py -v
pytest tests/unit/test_extract/test_ontology/test_embedding_and_similarity.py -v

Run specific test:

pytest tests/unit/test_extract/test_ontology/test_ontology_selector.py::TestOntologySelector::test_auto_include_properties_for_recipe_class -v

Run with coverage:

pytest tests/unit/test_extract/test_ontology/ --cov=trustgraph.extract.kg.ontology --cov-report=html

Test Fixtures

  • sample_ontology - Complete Food Ontology with Recipe, Ingredient, Food, Method classes
  • ontology_loader_with_sample - Mock OntologyLoader with the sample ontology
  • ontology_embedder - Mock embedder for testing
  • mock_embedding_service - Mock service for generating deterministic embeddings
  • vector_store - InMemoryVectorStore for testing
  • extractor - Processor instance for URI expansion tests
  • ontology_subset_with_uris - OntologySubset with proper URIs defined
  • sample_ontology_subset - OntologySubset for testing triple generation
  • text_processor - TextProcessor instance for text segmentation tests
  • sample_ontology_class - Sample OntologyClass for testing
  • sample_ontology_property - Sample OntologyProperty for testing

Implementation Notes

These tests verify the fixes made to address:

  1. Disconnected graph problem - Auto-include properties feature ensures all relevant relationships are available
  2. Wrong URIs problem - URI expansion using ontology definitions instead of constructed fallbacks
  3. Dict vs object attribute problem - URI expansion works with dicts (from cls.__dict__) not object attributes
  4. Ontology visibility in KG - Ontology elements themselves appear in the knowledge graph with proper metadata
  5. Text segmentation - Proper sentence splitting for ontology matching using NLTK