This specification describes the integration of TrustGraph with structured data flows, enabling the system to work with data that can be represented as rows in tables or objects in object stores. The integration supports four primary use cases:
1.**Unstructured to Structured Extraction**: Read unstructured data sources, identify and extract object structures, and store them in a tabular format
2.**Structured Data Ingestion**: Load data that is already in structured formats directly into the structured store alongside extracted data
3.**Natural Language Querying**: Convert natural language questions into structured queries to extract matching data from the store
4.**Direct Structured Querying**: Execute structured queries directly against the data store for precise data retrieval
## Goals
- **Unified Data Access**: Provide a single interface for accessing both structured and unstructured data within TrustGraph
- **Seamless Integration**: Enable smooth interoperability between TrustGraph's graph-based knowledge representation and traditional structured data formats
- **Flexible Extraction**: Support automatic extraction of structured data from various unstructured sources (documents, text, etc.)
- **Query Versatility**: Allow users to query data using both natural language and structured query languages
- **Data Consistency**: Maintain data integrity and consistency across different data representations
- **Performance Optimization**: Ensure efficient storage and retrieval of structured data at scale
- **Schema Flexibility**: Support both schema-on-write and schema-on-read approaches to accommodate diverse data sources
- **Backwards Compatibility**: Preserve existing TrustGraph functionality while adding structured data capabilities
## Background
TrustGraph currently excels at processing unstructured data and building knowledge graphs from diverse sources. However, many enterprise use cases involve data that is inherently structured - customer records, transaction logs, inventory databases, and other tabular datasets. These structured datasets often need to be analyzed alongside unstructured content to provide comprehensive insights.
Current limitations include:
- No native support for ingesting pre-structured data formats (CSV, JSON arrays, database exports)
- Inability to preserve the inherent structure when extracting tabular data from documents
- Lack of efficient querying mechanisms for structured data patterns
- Missing bridge between SQL-like queries and TrustGraph's graph queries
This specification addresses these gaps by introducing a structured data layer that complements TrustGraph's existing capabilities. By supporting structured data natively, TrustGraph can:
- Serve as a unified platform for both structured and unstructured data analysis
- Enable hybrid queries that span both graph relationships and tabular data
- Provide familiar interfaces for users accustomed to working with structured data
- Unlock new use cases in data integration and business intelligence
## Technical Design
### Architecture
The structured data integration requires the following technical components:
1.**NLP-to-Structured-Query Service**
- Converts natural language questions into structured queries
- Supports multiple query language targets (initially SQL-like syntax)
- Integrates with existing TrustGraph NLP capabilities
- Extended configuration system to store structured data schemas
- Support for defining table structures, field types, and relationships
- Schema versioning and migration capabilities
3.**Object Extraction Module** ✅ **[COMPLETE]**
- Enhanced knowledge extractor flow integration
- Identifies and extracts structured objects from unstructured sources
- Maintains provenance and confidence scores
- Registers a config handler (example: trustgraph-flow/trustgraph/prompt/template/service.py) to receive config data and decode schema information
- Receives objects and decodes them to ExtractedObject objects for delivery on the Pulsar queue
- NOTE: There's existing code at `trustgraph-flow/trustgraph/extract/object/row/`. This was a previous attempt and will need to be majorly refactored as it doesn't conform to current APIs. Use it if it's useful, start from scratch if not.
- Requires a command-line interface: `kg-extract-objects`