mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-30 10:56:23 +02:00
Merge branch 'release/v1.2'
This commit is contained in:
commit
0bff629f87
28 changed files with 3881 additions and 111 deletions
127
docs/tech-specs/__TEMPLATE.md
Normal file
127
docs/tech-specs/__TEMPLATE.md
Normal file
|
|
@ -0,0 +1,127 @@
|
|||
# Command-Line Loading Knowledge Technical Specification
|
||||
|
||||
## Overview
|
||||
|
||||
This specification describes the command-line interfaces for loading knowledge into TrustGraph, enabling users to ingest data from various sources through command-line tools. The integration supports four primary use cases:
|
||||
|
||||
1. **[Use Case 1]**: [Description]
|
||||
2. **[Use Case 2]**: [Description]
|
||||
3. **[Use Case 3]**: [Description]
|
||||
4. **[Use Case 4]**: [Description]
|
||||
|
||||
## Goals
|
||||
|
||||
- **[Goal 1]**: [Description]
|
||||
- **[Goal 2]**: [Description]
|
||||
- **[Goal 3]**: [Description]
|
||||
- **[Goal 4]**: [Description]
|
||||
- **[Goal 5]**: [Description]
|
||||
- **[Goal 6]**: [Description]
|
||||
- **[Goal 7]**: [Description]
|
||||
- **[Goal 8]**: [Description]
|
||||
|
||||
## Background
|
||||
|
||||
[Describe the current state and limitations that this specification addresses]
|
||||
|
||||
Current limitations include:
|
||||
- [Limitation 1]
|
||||
- [Limitation 2]
|
||||
- [Limitation 3]
|
||||
- [Limitation 4]
|
||||
|
||||
This specification addresses these gaps by [description]. By [capability], TrustGraph can:
|
||||
- [Benefit 1]
|
||||
- [Benefit 2]
|
||||
- [Benefit 3]
|
||||
- [Benefit 4]
|
||||
|
||||
## Technical Design
|
||||
|
||||
### Architecture
|
||||
|
||||
The command-line knowledge loading requires the following technical components:
|
||||
|
||||
1. **[Component 1]**
|
||||
- [Description of component functionality]
|
||||
- [Key features]
|
||||
- [Integration points]
|
||||
|
||||
Module: [module-path]
|
||||
|
||||
2. **[Component 2]**
|
||||
- [Description of component functionality]
|
||||
- [Key features]
|
||||
- [Integration points]
|
||||
|
||||
Module: [module-path]
|
||||
|
||||
3. **[Component 3]**
|
||||
- [Description of component functionality]
|
||||
- [Key features]
|
||||
- [Integration points]
|
||||
|
||||
Module: [module-path]
|
||||
|
||||
### Data Models
|
||||
|
||||
#### [Data Model 1]
|
||||
|
||||
[Description of data model and structure]
|
||||
|
||||
Example:
|
||||
```
|
||||
[Example data structure]
|
||||
```
|
||||
|
||||
This approach allows:
|
||||
- [Benefit 1]
|
||||
- [Benefit 2]
|
||||
- [Benefit 3]
|
||||
- [Benefit 4]
|
||||
|
||||
### APIs
|
||||
|
||||
New APIs:
|
||||
- [API description 1]
|
||||
- [API description 2]
|
||||
- [API description 3]
|
||||
|
||||
Modified APIs:
|
||||
- [Modified API 1] - [Description of changes]
|
||||
- [Modified API 2] - [Description of changes]
|
||||
|
||||
### Implementation Details
|
||||
|
||||
[Implementation approach and conventions]
|
||||
|
||||
[Additional implementation notes]
|
||||
|
||||
## Security Considerations
|
||||
|
||||
[Security considerations specific to this implementation]
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
[Performance considerations and potential bottlenecks]
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
[Testing approach and strategy]
|
||||
|
||||
## Migration Plan
|
||||
|
||||
[Migration strategy if applicable]
|
||||
|
||||
## Timeline
|
||||
|
||||
[Timeline information if specified]
|
||||
|
||||
## Open Questions
|
||||
|
||||
- [Open question 1]
|
||||
- [Open question 2]
|
||||
|
||||
## References
|
||||
|
||||
[References if applicable]
|
||||
106
docs/tech-specs/architecture-principles.md
Normal file
106
docs/tech-specs/architecture-principles.md
Normal file
|
|
@ -0,0 +1,106 @@
|
|||
# Knowledge Graph Architecture Foundations
|
||||
|
||||
## Foundation 1: Subject-Predicate-Object (SPO) Graph Model
|
||||
**Decision**: Adopt SPO/RDF as the core knowledge representation model
|
||||
|
||||
**Rationale**:
|
||||
- Provides maximum flexibility and interoperability with existing graph technologies
|
||||
- Enables seamless translation to other graph query languages (e.g., SPO → Cypher, but not vice versa)
|
||||
- Creates a foundation that "unlocks a lot" of downstream capabilities
|
||||
- Supports both node-to-node relationships (SPO) and node-to-literal relationships (RDF)
|
||||
|
||||
**Implementation**:
|
||||
- Core data structure: `node → edge → {node | literal}`
|
||||
- Maintain compatibility with RDF standards while supporting extended SPO operations
|
||||
|
||||
## Foundation 2: LLM-Native Knowledge Graph Integration
|
||||
**Decision**: Optimize knowledge graph structure and operations for LLM interaction
|
||||
|
||||
**Rationale**:
|
||||
- Primary use case involves LLMs interfacing with knowledge graphs
|
||||
- Graph technology choices must prioritize LLM compatibility over other considerations
|
||||
- Enables natural language processing workflows that leverage structured knowledge
|
||||
|
||||
**Implementation**:
|
||||
- Design graph schemas that LLMs can effectively reason about
|
||||
- Optimize for common LLM interaction patterns
|
||||
|
||||
## Foundation 3: Embedding-Based Graph Navigation
|
||||
**Decision**: Implement direct mapping from natural language queries to graph nodes via embeddings
|
||||
|
||||
**Rationale**:
|
||||
- Enables the simplest possible path from NLP query to graph navigation
|
||||
- Avoids complex intermediate query generation steps
|
||||
- Provides efficient semantic search capabilities within the graph structure
|
||||
|
||||
**Implementation**:
|
||||
- `NLP Query → Graph Embeddings → Graph Nodes`
|
||||
- Maintain embedding representations for all graph entities
|
||||
- Support direct semantic similarity matching for query resolution
|
||||
|
||||
## Foundation 4: Distributed Entity Resolution with Deterministic Identifiers
|
||||
**Decision**: Support parallel knowledge extraction with deterministic entity identification (80% rule)
|
||||
|
||||
**Rationale**:
|
||||
- **Ideal**: Single-process extraction with complete state visibility enables perfect entity resolution
|
||||
- **Reality**: Scalability requirements demand parallel processing capabilities
|
||||
- **Compromise**: Design for deterministic entity identification across distributed processes
|
||||
|
||||
**Implementation**:
|
||||
- Develop mechanisms for generating consistent, unique identifiers across different knowledge extractors
|
||||
- Same entity mentioned in different processes must resolve to the same identifier
|
||||
- Acknowledge that ~20% of edge cases may require alternative processing models
|
||||
- Design fallback mechanisms for complex entity resolution scenarios
|
||||
|
||||
## Foundation 5: Event-Driven Architecture with Publish-Subscribe
|
||||
**Decision**: Implement pub-sub messaging system for system coordination
|
||||
|
||||
**Rationale**:
|
||||
- Enables loose coupling between knowledge extraction, storage, and query components
|
||||
- Supports real-time updates and notifications across the system
|
||||
- Facilitates scalable, distributed processing workflows
|
||||
|
||||
**Implementation**:
|
||||
- Message-driven coordination between system components
|
||||
- Event streams for knowledge updates, extraction completion, and query results
|
||||
|
||||
## Foundation 6: Reentrant Agent Communication
|
||||
**Decision**: Support reentrant pub-sub operations for agent-based processing
|
||||
|
||||
**Rationale**:
|
||||
- Enables sophisticated agent workflows where agents can trigger and respond to each other
|
||||
- Supports complex, multi-step knowledge processing pipelines
|
||||
- Allows for recursive and iterative processing patterns
|
||||
|
||||
**Implementation**:
|
||||
- Pub-sub system must handle reentrant calls safely
|
||||
- Agent coordination mechanisms that prevent infinite loops
|
||||
- Support for agent workflow orchestration
|
||||
|
||||
## Foundation 7: Columnar Data Store Integration
|
||||
**Decision**: Ensure query compatibility with columnar storage systems
|
||||
|
||||
**Rationale**:
|
||||
- Enables efficient analytical queries over large knowledge datasets
|
||||
- Supports business intelligence and reporting use cases
|
||||
- Bridges graph-based knowledge representation with traditional analytical workflows
|
||||
|
||||
**Implementation**:
|
||||
- Query translation layer: Graph queries → Columnar queries
|
||||
- Hybrid storage strategy supporting both graph operations and analytical workloads
|
||||
- Maintain query performance across both paradigms
|
||||
|
||||
---
|
||||
|
||||
## Architecture Principles Summary
|
||||
|
||||
1. **Flexibility First**: SPO/RDF model provides maximum adaptability
|
||||
2. **LLM Optimization**: All design decisions consider LLM interaction requirements
|
||||
3. **Semantic Efficiency**: Direct embedding-to-node mapping for optimal query performance
|
||||
4. **Pragmatic Scalability**: Balance perfect accuracy with practical distributed processing
|
||||
5. **Event-Driven Coordination**: Pub-sub enables loose coupling and scalability
|
||||
6. **Agent-Friendly**: Support complex, multi-agent processing workflows
|
||||
7. **Analytical Compatibility**: Bridge graph and columnar paradigms for comprehensive querying
|
||||
|
||||
These foundations establish a knowledge graph architecture that balances theoretical rigor with practical scalability requirements, optimized for LLM integration and distributed processing.
|
||||
|
||||
169
docs/tech-specs/logging-strategy.md
Normal file
169
docs/tech-specs/logging-strategy.md
Normal file
|
|
@ -0,0 +1,169 @@
|
|||
# TrustGraph Logging Strategy
|
||||
|
||||
## Overview
|
||||
|
||||
TrustGraph uses Python's built-in `logging` module for all logging operations. This provides a standardized, flexible approach to logging across all components of the system.
|
||||
|
||||
## Default Configuration
|
||||
|
||||
### Logging Level
|
||||
- **Default Level**: `INFO`
|
||||
- **Debug Mode**: `DEBUG` (enabled via command-line argument)
|
||||
- **Production**: `WARNING` or `ERROR` as appropriate
|
||||
|
||||
### Output Destination
|
||||
All logs should be written to **standard output (stdout)** to ensure compatibility with containerized environments and log aggregation systems.
|
||||
|
||||
## Implementation Guidelines
|
||||
|
||||
### 1. Logger Initialization
|
||||
|
||||
Each module should create its own logger using the module's `__name__`:
|
||||
|
||||
```python
|
||||
import logging
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
```
|
||||
|
||||
### 2. Centralized Configuration
|
||||
|
||||
The logging configuration should be centralized in `async_processor.py` (or a dedicated logging configuration module) since it's inherited by much of the codebase:
|
||||
|
||||
```python
|
||||
import logging
|
||||
import argparse
|
||||
|
||||
def setup_logging(log_level='INFO'):
|
||||
"""Configure logging for the entire application"""
|
||||
logging.basicConfig(
|
||||
level=getattr(logging, log_level.upper()),
|
||||
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
|
||||
handlers=[logging.StreamHandler()]
|
||||
)
|
||||
|
||||
def parse_args():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument(
|
||||
'--log-level',
|
||||
default='INFO',
|
||||
choices=['DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'],
|
||||
help='Set the logging level (default: INFO)'
|
||||
)
|
||||
return parser.parse_args()
|
||||
|
||||
# In main execution
|
||||
if __name__ == '__main__':
|
||||
args = parse_args()
|
||||
setup_logging(args.log_level)
|
||||
```
|
||||
|
||||
### 3. Logging Best Practices
|
||||
|
||||
#### Log Levels Usage
|
||||
- **DEBUG**: Detailed information for diagnosing problems (variable values, function entry/exit)
|
||||
- **INFO**: General informational messages (service started, configuration loaded, processing milestones)
|
||||
- **WARNING**: Warning messages for potentially harmful situations (deprecated features, recoverable errors)
|
||||
- **ERROR**: Error messages for serious problems (failed operations, exceptions)
|
||||
- **CRITICAL**: Critical messages for system failures requiring immediate attention
|
||||
|
||||
#### Message Format
|
||||
```python
|
||||
# Good - includes context
|
||||
logger.info(f"Processing document: {doc_id}, size: {doc_size} bytes")
|
||||
logger.error(f"Failed to connect to database: {error}", exc_info=True)
|
||||
|
||||
# Avoid - lacks context
|
||||
logger.info("Processing document")
|
||||
logger.error("Connection failed")
|
||||
```
|
||||
|
||||
#### Performance Considerations
|
||||
```python
|
||||
# Use lazy formatting for expensive operations
|
||||
logger.debug("Expensive operation result: %s", expensive_function())
|
||||
|
||||
# Check log level for very expensive debug operations
|
||||
if logger.isEnabledFor(logging.DEBUG):
|
||||
debug_data = compute_expensive_debug_info()
|
||||
logger.debug(f"Debug data: {debug_data}")
|
||||
```
|
||||
|
||||
### 4. Structured Logging
|
||||
|
||||
For complex data, use structured logging:
|
||||
|
||||
```python
|
||||
logger.info("Request processed", extra={
|
||||
'request_id': request_id,
|
||||
'duration_ms': duration,
|
||||
'status_code': status_code,
|
||||
'user_id': user_id
|
||||
})
|
||||
```
|
||||
|
||||
### 5. Exception Logging
|
||||
|
||||
Always include stack traces for exceptions:
|
||||
|
||||
```python
|
||||
try:
|
||||
process_data()
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to process data: {e}", exc_info=True)
|
||||
raise
|
||||
```
|
||||
|
||||
### 6. Async Logging Considerations
|
||||
|
||||
For async code, ensure thread-safe logging:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
import logging
|
||||
|
||||
async def async_operation():
|
||||
logger = logging.getLogger(__name__)
|
||||
logger.info(f"Starting async operation in task: {asyncio.current_task().get_name()}")
|
||||
```
|
||||
|
||||
## Environment Variables
|
||||
|
||||
Support environment-based configuration as a fallback:
|
||||
|
||||
```python
|
||||
import os
|
||||
|
||||
log_level = os.environ.get('TRUSTGRAPH_LOG_LEVEL', 'INFO')
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
During tests, consider using a different logging configuration:
|
||||
|
||||
```python
|
||||
# In test setup
|
||||
logging.getLogger().setLevel(logging.WARNING) # Reduce noise during tests
|
||||
```
|
||||
|
||||
## Monitoring Integration
|
||||
|
||||
Ensure log format is compatible with monitoring tools:
|
||||
- Include timestamps in ISO format
|
||||
- Use consistent field names
|
||||
- Include correlation IDs where applicable
|
||||
- Structure logs for easy parsing (JSON format for production)
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- Never log sensitive information (passwords, API keys, personal data)
|
||||
- Sanitize user input before logging
|
||||
- Use placeholders for sensitive fields: `user_id=****1234`
|
||||
|
||||
## Migration Path
|
||||
|
||||
For existing code using print statements:
|
||||
1. Replace `print()` with appropriate logger calls
|
||||
2. Choose appropriate log levels based on message importance
|
||||
3. Add context to make logs more useful
|
||||
4. Test logging output at different levels
|
||||
256
docs/tech-specs/mcp-tool-arguments.md
Normal file
256
docs/tech-specs/mcp-tool-arguments.md
Normal file
|
|
@ -0,0 +1,256 @@
|
|||
# MCP Tool Arguments Specification
|
||||
|
||||
## Overview
|
||||
**Feature Name**: MCP Tool Arguments Support
|
||||
**Author**: Claude Code Assistant
|
||||
**Date**: 2025-08-21
|
||||
**Status**: Finalised
|
||||
|
||||
### Executive Summary
|
||||
|
||||
Enable ReACT agents to invoke MCP (Model Context Protocol) tools with
|
||||
properly defined arguments by adding argument specification support to
|
||||
MCP tool configurations, similar to how prompt template tools
|
||||
currently work.
|
||||
|
||||
### Problem Statement
|
||||
|
||||
Currently, MCP tools in the ReACT agent framework cannot specify their
|
||||
expected arguments. The `McpToolImpl.get_arguments()` method returns
|
||||
an empty list, forcing LLMs to guess the correct parameter structure
|
||||
based only on tool names and descriptions. This leads to:
|
||||
- Unreliable tool invocations due to parameter guessing
|
||||
- Poor user experience when tools fail due to incorrect arguments
|
||||
- No validation of tool parameters before execution
|
||||
- Missing parameter documentation in agent prompts
|
||||
|
||||
### Goals
|
||||
|
||||
- [ ] Allow MCP tool configurations to specify expected arguments (name, type, description)
|
||||
- [ ] Update agent manager to expose MCP tool arguments to LLMs via prompts
|
||||
- [ ] Maintain backward compatibility with existing MCP tool configurations
|
||||
- [ ] Support argument validation similar to prompt template tools
|
||||
|
||||
### Non-Goals
|
||||
- Dynamic argument discovery from MCP servers (future enhancement)
|
||||
- Argument type validation beyond basic structure
|
||||
- Complex argument schemas (nested objects, arrays)
|
||||
|
||||
## Background and Context
|
||||
|
||||
### Current State
|
||||
MCP tools are configured in the ReACT agent system with minimal metadata:
|
||||
```json
|
||||
{
|
||||
"type": "mcp-tool",
|
||||
"name": "get_bank_balance",
|
||||
"description": "Get bank account balance",
|
||||
"mcp-tool": "get_bank_balance"
|
||||
}
|
||||
```
|
||||
|
||||
The `McpToolImpl.get_arguments()` method returns `[]`, so LLMs receive no argument guidance in their prompts.
|
||||
|
||||
### Limitations
|
||||
|
||||
1. **No argument specification**: MCP tools cannot define expected
|
||||
parameters
|
||||
|
||||
2. **LLM parameter guessing**: Agents must infer parameters from tool
|
||||
names/descriptions
|
||||
|
||||
3. **Missing prompt information**: Agent prompts show no argument
|
||||
details for MCP tools
|
||||
|
||||
4. **No validation**: Invalid parameters are only caught at MCP tool
|
||||
execution time
|
||||
|
||||
### Related Components
|
||||
- **trustgraph-flow/agent/react/service.py**: Tool configuration loading and AgentManager creation
|
||||
- **trustgraph-flow/agent/react/tools.py**: McpToolImpl implementation
|
||||
- **trustgraph-flow/agent/react/agent_manager.py**: Prompt generation with tool arguments
|
||||
- **trustgraph-cli**: CLI tools for MCP tool management
|
||||
- **Workbench**: External UI for agent tool configuration
|
||||
|
||||
## Requirements
|
||||
|
||||
### Functional Requirements
|
||||
|
||||
1. **MCP Tool Configuration Arguments**: MCP tool configurations MUST support an optional `arguments` array with name, type, and description fields
|
||||
2. **Argument Exposure**: `McpToolImpl.get_arguments()` MUST return configured arguments instead of empty list
|
||||
3. **Prompt Integration**: Agent prompts MUST include MCP tool argument details when arguments are specified
|
||||
4. **Backward Compatibility**: Existing MCP tool configurations without arguments MUST continue to work
|
||||
5. **CLI Support**: Existing `tg-invoke-mcp-tool` CLI supports arguments (already implemented)
|
||||
|
||||
### Non-Functional Requirements
|
||||
1. **Backward Compatibility**: Zero breaking changes for existing MCP tool configurations
|
||||
2. **Performance**: No significant performance impact on agent prompt generation
|
||||
3. **Consistency**: Argument handling MUST match prompt template tool patterns
|
||||
|
||||
### User Stories
|
||||
|
||||
1. As an **agent developer**, I want to specify MCP tool arguments in configuration so that LLMs can invoke tools with correct parameters
|
||||
2. As a **workbench user**, I want to configure MCP tool arguments in the UI so that agents use tools properly
|
||||
3. As an **LLM in a ReACT agent**, I want to see tool argument specifications in prompts so that I can provide correct parameters
|
||||
|
||||
## Design
|
||||
|
||||
### High-Level Architecture
|
||||
Extend MCP tool configuration to match the prompt template pattern by:
|
||||
1. Adding optional `arguments` array to MCP tool configurations
|
||||
2. Modifying `McpToolImpl` to accept and return configured arguments
|
||||
3. Updating tool configuration loading to handle MCP tool arguments
|
||||
4. Ensuring agent prompts include MCP tool argument information
|
||||
|
||||
### Configuration Schema
|
||||
```json
|
||||
{
|
||||
"type": "mcp-tool",
|
||||
"name": "get_bank_balance",
|
||||
"description": "Get bank account balance",
|
||||
"mcp-tool": "get_bank_balance",
|
||||
"arguments": [
|
||||
{
|
||||
"name": "account_id",
|
||||
"type": "string",
|
||||
"description": "Bank account identifier"
|
||||
},
|
||||
{
|
||||
"name": "date",
|
||||
"type": "string",
|
||||
"description": "Date for balance query (optional, format: YYYY-MM-DD)"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Data Flow
|
||||
1. **Configuration Loading**: MCP tool config with arguments is loaded by `on_tools_config()`
|
||||
2. **Tool Creation**: Arguments are parsed and passed to `McpToolImpl` via constructor
|
||||
3. **Prompt Generation**: `agent_manager.py` calls `tool.arguments` to include in LLM prompts
|
||||
4. **Tool Invocation**: LLM provides parameters which are passed to MCP service unchanged
|
||||
|
||||
### API Changes
|
||||
No external API changes - this is purely internal configuration and argument handling.
|
||||
|
||||
### Component Details
|
||||
|
||||
#### Component 1: service.py (Tool Configuration Loading)
|
||||
- **Purpose**: Parse MCP tool configurations and create tool instances
|
||||
- **Changes Required**: Add argument parsing for MCP tools (similar to prompt tools)
|
||||
- **New Functionality**: Extract `arguments` array from MCP tool config and create `Argument` objects
|
||||
|
||||
#### Component 2: tools.py (McpToolImpl)
|
||||
- **Purpose**: MCP tool implementation wrapper
|
||||
- **Changes Required**: Accept arguments in constructor and return them from `get_arguments()`
|
||||
- **New Functionality**: Store and expose configured arguments instead of returning empty list
|
||||
|
||||
#### Component 3: Workbench (External Repository)
|
||||
- **Purpose**: UI for configuring agent tools
|
||||
- **Changes Required**: Add argument specification UI for MCP tools
|
||||
- **New Functionality**: Allow users to add/edit/remove arguments for MCP tools
|
||||
|
||||
#### Component 4: CLI Tools
|
||||
- **Purpose**: Command-line tool management
|
||||
- **Changes Required**: Support argument specification in MCP tool creation/update commands
|
||||
- **New Functionality**: Accept arguments parameter in tool configuration commands
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Phase 1: Core Agent Framework Changes
|
||||
- [ ] Update `McpToolImpl` constructor to accept `arguments` parameter
|
||||
- [ ] Change `McpToolImpl.get_arguments()` to return stored arguments
|
||||
- [ ] Modify `service.py` MCP tool configuration parsing to handle arguments
|
||||
- [ ] Add unit tests for MCP tool argument handling
|
||||
- [ ] Verify agent prompts include MCP tool arguments
|
||||
|
||||
### Phase 2: External Tool Support
|
||||
- [ ] Update CLI tools to support MCP tool argument specification
|
||||
- [ ] Document argument configuration format for users
|
||||
- [ ] Update Workbench UI to support MCP tool argument configuration
|
||||
- [ ] Add examples and documentation
|
||||
|
||||
### Code Changes Summary
|
||||
| File | Change Type | Description |
|
||||
|------|------------|-------------|
|
||||
| `tools.py` | Modified | Update McpToolImpl to accept and store arguments |
|
||||
| `service.py` | Modified | Parse arguments from MCP tool config (line 108-113) |
|
||||
| `test_react_processor.py` | Modified | Add tests for MCP tool arguments |
|
||||
| CLI tools | Modified | Support argument specification in commands |
|
||||
| Workbench | Modified | Add UI for MCP tool argument configuration |
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Tests
|
||||
- **MCP Tool Argument Parsing**: Test `service.py` correctly parses arguments from MCP tool configurations
|
||||
- **McpToolImpl Arguments**: Test `get_arguments()` returns configured arguments instead of empty list
|
||||
- **Backward Compatibility**: Test MCP tools without arguments continue to work (return empty list)
|
||||
- **Agent Prompt Generation**: Test agent prompts include MCP tool argument details
|
||||
|
||||
### Integration Tests
|
||||
- **End-to-End Tool Invocation**: Test agent with MCP tool arguments can successfully invoke tools
|
||||
- **Configuration Loading**: Test complete config load cycle with MCP tool arguments
|
||||
- **Cross-Component**: Test arguments flow correctly from config → tool creation → prompt generation
|
||||
|
||||
### Manual Testing
|
||||
- **Agent Behavior**: Manually verify LLM receives and uses argument information in ReACT cycles
|
||||
- **CLI Integration**: Test tg-invoke-mcp-tool works with new argument-configured MCP tools
|
||||
- **Workbench Integration**: Test UI supports MCP tool argument configuration
|
||||
|
||||
## Migration and Rollout
|
||||
|
||||
### Migration Strategy
|
||||
No migration required - this is purely additive functionality:
|
||||
- Existing MCP tool configurations without `arguments` continue to work unchanged
|
||||
- `McpToolImpl.get_arguments()` returns empty list for legacy tools
|
||||
- New configurations can optionally include `arguments` array
|
||||
|
||||
### Rollout Plan
|
||||
1. **Phase 1**: Deploy core agent framework changes to development/staging
|
||||
2. **Phase 2**: Deploy CLI tool updates and documentation
|
||||
3. **Phase 3**: Deploy Workbench UI updates for argument configuration
|
||||
4. **Phase 4**: Production rollout with monitoring
|
||||
|
||||
### Rollback Plan
|
||||
- Core changes are backward compatible - no rollback needed for functionality
|
||||
- If issues arise, disable argument parsing by reverting MCP tool config loading logic
|
||||
- Workbench and CLI changes are independent and can be rolled back separately
|
||||
|
||||
## Security Considerations
|
||||
- **No new attack surface**: Arguments are parsed from existing configuration sources with no new inputs
|
||||
- **Parameter validation**: Arguments are passed through to MCP tools unchanged - validation remains at MCP tool level
|
||||
- **Configuration integrity**: Argument specifications are part of tool configuration - same security model applies
|
||||
|
||||
## Performance Impact
|
||||
- **Minimal overhead**: Argument parsing happens only during configuration loading, not per-request
|
||||
- **Prompt size increase**: Agent prompts will include MCP tool argument details, slightly increasing token usage
|
||||
- **Memory usage**: Negligible increase for storing argument specifications in tool objects
|
||||
|
||||
## Documentation
|
||||
|
||||
### User Documentation
|
||||
- [ ] Update MCP tool configuration guide with argument examples
|
||||
- [ ] Add argument specification to CLI tool help text
|
||||
- [ ] Create examples of common MCP tool argument patterns
|
||||
|
||||
### Developer Documentation
|
||||
- [ ] Update McpToolImpl class documentation
|
||||
- [ ] Add inline comments for argument parsing logic
|
||||
- [ ] Document argument flow in system architecture
|
||||
|
||||
## Open Questions
|
||||
1. **Argument validation**: Should we validate argument types/formats beyond basic structure checking?
|
||||
2. **Dynamic discovery**: Future enhancement to query MCP servers for tool schemas automatically?
|
||||
|
||||
## Alternatives Considered
|
||||
1. **Dynamic MCP schema discovery**: Query MCP servers for tool argument schemas at runtime - rejected due to complexity and reliability concerns
|
||||
2. **Separate argument registry**: Store MCP tool arguments in separate configuration section - rejected for consistency with prompt template approach
|
||||
3. **Type validation**: Full JSON schema validation for arguments - deferred as future enhancement to keep initial implementation simple
|
||||
|
||||
## References
|
||||
- [MCP Protocol Specification](https://github.com/modelcontextprotocol/spec)
|
||||
- [Prompt Template Tool Implementation](./trustgraph-flow/trustgraph/agent/react/service.py#L114-129)
|
||||
- [Current MCP Tool Implementation](./trustgraph-flow/trustgraph/agent/react/tools.py#L58-86)
|
||||
|
||||
## Appendix
|
||||
[Any additional information, diagrams, or examples]
|
||||
279
docs/tech-specs/more-config-cli.md
Normal file
279
docs/tech-specs/more-config-cli.md
Normal file
|
|
@ -0,0 +1,279 @@
|
|||
# More Configuration CLI Technical Specification
|
||||
|
||||
## Overview
|
||||
|
||||
This specification describes enhanced command-line configuration capabilities for TrustGraph, enabling users to manage individual configuration items through granular CLI commands. The integration supports four primary use cases:
|
||||
|
||||
1. **List Configuration Items**: Display configuration keys of a specific type
|
||||
2. **Get Configuration Item**: Retrieve specific configuration values
|
||||
3. **Put Configuration Item**: Set or update individual configuration items
|
||||
4. **Delete Configuration Item**: Remove specific configuration items
|
||||
|
||||
## Goals
|
||||
|
||||
- **Granular Control**: Enable management of individual configuration items rather than bulk operations
|
||||
- **Type-Based Listing**: Allow users to explore configuration items by type
|
||||
- **Single Item Operations**: Provide commands for get/put/delete of individual config items
|
||||
- **API Integration**: Leverage existing Config API for all operations
|
||||
- **Consistent CLI Pattern**: Follow established TrustGraph CLI conventions and patterns
|
||||
- **Error Handling**: Provide clear error messages for invalid operations
|
||||
- **JSON Output**: Support structured output for programmatic use
|
||||
- **Documentation**: Include comprehensive help and usage examples
|
||||
|
||||
## Background
|
||||
|
||||
TrustGraph currently provides configuration management through the Config API and a single CLI command `tg-show-config` that displays the entire configuration. While this works for viewing configuration, it lacks granular management capabilities.
|
||||
|
||||
Current limitations include:
|
||||
- No way to list configuration items by type from CLI
|
||||
- No CLI command to retrieve specific configuration values
|
||||
- No CLI command to set individual configuration items
|
||||
- No CLI command to delete specific configuration items
|
||||
|
||||
This specification addresses these gaps by adding four new CLI commands that provide granular configuration management. By exposing individual Config API operations through CLI commands, TrustGraph can:
|
||||
- Enable scripted configuration management
|
||||
- Allow exploration of configuration structure by type
|
||||
- Support targeted configuration updates
|
||||
- Provide fine-grained configuration control
|
||||
|
||||
## Technical Design
|
||||
|
||||
### Architecture
|
||||
|
||||
The enhanced CLI configuration requires the following technical components:
|
||||
|
||||
1. **tg-list-config-items**
|
||||
- Lists configuration keys for a specified type
|
||||
- Calls Config.list(type) API method
|
||||
- Outputs list of configuration keys
|
||||
|
||||
Module: `trustgraph.cli.list_config_items`
|
||||
|
||||
2. **tg-get-config-item**
|
||||
- Retrieves specific configuration item(s)
|
||||
- Calls Config.get(keys) API method
|
||||
- Outputs configuration values in JSON format
|
||||
|
||||
Module: `trustgraph.cli.get_config_item`
|
||||
|
||||
3. **tg-put-config-item**
|
||||
- Sets or updates a configuration item
|
||||
- Calls Config.put(values) API method
|
||||
- Accepts type, key, and value parameters
|
||||
|
||||
Module: `trustgraph.cli.put_config_item`
|
||||
|
||||
4. **tg-delete-config-item**
|
||||
- Removes a configuration item
|
||||
- Calls Config.delete(keys) API method
|
||||
- Accepts type and key parameters
|
||||
|
||||
Module: `trustgraph.cli.delete_config_item`
|
||||
|
||||
### Data Models
|
||||
|
||||
#### ConfigKey and ConfigValue
|
||||
|
||||
The commands utilize existing data structures from `trustgraph.api.types`:
|
||||
|
||||
```python
|
||||
@dataclasses.dataclass
|
||||
class ConfigKey:
|
||||
type : str
|
||||
key : str
|
||||
|
||||
@dataclasses.dataclass
|
||||
class ConfigValue:
|
||||
type : str
|
||||
key : str
|
||||
value : str
|
||||
```
|
||||
|
||||
This approach allows:
|
||||
- Consistent data handling across CLI and API
|
||||
- Type-safe configuration operations
|
||||
- Structured input/output formats
|
||||
- Integration with existing Config API
|
||||
|
||||
### CLI Command Specifications
|
||||
|
||||
#### tg-list-config-items
|
||||
```bash
|
||||
tg-list-config-items --type <config-type> [--format text|json] [--api-url <url>]
|
||||
```
|
||||
- **Purpose**: List all configuration keys for a given type
|
||||
- **API Call**: `Config.list(type)`
|
||||
- **Output**:
|
||||
- `text` (default): Configuration keys separated by newlines
|
||||
- `json`: JSON array of configuration keys
|
||||
|
||||
#### tg-get-config-item
|
||||
```bash
|
||||
tg-get-config-item --type <type> --key <key> [--format text|json] [--api-url <url>]
|
||||
```
|
||||
- **Purpose**: Retrieve specific configuration item
|
||||
- **API Call**: `Config.get([ConfigKey(type, key)])`
|
||||
- **Output**:
|
||||
- `text` (default): Raw string value
|
||||
- `json`: JSON-encoded string value
|
||||
|
||||
#### tg-put-config-item
|
||||
```bash
|
||||
tg-put-config-item --type <type> --key <key> --value <value> [--api-url <url>]
|
||||
tg-put-config-item --type <type> --key <key> --stdin [--api-url <url>]
|
||||
```
|
||||
- **Purpose**: Set or update configuration item
|
||||
- **API Call**: `Config.put([ConfigValue(type, key, value)])`
|
||||
- **Input Options**:
|
||||
- `--value`: String value provided directly on command line
|
||||
- `--stdin`: Read value from standard input
|
||||
- **Output**: Success confirmation
|
||||
|
||||
#### tg-delete-config-item
|
||||
```bash
|
||||
tg-delete-config-item --type <type> --key <key> [--api-url <url>]
|
||||
```
|
||||
- **Purpose**: Delete configuration item
|
||||
- **API Call**: `Config.delete([ConfigKey(type, key)])`
|
||||
- **Output**: Success confirmation
|
||||
|
||||
### Implementation Details
|
||||
|
||||
All commands follow the established TrustGraph CLI pattern:
|
||||
- Use `argparse` for command-line argument parsing
|
||||
- Import and use `trustgraph.api.Api` for backend communication
|
||||
- Follow the same error handling patterns as existing CLI commands
|
||||
- Support the standard `--api-url` parameter for API endpoint configuration
|
||||
- Provide descriptive help text and usage examples
|
||||
|
||||
#### Output Format Handling
|
||||
|
||||
**Text Format (Default)**:
|
||||
- `tg-list-config-items`: One key per line, plain text
|
||||
- `tg-get-config-item`: Raw string value, no quotes or encoding
|
||||
|
||||
**JSON Format**:
|
||||
- `tg-list-config-items`: Array of strings `["key1", "key2", "key3"]`
|
||||
- `tg-get-config-item`: JSON-encoded string value `"actual string value"`
|
||||
|
||||
#### Input Handling
|
||||
|
||||
**tg-put-config-item** supports two mutually exclusive input methods:
|
||||
- `--value <string>`: Direct command-line string value
|
||||
- `--stdin`: Read entire input from standard input as the configuration value
|
||||
- stdin contents are read as raw text (preserving newlines, whitespace, etc.)
|
||||
- Supports piping from files, commands, or interactive input
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- **Input Validation**: All command-line parameters must be validated before API calls
|
||||
- **API Authentication**: Commands inherit existing API authentication mechanisms
|
||||
- **Configuration Access**: Commands respect existing configuration access controls
|
||||
- **Error Information**: Error messages should not leak sensitive configuration details
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
- **Single Item Operations**: Commands are designed for individual items, avoiding bulk operation overhead
|
||||
- **API Efficiency**: Direct API calls minimize processing layers
|
||||
- **Network Latency**: Each command makes one API call, minimizing network round trips
|
||||
- **Memory Usage**: Minimal memory footprint for single-item operations
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
- **Unit Tests**: Test each CLI command module independently
|
||||
- **Integration Tests**: Test CLI commands against live Config API
|
||||
- **Error Handling Tests**: Verify proper error handling for invalid inputs
|
||||
- **API Compatibility**: Ensure commands work with existing Config API versions
|
||||
|
||||
## Migration Plan
|
||||
|
||||
No migration required - these are new CLI commands that complement existing functionality:
|
||||
- Existing `tg-show-config` command remains unchanged
|
||||
- New commands can be added incrementally
|
||||
- No breaking changes to existing configuration workflows
|
||||
|
||||
## Packaging and Distribution
|
||||
|
||||
These commands will be added to the existing `trustgraph-cli` package:
|
||||
|
||||
**Package Location**: `trustgraph-cli/`
|
||||
**Module Files**:
|
||||
- `trustgraph-cli/trustgraph/cli/list_config_items.py`
|
||||
- `trustgraph-cli/trustgraph/cli/get_config_item.py`
|
||||
- `trustgraph-cli/trustgraph/cli/put_config_item.py`
|
||||
- `trustgraph-cli/trustgraph/cli/delete_config_item.py`
|
||||
|
||||
**Entry Points**: Added to `trustgraph-cli/pyproject.toml` in `[project.scripts]` section:
|
||||
```toml
|
||||
tg-list-config-items = "trustgraph.cli.list_config_items:main"
|
||||
tg-get-config-item = "trustgraph.cli.get_config_item:main"
|
||||
tg-put-config-item = "trustgraph.cli.put_config_item:main"
|
||||
tg-delete-config-item = "trustgraph.cli.delete_config_item:main"
|
||||
```
|
||||
|
||||
## Implementation Tasks
|
||||
|
||||
1. **Create CLI Modules**: Implement the four CLI command modules in `trustgraph-cli/trustgraph/cli/`
|
||||
2. **Update pyproject.toml**: Add new command entry points to `trustgraph-cli/pyproject.toml`
|
||||
3. **Documentation**: Create CLI documentation for each command in `docs/cli/`
|
||||
4. **Testing**: Implement comprehensive test coverage
|
||||
5. **Integration**: Ensure commands work with existing TrustGraph infrastructure
|
||||
6. **Package Build**: Verify commands are properly installed with `pip install trustgraph-cli`
|
||||
|
||||
## Usage Examples
|
||||
|
||||
#### List configuration items
|
||||
```bash
|
||||
# List prompt keys (text format)
|
||||
tg-list-config-items --type prompt
|
||||
template-1
|
||||
template-2
|
||||
system-prompt
|
||||
|
||||
# List prompt keys (JSON format)
|
||||
tg-list-config-items --type prompt --format json
|
||||
["template-1", "template-2", "system-prompt"]
|
||||
```
|
||||
|
||||
#### Get configuration item
|
||||
```bash
|
||||
# Get prompt value (text format)
|
||||
tg-get-config-item --type prompt --key template-1
|
||||
You are a helpful assistant. Please respond to: {query}
|
||||
|
||||
# Get prompt value (JSON format)
|
||||
tg-get-config-item --type prompt --key template-1 --format json
|
||||
"You are a helpful assistant. Please respond to: {query}"
|
||||
```
|
||||
|
||||
#### Set configuration item
|
||||
```bash
|
||||
# Set from command line
|
||||
tg-put-config-item --type prompt --key new-template --value "Custom prompt: {input}"
|
||||
|
||||
# Set from file via pipe
|
||||
cat ./prompt-template.txt | tg-put-config-item --type prompt --key complex-template --stdin
|
||||
|
||||
# Set from file via redirect
|
||||
tg-put-config-item --type prompt --key complex-template --stdin < ./prompt-template.txt
|
||||
|
||||
# Set from command output
|
||||
echo "Generated template: {query}" | tg-put-config-item --type prompt --key auto-template --stdin
|
||||
```
|
||||
|
||||
#### Delete configuration item
|
||||
```bash
|
||||
tg-delete-config-item --type prompt --key old-template
|
||||
```
|
||||
|
||||
## Open Questions
|
||||
|
||||
- Should commands support batch operations (multiple keys) in addition to single items?
|
||||
- What output format should be used for success confirmations?
|
||||
- How should configuration types be documented/discovered by users?
|
||||
|
||||
## References
|
||||
|
||||
- Existing Config API: `trustgraph/api/config.py`
|
||||
- CLI patterns: `trustgraph-cli/trustgraph/cli/show_config.py`
|
||||
- Data types: `trustgraph/api/types.py`
|
||||
91
docs/tech-specs/schema-refactoring-proposal.md
Normal file
91
docs/tech-specs/schema-refactoring-proposal.md
Normal file
|
|
@ -0,0 +1,91 @@
|
|||
# Schema Directory Refactoring Proposal
|
||||
|
||||
## Current Issues
|
||||
|
||||
1. **Flat structure** - All schemas in one directory makes it hard to understand relationships
|
||||
2. **Mixed concerns** - Core types, domain objects, and API contracts all mixed together
|
||||
3. **Unclear naming** - Files like "object.py", "types.py", "topic.py" don't clearly indicate their purpose
|
||||
4. **No clear layering** - Can't easily see what depends on what
|
||||
|
||||
## Proposed Structure
|
||||
|
||||
```
|
||||
trustgraph-base/trustgraph/schema/
|
||||
├── __init__.py
|
||||
├── core/ # Core primitive types used everywhere
|
||||
│ ├── __init__.py
|
||||
│ ├── primitives.py # Error, Value, Triple, Field, RowSchema
|
||||
│ ├── metadata.py # Metadata record
|
||||
│ └── topic.py # Topic utilities
|
||||
│
|
||||
├── knowledge/ # Knowledge domain models and extraction
|
||||
│ ├── __init__.py
|
||||
│ ├── graph.py # EntityContext, EntityEmbeddings, Triples
|
||||
│ ├── document.py # Document, TextDocument, Chunk
|
||||
│ ├── knowledge.py # Knowledge extraction types
|
||||
│ ├── embeddings.py # All embedding-related types (moved from multiple files)
|
||||
│ └── nlp.py # Definition, Topic, Relationship, Fact types
|
||||
│
|
||||
└── services/ # Service request/response contracts
|
||||
├── __init__.py
|
||||
├── llm.py # TextCompletion, Embeddings, Tool requests/responses
|
||||
├── retrieval.py # GraphRAG, DocumentRAG queries/responses
|
||||
├── query.py # GraphEmbeddingsRequest/Response, DocumentEmbeddingsRequest/Response
|
||||
├── agent.py # Agent requests/responses
|
||||
├── flow.py # Flow requests/responses
|
||||
├── prompt.py # Prompt service requests/responses
|
||||
├── config.py # Configuration service
|
||||
├── library.py # Librarian service
|
||||
└── lookup.py # Lookup service
|
||||
```
|
||||
|
||||
## Key Changes
|
||||
|
||||
1. **Hierarchical organization** - Clear separation between core types, knowledge models, and service contracts
|
||||
2. **Better naming**:
|
||||
- `types.py` → `core/primitives.py` (clearer purpose)
|
||||
- `object.py` → Split between appropriate files based on actual content
|
||||
- `documents.py` → `knowledge/document.py` (singular, consistent)
|
||||
- `models.py` → `services/llm.py` (clearer what kind of models)
|
||||
- `prompt.py` → Split: service parts to `services/prompt.py`, data types to `knowledge/nlp.py`
|
||||
|
||||
3. **Logical grouping**:
|
||||
- All embedding types consolidated in `knowledge/embeddings.py`
|
||||
- All LLM-related service contracts in `services/llm.py`
|
||||
- Clear separation of request/response pairs in services directory
|
||||
- Knowledge extraction types grouped with other knowledge domain models
|
||||
|
||||
4. **Dependency clarity**:
|
||||
- Core types have no dependencies
|
||||
- Knowledge models depend only on core
|
||||
- Service contracts can depend on both core and knowledge models
|
||||
|
||||
## Migration Benefits
|
||||
|
||||
1. **Easier navigation** - Developers can quickly find what they need
|
||||
2. **Better modularity** - Clear boundaries between different concerns
|
||||
3. **Simpler imports** - More intuitive import paths
|
||||
4. **Future-proof** - Easy to add new knowledge types or services without cluttering
|
||||
|
||||
## Example Import Changes
|
||||
|
||||
```python
|
||||
# Before
|
||||
from trustgraph.schema import Error, Triple, GraphEmbeddings, TextCompletionRequest
|
||||
|
||||
# After
|
||||
from trustgraph.schema.core import Error, Triple
|
||||
from trustgraph.schema.knowledge import GraphEmbeddings
|
||||
from trustgraph.schema.services import TextCompletionRequest
|
||||
```
|
||||
|
||||
## Implementation Notes
|
||||
|
||||
1. Keep backward compatibility by maintaining imports in root `__init__.py`
|
||||
2. Move files gradually, updating imports as needed
|
||||
3. Consider adding a `legacy.py` that imports everything for transition period
|
||||
4. Update documentation to reflect new structure
|
||||
|
||||
<function_calls>
|
||||
<invoke name="TodoWrite">
|
||||
<parameter name="todos">[{"id": "1", "content": "Examine current schema directory structure", "status": "completed", "priority": "high"}, {"id": "2", "content": "Analyze schema files and their purposes", "status": "completed", "priority": "high"}, {"id": "3", "content": "Propose improved naming and structure", "status": "completed", "priority": "high"}]
|
||||
139
docs/tech-specs/structured-data-schemas.md
Normal file
139
docs/tech-specs/structured-data-schemas.md
Normal file
|
|
@ -0,0 +1,139 @@
|
|||
# Structured Data Pulsar Schema Changes
|
||||
|
||||
## Overview
|
||||
|
||||
Based on the STRUCTURED_DATA.md specification, this document proposes the necessary Pulsar schema additions and modifications to support structured data capabilities in TrustGraph.
|
||||
|
||||
## Required Schema Changes
|
||||
|
||||
### 1. Core Schema Enhancements
|
||||
|
||||
#### Enhanced Field Definition
|
||||
The existing `Field` class in `core/primitives.py` needs additional properties:
|
||||
|
||||
```python
|
||||
class Field(Record):
|
||||
name = String()
|
||||
type = String() # int, string, long, bool, float, double, timestamp
|
||||
size = Integer()
|
||||
primary = Boolean()
|
||||
description = String()
|
||||
# NEW FIELDS:
|
||||
required = Boolean() # Whether field is required
|
||||
enum_values = Array(String()) # For enum type fields
|
||||
indexed = Boolean() # Whether field should be indexed
|
||||
```
|
||||
|
||||
### 2. New Knowledge Schemas
|
||||
|
||||
#### 2.1 Structured Data Submission
|
||||
New file: `knowledge/structured.py`
|
||||
|
||||
```python
|
||||
from pulsar.schema import Record, String, Bytes, Map
|
||||
from ..core.metadata import Metadata
|
||||
|
||||
class StructuredDataSubmission(Record):
|
||||
metadata = Metadata()
|
||||
format = String() # "json", "csv", "xml"
|
||||
schema_name = String() # Reference to schema in config
|
||||
data = Bytes() # Raw data to ingest
|
||||
options = Map(String()) # Format-specific options
|
||||
```
|
||||
|
||||
### 3. New Service Schemas
|
||||
|
||||
#### 3.1 NLP to Structured Query Service
|
||||
New file: `services/nlp_query.py`
|
||||
|
||||
```python
|
||||
from pulsar.schema import Record, String, Array, Map, Integer, Double
|
||||
from ..core.primitives import Error
|
||||
|
||||
class NLPToStructuredQueryRequest(Record):
|
||||
natural_language_query = String()
|
||||
max_results = Integer()
|
||||
context_hints = Map(String()) # Optional context for query generation
|
||||
|
||||
class NLPToStructuredQueryResponse(Record):
|
||||
error = Error()
|
||||
graphql_query = String() # Generated GraphQL query
|
||||
variables = Map(String()) # GraphQL variables if any
|
||||
detected_schemas = Array(String()) # Which schemas the query targets
|
||||
confidence = Double()
|
||||
```
|
||||
|
||||
#### 3.2 Structured Query Service
|
||||
New file: `services/structured_query.py`
|
||||
|
||||
```python
|
||||
from pulsar.schema import Record, String, Map, Array
|
||||
from ..core.primitives import Error
|
||||
|
||||
class StructuredQueryRequest(Record):
|
||||
query = String() # GraphQL query
|
||||
variables = Map(String()) # GraphQL variables
|
||||
operation_name = String() # Optional operation name for multi-operation documents
|
||||
|
||||
class StructuredQueryResponse(Record):
|
||||
error = Error()
|
||||
data = String() # JSON-encoded GraphQL response data
|
||||
errors = Array(String()) # GraphQL errors if any
|
||||
```
|
||||
|
||||
#### 2.2 Object Extraction Output
|
||||
New file: `knowledge/object.py`
|
||||
|
||||
```python
|
||||
from pulsar.schema import Record, String, Map, Double
|
||||
from ..core.metadata import Metadata
|
||||
|
||||
class ExtractedObject(Record):
|
||||
metadata = Metadata()
|
||||
schema_name = String() # Which schema this object belongs to
|
||||
values = Map(String()) # Field name -> value
|
||||
confidence = Double()
|
||||
source_span = String() # Text span where object was found
|
||||
```
|
||||
|
||||
### 4. Enhanced Knowledge Schemas
|
||||
|
||||
#### 4.1 Object Embeddings Enhancement
|
||||
Update `knowledge/embeddings.py` to support structured object embeddings better:
|
||||
|
||||
```python
|
||||
class StructuredObjectEmbedding(Record):
|
||||
metadata = Metadata()
|
||||
vectors = Array(Array(Double()))
|
||||
schema_name = String()
|
||||
object_id = String() # Primary key value
|
||||
field_embeddings = Map(Array(Double())) # Per-field embeddings
|
||||
```
|
||||
|
||||
## Integration Points
|
||||
|
||||
### Flow Integration
|
||||
|
||||
The schemas will be used by new flow modules:
|
||||
- `trustgraph-flow/trustgraph/decoding/structured` - Uses StructuredDataSubmission
|
||||
- `trustgraph-flow/trustgraph/query/nlp_query/cassandra` - Uses NLP query schemas
|
||||
- `trustgraph-flow/trustgraph/query/objects/cassandra` - Uses structured query schemas
|
||||
- `trustgraph-flow/trustgraph/extract/object/row/` - Consumes Chunk, produces ExtractedObject
|
||||
- `trustgraph-flow/trustgraph/storage/objects/cassandra` - Uses Rows schema
|
||||
- `trustgraph-flow/trustgraph/embeddings/object_embeddings/qdrant` - Uses object embedding schemas
|
||||
|
||||
## Implementation Notes
|
||||
|
||||
1. **Schema Versioning**: Consider adding a `version` field to RowSchema for future migration support
|
||||
2. **Type System**: The `Field.type` should support all Cassandra native types
|
||||
3. **Batch Operations**: Most services should support both single and batch operations
|
||||
4. **Error Handling**: Consistent error reporting across all new services
|
||||
5. **Backwards Compatibility**: Existing schemas remain unchanged except for minor Field enhancements
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. Implement schema files in the new structure
|
||||
2. Update existing services to recognize new schema types
|
||||
3. Implement flow modules that use these schemas
|
||||
4. Add gateway/rev-gateway endpoints for new services
|
||||
5. Create unit tests for schema validation
|
||||
253
docs/tech-specs/structured-data.md
Normal file
253
docs/tech-specs/structured-data.md
Normal file
|
|
@ -0,0 +1,253 @@
|
|||
# Structured Data Technical Specification
|
||||
|
||||
## Overview
|
||||
|
||||
This specification describes the integration of TrustGraph with structured data flows, enabling the system to work with data that can be represented as rows in tables or objects in object stores. The integration supports four primary use cases:
|
||||
|
||||
1. **Unstructured to Structured Extraction**: Read unstructured data sources, identify and extract object structures, and store them in a tabular format
|
||||
2. **Structured Data Ingestion**: Load data that is already in structured formats directly into the structured store alongside extracted data
|
||||
3. **Natural Language Querying**: Convert natural language questions into structured queries to extract matching data from the store
|
||||
4. **Direct Structured Querying**: Execute structured queries directly against the data store for precise data retrieval
|
||||
|
||||
## Goals
|
||||
|
||||
- **Unified Data Access**: Provide a single interface for accessing both structured and unstructured data within TrustGraph
|
||||
- **Seamless Integration**: Enable smooth interoperability between TrustGraph's graph-based knowledge representation and traditional structured data formats
|
||||
- **Flexible Extraction**: Support automatic extraction of structured data from various unstructured sources (documents, text, etc.)
|
||||
- **Query Versatility**: Allow users to query data using both natural language and structured query languages
|
||||
- **Data Consistency**: Maintain data integrity and consistency across different data representations
|
||||
- **Performance Optimization**: Ensure efficient storage and retrieval of structured data at scale
|
||||
- **Schema Flexibility**: Support both schema-on-write and schema-on-read approaches to accommodate diverse data sources
|
||||
- **Backwards Compatibility**: Preserve existing TrustGraph functionality while adding structured data capabilities
|
||||
|
||||
## Background
|
||||
|
||||
TrustGraph currently excels at processing unstructured data and building knowledge graphs from diverse sources. However, many enterprise use cases involve data that is inherently structured - customer records, transaction logs, inventory databases, and other tabular datasets. These structured datasets often need to be analyzed alongside unstructured content to provide comprehensive insights.
|
||||
|
||||
Current limitations include:
|
||||
- No native support for ingesting pre-structured data formats (CSV, JSON arrays, database exports)
|
||||
- Inability to preserve the inherent structure when extracting tabular data from documents
|
||||
- Lack of efficient querying mechanisms for structured data patterns
|
||||
- Missing bridge between SQL-like queries and TrustGraph's graph queries
|
||||
|
||||
This specification addresses these gaps by introducing a structured data layer that complements TrustGraph's existing capabilities. By supporting structured data natively, TrustGraph can:
|
||||
- Serve as a unified platform for both structured and unstructured data analysis
|
||||
- Enable hybrid queries that span both graph relationships and tabular data
|
||||
- Provide familiar interfaces for users accustomed to working with structured data
|
||||
- Unlock new use cases in data integration and business intelligence
|
||||
|
||||
## Technical Design
|
||||
|
||||
### Architecture
|
||||
|
||||
The structured data integration requires the following technical components:
|
||||
|
||||
1. **NLP-to-Structured-Query Service**
|
||||
- Converts natural language questions into structured queries
|
||||
- Supports multiple query language targets (initially SQL-like syntax)
|
||||
- Integrates with existing TrustGraph NLP capabilities
|
||||
|
||||
Module: trustgraph-flow/trustgraph/query/nlp_query/cassandra
|
||||
|
||||
2. **Configuration Schema Support** ✅ **[COMPLETE]**
|
||||
- Extended configuration system to store structured data schemas
|
||||
- Support for defining table structures, field types, and relationships
|
||||
- Schema versioning and migration capabilities
|
||||
|
||||
3. **Object Extraction Module** ✅ **[COMPLETE]**
|
||||
- Enhanced knowledge extractor flow integration
|
||||
- Identifies and extracts structured objects from unstructured sources
|
||||
- Maintains provenance and confidence scores
|
||||
- Registers a config handler (example: trustgraph-flow/trustgraph/prompt/template/service.py) to receive config data and decode schema information
|
||||
- Receives objects and decodes them to ExtractedObject objects for delivery on the Pulsar queue
|
||||
- NOTE: There's existing code at `trustgraph-flow/trustgraph/extract/object/row/`. This was a previous attempt and will need to be majorly refactored as it doesn't conform to current APIs. Use it if it's useful, start from scratch if not.
|
||||
- Requires a command-line interface: `kg-extract-objects`
|
||||
|
||||
Module: trustgraph-flow/trustgraph/extract/kg/objects/
|
||||
|
||||
4. **Structured Store Writer Module** ✅ **[COMPLETE]**
|
||||
- Receives objects in ExtractedObject format from Pulsar queues
|
||||
- Initial implementation targeting Apache Cassandra as the structured data store
|
||||
- Handles dynamic table creation based on schemas encountered
|
||||
- Manages schema-to-Cassandra table mapping and data transformation
|
||||
- Provides batch and streaming write operations for performance optimization
|
||||
- No Pulsar outputs - this is a terminal service in the data flow
|
||||
|
||||
**Schema Handling**:
|
||||
- Monitors incoming ExtractedObject messages for schema references
|
||||
- When a new schema is encountered for the first time, automatically creates the corresponding Cassandra table
|
||||
- Maintains a cache of known schemas to avoid redundant table creation attempts
|
||||
- Should consider whether to receive schema definitions directly or rely on schema names in ExtractedObject messages
|
||||
|
||||
**Cassandra Table Mapping**:
|
||||
- Keyspace is named after the `user` field from ExtractedObject's Metadata
|
||||
- Table is named after the `schema_name` field from ExtractedObject
|
||||
- Collection from Metadata becomes part of the partition key to ensure:
|
||||
- Natural data distribution across Cassandra nodes
|
||||
- Efficient queries within a specific collection
|
||||
- Logical isolation between different data imports/sources
|
||||
- Primary key structure: `PRIMARY KEY ((collection, <schema_primary_key_fields>), <clustering_keys>)`
|
||||
- Collection is always the first component of the partition key
|
||||
- Schema-defined primary key fields follow as part of the composite partition key
|
||||
- This requires queries to specify the collection, ensuring predictable performance
|
||||
- Field definitions map to Cassandra columns with type conversions:
|
||||
- `string` → `text`
|
||||
- `integer` → `int` or `bigint` based on size hint
|
||||
- `float` → `float` or `double` based on precision needs
|
||||
- `boolean` → `boolean`
|
||||
- `timestamp` → `timestamp`
|
||||
- `enum` → `text` with application-level validation
|
||||
- Indexed fields create Cassandra secondary indexes (excluding fields already in the primary key)
|
||||
- Required fields are enforced at the application level (Cassandra doesn't support NOT NULL)
|
||||
|
||||
**Object Storage**:
|
||||
- Extracts values from ExtractedObject.values map
|
||||
- Performs type conversion and validation before insertion
|
||||
- Handles missing optional fields gracefully
|
||||
- Maintains metadata about object provenance (source document, confidence scores)
|
||||
- Supports idempotent writes to handle message replay scenarios
|
||||
|
||||
**Implementation Notes**:
|
||||
- Existing code at `trustgraph-flow/trustgraph/storage/objects/cassandra/` is outdated and doesn't comply with current APIs
|
||||
- Should reference `trustgraph-flow/trustgraph/storage/triples/cassandra` as an example of a working storage processor
|
||||
- Needs evaluation of existing code for any reusable components before deciding to refactor or rewrite
|
||||
|
||||
Module: trustgraph-flow/trustgraph/storage/objects/cassandra
|
||||
|
||||
5. **Structured Query Service**
|
||||
- Accepts structured queries in defined formats
|
||||
- Executes queries against the structured store
|
||||
- Returns objects matching query criteria
|
||||
- Supports pagination and result filtering
|
||||
|
||||
Module: trustgraph-flow/trustgraph/query/objects/cassandra
|
||||
|
||||
6. **Agent Tool Integration**
|
||||
- New tool class for agent frameworks
|
||||
- Enables agents to query structured data stores
|
||||
- Provides natural language and structured query interfaces
|
||||
- Integrates with existing agent decision-making processes
|
||||
|
||||
7. **Structured Data Ingestion Service**
|
||||
- Accepts structured data in multiple formats (JSON, CSV, XML)
|
||||
- Parses and validates incoming data against defined schemas
|
||||
- Converts data into normalized object streams
|
||||
- Emits objects to appropriate message queues for processing
|
||||
- Supports bulk uploads and streaming ingestion
|
||||
|
||||
Module: trustgraph-flow/trustgraph/decoding/structured
|
||||
|
||||
8. **Object Embedding Service**
|
||||
- Generates vector embeddings for structured objects
|
||||
- Enables semantic search across structured data
|
||||
- Supports hybrid search combining structured queries with semantic similarity
|
||||
- Integrates with existing vector stores
|
||||
|
||||
Module: trustgraph-flow/trustgraph/embeddings/object_embeddings/qdrant
|
||||
|
||||
### Data Models
|
||||
|
||||
#### Schema Storage Mechanism
|
||||
|
||||
Schemas are stored in TrustGraph's configuration system using the following structure:
|
||||
|
||||
- **Type**: `schema` (fixed value for all structured data schemas)
|
||||
- **Key**: The unique name/identifier of the schema (e.g., `customer_records`, `transaction_log`)
|
||||
- **Value**: JSON schema definition containing the structure
|
||||
|
||||
Example configuration entry:
|
||||
```
|
||||
Type: schema
|
||||
Key: customer_records
|
||||
Value: {
|
||||
"name": "customer_records",
|
||||
"description": "Customer information table",
|
||||
"fields": [
|
||||
{
|
||||
"name": "customer_id",
|
||||
"type": "string",
|
||||
"primary_key": true
|
||||
},
|
||||
{
|
||||
"name": "name",
|
||||
"type": "string",
|
||||
"required": true
|
||||
},
|
||||
{
|
||||
"name": "email",
|
||||
"type": "string",
|
||||
"required": true
|
||||
},
|
||||
{
|
||||
"name": "registration_date",
|
||||
"type": "timestamp"
|
||||
},
|
||||
{
|
||||
"name": "status",
|
||||
"type": "string",
|
||||
"enum": ["active", "inactive", "suspended"]
|
||||
}
|
||||
],
|
||||
"indexes": ["email", "registration_date"]
|
||||
}
|
||||
```
|
||||
|
||||
This approach allows:
|
||||
- Dynamic schema definition without code changes
|
||||
- Easy schema updates and versioning
|
||||
- Consistent integration with existing TrustGraph configuration management
|
||||
- Support for multiple schemas within a single deployment
|
||||
|
||||
### APIs
|
||||
|
||||
New APIs:
|
||||
- Pulsar schemas for above types
|
||||
- Pulsar interfaces in new flows
|
||||
- Need a means to specify schema types in flows so that flows know which
|
||||
schema types to load
|
||||
- APIs added to gateway and rev-gateway
|
||||
|
||||
Modified APIs:
|
||||
- Knowledge extraction endpoints - Add structured object output option
|
||||
- Agent endpoints - Add structured data tool support
|
||||
|
||||
### Implementation Details
|
||||
|
||||
Following existing conventions - these are just new processing modules.
|
||||
Everything is in the trustgraph-flow packages except for schema items
|
||||
in trustgraph-base.
|
||||
|
||||
Need some UI work in the Workbench to be able to demo / pilot this
|
||||
capability.
|
||||
|
||||
## Security Considerations
|
||||
|
||||
No extra considerations.
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
Some questions around using Cassandra queries and indexes so that queries
|
||||
don't slow down.
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
Use existing test strategy, will build unit, contract and integration tests.
|
||||
|
||||
## Migration Plan
|
||||
|
||||
None.
|
||||
|
||||
## Timeline
|
||||
|
||||
Not specified.
|
||||
|
||||
## Open Questions
|
||||
|
||||
- Can this be made to work with other store types? We're aiming to use
|
||||
interfaces which make modules which work with one store applicable to
|
||||
other stores.
|
||||
|
||||
## References
|
||||
|
||||
n/a.
|
||||
|
||||
Loading…
Add table
Add a link
Reference in a new issue