mirror of https://github.com/trustgraph-ai/trustgraph.git synced 2026-04-25 16:36:21 +02:00

Catch up

2025-09-20 16:00:37 +01:00

10 KiB

Raw Blame History

Structured Data Diagnostic Service Technical Specification

Overview

This specification describes a new invokable service for diagnosing and analyzing structured data within TrustGraph. The service extracts functionality from the existing tg-load-structured-data command-line tool and exposes it as a request/response service, enabling programmatic access to data type detection and descriptor generation capabilities.

The service supports three primary operations:

Data Type Detection: Analyze a data sample to determine its format (CSV, JSON, or XML)
Descriptor Generation: Generate a TrustGraph structured data descriptor for a given data sample and type
Combined Diagnosis: Perform both type detection and descriptor generation in sequence

Goals

Modularize Data Analysis: Extract data diagnosis logic from CLI into reusable service components
Enable Programmatic Access: Provide API-based access to data analysis capabilities
Support Multiple Data Formats: Handle CSV, JSON, and XML data formats consistently
Generate Accurate Descriptors: Produce structured data descriptors that accurately map source data to TrustGraph schemas
Maintain Backward Compatibility: Ensure existing CLI functionality continues to work
Enable Service Composition: Allow other services to leverage data diagnosis capabilities
Improve Testability: Separate business logic from CLI interface for better testing
Support Streaming Analysis: Enable analysis of data samples without loading entire files

Background

Currently, the tg-load-structured-data command provides comprehensive functionality for analyzing structured data and generating descriptors. However, this functionality is tightly coupled to the CLI interface, limiting its reusability.

Current limitations include:

Data diagnosis logic embedded in CLI code
No programmatic access to type detection and descriptor generation
Difficult to integrate diagnosis capabilities into other services
Limited ability to compose data analysis workflows

This specification addresses these gaps by creating a dedicated service for structured data diagnosis. By exposing these capabilities as a service, TrustGraph can:

Enable other services to analyze data programmatically
Support more complex data processing pipelines
Facilitate integration with external systems
Improve maintainability through separation of concerns

Technical Design

Architecture

The structured data diagnostic service requires the following technical components:

Diagnostic Service Processor
- Handles incoming diagnosis requests
- Orchestrates type detection and descriptor generation
- Returns structured responses with diagnosis results
Module: trustgraph-flow/trustgraph/diagnosis/structured_data/service.py
Data Type Detector
- Uses algorithmic detection to identify data format (CSV, JSON, XML)
- Analyzes data structure, delimiters, and syntax patterns
- Returns detected format and confidence scores
Module: trustgraph-flow/trustgraph/diagnosis/structured_data/type_detector.py
Descriptor Generator
- Uses prompt service to generate descriptors
- Invokes format-specific prompts (diagnose-csv, diagnose-json, diagnose-xml)
- Maps data fields to TrustGraph schema fields through prompt responses
Module: trustgraph-flow/trustgraph/diagnosis/structured_data/descriptor_generator.py

Data Models

StructuredDataDiagnosisRequest

Request message for structured data diagnosis operations:

class StructuredDataDiagnosisRequest:
    operation: str  # "detect-type", "generate-descriptor", or "diagnose"
    sample: str     # Data sample to analyze (text content)
    type: Optional[str]  # Data type (csv, json, xml) - required for generate-descriptor
    schema_name: Optional[str]  # Target schema name for descriptor generation
    options: Dict[str, Any]  # Additional options (e.g., delimiter for CSV)

StructuredDataDiagnosisResponse

Response message containing diagnosis results:

class StructuredDataDiagnosisResponse:
    operation: str  # The operation that was performed
    detected_type: Optional[str]  # Detected data type (for detect-type/diagnose)
    confidence: Optional[float]  # Confidence score for type detection
    descriptor: Optional[Dict]  # Generated descriptor (for generate-descriptor/diagnose)
    error: Optional[str]  # Error message if operation failed
    metadata: Dict[str, Any]  # Additional metadata (e.g., field count, sample records)

Descriptor Structure

The generated descriptor follows the existing structured data descriptor format:

{
  "format": {
    "type": "csv",
    "encoding": "utf-8",
    "options": {
      "delimiter": ",",
      "has_header": true
    }
  },
  "mappings": [
    {
      "source_field": "customer_id",
      "target_field": "id",
      "transforms": [
        {"type": "trim"}
      ]
    }
  ],
  "output": {
    "schema_name": "customer",
    "options": {
      "batch_size": 1000,
      "confidence": 0.9
    }
  }
}

Service Interface

The service will expose the following operations through the request/response pattern:

Type Detection Operation
- Input: Data sample
- Processing: Analyze data structure using algorithmic detection
- Output: Detected type with confidence score
Descriptor Generation Operation
- Input: Data sample, type, target schema name
- Processing:
  - Call prompt service with format-specific prompt ID (diagnose-csv, diagnose-json, or diagnose-xml)
  - Pass data sample and available schemas to prompt
  - Receive generated descriptor from prompt response
- Output: Structured data descriptor
Combined Diagnosis Operation
- Input: Data sample, optional schema name
- Processing:
  - Use algorithmic detection to identify format first
  - Select appropriate format-specific prompt based on detected type
  - Call prompt service to generate descriptor
- Output: Both detected type and descriptor

Implementation Details

The service will follow TrustGraph service conventions:

Service Registration
- Register as structured-diag service type
- Use standard request/response topics
- Implement FlowProcessor base class
- Register PromptClientSpec for prompt service interaction
Configuration Management
- Access schema configurations via config service
- Cache schemas for performance
- Handle configuration updates dynamically
Prompt Integration
- Use existing prompt service infrastructure
- Call prompt service with format-specific prompt IDs:
  - diagnose-csv: For CSV data analysis
  - diagnose-json: For JSON data analysis
  - diagnose-xml: For XML data analysis
- Prompts are configured in prompt config, not hard-coded in service
- Pass schemas and data samples as prompt variables
- Parse prompt responses to extract descriptors
Error Handling
- Validate input data samples
- Provide descriptive error messages
- Handle malformed data gracefully
- Handle prompt service failures
Data Sampling
- Process configurable sample sizes
- Handle incomplete records appropriately
- Maintain sampling consistency

API Integration

The service will integrate with existing TrustGraph APIs:

Modified Components:

tg-load-structured-data CLI - Refactored to use the new service for diagnosis operations
Flow API - Extended to support structured data diagnosis requests

New Service Endpoints:

/api/v1/flow/{flow}/diagnose/structured-data - WebSocket endpoint for diagnosis requests
/api/v1/diagnose/structured-data - REST endpoint for synchronous diagnosis

Message Flow

Client → Gateway → Structured Diag Service → Config Service (for schemas)
                                           ↓
                                    Type Detector (algorithmic)
                                           ↓
                                    Prompt Service (diagnose-csv/json/xml)
                                           ↓
                                 Descriptor Generator (parses prompt response)
                                           ↓
Client ← Gateway ← Structured Diag Service (response)

Security Considerations

Input validation to prevent injection attacks
Size limits on data samples to prevent DoS
Sanitization of generated descriptors
Access control through existing TrustGraph authentication

Performance Considerations

Cache schema definitions to reduce config service calls
Limit sample sizes to maintain responsive performance
Use streaming processing for large data samples
Implement timeout mechanisms for long-running analyses

Testing Strategy

Unit Tests
- Type detection for various data formats
- Descriptor generation accuracy
- Error handling scenarios
Integration Tests
- Service request/response flow
- Schema retrieval and caching
- CLI integration
Performance Tests
- Large sample processing
- Concurrent request handling
- Memory usage under load

Migration Plan

Phase 1: Implement service with core functionality
Phase 2: Refactor CLI to use service (maintain backward compatibility)
Phase 3: Add REST API endpoints
Phase 4: Deprecate embedded CLI logic (with notice period)

Timeline

Week 1-2: Implement core service and type detection
Week 3-4: Add descriptor generation and integration
Week 5: Testing and documentation
Week 6: CLI refactoring and migration

Open Questions

Should the service support additional data formats (e.g., Parquet, Avro)?
What should be the maximum sample size for analysis?
Should diagnosis results be cached for repeated requests?
How should the service handle multi-schema scenarios?
Should the prompt IDs be configurable parameters for the service?

References

Structured Data Descriptor Specification
Structured Data Loading Documentation
tg-load-structured-data implementation: trustgraph-cli/trustgraph/cli/load_structured_data.py

10 KiB Raw Blame History