The Structured Data Descriptor is a JSON-based configuration language that describes how to parse, transform, and import structured data into TrustGraph. It provides a declarative approach to data ingestion, supporting multiple input formats and complex transformation pipelines without requiring custom code.
## Core Concepts
### 1. Format Definition
Describes the input file type and parsing options. Determines which parser to use and how to interpret the source data.
### 2. Field Mappings
Maps source paths to target fields with transformations. Defines how data flows from input sources to output schema fields.
### 3. Transform Pipeline
Chain of data transformations that can be applied to field values, including:
- Data cleaning (trim, normalize)
- Format conversion (date parsing, type casting)
- Calculations (arithmetic, string manipulation)
- Lookups (reference tables, substitutions)
### 4. Validation Rules
Data quality checks applied to ensure data integrity:
- Type validation
- Range checks
- Pattern matching (regex)
- Required field validation
- Custom validation logic
### 5. Global Settings
Configuration that applies across the entire import process:
- Lookup tables for data enrichment
- Global variables and constants
- Output format specifications
- Error handling policies
## Implementation Strategy
The importer implementation follows this pipeline:
1.**Parse Configuration** - Load and validate the JSON descriptor
2.**Initialize Parser** - Load appropriate parser (CSV, XML, JSON, etc.) based on `format.type`
3.**Apply Preprocessing** - Execute global filters and transformations
4.**Process Records** - For each input record:
- Extract data using source paths (JSONPath, XPath, column names)
- Apply field-level transforms in sequence
- Validate results against defined rules
- Apply default values for missing data
5.**Apply Postprocessing** - Execute deduplication, aggregation, etc.
6.**Generate Output** - Produce data in specified target format
## Path Expression Support
Different input formats use appropriate path expression languages:
- **CSV**: Column names or indices (`"column_name"` or `"[2]"`)
The following prompt can be used to have an LLM analyze sample data and generate a descriptor configuration:
```
I need you to analyze the provided data sample and create a Structured Data Descriptor configuration in JSON format.
The descriptor should follow this specification:
- version: "1.0"
- metadata: Configuration name, description, author, and creation date
- format: Input format type and parsing options
- globals: Variables, lookup tables, and constants
- preprocessing: Filters and transformations applied before mapping
- mappings: Field-by-field mapping from source to target with transformations and validations
- postprocessing: Operations like deduplication or aggregation
- output: Target format and error handling configuration
ANALYZE THE DATA:
1. Identify the format (CSV, JSON, XML, etc.)
2. Detect delimiters, encodings, and structure
3. Find data types for each field
4. Identify patterns and constraints
5. Look for fields that need cleaning or transformation
6. Find relationships between fields
7. Identify lookup opportunities (codes that map to values)
8. Detect required vs optional fields
CREATE THE DESCRIPTOR:
For each field in the sample data:
- Map it to an appropriate target field name
- Add necessary transformations (trim, case conversion, type casting)
- Include appropriate validations (required, patterns, ranges)
- Set defaults for missing values
Include preprocessing if needed:
- Filters to exclude invalid records
- Sorting requirements
Include postprocessing if beneficial:
- Deduplication on key fields
- Aggregation for summary data
Configure output for TrustGraph:
- format: "trustgraph-objects"
- schema_name: Based on the data entity type
- Appropriate error handling
DATA SAMPLE:
[Insert data sample here]
ADDITIONAL CONTEXT (optional):
- Target schema name: [if known]
- Business rules: [any specific requirements]
- Data quality issues to address: [known problems]
Generate a complete, valid Structured Data Descriptor configuration that will properly import this data into TrustGraph. Include comments explaining key decisions.
```
### Example Usage Prompt
```
I need you to analyze the provided data sample and create a Structured Data Descriptor configuration in JSON format.