Feat: TrustGraph i18n & Documentation Translation Updates (#781)

Native CLI i18n: The TrustGraph CLI has built-in translation support that dynamically loads language strings. You can test and use different languages by simply passing the --lang flag (e.g., --lang es for Spanish, --lang ru for Russian) or by configuring your environment's LANG variable. Automated Docs Translations: This PR introduces autonomously translated Markdown documentation into several target languages, including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew, Arabic, Simplified Chinese, and Russian.
2026-06-23 05:38:07 +02:00 · 2026-04-14 07:07:58 -04:00 · 2026-04-14 07:07:58 -04:00 · 8954fa3ad7
commit 8954fa3ad7
parent f976f1b6fe
560 changed files with 236300 additions and 99 deletions
--- a/docs/tech-specs/structured-data-descriptor.es.md
+++ b/docs/tech-specs/structured-data-descriptor.es.md
@ -0,0 +1,567 @@
+---
+layout: default
+title: "Especificación del Descriptor de Datos Estructurados"
+parent: "Spanish (Beta)"
+---
+
+# Especificación del Descriptor de Datos Estructurados
+
+> **Beta Translation:** This document was translated via Machine Learning and as such may not be 100% accurate. All non-English languages are currently classified as Beta.
+
+## Resumen
+
+El Descriptor de Datos Estructurados es un lenguaje de configuración basado en JSON que describe cómo analizar, transformar e importar datos estructurados en TrustGraph. Proporciona un enfoque declarativo para la ingesta de datos, admitiendo múltiples formatos de entrada y complejas canalizaciones de transformación sin requerir código personalizado.
+
+## Conceptos Clave
+
+### 1. Definición de Formato
+Describe el tipo de archivo de entrada y las opciones de análisis. Determina qué analizador utilizar y cómo interpretar los datos de origen.
+
+### 2. Mapeo de Campos
+Mapea las rutas de origen a los campos de destino con transformaciones. Define cómo fluyen los datos desde las fuentes de entrada a los campos del esquema de salida.
+
+### 3. Canalización de Transformación
+Cadena de transformaciones de datos que se pueden aplicar a los valores de los campos, incluyendo:
+Limpieza de datos (recorte, normalización)
+Conversión de formato (análisis de fechas, conversión de tipos)
+Cálculos (aritméticos, manipulación de cadenas)
+Consultas (tablas de referencia, sustituciones)
+
+### 4. Reglas de Validación
+Comprobaciones de calidad de datos aplicadas para garantizar la integridad de los datos:
+Validación de tipo
+Comprobaciones de rango
+Coincidencia de patrones (expresiones regulares)
+Validación de campos obligatorios
+Lógica de validación personalizada
+
+### 5. Configuración Global
+Configuración que se aplica a todo el proceso de importación:
+Tablas de consulta para la enriquecimiento de datos
+Variables y constantes globales
+Especificaciones de formato de salida
+Políticas de manejo de errores
+
+## Estrategia de Implementación
+
+La implementación del importador sigue esta canalización:
+
+1. **Análisis de Configuración** - Carga y valida el descriptor JSON
+2. **Inicialización del Analizador** - Carga el analizador apropiado (CSV, XML, JSON, etc.) basado en `format.type`
+3. **Aplicación de Preprocesamiento** - Ejecuta filtros y transformaciones globales
+4. **Procesamiento de Registros** - Para cada registro de entrada:
+   Extrae datos utilizando rutas de origen (JSONPath, XPath, nombres de columna)
+   Aplica transformaciones a nivel de campo en secuencia
+   Valida los resultados contra las reglas definidas
+   Aplica valores predeterminados para datos faltantes
+5. **Aplicación de Postprocesamiento** - Ejecuta desduplicación, agregación, etc.
+6. **Generación de Salida** - Produce datos en el formato de destino especificado
+
+## Soporte de Expresiones de Ruta
+
+Diferentes formatos de entrada utilizan lenguajes de expresiones de ruta apropiados:
+
+**CSV**: Nombres de columna o índices (`"column_name"` o `"[2]"`)
+**JSON**: Sintaxis JSONPath (`"$.user.profile.email"`)
+**XML**: Expresiones XPath (`"//product[@id='123']/price"`)
+**Ancho Fijo**: Nombres de campo de las definiciones de campo
+
+## Beneficios
+
+**Base de Código Única** - Un único importador maneja múltiples formatos de entrada
+**Fácil de Usar** - Usuarios no técnicos pueden crear configuraciones
+**Reutilizable** - Las configuraciones se pueden compartir y versionar
+**Flexible** - Transformaciones complejas sin codificación personalizada
+**Robusto** - Validación integrada y manejo de errores integral
+**Mantenible** - El enfoque declarativo reduce la complejidad de la implementación
+
+## Especificación del Lenguaje
+
+El Descriptor de Datos Estructurados utiliza un formato de configuración JSON con la siguiente estructura de nivel superior:
+
+```json
+{
+  "version": "1.0",
+  "metadata": {
+    "name": "Configuration Name",
+    "description": "Description of what this config does",
+    "author": "Author Name",
+    "created": "2024-01-01T00:00:00Z"
+  },
+  "format": { ... },
+  "globals": { ... },
+  "preprocessing": [ ... ],
+  "mappings": [ ... ],
+  "postprocessing": [ ... ],
+  "output": { ... }
+}
+```
+
+### Definición del formato
+
+Describe el formato de los datos de entrada y las opciones de análisis:
+
+```json
+{
+  "format": {
+    "type": "csv|json|xml|fixed-width|excel|parquet",
+    "encoding": "utf-8",
+    "options": {
+      // Format-specific options
+    }
+  }
+}
+```
+
+#### Opciones de formato CSV
+```json
+{
+  "format": {
+    "type": "csv",
+    "options": {
+      "delimiter": ",",
+      "quote_char": "\"",
+      "escape_char": "\\",
+      "skip_rows": 1,
+      "has_header": true,
+      "null_values": ["", "NULL", "null", "N/A"]
+    }
+  }
+}
+```
+
+#### Opciones de formato JSON
+```json
+{
+  "format": {
+    "type": "json",
+    "options": {
+      "root_path": "$.data",
+      "array_mode": "records|single",
+      "flatten": false
+    }
+  }
+}
+```
+
+#### Opciones de formato XML
+```json
+{
+  "format": {
+    "type": "xml",
+    "options": {
+      "root_element": "//records/record",
+      "namespaces": {
+        "ns": "http://example.com/namespace"
+      }
+    }
+  }
+}
+```
+
+### Configuración Global
+
+Defina tablas de búsqueda, variables y configuración global:
+
+```json
+{
+  "globals": {
+    "variables": {
+      "current_date": "2024-01-01",
+      "batch_id": "BATCH_001",
+      "default_confidence": 0.8
+    },
+    "lookup_tables": {
+      "country_codes": {
+        "US": "United States",
+        "UK": "United Kingdom",
+        "CA": "Canada"
+      },
+      "status_mapping": {
+        "1": "active",
+        "0": "inactive"
+      }
+    },
+    "constants": {
+      "source_system": "legacy_crm",
+      "import_type": "full"
+    }
+  }
+}
+```
+
+### Mapeo de campos
+
+Defina cómo los datos de origen se mapean a los campos de destino con transformaciones:
+
+```json
+{
+  "mappings": [
+    {
+      "target_field": "person_name",
+      "source": "$.name",
+      "transforms": [
+        {"type": "trim"},
+        {"type": "title_case"},
+        {"type": "required"}
+      ],
+      "validation": [
+        {"type": "min_length", "value": 2},
+        {"type": "max_length", "value": 100},
+        {"type": "pattern", "value": "^[A-Za-z\\s]+$"}
+      ]
+    },
+    {
+      "target_field": "age",
+      "source": "$.age",
+      "transforms": [
+        {"type": "to_int"},
+        {"type": "default", "value": 0}
+      ],
+      "validation": [
+        {"type": "range", "min": 0, "max": 150}
+      ]
+    },
+    {
+      "target_field": "country",
+      "source": "$.country_code",
+      "transforms": [
+        {"type": "lookup", "table": "country_codes"},
+        {"type": "default", "value": "Unknown"}
+      ]
+    }
+  ]
+}
+```
+
+### Tipos de Transformación
+
+Funciones de transformación disponibles:
+
+#### Transformaciones de Cadenas de Texto
+```json
+{"type": "trim"},
+{"type": "upper"},
+{"type": "lower"},
+{"type": "title_case"},
+{"type": "replace", "pattern": "old", "replacement": "new"},
+{"type": "regex_replace", "pattern": "\\d+", "replacement": "XXX"},
+{"type": "substring", "start": 0, "end": 10},
+{"type": "pad_left", "length": 10, "char": "0"}
+```
+
+#### Conversiones de tipo
+```json
+{"type": "to_string"},
+{"type": "to_int"},
+{"type": "to_float"},
+{"type": "to_bool"},
+{"type": "to_date", "format": "YYYY-MM-DD"},
+{"type": "parse_json"}
+```
+
+#### Operaciones de datos
+```json
+{"type": "default", "value": "default_value"},
+{"type": "lookup", "table": "table_name"},
+{"type": "concat", "values": ["field1", " - ", "field2"]},
+{"type": "calculate", "expression": "${field1} + ${field2}"},
+{"type": "conditional", "condition": "${age} > 18", "true_value": "adult", "false_value": "minor"}
+```
+
+### Reglas de validación
+
+Comprobaciones de calidad de datos con manejo de errores configurable:
+
+#### Validaciones básicas
+```json
+{"type": "required"},
+{"type": "not_null"},
+{"type": "min_length", "value": 5},
+{"type": "max_length", "value": 100},
+{"type": "range", "min": 0, "max": 1000},
+{"type": "pattern", "value": "^[A-Z]{2,3}$"},
+{"type": "in_list", "values": ["active", "inactive", "pending"]}
+```
+
+#### Validaciones Personalizadas
+```json
+{
+  "type": "custom",
+  "expression": "${age} >= 18 && ${country} == 'US'",
+  "message": "Must be 18+ and in US"
+},
+{
+  "type": "cross_field",
+  "fields": ["start_date", "end_date"],
+  "expression": "${start_date} < ${end_date}",
+  "message": "Start date must be before end date"
+}
+```
+
+### Preprocesamiento y Postprocesamiento
+
+Operaciones globales aplicadas antes/después del mapeo de campos:
+
+```json
+{
+  "preprocessing": [
+    {
+      "type": "filter",
+      "condition": "${status} != 'deleted'"
+    },
+    {
+      "type": "sort",
+      "field": "created_date",
+      "order": "asc"
+    }
+  ],
+  "postprocessing": [
+    {
+      "type": "deduplicate",
+      "key_fields": ["email", "phone"]
+    },
+    {
+      "type": "aggregate",
+      "group_by": ["country"],
+      "functions": {
+        "total_count": {"type": "count"},
+        "avg_age": {"type": "avg", "field": "age"}
+      }
+    }
+  ]
+}
+```
+
+### Configuración de salida
+
+Defina cómo se deben mostrar los datos procesados:
+
+```json
+{
+  "output": {
+    "format": "trustgraph-objects",
+    "schema_name": "person",
+    "options": {
+      "batch_size": 1000,
+      "confidence": 0.9,
+      "source_span_field": "raw_text",
+      "metadata": {
+        "source": "crm_import",
+        "version": "1.0"
+      }
+    },
+    "error_handling": {
+      "on_validation_error": "skip|fail|log",
+      "on_transform_error": "skip|fail|default",
+      "max_errors": 100,
+      "error_output": "errors.json"
+    }
+  }
+}
+```
+
+## Ejemplo Completo
+
+```json
+{
+  "version": "1.0",
+  "metadata": {
+    "name": "Customer Import from CRM CSV",
+    "description": "Imports customer data from legacy CRM system",
+    "author": "Data Team",
+    "created": "2024-01-01T00:00:00Z"
+  },
+  "format": {
+    "type": "csv",
+    "encoding": "utf-8",
+    "options": {
+      "delimiter": ",",
+      "has_header": true,
+      "skip_rows": 1
+    }
+  },
+  "globals": {
+    "variables": {
+      "import_date": "2024-01-01",
+      "default_confidence": 0.85
+    },
+    "lookup_tables": {
+      "country_codes": {
+        "US": "United States",
+        "CA": "Canada",
+        "UK": "United Kingdom"
+      }
+    }
+  },
+  "preprocessing": [
+    {
+      "type": "filter",
+      "condition": "${status} == 'active'"
+    }
+  ],
+  "mappings": [
+    {
+      "target_field": "full_name",
+      "source": "customer_name",
+      "transforms": [
+        {"type": "trim"},
+        {"type": "title_case"}
+      ],
+      "validation": [
+        {"type": "required"},
+        {"type": "min_length", "value": 2}
+      ]
+    },
+    {
+      "target_field": "email",
+      "source": "email_address",
+      "transforms": [
+        {"type": "trim"},
+        {"type": "lower"}
+      ],
+      "validation": [
+        {"type": "pattern", "value": "^[\\w.-]+@[\\w.-]+\\.[a-zA-Z]{2,}$"}
+      ]
+    },
+    {
+      "target_field": "age",
+      "source": "age",
+      "transforms": [
+        {"type": "to_int"},
+        {"type": "default", "value": 0}
+      ],
+      "validation": [
+        {"type": "range", "min": 0, "max": 120}
+      ]
+    },
+    {
+      "target_field": "country",
+      "source": "country_code",
+      "transforms": [
+        {"type": "lookup", "table": "country_codes"},
+        {"type": "default", "value": "Unknown"}
+      ]
+    }
+  ],
+  "output": {
+    "format": "trustgraph-objects",
+    "schema_name": "customer",
+    "options": {
+      "confidence": "${default_confidence}",
+      "batch_size": 500
+    },
+    "error_handling": {
+      "on_validation_error": "log",
+      "max_errors": 50
+    }
+  }
+}
+```
+
+## Indicación para el LLM para la Generación de Descriptores
+
+La siguiente indicación se puede utilizar para que un LLM analice datos de muestra y genere una configuración de descriptor:
+
+```
+I need you to analyze the provided data sample and create a Structured Data Descriptor configuration in JSON format.
+
+The descriptor should follow this specification:
+- version: "1.0"
+- metadata: Configuration name, description, author, and creation date
+- format: Input format type and parsing options
+- globals: Variables, lookup tables, and constants
+- preprocessing: Filters and transformations applied before mapping
+- mappings: Field-by-field mapping from source to target with transformations and validations
+- postprocessing: Operations like deduplication or aggregation
+- output: Target format and error handling configuration
+
+ANALYZE THE DATA:
+1. Identify the format (CSV, JSON, XML, etc.)
+2. Detect delimiters, encodings, and structure
+3. Find data types for each field
+4. Identify patterns and constraints
+5. Look for fields that need cleaning or transformation
+6. Find relationships between fields
+7. Identify lookup opportunities (codes that map to values)
+8. Detect required vs optional fields
+
+CREATE THE DESCRIPTOR:
+For each field in the sample data:
+- Map it to an appropriate target field name
+- Add necessary transformations (trim, case conversion, type casting)
+- Include appropriate validations (required, patterns, ranges)
+- Set defaults for missing values
+
+Include preprocessing if needed:
+- Filters to exclude invalid records
+- Sorting requirements
+
+Include postprocessing if beneficial:
+- Deduplication on key fields
+- Aggregation for summary data
+
+Configure output for TrustGraph:
+- format: "trustgraph-objects"
+- schema_name: Based on the data entity type
+- Appropriate error handling
+
+DATA SAMPLE:
+[Insert data sample here]
+
+ADDITIONAL CONTEXT (optional):
+- Target schema name: [if known]
+- Business rules: [any specific requirements]
+- Data quality issues to address: [known problems]
+
+Generate a complete, valid Structured Data Descriptor configuration that will properly import this data into TrustGraph. Include comments explaining key decisions.
+```
+
+### Ejemplo de uso (Prompt)
+
+```
+I need you to analyze the provided data sample and create a Structured Data Descriptor configuration in JSON format.
+
+[Standard instructions from above...]
+
+DATA SAMPLE:
+```csv
+CustomerID,Nombre,Email,Edad,País,Estado,Fecha de Inscripción,Total de Compras
+1001,"Smith, John",john.smith@email.com,35,US,1,2023-01-15,5420.50
+1002,"doe, jane",JANE.DOE@GMAIL.COM,28,CA,1,2023-03-22,3200.00
+1003,"Bob Johnson",bob@,62,UK,0,2022-11-01,0
+1004,"Alice Chen","alice.chen@company.org",41,US,1,2023-06-10,8900.25
+1005,,invalid-email,25,XX,1,2024-01-01,100
+```
+
+ADDITIONAL CONTEXT:
+- Target schema name: customer
+- Business rules: Email should be valid and lowercase, names should be title case
+- Data quality issues: Some emails are invalid, some names are missing, country codes need mapping
+```
+
+### Solicitud para Analizar Datos Existentes Sin Muestra
+
+```
+I need you to help me create a Structured Data Descriptor configuration for importing [data type] data.
+
+The source data has these characteristics:
+- Format: [CSV/JSON/XML/etc]
+- Fields: [list the fields]
+- Data quality issues: [describe any known issues]
+- Volume: [approximate number of records]
+
+Requirements:
+- [List any specific transformation needs]
+- [List any validation requirements]
+- [List any business rules]
+
+Please generate a Structured Data Descriptor configuration that will:
+1. Parse the input format correctly
+2. Clean and standardize the data
+3. Validate according to the requirements
+4. Handle errors gracefully
+5. Output in TrustGraph ExtractedObject format
+
+Focus on making the configuration robust and reusable.
+```