trustgraph/docs/tech-specs/structured-data-descriptor.sw.md

568 lines
15 KiB
Markdown
Raw Normal View History

---
layout: default
title: "Maelezo ya Muundo wa Data"
parent: "Swahili (Beta)"
---
# Maelezo ya Muundo wa Data
> **Beta Translation:** This document was translated via Machine Learning and as such may not be 100% accurate. All non-English languages are currently classified as Beta.
## Muhtasari
Maelezo ya Muundo wa Data ni lugha ya usanidi iliyo msingi ya JSON ambayo inaeleza jinsi ya kuchanganua, kubadilisha, na kuingiza data iliyo na muundo katika TrustGraph. Inatoa njia ya matangazo ya jinsi ya kuingiza data, ikisaidia aina tofauti za pembejeo na mnyororo mgumu wa mabadiliko bila kuhitaji nambari maalum.
## Dhana Muhimu
### 1. Ufafanuzi wa Aina
Inaeleza aina ya faili ya pembejeo na chaguzi za kuchanganua. Inaamua kichanganuzi gani cha kutumia na jinsi ya kutafsiri data ya chanzo.
### 2. Ramani za Uwanja
Inaunganisha njia za chanzo na uwanja wa lengo pamoja na mabadiliko. Inaeleza jinsi data inavyosonga kutoka kwa vyanzo vya pembejeo hadi kwenye uwanja wa schema ya pato.
### 3. Mnyororo wa Mabadiliko
Mnyororo wa mabadiliko ya data ambayo yanaweza kutumika kwenye thamani za uwanja, pamoja na:
Usafishaji wa data (kukata, kusawazisha)
Ubadilishaji wa aina (uchanganuzi wa tarehe, ubadilishaji wa aina)
Hesabu (hesabu, urekebishaji wa maandishi)
Utafiti (meza za rejea, mbadala)
### 4. Kanuni za Uthibitisho
Vipimo vya ubora wa data ambavyo hutumika kuhakikisha uadilifu wa data:
Uthibitisho wa aina
Vipimo vya anuwai
Ulinganishi wa muundo (regex)
Uthibitisho wa uwanja unaohitajika
Mantiki ya uthibitisho maalum
### 5. Mpangilio wa Jumla
Mpangilio ambao hutumika katika mchakato wote wa uingizaji:
Meza za utafiti kwa uboreshaji wa data
Vigezo na mara kwa mara vya kimataifa
Maelezo ya muundo wa pato
Sera za kushughulikia makosa
## Mkakati wa Utendaji
Utendaji wa programu ya uingizaji unafuata mnyororo huu:
1. **Changanua Mpangilio** - Pakia na thibitisha maelezo ya JSON
2. **Anzisha Kichanganuzi** - Pakia kichanganuzi kinachofaa (CSV, XML, JSON, n.k.) kulingana na `format.type`
3. **Tumia Uchujaji wa Awali** - Fanya vichujio na mabadiliko ya kimataifa
4. **Chakata Rekodi** - Kwa kila rekodi ya pembejeo:
Toa data ukitumia njia za chanzo (JSONPath, XPath, majina ya safu)
Tumia mabadiliko ya kiwanja kwa mlolongo
Thibitisha matokeo dhidi ya sheria zilizoelezwa
Tumia maadili chaguu kwa data inayokosekana
5. **Tumia Uchakataji wa Baada** - Fanya uondoaji mara mbili, ukusanyaji, n.k.
6. **Toa Pato** - Toa data katika muundo wa lengo uliotakikana
## Usaidizi wa Maneno ya Njia
Aina tofauti za pembejeo hutumia lugha zinazofaa za maneno ya njia:
**CSV**: Majina ya safu au fahirisi (`"column_name"` au `"[2]"`)
**JSON**: Lugha ya JSONPath (`"$.user.profile.email"`)
**XML**: Maneno ya XPath (`"//product[@id='123']/price"`)
**Upana-uliofanyika**: Majina ya uwanja kutoka maelezo ya uwanja
## Faida
**Nambari Moja** - Programu moja ya uingizaji inashughulikia aina tofauti za pembejeo
**Inafaa kwa Mtumiaji** - Watumiaji ambao hawana ujuzi wa kiufundi wanaweza kuunda mipangilio
**Inaweza Kurejeshwa** - Mipangilio inaweza kushirikiwa na kutolewa toleo
**Inafaa** - Mabadiliko magumu bila nambari maalum
**Imara** - Uthibitisho uliounganishwa na utunzaji wa kina wa makosa
**Inaweza Kudumishwa** - Njia ya matangazo hupunguza utata wa utekelezaji
## Maelezo ya Lugha
Maelezo ya Muundo wa Data hutumia muundo wa usanidi wa JSON na muundo wa juu unaofuata:
```json
{
"version": "1.0",
"metadata": {
"name": "Configuration Name",
"description": "Description of what this config does",
"author": "Author Name",
"created": "2024-01-01T00:00:00Z"
},
"format": { ... },
"globals": { ... },
"preprocessing": [ ... ],
"mappings": [ ... ],
"postprocessing": [ ... ],
"output": { ... }
}
```
### Ufafanuzi wa Muundo
Huainisha muundo wa data ya pembejeo na chaguo za uchakataji:
```json
{
"format": {
"type": "csv|json|xml|fixed-width|excel|parquet",
"encoding": "utf-8",
"options": {
// Format-specific options
}
}
}
```
#### Chaguo za Muundo wa CSV
```json
{
"format": {
"type": "csv",
"options": {
"delimiter": ",",
"quote_char": "\"",
"escape_char": "\\",
"skip_rows": 1,
"has_header": true,
"null_values": ["", "NULL", "null", "N/A"]
}
}
}
```
#### Chaguo za Muundo wa JSON
```json
{
"format": {
"type": "json",
"options": {
"root_path": "$.data",
"array_mode": "records|single",
"flatten": false
}
}
}
```
#### Chaguo za Muundo wa XML
```json
{
"format": {
"type": "xml",
"options": {
"root_element": "//records/record",
"namespaces": {
"ns": "http://example.com/namespace"
}
}
}
}
```
### Mipangilio ya Ulimwengu
Fafanua jedwali za utafutaji, vigezo, na usanidi wa jumla:
```json
{
"globals": {
"variables": {
"current_date": "2024-01-01",
"batch_id": "BATCH_001",
"default_confidence": 0.8
},
"lookup_tables": {
"country_codes": {
"US": "United States",
"UK": "United Kingdom",
"CA": "Canada"
},
"status_mapping": {
"1": "active",
"0": "inactive"
}
},
"constants": {
"source_system": "legacy_crm",
"import_type": "full"
}
}
}
```
### Ramani za Vyanzo vya Habari
Eleza jinsi data kutoka chanzo inavyolingana na vyanzo vya habari vya lengo, pamoja na mabadiliko:
```json
{
"mappings": [
{
"target_field": "person_name",
"source": "$.name",
"transforms": [
{"type": "trim"},
{"type": "title_case"},
{"type": "required"}
],
"validation": [
{"type": "min_length", "value": 2},
{"type": "max_length", "value": 100},
{"type": "pattern", "value": "^[A-Za-z\\s]+$"}
]
},
{
"target_field": "age",
"source": "$.age",
"transforms": [
{"type": "to_int"},
{"type": "default", "value": 0}
],
"validation": [
{"type": "range", "min": 0, "max": 150}
]
},
{
"target_field": "country",
"source": "$.country_code",
"transforms": [
{"type": "lookup", "table": "country_codes"},
{"type": "default", "value": "Unknown"}
]
}
]
}
```
### Aina za Ubadilishaji
Kazi zinazopatikana za kubadilisha:
#### Ubadilishaji wa Mfumo wa Herufi
```json
{"type": "trim"},
{"type": "upper"},
{"type": "lower"},
{"type": "title_case"},
{"type": "replace", "pattern": "old", "replacement": "new"},
{"type": "regex_replace", "pattern": "\\d+", "replacement": "XXX"},
{"type": "substring", "start": 0, "end": 10},
{"type": "pad_left", "length": 10, "char": "0"}
```
#### Tafsiri za Aina
```json
{"type": "to_string"},
{"type": "to_int"},
{"type": "to_float"},
{"type": "to_bool"},
{"type": "to_date", "format": "YYYY-MM-DD"},
{"type": "parse_json"}
```
#### Operesheni za Data
```json
{"type": "default", "value": "default_value"},
{"type": "lookup", "table": "table_name"},
{"type": "concat", "values": ["field1", " - ", "field2"]},
{"type": "calculate", "expression": "${field1} + ${field2}"},
{"type": "conditional", "condition": "${age} > 18", "true_value": "adult", "false_value": "minor"}
```
### Kanuni za Uthibitisho
Uchunguzi wa ubora wa data pamoja na udhibiti wa makosa unaoweza kusanidiwa:
### Uthibitisho Msingi
```json
{"type": "required"},
{"type": "not_null"},
{"type": "min_length", "value": 5},
{"type": "max_length", "value": 100},
{"type": "range", "min": 0, "max": 1000},
{"type": "pattern", "value": "^[A-Z]{2,3}$"},
{"type": "in_list", "values": ["active", "inactive", "pending"]}
```
#### Uthibitisho Maalum
```json
{
"type": "custom",
"expression": "${age} >= 18 && ${country} == 'US'",
"message": "Must be 18+ and in US"
},
{
"type": "cross_field",
"fields": ["start_date", "end_date"],
"expression": "${start_date} < ${end_date}",
"message": "Start date must be before end date"
}
```
### Maandalizi na Urekebishaji
Operesheni za jumla zinazotumiwa kabla na baada ya uhamishaji wa data:
```json
{
"preprocessing": [
{
"type": "filter",
"condition": "${status} != 'deleted'"
},
{
"type": "sort",
"field": "created_date",
"order": "asc"
}
],
"postprocessing": [
{
"type": "deduplicate",
"key_fields": ["email", "phone"]
},
{
"type": "aggregate",
"group_by": ["country"],
"functions": {
"total_count": {"type": "count"},
"avg_age": {"type": "avg", "field": "age"}
}
}
]
}
```
### Usanidi wa Matokeo
Eleza jinsi data iliyochakatwa inapaswa kuonyeshwa:
```json
{
"output": {
"format": "trustgraph-objects",
"schema_name": "person",
"options": {
"batch_size": 1000,
"confidence": 0.9,
"source_span_field": "raw_text",
"metadata": {
"source": "crm_import",
"version": "1.0"
}
},
"error_handling": {
"on_validation_error": "skip|fail|log",
"on_transform_error": "skip|fail|default",
"max_errors": 100,
"error_output": "errors.json"
}
}
}
```
## Mfano Kamili
```json
{
"version": "1.0",
"metadata": {
"name": "Customer Import from CRM CSV",
"description": "Imports customer data from legacy CRM system",
"author": "Data Team",
"created": "2024-01-01T00:00:00Z"
},
"format": {
"type": "csv",
"encoding": "utf-8",
"options": {
"delimiter": ",",
"has_header": true,
"skip_rows": 1
}
},
"globals": {
"variables": {
"import_date": "2024-01-01",
"default_confidence": 0.85
},
"lookup_tables": {
"country_codes": {
"US": "United States",
"CA": "Canada",
"UK": "United Kingdom"
}
}
},
"preprocessing": [
{
"type": "filter",
"condition": "${status} == 'active'"
}
],
"mappings": [
{
"target_field": "full_name",
"source": "customer_name",
"transforms": [
{"type": "trim"},
{"type": "title_case"}
],
"validation": [
{"type": "required"},
{"type": "min_length", "value": 2}
]
},
{
"target_field": "email",
"source": "email_address",
"transforms": [
{"type": "trim"},
{"type": "lower"}
],
"validation": [
{"type": "pattern", "value": "^[\\w.-]+@[\\w.-]+\\.[a-zA-Z]{2,}$"}
]
},
{
"target_field": "age",
"source": "age",
"transforms": [
{"type": "to_int"},
{"type": "default", "value": 0}
],
"validation": [
{"type": "range", "min": 0, "max": 120}
]
},
{
"target_field": "country",
"source": "country_code",
"transforms": [
{"type": "lookup", "table": "country_codes"},
{"type": "default", "value": "Unknown"}
]
}
],
"output": {
"format": "trustgraph-objects",
"schema_name": "customer",
"options": {
"confidence": "${default_confidence}",
"batch_size": 500
},
"error_handling": {
"on_validation_error": "log",
"max_errors": 50
}
}
}
```
## Maagizo ya LLM kwa Ajili ya Uundaji wa Maelezo
Maagizo yafuatayo yanaweza kutumika ili LLM iweze kuchanganua data ya mfano na kuunda usanidi wa maelezo:
```
I need you to analyze the provided data sample and create a Structured Data Descriptor configuration in JSON format.
The descriptor should follow this specification:
- version: "1.0"
- metadata: Configuration name, description, author, and creation date
- format: Input format type and parsing options
- globals: Variables, lookup tables, and constants
- preprocessing: Filters and transformations applied before mapping
- mappings: Field-by-field mapping from source to target with transformations and validations
- postprocessing: Operations like deduplication or aggregation
- output: Target format and error handling configuration
ANALYZE THE DATA:
1. Identify the format (CSV, JSON, XML, etc.)
2. Detect delimiters, encodings, and structure
3. Find data types for each field
4. Identify patterns and constraints
5. Look for fields that need cleaning or transformation
6. Find relationships between fields
7. Identify lookup opportunities (codes that map to values)
8. Detect required vs optional fields
CREATE THE DESCRIPTOR:
For each field in the sample data:
- Map it to an appropriate target field name
- Add necessary transformations (trim, case conversion, type casting)
- Include appropriate validations (required, patterns, ranges)
- Set defaults for missing values
Include preprocessing if needed:
- Filters to exclude invalid records
- Sorting requirements
Include postprocessing if beneficial:
- Deduplication on key fields
- Aggregation for summary data
Configure output for TrustGraph:
- format: "trustgraph-objects"
- schema_name: Based on the data entity type
- Appropriate error handling
DATA SAMPLE:
[Insert data sample here]
ADDITIONAL CONTEXT (optional):
- Target schema name: [if known]
- Business rules: [any specific requirements]
- Data quality issues to address: [known problems]
Generate a complete, valid Structured Data Descriptor configuration that will properly import this data into TrustGraph. Include comments explaining key decisions.
```
### Mfano wa Matumizi
```
I need you to analyze the provided data sample and create a Structured Data Descriptor configuration in JSON format.
[Standard instructions from above...]
DATA SAMPLE:
```csv
Kitambulisho cha Mteja,Jina,Barua pepe,Umri,Nchi,Hali,Tarehe ya Kujiunga,Ununuzi Jumla
1001,"Smith, John",john.smith@email.com,35,US,1,2023-01-15,5420.50
1002,"doe, jane",JANE.DOE@GMAIL.COM,28,CA,1,2023-03-22,3200.00
1003,"Bob Johnson",bob@,62,UK,0,2022-11-01,0
1004,"Alice Chen","alice.chen@company.org",41,US,1,2023-06-10,8900.25
1005,,invalid-email,25,XX,1,2024-01-01,100
```
ADDITIONAL CONTEXT:
- Target schema name: customer
- Business rules: Email should be valid and lowercase, names should be title case
- Data quality issues: Some emails are invalid, some names are missing, country codes need mapping
```
### Ombi la Kuchanganua Data Zilizopo Bila Sampuli
```
I need you to help me create a Structured Data Descriptor configuration for importing [data type] data.
The source data has these characteristics:
- Format: [CSV/JSON/XML/etc]
- Fields: [list the fields]
- Data quality issues: [describe any known issues]
- Volume: [approximate number of records]
Requirements:
- [List any specific transformation needs]
- [List any validation requirements]
- [List any business rules]
Please generate a Structured Data Descriptor configuration that will:
1. Parse the input format correctly
2. Clean and standardize the data
3. Validate according to the requirements
4. Handle errors gracefully
5. Output in TrustGraph ExtractedObject format
Focus on making the configuration robust and reusable.
```