trustgraph/docs/cli/tg-load-doc-embeds.md

568 lines
13 KiB
Markdown
Raw Normal View History

# tg-load-doc-embeds
Loads document embeddings from MessagePack format into TrustGraph processing pipelines.
## Synopsis
```bash
tg-load-doc-embeds -i INPUT_FILE [options]
```
## Description
The `tg-load-doc-embeds` command loads document embeddings from MessagePack files into a running TrustGraph system. This is typically used to restore previously saved document embeddings or to load embeddings generated by external systems.
The command reads document embedding data in MessagePack format and streams it to TrustGraph's document embeddings import API via WebSocket connections.
## Options
### Required Arguments
- `-i, --input-file FILE`: Input MessagePack file containing document embeddings
### Optional Arguments
- `-u, --url URL`: TrustGraph API URL (default: `$TRUSTGRAPH_API` or `http://localhost:8088/`)
- `-f, --flow-id ID`: Flow instance ID to use (default: `default`)
- `--format FORMAT`: Input format - `msgpack` or `json` (default: `msgpack`)
- `--user USER`: Override user ID from input data
- `--collection COLLECTION`: Override collection ID from input data
## Examples
### Basic Loading
```bash
tg-load-doc-embeds -i document-embeddings.msgpack
```
### Load with Custom Flow
```bash
tg-load-doc-embeds \
-i embeddings.msgpack \
-f "document-processing-flow"
```
### Override User and Collection
```bash
tg-load-doc-embeds \
-i embeddings.msgpack \
--user "research-team" \
--collection "research-docs"
```
### Load from JSON Format
```bash
tg-load-doc-embeds \
-i embeddings.json \
--format json
```
### Production Loading
```bash
tg-load-doc-embeds \
-i production-embeddings.msgpack \
-u https://trustgraph-api.company.com/ \
-f "production-flow" \
--user "system" \
--collection "production-docs"
```
## Input Data Format
### MessagePack Structure
Document embeddings are stored as MessagePack records with this structure:
```json
["de", {
"m": {
"i": "document-id",
"m": [{"metadata": "objects"}],
"u": "user-id",
"c": "collection-id"
},
"c": [{
"c": "text chunk content",
"v": [0.1, 0.2, 0.3, ...]
}]
}]
```
### Components
- **Document Metadata** (`m`):
- `i`: Document ID
- `m`: Document metadata objects
- `u`: User ID
- `c`: Collection ID
- **Chunks** (`c`): Array of text chunks with embeddings:
- `c`: Text content of the chunk
- `v`: Vector embedding array
## Use Cases
### Backup Restoration
```bash
# Restore document embeddings from backup
restore_embeddings() {
local backup_file="$1"
local target_collection="$2"
echo "Restoring document embeddings from: $backup_file"
if [ ! -f "$backup_file" ]; then
echo "Backup file not found: $backup_file"
return 1
fi
# Verify backup file
if tg-dump-msgpack -i "$backup_file" --summary | grep -q "Vector dimension:"; then
echo "✓ Backup file contains embeddings"
else
echo "✗ Backup file does not contain valid embeddings"
return 1
fi
# Load embeddings
tg-load-doc-embeds \
-i "$backup_file" \
--collection "$target_collection"
echo "Embedding restoration complete"
}
# Restore from backup
restore_embeddings "backup-20231215.msgpack" "restored-docs"
```
### Data Migration
```bash
# Migrate embeddings between environments
migrate_embeddings() {
local source_file="$1"
local target_env="$2"
local target_user="$3"
echo "Migrating embeddings to: $target_env"
# Load to target environment
tg-load-doc-embeds \
-i "$source_file" \
-u "https://$target_env/api/" \
--user "$target_user" \
--collection "migrated-docs"
echo "Migration complete"
}
# Migrate to production
migrate_embeddings "dev-embeddings.msgpack" "prod.company.com" "migration-user"
```
### Batch Processing
```bash
# Load multiple embedding files
batch_load_embeddings() {
local input_dir="$1"
local collection="$2"
echo "Batch loading embeddings from: $input_dir"
for file in "$input_dir"/*.msgpack; do
if [ -f "$file" ]; then
echo "Loading: $(basename "$file")"
tg-load-doc-embeds \
-i "$file" \
--collection "$collection"
if [ $? -eq 0 ]; then
echo "✓ Loaded: $(basename "$file")"
else
echo "✗ Failed: $(basename "$file")"
fi
fi
done
echo "Batch loading complete"
}
# Load all embeddings
batch_load_embeddings "embeddings/" "batch-processed"
```
### Incremental Loading
```bash
# Load new embeddings incrementally
incremental_load() {
local embeddings_dir="$1"
local processed_log="processed_embeddings.log"
# Create log if it doesn't exist
touch "$processed_log"
for file in "$embeddings_dir"/*.msgpack; do
if [ -f "$file" ]; then
# Check if already processed
if grep -q "$(basename "$file")" "$processed_log"; then
echo "Skipping already processed: $(basename "$file")"
continue
fi
echo "Processing new file: $(basename "$file")"
if tg-load-doc-embeds -i "$file"; then
echo "$(date): $(basename "$file")" >> "$processed_log"
echo "✓ Processed: $(basename "$file")"
else
echo "✗ Failed: $(basename "$file")"
fi
fi
done
}
# Run incremental loading
incremental_load "embeddings/"
```
## Advanced Usage
### Parallel Loading
```bash
# Load multiple files in parallel
parallel_load_embeddings() {
local files=("$@")
local max_parallel=3
local current_jobs=0
for file in "${files[@]}"; do
# Wait if max parallel jobs reached
while [ $current_jobs -ge $max_parallel ]; do
wait -n # Wait for any job to complete
current_jobs=$((current_jobs - 1))
done
# Start loading in background
(
echo "Loading: $file"
tg-load-doc-embeds -i "$file"
echo "Completed: $file"
) &
current_jobs=$((current_jobs + 1))
done
# Wait for all remaining jobs
wait
echo "All parallel loading completed"
}
# Load files in parallel
embedding_files=(embeddings1.msgpack embeddings2.msgpack embeddings3.msgpack)
parallel_load_embeddings "${embedding_files[@]}"
```
### Validation and Loading
```bash
# Validate before loading
validate_and_load() {
local file="$1"
local collection="$2"
echo "Validating embedding file: $file"
# Check file exists and is readable
if [ ! -r "$file" ]; then
echo "Error: Cannot read file $file"
return 1
fi
# Validate MessagePack structure
if ! tg-dump-msgpack -i "$file" --summary > /dev/null 2>&1; then
echo "Error: Invalid MessagePack format"
return 1
fi
# Check for document embeddings
if ! tg-dump-msgpack -i "$file" | grep -q '^\["de"'; then
echo "Error: No document embeddings found"
return 1
fi
# Get embedding statistics
summary=$(tg-dump-msgpack -i "$file" --summary)
vector_dim=$(echo "$summary" | grep "Vector dimension:" | awk '{print $3}')
if [ -n "$vector_dim" ]; then
echo "✓ Found embeddings with dimension: $vector_dim"
else
echo "Warning: Could not determine vector dimension"
fi
# Load embeddings
echo "Loading validated embeddings..."
tg-load-doc-embeds -i "$file" --collection "$collection"
echo "Loading complete"
}
# Validate and load
validate_and_load "embeddings.msgpack" "validated-docs"
```
### Progress Monitoring
```bash
# Monitor loading progress
monitor_loading() {
local file="$1"
local log_file="loading_progress.log"
# Start loading in background
tg-load-doc-embeds -i "$file" > "$log_file" 2>&1 &
local load_pid=$!
echo "Monitoring loading progress (PID: $load_pid)..."
# Monitor progress
while kill -0 $load_pid 2>/dev/null; do
if [ -f "$log_file" ]; then
# Extract progress from log
embeddings_count=$(grep -o "Document embeddings:.*[0-9]" "$log_file" | tail -1 | awk '{print $3}')
if [ -n "$embeddings_count" ]; then
echo "Progress: $embeddings_count embeddings loaded"
fi
fi
sleep 5
done
# Check final status
wait $load_pid
if [ $? -eq 0 ]; then
echo "✓ Loading completed successfully"
else
echo "✗ Loading failed"
cat "$log_file"
fi
rm "$log_file"
}
# Monitor loading
monitor_loading "large-embeddings.msgpack"
```
### Data Transformation
```bash
# Transform embeddings during loading
transform_and_load() {
local input_file="$1"
local output_file="transformed-$(basename "$input_file")"
local new_user="$2"
local new_collection="$3"
echo "Transforming embeddings: user=$new_user, collection=$new_collection"
# This would require a transformation script
# For now, we'll show the concept
# Load with override parameters
tg-load-doc-embeds \
-i "$input_file" \
--user "$new_user" \
--collection "$new_collection"
echo "Transformation and loading complete"
}
# Transform during loading
transform_and_load "original.msgpack" "new-user" "new-collection"
```
## Performance Optimization
### Memory Management
```bash
# Monitor memory usage during loading
monitor_memory_usage() {
local file="$1"
echo "Starting memory-monitored loading..."
# Start loading in background
tg-load-doc-embeds -i "$file" &
local load_pid=$!
# Monitor memory usage
while kill -0 $load_pid 2>/dev/null; do
memory_usage=$(ps -p $load_pid -o rss= 2>/dev/null | awk '{print $1/1024}')
if [ -n "$memory_usage" ]; then
echo "Memory usage: ${memory_usage}MB"
fi
sleep 10
done
wait $load_pid
echo "Loading completed"
}
```
### Chunked Loading
```bash
# Load large files in chunks
chunked_load() {
local large_file="$1"
local chunk_size=1000 # Records per chunk
echo "Loading large file in chunks: $large_file"
# Split the MessagePack file (this would need special tooling)
# For demonstration, assuming we have pre-split files
for chunk in "${large_file%.msgpack}"_chunk_*.msgpack; do
if [ -f "$chunk" ]; then
echo "Loading chunk: $(basename "$chunk")"
tg-load-doc-embeds -i "$chunk"
# Add delay between chunks to reduce system load
sleep 2
fi
done
echo "Chunked loading complete"
}
```
## Error Handling
### File Not Found
```bash
Exception: [Errno 2] No such file or directory
```
**Solution**: Verify file path and ensure the MessagePack file exists.
### Invalid Format
```bash
Exception: Unpack failed
```
**Solution**: Verify the file is a valid MessagePack file with document embeddings.
### WebSocket Connection Issues
```bash
Exception: Connection failed
```
**Solution**: Check API URL and ensure TrustGraph is running with WebSocket support.
### Memory Errors
```bash
MemoryError: Unable to allocate memory
```
**Solution**: Process large files in smaller chunks or increase available memory.
### Flow Not Found
```bash
Exception: Flow not found
```
**Solution**: Verify the flow ID exists with `tg-show-flows`.
## Integration with Other Commands
### Complete Workflow
```bash
# Complete document processing workflow
process_documents_workflow() {
local pdf_dir="$1"
local embeddings_file="embeddings.msgpack"
echo "Starting complete document workflow..."
# 1. Load PDFs
for pdf in "$pdf_dir"/*.pdf; do
tg-load-pdf "$pdf"
done
# 2. Wait for processing
sleep 30
# 3. Save embeddings
tg-save-doc-embeds -o "$embeddings_file"
# 4. Process embeddings (example: load to different collection)
tg-load-doc-embeds -i "$embeddings_file" --collection "processed-docs"
echo "Complete workflow finished"
}
```
### Backup and Restore
```bash
# Complete backup and restore cycle
backup_restore_cycle() {
local backup_file="embeddings-backup.msgpack"
echo "Creating embeddings backup..."
tg-save-doc-embeds -o "$backup_file"
echo "Simulating data loss..."
# (In real scenario, this might be system failure)
echo "Restoring from backup..."
tg-load-doc-embeds -i "$backup_file" --collection "restored"
echo "Backup/restore cycle complete"
}
```
## Environment Variables
- `TRUSTGRAPH_API`: Default API URL
## Related Commands
- [`tg-save-doc-embeds`](tg-save-doc-embeds.md) - Save document embeddings to MessagePack
- [`tg-dump-msgpack`](tg-dump-msgpack.md) - Analyze MessagePack files
- [`tg-load-pdf`](tg-load-pdf.md) - Load PDF documents for processing
- [`tg-show-flows`](tg-show-flows.md) - List available flows
## API Integration
This command uses TrustGraph's WebSocket API for document embeddings import, specifically the `/api/v1/flow/{flow-id}/import/document-embeddings` endpoint.
## Best Practices
1. **Validation**: Always validate MessagePack files before loading
2. **Backups**: Keep backups of original embedding files
3. **Monitoring**: Monitor memory usage and loading progress
4. **Chunking**: Process large files in manageable chunks
5. **Error Handling**: Implement robust error handling and retry logic
6. **Documentation**: Document the source and format of embedding files
7. **Testing**: Test loading procedures in non-production environments
## Troubleshooting
### Loading Stalls
```bash
# Check WebSocket connection
netstat -an | grep :8088
# Check system resources
free -h
df -h
```
### Incomplete Loading
```bash
# Compare input vs loaded data
input_count=$(tg-dump-msgpack -i input.msgpack | grep '^\["de"' | wc -l)
echo "Input embeddings: $input_count"
# Check loaded data (would need query command)
# loaded_count=$(tg-query-embeddings --count)
# echo "Loaded embeddings: $loaded_count"
```
### Performance Issues
```bash
# Monitor network usage
iftop
# Check TrustGraph service logs
docker logs trustgraph-service
```