mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-26 08:56:21 +02:00
Update docs for API/CLI changes in 1.0 (#421)
* Update some API basics for the 0.23/1.0 API change
This commit is contained in:
parent
f907ea7db8
commit
44bdd29f51
69 changed files with 19981 additions and 407 deletions
609
docs/cli/tg-save-doc-embeds.md
Normal file
609
docs/cli/tg-save-doc-embeds.md
Normal file
|
|
@ -0,0 +1,609 @@
|
|||
# tg-save-doc-embeds
|
||||
|
||||
Saves document embeddings from TrustGraph processing streams to MessagePack format files.
|
||||
|
||||
## Synopsis
|
||||
|
||||
```bash
|
||||
tg-save-doc-embeds -o OUTPUT_FILE [options]
|
||||
```
|
||||
|
||||
## Description
|
||||
|
||||
The `tg-save-doc-embeds` command connects to TrustGraph's document embeddings export stream and saves the embeddings to a file in MessagePack format. This is useful for creating backups of document embeddings, exporting data for analysis, or preparing data for migration between systems.
|
||||
|
||||
The command should typically be started before document processing begins to capture all embeddings as they are generated.
|
||||
|
||||
## Options
|
||||
|
||||
### Required Arguments
|
||||
|
||||
- `-o, --output-file FILE`: Output file for saved embeddings
|
||||
|
||||
### Optional Arguments
|
||||
|
||||
- `-u, --url URL`: TrustGraph API URL (default: `$TRUSTGRAPH_API` or `http://localhost:8088/`)
|
||||
- `-f, --flow-id ID`: Flow instance ID to monitor (default: `default`)
|
||||
- `--format FORMAT`: Output format - `msgpack` or `json` (default: `msgpack`)
|
||||
- `--user USER`: Filter by user ID (default: no filter)
|
||||
- `--collection COLLECTION`: Filter by collection ID (default: no filter)
|
||||
|
||||
## Examples
|
||||
|
||||
### Basic Document Embeddings Export
|
||||
```bash
|
||||
tg-save-doc-embeds -o document-embeddings.msgpack
|
||||
```
|
||||
|
||||
### Export from Specific Flow
|
||||
```bash
|
||||
tg-save-doc-embeds \
|
||||
-o research-embeddings.msgpack \
|
||||
-f "research-processing-flow"
|
||||
```
|
||||
|
||||
### Filter by User and Collection
|
||||
```bash
|
||||
tg-save-doc-embeds \
|
||||
-o filtered-embeddings.msgpack \
|
||||
--user "research-team" \
|
||||
--collection "research-docs"
|
||||
```
|
||||
|
||||
### Export to JSON Format
|
||||
```bash
|
||||
tg-save-doc-embeds \
|
||||
-o embeddings.json \
|
||||
--format json
|
||||
```
|
||||
|
||||
### Production Backup
|
||||
```bash
|
||||
tg-save-doc-embeds \
|
||||
-o "backup-$(date +%Y%m%d-%H%M%S).msgpack" \
|
||||
-u https://production-api.company.com/ \
|
||||
-f "production-flow"
|
||||
```
|
||||
|
||||
## Output Format
|
||||
|
||||
### MessagePack Structure
|
||||
Document embeddings are saved as MessagePack records:
|
||||
|
||||
```json
|
||||
["de", {
|
||||
"m": {
|
||||
"i": "document-id",
|
||||
"m": [{"metadata": "objects"}],
|
||||
"u": "user-id",
|
||||
"c": "collection-id"
|
||||
},
|
||||
"c": [{
|
||||
"c": "text chunk content",
|
||||
"v": [0.1, 0.2, 0.3, ...]
|
||||
}]
|
||||
}]
|
||||
```
|
||||
|
||||
### Components
|
||||
- **Record Type**: `"de"` indicates document embeddings
|
||||
- **Metadata** (`m`): Document information and context
|
||||
- **Chunks** (`c`): Text chunks with their vector embeddings
|
||||
|
||||
## Use Cases
|
||||
|
||||
### Backup Creation
|
||||
```bash
|
||||
# Create regular backups of document embeddings
|
||||
create_embeddings_backup() {
|
||||
local backup_dir="embeddings-backups"
|
||||
local timestamp=$(date +%Y%m%d_%H%M%S)
|
||||
local backup_file="$backup_dir/embeddings-$timestamp.msgpack"
|
||||
|
||||
mkdir -p "$backup_dir"
|
||||
|
||||
echo "Creating embeddings backup: $backup_file"
|
||||
|
||||
# Start backup process
|
||||
tg-save-doc-embeds -o "$backup_file" &
|
||||
save_pid=$!
|
||||
|
||||
echo "Backup process started (PID: $save_pid)"
|
||||
echo "To stop: kill $save_pid"
|
||||
echo "Backup file: $backup_file"
|
||||
|
||||
# Optionally wait for a specific duration
|
||||
# sleep 3600 # Run for 1 hour
|
||||
# kill $save_pid
|
||||
}
|
||||
|
||||
# Create backup
|
||||
create_embeddings_backup
|
||||
```
|
||||
|
||||
### Data Migration Preparation
|
||||
```bash
|
||||
# Prepare embeddings for migration
|
||||
prepare_migration_data() {
|
||||
local source_env="$1"
|
||||
local collection="$2"
|
||||
local migration_file="migration-$(date +%Y%m%d).msgpack"
|
||||
|
||||
echo "Preparing migration data from: $source_env"
|
||||
echo "Collection: $collection"
|
||||
|
||||
# Export embeddings from source
|
||||
tg-save-doc-embeds \
|
||||
-o "$migration_file" \
|
||||
-u "http://$source_env:8088/" \
|
||||
--collection "$collection" &
|
||||
|
||||
export_pid=$!
|
||||
|
||||
# Let it run for specified time to capture data
|
||||
echo "Capturing embeddings for migration..."
|
||||
echo "Process PID: $export_pid"
|
||||
|
||||
# In practice, you'd run this for the duration needed
|
||||
# sleep 1800 # 30 minutes
|
||||
# kill $export_pid
|
||||
|
||||
echo "Migration data will be saved to: $migration_file"
|
||||
}
|
||||
|
||||
# Prepare migration from dev to production
|
||||
prepare_migration_data "dev-server" "processed-docs"
|
||||
```
|
||||
|
||||
### Continuous Export
|
||||
```bash
|
||||
# Continuous embeddings export with rotation
|
||||
continuous_export() {
|
||||
local output_dir="continuous-exports"
|
||||
local rotation_hours=24
|
||||
local file_prefix="embeddings"
|
||||
|
||||
mkdir -p "$output_dir"
|
||||
|
||||
while true; do
|
||||
timestamp=$(date +%Y%m%d_%H%M%S)
|
||||
output_file="$output_dir/${file_prefix}-${timestamp}.msgpack"
|
||||
|
||||
echo "Starting export to: $output_file"
|
||||
|
||||
# Start export for specified duration
|
||||
timeout ${rotation_hours}h tg-save-doc-embeds -o "$output_file"
|
||||
|
||||
# Compress completed file
|
||||
gzip "$output_file"
|
||||
|
||||
echo "Export completed and compressed: ${output_file}.gz"
|
||||
|
||||
# Optional: clean up old files
|
||||
find "$output_dir" -name "*.msgpack.gz" -mtime +30 -delete
|
||||
|
||||
# Brief pause before next rotation
|
||||
sleep 60
|
||||
done
|
||||
}
|
||||
|
||||
# Start continuous export (run in background)
|
||||
continuous_export &
|
||||
```
|
||||
|
||||
### Analysis and Research
|
||||
```bash
|
||||
# Export embeddings for research analysis
|
||||
export_for_research() {
|
||||
local research_topic="$1"
|
||||
local output_file="research-${research_topic}-$(date +%Y%m%d).msgpack"
|
||||
|
||||
echo "Exporting embeddings for research: $research_topic"
|
||||
|
||||
# Start export with filtering
|
||||
tg-save-doc-embeds \
|
||||
-o "$output_file" \
|
||||
--collection "$research_topic" &
|
||||
|
||||
export_pid=$!
|
||||
|
||||
echo "Research export started (PID: $export_pid)"
|
||||
echo "Output: $output_file"
|
||||
|
||||
# Create analysis script
|
||||
cat > "analyze-${research_topic}.sh" << EOF
|
||||
#!/bin/bash
|
||||
# Analysis script for $research_topic embeddings
|
||||
|
||||
echo "Analyzing $research_topic embeddings..."
|
||||
|
||||
# Basic statistics
|
||||
echo "=== Basic Statistics ==="
|
||||
tg-dump-msgpack -i "$output_file" --summary
|
||||
|
||||
# Detailed analysis
|
||||
echo "=== Detailed Analysis ==="
|
||||
tg-dump-msgpack -i "$output_file" | head -10
|
||||
|
||||
echo "Analysis complete for $research_topic"
|
||||
EOF
|
||||
|
||||
chmod +x "analyze-${research_topic}.sh"
|
||||
echo "Analysis script created: analyze-${research_topic}.sh"
|
||||
}
|
||||
|
||||
# Export for different research topics
|
||||
export_for_research "cybersecurity"
|
||||
export_for_research "climate-change"
|
||||
```
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### Selective Export
|
||||
```bash
|
||||
# Export embeddings with multiple filters
|
||||
selective_export() {
|
||||
local users=("user1" "user2" "user3")
|
||||
local collections=("docs1" "docs2")
|
||||
|
||||
for user in "${users[@]}"; do
|
||||
for collection in "${collections[@]}"; do
|
||||
output_file="embeddings-${user}-${collection}.msgpack"
|
||||
|
||||
echo "Exporting for user: $user, collection: $collection"
|
||||
|
||||
tg-save-doc-embeds \
|
||||
-o "$output_file" \
|
||||
--user "$user" \
|
||||
--collection "$collection" &
|
||||
|
||||
# Store PID for later management
|
||||
echo $! > "${output_file}.pid"
|
||||
done
|
||||
done
|
||||
|
||||
echo "All selective exports started"
|
||||
}
|
||||
```
|
||||
|
||||
### Monitoring and Statistics
|
||||
```bash
|
||||
# Monitor export progress with statistics
|
||||
monitor_export() {
|
||||
local output_file="$1"
|
||||
local pid_file="${output_file}.pid"
|
||||
|
||||
if [ ! -f "$pid_file" ]; then
|
||||
echo "PID file not found: $pid_file"
|
||||
return 1
|
||||
fi
|
||||
|
||||
local export_pid=$(cat "$pid_file")
|
||||
|
||||
echo "Monitoring export (PID: $export_pid)..."
|
||||
echo "Output file: $output_file"
|
||||
|
||||
while kill -0 "$export_pid" 2>/dev/null; do
|
||||
if [ -f "$output_file" ]; then
|
||||
file_size=$(stat -c%s "$output_file" 2>/dev/null || echo "0")
|
||||
human_size=$(numfmt --to=iec-i --suffix=B "$file_size")
|
||||
|
||||
# Try to count embeddings
|
||||
embedding_count=$(tg-dump-msgpack -i "$output_file" 2>/dev/null | grep -c '^\["de"' || echo "0")
|
||||
|
||||
echo "File size: $human_size, Embeddings: $embedding_count"
|
||||
else
|
||||
echo "Output file not yet created..."
|
||||
fi
|
||||
|
||||
sleep 30
|
||||
done
|
||||
|
||||
echo "Export process completed"
|
||||
rm "$pid_file"
|
||||
}
|
||||
|
||||
# Start export and monitor
|
||||
tg-save-doc-embeds -o "monitored-export.msgpack" &
|
||||
echo $! > "monitored-export.msgpack.pid"
|
||||
monitor_export "monitored-export.msgpack"
|
||||
```
|
||||
|
||||
### Export Validation
|
||||
```bash
|
||||
# Validate exported embeddings
|
||||
validate_export() {
|
||||
local export_file="$1"
|
||||
|
||||
echo "Validating export file: $export_file"
|
||||
|
||||
# Check file exists and has content
|
||||
if [ ! -s "$export_file" ]; then
|
||||
echo "✗ Export file is empty or missing"
|
||||
return 1
|
||||
fi
|
||||
|
||||
# Check MessagePack format
|
||||
if tg-dump-msgpack -i "$export_file" --summary > /dev/null 2>&1; then
|
||||
echo "✓ Valid MessagePack format"
|
||||
else
|
||||
echo "✗ Invalid MessagePack format"
|
||||
return 1
|
||||
fi
|
||||
|
||||
# Check for document embeddings
|
||||
embedding_count=$(tg-dump-msgpack -i "$export_file" | grep -c '^\["de"' || echo "0")
|
||||
|
||||
if [ "$embedding_count" -gt 0 ]; then
|
||||
echo "✓ Contains $embedding_count document embeddings"
|
||||
else
|
||||
echo "✗ No document embeddings found"
|
||||
return 1
|
||||
fi
|
||||
|
||||
# Get vector dimension information
|
||||
summary=$(tg-dump-msgpack -i "$export_file" --summary)
|
||||
if echo "$summary" | grep -q "Vector dimension:"; then
|
||||
dimension=$(echo "$summary" | grep "Vector dimension:" | awk '{print $3}')
|
||||
echo "✓ Vector dimension: $dimension"
|
||||
else
|
||||
echo "⚠ Could not determine vector dimension"
|
||||
fi
|
||||
|
||||
echo "Validation completed successfully"
|
||||
}
|
||||
```
|
||||
|
||||
### Export Scheduling
|
||||
```bash
|
||||
# Scheduled export with cron-like functionality
|
||||
schedule_export() {
|
||||
local schedule="$1" # e.g., "daily", "hourly", "weekly"
|
||||
local output_prefix="$2"
|
||||
|
||||
case "$schedule" in
|
||||
"hourly")
|
||||
interval=3600
|
||||
;;
|
||||
"daily")
|
||||
interval=86400
|
||||
;;
|
||||
"weekly")
|
||||
interval=604800
|
||||
;;
|
||||
*)
|
||||
echo "Invalid schedule: $schedule"
|
||||
return 1
|
||||
;;
|
||||
esac
|
||||
|
||||
echo "Starting $schedule exports with prefix: $output_prefix"
|
||||
|
||||
while true; do
|
||||
timestamp=$(date +%Y%m%d_%H%M%S)
|
||||
output_file="${output_prefix}-${timestamp}.msgpack"
|
||||
|
||||
echo "Starting scheduled export: $output_file"
|
||||
|
||||
# Run export for the scheduled interval
|
||||
timeout ${interval}s tg-save-doc-embeds -o "$output_file"
|
||||
|
||||
# Validate and compress
|
||||
if validate_export "$output_file"; then
|
||||
gzip "$output_file"
|
||||
echo "✓ Export completed and compressed: ${output_file}.gz"
|
||||
else
|
||||
echo "✗ Export validation failed: $output_file"
|
||||
mv "$output_file" "${output_file}.failed"
|
||||
fi
|
||||
|
||||
# Brief pause before next cycle
|
||||
sleep 60
|
||||
done
|
||||
}
|
||||
|
||||
# Start daily scheduled exports
|
||||
schedule_export "daily" "daily-embeddings" &
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Memory Management
|
||||
```bash
|
||||
# Monitor memory usage during export
|
||||
monitor_memory_export() {
|
||||
local output_file="$1"
|
||||
|
||||
# Start export
|
||||
tg-save-doc-embeds -o "$output_file" &
|
||||
export_pid=$!
|
||||
|
||||
echo "Monitoring memory usage for export (PID: $export_pid)..."
|
||||
|
||||
while kill -0 "$export_pid" 2>/dev/null; do
|
||||
memory_usage=$(ps -p "$export_pid" -o rss= 2>/dev/null | awk '{print $1/1024}')
|
||||
|
||||
if [ -n "$memory_usage" ]; then
|
||||
echo "Memory usage: ${memory_usage}MB"
|
||||
fi
|
||||
|
||||
sleep 10
|
||||
done
|
||||
|
||||
echo "Export completed"
|
||||
}
|
||||
```
|
||||
|
||||
### Network Optimization
|
||||
```bash
|
||||
# Optimize for network conditions
|
||||
network_optimized_export() {
|
||||
local output_file="$1"
|
||||
local api_url="$2"
|
||||
|
||||
echo "Starting network-optimized export..."
|
||||
|
||||
# Use compression and buffering
|
||||
tg-save-doc-embeds \
|
||||
-o "$output_file" \
|
||||
-u "$api_url" \
|
||||
--format msgpack & # MessagePack is more compact than JSON
|
||||
|
||||
export_pid=$!
|
||||
|
||||
# Monitor network usage
|
||||
echo "Monitoring export (PID: $export_pid)..."
|
||||
|
||||
while kill -0 "$export_pid" 2>/dev/null; do
|
||||
# Monitor network connections
|
||||
connections=$(netstat -an | grep ":8088" | wc -l)
|
||||
echo "Active connections: $connections"
|
||||
sleep 30
|
||||
done
|
||||
}
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Connection Issues
|
||||
```bash
|
||||
Exception: WebSocket connection failed
|
||||
```
|
||||
**Solution**: Check API URL and ensure TrustGraph WebSocket service is running.
|
||||
|
||||
### Disk Space Issues
|
||||
```bash
|
||||
Exception: No space left on device
|
||||
```
|
||||
**Solution**: Free up disk space or use a different output location.
|
||||
|
||||
### Permission Errors
|
||||
```bash
|
||||
Exception: Permission denied
|
||||
```
|
||||
**Solution**: Check write permissions for the output file location.
|
||||
|
||||
### Memory Issues
|
||||
```bash
|
||||
MemoryError: Unable to allocate memory
|
||||
```
|
||||
**Solution**: Monitor memory usage and consider using smaller export windows.
|
||||
|
||||
## Integration with Other Commands
|
||||
|
||||
### Complete Backup Workflow
|
||||
```bash
|
||||
# Complete backup and restore workflow
|
||||
backup_restore_workflow() {
|
||||
local backup_file="embeddings-backup.msgpack"
|
||||
|
||||
echo "=== Backup Phase ==="
|
||||
|
||||
# Create backup
|
||||
tg-save-doc-embeds -o "$backup_file" &
|
||||
backup_pid=$!
|
||||
|
||||
# Let it run for a while
|
||||
sleep 300 # 5 minutes
|
||||
kill $backup_pid
|
||||
|
||||
echo "Backup created: $backup_file"
|
||||
|
||||
# Validate backup
|
||||
validate_export "$backup_file"
|
||||
|
||||
echo "=== Restore Phase ==="
|
||||
|
||||
# Restore from backup (to different collection)
|
||||
tg-load-doc-embeds -i "$backup_file" --collection "restored"
|
||||
|
||||
echo "Backup and restore workflow completed"
|
||||
}
|
||||
```
|
||||
|
||||
### Analysis Pipeline
|
||||
```bash
|
||||
# Export and analyze embeddings
|
||||
export_analyze_pipeline() {
|
||||
local topic="$1"
|
||||
local export_file="analysis-${topic}.msgpack"
|
||||
|
||||
echo "Starting export and analysis pipeline for: $topic"
|
||||
|
||||
# Export embeddings
|
||||
tg-save-doc-embeds \
|
||||
-o "$export_file" \
|
||||
--collection "$topic" &
|
||||
|
||||
export_pid=$!
|
||||
|
||||
# Run for analysis duration
|
||||
sleep 600 # 10 minutes
|
||||
kill $export_pid
|
||||
|
||||
# Analyze exported data
|
||||
echo "Analyzing exported embeddings..."
|
||||
tg-dump-msgpack -i "$export_file" --summary
|
||||
|
||||
# Count embeddings by user
|
||||
echo "Embeddings by user:"
|
||||
tg-dump-msgpack -i "$export_file" | \
|
||||
jq -r '.[1].m.u' | \
|
||||
sort | uniq -c
|
||||
|
||||
echo "Analysis pipeline completed"
|
||||
}
|
||||
```
|
||||
|
||||
## Environment Variables
|
||||
|
||||
- `TRUSTGRAPH_API`: Default API URL
|
||||
|
||||
## Related Commands
|
||||
|
||||
- [`tg-load-doc-embeds`](tg-load-doc-embeds.md) - Load document embeddings from files
|
||||
- [`tg-dump-msgpack`](tg-dump-msgpack.md) - Analyze MessagePack files
|
||||
- [`tg-show-flows`](tg-show-flows.md) - List available flows for monitoring
|
||||
|
||||
## API Integration
|
||||
|
||||
This command uses TrustGraph's WebSocket API for document embeddings export, specifically the `/api/v1/flow/{flow-id}/export/document-embeddings` endpoint.
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Start Early**: Begin export before processing starts to capture all data
|
||||
2. **Monitoring**: Monitor export progress and file sizes
|
||||
3. **Validation**: Always validate exported files
|
||||
4. **Compression**: Use compression for long-term storage
|
||||
5. **Rotation**: Implement file rotation for continuous exports
|
||||
6. **Backup**: Keep multiple backup copies in different locations
|
||||
7. **Documentation**: Document export schedules and procedures
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### No Data Captured
|
||||
```bash
|
||||
# Check if processing is generating embeddings
|
||||
tg-show-flows | grep processing
|
||||
|
||||
# Verify WebSocket connection
|
||||
netstat -an | grep :8088
|
||||
```
|
||||
|
||||
### Large File Issues
|
||||
```bash
|
||||
# Monitor file growth
|
||||
watch -n 5 'ls -lh *.msgpack'
|
||||
|
||||
# Check available disk space
|
||||
df -h
|
||||
```
|
||||
|
||||
### Process Management
|
||||
```bash
|
||||
# List running export processes
|
||||
ps aux | grep tg-save-doc-embeds
|
||||
|
||||
# Kill stuck processes
|
||||
pkill -f tg-save-doc-embeds
|
||||
```
|
||||
Loading…
Add table
Add a link
Reference in a new issue