mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-25 08:26:21 +02:00
609 lines
14 KiB
Markdown
609 lines
14 KiB
Markdown
|
|
# tg-save-doc-embeds
|
||
|
|
|
||
|
|
Saves document embeddings from TrustGraph processing streams to MessagePack format files.
|
||
|
|
|
||
|
|
## Synopsis
|
||
|
|
|
||
|
|
```bash
|
||
|
|
tg-save-doc-embeds -o OUTPUT_FILE [options]
|
||
|
|
```
|
||
|
|
|
||
|
|
## Description
|
||
|
|
|
||
|
|
The `tg-save-doc-embeds` command connects to TrustGraph's document embeddings export stream and saves the embeddings to a file in MessagePack format. This is useful for creating backups of document embeddings, exporting data for analysis, or preparing data for migration between systems.
|
||
|
|
|
||
|
|
The command should typically be started before document processing begins to capture all embeddings as they are generated.
|
||
|
|
|
||
|
|
## Options
|
||
|
|
|
||
|
|
### Required Arguments
|
||
|
|
|
||
|
|
- `-o, --output-file FILE`: Output file for saved embeddings
|
||
|
|
|
||
|
|
### Optional Arguments
|
||
|
|
|
||
|
|
- `-u, --url URL`: TrustGraph API URL (default: `$TRUSTGRAPH_API` or `http://localhost:8088/`)
|
||
|
|
- `-f, --flow-id ID`: Flow instance ID to monitor (default: `default`)
|
||
|
|
- `--format FORMAT`: Output format - `msgpack` or `json` (default: `msgpack`)
|
||
|
|
- `--user USER`: Filter by user ID (default: no filter)
|
||
|
|
- `--collection COLLECTION`: Filter by collection ID (default: no filter)
|
||
|
|
|
||
|
|
## Examples
|
||
|
|
|
||
|
|
### Basic Document Embeddings Export
|
||
|
|
```bash
|
||
|
|
tg-save-doc-embeds -o document-embeddings.msgpack
|
||
|
|
```
|
||
|
|
|
||
|
|
### Export from Specific Flow
|
||
|
|
```bash
|
||
|
|
tg-save-doc-embeds \
|
||
|
|
-o research-embeddings.msgpack \
|
||
|
|
-f "research-processing-flow"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Filter by User and Collection
|
||
|
|
```bash
|
||
|
|
tg-save-doc-embeds \
|
||
|
|
-o filtered-embeddings.msgpack \
|
||
|
|
--user "research-team" \
|
||
|
|
--collection "research-docs"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Export to JSON Format
|
||
|
|
```bash
|
||
|
|
tg-save-doc-embeds \
|
||
|
|
-o embeddings.json \
|
||
|
|
--format json
|
||
|
|
```
|
||
|
|
|
||
|
|
### Production Backup
|
||
|
|
```bash
|
||
|
|
tg-save-doc-embeds \
|
||
|
|
-o "backup-$(date +%Y%m%d-%H%M%S).msgpack" \
|
||
|
|
-u https://production-api.company.com/ \
|
||
|
|
-f "production-flow"
|
||
|
|
```
|
||
|
|
|
||
|
|
## Output Format
|
||
|
|
|
||
|
|
### MessagePack Structure
|
||
|
|
Document embeddings are saved as MessagePack records:
|
||
|
|
|
||
|
|
```json
|
||
|
|
["de", {
|
||
|
|
"m": {
|
||
|
|
"i": "document-id",
|
||
|
|
"m": [{"metadata": "objects"}],
|
||
|
|
"u": "user-id",
|
||
|
|
"c": "collection-id"
|
||
|
|
},
|
||
|
|
"c": [{
|
||
|
|
"c": "text chunk content",
|
||
|
|
"v": [0.1, 0.2, 0.3, ...]
|
||
|
|
}]
|
||
|
|
}]
|
||
|
|
```
|
||
|
|
|
||
|
|
### Components
|
||
|
|
- **Record Type**: `"de"` indicates document embeddings
|
||
|
|
- **Metadata** (`m`): Document information and context
|
||
|
|
- **Chunks** (`c`): Text chunks with their vector embeddings
|
||
|
|
|
||
|
|
## Use Cases
|
||
|
|
|
||
|
|
### Backup Creation
|
||
|
|
```bash
|
||
|
|
# Create regular backups of document embeddings
|
||
|
|
create_embeddings_backup() {
|
||
|
|
local backup_dir="embeddings-backups"
|
||
|
|
local timestamp=$(date +%Y%m%d_%H%M%S)
|
||
|
|
local backup_file="$backup_dir/embeddings-$timestamp.msgpack"
|
||
|
|
|
||
|
|
mkdir -p "$backup_dir"
|
||
|
|
|
||
|
|
echo "Creating embeddings backup: $backup_file"
|
||
|
|
|
||
|
|
# Start backup process
|
||
|
|
tg-save-doc-embeds -o "$backup_file" &
|
||
|
|
save_pid=$!
|
||
|
|
|
||
|
|
echo "Backup process started (PID: $save_pid)"
|
||
|
|
echo "To stop: kill $save_pid"
|
||
|
|
echo "Backup file: $backup_file"
|
||
|
|
|
||
|
|
# Optionally wait for a specific duration
|
||
|
|
# sleep 3600 # Run for 1 hour
|
||
|
|
# kill $save_pid
|
||
|
|
}
|
||
|
|
|
||
|
|
# Create backup
|
||
|
|
create_embeddings_backup
|
||
|
|
```
|
||
|
|
|
||
|
|
### Data Migration Preparation
|
||
|
|
```bash
|
||
|
|
# Prepare embeddings for migration
|
||
|
|
prepare_migration_data() {
|
||
|
|
local source_env="$1"
|
||
|
|
local collection="$2"
|
||
|
|
local migration_file="migration-$(date +%Y%m%d).msgpack"
|
||
|
|
|
||
|
|
echo "Preparing migration data from: $source_env"
|
||
|
|
echo "Collection: $collection"
|
||
|
|
|
||
|
|
# Export embeddings from source
|
||
|
|
tg-save-doc-embeds \
|
||
|
|
-o "$migration_file" \
|
||
|
|
-u "http://$source_env:8088/" \
|
||
|
|
--collection "$collection" &
|
||
|
|
|
||
|
|
export_pid=$!
|
||
|
|
|
||
|
|
# Let it run for specified time to capture data
|
||
|
|
echo "Capturing embeddings for migration..."
|
||
|
|
echo "Process PID: $export_pid"
|
||
|
|
|
||
|
|
# In practice, you'd run this for the duration needed
|
||
|
|
# sleep 1800 # 30 minutes
|
||
|
|
# kill $export_pid
|
||
|
|
|
||
|
|
echo "Migration data will be saved to: $migration_file"
|
||
|
|
}
|
||
|
|
|
||
|
|
# Prepare migration from dev to production
|
||
|
|
prepare_migration_data "dev-server" "processed-docs"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Continuous Export
|
||
|
|
```bash
|
||
|
|
# Continuous embeddings export with rotation
|
||
|
|
continuous_export() {
|
||
|
|
local output_dir="continuous-exports"
|
||
|
|
local rotation_hours=24
|
||
|
|
local file_prefix="embeddings"
|
||
|
|
|
||
|
|
mkdir -p "$output_dir"
|
||
|
|
|
||
|
|
while true; do
|
||
|
|
timestamp=$(date +%Y%m%d_%H%M%S)
|
||
|
|
output_file="$output_dir/${file_prefix}-${timestamp}.msgpack"
|
||
|
|
|
||
|
|
echo "Starting export to: $output_file"
|
||
|
|
|
||
|
|
# Start export for specified duration
|
||
|
|
timeout ${rotation_hours}h tg-save-doc-embeds -o "$output_file"
|
||
|
|
|
||
|
|
# Compress completed file
|
||
|
|
gzip "$output_file"
|
||
|
|
|
||
|
|
echo "Export completed and compressed: ${output_file}.gz"
|
||
|
|
|
||
|
|
# Optional: clean up old files
|
||
|
|
find "$output_dir" -name "*.msgpack.gz" -mtime +30 -delete
|
||
|
|
|
||
|
|
# Brief pause before next rotation
|
||
|
|
sleep 60
|
||
|
|
done
|
||
|
|
}
|
||
|
|
|
||
|
|
# Start continuous export (run in background)
|
||
|
|
continuous_export &
|
||
|
|
```
|
||
|
|
|
||
|
|
### Analysis and Research
|
||
|
|
```bash
|
||
|
|
# Export embeddings for research analysis
|
||
|
|
export_for_research() {
|
||
|
|
local research_topic="$1"
|
||
|
|
local output_file="research-${research_topic}-$(date +%Y%m%d).msgpack"
|
||
|
|
|
||
|
|
echo "Exporting embeddings for research: $research_topic"
|
||
|
|
|
||
|
|
# Start export with filtering
|
||
|
|
tg-save-doc-embeds \
|
||
|
|
-o "$output_file" \
|
||
|
|
--collection "$research_topic" &
|
||
|
|
|
||
|
|
export_pid=$!
|
||
|
|
|
||
|
|
echo "Research export started (PID: $export_pid)"
|
||
|
|
echo "Output: $output_file"
|
||
|
|
|
||
|
|
# Create analysis script
|
||
|
|
cat > "analyze-${research_topic}.sh" << EOF
|
||
|
|
#!/bin/bash
|
||
|
|
# Analysis script for $research_topic embeddings
|
||
|
|
|
||
|
|
echo "Analyzing $research_topic embeddings..."
|
||
|
|
|
||
|
|
# Basic statistics
|
||
|
|
echo "=== Basic Statistics ==="
|
||
|
|
tg-dump-msgpack -i "$output_file" --summary
|
||
|
|
|
||
|
|
# Detailed analysis
|
||
|
|
echo "=== Detailed Analysis ==="
|
||
|
|
tg-dump-msgpack -i "$output_file" | head -10
|
||
|
|
|
||
|
|
echo "Analysis complete for $research_topic"
|
||
|
|
EOF
|
||
|
|
|
||
|
|
chmod +x "analyze-${research_topic}.sh"
|
||
|
|
echo "Analysis script created: analyze-${research_topic}.sh"
|
||
|
|
}
|
||
|
|
|
||
|
|
# Export for different research topics
|
||
|
|
export_for_research "cybersecurity"
|
||
|
|
export_for_research "climate-change"
|
||
|
|
```
|
||
|
|
|
||
|
|
## Advanced Usage
|
||
|
|
|
||
|
|
### Selective Export
|
||
|
|
```bash
|
||
|
|
# Export embeddings with multiple filters
|
||
|
|
selective_export() {
|
||
|
|
local users=("user1" "user2" "user3")
|
||
|
|
local collections=("docs1" "docs2")
|
||
|
|
|
||
|
|
for user in "${users[@]}"; do
|
||
|
|
for collection in "${collections[@]}"; do
|
||
|
|
output_file="embeddings-${user}-${collection}.msgpack"
|
||
|
|
|
||
|
|
echo "Exporting for user: $user, collection: $collection"
|
||
|
|
|
||
|
|
tg-save-doc-embeds \
|
||
|
|
-o "$output_file" \
|
||
|
|
--user "$user" \
|
||
|
|
--collection "$collection" &
|
||
|
|
|
||
|
|
# Store PID for later management
|
||
|
|
echo $! > "${output_file}.pid"
|
||
|
|
done
|
||
|
|
done
|
||
|
|
|
||
|
|
echo "All selective exports started"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Monitoring and Statistics
|
||
|
|
```bash
|
||
|
|
# Monitor export progress with statistics
|
||
|
|
monitor_export() {
|
||
|
|
local output_file="$1"
|
||
|
|
local pid_file="${output_file}.pid"
|
||
|
|
|
||
|
|
if [ ! -f "$pid_file" ]; then
|
||
|
|
echo "PID file not found: $pid_file"
|
||
|
|
return 1
|
||
|
|
fi
|
||
|
|
|
||
|
|
local export_pid=$(cat "$pid_file")
|
||
|
|
|
||
|
|
echo "Monitoring export (PID: $export_pid)..."
|
||
|
|
echo "Output file: $output_file"
|
||
|
|
|
||
|
|
while kill -0 "$export_pid" 2>/dev/null; do
|
||
|
|
if [ -f "$output_file" ]; then
|
||
|
|
file_size=$(stat -c%s "$output_file" 2>/dev/null || echo "0")
|
||
|
|
human_size=$(numfmt --to=iec-i --suffix=B "$file_size")
|
||
|
|
|
||
|
|
# Try to count embeddings
|
||
|
|
embedding_count=$(tg-dump-msgpack -i "$output_file" 2>/dev/null | grep -c '^\["de"' || echo "0")
|
||
|
|
|
||
|
|
echo "File size: $human_size, Embeddings: $embedding_count"
|
||
|
|
else
|
||
|
|
echo "Output file not yet created..."
|
||
|
|
fi
|
||
|
|
|
||
|
|
sleep 30
|
||
|
|
done
|
||
|
|
|
||
|
|
echo "Export process completed"
|
||
|
|
rm "$pid_file"
|
||
|
|
}
|
||
|
|
|
||
|
|
# Start export and monitor
|
||
|
|
tg-save-doc-embeds -o "monitored-export.msgpack" &
|
||
|
|
echo $! > "monitored-export.msgpack.pid"
|
||
|
|
monitor_export "monitored-export.msgpack"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Export Validation
|
||
|
|
```bash
|
||
|
|
# Validate exported embeddings
|
||
|
|
validate_export() {
|
||
|
|
local export_file="$1"
|
||
|
|
|
||
|
|
echo "Validating export file: $export_file"
|
||
|
|
|
||
|
|
# Check file exists and has content
|
||
|
|
if [ ! -s "$export_file" ]; then
|
||
|
|
echo "✗ Export file is empty or missing"
|
||
|
|
return 1
|
||
|
|
fi
|
||
|
|
|
||
|
|
# Check MessagePack format
|
||
|
|
if tg-dump-msgpack -i "$export_file" --summary > /dev/null 2>&1; then
|
||
|
|
echo "✓ Valid MessagePack format"
|
||
|
|
else
|
||
|
|
echo "✗ Invalid MessagePack format"
|
||
|
|
return 1
|
||
|
|
fi
|
||
|
|
|
||
|
|
# Check for document embeddings
|
||
|
|
embedding_count=$(tg-dump-msgpack -i "$export_file" | grep -c '^\["de"' || echo "0")
|
||
|
|
|
||
|
|
if [ "$embedding_count" -gt 0 ]; then
|
||
|
|
echo "✓ Contains $embedding_count document embeddings"
|
||
|
|
else
|
||
|
|
echo "✗ No document embeddings found"
|
||
|
|
return 1
|
||
|
|
fi
|
||
|
|
|
||
|
|
# Get vector dimension information
|
||
|
|
summary=$(tg-dump-msgpack -i "$export_file" --summary)
|
||
|
|
if echo "$summary" | grep -q "Vector dimension:"; then
|
||
|
|
dimension=$(echo "$summary" | grep "Vector dimension:" | awk '{print $3}')
|
||
|
|
echo "✓ Vector dimension: $dimension"
|
||
|
|
else
|
||
|
|
echo "⚠ Could not determine vector dimension"
|
||
|
|
fi
|
||
|
|
|
||
|
|
echo "Validation completed successfully"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Export Scheduling
|
||
|
|
```bash
|
||
|
|
# Scheduled export with cron-like functionality
|
||
|
|
schedule_export() {
|
||
|
|
local schedule="$1" # e.g., "daily", "hourly", "weekly"
|
||
|
|
local output_prefix="$2"
|
||
|
|
|
||
|
|
case "$schedule" in
|
||
|
|
"hourly")
|
||
|
|
interval=3600
|
||
|
|
;;
|
||
|
|
"daily")
|
||
|
|
interval=86400
|
||
|
|
;;
|
||
|
|
"weekly")
|
||
|
|
interval=604800
|
||
|
|
;;
|
||
|
|
*)
|
||
|
|
echo "Invalid schedule: $schedule"
|
||
|
|
return 1
|
||
|
|
;;
|
||
|
|
esac
|
||
|
|
|
||
|
|
echo "Starting $schedule exports with prefix: $output_prefix"
|
||
|
|
|
||
|
|
while true; do
|
||
|
|
timestamp=$(date +%Y%m%d_%H%M%S)
|
||
|
|
output_file="${output_prefix}-${timestamp}.msgpack"
|
||
|
|
|
||
|
|
echo "Starting scheduled export: $output_file"
|
||
|
|
|
||
|
|
# Run export for the scheduled interval
|
||
|
|
timeout ${interval}s tg-save-doc-embeds -o "$output_file"
|
||
|
|
|
||
|
|
# Validate and compress
|
||
|
|
if validate_export "$output_file"; then
|
||
|
|
gzip "$output_file"
|
||
|
|
echo "✓ Export completed and compressed: ${output_file}.gz"
|
||
|
|
else
|
||
|
|
echo "✗ Export validation failed: $output_file"
|
||
|
|
mv "$output_file" "${output_file}.failed"
|
||
|
|
fi
|
||
|
|
|
||
|
|
# Brief pause before next cycle
|
||
|
|
sleep 60
|
||
|
|
done
|
||
|
|
}
|
||
|
|
|
||
|
|
# Start daily scheduled exports
|
||
|
|
schedule_export "daily" "daily-embeddings" &
|
||
|
|
```
|
||
|
|
|
||
|
|
## Performance Considerations
|
||
|
|
|
||
|
|
### Memory Management
|
||
|
|
```bash
|
||
|
|
# Monitor memory usage during export
|
||
|
|
monitor_memory_export() {
|
||
|
|
local output_file="$1"
|
||
|
|
|
||
|
|
# Start export
|
||
|
|
tg-save-doc-embeds -o "$output_file" &
|
||
|
|
export_pid=$!
|
||
|
|
|
||
|
|
echo "Monitoring memory usage for export (PID: $export_pid)..."
|
||
|
|
|
||
|
|
while kill -0 "$export_pid" 2>/dev/null; do
|
||
|
|
memory_usage=$(ps -p "$export_pid" -o rss= 2>/dev/null | awk '{print $1/1024}')
|
||
|
|
|
||
|
|
if [ -n "$memory_usage" ]; then
|
||
|
|
echo "Memory usage: ${memory_usage}MB"
|
||
|
|
fi
|
||
|
|
|
||
|
|
sleep 10
|
||
|
|
done
|
||
|
|
|
||
|
|
echo "Export completed"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Network Optimization
|
||
|
|
```bash
|
||
|
|
# Optimize for network conditions
|
||
|
|
network_optimized_export() {
|
||
|
|
local output_file="$1"
|
||
|
|
local api_url="$2"
|
||
|
|
|
||
|
|
echo "Starting network-optimized export..."
|
||
|
|
|
||
|
|
# Use compression and buffering
|
||
|
|
tg-save-doc-embeds \
|
||
|
|
-o "$output_file" \
|
||
|
|
-u "$api_url" \
|
||
|
|
--format msgpack & # MessagePack is more compact than JSON
|
||
|
|
|
||
|
|
export_pid=$!
|
||
|
|
|
||
|
|
# Monitor network usage
|
||
|
|
echo "Monitoring export (PID: $export_pid)..."
|
||
|
|
|
||
|
|
while kill -0 "$export_pid" 2>/dev/null; do
|
||
|
|
# Monitor network connections
|
||
|
|
connections=$(netstat -an | grep ":8088" | wc -l)
|
||
|
|
echo "Active connections: $connections"
|
||
|
|
sleep 30
|
||
|
|
done
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Error Handling
|
||
|
|
|
||
|
|
### Connection Issues
|
||
|
|
```bash
|
||
|
|
Exception: WebSocket connection failed
|
||
|
|
```
|
||
|
|
**Solution**: Check API URL and ensure TrustGraph WebSocket service is running.
|
||
|
|
|
||
|
|
### Disk Space Issues
|
||
|
|
```bash
|
||
|
|
Exception: No space left on device
|
||
|
|
```
|
||
|
|
**Solution**: Free up disk space or use a different output location.
|
||
|
|
|
||
|
|
### Permission Errors
|
||
|
|
```bash
|
||
|
|
Exception: Permission denied
|
||
|
|
```
|
||
|
|
**Solution**: Check write permissions for the output file location.
|
||
|
|
|
||
|
|
### Memory Issues
|
||
|
|
```bash
|
||
|
|
MemoryError: Unable to allocate memory
|
||
|
|
```
|
||
|
|
**Solution**: Monitor memory usage and consider using smaller export windows.
|
||
|
|
|
||
|
|
## Integration with Other Commands
|
||
|
|
|
||
|
|
### Complete Backup Workflow
|
||
|
|
```bash
|
||
|
|
# Complete backup and restore workflow
|
||
|
|
backup_restore_workflow() {
|
||
|
|
local backup_file="embeddings-backup.msgpack"
|
||
|
|
|
||
|
|
echo "=== Backup Phase ==="
|
||
|
|
|
||
|
|
# Create backup
|
||
|
|
tg-save-doc-embeds -o "$backup_file" &
|
||
|
|
backup_pid=$!
|
||
|
|
|
||
|
|
# Let it run for a while
|
||
|
|
sleep 300 # 5 minutes
|
||
|
|
kill $backup_pid
|
||
|
|
|
||
|
|
echo "Backup created: $backup_file"
|
||
|
|
|
||
|
|
# Validate backup
|
||
|
|
validate_export "$backup_file"
|
||
|
|
|
||
|
|
echo "=== Restore Phase ==="
|
||
|
|
|
||
|
|
# Restore from backup (to different collection)
|
||
|
|
tg-load-doc-embeds -i "$backup_file" --collection "restored"
|
||
|
|
|
||
|
|
echo "Backup and restore workflow completed"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Analysis Pipeline
|
||
|
|
```bash
|
||
|
|
# Export and analyze embeddings
|
||
|
|
export_analyze_pipeline() {
|
||
|
|
local topic="$1"
|
||
|
|
local export_file="analysis-${topic}.msgpack"
|
||
|
|
|
||
|
|
echo "Starting export and analysis pipeline for: $topic"
|
||
|
|
|
||
|
|
# Export embeddings
|
||
|
|
tg-save-doc-embeds \
|
||
|
|
-o "$export_file" \
|
||
|
|
--collection "$topic" &
|
||
|
|
|
||
|
|
export_pid=$!
|
||
|
|
|
||
|
|
# Run for analysis duration
|
||
|
|
sleep 600 # 10 minutes
|
||
|
|
kill $export_pid
|
||
|
|
|
||
|
|
# Analyze exported data
|
||
|
|
echo "Analyzing exported embeddings..."
|
||
|
|
tg-dump-msgpack -i "$export_file" --summary
|
||
|
|
|
||
|
|
# Count embeddings by user
|
||
|
|
echo "Embeddings by user:"
|
||
|
|
tg-dump-msgpack -i "$export_file" | \
|
||
|
|
jq -r '.[1].m.u' | \
|
||
|
|
sort | uniq -c
|
||
|
|
|
||
|
|
echo "Analysis pipeline completed"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Environment Variables
|
||
|
|
|
||
|
|
- `TRUSTGRAPH_API`: Default API URL
|
||
|
|
|
||
|
|
## Related Commands
|
||
|
|
|
||
|
|
- [`tg-load-doc-embeds`](tg-load-doc-embeds.md) - Load document embeddings from files
|
||
|
|
- [`tg-dump-msgpack`](tg-dump-msgpack.md) - Analyze MessagePack files
|
||
|
|
- [`tg-show-flows`](tg-show-flows.md) - List available flows for monitoring
|
||
|
|
|
||
|
|
## API Integration
|
||
|
|
|
||
|
|
This command uses TrustGraph's WebSocket API for document embeddings export, specifically the `/api/v1/flow/{flow-id}/export/document-embeddings` endpoint.
|
||
|
|
|
||
|
|
## Best Practices
|
||
|
|
|
||
|
|
1. **Start Early**: Begin export before processing starts to capture all data
|
||
|
|
2. **Monitoring**: Monitor export progress and file sizes
|
||
|
|
3. **Validation**: Always validate exported files
|
||
|
|
4. **Compression**: Use compression for long-term storage
|
||
|
|
5. **Rotation**: Implement file rotation for continuous exports
|
||
|
|
6. **Backup**: Keep multiple backup copies in different locations
|
||
|
|
7. **Documentation**: Document export schedules and procedures
|
||
|
|
|
||
|
|
## Troubleshooting
|
||
|
|
|
||
|
|
### No Data Captured
|
||
|
|
```bash
|
||
|
|
# Check if processing is generating embeddings
|
||
|
|
tg-show-flows | grep processing
|
||
|
|
|
||
|
|
# Verify WebSocket connection
|
||
|
|
netstat -an | grep :8088
|
||
|
|
```
|
||
|
|
|
||
|
|
### Large File Issues
|
||
|
|
```bash
|
||
|
|
# Monitor file growth
|
||
|
|
watch -n 5 'ls -lh *.msgpack'
|
||
|
|
|
||
|
|
# Check available disk space
|
||
|
|
df -h
|
||
|
|
```
|
||
|
|
|
||
|
|
### Process Management
|
||
|
|
```bash
|
||
|
|
# List running export processes
|
||
|
|
ps aux | grep tg-save-doc-embeds
|
||
|
|
|
||
|
|
# Kill stuck processes
|
||
|
|
pkill -f tg-save-doc-embeds
|
||
|
|
```
|