mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-26 00:46:22 +02:00
Update docs for API/CLI changes in 1.0 (#421)
* Update some API basics for the 0.23/1.0 API change
This commit is contained in:
parent
f907ea7db8
commit
44bdd29f51
69 changed files with 19981 additions and 407 deletions
563
docs/cli/tg-start-library-processing.md
Normal file
563
docs/cli/tg-start-library-processing.md
Normal file
|
|
@ -0,0 +1,563 @@
|
|||
# tg-start-library-processing
|
||||
|
||||
Submits a library document for processing through TrustGraph workflows.
|
||||
|
||||
## Synopsis
|
||||
|
||||
```bash
|
||||
tg-start-library-processing -d DOCUMENT_ID --id PROCESSING_ID [options]
|
||||
```
|
||||
|
||||
## Description
|
||||
|
||||
The `tg-start-library-processing` command initiates processing of a document stored in TrustGraph's document library. This triggers workflows that can extract text, generate embeddings, create knowledge graphs, and enable document search and analysis.
|
||||
|
||||
Each processing job is assigned a unique processing ID for tracking and management purposes.
|
||||
|
||||
## Options
|
||||
|
||||
### Required Arguments
|
||||
|
||||
- `-d, --document-id ID`: Document ID from the library to process
|
||||
- `--id, --processing-id ID`: Unique identifier for this processing job
|
||||
|
||||
### Optional Arguments
|
||||
|
||||
- `-u, --api-url URL`: TrustGraph API URL (default: `$TRUSTGRAPH_URL` or `http://localhost:8088/`)
|
||||
- `-U, --user USER`: User ID for processing context (default: `trustgraph`)
|
||||
- `-i, --flow-id ID`: Flow instance to use for processing (default: `default`)
|
||||
- `--collection COLLECTION`: Collection to assign processed data (default: `default`)
|
||||
- `--tags TAGS`: Comma-separated tags for the processing job
|
||||
|
||||
## Examples
|
||||
|
||||
### Basic Document Processing
|
||||
```bash
|
||||
tg-start-library-processing -d "doc_123456789" --id "proc_001"
|
||||
```
|
||||
|
||||
### Processing with Custom Collection
|
||||
```bash
|
||||
tg-start-library-processing \
|
||||
-d "research_paper_456" \
|
||||
--id "research_proc_001" \
|
||||
--collection "research-papers" \
|
||||
--tags "nlp,research,2023"
|
||||
```
|
||||
|
||||
### Processing with Specific Flow
|
||||
```bash
|
||||
tg-start-library-processing \
|
||||
-d "technical_manual" \
|
||||
--id "manual_proc_001" \
|
||||
-i "document-analysis-flow" \
|
||||
-U "technical-team" \
|
||||
--collection "technical-docs"
|
||||
```
|
||||
|
||||
### Processing Multiple Documents
|
||||
```bash
|
||||
# Process several documents in sequence
|
||||
documents=("doc_001" "doc_002" "doc_003")
|
||||
for i in "${!documents[@]}"; do
|
||||
doc_id="${documents[$i]}"
|
||||
proc_id="batch_proc_$(printf %03d $((i+1)))"
|
||||
|
||||
echo "Processing document: $doc_id"
|
||||
tg-start-library-processing \
|
||||
-d "$doc_id" \
|
||||
--id "$proc_id" \
|
||||
--collection "batch-processing" \
|
||||
--tags "batch,automated"
|
||||
done
|
||||
```
|
||||
|
||||
## Processing Workflow
|
||||
|
||||
### Document Processing Steps
|
||||
1. **Document Retrieval**: Fetch document from library
|
||||
2. **Content Extraction**: Extract text and metadata
|
||||
3. **Text Processing**: Clean and normalize content
|
||||
4. **Embedding Generation**: Create vector embeddings
|
||||
5. **Knowledge Extraction**: Generate triples and entities
|
||||
6. **Index Creation**: Make content searchable
|
||||
|
||||
### Processing Types
|
||||
Different document types may trigger different processing workflows:
|
||||
- **PDF Documents**: Text extraction, OCR if needed
|
||||
- **Text Files**: Direct text processing
|
||||
- **Images**: OCR and image analysis
|
||||
- **Structured Data**: Schema extraction and mapping
|
||||
|
||||
## Use Cases
|
||||
|
||||
### Batch Document Processing
|
||||
```bash
|
||||
# Process all unprocessed documents
|
||||
process_all_documents() {
|
||||
local collection="$1"
|
||||
local batch_id="batch_$(date +%Y%m%d_%H%M%S)"
|
||||
|
||||
echo "Starting batch processing for collection: $collection"
|
||||
|
||||
# Get all document IDs
|
||||
tg-show-library-documents | \
|
||||
grep "| id" | \
|
||||
awk '{print $3}' | \
|
||||
while read -r doc_id; do
|
||||
proc_id="${batch_id}_${doc_id}"
|
||||
|
||||
echo "Processing document: $doc_id"
|
||||
tg-start-library-processing \
|
||||
-d "$doc_id" \
|
||||
--id "$proc_id" \
|
||||
--collection "$collection" \
|
||||
--tags "batch,automated,$(date +%Y%m%d)"
|
||||
|
||||
# Add delay to avoid overwhelming the system
|
||||
sleep 2
|
||||
done
|
||||
}
|
||||
|
||||
# Process all documents
|
||||
process_all_documents "processed-docs"
|
||||
```
|
||||
|
||||
### Department-Specific Processing
|
||||
```bash
|
||||
# Process documents by department
|
||||
process_by_department() {
|
||||
local dept="$1"
|
||||
local flow="$2"
|
||||
|
||||
echo "Processing documents for department: $dept"
|
||||
|
||||
# Find documents with department tag
|
||||
tg-show-library-documents -U "$dept" | \
|
||||
grep "| id" | \
|
||||
awk '{print $3}' | \
|
||||
while read -r doc_id; do
|
||||
proc_id="${dept}_proc_$(date +%s)_${doc_id}"
|
||||
|
||||
echo "Processing $dept document: $doc_id"
|
||||
tg-start-library-processing \
|
||||
-d "$doc_id" \
|
||||
--id "$proc_id" \
|
||||
-i "$flow" \
|
||||
-U "$dept" \
|
||||
--collection "${dept}-processed" \
|
||||
--tags "$dept,departmental"
|
||||
done
|
||||
}
|
||||
|
||||
# Process documents for different departments
|
||||
process_by_department "research" "research-flow"
|
||||
process_by_department "finance" "document-flow"
|
||||
process_by_department "legal" "compliance-flow"
|
||||
```
|
||||
|
||||
### Priority Processing
|
||||
```bash
|
||||
# Process high-priority documents first
|
||||
priority_processing() {
|
||||
local priority_tags=("urgent" "high-priority" "critical")
|
||||
|
||||
for tag in "${priority_tags[@]}"; do
|
||||
echo "Processing $tag documents..."
|
||||
|
||||
tg-show-library-documents | \
|
||||
grep -B5 -A5 "$tag" | \
|
||||
grep "| id" | \
|
||||
awk '{print $3}' | \
|
||||
while read -r doc_id; do
|
||||
proc_id="priority_$(date +%s)_${doc_id}"
|
||||
|
||||
echo "Processing priority document: $doc_id"
|
||||
tg-start-library-processing \
|
||||
-d "$doc_id" \
|
||||
--id "$proc_id" \
|
||||
--collection "priority-processed" \
|
||||
--tags "priority,$tag"
|
||||
done
|
||||
done
|
||||
}
|
||||
|
||||
priority_processing
|
||||
```
|
||||
|
||||
### Conditional Processing
|
||||
```bash
|
||||
# Process documents based on criteria
|
||||
conditional_processing() {
|
||||
local criteria="$1"
|
||||
local flow="$2"
|
||||
|
||||
echo "Processing documents matching criteria: $criteria"
|
||||
|
||||
tg-show-library-documents | \
|
||||
grep -B10 -A10 "$criteria" | \
|
||||
grep "| id" | \
|
||||
awk '{print $3}' | \
|
||||
while read -r doc_id; do
|
||||
# Check if already processed
|
||||
if tg-invoke-document-rag -q "test" 2>/dev/null | grep -q "$doc_id"; then
|
||||
echo "Document $doc_id already processed, skipping"
|
||||
continue
|
||||
fi
|
||||
|
||||
proc_id="conditional_$(date +%s)_${doc_id}"
|
||||
|
||||
echo "Processing document: $doc_id"
|
||||
tg-start-library-processing \
|
||||
-d "$doc_id" \
|
||||
--id "$proc_id" \
|
||||
-i "$flow" \
|
||||
--collection "conditional-processed" \
|
||||
--tags "conditional,$criteria"
|
||||
done
|
||||
}
|
||||
|
||||
# Process technical documents
|
||||
conditional_processing "technical" "technical-flow"
|
||||
```
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### Processing with Validation
|
||||
```bash
|
||||
# Process with pre and post validation
|
||||
validated_processing() {
|
||||
local doc_id="$1"
|
||||
local proc_id="$2"
|
||||
local collection="$3"
|
||||
|
||||
echo "Starting validated processing for: $doc_id"
|
||||
|
||||
# Pre-processing validation
|
||||
if ! tg-show-library-documents | grep -q "$doc_id"; then
|
||||
echo "ERROR: Document $doc_id not found"
|
||||
return 1
|
||||
fi
|
||||
|
||||
# Check if processing ID is unique
|
||||
if tg-show-flows | grep -q "$proc_id"; then
|
||||
echo "ERROR: Processing ID $proc_id already in use"
|
||||
return 1
|
||||
fi
|
||||
|
||||
# Start processing
|
||||
echo "Starting processing..."
|
||||
tg-start-library-processing \
|
||||
-d "$doc_id" \
|
||||
--id "$proc_id" \
|
||||
--collection "$collection" \
|
||||
--tags "validated,$(date +%Y%m%d)"
|
||||
|
||||
# Monitor processing
|
||||
echo "Monitoring processing progress..."
|
||||
timeout=300 # 5 minutes
|
||||
elapsed=0
|
||||
interval=10
|
||||
|
||||
while [ $elapsed -lt $timeout ]; do
|
||||
if tg-invoke-document-rag -q "test" -C "$collection" 2>/dev/null | grep -q "$doc_id"; then
|
||||
echo "✓ Processing completed successfully"
|
||||
return 0
|
||||
fi
|
||||
|
||||
echo "Processing in progress... (${elapsed}s elapsed)"
|
||||
sleep $interval
|
||||
elapsed=$((elapsed + interval))
|
||||
done
|
||||
|
||||
echo "⚠ Processing timeout reached"
|
||||
return 1
|
||||
}
|
||||
|
||||
# Usage
|
||||
validated_processing "doc_123" "validated_proc_001" "validated-docs"
|
||||
```
|
||||
|
||||
### Parallel Processing with Limits
|
||||
```bash
|
||||
# Process multiple documents in parallel with concurrency limits
|
||||
parallel_processing() {
|
||||
local doc_list=("$@")
|
||||
local max_concurrent=5
|
||||
local current_jobs=0
|
||||
|
||||
echo "Processing ${#doc_list[@]} documents with max $max_concurrent concurrent jobs"
|
||||
|
||||
for doc_id in "${doc_list[@]}"; do
|
||||
# Wait if max concurrent jobs reached
|
||||
while [ $current_jobs -ge $max_concurrent ]; do
|
||||
wait -n # Wait for any job to complete
|
||||
current_jobs=$((current_jobs - 1))
|
||||
done
|
||||
|
||||
# Start processing in background
|
||||
(
|
||||
proc_id="parallel_$(date +%s)_${doc_id}"
|
||||
echo "Starting processing: $doc_id"
|
||||
|
||||
tg-start-library-processing \
|
||||
-d "$doc_id" \
|
||||
--id "$proc_id" \
|
||||
--collection "parallel-processed" \
|
||||
--tags "parallel,batch"
|
||||
|
||||
echo "Completed processing: $doc_id"
|
||||
) &
|
||||
|
||||
current_jobs=$((current_jobs + 1))
|
||||
done
|
||||
|
||||
# Wait for all remaining jobs
|
||||
wait
|
||||
echo "All processing jobs completed"
|
||||
}
|
||||
|
||||
# Get document list and process in parallel
|
||||
doc_list=($(tg-show-library-documents | grep "| id" | awk '{print $3}'))
|
||||
parallel_processing "${doc_list[@]}"
|
||||
```
|
||||
|
||||
### Processing with Retry Logic
|
||||
```bash
|
||||
# Process with automatic retry on failure
|
||||
processing_with_retry() {
|
||||
local doc_id="$1"
|
||||
local proc_id="$2"
|
||||
local max_retries=3
|
||||
local retry_delay=30
|
||||
|
||||
for attempt in $(seq 1 $max_retries); do
|
||||
echo "Processing attempt $attempt/$max_retries for document: $doc_id"
|
||||
|
||||
if tg-start-library-processing \
|
||||
-d "$doc_id" \
|
||||
--id "${proc_id}_attempt_${attempt}" \
|
||||
--collection "retry-processed" \
|
||||
--tags "retry,attempt_$attempt"; then
|
||||
|
||||
# Wait and check if processing succeeded
|
||||
sleep $retry_delay
|
||||
|
||||
if tg-invoke-document-rag -q "test" 2>/dev/null | grep -q "$doc_id"; then
|
||||
echo "✓ Processing succeeded on attempt $attempt"
|
||||
return 0
|
||||
else
|
||||
echo "Processing started but content not yet accessible"
|
||||
fi
|
||||
else
|
||||
echo "✗ Processing failed on attempt $attempt"
|
||||
fi
|
||||
|
||||
if [ $attempt -lt $max_retries ]; then
|
||||
echo "Retrying in ${retry_delay}s..."
|
||||
sleep $retry_delay
|
||||
fi
|
||||
done
|
||||
|
||||
echo "✗ Processing failed after $max_retries attempts"
|
||||
return 1
|
||||
}
|
||||
|
||||
# Usage
|
||||
processing_with_retry "doc_123" "retry_proc_001"
|
||||
```
|
||||
|
||||
### Configuration-Driven Processing
|
||||
```bash
|
||||
# Process documents based on configuration file
|
||||
config_driven_processing() {
|
||||
local config_file="$1"
|
||||
|
||||
if [ ! -f "$config_file" ]; then
|
||||
echo "Configuration file not found: $config_file"
|
||||
return 1
|
||||
fi
|
||||
|
||||
echo "Processing documents based on configuration: $config_file"
|
||||
|
||||
# Example configuration format:
|
||||
# doc_id,flow_id,collection,tags
|
||||
# doc_123,research-flow,research-docs,nlp research
|
||||
|
||||
while IFS=',' read -r doc_id flow_id collection tags; do
|
||||
# Skip header line
|
||||
if [ "$doc_id" = "doc_id" ]; then
|
||||
continue
|
||||
fi
|
||||
|
||||
proc_id="config_$(date +%s)_${doc_id}"
|
||||
|
||||
echo "Processing: $doc_id -> $collection (flow: $flow_id)"
|
||||
|
||||
tg-start-library-processing \
|
||||
-d "$doc_id" \
|
||||
--id "$proc_id" \
|
||||
-i "$flow_id" \
|
||||
--collection "$collection" \
|
||||
--tags "$tags"
|
||||
|
||||
done < "$config_file"
|
||||
}
|
||||
|
||||
# Create example configuration
|
||||
cat > processing_config.csv << EOF
|
||||
doc_id,flow_id,collection,tags
|
||||
doc_123,research-flow,research-docs,nlp research
|
||||
doc_456,finance-flow,finance-docs,financial quarterly
|
||||
doc_789,general-flow,general-docs,general processing
|
||||
EOF
|
||||
|
||||
# Process based on configuration
|
||||
config_driven_processing "processing_config.csv"
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Document Not Found
|
||||
```bash
|
||||
Exception: Document not found
|
||||
```
|
||||
**Solution**: Verify document exists with `tg-show-library-documents`.
|
||||
|
||||
### Processing ID Conflict
|
||||
```bash
|
||||
Exception: Processing ID already exists
|
||||
```
|
||||
**Solution**: Use a unique processing ID or check existing jobs with `tg-show-flows`.
|
||||
|
||||
### Flow Not Found
|
||||
```bash
|
||||
Exception: Flow instance not found
|
||||
```
|
||||
**Solution**: Verify flow exists with `tg-show-flows` or `tg-show-flow-classes`.
|
||||
|
||||
### Insufficient Resources
|
||||
```bash
|
||||
Exception: Processing queue full
|
||||
```
|
||||
**Solution**: Wait for current jobs to complete or scale processing resources.
|
||||
|
||||
## Monitoring and Management
|
||||
|
||||
### Processing Status
|
||||
```bash
|
||||
# Monitor processing progress
|
||||
monitor_processing() {
|
||||
local proc_id="$1"
|
||||
local timeout="${2:-300}" # 5 minutes default
|
||||
|
||||
echo "Monitoring processing: $proc_id"
|
||||
|
||||
elapsed=0
|
||||
interval=10
|
||||
|
||||
while [ $elapsed -lt $timeout ]; do
|
||||
# Check if processing is active
|
||||
if tg-show-flows | grep -q "$proc_id"; then
|
||||
echo "Processing active... (${elapsed}s elapsed)"
|
||||
else
|
||||
echo "Processing completed or stopped"
|
||||
break
|
||||
fi
|
||||
|
||||
sleep $interval
|
||||
elapsed=$((elapsed + interval))
|
||||
done
|
||||
|
||||
if [ $elapsed -ge $timeout ]; then
|
||||
echo "Monitoring timeout reached"
|
||||
fi
|
||||
}
|
||||
|
||||
# Monitor specific processing job
|
||||
monitor_processing "proc_001" 600
|
||||
```
|
||||
|
||||
### Batch Monitoring
|
||||
```bash
|
||||
# Monitor multiple processing jobs
|
||||
monitor_batch() {
|
||||
local proc_pattern="$1"
|
||||
|
||||
echo "Monitoring batch processing: $proc_pattern"
|
||||
|
||||
while true; do
|
||||
active_jobs=$(tg-show-flows | grep -c "$proc_pattern" || echo "0")
|
||||
|
||||
if [ "$active_jobs" -eq 0 ]; then
|
||||
echo "All batch processing jobs completed"
|
||||
break
|
||||
fi
|
||||
|
||||
echo "Active jobs: $active_jobs"
|
||||
sleep 30
|
||||
done
|
||||
}
|
||||
|
||||
# Monitor batch processing
|
||||
monitor_batch "batch_proc_"
|
||||
```
|
||||
|
||||
## Environment Variables
|
||||
|
||||
- `TRUSTGRAPH_URL`: Default API URL
|
||||
|
||||
## Related Commands
|
||||
|
||||
- [`tg-show-library-documents`](tg-show-library-documents.md) - List available documents
|
||||
- [`tg-stop-library-processing`](tg-stop-library-processing.md) - Stop processing jobs
|
||||
- [`tg-show-flows`](tg-show-flows.md) - Monitor processing flows
|
||||
- [`tg-invoke-document-rag`](tg-invoke-document-rag.md) - Query processed documents
|
||||
|
||||
## API Integration
|
||||
|
||||
This command uses the [Library API](../apis/api-librarian.md) to initiate document processing workflows.
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Unique IDs**: Always use unique processing IDs to avoid conflicts
|
||||
2. **Resource Management**: Monitor system resources during batch processing
|
||||
3. **Error Handling**: Implement retry logic for robust processing
|
||||
4. **Monitoring**: Track processing progress and completion
|
||||
5. **Collection Organization**: Use meaningful collection names
|
||||
6. **Tagging**: Apply consistent tagging for better organization
|
||||
7. **Documentation**: Document processing procedures and configurations
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Processing Not Starting
|
||||
```bash
|
||||
# Check document exists
|
||||
tg-show-library-documents | grep "document-id"
|
||||
|
||||
# Check flow is available
|
||||
tg-show-flows | grep "flow-id"
|
||||
|
||||
# Check system resources
|
||||
free -h
|
||||
df -h
|
||||
```
|
||||
|
||||
### Slow Processing
|
||||
```bash
|
||||
# Check processing queue
|
||||
tg-show-flows | grep processing | wc -l
|
||||
|
||||
# Monitor system load
|
||||
top
|
||||
htop
|
||||
```
|
||||
|
||||
### Processing Failures
|
||||
```bash
|
||||
# Check processing logs
|
||||
# (Log location depends on TrustGraph configuration)
|
||||
|
||||
# Retry with different flow
|
||||
tg-start-library-processing -d "doc-id" --id "retry-proc" -i "alternative-flow"
|
||||
```
|
||||
Loading…
Add table
Add a link
Reference in a new issue