Update docs for API/CLI changes in 1.0 (#421)

* Update some API basics for the 0.23/1.0 API change
2026-06-12 08:15:14 +02:00 · 2025-07-03 14:58:32 +01:00 · 2025-07-03 14:58:32 +01:00 · 44bdd29f51
commit 44bdd29f51
parent f907ea7db8
69 changed files with 19981 additions and 407 deletions
--- a/docs/cli/tg-start-library-processing.md
+++ b/docs/cli/tg-start-library-processing.md
@ -0,0 +1,563 @@
+# tg-start-library-processing
+
+Submits a library document for processing through TrustGraph workflows.
+
+## Synopsis
+
+```bash
+tg-start-library-processing -d DOCUMENT_ID --id PROCESSING_ID [options]
+```
+
+## Description
+
+The `tg-start-library-processing` command initiates processing of a document stored in TrustGraph's document library. This triggers workflows that can extract text, generate embeddings, create knowledge graphs, and enable document search and analysis.
+
+Each processing job is assigned a unique processing ID for tracking and management purposes.
+
+## Options
+
+### Required Arguments
+
+- `-d, --document-id ID`: Document ID from the library to process
+- `--id, --processing-id ID`: Unique identifier for this processing job
+
+### Optional Arguments
+
+- `-u, --api-url URL`: TrustGraph API URL (default: `$TRUSTGRAPH_URL` or `http://localhost:8088/`)
+- `-U, --user USER`: User ID for processing context (default: `trustgraph`)
+- `-i, --flow-id ID`: Flow instance to use for processing (default: `default`)
+- `--collection COLLECTION`: Collection to assign processed data (default: `default`)
+- `--tags TAGS`: Comma-separated tags for the processing job
+
+## Examples
+
+### Basic Document Processing
+```bash
+tg-start-library-processing -d "doc_123456789" --id "proc_001"
+```
+
+### Processing with Custom Collection
+```bash
+tg-start-library-processing \
+  -d "research_paper_456" \
+  --id "research_proc_001" \
+  --collection "research-papers" \
+  --tags "nlp,research,2023"
+```
+
+### Processing with Specific Flow
+```bash
+tg-start-library-processing \
+  -d "technical_manual" \
+  --id "manual_proc_001" \
+  -i "document-analysis-flow" \
+  -U "technical-team" \
+  --collection "technical-docs"
+```
+
+### Processing Multiple Documents
+```bash
+# Process several documents in sequence
+documents=("doc_001" "doc_002" "doc_003")
+for i in "${!documents[@]}"; do
+  doc_id="${documents[$i]}"
+  proc_id="batch_proc_$(printf %03d $((i+1)))"
+  
+  echo "Processing document: $doc_id"
+  tg-start-library-processing \
+    -d "$doc_id" \
+    --id "$proc_id" \
+    --collection "batch-processing" \
+    --tags "batch,automated"
+done
+```
+
+## Processing Workflow
+
+### Document Processing Steps
+1. **Document Retrieval**: Fetch document from library
+2. **Content Extraction**: Extract text and metadata
+3. **Text Processing**: Clean and normalize content
+4. **Embedding Generation**: Create vector embeddings
+5. **Knowledge Extraction**: Generate triples and entities
+6. **Index Creation**: Make content searchable
+
+### Processing Types
+Different document types may trigger different processing workflows:
+- **PDF Documents**: Text extraction, OCR if needed
+- **Text Files**: Direct text processing
+- **Images**: OCR and image analysis
+- **Structured Data**: Schema extraction and mapping
+
+## Use Cases
+
+### Batch Document Processing
+```bash
+# Process all unprocessed documents
+process_all_documents() {
+  local collection="$1"
+  local batch_id="batch_$(date +%Y%m%d_%H%M%S)"
+  
+  echo "Starting batch processing for collection: $collection"
+  
+  # Get all document IDs
+  tg-show-library-documents | \
+    grep "| id" | \
+    awk '{print $3}' | \
+    while read -r doc_id; do
+      proc_id="${batch_id}_${doc_id}"
+      
+      echo "Processing document: $doc_id"
+      tg-start-library-processing \
+        -d "$doc_id" \
+        --id "$proc_id" \
+        --collection "$collection" \
+        --tags "batch,automated,$(date +%Y%m%d)"
+      
+      # Add delay to avoid overwhelming the system
+      sleep 2
+    done
+}
+
+# Process all documents
+process_all_documents "processed-docs"
+```
+
+### Department-Specific Processing
+```bash
+# Process documents by department
+process_by_department() {
+  local dept="$1"
+  local flow="$2"
+  
+  echo "Processing documents for department: $dept"
+  
+  # Find documents with department tag
+  tg-show-library-documents -U "$dept" | \
+    grep "| id" | \
+    awk '{print $3}' | \
+    while read -r doc_id; do
+      proc_id="${dept}_proc_$(date +%s)_${doc_id}"
+      
+      echo "Processing $dept document: $doc_id"
+      tg-start-library-processing \
+        -d "$doc_id" \
+        --id "$proc_id" \
+        -i "$flow" \
+        -U "$dept" \
+        --collection "${dept}-processed" \
+        --tags "$dept,departmental"
+    done
+}
+
+# Process documents for different departments
+process_by_department "research" "research-flow"
+process_by_department "finance" "document-flow"
+process_by_department "legal" "compliance-flow"
+```
+
+### Priority Processing
+```bash
+# Process high-priority documents first
+priority_processing() {
+  local priority_tags=("urgent" "high-priority" "critical")
+  
+  for tag in "${priority_tags[@]}"; do
+    echo "Processing $tag documents..."
+    
+    tg-show-library-documents | \
+      grep -B5 -A5 "$tag" | \
+      grep "| id" | \
+      awk '{print $3}' | \
+      while read -r doc_id; do
+        proc_id="priority_$(date +%s)_${doc_id}"
+        
+        echo "Processing priority document: $doc_id"
+        tg-start-library-processing \
+          -d "$doc_id" \
+          --id "$proc_id" \
+          --collection "priority-processed" \
+          --tags "priority,$tag"
+      done
+  done
+}
+
+priority_processing
+```
+
+### Conditional Processing
+```bash
+# Process documents based on criteria
+conditional_processing() {
+  local criteria="$1"
+  local flow="$2"
+  
+  echo "Processing documents matching criteria: $criteria"
+  
+  tg-show-library-documents | \
+    grep -B10 -A10 "$criteria" | \
+    grep "| id" | \
+    awk '{print $3}' | \
+    while read -r doc_id; do
+      # Check if already processed
+      if tg-invoke-document-rag -q "test" 2>/dev/null | grep -q "$doc_id"; then
+        echo "Document $doc_id already processed, skipping"
+        continue
+      fi
+      
+      proc_id="conditional_$(date +%s)_${doc_id}"
+      
+      echo "Processing document: $doc_id"
+      tg-start-library-processing \
+        -d "$doc_id" \
+        --id "$proc_id" \
+        -i "$flow" \
+        --collection "conditional-processed" \
+        --tags "conditional,$criteria"
+    done
+}
+
+# Process technical documents
+conditional_processing "technical" "technical-flow"
+```
+
+## Advanced Usage
+
+### Processing with Validation
+```bash
+# Process with pre and post validation
+validated_processing() {
+  local doc_id="$1"
+  local proc_id="$2"
+  local collection="$3"
+  
+  echo "Starting validated processing for: $doc_id"
+  
+  # Pre-processing validation
+  if ! tg-show-library-documents | grep -q "$doc_id"; then
+    echo "ERROR: Document $doc_id not found"
+    return 1
+  fi
+  
+  # Check if processing ID is unique
+  if tg-show-flows | grep -q "$proc_id"; then
+    echo "ERROR: Processing ID $proc_id already in use"
+    return 1
+  fi
+  
+  # Start processing
+  echo "Starting processing..."
+  tg-start-library-processing \
+    -d "$doc_id" \
+    --id "$proc_id" \
+    --collection "$collection" \
+    --tags "validated,$(date +%Y%m%d)"
+  
+  # Monitor processing
+  echo "Monitoring processing progress..."
+  timeout=300  # 5 minutes
+  elapsed=0
+  interval=10
+  
+  while [ $elapsed -lt $timeout ]; do
+    if tg-invoke-document-rag -q "test" -C "$collection" 2>/dev/null | grep -q "$doc_id"; then
+      echo "✓ Processing completed successfully"
+      return 0
+    fi
+    
+    echo "Processing in progress... (${elapsed}s elapsed)"
+    sleep $interval
+    elapsed=$((elapsed + interval))
+  done
+  
+  echo "⚠ Processing timeout reached"
+  return 1
+}
+
+# Usage
+validated_processing "doc_123" "validated_proc_001" "validated-docs"
+```
+
+### Parallel Processing with Limits
+```bash
+# Process multiple documents in parallel with concurrency limits
+parallel_processing() {
+  local doc_list=("$@")
+  local max_concurrent=5
+  local current_jobs=0
+  
+  echo "Processing ${#doc_list[@]} documents with max $max_concurrent concurrent jobs"
+  
+  for doc_id in "${doc_list[@]}"; do
+    # Wait if max concurrent jobs reached
+    while [ $current_jobs -ge $max_concurrent ]; do
+      wait -n  # Wait for any job to complete
+      current_jobs=$((current_jobs - 1))
+    done
+    
+    # Start processing in background
+    (
+      proc_id="parallel_$(date +%s)_${doc_id}"
+      echo "Starting processing: $doc_id"
+      
+      tg-start-library-processing \
+        -d "$doc_id" \
+        --id "$proc_id" \
+        --collection "parallel-processed" \
+        --tags "parallel,batch"
+      
+      echo "Completed processing: $doc_id"
+    ) &
+    
+    current_jobs=$((current_jobs + 1))
+  done
+  
+  # Wait for all remaining jobs
+  wait
+  echo "All processing jobs completed"
+}
+
+# Get document list and process in parallel
+doc_list=($(tg-show-library-documents | grep "| id" | awk '{print $3}'))
+parallel_processing "${doc_list[@]}"
+```
+
+### Processing with Retry Logic
+```bash
+# Process with automatic retry on failure
+processing_with_retry() {
+  local doc_id="$1"
+  local proc_id="$2"
+  local max_retries=3
+  local retry_delay=30
+  
+  for attempt in $(seq 1 $max_retries); do
+    echo "Processing attempt $attempt/$max_retries for document: $doc_id"
+    
+    if tg-start-library-processing \
+        -d "$doc_id" \
+        --id "${proc_id}_attempt_${attempt}" \
+        --collection "retry-processed" \
+        --tags "retry,attempt_$attempt"; then
+      
+      # Wait and check if processing succeeded
+      sleep $retry_delay
+      
+      if tg-invoke-document-rag -q "test" 2>/dev/null | grep -q "$doc_id"; then
+        echo "✓ Processing succeeded on attempt $attempt"
+        return 0
+      else
+        echo "Processing started but content not yet accessible"
+      fi
+    else
+      echo "✗ Processing failed on attempt $attempt"
+    fi
+    
+    if [ $attempt -lt $max_retries ]; then
+      echo "Retrying in ${retry_delay}s..."
+      sleep $retry_delay
+    fi
+  done
+  
+  echo "✗ Processing failed after $max_retries attempts"
+  return 1
+}
+
+# Usage
+processing_with_retry "doc_123" "retry_proc_001"
+```
+
+### Configuration-Driven Processing
+```bash
+# Process documents based on configuration file
+config_driven_processing() {
+  local config_file="$1"
+  
+  if [ ! -f "$config_file" ]; then
+    echo "Configuration file not found: $config_file"
+    return 1
+  fi
+  
+  echo "Processing documents based on configuration: $config_file"
+  
+  # Example configuration format:
+  # doc_id,flow_id,collection,tags
+  # doc_123,research-flow,research-docs,nlp research
+  
+  while IFS=',' read -r doc_id flow_id collection tags; do
+    # Skip header line
+    if [ "$doc_id" = "doc_id" ]; then
+      continue
+    fi
+    
+    proc_id="config_$(date +%s)_${doc_id}"
+    
+    echo "Processing: $doc_id -> $collection (flow: $flow_id)"
+    
+    tg-start-library-processing \
+      -d "$doc_id" \
+      --id "$proc_id" \
+      -i "$flow_id" \
+      --collection "$collection" \
+      --tags "$tags"
+    
+  done < "$config_file"
+}
+
+# Create example configuration
+cat > processing_config.csv << EOF
+doc_id,flow_id,collection,tags
+doc_123,research-flow,research-docs,nlp research
+doc_456,finance-flow,finance-docs,financial quarterly
+doc_789,general-flow,general-docs,general processing
+EOF
+
+# Process based on configuration
+config_driven_processing "processing_config.csv"
+```
+
+## Error Handling
+
+### Document Not Found
+```bash
+Exception: Document not found
+```
+**Solution**: Verify document exists with `tg-show-library-documents`.
+
+### Processing ID Conflict
+```bash
+Exception: Processing ID already exists
+```
+**Solution**: Use a unique processing ID or check existing jobs with `tg-show-flows`.
+
+### Flow Not Found
+```bash
+Exception: Flow instance not found
+```
+**Solution**: Verify flow exists with `tg-show-flows` or `tg-show-flow-classes`.
+
+### Insufficient Resources
+```bash
+Exception: Processing queue full
+```
+**Solution**: Wait for current jobs to complete or scale processing resources.
+
+## Monitoring and Management
+
+### Processing Status
+```bash
+# Monitor processing progress
+monitor_processing() {
+  local proc_id="$1"
+  local timeout="${2:-300}"  # 5 minutes default
+  
+  echo "Monitoring processing: $proc_id"
+  
+  elapsed=0
+  interval=10
+  
+  while [ $elapsed -lt $timeout ]; do
+    # Check if processing is active
+    if tg-show-flows | grep -q "$proc_id"; then
+      echo "Processing active... (${elapsed}s elapsed)"
+    else
+      echo "Processing completed or stopped"
+      break
+    fi
+    
+    sleep $interval
+    elapsed=$((elapsed + interval))
+  done
+  
+  if [ $elapsed -ge $timeout ]; then
+    echo "Monitoring timeout reached"
+  fi
+}
+
+# Monitor specific processing job
+monitor_processing "proc_001" 600
+```
+
+### Batch Monitoring
+```bash
+# Monitor multiple processing jobs
+monitor_batch() {
+  local proc_pattern="$1"
+  
+  echo "Monitoring batch processing: $proc_pattern"
+  
+  while true; do
+    active_jobs=$(tg-show-flows | grep -c "$proc_pattern" || echo "0")
+    
+    if [ "$active_jobs" -eq 0 ]; then
+      echo "All batch processing jobs completed"
+      break
+    fi
+    
+    echo "Active jobs: $active_jobs"
+    sleep 30
+  done
+}
+
+# Monitor batch processing
+monitor_batch "batch_proc_"
+```
+
+## Environment Variables
+
+- `TRUSTGRAPH_URL`: Default API URL
+
+## Related Commands
+
+- [`tg-show-library-documents`](tg-show-library-documents.md) - List available documents
+- [`tg-stop-library-processing`](tg-stop-library-processing.md) - Stop processing jobs
+- [`tg-show-flows`](tg-show-flows.md) - Monitor processing flows
+- [`tg-invoke-document-rag`](tg-invoke-document-rag.md) - Query processed documents
+
+## API Integration
+
+This command uses the [Library API](../apis/api-librarian.md) to initiate document processing workflows.
+
+## Best Practices
+
+1. **Unique IDs**: Always use unique processing IDs to avoid conflicts
+2. **Resource Management**: Monitor system resources during batch processing
+3. **Error Handling**: Implement retry logic for robust processing
+4. **Monitoring**: Track processing progress and completion
+5. **Collection Organization**: Use meaningful collection names
+6. **Tagging**: Apply consistent tagging for better organization
+7. **Documentation**: Document processing procedures and configurations
+
+## Troubleshooting
+
+### Processing Not Starting
+```bash
+# Check document exists
+tg-show-library-documents | grep "document-id"
+
+# Check flow is available
+tg-show-flows | grep "flow-id"
+
+# Check system resources
+free -h
+df -h
+```
+
+### Slow Processing
+```bash
+# Check processing queue
+tg-show-flows | grep processing | wc -l
+
+# Monitor system load
+top
+htop
+```
+
+### Processing Failures
+```bash
+# Check processing logs
+# (Log location depends on TrustGraph configuration)
+
+# Retry with different flow
+tg-start-library-processing -d "doc-id" --id "retry-proc" -i "alternative-flow"
+```