trustgraph/docs/cli/tg-load-sample-documents.md

# tg-load-sample-documents

Loads predefined sample documents into TrustGraph library for testing and demonstration purposes.

## Synopsis

```bash
tg-load-sample-documents [options]
```

## Description

The `tg-load-sample-documents` command loads a curated set of sample documents into TrustGraph's document library. These documents include academic papers, government reports, and reference materials that demonstrate TrustGraph's capabilities and provide data for testing and evaluation.

The command downloads documents from public sources and adds them to the library with comprehensive metadata including RDF triples for semantic relationships.

## Options

### Optional Arguments

- `-u, --url URL`: TrustGraph API URL (default: `$TRUSTGRAPH_URL` or `http://localhost:8088/`)
- `-U, --user USER`: User ID for document ownership (default: `trustgraph`)

## Examples

### Basic Loading
```bash
tg-load-sample-documents
```

### Load with Custom User
```bash
tg-load-sample-documents -U "demo-user"
```

### Load to Custom Environment
```bash
tg-load-sample-documents -u http://demo.trustgraph.ai:8088/
```

## Sample Documents

The command loads the following sample documents:

### 1. NASA Challenger Report
- **Title**: Report of the Presidential Commission on the Space Shuttle Challenger Accident, Volume 1
- **Topics**: Safety engineering, space shuttle, NASA
- **Format**: PDF
- **Source**: NASA Technical Reports Server
- **Use Case**: Demonstrates technical document processing and safety analysis

### 2. Old Icelandic Dictionary
- **Title**: A Concise Dictionary of Old Icelandic
- **Topics**: Language, linguistics, Old Norse, grammar
- **Format**: PDF
- **Publication**: 1910, Clarendon Press
- **Use Case**: Historical document processing and linguistic analysis

### 3. US Intelligence Threat Assessment
- **Title**: Annual Threat Assessment of the U.S. Intelligence Community - March 2025
- **Topics**: National security, cyberthreats, geopolitics
- **Format**: PDF
- **Source**: Director of National Intelligence
- **Use Case**: Current affairs analysis and security research

### 4. Intelligence and State Policy
- **Title**: The Role of Intelligence and State Policies in International Security
- **Topics**: Intelligence, international security, state policy
- **Format**: PDF (sample)
- **Publication**: Cambridge Scholars Publishing, 2021
- **Use Case**: Academic research and policy analysis

### 5. Globalization and Intelligence
- **Title**: Beyond the Vigilant State: Globalisation and Intelligence
- **Topics**: Intelligence, globalization, security studies
- **Format**: PDF
- **Author**: Richard J. Aldrich
- **Use Case**: Academic paper analysis and research

## Use Cases

### Demo Environment Setup
```bash
# Set up demonstration environment
setup_demo_environment() {
  echo "Setting up TrustGraph demo environment..."

  # Initialize system
  tg-init-trustgraph

  # Load sample documents
  echo "Loading sample documents..."
  tg-load-sample-documents -U "demo"

  # Wait for processing
  echo "Waiting for document processing..."
  sleep 60

  # Start document processing
  echo "Starting document processing..."
  tg-show-library-documents -U "demo" | \
    grep "| id" | \
    awk '{print $3}' | \
    while read doc_id; do
      proc_id="demo_proc_$(date +%s)_${doc_id}"
      tg-start-library-processing -d "$doc_id" --id "$proc_id" -U "demo"
    done

  echo "Demo environment ready!"
  echo "Try: tg-invoke-document-rag -q 'What caused the Challenger accident?' -U demo"
}
```

### Testing Data Pipeline
```bash
# Test complete document processing pipeline
test_document_pipeline() {
  echo "Testing document processing pipeline..."

  # Load sample documents
  tg-load-sample-documents -U "test"

  # List loaded documents
  echo "Loaded documents:"
  tg-show-library-documents -U "test"

  # Start processing for each document
  tg-show-library-documents -U "test" | \
    grep "| id" | \
    awk '{print $3}' | \
    while read doc_id; do
      echo "Processing document: $doc_id"
      proc_id="test_$(date +%s)_${doc_id}"
      tg-start-library-processing -d "$doc_id" --id "$proc_id" -U "test"
    done

  # Wait for processing
  echo "Processing documents... (this may take several minutes)"
  sleep 300

  # Test document queries
  echo "Testing document queries..."

  test_queries=(
    "What is the Challenger accident?"
    "What is Old Icelandic?"
    "What are the main cybersecurity threats?"
    "What is intelligence policy?"
  )

  for query in "${test_queries[@]}"; do
    echo "Query: $query"
    tg-invoke-document-rag -q "$query" -U "test" | head -5
    echo "---"
  done

  echo "Pipeline test complete!"
}
```

### Educational Environment
```bash
# Set up educational/training environment
setup_educational_environment() {
  local class_name="$1"

  echo "Setting up educational environment for: $class_name"

  # Create user for the class
  class_user=$(echo "$class_name" | tr '[:upper:]' '[:lower:]' | tr ' ' '-')

  # Load sample documents for the class
  tg-load-sample-documents -U "$class_user"

  # Process documents
  echo "Processing documents for educational use..."
  tg-show-library-documents -U "$class_user" | \
    grep "| id" | \
    awk '{print $3}' | \
    while read doc_id; do
      proc_id="edu_$(date +%s)_${doc_id}"
      tg-start-library-processing \
        -d "$doc_id" \
        --id "$proc_id" \
        -U "$class_user" \
        --collection "education"
    done

  echo "Educational environment ready for: $class_name"
  echo "User: $class_user"
  echo "Collection: education"
}

# Set up for different classes
setup_educational_environment "AI Research Methods"
setup_educational_environment "Security Studies"
```

### Benchmarking and Performance Testing
```bash
# Benchmark document processing performance
benchmark_processing() {
  echo "Starting document processing benchmark..."

  # Load sample documents
  start_time=$(date +%s)
  tg-load-sample-documents -U "benchmark"
  load_time=$(date +%s)

  echo "Document loading time: $((load_time - start_time))s"

  # Count documents
  doc_count=$(tg-show-library-documents -U "benchmark" | grep -c "| id")
  echo "Documents loaded: $doc_count"

  # Start processing
  processing_ids=()
  tg-show-library-documents -U "benchmark" | \
    grep "| id" | \
    awk '{print $3}' | \
    while read doc_id; do
      proc_id="bench_$(date +%s)_${doc_id}"
      processing_ids+=("$proc_id")
      tg-start-library-processing -d "$doc_id" --id "$proc_id" -U "benchmark"
    done

  processing_start=$(date +%s)

  # Monitor processing completion
  echo "Monitoring processing completion..."
  while true; do
    active_processing=$(tg-show-flows | grep -c "bench_" || echo "0")

    if [ "$active_processing" -eq 0 ]; then
      break
    fi

    echo "Active processing jobs: $active_processing"
    sleep 30
  done

  processing_end=$(date +%s)

  echo "Processing completion time: $((processing_end - processing_start))s"
  echo "Total benchmark time: $((processing_end - start_time))s"

  # Test query performance
  echo "Testing query performance..."
  query_start=$(date +%s)

  for i in {1..10}; do
    tg-invoke-document-rag \
      -q "What are the main topics in these documents?" \
      -U "benchmark" > /dev/null
  done

  query_end=$(date +%s)
  echo "Average query time: $(echo "scale=2; ($query_end - $query_start) / 10" | bc)s"
}
```

## Advanced Usage

### Selective Document Loading
```bash
# Load only specific types of documents
load_by_category() {
  local category="$1"

  case "$category" in
    "government")
      echo "Loading government documents..."
      # This would require modifying the script to load selectively
      # For now, we load all and filter by tags later
      tg-load-sample-documents -U "gov-docs"
      ;;
    "academic")
      echo "Loading academic documents..."
      tg-load-sample-documents -U "academic-docs"
      ;;
    "historical")
      echo "Loading historical documents..."
      tg-load-sample-documents -U "historical-docs"
      ;;
    *)
      echo "Loading all sample documents..."
      tg-load-sample-documents
      ;;
  esac
}

# Load by category
load_by_category "government"
load_by_category "academic"
```

### Multi-Environment Loading
```bash
# Load sample documents to multiple environments
multi_environment_setup() {
  local environments=("dev" "staging" "demo")

  for env in "${environments[@]}"; do
    echo "Setting up $env environment..."

    tg-load-sample-documents \
      -u "http://$env.trustgraph.company.com:8088/" \
      -U "sample-data"

    echo "✓ $env environment loaded"
  done

  echo "All environments loaded with sample documents"
}
```

### Custom Document Sets
```bash
# Create custom document loading scripts based on the sample
create_custom_loader() {
  local domain="$1"

  cat > "load-${domain}-documents.py" << 'EOF'
#!/usr/bin/env python3
"""
Custom document loader for specific domain
Based on tg-load-sample-documents
"""

import argparse
import os
from trustgraph.api import Api

# Define your own document set here
documents = [
    {
        "id": "https://example.com/doc/custom-1",
        "title": "Custom Document 1",
        "url": "https://example.com/docs/custom1.pdf",
        # Add your document definitions...
    }
]

# Rest of the implementation similar to tg-load-sample-documents
EOF

  echo "Custom loader created: load-${domain}-documents.py"
}

# Create custom loaders for different domains
create_custom_loader "medical"
create_custom_loader "legal"
create_custom_loader "technical"
```

## Document Analysis

### Content Analysis
```bash
# Analyze loaded sample documents
analyze_sample_documents() {
  echo "Analyzing sample documents..."

  # Get document statistics
  total_docs=$(tg-show-library-documents | grep -c "| id")
  echo "Total documents: $total_docs"

  # Analyze by type
  echo "Document types:"
  tg-show-library-documents | \
    grep "| kind" | \
    awk '{print $3}' | \
    sort | uniq -c

  # Analyze tags
  echo "Popular tags:"
  tg-show-library-documents | \
    grep "| tags" | \
    sed 's/.*| tags.*| \(.*\) |.*/\1/' | \
    tr ',' '\n' | \
    sed 's/^ *//;s/ *$//' | \
    sort | uniq -c | sort -nr | head -10

  # Document sizes (would need additional API)
  echo "Document analysis complete"
}
```

### Query Testing
```bash
# Test sample documents with various queries
test_sample_queries() {
  echo "Testing sample document queries..."

  # Define test queries for different domains
  queries=(
    "What caused the Challenger space shuttle accident?"
    "What is Old Norse language?"
    "What are current cybersecurity threats?"
    "How does globalization affect intelligence services?"
    "What are the main security challenges in international relations?"
  )

  for query in "${queries[@]}"; do
    echo "Testing query: $query"
    echo "===================="

    result=$(tg-invoke-document-rag -q "$query" 2>/dev/null)

    if [ $? -eq 0 ]; then
      echo "$result" | head -3
      echo "✓ Query successful"
    else
      echo "✗ Query failed"
    fi

    echo ""
  done
}
```

## Error Handling

### Network Issues
```bash
Exception: Connection failed during download
```
**Solution**: Check internet connectivity and retry. Documents are cached locally after first download.

### Insufficient Storage
```bash
Exception: No space left on device
```
**Solution**: Free up disk space. Sample documents total approximately 50-100MB.

### API Connection Issues
```bash
Exception: Connection refused
```
**Solution**: Verify TrustGraph API is running and accessible.

### Processing Failures
```bash
Exception: Document processing failed
```
**Solution**: Check TrustGraph service logs and ensure all components are running.

## Monitoring and Validation

### Loading Progress
```bash
# Monitor sample document loading
monitor_sample_loading() {
  echo "Starting sample document loading with monitoring..."

  # Start loading in background
  tg-load-sample-documents &
  load_pid=$!

  # Monitor progress
  while kill -0 $load_pid 2>/dev/null; do
    doc_count=$(tg-show-library-documents 2>/dev/null | grep -c "| id" || echo "0")
    echo "Documents loaded so far: $doc_count"
    sleep 10
  done

  wait $load_pid

  if [ $? -eq 0 ]; then
    final_count=$(tg-show-library-documents | grep -c "| id")
    echo "✓ Loading completed successfully"
    echo "Total documents loaded: $final_count"
  else
    echo "✗ Loading failed"
  fi
}
```

### Validation
```bash
# Validate sample document loading
validate_sample_loading() {
  echo "Validating sample document loading..."

  # Expected document count (based on current sample set)
  expected_docs=5

  # Check actual count
  actual_docs=$(tg-show-library-documents | grep -c "| id")

  if [ "$actual_docs" -eq "$expected_docs" ]; then
    echo "✓ Document count correct: $actual_docs"
  else
    echo "⚠ Document count mismatch: expected $expected_docs, got $actual_docs"
  fi

  # Check for expected documents
  expected_titles=(
    "Challenger"
    "Icelandic"
    "Intelligence"
    "Threat Assessment"
    "Vigilant State"
  )

  for title in "${expected_titles[@]}"; do
    if tg-show-library-documents | grep -q "$title"; then
      echo "✓ Found document containing: $title"
    else
      echo "✗ Missing document containing: $title"
    fi
  done

  echo "Validation complete"
}
```

## Environment Variables

- `TRUSTGRAPH_URL`: Default API URL

## Related Commands

- [`tg-show-library-documents`](tg-show-library-documents.md) - List loaded documents
- [`tg-start-library-processing`](tg-start-library-processing.md) - Process loaded documents
- [`tg-invoke-document-rag`](tg-invoke-document-rag.md) - Query processed documents
- [`tg-load-pdf`](tg-load-pdf.md) - Load individual PDF documents

## API Integration

This command uses the [Library API](../apis/api-librarian.md) to add sample documents to TrustGraph's document repository.

## Best Practices

1. **Demo Preparation**: Use for setting up demonstration environments
2. **Testing**: Ideal for testing document processing pipelines
3. **Education**: Excellent for training and educational purposes
4. **Development**: Use in development environments for consistent test data
5. **Benchmarking**: Suitable for performance testing and optimization
6. **Documentation**: Great for documenting TrustGraph capabilities

## Troubleshooting

### Download Failures
```bash
# Check document URLs are accessible
curl -I "https://ntrs.nasa.gov/api/citations/19860015255/downloads/19860015255.pdf"

# Check local cache
ls -la doc-cache/
```

### Processing Issues
```bash
# Check document processing status
tg-show-library-processing

# Verify documents are in library
tg-show-library-documents | grep -E "(Challenger|Icelandic|Intelligence)"
```

### Performance Problems
```bash
# Monitor system resources during loading
top
df -h
```