trustgraph/docs/cli/tg-load-pdf.md

480 lines
12 KiB
Markdown
Raw Normal View History

# tg-load-pdf
Loads PDF documents into TrustGraph for processing and analysis.
## Synopsis
```bash
tg-load-pdf [options] file1.pdf [file2.pdf ...]
```
## Description
The `tg-load-pdf` command loads PDF documents into TrustGraph by directing them to the PDF decoder service. The command extracts content, generates document metadata, and makes the documents available for processing by other TrustGraph services.
Each PDF is assigned a unique identifier based on its content hash, and comprehensive metadata can be attached including copyright information, publication details, and keywords.
**Note**: Consider using `tg-add-library-document` followed by `tg-start-library-processing` for more comprehensive document management.
## Options
### Required Arguments
- `files`: One or more PDF files to load
### Optional Arguments
- `-u, --url URL`: TrustGraph API URL (default: `$TRUSTGRAPH_URL` or `http://localhost:8088/`)
- `-f, --flow-id ID`: Flow instance ID to use (default: `default`)
- `-U, --user USER`: User ID for document ownership (default: `trustgraph`)
- `-C, --collection COLLECTION`: Collection to assign document (default: `default`)
### Document Metadata
- `--name NAME`: Document name/title
- `--description DESCRIPTION`: Document description
- `--identifier ID`: Custom document identifier
- `--document-url URL`: Source URL for the document
- `--keyword KEYWORD`: Document keywords (can be specified multiple times)
### Copyright Information
- `--copyright-notice NOTICE`: Copyright notice text
- `--copyright-holder HOLDER`: Copyright holder name
- `--copyright-year YEAR`: Copyright year
- `--license LICENSE`: Copyright license
### Publication Details
- `--publication-organization ORG`: Publishing organization
- `--publication-description DESC`: Publication description
- `--publication-date DATE`: Publication date
## Examples
### Basic PDF Loading
```bash
tg-load-pdf document.pdf
```
### Multiple Files
```bash
tg-load-pdf report1.pdf report2.pdf manual.pdf
```
### With Basic Metadata
```bash
tg-load-pdf \
--name "Technical Manual" \
--description "System administration guide" \
--keyword "technical" --keyword "manual" \
technical-manual.pdf
```
### Complete Metadata
```bash
tg-load-pdf \
--name "Annual Report 2023" \
--description "Company annual financial report" \
--copyright-holder "Acme Corporation" \
--copyright-year "2023" \
--license "All Rights Reserved" \
--publication-organization "Acme Corporation" \
--publication-date "2023-12-31" \
--keyword "financial" --keyword "annual" --keyword "report" \
annual-report-2023.pdf
```
### Custom Flow and Collection
```bash
tg-load-pdf \
-f "document-processing-flow" \
-U "finance-team" \
-C "financial-documents" \
--name "Budget Analysis" \
budget-2024.pdf
```
## Document Processing
### Content Extraction
The PDF loader:
1. Calculates SHA256 hash for unique document ID
2. Extracts text content from PDF
3. Preserves document structure and formatting metadata
4. Generates searchable text index
### Metadata Generation
Document metadata includes:
- **Document ID**: SHA256 hash-based unique identifier
- **Content Hash**: For duplicate detection
- **File Information**: Size, format, creation date
- **Custom Metadata**: User-provided attributes
### Integration with Processing Pipeline
```bash
# Load PDF and start processing
tg-load-pdf research-paper.pdf --name "AI Research Paper"
# Check processing status
tg-show-flows | grep "document-processing"
# Query loaded content
tg-invoke-document-rag -q "What is the main conclusion?" -C "default"
```
## Error Handling
### File Not Found
```bash
Exception: [Errno 2] No such file or directory: 'missing.pdf'
```
**Solution**: Verify file path and ensure PDF exists.
### Invalid PDF Format
```bash
Exception: PDF parsing failed: Invalid PDF structure
```
**Solution**: Verify PDF is not corrupted and is a valid PDF file.
### Permission Errors
```bash
Exception: [Errno 13] Permission denied: 'protected.pdf'
```
**Solution**: Check file permissions and ensure read access.
### Flow Not Found
```bash
Exception: Flow instance 'invalid-flow' not found
```
**Solution**: Verify flow ID exists with `tg-show-flows`.
### API Connection Issues
```bash
Exception: Connection refused
```
**Solution**: Check API URL and ensure TrustGraph is running.
## Advanced Usage
### Batch Processing
```bash
# Process all PDFs in directory
for pdf in *.pdf; do
echo "Loading $pdf..."
tg-load-pdf \
--name "$(basename "$pdf" .pdf)" \
--collection "research-papers" \
"$pdf"
done
```
### Organized Loading
```bash
# Load with structured metadata
categories=("technical" "financial" "legal")
for category in "${categories[@]}"; do
for pdf in "$category"/*.pdf; do
if [ -f "$pdf" ]; then
tg-load-pdf \
--collection "$category-documents" \
--keyword "$category" \
--name "$(basename "$pdf" .pdf)" \
"$pdf"
fi
done
done
```
### CSV-Driven Loading
```bash
# Load PDFs with metadata from CSV
# Format: filename,title,description,keywords
while IFS=',' read -r filename title description keywords; do
if [ -f "$filename" ]; then
echo "Loading $filename..."
# Convert comma-separated keywords to multiple --keyword args
keyword_args=""
IFS='|' read -ra KEYWORDS <<< "$keywords"
for kw in "${KEYWORDS[@]}"; do
keyword_args="$keyword_args --keyword \"$kw\""
done
eval "tg-load-pdf \
--name \"$title\" \
--description \"$description\" \
$keyword_args \
\"$filename\""
fi
done < documents.csv
```
### Publication Processing
```bash
# Load academic papers with publication details
load_academic_paper() {
local file="$1"
local title="$2"
local authors="$3"
local journal="$4"
local year="$5"
tg-load-pdf \
--name "$title" \
--description "Academic paper: $title" \
--copyright-holder "$authors" \
--copyright-year "$year" \
--publication-organization "$journal" \
--publication-date "$year-01-01" \
--keyword "academic" --keyword "research" \
"$file"
}
# Usage
load_academic_paper "ai-paper.pdf" "AI in Healthcare" "Smith et al." "AI Journal" "2023"
```
## Monitoring and Validation
### Load Status Checking
```bash
# Check document loading progress
check_load_status() {
local file="$1"
local expected_name="$2"
echo "Checking load status for: $file"
# Check if document appears in library
if tg-show-library-documents | grep -q "$expected_name"; then
echo "✓ Document loaded successfully"
else
echo "✗ Document not found in library"
return 1
fi
}
# Monitor batch loading
for pdf in *.pdf; do
name=$(basename "$pdf" .pdf)
check_load_status "$pdf" "$name"
done
```
### Content Verification
```bash
# Verify PDF content is accessible
verify_pdf_content() {
local pdf_name="$1"
local test_query="$2"
echo "Verifying content for: $pdf_name"
# Try to query the document
result=$(tg-invoke-document-rag -q "$test_query" -C "default" 2>/dev/null)
if [ $? -eq 0 ] && [ -n "$result" ]; then
echo "✓ Content accessible via RAG"
else
echo "✗ Content not accessible"
return 1
fi
}
# Verify loaded documents
verify_pdf_content "Technical Manual" "What is the installation process?"
```
## Performance Optimization
### Parallel Loading
```bash
# Load multiple PDFs in parallel
pdf_files=(document1.pdf document2.pdf document3.pdf)
for pdf in "${pdf_files[@]}"; do
(
echo "Loading $pdf in background..."
tg-load-pdf \
--name "$(basename "$pdf" .pdf)" \
--collection "batch-$(date +%Y%m%d)" \
"$pdf"
) &
done
wait
echo "All PDFs loaded"
```
### Size-Based Processing
```bash
# Process files based on size
for pdf in *.pdf; do
size=$(stat -c%s "$pdf")
if [ $size -lt 10485760 ]; then # < 10MB
echo "Processing small file: $pdf"
tg-load-pdf --collection "small-docs" "$pdf"
else
echo "Processing large file: $pdf"
tg-load-pdf --collection "large-docs" "$pdf"
fi
done
```
## Document Organization
### Collection Management
```bash
# Organize by document type
organize_by_type() {
local pdf="$1"
local filename=$(basename "$pdf" .pdf)
case "$filename" in
*manual*|*guide*) collection="manuals" ;;
*report*|*analysis*) collection="reports" ;;
*spec*|*specification*) collection="specifications" ;;
*legal*|*contract*) collection="legal" ;;
*) collection="general" ;;
esac
tg-load-pdf \
--collection "$collection" \
--name "$filename" \
"$pdf"
}
# Process all PDFs
for pdf in *.pdf; do
organize_by_type "$pdf"
done
```
### Metadata Standardization
```bash
# Apply consistent metadata standards
standardize_metadata() {
local pdf="$1"
local dept="$2"
local year="$3"
local name=$(basename "$pdf" .pdf)
local collection="$dept-$(date +%Y)"
tg-load-pdf \
--name "$name" \
--description "$dept document from $year" \
--copyright-holder "Company Name" \
--copyright-year "$year" \
--collection "$collection" \
--keyword "$dept" --keyword "$year" \
"$pdf"
}
# Usage
standardize_metadata "finance-report.pdf" "finance" "2023"
```
## Integration with Other Services
### Library Integration
```bash
# Alternative approach using library services
load_via_library() {
local pdf="$1"
local name="$2"
# Add to library first
tg-add-library-document \
--name "$name" \
--file "$pdf" \
--collection "documents"
# Start processing
tg-start-library-processing \
--collection "documents"
}
```
### Workflow Integration
```bash
# Complete document workflow
process_document_workflow() {
local pdf="$1"
local name="$2"
echo "Starting document workflow for: $name"
# 1. Load PDF
tg-load-pdf --name "$name" "$pdf"
# 2. Wait for processing
sleep 5
# 3. Verify availability
if tg-show-library-documents | grep -q "$name"; then
echo "Document available in library"
# 4. Test RAG functionality
tg-invoke-document-rag -q "What is this document about?"
# 5. Extract key information
tg-invoke-prompt extract-key-points \
text="Document: $name" \
format="bullet_points"
else
echo "Document processing failed"
fi
}
```
## Environment Variables
- `TRUSTGRAPH_URL`: Default API URL
## Related Commands
- [`tg-add-library-document`](tg-add-library-document.md) - Add documents to library
- [`tg-start-library-processing`](tg-start-library-processing.md) - Process library documents
- [`tg-show-library-documents`](tg-show-library-documents.md) - List library documents
- [`tg-invoke-document-rag`](tg-invoke-document-rag.md) - Query document content
- [`tg-show-flows`](tg-show-flows.md) - Monitor processing flows
## API Integration
This command uses the document loading API to process PDF files and make them available for text extraction, search, and analysis.
## Best Practices
1. **Metadata Completeness**: Provide comprehensive metadata for better organization
2. **Collection Organization**: Use logical collections for document categorization
3. **Error Handling**: Implement robust error handling for batch operations
4. **Performance**: Consider file sizes and processing capacity
5. **Monitoring**: Verify successful loading and processing
6. **Security**: Ensure sensitive documents are properly protected
7. **Backup**: Maintain backups of source PDFs
## Troubleshooting
### PDF Processing Issues
```bash
# Check PDF validity
file document.pdf
pdfinfo document.pdf
# Try alternative PDF processors
qpdf --check document.pdf
```
### Memory Issues
```bash
# For large PDFs, monitor memory usage
free -h
# Consider processing large files separately
```
### Content Extraction Problems
```bash
# Verify PDF contains extractable text
pdftotext document.pdf test-output.txt
cat test-output.txt | head -20
```