mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-25 08:26:21 +02:00
480 lines
No EOL
12 KiB
Markdown
480 lines
No EOL
12 KiB
Markdown
# tg-load-pdf
|
|
|
|
Loads PDF documents into TrustGraph for processing and analysis.
|
|
|
|
## Synopsis
|
|
|
|
```bash
|
|
tg-load-pdf [options] file1.pdf [file2.pdf ...]
|
|
```
|
|
|
|
## Description
|
|
|
|
The `tg-load-pdf` command loads PDF documents into TrustGraph by directing them to the PDF decoder service. The command extracts content, generates document metadata, and makes the documents available for processing by other TrustGraph services.
|
|
|
|
Each PDF is assigned a unique identifier based on its content hash, and comprehensive metadata can be attached including copyright information, publication details, and keywords.
|
|
|
|
**Note**: Consider using `tg-add-library-document` followed by `tg-start-library-processing` for more comprehensive document management.
|
|
|
|
## Options
|
|
|
|
### Required Arguments
|
|
|
|
- `files`: One or more PDF files to load
|
|
|
|
### Optional Arguments
|
|
|
|
- `-u, --url URL`: TrustGraph API URL (default: `$TRUSTGRAPH_URL` or `http://localhost:8088/`)
|
|
- `-f, --flow-id ID`: Flow instance ID to use (default: `default`)
|
|
- `-U, --user USER`: User ID for document ownership (default: `trustgraph`)
|
|
- `-C, --collection COLLECTION`: Collection to assign document (default: `default`)
|
|
|
|
### Document Metadata
|
|
|
|
- `--name NAME`: Document name/title
|
|
- `--description DESCRIPTION`: Document description
|
|
- `--identifier ID`: Custom document identifier
|
|
- `--document-url URL`: Source URL for the document
|
|
- `--keyword KEYWORD`: Document keywords (can be specified multiple times)
|
|
|
|
### Copyright Information
|
|
|
|
- `--copyright-notice NOTICE`: Copyright notice text
|
|
- `--copyright-holder HOLDER`: Copyright holder name
|
|
- `--copyright-year YEAR`: Copyright year
|
|
- `--license LICENSE`: Copyright license
|
|
|
|
### Publication Details
|
|
|
|
- `--publication-organization ORG`: Publishing organization
|
|
- `--publication-description DESC`: Publication description
|
|
- `--publication-date DATE`: Publication date
|
|
|
|
## Examples
|
|
|
|
### Basic PDF Loading
|
|
```bash
|
|
tg-load-pdf document.pdf
|
|
```
|
|
|
|
### Multiple Files
|
|
```bash
|
|
tg-load-pdf report1.pdf report2.pdf manual.pdf
|
|
```
|
|
|
|
### With Basic Metadata
|
|
```bash
|
|
tg-load-pdf \
|
|
--name "Technical Manual" \
|
|
--description "System administration guide" \
|
|
--keyword "technical" --keyword "manual" \
|
|
technical-manual.pdf
|
|
```
|
|
|
|
### Complete Metadata
|
|
```bash
|
|
tg-load-pdf \
|
|
--name "Annual Report 2023" \
|
|
--description "Company annual financial report" \
|
|
--copyright-holder "Acme Corporation" \
|
|
--copyright-year "2023" \
|
|
--license "All Rights Reserved" \
|
|
--publication-organization "Acme Corporation" \
|
|
--publication-date "2023-12-31" \
|
|
--keyword "financial" --keyword "annual" --keyword "report" \
|
|
annual-report-2023.pdf
|
|
```
|
|
|
|
### Custom Flow and Collection
|
|
```bash
|
|
tg-load-pdf \
|
|
-f "document-processing-flow" \
|
|
-U "finance-team" \
|
|
-C "financial-documents" \
|
|
--name "Budget Analysis" \
|
|
budget-2024.pdf
|
|
```
|
|
|
|
## Document Processing
|
|
|
|
### Content Extraction
|
|
The PDF loader:
|
|
1. Calculates SHA256 hash for unique document ID
|
|
2. Extracts text content from PDF
|
|
3. Preserves document structure and formatting metadata
|
|
4. Generates searchable text index
|
|
|
|
### Metadata Generation
|
|
Document metadata includes:
|
|
- **Document ID**: SHA256 hash-based unique identifier
|
|
- **Content Hash**: For duplicate detection
|
|
- **File Information**: Size, format, creation date
|
|
- **Custom Metadata**: User-provided attributes
|
|
|
|
### Integration with Processing Pipeline
|
|
```bash
|
|
# Load PDF and start processing
|
|
tg-load-pdf research-paper.pdf --name "AI Research Paper"
|
|
|
|
# Check processing status
|
|
tg-show-flows | grep "document-processing"
|
|
|
|
# Query loaded content
|
|
tg-invoke-document-rag -q "What is the main conclusion?" -C "default"
|
|
```
|
|
|
|
## Error Handling
|
|
|
|
### File Not Found
|
|
```bash
|
|
Exception: [Errno 2] No such file or directory: 'missing.pdf'
|
|
```
|
|
**Solution**: Verify file path and ensure PDF exists.
|
|
|
|
### Invalid PDF Format
|
|
```bash
|
|
Exception: PDF parsing failed: Invalid PDF structure
|
|
```
|
|
**Solution**: Verify PDF is not corrupted and is a valid PDF file.
|
|
|
|
### Permission Errors
|
|
```bash
|
|
Exception: [Errno 13] Permission denied: 'protected.pdf'
|
|
```
|
|
**Solution**: Check file permissions and ensure read access.
|
|
|
|
### Flow Not Found
|
|
```bash
|
|
Exception: Flow instance 'invalid-flow' not found
|
|
```
|
|
**Solution**: Verify flow ID exists with `tg-show-flows`.
|
|
|
|
### API Connection Issues
|
|
```bash
|
|
Exception: Connection refused
|
|
```
|
|
**Solution**: Check API URL and ensure TrustGraph is running.
|
|
|
|
## Advanced Usage
|
|
|
|
### Batch Processing
|
|
```bash
|
|
# Process all PDFs in directory
|
|
for pdf in *.pdf; do
|
|
echo "Loading $pdf..."
|
|
tg-load-pdf \
|
|
--name "$(basename "$pdf" .pdf)" \
|
|
--collection "research-papers" \
|
|
"$pdf"
|
|
done
|
|
```
|
|
|
|
### Organized Loading
|
|
```bash
|
|
# Load with structured metadata
|
|
categories=("technical" "financial" "legal")
|
|
for category in "${categories[@]}"; do
|
|
for pdf in "$category"/*.pdf; do
|
|
if [ -f "$pdf" ]; then
|
|
tg-load-pdf \
|
|
--collection "$category-documents" \
|
|
--keyword "$category" \
|
|
--name "$(basename "$pdf" .pdf)" \
|
|
"$pdf"
|
|
fi
|
|
done
|
|
done
|
|
```
|
|
|
|
### CSV-Driven Loading
|
|
```bash
|
|
# Load PDFs with metadata from CSV
|
|
# Format: filename,title,description,keywords
|
|
while IFS=',' read -r filename title description keywords; do
|
|
if [ -f "$filename" ]; then
|
|
echo "Loading $filename..."
|
|
|
|
# Convert comma-separated keywords to multiple --keyword args
|
|
keyword_args=""
|
|
IFS='|' read -ra KEYWORDS <<< "$keywords"
|
|
for kw in "${KEYWORDS[@]}"; do
|
|
keyword_args="$keyword_args --keyword \"$kw\""
|
|
done
|
|
|
|
eval "tg-load-pdf \
|
|
--name \"$title\" \
|
|
--description \"$description\" \
|
|
$keyword_args \
|
|
\"$filename\""
|
|
fi
|
|
done < documents.csv
|
|
```
|
|
|
|
### Publication Processing
|
|
```bash
|
|
# Load academic papers with publication details
|
|
load_academic_paper() {
|
|
local file="$1"
|
|
local title="$2"
|
|
local authors="$3"
|
|
local journal="$4"
|
|
local year="$5"
|
|
|
|
tg-load-pdf \
|
|
--name "$title" \
|
|
--description "Academic paper: $title" \
|
|
--copyright-holder "$authors" \
|
|
--copyright-year "$year" \
|
|
--publication-organization "$journal" \
|
|
--publication-date "$year-01-01" \
|
|
--keyword "academic" --keyword "research" \
|
|
"$file"
|
|
}
|
|
|
|
# Usage
|
|
load_academic_paper "ai-paper.pdf" "AI in Healthcare" "Smith et al." "AI Journal" "2023"
|
|
```
|
|
|
|
## Monitoring and Validation
|
|
|
|
### Load Status Checking
|
|
```bash
|
|
# Check document loading progress
|
|
check_load_status() {
|
|
local file="$1"
|
|
local expected_name="$2"
|
|
|
|
echo "Checking load status for: $file"
|
|
|
|
# Check if document appears in library
|
|
if tg-show-library-documents | grep -q "$expected_name"; then
|
|
echo "✓ Document loaded successfully"
|
|
else
|
|
echo "✗ Document not found in library"
|
|
return 1
|
|
fi
|
|
}
|
|
|
|
# Monitor batch loading
|
|
for pdf in *.pdf; do
|
|
name=$(basename "$pdf" .pdf)
|
|
check_load_status "$pdf" "$name"
|
|
done
|
|
```
|
|
|
|
### Content Verification
|
|
```bash
|
|
# Verify PDF content is accessible
|
|
verify_pdf_content() {
|
|
local pdf_name="$1"
|
|
local test_query="$2"
|
|
|
|
echo "Verifying content for: $pdf_name"
|
|
|
|
# Try to query the document
|
|
result=$(tg-invoke-document-rag -q "$test_query" -C "default" 2>/dev/null)
|
|
|
|
if [ $? -eq 0 ] && [ -n "$result" ]; then
|
|
echo "✓ Content accessible via RAG"
|
|
else
|
|
echo "✗ Content not accessible"
|
|
return 1
|
|
fi
|
|
}
|
|
|
|
# Verify loaded documents
|
|
verify_pdf_content "Technical Manual" "What is the installation process?"
|
|
```
|
|
|
|
## Performance Optimization
|
|
|
|
### Parallel Loading
|
|
```bash
|
|
# Load multiple PDFs in parallel
|
|
pdf_files=(document1.pdf document2.pdf document3.pdf)
|
|
for pdf in "${pdf_files[@]}"; do
|
|
(
|
|
echo "Loading $pdf in background..."
|
|
tg-load-pdf \
|
|
--name "$(basename "$pdf" .pdf)" \
|
|
--collection "batch-$(date +%Y%m%d)" \
|
|
"$pdf"
|
|
) &
|
|
done
|
|
wait
|
|
echo "All PDFs loaded"
|
|
```
|
|
|
|
### Size-Based Processing
|
|
```bash
|
|
# Process files based on size
|
|
for pdf in *.pdf; do
|
|
size=$(stat -c%s "$pdf")
|
|
if [ $size -lt 10485760 ]; then # < 10MB
|
|
echo "Processing small file: $pdf"
|
|
tg-load-pdf --collection "small-docs" "$pdf"
|
|
else
|
|
echo "Processing large file: $pdf"
|
|
tg-load-pdf --collection "large-docs" "$pdf"
|
|
fi
|
|
done
|
|
```
|
|
|
|
## Document Organization
|
|
|
|
### Collection Management
|
|
```bash
|
|
# Organize by document type
|
|
organize_by_type() {
|
|
local pdf="$1"
|
|
local filename=$(basename "$pdf" .pdf)
|
|
|
|
case "$filename" in
|
|
*manual*|*guide*) collection="manuals" ;;
|
|
*report*|*analysis*) collection="reports" ;;
|
|
*spec*|*specification*) collection="specifications" ;;
|
|
*legal*|*contract*) collection="legal" ;;
|
|
*) collection="general" ;;
|
|
esac
|
|
|
|
tg-load-pdf \
|
|
--collection "$collection" \
|
|
--name "$filename" \
|
|
"$pdf"
|
|
}
|
|
|
|
# Process all PDFs
|
|
for pdf in *.pdf; do
|
|
organize_by_type "$pdf"
|
|
done
|
|
```
|
|
|
|
### Metadata Standardization
|
|
```bash
|
|
# Apply consistent metadata standards
|
|
standardize_metadata() {
|
|
local pdf="$1"
|
|
local dept="$2"
|
|
local year="$3"
|
|
|
|
local name=$(basename "$pdf" .pdf)
|
|
local collection="$dept-$(date +%Y)"
|
|
|
|
tg-load-pdf \
|
|
--name "$name" \
|
|
--description "$dept document from $year" \
|
|
--copyright-holder "Company Name" \
|
|
--copyright-year "$year" \
|
|
--collection "$collection" \
|
|
--keyword "$dept" --keyword "$year" \
|
|
"$pdf"
|
|
}
|
|
|
|
# Usage
|
|
standardize_metadata "finance-report.pdf" "finance" "2023"
|
|
```
|
|
|
|
## Integration with Other Services
|
|
|
|
### Library Integration
|
|
```bash
|
|
# Alternative approach using library services
|
|
load_via_library() {
|
|
local pdf="$1"
|
|
local name="$2"
|
|
|
|
# Add to library first
|
|
tg-add-library-document \
|
|
--name "$name" \
|
|
--file "$pdf" \
|
|
--collection "documents"
|
|
|
|
# Start processing
|
|
tg-start-library-processing \
|
|
--collection "documents"
|
|
}
|
|
```
|
|
|
|
### Workflow Integration
|
|
```bash
|
|
# Complete document workflow
|
|
process_document_workflow() {
|
|
local pdf="$1"
|
|
local name="$2"
|
|
|
|
echo "Starting document workflow for: $name"
|
|
|
|
# 1. Load PDF
|
|
tg-load-pdf --name "$name" "$pdf"
|
|
|
|
# 2. Wait for processing
|
|
sleep 5
|
|
|
|
# 3. Verify availability
|
|
if tg-show-library-documents | grep -q "$name"; then
|
|
echo "Document available in library"
|
|
|
|
# 4. Test RAG functionality
|
|
tg-invoke-document-rag -q "What is this document about?"
|
|
|
|
# 5. Extract key information
|
|
tg-invoke-prompt extract-key-points \
|
|
text="Document: $name" \
|
|
format="bullet_points"
|
|
else
|
|
echo "Document processing failed"
|
|
fi
|
|
}
|
|
```
|
|
|
|
## Environment Variables
|
|
|
|
- `TRUSTGRAPH_URL`: Default API URL
|
|
|
|
## Related Commands
|
|
|
|
- [`tg-add-library-document`](tg-add-library-document.md) - Add documents to library
|
|
- [`tg-start-library-processing`](tg-start-library-processing.md) - Process library documents
|
|
- [`tg-show-library-documents`](tg-show-library-documents.md) - List library documents
|
|
- [`tg-invoke-document-rag`](tg-invoke-document-rag.md) - Query document content
|
|
- [`tg-show-flows`](tg-show-flows.md) - Monitor processing flows
|
|
|
|
## API Integration
|
|
|
|
This command uses the document loading API to process PDF files and make them available for text extraction, search, and analysis.
|
|
|
|
## Best Practices
|
|
|
|
1. **Metadata Completeness**: Provide comprehensive metadata for better organization
|
|
2. **Collection Organization**: Use logical collections for document categorization
|
|
3. **Error Handling**: Implement robust error handling for batch operations
|
|
4. **Performance**: Consider file sizes and processing capacity
|
|
5. **Monitoring**: Verify successful loading and processing
|
|
6. **Security**: Ensure sensitive documents are properly protected
|
|
7. **Backup**: Maintain backups of source PDFs
|
|
|
|
## Troubleshooting
|
|
|
|
### PDF Processing Issues
|
|
```bash
|
|
# Check PDF validity
|
|
file document.pdf
|
|
pdfinfo document.pdf
|
|
|
|
# Try alternative PDF processors
|
|
qpdf --check document.pdf
|
|
```
|
|
|
|
### Memory Issues
|
|
```bash
|
|
# For large PDFs, monitor memory usage
|
|
free -h
|
|
# Consider processing large files separately
|
|
```
|
|
|
|
### Content Extraction Problems
|
|
```bash
|
|
# Verify PDF contains extractable text
|
|
pdftotext document.pdf test-output.txt
|
|
cat test-output.txt | head -20
|
|
``` |