trustgraph/docs/apis/api-document-embeddings.md
cybermaggedon cc224e97f6
Update docs for API/CLI changes in 1.0 (#420)
* Update some API basics for the 0.23/1.0 API change
2025-07-03 14:58:29 +01:00

252 lines
No EOL
6.5 KiB
Markdown

# TrustGraph Document Embeddings API
This API provides import, export, and query capabilities for document embeddings. It handles
document chunks with their vector embeddings and metadata, supporting both real-time WebSocket
operations and request/response patterns.
## Schema Overview
### DocumentEmbeddings Structure
- `metadata`: Document metadata (ID, user, collection, RDF triples)
- `chunks`: Array of document chunks with embeddings
### ChunkEmbeddings Structure
- `chunk`: Text chunk as bytes
- `vectors`: Array of vector embeddings (Array of Array of Double)
### DocumentEmbeddingsRequest Structure
- `vectors`: Query vector embeddings
- `limit`: Maximum number of results
- `user`: User identifier
- `collection`: Collection identifier
### DocumentEmbeddingsResponse Structure
- `error`: Error information if operation fails
- `documents`: Array of matching documents as bytes
## Import/Export Operations
### Import - WebSocket Endpoint
**Endpoint:** `/api/v1/flow/{flow}/import/document-embeddings`
**Method:** WebSocket connection
**Request Format:**
```json
{
"metadata": {
"id": "doc-123",
"user": "alice",
"collection": "research",
"metadata": [
{
"s": {"v": "doc-123", "e": true},
"p": {"v": "dc:title", "e": true},
"o": {"v": "Research Paper", "e": false}
}
]
},
"chunks": [
{
"chunk": "This is the first chunk of the document...",
"vectors": [
[0.1, 0.2, 0.3, 0.4],
[0.5, 0.6, 0.7, 0.8]
]
},
{
"chunk": "This is the second chunk...",
"vectors": [
[0.9, 0.8, 0.7, 0.6],
[0.5, 0.4, 0.3, 0.2]
]
}
]
}
```
**Response:** Import operations are fire-and-forget with no response payload.
### Export - WebSocket Endpoint
**Endpoint:** `/api/v1/flow/{flow}/export/document-embeddings`
**Method:** WebSocket connection
The export endpoint streams document embeddings data in real-time. Each message contains:
```json
{
"metadata": {
"id": "doc-123",
"user": "alice",
"collection": "research",
"metadata": [
{
"s": {"v": "doc-123", "e": true},
"p": {"v": "dc:title", "e": true},
"o": {"v": "Research Paper", "e": false}
}
]
},
"chunks": [
{
"chunk": "Decoded text content of chunk",
"vectors": [[0.1, 0.2, 0.3, 0.4]]
}
]
}
```
## Query Operations
### Query Document Embeddings
**Purpose:** Find documents similar to provided vector embeddings
**Request:**
```json
{
"vectors": [
[0.1, 0.2, 0.3, 0.4, 0.5],
[0.6, 0.7, 0.8, 0.9, 1.0]
],
"limit": 10,
"user": "alice",
"collection": "research"
}
```
**Response:**
```json
{
"documents": [
"base64-encoded-document-1",
"base64-encoded-document-2"
]
}
```
## WebSocket Usage Examples
### Importing Document Embeddings
```javascript
// Connect to import endpoint
const ws = new WebSocket('ws://api-gateway:8080/api/v1/flow/my-flow/import/document-embeddings');
// Send document embeddings
ws.send(JSON.stringify({
metadata: {
id: "doc-123",
user: "alice",
collection: "research"
},
chunks: [
{
chunk: "Document content chunk 1",
vectors: [[0.1, 0.2, 0.3]]
}
]
}));
```
### Exporting Document Embeddings
```javascript
// Connect to export endpoint
const ws = new WebSocket('ws://api-gateway:8080/api/v1/flow/my-flow/export/document-embeddings');
// Listen for exported data
ws.onmessage = (event) => {
const documentEmbeddings = JSON.parse(event.data);
console.log('Received document:', documentEmbeddings.metadata.id);
console.log('Chunks:', documentEmbeddings.chunks.length);
};
```
## Data Format Details
### Metadata Format
Each metadata triple contains:
- `s`: Subject (object with `v` for value and `e` for is_entity boolean)
- `p`: Predicate (object with `v` for value and `e` for is_entity boolean)
- `o`: Object (object with `v` for value and `e` for is_entity boolean)
### Vector Format
- Vectors are arrays of floating-point numbers
- Each chunk can have multiple vectors (different embedding models)
- Vectors should be consistently dimensioned within a collection
### Text Encoding
- Chunk text is handled as UTF-8 encoded bytes internally
- WebSocket API accepts/returns plain text strings
- Base64 encoding used for binary data in query responses
## Python SDK
```python
from trustgraph.clients.document_embeddings_client import DocumentEmbeddingsClient
# Create client
client = DocumentEmbeddingsClient()
# Query similar documents
request = {
"vectors": [[0.1, 0.2, 0.3, 0.4]],
"limit": 5,
"user": "alice",
"collection": "research"
}
response = await client.query(request)
documents = response.documents
```
## Integration with TrustGraph
### Storage Integration
- Document embeddings are stored in vector databases
- Metadata is cross-referenced with knowledge graph
- Supports multi-tenant isolation by user and collection
### Processing Pipeline
1. **Document Ingestion**: Text documents loaded via text-load API
2. **Chunking**: Documents split into manageable chunks
3. **Embedding Generation**: Vector embeddings created for each chunk
4. **Storage**: Embeddings stored via import API
5. **Retrieval**: Similar documents found via query API
### Use Cases
- **Semantic Search**: Find documents similar to query embeddings
- **RAG Systems**: Retrieve relevant document chunks for question answering
- **Document Clustering**: Group similar documents using embeddings
- **Content Recommendations**: Suggest related documents to users
- **Knowledge Discovery**: Find connections between document collections
## Error Handling
Common error scenarios:
- Invalid vector dimensions
- Missing required metadata fields
- User/collection access restrictions
- WebSocket connection failures
- Malformed JSON data
Errors are returned in the response `error` field:
```json
{
"error": {
"type": "ValidationError",
"message": "Invalid vector dimensions"
}
}
```
## Performance Considerations
- **Batch Processing**: Import multiple documents in single WebSocket session
- **Vector Dimensions**: Consistent embedding dimensions improve performance
- **Collection Sizing**: Limit collections to reasonable sizes for query performance
- **Real-time vs Batch**: Choose appropriate method based on use case requirements