trustgraph/docs/apis/api-document-embeddings.md

# TrustGraph Document Embeddings API

This API provides import, export, and query capabilities for document embeddings. It handles
document chunks with their vector embeddings and metadata, supporting both real-time WebSocket
operations and request/response patterns.

## Schema Overview

### DocumentEmbeddings Structure
- `metadata`: Document metadata (ID, user, collection, RDF triples)
- `chunks`: Array of document chunks with embeddings

### ChunkEmbeddings Structure
- `chunk`: Text chunk as bytes
- `vectors`: Array of vector embeddings (Array of Array of Double)

### DocumentEmbeddingsRequest Structure
- `vectors`: Query vector embeddings
- `limit`: Maximum number of results
- `user`: User identifier
- `collection`: Collection identifier

### DocumentEmbeddingsResponse Structure
- `error`: Error information if operation fails
- `documents`: Array of matching documents as bytes

## Import/Export Operations

### Import - WebSocket Endpoint

**Endpoint:** `/api/v1/flow/{flow}/import/document-embeddings`

**Method:** WebSocket connection

**Request Format:**
```json
{
    "metadata": {
        "id": "doc-123",
        "user": "alice",
        "collection": "research",
        "metadata": [
            {
                "s": {"v": "doc-123", "e": true},
                "p": {"v": "dc:title", "e": true},
                "o": {"v": "Research Paper", "e": false}
            }
        ]
    },
    "chunks": [
        {
            "chunk": "This is the first chunk of the document...",
            "vectors": [
                [0.1, 0.2, 0.3, 0.4],
                [0.5, 0.6, 0.7, 0.8]
            ]
        },
        {
            "chunk": "This is the second chunk...",
            "vectors": [
                [0.9, 0.8, 0.7, 0.6],
                [0.5, 0.4, 0.3, 0.2]
            ]
        }
    ]
}
```

**Response:** Import operations are fire-and-forget with no response payload.

### Export - WebSocket Endpoint

**Endpoint:** `/api/v1/flow/{flow}/export/document-embeddings`

**Method:** WebSocket connection

The export endpoint streams document embeddings data in real-time. Each message contains:

```json
{
    "metadata": {
        "id": "doc-123",
        "user": "alice",
        "collection": "research",
        "metadata": [
            {
                "s": {"v": "doc-123", "e": true},
                "p": {"v": "dc:title", "e": true},
                "o": {"v": "Research Paper", "e": false}
            }
        ]
    },
    "chunks": [
        {
            "chunk": "Decoded text content of chunk",
            "vectors": [[0.1, 0.2, 0.3, 0.4]]
        }
    ]
}
```

## Query Operations

### Query Document Embeddings

**Purpose:** Find documents similar to provided vector embeddings

**Request:**
```json
{
    "vectors": [
        [0.1, 0.2, 0.3, 0.4, 0.5],
        [0.6, 0.7, 0.8, 0.9, 1.0]
    ],
    "limit": 10,
    "user": "alice",
    "collection": "research"
}
```

**Response:**
```json
{
    "documents": [
        "base64-encoded-document-1",
        "base64-encoded-document-2"
    ]
}
```

## WebSocket Usage Examples

### Importing Document Embeddings

```javascript
// Connect to import endpoint
const ws = new WebSocket('ws://api-gateway:8080/api/v1/flow/my-flow/import/document-embeddings');

// Send document embeddings
ws.send(JSON.stringify({
    metadata: {
        id: "doc-123",
        user: "alice",
        collection: "research"
    },
    chunks: [
        {
            chunk: "Document content chunk 1",
            vectors: [[0.1, 0.2, 0.3]]
        }
    ]
}));
```

### Exporting Document Embeddings

```javascript
// Connect to export endpoint
const ws = new WebSocket('ws://api-gateway:8080/api/v1/flow/my-flow/export/document-embeddings');

// Listen for exported data
ws.onmessage = (event) => {
    const documentEmbeddings = JSON.parse(event.data);
    console.log('Received document:', documentEmbeddings.metadata.id);
    console.log('Chunks:', documentEmbeddings.chunks.length);
};
```

## Data Format Details

### Metadata Format
Each metadata triple contains:
- `s`: Subject (object with `v` for value and `e` for is_entity boolean)
- `p`: Predicate (object with `v` for value and `e` for is_entity boolean)
- `o`: Object (object with `v` for value and `e` for is_entity boolean)

### Vector Format
- Vectors are arrays of floating-point numbers
- Each chunk can have multiple vectors (different embedding models)
- Vectors should be consistently dimensioned within a collection

### Text Encoding
- Chunk text is handled as UTF-8 encoded bytes internally
- WebSocket API accepts/returns plain text strings
- Base64 encoding used for binary data in query responses

## Python SDK

```python
from trustgraph.clients.document_embeddings_client import DocumentEmbeddingsClient

# Create client
client = DocumentEmbeddingsClient()

# Query similar documents
request = {
    "vectors": [[0.1, 0.2, 0.3, 0.4]],
    "limit": 5,
    "user": "alice",
    "collection": "research"
}

response = await client.query(request)
documents = response.documents
```

## Integration with TrustGraph

### Storage Integration
- Document embeddings are stored in vector databases
- Metadata is cross-referenced with knowledge graph
- Supports multi-tenant isolation by user and collection

### Processing Pipeline
1. **Document Ingestion**: Text documents loaded via text-load API
2. **Chunking**: Documents split into manageable chunks
3. **Embedding Generation**: Vector embeddings created for each chunk
4. **Storage**: Embeddings stored via import API
5. **Retrieval**: Similar documents found via query API

### Use Cases
- **Semantic Search**: Find documents similar to query embeddings
- **RAG Systems**: Retrieve relevant document chunks for question answering
- **Document Clustering**: Group similar documents using embeddings
- **Content Recommendations**: Suggest related documents to users
- **Knowledge Discovery**: Find connections between document collections

## Error Handling

Common error scenarios:
- Invalid vector dimensions
- Missing required metadata fields
- User/collection access restrictions
- WebSocket connection failures
- Malformed JSON data

Errors are returned in the response `error` field:
```json
{
    "error": {
        "type": "ValidationError",
        "message": "Invalid vector dimensions"
    }
}
```

## Performance Considerations

- **Batch Processing**: Import multiple documents in single WebSocket session
- **Vector Dimensions**: Consistent embedding dimensions improve performance
- **Collection Sizing**: Limit collections to reasonable sizes for query performance
- **Real-time vs Batch**: Choose appropriate method based on use case requirements