mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-25 08:26:21 +02:00
Update docs for API/CLI changes in 1.0 (#421)
* Update some API basics for the 0.23/1.0 API change
This commit is contained in:
parent
f907ea7db8
commit
44bdd29f51
69 changed files with 19981 additions and 407 deletions
252
docs/apis/api-document-embeddings.md
Normal file
252
docs/apis/api-document-embeddings.md
Normal file
|
|
@ -0,0 +1,252 @@
|
|||
# TrustGraph Document Embeddings API
|
||||
|
||||
This API provides import, export, and query capabilities for document embeddings. It handles
|
||||
document chunks with their vector embeddings and metadata, supporting both real-time WebSocket
|
||||
operations and request/response patterns.
|
||||
|
||||
## Schema Overview
|
||||
|
||||
### DocumentEmbeddings Structure
|
||||
- `metadata`: Document metadata (ID, user, collection, RDF triples)
|
||||
- `chunks`: Array of document chunks with embeddings
|
||||
|
||||
### ChunkEmbeddings Structure
|
||||
- `chunk`: Text chunk as bytes
|
||||
- `vectors`: Array of vector embeddings (Array of Array of Double)
|
||||
|
||||
### DocumentEmbeddingsRequest Structure
|
||||
- `vectors`: Query vector embeddings
|
||||
- `limit`: Maximum number of results
|
||||
- `user`: User identifier
|
||||
- `collection`: Collection identifier
|
||||
|
||||
### DocumentEmbeddingsResponse Structure
|
||||
- `error`: Error information if operation fails
|
||||
- `documents`: Array of matching documents as bytes
|
||||
|
||||
## Import/Export Operations
|
||||
|
||||
### Import - WebSocket Endpoint
|
||||
|
||||
**Endpoint:** `/api/v1/flow/{flow}/import/document-embeddings`
|
||||
|
||||
**Method:** WebSocket connection
|
||||
|
||||
**Request Format:**
|
||||
```json
|
||||
{
|
||||
"metadata": {
|
||||
"id": "doc-123",
|
||||
"user": "alice",
|
||||
"collection": "research",
|
||||
"metadata": [
|
||||
{
|
||||
"s": {"v": "doc-123", "e": true},
|
||||
"p": {"v": "dc:title", "e": true},
|
||||
"o": {"v": "Research Paper", "e": false}
|
||||
}
|
||||
]
|
||||
},
|
||||
"chunks": [
|
||||
{
|
||||
"chunk": "This is the first chunk of the document...",
|
||||
"vectors": [
|
||||
[0.1, 0.2, 0.3, 0.4],
|
||||
[0.5, 0.6, 0.7, 0.8]
|
||||
]
|
||||
},
|
||||
{
|
||||
"chunk": "This is the second chunk...",
|
||||
"vectors": [
|
||||
[0.9, 0.8, 0.7, 0.6],
|
||||
[0.5, 0.4, 0.3, 0.2]
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Response:** Import operations are fire-and-forget with no response payload.
|
||||
|
||||
### Export - WebSocket Endpoint
|
||||
|
||||
**Endpoint:** `/api/v1/flow/{flow}/export/document-embeddings`
|
||||
|
||||
**Method:** WebSocket connection
|
||||
|
||||
The export endpoint streams document embeddings data in real-time. Each message contains:
|
||||
|
||||
```json
|
||||
{
|
||||
"metadata": {
|
||||
"id": "doc-123",
|
||||
"user": "alice",
|
||||
"collection": "research",
|
||||
"metadata": [
|
||||
{
|
||||
"s": {"v": "doc-123", "e": true},
|
||||
"p": {"v": "dc:title", "e": true},
|
||||
"o": {"v": "Research Paper", "e": false}
|
||||
}
|
||||
]
|
||||
},
|
||||
"chunks": [
|
||||
{
|
||||
"chunk": "Decoded text content of chunk",
|
||||
"vectors": [[0.1, 0.2, 0.3, 0.4]]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Query Operations
|
||||
|
||||
### Query Document Embeddings
|
||||
|
||||
**Purpose:** Find documents similar to provided vector embeddings
|
||||
|
||||
**Request:**
|
||||
```json
|
||||
{
|
||||
"vectors": [
|
||||
[0.1, 0.2, 0.3, 0.4, 0.5],
|
||||
[0.6, 0.7, 0.8, 0.9, 1.0]
|
||||
],
|
||||
"limit": 10,
|
||||
"user": "alice",
|
||||
"collection": "research"
|
||||
}
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"documents": [
|
||||
"base64-encoded-document-1",
|
||||
"base64-encoded-document-2"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## WebSocket Usage Examples
|
||||
|
||||
### Importing Document Embeddings
|
||||
|
||||
```javascript
|
||||
// Connect to import endpoint
|
||||
const ws = new WebSocket('ws://api-gateway:8080/api/v1/flow/my-flow/import/document-embeddings');
|
||||
|
||||
// Send document embeddings
|
||||
ws.send(JSON.stringify({
|
||||
metadata: {
|
||||
id: "doc-123",
|
||||
user: "alice",
|
||||
collection: "research"
|
||||
},
|
||||
chunks: [
|
||||
{
|
||||
chunk: "Document content chunk 1",
|
||||
vectors: [[0.1, 0.2, 0.3]]
|
||||
}
|
||||
]
|
||||
}));
|
||||
```
|
||||
|
||||
### Exporting Document Embeddings
|
||||
|
||||
```javascript
|
||||
// Connect to export endpoint
|
||||
const ws = new WebSocket('ws://api-gateway:8080/api/v1/flow/my-flow/export/document-embeddings');
|
||||
|
||||
// Listen for exported data
|
||||
ws.onmessage = (event) => {
|
||||
const documentEmbeddings = JSON.parse(event.data);
|
||||
console.log('Received document:', documentEmbeddings.metadata.id);
|
||||
console.log('Chunks:', documentEmbeddings.chunks.length);
|
||||
};
|
||||
```
|
||||
|
||||
## Data Format Details
|
||||
|
||||
### Metadata Format
|
||||
Each metadata triple contains:
|
||||
- `s`: Subject (object with `v` for value and `e` for is_entity boolean)
|
||||
- `p`: Predicate (object with `v` for value and `e` for is_entity boolean)
|
||||
- `o`: Object (object with `v` for value and `e` for is_entity boolean)
|
||||
|
||||
### Vector Format
|
||||
- Vectors are arrays of floating-point numbers
|
||||
- Each chunk can have multiple vectors (different embedding models)
|
||||
- Vectors should be consistently dimensioned within a collection
|
||||
|
||||
### Text Encoding
|
||||
- Chunk text is handled as UTF-8 encoded bytes internally
|
||||
- WebSocket API accepts/returns plain text strings
|
||||
- Base64 encoding used for binary data in query responses
|
||||
|
||||
## Python SDK
|
||||
|
||||
```python
|
||||
from trustgraph.clients.document_embeddings_client import DocumentEmbeddingsClient
|
||||
|
||||
# Create client
|
||||
client = DocumentEmbeddingsClient()
|
||||
|
||||
# Query similar documents
|
||||
request = {
|
||||
"vectors": [[0.1, 0.2, 0.3, 0.4]],
|
||||
"limit": 5,
|
||||
"user": "alice",
|
||||
"collection": "research"
|
||||
}
|
||||
|
||||
response = await client.query(request)
|
||||
documents = response.documents
|
||||
```
|
||||
|
||||
## Integration with TrustGraph
|
||||
|
||||
### Storage Integration
|
||||
- Document embeddings are stored in vector databases
|
||||
- Metadata is cross-referenced with knowledge graph
|
||||
- Supports multi-tenant isolation by user and collection
|
||||
|
||||
### Processing Pipeline
|
||||
1. **Document Ingestion**: Text documents loaded via text-load API
|
||||
2. **Chunking**: Documents split into manageable chunks
|
||||
3. **Embedding Generation**: Vector embeddings created for each chunk
|
||||
4. **Storage**: Embeddings stored via import API
|
||||
5. **Retrieval**: Similar documents found via query API
|
||||
|
||||
### Use Cases
|
||||
- **Semantic Search**: Find documents similar to query embeddings
|
||||
- **RAG Systems**: Retrieve relevant document chunks for question answering
|
||||
- **Document Clustering**: Group similar documents using embeddings
|
||||
- **Content Recommendations**: Suggest related documents to users
|
||||
- **Knowledge Discovery**: Find connections between document collections
|
||||
|
||||
## Error Handling
|
||||
|
||||
Common error scenarios:
|
||||
- Invalid vector dimensions
|
||||
- Missing required metadata fields
|
||||
- User/collection access restrictions
|
||||
- WebSocket connection failures
|
||||
- Malformed JSON data
|
||||
|
||||
Errors are returned in the response `error` field:
|
||||
```json
|
||||
{
|
||||
"error": {
|
||||
"type": "ValidationError",
|
||||
"message": "Invalid vector dimensions"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
- **Batch Processing**: Import multiple documents in single WebSocket session
|
||||
- **Vector Dimensions**: Consistent embedding dimensions improve performance
|
||||
- **Collection Sizing**: Limit collections to reasonable sizes for query performance
|
||||
- **Real-time vs Batch**: Choose appropriate method based on use case requirements
|
||||
Loading…
Add table
Add a link
Reference in a new issue