trustgraph/docs/apis/api-librarian.md

392 lines
9.3 KiB
Markdown
Raw Normal View History

# TrustGraph Librarian API
This API provides document library management for TrustGraph. It handles document storage,
metadata management, and processing orchestration using hybrid storage (S3-compatible object
storage for content, Cassandra for metadata) with multi-user support.
## Request/response
### Request
The request contains the following fields:
- `operation`: The operation to perform (see operations below)
- `document_id`: Document identifier (for document operations)
- `document_metadata`: Document metadata object (for add/update operations)
- `id`: Document identifier (required)
- `time`: Unix timestamp in seconds as a float (required for add operations)
- `kind`: MIME type of document (required, e.g., "text/plain", "application/pdf")
- `title`: Document title (optional)
- `comments`: Document comments (optional)
- `user`: Document owner (required)
- `tags`: Array of tags (optional)
- `metadata`: Array of RDF triples (optional) - each triple has:
- `s`: Subject with `v` (value) and `e` (is_uri boolean)
- `p`: Predicate with `v` (value) and `e` (is_uri boolean)
- `o`: Object with `v` (value) and `e` (is_uri boolean)
- `content`: Document content as base64-encoded bytes (for add operations)
- `processing_id`: Processing job identifier (for processing operations)
- `processing_metadata`: Processing metadata object (for add-processing)
- `user`: User identifier (required for most operations)
- `collection`: Collection filter (optional for list operations)
- `criteria`: Query criteria array (for filtering operations)
### Response
The response contains the following fields:
- `error`: Error information if operation fails
- `document_metadata`: Single document metadata (for get operations)
- `content`: Document content as base64-encoded bytes (for get-content)
- `document_metadatas`: Array of document metadata (for list operations)
- `processing_metadatas`: Array of processing metadata (for list-processing)
## Document Operations
### ADD-DOCUMENT - Add Document to Library
Request:
```json
{
"operation": "add-document",
"document_metadata": {
"id": "doc-123",
"time": 1640995200.0,
"kind": "application/pdf",
"title": "Research Paper",
"comments": "Important research findings",
"user": "alice",
"tags": ["research", "ai", "machine-learning"],
"metadata": [
{
"s": {
"v": "http://example.com/doc-123",
"e": true
},
"p": {
"v": "http://purl.org/dc/elements/1.1/creator",
"e": true
},
"o": {
"v": "Dr. Smith",
"e": false
}
}
]
},
"content": "JVBERi0xLjQKMSAwIG9iago8PAovVHlwZSAvQ2F0YWxvZwovUGFnZXMgMiAwIFIKPj4KZW5kb2JqCg=="
}
```
Response:
```json
{}
```
### GET-DOCUMENT-METADATA - Get Document Metadata
Request:
```json
{
"operation": "get-document-metadata",
"document_id": "doc-123",
"user": "alice"
}
```
Response:
```json
{
"document_metadata": {
"id": "doc-123",
"time": 1640995200.0,
"kind": "application/pdf",
"title": "Research Paper",
"comments": "Important research findings",
"user": "alice",
"tags": ["research", "ai", "machine-learning"],
"metadata": [
{
"s": {
"v": "http://example.com/doc-123",
"e": true
},
"p": {
"v": "http://purl.org/dc/elements/1.1/creator",
"e": true
},
"o": {
"v": "Dr. Smith",
"e": false
}
}
]
}
}
```
### GET-DOCUMENT-CONTENT - Get Document Content
Request:
```json
{
"operation": "get-document-content",
"document_id": "doc-123",
"user": "alice"
}
```
Response:
```json
{
"content": "JVBERi0xLjQKMSAwIG9iago8PAovVHlwZSAvQ2F0YWxvZwovUGFnZXMgMiAwIFIKPj4KZW5kb2JqCg=="
}
```
### LIST-DOCUMENTS - List User's Documents
Request:
```json
{
"operation": "list-documents",
"user": "alice",
"collection": "research"
}
```
Response:
```json
{
"document_metadatas": [
{
"id": "doc-123",
"time": 1640995200.0,
"kind": "application/pdf",
"title": "Research Paper",
"comments": "Important research findings",
"user": "alice",
"tags": ["research", "ai"]
},
{
"id": "doc-124",
"time": 1640995300.0,
"kind": "text/plain",
"title": "Meeting Notes",
"comments": "Team meeting discussion",
"user": "alice",
"tags": ["meeting", "notes"]
}
]
}
```
### UPDATE-DOCUMENT - Update Document Metadata
Request:
```json
{
"operation": "update-document",
"document_metadata": {
"id": "doc-123",
"time": 1640995500.0,
"title": "Updated Research Paper",
"comments": "Updated findings and conclusions",
"user": "alice",
"tags": ["research", "ai", "machine-learning", "updated"],
"metadata": []
}
}
```
Response:
```json
{}
```
### REMOVE-DOCUMENT - Remove Document
Request:
```json
{
"operation": "remove-document",
"document_id": "doc-123",
"user": "alice"
}
```
Response:
```json
{}
```
## Processing Operations
### ADD-PROCESSING - Start Document Processing
Request:
```json
{
"operation": "add-processing",
"processing_metadata": {
"id": "proc-456",
"document_id": "doc-123",
"time": 1640995400.0,
"flow": "pdf-extraction",
"user": "alice",
"collection": "research",
"tags": ["extraction", "nlp"]
}
}
```
Response:
```json
{}
```
### LIST-PROCESSING - List Processing Jobs
Request:
```json
{
"operation": "list-processing",
"user": "alice",
"collection": "research"
}
```
Response:
```json
{
"processing_metadatas": [
{
"id": "proc-456",
"document_id": "doc-123",
"time": 1640995400.0,
"flow": "pdf-extraction",
"user": "alice",
"collection": "research",
"tags": ["extraction", "nlp"]
}
]
}
```
### REMOVE-PROCESSING - Stop Processing Job
Request:
```json
{
"operation": "remove-processing",
"processing_id": "proc-456",
"user": "alice"
}
```
Response:
```json
{}
```
## REST service
The REST service is available at `/api/v1/librarian` and accepts the above request formats.
## Websocket
Requests have a `request` object containing the operation fields.
Responses have a `response` object containing the response fields.
Request:
```json
{
"id": "unique-request-id",
"service": "librarian",
"request": {
"operation": "list-documents",
"user": "alice"
}
}
```
Response:
```json
{
"id": "unique-request-id",
"response": {
"document_metadatas": [...]
},
"complete": true
}
```
## Pulsar
The Pulsar schema for the Librarian API is defined in Python code here:
https://github.com/trustgraph-ai/trustgraph/blob/master/trustgraph-base/trustgraph/schema/library.py
Default request queue:
`non-persistent://tg/request/librarian`
Default response queue:
`non-persistent://tg/response/librarian`
Request schema:
`trustgraph.schema.LibrarianRequest`
Response schema:
`trustgraph.schema.LibrarianResponse`
## Python SDK
The Python SDK provides convenient access to the Librarian API:
```python
from trustgraph.api.library import LibrarianClient
client = LibrarianClient()
# Add a document
with open("document.pdf", "rb") as f:
content = f.read()
await client.add_document(
doc_id="doc-123",
title="Research Paper",
content=content,
user="alice",
tags=["research", "ai"]
)
# Get document metadata
metadata = await client.get_document_metadata("doc-123", "alice")
# List documents
documents = await client.list_documents("alice", collection="research")
# Start processing
await client.add_processing(
processing_id="proc-456",
document_id="doc-123",
flow="pdf-extraction",
user="alice"
)
```
## Features
- **Hybrid Storage**: S3-compatible object storage (MinIO, Ceph RGW, AWS S3, etc.) for content, Cassandra for metadata
- **Multi-user Support**: User-based document ownership and access control
- **Rich Metadata**: RDF-style metadata triples and tagging system
- **Processing Integration**: Automatic triggering of document processing workflows
- **Content Types**: Support for multiple document formats (PDF, text, etc.)
- **Collection Management**: Optional document grouping by collection
- **Metadata Search**: Query documents by metadata criteria
- **Flexible Storage Backend**: Works with any S3-compatible storage (MinIO, Ceph RADOS Gateway, AWS S3, Cloudflare R2, etc.)
## Use Cases
- **Document Management**: Store and organize documents with rich metadata
- **Knowledge Extraction**: Process documents to extract structured knowledge
- **Research Libraries**: Manage collections of research papers and documents
- **Content Processing**: Orchestrate document processing workflows
- **Multi-tenant Systems**: Support multiple users with isolated document libraries