trustgraph/docs/apis/api-librarian.md

# TrustGraph Librarian API

This API provides document library management for TrustGraph. It handles document storage,
metadata management, and processing orchestration using hybrid storage (S3-compatible object
storage for content, Cassandra for metadata) with multi-user support.

## Request/response

### Request

The request contains the following fields:
- `operation`: The operation to perform (see operations below)
- `document_id`: Document identifier (for document operations)
- `document_metadata`: Document metadata object (for add/update operations)
  - `id`: Document identifier (required)
  - `time`: Unix timestamp in seconds as a float (required for add operations)
  - `kind`: MIME type of document (required, e.g., "text/plain", "application/pdf")
  - `title`: Document title (optional)
  - `comments`: Document comments (optional)
  - `user`: Document owner (required)
  - `tags`: Array of tags (optional)
  - `metadata`: Array of RDF triples (optional) - each triple has:
    - `s`: Subject with `v` (value) and `e` (is_uri boolean)
    - `p`: Predicate with `v` (value) and `e` (is_uri boolean)
    - `o`: Object with `v` (value) and `e` (is_uri boolean)
- `content`: Document content as base64-encoded bytes (for add operations)
- `processing_id`: Processing job identifier (for processing operations)
- `processing_metadata`: Processing metadata object (for add-processing)
- `user`: User identifier (required for most operations)
- `collection`: Collection filter (optional for list operations)
- `criteria`: Query criteria array (for filtering operations)

### Response

The response contains the following fields:
- `error`: Error information if operation fails
- `document_metadata`: Single document metadata (for get operations)
- `content`: Document content as base64-encoded bytes (for get-content)
- `document_metadatas`: Array of document metadata (for list operations)
- `processing_metadatas`: Array of processing metadata (for list-processing)

## Document Operations

### ADD-DOCUMENT - Add Document to Library

Request:
```json
{
    "operation": "add-document",
    "document_metadata": {
        "id": "doc-123",
        "time": 1640995200.0,
        "kind": "application/pdf",
        "title": "Research Paper",
        "comments": "Important research findings",
        "user": "alice",
        "tags": ["research", "ai", "machine-learning"],
        "metadata": [
            {
                "s": {
                    "v": "http://example.com/doc-123",
                    "e": true
                },
                "p": {
                    "v": "http://purl.org/dc/elements/1.1/creator",
                    "e": true
                },
                "o": {
                    "v": "Dr. Smith",
                    "e": false
                }
            }
        ]
    },
    "content": "JVBERi0xLjQKMSAwIG9iago8PAovVHlwZSAvQ2F0YWxvZwovUGFnZXMgMiAwIFIKPj4KZW5kb2JqCg=="
}
```

Response:
```json
{}
```

### GET-DOCUMENT-METADATA - Get Document Metadata

Request:
```json
{
    "operation": "get-document-metadata",
    "document_id": "doc-123",
    "user": "alice"
}
```

Response:
```json
{
    "document_metadata": {
        "id": "doc-123",
        "time": 1640995200.0,
        "kind": "application/pdf",
        "title": "Research Paper",
        "comments": "Important research findings",
        "user": "alice",
        "tags": ["research", "ai", "machine-learning"],
        "metadata": [
            {
                "s": {
                    "v": "http://example.com/doc-123",
                    "e": true
                },
                "p": {
                    "v": "http://purl.org/dc/elements/1.1/creator",
                    "e": true
                },
                "o": {
                    "v": "Dr. Smith",
                    "e": false
                }
            }
        ]
    }
}
```

### GET-DOCUMENT-CONTENT - Get Document Content

Request:
```json
{
    "operation": "get-document-content",
    "document_id": "doc-123",
    "user": "alice"
}
```

Response:
```json
{
    "content": "JVBERi0xLjQKMSAwIG9iago8PAovVHlwZSAvQ2F0YWxvZwovUGFnZXMgMiAwIFIKPj4KZW5kb2JqCg=="
}
```

### LIST-DOCUMENTS - List User's Documents

Request:
```json
{
    "operation": "list-documents",
    "user": "alice",
    "collection": "research"
}
```

Response:
```json
{
    "document_metadatas": [
        {
            "id": "doc-123",
            "time": 1640995200.0,
            "kind": "application/pdf",
            "title": "Research Paper",
            "comments": "Important research findings",
            "user": "alice",
            "tags": ["research", "ai"]
        },
        {
            "id": "doc-124",
            "time": 1640995300.0,
            "kind": "text/plain",
            "title": "Meeting Notes",
            "comments": "Team meeting discussion",
            "user": "alice",
            "tags": ["meeting", "notes"]
        }
    ]
}
```

### UPDATE-DOCUMENT - Update Document Metadata

Request:
```json
{
    "operation": "update-document",
    "document_metadata": {
        "id": "doc-123",
        "time": 1640995500.0,
        "title": "Updated Research Paper",
        "comments": "Updated findings and conclusions",
        "user": "alice",
        "tags": ["research", "ai", "machine-learning", "updated"],
        "metadata": []
    }
}
```

Response:
```json
{}
```

### REMOVE-DOCUMENT - Remove Document

Request:
```json
{
    "operation": "remove-document",
    "document_id": "doc-123",
    "user": "alice"
}
```

Response:
```json
{}
```

## Processing Operations

### ADD-PROCESSING - Start Document Processing

Request:
```json
{
    "operation": "add-processing",
    "processing_metadata": {
        "id": "proc-456",
        "document_id": "doc-123",
        "time": 1640995400.0,
        "flow": "pdf-extraction",
        "user": "alice",
        "collection": "research",
        "tags": ["extraction", "nlp"]
    }
}
```

Response:
```json
{}
```

### LIST-PROCESSING - List Processing Jobs

Request:
```json
{
    "operation": "list-processing",
    "user": "alice",
    "collection": "research"
}
```

Response:
```json
{
    "processing_metadatas": [
        {
            "id": "proc-456",
            "document_id": "doc-123",
            "time": 1640995400.0,
            "flow": "pdf-extraction",
            "user": "alice",
            "collection": "research",
            "tags": ["extraction", "nlp"]
        }
    ]
}
```

### REMOVE-PROCESSING - Stop Processing Job

Request:
```json
{
    "operation": "remove-processing",
    "processing_id": "proc-456",
    "user": "alice"
}
```

Response:
```json
{}
```

## REST service

The REST service is available at `/api/v1/librarian` and accepts the above request formats.

## Websocket

Requests have a `request` object containing the operation fields.
Responses have a `response` object containing the response fields.

Request:
```json
{
    "id": "unique-request-id",
    "service": "librarian",
    "request": {
        "operation": "list-documents",
        "user": "alice"
    }
}
```

Response:
```json
{
    "id": "unique-request-id",
    "response": {
        "document_metadatas": [...]
    },
    "complete": true
}
```

## Pulsar

The Pulsar schema for the Librarian API is defined in Python code here:

https://github.com/trustgraph-ai/trustgraph/blob/master/trustgraph-base/trustgraph/schema/library.py

Default request queue:
`non-persistent://tg/request/librarian`

Default response queue:
`non-persistent://tg/response/librarian`

Request schema:
`trustgraph.schema.LibrarianRequest`

Response schema:
`trustgraph.schema.LibrarianResponse`

## Python SDK

The Python SDK provides convenient access to the Librarian API:

```python
from trustgraph.api.library import LibrarianClient

client = LibrarianClient()

# Add a document
with open("document.pdf", "rb") as f:
    content = f.read()
    
await client.add_document(
    doc_id="doc-123",
    title="Research Paper",
    content=content,
    user="alice",
    tags=["research", "ai"]
)

# Get document metadata
metadata = await client.get_document_metadata("doc-123", "alice")

# List documents
documents = await client.list_documents("alice", collection="research")

# Start processing
await client.add_processing(
    processing_id="proc-456",
    document_id="doc-123",
    flow="pdf-extraction",
    user="alice"
)
```

## Features

- **Hybrid Storage**: S3-compatible object storage (MinIO, Ceph RGW, AWS S3, etc.) for content, Cassandra for metadata
- **Multi-user Support**: User-based document ownership and access control
- **Rich Metadata**: RDF-style metadata triples and tagging system
- **Processing Integration**: Automatic triggering of document processing workflows
- **Content Types**: Support for multiple document formats (PDF, text, etc.)
- **Collection Management**: Optional document grouping by collection
- **Metadata Search**: Query documents by metadata criteria
- **Flexible Storage Backend**: Works with any S3-compatible storage (MinIO, Ceph RADOS Gateway, AWS S3, Cloudflare R2, etc.)

## Use Cases

- **Document Management**: Store and organize documents with rich metadata
- **Knowledge Extraction**: Process documents to extract structured knowledge
- **Research Libraries**: Manage collections of research papers and documents
- **Content Processing**: Orchestrate document processing workflows
- **Multi-tenant Systems**: Support multiple users with isolated document libraries
Update docs for API/CLI changes in 1.0 (#421) * Update some API basics for the 0.23/1.0 API change 2025-07-03 14:58:32 +01:00			`# TrustGraph Librarian API`

Change MinIO integration options in librarian to be more generic - to support a Garage integration (#594) * Tweak object store parameters to be more generic for other S3-type store integration * Update librarian to have region & SSL params * Update MinIO migration tech spec 2025-12-27 18:01:51 +00:00			`This API provides document library management for TrustGraph. It handles document storage,`
			`metadata management, and processing orchestration using hybrid storage (S3-compatible object`
			`storage for content, Cassandra for metadata) with multi-user support.`
Update docs for API/CLI changes in 1.0 (#421) * Update some API basics for the 0.23/1.0 API change 2025-07-03 14:58:32 +01:00
			`## Request/response`

			`### Request`

			`The request contains the following fields:`
			- `operation`: The operation to perform (see operations below)
			- `document_id`: Document identifier (for document operations)
			- `document_metadata`: Document metadata object (for add/update operations)
Structure data mvp (#452) * Structured data tech spec * Architecture principles * New schemas * Updated schemas and specs * Object extractor * Add .coveragerc * New tests * Cassandra object storage * Trying to object extraction working, issues exist 2025-08-07 20:47:20 +01:00			- `id`: Document identifier (required)
			- `time`: Unix timestamp in seconds as a float (required for add operations)
			- `kind`: MIME type of document (required, e.g., "text/plain", "application/pdf")
			- `title`: Document title (optional)
			- `comments`: Document comments (optional)
			- `user`: Document owner (required)
			- `tags`: Array of tags (optional)
			- `metadata`: Array of RDF triples (optional) - each triple has:
			- `s`: Subject with `v` (value) and `e` (is_uri boolean)
			- `p`: Predicate with `v` (value) and `e` (is_uri boolean)
			- `o`: Object with `v` (value) and `e` (is_uri boolean)
Update docs for API/CLI changes in 1.0 (#421) * Update some API basics for the 0.23/1.0 API change 2025-07-03 14:58:32 +01:00			- `content`: Document content as base64-encoded bytes (for add operations)
			- `processing_id`: Processing job identifier (for processing operations)
			- `processing_metadata`: Processing metadata object (for add-processing)
			- `user`: User identifier (required for most operations)
			- `collection`: Collection filter (optional for list operations)
			- `criteria`: Query criteria array (for filtering operations)

			`### Response`

			`The response contains the following fields:`
			- `error`: Error information if operation fails
			- `document_metadata`: Single document metadata (for get operations)
			- `content`: Document content as base64-encoded bytes (for get-content)
			- `document_metadatas`: Array of document metadata (for list operations)
			- `processing_metadatas`: Array of processing metadata (for list-processing)

			`## Document Operations`

			`### ADD-DOCUMENT - Add Document to Library`

			`Request:`
			```json
			`{`
			`"operation": "add-document",`
			`"document_metadata": {`
			`"id": "doc-123",`
Structure data mvp (#452) * Structured data tech spec * Architecture principles * New schemas * Updated schemas and specs * Object extractor * Add .coveragerc * New tests * Cassandra object storage * Trying to object extraction working, issues exist 2025-08-07 20:47:20 +01:00			`"time": 1640995200.0,`
Update docs for API/CLI changes in 1.0 (#421) * Update some API basics for the 0.23/1.0 API change 2025-07-03 14:58:32 +01:00			`"kind": "application/pdf",`
			`"title": "Research Paper",`
			`"comments": "Important research findings",`
			`"user": "alice",`
			`"tags": ["research", "ai", "machine-learning"],`
			`"metadata": [`
			`{`
Structure data mvp (#452) * Structured data tech spec * Architecture principles * New schemas * Updated schemas and specs * Object extractor * Add .coveragerc * New tests * Cassandra object storage * Trying to object extraction working, issues exist 2025-08-07 20:47:20 +01:00			`"s": {`
			`"v": "http://example.com/doc-123",`
			`"e": true`
			`},`
			`"p": {`
			`"v": "http://purl.org/dc/elements/1.1/creator",`
			`"e": true`
			`},`
			`"o": {`
			`"v": "Dr. Smith",`
			`"e": false`
			`}`
Update docs for API/CLI changes in 1.0 (#421) * Update some API basics for the 0.23/1.0 API change 2025-07-03 14:58:32 +01:00			`}`
			`]`
			`},`
			`"content": "JVBERi0xLjQKMSAwIG9iago8PAovVHlwZSAvQ2F0YWxvZwovUGFnZXMgMiAwIFIKPj4KZW5kb2JqCg=="`
			`}`
			```

			`Response:`
			```json
			`{}`
			```

			`### GET-DOCUMENT-METADATA - Get Document Metadata`

			`Request:`
			```json
			`{`
			`"operation": "get-document-metadata",`
			`"document_id": "doc-123",`
			`"user": "alice"`
			`}`
			```

			`Response:`
			```json
			`{`
			`"document_metadata": {`
			`"id": "doc-123",`
Structure data mvp (#452) * Structured data tech spec * Architecture principles * New schemas * Updated schemas and specs * Object extractor * Add .coveragerc * New tests * Cassandra object storage * Trying to object extraction working, issues exist 2025-08-07 20:47:20 +01:00			`"time": 1640995200.0,`
Update docs for API/CLI changes in 1.0 (#421) * Update some API basics for the 0.23/1.0 API change 2025-07-03 14:58:32 +01:00			`"kind": "application/pdf",`
			`"title": "Research Paper",`
			`"comments": "Important research findings",`
			`"user": "alice",`
			`"tags": ["research", "ai", "machine-learning"],`
			`"metadata": [`
			`{`
Structure data mvp (#452) * Structured data tech spec * Architecture principles * New schemas * Updated schemas and specs * Object extractor * Add .coveragerc * New tests * Cassandra object storage * Trying to object extraction working, issues exist 2025-08-07 20:47:20 +01:00			`"s": {`
			`"v": "http://example.com/doc-123",`
			`"e": true`
			`},`
			`"p": {`
			`"v": "http://purl.org/dc/elements/1.1/creator",`
			`"e": true`
			`},`
			`"o": {`
			`"v": "Dr. Smith",`
			`"e": false`
			`}`
Update docs for API/CLI changes in 1.0 (#421) * Update some API basics for the 0.23/1.0 API change 2025-07-03 14:58:32 +01:00			`}`
			`]`
			`}`
			`}`
			```

			`### GET-DOCUMENT-CONTENT - Get Document Content`

			`Request:`
			```json
			`{`
			`"operation": "get-document-content",`
			`"document_id": "doc-123",`
			`"user": "alice"`
			`}`
			```

			`Response:`
			```json
			`{`
			`"content": "JVBERi0xLjQKMSAwIG9iago8PAovVHlwZSAvQ2F0YWxvZwovUGFnZXMgMiAwIFIKPj4KZW5kb2JqCg=="`
			`}`
			```

			`### LIST-DOCUMENTS - List User's Documents`

			`Request:`
			```json
			`{`
			`"operation": "list-documents",`
			`"user": "alice",`
			`"collection": "research"`
			`}`
			```

			`Response:`
			```json
			`{`
			`"document_metadatas": [`
			`{`
			`"id": "doc-123",`
Structure data mvp (#452) * Structured data tech spec * Architecture principles * New schemas * Updated schemas and specs * Object extractor * Add .coveragerc * New tests * Cassandra object storage * Trying to object extraction working, issues exist 2025-08-07 20:47:20 +01:00			`"time": 1640995200.0,`
Update docs for API/CLI changes in 1.0 (#421) * Update some API basics for the 0.23/1.0 API change 2025-07-03 14:58:32 +01:00			`"kind": "application/pdf",`
			`"title": "Research Paper",`
			`"comments": "Important research findings",`
			`"user": "alice",`
			`"tags": ["research", "ai"]`
			`},`
			`{`
			`"id": "doc-124",`
Structure data mvp (#452) * Structured data tech spec * Architecture principles * New schemas * Updated schemas and specs * Object extractor * Add .coveragerc * New tests * Cassandra object storage * Trying to object extraction working, issues exist 2025-08-07 20:47:20 +01:00			`"time": 1640995300.0,`
Update docs for API/CLI changes in 1.0 (#421) * Update some API basics for the 0.23/1.0 API change 2025-07-03 14:58:32 +01:00			`"kind": "text/plain",`
			`"title": "Meeting Notes",`
			`"comments": "Team meeting discussion",`
			`"user": "alice",`
			`"tags": ["meeting", "notes"]`
			`}`
			`]`
			`}`
			```

			`### UPDATE-DOCUMENT - Update Document Metadata`

			`Request:`
			```json
			`{`
			`"operation": "update-document",`
			`"document_metadata": {`
			`"id": "doc-123",`
Structure data mvp (#452) * Structured data tech spec * Architecture principles * New schemas * Updated schemas and specs * Object extractor * Add .coveragerc * New tests * Cassandra object storage * Trying to object extraction working, issues exist 2025-08-07 20:47:20 +01:00			`"time": 1640995500.0,`
Update docs for API/CLI changes in 1.0 (#421) * Update some API basics for the 0.23/1.0 API change 2025-07-03 14:58:32 +01:00			`"title": "Updated Research Paper",`
			`"comments": "Updated findings and conclusions",`
			`"user": "alice",`
Structure data mvp (#452) * Structured data tech spec * Architecture principles * New schemas * Updated schemas and specs * Object extractor * Add .coveragerc * New tests * Cassandra object storage * Trying to object extraction working, issues exist 2025-08-07 20:47:20 +01:00			`"tags": ["research", "ai", "machine-learning", "updated"],`
			`"metadata": []`
Update docs for API/CLI changes in 1.0 (#421) * Update some API basics for the 0.23/1.0 API change 2025-07-03 14:58:32 +01:00			`}`
			`}`
			```

			`Response:`
			```json
			`{}`
			```

			`### REMOVE-DOCUMENT - Remove Document`

			`Request:`
			```json
			`{`
			`"operation": "remove-document",`
			`"document_id": "doc-123",`
			`"user": "alice"`
			`}`
			```

			`Response:`
			```json
			`{}`
			```

			`## Processing Operations`

			`### ADD-PROCESSING - Start Document Processing`

			`Request:`
			```json
			`{`
			`"operation": "add-processing",`
			`"processing_metadata": {`
			`"id": "proc-456",`
			`"document_id": "doc-123",`
Structure data mvp (#452) * Structured data tech spec * Architecture principles * New schemas * Updated schemas and specs * Object extractor * Add .coveragerc * New tests * Cassandra object storage * Trying to object extraction working, issues exist 2025-08-07 20:47:20 +01:00			`"time": 1640995400.0,`
Update docs for API/CLI changes in 1.0 (#421) * Update some API basics for the 0.23/1.0 API change 2025-07-03 14:58:32 +01:00			`"flow": "pdf-extraction",`
			`"user": "alice",`
			`"collection": "research",`
			`"tags": ["extraction", "nlp"]`
			`}`
			`}`
			```

			`Response:`
			```json
			`{}`
			```

			`### LIST-PROCESSING - List Processing Jobs`

			`Request:`
			```json
			`{`
			`"operation": "list-processing",`
			`"user": "alice",`
			`"collection": "research"`
			`}`
			```

			`Response:`
			```json
			`{`
			`"processing_metadatas": [`
			`{`
			`"id": "proc-456",`
			`"document_id": "doc-123",`
Structure data mvp (#452) * Structured data tech spec * Architecture principles * New schemas * Updated schemas and specs * Object extractor * Add .coveragerc * New tests * Cassandra object storage * Trying to object extraction working, issues exist 2025-08-07 20:47:20 +01:00			`"time": 1640995400.0,`
Update docs for API/CLI changes in 1.0 (#421) * Update some API basics for the 0.23/1.0 API change 2025-07-03 14:58:32 +01:00			`"flow": "pdf-extraction",`
			`"user": "alice",`
			`"collection": "research",`
			`"tags": ["extraction", "nlp"]`
			`}`
			`]`
			`}`
			```

			`### REMOVE-PROCESSING - Stop Processing Job`

			`Request:`
			```json
			`{`
			`"operation": "remove-processing",`
			`"processing_id": "proc-456",`
			`"user": "alice"`
			`}`
			```

			`Response:`
			```json
			`{}`
			```

			`## REST service`

			The REST service is available at `/api/v1/librarian` and accepts the above request formats.

			`## Websocket`

			Requests have a `request` object containing the operation fields.
			Responses have a `response` object containing the response fields.

			`Request:`
			```json
			`{`
			`"id": "unique-request-id",`
			`"service": "librarian",`
			`"request": {`
			`"operation": "list-documents",`
			`"user": "alice"`
			`}`
			`}`
			```

			`Response:`
			```json
			`{`
			`"id": "unique-request-id",`
			`"response": {`
			`"document_metadatas": [...]`
			`},`
			`"complete": true`
			`}`
			```

			`## Pulsar`

			`The Pulsar schema for the Librarian API is defined in Python code here:`

			`https://github.com/trustgraph-ai/trustgraph/blob/master/trustgraph-base/trustgraph/schema/library.py`

			`Default request queue:`
			`non-persistent://tg/request/librarian`

			`Default response queue:`
			`non-persistent://tg/response/librarian`

			`Request schema:`
			`trustgraph.schema.LibrarianRequest`

			`Response schema:`
			`trustgraph.schema.LibrarianResponse`

			`## Python SDK`

			`The Python SDK provides convenient access to the Librarian API:`

			```python
			`from trustgraph.api.library import LibrarianClient`

			`client = LibrarianClient()`

			`# Add a document`
			`with open("document.pdf", "rb") as f:`
			`content = f.read()`

			`await client.add_document(`
			`doc_id="doc-123",`
			`title="Research Paper",`
			`content=content,`
			`user="alice",`
			`tags=["research", "ai"]`
			`)`

			`# Get document metadata`
			`metadata = await client.get_document_metadata("doc-123", "alice")`

			`# List documents`
			`documents = await client.list_documents("alice", collection="research")`

			`# Start processing`
			`await client.add_processing(`
			`processing_id="proc-456",`
			`document_id="doc-123",`
			`flow="pdf-extraction",`
			`user="alice"`
			`)`
			```

			`## Features`

Change MinIO integration options in librarian to be more generic - to support a Garage integration (#594) * Tweak object store parameters to be more generic for other S3-type store integration * Update librarian to have region & SSL params * Update MinIO migration tech spec 2025-12-27 18:01:51 +00:00			`- Hybrid Storage: S3-compatible object storage (MinIO, Ceph RGW, AWS S3, etc.) for content, Cassandra for metadata`
Update docs for API/CLI changes in 1.0 (#421) * Update some API basics for the 0.23/1.0 API change 2025-07-03 14:58:32 +01:00			`- Multi-user Support: User-based document ownership and access control`
			`- Rich Metadata: RDF-style metadata triples and tagging system`
			`- Processing Integration: Automatic triggering of document processing workflows`
			`- Content Types: Support for multiple document formats (PDF, text, etc.)`
			`- Collection Management: Optional document grouping by collection`
			`- Metadata Search: Query documents by metadata criteria`
Change MinIO integration options in librarian to be more generic - to support a Garage integration (#594) * Tweak object store parameters to be more generic for other S3-type store integration * Update librarian to have region & SSL params * Update MinIO migration tech spec 2025-12-27 18:01:51 +00:00			`- Flexible Storage Backend: Works with any S3-compatible storage (MinIO, Ceph RADOS Gateway, AWS S3, Cloudflare R2, etc.)`
Update docs for API/CLI changes in 1.0 (#421) * Update some API basics for the 0.23/1.0 API change 2025-07-03 14:58:32 +01:00
			`## Use Cases`

			`- Document Management: Store and organize documents with rich metadata`
			`- Knowledge Extraction: Process documents to extract structured knowledge`
			`- Research Libraries: Manage collections of research papers and documents`
			`- Content Processing: Orchestrate document processing workflows`
			`- Multi-tenant Systems: Support multiple users with isolated document libraries`