trustgraph/docs/apis/api-librarian.md
cybermaggedon 25563bae3c
Change MinIO integration options in librarian to be more generic - to support a Garage integration (#594)
* Tweak object store parameters to be more generic for other S3-type store integration

* Update librarian to have region & SSL params

* Update MinIO migration tech spec
2025-12-27 18:01:51 +00:00

9.3 KiB

TrustGraph Librarian API

This API provides document library management for TrustGraph. It handles document storage, metadata management, and processing orchestration using hybrid storage (S3-compatible object storage for content, Cassandra for metadata) with multi-user support.

Request/response

Request

The request contains the following fields:

  • operation: The operation to perform (see operations below)
  • document_id: Document identifier (for document operations)
  • document_metadata: Document metadata object (for add/update operations)
    • id: Document identifier (required)
    • time: Unix timestamp in seconds as a float (required for add operations)
    • kind: MIME type of document (required, e.g., "text/plain", "application/pdf")
    • title: Document title (optional)
    • comments: Document comments (optional)
    • user: Document owner (required)
    • tags: Array of tags (optional)
    • metadata: Array of RDF triples (optional) - each triple has:
      • s: Subject with v (value) and e (is_uri boolean)
      • p: Predicate with v (value) and e (is_uri boolean)
      • o: Object with v (value) and e (is_uri boolean)
  • content: Document content as base64-encoded bytes (for add operations)
  • processing_id: Processing job identifier (for processing operations)
  • processing_metadata: Processing metadata object (for add-processing)
  • user: User identifier (required for most operations)
  • collection: Collection filter (optional for list operations)
  • criteria: Query criteria array (for filtering operations)

Response

The response contains the following fields:

  • error: Error information if operation fails
  • document_metadata: Single document metadata (for get operations)
  • content: Document content as base64-encoded bytes (for get-content)
  • document_metadatas: Array of document metadata (for list operations)
  • processing_metadatas: Array of processing metadata (for list-processing)

Document Operations

ADD-DOCUMENT - Add Document to Library

Request:

{
    "operation": "add-document",
    "document_metadata": {
        "id": "doc-123",
        "time": 1640995200.0,
        "kind": "application/pdf",
        "title": "Research Paper",
        "comments": "Important research findings",
        "user": "alice",
        "tags": ["research", "ai", "machine-learning"],
        "metadata": [
            {
                "s": {
                    "v": "http://example.com/doc-123",
                    "e": true
                },
                "p": {
                    "v": "http://purl.org/dc/elements/1.1/creator",
                    "e": true
                },
                "o": {
                    "v": "Dr. Smith",
                    "e": false
                }
            }
        ]
    },
    "content": "JVBERi0xLjQKMSAwIG9iago8PAovVHlwZSAvQ2F0YWxvZwovUGFnZXMgMiAwIFIKPj4KZW5kb2JqCg=="
}

Response:

{}

GET-DOCUMENT-METADATA - Get Document Metadata

Request:

{
    "operation": "get-document-metadata",
    "document_id": "doc-123",
    "user": "alice"
}

Response:

{
    "document_metadata": {
        "id": "doc-123",
        "time": 1640995200.0,
        "kind": "application/pdf",
        "title": "Research Paper",
        "comments": "Important research findings",
        "user": "alice",
        "tags": ["research", "ai", "machine-learning"],
        "metadata": [
            {
                "s": {
                    "v": "http://example.com/doc-123",
                    "e": true
                },
                "p": {
                    "v": "http://purl.org/dc/elements/1.1/creator",
                    "e": true
                },
                "o": {
                    "v": "Dr. Smith",
                    "e": false
                }
            }
        ]
    }
}

GET-DOCUMENT-CONTENT - Get Document Content

Request:

{
    "operation": "get-document-content",
    "document_id": "doc-123",
    "user": "alice"
}

Response:

{
    "content": "JVBERi0xLjQKMSAwIG9iago8PAovVHlwZSAvQ2F0YWxvZwovUGFnZXMgMiAwIFIKPj4KZW5kb2JqCg=="
}

LIST-DOCUMENTS - List User's Documents

Request:

{
    "operation": "list-documents",
    "user": "alice",
    "collection": "research"
}

Response:

{
    "document_metadatas": [
        {
            "id": "doc-123",
            "time": 1640995200.0,
            "kind": "application/pdf",
            "title": "Research Paper",
            "comments": "Important research findings",
            "user": "alice",
            "tags": ["research", "ai"]
        },
        {
            "id": "doc-124",
            "time": 1640995300.0,
            "kind": "text/plain",
            "title": "Meeting Notes",
            "comments": "Team meeting discussion",
            "user": "alice",
            "tags": ["meeting", "notes"]
        }
    ]
}

UPDATE-DOCUMENT - Update Document Metadata

Request:

{
    "operation": "update-document",
    "document_metadata": {
        "id": "doc-123",
        "time": 1640995500.0,
        "title": "Updated Research Paper",
        "comments": "Updated findings and conclusions",
        "user": "alice",
        "tags": ["research", "ai", "machine-learning", "updated"],
        "metadata": []
    }
}

Response:

{}

REMOVE-DOCUMENT - Remove Document

Request:

{
    "operation": "remove-document",
    "document_id": "doc-123",
    "user": "alice"
}

Response:

{}

Processing Operations

ADD-PROCESSING - Start Document Processing

Request:

{
    "operation": "add-processing",
    "processing_metadata": {
        "id": "proc-456",
        "document_id": "doc-123",
        "time": 1640995400.0,
        "flow": "pdf-extraction",
        "user": "alice",
        "collection": "research",
        "tags": ["extraction", "nlp"]
    }
}

Response:

{}

LIST-PROCESSING - List Processing Jobs

Request:

{
    "operation": "list-processing",
    "user": "alice",
    "collection": "research"
}

Response:

{
    "processing_metadatas": [
        {
            "id": "proc-456",
            "document_id": "doc-123",
            "time": 1640995400.0,
            "flow": "pdf-extraction",
            "user": "alice",
            "collection": "research",
            "tags": ["extraction", "nlp"]
        }
    ]
}

REMOVE-PROCESSING - Stop Processing Job

Request:

{
    "operation": "remove-processing",
    "processing_id": "proc-456",
    "user": "alice"
}

Response:

{}

REST service

The REST service is available at /api/v1/librarian and accepts the above request formats.

Websocket

Requests have a request object containing the operation fields. Responses have a response object containing the response fields.

Request:

{
    "id": "unique-request-id",
    "service": "librarian",
    "request": {
        "operation": "list-documents",
        "user": "alice"
    }
}

Response:

{
    "id": "unique-request-id",
    "response": {
        "document_metadatas": [...]
    },
    "complete": true
}

Pulsar

The Pulsar schema for the Librarian API is defined in Python code here:

https://github.com/trustgraph-ai/trustgraph/blob/master/trustgraph-base/trustgraph/schema/library.py

Default request queue: non-persistent://tg/request/librarian

Default response queue: non-persistent://tg/response/librarian

Request schema: trustgraph.schema.LibrarianRequest

Response schema: trustgraph.schema.LibrarianResponse

Python SDK

The Python SDK provides convenient access to the Librarian API:

from trustgraph.api.library import LibrarianClient

client = LibrarianClient()

# Add a document
with open("document.pdf", "rb") as f:
    content = f.read()
    
await client.add_document(
    doc_id="doc-123",
    title="Research Paper",
    content=content,
    user="alice",
    tags=["research", "ai"]
)

# Get document metadata
metadata = await client.get_document_metadata("doc-123", "alice")

# List documents
documents = await client.list_documents("alice", collection="research")

# Start processing
await client.add_processing(
    processing_id="proc-456",
    document_id="doc-123",
    flow="pdf-extraction",
    user="alice"
)

Features

  • Hybrid Storage: S3-compatible object storage (MinIO, Ceph RGW, AWS S3, etc.) for content, Cassandra for metadata
  • Multi-user Support: User-based document ownership and access control
  • Rich Metadata: RDF-style metadata triples and tagging system
  • Processing Integration: Automatic triggering of document processing workflows
  • Content Types: Support for multiple document formats (PDF, text, etc.)
  • Collection Management: Optional document grouping by collection
  • Metadata Search: Query documents by metadata criteria
  • Flexible Storage Backend: Works with any S3-compatible storage (MinIO, Ceph RADOS Gateway, AWS S3, Cloudflare R2, etc.)

Use Cases

  • Document Management: Store and organize documents with rich metadata
  • Knowledge Extraction: Process documents to extract structured knowledge
  • Research Libraries: Manage collections of research papers and documents
  • Content Processing: Orchestrate document processing workflows
  • Multi-tenant Systems: Support multiple users with isolated document libraries