trustgraph/docs/tech-specs/minio-to-s3-migration.md
Alex Jenkins 8954fa3ad7 Feat: TrustGraph i18n & Documentation Translation Updates (#781)
Native CLI i18n: The TrustGraph CLI has built-in translation support
that dynamically loads language strings. You can test and use
different languages by simply passing the --lang flag (e.g., --lang
es for Spanish, --lang ru for Russian) or by configuring your
environment's LANG variable.

Automated Docs Translations: This PR introduces autonomously
translated Markdown documentation into several target languages,
including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew,
Arabic, Simplified Chinese, and Russian.
2026-04-14 12:08:32 +01:00

8.4 KiB

layout title parent
default Tech Spec: S3-Compatible Storage Backend Support Tech Specs

Tech Spec: S3-Compatible Storage Backend Support

Overview

The Librarian service uses S3-compatible object storage for document blob storage. This spec documents the implementation that enables support for any S3-compatible backend including MinIO, Ceph RADOS Gateway (RGW), AWS S3, Cloudflare R2, DigitalOcean Spaces, and others.

Architecture

Storage Components

  • Blob Storage: S3-compatible object storage via minio Python client library
  • Metadata Storage: Cassandra (stores object_id mapping and document metadata)
  • Affected Component: Librarian service only
  • Storage Pattern: Hybrid storage with metadata in Cassandra, content in S3-compatible storage

Implementation

  • Library: minio Python client (supports any S3-compatible API)
  • Location: trustgraph-flow/trustgraph/librarian/blob_store.py
  • Operations:
    • add() - Store blob with UUID object_id
    • get() - Retrieve blob by object_id
    • remove() - Delete blob by object_id
    • ensure_bucket() - Create bucket if not exists
  • Bucket: library
  • Object Path: doc/{object_id}
  • Supported MIME Types: text/plain, application/pdf

Key Files

  1. trustgraph-flow/trustgraph/librarian/blob_store.py - BlobStore implementation
  2. trustgraph-flow/trustgraph/librarian/librarian.py - BlobStore initialization
  3. trustgraph-flow/trustgraph/librarian/service.py - Service configuration
  4. trustgraph-flow/pyproject.toml - Dependencies (minio package)
  5. docs/apis/api-librarian.md - API documentation

Supported Storage Backends

The implementation works with any S3-compatible object storage system:

Tested/Supported

  • Ceph RADOS Gateway (RGW) - Distributed storage system with S3 API (default configuration)
  • MinIO - Lightweight self-hosted object storage
  • Garage - Lightweight geo-distributed S3-compatible storage

Should Work (S3-Compatible)

  • AWS S3 - Amazon's cloud object storage
  • Cloudflare R2 - Cloudflare's S3-compatible storage
  • DigitalOcean Spaces - DigitalOcean's object storage
  • Wasabi - S3-compatible cloud storage
  • Backblaze B2 - S3-compatible backup storage
  • Any other service implementing the S3 REST API

Configuration

CLI Arguments

librarian \
  --object-store-endpoint <hostname:port> \
  --object-store-access-key <access_key> \
  --object-store-secret-key <secret_key> \
  [--object-store-use-ssl] \
  [--object-store-region <region>]

Note: Do not include http:// or https:// in the endpoint. Use --object-store-use-ssl to enable HTTPS.

Environment Variables (Alternative)

OBJECT_STORE_ENDPOINT=<hostname:port>
OBJECT_STORE_ACCESS_KEY=<access_key>
OBJECT_STORE_SECRET_KEY=<secret_key>
OBJECT_STORE_USE_SSL=true|false  # Optional, default: false
OBJECT_STORE_REGION=<region>     # Optional

Examples

Ceph RADOS Gateway (default):

--object-store-endpoint ceph-rgw:7480 \
--object-store-access-key object-user \
--object-store-secret-key object-password

MinIO:

--object-store-endpoint minio:9000 \
--object-store-access-key minioadmin \
--object-store-secret-key minioadmin

Garage (S3-compatible):

--object-store-endpoint garage:3900 \
--object-store-access-key GK000000000000000000000001 \
--object-store-secret-key b171f00be9be4c32c734f4c05fe64c527a8ab5eb823b376cfa8c2531f70fc427

AWS S3 with SSL:

--object-store-endpoint s3.amazonaws.com \
--object-store-access-key AKIAIOSFODNN7EXAMPLE \
--object-store-secret-key wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY \
--object-store-use-ssl \
--object-store-region us-east-1

Authentication

All S3-compatible backends require AWS Signature Version 4 (or v2) authentication:

  • Access Key - Public identifier (like username)
  • Secret Key - Private signing key (like password)

The MinIO Python client handles all signature calculation automatically.

Creating Credentials

For MinIO:

# Use default credentials or create user via MinIO Console
minioadmin / minioadmin

For Ceph RGW:

radosgw-admin user create --uid="trustgraph" --display-name="TrustGraph Service"
# Returns access_key and secret_key

For AWS S3:

  • Create IAM user with S3 permissions
  • Generate access key in AWS Console

Library Selection: MinIO Python Client

Rationale:

  • Lightweight (~500KB vs boto3's ~50MB)
  • S3-compatible - works with any S3 API endpoint
  • Simpler API than boto3 for basic operations
  • Already in use, no migration needed
  • Battle-tested with MinIO and other S3 systems

BlobStore Implementation

Location: trustgraph-flow/trustgraph/librarian/blob_store.py

from minio import Minio
import io
import logging

logger = logging.getLogger(__name__)

class BlobStore:
    """
    S3-compatible blob storage for document content.
    Supports MinIO, Ceph RGW, AWS S3, and other S3-compatible backends.
    """

    def __init__(self, endpoint, access_key, secret_key, bucket_name,
                 use_ssl=False, region=None):
        """
        Initialize S3-compatible blob storage.

        Args:
            endpoint: S3 endpoint (e.g., "minio:9000", "ceph-rgw:7480")
            access_key: S3 access key
            secret_key: S3 secret key
            bucket_name: Bucket name for storage
            use_ssl: Use HTTPS instead of HTTP (default: False)
            region: S3 region (optional, e.g., "us-east-1")
        """
        self.client = Minio(
            endpoint=endpoint,
            access_key=access_key,
            secret_key=secret_key,
            secure=use_ssl,
            region=region,
        )

        self.bucket_name = bucket_name

        protocol = "https" if use_ssl else "http"
        logger.info(f"Connected to S3-compatible storage at {protocol}://{endpoint}")

        self.ensure_bucket()

    def ensure_bucket(self):
        """Create bucket if it doesn't exist"""
        found = self.client.bucket_exists(bucket_name=self.bucket_name)
        if not found:
            self.client.make_bucket(bucket_name=self.bucket_name)
            logger.info(f"Created bucket {self.bucket_name}")
        else:
            logger.debug(f"Bucket {self.bucket_name} already exists")

    async def add(self, object_id, blob, kind):
        """Store blob in S3-compatible storage"""
        self.client.put_object(
            bucket_name=self.bucket_name,
            object_name=f"doc/{object_id}",
            length=len(blob),
            data=io.BytesIO(blob),
            content_type=kind,
        )
        logger.debug("Add blob complete")

    async def remove(self, object_id):
        """Delete blob from S3-compatible storage"""
        self.client.remove_object(
            bucket_name=self.bucket_name,
            object_name=f"doc/{object_id}",
        )
        logger.debug("Remove blob complete")

    async def get(self, object_id):
        """Retrieve blob from S3-compatible storage"""
        resp = self.client.get_object(
            bucket_name=self.bucket_name,
            object_name=f"doc/{object_id}",
        )
        return resp.read()

Key Benefits

  1. No Vendor Lock-in - Works with any S3-compatible storage
  2. Lightweight - MinIO client is only ~500KB
  3. Simple Configuration - Just endpoint + credentials
  4. No Data Migration - Drop-in replacement between backends
  5. Battle-Tested - MinIO client works with all major S3 implementations

Implementation Status

All code has been updated to use generic S3 parameter names:

  • blob_store.py - Updated to accept endpoint, access_key, secret_key
  • librarian.py - Updated parameter names
  • service.py - Updated CLI arguments and configuration
  • Documentation updated

Future Enhancements

  1. SSL/TLS Support - Add --s3-use-ssl flag for HTTPS
  2. Retry Logic - Implement exponential backoff for transient failures
  3. Presigned URLs - Generate temporary upload/download URLs
  4. Multi-region Support - Replicate blobs across regions
  5. CDN Integration - Serve blobs via CDN
  6. Storage Classes - Use S3 storage classes for cost optimization
  7. Lifecycle Policies - Automatic archival/deletion
  8. Versioning - Store multiple versions of blobs

References