mirror of https://github.com/trustgraph-ai/trustgraph.git synced 2026-04-25 08:26:21 +02:00

Base Service (trustgraph-base/trustgraph/base/embeddings_service.py):
- Changed on_request to use request.texts

FastEmbed Processor
(trustgraph-flow/trustgraph/embeddings/fastembed/processor.py):
- on_embeddings(texts, model=None) now processes full batch efficiently
- Returns [[v.tolist()] for v in vecs] - list of vector sets

Ollama Processor (trustgraph-flow/trustgraph/embeddings/ollama/processor.py):
- on_embeddings(texts, model=None) passes list directly to Ollama
- Returns [[embedding] for embedding in embeds.embeddings]

EmbeddingsClient (trustgraph-base/trustgraph/base/embeddings_client.py):
- embed(texts, timeout=300) accepts list of texts

Tests Updated:
- test_fastembed_dynamic_model.py - 4 tests updated for new interface
- test_ollama_dynamic_model.py - 4 tests updated for new interface

Updated CLI, SDK and APIs

2026-03-08 18:36:54 +00:00

20 KiB

Raw Blame History

Embeddings Batch Processing Technical Specification

Overview

This specification describes optimizations for the embeddings service to support batch processing of multiple texts in a single request. The current implementation processes one text at a time, missing the significant performance benefits that embedding models provide when processing batches.

Single-Text Processing Inefficiency: Current implementation wraps single texts in a list, underutilizing FastEmbed's batch capabilities
Request-Per-Text Overhead: Each text requires a separate Pulsar message round-trip
Model Inference Inefficiency: Embedding models have fixed per-batch overhead; small batches waste GPU/CPU resources
Serial Processing in Callers: Key services loop over items and call embeddings one at a time

Goals

Batch API Support: Enable processing multiple texts in a single request
Backward Compatibility: Maintain support for single-text requests
Significant Throughput Improvement: Target 5-10x throughput improvement for bulk operations
Reduced Latency per Text: Lower amortized latency when embedding multiple texts
Memory Efficiency: Process batches without excessive memory consumption
Provider Agnostic: Support batching across FastEmbed, Ollama, and other providers
Caller Migration: Update all embedding callers to use batch API where beneficial

Background

Current Implementation - Embeddings Service

The embeddings implementation in trustgraph-flow/trustgraph/embeddings/fastembed/processor.py exhibits a significant performance inefficiency:

# fastembed/processor.py line 56
async def on_embeddings(self, text, model=None):
    use_model = model or self.default_model
    self._load_model(use_model)

    vecs = self.embeddings.embed([text])  # Single text wrapped in list

    return [v.tolist() for v in vecs]

Problems:

Batch Size 1: FastEmbed's embed() method is optimized for batch processing, but we always call it with [text] - a batch of size 1
Per-Request Overhead: Each embedding request incurs:
- Pulsar message serialization/deserialization
- Network round-trip latency
- Model inference startup overhead
- Python async scheduling overhead

Schema Limitation: The EmbeddingsRequest schema only supports a single text:

@dataclass
class EmbeddingsRequest:
    text: str = ""  # Single text only

Current Callers - Serial Processing

1. API Gateway

File: trustgraph-flow/trustgraph/gateway/dispatch/embeddings.py

The gateway accepts single-text embedding requests via HTTP/WebSocket and forwards them to the embeddings service. Currently no batch endpoint exists.

class EmbeddingsRequestor(ServiceRequestor):
    # Handles single EmbeddingsRequest -> EmbeddingsResponse
    request_schema=EmbeddingsRequest,  # Single text only
    response_schema=EmbeddingsResponse,

Impact: External clients (web apps, scripts) must make N HTTP requests to embed N texts.

2. Document Embeddings Service

File: trustgraph-flow/trustgraph/embeddings/document_embeddings/embeddings.py

Processes document chunks one at a time:

async def on_message(self, msg, consumer, flow):
    v = msg.value()

    # Single chunk per request
    resp = await flow("embeddings-request").request(
        EmbeddingsRequest(text=v.chunk)
    )
    vectors = resp.vectors

Impact: Each document chunk requires a separate embedding call. A document with 100 chunks = 100 embedding requests.

3. Graph Embeddings Service

File: trustgraph-flow/trustgraph/embeddings/graph_embeddings/embeddings.py

Loops over entities and embeds each one serially:

async def on_message(self, msg, consumer, flow):
    for entity in v.entities:
        # Serial embedding - one entity at a time
        vectors = await flow("embeddings-request").embed(
            text=entity.context
        )
        entities.append(EntityEmbeddings(
            entity=entity.entity,
            vectors=vectors,
            chunk_id=entity.chunk_id,
        ))

Impact: A message with 50 entities = 50 serial embedding requests. This is a major bottleneck during knowledge graph construction.

4. Row Embeddings Service

File: trustgraph-flow/trustgraph/embeddings/row_embeddings/embeddings.py

Loops over unique texts and embeds each one serially:

async def on_message(self, msg, consumer, flow):
    for text, (index_name, index_value) in texts_to_embed.items():
        # Serial embedding - one text at a time
        vectors = await flow("embeddings-request").embed(text=text)

        embeddings_list.append(RowIndexEmbedding(
            index_name=index_name,
            index_value=index_value,
            text=text,
            vectors=vectors
        ))

Impact: Processing a table with 100 unique indexed values = 100 serial embedding requests.

5. EmbeddingsClient (Base Client)

File: trustgraph-base/trustgraph/base/embeddings_client.py

The client used by all flow processors only supports single-text embedding:

class EmbeddingsClient(RequestResponse):
    async def embed(self, text, timeout=30):
        resp = await self.request(
            EmbeddingsRequest(text=text),  # Single text
            timeout=timeout
        )
        return resp.vectors

Impact: All callers using this client are limited to single-text operations.

6. Command-Line Tools

File: trustgraph-cli/trustgraph/cli/invoke_embeddings.py

CLI tool accepts single text argument:

def query(url, flow_id, text, token=None):
    result = flow.embeddings(text=text)  # Single text
    vectors = result.get("vectors", [])

Impact: Users cannot batch-embed from command line. Processing a file of texts requires N invocations.

7. Python SDK

The Python SDK provides two client classes for interacting with TrustGraph services. Both only support single-text embedding.

File: trustgraph-base/trustgraph/api/flow.py

class FlowInstance:
    def embeddings(self, text):
        """Get embeddings for a single text"""
        input = {"text": text}
        return self.request("service/embeddings", input)["vectors"]

File: trustgraph-base/trustgraph/api/socket_client.py

class SocketFlowInstance:
    def embeddings(self, text: str, **kwargs: Any) -> Dict[str, Any]:
        """Get embeddings for a single text via WebSocket"""
        request = {"text": text}
        return self.client._send_request_sync(
            "embeddings", self.flow_id, request, False
        )

Impact: Python developers using the SDK must loop over texts and make N separate API calls. No batch embedding support exists for SDK users.

Performance Impact

For typical document ingestion (1000 text chunks):

Current: 1000 separate requests, 1000 model inference calls
Batched (batch_size=32): 32 requests, 32 model inference calls (96.8% reduction)

For graph embedding (message with 50 entities):

Current: 50 serial await calls, ~5-10 seconds
Batched: 1-2 batch calls, ~0.5-1 second (5-10x improvement)

FastEmbed and similar libraries achieve near-linear throughput scaling with batch size up to hardware limits (typically 32-128 texts per batch).

Technical Design

Architecture

The embeddings batch processing optimization requires changes to the following components:

1. Schema Enhancement

Extend EmbeddingsRequest to support multiple texts
Extend EmbeddingsResponse to return multiple vector sets
Maintain backward compatibility with single-text requests

Module: trustgraph-base/trustgraph/schema/services/llm.py

2. Base Service Enhancement

Update EmbeddingsService to handle batch requests
Add batch size configuration
Implement batch-aware request handling

Module: trustgraph-base/trustgraph/base/embeddings_service.py

3. Provider Processor Updates

Update FastEmbed processor to pass full batch to embed()
Update Ollama processor to handle batches (if supported)
Add fallback sequential processing for providers without batch support

Modules:

trustgraph-flow/trustgraph/embeddings/fastembed/processor.py
trustgraph-flow/trustgraph/embeddings/ollama/processor.py

4. Client Enhancement

Add batch embedding method to EmbeddingsClient
Support both single and batch APIs
Add automatic batching for large inputs

Module: trustgraph-base/trustgraph/base/embeddings_client.py

5. Caller Updates - Flow Processors

Update graph_embeddings to batch entity contexts
Update row_embeddings to batch index texts
Update document_embeddings if message batching is feasible

Modules:

trustgraph-flow/trustgraph/embeddings/graph_embeddings/embeddings.py
trustgraph-flow/trustgraph/embeddings/row_embeddings/embeddings.py
trustgraph-flow/trustgraph/embeddings/document_embeddings/embeddings.py

6. API Gateway Enhancement

Add batch embedding endpoint
Support array of texts in request body

Module: trustgraph-flow/trustgraph/gateway/dispatch/embeddings.py

7. CLI Tool Enhancement

Add support for multiple texts or file input
Add batch size parameter

Module: trustgraph-cli/trustgraph/cli/invoke_embeddings.py

8. Python SDK Enhancement

Add embeddings_batch() method to FlowInstance
Add embeddings_batch() method to SocketFlowInstance
Support both single and batch APIs for SDK users

Modules:

trustgraph-base/trustgraph/api/flow.py
trustgraph-base/trustgraph/api/socket_client.py

Data Models

EmbeddingsRequest

@dataclass
class EmbeddingsRequest:
    texts: list[str] = field(default_factory=list)

Usage:

Single text: EmbeddingsRequest(texts=["hello world"])
Batch: EmbeddingsRequest(texts=["text1", "text2", "text3"])

EmbeddingsResponse

@dataclass
class EmbeddingsResponse:
    error: Error | None = None
    vectors: list[list[list[float]]] = field(default_factory=list)

Response structure:

vectors[i] contains the vector set for texts[i]
Each vector set is list[list[float]] (models may return multiple vectors per text)
Example: 3 texts → vectors has 3 entries, each containing that text's embeddings

APIs

EmbeddingsClient

class EmbeddingsClient(RequestResponse):
    async def embed(
        self,
        texts: list[str],
        timeout: float = 300,
    ) -> list[list[list[float]]]:
        """
        Embed one or more texts in a single request.

        Args:
            texts: List of texts to embed
            timeout: Timeout for the operation

        Returns:
            List of vector sets, one per input text
        """
        resp = await self.request(
            EmbeddingsRequest(texts=texts),
            timeout=timeout
        )
        if resp.error:
            raise RuntimeError(resp.error.message)
        return resp.vectors

API Gateway Embeddings Endpoint

Updated endpoint supporting single or batch embedding:

POST /api/v1/embeddings
Content-Type: application/json

{
    "texts": ["text1", "text2", "text3"],
    "flow_id": "default"
}

Response:
{
    "vectors": [
        [[0.1, 0.2, ...]],
        [[0.3, 0.4, ...]],
        [[0.5, 0.6, ...]]
    ]
}

Implementation Details

Phase 1: Schema Changes

EmbeddingsRequest:

@dataclass
class EmbeddingsRequest:
    texts: list[str] = field(default_factory=list)

EmbeddingsResponse:

@dataclass
class EmbeddingsResponse:
    error: Error | None = None
    vectors: list[list[list[float]]] = field(default_factory=list)

Updated EmbeddingsService.on_request:

async def on_request(self, msg, consumer, flow):
    request = msg.value()
    id = msg.properties()["id"]
    model = flow("model")

    vectors = await self.on_embeddings(request.texts, model=model)
    response = EmbeddingsResponse(error=None, vectors=vectors)

    await flow("response").send(response, properties={"id": id})

Phase 2: FastEmbed Processor Update

Current (Inefficient):

async def on_embeddings(self, text, model=None):
    use_model = model or self.default_model
    self._load_model(use_model)
    vecs = self.embeddings.embed([text])  # Batch of 1
    return [v.tolist() for v in vecs]

Updated:

async def on_embeddings(self, texts: list[str], model=None):
    """Embed texts - processes all texts in single model call"""
    if not texts:
        return []

    use_model = model or self.default_model
    self._load_model(use_model)

    # FastEmbed handles the full batch efficiently
    all_vecs = list(self.embeddings.embed(texts))

    # Return list of vector sets, one per input text
    return [[v.tolist()] for v in all_vecs]

Phase 3: Graph Embeddings Service Update

Current (Serial):

async def on_message(self, msg, consumer, flow):
    entities = []
    for entity in v.entities:
        vectors = await flow("embeddings-request").embed(text=entity.context)
        entities.append(EntityEmbeddings(...))

Updated (Batch):

async def on_message(self, msg, consumer, flow):
    # Collect all contexts
    contexts = [entity.context for entity in v.entities]

    # Single batch embedding call
    all_vectors = await flow("embeddings-request").embed(texts=contexts)

    # Pair results with entities
    entities = [
        EntityEmbeddings(
            entity=entity.entity,
            vectors=vectors[0],  # First vector from the set
            chunk_id=entity.chunk_id,
        )
        for entity, vectors in zip(v.entities, all_vectors)
    ]

Phase 4: Row Embeddings Service Update

Current (Serial):

for text, (index_name, index_value) in texts_to_embed.items():
    vectors = await flow("embeddings-request").embed(text=text)
    embeddings_list.append(RowIndexEmbedding(...))

Updated (Batch):

# Collect texts and metadata
texts = list(texts_to_embed.keys())
metadata = list(texts_to_embed.values())

# Single batch embedding call
all_vectors = await flow("embeddings-request").embed(texts=texts)

# Pair results
embeddings_list = [
    RowIndexEmbedding(
        index_name=meta[0],
        index_value=meta[1],
        text=text,
        vectors=vectors[0]  # First vector from the set
    )
    for text, meta, vectors in zip(texts, metadata, all_vectors)
]

Phase 5: CLI Tool Enhancement

Updated CLI:

def main():
    parser = argparse.ArgumentParser(...)

    parser.add_argument(
        'text',
        nargs='*',  # Zero or more texts
        help='Text(s) to convert to embedding vectors',
    )

    parser.add_argument(
        '-f', '--file',
        help='File containing texts (one per line)',
    )

    parser.add_argument(
        '--batch-size',
        type=int,
        default=32,
        help='Batch size for processing (default: 32)',
    )

Usage:

# Single text (existing)
tg-invoke-embeddings "hello world"

# Multiple texts
tg-invoke-embeddings "text one" "text two" "text three"

# From file
tg-invoke-embeddings -f texts.txt --batch-size 64

Phase 6: Python SDK Enhancement

FlowInstance (HTTP client):

class FlowInstance:
    def embeddings(self, texts: list[str]) -> list[list[list[float]]]:
        """
        Get embeddings for one or more texts.

        Args:
            texts: List of texts to embed

        Returns:
            List of vector sets, one per input text
        """
        input = {"texts": texts}
        return self.request("service/embeddings", input)["vectors"]

SocketFlowInstance (WebSocket client):

class SocketFlowInstance:
    def embeddings(self, texts: list[str], **kwargs: Any) -> list[list[list[float]]]:
        """
        Get embeddings for one or more texts via WebSocket.

        Args:
            texts: List of texts to embed

        Returns:
            List of vector sets, one per input text
        """
        request = {"texts": texts}
        response = self.client._send_request_sync(
            "embeddings", self.flow_id, request, False
        )
        return response["vectors"]

SDK Usage Examples:

# Single text
vectors = flow.embeddings(["hello world"])
print(f"Dimensions: {len(vectors[0][0])}")

# Batch embedding
texts = ["text one", "text two", "text three"]
all_vectors = flow.embeddings(texts)

# Process results
for text, vecs in zip(texts, all_vectors):
    print(f"{text}: {len(vecs[0])} dimensions")

Security Considerations

Request Size Limits: Enforce maximum batch size to prevent resource exhaustion
Timeout Handling: Scale timeouts appropriately for batch size
Memory Limits: Monitor memory usage for large batches
Input Validation: Validate all texts in batch before processing

Performance Considerations

Expected Improvements

Throughput:

Single-text: ~10-50 texts/second (depending on model)
Batch (size 32): ~200-500 texts/second (5-10x improvement)

Latency per Text:

Single-text: 50-200ms per text
Batch (size 32): 5-20ms per text (amortized)

Service-Specific Improvements:

Service	Current	Batched	Improvement
Graph Embeddings (50 entities)	5-10s	0.5-1s	5-10x
Row Embeddings (100 texts)	10-20s	1-2s	5-10x
Document Ingestion (1000 chunks)	100-200s	10-30s	5-10x

Configuration Parameters

# Recommended defaults
DEFAULT_BATCH_SIZE = 32
MAX_BATCH_SIZE = 128
BATCH_TIMEOUT_MULTIPLIER = 2.0

Testing Strategy

Unit Testing

Single text embedding (backward compatibility)
Empty batch handling
Maximum batch size enforcement
Error handling for partial batch failures

Integration Testing

End-to-end batch embedding through Pulsar
Graph embeddings service batch processing
Row embeddings service batch processing
API gateway batch endpoint

Performance Testing

Benchmark single vs batch throughput
Memory usage under various batch sizes
Latency distribution analysis

Migration Plan

This is a breaking change release. All phases are implemented together.

Phase 1: Schema Changes

Replace text: str with texts: list[str] in EmbeddingsRequest
Change vectors type to list[list[list[float]]] in EmbeddingsResponse

Phase 2: Processor Updates

Update on_embeddings signature in FastEmbed and Ollama processors
Process full batch in single model call

Phase 3: Client Updates

Update EmbeddingsClient.embed() to accept texts: list[str]

Phase 4: Caller Updates

Update graph_embeddings to batch entity contexts
Update row_embeddings to batch index texts
Update document_embeddings to use new schema
Update CLI tool

Phase 5: API Gateway

Update embeddings endpoint for new schema

Phase 6: Python SDK

Update FlowInstance.embeddings() signature
Update SocketFlowInstance.embeddings() signature

Open Questions

Streaming Large Batches: Should we support streaming results for very large batches (>100 texts)?
Provider-Specific Limits: How should we handle providers with different maximum batch sizes?
Partial Failure Handling: If one text in a batch fails, should we fail the entire batch or return partial results?
Document Embeddings Batching: Should we batch across multiple Chunk messages or keep per-message processing?

20 KiB Raw Blame History

Embeddings Batch Processing Technical Specification

Overview

Goals

Background

Current Implementation - Embeddings Service

Current Callers - Serial Processing

1. API Gateway

2. Document Embeddings Service

3. Graph Embeddings Service

4. Row Embeddings Service

5. EmbeddingsClient (Base Client)

6. Command-Line Tools

7. Python SDK

Performance Impact

Technical Design

Architecture

1. Schema Enhancement

2. Base Service Enhancement

3. Provider Processor Updates

4. Client Enhancement

5. Caller Updates - Flow Processors

6. API Gateway Enhancement

7. CLI Tool Enhancement

8. Python SDK Enhancement

Data Models

EmbeddingsRequest

EmbeddingsResponse

APIs

EmbeddingsClient

API Gateway Embeddings Endpoint

Implementation Details

Phase 1: Schema Changes

Phase 2: FastEmbed Processor Update

Phase 3: Graph Embeddings Service Update

Phase 4: Row Embeddings Service Update

Phase 5: CLI Tool Enhancement

Phase 6: Python SDK Enhancement

Security Considerations

Performance Considerations

Expected Improvements

Configuration Parameters

Testing Strategy

Unit Testing

Integration Testing

Performance Testing

Migration Plan

Phase 1: Schema Changes

Phase 2: Processor Updates

Phase 3: Client Updates

Phase 4: Caller Updates

Phase 5: API Gateway

Phase 6: Python SDK

Open Questions

References

20 KiB

Raw Blame History