Merge pull request #837 from CREDO23/test-document-creation

[Refactor] Core document creation / indexing pipeline with unit and e2e tests
2026-05-02 04:12:47 +02:00 · 2026-02-25 12:27:41 -08:00 · 2026-02-25 12:27:41 -08:00 · 2e99f1e853
commit 2e99f1e853
parent 54dd6f904c c50d661d7d
38 changed files with 5535 additions and 3342 deletions
--- a/.cursor/skills/tdd/SKILL.md
+++ b/.cursor/skills/tdd/SKILL.md
@ -0,0 +1,112 @@
 ---
 name: tdd
 description: Strict Python TDD workflow using pytest (Red-Green-Refactor).
 ---
 ---
 name: tdd
 description: Test-driven development with red-green-refactor loop. Use when user wants to build features or fix bugs using TDD, mentions "red-green-refactor", wants integration tests, or asks for test-first development.
 ---
 # Test-Driven Development
 ## Philosophy
 **Core principle**: Tests should verify behavior through public interfaces, not implementation details. Code can change entirely; tests shouldn't.
 **Good tests** are integration-style: they exercise real code paths through public APIs. They describe _what_ the system does, not _how_ it does it. A good test reads like a specification - "user can checkout with valid cart" tells you exactly what capability exists. These tests survive refactors because they don't care about internal structure.
 **Bad tests** are coupled to implementation. They mock internal collaborators, test private methods, or verify through external means (like querying a database directly instead of using the interface). The warning sign: your test breaks when you refactor, but behavior hasn't changed. If you rename an internal function and tests fail, those tests were testing implementation, not behavior.
 See [tests.md](tests.md) for examples and [mocking.md](mocking.md) for mocking guidelines.
 ## Anti-Pattern: Horizontal Slices
 **DO NOT write all tests first, then all implementation.** This is "horizontal slicing" - treating RED as "write all tests" and GREEN as "write all code."
 This produces **crap tests**:
 - Tests written in bulk test _imagined_ behavior, not _actual_ behavior
 - You end up testing the _shape_ of things (data structures, function signatures) rather than user-facing behavior
 - Tests become insensitive to real changes - they pass when behavior breaks, fail when behavior is fine
 - You outrun your headlights, committing to test structure before understanding the implementation
 **Correct approach**: Vertical slices via tracer bullets. One test → one implementation → repeat. Each test responds to what you learned from the previous cycle. Because you just wrote the code, you know exactly what behavior matters and how to verify it.
 ```
 WRONG (horizontal):
  RED:   test1, test2, test3, test4, test5
  GREEN: impl1, impl2, impl3, impl4, impl5
 RIGHT (vertical):
  RED→GREEN: test1→impl1
  RED→GREEN: test2→impl2
  RED→GREEN: test3→impl3
  ...
 ```
 ## Workflow
 ### 1. Planning
 Before writing any code:
 - [ ] Confirm with user what interface changes are needed
 - [ ] Confirm with user which behaviors to test (prioritize)
 - [ ] Identify opportunities for [deep modules](deep-modules.md) (small interface, deep implementation)
 - [ ] Design interfaces for [testability](interface-design.md)
 - [ ] List the behaviors to test (not implementation steps)
 - [ ] Get user approval on the plan
 Ask: "What should the public interface look like? Which behaviors are most important to test?"
 **You can't test everything.** Confirm with the user exactly which behaviors matter most. Focus testing effort on critical paths and complex logic, not every possible edge case.
 ### 2. Tracer Bullet
 Write ONE test that confirms ONE thing about the system:
 ```
 RED:   Write test for first behavior → test fails
 GREEN: Write minimal code to pass → test passes
 ```
 This is your tracer bullet - proves the path works end-to-end.
 ### 3. Incremental Loop
 For each remaining behavior:
 ```
 RED:   Write next test → fails
 GREEN: Minimal code to pass → passes
 ```
 Rules:
 - One test at a time
 - Only enough code to pass current test
 - Don't anticipate future tests
 - Keep tests focused on observable behavior
 ### 4. Refactor
 After all tests pass, look for [refactor candidates](refactoring.md):
 - [ ] Extract duplication
 - [ ] Deepen modules (move complexity behind simple interfaces)
 - [ ] Apply SOLID principles where natural
 - [ ] Consider what new code reveals about existing code
 - [ ] Run tests after each refactor step
 **Never refactor while RED.** Get to GREEN first.
 ## Checklist Per Cycle
 ```
 [ ] Test describes behavior, not implementation
 [ ] Test uses public interface only
 [ ] Test would survive internal refactor
 [ ] Code is minimal for this test
 [ ] No speculative features added
 ```
--- a/.cursor/skills/tdd/deep-modules.md
+++ b/.cursor/skills/tdd/deep-modules.md
@ -0,0 +1,33 @@
 # Deep Modules
 From "A Philosophy of Software Design":
 **Deep module** = small interface + lots of implementation
 ```
 ┌─────────────────────┐
 │   Small Interface   │  ← Few methods, simple params
 ├─────────────────────┤
 │                     │
 │                     │
 │  Deep Implementation│  ← Complex logic hidden
 │                     │
 │                     │
 └─────────────────────┘
 ```
 **Shallow module** = large interface + little implementation (avoid)
 ```
 ┌─────────────────────────────────┐
 │       Large Interface           │  ← Many methods, complex params
 ├─────────────────────────────────┤
 │  Thin Implementation            │  ← Just passes through
 └─────────────────────────────────┘
 ```
 When designing interfaces, ask:
 - Can I reduce the number of methods?
 - Can I simplify the parameters?
 - Can I hide more complexity inside?
--- a/.cursor/skills/tdd/interface-design.md
+++ b/.cursor/skills/tdd/interface-design.md
@ -0,0 +1,33 @@
 # Interface Design for Testability
 Good interfaces make testing natural:
 1. **Accept dependencies, don't create them**
 ```python
 # Testable
 def process_order(order, payment_gateway):
    pass
 # Hard to test
 def process_order(order):
    gateway = StripeGateway()
 ```
 2. **Return results, don't produce side effects**
 ```python
 # Testable
 def calculate_discount(cart) -> float:
    return discount
 # Hard to test
 def apply_discount(cart) -> None:
    cart.total -= discount
 ```
 3. **Small surface area**
 * Fewer methods = fewer tests needed
 * Fewer params = simpler test setup
--- a/.cursor/skills/tdd/mocking.md
+++ b/.cursor/skills/tdd/mocking.md
@ -0,0 +1,69 @@
 # When to Mock
 Mock at **system boundaries** only:
 * External APIs (payment, email, etc.)
 * Databases (sometimes - prefer test DB)
 * Time/randomness
 * File system (sometimes)
 Don't mock:
 * Your own classes/modules
 * Internal collaborators
 * Anything you control
 ## Designing for Mockability
 At system boundaries, design interfaces that are easy to mock:
 **1. Use dependency injection**
 Pass external dependencies in rather than creating them internally:
 ```python
 import os
 # Easy to mock
 def process_payment(order, payment_client):
    return payment_client.charge(order.total)
 # Hard to mock
 def process_payment(order):
    client = StripeClient(os.getenv("STRIPE_KEY"))
    return client.charge(order.total)
 ```
 **2. Prefer SDK-style interfaces over generic fetchers**
 Create specific functions for each external operation instead of one generic function with conditional logic:
 ```python
 import requests
 # GOOD: Each function is independently mockable
 class UserAPI:
    def get_user(self, user_id):
        return requests.get(f"/users/{user_id}")
    def get_orders(self, user_id):
        return requests.get(f"/users/{user_id}/orders")
    def create_order(self, data):
        return requests.post("/orders", json=data)
 # BAD: Mocking requires conditional logic inside the mock
 class GenericAPI:
    def fetch(self, endpoint, method="GET", data=None):
        return requests.request(method, endpoint, json=data)
 ```
 The SDK approach means:
 * Each mock returns one specific shape
 * No conditional logic in test setup
 * Easier to see which endpoints a test exercises
 * Type safety per endpoint
--- a/.cursor/skills/tdd/refactoring.md
+++ b/.cursor/skills/tdd/refactoring.md
@ -0,0 +1,10 @@
 # Refactor Candidates
 After TDD cycle, look for:
 - **Duplication** → Extract function/class
 - **Long methods** → Break into private helpers (keep tests on public interface)
 - **Shallow modules** → Combine or deepen
 - **Feature envy** → Move logic to where data lives
 - **Primitive obsession** → Introduce value objects
 - **Existing code** the new code reveals as problematic
--- a/.cursor/skills/tdd/tests.md
+++ b/.cursor/skills/tdd/tests.md
@ -0,0 +1,60 @@
 # Good and Bad Tests
 ## Good Tests
 **Integration-style**: Test through real interfaces, not mocks of internal parts.
 ```python
 # GOOD: Tests observable behavior
 def test_user_can_checkout_with_valid_cart():
    cart = create_cart()
    cart.add(product)
    result = checkout(cart, payment_method)
    assert result.status == "confirmed"
 ```
 Characteristics:
 * Tests behavior users/callers care about
 * Uses public API only
 * Survives internal refactors
 * Describes WHAT, not HOW
 * One logical assertion per test
 ## Bad Tests
 **Implementation-detail tests**: Coupled to internal structure.
 ```python
 # BAD: Tests implementation details
 def test_checkout_calls_payment_service_process():
    mock_payment = MagicMock()
    checkout(cart, mock_payment)
    mock_payment.process.assert_called_with(cart.total)
 ```
 Red flags:
 * Mocking internal collaborators
 * Testing private methods
 * Asserting on call counts/order
 * Test breaks when refactoring without behavior change
 * Test name describes HOW not WHAT
 * Verifying through external means instead of interface
 ```python
 # BAD: Bypasses interface to verify
 def test_create_user_saves_to_database():
    create_user({"name": "Alice"})
    row = db.query("SELECT * FROM users WHERE name = ?", ["Alice"])
    assert row is not None
 # GOOD: Verifies through interface
 def test_create_user_makes_user_retrievable():
    user = create_user({"name": "Alice"})
    retrieved = get_user(user.id)
    assert retrieved.name == "Alice"
 ```
--- a/surfsense_backend/app/indexing_pipeline/init.py
+++ b/surfsense_backend/app/indexing_pipeline/init.py
--- a/surfsense_backend/app/indexing_pipeline/adapters/init.py
+++ b/surfsense_backend/app/indexing_pipeline/adapters/init.py
--- a/surfsense_backend/app/indexing_pipeline/adapters/file_upload_adapter.py
+++ b/surfsense_backend/app/indexing_pipeline/adapters/file_upload_adapter.py
@ -0,0 +1,46 @@
 from sqlalchemy.ext.asyncio import AsyncSession
 from app.db import DocumentStatus, DocumentType
 from app.indexing_pipeline.connector_document import ConnectorDocument
 from app.indexing_pipeline.indexing_pipeline_service import IndexingPipelineService
 async def index_uploaded_file(
    markdown_content: str,
    filename: str,
    etl_service: str,
    search_space_id: int,
    user_id: str,
    session: AsyncSession,
    llm,
 ) -> None:
    connector_doc = ConnectorDocument(
        title=filename,
        source_markdown=markdown_content,
        unique_id=filename,
        document_type=DocumentType.FILE,
        search_space_id=search_space_id,
        created_by_id=user_id,
        connector_id=None,
        should_summarize=True,
        should_use_code_chunker=False,
        fallback_summary=markdown_content[:4000],
        metadata={
            "FILE_NAME": filename,
            "ETL_SERVICE": etl_service,
        },
    )
    service = IndexingPipelineService(session)
    documents = await service.prepare_for_indexing([connector_doc])
    if not documents:
        raise RuntimeError("prepare_for_indexing returned no documents")
    indexed = await service.index(documents[0], connector_doc, llm)
    if not DocumentStatus.is_state(indexed.status, DocumentStatus.READY):
        raise RuntimeError(indexed.status.get("reason", "Indexing failed"))
    indexed.content_needs_reindexing = False
    await session.commit()
--- a/surfsense_backend/app/indexing_pipeline/connector_document.py
+++ b/surfsense_backend/app/indexing_pipeline/connector_document.py
@ -0,0 +1,25 @@
 from pydantic import BaseModel, Field, field_validator
 from app.db import DocumentType
 class ConnectorDocument(BaseModel):
    """Canonical data transfer object produced by connector adapters and consumed by the indexing pipeline."""
    title: str
    source_markdown: str
    unique_id: str
    document_type: DocumentType
    search_space_id: int = Field(gt=0)
    should_summarize: bool = True
    should_use_code_chunker: bool = False
    fallback_summary: str | None = None
    metadata: dict = {}
    connector_id: int | None = None
    created_by_id: str
    @field_validator("title", "source_markdown", "unique_id", "created_by_id")
    @classmethod
    def not_empty(cls, v: str, info) -> str:
        if not v.strip():
            raise ValueError(f"{info.field_name} must not be empty or whitespace")
        return v
--- a/surfsense_backend/app/indexing_pipeline/document_chunker.py
+++ b/surfsense_backend/app/indexing_pipeline/document_chunker.py
@ -0,0 +1,7 @@
 from app.config import config
 def chunk_text(text: str, use_code_chunker: bool = False) -> list[str]:
    """Chunk a text string using the configured chunker and return the chunk texts."""
    chunker = config.code_chunker_instance if use_code_chunker else config.chunker_instance
    return [c.text for c in chunker.chunk(text)]
--- a/surfsense_backend/app/indexing_pipeline/document_embedder.py
+++ b/surfsense_backend/app/indexing_pipeline/document_embedder.py
@ -0,0 +1,6 @@
 from app.config import config
 def embed_text(text: str) -> list[float]:
    """Embed a single text string using the configured embedding model."""
    return config.embedding_model_instance.embed(text)
--- a/surfsense_backend/app/indexing_pipeline/document_hashing.py
+++ b/surfsense_backend/app/indexing_pipeline/document_hashing.py
@ -0,0 +1,15 @@
 import hashlib
 from app.indexing_pipeline.connector_document import ConnectorDocument
 def compute_unique_identifier_hash(doc: ConnectorDocument) -> str:
    """Return a stable SHA-256 hash identifying a document by its source identity."""
    combined = f"{doc.document_type.value}:{doc.unique_id}:{doc.search_space_id}"
    return hashlib.sha256(combined.encode("utf-8")).hexdigest()
 def compute_content_hash(doc: ConnectorDocument) -> str:
    """Return a SHA-256 hash of the document's content scoped to its search space."""
    combined = f"{doc.search_space_id}:{doc.source_markdown}"
    return hashlib.sha256(combined.encode("utf-8")).hexdigest()
--- a/surfsense_backend/app/indexing_pipeline/document_persistence.py
+++ b/surfsense_backend/app/indexing_pipeline/document_persistence.py
@ -0,0 +1,39 @@
 from datetime import UTC, datetime
 from sqlalchemy.ext.asyncio import AsyncSession
 from sqlalchemy.orm import object_session
 from sqlalchemy.orm.attributes import set_committed_value
 from app.db import Document, DocumentStatus
 async def rollback_and_persist_failure(
    session: AsyncSession, document: Document, message: str
 ) -> None:
    """Roll back the current transaction and best-effort persist a failed status.
    Called exclusively from except blocks — must never raise, or the new exception
    would chain with the original and mask it entirely.
    """
    try:
        await session.rollback()
    except Exception:
        return  # Session is completely dead; nothing further we can do.
    try:
        await session.refresh(document)
        document.updated_at = datetime.now(UTC)
        document.status = DocumentStatus.failed(message)
        await session.commit()
    except Exception:
        pass  # Best-effort; document will be retried on the next sync.
 def attach_chunks_to_document(document: Document, chunks: list) -> None:
    """Assign chunks to a document without triggering SQLAlchemy async lazy loading."""
    set_committed_value(document, "chunks", chunks)
    session = object_session(document)
    if session is not None:
        if document.id is not None:
            for chunk in chunks:
                chunk.document_id = document.id
        session.add_all(chunks)
--- a/surfsense_backend/app/indexing_pipeline/document_summarizer.py
+++ b/surfsense_backend/app/indexing_pipeline/document_summarizer.py
@ -0,0 +1,28 @@
 from app.prompts import SUMMARY_PROMPT_TEMPLATE
 from app.utils.document_converters import optimize_content_for_context_window
 async def summarize_document(source_markdown: str, llm, metadata: dict | None = None) -> str:
    """Generate a text summary of a document using an LLM, prefixed with metadata when provided."""
    model_name = getattr(llm, "model", "gpt-3.5-turbo")
    optimized_content = optimize_content_for_context_window(
        source_markdown, metadata, model_name
    )
    summary_chain = SUMMARY_PROMPT_TEMPLATE | llm
    content_with_metadata = (
        f"<DOCUMENT><DOCUMENT_METADATA>\n\n{metadata}\n\n</DOCUMENT_METADATA>"
        f"\n\n<DOCUMENT_CONTENT>\n\n{optimized_content}\n\n</DOCUMENT_CONTENT></DOCUMENT>"
    )
    summary_result = await summary_chain.ainvoke({"document": content_with_metadata})
    summary_content = summary_result.content
    if metadata:
        metadata_parts = ["# DOCUMENT METADATA"]
        for key, value in metadata.items():
            if value:
                metadata_parts.append(f"**{key.replace('_', ' ').title()}:** {value}")
        metadata_section = "\n".join(metadata_parts)
        return f"{metadata_section}\n\n# DOCUMENT SUMMARY\n\n{summary_content}"
    return summary_content
--- a/surfsense_backend/app/indexing_pipeline/exceptions.py
+++ b/surfsense_backend/app/indexing_pipeline/exceptions.py
@ -0,0 +1,121 @@
 from litellm.exceptions import (
    APIConnectionError,
    APIResponseValidationError,
    AuthenticationError,
    BadGatewayError,
    BadRequestError,
    InternalServerError,
    NotFoundError,
    PermissionDeniedError,
    RateLimitError,
    ServiceUnavailableError,
    Timeout,
    UnprocessableEntityError,
 )
 from sqlalchemy.exc import IntegrityError
 # Tuples for use directly in except clauses.
 RETRYABLE_LLM_ERRORS = (
    RateLimitError,
    Timeout,
    ServiceUnavailableError,
    BadGatewayError,
    InternalServerError,
    APIConnectionError,
 )
 PERMANENT_LLM_ERRORS = (
    AuthenticationError,
    PermissionDeniedError,
    NotFoundError,
    BadRequestError,
    UnprocessableEntityError,
    APIResponseValidationError,
 )
 # (LiteLLMEmbeddings, CohereEmbeddings, GeminiEmbeddings all normalize to RuntimeError).
 EMBEDDING_ERRORS = (
    RuntimeError,  # local device failure or API backend normalization
    OSError,       # model files missing or corrupted (local backends)
    MemoryError,   # document too large for available RAM
 )
 class PipelineMessages:
    RATE_LIMIT        = "LLM rate limit exceeded. Will retry on next sync."
    LLM_TIMEOUT       = "LLM request timed out. Will retry on next sync."
    LLM_UNAVAILABLE   = "LLM service temporarily unavailable. Will retry on next sync."
    LLM_BAD_GATEWAY   = "LLM gateway error. Will retry on next sync."
    LLM_SERVER_ERROR  = "LLM internal server error. Will retry on next sync."
    LLM_CONNECTION    = "Could not reach the LLM service. Check network connectivity."
    LLM_AUTH          = "LLM authentication failed. Check your API key."
    LLM_PERMISSION    = "LLM request denied. Check your account permissions."
    LLM_NOT_FOUND     = "LLM model not found. Check your model configuration."
    LLM_BAD_REQUEST   = "LLM rejected the request. Document content may be invalid."
    LLM_UNPROCESSABLE = "Document exceeds the LLM context window even after optimization."
    LLM_RESPONSE      = "LLM returned an invalid response."
    EMBEDDING_FAILED  = "Embedding failed. Check your embedding model configuration or service."
    EMBEDDING_MODEL   = "Embedding model files are missing or corrupted."
    EMBEDDING_MEMORY  = "Not enough memory to embed this document."
    CHUNKING_OVERFLOW = "Document structure is too deeply nested to chunk."
 def safe_exception_message(exc: Exception) -> str:
    try:
        return str(exc)
    except Exception:
        return "Something went wrong during indexing. Error details could not be retrieved."
 def llm_retryable_message(exc: Exception) -> str:
    try:
        if isinstance(exc, RateLimitError):
            return PipelineMessages.RATE_LIMIT
        if isinstance(exc, Timeout):
            return PipelineMessages.LLM_TIMEOUT
        if isinstance(exc, ServiceUnavailableError):
            return PipelineMessages.LLM_UNAVAILABLE
        if isinstance(exc, BadGatewayError):
            return PipelineMessages.LLM_BAD_GATEWAY
        if isinstance(exc, InternalServerError):
            return PipelineMessages.LLM_SERVER_ERROR
        if isinstance(exc, APIConnectionError):
            return PipelineMessages.LLM_CONNECTION
        return safe_exception_message(exc)
    except Exception:
        return "Something went wrong when calling the LLM."
 def llm_permanent_message(exc: Exception) -> str:
    try:
        if isinstance(exc, AuthenticationError):
            return PipelineMessages.LLM_AUTH
        if isinstance(exc, PermissionDeniedError):
            return PipelineMessages.LLM_PERMISSION
        if isinstance(exc, NotFoundError):
            return PipelineMessages.LLM_NOT_FOUND
        if isinstance(exc, BadRequestError):
            return PipelineMessages.LLM_BAD_REQUEST
        if isinstance(exc, UnprocessableEntityError):
            return PipelineMessages.LLM_UNPROCESSABLE
        if isinstance(exc, APIResponseValidationError):
            return PipelineMessages.LLM_RESPONSE
        return safe_exception_message(exc)
    except Exception:
        return "Something went wrong when calling the LLM."
 def embedding_message(exc: Exception) -> str:
    try:
        if isinstance(exc, RuntimeError):
            return PipelineMessages.EMBEDDING_FAILED
        if isinstance(exc, OSError):
            return PipelineMessages.EMBEDDING_MODEL
        if isinstance(exc, MemoryError):
            return PipelineMessages.EMBEDDING_MEMORY
        return safe_exception_message(exc)
    except Exception:
        return "Something went wrong when generating the embedding."
--- a/surfsense_backend/app/indexing_pipeline/indexing_pipeline_service.py
+++ b/surfsense_backend/app/indexing_pipeline/indexing_pipeline_service.py
@ -0,0 +1,237 @@
 import contextlib
 from datetime import UTC, datetime
 from sqlalchemy import delete, select
 from sqlalchemy.ext.asyncio import AsyncSession
 from app.db import Chunk, Document, DocumentStatus
 from app.indexing_pipeline.connector_document import ConnectorDocument
 from app.indexing_pipeline.document_chunker import chunk_text
 from app.indexing_pipeline.document_embedder import embed_text
 from app.indexing_pipeline.document_hashing import (
    compute_content_hash,
    compute_unique_identifier_hash,
 )
 from app.indexing_pipeline.document_persistence import (
    attach_chunks_to_document,
    rollback_and_persist_failure,
 )
 from app.indexing_pipeline.document_summarizer import summarize_document
 from app.indexing_pipeline.exceptions import (
    EMBEDDING_ERRORS,
    PERMANENT_LLM_ERRORS,
    RETRYABLE_LLM_ERRORS,
    IntegrityError,
    PipelineMessages,
    embedding_message,
    llm_permanent_message,
    llm_retryable_message,
    safe_exception_message,
 )
 from app.indexing_pipeline.pipeline_logger import (
    PipelineLogContext,
    log_batch_aborted,
    log_chunking_overflow,
    log_doc_skipped_unknown,
    log_document_queued,
    log_document_requeued,
    log_document_updated,
    log_embedding_error,
    log_index_started,
    log_index_success,
    log_permanent_llm_error,
    log_race_condition,
    log_retryable_llm_error,
    log_unexpected_error,
 )
 class IndexingPipelineService:
    """Single pipeline for indexing connector documents. All connectors use this service."""
    def __init__(self, session: AsyncSession) -> None:
        self.session = session
    async def prepare_for_indexing(
        self, connector_docs: list[ConnectorDocument]
    ) -> list[Document]:
        """
        Persist new documents and detect changes, returning only those that need indexing.
        """
        documents = []
        seen_hashes: set[str] = set()
        batch_ctx = PipelineLogContext(
            connector_id=connector_docs[0].connector_id if connector_docs else 0,
            search_space_id=connector_docs[0].search_space_id if connector_docs else 0,
            unique_id="batch",
        )
        for connector_doc in connector_docs:
            ctx = PipelineLogContext(
                connector_id=connector_doc.connector_id,
                search_space_id=connector_doc.search_space_id,
                unique_id=connector_doc.unique_id,
            )
            try:
                unique_identifier_hash = compute_unique_identifier_hash(connector_doc)
                content_hash = compute_content_hash(connector_doc)
                if unique_identifier_hash in seen_hashes:
                    continue
                seen_hashes.add(unique_identifier_hash)
                result = await self.session.execute(
                    select(Document).filter(
                        Document.unique_identifier_hash == unique_identifier_hash
                    )
                )
                existing = result.scalars().first()
                if existing is not None:
                    if existing.content_hash == content_hash:
                        if existing.title != connector_doc.title:
                            existing.title = connector_doc.title
                            existing.updated_at = datetime.now(UTC)
                        if not DocumentStatus.is_state(
                            existing.status, DocumentStatus.READY
                        ):
                            existing.status = DocumentStatus.pending()
                            existing.updated_at = datetime.now(UTC)
                            documents.append(existing)
                            log_document_requeued(ctx)
                        continue
                    existing.title = connector_doc.title
                    existing.content_hash = content_hash
                    existing.source_markdown = connector_doc.source_markdown
                    existing.document_metadata = connector_doc.metadata
                    existing.updated_at = datetime.now(UTC)
                    existing.status = DocumentStatus.pending()
                    documents.append(existing)
                    log_document_updated(ctx)
                    continue
                duplicate = await self.session.execute(
                    select(Document).filter(Document.content_hash == content_hash)
                )
                if duplicate.scalars().first() is not None:
                    continue
                document = Document(
                    title=connector_doc.title,
                    document_type=connector_doc.document_type,
                    content="Pending...",
                    content_hash=content_hash,
                    unique_identifier_hash=unique_identifier_hash,
                    source_markdown=connector_doc.source_markdown,
                    document_metadata=connector_doc.metadata,
                    search_space_id=connector_doc.search_space_id,
                    connector_id=connector_doc.connector_id,
                    created_by_id=connector_doc.created_by_id,
                    updated_at=datetime.now(UTC),
                    status=DocumentStatus.pending(),
                )
                self.session.add(document)
                documents.append(document)
                log_document_queued(ctx)
            except Exception as e:
                log_doc_skipped_unknown(ctx, e)
        try:
            await self.session.commit()
            return documents
        except IntegrityError:
            # A concurrent worker committed a document with the same content_hash
            # or unique_identifier_hash between our check and our INSERT.
            # The document already exists — roll back and let the next sync run handle it.
            log_race_condition(batch_ctx)
            await self.session.rollback()
            return []
        except Exception as e:
            log_batch_aborted(batch_ctx, e)
            await self.session.rollback()
            return []
    async def index(
        self, document: Document, connector_doc: ConnectorDocument, llm
    ) -> Document:
        """
        Run summarization, embedding, and chunking for a document and persist the results.
        """
        ctx = PipelineLogContext(
            connector_id=connector_doc.connector_id,
            search_space_id=connector_doc.search_space_id,
            unique_id=connector_doc.unique_id,
            doc_id=document.id,
        )
        try:
            log_index_started(ctx)
            document.status = DocumentStatus.processing()
            await self.session.commit()
            if connector_doc.should_summarize and llm is not None:
                content = await summarize_document(
                    connector_doc.source_markdown, llm, connector_doc.metadata
                )
            elif connector_doc.should_summarize and connector_doc.fallback_summary:
                content = connector_doc.fallback_summary
            else:
                content = connector_doc.source_markdown
            embedding = embed_text(content)
            await self.session.execute(
                delete(Chunk).where(Chunk.document_id == document.id)
            )
            chunks = [
                Chunk(content=text, embedding=embed_text(text))
                for text in chunk_text(
                    connector_doc.source_markdown,
                    use_code_chunker=connector_doc.should_use_code_chunker,
                )
            ]
            document.content = content
            document.embedding = embedding
            attach_chunks_to_document(document, chunks)
            document.updated_at = datetime.now(UTC)
            document.status = DocumentStatus.ready()
            await self.session.commit()
            log_index_success(ctx, chunk_count=len(chunks))
        except RETRYABLE_LLM_ERRORS as e:
            log_retryable_llm_error(ctx, e)
            await rollback_and_persist_failure(
                self.session, document, llm_retryable_message(e)
            )
        except PERMANENT_LLM_ERRORS as e:
            log_permanent_llm_error(ctx, e)
            await rollback_and_persist_failure(
                self.session, document, llm_permanent_message(e)
            )
        except RecursionError as e:
            log_chunking_overflow(ctx, e)
            await rollback_and_persist_failure(
                self.session, document, PipelineMessages.CHUNKING_OVERFLOW
            )
        except EMBEDDING_ERRORS as e:
            log_embedding_error(ctx, e)
            await rollback_and_persist_failure(
                self.session, document, embedding_message(e)
            )
        except Exception as e:
            log_unexpected_error(ctx, e)
            await rollback_and_persist_failure(
                self.session, document, safe_exception_message(e)
            )
        with contextlib.suppress(Exception):
            await self.session.refresh(document)
        return document
--- a/surfsense_backend/app/indexing_pipeline/pipeline_logger.py
+++ b/surfsense_backend/app/indexing_pipeline/pipeline_logger.py
@ -0,0 +1,118 @@
 import logging
 from dataclasses import dataclass
 logger = logging.getLogger(__name__)
@dataclass
 class PipelineLogContext:
    connector_id: int | None
    search_space_id: int
    unique_id: str            # always available from ConnectorDocument
    doc_id: int | None = None # set once the DB row exists (index phase only)
 class LogMessages:
    # prepare_for_indexing
    DOCUMENT_QUEUED     = "New document queued for indexing."
    DOCUMENT_UPDATED    = "Document content changed, re-queued for indexing."
    DOCUMENT_REQUEUED   = "Stuck document re-queued for indexing."
    DOC_SKIPPED_UNKNOWN = "Unexpected error — document skipped."
    BATCH_ABORTED       = "Fatal DB error — aborting prepare batch."
    RACE_CONDITION      = "Concurrent worker beat us to the commit — rolling back batch."
    # index
    INDEX_STARTED       = "Document indexing started."
    INDEX_SUCCESS       = "Document indexed successfully."
    LLM_RETRYABLE       = "Retryable LLM error — document marked failed, will retry on next sync."
    LLM_PERMANENT       = "Permanent LLM error — document marked failed."
    EMBEDDING_FAILED    = "Embedding error — document marked failed."
    CHUNKING_OVERFLOW   = "Chunking overflow — document marked failed."
    UNEXPECTED          = "Unexpected error — document marked failed."
 def _format_context(ctx: PipelineLogContext) -> str:
    parts = [
        f"connector_id={ctx.connector_id}",
        f"search_space_id={ctx.search_space_id}",
        f"unique_id={ctx.unique_id}",
    ]
    if ctx.doc_id is not None:
        parts.append(f"doc_id={ctx.doc_id}")
    return " ".join(parts)
 def _build_message(msg: str, ctx: PipelineLogContext, **extra) -> str:
    try:
        parts = [msg, _format_context(ctx)]
        for key, val in extra.items():
            parts.append(f"{key}={val}")
        return " ".join(parts)
    except Exception:
        return msg
 def _safe_log(level_fn, msg: str, ctx: PipelineLogContext, exc_info=None, **extra) -> None:
    # Logging must never raise — a broken log call inside an except block would
    # chain with the original exception and mask it entirely.
    try:
        message = _build_message(msg, ctx, **extra)
        level_fn(message, exc_info=exc_info)
    except Exception:
        pass
 # ── prepare_for_indexing ──────────────────────────────────────────────────────
 def log_document_queued(ctx: PipelineLogContext) -> None:
    _safe_log(logger.info, LogMessages.DOCUMENT_QUEUED, ctx)
 def log_document_updated(ctx: PipelineLogContext) -> None:
    _safe_log(logger.info, LogMessages.DOCUMENT_UPDATED, ctx)
 def log_document_requeued(ctx: PipelineLogContext) -> None:
    _safe_log(logger.info, LogMessages.DOCUMENT_REQUEUED, ctx)
 def log_doc_skipped_unknown(ctx: PipelineLogContext, exc: Exception) -> None:
    _safe_log(logger.warning, LogMessages.DOC_SKIPPED_UNKNOWN, ctx, exc_info=exc, error=exc)
 def log_race_condition(ctx: PipelineLogContext) -> None:
    _safe_log(logger.warning, LogMessages.RACE_CONDITION, ctx)
 def log_batch_aborted(ctx: PipelineLogContext, exc: Exception) -> None:
    _safe_log(logger.error, LogMessages.BATCH_ABORTED, ctx, exc_info=exc, error=exc)
 # ── index ─────────────────────────────────────────────────────────────────────
 def log_index_started(ctx: PipelineLogContext) -> None:
    _safe_log(logger.info, LogMessages.INDEX_STARTED, ctx)
 def log_index_success(ctx: PipelineLogContext, chunk_count: int) -> None:
    _safe_log(logger.info, LogMessages.INDEX_SUCCESS, ctx, chunk_count=chunk_count)
 def log_retryable_llm_error(ctx: PipelineLogContext, exc: Exception) -> None:
    _safe_log(logger.warning, LogMessages.LLM_RETRYABLE, ctx, exc_info=exc, error=exc)
 def log_permanent_llm_error(ctx: PipelineLogContext, exc: Exception) -> None:
    _safe_log(logger.error, LogMessages.LLM_PERMANENT, ctx, exc_info=exc, error=exc)
 def log_embedding_error(ctx: PipelineLogContext, exc: Exception) -> None:
    _safe_log(logger.error, LogMessages.EMBEDDING_FAILED, ctx, exc_info=exc, error=exc)
 def log_chunking_overflow(ctx: PipelineLogContext, exc: Exception) -> None:
    _safe_log(logger.error, LogMessages.CHUNKING_OVERFLOW, ctx, exc_info=exc, error=exc)
 def log_unexpected_error(ctx: PipelineLogContext, exc: Exception) -> None:
    _safe_log(logger.error, LogMessages.UNEXPECTED, ctx, exc_info=exc, error=exc)
--- a/surfsense_backend/app/tasks/document_processors/file_processors.py
+++ b/surfsense_backend/app/tasks/document_processors/file_processors.py
@ -18,6 +18,7 @@ from sqlalchemy.ext.asyncio import AsyncSession
 from app.config import config as app_config
 from app.db import Document, DocumentStatus, DocumentType, Log, Notification
 from app.indexing_pipeline.adapters.file_upload_adapter import index_uploaded_file
 from app.services.llm_service import get_user_long_context_llm
 from app.services.notification_service import NotificationService
 from app.services.task_logging_service import TaskLoggingService
@ -33,7 +34,6 @@ from .base import (
    check_document_by_unique_identifier,
    check_duplicate_document,
    get_current_timestamp,
    safe_set_chunks,
 )
 from .markdown_processor import add_received_markdown_file_document
@ -1863,7 +1863,7 @@ async def process_file_in_background_with_document(
            )
            return None
-        # ===== STEP 3: Generate embeddings and chunks =====
+        # ===== STEP 3+4: Index via pipeline =====
        if notification:
            await NotificationService.document_processing.notify_processing_progress(
                session, notification, stage="chunking"
@ -1871,57 +1871,22 @@ async def process_file_in_background_with_document(
        user_llm = await get_user_long_context_llm(session, user_id, search_space_id)
-        if user_llm:
+        await index_uploaded_file(
-            document_metadata = {
+            markdown_content=markdown_content,
-                "file_name": filename,
+            filename=filename,
-                "etl_service": etl_service,
+            etl_service=etl_service,
-                "document_type": "File Document",
+            search_space_id=search_space_id,
-            }
+            user_id=user_id,
-            summary_content, summary_embedding = await generate_document_summary(
+            session=session,
-                markdown_content, user_llm, document_metadata
+            llm=user_llm,
        )
        else:
            # Fallback: use truncated content as summary
            summary_content = markdown_content[:4000]
            from app.config import config
            summary_embedding = config.embedding_model_instance.embed(summary_content)
        chunks = await create_document_chunks(markdown_content)
        # ===== STEP 4: Update document to READY =====
        from sqlalchemy.orm.attributes import flag_modified
        document.title = filename
        document.content = summary_content
        document.content_hash = content_hash
        document.embedding = summary_embedding
        document.document_metadata = {
            "FILE_NAME": filename,
            "ETL_SERVICE": etl_service or "UNKNOWN",
            **(document.document_metadata or {}),
        }
        flag_modified(document, "document_metadata")
        # Use safe_set_chunks to avoid async issues
        safe_set_chunks(document, chunks)
        document.source_markdown = markdown_content
        document.content_needs_reindexing = False
        document.updated_at = get_current_timestamp()
        document.status = DocumentStatus.ready()  # Shows checkmark in UI
        await session.commit()
        await session.refresh(document)
        await task_logger.log_task_success(
            log_entry,
            f"Successfully processed file: {filename}",
            {
                "document_id": document.id,
                "content_hash": content_hash,
                "file_type": etl_service,
                "chunks_count": len(chunks),
            },
        )
--- a/surfsense_backend/pyproject.toml
+++ b/surfsense_backend/pyproject.toml
@ -72,6 +72,22 @@ dependencies = [
 [dependency-groups]
 dev = [
    "ruff>=0.12.5",
    "pytest>=8.0",
    "pytest-asyncio>=0.25",
    "pytest-mock>=3.14",
 ]
 [tool.pytest.ini_options]
 asyncio_mode = "auto"
 asyncio_default_fixture_loop_scope = "session"
 asyncio_default_test_loop_scope = "session"
 testpaths = ["tests"]
 markers = [
    "unit: pure logic tests, no DB or external services",
    "integration: tests that require a real PostgreSQL database",
 ]
 filterwarnings = [
    "ignore::UserWarning:chonkie",
 ]
 [tool.ruff]
--- a/surfsense_backend/tests/init.py
+++ b/surfsense_backend/tests/init.py
--- a/surfsense_backend/tests/conftest.py
+++ b/surfsense_backend/tests/conftest.py
@ -0,0 +1,36 @@
 import pytest
 from app.db import DocumentType
 from app.indexing_pipeline.connector_document import ConnectorDocument
@pytest.fixture
 def sample_user_id() -> str:
    return "00000000-0000-0000-0000-000000000001"
@pytest.fixture
 def sample_search_space_id() -> int:
    return 1
@pytest.fixture
 def sample_connector_id() -> int:
    return 42
@pytest.fixture
 def make_connector_document():
    def _make(**overrides):
        defaults = {
            "title": "Test Document",
            "source_markdown": "## Heading\n\nSome content.",
            "unique_id": "test-id-001",
            "document_type": DocumentType.CLICKUP_CONNECTOR,
            "search_space_id": 1,
            "connector_id": 1,
            "created_by_id": "00000000-0000-0000-0000-000000000001",
        }
        defaults.update(overrides)
        return ConnectorDocument(**defaults)
    return _make
--- a/surfsense_backend/tests/integration/init.py
+++ b/surfsense_backend/tests/integration/init.py
--- a/surfsense_backend/tests/integration/conftest.py
+++ b/surfsense_backend/tests/integration/conftest.py
@ -0,0 +1,164 @@
 import os
 import uuid
 from unittest.mock import AsyncMock, MagicMock
 import pytest
 import pytest_asyncio
 from sqlalchemy import text
 from sqlalchemy.ext.asyncio import AsyncSession, create_async_engine
 from sqlalchemy.pool import NullPool
 from app.db import Base, SearchSpace, SearchSourceConnector, SearchSourceConnectorType
 from app.db import User
 from app.db import DocumentType
 from app.indexing_pipeline.connector_document import ConnectorDocument
 _EMBEDDING_DIM = 1024  # must match the Vector() dimension used in DB column creation
 _DEFAULT_TEST_DB = "postgresql+asyncpg://postgres:postgres@localhost:5432/surfsense_test"
 TEST_DATABASE_URL = os.environ.get("TEST_DATABASE_URL", _DEFAULT_TEST_DB)
@pytest_asyncio.fixture(scope="session")
 async def async_engine():
    engine = create_async_engine(
        TEST_DATABASE_URL,
        poolclass=NullPool,
        echo=False,
        # Required for asyncpg + savepoints: disables prepared statement cache
        # to prevent "another operation is in progress" errors during savepoint rollbacks.
        connect_args={"prepared_statement_cache_size": 0},
    )
    async with engine.begin() as conn:
        await conn.execute(text("CREATE EXTENSION IF NOT EXISTS vector"))
        await conn.run_sync(Base.metadata.create_all)
    yield engine
    # drop_all fails on circular FKs (new_chat_threads ↔ public_chat_snapshots).
    # DROP SCHEMA CASCADE handles this without needing topological sort.
    async with engine.begin() as conn:
        await conn.execute(text("DROP SCHEMA public CASCADE"))
        await conn.execute(text("CREATE SCHEMA public"))
    await engine.dispose()
@pytest_asyncio.fixture
 async def db_session(async_engine) -> AsyncSession:
    # Bind the session to a connection that holds an outer transaction.
    # join_transaction_mode="create_savepoint" makes session.commit() release
    # a SAVEPOINT instead of committing the outer transaction, so the final
    # transaction.rollback() undoes everything — including commits made by the
    # service under test — leaving the DB clean for the next test.
    async with async_engine.connect() as conn:
        transaction = await conn.begin()
        async with AsyncSession(
            bind=conn,
            expire_on_commit=False,
            join_transaction_mode="create_savepoint",
        ) as session:
            yield session
        await transaction.rollback()
@pytest_asyncio.fixture
 async def db_user(db_session: AsyncSession) -> User:
    user = User(
        id=uuid.uuid4(),
        email="test@surfsense.net",
        hashed_password="hashed",
        is_active=True,
        is_superuser=False,
        is_verified=True,
    )
    db_session.add(user)
    await db_session.flush()
    return user
@pytest_asyncio.fixture
 async def db_connector(db_session: AsyncSession, db_user: User, db_search_space: "SearchSpace") -> SearchSourceConnector:
    connector = SearchSourceConnector(
        name="Test Connector",
        connector_type=SearchSourceConnectorType.CLICKUP_CONNECTOR,
        config={},
        search_space_id=db_search_space.id,
        user_id=db_user.id,
    )
    db_session.add(connector)
    await db_session.flush()
    return connector
@pytest_asyncio.fixture
 async def db_search_space(db_session: AsyncSession, db_user: User) -> SearchSpace:
    space = SearchSpace(
        name="Test Space",
        user_id=db_user.id,
    )
    db_session.add(space)
    await db_session.flush()
    return space
@pytest.fixture
 def patched_summarize(monkeypatch) -> AsyncMock:
    mock = AsyncMock(return_value="Mocked summary.")
    monkeypatch.setattr(
        "app.indexing_pipeline.indexing_pipeline_service.summarize_document",
        mock,
    )
    return mock
@pytest.fixture
 def patched_summarize_raises(monkeypatch) -> AsyncMock:
    mock = AsyncMock(side_effect=RuntimeError("LLM unavailable"))
    monkeypatch.setattr(
        "app.indexing_pipeline.indexing_pipeline_service.summarize_document",
        mock,
    )
    return mock
@pytest.fixture
 def patched_embed_text(monkeypatch) -> MagicMock:
    mock = MagicMock(return_value=[0.1] * _EMBEDDING_DIM)
    monkeypatch.setattr(
        "app.indexing_pipeline.indexing_pipeline_service.embed_text",
        mock,
    )
    return mock
@pytest.fixture
 def patched_chunk_text(monkeypatch) -> MagicMock:
    mock = MagicMock(return_value=["Test chunk content."])
    monkeypatch.setattr(
        "app.indexing_pipeline.indexing_pipeline_service.chunk_text",
        mock,
    )
    return mock
@pytest.fixture
 def make_connector_document(db_connector, db_user):
    """Integration-scoped override: uses real DB connector and user IDs."""
    def _make(**overrides):
        defaults = {
            "title": "Test Document",
            "source_markdown": "## Heading\n\nSome content.",
            "unique_id": "test-id-001",
            "document_type": DocumentType.CLICKUP_CONNECTOR,
            "search_space_id": db_connector.search_space_id,
            "connector_id": db_connector.id,
            "created_by_id": str(db_user.id),
        }
        defaults.update(overrides)
        return ConnectorDocument(**defaults)
    return _make
--- a/surfsense_backend/tests/integration/indexing_pipeline/init.py
+++ b/surfsense_backend/tests/integration/indexing_pipeline/init.py
--- a/surfsense_backend/tests/integration/indexing_pipeline/adapters/init.py
+++ b/surfsense_backend/tests/integration/indexing_pipeline/adapters/init.py
--- a/surfsense_backend/tests/integration/indexing_pipeline/adapters/test_file_upload_adapter.py
+++ b/surfsense_backend/tests/integration/indexing_pipeline/adapters/test_file_upload_adapter.py
@ -0,0 +1,91 @@
 import pytest
 from sqlalchemy import select
 from app.db import Chunk, Document, DocumentStatus
 from app.indexing_pipeline.adapters.file_upload_adapter import index_uploaded_file
 pytestmark = pytest.mark.integration
@pytest.mark.usefixtures("patched_summarize", "patched_embed_text", "patched_chunk_text")
 async def test_sets_status_ready(db_session, db_search_space, db_user, mocker):
    """Document status is READY after successful indexing."""
    await index_uploaded_file(
        markdown_content="## Hello\n\nSome content.",
        filename="test.pdf",
        etl_service="UNSTRUCTURED",
        search_space_id=db_search_space.id,
        user_id=str(db_user.id),
        session=db_session,
        llm=mocker.Mock(),
    )
    result = await db_session.execute(
        select(Document).filter(Document.search_space_id == db_search_space.id)
    )
    document = result.scalars().first()
    assert DocumentStatus.is_state(document.status, DocumentStatus.READY)
@pytest.mark.usefixtures("patched_summarize", "patched_embed_text", "patched_chunk_text")
 async def test_content_is_summary(db_session, db_search_space, db_user, mocker):
    """Document content is set to the LLM-generated summary."""
    await index_uploaded_file(
        markdown_content="## Hello\n\nSome content.",
        filename="test.pdf",
        etl_service="UNSTRUCTURED",
        search_space_id=db_search_space.id,
        user_id=str(db_user.id),
        session=db_session,
        llm=mocker.Mock(),
    )
    result = await db_session.execute(
        select(Document).filter(Document.search_space_id == db_search_space.id)
    )
    document = result.scalars().first()
    assert document.content == "Mocked summary."
@pytest.mark.usefixtures("patched_summarize", "patched_embed_text", "patched_chunk_text")
 async def test_chunks_written_to_db(db_session, db_search_space, db_user, mocker):
    """Chunks derived from the source markdown are persisted in the DB."""
    await index_uploaded_file(
        markdown_content="## Hello\n\nSome content.",
        filename="test.pdf",
        etl_service="UNSTRUCTURED",
        search_space_id=db_search_space.id,
        user_id=str(db_user.id),
        session=db_session,
        llm=mocker.Mock(),
    )
    result = await db_session.execute(
        select(Document).filter(Document.search_space_id == db_search_space.id)
    )
    document = result.scalars().first()
    chunks_result = await db_session.execute(
        select(Chunk).filter(Chunk.document_id == document.id)
    )
    chunks = chunks_result.scalars().all()
    assert len(chunks) == 1
    assert chunks[0].content == "Test chunk content."
@pytest.mark.usefixtures("patched_summarize_raises", "patched_embed_text", "patched_chunk_text")
 async def test_raises_on_indexing_failure(db_session, db_search_space, db_user, mocker):
    """RuntimeError is raised when the indexing step fails so the caller can fire a failure notification."""
    with pytest.raises(RuntimeError):
        await index_uploaded_file(
            markdown_content="## Hello\n\nSome content.",
            filename="test.pdf",
            etl_service="UNSTRUCTURED",
            search_space_id=db_search_space.id,
            user_id=str(db_user.id),
            session=db_session,
            llm=mocker.Mock(),
        )
--- a/surfsense_backend/tests/integration/indexing_pipeline/test_index_document.py
+++ b/surfsense_backend/tests/integration/indexing_pipeline/test_index_document.py
@ -0,0 +1,266 @@
 import pytest
 from sqlalchemy import select
 from app.db import Chunk, Document, DocumentStatus
 from app.indexing_pipeline.indexing_pipeline_service import IndexingPipelineService
 pytestmark = pytest.mark.integration
@pytest.mark.usefixtures("patched_summarize", "patched_embed_text", "patched_chunk_text")
 async def test_sets_status_ready(
    db_session, db_search_space, make_connector_document, mocker,
 ):
    """Document status is READY after successful indexing."""
    connector_doc = make_connector_document(search_space_id=db_search_space.id)
    service = IndexingPipelineService(session=db_session)
    prepared = await service.prepare_for_indexing([connector_doc])
    document = prepared[0]
    document_id = document.id
    await service.index(document, connector_doc, llm=mocker.Mock())
    result = await db_session.execute(select(Document).filter(Document.id == document_id))
    reloaded = result.scalars().first()
    assert DocumentStatus.is_state(reloaded.status, DocumentStatus.READY)
@pytest.mark.usefixtures("patched_summarize", "patched_embed_text", "patched_chunk_text")
 async def test_content_is_summary_when_should_summarize_true(
    db_session, db_search_space, make_connector_document, mocker,
 ):
    """Document content is set to the LLM-generated summary when should_summarize=True."""
    connector_doc = make_connector_document(search_space_id=db_search_space.id)
    service = IndexingPipelineService(session=db_session)
    prepared = await service.prepare_for_indexing([connector_doc])
    document = prepared[0]
    document_id = document.id
    await service.index(document, connector_doc, llm=mocker.Mock())
    result = await db_session.execute(select(Document).filter(Document.id == document_id))
    reloaded = result.scalars().first()
    assert reloaded.content == "Mocked summary."
@pytest.mark.usefixtures("patched_summarize", "patched_embed_text", "patched_chunk_text")
 async def test_content_is_source_markdown_when_should_summarize_false(
    db_session, db_search_space, make_connector_document,
 ):
    """Document content is set to source_markdown verbatim when should_summarize=False."""
    connector_doc = make_connector_document(
        search_space_id=db_search_space.id,
        should_summarize=False,
        source_markdown="## Raw content",
    )
    service = IndexingPipelineService(session=db_session)
    prepared = await service.prepare_for_indexing([connector_doc])
    document = prepared[0]
    document_id = document.id
    await service.index(document, connector_doc, llm=None)
    result = await db_session.execute(select(Document).filter(Document.id == document_id))
    reloaded = result.scalars().first()
    assert reloaded.content == "## Raw content"
@pytest.mark.usefixtures("patched_summarize", "patched_embed_text", "patched_chunk_text")
 async def test_chunks_written_to_db(
    db_session, db_search_space, make_connector_document, mocker,
 ):
    """Chunks derived from source_markdown are persisted in the DB."""
    connector_doc = make_connector_document(search_space_id=db_search_space.id)
    service = IndexingPipelineService(session=db_session)
    prepared = await service.prepare_for_indexing([connector_doc])
    document = prepared[0]
    document_id = document.id
    await service.index(document, connector_doc, llm=mocker.Mock())
    result = await db_session.execute(
        select(Chunk).filter(Chunk.document_id == document_id)
    )
    chunks = result.scalars().all()
    assert len(chunks) == 1
    assert chunks[0].content == "Test chunk content."
@pytest.mark.usefixtures("patched_summarize", "patched_embed_text", "patched_chunk_text")
 async def test_embedding_written_to_db(
    db_session, db_search_space, make_connector_document, mocker,
 ):
    """Document embedding vector is persisted in the DB after indexing."""
    connector_doc = make_connector_document(search_space_id=db_search_space.id)
    service = IndexingPipelineService(session=db_session)
    prepared = await service.prepare_for_indexing([connector_doc])
    document = prepared[0]
    document_id = document.id
    await service.index(document, connector_doc, llm=mocker.Mock())
    result = await db_session.execute(select(Document).filter(Document.id == document_id))
    reloaded = result.scalars().first()
    assert reloaded.embedding is not None
    assert len(reloaded.embedding) == 1024
@pytest.mark.usefixtures("patched_summarize", "patched_embed_text", "patched_chunk_text")
 async def test_updated_at_advances_after_indexing(
    db_session, db_search_space, make_connector_document, mocker,
 ):
    """updated_at timestamp is later after indexing than it was at prepare time."""
    connector_doc = make_connector_document(search_space_id=db_search_space.id)
    service = IndexingPipelineService(session=db_session)
    prepared = await service.prepare_for_indexing([connector_doc])
    document = prepared[0]
    document_id = document.id
    result = await db_session.execute(select(Document).filter(Document.id == document_id))
    updated_at_pending = result.scalars().first().updated_at
    await service.index(document, connector_doc, llm=mocker.Mock())
    result = await db_session.execute(select(Document).filter(Document.id == document_id))
    updated_at_ready = result.scalars().first().updated_at
    assert updated_at_ready > updated_at_pending
@pytest.mark.usefixtures("patched_summarize", "patched_embed_text", "patched_chunk_text")
 async def test_no_llm_falls_back_to_source_markdown(
    db_session, db_search_space, make_connector_document,
 ):
    """When llm=None and no fallback_summary, content falls back to source_markdown."""
    connector_doc = make_connector_document(
        search_space_id=db_search_space.id,
        should_summarize=True,
        source_markdown="## Fallback content",
    )
    service = IndexingPipelineService(session=db_session)
    prepared = await service.prepare_for_indexing([connector_doc])
    document = prepared[0]
    document_id = document.id
    await service.index(document, connector_doc, llm=None)
    result = await db_session.execute(select(Document).filter(Document.id == document_id))
    reloaded = result.scalars().first()
    assert DocumentStatus.is_state(reloaded.status, DocumentStatus.READY)
    assert reloaded.content == "## Fallback content"
@pytest.mark.usefixtures("patched_summarize", "patched_embed_text", "patched_chunk_text")
 async def test_fallback_summary_used_when_llm_unavailable(
    db_session, db_search_space, make_connector_document,
 ):
    """fallback_summary is used as content when llm=None and should_summarize=True."""
    connector_doc = make_connector_document(
        search_space_id=db_search_space.id,
        should_summarize=True,
        source_markdown="## Full raw content",
        fallback_summary="Short pre-built summary.",
    )
    service = IndexingPipelineService(session=db_session)
    prepared = await service.prepare_for_indexing([connector_doc])
    document_id = prepared[0].id
    await service.index(prepared[0], connector_doc, llm=None)
    result = await db_session.execute(select(Document).filter(Document.id == document_id))
    reloaded = result.scalars().first()
    assert DocumentStatus.is_state(reloaded.status, DocumentStatus.READY)
    assert reloaded.content == "Short pre-built summary."
@pytest.mark.usefixtures("patched_summarize", "patched_embed_text", "patched_chunk_text")
 async def test_reindex_replaces_old_chunks(
    db_session, db_search_space, make_connector_document, mocker,
 ):
    """Re-indexing a document replaces its old chunks rather than appending."""
    connector_doc = make_connector_document(
        search_space_id=db_search_space.id,
        source_markdown="## v1",
    )
    service = IndexingPipelineService(session=db_session)
    prepared = await service.prepare_for_indexing([connector_doc])
    document = prepared[0]
    document_id = document.id
    await service.index(document, connector_doc, llm=mocker.Mock())
    updated_doc = make_connector_document(
        search_space_id=db_search_space.id,
        source_markdown="## v2",
    )
    re_prepared = await service.prepare_for_indexing([updated_doc])
    await service.index(re_prepared[0], updated_doc, llm=mocker.Mock())
    result = await db_session.execute(
        select(Chunk).filter(Chunk.document_id == document_id)
    )
    chunks = result.scalars().all()
    assert len(chunks) == 1
@pytest.mark.usefixtures("patched_summarize_raises", "patched_embed_text", "patched_chunk_text")
 async def test_llm_error_sets_status_failed(
    db_session, db_search_space, make_connector_document, mocker,
 ):
    """Document status is FAILED when the LLM raises during indexing."""
    connector_doc = make_connector_document(search_space_id=db_search_space.id)
    service = IndexingPipelineService(session=db_session)
    prepared = await service.prepare_for_indexing([connector_doc])
    document = prepared[0]
    document_id = document.id
    await service.index(document, connector_doc, llm=mocker.Mock())
    result = await db_session.execute(select(Document).filter(Document.id == document_id))
    reloaded = result.scalars().first()
    assert DocumentStatus.is_state(reloaded.status, DocumentStatus.FAILED)
@pytest.mark.usefixtures("patched_summarize_raises", "patched_embed_text", "patched_chunk_text")
 async def test_llm_error_leaves_no_partial_data(
    db_session, db_search_space, make_connector_document, mocker,
 ):
    """A failed indexing attempt leaves no partial embedding or chunks in the DB."""
    connector_doc = make_connector_document(search_space_id=db_search_space.id)
    service = IndexingPipelineService(session=db_session)
    prepared = await service.prepare_for_indexing([connector_doc])
    document = prepared[0]
    document_id = document.id
    await service.index(document, connector_doc, llm=mocker.Mock())
    result = await db_session.execute(select(Document).filter(Document.id == document_id))
    reloaded = result.scalars().first()
    assert reloaded.embedding is None
    assert reloaded.content == "Pending..."
    chunks_result = await db_session.execute(
        select(Chunk).filter(Chunk.document_id == document_id)
    )
    assert chunks_result.scalars().all() == []
--- a/surfsense_backend/tests/integration/indexing_pipeline/test_prepare_for_indexing.py
+++ b/surfsense_backend/tests/integration/indexing_pipeline/test_prepare_for_indexing.py
@ -0,0 +1,377 @@
 import pytest
 from sqlalchemy import select
 from app.db import Document, DocumentStatus
 from app.indexing_pipeline.document_hashing import compute_content_hash as real_compute_content_hash
 from app.indexing_pipeline.indexing_pipeline_service import IndexingPipelineService
 pytestmark = pytest.mark.integration
 async def test_new_document_is_persisted_with_pending_status(
    db_session, db_search_space, make_connector_document
 ):
    """A new document is created in the DB with PENDING status and correct markdown."""
    doc = make_connector_document(search_space_id=db_search_space.id)
    service = IndexingPipelineService(session=db_session)
    results = await service.prepare_for_indexing([doc])
    assert len(results) == 1
    document_id = results[0].id
    result = await db_session.execute(select(Document).filter(Document.id == document_id))
    reloaded = result.scalars().first()
    assert reloaded is not None
    assert DocumentStatus.is_state(reloaded.status, DocumentStatus.PENDING)
    assert reloaded.source_markdown == doc.source_markdown
@pytest.mark.usefixtures("patched_summarize", "patched_embed_text", "patched_chunk_text")
 async def test_unchanged_ready_document_is_skipped(
    db_session, db_search_space, make_connector_document, mocker,
 ):
    """A READY document with unchanged content is not returned for re-indexing."""
    doc = make_connector_document(search_space_id=db_search_space.id)
    service = IndexingPipelineService(session=db_session)
    # Index fully so the document reaches ready state
    prepared = await service.prepare_for_indexing([doc])
    await service.index(prepared[0], doc, llm=mocker.Mock())
    # Same content on the next run — a ready document must be skipped
    results = await service.prepare_for_indexing([doc])
    assert results == []
@pytest.mark.usefixtures("patched_summarize", "patched_embed_text", "patched_chunk_text")
 async def test_title_only_change_updates_title_in_db(
    db_session, db_search_space, make_connector_document, mocker,
 ):
    """A title-only change updates the DB title without re-queuing the document."""
    original = make_connector_document(search_space_id=db_search_space.id, title="Original Title")
    service = IndexingPipelineService(session=db_session)
    prepared = await service.prepare_for_indexing([original])
    document_id = prepared[0].id
    await service.index(prepared[0], original, llm=mocker.Mock())
    renamed = make_connector_document(search_space_id=db_search_space.id, title="Updated Title")
    results = await service.prepare_for_indexing([renamed])
    assert results == []
    result = await db_session.execute(select(Document).filter(Document.id == document_id))
    reloaded = result.scalars().first()
    assert reloaded.title == "Updated Title"
 async def test_changed_content_is_returned_for_reprocessing(
    db_session, db_search_space, make_connector_document
 ):
    """A document with changed content is returned for re-indexing with updated markdown."""
    original = make_connector_document(search_space_id=db_search_space.id, source_markdown="## v1")
    service = IndexingPipelineService(session=db_session)
    first = await service.prepare_for_indexing([original])
    original_id = first[0].id
    updated = make_connector_document(search_space_id=db_search_space.id, source_markdown="## v2")
    results = await service.prepare_for_indexing([updated])
    assert len(results) == 1
    assert results[0].id == original_id
    result = await db_session.execute(select(Document).filter(Document.id == original_id))
    reloaded = result.scalars().first()
    assert reloaded.source_markdown == "## v2"
    assert DocumentStatus.is_state(reloaded.status, DocumentStatus.PENDING)
 async def test_all_documents_in_batch_are_persisted(
    db_session, db_search_space, make_connector_document
 ):
    """All documents in a batch are persisted and returned."""
    docs = [
        make_connector_document(search_space_id=db_search_space.id, unique_id="id-1", title="Doc 1", source_markdown="## Content 1"),
        make_connector_document(search_space_id=db_search_space.id, unique_id="id-2", title="Doc 2", source_markdown="## Content 2"),
        make_connector_document(search_space_id=db_search_space.id, unique_id="id-3", title="Doc 3", source_markdown="## Content 3"),
    ]
    service = IndexingPipelineService(session=db_session)
    results = await service.prepare_for_indexing(docs)
    assert len(results) == 3
    result = await db_session.execute(select(Document).filter(Document.search_space_id == db_search_space.id))
    rows = result.scalars().all()
    assert len(rows) == 3
 async def test_duplicate_in_batch_is_persisted_once(
    db_session, db_search_space, make_connector_document
 ):
    """The same document passed twice in a batch is only persisted once."""
    doc = make_connector_document(search_space_id=db_search_space.id)
    service = IndexingPipelineService(session=db_session)
    results = await service.prepare_for_indexing([doc, doc])
    assert len(results) == 1
    result = await db_session.execute(select(Document).filter(Document.search_space_id == db_search_space.id))
    rows = result.scalars().all()
    assert len(rows) == 1
 async def test_created_by_id_is_persisted(
    db_session, db_user, db_search_space, make_connector_document
 ):
    """created_by_id from the connector document is persisted on the DB row."""
    doc = make_connector_document(
        search_space_id=db_search_space.id,
        created_by_id=str(db_user.id),
    )
    service = IndexingPipelineService(session=db_session)
    results = await service.prepare_for_indexing([doc])
    document_id = results[0].id
    result = await db_session.execute(select(Document).filter(Document.id == document_id))
    reloaded = result.scalars().first()
    assert str(reloaded.created_by_id) == str(db_user.id)
 async def test_metadata_is_updated_when_content_changes(
    db_session, db_search_space, make_connector_document
 ):
    """document_metadata is overwritten with the latest metadata when content changes."""
    original = make_connector_document(
        search_space_id=db_search_space.id,
        source_markdown="## v1",
        metadata={"status": "in_progress"},
    )
    service = IndexingPipelineService(session=db_session)
    first = await service.prepare_for_indexing([original])
    document_id = first[0].id
    updated = make_connector_document(
        search_space_id=db_search_space.id,
        source_markdown="## v2",
        metadata={"status": "done"},
    )
    await service.prepare_for_indexing([updated])
    result = await db_session.execute(select(Document).filter(Document.id == document_id))
    reloaded = result.scalars().first()
    assert reloaded.document_metadata == {"status": "done"}
 async def test_updated_at_advances_when_title_only_changes(
    db_session, db_search_space, make_connector_document
 ):
    """updated_at advances even when only the title changes."""
    original = make_connector_document(search_space_id=db_search_space.id, title="Old Title")
    service = IndexingPipelineService(session=db_session)
    first = await service.prepare_for_indexing([original])
    document_id = first[0].id
    result = await db_session.execute(select(Document).filter(Document.id == document_id))
    updated_at_v1 = result.scalars().first().updated_at
    renamed = make_connector_document(search_space_id=db_search_space.id, title="New Title")
    await service.prepare_for_indexing([renamed])
    result = await db_session.execute(select(Document).filter(Document.id == document_id))
    updated_at_v2 = result.scalars().first().updated_at
    assert updated_at_v2 > updated_at_v1
 async def test_updated_at_advances_when_content_changes(
    db_session, db_search_space, make_connector_document
 ):
    """updated_at advances when document content changes."""
    original = make_connector_document(search_space_id=db_search_space.id, source_markdown="## v1")
    service = IndexingPipelineService(session=db_session)
    first = await service.prepare_for_indexing([original])
    document_id = first[0].id
    result = await db_session.execute(select(Document).filter(Document.id == document_id))
    updated_at_v1 = result.scalars().first().updated_at
    updated = make_connector_document(search_space_id=db_search_space.id, source_markdown="## v2")
    await service.prepare_for_indexing([updated])
    result = await db_session.execute(select(Document).filter(Document.id == document_id))
    updated_at_v2 = result.scalars().first().updated_at
    assert updated_at_v2 > updated_at_v1
 async def test_same_content_from_different_source_skipped_in_single_batch(
    db_session, db_search_space, make_connector_document
 ):
    """Two documents with identical content in the same batch result in only one being persisted."""
    first = make_connector_document(
        search_space_id=db_search_space.id,
        unique_id="source-a",
        source_markdown="## Shared content",
    )
    second = make_connector_document(
        search_space_id=db_search_space.id,
        unique_id="source-b",
        source_markdown="## Shared content",
    )
    service = IndexingPipelineService(session=db_session)
    results = await service.prepare_for_indexing([first, second])
    assert len(results) == 1
    result = await db_session.execute(
        select(Document).filter(Document.search_space_id == db_search_space.id)
    )
    assert len(result.scalars().all()) == 1
 async def test_same_content_from_different_source_is_skipped(
    db_session, db_search_space, make_connector_document
 ):
    """A document with content identical to an already-indexed document is skipped."""
    first = make_connector_document(
        search_space_id=db_search_space.id,
        unique_id="source-a",
        source_markdown="## Shared content",
    )
    second = make_connector_document(
        search_space_id=db_search_space.id,
        unique_id="source-b",
        source_markdown="## Shared content",
    )
    service = IndexingPipelineService(session=db_session)
    await service.prepare_for_indexing([first])
    results = await service.prepare_for_indexing([second])
    assert results == []
    result = await db_session.execute(
        select(Document).filter(Document.search_space_id == db_search_space.id)
    )
    assert len(result.scalars().all()) == 1
@pytest.mark.usefixtures("patched_summarize_raises", "patched_embed_text", "patched_chunk_text")
 async def test_failed_document_with_unchanged_content_is_requeued(
    db_session, db_search_space, make_connector_document, mocker,
 ):
    """A FAILED document with unchanged content is re-queued as PENDING on the next run."""
    doc = make_connector_document(search_space_id=db_search_space.id)
    service = IndexingPipelineService(session=db_session)
    # First run: document is created and indexing crashes → status = failed
    prepared = await service.prepare_for_indexing([doc])
    document_id = prepared[0].id
    await service.index(prepared[0], doc, llm=mocker.Mock())
    result = await db_session.execute(select(Document).filter(Document.id == document_id))
    assert DocumentStatus.is_state(result.scalars().first().status, DocumentStatus.FAILED)
    # Next run: same content, pipeline must re-queue the failed document
    results = await service.prepare_for_indexing([doc])
    assert len(results) == 1
    assert results[0].id == document_id
    result = await db_session.execute(select(Document).filter(Document.id == document_id))
    assert DocumentStatus.is_state(result.scalars().first().status, DocumentStatus.PENDING)
 async def test_title_and_content_change_updates_both_and_returns_document(
    db_session, db_search_space, make_connector_document
 ):
    """When both title and content change, both are updated and the document is returned for re-indexing."""
    original = make_connector_document(
        search_space_id=db_search_space.id,
        title="Original Title",
        source_markdown="## v1",
    )
    service = IndexingPipelineService(session=db_session)
    first = await service.prepare_for_indexing([original])
    original_id = first[0].id
    updated = make_connector_document(
        search_space_id=db_search_space.id,
        title="Updated Title",
        source_markdown="## v2",
    )
    results = await service.prepare_for_indexing([updated])
    assert len(results) == 1
    assert results[0].id == original_id
    result = await db_session.execute(select(Document).filter(Document.id == original_id))
    reloaded = result.scalars().first()
    assert reloaded.title == "Updated Title"
    assert reloaded.source_markdown == "## v2"
 async def test_one_bad_document_in_batch_does_not_prevent_others_from_being_persisted(
    db_session, db_search_space, make_connector_document, monkeypatch,
 ):
    """
    A per-document error during prepare_for_indexing must be isolated.
    The two valid documents around the failing one must still be persisted.
    """
    docs = [
        make_connector_document(
            search_space_id=db_search_space.id,
            unique_id="good-1",
            source_markdown="## Good doc 1",
        ),
        make_connector_document(
            search_space_id=db_search_space.id,
            unique_id="will-fail",
            source_markdown="## Bad doc",
        ),
        make_connector_document(
            search_space_id=db_search_space.id,
            unique_id="good-2",
            source_markdown="## Good doc 2",
        ),
    ]
    def compute_content_hash_with_error(doc):
        if doc.unique_id == "will-fail":
            raise RuntimeError("Simulated per-document failure")
        return real_compute_content_hash(doc)
    monkeypatch.setattr(
        "app.indexing_pipeline.indexing_pipeline_service.compute_content_hash",
        compute_content_hash_with_error,
    )
    service = IndexingPipelineService(session=db_session)
    results = await service.prepare_for_indexing(docs)
    assert len(results) == 2
    result = await db_session.execute(
        select(Document).filter(Document.search_space_id == db_search_space.id)
    )
    assert len(result.scalars().all()) == 2
--- a/surfsense_backend/tests/unit/init.py
+++ b/surfsense_backend/tests/unit/init.py
--- a/surfsense_backend/tests/unit/adapters/init.py
+++ b/surfsense_backend/tests/unit/adapters/init.py
--- a/surfsense_backend/tests/unit/indexing_pipeline/init.py
+++ b/surfsense_backend/tests/unit/indexing_pipeline/init.py
--- a/surfsense_backend/tests/unit/indexing_pipeline/conftest.py
+++ b/surfsense_backend/tests/unit/indexing_pipeline/conftest.py
@ -0,0 +1,33 @@
 import pytest
 from unittest.mock import AsyncMock, MagicMock
@pytest.fixture
 def patched_summarizer_chain(monkeypatch):
    chain = MagicMock()
    chain.ainvoke = AsyncMock(return_value=MagicMock(content="The summary."))
    template = MagicMock()
    template.__or__ = MagicMock(return_value=chain)
    monkeypatch.setattr(
        "app.indexing_pipeline.document_summarizer.SUMMARY_PROMPT_TEMPLATE",
        template,
    )
    return chain
@pytest.fixture
 def patched_chunker_instance(monkeypatch):
    mock = MagicMock()
    mock.chunk.return_value = [MagicMock(text="prose chunk")]
    monkeypatch.setattr("app.indexing_pipeline.document_chunker.config.chunker_instance", mock)
    return mock
@pytest.fixture
 def patched_code_chunker_instance(monkeypatch):
    mock = MagicMock()
    mock.chunk.return_value = [MagicMock(text="code chunk")]
    monkeypatch.setattr("app.indexing_pipeline.document_chunker.config.code_chunker_instance", mock)
    return mock
--- a/surfsense_backend/tests/unit/indexing_pipeline/test_connector_document.py
+++ b/surfsense_backend/tests/unit/indexing_pipeline/test_connector_document.py
@ -0,0 +1,112 @@
 import pytest
 from pydantic import ValidationError
 from app.db import DocumentType
 from app.indexing_pipeline.connector_document import ConnectorDocument
 def test_valid_document_created_with_required_fields():
    """All optional fields default correctly when only required fields are supplied."""
    doc = ConnectorDocument(
        title="Task",
        source_markdown="## Task\n\nSome content.",
        unique_id="task-1",
        document_type=DocumentType.CLICKUP_CONNECTOR,
        search_space_id=1,
        connector_id=42,
        created_by_id="00000000-0000-0000-0000-000000000001",
    )
    assert doc.should_summarize is True
    assert doc.should_use_code_chunker is False
    assert doc.metadata == {}
    assert doc.connector_id == 42
    assert doc.created_by_id == "00000000-0000-0000-0000-000000000001"
 def test_omitting_created_by_id_raises():
    """Omitting created_by_id raises a validation error."""
    with pytest.raises(ValidationError):
        ConnectorDocument(
            title="Task",
            source_markdown="## Content",
            unique_id="task-1",
            document_type=DocumentType.CLICKUP_CONNECTOR,
            search_space_id=1,
            connector_id=42,
        )
 def test_empty_source_markdown_raises():
    """Empty source_markdown raises a validation error."""
    with pytest.raises(ValidationError):
        ConnectorDocument(
            title="Task",
            source_markdown="",
            unique_id="task-1",
            document_type=DocumentType.CLICKUP_CONNECTOR,
            search_space_id=1,
        )
 def test_whitespace_only_source_markdown_raises():
    """Whitespace-only source_markdown raises a validation error."""
    with pytest.raises(ValidationError):
        ConnectorDocument(
            title="Task",
            source_markdown="   \n\t  ",
            unique_id="task-1",
            document_type=DocumentType.CLICKUP_CONNECTOR,
            search_space_id=1,
        )
 def test_empty_title_raises():
    """Empty title raises a validation error."""
    with pytest.raises(ValidationError):
        ConnectorDocument(
            title="",
            source_markdown="## Content",
            unique_id="task-1",
            document_type=DocumentType.CLICKUP_CONNECTOR,
            search_space_id=1,
        )
 def test_empty_created_by_id_raises():
    """Empty created_by_id raises a validation error."""
    with pytest.raises(ValidationError):
        ConnectorDocument(
            title="Task",
            source_markdown="## Content",
            unique_id="task-1",
            document_type=DocumentType.CLICKUP_CONNECTOR,
            search_space_id=1,
            connector_id=42,
            created_by_id="",
        )
 def test_zero_search_space_id_raises():
    """search_space_id of zero raises a validation error."""
    with pytest.raises(ValidationError):
        ConnectorDocument(
            title="Task",
            source_markdown="## Content",
            unique_id="task-1",
            document_type=DocumentType.CLICKUP_CONNECTOR,
            search_space_id=0,
            connector_id=42,
            created_by_id="00000000-0000-0000-0000-000000000001",
        )
 def test_empty_unique_id_raises():
    """Empty unique_id raises a validation error."""
    with pytest.raises(ValidationError):
        ConnectorDocument(
            title="Task",
            source_markdown="## Content",
            unique_id="",
            document_type=DocumentType.CLICKUP_CONNECTOR,
            search_space_id=1,
        )
--- a/surfsense_backend/tests/unit/indexing_pipeline/test_document_chunker.py
+++ b/surfsense_backend/tests/unit/indexing_pipeline/test_document_chunker.py
@ -0,0 +1,21 @@
 import pytest
 from app.indexing_pipeline.document_chunker import chunk_text
 pytestmark = pytest.mark.unit
@pytest.mark.usefixtures("patched_chunker_instance", "patched_code_chunker_instance")
 def test_uses_code_chunker_when_flag_is_true():
    """Code chunker is selected when use_code_chunker=True."""
    result = chunk_text("def foo(): pass", use_code_chunker=True)
    assert result == ["code chunk"]
@pytest.mark.usefixtures("patched_chunker_instance", "patched_code_chunker_instance")
 def test_uses_default_chunker_when_flag_is_false():
    """Default prose chunker is selected when use_code_chunker=False."""
    result = chunk_text("Some prose text.", use_code_chunker=False)
    assert result == ["prose chunk"]
--- a/surfsense_backend/tests/unit/indexing_pipeline/test_document_hashing.py
+++ b/surfsense_backend/tests/unit/indexing_pipeline/test_document_hashing.py
@ -0,0 +1,48 @@
 import pytest
 from app.db import DocumentType
 from app.indexing_pipeline.document_hashing import compute_content_hash, compute_unique_identifier_hash
 pytestmark = pytest.mark.unit
 def test_different_unique_id_produces_different_hash(make_connector_document):
    """Two documents with different unique_ids produce different identifier hashes."""
    doc_a = make_connector_document(unique_id="id-001")
    doc_b = make_connector_document(unique_id="id-002")
    assert compute_unique_identifier_hash(doc_a) != compute_unique_identifier_hash(doc_b)
 def test_different_search_space_produces_different_identifier_hash(make_connector_document):
    """Same document in different search spaces produces different identifier hashes."""
    doc_a = make_connector_document(search_space_id=1)
    doc_b = make_connector_document(search_space_id=2)
    assert compute_unique_identifier_hash(doc_a) != compute_unique_identifier_hash(doc_b)
 def test_different_document_type_produces_different_identifier_hash(make_connector_document):
    """Same unique_id with different document types produces different identifier hashes."""
    doc_a = make_connector_document(document_type=DocumentType.CLICKUP_CONNECTOR)
    doc_b = make_connector_document(document_type=DocumentType.NOTION_CONNECTOR)
    assert compute_unique_identifier_hash(doc_a) != compute_unique_identifier_hash(doc_b)
 def test_same_content_same_space_produces_same_content_hash(make_connector_document):
    """Identical content in the same search space always produces the same content hash."""
    doc_a = make_connector_document(source_markdown="Hello world", search_space_id=1)
    doc_b = make_connector_document(source_markdown="Hello world", search_space_id=1)
    assert compute_content_hash(doc_a) == compute_content_hash(doc_b)
 def test_same_content_different_space_produces_different_content_hash(make_connector_document):
    """Identical content in different search spaces produces different content hashes."""
    doc_a = make_connector_document(source_markdown="Hello world", search_space_id=1)
    doc_b = make_connector_document(source_markdown="Hello world", search_space_id=2)
    assert compute_content_hash(doc_a) != compute_content_hash(doc_b)
 def test_different_content_produces_different_content_hash(make_connector_document):
    """Different source markdown produces different content hashes."""
    doc_a = make_connector_document(source_markdown="Original content")
    doc_b = make_connector_document(source_markdown="Updated content")
    assert compute_content_hash(doc_a) != compute_content_hash(doc_b)
--- a/surfsense_backend/tests/unit/indexing_pipeline/test_document_summarizer.py
+++ b/surfsense_backend/tests/unit/indexing_pipeline/test_document_summarizer.py
@ -0,0 +1,42 @@
 import pytest
 from unittest.mock import MagicMock
 from app.indexing_pipeline.document_summarizer import summarize_document
 pytestmark = pytest.mark.unit
@pytest.mark.usefixtures("patched_summarizer_chain")
 async def test_without_metadata_returns_raw_summary():
    """Summarizer returns the LLM output directly when no metadata is provided."""
    result = await summarize_document("# Content", llm=MagicMock(model="gpt-4"))
    assert result == "The summary."
@pytest.mark.usefixtures("patched_summarizer_chain")
 async def test_with_metadata_includes_metadata_values_in_output():
    """Non-empty metadata values are prepended to the summary output."""
    result = await summarize_document(
        "# Content",
        llm=MagicMock(model="gpt-4"),
        metadata={"author": "Alice", "source": "Notion"},
    )
    assert "Alice" in result
    assert "Notion" in result
@pytest.mark.usefixtures("patched_summarizer_chain")
 async def test_with_metadata_omits_empty_fields_from_output():
    """Empty metadata fields are omitted from the summary output."""
    result = await summarize_document(
        "# Content",
        llm=MagicMock(model="gpt-4"),
        metadata={"author": "Alice", "description": ""},
    )
    assert "Alice" in result
    assert "description" not in result.lower()
--- a/surfsense_backend/uv.lock
+++ b/surfsense_backend/uv.lock