SurfSense/surfsense_backend/app/schemas/documents.py

from datetime import datetime
from typing import TypeVar
from uuid import UUID

from pydantic import BaseModel, ConfigDict

from app.db import DocumentType

from .chunks import ChunkRead

T = TypeVar("T")


class ExtensionDocumentMetadata(BaseModel):
    BrowsingSessionId: str
    VisitedWebPageURL: str
    VisitedWebPageTitle: str
    VisitedWebPageDateWithTimeInISOString: str
    VisitedWebPageReffererURL: str
    VisitedWebPageVisitDurationInMilliseconds: str


class ExtensionDocumentContent(BaseModel):
    metadata: ExtensionDocumentMetadata
    pageContent: str  # noqa: N815


class DocumentBase(BaseModel):
    document_type: DocumentType
    content: (
        list[ExtensionDocumentContent] | list[str] | str
    )  # Updated to allow string content
    search_space_id: int


class DocumentsCreate(DocumentBase):
    pass


class DocumentUpdate(DocumentBase):
    pass


class DocumentStatusSchema(BaseModel):
    """Document processing status."""

    state: str  # "ready", "processing", "failed"
    reason: str | None = None


class DocumentRead(BaseModel):
    id: int
    title: str
    document_type: DocumentType
    document_metadata: dict
    content: str  # Changed to string to match frontend
    content_hash: str
    unique_identifier_hash: str | None
    created_at: datetime
    updated_at: datetime | None
    search_space_id: int
    created_by_id: UUID | None = None  # User who created/uploaded this document
    created_by_name: str | None = None
    created_by_email: str | None = None
    status: DocumentStatusSchema | None = (
        None  # Processing status (ready, processing, failed)
    )

    model_config = ConfigDict(from_attributes=True)


class DocumentWithChunksRead(DocumentRead):
    chunks: list[ChunkRead] = []

    model_config = ConfigDict(from_attributes=True)


class PaginatedResponse[T](BaseModel):
    items: list[T]
    total: int
    page: int
    page_size: int
    has_more: bool


class DocumentTitleRead(BaseModel):
    """Lightweight document response for mention picker - only essential fields."""

    id: int
    title: str
    document_type: DocumentType

    model_config = ConfigDict(from_attributes=True)


class DocumentTitleSearchResponse(BaseModel):
    """Response for document title search - optimized for typeahead."""

    items: list[DocumentTitleRead]
    has_more: bool


class DocumentStatusItemRead(BaseModel):
    """Lightweight document status payload for batch status polling."""

    id: int
    title: str
    document_type: DocumentType
    status: DocumentStatusSchema

    model_config = ConfigDict(from_attributes=True)


class DocumentStatusBatchResponse(BaseModel):
    """Batch status response for a set of document IDs."""

    items: list[DocumentStatusItemRead]
Fixed all ruff lint and formatting errors 2025-07-24 14:43:48 -07:00			`from datetime import datetime`
$MSI\ModSetter$ chore: biome & ruff checks 2025-10-01 18:50:36 -07:00			`from typing import TypeVar`
feat: add created_by_id column to documents for ownership tracking and update related connectors 2026-02-02 12:32:24 +05:30			`from uuid import UUID`
Fixed all ruff lint and formatting errors 2025-07-24 14:43:48 -07:00
$DESKTOP-RTLN3BA\$punk$ chore: Added direct handling for markdown files. - Fixed podcast imports. 2025-05-07 22:04:57 -07:00			`from pydantic import BaseModel, ConfigDict`
Fixed all ruff lint and formatting errors 2025-07-24 14:43:48 -07:00
$DESKTOP-RTLN3BA\$punk$ feat: SurfSense v0.0.6 init 2025-03-14 18:53:14 -07:00			`from app.db import DocumentType`
Fixed all ruff lint and formatting errors 2025-07-24 14:43:48 -07:00
$DESKTOP-RTLN3BA\$punk$ feat: added jump to source referencing of citations 2025-08-23 18:48:18 -07:00			`from .chunks import ChunkRead`

Implemented serverside pagination; Enabled searchspace file mgmt panel to use serverside pagination; 2025-10-01 13:05:22 -07:00			`T = TypeVar("T")`

$DESKTOP-RTLN3BA\$punk$ feat: SurfSense v0.0.6 init 2025-03-14 18:53:14 -07:00
			`class ExtensionDocumentMetadata(BaseModel):`
			`BrowsingSessionId: str`
			`VisitedWebPageURL: str`
			`VisitedWebPageTitle: str`
			`VisitedWebPageDateWithTimeInISOString: str`
			`VisitedWebPageReffererURL: str`
			`VisitedWebPageVisitDurationInMilliseconds: str`

Fixed all ruff lint and formatting errors 2025-07-24 14:43:48 -07:00
$DESKTOP-RTLN3BA\$punk$ feat: SurfSense v0.0.6 init 2025-03-14 18:53:14 -07:00			`class ExtensionDocumentContent(BaseModel):`
			`metadata: ExtensionDocumentMetadata`
Fixed all ruff lint and formatting errors 2025-07-24 14:43:48 -07:00			`pageContent: str # noqa: N815`

$DESKTOP-RTLN3BA\$punk$ feat: SurfSense v0.0.6 init 2025-03-14 18:53:14 -07:00
			`class DocumentBase(BaseModel):`
			`document_type: DocumentType`
Fixed all ruff lint and formatting errors 2025-07-24 14:43:48 -07:00			`content: (`
			`list[ExtensionDocumentContent] \| list[str] \| str`
			`) # Updated to allow string content`
$DESKTOP-RTLN3BA\$punk$ feat: SurfSense v0.0.6 init 2025-03-14 18:53:14 -07:00			`search_space_id: int`

Fixed all ruff lint and formatting errors 2025-07-24 14:43:48 -07:00
$DESKTOP-RTLN3BA\$punk$ feat: SurfSense v0.0.6 init 2025-03-14 18:53:14 -07:00			`class DocumentsCreate(DocumentBase):`
			`pass`

Fixed all ruff lint and formatting errors 2025-07-24 14:43:48 -07:00
$DESKTOP-RTLN3BA\$punk$ feat: SurfSense v0.0.6 init 2025-03-14 18:53:14 -07:00			`class DocumentUpdate(DocumentBase):`
			`pass`

Fixed all ruff lint and formatting errors 2025-07-24 14:43:48 -07:00
feat: add document status management with JSONB column for processing states in documents 2026-02-05 21:59:31 +05:30			`class DocumentStatusSchema(BaseModel):`
			`"""Document processing status."""`
chore: ran linting 2026-02-06 05:35:15 +05:30
feat: add document status management with JSONB column for processing states in documents 2026-02-05 21:59:31 +05:30			`state: str # "ready", "processing", "failed"`
			`reason: str \| None = None`


$DESKTOP-RTLN3BA\$punk$ feat: SurfSense v0.0.6 init 2025-03-14 18:53:14 -07:00			`class DocumentRead(BaseModel):`
			`id: int`
			`title: str`
			`document_type: DocumentType`
			`document_metadata: dict`
			`content: str # Changed to string to match frontend`
$DESKTOP-RTLN3BA\$punk$ feat: fixed issues of note management Issues Fixed - Missing pagination fields in API response schemas (page, page_size, has_more) - NOTE enum missing from frontend Zod schema - Missing fields in DocumentRead response construction (content_hash, updated_at) - BlockNote slash menu clipped by overflow-hidden CSS - Sidebar click conflicts - hidden action buttons intercepting clicks - Rewrote All Notes sidebar - replaced fragile custom portal with shadcn Sheet - Missing translation keys for new UI strings - Missing NOTE retrieval logic in researcher agent - Added search to All Notes sidebar - Removed frontend logging - was causing toasters on every page refresh - Added backend logging to document reindex Celery task 2025-12-17 00:09:43 -08:00			`content_hash: str`
			`unique_identifier_hash: str \| None`
$DESKTOP-RTLN3BA\$punk$ feat: SurfSense v0.0.6 init 2025-03-14 18:53:14 -07:00			`created_at: datetime`
$DESKTOP-RTLN3BA\$punk$ feat: fixed issues of note management Issues Fixed - Missing pagination fields in API response schemas (page, page_size, has_more) - NOTE enum missing from frontend Zod schema - Missing fields in DocumentRead response construction (content_hash, updated_at) - BlockNote slash menu clipped by overflow-hidden CSS - Sidebar click conflicts - hidden action buttons intercepting clicks - Rewrote All Notes sidebar - replaced fragile custom portal with shadcn Sheet - Missing translation keys for new UI strings - Missing NOTE retrieval logic in researcher agent - Added search to All Notes sidebar - Removed frontend logging - was causing toasters on every page refresh - Added backend logging to document reindex Celery task 2025-12-17 00:09:43 -08:00			`updated_at: datetime \| None`
$DESKTOP-RTLN3BA\$punk$ feat: SurfSense v0.0.6 init 2025-03-14 18:53:14 -07:00			`search_space_id: int`
feat: add created_by_id column to documents for ownership tracking and update related connectors 2026-02-02 12:32:24 +05:30			`created_by_id: UUID \| None = None # User who created/uploaded this document`
feat: add created_by_email field to document schema and update related components for improved user information display 2026-02-21 23:41:00 +05:30			`created_by_name: str \| None = None`
			`created_by_email: str \| None = None`
chore: ran linting 2026-02-06 05:35:15 +05:30			`status: DocumentStatusSchema \| None = (`
			`None # Processing status (ready, processing, failed)`
			`)`
$DESKTOP-RTLN3BA\$punk$ feat: SurfSense v0.0.6 init 2025-03-14 18:53:14 -07:00
Fixed all ruff lint and formatting errors 2025-07-24 14:43:48 -07:00			`model_config = ConfigDict(from_attributes=True)`
$DESKTOP-RTLN3BA\$punk$ feat: added jump to source referencing of citations 2025-08-23 18:48:18 -07:00

			`class DocumentWithChunksRead(DocumentRead):`
			`chunks: list[ChunkRead] = []`

			`model_config = ConfigDict(from_attributes=True)`
Implemented serverside pagination; Enabled searchspace file mgmt panel to use serverside pagination; 2025-10-01 13:05:22 -07:00

$MSI\ModSetter$ chore: biome & ruff checks 2025-10-01 18:50:36 -07:00			`class PaginatedResponse[T](BaseModel):`
Implemented serverside pagination; Enabled searchspace file mgmt panel to use serverside pagination; 2025-10-01 13:05:22 -07:00			`items: list[T]`
			`total: int`
$DESKTOP-RTLN3BA\$punk$ feat: fixed issues of note management Issues Fixed - Missing pagination fields in API response schemas (page, page_size, has_more) - NOTE enum missing from frontend Zod schema - Missing fields in DocumentRead response construction (content_hash, updated_at) - BlockNote slash menu clipped by overflow-hidden CSS - Sidebar click conflicts - hidden action buttons intercepting clicks - Rewrote All Notes sidebar - replaced fragile custom portal with shadcn Sheet - Missing translation keys for new UI strings - Missing NOTE retrieval logic in researcher agent - Added search to All Notes sidebar - Removed frontend logging - was causing toasters on every page refresh - Added backend logging to document reindex Celery task 2025-12-17 00:09:43 -08:00			`page: int`
			`page_size: int`
			`has_more: bool`
feat: add pg_trgm indexes and lightweight document title search - Introduced pg_trgm extension and GIN trigram indexes for efficient document title searches, enhancing performance for mention picker functionality. - Implemented a new API endpoint for lightweight document title searches, returning only essential fields. - Updated frontend components to utilize the new title search feature with throttling for improved user experience. - Added necessary schemas and types for the new search functionality. 2026-01-17 20:45:10 +05:30

			`class DocumentTitleRead(BaseModel):`
			`"""Lightweight document response for mention picker - only essential fields."""`

			`id: int`
			`title: str`
			`document_type: DocumentType`

			`model_config = ConfigDict(from_attributes=True)`


			`class DocumentTitleSearchResponse(BaseModel):`
			`"""Response for document title search - optimized for typeahead."""`

			`items: list[DocumentTitleRead]`
			`has_more: bool`
$DESKTOP-RTLN3BA\$punk$ feat: simplified document upload handling - Introduced a new endpoint for batch document status retrieval, allowing users to check the status of multiple documents in a search space. - Enhanced the document upload process to return duplicate document IDs and improved response structure. - Updated schemas to include new response models for document status. - Removed unused attachment processing code from chat routes and UI components to streamline functionality. 2026-02-09 16:46:54 -08:00

			`class DocumentStatusItemRead(BaseModel):`
			`"""Lightweight document status payload for batch status polling."""`

			`id: int`
			`title: str`
			`document_type: DocumentType`
			`status: DocumentStatusSchema`

			`model_config = ConfigDict(from_attributes=True)`


			`class DocumentStatusBatchResponse(BaseModel):`
			`"""Batch status response for a set of document IDs."""`

			`items: list[DocumentStatusItemRead]`