mirror of https://github.com/MODSetter/SurfSense.git synced 2026-04-25 08:46:22 +02:00

Anish Sarkar 41eb68663a feat: Add end-to-end tests for document upload pipeline and shared test utilities

- Introduced new test files for end-to-end testing of document uploads, including support for .txt, .md, and .pdf formats.
- Created shared fixtures and helper functions for authentication, document management, and cleanup.
- Added sample documents for testing purposes.
- Established a conftest.py file to provide reusable fixtures across test modules.

2026-02-25 16:39:45 +05:30

1.4 KiB

Raw Blame History

SurfSense Test Document

Overview

This is a sample markdown document used for end-to-end testing of the manual document upload pipeline. It includes various markdown formatting elements.

Key Features

Document upload and processing
Automatic chunking of content
Embedding generation for semantic search
Real-time status tracking via ElectricSQL

Technical Architecture

Backend Stack

The SurfSense backend is built with:

FastAPI for the REST API
PostgreSQL with pgvector for vector storage
Celery with Redis for background task processing
Docling/Unstructured for document parsing (ETL)

Processing Pipeline

Documents go through a multi-stage pipeline:

Stage	Description
Upload	File received via API endpoint
Parsing	Content extracted using ETL service
Chunking	Text split into semantic chunks
Embedding	Vector representations generated
Storage	Chunks stored with embeddings in pgvector

Code Example

async def process_document(file_path: str) -> Document:
    content = extract_content(file_path)
    chunks = create_chunks(content)
    embeddings = generate_embeddings(chunks)
    return store_document(chunks, embeddings)

Conclusion

This document serves as a test fixture to validate the complete document processing pipeline from upload through to chunk creation and embedding storage.

1.4 KiB Raw Blame History