SurfSense/surfsense_backend/tests/fixtures/sample.md

# SurfSense Test Document

## Overview

This is a **sample markdown document** used for end-to-end testing of the manual
document upload pipeline. It includes various markdown formatting elements.

## Key Features

- Document upload and processing
- Automatic chunking of content
- Embedding generation for semantic search
- Real-time status tracking via ElectricSQL

## Technical Architecture

### Backend Stack

The SurfSense backend is built with:

1. **FastAPI** for the REST API
2. **PostgreSQL** with pgvector for vector storage
3. **Celery** with Redis for background task processing
4. **Docling/Unstructured** for document parsing (ETL)

### Processing Pipeline

Documents go through a multi-stage pipeline:

| Stage | Description |
|-------|-------------|
| Upload | File received via API endpoint |
| Parsing | Content extracted using ETL service |
| Chunking | Text split into semantic chunks |
| Embedding | Vector representations generated |
| Storage | Chunks stored with embeddings in pgvector |

## Code Example

```python
async def process_document(file_path: str) -> Document:
    content = extract_content(file_path)
    chunks = create_chunks(content)
    embeddings = generate_embeddings(chunks)
    return store_document(chunks, embeddings)
```

## Conclusion

This document serves as a test fixture to validate the complete document processing
pipeline from upload through to chunk creation and embedding storage.
feat: Add end-to-end tests for document upload pipeline and shared test utilities - Introduced new test files for end-to-end testing of document uploads, including support for .txt, .md, and .pdf formats. - Created shared fixtures and helper functions for authentication, document management, and cleanup. - Added sample documents for testing purposes. - Established a conftest.py file to provide reusable fixtures across test modules. 2026-02-25 16:39:45 +05:30			`# SurfSense Test Document`

			`## Overview`

			`This is a sample markdown document used for end-to-end testing of the manual`
			`document upload pipeline. It includes various markdown formatting elements.`

			`## Key Features`

			`- Document upload and processing`
			`- Automatic chunking of content`
			`- Embedding generation for semantic search`
			`- Real-time status tracking via ElectricSQL`

			`## Technical Architecture`

			`### Backend Stack`

			`The SurfSense backend is built with:`

			`1. FastAPI for the REST API`
			`2. PostgreSQL with pgvector for vector storage`
			`3. Celery with Redis for background task processing`
			`4. Docling/Unstructured for document parsing (ETL)`

			`### Processing Pipeline`

			`Documents go through a multi-stage pipeline:`

			`\| Stage \| Description \|`
			`\|-------\|-------------\|`
			`\| Upload \| File received via API endpoint \|`
			`\| Parsing \| Content extracted using ETL service \|`
			`\| Chunking \| Text split into semantic chunks \|`
			`\| Embedding \| Vector representations generated \|`
			`\| Storage \| Chunks stored with embeddings in pgvector \|`

			`## Code Example`

			```python
			`async def process_document(file_path: str) -> Document:`
			`content = extract_content(file_path)`
			`chunks = create_chunks(content)`
			`embeddings = generate_embeddings(chunks)`
			`return store_document(chunks, embeddings)`
			```

			`## Conclusion`

			`This document serves as a test fixture to validate the complete document processing`
			`pipeline from upload through to chunk creation and embedding storage.`