SurfSense/surfsense_backend/tests/fixtures/sample.md

52 lines
1.4 KiB
Markdown
Raw Normal View History

# SurfSense Test Document
## Overview
This is a **sample markdown document** used for end-to-end testing of the manual
document upload pipeline. It includes various markdown formatting elements.
## Key Features
- Document upload and processing
- Automatic chunking of content
- Embedding generation for semantic search
- Real-time status tracking via ElectricSQL
## Technical Architecture
### Backend Stack
The SurfSense backend is built with:
1. **FastAPI** for the REST API
2. **PostgreSQL** with pgvector for vector storage
3. **Celery** with Redis for background task processing
4. **Docling/Unstructured** for document parsing (ETL)
### Processing Pipeline
Documents go through a multi-stage pipeline:
| Stage | Description |
|-------|-------------|
| Upload | File received via API endpoint |
| Parsing | Content extracted using ETL service |
| Chunking | Text split into semantic chunks |
| Embedding | Vector representations generated |
| Storage | Chunks stored with embeddings in pgvector |
## Code Example
```python
async def process_document(file_path: str) -> Document:
content = extract_content(file_path)
chunks = create_chunks(content)
embeddings = generate_embeddings(chunks)
return store_document(chunks, embeddings)
```
## Conclusion
This document serves as a test fixture to validate the complete document processing
pipeline from upload through to chunk creation and embedding storage.