mirror of
https://github.com/MODSetter/SurfSense.git
synced 2026-04-25 08:46:22 +02:00
52 lines
1.4 KiB
Markdown
52 lines
1.4 KiB
Markdown
|
|
# SurfSense Test Document
|
||
|
|
|
||
|
|
## Overview
|
||
|
|
|
||
|
|
This is a **sample markdown document** used for end-to-end testing of the manual
|
||
|
|
document upload pipeline. It includes various markdown formatting elements.
|
||
|
|
|
||
|
|
## Key Features
|
||
|
|
|
||
|
|
- Document upload and processing
|
||
|
|
- Automatic chunking of content
|
||
|
|
- Embedding generation for semantic search
|
||
|
|
- Real-time status tracking via ElectricSQL
|
||
|
|
|
||
|
|
## Technical Architecture
|
||
|
|
|
||
|
|
### Backend Stack
|
||
|
|
|
||
|
|
The SurfSense backend is built with:
|
||
|
|
|
||
|
|
1. **FastAPI** for the REST API
|
||
|
|
2. **PostgreSQL** with pgvector for vector storage
|
||
|
|
3. **Celery** with Redis for background task processing
|
||
|
|
4. **Docling/Unstructured** for document parsing (ETL)
|
||
|
|
|
||
|
|
### Processing Pipeline
|
||
|
|
|
||
|
|
Documents go through a multi-stage pipeline:
|
||
|
|
|
||
|
|
| Stage | Description |
|
||
|
|
|-------|-------------|
|
||
|
|
| Upload | File received via API endpoint |
|
||
|
|
| Parsing | Content extracted using ETL service |
|
||
|
|
| Chunking | Text split into semantic chunks |
|
||
|
|
| Embedding | Vector representations generated |
|
||
|
|
| Storage | Chunks stored with embeddings in pgvector |
|
||
|
|
|
||
|
|
## Code Example
|
||
|
|
|
||
|
|
```python
|
||
|
|
async def process_document(file_path: str) -> Document:
|
||
|
|
content = extract_content(file_path)
|
||
|
|
chunks = create_chunks(content)
|
||
|
|
embeddings = generate_embeddings(chunks)
|
||
|
|
return store_document(chunks, embeddings)
|
||
|
|
```
|
||
|
|
|
||
|
|
## Conclusion
|
||
|
|
|
||
|
|
This document serves as a test fixture to validate the complete document processing
|
||
|
|
pipeline from upload through to chunk creation and embedding storage.
|