mirror of
https://github.com/MODSetter/SurfSense.git
synced 2026-04-26 09:16:22 +02:00
feat: Add end-to-end tests for document upload pipeline and shared test utilities
- Introduced new test files for end-to-end testing of document uploads, including support for .txt, .md, and .pdf formats. - Created shared fixtures and helper functions for authentication, document management, and cleanup. - Added sample documents for testing purposes. - Established a conftest.py file to provide reusable fixtures across test modules.
This commit is contained in:
parent
b7447b26f9
commit
41eb68663a
10 changed files with 802 additions and 0 deletions
0
surfsense_backend/tests/fixtures/empty.pdf
vendored
Normal file
0
surfsense_backend/tests/fixtures/empty.pdf
vendored
Normal file
51
surfsense_backend/tests/fixtures/sample.md
vendored
Normal file
51
surfsense_backend/tests/fixtures/sample.md
vendored
Normal file
|
|
@ -0,0 +1,51 @@
|
|||
# SurfSense Test Document
|
||||
|
||||
## Overview
|
||||
|
||||
This is a **sample markdown document** used for end-to-end testing of the manual
|
||||
document upload pipeline. It includes various markdown formatting elements.
|
||||
|
||||
## Key Features
|
||||
|
||||
- Document upload and processing
|
||||
- Automatic chunking of content
|
||||
- Embedding generation for semantic search
|
||||
- Real-time status tracking via ElectricSQL
|
||||
|
||||
## Technical Architecture
|
||||
|
||||
### Backend Stack
|
||||
|
||||
The SurfSense backend is built with:
|
||||
|
||||
1. **FastAPI** for the REST API
|
||||
2. **PostgreSQL** with pgvector for vector storage
|
||||
3. **Celery** with Redis for background task processing
|
||||
4. **Docling/Unstructured** for document parsing (ETL)
|
||||
|
||||
### Processing Pipeline
|
||||
|
||||
Documents go through a multi-stage pipeline:
|
||||
|
||||
| Stage | Description |
|
||||
|-------|-------------|
|
||||
| Upload | File received via API endpoint |
|
||||
| Parsing | Content extracted using ETL service |
|
||||
| Chunking | Text split into semantic chunks |
|
||||
| Embedding | Vector representations generated |
|
||||
| Storage | Chunks stored with embeddings in pgvector |
|
||||
|
||||
## Code Example
|
||||
|
||||
```python
|
||||
async def process_document(file_path: str) -> Document:
|
||||
content = extract_content(file_path)
|
||||
chunks = create_chunks(content)
|
||||
embeddings = generate_embeddings(chunks)
|
||||
return store_document(chunks, embeddings)
|
||||
```
|
||||
|
||||
## Conclusion
|
||||
|
||||
This document serves as a test fixture to validate the complete document processing
|
||||
pipeline from upload through to chunk creation and embedding storage.
|
||||
BIN
surfsense_backend/tests/fixtures/sample.pdf
vendored
Normal file
BIN
surfsense_backend/tests/fixtures/sample.pdf
vendored
Normal file
Binary file not shown.
34
surfsense_backend/tests/fixtures/sample.txt
vendored
Normal file
34
surfsense_backend/tests/fixtures/sample.txt
vendored
Normal file
|
|
@ -0,0 +1,34 @@
|
|||
SurfSense Document Upload Test
|
||||
|
||||
This is a sample text document used for end-to-end testing of the manual document
|
||||
upload pipeline in SurfSense. The document contains multiple paragraphs to ensure
|
||||
that the chunking system has enough content to work with.
|
||||
|
||||
Artificial Intelligence and Machine Learning
|
||||
|
||||
Artificial intelligence (AI) is a broad field of computer science concerned with
|
||||
building smart machines capable of performing tasks that typically require human
|
||||
intelligence. Machine learning is a subset of AI that enables systems to learn and
|
||||
improve from experience without being explicitly programmed.
|
||||
|
||||
Natural Language Processing
|
||||
|
||||
Natural language processing (NLP) is a subfield of linguistics, computer science,
|
||||
and artificial intelligence concerned with the interactions between computers and
|
||||
human language. Key applications include machine translation, sentiment analysis,
|
||||
text summarization, and question answering systems.
|
||||
|
||||
Vector Databases and Semantic Search
|
||||
|
||||
Vector databases store data as high-dimensional vectors, enabling efficient
|
||||
similarity search operations. When combined with embedding models, they power
|
||||
semantic search systems that understand the meaning behind queries rather than
|
||||
relying on exact keyword matches. This technology is fundamental to modern
|
||||
retrieval-augmented generation (RAG) systems.
|
||||
|
||||
Document Processing Pipelines
|
||||
|
||||
Modern document processing pipelines involve several stages: extraction, transformation,
|
||||
chunking, embedding generation, and storage. Each stage plays a critical role in
|
||||
converting raw documents into searchable, structured knowledge that can be retrieved
|
||||
and used by AI systems for accurate information retrieval and generation.
|
||||
Loading…
Add table
Add a link
Reference in a new issue