feat: added celery and removed background_tasks for MQ's

- removed pre commit hooks - updated docker setup - updated github docker actions - updated docs
2026-05-01 11:56:25 +02:00 · 2025-10-20 00:30:00 -07:00 · 2025-10-20 00:30:00 -07:00 · c80bbfa867
commit c80bbfa867
parent 031dc055da
27 changed files with 1664 additions and 1038 deletions
--- a/surfsense_web/content/docs/docker-installation.mdx
+++ b/surfsense_web/content/docs/docker-installation.mdx
@ -17,7 +17,7 @@ Before you begin, ensure you have:
 - [Docker](https://docs.docker.com/get-docker/) and [Docker Compose](https://docs.docker.com/compose/install/) installed on your machine
 - [Git](https://git-scm.com/downloads) (to clone the repository)
 - Completed all the [prerequisite setup steps](/docs) including:
-  - PGVector setup
+  - Auth setup
  - **File Processing ETL Service** (choose one):
    - Unstructured.io API key (Supports 34+ formats)
    - LlamaIndex API key (enhanced parsing, supports 50+ formats)
@ -56,7 +56,7 @@ Before you begin, ensure you have:

    Edit all `.env` files and fill in the required values:

-### Docker-Specific Environment Variables
+### Docker-Specific Environment Variables (Optional)

 | ENV VARIABLE               | DESCRIPTION                                                                 | DEFAULT VALUE       |
 |----------------------------|-----------------------------------------------------------------------------|---------------------|
@ -64,6 +64,8 @@ Before you begin, ensure you have:
 | BACKEND_PORT               | Port for the backend API service                                            | 8000                |
 | POSTGRES_PORT              | Port for the PostgreSQL database                                            | 5432                |
 | PGADMIN_PORT               | Port for pgAdmin web interface                                              | 5050                |
+| REDIS_PORT                 | Port for Redis (used by Celery)                                             | 6379                |
+| FLOWER_PORT                | Port for Flower (Celery monitoring tool)                                    | 5555                |
 | POSTGRES_USER              | PostgreSQL username                                                         | postgres            |
 | POSTGRES_PASSWORD          | PostgreSQL password                                                         | postgres            |
 | POSTGRES_DB                | PostgreSQL database name                                                    | surfsense           |
@ -81,7 +83,7 @@ Before you begin, ensure you have:
 | AUTH_TYPE                  | Authentication method: `GOOGLE` for OAuth with Google, `LOCAL` for email/password authentication                                                                                          |
 | GOOGLE_OAUTH_CLIENT_ID     | (Optional) Client ID from Google Cloud Console (required if AUTH_TYPE=GOOGLE)                                                                                                                        |
 | GOOGLE_OAUTH_CLIENT_SECRET | (Optional) Client secret from Google Cloud Console (required if AUTH_TYPE=GOOGLE)                                                                                                                    |
-| EMBEDDING_MODEL            | Name of the embedding model (e.g., `mixedbread-ai/mxbai-embed-large-v1`)                                                                                                                 |
+| EMBEDDING_MODEL            | Name of the embedding model (e.g., `sentence-transformers/all-MiniLM-L6-v2`, `openai://text-embedding-ada-002`)                                                                                                                 |
 | RERANKERS_MODEL_NAME       | Name of the reranker model (e.g., `ms-marco-MiniLM-L-12-v2`)                                                                                                                              |
 | RERANKERS_MODEL_TYPE       | Type of reranker model (e.g., `flashrank`)                                                                                                                                                |
 | TTS_SERVICE                | Text-to-Speech API provider for Podcasts (e.g., `local/kokoro`, `openai/tts-1`). See [supported providers](https://docs.litellm.ai/docs/text_to_speech#supported-providers)                            |
@ -94,6 +96,8 @@ Before you begin, ensure you have:
 | ETL_SERVICE                | Document parsing service: `UNSTRUCTURED` (supports 34+ formats), `LLAMACLOUD` (supports 50+ formats including legacy document types), or `DOCLING` (local processing, supports PDF, Office docs, images, HTML, CSV)                                                  |
 | UNSTRUCTURED_API_KEY       | API key for Unstructured.io service for document parsing (required if ETL_SERVICE=UNSTRUCTURED)                                                                                           |
 | LLAMA_CLOUD_API_KEY        | API key for LlamaCloud service for document parsing (required if ETL_SERVICE=LLAMACLOUD)                                                                                                  |
+| CELERY_BROKER_URL          | Redis connection URL for Celery broker (e.g., `redis://localhost:6379/0`)                                                                                                                |
+| CELERY_RESULT_BACKEND      | Redis connection URL for Celery result backend (e.g., `redis://localhost:6379/0`)                                                                                                        |


 **Optional Backend LangSmith Observability:**
--- a/surfsense_web/content/docs/index.mdx
+++ b/surfsense_web/content/docs/index.mdx
@ -4,50 +4,8 @@ description: Required setup's before setting up SurfSense
 full: true
 ---

-## PGVector installation Guide 

-SurfSense requires the pgvector extension for PostgreSQL:
-
-### Linux and Mac
-
-Compile and install the extension (supports Postgres 13+)
-
-```sh
-cd /tmp
-git clone --branch v0.8.0 https://github.com/pgvector/pgvector.git
-cd pgvector
-make
-make install # may need sudo
-```
-
-See the [installation notes](https://github.com/pgvector/pgvector/tree/master#installation-notes---linux-and-mac) if you run into issues
-
-### Windows
-
-Ensure [C++ support in Visual Studio](https://learn.microsoft.com/en-us/cpp/build/building-on-the-command-line?view=msvc-170#download-and-install-the-tools) is installed, and run:
-
-```cmd
-call "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars64.bat"
-```
-
-Note: The exact path will vary depending on your Visual Studio version and edition
-
-Then use `nmake` to build:
-
-```cmd
-set "PGROOT=C:\Program Files\PostgreSQL\16"
-cd %TEMP%
-git clone --branch v0.8.0 https://github.com/pgvector/pgvector.git
-cd pgvector
-nmake /F Makefile.win
-nmake /F Makefile.win install
-```
-
-See the [installation notes](https://github.com/pgvector/pgvector/tree/master#installation-notes---windows) if you run into issues
-
---
-
-## Google OAuth Setup (Optional)
+## Auth Setup 

 SurfSense supports both Google OAuth and local email/password authentication. Google OAuth is optional - if you prefer local authentication, you can skip this section.

--- a/surfsense_web/content/docs/manual-installation.mdx
+++ b/surfsense_web/content/docs/manual-installation.mdx
@ -10,14 +10,28 @@ This guide provides step-by-step instructions for setting up SurfSense without D

 ## Prerequisites

-Before beginning the manual installation, ensure you have completed all the [prerequisite setup steps](/docs), including:
+Before beginning the manual installation, ensure you have the following installed and configured:

- PGVector setup
+### Required Software
+- **Python 3.12+** - Backend runtime environment
+- **Node.js 20+** - Frontend runtime environment  
+- **PostgreSQL 14+** - Database server
+- **PGVector** - PostgreSQL extension for vector similarity search
+- **Redis** - Message broker for Celery task queue
+- **Git** - Version control (to clone the repository)
+
+### Required Services & API Keys
+
+Complete all the [setup steps](/docs), including:
+
+- **Authentication Setup** (choose one):
+  - Google OAuth credentials (for `AUTH_TYPE=GOOGLE`)
+  - Local authentication setup (for `AUTH_TYPE=LOCAL`)
 - **File Processing ETL Service** (choose one):
-    - Unstructured.io API key (Supports 34+ formats)
-    - LlamaIndex API key (enhanced parsing, supports 50+ formats)
-    - Docling (local processing, no API key required, supports PDF, Office docs, images, HTML, CSV)
- Other required API keys
+  - Unstructured.io API key (Supports 34+ formats)
+  - LlamaCloud API key (enhanced parsing, supports 50+ formats)
+  - Docling (local processing, no API key required, supports PDF, Office docs, images, HTML, CSV)
+- **Other API keys** as needed for your use case

 ## Backend Setup

@ -58,7 +72,7 @@ Edit the `.env` file and set the following variables:
 | AUTH_TYPE                  | Authentication method: `GOOGLE` for OAuth with Google, `LOCAL` for email/password authentication                                                                                          |
 | GOOGLE_OAUTH_CLIENT_ID     | (Optional) Client ID from Google Cloud Console (required if AUTH_TYPE=GOOGLE)                                                                                                                        |
 | GOOGLE_OAUTH_CLIENT_SECRET | (Optional) Client secret from Google Cloud Console (required if AUTH_TYPE=GOOGLE)                                                                                                                    |
-| EMBEDDING_MODEL            | Name of the embedding model (e.g., `mixedbread-ai/mxbai-embed-large-v1`)                                                                                                                 |
+| EMBEDDING_MODEL            | Name of the embedding model (e.g., `sentence-transformers/all-MiniLM-L6-v2`, `openai://text-embedding-ada-002`)                                                                                                                 |
 | RERANKERS_MODEL_NAME       | Name of the reranker model (e.g., `ms-marco-MiniLM-L-12-v2`)                                                                                                                              |
 | RERANKERS_MODEL_TYPE       | Type of reranker model (e.g., `flashrank`)                                                                                                                                                |
 | TTS_SERVICE                | Text-to-Speech API provider for Podcasts (e.g., `local/kokoro`, `openai/tts-1`). See [supported providers](https://docs.litellm.ai/docs/text_to_speech#supported-providers)                            |
@ -70,9 +84,11 @@ Edit the `.env` file and set the following variables:
 | ETL_SERVICE                | Document parsing service: `UNSTRUCTURED` (supports 34+ formats), `LLAMACLOUD` (supports 50+ formats including legacy document types), or `DOCLING` (local processing, supports PDF, Office docs, images, HTML, CSV)                                                  |
 | UNSTRUCTURED_API_KEY       | API key for Unstructured.io service for document parsing (required if ETL_SERVICE=UNSTRUCTURED)                                                                                           |
 | LLAMA_CLOUD_API_KEY        | API key for LlamaCloud service for document parsing (required if ETL_SERVICE=LLAMACLOUD)                                                                                                  |
+| CELERY_BROKER_URL          | Redis connection URL for Celery broker (e.g., `redis://localhost:6379/0`)                                                                                                                |
+| CELERY_RESULT_BACKEND      | Redis connection URL for Celery result backend (e.g., `redis://localhost:6379/0`)                                                                                                        |


-**Optional Backend LangSmith Observability:**
+**(Optional) Backend LangSmith Observability:**
 | ENV VARIABLE | DESCRIPTION |
 |--------------|-------------|
 | LANGSMITH_TRACING | Enable LangSmith tracing (e.g., `true`) |
@ -80,7 +96,7 @@ Edit the `.env` file and set the following variables:
 | LANGSMITH_API_KEY | Your LangSmith API key |
 | LANGSMITH_PROJECT | LangSmith project name (e.g., `surfsense`) |

-**Uvicorn Server Configuration**
+**(Optional) Uvicorn Server Configuration**
 | ENV VARIABLE | DESCRIPTION | DEFAULT VALUE |
 |------------------------------|---------------------------------------------|---------------|
 | UVICORN_HOST                 | Host address to bind the server             | 0.0.0.0       |
@ -149,7 +165,91 @@ uv sync
 uv sync
 ```

-### 3. Run the Backend
+### 3. Start Redis Server
+
+Redis is required for Celery task queue. Start the Redis server:
+
+**Linux:**
+
+```bash
+# Start Redis server
+sudo systemctl start redis
+
+# Or if using Redis installed via package manager
+redis-server
+```
+
+**macOS:**
+
+```bash
+# If installed via Homebrew
+brew services start redis
+
+# Or run directly
+redis-server
+```
+
+**Windows:**
+
+```powershell
+# Option 1: If using Redis on Windows (via WSL or Windows port)
+redis-server
+
+# Option 2: If installed as a Windows service
+net start Redis
+```
+
+**Alternative for Windows - Run Redis in Docker:**
+
+If you have Docker Desktop installed, you can run Redis in a container:
+
+```powershell
+# Pull and run Redis container
+docker run -d --name redis -p 6379:6379 redis:latest
+
+# To stop Redis
+docker stop redis
+
+# To start Redis again
+docker start redis
+
+# To remove Redis container
+docker rm -f redis
+```
+
+Verify Redis is running by connecting to it:
+
+```bash
+redis-cli ping
+# Should return: PONG
+```
+
+### 4. Start Celery Worker
+
+In a new terminal window, start the Celery worker to handle background tasks:
+
+**Linux/macOS/Windows:**
+
+```bash
+# Make sure you're in the surfsense_backend directory
+cd surfsense_backend
+
+# Start Celery worker
+uv run celery -A celery_worker.celery_app worker --loglevel=info --concurrency=1 --pool=solo
+```
+
+**Optional: Start Flower for monitoring Celery tasks:**
+
+In another terminal window:
+
+```bash
+# Start Flower (Celery monitoring tool)
+uv run celery -A celery_worker.celery_app flower --port=5555
+```
+
+Access Flower at [http://localhost:5555](http://localhost:5555) to monitor your Celery tasks.
+
+### 5. Run the Backend

 Start the backend server:

@ -303,9 +403,11 @@ To verify your installation:
 ## Troubleshooting

 - **Database Connection Issues**: Verify your PostgreSQL server is running and pgvector is properly installed
+- **Redis Connection Issues**: Ensure Redis server is running (`redis-cli ping` should return `PONG`). Check that `CELERY_BROKER_URL` and `CELERY_RESULT_BACKEND` are correctly set in your `.env` file
+- **Celery Worker Issues**: Make sure the Celery worker is running in a separate terminal. Check worker logs for any errors
 - **Authentication Problems**: Check your Google OAuth configuration and ensure redirect URIs are set correctly
 - **LLM Errors**: Confirm your LLM API keys are valid and the selected models are accessible
- **File Upload Failures**: Validate your Unstructured.io API key
+- **File Upload Failures**: Validate your ETL service API key (Unstructured.io or LlamaCloud) or ensure Docling is properly configured
 - **Windows-specific**: If you encounter path issues, ensure you're using the correct path separator (`\` instead of `/`)
 - **macOS-specific**: If you encounter permission issues, you may need to use `sudo` for some installation commands