feat: added celery and removed background_tasks for MQ's

- removed pre commit hooks
- updated docker setup
- updated github docker actions
- updated docs
This commit is contained in:
DESKTOP-RTLN3BA\$punk 2025-10-20 00:30:00 -07:00
parent 031dc055da
commit c80bbfa867
27 changed files with 1664 additions and 1038 deletions

View file

@ -1,3 +1,9 @@
# Docker Specific Env's Only - Can skip if needed
# Celery Config
REDIS_PORT=6379
FLOWER_PORT=5555
# Frontend Configuration
FRONTEND_PORT=3000
NEXT_PUBLIC_API_URL=http://backend:8000

View file

@ -1,16 +1,13 @@
<!--- Provide a general summary of your changes in the Title above -->
<!--- Summarize your pull request in a few sentences -->
## Description
<!--- Describe your changes in detail -->
<!--- Clearly describe what has changed in this pull request -->
## Motivation and Context
<!--- Why is this change required? What problem does it solve? -->
<!--- If this PR relates to an open issue, please link to the issue here: FIX #123 -->
FIX #
## Changes Overview
<!-- List the primary changes/improvements made in this PR -->
-
## Screenshots
<!-- If applicable, add screenshots or images to demonstrate the changes visually -->
@ -19,27 +16,26 @@ FIX #
<!-- Document any API changes if applicable -->
- [ ] This PR includes API changes
## Types of changes
<!--- What types of changes does your code introduce? Put an `x` in all the boxes that apply: -->
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Performance improvement (non-breaking change which enhances performance)
- [ ] Documentation update
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
## Change Type
<!--- Indicate what kind(s) of changes this PR includes: -->
- [ ] Bug fix
- [ ] New feature
- [ ] Performance improvement
- [ ] Refactoring
- [ ] Documentation
- [ ] Dependency/Build system
- [ ] Breaking change
- [ ] Other (specify):
## Testing
<!-- Describe the tests that have been run to verify your changes -->
- [ ] I have tested these changes locally
- [ ] I have added/updated unit tests
- [ ] I have added/updated integration tests
## Testing Performed
<!--- Briefly describe how you have tested these changes and what verification was performed -->
- [ ] Tested locally
- [ ] Manual/QA verification
## Checklist:
<!--- Go over all the following points, and put an `x` in all the boxes that apply. -->
<!--- If you're unsure about any of these, don't hesitate to ask. We're here to help! -->
- [ ] My code follows the code style of this project
- [ ] My change requires documentation updates
- [ ] I have updated the documentation accordingly
- [ ] My change requires dependency updates
- [ ] I have updated the dependencies accordingly
- [ ] My code builds clean without any errors or warnings
- [ ] All new and existing tests passed
## Checklist
<!--- Please confirm the following by marking with an 'x' as appropriate -->
- [ ] Follows project coding standards and conventions
- [ ] Documentation updated as needed
- [ ] Dependencies updated as needed
- [ ] No lint/build errors or new warnings
- [ ] All relevant tests are passing

View file

@ -4,40 +4,40 @@ on:
workflow_dispatch:
jobs:
build_and_push_backend:
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
steps:
- name: Checkout repository
uses: actions/checkout@v4
# build_and_push_backend:
# runs-on: ubuntu-latest
# permissions:
# contents: read
# packages: write
# steps:
# - name: Checkout repository
# uses: actions/checkout@v4
- name: Set up QEMU
uses: docker/setup-qemu-action@v3
# - name: Set up QEMU
# uses: docker/setup-qemu-action@v3
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
# - name: Set up Docker Buildx
# uses: docker/setup-buildx-action@v3
- name: Log in to GitHub Container Registry
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
# - name: Log in to GitHub Container Registry
# uses: docker/login-action@v3
# with:
# registry: ghcr.io
# username: ${{ github.actor }}
# password: ${{ secrets.GITHUB_TOKEN }}
- name: Build and push backend image
uses: docker/build-push-action@v5
with:
context: ./surfsense_backend
file: ./surfsense_backend/Dockerfile
push: true
tags: ghcr.io/${{ github.repository_owner }}/surfsense_backend:${{ github.sha }}
platforms: linux/amd64,linux/arm64
labels: |
org.opencontainers.image.source=${{ github.repositoryUrl }}
org.opencontainers.image.created=${{ fromJSON(steps.meta.outputs.json).labels['org.opencontainers.image.created'] }}
org.opencontainers.image.revision=${{ github.sha }}
# - name: Build and push backend image
# uses: docker/build-push-action@v5
# with:
# context: ./surfsense_backend
# file: ./surfsense_backend/Dockerfile
# push: true
# tags: ghcr.io/${{ github.repository_owner }}/surfsense_backend:${{ github.sha }}
# platforms: linux/amd64,linux/arm64
# labels: |
# org.opencontainers.image.source=${{ github.repositoryUrl }}
# org.opencontainers.image.created=${{ fromJSON(steps.meta.outputs.json).labels['org.opencontainers.image.created'] }}
# org.opencontainers.image.revision=${{ github.sha }}
build_and_push_frontend:
runs-on: ubuntu-latest

View file

@ -124,52 +124,52 @@ jobs:
git ls-remote --tags origin | grep "refs/tags/${{ steps.tag_version.outputs.next_version }}" || (echo "Tag push verification failed!" && exit 1)
echo "Tag successfully pushed."
build_and_push_backend_image:
runs-on: ubuntu-latest
needs: tag_release # Depends on the tag being created successfully
permissions:
packages: write # Need permission to write to GHCR
contents: read # Need permission to read repo contents (checkout)
# build_and_push_backend_image:
# runs-on: ubuntu-latest
# needs: tag_release # Depends on the tag being created successfully
# permissions:
# packages: write # Need permission to write to GHCR
# contents: read # Need permission to read repo contents (checkout)
steps:
- name: Checkout code
uses: actions/checkout@v4
# steps:
# - name: Checkout code
# uses: actions/checkout@v4
- name: Login to GitHub Container Registry
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.repository_owner }}
password: ${{ secrets.GITHUB_TOKEN }}
# - name: Login to GitHub Container Registry
# uses: docker/login-action@v3
# with:
# registry: ghcr.io
# username: ${{ github.repository_owner }}
# password: ${{ secrets.GITHUB_TOKEN }}
- name: Set up QEMU
uses: docker/setup-qemu-action@v3
# - name: Set up QEMU
# uses: docker/setup-qemu-action@v3
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
# - name: Set up Docker Buildx
# uses: docker/setup-buildx-action@v3
- name: Extract metadata (tags, labels) for Docker build
id: meta
uses: docker/metadata-action@v5
with:
images: ghcr.io/${{ github.repository_owner }}/surfsense_backend
tags: |
# Use the tag generated in the previous job
type=raw,value=${{ needs.tag_release.outputs.new_tag }}
# Optionally add 'latest' tag if building from the default branch
type=raw,value=latest,enable=${{ github.ref == format('refs/heads/{0}', github.event.repository.default_branch) || github.event.inputs.branch == github.event.repository.default_branch }}
# - name: Extract metadata (tags, labels) for Docker build
# id: meta
# uses: docker/metadata-action@v5
# with:
# images: ghcr.io/${{ github.repository_owner }}/surfsense_backend
# tags: |
# # Use the tag generated in the previous job
# type=raw,value=${{ needs.tag_release.outputs.new_tag }}
# # Optionally add 'latest' tag if building from the default branch
# type=raw,value=latest,enable=${{ github.ref == format('refs/heads/{0}', github.event.repository.default_branch) || github.event.inputs.branch == github.event.repository.default_branch }}
- name: Build and push surfsense backend
uses: docker/build-push-action@v5
with:
context: ./surfsense_backend
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
platforms: linux/amd64,linux/arm64
# Optional: Add build cache for faster builds
cache-from: type=gha
cache-to: type=gha,mode=max
# - name: Build and push surfsense backend
# uses: docker/build-push-action@v5
# with:
# context: ./surfsense_backend
# push: true
# tags: ${{ steps.meta.outputs.tags }}
# labels: ${{ steps.meta.outputs.labels }}
# platforms: linux/amd64,linux/arm64
# # Optional: Add build cache for faster builds
# cache-from: type=gha
# cache-to: type=gha,mode=max
build_and_push_ui_image:
runs-on: ubuntu-latest

View file

@ -1,59 +0,0 @@
name: pre-commit
on:
push:
pull_request:
branches: [main, dev]
jobs:
pre-commit:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
fetch-depth: 0 # Required for detecting diffs
- name: Fetch main branch
run: |
# Ensure we have the main branch reference for comparison
git fetch origin main:main 2>/dev/null || git fetch origin main 2>/dev/null || true
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Cache pre-commit environments
uses: actions/cache@v4
with:
path: ~/.cache/pre-commit
key: pre-commit-${{ hashFiles('.pre-commit-config.yaml') }}
restore-keys: |
pre-commit-
- name: Install pre-commit
run: |
pip install pre-commit
- name: Install hook environments (cache)
run: |
pre-commit install-hooks
- name: Run pre-commit on changed files
run: |
# Use pre-commit's native diff detection with fallback strategies
if git show-ref --verify --quiet refs/heads/main; then
# Main branch exists locally, use pre-commit's native diff mode
echo "Running pre-commit with native diff detection against main branch"
pre-commit run --from-ref main --to-ref HEAD
elif git show-ref --verify --quiet refs/remotes/origin/main; then
# Origin/main exists, use it as reference
echo "Running pre-commit with native diff detection against origin/main"
pre-commit run --from-ref origin/main --to-ref HEAD
else
# Fallback: run on all files (for first commits or when main is unavailable)
echo "Main branch reference not found, running pre-commit on all files"
echo "⚠️ This may take longer and show more issues than normal"
pre-commit run --all-files
fi

View file

@ -1,95 +0,0 @@
# Pre-commit configuration for SurfSense
# See https://pre-commit.com for more information
repos:
# General file quality hooks
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v5.0.0
hooks:
- id: check-yaml
args: [--multi, --unsafe]
- id: check-json
exclude: '(tsconfig\.json|\.vscode/.*\.json)$'
- id: check-toml
- id: check-merge-conflict
- id: check-added-large-files
args: [--maxkb=10240] # 10MB limit
- id: debug-statements
- id: check-case-conflict
# Security - detect secrets across all file types
- repo: https://github.com/Yelp/detect-secrets
rev: v1.5.0
hooks:
- id: detect-secrets
args: ['--baseline', '.secrets.baseline']
exclude: |
(?x)^(
.*\.env\.example|
.*\.env\.template|
.*/tests/.*|
.*test.*\.py|
test_.*\.py|
.github/workflows/.*\.yml|
.github/workflows/.*\.yaml|
.*pnpm-lock\.yaml|
.*alembic\.ini|
.*alembic/versions/.*\.py|
.*\.mdx$
)$
# Python Backend Hooks (surfsense_backend) - Using Ruff for linting and formatting
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.12.5
hooks:
- id: ruff
name: ruff-check
files: ^surfsense_backend/
exclude: ^surfsense_backend/(test_.*\.py|.*test.*\.py)
args: [--fix]
- id: ruff-format
name: ruff-format
files: ^surfsense_backend/
exclude: ^surfsense_backend/(test_.*\.py|.*test.*\.py)
- repo: https://github.com/PyCQA/bandit
rev: 1.8.6
hooks:
- id: bandit
files: ^surfsense_backend/
args: ['-f', 'json', '--severity-level', 'high', '--confidence-level', 'high']
exclude: ^surfsense_backend/(tests/|test_.*\.py|.*test.*\.py|alembic/)
# Biome hooks for TypeScript/JavaScript projects
- repo: local
hooks:
# Biome check for surfsense_web
- id: biome-check-web
name: biome-check-web
entry: bash -c 'cd surfsense_web && npx @biomejs/biome check --diagnostic-level=error .'
language: system
files: ^surfsense_web/
pass_filenames: false
always_run: true
stages: [pre-commit]
# Biome check for surfsense_browser_extension
# - id: biome-check-extension
# name: biome-check-extension
# entry: bash -c 'cd surfsense_browser_extension && npx @biomejs/biome check --diagnostic-level=error .'
# language: system
# files: ^surfsense_browser_extension/
# pass_filenames: false
# always_run: true
# stages: [pre-commit]
# Commit message linting
- repo: https://github.com/commitizen-tools/commitizen
rev: v4.8.3
hooks:
- id: commitizen
stages: [commit-msg]
# Global configuration
default_stages: [pre-commit]
fail_fast: false

View file

@ -1,124 +0,0 @@
# SurfSense Deployment Guide
This guide explains the different deployment options available for SurfSense using Docker Compose.
## Deployment Options
SurfSense uses a flexible Docker Compose configuration that allows you to easily switch between deployment modes without manually editing files. Our approach uses Docker's built-in override functionality with two configuration files:
1. **docker-compose.yml**: Contains essential core services (database and pgAdmin)
2. **docker-compose.override.yml**: Contains application services (frontend and backend)
This structure provides several advantages:
- No need to comment/uncomment services manually
- Clear separation between core infrastructure and application services
- Easy switching between development and production environments
## Deployment Modes
### Full Stack Mode (Development)
This mode runs everything: frontend, backend, database, and pgAdmin. It's ideal for development environments where you need the complete application stack.
```bash
# Both files are automatically used (docker-compose.yml + docker-compose.override.yml)
docker compose up -d
```
### Core Services Mode (Production)
This mode runs only the database and pgAdmin services. It's suitable for production environments where you might want to deploy the frontend and backend separately or need to run database migrations.
```bash
# Explicitly use only the main file
docker compose -f docker-compose.yml up -d
```
## Custom Deployment Options
### Running Specific Services
You can specify which services to start by naming them:
```bash
# Start only database
docker compose up -d db
# Start database and pgAdmin
docker compose up -d db pgadmin
# Start only backend (requires db to be running)
docker compose up -d backend
```
### Using Custom Override Files
You can create and use custom override files for different environments:
```bash
# Create a staging configuration
docker compose -f docker-compose.yml -f docker-compose.staging.yml up -d
```
## Environment Variables
The deployment can be customized using environment variables:
```bash
# Change default ports
FRONTEND_PORT=4000 BACKEND_PORT=9000 docker compose up -d
# Or use a .env file
# Create or modify .env file with your desired values
docker compose up -d
```
## Common Deployment Workflows
### Initial Setup
```bash
# Clone the repository
git clone https://github.com/MODSetter/SurfSense.git
cd SurfSense
# Copy example env files
cp .env.example .env
cp surfsense_backend/.env.example surfsense_backend/.env
cp surfsense_web/.env.example surfsense_web/.env
# Edit the .env files with your configuration
# Start full stack for development
docker compose up -d
```
### Database-Only Mode (for migrations or maintenance)
```bash
# Start just the database
docker compose -f docker-compose.yml up -d db
# Run migrations or maintenance tasks
docker compose exec db psql -U postgres -d surfsense
```
### Scaling in Production
For production deployments, you might want to:
1. Run core services with Docker Compose
2. Deploy frontend/backend with specialized services like Vercel, Netlify, or dedicated application servers
This separation allows for better scaling and resource utilization in production environments.
## Troubleshooting
If you encounter issues with the deployment:
- Check container logs: `docker compose logs -f [service_name]`
- Ensure all required environment variables are set
- Verify network connectivity between containers
- Check that required ports are available and not blocked by firewalls
For more detailed setup instructions, refer to [DOCKER_SETUP.md](DOCKER_SETUP.md).

View file

@ -1,192 +0,0 @@
# Docker Setup for SurfSense
This document explains how to run the SurfSense project using Docker Compose.
## Prerequisites
- Docker and Docker Compose installed on your machine
- Git (to clone the repository)
## Environment Variables Configuration
SurfSense Docker setup supports configuration through environment variables. You can set these variables in two ways:
1. Create a `.env` file in the project root directory (copy from `.env.example`)
2. Set environment variables directly in your shell before running Docker Compose
The following environment variables are available:
```
# Frontend Configuration
FRONTEND_PORT=3000
NEXT_PUBLIC_API_URL=http://backend:8000
# Backend Configuration
BACKEND_PORT=8000
# Database Configuration
POSTGRES_USER=postgres
POSTGRES_PASSWORD=postgres
POSTGRES_DB=surfsense
POSTGRES_PORT=5432
# pgAdmin Configuration
PGADMIN_PORT=5050
PGADMIN_DEFAULT_EMAIL=admin@surfsense.com
PGADMIN_DEFAULT_PASSWORD=surfsense
```
## Deployment Options
SurfSense uses a flexible Docker Compose setup that allows you to choose between different deployment modes:
### Option 1: Full-Stack Deployment (Development Mode)
Includes frontend, backend, database, and pgAdmin. This is the default when running `docker compose up`.
### Option 2: Core Services Only (Production Mode)
Includes only database and pgAdmin, suitable for production environments where you might deploy frontend/backend separately.
Our setup uses two files:
- `docker-compose.yml`: Contains core services (database and pgAdmin)
- `docker-compose.override.yml`: Contains application services (frontend and backend)
## Setup
1. Make sure you have all the necessary environment variables set up:
- Run `cp surfsense_backend/.env.example surfsense_backend/.env` to create .env file, and fill in the required values
- Run `cp surfsense_web/.env.example surfsense_web/.env` to create .env file, fill in the required values
- Optionally: Copy `.env.example` to `.env` in the project root to customize Docker settings
2. Deploy based on your needs:
**Full Stack (Development Mode)**:
```bash
# Both files are automatically used
docker compose up --build
```
**Core Services Only (Production Mode)**:
```bash
# Explicitly use only the main file
docker compose -f docker-compose.yml up --build
```
3. To run in detached mode (in the background):
```bash
# Full stack
docker compose up -d
# Core services only
docker compose -f docker-compose.yml up -d
```
4. Access the applications:
- Frontend: http://localhost:3000 (when using full stack)
- Backend API: http://localhost:8000 (when using full stack)
- API Documentation: http://localhost:8000/docs (when using full stack)
- pgAdmin: http://localhost:5050
## Customizing the Deployment
If you need to make temporary changes to either full stack or core services deployment, you can:
1. **Temporarily disable override file**:
```bash
docker compose -f docker-compose.yml up -d
```
2. **Use a custom override file**:
```bash
docker compose -f docker-compose.yml -f custom-override.yml up -d
```
3. **Temporarily modify which services start**:
```bash
docker compose up -d db pgadmin
```
## Useful Commands
- Stop the containers:
```bash
docker compose down
```
- View logs:
```bash
# All services
docker compose logs -f
# Specific service
docker compose logs -f backend
docker compose logs -f frontend
docker compose logs -f db
docker compose logs -f pgadmin
```
- Restart a specific service:
```bash
docker compose restart backend
```
- Execute commands in a running container:
```bash
# Backend
docker compose exec backend python -m pytest
# Frontend
docker compose exec frontend pnpm lint
```
## Database
The PostgreSQL database with pgvector extensions is available at:
- Host: localhost
- Port: 5432 (configurable via POSTGRES_PORT)
- Username: postgres (configurable via POSTGRES_USER)
- Password: postgres (configurable via POSTGRES_PASSWORD)
- Database: surfsense (configurable via POSTGRES_DB)
You can connect to it using any PostgreSQL client or the included pgAdmin.
## pgAdmin
pgAdmin is a web-based administration tool for PostgreSQL. It is included in the Docker setup for easier database management.
- URL: http://localhost:5050 (configurable via PGADMIN_PORT)
- Default Email: admin@surfsense.com (configurable via PGADMIN_DEFAULT_EMAIL)
- Default Password: surfsense (configurable via PGADMIN_DEFAULT_PASSWORD)
### Connecting to the Database in pgAdmin
1. Log in to pgAdmin using the credentials above
2. Right-click on "Servers" in the left sidebar and select "Create" > "Server"
3. In the "General" tab, give your connection a name (e.g., "SurfSense DB")
4. In the "Connection" tab, enter the following:
- Host: db
- Port: 5432
- Maintenance database: surfsense
- Username: postgres
- Password: postgres
5. Click "Save" to establish the connection
## Troubleshooting
- If you encounter permission errors, you may need to run the docker commands with `sudo`.
- If ports are already in use, modify the port mappings in the `.env` file or directly in the `docker-compose.yml` file.
- For backend dependency issues, you may need to modify the `Dockerfile` in the backend directory.
- If you encounter frontend dependency errors, adjust the frontend's `Dockerfile` accordingly.
- If pgAdmin doesn't connect to the database, ensure you're using `db` as the hostname, not `localhost`, as that's the Docker network name.
- If you need only specific services, you can explicitly name them: `docker compose up db pgadmin`
## Understanding Docker Compose File Structure
The project uses Docker's default override mechanism:
1. **docker-compose.yml**: Contains essential services (database and pgAdmin)
2. **docker-compose.override.yml**: Contains development services (frontend and backend)
When you run `docker compose up` without additional flags, Docker automatically merges both files.
When you run `docker compose -f docker-compose.yml up`, only the specified file is used.
This approach lets you maintain a cleaner codebase without manually commenting/uncommenting services in your configuration files.

View file

@ -1,237 +0,0 @@
# Pre-commit Hooks for SurfSense Contributors
Welcome to SurfSense! As an open-source project, we use pre-commit hooks to maintain code quality, security, and consistency across our multi-component codebase. This guide will help you set up and work with our pre-commit configuration.
## 🚀 What is Pre-commit?
Pre-commit is a framework for managing multi-language pre-commit hooks. It runs automatically before each commit to catch issues early, ensuring high code quality and consistency across the project.
## 📁 Project Structure
SurfSense consists of three main components:
- **`surfsense_backend/`** - Python backend API
- **`surfsense_web/`** - Next.js web application
- **`surfsense_browser_extension/`** - TypeScript browser extension
## 🛠 Installation
### Prerequisites
- Python 3.8 or higher
- Node.js 18+ and pnpm (for frontend components)
- Git
### Install Pre-commit
```bash
# Install pre-commit globally
pip install pre-commit
# Or using your preferred package manager
# pipx install pre-commit # Recommended for isolation
```
### Setup Pre-commit Hooks
1. **Clone the repository**:
```bash
git clone https://github.com/masabinhok/SurfSense.git
cd SurfSense
```
2. **Install the pre-commit hooks**:
```bash
pre-commit install
```
3. **Install commit message hooks** (optional, for conventional commits):
```bash
pre-commit install --hook-type commit-msg
```
## 🔧 Configuration Files Added
When you install pre-commit, the following files are part of the setup:
- **`.pre-commit-config.yaml`** - Main pre-commit configuration
- **`.secrets.baseline`** - Baseline file for secret detection (prevents false positives)
- **`.github/workflows/pre-commit.yml`** - CI workflow that runs pre-commit on PRs
## 🎯 What Gets Checked
### All Files
- ✅ Trailing whitespace removal
- ✅ YAML, JSON, and TOML validation
- ✅ Large file detection (>10MB)
- ✅ Merge conflict markers
- 🔒 **Secret detection** using detect-secrets
### Python Backend (`surfsense_backend/`)
- 🐍 **Black** - Code formatting
- 📦 **isort** - Import sorting
- ⚡ **Ruff** - Fast linting and formatting
- 🔍 **MyPy** - Static type checking
- 🛡️ **Bandit** - Security vulnerability scanning
### Frontend (`surfsense_web/` & `surfsense_browser_extension/`)
- 💅 **Prettier** - Code formatting
- 🔍 **ESLint** - Linting (Next.js config)
- 📝 **TypeScript** - Compilation checks
### Commit Messages
- 📝 **Commitizen** - Conventional commit format validation
## 🚀 Usage
### Normal Workflow
Pre-commit will run automatically when you commit:
```bash
git add .
git commit -m "feat: add new feature"
# Pre-commit hooks will run automatically
```
### Manual Execution
Run on staged files only:
```bash
pre-commit run
```
Run on specific files:
```bash
pre-commit run --files path/to/file.py path/to/file.ts
```
Run all hooks on all files:
```bash
pre-commit run --all-files
```
⚠️ **Warning**: Running `--all-files` may generate numerous errors as this codebase has existing linting and type issues that are being gradually resolved.
### Advanced Commands
Update all hooks to latest versions:
```bash
pre-commit autoupdate
```
Run only specific hooks:
```bash
pre-commit run black # Run only black
pre-commit run --all-files prettier # Run prettier on all files
```
Clean pre-commit cache:
```bash
pre-commit clean
```
## 🆘 Bypassing Pre-commit (When Necessary)
Sometimes you might need to bypass pre-commit hooks (use sparingly!):
### Skip all hooks for one commit:
```bash
git commit -m "fix: urgent hotfix" --no-verify
```
### Skip specific hooks:
```bash
SKIP=mypy,black git commit -m "feat: work in progress"
```
Available hook IDs to skip:
- `trailing-whitespace`, `check-yaml`, `check-json`
- `detect-secrets`
- `black`, `isort`, `ruff`, `ruff-format`, `mypy`, `bandit`
- `prettier`, `eslint`
- `typescript-check-web`, `typescript-check-extension`
- `commitizen`
## 🐛 Common Issues & Solutions
### Secret Detection False Positives
If detect-secrets flags legitimate content as secrets:
1. **Review the detection** - Ensure it's not actually a secret
2. **Update baseline**:
```bash
detect-secrets scan --baseline .secrets.baseline --update
git add .secrets.baseline
```
### TypeScript/Node.js Issues
Ensure dependencies are installed:
```bash
cd surfsense_web && pnpm install
cd surfsense_browser_extension && pnpm install
```
### Python Environment Issues
For Python hooks, ensure you're in the correct environment:
```bash
cd surfsense_backend
# If using uv
uv sync
# Or traditional pip
pip install -r requirements.txt
```
### Hook Installation Issues
If hooks aren't running:
```bash
pre-commit uninstall
pre-commit install --install-hooks
```
## 📊 Performance Tips
- **Incremental runs**: Pre-commit only runs on changed files by default
- **Parallel execution**: Many hooks run in parallel for speed
- **Caching**: Pre-commit caches environments to speed up subsequent runs
## 🔄 CI Integration
Pre-commit also runs in our GitHub Actions CI pipeline on every PR to `main`. The CI:
- Runs only on changed files for efficiency
- Provides the same feedback as local pre-commit
- Prevents merging code that doesn't pass quality checks
## 📋 Best Practices
1. **Install pre-commit early** in your development setup
2. **Fix issues incrementally** rather than bypassing hooks
3. **Update your branch regularly** to avoid conflicts with formatting changes
4. **Run `--all-files` periodically** on feature branches (in small chunks)
5. **Keep the `.secrets.baseline` updated** when legitimate secrets-like strings are added
## 💡 Contributing to Pre-commit Config
To modify the pre-commit configuration:
1. Edit `.pre-commit-config.yaml`
2. Test your changes:
```bash
pre-commit run --all-files # Test with caution!
```
3. Update the baseline if needed:
```bash
detect-secrets scan --baseline .secrets.baseline --update
```
4. Submit a PR with your changes
## 🆘 Getting Help
- **Pre-commit docs**: https://pre-commit.com/
- **Project issues**: Open an issue on GitHub
- **Hook-specific help**: Check individual tool documentation (Black, Ruff, ESLint, etc.)
---
Thank you for contributing to SurfSense! 🏄‍♀️ Quality code makes everyone's surfing experience smoother.

View file

@ -156,7 +156,7 @@ SurfSense provides two installation methods:
Both installation guides include detailed OS-specific instructions for Windows, macOS, and Linux.
Before installation, make sure to complete the [prerequisite setup steps](https://www.surfsense.net/docs/) including:
- PGVector setup
- Auth setup
- **File Processing ETL Service** (choose one):
- Unstructured.io API key (supports 34+ formats)
- LlamaIndex API key (enhanced parsing, supports 50+ formats)

View file

@ -1,34 +0,0 @@
version: '3.8'
services:
frontend:
image: ghcr.io/modsetter/surfsense_ui:latest
ports:
- "${FRONTEND_PORT:-3000}:3000"
volumes:
- ./surfsense_web:/app
- /app/node_modules
env_file:
- ./surfsense_web/.env
depends_on:
- backend
environment:
- NEXT_PUBLIC_API_URL=${NEXT_PUBLIC_API_URL:-http://backend:8000}
backend:
image: ghcr.io/modsetter/surfsense_backend:latest
ports:
- "${BACKEND_PORT:-8000}:8000"
volumes:
- ./surfsense_backend:/app
depends_on:
- db
env_file:
- ./surfsense_backend/.env
environment:
- DATABASE_URL=postgresql+asyncpg://${POSTGRES_USER:-postgres}:${POSTGRES_PASSWORD:-postgres}@db:5432/${POSTGRES_DB:-surfsense}
- PYTHONPATH=/app
- UVICORN_LOOP=asyncio
- UNSTRUCTURED_HAS_PATCHED_LOOP=1
- LANGCHAIN_TRACING_V2=false
- LANGSMITH_TRACING=false

View file

@ -1,4 +1,4 @@
version: '3.8'
version: "3.8"
services:
db:
@ -24,6 +24,89 @@ services:
depends_on:
- db
redis:
image: redis:7-alpine
ports:
- "${REDIS_PORT:-6379}:6379"
volumes:
- redis_data:/data
command: redis-server --appendonly yes
backend:
build: ./surfsense_backend
# image: ghcr.io/modsetter/surfsense_backend:latest
ports:
- "${BACKEND_PORT:-8000}:8000"
volumes:
- ./surfsense_backend:/app
- shared_temp:/tmp
env_file:
- ./surfsense_backend/.env
environment:
- DATABASE_URL=postgresql+asyncpg://${POSTGRES_USER:-postgres}:${POSTGRES_PASSWORD:-postgres}@db:5432/${POSTGRES_DB:-surfsense}
- CELERY_BROKER_URL=redis://redis:${REDIS_PORT:-6379}/0
- CELERY_RESULT_BACKEND=redis://redis:${REDIS_PORT:-6379}/0
- PYTHONPATH=/app
- UVICORN_LOOP=asyncio
- UNSTRUCTURED_HAS_PATCHED_LOOP=1
- LANGCHAIN_TRACING_V2=false
- LANGSMITH_TRACING=false
depends_on:
- db
- redis
celery_worker:
build: ./surfsense_backend
# image: ghcr.io/modsetter/surfsense_backend:latest
command: celery -A app.celery_app worker --loglevel=info --concurrency=1 --pool=solo
volumes:
- ./surfsense_backend:/app
- shared_temp:/tmp
env_file:
- ./surfsense_backend/.env
environment:
- DATABASE_URL=postgresql+asyncpg://${POSTGRES_USER:-postgres}:${POSTGRES_PASSWORD:-postgres}@db:5432/${POSTGRES_DB:-surfsense}
- CELERY_BROKER_URL=redis://redis:${REDIS_PORT:-6379}/0
- CELERY_RESULT_BACKEND=redis://redis:${REDIS_PORT:-6379}/0
- PYTHONPATH=/app
depends_on:
- db
- redis
- backend
# flower:
# build: ./surfsense_backend
# # image: ghcr.io/modsetter/surfsense_backend:latest
# command: celery -A app.celery_app flower --port=5555
# ports:
# - "${FLOWER_PORT:-5555}:5555"
# env_file:
# - ./surfsense_backend/.env
# environment:
# - CELERY_BROKER_URL=redis://redis:${REDIS_PORT:-6379}/0
# - CELERY_RESULT_BACKEND=redis://redis:${REDIS_PORT:-6379}/0
# - PYTHONPATH=/app
# depends_on:
# - redis
# - celery_worker
frontend:
# build: ./surfsense_web
image: ghcr.io/modsetter/surfsense_ui:latest
ports:
- "${FRONTEND_PORT:-3000}:3000"
volumes:
- ./surfsense_web:/app
- /app/node_modules
env_file:
- ./surfsense_web/.env
environment:
- NEXT_PUBLIC_API_URL=${NEXT_PUBLIC_API_URL:-http://backend:8000}
depends_on:
- backend
volumes:
postgres_data:
pgadmin_data:
pgadmin_data:
redis_data:
shared_temp:

View file

@ -1,5 +1,9 @@
DATABASE_URL=postgresql+asyncpg://postgres:postgres@localhost:5432/surfsense
#Celery Config
CELERY_BROKER_URL=redis://localhost:6379/0
CELERY_RESULT_BACKEND=redis://localhost:6379/0
SECRET_KEY=SECRET
NEXT_FRONTEND_URL=http://localhost:3000
@ -17,7 +21,7 @@ AIRTABLE_CLIENT_SECRET=your_airtable_client_secret
AIRTABLE_REDIRECT_URI=http://localhost:8000/api/v1/auth/airtable/connector/callback
# Embedding Model
EMBEDDING_MODEL=mixedbread-ai/mxbai-embed-large-v1
EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
RERANKERS_MODEL_NAME=ms-marco-MiniLM-L-12-v2
RERANKERS_MODEL_TYPE=flashrank

View file

@ -0,0 +1,59 @@
"""Celery application configuration and setup."""
import os
from celery import Celery
from dotenv import load_dotenv
# Load environment variables
load_dotenv()
# Get Celery configuration from environment
CELERY_BROKER_URL = os.getenv("CELERY_BROKER_URL", "redis://localhost:6379/0")
CELERY_RESULT_BACKEND = os.getenv("CELERY_RESULT_BACKEND", "redis://localhost:6379/0")
# Create Celery app
celery_app = Celery(
"surfsense",
broker=CELERY_BROKER_URL,
backend=CELERY_RESULT_BACKEND,
include=[
"app.tasks.celery_tasks.document_tasks",
"app.tasks.celery_tasks.podcast_tasks",
"app.tasks.celery_tasks.connector_tasks",
],
)
# Celery configuration
celery_app.conf.update(
# Task settings
task_serializer="json",
accept_content=["json"],
result_serializer="json",
timezone="UTC",
enable_utc=True,
# Task execution settings
task_track_started=True,
task_time_limit=3600, # 1 hour hard limit
task_soft_time_limit=3000, # 50 minutes soft limit
# Result backend settings
result_expires=86400, # Results expire after 24 hours
result_extended=True,
# Worker settings
worker_prefetch_multiplier=1,
worker_max_tasks_per_child=1000,
# Retry settings
task_acks_late=True,
task_reject_on_worker_lost=True,
# Broker settings
broker_connection_retry_on_startup=True,
)
# Optional: Configure Celery Beat for periodic tasks
celery_app.conf.beat_schedule = {
# Example: Add periodic tasks here if needed
# "periodic-task-name": {
# "task": "app.tasks.celery_tasks.some_task",
# "schedule": crontab(minute=0, hour=0), # Run daily at midnight
# },
}

View file

@ -1,7 +1,7 @@
# Force asyncio to use standard event loop before unstructured imports
import asyncio
from fastapi import APIRouter, BackgroundTasks, Depends, Form, HTTPException, UploadFile
from fastapi import APIRouter, Depends, Form, HTTPException, UploadFile
from litellm import atranscription
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy.future import select
@ -56,35 +56,41 @@ async def create_documents(
request: DocumentsCreate,
session: AsyncSession = Depends(get_async_session),
user: User = Depends(current_active_user),
fastapi_background_tasks: BackgroundTasks = BackgroundTasks(),
):
try:
# Check if the user owns the search space
await check_ownership(session, SearchSpace, request.search_space_id, user)
if request.document_type == DocumentType.EXTENSION:
from app.tasks.celery_tasks.document_tasks import (
process_extension_document_task,
)
for individual_document in request.content:
fastapi_background_tasks.add_task(
process_extension_document_with_new_session,
individual_document,
request.search_space_id,
str(user.id),
# Convert document to dict for Celery serialization
document_dict = {
"metadata": {
"VisitedWebPageTitle": individual_document.metadata.VisitedWebPageTitle,
"VisitedWebPageURL": individual_document.metadata.VisitedWebPageURL,
},
"content": individual_document.content,
}
process_extension_document_task.delay(
document_dict, request.search_space_id, str(user.id)
)
elif request.document_type == DocumentType.CRAWLED_URL:
from app.tasks.celery_tasks.document_tasks import process_crawled_url_task
for url in request.content:
fastapi_background_tasks.add_task(
process_crawled_url_with_new_session,
url,
request.search_space_id,
str(user.id),
process_crawled_url_task.delay(
url, request.search_space_id, str(user.id)
)
elif request.document_type == DocumentType.YOUTUBE_VIDEO:
from app.tasks.celery_tasks.document_tasks import process_youtube_video_task
for url in request.content:
fastapi_background_tasks.add_task(
process_youtube_video_with_new_session,
url,
request.search_space_id,
str(user.id),
process_youtube_video_task.delay(
url, request.search_space_id, str(user.id)
)
else:
raise HTTPException(status_code=400, detail="Invalid document type")
@ -106,7 +112,6 @@ async def create_documents_file_upload(
search_space_id: int = Form(...),
session: AsyncSession = Depends(get_async_session),
user: User = Depends(current_active_user),
fastapi_background_tasks: BackgroundTasks = BackgroundTasks(),
):
try:
await check_ownership(session, SearchSpace, search_space_id, user)
@ -131,12 +136,12 @@ async def create_documents_file_upload(
with open(temp_path, "wb") as f:
f.write(content)
fastapi_background_tasks.add_task(
process_file_in_background_with_new_session,
temp_path,
file.filename,
search_space_id,
str(user.id),
from app.tasks.celery_tasks.document_tasks import (
process_file_upload_task,
)
process_file_upload_task.delay(
temp_path, file.filename, search_space_id, str(user.id)
)
except Exception as e:
raise HTTPException(

View file

@ -1,7 +1,7 @@
import os
from pathlib import Path
from fastapi import APIRouter, BackgroundTasks, Depends, HTTPException
from fastapi import APIRouter, Depends, HTTPException
from fastapi.responses import StreamingResponse
from sqlalchemy.exc import IntegrityError, SQLAlchemyError
from sqlalchemy.ext.asyncio import AsyncSession
@ -176,7 +176,6 @@ async def generate_podcast(
request: PodcastGenerateRequest,
session: AsyncSession = Depends(get_async_session),
user: User = Depends(current_active_user),
fastapi_background_tasks: BackgroundTasks = BackgroundTasks(),
):
try:
# Check if the user owns the search space
@ -205,14 +204,14 @@ async def generate_podcast(
detail="One or more chat IDs do not belong to this user or search space",
)
# Only add a single task with the first chat ID
from app.tasks.celery_tasks.podcast_tasks import (
generate_chat_podcast_task,
)
# Add Celery tasks for each chat ID
for chat_id in valid_chat_ids:
fastapi_background_tasks.add_task(
generate_chat_podcast_with_new_session,
chat_id,
request.search_space_id,
request.podcast_title,
user.id,
generate_chat_podcast_task.delay(
chat_id, request.search_space_id, request.podcast_title, user.id
)
return {

View file

@ -14,7 +14,7 @@ import logging
from datetime import datetime, timedelta
from typing import Any
from fastapi import APIRouter, BackgroundTasks, Depends, HTTPException, Query
from fastapi import APIRouter, Depends, HTTPException, Query
from pydantic import BaseModel, Field, ValidationError
from sqlalchemy.exc import IntegrityError
from sqlalchemy.ext.asyncio import AsyncSession
@ -351,7 +351,6 @@ async def index_connector_content(
),
session: AsyncSession = Depends(get_async_session),
user: User = Depends(current_active_user),
background_tasks: BackgroundTasks = None,
):
"""
Index content from a connector to a search space.
@ -409,107 +408,83 @@ async def index_connector_content(
indexing_to = end_date if end_date else today_str
if connector.connector_type == SearchSourceConnectorType.SLACK_CONNECTOR:
# Run indexing in background
from app.tasks.celery_tasks.connector_tasks import (
index_slack_messages_task,
)
logger.info(
f"Triggering Slack indexing for connector {connector_id} into search space {search_space_id} from {indexing_from} to {indexing_to}"
)
background_tasks.add_task(
run_slack_indexing_with_new_session,
connector_id,
search_space_id,
str(user.id),
indexing_from,
indexing_to,
index_slack_messages_task.delay(
connector_id, search_space_id, str(user.id), indexing_from, indexing_to
)
response_message = "Slack indexing started in the background."
elif connector.connector_type == SearchSourceConnectorType.NOTION_CONNECTOR:
# Run indexing in background
from app.tasks.celery_tasks.connector_tasks import index_notion_pages_task
logger.info(
f"Triggering Notion indexing for connector {connector_id} into search space {search_space_id} from {indexing_from} to {indexing_to}"
)
background_tasks.add_task(
run_notion_indexing_with_new_session,
connector_id,
search_space_id,
str(user.id),
indexing_from,
indexing_to,
index_notion_pages_task.delay(
connector_id, search_space_id, str(user.id), indexing_from, indexing_to
)
response_message = "Notion indexing started in the background."
elif connector.connector_type == SearchSourceConnectorType.GITHUB_CONNECTOR:
# Run indexing in background
from app.tasks.celery_tasks.connector_tasks import index_github_repos_task
logger.info(
f"Triggering GitHub indexing for connector {connector_id} into search space {search_space_id} from {indexing_from} to {indexing_to}"
)
background_tasks.add_task(
run_github_indexing_with_new_session,
connector_id,
search_space_id,
str(user.id),
indexing_from,
indexing_to,
index_github_repos_task.delay(
connector_id, search_space_id, str(user.id), indexing_from, indexing_to
)
response_message = "GitHub indexing started in the background."
elif connector.connector_type == SearchSourceConnectorType.LINEAR_CONNECTOR:
# Run indexing in background
from app.tasks.celery_tasks.connector_tasks import index_linear_issues_task
logger.info(
f"Triggering Linear indexing for connector {connector_id} into search space {search_space_id} from {indexing_from} to {indexing_to}"
)
background_tasks.add_task(
run_linear_indexing_with_new_session,
connector_id,
search_space_id,
str(user.id),
indexing_from,
indexing_to,
index_linear_issues_task.delay(
connector_id, search_space_id, str(user.id), indexing_from, indexing_to
)
response_message = "Linear indexing started in the background."
elif connector.connector_type == SearchSourceConnectorType.JIRA_CONNECTOR:
# Run indexing in background
from app.tasks.celery_tasks.connector_tasks import index_jira_issues_task
logger.info(
f"Triggering Jira indexing for connector {connector_id} into search space {search_space_id} from {indexing_from} to {indexing_to}"
)
background_tasks.add_task(
run_jira_indexing_with_new_session,
connector_id,
search_space_id,
str(user.id),
indexing_from,
indexing_to,
index_jira_issues_task.delay(
connector_id, search_space_id, str(user.id), indexing_from, indexing_to
)
response_message = "Jira indexing started in the background."
elif connector.connector_type == SearchSourceConnectorType.CONFLUENCE_CONNECTOR:
# Run indexing in background
from app.tasks.celery_tasks.connector_tasks import (
index_confluence_pages_task,
)
logger.info(
f"Triggering Confluence indexing for connector {connector_id} into search space {search_space_id} from {indexing_from} to {indexing_to}"
)
background_tasks.add_task(
run_confluence_indexing_with_new_session,
connector_id,
search_space_id,
str(user.id),
indexing_from,
indexing_to,
index_confluence_pages_task.delay(
connector_id, search_space_id, str(user.id), indexing_from, indexing_to
)
response_message = "Confluence indexing started in the background."
elif connector.connector_type == SearchSourceConnectorType.CLICKUP_CONNECTOR:
# Run indexing in background
from app.tasks.celery_tasks.connector_tasks import index_clickup_tasks_task
logger.info(
f"Triggering ClickUp indexing for connector {connector_id} into search space {search_space_id} from {indexing_from} to {indexing_to}"
)
background_tasks.add_task(
run_clickup_indexing_with_new_session,
connector_id,
search_space_id,
str(user.id),
indexing_from,
indexing_to,
index_clickup_tasks_task.delay(
connector_id, search_space_id, str(user.id), indexing_from, indexing_to
)
response_message = "ClickUp indexing started in the background."
@ -517,77 +492,65 @@ async def index_connector_content(
connector.connector_type
== SearchSourceConnectorType.GOOGLE_CALENDAR_CONNECTOR
):
# Run indexing in background
from app.tasks.celery_tasks.connector_tasks import (
index_google_calendar_events_task,
)
logger.info(
f"Triggering Google Calendar indexing for connector {connector_id} into search space {search_space_id} from {indexing_from} to {indexing_to}"
)
background_tasks.add_task(
run_google_calendar_indexing_with_new_session,
connector_id,
search_space_id,
str(user.id),
indexing_from,
indexing_to,
index_google_calendar_events_task.delay(
connector_id, search_space_id, str(user.id), indexing_from, indexing_to
)
response_message = "Google Calendar indexing started in the background."
elif connector.connector_type == SearchSourceConnectorType.AIRTABLE_CONNECTOR:
# Run indexing in background
from app.tasks.celery_tasks.connector_tasks import (
index_airtable_records_task,
)
logger.info(
f"Triggering Airtable indexing for connector {connector_id} into search space {search_space_id} from {indexing_from} to {indexing_to}"
)
background_tasks.add_task(
run_airtable_indexing_with_new_session,
connector_id,
search_space_id,
str(user.id),
indexing_from,
indexing_to,
index_airtable_records_task.delay(
connector_id, search_space_id, str(user.id), indexing_from, indexing_to
)
response_message = "Airtable indexing started in the background."
elif (
connector.connector_type == SearchSourceConnectorType.GOOGLE_GMAIL_CONNECTOR
):
# Run indexing in background
from app.tasks.celery_tasks.connector_tasks import (
index_google_gmail_messages_task,
)
logger.info(
f"Triggering Google Gmail indexing for connector {connector_id} into search space {search_space_id} from {indexing_from} to {indexing_to}"
)
background_tasks.add_task(
run_google_gmail_indexing_with_new_session,
connector_id,
search_space_id,
str(user.id),
indexing_from,
indexing_to,
index_google_gmail_messages_task.delay(
connector_id, search_space_id, str(user.id), indexing_from, indexing_to
)
response_message = "Google Gmail indexing started in the background."
elif connector.connector_type == SearchSourceConnectorType.DISCORD_CONNECTOR:
# Run indexing in background
from app.tasks.celery_tasks.connector_tasks import (
index_discord_messages_task,
)
logger.info(
f"Triggering Discord indexing for connector {connector_id} into search space {search_space_id} from {indexing_from} to {indexing_to}"
)
background_tasks.add_task(
run_discord_indexing_with_new_session,
connector_id,
search_space_id,
str(user.id),
indexing_from,
indexing_to,
index_discord_messages_task.delay(
connector_id, search_space_id, str(user.id), indexing_from, indexing_to
)
response_message = "Discord indexing started in the background."
elif connector.connector_type == SearchSourceConnectorType.LUMA_CONNECTOR:
# Run indexing in background
from app.tasks.celery_tasks.connector_tasks import index_luma_events_task
logger.info(
f"Triggering Luma indexing for connector {connector_id} into search space {search_space_id} from {indexing_from} to {indexing_to}"
)
background_tasks.add_task(
run_luma_indexing_with_new_session,
connector_id,
search_space_id,
str(user.id),
indexing_from,
indexing_to,
index_luma_events_task.delay(
connector_id, search_space_id, str(user.id), indexing_from, indexing_to
)
response_message = "Luma indexing started in the background."
@ -595,17 +558,15 @@ async def index_connector_content(
connector.connector_type
== SearchSourceConnectorType.ELASTICSEARCH_CONNECTOR
):
# Run indexing in background
from app.tasks.celery_tasks.connector_tasks import (
index_elasticsearch_documents_task,
)
logger.info(
f"Triggering Elasticsearch indexing for connector {connector_id} into search space {search_space_id}"
)
background_tasks.add_task(
run_elasticsearch_indexing_with_new_session,
connector_id,
search_space_id,
str(user.id),
indexing_from,
indexing_to,
index_elasticsearch_documents_task.delay(
connector_id, search_space_id, str(user.id), indexing_from, indexing_to
)
response_message = "Elasticsearch indexing started in the background."

View file

@ -0,0 +1 @@
"""Celery tasks package."""

View file

@ -0,0 +1,589 @@
"""Celery tasks for connector indexing."""
import logging
from sqlalchemy.ext.asyncio import async_sessionmaker, create_async_engine
from sqlalchemy.pool import NullPool
from app.celery_app import celery_app
from app.config import config
logger = logging.getLogger(__name__)
def get_celery_session_maker():
"""
Create a new async session maker for Celery tasks.
This is necessary because Celery tasks run in a new event loop,
and the default session maker is bound to the main app's event loop.
"""
engine = create_async_engine(
config.DATABASE_URL,
poolclass=NullPool, # Don't use connection pooling for Celery tasks
echo=False,
)
return async_sessionmaker(engine, expire_on_commit=False)
@celery_app.task(name="index_slack_messages", bind=True)
def index_slack_messages_task(
self,
connector_id: int,
search_space_id: int,
user_id: str,
start_date: str,
end_date: str,
):
"""Celery task to index Slack messages."""
import asyncio
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
loop.run_until_complete(
_index_slack_messages(
connector_id, search_space_id, user_id, start_date, end_date
)
)
finally:
loop.close()
async def _index_slack_messages(
connector_id: int,
search_space_id: int,
user_id: str,
start_date: str,
end_date: str,
):
"""Index Slack messages with new session."""
from app.routes.search_source_connectors_routes import (
run_slack_indexing,
)
async with get_celery_session_maker()() as session:
await run_slack_indexing(
session, connector_id, search_space_id, user_id, start_date, end_date
)
@celery_app.task(name="index_notion_pages", bind=True)
def index_notion_pages_task(
self,
connector_id: int,
search_space_id: int,
user_id: str,
start_date: str,
end_date: str,
):
"""Celery task to index Notion pages."""
import asyncio
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
loop.run_until_complete(
_index_notion_pages(
connector_id, search_space_id, user_id, start_date, end_date
)
)
finally:
loop.close()
async def _index_notion_pages(
connector_id: int,
search_space_id: int,
user_id: str,
start_date: str,
end_date: str,
):
"""Index Notion pages with new session."""
from app.routes.search_source_connectors_routes import (
run_notion_indexing,
)
async with get_celery_session_maker()() as session:
await run_notion_indexing(
session, connector_id, search_space_id, user_id, start_date, end_date
)
@celery_app.task(name="index_github_repos", bind=True)
def index_github_repos_task(
self,
connector_id: int,
search_space_id: int,
user_id: str,
start_date: str,
end_date: str,
):
"""Celery task to index GitHub repositories."""
import asyncio
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
loop.run_until_complete(
_index_github_repos(
connector_id, search_space_id, user_id, start_date, end_date
)
)
finally:
loop.close()
async def _index_github_repos(
connector_id: int,
search_space_id: int,
user_id: str,
start_date: str,
end_date: str,
):
"""Index GitHub repositories with new session."""
from app.routes.search_source_connectors_routes import (
run_github_indexing,
)
async with get_celery_session_maker()() as session:
await run_github_indexing(
session, connector_id, search_space_id, user_id, start_date, end_date
)
@celery_app.task(name="index_linear_issues", bind=True)
def index_linear_issues_task(
self,
connector_id: int,
search_space_id: int,
user_id: str,
start_date: str,
end_date: str,
):
"""Celery task to index Linear issues."""
import asyncio
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
loop.run_until_complete(
_index_linear_issues(
connector_id, search_space_id, user_id, start_date, end_date
)
)
finally:
loop.close()
async def _index_linear_issues(
connector_id: int,
search_space_id: int,
user_id: str,
start_date: str,
end_date: str,
):
"""Index Linear issues with new session."""
from app.routes.search_source_connectors_routes import (
run_linear_indexing,
)
async with get_celery_session_maker()() as session:
await run_linear_indexing(
session, connector_id, search_space_id, user_id, start_date, end_date
)
@celery_app.task(name="index_jira_issues", bind=True)
def index_jira_issues_task(
self,
connector_id: int,
search_space_id: int,
user_id: str,
start_date: str,
end_date: str,
):
"""Celery task to index Jira issues."""
import asyncio
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
loop.run_until_complete(
_index_jira_issues(
connector_id, search_space_id, user_id, start_date, end_date
)
)
finally:
loop.close()
async def _index_jira_issues(
connector_id: int,
search_space_id: int,
user_id: str,
start_date: str,
end_date: str,
):
"""Index Jira issues with new session."""
from app.routes.search_source_connectors_routes import (
run_jira_indexing,
)
async with get_celery_session_maker()() as session:
await run_jira_indexing(
session, connector_id, search_space_id, user_id, start_date, end_date
)
@celery_app.task(name="index_confluence_pages", bind=True)
def index_confluence_pages_task(
self,
connector_id: int,
search_space_id: int,
user_id: str,
start_date: str,
end_date: str,
):
"""Celery task to index Confluence pages."""
import asyncio
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
loop.run_until_complete(
_index_confluence_pages(
connector_id, search_space_id, user_id, start_date, end_date
)
)
finally:
loop.close()
async def _index_confluence_pages(
connector_id: int,
search_space_id: int,
user_id: str,
start_date: str,
end_date: str,
):
"""Index Confluence pages with new session."""
from app.routes.search_source_connectors_routes import (
run_confluence_indexing,
)
async with get_celery_session_maker()() as session:
await run_confluence_indexing(
session, connector_id, search_space_id, user_id, start_date, end_date
)
@celery_app.task(name="index_clickup_tasks", bind=True)
def index_clickup_tasks_task(
self,
connector_id: int,
search_space_id: int,
user_id: str,
start_date: str,
end_date: str,
):
"""Celery task to index ClickUp tasks."""
import asyncio
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
loop.run_until_complete(
_index_clickup_tasks(
connector_id, search_space_id, user_id, start_date, end_date
)
)
finally:
loop.close()
async def _index_clickup_tasks(
connector_id: int,
search_space_id: int,
user_id: str,
start_date: str,
end_date: str,
):
"""Index ClickUp tasks with new session."""
from app.routes.search_source_connectors_routes import (
run_clickup_indexing,
)
async with get_celery_session_maker()() as session:
await run_clickup_indexing(
session, connector_id, search_space_id, user_id, start_date, end_date
)
@celery_app.task(name="index_google_calendar_events", bind=True)
def index_google_calendar_events_task(
self,
connector_id: int,
search_space_id: int,
user_id: str,
start_date: str,
end_date: str,
):
"""Celery task to index Google Calendar events."""
import asyncio
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
loop.run_until_complete(
_index_google_calendar_events(
connector_id, search_space_id, user_id, start_date, end_date
)
)
finally:
loop.close()
async def _index_google_calendar_events(
connector_id: int,
search_space_id: int,
user_id: str,
start_date: str,
end_date: str,
):
"""Index Google Calendar events with new session."""
from app.routes.search_source_connectors_routes import (
run_google_calendar_indexing,
)
async with get_celery_session_maker()() as session:
await run_google_calendar_indexing(
session, connector_id, search_space_id, user_id, start_date, end_date
)
@celery_app.task(name="index_airtable_records", bind=True)
def index_airtable_records_task(
self,
connector_id: int,
search_space_id: int,
user_id: str,
start_date: str,
end_date: str,
):
"""Celery task to index Airtable records."""
import asyncio
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
loop.run_until_complete(
_index_airtable_records(
connector_id, search_space_id, user_id, start_date, end_date
)
)
finally:
loop.close()
async def _index_airtable_records(
connector_id: int,
search_space_id: int,
user_id: str,
start_date: str,
end_date: str,
):
"""Index Airtable records with new session."""
from app.routes.search_source_connectors_routes import (
run_airtable_indexing,
)
async with get_celery_session_maker()() as session:
await run_airtable_indexing(
session, connector_id, search_space_id, user_id, start_date, end_date
)
@celery_app.task(name="index_google_gmail_messages", bind=True)
def index_google_gmail_messages_task(
self,
connector_id: int,
search_space_id: int,
user_id: str,
start_date: str,
end_date: str,
):
"""Celery task to index Google Gmail messages."""
import asyncio
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
loop.run_until_complete(
_index_google_gmail_messages(
connector_id, search_space_id, user_id, start_date, end_date
)
)
finally:
loop.close()
async def _index_google_gmail_messages(
connector_id: int,
search_space_id: int,
user_id: str,
start_date: str,
end_date: str,
):
"""Index Google Gmail messages with new session."""
from app.routes.search_source_connectors_routes import (
run_google_gmail_indexing,
)
# Parse dates to get max_messages and days_back
# For now, we'll use default values
max_messages = 100
days_back = 30
async with get_celery_session_maker()() as session:
await run_google_gmail_indexing(
session, connector_id, search_space_id, user_id, max_messages, days_back
)
@celery_app.task(name="index_discord_messages", bind=True)
def index_discord_messages_task(
self,
connector_id: int,
search_space_id: int,
user_id: str,
start_date: str,
end_date: str,
):
"""Celery task to index Discord messages."""
import asyncio
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
loop.run_until_complete(
_index_discord_messages(
connector_id, search_space_id, user_id, start_date, end_date
)
)
finally:
loop.close()
async def _index_discord_messages(
connector_id: int,
search_space_id: int,
user_id: str,
start_date: str,
end_date: str,
):
"""Index Discord messages with new session."""
from app.routes.search_source_connectors_routes import (
run_discord_indexing,
)
async with get_celery_session_maker()() as session:
await run_discord_indexing(
session, connector_id, search_space_id, user_id, start_date, end_date
)
@celery_app.task(name="index_luma_events", bind=True)
def index_luma_events_task(
self,
connector_id: int,
search_space_id: int,
user_id: str,
start_date: str,
end_date: str,
):
"""Celery task to index Luma events."""
import asyncio
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
loop.run_until_complete(
_index_luma_events(
connector_id, search_space_id, user_id, start_date, end_date
)
)
finally:
loop.close()
async def _index_luma_events(
connector_id: int,
search_space_id: int,
user_id: str,
start_date: str,
end_date: str,
):
"""Index Luma events with new session."""
from app.routes.search_source_connectors_routes import (
run_luma_indexing,
)
async with get_celery_session_maker()() as session:
await run_luma_indexing(
session, connector_id, search_space_id, user_id, start_date, end_date
)
@celery_app.task(name="index_elasticsearch_documents", bind=True)
def index_elasticsearch_documents_task(
self,
connector_id: int,
search_space_id: int,
user_id: str,
start_date: str,
end_date: str,
):
"""Celery task to index Elasticsearch documents."""
import asyncio
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
loop.run_until_complete(
_index_elasticsearch_documents(
connector_id, search_space_id, user_id, start_date, end_date
)
)
finally:
loop.close()
async def _index_elasticsearch_documents(
connector_id: int,
search_space_id: int,
user_id: str,
start_date: str,
end_date: str,
):
"""Index Elasticsearch documents with new session."""
from app.routes.search_source_connectors_routes import (
run_elasticsearch_indexing,
)
async with get_celery_session_maker()() as session:
await run_elasticsearch_indexing(
session, connector_id, search_space_id, user_id, start_date, end_date
)

View file

@ -0,0 +1,318 @@
"""Celery tasks for document processing."""
import logging
from sqlalchemy.ext.asyncio import async_sessionmaker, create_async_engine
from sqlalchemy.pool import NullPool
from app.celery_app import celery_app
from app.config import config
from app.services.task_logging_service import TaskLoggingService
from app.tasks.document_processors import (
add_crawled_url_document,
add_extension_received_document,
add_youtube_video_document,
)
logger = logging.getLogger(__name__)
def get_celery_session_maker():
"""
Create a new async session maker for Celery tasks.
This is necessary because Celery tasks run in a new event loop,
and the default session maker is bound to the main app's event loop.
"""
engine = create_async_engine(
config.DATABASE_URL,
poolclass=NullPool, # Don't use connection pooling for Celery tasks
echo=False,
)
return async_sessionmaker(engine, expire_on_commit=False)
@celery_app.task(name="process_extension_document", bind=True)
def process_extension_document_task(
self, individual_document_dict, search_space_id: int, user_id: str
):
"""
Celery task to process extension document.
Args:
individual_document_dict: Document data as dictionary
search_space_id: ID of the search space
user_id: ID of the user
"""
import asyncio
# Create a new event loop for this task
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
loop.run_until_complete(
_process_extension_document(
individual_document_dict, search_space_id, user_id
)
)
finally:
loop.close()
async def _process_extension_document(
individual_document_dict, search_space_id: int, user_id: str
):
"""Process extension document with new session."""
from pydantic import BaseModel
# Reconstruct the document object from dict
# You'll need to define the proper model for this
class DocumentMetadata(BaseModel):
VisitedWebPageTitle: str
VisitedWebPageURL: str
class IndividualDocument(BaseModel):
metadata: DocumentMetadata
content: str
individual_document = IndividualDocument(**individual_document_dict)
async with get_celery_session_maker()() as session:
task_logger = TaskLoggingService(session, search_space_id)
log_entry = await task_logger.log_task_start(
task_name="process_extension_document",
source="document_processor",
message=f"Starting processing of extension document from {individual_document.metadata.VisitedWebPageTitle}",
metadata={
"document_type": "EXTENSION",
"url": individual_document.metadata.VisitedWebPageURL,
"title": individual_document.metadata.VisitedWebPageTitle,
"user_id": user_id,
},
)
try:
result = await add_extension_received_document(
session, individual_document, search_space_id, user_id
)
if result:
await task_logger.log_task_success(
log_entry,
f"Successfully processed extension document: {individual_document.metadata.VisitedWebPageTitle}",
{"document_id": result.id, "content_hash": result.content_hash},
)
else:
await task_logger.log_task_success(
log_entry,
f"Extension document already exists (duplicate): {individual_document.metadata.VisitedWebPageTitle}",
{"duplicate_detected": True},
)
except Exception as e:
await task_logger.log_task_failure(
log_entry,
f"Failed to process extension document: {individual_document.metadata.VisitedWebPageTitle}",
str(e),
{"error_type": type(e).__name__},
)
logger.error(f"Error processing extension document: {e!s}")
raise
@celery_app.task(name="process_crawled_url", bind=True)
def process_crawled_url_task(self, url: str, search_space_id: int, user_id: str):
"""
Celery task to process crawled URL.
Args:
url: URL to crawl and process
search_space_id: ID of the search space
user_id: ID of the user
"""
import asyncio
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
loop.run_until_complete(_process_crawled_url(url, search_space_id, user_id))
finally:
loop.close()
async def _process_crawled_url(url: str, search_space_id: int, user_id: str):
"""Process crawled URL with new session."""
async with get_celery_session_maker()() as session:
task_logger = TaskLoggingService(session, search_space_id)
log_entry = await task_logger.log_task_start(
task_name="process_crawled_url",
source="document_processor",
message=f"Starting URL crawling and processing for: {url}",
metadata={"document_type": "CRAWLED_URL", "url": url, "user_id": user_id},
)
try:
result = await add_crawled_url_document(
session, url, search_space_id, user_id
)
if result:
await task_logger.log_task_success(
log_entry,
f"Successfully crawled and processed URL: {url}",
{
"document_id": result.id,
"title": result.title,
"content_hash": result.content_hash,
},
)
else:
await task_logger.log_task_success(
log_entry,
f"URL document already exists (duplicate): {url}",
{"duplicate_detected": True},
)
except Exception as e:
await task_logger.log_task_failure(
log_entry,
f"Failed to crawl URL: {url}",
str(e),
{"error_type": type(e).__name__},
)
logger.error(f"Error processing crawled URL: {e!s}")
raise
@celery_app.task(name="process_youtube_video", bind=True)
def process_youtube_video_task(self, url: str, search_space_id: int, user_id: str):
"""
Celery task to process YouTube video.
Args:
url: YouTube video URL
search_space_id: ID of the search space
user_id: ID of the user
"""
import asyncio
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
loop.run_until_complete(_process_youtube_video(url, search_space_id, user_id))
finally:
loop.close()
async def _process_youtube_video(url: str, search_space_id: int, user_id: str):
"""Process YouTube video with new session."""
async with get_celery_session_maker()() as session:
task_logger = TaskLoggingService(session, search_space_id)
log_entry = await task_logger.log_task_start(
task_name="process_youtube_video",
source="document_processor",
message=f"Starting YouTube video processing for: {url}",
metadata={"document_type": "YOUTUBE_VIDEO", "url": url, "user_id": user_id},
)
try:
result = await add_youtube_video_document(
session, url, search_space_id, user_id
)
if result:
await task_logger.log_task_success(
log_entry,
f"Successfully processed YouTube video: {result.title}",
{
"document_id": result.id,
"video_id": result.document_metadata.get("video_id"),
"content_hash": result.content_hash,
},
)
else:
await task_logger.log_task_success(
log_entry,
f"YouTube video document already exists (duplicate): {url}",
{"duplicate_detected": True},
)
except Exception as e:
await task_logger.log_task_failure(
log_entry,
f"Failed to process YouTube video: {url}",
str(e),
{"error_type": type(e).__name__},
)
logger.error(f"Error processing YouTube video: {e!s}")
raise
@celery_app.task(name="process_file_upload", bind=True)
def process_file_upload_task(
self, file_path: str, filename: str, search_space_id: int, user_id: str
):
"""
Celery task to process uploaded file.
Args:
file_path: Path to the uploaded file
filename: Original filename
search_space_id: ID of the search space
user_id: ID of the user
"""
import asyncio
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
loop.run_until_complete(
_process_file_upload(file_path, filename, search_space_id, user_id)
)
finally:
loop.close()
async def _process_file_upload(
file_path: str, filename: str, search_space_id: int, user_id: str
):
"""Process file upload with new session."""
from app.routes.documents_routes import process_file_in_background
async with get_celery_session_maker()() as session:
task_logger = TaskLoggingService(session, search_space_id)
log_entry = await task_logger.log_task_start(
task_name="process_file_upload",
source="document_processor",
message=f"Starting file processing for: {filename}",
metadata={
"document_type": "FILE",
"filename": filename,
"file_path": file_path,
"user_id": user_id,
},
)
try:
await process_file_in_background(
file_path,
filename,
search_space_id,
user_id,
session,
task_logger,
log_entry,
)
except Exception as e:
await task_logger.log_task_failure(
log_entry,
f"Failed to process file: {filename}",
str(e),
{"error_type": type(e).__name__},
)
logger.error(f"Error processing file: {e!s}")
raise

View file

@ -0,0 +1,66 @@
"""Celery tasks for podcast generation."""
import logging
from sqlalchemy.ext.asyncio import async_sessionmaker, create_async_engine
from sqlalchemy.pool import NullPool
from app.celery_app import celery_app
from app.config import config
from app.tasks.podcast_tasks import generate_chat_podcast
logger = logging.getLogger(__name__)
def get_celery_session_maker():
"""
Create a new async session maker for Celery tasks.
This is necessary because Celery tasks run in a new event loop,
and the default session maker is bound to the main app's event loop.
"""
engine = create_async_engine(
config.DATABASE_URL,
poolclass=NullPool, # Don't use connection pooling for Celery tasks
echo=False,
)
return async_sessionmaker(engine, expire_on_commit=False)
@celery_app.task(name="generate_chat_podcast", bind=True)
def generate_chat_podcast_task(
self, chat_id: int, search_space_id: int, podcast_title: str, user_id: int
):
"""
Celery task to generate podcast from chat.
Args:
chat_id: ID of the chat to generate podcast from
search_space_id: ID of the search space
podcast_title: Title for the podcast
user_id: ID of the user
"""
import asyncio
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
loop.run_until_complete(
_generate_chat_podcast(chat_id, search_space_id, podcast_title, user_id)
)
finally:
loop.close()
async def _generate_chat_podcast(
chat_id: int, search_space_id: int, podcast_title: str, user_id: int
):
"""Generate chat podcast with new session."""
async with get_celery_session_maker()() as session:
try:
await generate_chat_podcast(
session, chat_id, search_space_id, podcast_title, user_id
)
except Exception as e:
logger.error(f"Error generating podcast from chat: {e!s}")
raise

View file

@ -0,0 +1,13 @@
"""Celery worker startup script."""
import os
import sys
# Add the app directory to the Python path
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "app"))
from app.celery_app import celery_app
if __name__ == "__main__":
# Start the Celery worker
celery_app.start()

View file

@ -45,6 +45,9 @@ dependencies = [
"langchain-litellm>=0.2.3",
"elasticsearch>=9.1.1",
"faster-whisper>=1.1.0",
"celery[redis]>=5.5.3",
"flower>=2.0.1",
"redis>=5.2.1",
]
[dependency-groups]

View file

@ -147,6 +147,18 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/dd/e2/88e425adac5ad887a087c38d04fe2030010572a3e0e627f8a6e8c33eeda8/alembic-1.16.2-py3-none-any.whl", hash = "sha256:5f42e9bd0afdbd1d5e3ad856c01754530367debdebf21ed6894e34af52b3bb03", size = 242717 },
]
[[package]]
name = "amqp"
version = "5.3.1"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "vine" },
]
sdist = { url = "https://files.pythonhosted.org/packages/79/fc/ec94a357dfc6683d8c86f8b4cfa5416a4c36b28052ec8260c77aca96a443/amqp-5.3.1.tar.gz", hash = "sha256:cddc00c725449522023bad949f70fff7b48f0b1ade74d170a6f10ab044739432", size = 129013 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/26/99/fc813cd978842c26c82534010ea849eee9ab3a13ea2b74e95cb9c99e747b/amqp-5.3.1-py3-none-any.whl", hash = "sha256:43b3319e1b4e7d1251833a93d672b4af1e40f3d632d479b98661a95f117880a2", size = 50944 },
]
[[package]]
name = "annotated-types"
version = "0.7.0"
@ -415,6 +427,15 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/50/cd/30110dc0ffcf3b131156077b90e9f60ed75711223f306da4db08eff8403b/beautifulsoup4-4.13.4-py3-none-any.whl", hash = "sha256:9bbbb14bfde9d79f38b8cd5f8c7c85f4b8f2523190ebed90e950a8dea4cb1c4b", size = 187285 },
]
[[package]]
name = "billiard"
version = "4.2.2"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/b9/6a/1405343016bce8354b29d90aad6b0bf6485b5e60404516e4b9a3a9646cf0/billiard-4.2.2.tar.gz", hash = "sha256:e815017a062b714958463e07ba15981d802dc53d41c5b69d28c5a7c238f8ecf3", size = 155592 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/a6/80/ef8dff49aae0e4430f81842f7403e14e0ca59db7bbaf7af41245b67c6b25/billiard-4.2.2-py3-none-any.whl", hash = "sha256:4bc05dcf0d1cc6addef470723aac2a6232f3c7ed7475b0b580473a9145829457", size = 86896 },
]
[[package]]
name = "blis"
version = "1.3.0"
@ -476,6 +497,30 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/9e/96/d32b941a501ab566a16358d68b6eb4e4acc373fab3c3c4d7d9e649f7b4bb/catalogue-2.0.10-py3-none-any.whl", hash = "sha256:58c2de0020aa90f4a2da7dfad161bf7b3b054c86a5f09fcedc0b2b740c109a9f", size = 17325 },
]
[[package]]
name = "celery"
version = "5.5.3"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "billiard" },
{ name = "click" },
{ name = "click-didyoumean" },
{ name = "click-plugins" },
{ name = "click-repl" },
{ name = "kombu" },
{ name = "python-dateutil" },
{ name = "vine" },
]
sdist = { url = "https://files.pythonhosted.org/packages/bb/7d/6c289f407d219ba36d8b384b42489ebdd0c84ce9c413875a8aae0c85f35b/celery-5.5.3.tar.gz", hash = "sha256:6c972ae7968c2b5281227f01c3a3f984037d21c5129d07bf3550cc2afc6b10a5", size = 1667144 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/c9/af/0dcccc7fdcdf170f9a1585e5e96b6fb0ba1749ef6be8c89a6202284759bd/celery-5.5.3-py3-none-any.whl", hash = "sha256:0b5761a07057acee94694464ca482416b959568904c9dfa41ce8413a7d65d525", size = 438775 },
]
[package.optional-dependencies]
redis = [
{ name = "kombu", extra = ["redis"] },
]
[[package]]
name = "certifi"
version = "2025.6.15"
@ -657,6 +702,43 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/85/32/10bb5764d90a8eee674e9dc6f4db6a0ab47c8c4d0d83c27f7c39ac415a4d/click-8.2.1-py3-none-any.whl", hash = "sha256:61a3265b914e850b85317d0b3109c7f8cd35a670f963866005d6ef1d5175a12b", size = 102215 },
]
[[package]]
name = "click-didyoumean"
version = "0.3.1"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "click" },
]
sdist = { url = "https://files.pythonhosted.org/packages/30/ce/217289b77c590ea1e7c24242d9ddd6e249e52c795ff10fac2c50062c48cb/click_didyoumean-0.3.1.tar.gz", hash = "sha256:4f82fdff0dbe64ef8ab2279bd6aa3f6a99c3b28c05aa09cbfc07c9d7fbb5a463", size = 3089 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/1b/5b/974430b5ffdb7a4f1941d13d83c64a0395114503cc357c6b9ae4ce5047ed/click_didyoumean-0.3.1-py3-none-any.whl", hash = "sha256:5c4bb6007cfea5f2fd6583a2fb6701a22a41eb98957e63d0fac41c10e7c3117c", size = 3631 },
]
[[package]]
name = "click-plugins"
version = "1.1.1.2"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "click" },
]
sdist = { url = "https://files.pythonhosted.org/packages/c3/a4/34847b59150da33690a36da3681d6bbc2ec14ee9a846bc30a6746e5984e4/click_plugins-1.1.1.2.tar.gz", hash = "sha256:d7af3984a99d243c131aa1a828331e7630f4a88a9741fd05c927b204bcf92261", size = 8343 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/3d/9a/2abecb28ae875e39c8cad711eb1186d8d14eab564705325e77e4e6ab9ae5/click_plugins-1.1.1.2-py2.py3-none-any.whl", hash = "sha256:008d65743833ffc1f5417bf0e78e8d2c23aab04d9745ba817bd3e71b0feb6aa6", size = 11051 },
]
[[package]]
name = "click-repl"
version = "0.3.0"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "click" },
{ name = "prompt-toolkit" },
]
sdist = { url = "https://files.pythonhosted.org/packages/cb/a2/57f4ac79838cfae6912f997b4d1a64a858fb0c86d7fcaae6f7b58d267fca/click-repl-0.3.0.tar.gz", hash = "sha256:17849c23dba3d667247dc4defe1757fff98694e90fe37474f3feebb69ced26a9", size = 10449 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/52/40/9d857001228658f0d59e97ebd4c346fe73e138c6de1bce61dc568a57c7f8/click_repl-0.3.0-py3-none-any.whl", hash = "sha256:fb7e06deb8da8de86180a33a9da97ac316751c094c6899382da7feeeeb51b812", size = 10289 },
]
[[package]]
name = "cloudpathlib"
version = "0.21.1"
@ -1424,6 +1506,22 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/b8/25/155f9f080d5e4bc0082edfda032ea2bc2b8fab3f4d25d46c1e9dd22a1a89/flatbuffers-25.2.10-py2.py3-none-any.whl", hash = "sha256:ebba5f4d5ea615af3f7fd70fc310636fbb2bbd1f566ac0a23d98dd412de50051", size = 30953 },
]
[[package]]
name = "flower"
version = "2.0.1"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "celery" },
{ name = "humanize" },
{ name = "prometheus-client" },
{ name = "pytz" },
{ name = "tornado" },
]
sdist = { url = "https://files.pythonhosted.org/packages/09/a1/357f1b5d8946deafdcfdd604f51baae9de10aafa2908d0b7322597155f92/flower-2.0.1.tar.gz", hash = "sha256:5ab717b979530770c16afb48b50d2a98d23c3e9fe39851dcf6bc4d01845a02a0", size = 3220408 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/a6/ff/ee2f67c0ff146ec98b5df1df637b2bc2d17beeb05df9f427a67bd7a7d79c/flower-2.0.1-py2.py3-none-any.whl", hash = "sha256:9db2c621eeefbc844c8dd88be64aef61e84e2deb29b271e02ab2b5b9f01068e2", size = 383553 },
]
[[package]]
name = "fonttools"
version = "4.58.4"
@ -1921,6 +2019,15 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/f0/0f/310fb31e39e2d734ccaa2c0fb981ee41f7bd5056ce9bc29b2248bd569169/humanfriendly-10.0-py2.py3-none-any.whl", hash = "sha256:1697e1a8a8f550fd43c2865cd84542fc175a61dcb779b6fee18cf6b6ccba1477", size = 86794 },
]
[[package]]
name = "humanize"
version = "4.14.0"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/b6/43/50033d25ad96a7f3845f40999b4778f753c3901a11808a584fed7c00d9f5/humanize-4.14.0.tar.gz", hash = "sha256:2fa092705ea640d605c435b1ca82b2866a1b601cdf96f076d70b79a855eba90d", size = 82939 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/c3/5b/9512c5fb6c8218332b530f13500c6ff5f3ce3342f35e0dd7be9ac3856fd3/humanize-4.14.0-py3-none-any.whl", hash = "sha256:d57701248d040ad456092820e6fde56c930f17749956ac47f4f655c0c547bfff", size = 132092 },
]
[[package]]
name = "hyperframe"
version = "6.1.0"
@ -2259,6 +2366,26 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/ea/cc/75f41633c75224ba820a4533163bc8b070b6bf25416014074c63284c2d4e/kokoro-0.9.4-py3-none-any.whl", hash = "sha256:a129dc6364a286bd6a92c396e9862459d3d3e45f2c15596ed5a94dcee5789efd", size = 32592 },
]
[[package]]
name = "kombu"
version = "5.5.4"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "amqp" },
{ name = "packaging" },
{ name = "tzdata" },
{ name = "vine" },
]
sdist = { url = "https://files.pythonhosted.org/packages/0f/d3/5ff936d8319ac86b9c409f1501b07c426e6ad41966fedace9ef1b966e23f/kombu-5.5.4.tar.gz", hash = "sha256:886600168275ebeada93b888e831352fe578168342f0d1d5833d88ba0d847363", size = 461992 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/ef/70/a07dcf4f62598c8ad579df241af55ced65bed76e42e45d3c368a6d82dbc1/kombu-5.5.4-py3-none-any.whl", hash = "sha256:a12ed0557c238897d8e518f1d1fdf84bd1516c5e305af2dacd85c2015115feb8", size = 210034 },
]
[package.optional-dependencies]
redis = [
{ name = "redis" },
]
[[package]]
name = "kubernetes"
version = "33.1.0"
@ -4029,6 +4156,27 @@ version = "1.6"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/2a/68/d8412d1e0d70edf9791cbac5426dc859f4649afc22f2abbeb0d947cf70fd/progress-1.6.tar.gz", hash = "sha256:c9c86e98b5c03fa1fe11e3b67c1feda4788b8d0fe7336c2ff7d5644ccfba34cd", size = 7842 }
[[package]]
name = "prometheus-client"
version = "0.23.1"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/23/53/3edb5d68ecf6b38fcbcc1ad28391117d2a322d9a1a3eff04bfdb184d8c3b/prometheus_client-0.23.1.tar.gz", hash = "sha256:6ae8f9081eaaaf153a2e959d2e6c4f4fb57b12ef76c8c7980202f1e57b48b2ce", size = 80481 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/b8/db/14bafcb4af2139e046d03fd00dea7873e48eafe18b7d2797e73d6681f210/prometheus_client-0.23.1-py3-none-any.whl", hash = "sha256:dd1913e6e76b59cfe44e7a4b83e01afc9873c1bdfd2ed8739f1e76aeca115f99", size = 61145 },
]
[[package]]
name = "prompt-toolkit"
version = "3.0.52"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "wcwidth" },
]
sdist = { url = "https://files.pythonhosted.org/packages/a1/96/06e01a7b38dce6fe1db213e061a4602dd6032a8a97ef6c1a862537732421/prompt_toolkit-3.0.52.tar.gz", hash = "sha256:28cde192929c8e7321de85de1ddbe736f1375148b02f2e17edd840042b1be855", size = 434198 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/84/03/0d3ce49e2505ae70cf43bc5bb3033955d2fc9f932163e84dc0779cc47f48/prompt_toolkit-3.0.52-py3-none-any.whl", hash = "sha256:9aac639a3bbd33284347de5ad8d68ecc044b91a762dc39b7c21095fcd6a19955", size = 391431 },
]
[[package]]
name = "propcache"
version = "0.3.2"
@ -4727,6 +4875,15 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/e1/67/921ec3024056483db83953ae8e48079ad62b92db7880013ca77632921dd0/readme_renderer-44.0-py3-none-any.whl", hash = "sha256:2fbca89b81a08526aadf1357a8c2ae889ec05fb03f5da67f9769c9a592166151", size = 13310 },
]
[[package]]
name = "redis"
version = "5.2.1"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/47/da/d283a37303a995cd36f8b92db85135153dc4f7a8e4441aa827721b442cfb/redis-5.2.1.tar.gz", hash = "sha256:16f2e22dff21d5125e8481515e386711a34cbec50f0e44413dd7d9c060a54e0f", size = 4608355 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/3c/5f/fa26b9b2672cbe30e07d9a5bdf39cf16e3b80b42916757c5f92bca88e4ba/redis-5.2.1-py3-none-any.whl", hash = "sha256:ee7e1056b9aea0f04c6c2ed59452947f34c4940ee025f5dd83e6a6418b6989e4", size = 261502 },
]
[[package]]
name = "referencing"
version = "0.36.2"
@ -5426,6 +5583,7 @@ source = { virtual = "." }
dependencies = [
{ name = "alembic" },
{ name = "asyncpg" },
{ name = "celery", extra = ["redis"] },
{ name = "chonkie", extra = ["all"] },
{ name = "discord-py" },
{ name = "docling" },
@ -5435,6 +5593,7 @@ dependencies = [
{ name = "fastapi-users", extra = ["oauth", "sqlalchemy"] },
{ name = "faster-whisper" },
{ name = "firecrawl-py" },
{ name = "flower" },
{ name = "github3-py" },
{ name = "google-api-python-client" },
{ name = "google-auth-oauthlib" },
@ -5452,6 +5611,7 @@ dependencies = [
{ name = "pgvector" },
{ name = "playwright" },
{ name = "python-ffmpeg" },
{ name = "redis" },
{ name = "rerankers", extra = ["flashrank"] },
{ name = "sentence-transformers" },
{ name = "slack-sdk" },
@ -5475,6 +5635,7 @@ dev = [
requires-dist = [
{ name = "alembic", specifier = ">=1.13.0" },
{ name = "asyncpg", specifier = ">=0.30.0" },
{ name = "celery", extras = ["redis"], specifier = ">=5.5.3" },
{ name = "chonkie", extras = ["all"], specifier = ">=1.0.6" },
{ name = "discord-py", specifier = ">=2.5.2" },
{ name = "docling", specifier = ">=2.15.0" },
@ -5484,6 +5645,7 @@ requires-dist = [
{ name = "fastapi-users", extras = ["oauth", "sqlalchemy"], specifier = ">=14.0.1" },
{ name = "faster-whisper", specifier = ">=1.1.0" },
{ name = "firecrawl-py", specifier = ">=1.12.0" },
{ name = "flower", specifier = ">=2.0.1" },
{ name = "github3-py", specifier = "==4.0.1" },
{ name = "google-api-python-client", specifier = ">=2.156.0" },
{ name = "google-auth-oauthlib", specifier = ">=1.2.1" },
@ -5501,6 +5663,7 @@ requires-dist = [
{ name = "pgvector", specifier = ">=0.3.6" },
{ name = "playwright", specifier = ">=1.50.0" },
{ name = "python-ffmpeg", specifier = ">=2.0.12" },
{ name = "redis", specifier = ">=5.2.1" },
{ name = "rerankers", extras = ["flashrank"], specifier = ">=0.7.1" },
{ name = "sentence-transformers", specifier = ">=3.4.1" },
{ name = "slack-sdk", specifier = ">=3.34.0" },
@ -5751,6 +5914,25 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/ab/c0/131628e6d42682b0502c63fd7f647b8b5ca4bd94088f6c85ca7225db8ac4/torchvision-0.22.1-cp313-cp313t-win_amd64.whl", hash = "sha256:7414eeacfb941fa21acddcd725f1617da5630ec822e498660a4b864d7d998075", size = 1629892 },
]
[[package]]
name = "tornado"
version = "6.5.2"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/09/ce/1eb500eae19f4648281bb2186927bb062d2438c2e5093d1360391afd2f90/tornado-6.5.2.tar.gz", hash = "sha256:ab53c8f9a0fa351e2c0741284e06c7a45da86afb544133201c5cc8578eb076a0", size = 510821 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/f6/48/6a7529df2c9cc12efd2e8f5dd219516184d703b34c06786809670df5b3bd/tornado-6.5.2-cp39-abi3-macosx_10_9_universal2.whl", hash = "sha256:2436822940d37cde62771cff8774f4f00b3c8024fe482e16ca8387b8a2724db6", size = 442563 },
{ url = "https://files.pythonhosted.org/packages/f2/b5/9b575a0ed3e50b00c40b08cbce82eb618229091d09f6d14bce80fc01cb0b/tornado-6.5.2-cp39-abi3-macosx_10_9_x86_64.whl", hash = "sha256:583a52c7aa94ee046854ba81d9ebb6c81ec0fd30386d96f7640c96dad45a03ef", size = 440729 },
{ url = "https://files.pythonhosted.org/packages/1b/4e/619174f52b120efcf23633c817fd3fed867c30bff785e2cd5a53a70e483c/tornado-6.5.2-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:b0fe179f28d597deab2842b86ed4060deec7388f1fd9c1b4a41adf8af058907e", size = 444295 },
{ url = "https://files.pythonhosted.org/packages/95/fa/87b41709552bbd393c85dd18e4e3499dcd8983f66e7972926db8d96aa065/tornado-6.5.2-cp39-abi3-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:b186e85d1e3536d69583d2298423744740986018e393d0321df7340e71898882", size = 443644 },
{ url = "https://files.pythonhosted.org/packages/f9/41/fb15f06e33d7430ca89420283a8762a4e6b8025b800ea51796ab5e6d9559/tornado-6.5.2-cp39-abi3-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:e792706668c87709709c18b353da1f7662317b563ff69f00bab83595940c7108", size = 443878 },
{ url = "https://files.pythonhosted.org/packages/11/92/fe6d57da897776ad2e01e279170ea8ae726755b045fe5ac73b75357a5a3f/tornado-6.5.2-cp39-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:06ceb1300fd70cb20e43b1ad8aaee0266e69e7ced38fa910ad2e03285009ce7c", size = 444549 },
{ url = "https://files.pythonhosted.org/packages/9b/02/c8f4f6c9204526daf3d760f4aa555a7a33ad0e60843eac025ccfd6ff4a93/tornado-6.5.2-cp39-abi3-musllinux_1_2_i686.whl", hash = "sha256:74db443e0f5251be86cbf37929f84d8c20c27a355dd452a5cfa2aada0d001ec4", size = 443973 },
{ url = "https://files.pythonhosted.org/packages/ae/2d/f5f5707b655ce2317190183868cd0f6822a1121b4baeae509ceb9590d0bd/tornado-6.5.2-cp39-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:b5e735ab2889d7ed33b32a459cac490eda71a1ba6857b0118de476ab6c366c04", size = 443954 },
{ url = "https://files.pythonhosted.org/packages/e8/59/593bd0f40f7355806bf6573b47b8c22f8e1374c9b6fd03114bd6b7a3dcfd/tornado-6.5.2-cp39-abi3-win32.whl", hash = "sha256:c6f29e94d9b37a95013bb669616352ddb82e3bfe8326fccee50583caebc8a5f0", size = 445023 },
{ url = "https://files.pythonhosted.org/packages/c7/2a/f609b420c2f564a748a2d80ebfb2ee02a73ca80223af712fca591386cafb/tornado-6.5.2-cp39-abi3-win_amd64.whl", hash = "sha256:e56a5af51cc30dd2cae649429af65ca2f6571da29504a07995175df14c18f35f", size = 445427 },
{ url = "https://files.pythonhosted.org/packages/5e/4f/e1f65e8f8c76d73658b33d33b81eed4322fb5085350e4328d5c956f0c8f9/tornado-6.5.2-cp39-abi3-win_arm64.whl", hash = "sha256:d6c33dc3672e3a1f3618eb63b7ef4683a7688e7b9e6e8f0d9aa5726360a004af", size = 444456 },
]
[[package]]
name = "tqdm"
version = "4.67.1"
@ -6180,6 +6362,15 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/fa/6e/3e955517e22cbdd565f2f8b2e73d52528b14b8bcfdb04f62466b071de847/validators-0.35.0-py3-none-any.whl", hash = "sha256:e8c947097eae7892cb3d26868d637f79f47b4a0554bc6b80065dfe5aac3705dd", size = 44712 },
]
[[package]]
name = "vine"
version = "5.1.0"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/bd/e4/d07b5f29d283596b9727dd5275ccbceb63c44a1a82aa9e4bfd20426762ac/vine-5.1.0.tar.gz", hash = "sha256:8b62e981d35c41049211cf62a0a1242d8c1ee9bd15bb196ce38aefd6799e61e0", size = 48980 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/03/ff/7c0c86c43b3cbb927e0ccc0255cb4057ceba4799cd44ae95174ce8e8b5b2/vine-5.1.0-py3-none-any.whl", hash = "sha256:40fdf3c48b2cfe1c38a49e9ae2da6fda88e4794c810050a728bd7413811fb1dc", size = 9636 },
]
[[package]]
name = "wasabi"
version = "1.1.3"
@ -6259,6 +6450,15 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/32/fa/a4f5c2046385492b2273213ef815bf71a0d4c1943b784fb904e184e30201/watchfiles-1.1.0-cp314-cp314t-musllinux_1_1_x86_64.whl", hash = "sha256:af06c863f152005c7592df1d6a7009c836a247c9d8adb78fef8575a5a98699db", size = 623315 },
]
[[package]]
name = "wcwidth"
version = "0.2.14"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/24/30/6b0809f4510673dc723187aeaf24c7f5459922d01e2f794277a3dfb90345/wcwidth-0.2.14.tar.gz", hash = "sha256:4d478375d31bc5395a3c55c40ccdf3354688364cd61c4f6adacaa9215d0b3605", size = 102293 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/af/b5/123f13c975e9f27ab9c0770f514345bd406d0e8d3b7a0723af9d43f710af/wcwidth-0.2.14-py2.py3-none-any.whl", hash = "sha256:a7bb560c8aee30f9957e5f9895805edd20602f2d7f720186dfd906e82b4982e1", size = 37286 },
]
[[package]]
name = "weasel"
version = "0.4.1"

View file

@ -17,7 +17,7 @@ Before you begin, ensure you have:
- [Docker](https://docs.docker.com/get-docker/) and [Docker Compose](https://docs.docker.com/compose/install/) installed on your machine
- [Git](https://git-scm.com/downloads) (to clone the repository)
- Completed all the [prerequisite setup steps](/docs) including:
- PGVector setup
- Auth setup
- **File Processing ETL Service** (choose one):
- Unstructured.io API key (Supports 34+ formats)
- LlamaIndex API key (enhanced parsing, supports 50+ formats)
@ -56,7 +56,7 @@ Before you begin, ensure you have:
Edit all `.env` files and fill in the required values:
### Docker-Specific Environment Variables
### Docker-Specific Environment Variables (Optional)
| ENV VARIABLE | DESCRIPTION | DEFAULT VALUE |
|----------------------------|-----------------------------------------------------------------------------|---------------------|
@ -64,6 +64,8 @@ Before you begin, ensure you have:
| BACKEND_PORT | Port for the backend API service | 8000 |
| POSTGRES_PORT | Port for the PostgreSQL database | 5432 |
| PGADMIN_PORT | Port for pgAdmin web interface | 5050 |
| REDIS_PORT | Port for Redis (used by Celery) | 6379 |
| FLOWER_PORT | Port for Flower (Celery monitoring tool) | 5555 |
| POSTGRES_USER | PostgreSQL username | postgres |
| POSTGRES_PASSWORD | PostgreSQL password | postgres |
| POSTGRES_DB | PostgreSQL database name | surfsense |
@ -81,7 +83,7 @@ Before you begin, ensure you have:
| AUTH_TYPE | Authentication method: `GOOGLE` for OAuth with Google, `LOCAL` for email/password authentication |
| GOOGLE_OAUTH_CLIENT_ID | (Optional) Client ID from Google Cloud Console (required if AUTH_TYPE=GOOGLE) |
| GOOGLE_OAUTH_CLIENT_SECRET | (Optional) Client secret from Google Cloud Console (required if AUTH_TYPE=GOOGLE) |
| EMBEDDING_MODEL | Name of the embedding model (e.g., `mixedbread-ai/mxbai-embed-large-v1`) |
| EMBEDDING_MODEL | Name of the embedding model (e.g., `sentence-transformers/all-MiniLM-L6-v2`, `openai://text-embedding-ada-002`) |
| RERANKERS_MODEL_NAME | Name of the reranker model (e.g., `ms-marco-MiniLM-L-12-v2`) |
| RERANKERS_MODEL_TYPE | Type of reranker model (e.g., `flashrank`) |
| TTS_SERVICE | Text-to-Speech API provider for Podcasts (e.g., `local/kokoro`, `openai/tts-1`). See [supported providers](https://docs.litellm.ai/docs/text_to_speech#supported-providers) |
@ -94,6 +96,8 @@ Before you begin, ensure you have:
| ETL_SERVICE | Document parsing service: `UNSTRUCTURED` (supports 34+ formats), `LLAMACLOUD` (supports 50+ formats including legacy document types), or `DOCLING` (local processing, supports PDF, Office docs, images, HTML, CSV) |
| UNSTRUCTURED_API_KEY | API key for Unstructured.io service for document parsing (required if ETL_SERVICE=UNSTRUCTURED) |
| LLAMA_CLOUD_API_KEY | API key for LlamaCloud service for document parsing (required if ETL_SERVICE=LLAMACLOUD) |
| CELERY_BROKER_URL | Redis connection URL for Celery broker (e.g., `redis://localhost:6379/0`) |
| CELERY_RESULT_BACKEND | Redis connection URL for Celery result backend (e.g., `redis://localhost:6379/0`) |
**Optional Backend LangSmith Observability:**

View file

@ -4,50 +4,8 @@ description: Required setup's before setting up SurfSense
full: true
---
## PGVector installation Guide
SurfSense requires the pgvector extension for PostgreSQL:
### Linux and Mac
Compile and install the extension (supports Postgres 13+)
```sh
cd /tmp
git clone --branch v0.8.0 https://github.com/pgvector/pgvector.git
cd pgvector
make
make install # may need sudo
```
See the [installation notes](https://github.com/pgvector/pgvector/tree/master#installation-notes---linux-and-mac) if you run into issues
### Windows
Ensure [C++ support in Visual Studio](https://learn.microsoft.com/en-us/cpp/build/building-on-the-command-line?view=msvc-170#download-and-install-the-tools) is installed, and run:
```cmd
call "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars64.bat"
```
Note: The exact path will vary depending on your Visual Studio version and edition
Then use `nmake` to build:
```cmd
set "PGROOT=C:\Program Files\PostgreSQL\16"
cd %TEMP%
git clone --branch v0.8.0 https://github.com/pgvector/pgvector.git
cd pgvector
nmake /F Makefile.win
nmake /F Makefile.win install
```
See the [installation notes](https://github.com/pgvector/pgvector/tree/master#installation-notes---windows) if you run into issues
---
## Google OAuth Setup (Optional)
## Auth Setup
SurfSense supports both Google OAuth and local email/password authentication. Google OAuth is optional - if you prefer local authentication, you can skip this section.

View file

@ -10,14 +10,28 @@ This guide provides step-by-step instructions for setting up SurfSense without D
## Prerequisites
Before beginning the manual installation, ensure you have completed all the [prerequisite setup steps](/docs), including:
Before beginning the manual installation, ensure you have the following installed and configured:
- PGVector setup
### Required Software
- **Python 3.12+** - Backend runtime environment
- **Node.js 20+** - Frontend runtime environment
- **PostgreSQL 14+** - Database server
- **PGVector** - PostgreSQL extension for vector similarity search
- **Redis** - Message broker for Celery task queue
- **Git** - Version control (to clone the repository)
### Required Services & API Keys
Complete all the [setup steps](/docs), including:
- **Authentication Setup** (choose one):
- Google OAuth credentials (for `AUTH_TYPE=GOOGLE`)
- Local authentication setup (for `AUTH_TYPE=LOCAL`)
- **File Processing ETL Service** (choose one):
- Unstructured.io API key (Supports 34+ formats)
- LlamaIndex API key (enhanced parsing, supports 50+ formats)
- Docling (local processing, no API key required, supports PDF, Office docs, images, HTML, CSV)
- Other required API keys
- Unstructured.io API key (Supports 34+ formats)
- LlamaCloud API key (enhanced parsing, supports 50+ formats)
- Docling (local processing, no API key required, supports PDF, Office docs, images, HTML, CSV)
- **Other API keys** as needed for your use case
## Backend Setup
@ -58,7 +72,7 @@ Edit the `.env` file and set the following variables:
| AUTH_TYPE | Authentication method: `GOOGLE` for OAuth with Google, `LOCAL` for email/password authentication |
| GOOGLE_OAUTH_CLIENT_ID | (Optional) Client ID from Google Cloud Console (required if AUTH_TYPE=GOOGLE) |
| GOOGLE_OAUTH_CLIENT_SECRET | (Optional) Client secret from Google Cloud Console (required if AUTH_TYPE=GOOGLE) |
| EMBEDDING_MODEL | Name of the embedding model (e.g., `mixedbread-ai/mxbai-embed-large-v1`) |
| EMBEDDING_MODEL | Name of the embedding model (e.g., `sentence-transformers/all-MiniLM-L6-v2`, `openai://text-embedding-ada-002`) |
| RERANKERS_MODEL_NAME | Name of the reranker model (e.g., `ms-marco-MiniLM-L-12-v2`) |
| RERANKERS_MODEL_TYPE | Type of reranker model (e.g., `flashrank`) |
| TTS_SERVICE | Text-to-Speech API provider for Podcasts (e.g., `local/kokoro`, `openai/tts-1`). See [supported providers](https://docs.litellm.ai/docs/text_to_speech#supported-providers) |
@ -70,9 +84,11 @@ Edit the `.env` file and set the following variables:
| ETL_SERVICE | Document parsing service: `UNSTRUCTURED` (supports 34+ formats), `LLAMACLOUD` (supports 50+ formats including legacy document types), or `DOCLING` (local processing, supports PDF, Office docs, images, HTML, CSV) |
| UNSTRUCTURED_API_KEY | API key for Unstructured.io service for document parsing (required if ETL_SERVICE=UNSTRUCTURED) |
| LLAMA_CLOUD_API_KEY | API key for LlamaCloud service for document parsing (required if ETL_SERVICE=LLAMACLOUD) |
| CELERY_BROKER_URL | Redis connection URL for Celery broker (e.g., `redis://localhost:6379/0`) |
| CELERY_RESULT_BACKEND | Redis connection URL for Celery result backend (e.g., `redis://localhost:6379/0`) |
**Optional Backend LangSmith Observability:**
**(Optional) Backend LangSmith Observability:**
| ENV VARIABLE | DESCRIPTION |
|--------------|-------------|
| LANGSMITH_TRACING | Enable LangSmith tracing (e.g., `true`) |
@ -80,7 +96,7 @@ Edit the `.env` file and set the following variables:
| LANGSMITH_API_KEY | Your LangSmith API key |
| LANGSMITH_PROJECT | LangSmith project name (e.g., `surfsense`) |
**Uvicorn Server Configuration**
**(Optional) Uvicorn Server Configuration**
| ENV VARIABLE | DESCRIPTION | DEFAULT VALUE |
|------------------------------|---------------------------------------------|---------------|
| UVICORN_HOST | Host address to bind the server | 0.0.0.0 |
@ -149,7 +165,91 @@ uv sync
uv sync
```
### 3. Run the Backend
### 3. Start Redis Server
Redis is required for Celery task queue. Start the Redis server:
**Linux:**
```bash
# Start Redis server
sudo systemctl start redis
# Or if using Redis installed via package manager
redis-server
```
**macOS:**
```bash
# If installed via Homebrew
brew services start redis
# Or run directly
redis-server
```
**Windows:**
```powershell
# Option 1: If using Redis on Windows (via WSL or Windows port)
redis-server
# Option 2: If installed as a Windows service
net start Redis
```
**Alternative for Windows - Run Redis in Docker:**
If you have Docker Desktop installed, you can run Redis in a container:
```powershell
# Pull and run Redis container
docker run -d --name redis -p 6379:6379 redis:latest
# To stop Redis
docker stop redis
# To start Redis again
docker start redis
# To remove Redis container
docker rm -f redis
```
Verify Redis is running by connecting to it:
```bash
redis-cli ping
# Should return: PONG
```
### 4. Start Celery Worker
In a new terminal window, start the Celery worker to handle background tasks:
**Linux/macOS/Windows:**
```bash
# Make sure you're in the surfsense_backend directory
cd surfsense_backend
# Start Celery worker
uv run celery -A celery_worker.celery_app worker --loglevel=info --concurrency=1 --pool=solo
```
**Optional: Start Flower for monitoring Celery tasks:**
In another terminal window:
```bash
# Start Flower (Celery monitoring tool)
uv run celery -A celery_worker.celery_app flower --port=5555
```
Access Flower at [http://localhost:5555](http://localhost:5555) to monitor your Celery tasks.
### 5. Run the Backend
Start the backend server:
@ -303,9 +403,11 @@ To verify your installation:
## Troubleshooting
- **Database Connection Issues**: Verify your PostgreSQL server is running and pgvector is properly installed
- **Redis Connection Issues**: Ensure Redis server is running (`redis-cli ping` should return `PONG`). Check that `CELERY_BROKER_URL` and `CELERY_RESULT_BACKEND` are correctly set in your `.env` file
- **Celery Worker Issues**: Make sure the Celery worker is running in a separate terminal. Check worker logs for any errors
- **Authentication Problems**: Check your Google OAuth configuration and ensure redirect URIs are set correctly
- **LLM Errors**: Confirm your LLM API keys are valid and the selected models are accessible
- **File Upload Failures**: Validate your Unstructured.io API key
- **File Upload Failures**: Validate your ETL service API key (Unstructured.io or LlamaCloud) or ensure Docling is properly configured
- **Windows-specific**: If you encounter path issues, ensure you're using the correct path separator (`\` instead of `/`)
- **macOS-specific**: If you encounter permission issues, you may need to use `sudo` for some installation commands