Merge branch 'dev' into fix/replace-transition-all-with-specific-transitions

This commit is contained in:
Soham Bhattacharjee 2026-04-08 05:38:30 +05:30 committed by GitHub
commit e404b05b11
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
295 changed files with 25773 additions and 10799 deletions

View file

@ -5,6 +5,20 @@ on:
tags:
- 'v*'
- 'beta-v*'
workflow_dispatch:
inputs:
version:
description: 'Version number (e.g. 0.0.15) — used for dry-run testing without a tag'
required: true
default: '0.0.0-test'
publish:
description: 'Publish to GitHub Releases'
required: true
type: choice
options:
- never
- always
default: 'never'
permissions:
contents: write
@ -25,24 +39,28 @@ jobs:
steps:
- name: Checkout
uses: actions/checkout@v4
uses: actions/checkout@v5
- name: Extract version from tag
- name: Extract version
id: version
shell: bash
run: |
TAG=${GITHUB_REF#refs/tags/}
VERSION=${TAG#beta-}
VERSION=${VERSION#v}
if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
VERSION="${{ inputs.version }}"
else
TAG=${GITHUB_REF#refs/tags/}
VERSION=${TAG#beta-}
VERSION=${VERSION#v}
fi
echo "VERSION=$VERSION" >> "$GITHUB_OUTPUT"
- name: Setup pnpm
uses: pnpm/action-setup@v4
uses: pnpm/action-setup@v5
- name: Setup Node.js
uses: actions/setup-node@v4
uses: actions/setup-node@v5
with:
node-version: 20
node-version: 22
cache: 'pnpm'
cache-dependency-path: |
surfsense_web/pnpm-lock.yaml
@ -60,6 +78,7 @@ jobs:
NEXT_PUBLIC_ZERO_CACHE_URL: ${{ vars.NEXT_PUBLIC_ZERO_CACHE_URL }}
NEXT_PUBLIC_DEPLOYMENT_MODE: ${{ vars.NEXT_PUBLIC_DEPLOYMENT_MODE }}
NEXT_PUBLIC_FASTAPI_BACKEND_AUTH_TYPE: ${{ vars.NEXT_PUBLIC_FASTAPI_BACKEND_AUTH_TYPE }}
NEXT_PUBLIC_POSTHOG_KEY: ${{ secrets.NEXT_PUBLIC_POSTHOG_KEY }}
- name: Install desktop dependencies
run: pnpm install
@ -70,9 +89,12 @@ jobs:
working-directory: surfsense_desktop
env:
HOSTED_FRONTEND_URL: ${{ vars.HOSTED_FRONTEND_URL }}
POSTHOG_KEY: ${{ secrets.POSTHOG_KEY }}
POSTHOG_HOST: ${{ vars.POSTHOG_HOST }}
- name: Package & Publish
run: pnpm exec electron-builder ${{ matrix.platform }} --config electron-builder.yml --publish always -c.extraMetadata.version=${{ steps.version.outputs.VERSION }}
shell: bash
run: pnpm exec electron-builder ${{ matrix.platform }} --config electron-builder.yml --publish ${{ inputs.publish || 'always' }} -c.extraMetadata.version=${{ steps.version.outputs.VERSION }}
working-directory: surfsense_desktop
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

View file

@ -21,9 +21,29 @@
</div>
# SurfSense
Conecta cualquier LLM a tus fuentes de conocimiento internas y chatea con él en tiempo real junto a tu equipo. Alternativa de código abierto a NotebookLM, Perplexity y Glean.
SurfSense es un agente de investigación de IA altamente personalizable, conectado a fuentes externas como motores de búsqueda (SearxNG, Tavily, LinkUp), Google Drive, OneDrive, Dropbox, Slack, Microsoft Teams, Linear, Jira, ClickUp, Confluence, BookStack, Gmail, Notion, YouTube, GitHub, Discord, Airtable, Google Calendar, Luma, Circleback, Elasticsearch, Obsidian y más por venir.
NotebookLM es una de las mejores y más útiles plataformas de IA que existen, pero una vez que comienzas a usarla regularmente también sientes sus limitaciones dejando algo que desear.
1. Hay límites en la cantidad de fuentes que puedes agregar en un notebook.
2. Hay límites en la cantidad de notebooks que puedes tener.
3. No puedes tener fuentes que excedan 500,000 palabras y más de 200MB.
4. Estás bloqueado con los servicios de Google (LLMs, modelos de uso, etc.) sin opción de configurarlos.
5. Fuentes de datos externas e integraciones de servicios limitadas.
6. El agente de NotebookLM está específicamente optimizado solo para estudiar e investigar, pero puedes hacer mucho más con los datos de origen.
7. Falta de soporte multijugador.
...y más.
**SurfSense está específicamente hecho para resolver estos problemas.** SurfSense te permite:
- **Controla Tu Flujo de Datos** - Mantén tus datos privados y seguros.
- **Sin Límites de Datos** - Agrega una cantidad ilimitada de fuentes y notebooks.
- **Sin Dependencia de Proveedores** - Configura cualquier modelo LLM, de imagen, TTS y STT.
- **25+ Fuentes de Datos Externas** - Agrega tus fuentes desde Google Drive, OneDrive, Dropbox, Notion y muchos otros servicios externos.
- **Soporte Multijugador en Tiempo Real** - Trabaja fácilmente con los miembros de tu equipo en un notebook compartido.
- **Aplicación de Escritorio** - Obtén asistencia de IA en cualquier aplicación con Quick Assist, General Assist y Extreme Assist.
...y más por venir.
@ -34,7 +54,7 @@ https://github.com/user-attachments/assets/cc0c84d3-1f2f-4f7a-b519-2ecce22310b1
## Ejemplo de Agente de Video
https://github.com/user-attachments/assets/cc977e6d-8292-4ffe-abb8-3b0560ef5562
https://github.com/user-attachments/assets/012a7ffa-6f76-4f06-9dda-7632b470057a
@ -111,6 +131,18 @@ El script de instalación configura [Watchtower](https://github.com/nicholas-fed
Para Docker Compose, instalación manual y otras opciones de despliegue, consulta la [documentación](https://www.surfsense.com/docs/).
### Aplicación de Escritorio
SurfSense también ofrece una aplicación de escritorio que lleva la asistencia de IA a cada aplicación en tu computadora. Descárgala desde la [última versión](https://github.com/MODSetter/SurfSense/releases/latest).
La aplicación de escritorio incluye tres potentes funciones:
- **General Assist** — Lanza SurfSense al instante desde cualquier aplicación con un atajo global.
- **Quick Assist** — Selecciona texto en cualquier lugar, luego pide a la IA que lo explique, reescriba o actúe sobre él.
- **Extreme Assist** — Obtén sugerencias de escritura en línea impulsadas por tu base de conocimiento mientras escribes en cualquier aplicación.
Las tres funciones operan contra tu espacio de búsqueda elegido, por lo que tus respuestas siempre están basadas en tus propios datos.
### Cómo Colaborar en Tiempo Real (Beta)
1. Ve a la página de Gestión de Miembros y crea una invitación.
@ -133,24 +165,30 @@ Para Docker Compose, instalación manual y otras opciones de despliegue, consult
<p align="center"><img src="https://github.com/user-attachments/assets/3b04477d-8f42-4baa-be95-867c1eaeba87" alt="Comentarios en Tiempo Real" /></p>
## Funcionalidades Principales
## SurfSense vs Google NotebookLM
| Funcionalidad | Descripción |
|----------------|-------------|
| Alternativa OSS | Reemplazo directo de NotebookLM, Perplexity y Glean con colaboración en equipo en tiempo real |
| 50+ Formatos de Archivo | Sube documentos, imágenes, videos vía LlamaCloud, Unstructured o Docling (local) |
| Búsqueda Híbrida | Semántica + Texto completo con Índices Jerárquicos y Reciprocal Rank Fusion |
| Respuestas con Citas | Chatea con tu base de conocimiento y obtén respuestas citadas al estilo Perplexity |
| Arquitectura de Agentes Profundos | Impulsado por [LangChain Deep Agents](https://docs.langchain.com/oss/python/deepagents/overview) con planificación, subagentes y acceso al sistema de archivos |
| Soporte Universal de LLM | 100+ LLMs, 6000+ modelos de embeddings, todos los principales rerankers vía OpenAI spec y LiteLLM |
| Privacidad Primero | Soporte completo de LLM local (vLLM, Ollama) tus datos son tuyos |
| Colaboración en Equipo | RBAC con roles de Propietario / Admin / Editor / Visor, chat en tiempo real e hilos de comentarios |
| Generación de Videos | Genera videos con narración y visuales |
| Generación de Presentaciones | Crea presentaciones editables basadas en diapositivas |
| Generación de Podcasts | Podcast de 3 min en menos de 20 segundos; múltiples proveedores TTS (OpenAI, Azure, Kokoro) |
| Extensión de Navegador | Extensión multi-navegador para guardar cualquier página web, incluyendo páginas protegidas por autenticación |
| 27+ Conectores | Motores de búsqueda, Google Drive, OneDrive, Dropbox, Slack, Teams, Jira, Notion, GitHub, Discord y [más](#fuentes-externas) |
| Auto-Hospedable | Código abierto, Docker en un solo comando o Docker Compose completo para producción |
| Característica | Google NotebookLM | SurfSense |
|---------|-------------------|-----------|
| **Fuentes por Notebook** | 50 (Gratis) a 600 (Ultra, $249.99/mes) | Ilimitadas |
| **Número de Notebooks** | 100 (Gratis) a 500 (planes de pago) | Ilimitados |
| **Límite de Tamaño de Fuente** | 500,000 palabras / 200MB por fuente | Sin límite |
| **Precios** | Nivel gratuito disponible; Pro $19.99/mes, Ultra $249.99/mes | Gratuito y de código abierto, auto-hospedable en tu propia infra |
| **Soporte de LLM** | Solo Google Gemini | 100+ LLMs vía OpenAI spec y LiteLLM |
| **Modelos de Embeddings** | Solo Google | 6,000+ modelos de embeddings, todos los principales rerankers |
| **LLMs Locales / Privados** | No disponible | Soporte completo (vLLM, Ollama) - tus datos son tuyos |
| **Auto-Hospedable** | No | Sí - Docker en un solo comando o Docker Compose completo |
| **Código Abierto** | No | Sí |
| **Conectores Externos** | Google Drive, YouTube, sitios web | 27+ conectores - Motores de búsqueda, Google Drive, OneDrive, Dropbox, Slack, Teams, Jira, Notion, GitHub, Discord y [más](#fuentes-externas) |
| **Soporte de Formatos de Archivo** | PDFs, Docs, Slides, Sheets, CSV, Word, EPUB, imágenes, URLs web, YouTube | 50+ formatos - documentos, imágenes, videos vía LlamaCloud, Unstructured o Docling (local) |
| **Búsqueda** | Búsqueda semántica | Búsqueda Híbrida - Semántica + Texto completo con Índices Jerárquicos y Reciprocal Rank Fusion |
| **Respuestas con Citas** | Sí | Sí - Respuestas citadas al estilo Perplexity |
| **Arquitectura de Agentes** | No | Sí - impulsado por [LangChain Deep Agents](https://docs.langchain.com/oss/python/deepagents/overview) con planificación, subagentes y acceso al sistema de archivos |
| **Multijugador en Tiempo Real** | Notebooks compartidos con roles de Visor/Editor (sin chat en tiempo real) | RBAC con roles de Propietario / Admin / Editor / Visor, chat en tiempo real e hilos de comentarios |
| **Generación de Videos** | Resúmenes en video cinemáticos vía Veo 3 (solo Ultra) | Disponible (NotebookLM es mejor aquí, mejorando activamente) |
| **Generación de Presentaciones** | Diapositivas más atractivas pero no editables | Crea presentaciones editables basadas en diapositivas |
| **Generación de Podcasts** | Resúmenes de audio con hosts e idiomas personalizables | Disponible con múltiples proveedores TTS (NotebookLM es mejor aquí, mejorando activamente) |
| **Aplicación de Escritorio** | No | Aplicación nativa con General Assist, Quick Assist y Extreme Assist — asistencia de IA en cualquier aplicación |
| **Extensión de Navegador** | No | Extensión multi-navegador para guardar cualquier página web, incluyendo páginas protegidas por autenticación |
<details>
<summary><b>Lista completa de Fuentes Externas</b></summary>

View file

@ -21,9 +21,29 @@
</div>
# SurfSense
किसी भी LLM को अपने आंतरिक ज्ञान स्रोतों से जोड़ें और अपनी टीम के साथ रीयल-टाइम में चैट करें। NotebookLM, Perplexity और Glean का ओपन सोर्स विकल्प।
SurfSense एक अत्यधिक अनुकूलन योग्य AI शोध एजेंट है, जो बाहरी स्रोतों से जुड़ा है जैसे सर्च इंजन (SearxNG, Tavily, LinkUp), Google Drive, OneDrive, Dropbox, Slack, Microsoft Teams, Linear, Jira, ClickUp, Confluence, BookStack, Gmail, Notion, YouTube, GitHub, Discord, Airtable, Google Calendar, Luma, Circleback, Elasticsearch, Obsidian और भी बहुत कुछ आने वाला है।
NotebookLM वहाँ उपलब्ध सबसे अच्छे और सबसे उपयोगी AI प्लेटफ़ॉर्म में से एक है, लेकिन जब आप इसे नियमित रूप से उपयोग करना शुरू करते हैं तो आप इसकी सीमाओं को भी महसूस करते हैं जो कुछ और की चाह छोड़ती हैं।
1. एक notebook में जोड़े जा सकने वाले स्रोतों की मात्रा पर सीमाएं हैं।
2. आपके पास कितने notebooks हो सकते हैं इस पर सीमाएं हैं।
3. आपके पास ऐसे स्रोत नहीं हो सकते जो 500,000 शब्दों और 200MB से अधिक हों।
4. आप Google सेवाओं (LLMs, उपयोग मॉडल, आदि) में बंद हैं और उन्हें कॉन्फ़िगर करने का कोई विकल्प नहीं है।
5. सीमित बाहरी डेटा स्रोत और सेवा एकीकरण।
6. NotebookLM एजेंट विशेष रूप से केवल अध्ययन और शोध के लिए अनुकूलित है, लेकिन आप स्रोत डेटा के साथ और भी बहुत कुछ कर सकते हैं।
7. मल्टीप्लेयर सपोर्ट की कमी।
...और भी बहुत कुछ।
**SurfSense विशेष रूप से इन समस्याओं को हल करने के लिए बनाया गया है।** SurfSense आपको सक्षम बनाता है:
- **अपने डेटा प्रवाह को नियंत्रित करें** - अपने डेटा को निजी और सुरक्षित रखें।
- **कोई डेटा सीमा नहीं** - असीमित मात्रा में स्रोत और notebooks जोड़ें।
- **कोई विक्रेता लॉक-इन नहीं** - किसी भी LLM, इमेज, TTS और STT मॉडल को कॉन्फ़िगर करें।
- **25+ बाहरी डेटा स्रोत** - Google Drive, OneDrive, Dropbox, Notion और कई अन्य बाहरी सेवाओं से अपने स्रोत जोड़ें।
- **रीयल-टाइम मल्टीप्लेयर सपोर्ट** - एक साझा notebook में अपनी टीम के सदस्यों के साथ आसानी से काम करें।
- **डेस्कटॉप ऐप** - Quick Assist, General Assist और Extreme Assist के साथ किसी भी एप्लिकेशन में AI सहायता प्राप्त करें।
...और भी बहुत कुछ आने वाला है।
@ -34,7 +54,7 @@ https://github.com/user-attachments/assets/cc0c84d3-1f2f-4f7a-b519-2ecce22310b1
## वीडियो एजेंट नमूना
https://github.com/user-attachments/assets/cc977e6d-8292-4ffe-abb8-3b0560ef5562
https://github.com/user-attachments/assets/012a7ffa-6f76-4f06-9dda-7632b470057a
@ -111,6 +131,18 @@ irm https://raw.githubusercontent.com/MODSetter/SurfSense/main/docker/scripts/in
Docker Compose, मैनुअल इंस्टॉलेशन और अन्य डिप्लॉयमेंट विकल्पों के लिए, [डॉक्स](https://www.surfsense.com/docs/) देखें।
### डेस्कटॉप ऐप
SurfSense एक डेस्कटॉप ऐप भी प्रदान करता है जो आपके कंप्यूटर पर हर एप्लिकेशन में AI सहायता लाता है। इसे [नवीनतम रिलीज़](https://github.com/MODSetter/SurfSense/releases/latest) से डाउनलोड करें।
डेस्कटॉप ऐप में तीन शक्तिशाली सुविधाएं शामिल हैं:
- **General Assist** — एक ग्लोबल शॉर्टकट से किसी भी एप्लिकेशन से तुरंत SurfSense लॉन्च करें।
- **Quick Assist** — कहीं भी टेक्स्ट चुनें, फिर AI से समझाने, फिर से लिखने या उस पर कार्रवाई करने को कहें।
- **Extreme Assist** — किसी भी ऐप में टाइप करते समय अपनी नॉलेज बेस से संचालित इनलाइन लेखन सुझाव प्राप्त करें।
तीनों सुविधाएं आपके चुने हुए सर्च स्पेस पर काम करती हैं, ताकि आपके उत्तर हमेशा आपके अपने डेटा पर आधारित हों।
### रीयल-टाइम सहयोग कैसे करें (बीटा)
1. सदस्य प्रबंधन पेज पर जाएं और एक आमंत्रण बनाएं।
@ -133,24 +165,30 @@ Docker Compose, मैनुअल इंस्टॉलेशन और अन
<p align="center"><img src="https://github.com/user-attachments/assets/3b04477d-8f42-4baa-be95-867c1eaeba87" alt="रीयल-टाइम कमेंट्स" /></p>
## प्रमुख विशेषताएं
## SurfSense vs Google NotebookLM
| विशेषता | विवरण |
|----------|--------|
| OSS विकल्प | रीयल-टाइम टीम सहयोग के साथ NotebookLM, Perplexity और Glean का सीधा प्रतिस्थापन |
| 50+ फ़ाइल फ़ॉर्मेट | LlamaCloud, Unstructured या Docling (लोकल) के माध्यम से दस्तावेज़, चित्र, वीडियो अपलोड करें |
| हाइब्रिड सर्च | हायरार्किकल इंडाइसेस और Reciprocal Rank Fusion के साथ सिमैंटिक + फुल टेक्स्ट सर्च |
| उद्धृत उत्तर | अपने ज्ञान आधार के साथ चैट करें और Perplexity शैली के उद्धृत उत्तर पाएं |
| डीप एजेंट आर्किटेक्चर | [LangChain Deep Agents](https://docs.langchain.com/oss/python/deepagents/overview) द्वारा संचालित, योजना, सब-एजेंट और फ़ाइल सिस्टम एक्सेस |
| यूनिवर्सल LLM सपोर्ट | 100+ LLMs, 6000+ एम्बेडिंग मॉडल, सभी प्रमुख रीरैंकर्स OpenAI spec और LiteLLM के माध्यम से |
| प्राइवेसी फर्स्ट | पूर्ण लोकल LLM सपोर्ट (vLLM, Ollama) आपका डेटा आपका रहता है |
| टीम सहयोग | मालिक / एडमिन / संपादक / दर्शक भूमिकाओं के साथ RBAC, रीयल-टाइम चैट और कमेंट थ्रेड |
| वीडियो जनरेशन | नैरेशन और विज़ुअल के साथ वीडियो बनाएं |
| प्रेजेंटेशन जनरेशन | संपादन योग्य, स्लाइड आधारित प्रेजेंटेशन बनाएं |
| पॉडकास्ट जनरेशन | 20 सेकंड से कम में 3 मिनट का पॉडकास्ट; कई TTS प्रदाता (OpenAI, Azure, Kokoro) |
| ब्राउज़र एक्सटेंशन | किसी भी वेबपेज को सहेजने के लिए क्रॉस-ब्राउज़र एक्सटेंशन, प्रमाणीकरण सुरक्षित पेज सहित |
| 27+ कनेक्टर्स | सर्च इंजन, Google Drive, OneDrive, Dropbox, Slack, Teams, Jira, Notion, GitHub, Discord और [अधिक](#बाहरी-स्रोत) |
| सेल्फ-होस्ट करने योग्य | ओपन सोर्स, Docker एक कमांड या प्रोडक्शन के लिए पूर्ण Docker Compose |
| विशेषता | Google NotebookLM | SurfSense |
|---------|-------------------|-----------|
| **प्रति Notebook स्रोत** | 50 (मुफ़्त) से 600 (Ultra, $249.99/माह) | असीमित |
| **Notebooks की संख्या** | 100 (मुफ़्त) से 500 (सशुल्क योजनाएं) | असीमित |
| **स्रोत आकार सीमा** | 500,000 शब्द / 200MB प्रति स्रोत | कोई सीमा नहीं |
| **मूल्य निर्धारण** | मुफ़्त स्तर उपलब्ध; Pro $19.99/माह, Ultra $249.99/माह | मुफ़्त और ओपन सोर्स, अपनी इंफ्रा पर सेल्फ-होस्ट करें |
| **LLM सपोर्ट** | केवल Google Gemini | 100+ LLMs OpenAI spec और LiteLLM के माध्यम से |
| **एम्बेडिंग मॉडल** | केवल Google | 6,000+ एम्बेडिंग मॉडल, सभी प्रमुख रीरैंकर्स |
| **लोकल / प्राइवेट LLMs** | उपलब्ध नहीं | पूर्ण सपोर्ट (vLLM, Ollama) - आपका डेटा आपका रहता है |
| **सेल्फ-होस्ट करने योग्य** | नहीं | हाँ - Docker एक कमांड या पूर्ण Docker Compose |
| **ओपन सोर्स** | नहीं | हाँ |
| **बाहरी कनेक्टर्स** | Google Drive, YouTube, वेबसाइटें | 27+ कनेक्टर्स - सर्च इंजन, Google Drive, OneDrive, Dropbox, Slack, Teams, Jira, Notion, GitHub, Discord और [अधिक](#बाहरी-स्रोत) |
| **फ़ाइल फ़ॉर्मेट सपोर्ट** | PDFs, Docs, Slides, Sheets, CSV, Word, EPUB, इमेज, वेब URLs, YouTube | 50+ फ़ॉर्मेट - दस्तावेज़, इमेज, वीडियो LlamaCloud, Unstructured या Docling (लोकल) के माध्यम से |
| **सर्च** | सिमैंटिक सर्च | हाइब्रिड सर्च - हायरार्किकल इंडाइसेस और Reciprocal Rank Fusion के साथ सिमैंटिक + फुल टेक्स्ट |
| **उद्धृत उत्तर** | हाँ | हाँ - Perplexity शैली के उद्धृत उत्तर |
| **एजेंट आर्किटेक्चर** | नहीं | हाँ - [LangChain Deep Agents](https://docs.langchain.com/oss/python/deepagents/overview) द्वारा संचालित, योजना, सब-एजेंट और फ़ाइल सिस्टम एक्सेस |
| **रीयल-टाइम मल्टीप्लेयर** | दर्शक/संपादक भूमिकाओं के साथ साझा notebooks (कोई रीयल-टाइम चैट नहीं) | मालिक / एडमिन / संपादक / दर्शक भूमिकाओं के साथ RBAC, रीयल-टाइम चैट और कमेंट थ्रेड |
| **वीडियो जनरेशन** | Veo 3 के माध्यम से सिनेमैटिक वीडियो ओवरव्यू (केवल Ultra) | उपलब्ध (NotebookLM यहाँ बेहतर है, सक्रिय रूप से सुधार हो रहा है) |
| **प्रेजेंटेशन जनरेशन** | बेहतर दिखने वाली स्लाइड्स लेकिन संपादन योग्य नहीं | संपादन योग्य, स्लाइड आधारित प्रेजेंटेशन बनाएं |
| **पॉडकास्ट जनरेशन** | कस्टमाइज़ेबल होस्ट और भाषाओं के साथ ऑडियो ओवरव्यू | कई TTS प्रदाताओं के साथ उपलब्ध (NotebookLM यहाँ बेहतर है, सक्रिय रूप से सुधार हो रहा है) |
| **डेस्कटॉप ऐप** | नहीं | General Assist, Quick Assist और Extreme Assist के साथ नेटिव ऐप — किसी भी एप्लिकेशन में AI सहायता |
| **ब्राउज़र एक्सटेंशन** | नहीं | किसी भी वेबपेज को सहेजने के लिए क्रॉस-ब्राउज़र एक्सटेंशन, प्रमाणीकरण सुरक्षित पेज सहित |
<details>
<summary><b>बाहरी स्रोतों की पूरी सूची</b></summary>

View file

@ -21,9 +21,29 @@
</div>
# SurfSense
Connect any LLM to your internal knowledge sources and chat with it in real time alongside your team. OSS alternative to NotebookLM, Perplexity, and Glean.
SurfSense is a highly customizable AI research agent, connected to external sources such as Search Engines (SearxNG, Tavily, LinkUp), Google Drive, OneDrive, Dropbox, Slack, Microsoft Teams, Linear, Jira, ClickUp, Confluence, BookStack, Gmail, Notion, YouTube, GitHub, Discord, Airtable, Google Calendar, Luma, Circleback, Elasticsearch, Obsidian and more to come.
NotebookLM is one of the best and most useful AI platforms out there, but once you start using it regularly you also feel its limitations leaving something to be desired more.
1. There are limits on the amount of sources you can add in a notebook.
2. There are limits on the number of notebooks you can have.
3. You cannot have sources that exceed 500,000 words and are more than 200MB.
4. You are vendor locked in to Google services (LLMs, usage models, etc.) with no option to configure them.
5. Limited external data sources and service integrations.
6. NotebookLM Agent is specifically optimised for just studying and researching, but you can do so much more with the source data.
7. Lack of multiplayer support.
...and more.
**SurfSense is specifically made to solve these problems.** SurfSense empowers you to:
- **Control Your Data Flow** - Keep your data private and secure.
- **No Data Limits** - Add an unlimited amount of sources and notebooks.
- **No Vendor Lock-in** - Configure any LLM, image, TTS, and STT models to use.
- **25+ External Data Sources** - Add your sources from Google Drive, OneDrive, Dropbox, Notion, and many other external services.
- **Real-Time Multiplayer Support** - Work easily with your team members in a shared notebook.
- **Desktop App** - Get AI assistance in any application with Quick Assist, General Assist, and Extreme Assist.
...and more to come.
@ -112,6 +132,18 @@ The install script sets up [Watchtower](https://github.com/nicholas-fedor/watcht
For Docker Compose, manual installation, and other deployment options, see the [docs](https://www.surfsense.com/docs/).
### Desktop App
SurfSense also ships a desktop app that brings AI assistance to every application on your computer. Download it from the [latest release](https://github.com/MODSetter/SurfSense/releases/latest).
The desktop app includes three powerful features:
- **General Assist** — Launch SurfSense instantly from any application with a global shortcut.
- **Quick Assist** — Select text anywhere, then ask AI to explain, rewrite, or act on it.
- **Extreme Assist** — Get inline writing suggestions powered by your knowledge base as you type in any app.
All three features operate against your chosen search space, so your answers are always grounded in your own data.
### How to Realtime Collaborate (Beta)
1. Go to Manage Members page and create an invite.
@ -134,24 +166,30 @@ For Docker Compose, manual installation, and other deployment options, see the [
<p align="center"><img src="https://github.com/user-attachments/assets/3b04477d-8f42-4baa-be95-867c1eaeba87" alt="Realtime Comments" /></p>
## Key Features
## SurfSense vs Google NotebookLM
| Feature | Description |
|---------|-------------|
| OSS Alternative | Drop in replacement for NotebookLM, Perplexity, and Glean with real time team collaboration |
| 50+ File Formats | Upload documents, images, videos via LlamaCloud, Unstructured, or Docling (local) |
| Hybrid Search | Semantic + Full Text Search with Hierarchical Indices and Reciprocal Rank Fusion |
| Cited Answers | Chat with your knowledge base and get Perplexity style cited responses |
| Deep Agent Architecture | Powered by [LangChain Deep Agents](https://docs.langchain.com/oss/python/deepagents/overview) planning, subagents, and file system access |
| Universal LLM Support | 100+ LLMs, 6000+ embedding models, all major rerankers via OpenAI spec & LiteLLM |
| Privacy First | Full local LLM support (vLLM, Ollama) your data stays yours |
| Team Collaboration | RBAC with Owner / Admin / Editor / Viewer roles, real time chat & comment threads |
| Video Generation | Generate videos with narration and visuals |
| Presentation Generation | Create editable, slide based presentations |
| Podcast Generation | 3 min podcast in under 20 seconds; multiple TTS providers (OpenAI, Azure, Kokoro) |
| Browser Extension | Cross browser extension to save any webpage, including auth protected pages |
| 27+ Connectors | Search Engines, Google Drive, OneDrive, Dropbox, Slack, Teams, Jira, Notion, GitHub, Discord & [more](#external-sources) |
| Self Hostable | Open source, Docker one liner or full Docker Compose for production |
| Feature | Google NotebookLM | SurfSense |
|---------|-------------------|-----------|
| **Sources per Notebook** | 50 (Free) to 600 (Ultra, $249.99/mo) | Unlimited |
| **Number of Notebooks** | 100 (Free) to 500 (paid tiers) | Unlimited |
| **Source Size Limit** | 500,000 words / 200MB per source | No limit |
| **Pricing** | Free tier available; Pro $19.99/mo, Ultra $249.99/mo | Free and open source, self-host on your own infra |
| **LLM Support** | Google Gemini only | 100+ LLMs via OpenAI spec & LiteLLM |
| **Embedding Models** | Google only | 6,000+ embedding models, all major rerankers |
| **Local / Private LLMs** | Not available | Full support (vLLM, Ollama) - your data stays yours |
| **Self Hostable** | No | Yes - Docker one-liner or full Docker Compose |
| **Open Source** | No | Yes |
| **External Connectors** | Google Drive, YouTube, websites | 27+ connectors - Search Engines, Google Drive, OneDrive, Dropbox, Slack, Teams, Jira, Notion, GitHub, Discord & [more](#external-sources) |
| **File Format Support** | PDFs, Docs, Slides, Sheets, CSV, Word, EPUB, images, web URLs, YouTube | 50+ formats - documents, images, videos via LlamaCloud, Unstructured, or Docling (local) |
| **Search** | Semantic search | Hybrid Search - Semantic + Full Text with Hierarchical Indices & Reciprocal Rank Fusion |
| **Cited Answers** | Yes | Yes - Perplexity-style cited responses |
| **Agentic Architecture** | No | Yes - powered by [LangChain Deep Agents](https://docs.langchain.com/oss/python/deepagents/overview) with planning, subagents, and file system access |
| **Real-Time Multiplayer** | Shared notebooks with Viewer/Editor roles (no real-time chat) | RBAC with Owner / Admin / Editor / Viewer roles, real-time chat & comment threads |
| **Video Generation** | Cinematic Video Overviews via Veo 3 (Ultra only) | Available (NotebookLM is better here, actively improving) |
| **Presentation Generation** | Better looking slides but not editable | Create editable, slide-based presentations |
| **Podcast Generation** | Audio Overviews with customizable hosts and languages | Available with multiple TTS providers (NotebookLM is better here, actively improving) |
| **Desktop App** | No | Native app with General Assist, Quick Assist, and Extreme Assist — AI help in any application |
| **Browser Extension** | No | Cross-browser extension to save any webpage, including auth-protected pages |
<details>
<summary><b>Full list of External Sources</b></summary>

View file

@ -21,9 +21,29 @@
</div>
# SurfSense
Conecte qualquer LLM às suas fontes de conhecimento internas e converse com ele em tempo real junto com sua equipe. Alternativa de código aberto ao NotebookLM, Perplexity e Glean.
SurfSense é um agente de pesquisa de IA altamente personalizável, conectado a fontes externas como mecanismos de busca (SearxNG, Tavily, LinkUp), Google Drive, OneDrive, Dropbox, Slack, Microsoft Teams, Linear, Jira, ClickUp, Confluence, BookStack, Gmail, Notion, YouTube, GitHub, Discord, Airtable, Google Calendar, Luma, Circleback, Elasticsearch, Obsidian e mais por vir.
O NotebookLM é uma das melhores e mais úteis plataformas de IA disponíveis, mas quando você começa a usá-lo regularmente também sente suas limitações deixando algo a desejar.
1. Há limites na quantidade de fontes que você pode adicionar em um notebook.
2. Há limites no número de notebooks que você pode ter.
3. Você não pode ter fontes que excedam 500.000 palavras e mais de 200MB.
4. Você fica preso aos serviços do Google (LLMs, modelos de uso, etc.) sem opção de configurá-los.
5. Fontes de dados externas e integrações de serviços limitadas.
6. O agente do NotebookLM é especificamente otimizado apenas para estudar e pesquisar, mas você pode fazer muito mais com os dados de origem.
7. Falta de suporte multiplayer.
...e mais.
**O SurfSense foi feito especificamente para resolver esses problemas.** O SurfSense permite que você:
- **Controle Seu Fluxo de Dados** - Mantenha seus dados privados e seguros.
- **Sem Limites de Dados** - Adicione uma quantidade ilimitada de fontes e notebooks.
- **Sem Dependência de Fornecedor** - Configure qualquer modelo LLM, de imagem, TTS e STT.
- **25+ Fontes de Dados Externas** - Adicione suas fontes do Google Drive, OneDrive, Dropbox, Notion e muitos outros serviços externos.
- **Suporte Multiplayer em Tempo Real** - Trabalhe facilmente com os membros da sua equipe em um notebook compartilhado.
- **Aplicativo Desktop** - Obtenha assistência de IA em qualquer aplicativo com Quick Assist, General Assist e Extreme Assist.
...e mais por vir.
@ -34,7 +54,7 @@ https://github.com/user-attachments/assets/cc0c84d3-1f2f-4f7a-b519-2ecce22310b1
## Exemplo de Agente de Vídeo
https://github.com/user-attachments/assets/cc977e6d-8292-4ffe-abb8-3b0560ef5562
https://github.com/user-attachments/assets/012a7ffa-6f76-4f06-9dda-7632b470057a
@ -111,6 +131,18 @@ O script de instalação configura o [Watchtower](https://github.com/nicholas-fe
Para Docker Compose, instalação manual e outras opções de implantação, consulte a [documentação](https://www.surfsense.com/docs/).
### Aplicativo Desktop
O SurfSense também oferece um aplicativo desktop que traz assistência de IA para cada aplicativo no seu computador. Baixe-o na [última versão](https://github.com/MODSetter/SurfSense/releases/latest).
O aplicativo desktop inclui três recursos poderosos:
- **General Assist** — Abra o SurfSense instantaneamente de qualquer aplicativo com um atalho global.
- **Quick Assist** — Selecione texto em qualquer lugar, depois peça à IA para explicar, reescrever ou agir sobre ele.
- **Extreme Assist** — Receba sugestões de escrita em linha alimentadas pela sua base de conhecimento enquanto digita em qualquer aplicativo.
Os três recursos operam no espaço de busca escolhido, para que suas respostas sejam sempre baseadas nos seus próprios dados.
### Como Colaborar em Tempo Real (Beta)
1. Acesse a página de Gerenciar Membros e crie um convite.
@ -133,24 +165,30 @@ Para Docker Compose, instalação manual e outras opções de implantação, con
<p align="center"><img src="https://github.com/user-attachments/assets/3b04477d-8f42-4baa-be95-867c1eaeba87" alt="Comentários em Tempo Real" /></p>
## Funcionalidades Principais
## SurfSense vs Google NotebookLM
| Funcionalidade | Descrição |
|----------------|-----------|
| Alternativa OSS | Substituto direto do NotebookLM, Perplexity e Glean com colaboração em equipe em tempo real |
| 50+ Formatos de Arquivo | Faça upload de documentos, imagens, vídeos via LlamaCloud, Unstructured ou Docling (local) |
| Busca Híbrida | Semântica + Texto completo com Índices Hierárquicos e Reciprocal Rank Fusion |
| Respostas com Citações | Converse com sua base de conhecimento e obtenha respostas citadas no estilo Perplexity |
| Arquitetura de Agentes Profundos | Alimentado por [LangChain Deep Agents](https://docs.langchain.com/oss/python/deepagents/overview) com planejamento, subagentes e acesso ao sistema de arquivos |
| Suporte Universal de LLM | 100+ LLMs, 6000+ modelos de embeddings, todos os principais rerankers via OpenAI spec e LiteLLM |
| Privacidade em Primeiro Lugar | Suporte completo a LLM local (vLLM, Ollama) seus dados ficam com você |
| Colaboração em Equipe | RBAC com papéis de Proprietário / Admin / Editor / Visualizador, chat em tempo real e threads de comentários |
| Geração de Vídeos | Gera vídeos com narração e visuais |
| Geração de Apresentações | Cria apresentações editáveis baseadas em slides |
| Geração de Podcasts | Podcast de 3 min em menos de 20 segundos; múltiplos provedores TTS (OpenAI, Azure, Kokoro) |
| Extensão de Navegador | Extensão multi-navegador para salvar qualquer página web, incluindo páginas protegidas por autenticação |
| 27+ Conectores | Mecanismos de busca, Google Drive, OneDrive, Dropbox, Slack, Teams, Jira, Notion, GitHub, Discord e [mais](#fontes-externas) |
| Auto-Hospedável | Código aberto, Docker em um único comando ou Docker Compose completo para produção |
| Recurso | Google NotebookLM | SurfSense |
|---------|-------------------|-----------|
| **Fontes por Notebook** | 50 (Grátis) a 600 (Ultra, $249.99/mês) | Ilimitadas |
| **Número de Notebooks** | 100 (Grátis) a 500 (planos pagos) | Ilimitados |
| **Limite de Tamanho da Fonte** | 500.000 palavras / 200MB por fonte | Sem limite |
| **Preços** | Nível gratuito disponível; Pro $19.99/mês, Ultra $249.99/mês | Gratuito e de código aberto, auto-hospedável na sua própria infra |
| **Suporte a LLM** | Apenas Google Gemini | 100+ LLMs via OpenAI spec e LiteLLM |
| **Modelos de Embeddings** | Apenas Google | 6.000+ modelos de embeddings, todos os principais rerankers |
| **LLMs Locais / Privados** | Não disponível | Suporte completo (vLLM, Ollama) - seus dados ficam com você |
| **Auto-Hospedável** | Não | Sim - Docker em um único comando ou Docker Compose completo |
| **Código Aberto** | Não | Sim |
| **Conectores Externos** | Google Drive, YouTube, sites | 27+ conectores - Mecanismos de busca, Google Drive, OneDrive, Dropbox, Slack, Teams, Jira, Notion, GitHub, Discord e [mais](#fontes-externas) |
| **Suporte a Formatos de Arquivo** | PDFs, Docs, Slides, Sheets, CSV, Word, EPUB, imagens, URLs web, YouTube | 50+ formatos - documentos, imagens, vídeos via LlamaCloud, Unstructured ou Docling (local) |
| **Busca** | Busca semântica | Busca Híbrida - Semântica + Texto completo com Índices Hierárquicos e Reciprocal Rank Fusion |
| **Respostas com Citações** | Sim | Sim - Respostas citadas no estilo Perplexity |
| **Arquitetura de Agentes** | Não | Sim - alimentado por [LangChain Deep Agents](https://docs.langchain.com/oss/python/deepagents/overview) com planejamento, subagentes e acesso ao sistema de arquivos |
| **Multiplayer em Tempo Real** | Notebooks compartilhados com papéis de Visualizador/Editor (sem chat em tempo real) | RBAC com papéis de Proprietário / Admin / Editor / Visualizador, chat em tempo real e threads de comentários |
| **Geração de Vídeos** | Visões gerais cinemáticas via Veo 3 (apenas Ultra) | Disponível (NotebookLM é melhor aqui, melhorando ativamente) |
| **Geração de Apresentações** | Slides mais bonitos mas não editáveis | Cria apresentações editáveis baseadas em slides |
| **Geração de Podcasts** | Visões gerais em áudio com hosts e idiomas personalizáveis | Disponível com múltiplos provedores TTS (NotebookLM é melhor aqui, melhorando ativamente) |
| **Aplicativo Desktop** | Não | Aplicativo nativo com General Assist, Quick Assist e Extreme Assist — assistência de IA em qualquer aplicativo |
| **Extensão de Navegador** | Não | Extensão multi-navegador para salvar qualquer página web, incluindo páginas protegidas por autenticação |
<details>
<summary><b>Lista completa de Fontes Externas</b></summary>

View file

@ -21,9 +21,29 @@
</div>
# SurfSense
将任何 LLM 连接到您的内部知识源并与团队成员实时聊天。NotebookLM、Perplexity 和 Glean 的开源替代方案。
SurfSense 是一个高度可定制的 AI 研究助手可以连接外部数据源如搜索引擎SearxNG、Tavily、LinkUp、Google Drive、OneDrive、Dropbox、Slack、Microsoft Teams、Linear、Jira、ClickUp、Confluence、BookStack、Gmail、Notion、YouTube、GitHub、Discord、Airtable、Google Calendar、Luma、Circleback、Elasticsearch、Obsidian 等,未来还会支持更多。
NotebookLM 是目前最好、最实用的 AI 平台之一,但当你开始经常使用它时,你也会感受到它的局限性,总觉得还有不足之处。
1. 一个笔记本中可以添加的来源数量有限制。
2. 可以拥有的笔记本数量有限制。
3. 来源不能超过 500,000 个单词和 200MB。
4. 你被锁定在 Google 服务中LLM、使用模型等没有配置选项。
5. 有限的外部数据源和服务集成。
6. NotebookLM 代理专门针对学习和研究进行了优化,但你可以用源数据做更多事情。
7. 缺乏多人协作支持。
...还有更多。
**SurfSense 正是为了解决这些问题而生。** SurfSense 赋予你:
- **控制你的数据流** - 保持数据私密和安全。
- **无数据限制** - 添加无限数量的来源和笔记本。
- **无供应商锁定** - 配置任何 LLM、图像、TTS 和 STT 模型。
- **25+ 外部数据源** - 从 Google Drive、OneDrive、Dropbox、Notion 和许多其他外部服务添加你的来源。
- **实时多人协作支持** - 在共享笔记本中轻松与团队成员协作。
- **桌面应用** - 通过 Quick Assist、General Assist 和 Extreme Assist 在任何应用程序中获得 AI 助手。
...更多功能即将推出。
@ -34,7 +54,7 @@ https://github.com/user-attachments/assets/cc0c84d3-1f2f-4f7a-b519-2ecce22310b1
## 视频代理示例
https://github.com/user-attachments/assets/cc977e6d-8292-4ffe-abb8-3b0560ef5562
https://github.com/user-attachments/assets/012a7ffa-6f76-4f06-9dda-7632b470057a
@ -111,6 +131,18 @@ irm https://raw.githubusercontent.com/MODSetter/SurfSense/main/docker/scripts/in
如需 Docker Compose、手动安装及其他部署方式请查看[文档](https://www.surfsense.com/docs/)。
### 桌面应用
SurfSense 还提供桌面应用,将 AI 助手带到您计算机上的每个应用程序中。从[最新版本](https://github.com/MODSetter/SurfSense/releases/latest)下载。
桌面应用包含三个强大功能:
- **General Assist** — 通过全局快捷键从任何应用程序即时启动 SurfSense。
- **Quick Assist** — 在任何位置选中文本,然后让 AI 解释、改写或对其执行操作。
- **Extreme Assist** — 在任何应用中输入时,获得基于您知识库的内联写作建议。
三个功能均基于您选择的搜索空间运行,确保回答始终以您自己的数据为依据。
### 如何实时协作Beta
1. 前往成员管理页面并创建邀请。
@ -133,24 +165,30 @@ irm https://raw.githubusercontent.com/MODSetter/SurfSense/main/docker/scripts/in
<p align="center"><img src="https://github.com/user-attachments/assets/3b04477d-8f42-4baa-be95-867c1eaeba87" alt="实时评论" /></p>
## 核心功能
## SurfSense vs Google NotebookLM
| 功能 | 描述 |
|------|------|
| 开源替代方案 | 支持实时团队协作的 NotebookLM、Perplexity 和 Glean 替代品 |
| 50+ 文件格式 | 通过 LlamaCloud、Unstructured 或 Docling本地上传文档、图像、视频 |
| 混合搜索 | 语义搜索 + 全文搜索,结合层次化索引和倒数排名融合 |
| 引用回答 | 与知识库对话,获得 Perplexity 风格的引用回答 |
| 深度代理架构 | 基于 [LangChain Deep Agents](https://docs.langchain.com/oss/python/deepagents/overview) 构建,支持规划、子代理和文件系统访问 |
| 通用 LLM 支持 | 100+ LLM、6000+ 嵌入模型、所有主流重排序器,通过 OpenAI spec 和 LiteLLM |
| 隐私优先 | 完整本地 LLM 支持vLLM、Ollama您的数据由您掌控 |
| 团队协作 | RBAC 角色控制(所有者/管理员/编辑者/查看者),实时聊天和评论线程 |
| 视频生成 | 生成带有旁白和视觉效果的视频 |
| 演示文稿生成 | 创建可编辑的幻灯片式演示文稿 |
| 播客生成 | 20 秒内生成 3 分钟播客;多种 TTS 提供商OpenAI、Azure、Kokoro |
| 浏览器扩展 | 跨浏览器扩展,保存任何网页,包括需要身份验证的页面 |
| 27+ 连接器 | 搜索引擎、Google Drive、OneDrive、Dropbox、Slack、Teams、Jira、Notion、GitHub、Discord 等[更多](#外部数据源) |
| 可自托管 | 开源Docker 一行命令或完整 Docker Compose 用于生产环境 |
| 功能 | Google NotebookLM | SurfSense |
|---------|-------------------|-----------|
| **每个笔记本的来源数** | 50免费到 600Ultra$249.99/月) | 无限制 |
| **笔记本数量** | 100免费到 500付费方案 | 无限制 |
| **来源大小限制** | 500,000 词 / 200MB 每个来源 | 无限制 |
| **定价** | 免费版可用Pro $19.99/月Ultra $249.99/月 | 免费开源,在自己的基础设施上自托管 |
| **LLM 支持** | 仅 Google Gemini | 100+ LLM通过 OpenAI spec 和 LiteLLM |
| **嵌入模型** | 仅 Google | 6,000+ 嵌入模型,所有主流重排序器 |
| **本地 / 私有 LLM** | 不可用 | 完整支持vLLM、Ollama- 您的数据由您掌控 |
| **可自托管** | 否 | 是 - Docker 一行命令或完整 Docker Compose |
| **开源** | 否 | 是 |
| **外部连接器** | Google Drive、YouTube、网站 | 27+ 连接器 - 搜索引擎、Google Drive、OneDrive、Dropbox、Slack、Teams、Jira、Notion、GitHub、Discord 等[更多](#外部数据源) |
| **文件格式支持** | PDF、Docs、Slides、Sheets、CSV、Word、EPUB、图像、网页 URL、YouTube | 50+ 格式 - 文档、图像、视频,通过 LlamaCloud、Unstructured 或 Docling本地 |
| **搜索** | 语义搜索 | 混合搜索 - 语义 + 全文搜索,结合层次化索引和倒数排名融合 |
| **引用回答** | 是 | 是 - Perplexity 风格的引用回答 |
| **代理架构** | 否 | 是 - 基于 [LangChain Deep Agents](https://docs.langchain.com/oss/python/deepagents/overview) 构建,支持规划、子代理和文件系统访问 |
| **实时多人协作** | 共享笔记本,支持查看者/编辑者角色(无实时聊天) | RBAC 角色控制(所有者/管理员/编辑者/查看者),实时聊天和评论线程 |
| **视频生成** | 通过 Veo 3 的电影级视频概览(仅 Ultra | 可用NotebookLM 在此方面更好,正在积极改进) |
| **演示文稿生成** | 更美观的幻灯片但不可编辑 | 创建可编辑的幻灯片式演示文稿 |
| **播客生成** | 可自定义主持人和语言的音频概览 | 可用,支持多种 TTS 提供商NotebookLM 在此方面更好,正在积极改进) |
| **桌面应用** | 否 | 原生应用,包含 General Assist、Quick Assist 和 Extreme Assist — 在任何应用程序中获得 AI 助手 |
| **浏览器扩展** | 否 | 跨浏览器扩展,保存任何网页,包括需要身份验证的页面 |
<details>
<summary><b>外部数据源完整列表</b></summary>

View file

@ -282,6 +282,9 @@ STT_SERVICE=local/base
# LlamaCloud (if ETL_SERVICE=LLAMACLOUD)
# LLAMA_CLOUD_API_KEY=
# Optional: Azure Document Intelligence accelerator (used with LLAMACLOUD)
# AZURE_DI_ENDPOINT=https://your-resource.cognitiveservices.azure.com/
# AZURE_DI_KEY=
# ------------------------------------------------------------------------------
# Observability (optional)

View file

@ -24,7 +24,7 @@ SurfSense 现已支持以下国产 LLM
1. 登录 SurfSense Dashboard
2. 进入 **Settings****API Keys** (或 **LLM Configurations**)
3. 点击 **Add LLM Model**
3. 点击 **Add Model**
4. 从 **Provider** 下拉菜单中选择你的国产 LLM 提供商
5. 填写必填字段(见下方各提供商详细配置)
6. 点击 **Save**

6
package-lock.json generated Normal file
View file

@ -0,0 +1,6 @@
{
"name": "SurfSense",
"lockfileVersion": 3,
"requires": true,
"packages": {}
}

5
package.json Normal file
View file

@ -0,0 +1,5 @@
{
"name": "surfsense",
"private": true,
"packageManager": "pnpm@10.24.0"
}

View file

@ -193,6 +193,9 @@ FIRECRAWL_API_KEY=fcr-01J0000000000000000000000
ETL_SERVICE=UNSTRUCTURED or LLAMACLOUD or DOCLING
UNSTRUCTURED_API_KEY=Tpu3P0U8iy
LLAMA_CLOUD_API_KEY=llx-nnn
# Optional: Azure Document Intelligence accelerator (used when ETL_SERVICE=LLAMACLOUD)
# AZURE_DI_ENDPOINT=https://your-resource.cognitiveservices.azure.com/
# AZURE_DI_KEY=your-key
# OPTIONAL: Add these for LangSmith Observability
LANGSMITH_TRACING=true

View file

@ -42,9 +42,7 @@ def upgrade() -> None:
if not exists:
table_list = ", ".join(TABLES)
conn.execute(
sa.text(
f"CREATE PUBLICATION {PUBLICATION_NAME} FOR TABLE {table_list}"
)
sa.text(f"CREATE PUBLICATION {PUBLICATION_NAME} FOR TABLE {table_list}")
)

View file

@ -0,0 +1,123 @@
"""optimize zero_publication with column lists
Recreates the zero_publication using column lists for the documents
table so that large text columns (content, source_markdown,
blocknote_document, etc.) are excluded from WAL replication.
This prevents RangeError: Invalid string length in zero-cache's
change-streamer when documents have very large content.
Also resets REPLICA IDENTITY to DEFAULT on tables that had it set
to FULL for the old Electric SQL setup (migration 66/75/76).
With DEFAULT (primary-key) identity, column-list publications
only need to include the PK not every column.
IMPORTANT before AND after running this migration:
1. Stop zero-cache (it holds replication locks that will deadlock DDL)
2. Run: alembic upgrade head
3. Delete / reset the zero-cache data volume
4. Restart zero-cache (it will do a fresh initial sync)
Revision ID: 117
Revises: 116
"""
from collections.abc import Sequence
import sqlalchemy as sa
from alembic import op
revision: str = "117"
down_revision: str | None = "116"
branch_labels: str | Sequence[str] | None = None
depends_on: str | Sequence[str] | None = None
PUBLICATION_NAME = "zero_publication"
TABLES_WITH_FULL_IDENTITY = [
"documents",
"notifications",
"search_source_connectors",
"new_chat_messages",
"chat_comments",
"chat_session_state",
]
DOCUMENT_COLS = [
"id",
"title",
"document_type",
"search_space_id",
"folder_id",
"created_by_id",
"status",
"created_at",
"updated_at",
]
PUBLICATION_DDL_FULL = f"""\
CREATE PUBLICATION {PUBLICATION_NAME} FOR TABLE
notifications, documents, folders,
search_source_connectors, new_chat_messages,
chat_comments, chat_session_state
"""
def _terminate_blocked_pids(conn, table: str) -> None:
"""Kill backends whose locks on *table* would block our AccessExclusiveLock."""
conn.execute(
sa.text(
"SELECT pg_terminate_backend(l.pid) "
"FROM pg_locks l "
"JOIN pg_class c ON c.oid = l.relation "
"WHERE c.relname = :tbl "
" AND l.pid != pg_backend_pid()"
),
{"tbl": table},
)
def upgrade() -> None:
conn = op.get_bind()
conn.execute(sa.text("SET lock_timeout = '10s'"))
for tbl in sorted(TABLES_WITH_FULL_IDENTITY):
_terminate_blocked_pids(conn, tbl)
conn.execute(sa.text(f'LOCK TABLE "{tbl}" IN ACCESS EXCLUSIVE MODE'))
for tbl in TABLES_WITH_FULL_IDENTITY:
conn.execute(sa.text(f'ALTER TABLE "{tbl}" REPLICA IDENTITY DEFAULT'))
conn.execute(sa.text(f"DROP PUBLICATION IF EXISTS {PUBLICATION_NAME}"))
has_zero_ver = conn.execute(
sa.text(
"SELECT 1 FROM information_schema.columns "
"WHERE table_name = 'documents' AND column_name = '_0_version'"
)
).fetchone()
cols = DOCUMENT_COLS + (['"_0_version"'] if has_zero_ver else [])
col_list = ", ".join(cols)
conn.execute(
sa.text(
f"CREATE PUBLICATION {PUBLICATION_NAME} FOR TABLE "
f"notifications, "
f"documents ({col_list}), "
f"folders, "
f"search_source_connectors, "
f"new_chat_messages, "
f"chat_comments, "
f"chat_session_state"
)
)
def downgrade() -> None:
conn = op.get_bind()
conn.execute(sa.text(f"DROP PUBLICATION IF EXISTS {PUBLICATION_NAME}"))
conn.execute(sa.text(PUBLICATION_DDL_FULL))
for tbl in TABLES_WITH_FULL_IDENTITY:
conn.execute(sa.text(f'ALTER TABLE "{tbl}" REPLICA IDENTITY FULL'))

View file

@ -0,0 +1,149 @@
"""Add LOCAL_FOLDER_FILE document type, folder metadata, and document_versions table
Revision ID: 118
Revises: 117
"""
from collections.abc import Sequence
import sqlalchemy as sa
from alembic import op
revision: str = "118"
down_revision: str | None = "117"
branch_labels: str | Sequence[str] | None = None
depends_on: str | Sequence[str] | None = None
PUBLICATION_NAME = "zero_publication"
def upgrade() -> None:
conn = op.get_bind()
# Add LOCAL_FOLDER_FILE to documenttype enum
op.execute(
"""
DO $$
BEGIN
IF NOT EXISTS (
SELECT 1 FROM pg_type t
JOIN pg_enum e ON t.oid = e.enumtypid
WHERE t.typname = 'documenttype' AND e.enumlabel = 'LOCAL_FOLDER_FILE'
) THEN
ALTER TYPE documenttype ADD VALUE 'LOCAL_FOLDER_FILE';
END IF;
END
$$;
"""
)
# Add JSONB metadata column to folders table
col_exists = conn.execute(
sa.text(
"SELECT 1 FROM information_schema.columns "
"WHERE table_name = 'folders' AND column_name = 'metadata'"
)
).fetchone()
if not col_exists:
op.add_column(
"folders",
sa.Column("metadata", sa.dialects.postgresql.JSONB, nullable=True),
)
# Create document_versions table
table_exists = conn.execute(
sa.text(
"SELECT 1 FROM information_schema.tables WHERE table_name = 'document_versions'"
)
).fetchone()
if not table_exists:
op.create_table(
"document_versions",
sa.Column("id", sa.Integer(), nullable=False, autoincrement=True),
sa.Column("document_id", sa.Integer(), nullable=False),
sa.Column("version_number", sa.Integer(), nullable=False),
sa.Column("source_markdown", sa.Text(), nullable=True),
sa.Column("content_hash", sa.String(), nullable=False),
sa.Column("title", sa.String(), nullable=True),
sa.Column(
"created_at",
sa.TIMESTAMP(timezone=True),
server_default=sa.text("now()"),
nullable=False,
),
sa.ForeignKeyConstraint(
["document_id"],
["documents.id"],
ondelete="CASCADE",
),
sa.PrimaryKeyConstraint("id"),
sa.UniqueConstraint(
"document_id",
"version_number",
name="uq_document_version",
),
)
op.execute(
"CREATE INDEX IF NOT EXISTS ix_document_versions_document_id "
"ON document_versions (document_id)"
)
op.execute(
"CREATE INDEX IF NOT EXISTS ix_document_versions_created_at "
"ON document_versions (created_at)"
)
# Add document_versions to Zero publication
pub_exists = conn.execute(
sa.text("SELECT 1 FROM pg_publication WHERE pubname = :name"),
{"name": PUBLICATION_NAME},
).fetchone()
if pub_exists:
already_in_pub = conn.execute(
sa.text(
"SELECT 1 FROM pg_publication_tables "
"WHERE pubname = :name AND tablename = 'document_versions'"
),
{"name": PUBLICATION_NAME},
).fetchone()
if not already_in_pub:
op.execute(
f"ALTER PUBLICATION {PUBLICATION_NAME} ADD TABLE document_versions"
)
def downgrade() -> None:
conn = op.get_bind()
# Remove from publication
pub_exists = conn.execute(
sa.text("SELECT 1 FROM pg_publication WHERE pubname = :name"),
{"name": PUBLICATION_NAME},
).fetchone()
if pub_exists:
already_in_pub = conn.execute(
sa.text(
"SELECT 1 FROM pg_publication_tables "
"WHERE pubname = :name AND tablename = 'document_versions'"
),
{"name": PUBLICATION_NAME},
).fetchone()
if already_in_pub:
op.execute(
f"ALTER PUBLICATION {PUBLICATION_NAME} DROP TABLE document_versions"
)
op.execute("DROP INDEX IF EXISTS ix_document_versions_created_at")
op.execute("DROP INDEX IF EXISTS ix_document_versions_document_id")
op.execute("DROP TABLE IF EXISTS document_versions")
# Drop metadata column from folders
col_exists = conn.execute(
sa.text(
"SELECT 1 FROM information_schema.columns "
"WHERE table_name = 'folders' AND column_name = 'metadata'"
)
).fetchone()
if col_exists:
op.drop_column("folders", "metadata")

View file

@ -0,0 +1,39 @@
"""119_add_vision_llm_id_to_search_spaces
Revision ID: 119
Revises: 118
Adds vision_llm_id column to search_spaces for vision/screenshot analysis
LLM role assignment. Defaults to 0 (Auto mode), same convention as
agent_llm_id and document_summary_llm_id.
"""
from __future__ import annotations
from collections.abc import Sequence
import sqlalchemy as sa
from alembic import op
revision: str = "119"
down_revision: str | None = "118"
branch_labels: str | Sequence[str] | None = None
depends_on: str | Sequence[str] | None = None
def upgrade() -> None:
conn = op.get_bind()
existing_columns = [
col["name"] for col in sa.inspect(conn).get_columns("searchspaces")
]
if "vision_llm_id" not in existing_columns:
op.add_column(
"searchspaces",
sa.Column("vision_llm_id", sa.Integer(), nullable=True, server_default="0"),
)
def downgrade() -> None:
op.drop_column("searchspaces", "vision_llm_id")

View file

@ -0,0 +1,199 @@
"""Add vision LLM configs table and rename preference column
Revision ID: 120
Revises: 119
Changes:
1. Create visionprovider enum type
2. Create vision_llm_configs table
3. Rename vision_llm_id -> vision_llm_config_id on searchspaces
4. Add vision config permissions to existing system roles
"""
from __future__ import annotations
from collections.abc import Sequence
import sqlalchemy as sa
from sqlalchemy.dialects.postgresql import ENUM as PG_ENUM, UUID
from alembic import op
revision: str = "120"
down_revision: str | None = "119"
branch_labels: str | Sequence[str] | None = None
depends_on: str | Sequence[str] | None = None
VISION_PROVIDER_VALUES = (
"OPENAI",
"ANTHROPIC",
"GOOGLE",
"AZURE_OPENAI",
"VERTEX_AI",
"BEDROCK",
"XAI",
"OPENROUTER",
"OLLAMA",
"GROQ",
"TOGETHER_AI",
"FIREWORKS_AI",
"DEEPSEEK",
"MISTRAL",
"CUSTOM",
)
def upgrade() -> None:
connection = op.get_bind()
# 1. Create visionprovider enum
connection.execute(
sa.text(
"""
DO $$
BEGIN
IF NOT EXISTS (SELECT 1 FROM pg_type WHERE typname = 'visionprovider') THEN
CREATE TYPE visionprovider AS ENUM (
'OPENAI', 'ANTHROPIC', 'GOOGLE', 'AZURE_OPENAI', 'VERTEX_AI',
'BEDROCK', 'XAI', 'OPENROUTER', 'OLLAMA', 'GROQ',
'TOGETHER_AI', 'FIREWORKS_AI', 'DEEPSEEK', 'MISTRAL', 'CUSTOM'
);
END IF;
END
$$;
"""
)
)
# 2. Create vision_llm_configs table
result = connection.execute(
sa.text(
"SELECT EXISTS (SELECT FROM information_schema.tables WHERE table_name = 'vision_llm_configs')"
)
)
if not result.scalar():
op.create_table(
"vision_llm_configs",
sa.Column("id", sa.Integer(), autoincrement=True, nullable=False),
sa.Column("name", sa.String(100), nullable=False),
sa.Column("description", sa.String(500), nullable=True),
sa.Column(
"provider",
PG_ENUM(
*VISION_PROVIDER_VALUES, name="visionprovider", create_type=False
),
nullable=False,
),
sa.Column("custom_provider", sa.String(100), nullable=True),
sa.Column("model_name", sa.String(100), nullable=False),
sa.Column("api_key", sa.String(), nullable=False),
sa.Column("api_base", sa.String(500), nullable=True),
sa.Column("api_version", sa.String(50), nullable=True),
sa.Column("litellm_params", sa.JSON(), nullable=True),
sa.Column("search_space_id", sa.Integer(), nullable=False),
sa.Column("user_id", UUID(as_uuid=True), nullable=False),
sa.Column(
"created_at",
sa.TIMESTAMP(timezone=True),
server_default=sa.text("now()"),
nullable=False,
),
sa.PrimaryKeyConstraint("id"),
sa.ForeignKeyConstraint(
["search_space_id"], ["searchspaces.id"], ondelete="CASCADE"
),
sa.ForeignKeyConstraint(["user_id"], ["user.id"], ondelete="CASCADE"),
)
op.execute(
"CREATE INDEX IF NOT EXISTS ix_vision_llm_configs_name "
"ON vision_llm_configs (name)"
)
op.execute(
"CREATE INDEX IF NOT EXISTS ix_vision_llm_configs_search_space_id "
"ON vision_llm_configs (search_space_id)"
)
# 3. Rename vision_llm_id -> vision_llm_config_id on searchspaces
existing_columns = [
col["name"] for col in sa.inspect(connection).get_columns("searchspaces")
]
if (
"vision_llm_id" in existing_columns
and "vision_llm_config_id" not in existing_columns
):
op.alter_column(
"searchspaces", "vision_llm_id", new_column_name="vision_llm_config_id"
)
elif "vision_llm_config_id" not in existing_columns:
op.add_column(
"searchspaces",
sa.Column(
"vision_llm_config_id", sa.Integer(), nullable=True, server_default="0"
),
)
# 4. Add vision config permissions to existing system roles
connection.execute(
sa.text(
"""
UPDATE search_space_roles
SET permissions = array_cat(
permissions,
ARRAY['vision_configs:create', 'vision_configs:read']
)
WHERE is_system_role = true
AND name = 'Editor'
AND NOT ('vision_configs:create' = ANY(permissions))
"""
)
)
connection.execute(
sa.text(
"""
UPDATE search_space_roles
SET permissions = array_cat(
permissions,
ARRAY['vision_configs:read']
)
WHERE is_system_role = true
AND name = 'Viewer'
AND NOT ('vision_configs:read' = ANY(permissions))
"""
)
)
def downgrade() -> None:
connection = op.get_bind()
# Remove permissions
connection.execute(
sa.text(
"""
UPDATE search_space_roles
SET permissions = array_remove(
array_remove(
array_remove(permissions, 'vision_configs:create'),
'vision_configs:read'
),
'vision_configs:delete'
)
WHERE is_system_role = true
"""
)
)
# Rename column back
existing_columns = [
col["name"] for col in sa.inspect(connection).get_columns("searchspaces")
]
if "vision_llm_config_id" in existing_columns:
op.alter_column(
"searchspaces", "vision_llm_config_id", new_column_name="vision_llm_id"
)
# Drop table and enum
op.execute("DROP INDEX IF EXISTS ix_vision_llm_configs_search_space_id")
op.execute("DROP INDEX IF EXISTS ix_vision_llm_configs_name")
op.execute("DROP TABLE IF EXISTS vision_llm_configs")
op.execute("DROP TYPE IF EXISTS visionprovider")

View file

@ -17,10 +17,10 @@ depends_on: str | Sequence[str] | None = None
def upgrade() -> None:
"""
Add the new_llm_configs table that combines LLM model settings with prompt configuration.
Add the new_llm_configs table that combines model settings with prompt configuration.
This table includes:
- LLM model configuration (provider, model_name, api_key, etc.)
- Model configuration (provider, model_name, api_key, etc.)
- Configurable system instructions
- Citation toggle
"""
@ -41,7 +41,7 @@ def upgrade() -> None:
name VARCHAR(100) NOT NULL,
description VARCHAR(500),
-- LLM Model Configuration (same as llm_configs, excluding language)
-- Model Configuration (same as llm_configs, excluding language)
provider litellmprovider NOT NULL,
custom_provider VARCHAR(100),
model_name VARCHAR(100) NOT NULL,

View file

@ -0,0 +1,11 @@
"""Agent-based vision autocomplete with scoped filesystem exploration."""
from app.agents.autocomplete.autocomplete_agent import (
create_autocomplete_agent,
stream_autocomplete_agent,
)
__all__ = [
"create_autocomplete_agent",
"stream_autocomplete_agent",
]

View file

@ -0,0 +1,495 @@
"""Vision autocomplete agent with scoped filesystem exploration.
Converts the stateless single-shot vision autocomplete into an agent that
seeds a virtual filesystem from KB search results and lets the vision LLM
explore documents via ``ls``, ``read_file``, ``glob``, ``grep``, etc.
before generating the final completion.
Performance: KB search and agent graph compilation run in parallel so
the only sequential latency is KB-search (or agent compile, whichever is
slower) + the agent's LLM turns. There is no separate "query extraction"
LLM call the window title is used directly as the KB search query.
"""
from __future__ import annotations
import asyncio
import json
import logging
import re
import uuid
from collections.abc import AsyncGenerator
from typing import Any
from deepagents.graph import BASE_AGENT_PROMPT
from deepagents.middleware.patch_tool_calls import PatchToolCallsMiddleware
from langchain.agents import create_agent
from langchain_anthropic.middleware import AnthropicPromptCachingMiddleware
from langchain_core.language_models import BaseChatModel
from langchain_core.messages import AIMessage, ToolMessage
from app.agents.new_chat.middleware.filesystem import SurfSenseFilesystemMiddleware
from app.agents.new_chat.middleware.knowledge_search import (
build_scoped_filesystem,
search_knowledge_base,
)
from app.services.new_streaming_service import VercelStreamingService
logger = logging.getLogger(__name__)
KB_TOP_K = 10
# ---------------------------------------------------------------------------
# System prompt
# ---------------------------------------------------------------------------
AUTOCOMPLETE_SYSTEM_PROMPT = """You are a smart writing assistant that analyzes the user's screen to draft or complete text.
You will receive a screenshot of the user's screen. Your PRIMARY source of truth is the screenshot itself — the visual context determines what to write.
Your job:
1. Analyze the ENTIRE screenshot to understand what the user is working on (email thread, chat conversation, document, code editor, form, etc.).
2. Identify the text area where the user will type.
3. Generate the text the user most likely wants to write based on the visual context.
You also have access to the user's knowledge base documents via filesystem tools. However:
- ONLY consult the knowledge base if the screenshot clearly involves a topic where your KB documents are DIRECTLY relevant (e.g., the user is writing about a specific project/topic that matches a document title).
- Do NOT explore documents just because they exist. Most autocomplete requests can be answered purely from the screenshot.
- If you do read a document, only incorporate information that is 100% relevant to what the user is typing RIGHT NOW. Do not add extra details, background, or tangential information from the KB.
- Keep your output SHORT autocomplete should feel like a natural continuation, not an essay.
Key behavior:
- If the text area is EMPTY, draft a concise response or message based on what you see on screen (e.g., reply to an email, respond to a chat message, continue a document).
- If the text area already has text, continue it naturally typically just a sentence or two.
Rules:
- Be CONCISE. Prefer a single paragraph or a few sentences. Autocomplete is a quick assist, not a full draft.
- Match the tone and formality of the surrounding context.
- If the screen shows code, write code. If it shows a casual chat, be casual. If it shows a formal email, be formal.
- Do NOT describe the screenshot or explain your reasoning.
- Do NOT cite or reference documents explicitly just let the knowledge inform your writing naturally.
- If you cannot determine what to write, output an empty JSON array: []
## Output Format
You MUST provide exactly 3 different suggestion options. Each should be a distinct, plausible completion vary the tone, detail level, or angle.
Return your suggestions as a JSON array of exactly 3 strings. Output ONLY the JSON array, nothing else no markdown fences, no explanation, no commentary.
Example format:
["First suggestion text here.", "Second suggestion — a different take.", "Third option with another approach."]
## Filesystem Tools `ls`, `read_file`, `write_file`, `edit_file`, `glob`, `grep`
All file paths must start with a `/`.
- ls: list files and directories at a given path.
- read_file: read a file from the filesystem.
- write_file: create a temporary file in the session (not persisted).
- edit_file: edit a file in the session (not persisted for /documents/ files).
- glob: find files matching a pattern (e.g., "**/*.xml").
- grep: search for text within files.
## When to Use Filesystem Tools
BEFORE reaching for any tool, ask yourself: "Can I write a good completion purely from the screenshot?" If yes, just write it do NOT explore the KB.
Only use tools when:
- The user is clearly writing about a specific topic that likely has detailed information in their KB.
- You need a specific fact, name, number, or reference that the screenshot doesn't provide.
When you do use tools, be surgical:
- Check the `ls` output first. If no document title looks relevant, stop do not read files just to see what's there.
- If a title looks relevant, read only the `<chunk_index>` (first ~20 lines) and jump to matched chunks. Do not read entire documents.
- Extract only the specific information you need and move on to generating the completion.
## Reading Documents Efficiently
Documents are formatted as XML. Each document contains:
- `<document_metadata>` title, type, URL, etc.
- `<chunk_index>` a table of every chunk with its **line range** and a
`matched="true"` flag for chunks that matched the search query.
- `<document_content>` the actual chunks in original document order.
**Workflow**: read the first ~20 lines to see the `<chunk_index>`, identify
chunks marked `matched="true"`, then use `read_file(path, offset=<start_line>,
limit=<lines>)` to jump directly to those sections."""
APP_CONTEXT_BLOCK = """
The user is currently working in "{app_name}" (window: "{window_title}"). Use this to understand the type of application and adapt your tone and format accordingly."""
def _build_autocomplete_system_prompt(app_name: str, window_title: str) -> str:
prompt = AUTOCOMPLETE_SYSTEM_PROMPT
if app_name:
prompt += APP_CONTEXT_BLOCK.format(app_name=app_name, window_title=window_title)
return prompt
# ---------------------------------------------------------------------------
# Pre-compute KB filesystem (runs in parallel with agent compilation)
# ---------------------------------------------------------------------------
class _KBResult:
"""Container for pre-computed KB filesystem results."""
__slots__ = ("files", "ls_ai_msg", "ls_tool_msg")
def __init__(
self,
files: dict[str, Any] | None = None,
ls_ai_msg: AIMessage | None = None,
ls_tool_msg: ToolMessage | None = None,
) -> None:
self.files = files
self.ls_ai_msg = ls_ai_msg
self.ls_tool_msg = ls_tool_msg
@property
def has_documents(self) -> bool:
return bool(self.files)
async def precompute_kb_filesystem(
search_space_id: int,
query: str,
top_k: int = KB_TOP_K,
) -> _KBResult:
"""Search the KB and build the scoped filesystem outside the agent.
This is designed to be called via ``asyncio.gather`` alongside agent
graph compilation so the two run concurrently.
"""
if not query:
return _KBResult()
try:
search_results = await search_knowledge_base(
query=query,
search_space_id=search_space_id,
top_k=top_k,
)
if not search_results:
return _KBResult()
new_files, _ = await build_scoped_filesystem(
documents=search_results,
search_space_id=search_space_id,
)
if not new_files:
return _KBResult()
doc_paths = [
p
for p, v in new_files.items()
if p.startswith("/documents/") and v is not None
]
tool_call_id = f"auto_ls_{uuid.uuid4().hex[:12]}"
ai_msg = AIMessage(
content="",
tool_calls=[
{"name": "ls", "args": {"path": "/documents"}, "id": tool_call_id}
],
)
tool_msg = ToolMessage(
content=str(doc_paths) if doc_paths else "No documents found.",
tool_call_id=tool_call_id,
)
return _KBResult(files=new_files, ls_ai_msg=ai_msg, ls_tool_msg=tool_msg)
except Exception:
logger.warning(
"KB pre-computation failed, proceeding without KB", exc_info=True
)
return _KBResult()
# ---------------------------------------------------------------------------
# Filesystem middleware — no save_document, no persistence
# ---------------------------------------------------------------------------
class AutocompleteFilesystemMiddleware(SurfSenseFilesystemMiddleware):
"""Filesystem middleware for autocomplete — read-only exploration only.
Strips ``save_document`` (permanent KB persistence) and passes
``search_space_id=None`` so ``write_file`` / ``edit_file`` stay ephemeral.
"""
def __init__(self) -> None:
super().__init__(search_space_id=None, created_by_id=None)
self.tools = [t for t in self.tools if t.name != "save_document"]
# ---------------------------------------------------------------------------
# Agent factory
# ---------------------------------------------------------------------------
async def _compile_agent(
llm: BaseChatModel,
app_name: str,
window_title: str,
) -> Any:
"""Compile the agent graph (CPU-bound, runs in a thread)."""
system_prompt = _build_autocomplete_system_prompt(app_name, window_title)
final_system_prompt = system_prompt + "\n\n" + BASE_AGENT_PROMPT
middleware = [
AutocompleteFilesystemMiddleware(),
PatchToolCallsMiddleware(),
AnthropicPromptCachingMiddleware(unsupported_model_behavior="ignore"),
]
agent = await asyncio.to_thread(
create_agent,
llm,
system_prompt=final_system_prompt,
tools=[],
middleware=middleware,
)
return agent.with_config({"recursion_limit": 200})
async def create_autocomplete_agent(
llm: BaseChatModel,
*,
search_space_id: int,
kb_query: str,
app_name: str = "",
window_title: str = "",
) -> tuple[Any, _KBResult]:
"""Create the autocomplete agent and pre-compute KB in parallel.
Returns ``(agent, kb_result)`` so the caller can inject the pre-computed
filesystem into the agent's initial state without any middleware delay.
"""
agent, kb = await asyncio.gather(
_compile_agent(llm, app_name, window_title),
precompute_kb_filesystem(search_space_id, kb_query),
)
return agent, kb
# ---------------------------------------------------------------------------
# JSON suggestion parsing (with fallback)
# ---------------------------------------------------------------------------
def _parse_suggestions(raw: str) -> list[str]:
"""Extract a list of suggestion strings from the agent's output.
Tries, in order:
1. Direct ``json.loads``
2. Extract content between ```json ... ``` fences
3. Find the first ``[`` ``]`` span
Falls back to wrapping the raw text as a single suggestion.
"""
text = raw.strip()
if not text:
return []
for candidate in _json_candidates(text):
try:
parsed = json.loads(candidate)
if isinstance(parsed, list) and all(isinstance(s, str) for s in parsed):
return [s for s in parsed if s.strip()]
except (json.JSONDecodeError, ValueError):
continue
return [text]
def _json_candidates(text: str) -> list[str]:
"""Yield candidate JSON strings from raw text."""
candidates = [text]
fence = re.search(r"```(?:json)?\s*\n?(.*?)```", text, re.DOTALL)
if fence:
candidates.append(fence.group(1).strip())
bracket = re.search(r"\[.*]", text, re.DOTALL)
if bracket:
candidates.append(bracket.group(0))
return candidates
# ---------------------------------------------------------------------------
# Streaming helper
# ---------------------------------------------------------------------------
async def stream_autocomplete_agent(
agent: Any,
input_data: dict[str, Any],
streaming_service: VercelStreamingService,
*,
emit_message_start: bool = True,
) -> AsyncGenerator[str, None]:
"""Stream agent events as Vercel SSE, with thinking steps for tool calls.
When ``emit_message_start`` is False the caller has already sent the
``message_start`` event (e.g. to show preparation steps before the agent
runs).
"""
thread_id = uuid.uuid4().hex
config = {"configurable": {"thread_id": thread_id}}
text_buffer: list[str] = []
active_tool_depth = 0
thinking_step_counter = 0
tool_step_ids: dict[str, str] = {}
step_titles: dict[str, str] = {}
completed_step_ids: set[str] = set()
last_active_step_id: str | None = None
def next_thinking_step_id() -> str:
nonlocal thinking_step_counter
thinking_step_counter += 1
return f"autocomplete-step-{thinking_step_counter}"
def complete_current_step() -> str | None:
nonlocal last_active_step_id
if last_active_step_id and last_active_step_id not in completed_step_ids:
completed_step_ids.add(last_active_step_id)
title = step_titles.get(last_active_step_id, "Done")
event = streaming_service.format_thinking_step(
step_id=last_active_step_id,
title=title,
status="complete",
)
last_active_step_id = None
return event
return None
if emit_message_start:
yield streaming_service.format_message_start()
gen_step_id = next_thinking_step_id()
last_active_step_id = gen_step_id
step_titles[gen_step_id] = "Generating suggestions"
yield streaming_service.format_thinking_step(
step_id=gen_step_id,
title="Generating suggestions",
status="in_progress",
)
try:
async for event in agent.astream_events(
input_data, config=config, version="v2"
):
event_type = event.get("event", "")
if event_type == "on_chat_model_stream":
if active_tool_depth > 0:
continue
if "surfsense:internal" in event.get("tags", []):
continue
chunk = event.get("data", {}).get("chunk")
if chunk and hasattr(chunk, "content"):
content = chunk.content
if content and isinstance(content, str):
text_buffer.append(content)
elif event_type == "on_chat_model_end":
if active_tool_depth > 0:
continue
if "surfsense:internal" in event.get("tags", []):
continue
output = event.get("data", {}).get("output")
if output and hasattr(output, "content"):
if getattr(output, "tool_calls", None):
continue
content = output.content
if content and isinstance(content, str) and not text_buffer:
text_buffer.append(content)
elif event_type == "on_tool_start":
active_tool_depth += 1
tool_name = event.get("name", "unknown_tool")
run_id = event.get("run_id", "")
tool_input = event.get("data", {}).get("input", {})
step_event = complete_current_step()
if step_event:
yield step_event
tool_step_id = next_thinking_step_id()
tool_step_ids[run_id] = tool_step_id
last_active_step_id = tool_step_id
title, items = _describe_tool_call(tool_name, tool_input)
step_titles[tool_step_id] = title
yield streaming_service.format_thinking_step(
step_id=tool_step_id,
title=title,
status="in_progress",
items=items,
)
elif event_type == "on_tool_end":
active_tool_depth = max(0, active_tool_depth - 1)
run_id = event.get("run_id", "")
step_id = tool_step_ids.pop(run_id, None)
if step_id and step_id not in completed_step_ids:
completed_step_ids.add(step_id)
title = step_titles.get(step_id, "Done")
yield streaming_service.format_thinking_step(
step_id=step_id,
title=title,
status="complete",
)
if last_active_step_id == step_id:
last_active_step_id = None
step_event = complete_current_step()
if step_event:
yield step_event
raw_text = "".join(text_buffer)
suggestions = _parse_suggestions(raw_text)
yield streaming_service.format_data("suggestions", {"options": suggestions})
yield streaming_service.format_finish()
yield streaming_service.format_done()
except Exception as e:
logger.error(f"Autocomplete agent streaming error: {e}", exc_info=True)
yield streaming_service.format_error("Autocomplete failed. Please try again.")
yield streaming_service.format_done()
def _describe_tool_call(tool_name: str, tool_input: Any) -> tuple[str, list[str]]:
"""Return a human-readable (title, items) for a tool call thinking step."""
inp = tool_input if isinstance(tool_input, dict) else {}
if tool_name == "ls":
path = inp.get("path", "/")
return "Listing files", [path]
if tool_name == "read_file":
fp = inp.get("file_path", "")
display = fp if len(fp) <= 80 else "" + fp[-77:]
return "Reading file", [display]
if tool_name == "write_file":
fp = inp.get("file_path", "")
display = fp if len(fp) <= 80 else "" + fp[-77:]
return "Writing file", [display]
if tool_name == "edit_file":
fp = inp.get("file_path", "")
display = fp if len(fp) <= 80 else "" + fp[-77:]
return "Editing file", [display]
if tool_name == "glob":
pat = inp.get("pattern", "")
base = inp.get("path", "/")
return "Searching files", [f"{pat} in {base}"]
if tool_name == "grep":
pat = inp.get("pattern", "")
path = inp.get("path", "")
display_pat = pat[:60] + ("" if len(pat) > 60 else "")
return "Searching content", [
f'"{display_pat}"' + (f" in {path}" if path else "")
]
return f"Using {tool_name}", []

View file

@ -159,6 +159,7 @@ async def create_surfsense_deep_agent(
additional_tools: Sequence[BaseTool] | None = None,
firecrawl_api_key: str | None = None,
thread_visibility: ChatVisibility | None = None,
mentioned_document_ids: list[int] | None = None,
):
"""
Create a SurfSense deep agent with configurable tools and prompts.
@ -451,6 +452,7 @@ async def create_surfsense_deep_agent(
search_space_id=search_space_id,
available_connectors=available_connectors,
available_document_types=available_document_types,
mentioned_document_ids=mentioned_document_ids,
),
SurfSenseFilesystemMiddleware(
search_space_id=search_space_id,

View file

@ -66,6 +66,16 @@ the `<chunk_index>`, identify chunks marked `matched="true"`, then use
those sections instead of reading the entire file sequentially.
Use `<chunk id='...'>` values as citation IDs in your answers.
## User-Mentioned Documents
When the `ls` output tags a file with `[MENTIONED BY USER read deeply]`,
the user **explicitly selected** that document. These files are your highest-
priority sources:
1. **Always read them thoroughly** scan the full `<chunk_index>`, then read
all major sections, not just matched chunks.
2. **Prefer their content** over other search results when answering.
3. **Cite from them first** whenever applicable.
"""
# =============================================================================

View file

@ -28,7 +28,13 @@ from sqlalchemy import select
from sqlalchemy.ext.asyncio import AsyncSession
from app.agents.new_chat.utils import parse_date_or_datetime, resolve_date_range
from app.db import NATIVE_TO_LEGACY_DOCTYPE, Document, Folder, shielded_async_session
from app.db import (
NATIVE_TO_LEGACY_DOCTYPE,
Chunk,
Document,
Folder,
shielded_async_session,
)
from app.retriever.chunks_hybrid_search import ChucksHybridSearchRetriever
from app.utils.document_converters import embed_texts
from app.utils.perf import get_perf_logger
@ -430,21 +436,36 @@ async def _get_folder_paths(
def _build_synthetic_ls(
existing_files: dict[str, Any] | None,
new_files: dict[str, Any],
*,
mentioned_paths: set[str] | None = None,
) -> tuple[AIMessage, ToolMessage]:
"""Build a synthetic ls("/documents") tool-call + result for the LLM context.
Paths are listed with *new* (rank-ordered) files first, then existing files
that were already in state from prior turns.
Mentioned files are listed first. A separate header tells the LLM which
files the user explicitly selected; the path list itself stays clean so
paths can be passed directly to ``read_file`` without stripping tags.
"""
_mentioned = mentioned_paths or set()
merged: dict[str, Any] = {**(existing_files or {}), **new_files}
doc_paths = [
p for p, v in merged.items() if p.startswith("/documents/") and v is not None
]
new_set = set(new_files)
new_paths = [p for p in doc_paths if p in new_set]
mentioned_list = [p for p in doc_paths if p in _mentioned]
new_non_mentioned = [p for p in doc_paths if p in new_set and p not in _mentioned]
old_paths = [p for p in doc_paths if p not in new_set]
ordered = new_paths + old_paths
ordered = mentioned_list + new_non_mentioned + old_paths
parts: list[str] = []
if mentioned_list:
parts.append(
"USER-MENTIONED documents (read these thoroughly before answering):"
)
for p in mentioned_list:
parts.append(f" {p}")
parts.append("")
parts.append(str(ordered) if ordered else "No documents found.")
tool_call_id = f"auto_ls_{uuid.uuid4().hex[:12]}"
ai_msg = AIMessage(
@ -452,7 +473,7 @@ def _build_synthetic_ls(
tool_calls=[{"name": "ls", "args": {"path": "/documents"}, "id": tool_call_id}],
)
tool_msg = ToolMessage(
content=str(ordered) if ordered else "No documents found.",
content="\n".join(parts),
tool_call_id=tool_call_id,
)
return ai_msg, tool_msg
@ -524,12 +545,92 @@ async def search_knowledge_base(
return results[:top_k]
async def fetch_mentioned_documents(
*,
document_ids: list[int],
search_space_id: int,
) -> list[dict[str, Any]]:
"""Fetch explicitly mentioned documents with *all* their chunks.
Returns the same dict structure as ``search_knowledge_base`` so results
can be merged directly into ``build_scoped_filesystem``. Unlike search
results, every chunk is included (no top-K limiting) and none are marked
as ``matched`` since the entire document is relevant by virtue of the
user's explicit mention.
"""
if not document_ids:
return []
async with shielded_async_session() as session:
doc_result = await session.execute(
select(Document).where(
Document.id.in_(document_ids),
Document.search_space_id == search_space_id,
)
)
docs = {doc.id: doc for doc in doc_result.scalars().all()}
if not docs:
return []
chunk_result = await session.execute(
select(Chunk.id, Chunk.content, Chunk.document_id)
.where(Chunk.document_id.in_(list(docs.keys())))
.order_by(Chunk.document_id, Chunk.id)
)
chunks_by_doc: dict[int, list[dict[str, Any]]] = {doc_id: [] for doc_id in docs}
for row in chunk_result.all():
if row.document_id in chunks_by_doc:
chunks_by_doc[row.document_id].append(
{"chunk_id": row.id, "content": row.content}
)
results: list[dict[str, Any]] = []
for doc_id in document_ids:
doc = docs.get(doc_id)
if doc is None:
continue
metadata = doc.document_metadata or {}
results.append(
{
"document_id": doc.id,
"content": "",
"score": 1.0,
"chunks": chunks_by_doc.get(doc.id, []),
"matched_chunk_ids": [],
"document": {
"id": doc.id,
"title": doc.title,
"document_type": (
doc.document_type.value
if getattr(doc, "document_type", None)
else None
),
"metadata": metadata,
},
"source": (
doc.document_type.value
if getattr(doc, "document_type", None)
else None
),
"_user_mentioned": True,
}
)
return results
async def build_scoped_filesystem(
*,
documents: Sequence[dict[str, Any]],
search_space_id: int,
) -> dict[str, dict[str, str]]:
"""Build a StateBackend-compatible files dict from search results."""
) -> tuple[dict[str, dict[str, str]], dict[int, str]]:
"""Build a StateBackend-compatible files dict from search results.
Returns ``(files, doc_id_to_path)`` so callers can reliably map a
document id back to its filesystem path without guessing by title.
Paths are collision-proof: when two documents resolve to the same
path the doc-id is appended to disambiguate.
"""
async with shielded_async_session() as session:
folder_paths = await _get_folder_paths(session, search_space_id)
doc_ids = [
@ -551,6 +652,7 @@ async def build_scoped_filesystem(
}
files: dict[str, dict[str, str]] = {}
doc_id_to_path: dict[int, str] = {}
for document in documents:
doc_meta = document.get("document") or {}
title = str(doc_meta.get("title") or "untitled")
@ -559,6 +661,9 @@ async def build_scoped_filesystem(
base_folder = folder_paths.get(folder_id, "/documents")
file_name = _safe_filename(title)
path = f"{base_folder}/{file_name}"
if path in files:
stem = file_name.removesuffix(".xml")
path = f"{base_folder}/{stem} ({doc_id}).xml"
matched_ids = set(document.get("matched_chunk_ids") or [])
xml_content = _build_document_xml(document, matched_chunk_ids=matched_ids)
files[path] = {
@ -567,7 +672,9 @@ async def build_scoped_filesystem(
"created_at": "",
"modified_at": "",
}
return files
if isinstance(doc_id, int):
doc_id_to_path[doc_id] = path
return files, doc_id_to_path
class KnowledgeBaseSearchMiddleware(AgentMiddleware): # type: ignore[type-arg]
@ -583,12 +690,14 @@ class KnowledgeBaseSearchMiddleware(AgentMiddleware): # type: ignore[type-arg]
available_connectors: list[str] | None = None,
available_document_types: list[str] | None = None,
top_k: int = 10,
mentioned_document_ids: list[int] | None = None,
) -> None:
self.llm = llm
self.search_space_id = search_space_id
self.available_connectors = available_connectors
self.available_document_types = available_document_types
self.top_k = top_k
self.mentioned_document_ids = mentioned_document_ids or []
async def _plan_search_inputs(
self,
@ -680,6 +789,18 @@ class KnowledgeBaseSearchMiddleware(AgentMiddleware): # type: ignore[type-arg]
user_text=user_text,
)
# --- 1. Fetch mentioned documents (user-selected, all chunks) ---
mentioned_results: list[dict[str, Any]] = []
if self.mentioned_document_ids:
mentioned_results = await fetch_mentioned_documents(
document_ids=self.mentioned_document_ids,
search_space_id=self.search_space_id,
)
# Clear after first turn so they are not re-fetched on subsequent
# messages within the same agent instance.
self.mentioned_document_ids = []
# --- 2. Run KB hybrid search ---
search_results = await search_knowledge_base(
query=planned_query,
search_space_id=self.search_space_id,
@ -689,19 +810,50 @@ class KnowledgeBaseSearchMiddleware(AgentMiddleware): # type: ignore[type-arg]
start_date=start_date,
end_date=end_date,
)
new_files = await build_scoped_filesystem(
documents=search_results,
# --- 3. Merge: mentioned first, then search (dedup by doc id) ---
seen_doc_ids: set[int] = set()
merged: list[dict[str, Any]] = []
for doc in mentioned_results:
doc_id = (doc.get("document") or {}).get("id")
if doc_id is not None:
seen_doc_ids.add(doc_id)
merged.append(doc)
for doc in search_results:
doc_id = (doc.get("document") or {}).get("id")
if doc_id is not None and doc_id in seen_doc_ids:
continue
merged.append(doc)
# --- 4. Build scoped filesystem ---
new_files, doc_id_to_path = await build_scoped_filesystem(
documents=merged,
search_space_id=self.search_space_id,
)
ai_msg, tool_msg = _build_synthetic_ls(existing_files, new_files)
# Identify which paths belong to user-mentioned documents using
# the authoritative doc_id -> path mapping (no title guessing).
mentioned_doc_ids = {
(d.get("document") or {}).get("id") for d in mentioned_results
}
mentioned_paths = {
doc_id_to_path[did] for did in mentioned_doc_ids if did in doc_id_to_path
}
ai_msg, tool_msg = _build_synthetic_ls(
existing_files,
new_files,
mentioned_paths=mentioned_paths,
)
if t0 is not None:
_perf_log.info(
"[kb_fs_middleware] completed in %.3fs query=%r optimized=%r new_files=%d total=%d",
"[kb_fs_middleware] completed in %.3fs query=%r optimized=%r "
"mentioned=%d new_files=%d total=%d",
asyncio.get_event_loop().time() - t0,
user_text[:80],
planned_query[:120],
len(mentioned_results),
len(new_files),
len(new_files) + len(existing_files or {}),
)

View file

@ -25,7 +25,12 @@ from app.agents.new_chat.checkpointer import (
close_checkpointer,
setup_checkpointer_tables,
)
from app.config import config, initialize_image_gen_router, initialize_llm_router
from app.config import (
config,
initialize_image_gen_router,
initialize_llm_router,
initialize_vision_llm_router,
)
from app.db import User, create_db_and_tables, get_async_session
from app.routes import router as crud_router
from app.routes.auth_routes import router as auth_router
@ -223,6 +228,7 @@ async def lifespan(app: FastAPI):
await setup_checkpointer_tables()
initialize_llm_router()
initialize_image_gen_router()
initialize_vision_llm_router()
try:
await asyncio.wait_for(seed_surfsense_docs(), timeout=120)
except TimeoutError:

View file

@ -18,10 +18,15 @@ def init_worker(**kwargs):
This ensures the Auto mode (LiteLLM Router) is available for background tasks
like document summarization and image generation.
"""
from app.config import initialize_image_gen_router, initialize_llm_router
from app.config import (
initialize_image_gen_router,
initialize_llm_router,
initialize_vision_llm_router,
)
initialize_llm_router()
initialize_image_gen_router()
initialize_vision_llm_router()
# Get Celery configuration from environment

View file

@ -102,6 +102,44 @@ def load_global_image_gen_configs():
return []
def load_global_vision_llm_configs():
global_config_file = BASE_DIR / "app" / "config" / "global_llm_config.yaml"
if not global_config_file.exists():
return []
try:
with open(global_config_file, encoding="utf-8") as f:
data = yaml.safe_load(f)
return data.get("global_vision_llm_configs", [])
except Exception as e:
print(f"Warning: Failed to load global vision LLM configs: {e}")
return []
def load_vision_llm_router_settings():
default_settings = {
"routing_strategy": "usage-based-routing",
"num_retries": 3,
"allowed_fails": 3,
"cooldown_time": 60,
}
global_config_file = BASE_DIR / "app" / "config" / "global_llm_config.yaml"
if not global_config_file.exists():
return default_settings
try:
with open(global_config_file, encoding="utf-8") as f:
data = yaml.safe_load(f)
settings = data.get("vision_llm_router_settings", {})
return {**default_settings, **settings}
except Exception as e:
print(f"Warning: Failed to load vision LLM router settings: {e}")
return default_settings
def load_image_gen_router_settings():
"""
Load router settings for image generation Auto mode from YAML file.
@ -182,6 +220,29 @@ def initialize_image_gen_router():
print(f"Warning: Failed to initialize Image Generation Router: {e}")
def initialize_vision_llm_router():
vision_configs = load_global_vision_llm_configs()
router_settings = load_vision_llm_router_settings()
if not vision_configs:
print(
"Info: No global vision LLM configs found, "
"Vision LLM Auto mode will not be available"
)
return
try:
from app.services.vision_llm_router_service import VisionLLMRouterService
VisionLLMRouterService.initialize(vision_configs, router_settings)
print(
f"Info: Vision LLM Router initialized with {len(vision_configs)} models "
f"(strategy: {router_settings.get('routing_strategy', 'usage-based-routing')})"
)
except Exception as e:
print(f"Warning: Failed to initialize Vision LLM Router: {e}")
class Config:
# Check if ffmpeg is installed
if not is_ffmpeg_installed():
@ -335,6 +396,12 @@ class Config:
# Router settings for Image Generation Auto mode
IMAGE_GEN_ROUTER_SETTINGS = load_image_gen_router_settings()
# Global Vision LLM Configurations (optional)
GLOBAL_VISION_LLM_CONFIGS = load_global_vision_llm_configs()
# Router settings for Vision LLM Auto mode
VISION_LLM_ROUTER_SETTINGS = load_vision_llm_router_settings()
# Chonkie Configuration | Edit this to your needs
EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL")
# Azure OpenAI credentials from environment variables
@ -394,8 +461,10 @@ class Config:
UNSTRUCTURED_API_KEY = os.getenv("UNSTRUCTURED_API_KEY")
elif ETL_SERVICE == "LLAMACLOUD":
# LlamaCloud API Key
LLAMA_CLOUD_API_KEY = os.getenv("LLAMA_CLOUD_API_KEY")
# Optional: Azure Document Intelligence accelerator for supported file types
AZURE_DI_ENDPOINT = os.getenv("AZURE_DI_ENDPOINT")
AZURE_DI_KEY = os.getenv("AZURE_DI_KEY")
# Residential Proxy Configuration (anonymous-proxies.net)
# Used for web crawling and YouTube transcript fetching to avoid IP bans.

View file

@ -17,7 +17,7 @@
# - Configure router_settings below to customize the load balancing behavior
#
# Structure matches NewLLMConfig:
# - LLM model configuration (provider, model_name, api_key, etc.)
# - Model configuration (provider, model_name, api_key, etc.)
# - Prompt configuration (system_instructions, citations_enabled)
# Router Settings for Auto Mode
@ -263,6 +263,82 @@ global_image_generation_configs:
# rpm: 30
# litellm_params: {}
# =============================================================================
# Vision LLM Configuration
# =============================================================================
# These configurations power the vision autocomplete feature (screenshot analysis).
# Only vision-capable models should be used here (e.g. GPT-4o, Gemini Pro, Claude 3).
# Supported providers: OpenAI, Anthropic, Google, Azure OpenAI, Vertex AI, Bedrock,
# xAI, OpenRouter, Ollama, Groq, Together AI, Fireworks AI, DeepSeek, Mistral, Custom
#
# Auto mode (ID 0) uses LiteLLM Router for load balancing across all vision configs.
# Router Settings for Vision LLM Auto Mode
vision_llm_router_settings:
routing_strategy: "usage-based-routing"
num_retries: 3
allowed_fails: 3
cooldown_time: 60
global_vision_llm_configs:
# Example: OpenAI GPT-4o (recommended for vision)
- id: -1
name: "Global GPT-4o Vision"
description: "OpenAI's GPT-4o with strong vision capabilities"
provider: "OPENAI"
model_name: "gpt-4o"
api_key: "sk-your-openai-api-key-here"
api_base: ""
rpm: 500
tpm: 100000
litellm_params:
temperature: 0.3
max_tokens: 1000
# Example: Google Gemini 2.0 Flash
- id: -2
name: "Global Gemini 2.0 Flash"
description: "Google's fast vision model with large context"
provider: "GOOGLE"
model_name: "gemini-2.0-flash"
api_key: "your-google-ai-api-key-here"
api_base: ""
rpm: 1000
tpm: 200000
litellm_params:
temperature: 0.3
max_tokens: 1000
# Example: Anthropic Claude 3.5 Sonnet
- id: -3
name: "Global Claude 3.5 Sonnet Vision"
description: "Anthropic's Claude 3.5 Sonnet with vision support"
provider: "ANTHROPIC"
model_name: "claude-3-5-sonnet-20241022"
api_key: "sk-ant-your-anthropic-api-key-here"
api_base: ""
rpm: 1000
tpm: 100000
litellm_params:
temperature: 0.3
max_tokens: 1000
# Example: Azure OpenAI GPT-4o
# - id: -4
# name: "Global Azure GPT-4o Vision"
# description: "Azure-hosted GPT-4o for vision analysis"
# provider: "AZURE_OPENAI"
# model_name: "azure/gpt-4o-deployment"
# api_key: "your-azure-api-key-here"
# api_base: "https://your-resource.openai.azure.com"
# api_version: "2024-02-15-preview"
# rpm: 500
# tpm: 100000
# litellm_params:
# temperature: 0.3
# max_tokens: 1000
# base_model: "gpt-4o"
# Notes:
# - ID 0 is reserved for "Auto" mode - uses LiteLLM Router for load balancing
# - Use negative IDs to distinguish global configs from user configs (NewLLMConfig in DB)
@ -283,3 +359,9 @@ global_image_generation_configs:
# - The router uses litellm.aimage_generation() for async image generation
# - Only RPM (requests per minute) is relevant for image generation rate limiting.
# TPM (tokens per minute) does not apply since image APIs are billed/rate-limited per request, not per token.
#
# VISION LLM NOTES:
# - Vision configs use the same ID scheme (negative for global, positive for user DB)
# - Only use vision-capable models (GPT-4o, Gemini, Claude 3, etc.)
# - Lower temperature (0.3) is recommended for accurate screenshot analysis
# - Lower max_tokens (1000) is sufficient since autocomplete produces short suggestions

View file

@ -0,0 +1,23 @@
[
{"value": "gpt-4o", "label": "GPT-4o", "provider": "OPENAI", "context_window": "128K"},
{"value": "gpt-4o-mini", "label": "GPT-4o Mini", "provider": "OPENAI", "context_window": "128K"},
{"value": "gpt-4-turbo", "label": "GPT-4 Turbo", "provider": "OPENAI", "context_window": "128K"},
{"value": "claude-sonnet-4-20250514", "label": "Claude Sonnet 4", "provider": "ANTHROPIC", "context_window": "200K"},
{"value": "claude-3-7-sonnet-20250219", "label": "Claude 3.7 Sonnet", "provider": "ANTHROPIC", "context_window": "200K"},
{"value": "claude-3-5-sonnet-20241022", "label": "Claude 3.5 Sonnet", "provider": "ANTHROPIC", "context_window": "200K"},
{"value": "claude-3-opus-20240229", "label": "Claude 3 Opus", "provider": "ANTHROPIC", "context_window": "200K"},
{"value": "claude-3-haiku-20240307", "label": "Claude 3 Haiku", "provider": "ANTHROPIC", "context_window": "200K"},
{"value": "gemini-2.5-flash", "label": "Gemini 2.5 Flash", "provider": "GOOGLE", "context_window": "1M"},
{"value": "gemini-2.5-pro", "label": "Gemini 2.5 Pro", "provider": "GOOGLE", "context_window": "1M"},
{"value": "gemini-2.0-flash", "label": "Gemini 2.0 Flash", "provider": "GOOGLE", "context_window": "1M"},
{"value": "gemini-1.5-pro", "label": "Gemini 1.5 Pro", "provider": "GOOGLE", "context_window": "1M"},
{"value": "gemini-1.5-flash", "label": "Gemini 1.5 Flash", "provider": "GOOGLE", "context_window": "1M"},
{"value": "pixtral-large-latest", "label": "Pixtral Large", "provider": "MISTRAL", "context_window": "128K"},
{"value": "pixtral-12b-2409", "label": "Pixtral 12B", "provider": "MISTRAL", "context_window": "128K"},
{"value": "grok-2-vision-1212", "label": "Grok 2 Vision", "provider": "XAI", "context_window": "32K"},
{"value": "llava", "label": "LLaVA", "provider": "OLLAMA"},
{"value": "bakllava", "label": "BakLLaVA", "provider": "OLLAMA"},
{"value": "llava-llama3", "label": "LLaVA Llama 3", "provider": "OLLAMA"},
{"value": "llama-4-scout-17b-16e-instruct", "label": "Llama 4 Scout 17B", "provider": "GROQ", "context_window": "128K"},
{"value": "meta-llama/Llama-4-Scout-17B-16E-Instruct", "label": "Llama 4 Scout 17B", "provider": "TOGETHER_AI", "context_window": "128K"}
]

View file

@ -225,6 +225,55 @@ class DropboxClient:
return all_items, None
async def get_latest_cursor(self, path: str = "") -> tuple[str | None, str | None]:
"""Get a cursor representing the current state of a folder.
Uses /2/files/list_folder/get_latest_cursor so we can later call
get_changes to receive only incremental updates.
"""
resp = await self._request(
"/2/files/list_folder/get_latest_cursor",
{"path": path, "recursive": False, "include_non_downloadable_files": True},
)
if resp.status_code != 200:
return None, f"Failed to get cursor: {resp.status_code} - {resp.text}"
return resp.json().get("cursor"), None
async def get_changes(
self, cursor: str
) -> tuple[list[dict[str, Any]], str | None, str | None]:
"""Fetch incremental changes since the given cursor.
Calls /2/files/list_folder/continue and handles pagination.
Returns (entries, new_cursor, error).
"""
all_entries: list[dict[str, Any]] = []
resp = await self._request("/2/files/list_folder/continue", {"cursor": cursor})
if resp.status_code == 401:
return [], None, "Dropbox authentication expired (401)"
if resp.status_code != 200:
return [], None, f"Failed to get changes: {resp.status_code} - {resp.text}"
data = resp.json()
all_entries.extend(data.get("entries", []))
while data.get("has_more"):
cursor = data["cursor"]
resp = await self._request(
"/2/files/list_folder/continue", {"cursor": cursor}
)
if resp.status_code != 200:
return (
all_entries,
data.get("cursor"),
f"Pagination failed: {resp.status_code}",
)
data = resp.json()
all_entries.extend(data.get("entries", []))
return all_entries, data.get("cursor"), None
async def get_metadata(self, path: str) -> tuple[dict[str, Any] | None, str | None]:
resp = await self._request("/2/files/get_metadata", {"path": path})
if resp.status_code != 200:

View file

@ -53,7 +53,8 @@ async def download_and_extract_content(
file_name = file.get("name", "Unknown")
file_id = file.get("id", "")
if should_skip_file(file):
skip, _unsup_ext = should_skip_file(file)
if skip:
return None, {}, "Skipping non-indexable item"
logger.info(f"Downloading file for content extraction: {file_name}")
@ -87,9 +88,13 @@ async def download_and_extract_content(
if error:
return None, metadata, error
from app.connectors.onedrive.content_extractor import _parse_file_to_markdown
from app.etl_pipeline.etl_document import EtlRequest
from app.etl_pipeline.etl_pipeline_service import EtlPipelineService
markdown = await _parse_file_to_markdown(temp_file_path, file_name)
result = await EtlPipelineService().extract(
EtlRequest(file_path=temp_file_path, filename=file_name)
)
markdown = result.markdown_content
return markdown, metadata, None
except Exception as e:

View file

@ -1,8 +1,8 @@
"""File type handlers for Dropbox."""
PAPER_EXTENSION = ".paper"
from app.etl_pipeline.file_classifier import should_skip_for_service
SKIP_EXTENSIONS: frozenset[str] = frozenset()
PAPER_EXTENSION = ".paper"
MIME_TO_EXTENSION: dict[str, str] = {
"application/pdf": ".pdf",
@ -42,17 +42,25 @@ def is_paper_file(item: dict) -> bool:
return ext == PAPER_EXTENSION
def should_skip_file(item: dict) -> bool:
def should_skip_file(item: dict) -> tuple[bool, str | None]:
"""Skip folders and truly non-indexable files.
Paper docs are non-downloadable but exportable, so they are NOT skipped.
Returns (should_skip, unsupported_extension_or_None).
"""
if is_folder(item):
return True
return True, None
if is_paper_file(item):
return False
return False, None
if not item.get("is_downloadable", True):
return True
return True, None
from pathlib import PurePosixPath
from app.config import config as app_config
name = item.get("name", "")
ext = get_extension_from_name(name).lower()
return ext in SKIP_EXTENSIONS
if should_skip_for_service(name, app_config.ETL_SERVICE):
ext = PurePosixPath(name).suffix.lower()
return True, ext
return False, None

View file

@ -64,8 +64,10 @@ async def get_files_in_folder(
)
continue
files.extend(sub_files)
elif not should_skip_file(item):
files.append(item)
else:
skip, _unsup_ext = should_skip_file(item)
if not skip:
files.append(item)
return files, None

View file

@ -1,12 +1,9 @@
"""Content extraction for Google Drive files."""
import asyncio
import contextlib
import logging
import os
import tempfile
import threading
import time
from pathlib import Path
from typing import Any
@ -20,6 +17,7 @@ from .file_types import (
get_export_mime_type,
get_extension_from_mime,
is_google_workspace_file,
should_skip_by_extension,
should_skip_file,
)
@ -45,6 +43,11 @@ async def download_and_extract_content(
if should_skip_file(mime_type):
return None, {}, f"Skipping {mime_type}"
if not is_google_workspace_file(mime_type):
ext_skip, _unsup_ext = should_skip_by_extension(file_name)
if ext_skip:
return None, {}, f"Skipping unsupported extension: {file_name}"
logger.info(f"Downloading file for content extraction: {file_name} ({mime_type})")
drive_metadata: dict[str, Any] = {
@ -97,7 +100,10 @@ async def download_and_extract_content(
if error:
return None, drive_metadata, error
markdown = await _parse_file_to_markdown(temp_file_path, file_name)
etl_filename = (
file_name + extension if is_google_workspace_file(mime_type) else file_name
)
markdown = await _parse_file_to_markdown(temp_file_path, etl_filename)
return markdown, drive_metadata, None
except Exception as e:
@ -110,99 +116,14 @@ async def download_and_extract_content(
async def _parse_file_to_markdown(file_path: str, filename: str) -> str:
"""Parse a local file to markdown using the configured ETL service."""
lower = filename.lower()
"""Parse a local file to markdown using the unified ETL pipeline."""
from app.etl_pipeline.etl_document import EtlRequest
from app.etl_pipeline.etl_pipeline_service import EtlPipelineService
if lower.endswith((".md", ".markdown", ".txt")):
with open(file_path, encoding="utf-8") as f:
return f.read()
if lower.endswith((".mp3", ".mp4", ".mpeg", ".mpga", ".m4a", ".wav", ".webm")):
from litellm import atranscription
from app.config import config as app_config
stt_service_type = (
"local"
if app_config.STT_SERVICE and app_config.STT_SERVICE.startswith("local/")
else "external"
)
if stt_service_type == "local":
from app.services.stt_service import stt_service
t0 = time.monotonic()
logger.info(
f"[local-stt] START file={filename} thread={threading.current_thread().name}"
)
result = await asyncio.to_thread(stt_service.transcribe_file, file_path)
logger.info(
f"[local-stt] END file={filename} elapsed={time.monotonic() - t0:.2f}s"
)
text = result.get("text", "")
else:
with open(file_path, "rb") as audio_file:
kwargs: dict[str, Any] = {
"model": app_config.STT_SERVICE,
"file": audio_file,
"api_key": app_config.STT_SERVICE_API_KEY,
}
if app_config.STT_SERVICE_API_BASE:
kwargs["api_base"] = app_config.STT_SERVICE_API_BASE
resp = await atranscription(**kwargs)
text = resp.get("text", "")
if not text:
raise ValueError("Transcription returned empty text")
return f"# Transcription of {filename}\n\n{text}"
# Document files -- use configured ETL service
from app.config import config as app_config
if app_config.ETL_SERVICE == "UNSTRUCTURED":
from langchain_unstructured import UnstructuredLoader
from app.utils.document_converters import convert_document_to_markdown
loader = UnstructuredLoader(
file_path,
mode="elements",
post_processors=[],
languages=["eng"],
include_orig_elements=False,
include_metadata=False,
strategy="auto",
)
docs = await loader.aload()
return await convert_document_to_markdown(docs)
if app_config.ETL_SERVICE == "LLAMACLOUD":
from app.tasks.document_processors.file_processors import (
parse_with_llamacloud_retry,
)
result = await parse_with_llamacloud_retry(
file_path=file_path, estimated_pages=50
)
markdown_documents = await result.aget_markdown_documents(split_by_page=False)
if not markdown_documents:
raise RuntimeError(f"LlamaCloud returned no documents for {filename}")
return markdown_documents[0].text
if app_config.ETL_SERVICE == "DOCLING":
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
t0 = time.monotonic()
logger.info(
f"[docling] START file={filename} thread={threading.current_thread().name}"
)
result = await asyncio.to_thread(converter.convert, file_path)
logger.info(
f"[docling] END file={filename} elapsed={time.monotonic() - t0:.2f}s"
)
return result.document.export_to_markdown()
raise RuntimeError(f"Unknown ETL_SERVICE: {app_config.ETL_SERVICE}")
result = await EtlPipelineService().extract(
EtlRequest(file_path=file_path, filename=filename)
)
return result.markdown_content
async def download_and_process_file(
@ -236,10 +157,14 @@ async def download_and_process_file(
file_name = file.get("name", "Unknown")
mime_type = file.get("mimeType", "")
# Skip folders and shortcuts
if should_skip_file(mime_type):
return None, f"Skipping {mime_type}", None
if not is_google_workspace_file(mime_type):
ext_skip, _unsup_ext = should_skip_by_extension(file_name)
if ext_skip:
return None, f"Skipping unsupported extension: {file_name}", None
logger.info(f"Downloading file: {file_name} ({mime_type})")
temp_file_path = None
@ -310,10 +235,13 @@ async def download_and_process_file(
"."
)[-1]
etl_filename = (
file_name + extension if is_google_workspace_file(mime_type) else file_name
)
logger.info(f"Processing {file_name} with Surfsense's file processor")
await process_file_in_background(
file_path=temp_file_path,
filename=file_name,
filename=etl_filename,
search_space_id=search_space_id,
user_id=user_id,
session=session,

View file

@ -1,5 +1,7 @@
"""File type handlers for Google Drive."""
from app.etl_pipeline.file_classifier import should_skip_for_service
GOOGLE_DOC = "application/vnd.google-apps.document"
GOOGLE_SHEET = "application/vnd.google-apps.spreadsheet"
GOOGLE_SLIDE = "application/vnd.google-apps.presentation"
@ -46,6 +48,21 @@ def should_skip_file(mime_type: str) -> bool:
return mime_type in [GOOGLE_FOLDER, GOOGLE_SHORTCUT]
def should_skip_by_extension(filename: str) -> tuple[bool, str | None]:
"""Check if the file extension is not parseable by the configured ETL service.
Returns (should_skip, unsupported_extension_or_None).
"""
from pathlib import PurePosixPath
from app.config import config as app_config
if should_skip_for_service(filename, app_config.ETL_SERVICE):
ext = PurePosixPath(filename).suffix.lower()
return True, ext
return False, None
def get_export_mime_type(mime_type: str) -> str | None:
"""Get export MIME type for Google Workspace files."""
return EXPORT_FORMATS.get(mime_type)

View file

@ -1,16 +1,9 @@
"""Content extraction for OneDrive files.
"""Content extraction for OneDrive files."""
Reuses the same ETL parsing logic as Google Drive since file parsing is
extension-based, not provider-specific.
"""
import asyncio
import contextlib
import logging
import os
import tempfile
import threading
import time
from pathlib import Path
from typing import Any
@ -31,7 +24,8 @@ async def download_and_extract_content(
item_id = file.get("id")
file_name = file.get("name", "Unknown")
if should_skip_file(file):
skip, _unsup_ext = should_skip_file(file)
if skip:
return None, {}, "Skipping non-indexable item"
file_info = file.get("file", {})
@ -84,98 +78,11 @@ async def download_and_extract_content(
async def _parse_file_to_markdown(file_path: str, filename: str) -> str:
"""Parse a local file to markdown using the configured ETL service.
"""Parse a local file to markdown using the unified ETL pipeline."""
from app.etl_pipeline.etl_document import EtlRequest
from app.etl_pipeline.etl_pipeline_service import EtlPipelineService
Same logic as Google Drive -- file parsing is extension-based.
"""
lower = filename.lower()
if lower.endswith((".md", ".markdown", ".txt")):
with open(file_path, encoding="utf-8") as f:
return f.read()
if lower.endswith((".mp3", ".mp4", ".mpeg", ".mpga", ".m4a", ".wav", ".webm")):
from litellm import atranscription
from app.config import config as app_config
stt_service_type = (
"local"
if app_config.STT_SERVICE and app_config.STT_SERVICE.startswith("local/")
else "external"
)
if stt_service_type == "local":
from app.services.stt_service import stt_service
t0 = time.monotonic()
logger.info(
f"[local-stt] START file={filename} thread={threading.current_thread().name}"
)
result = await asyncio.to_thread(stt_service.transcribe_file, file_path)
logger.info(
f"[local-stt] END file={filename} elapsed={time.monotonic() - t0:.2f}s"
)
text = result.get("text", "")
else:
with open(file_path, "rb") as audio_file:
kwargs: dict[str, Any] = {
"model": app_config.STT_SERVICE,
"file": audio_file,
"api_key": app_config.STT_SERVICE_API_KEY,
}
if app_config.STT_SERVICE_API_BASE:
kwargs["api_base"] = app_config.STT_SERVICE_API_BASE
resp = await atranscription(**kwargs)
text = resp.get("text", "")
if not text:
raise ValueError("Transcription returned empty text")
return f"# Transcription of {filename}\n\n{text}"
from app.config import config as app_config
if app_config.ETL_SERVICE == "UNSTRUCTURED":
from langchain_unstructured import UnstructuredLoader
from app.utils.document_converters import convert_document_to_markdown
loader = UnstructuredLoader(
file_path,
mode="elements",
post_processors=[],
languages=["eng"],
include_orig_elements=False,
include_metadata=False,
strategy="auto",
)
docs = await loader.aload()
return await convert_document_to_markdown(docs)
if app_config.ETL_SERVICE == "LLAMACLOUD":
from app.tasks.document_processors.file_processors import (
parse_with_llamacloud_retry,
)
result = await parse_with_llamacloud_retry(
file_path=file_path, estimated_pages=50
)
markdown_documents = await result.aget_markdown_documents(split_by_page=False)
if not markdown_documents:
raise RuntimeError(f"LlamaCloud returned no documents for {filename}")
return markdown_documents[0].text
if app_config.ETL_SERVICE == "DOCLING":
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
t0 = time.monotonic()
logger.info(
f"[docling] START file={filename} thread={threading.current_thread().name}"
)
result = await asyncio.to_thread(converter.convert, file_path)
logger.info(
f"[docling] END file={filename} elapsed={time.monotonic() - t0:.2f}s"
)
return result.document.export_to_markdown()
raise RuntimeError(f"Unknown ETL_SERVICE: {app_config.ETL_SERVICE}")
result = await EtlPipelineService().extract(
EtlRequest(file_path=file_path, filename=filename)
)
return result.markdown_content

View file

@ -1,5 +1,7 @@
"""File type handlers for Microsoft OneDrive."""
from app.etl_pipeline.file_classifier import should_skip_for_service
ONEDRIVE_FOLDER_FACET = "folder"
ONENOTE_MIME = "application/msonenote"
@ -38,13 +40,28 @@ def is_folder(item: dict) -> bool:
return ONEDRIVE_FOLDER_FACET in item
def should_skip_file(item: dict) -> bool:
"""Skip folders, OneNote files, remote items (shared links), and packages."""
def should_skip_file(item: dict) -> tuple[bool, str | None]:
"""Skip folders, OneNote files, remote items, packages, and unsupported extensions.
Returns (should_skip, unsupported_extension_or_None).
The second element is only set when the skip is due to an unsupported extension.
"""
if is_folder(item):
return True
return True, None
if "remoteItem" in item:
return True
return True, None
if "package" in item:
return True
return True, None
mime = item.get("file", {}).get("mimeType", "")
return mime in SKIP_MIME_TYPES
if mime in SKIP_MIME_TYPES:
return True, None
from pathlib import PurePosixPath
from app.config import config as app_config
name = item.get("name", "")
if should_skip_for_service(name, app_config.ETL_SERVICE):
ext = PurePosixPath(name).suffix.lower()
return True, ext
return False, None

View file

@ -71,8 +71,10 @@ async def get_files_in_folder(
)
continue
files.extend(sub_files)
elif not should_skip_file(item):
files.append(item)
else:
skip, _unsup_ext = should_skip_file(item)
if not skip:
files.append(item)
return files, None

View file

@ -64,6 +64,7 @@ class DocumentType(StrEnum):
COMPOSIO_GOOGLE_DRIVE_CONNECTOR = "COMPOSIO_GOOGLE_DRIVE_CONNECTOR"
COMPOSIO_GMAIL_CONNECTOR = "COMPOSIO_GMAIL_CONNECTOR"
COMPOSIO_GOOGLE_CALENDAR_CONNECTOR = "COMPOSIO_GOOGLE_CALENDAR_CONNECTOR"
LOCAL_FOLDER_FILE = "LOCAL_FOLDER_FILE"
# Native Google document types → their legacy Composio equivalents.
@ -259,6 +260,24 @@ class ImageGenProvider(StrEnum):
NSCALE = "NSCALE"
class VisionProvider(StrEnum):
OPENAI = "OPENAI"
ANTHROPIC = "ANTHROPIC"
GOOGLE = "GOOGLE"
AZURE_OPENAI = "AZURE_OPENAI"
VERTEX_AI = "VERTEX_AI"
BEDROCK = "BEDROCK"
XAI = "XAI"
OPENROUTER = "OPENROUTER"
OLLAMA = "OLLAMA"
GROQ = "GROQ"
TOGETHER_AI = "TOGETHER_AI"
FIREWORKS_AI = "FIREWORKS_AI"
DEEPSEEK = "DEEPSEEK"
MISTRAL = "MISTRAL"
CUSTOM = "CUSTOM"
class LogLevel(StrEnum):
DEBUG = "DEBUG"
INFO = "INFO"
@ -376,6 +395,11 @@ class Permission(StrEnum):
IMAGE_GENERATIONS_READ = "image_generations:read"
IMAGE_GENERATIONS_DELETE = "image_generations:delete"
# Vision LLM Configs
VISION_CONFIGS_CREATE = "vision_configs:create"
VISION_CONFIGS_READ = "vision_configs:read"
VISION_CONFIGS_DELETE = "vision_configs:delete"
# Connectors
CONNECTORS_CREATE = "connectors:create"
CONNECTORS_READ = "connectors:read"
@ -444,6 +468,9 @@ DEFAULT_ROLE_PERMISSIONS = {
# Image Generations (create and read, no delete)
Permission.IMAGE_GENERATIONS_CREATE.value,
Permission.IMAGE_GENERATIONS_READ.value,
# Vision Configs (create and read, no delete)
Permission.VISION_CONFIGS_CREATE.value,
Permission.VISION_CONFIGS_READ.value,
# Connectors (no delete)
Permission.CONNECTORS_CREATE.value,
Permission.CONNECTORS_READ.value,
@ -477,6 +504,8 @@ DEFAULT_ROLE_PERMISSIONS = {
Permission.VIDEO_PRESENTATIONS_READ.value,
# Image Generations (read only)
Permission.IMAGE_GENERATIONS_READ.value,
# Vision Configs (read only)
Permission.VISION_CONFIGS_READ.value,
# Connectors (read only)
Permission.CONNECTORS_READ.value,
# Logs (read only)
@ -955,6 +984,7 @@ class Folder(BaseModel, TimestampMixin):
onupdate=lambda: datetime.now(UTC),
index=True,
)
folder_metadata = Column("metadata", JSONB, nullable=True)
parent = relationship("Folder", remote_side="Folder.id", backref="children")
search_space = relationship("SearchSpace", back_populates="folders")
@ -1039,6 +1069,26 @@ class Document(BaseModel, TimestampMixin):
)
class DocumentVersion(BaseModel, TimestampMixin):
__tablename__ = "document_versions"
__table_args__ = (
UniqueConstraint("document_id", "version_number", name="uq_document_version"),
)
document_id = Column(
Integer,
ForeignKey("documents.id", ondelete="CASCADE"),
nullable=False,
index=True,
)
version_number = Column(Integer, nullable=False)
source_markdown = Column(Text, nullable=True)
content_hash = Column(String, nullable=False)
title = Column(String, nullable=True)
document = relationship("Document", backref="versions")
class Chunk(BaseModel, TimestampMixin):
__tablename__ = "chunks"
@ -1241,6 +1291,33 @@ class ImageGenerationConfig(BaseModel, TimestampMixin):
user = relationship("User", back_populates="image_generation_configs")
class VisionLLMConfig(BaseModel, TimestampMixin):
__tablename__ = "vision_llm_configs"
name = Column(String(100), nullable=False, index=True)
description = Column(String(500), nullable=True)
provider = Column(SQLAlchemyEnum(VisionProvider), nullable=False)
custom_provider = Column(String(100), nullable=True)
model_name = Column(String(100), nullable=False)
api_key = Column(String, nullable=False)
api_base = Column(String(500), nullable=True)
api_version = Column(String(50), nullable=True)
litellm_params = Column(JSON, nullable=True, default={})
search_space_id = Column(
Integer, ForeignKey("searchspaces.id", ondelete="CASCADE"), nullable=False
)
search_space = relationship("SearchSpace", back_populates="vision_llm_configs")
user_id = Column(
UUID(as_uuid=True), ForeignKey("user.id", ondelete="CASCADE"), nullable=False
)
user = relationship("User", back_populates="vision_llm_configs")
class ImageGeneration(BaseModel, TimestampMixin):
"""
Stores image generation requests and results using litellm.aimage_generation().
@ -1329,6 +1406,9 @@ class SearchSpace(BaseModel, TimestampMixin):
image_generation_config_id = Column(
Integer, nullable=True, default=0
) # For image generation, defaults to Auto mode
vision_llm_config_id = Column(
Integer, nullable=True, default=0
) # For vision/screenshot analysis, defaults to Auto mode
user_id = Column(
UUID(as_uuid=True), ForeignKey("user.id", ondelete="CASCADE"), nullable=False
@ -1407,6 +1487,12 @@ class SearchSpace(BaseModel, TimestampMixin):
order_by="ImageGenerationConfig.id",
cascade="all, delete-orphan",
)
vision_llm_configs = relationship(
"VisionLLMConfig",
back_populates="search_space",
order_by="VisionLLMConfig.id",
cascade="all, delete-orphan",
)
# RBAC relationships
roles = relationship(
@ -1936,6 +2022,12 @@ if config.AUTH_TYPE == "GOOGLE":
passive_deletes=True,
)
vision_llm_configs = relationship(
"VisionLLMConfig",
back_populates="user",
passive_deletes=True,
)
# User memories for personalized AI responses
memories = relationship(
"UserMemory",
@ -2050,6 +2142,12 @@ else:
passive_deletes=True,
)
vision_llm_configs = relationship(
"VisionLLMConfig",
back_populates="user",
passive_deletes=True,
)
# User memories for personalized AI responses
memories = relationship(
"UserMemory",

View file

@ -0,0 +1,39 @@
import ssl
import httpx
LLAMACLOUD_MAX_RETRIES = 5
LLAMACLOUD_BASE_DELAY = 10
LLAMACLOUD_MAX_DELAY = 120
LLAMACLOUD_RETRYABLE_EXCEPTIONS = (
ssl.SSLError,
httpx.ConnectError,
httpx.ConnectTimeout,
httpx.ReadError,
httpx.ReadTimeout,
httpx.WriteError,
httpx.WriteTimeout,
httpx.RemoteProtocolError,
httpx.LocalProtocolError,
ConnectionError,
ConnectionResetError,
TimeoutError,
OSError,
)
UPLOAD_BYTES_PER_SECOND_SLOW = 100 * 1024
MIN_UPLOAD_TIMEOUT = 120
MAX_UPLOAD_TIMEOUT = 1800
BASE_JOB_TIMEOUT = 600
PER_PAGE_JOB_TIMEOUT = 60
def calculate_upload_timeout(file_size_bytes: int) -> float:
estimated_time = (file_size_bytes / UPLOAD_BYTES_PER_SECOND_SLOW) * 1.5
return max(MIN_UPLOAD_TIMEOUT, min(estimated_time, MAX_UPLOAD_TIMEOUT))
def calculate_job_timeout(estimated_pages: int, file_size_bytes: int) -> float:
page_based_timeout = BASE_JOB_TIMEOUT + (estimated_pages * PER_PAGE_JOB_TIMEOUT)
size_based_timeout = BASE_JOB_TIMEOUT + (file_size_bytes / (10 * 1024 * 1024)) * 60
return max(page_based_timeout, size_based_timeout)

View file

@ -0,0 +1,21 @@
from pydantic import BaseModel, field_validator
class EtlRequest(BaseModel):
file_path: str
filename: str
estimated_pages: int = 0
@field_validator("filename")
@classmethod
def filename_must_not_be_empty(cls, v: str) -> str:
if not v.strip():
raise ValueError("filename must not be empty")
return v
class EtlResult(BaseModel):
markdown_content: str
etl_service: str
actual_pages: int = 0
content_type: str

View file

@ -0,0 +1,125 @@
import logging
from app.config import config as app_config
from app.etl_pipeline.etl_document import EtlRequest, EtlResult
from app.etl_pipeline.exceptions import (
EtlServiceUnavailableError,
EtlUnsupportedFileError,
)
from app.etl_pipeline.file_classifier import FileCategory, classify_file
from app.etl_pipeline.parsers.audio import transcribe_audio
from app.etl_pipeline.parsers.direct_convert import convert_file_directly
from app.etl_pipeline.parsers.plaintext import read_plaintext
class EtlPipelineService:
"""Single pipeline for extracting markdown from files. All callers use this."""
async def extract(self, request: EtlRequest) -> EtlResult:
category = classify_file(request.filename)
if category == FileCategory.UNSUPPORTED:
raise EtlUnsupportedFileError(
f"File type not supported for parsing: {request.filename}"
)
if category == FileCategory.PLAINTEXT:
content = read_plaintext(request.file_path)
return EtlResult(
markdown_content=content,
etl_service="PLAINTEXT",
content_type="plaintext",
)
if category == FileCategory.DIRECT_CONVERT:
content = convert_file_directly(request.file_path, request.filename)
return EtlResult(
markdown_content=content,
etl_service="DIRECT_CONVERT",
content_type="direct_convert",
)
if category == FileCategory.AUDIO:
content = await transcribe_audio(request.file_path, request.filename)
return EtlResult(
markdown_content=content,
etl_service="AUDIO",
content_type="audio",
)
return await self._extract_document(request)
async def _extract_document(self, request: EtlRequest) -> EtlResult:
from pathlib import PurePosixPath
from app.utils.file_extensions import get_document_extensions_for_service
etl_service = app_config.ETL_SERVICE
if not etl_service:
raise EtlServiceUnavailableError(
"No ETL_SERVICE configured. "
"Set ETL_SERVICE to UNSTRUCTURED, LLAMACLOUD, or DOCLING in your .env"
)
ext = PurePosixPath(request.filename).suffix.lower()
supported = get_document_extensions_for_service(etl_service)
if ext not in supported:
raise EtlUnsupportedFileError(
f"File type {ext} is not supported by {etl_service}"
)
if etl_service == "DOCLING":
from app.etl_pipeline.parsers.docling import parse_with_docling
content = await parse_with_docling(request.file_path, request.filename)
elif etl_service == "UNSTRUCTURED":
from app.etl_pipeline.parsers.unstructured import parse_with_unstructured
content = await parse_with_unstructured(request.file_path)
elif etl_service == "LLAMACLOUD":
content = await self._extract_with_llamacloud(request)
else:
raise EtlServiceUnavailableError(f"Unknown ETL_SERVICE: {etl_service}")
return EtlResult(
markdown_content=content,
etl_service=etl_service,
content_type="document",
)
async def _extract_with_llamacloud(self, request: EtlRequest) -> str:
"""Try Azure Document Intelligence first (when configured) then LlamaCloud.
Azure DI is an internal accelerator: cheaper and faster for its supported
file types. If it is not configured, or the file extension is not in
Azure DI's supported set, LlamaCloud is used directly. If Azure DI
fails for any reason, LlamaCloud is used as a fallback.
"""
from pathlib import PurePosixPath
from app.utils.file_extensions import AZURE_DI_DOCUMENT_EXTENSIONS
ext = PurePosixPath(request.filename).suffix.lower()
azure_configured = bool(
getattr(app_config, "AZURE_DI_ENDPOINT", None)
and getattr(app_config, "AZURE_DI_KEY", None)
)
if azure_configured and ext in AZURE_DI_DOCUMENT_EXTENSIONS:
try:
from app.etl_pipeline.parsers.azure_doc_intelligence import (
parse_with_azure_doc_intelligence,
)
return await parse_with_azure_doc_intelligence(request.file_path)
except Exception:
logging.warning(
"Azure Document Intelligence failed for %s, "
"falling back to LlamaCloud",
request.filename,
exc_info=True,
)
from app.etl_pipeline.parsers.llamacloud import parse_with_llamacloud
return await parse_with_llamacloud(request.file_path, request.estimated_pages)

View file

@ -0,0 +1,10 @@
class EtlParseError(Exception):
"""Raised when an ETL parser fails to produce content."""
class EtlServiceUnavailableError(Exception):
"""Raised when the configured ETL_SERVICE is not recognised."""
class EtlUnsupportedFileError(Exception):
"""Raised when a file type cannot be parsed by any ETL pipeline."""

View file

@ -0,0 +1,137 @@
from enum import Enum
from pathlib import PurePosixPath
from app.utils.file_extensions import (
DOCUMENT_EXTENSIONS,
get_document_extensions_for_service,
)
PLAINTEXT_EXTENSIONS = frozenset(
{
".md",
".markdown",
".txt",
".text",
".json",
".jsonl",
".yaml",
".yml",
".toml",
".ini",
".cfg",
".conf",
".xml",
".css",
".scss",
".less",
".sass",
".py",
".pyw",
".pyi",
".pyx",
".js",
".jsx",
".ts",
".tsx",
".mjs",
".cjs",
".java",
".kt",
".kts",
".scala",
".groovy",
".c",
".h",
".cpp",
".cxx",
".cc",
".hpp",
".hxx",
".cs",
".fs",
".fsx",
".go",
".rs",
".rb",
".php",
".pl",
".pm",
".lua",
".swift",
".m",
".mm",
".r",
".jl",
".sh",
".bash",
".zsh",
".fish",
".bat",
".cmd",
".ps1",
".sql",
".graphql",
".gql",
".env",
".gitignore",
".dockerignore",
".editorconfig",
".makefile",
".cmake",
".log",
".rst",
".tex",
".bib",
".org",
".adoc",
".asciidoc",
".vue",
".svelte",
".astro",
".tf",
".hcl",
".proto",
}
)
AUDIO_EXTENSIONS = frozenset(
{".mp3", ".mp4", ".mpeg", ".mpga", ".m4a", ".wav", ".webm"}
)
DIRECT_CONVERT_EXTENSIONS = frozenset({".csv", ".tsv", ".html", ".htm", ".xhtml"})
class FileCategory(Enum):
PLAINTEXT = "plaintext"
AUDIO = "audio"
DIRECT_CONVERT = "direct_convert"
UNSUPPORTED = "unsupported"
DOCUMENT = "document"
def classify_file(filename: str) -> FileCategory:
suffix = PurePosixPath(filename).suffix.lower()
if suffix in PLAINTEXT_EXTENSIONS:
return FileCategory.PLAINTEXT
if suffix in AUDIO_EXTENSIONS:
return FileCategory.AUDIO
if suffix in DIRECT_CONVERT_EXTENSIONS:
return FileCategory.DIRECT_CONVERT
if suffix in DOCUMENT_EXTENSIONS:
return FileCategory.DOCUMENT
return FileCategory.UNSUPPORTED
def should_skip_for_service(filename: str, etl_service: str | None) -> bool:
"""Return True if *filename* cannot be processed by *etl_service*.
Plaintext, audio, and direct-convert files are parser-agnostic and never
skipped. Document files are checked against the per-parser extension set.
"""
category = classify_file(filename)
if category == FileCategory.UNSUPPORTED:
return True
if category == FileCategory.DOCUMENT:
suffix = PurePosixPath(filename).suffix.lower()
return suffix not in get_document_extensions_for_service(etl_service)
return False

View file

@ -0,0 +1,34 @@
from litellm import atranscription
from app.config import config as app_config
async def transcribe_audio(file_path: str, filename: str) -> str:
stt_service_type = (
"local"
if app_config.STT_SERVICE and app_config.STT_SERVICE.startswith("local/")
else "external"
)
if stt_service_type == "local":
from app.services.stt_service import stt_service
result = stt_service.transcribe_file(file_path)
text = result.get("text", "")
if not text:
raise ValueError("Transcription returned empty text")
else:
with open(file_path, "rb") as audio_file:
kwargs: dict = {
"model": app_config.STT_SERVICE,
"file": audio_file,
"api_key": app_config.STT_SERVICE_API_KEY,
}
if app_config.STT_SERVICE_API_BASE:
kwargs["api_base"] = app_config.STT_SERVICE_API_BASE
response = await atranscription(**kwargs)
text = response.get("text", "")
if not text:
raise ValueError("Transcription returned empty text")
return f"# Transcription of {filename}\n\n{text}"

View file

@ -0,0 +1,93 @@
import asyncio
import logging
import os
import random
from app.config import config as app_config
MAX_RETRIES = 5
BASE_DELAY = 10
MAX_DELAY = 120
async def parse_with_azure_doc_intelligence(file_path: str) -> str:
from azure.ai.documentintelligence.aio import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import DocumentContentFormat
from azure.core.credentials import AzureKeyCredential
from azure.core.exceptions import (
ClientAuthenticationError,
HttpResponseError,
ServiceRequestError,
ServiceResponseError,
)
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
retryable_exceptions = (ServiceRequestError, ServiceResponseError)
last_exception = None
attempt_errors: list[str] = []
for attempt in range(1, MAX_RETRIES + 1):
try:
client = DocumentIntelligenceClient(
endpoint=app_config.AZURE_DI_ENDPOINT,
credential=AzureKeyCredential(app_config.AZURE_DI_KEY),
)
async with client:
with open(file_path, "rb") as f:
poller = await client.begin_analyze_document(
"prebuilt-read",
body=f,
output_content_format=DocumentContentFormat.MARKDOWN,
)
result = await poller.result()
if attempt > 1:
logging.info(
f"Azure Document Intelligence succeeded on attempt {attempt} "
f"after {len(attempt_errors)} failures"
)
if not result.content:
return ""
return result.content
except ClientAuthenticationError:
raise
except HttpResponseError as e:
if e.status_code and 400 <= e.status_code < 500:
raise
last_exception = e
error_type = type(e).__name__
error_msg = str(e)[:200]
attempt_errors.append(f"Attempt {attempt}: {error_type} - {error_msg}")
except retryable_exceptions as e:
last_exception = e
error_type = type(e).__name__
error_msg = str(e)[:200]
attempt_errors.append(f"Attempt {attempt}: {error_type} - {error_msg}")
if attempt < MAX_RETRIES:
base_delay = min(BASE_DELAY * (2 ** (attempt - 1)), MAX_DELAY)
jitter = base_delay * 0.25 * (2 * random.random() - 1)
delay = base_delay + jitter
logging.warning(
f"Azure Document Intelligence failed "
f"(attempt {attempt}/{MAX_RETRIES}): "
f"{attempt_errors[-1]}. File: {file_size_mb:.1f}MB. "
f"Retrying in {delay:.0f}s..."
)
await asyncio.sleep(delay)
else:
logging.error(
f"Azure Document Intelligence failed after {MAX_RETRIES} "
f"attempts. File size: {file_size_mb:.1f}MB. "
f"Errors: {'; '.join(attempt_errors)}"
)
raise last_exception or RuntimeError(
f"Azure Document Intelligence parsing failed after {MAX_RETRIES} retries. "
f"File size: {file_size_mb:.1f}MB"
)

View file

@ -0,0 +1,3 @@
from app.tasks.document_processors._direct_converters import convert_file_directly
__all__ = ["convert_file_directly"]

View file

@ -0,0 +1,26 @@
import warnings
from logging import ERROR, getLogger
async def parse_with_docling(file_path: str, filename: str) -> str:
from app.services.docling_service import create_docling_service
docling_service = create_docling_service()
pdfminer_logger = getLogger("pdfminer")
original_level = pdfminer_logger.level
with warnings.catch_warnings():
warnings.filterwarnings("ignore", category=UserWarning, module="pdfminer")
warnings.filterwarnings(
"ignore", message=".*Cannot set gray non-stroke color.*"
)
warnings.filterwarnings("ignore", message=".*invalid float value.*")
pdfminer_logger.setLevel(ERROR)
try:
result = await docling_service.process_document(file_path, filename)
finally:
pdfminer_logger.setLevel(original_level)
return result["content"]

View file

@ -0,0 +1,123 @@
import asyncio
import logging
import os
import random
import httpx
from app.config import config as app_config
from app.etl_pipeline.constants import (
LLAMACLOUD_BASE_DELAY,
LLAMACLOUD_MAX_DELAY,
LLAMACLOUD_MAX_RETRIES,
LLAMACLOUD_RETRYABLE_EXCEPTIONS,
PER_PAGE_JOB_TIMEOUT,
calculate_job_timeout,
calculate_upload_timeout,
)
async def parse_with_llamacloud(file_path: str, estimated_pages: int) -> str:
from llama_cloud_services import LlamaParse
from llama_cloud_services.parse.utils import ResultType
file_size_bytes = os.path.getsize(file_path)
file_size_mb = file_size_bytes / (1024 * 1024)
upload_timeout = calculate_upload_timeout(file_size_bytes)
job_timeout = calculate_job_timeout(estimated_pages, file_size_bytes)
custom_timeout = httpx.Timeout(
connect=120.0,
read=upload_timeout,
write=upload_timeout,
pool=120.0,
)
logging.info(
f"LlamaCloud upload configured: file_size={file_size_mb:.1f}MB, "
f"pages={estimated_pages}, upload_timeout={upload_timeout:.0f}s, "
f"job_timeout={job_timeout:.0f}s"
)
last_exception = None
attempt_errors: list[str] = []
for attempt in range(1, LLAMACLOUD_MAX_RETRIES + 1):
try:
async with httpx.AsyncClient(timeout=custom_timeout) as custom_client:
parser = LlamaParse(
api_key=app_config.LLAMA_CLOUD_API_KEY,
num_workers=1,
verbose=True,
language="en",
result_type=ResultType.MD,
max_timeout=int(max(2000, job_timeout + upload_timeout)),
job_timeout_in_seconds=job_timeout,
job_timeout_extra_time_per_page_in_seconds=PER_PAGE_JOB_TIMEOUT,
custom_client=custom_client,
)
result = await parser.aparse(file_path)
if attempt > 1:
logging.info(
f"LlamaCloud upload succeeded on attempt {attempt} after "
f"{len(attempt_errors)} failures"
)
if hasattr(result, "get_markdown_documents"):
markdown_docs = result.get_markdown_documents(split_by_page=False)
if markdown_docs and hasattr(markdown_docs[0], "text"):
return markdown_docs[0].text
if hasattr(result, "pages") and result.pages:
return "\n\n".join(
p.md for p in result.pages if hasattr(p, "md") and p.md
)
return str(result)
if isinstance(result, list):
if result and hasattr(result[0], "text"):
return result[0].text
return "\n\n".join(
doc.page_content if hasattr(doc, "page_content") else str(doc)
for doc in result
)
return str(result)
except LLAMACLOUD_RETRYABLE_EXCEPTIONS as e:
last_exception = e
error_type = type(e).__name__
error_msg = str(e)[:200]
attempt_errors.append(f"Attempt {attempt}: {error_type} - {error_msg}")
if attempt < LLAMACLOUD_MAX_RETRIES:
base_delay = min(
LLAMACLOUD_BASE_DELAY * (2 ** (attempt - 1)),
LLAMACLOUD_MAX_DELAY,
)
jitter = base_delay * 0.25 * (2 * random.random() - 1)
delay = base_delay + jitter
logging.warning(
f"LlamaCloud upload failed "
f"(attempt {attempt}/{LLAMACLOUD_MAX_RETRIES}): "
f"{error_type}. File: {file_size_mb:.1f}MB. "
f"Retrying in {delay:.0f}s..."
)
await asyncio.sleep(delay)
else:
logging.error(
f"LlamaCloud upload failed after {LLAMACLOUD_MAX_RETRIES} "
f"attempts. File size: {file_size_mb:.1f}MB, "
f"Pages: {estimated_pages}. "
f"Errors: {'; '.join(attempt_errors)}"
)
except Exception:
raise
raise last_exception or RuntimeError(
f"LlamaCloud parsing failed after {LLAMACLOUD_MAX_RETRIES} retries. "
f"File size: {file_size_mb:.1f}MB"
)

View file

@ -0,0 +1,8 @@
def read_plaintext(file_path: str) -> str:
with open(file_path, encoding="utf-8", errors="replace") as f:
content = f.read()
if "\x00" in content:
raise ValueError(
f"File contains null bytes — likely a binary file opened as text: {file_path}"
)
return content

View file

@ -0,0 +1,14 @@
async def parse_with_unstructured(file_path: str) -> str:
from langchain_unstructured import UnstructuredLoader
loader = UnstructuredLoader(
file_path,
mode="elements",
post_processors=[],
languages=["eng"],
include_orig_elements=False,
include_metadata=False,
strategy="auto",
)
docs = await loader.aload()
return "\n\n".join(doc.page_content for doc in docs if doc.page_content)

View file

@ -59,7 +59,7 @@ class PipelineMessages:
LLM_AUTH = "LLM authentication failed. Check your API key."
LLM_PERMISSION = "LLM request denied. Check your account permissions."
LLM_NOT_FOUND = "LLM model not found. Check your model configuration."
LLM_NOT_FOUND = "Model not found. Check your model configuration."
LLM_BAD_REQUEST = "LLM rejected the request. Document content may be invalid."
LLM_UNPROCESSABLE = (
"Document exceeds the LLM context window even after optimization."
@ -67,7 +67,7 @@ class PipelineMessages:
LLM_RESPONSE = "LLM returned an invalid response."
LLM_AUTH = "LLM authentication failed. Check your API key."
LLM_PERMISSION = "LLM request denied. Check your account permissions."
LLM_NOT_FOUND = "LLM model not found. Check your model configuration."
LLM_NOT_FOUND = "Model not found. Check your model configuration."
LLM_BAD_REQUEST = "LLM rejected the request. Document content may be invalid."
LLM_UNPROCESSABLE = (
"Document exceeds the LLM context window even after optimization."

View file

@ -273,17 +273,18 @@ class IndexingPipelineService:
continue
dup_check = await self.session.execute(
select(Document.id).filter(
select(Document.id, Document.title).filter(
Document.content_hash == content_hash,
Document.id != existing.id,
)
)
if dup_check.scalars().first() is not None:
dup_row = dup_check.first()
if dup_row is not None:
if not DocumentStatus.is_state(
existing.status, DocumentStatus.READY
):
existing.status = DocumentStatus.failed(
"Duplicate content — already indexed by another document"
f"Duplicate content: matches '{dup_row.title}'"
)
continue

View file

@ -3,6 +3,7 @@ from fastapi import APIRouter
from .airtable_add_connector_route import (
router as airtable_add_connector_router,
)
from .autocomplete_routes import router as autocomplete_router
from .chat_comments_routes import router as chat_comments_router
from .circleback_webhook_route import router as circleback_webhook_router
from .clickup_add_connector_route import router as clickup_add_connector_router
@ -48,6 +49,7 @@ from .stripe_routes import router as stripe_router
from .surfsense_docs_routes import router as surfsense_docs_router
from .teams_add_connector_route import router as teams_add_connector_router
from .video_presentations_routes import router as video_presentations_router
from .vision_llm_routes import router as vision_llm_router
from .youtube_routes import router as youtube_router
router = APIRouter()
@ -67,6 +69,7 @@ router.include_router(
) # Video presentation status and streaming
router.include_router(reports_router) # Report CRUD and multi-format export
router.include_router(image_generation_router) # Image generation via litellm
router.include_router(vision_llm_router) # Vision LLM configs for screenshot analysis
router.include_router(search_source_connectors_router)
router.include_router(google_calendar_add_connector_router)
router.include_router(google_gmail_add_connector_router)
@ -84,7 +87,7 @@ router.include_router(confluence_add_connector_router)
router.include_router(clickup_add_connector_router)
router.include_router(dropbox_add_connector_router)
router.include_router(new_llm_config_router) # LLM configs with prompt configuration
router.include_router(model_list_router) # Dynamic LLM model catalogue from OpenRouter
router.include_router(model_list_router) # Dynamic model catalogue from OpenRouter
router.include_router(logs_router)
router.include_router(circleback_webhook_router) # Circleback meeting webhooks
router.include_router(surfsense_docs_router) # Surfsense documentation for citations
@ -95,3 +98,4 @@ router.include_router(incentive_tasks_router) # Incentive tasks for earning fre
router.include_router(stripe_router) # Stripe checkout for additional page packs
router.include_router(youtube_router) # YouTube playlist resolution
router.include_router(prompts_router)
router.include_router(autocomplete_router) # Lightweight autocomplete with KB context

View file

@ -1,7 +1,5 @@
import base64
import hashlib
import logging
import secrets
from datetime import UTC, datetime, timedelta
from uuid import UUID
@ -26,7 +24,11 @@ from app.utils.connector_naming import (
check_duplicate_connector,
generate_unique_connector_name,
)
from app.utils.oauth_security import OAuthStateManager, TokenEncryption
from app.utils.oauth_security import (
OAuthStateManager,
TokenEncryption,
generate_pkce_pair,
)
logger = logging.getLogger(__name__)
@ -75,28 +77,6 @@ def make_basic_auth_header(client_id: str, client_secret: str) -> str:
return f"Basic {b64}"
def generate_pkce_pair() -> tuple[str, str]:
"""
Generate PKCE code verifier and code challenge.
Returns:
Tuple of (code_verifier, code_challenge)
"""
# Generate code verifier (43-128 characters)
code_verifier = (
base64.urlsafe_b64encode(secrets.token_bytes(32)).decode("utf-8").rstrip("=")
)
# Generate code challenge (SHA256 hash of verifier, base64url encoded)
code_challenge = (
base64.urlsafe_b64encode(hashlib.sha256(code_verifier.encode("utf-8")).digest())
.decode("utf-8")
.rstrip("=")
)
return code_verifier, code_challenge
@router.get("/auth/airtable/connector/add")
async def connect_airtable(space_id: int, user: User = Depends(current_active_user)):
"""

View file

@ -0,0 +1,45 @@
from fastapi import APIRouter, Depends
from fastapi.responses import StreamingResponse
from pydantic import BaseModel, Field
from sqlalchemy.ext.asyncio import AsyncSession
from app.db import User, get_async_session
from app.services.new_streaming_service import VercelStreamingService
from app.services.vision_autocomplete_service import stream_vision_autocomplete
from app.users import current_active_user
from app.utils.rbac import check_search_space_access
router = APIRouter(prefix="/autocomplete", tags=["autocomplete"])
MAX_SCREENSHOT_SIZE = 20 * 1024 * 1024 # 20 MB base64 ceiling
class VisionAutocompleteRequest(BaseModel):
screenshot: str = Field(..., max_length=MAX_SCREENSHOT_SIZE)
search_space_id: int
app_name: str = ""
window_title: str = ""
@router.post("/vision/stream")
async def vision_autocomplete_stream(
body: VisionAutocompleteRequest,
user: User = Depends(current_active_user),
session: AsyncSession = Depends(get_async_session),
):
await check_search_space_access(session, user, body.search_space_id)
return StreamingResponse(
stream_vision_autocomplete(
body.screenshot,
body.search_space_id,
session,
app_name=body.app_name,
window_title=body.window_title,
),
media_type="text/event-stream",
headers={
**VercelStreamingService.get_response_headers(),
"X-Accel-Buffering": "no",
},
)

View file

@ -1,7 +1,8 @@
# Force asyncio to use standard event loop before unstructured imports
import asyncio
from fastapi import APIRouter, Depends, Form, HTTPException, UploadFile
from fastapi import APIRouter, Depends, Form, HTTPException, Query, UploadFile
from pydantic import BaseModel as PydanticBaseModel
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy.future import select
from sqlalchemy.orm import selectinload
@ -10,6 +11,8 @@ from app.db import (
Chunk,
Document,
DocumentType,
DocumentVersion,
Folder,
Permission,
SearchSpace,
SearchSpaceMembership,
@ -17,6 +20,7 @@ from app.db import (
get_async_session,
)
from app.schemas import (
ChunkRead,
DocumentRead,
DocumentsCreate,
DocumentStatusBatchResponse,
@ -26,6 +30,7 @@ from app.schemas import (
DocumentTitleSearchResponse,
DocumentUpdate,
DocumentWithChunksRead,
FolderRead,
PaginatedResponse,
)
from app.services.task_dispatcher import TaskDispatcher, get_task_dispatcher
@ -45,9 +50,7 @@ os.environ["UNSTRUCTURED_HAS_PATCHED_LOOP"] = "1"
router = APIRouter()
MAX_FILES_PER_UPLOAD = 10
MAX_FILE_SIZE_BYTES = 50 * 1024 * 1024 # 50 MB per file
MAX_TOTAL_SIZE_BYTES = 200 * 1024 * 1024 # 200 MB total
MAX_FILE_SIZE_BYTES = 500 * 1024 * 1024 # 500 MB per file
@router.post("/documents")
@ -156,13 +159,6 @@ async def create_documents_file_upload(
if not files:
raise HTTPException(status_code=400, detail="No files provided")
if len(files) > MAX_FILES_PER_UPLOAD:
raise HTTPException(
status_code=413,
detail=f"Too many files. Maximum {MAX_FILES_PER_UPLOAD} files per upload.",
)
total_size = 0
for file in files:
file_size = file.size or 0
if file_size > MAX_FILE_SIZE_BYTES:
@ -171,14 +167,6 @@ async def create_documents_file_upload(
detail=f"File '{file.filename}' ({file_size / (1024 * 1024):.1f} MB) "
f"exceeds the {MAX_FILE_SIZE_BYTES // (1024 * 1024)} MB per-file limit.",
)
total_size += file_size
if total_size > MAX_TOTAL_SIZE_BYTES:
raise HTTPException(
status_code=413,
detail=f"Total upload size ({total_size / (1024 * 1024):.1f} MB) "
f"exceeds the {MAX_TOTAL_SIZE_BYTES // (1024 * 1024)} MB limit.",
)
# ===== Read all files concurrently to avoid blocking the event loop =====
async def _read_and_save(file: UploadFile) -> tuple[str, str, int]:
@ -206,16 +194,6 @@ async def create_documents_file_upload(
saved_files = await asyncio.gather(*(_read_and_save(f) for f in files))
actual_total_size = sum(size for _, _, size in saved_files)
if actual_total_size > MAX_TOTAL_SIZE_BYTES:
for temp_path, _, _ in saved_files:
os.unlink(temp_path)
raise HTTPException(
status_code=413,
detail=f"Total upload size ({actual_total_size / (1024 * 1024):.1f} MB) "
f"exceeds the {MAX_TOTAL_SIZE_BYTES // (1024 * 1024)} MB limit.",
)
# ===== PHASE 1: Create pending documents for all files =====
created_documents: list[Document] = []
files_to_process: list[tuple[Document, str, str]] = []
@ -451,13 +429,15 @@ async def read_documents(
reason=doc.status.get("reason"),
)
raw_content = doc.content or ""
api_documents.append(
DocumentRead(
id=doc.id,
title=doc.title,
document_type=doc.document_type,
document_metadata=doc.document_metadata,
content=doc.content,
content="",
content_preview=raw_content[:300],
content_hash=doc.content_hash,
unique_identifier_hash=doc.unique_identifier_hash,
created_at=doc.created_at,
@ -609,13 +589,15 @@ async def search_documents(
reason=doc.status.get("reason"),
)
raw_content = doc.content or ""
api_documents.append(
DocumentRead(
id=doc.id,
title=doc.title,
document_type=doc.document_type,
document_metadata=doc.document_metadata,
content=doc.content,
content="",
content_preview=raw_content[:300],
content_hash=doc.content_hash,
unique_identifier_hash=doc.unique_identifier_hash,
created_at=doc.created_at,
@ -884,16 +866,19 @@ async def get_document_type_counts(
@router.get("/documents/by-chunk/{chunk_id}", response_model=DocumentWithChunksRead)
async def get_document_by_chunk_id(
chunk_id: int,
chunk_window: int = Query(
5, ge=0, description="Number of chunks before/after the cited chunk to include"
),
session: AsyncSession = Depends(get_async_session),
user: User = Depends(current_active_user),
):
"""
Retrieves a document based on a chunk ID, including all its chunks ordered by creation time.
Requires DOCUMENTS_READ permission for the search space.
The document's embedding and chunk embeddings are excluded from the response.
Retrieves a document based on a chunk ID, including a window of chunks around the cited one.
Uses SQL-level pagination to avoid loading all chunks into memory.
"""
try:
# First, get the chunk and verify it exists
from sqlalchemy import and_, func, or_
chunk_result = await session.execute(select(Chunk).filter(Chunk.id == chunk_id))
chunk = chunk_result.scalars().first()
@ -902,11 +887,8 @@ async def get_document_by_chunk_id(
status_code=404, detail=f"Chunk with id {chunk_id} not found"
)
# Get the associated document
document_result = await session.execute(
select(Document)
.options(selectinload(Document.chunks))
.filter(Document.id == chunk.document_id)
select(Document).filter(Document.id == chunk.document_id)
)
document = document_result.scalars().first()
@ -916,7 +898,6 @@ async def get_document_by_chunk_id(
detail="Document not found",
)
# Check permission for the search space
await check_permission(
session,
user,
@ -925,10 +906,38 @@ async def get_document_by_chunk_id(
"You don't have permission to read documents in this search space",
)
# Sort chunks by creation time
sorted_chunks = sorted(document.chunks, key=lambda x: x.created_at)
total_result = await session.execute(
select(func.count())
.select_from(Chunk)
.filter(Chunk.document_id == document.id)
)
total_chunks = total_result.scalar() or 0
cited_idx_result = await session.execute(
select(func.count())
.select_from(Chunk)
.filter(
Chunk.document_id == document.id,
or_(
Chunk.created_at < chunk.created_at,
and_(Chunk.created_at == chunk.created_at, Chunk.id < chunk.id),
),
)
)
cited_idx = cited_idx_result.scalar() or 0
start = max(0, cited_idx - chunk_window)
end = min(total_chunks, cited_idx + chunk_window + 1)
windowed_result = await session.execute(
select(Chunk)
.filter(Chunk.document_id == document.id)
.order_by(Chunk.created_at, Chunk.id)
.offset(start)
.limit(end - start)
)
windowed_chunks = windowed_result.scalars().all()
# Return the document with its chunks
return DocumentWithChunksRead(
id=document.id,
title=document.title,
@ -940,7 +949,9 @@ async def get_document_by_chunk_id(
created_at=document.created_at,
updated_at=document.updated_at,
search_space_id=document.search_space_id,
chunks=sorted_chunks,
chunks=windowed_chunks,
total_chunks=total_chunks,
chunk_start_index=start,
)
except HTTPException:
raise
@ -950,6 +961,108 @@ async def get_document_by_chunk_id(
) from e
@router.get("/documents/watched-folders", response_model=list[FolderRead])
async def get_watched_folders(
search_space_id: int,
session: AsyncSession = Depends(get_async_session),
user: User = Depends(current_active_user),
):
"""Return root folders that are marked as watched (metadata->>'watched' = 'true')."""
await check_permission(
session,
user,
search_space_id,
Permission.DOCUMENTS_READ.value,
"You don't have permission to read documents in this search space",
)
folders = (
(
await session.execute(
select(Folder).where(
Folder.search_space_id == search_space_id,
Folder.parent_id.is_(None),
Folder.folder_metadata.isnot(None),
Folder.folder_metadata["watched"].astext == "true",
)
)
)
.scalars()
.all()
)
return folders
@router.get(
"/documents/{document_id}/chunks",
response_model=PaginatedResponse[ChunkRead],
)
async def get_document_chunks_paginated(
document_id: int,
page: int = Query(0, ge=0),
page_size: int = Query(20, ge=1, le=100),
start_offset: int | None = Query(
None, ge=0, description="Direct offset; overrides page * page_size"
),
session: AsyncSession = Depends(get_async_session),
user: User = Depends(current_active_user),
):
"""
Paginated chunk loading for a document.
Supports both page-based and offset-based access.
"""
try:
from sqlalchemy import func
doc_result = await session.execute(
select(Document).filter(Document.id == document_id)
)
document = doc_result.scalars().first()
if not document:
raise HTTPException(status_code=404, detail="Document not found")
await check_permission(
session,
user,
document.search_space_id,
Permission.DOCUMENTS_READ.value,
"You don't have permission to read documents in this search space",
)
total_result = await session.execute(
select(func.count())
.select_from(Chunk)
.filter(Chunk.document_id == document_id)
)
total = total_result.scalar() or 0
offset = start_offset if start_offset is not None else page * page_size
chunks_result = await session.execute(
select(Chunk)
.filter(Chunk.document_id == document_id)
.order_by(Chunk.created_at, Chunk.id)
.offset(offset)
.limit(page_size)
)
chunks = chunks_result.scalars().all()
return PaginatedResponse(
items=chunks,
total=total,
page=offset // page_size if page_size else page,
page_size=page_size,
has_more=(offset + len(chunks)) < total,
)
except HTTPException:
raise
except Exception as e:
raise HTTPException(
status_code=500, detail=f"Failed to fetch chunks: {e!s}"
) from e
@router.get("/documents/{document_id}", response_model=DocumentRead)
async def read_document(
document_id: int,
@ -980,13 +1093,14 @@ async def read_document(
"You don't have permission to read documents in this search space",
)
# Convert database object to API-friendly format
raw_content = document.content or ""
return DocumentRead(
id=document.id,
title=document.title,
document_type=document.document_type,
document_metadata=document.document_metadata,
content=document.content,
content=raw_content,
content_preview=raw_content[:300],
content_hash=document.content_hash,
unique_identifier_hash=document.unique_identifier_hash,
created_at=document.created_at,
@ -1135,3 +1249,297 @@ async def delete_document(
raise HTTPException(
status_code=500, detail=f"Failed to delete document: {e!s}"
) from e
# ====================================================================
# Version History Endpoints
# ====================================================================
@router.get("/documents/{document_id}/versions")
async def list_document_versions(
document_id: int,
session: AsyncSession = Depends(get_async_session),
user: User = Depends(current_active_user),
):
"""List all versions for a document, ordered by version_number descending."""
document = (
await session.execute(select(Document).where(Document.id == document_id))
).scalar_one_or_none()
if not document:
raise HTTPException(status_code=404, detail="Document not found")
await check_permission(
session, user, document.search_space_id, Permission.DOCUMENTS_READ.value
)
versions = (
(
await session.execute(
select(DocumentVersion)
.where(DocumentVersion.document_id == document_id)
.order_by(DocumentVersion.version_number.desc())
)
)
.scalars()
.all()
)
return [
{
"version_number": v.version_number,
"title": v.title,
"content_hash": v.content_hash,
"created_at": v.created_at.isoformat() if v.created_at else None,
}
for v in versions
]
@router.get("/documents/{document_id}/versions/{version_number}")
async def get_document_version(
document_id: int,
version_number: int,
session: AsyncSession = Depends(get_async_session),
user: User = Depends(current_active_user),
):
"""Get full version content including source_markdown."""
document = (
await session.execute(select(Document).where(Document.id == document_id))
).scalar_one_or_none()
if not document:
raise HTTPException(status_code=404, detail="Document not found")
await check_permission(
session, user, document.search_space_id, Permission.DOCUMENTS_READ.value
)
version = (
await session.execute(
select(DocumentVersion).where(
DocumentVersion.document_id == document_id,
DocumentVersion.version_number == version_number,
)
)
).scalar_one_or_none()
if not version:
raise HTTPException(status_code=404, detail="Version not found")
return {
"version_number": version.version_number,
"title": version.title,
"content_hash": version.content_hash,
"source_markdown": version.source_markdown,
"created_at": version.created_at.isoformat() if version.created_at else None,
}
@router.post("/documents/{document_id}/versions/{version_number}/restore")
async def restore_document_version(
document_id: int,
version_number: int,
session: AsyncSession = Depends(get_async_session),
user: User = Depends(current_active_user),
):
"""Restore a previous version: snapshot current state, then overwrite document content."""
document = (
await session.execute(select(Document).where(Document.id == document_id))
).scalar_one_or_none()
if not document:
raise HTTPException(status_code=404, detail="Document not found")
await check_permission(
session, user, document.search_space_id, Permission.DOCUMENTS_UPDATE.value
)
version = (
await session.execute(
select(DocumentVersion).where(
DocumentVersion.document_id == document_id,
DocumentVersion.version_number == version_number,
)
)
).scalar_one_or_none()
if not version:
raise HTTPException(status_code=404, detail="Version not found")
# Snapshot current state before restoring
from app.utils.document_versioning import create_version_snapshot
await create_version_snapshot(session, document)
# Restore the version's content onto the document
document.source_markdown = version.source_markdown
document.title = version.title or document.title
document.content_needs_reindexing = True
await session.commit()
from app.tasks.celery_tasks.document_reindex_tasks import reindex_document_task
reindex_document_task.delay(document_id, str(user.id))
return {
"message": f"Restored version {version_number}",
"document_id": document_id,
"restored_version": version_number,
}
# ===== Local folder indexing endpoints =====
class FolderIndexRequest(PydanticBaseModel):
folder_path: str
folder_name: str
search_space_id: int
exclude_patterns: list[str] | None = None
file_extensions: list[str] | None = None
root_folder_id: int | None = None
enable_summary: bool = False
class FolderIndexFilesRequest(PydanticBaseModel):
folder_path: str
folder_name: str
search_space_id: int
target_file_paths: list[str]
root_folder_id: int | None = None
enable_summary: bool = False
@router.post("/documents/folder-index")
async def folder_index(
request: FolderIndexRequest,
session: AsyncSession = Depends(get_async_session),
user: User = Depends(current_active_user),
):
"""Full-scan index of a local folder. Creates the root Folder row synchronously
and dispatches the heavy indexing work to a Celery task.
Returns the root_folder_id so the desktop can persist it.
"""
from app.config import config as app_config
if not app_config.is_self_hosted():
raise HTTPException(
status_code=400,
detail="Local folder indexing is only available in self-hosted mode",
)
await check_permission(
session,
user,
request.search_space_id,
Permission.DOCUMENTS_CREATE.value,
"You don't have permission to create documents in this search space",
)
watched_metadata = {
"watched": True,
"folder_path": request.folder_path,
"exclude_patterns": request.exclude_patterns,
"file_extensions": request.file_extensions,
}
root_folder_id = request.root_folder_id
if root_folder_id:
existing = (
await session.execute(select(Folder).where(Folder.id == root_folder_id))
).scalar_one_or_none()
if not existing:
root_folder_id = None
else:
existing.folder_metadata = watched_metadata
await session.commit()
if not root_folder_id:
root_folder = Folder(
name=request.folder_name,
search_space_id=request.search_space_id,
created_by_id=str(user.id),
position="a0",
folder_metadata=watched_metadata,
)
session.add(root_folder)
await session.flush()
root_folder_id = root_folder.id
await session.commit()
from app.tasks.celery_tasks.document_tasks import index_local_folder_task
index_local_folder_task.delay(
search_space_id=request.search_space_id,
user_id=str(user.id),
folder_path=request.folder_path,
folder_name=request.folder_name,
exclude_patterns=request.exclude_patterns,
file_extensions=request.file_extensions,
root_folder_id=root_folder_id,
enable_summary=request.enable_summary,
)
return {
"message": "Folder indexing started",
"status": "processing",
"root_folder_id": root_folder_id,
}
@router.post("/documents/folder-index-files")
async def folder_index_files(
request: FolderIndexFilesRequest,
session: AsyncSession = Depends(get_async_session),
user: User = Depends(current_active_user),
):
"""Index multiple files within a watched folder (batched chokidar trigger).
Validates that all target_file_paths are under folder_path.
Dispatches a single Celery task that processes them in parallel.
"""
from app.config import config as app_config
if not app_config.is_self_hosted():
raise HTTPException(
status_code=400,
detail="Local folder indexing is only available in self-hosted mode",
)
if not request.target_file_paths:
raise HTTPException(
status_code=400, detail="target_file_paths must not be empty"
)
await check_permission(
session,
user,
request.search_space_id,
Permission.DOCUMENTS_CREATE.value,
"You don't have permission to create documents in this search space",
)
from pathlib import Path
for fp in request.target_file_paths:
try:
Path(fp).relative_to(request.folder_path)
except ValueError as err:
raise HTTPException(
status_code=400,
detail=f"target_file_path {fp} must be inside folder_path",
) from err
from app.tasks.celery_tasks.document_tasks import index_local_folder_task
index_local_folder_task.delay(
search_space_id=request.search_space_id,
user_id=str(user.id),
folder_path=request.folder_path,
folder_name=request.folder_name,
target_file_paths=request.target_file_paths,
root_folder_id=request.root_folder_id,
enable_summary=request.enable_summary,
)
return {
"message": f"Batch indexing started for {len(request.target_file_paths)} file(s)",
"status": "processing",
"file_count": len(request.target_file_paths),
}

View file

@ -311,9 +311,11 @@ async def dropbox_callback(
)
existing_cursor = db_connector.config.get("cursor")
existing_folder_cursors = db_connector.config.get("folder_cursors")
db_connector.config = {
**connector_config,
"cursor": existing_cursor,
"folder_cursors": existing_folder_cursors,
"auth_expired": False,
}
flag_modified(db_connector, "config")

View file

@ -15,11 +15,10 @@ import pypandoc
import typst
from fastapi import APIRouter, Depends, HTTPException, Query
from fastapi.responses import StreamingResponse
from sqlalchemy import select
from sqlalchemy import func, select
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy.orm import selectinload
from app.db import Document, DocumentType, Permission, User, get_async_session
from app.db import Chunk, Document, DocumentType, Permission, User, get_async_session
from app.routes.reports_routes import (
_FILE_EXTENSIONS,
_MEDIA_TYPES,
@ -44,6 +43,9 @@ router = APIRouter()
async def get_editor_content(
search_space_id: int,
document_id: int,
max_length: int | None = Query(
None, description="Truncate source_markdown to this many characters"
),
session: AsyncSession = Depends(get_async_session),
user: User = Depends(current_active_user),
):
@ -65,9 +67,7 @@ async def get_editor_content(
)
result = await session.execute(
select(Document)
.options(selectinload(Document.chunks))
.filter(
select(Document).filter(
Document.id == document_id,
Document.search_space_id == search_space_id,
)
@ -77,80 +77,152 @@ async def get_editor_content(
if not document:
raise HTTPException(status_code=404, detail="Document not found")
# Priority 1: Return source_markdown if it exists (check `is not None` to allow empty strings)
if document.source_markdown is not None:
count_result = await session.execute(
select(func.count()).select_from(Chunk).filter(Chunk.document_id == document_id)
)
chunk_count = count_result.scalar() or 0
def _build_response(md: str) -> dict:
size_bytes = len(md.encode("utf-8"))
truncated = False
output_md = md
if max_length is not None and size_bytes > max_length:
output_md = md[:max_length]
truncated = True
return {
"document_id": document.id,
"title": document.title,
"document_type": document.document_type.value,
"source_markdown": document.source_markdown,
"source_markdown": output_md,
"content_size_bytes": size_bytes,
"chunk_count": chunk_count,
"truncated": truncated,
"updated_at": document.updated_at.isoformat()
if document.updated_at
else None,
}
# Priority 2: Lazy-migrate from blocknote_document (pure Python, no external deps)
if document.source_markdown is not None:
return _build_response(document.source_markdown)
if document.blocknote_document:
from app.utils.blocknote_to_markdown import blocknote_to_markdown
markdown = blocknote_to_markdown(document.blocknote_document)
if markdown:
# Persist the migration so we don't repeat it
document.source_markdown = markdown
await session.commit()
return {
"document_id": document.id,
"title": document.title,
"document_type": document.document_type.value,
"source_markdown": markdown,
"updated_at": document.updated_at.isoformat()
if document.updated_at
else None,
}
return _build_response(markdown)
# Priority 3: For NOTE type with no content, return empty markdown
if document.document_type == DocumentType.NOTE:
empty_markdown = ""
document.source_markdown = empty_markdown
await session.commit()
return {
"document_id": document.id,
"title": document.title,
"document_type": document.document_type.value,
"source_markdown": empty_markdown,
"updated_at": document.updated_at.isoformat()
if document.updated_at
else None,
}
return _build_response(empty_markdown)
# Priority 4: Reconstruct from chunks
chunks = sorted(document.chunks, key=lambda c: c.id)
chunk_contents_result = await session.execute(
select(Chunk.content)
.filter(Chunk.document_id == document_id)
.order_by(Chunk.id)
)
chunk_contents = chunk_contents_result.scalars().all()
if not chunks:
if not chunk_contents:
doc_status = document.status or {}
state = (
doc_status.get("state", "ready")
if isinstance(doc_status, dict)
else "ready"
)
if state in ("pending", "processing"):
raise HTTPException(
status_code=409,
detail="This document is still being processed. Please wait a moment and try again.",
)
raise HTTPException(
status_code=400,
detail="This document has no content and cannot be edited. Please re-upload to enable editing.",
detail="This document has no viewable content yet. It may still be syncing. Try again in a few seconds, or re-upload if the issue persists.",
)
markdown_content = "\n\n".join(chunk.content for chunk in chunks)
markdown_content = "\n\n".join(chunk_contents)
if not markdown_content.strip():
raise HTTPException(
status_code=400,
detail="This document has empty content and cannot be edited.",
detail="This document appears to be empty. Try re-uploading or editing it to add content.",
)
# Persist the lazy migration
document.source_markdown = markdown_content
await session.commit()
return {
"document_id": document.id,
"title": document.title,
"document_type": document.document_type.value,
"source_markdown": markdown_content,
"updated_at": document.updated_at.isoformat() if document.updated_at else None,
}
return _build_response(markdown_content)
@router.get(
"/search-spaces/{search_space_id}/documents/{document_id}/download-markdown"
)
async def download_document_markdown(
search_space_id: int,
document_id: int,
session: AsyncSession = Depends(get_async_session),
user: User = Depends(current_active_user),
):
"""
Download the full document content as a .md file.
Reconstructs markdown from source_markdown or chunks.
"""
await check_permission(
session,
user,
search_space_id,
Permission.DOCUMENTS_READ.value,
"You don't have permission to read documents in this search space",
)
result = await session.execute(
select(Document).filter(
Document.id == document_id,
Document.search_space_id == search_space_id,
)
)
document = result.scalars().first()
if not document:
raise HTTPException(status_code=404, detail="Document not found")
markdown: str | None = document.source_markdown
if markdown is None and document.blocknote_document:
from app.utils.blocknote_to_markdown import blocknote_to_markdown
markdown = blocknote_to_markdown(document.blocknote_document)
if markdown is None:
chunk_contents_result = await session.execute(
select(Chunk.content)
.filter(Chunk.document_id == document_id)
.order_by(Chunk.id)
)
chunk_contents = chunk_contents_result.scalars().all()
if chunk_contents:
markdown = "\n\n".join(chunk_contents)
if not markdown or not markdown.strip():
raise HTTPException(
status_code=400, detail="Document has no content to download"
)
safe_title = (
"".join(
c if c.isalnum() or c in " -_" else "_"
for c in (document.title or "document")
).strip()[:80]
or "document"
)
return StreamingResponse(
io.BytesIO(markdown.encode("utf-8")),
media_type="text/markdown; charset=utf-8",
headers={"Content-Disposition": f'attachment; filename="{safe_title}.md"'},
)
@router.post("/search-spaces/{search_space_id}/documents/{document_id}/save")
@ -258,9 +330,7 @@ async def export_document(
)
result = await session.execute(
select(Document)
.options(selectinload(Document.chunks))
.filter(
select(Document).filter(
Document.id == document_id,
Document.search_space_id == search_space_id,
)
@ -269,16 +339,20 @@ async def export_document(
if not document:
raise HTTPException(status_code=404, detail="Document not found")
# Resolve markdown content (same priority as editor-content endpoint)
markdown_content: str | None = document.source_markdown
if markdown_content is None and document.blocknote_document:
from app.utils.blocknote_to_markdown import blocknote_to_markdown
markdown_content = blocknote_to_markdown(document.blocknote_document)
if markdown_content is None:
chunks = sorted(document.chunks, key=lambda c: c.id)
if chunks:
markdown_content = "\n\n".join(chunk.content for chunk in chunks)
chunk_contents_result = await session.execute(
select(Chunk.content)
.filter(Chunk.document_id == document_id)
.order_by(Chunk.id)
)
chunk_contents = chunk_contents_result.scalars().all()
if chunk_contents:
markdown_content = "\n\n".join(chunk_contents)
if not markdown_content or not markdown_content.strip():
raise HTTPException(status_code=400, detail="Document has no content to export")

View file

@ -192,6 +192,33 @@ async def get_folder_breadcrumb(
) from e
@router.patch("/folders/{folder_id}/watched")
async def stop_watching_folder(
folder_id: int,
session: AsyncSession = Depends(get_async_session),
user: User = Depends(current_active_user),
):
"""Clear the watched flag from a folder's metadata."""
folder = await session.get(Folder, folder_id)
if not folder:
raise HTTPException(status_code=404, detail="Folder not found")
await check_permission(
session,
user,
folder.search_space_id,
Permission.DOCUMENTS_UPDATE.value,
"You don't have permission to update folders in this search space",
)
if folder.folder_metadata and isinstance(folder.folder_metadata, dict):
updated = {**folder.folder_metadata, "watched": False}
folder.folder_metadata = updated
await session.commit()
return {"message": "Folder watch status updated"}
@router.put("/folders/{folder_id}", response_model=FolderRead)
async def update_folder(
folder_id: int,
@ -340,7 +367,7 @@ async def delete_folder(
session: AsyncSession = Depends(get_async_session),
user: User = Depends(current_active_user),
):
"""Delete a folder and cascade-delete subfolders. Documents are async-deleted via Celery."""
"""Mark documents for deletion and dispatch Celery to delete docs first, then folders."""
try:
folder = await session.get(Folder, folder_id)
if not folder:
@ -372,30 +399,29 @@ async def delete_folder(
)
await session.commit()
await session.execute(Folder.__table__.delete().where(Folder.id == folder_id))
await session.commit()
try:
from app.tasks.celery_tasks.document_tasks import (
delete_folder_documents_task,
)
if document_ids:
try:
from app.tasks.celery_tasks.document_tasks import (
delete_folder_documents_task,
)
delete_folder_documents_task.delay(document_ids)
except Exception as err:
delete_folder_documents_task.delay(
document_ids, folder_subtree_ids=list(subtree_ids)
)
except Exception as err:
if document_ids:
await session.execute(
Document.__table__.update()
.where(Document.id.in_(document_ids))
.values(status={"state": "ready"})
)
await session.commit()
raise HTTPException(
status_code=503,
detail="Folder deleted but document cleanup could not be queued. Documents have been restored.",
) from err
raise HTTPException(
status_code=503,
detail="Could not queue folder deletion. Documents have been restored.",
) from err
return {
"message": "Folder deleted successfully",
"message": "Folder deletion started",
"documents_queued_for_deletion": len(document_ids),
}

View file

@ -28,7 +28,11 @@ from app.utils.connector_naming import (
check_duplicate_connector,
generate_unique_connector_name,
)
from app.utils.oauth_security import OAuthStateManager, TokenEncryption
from app.utils.oauth_security import (
OAuthStateManager,
TokenEncryption,
generate_code_verifier,
)
logger = logging.getLogger(__name__)
@ -96,9 +100,14 @@ async def connect_calendar(space_id: int, user: User = Depends(current_active_us
flow = get_google_flow()
# Generate secure state parameter with HMAC signature
code_verifier = generate_code_verifier()
flow.code_verifier = code_verifier
# Generate secure state parameter with HMAC signature (includes PKCE code_verifier)
state_manager = get_state_manager()
state_encoded = state_manager.generate_secure_state(space_id, user.id)
state_encoded = state_manager.generate_secure_state(
space_id, user.id, code_verifier=code_verifier
)
auth_url, _ = flow.authorization_url(
access_type="offline",
@ -146,8 +155,11 @@ async def reauth_calendar(
flow = get_google_flow()
code_verifier = generate_code_verifier()
flow.code_verifier = code_verifier
state_manager = get_state_manager()
extra: dict = {"connector_id": connector_id}
extra: dict = {"connector_id": connector_id, "code_verifier": code_verifier}
if return_url and return_url.startswith("/"):
extra["return_url"] = return_url
state_encoded = state_manager.generate_secure_state(space_id, user.id, **extra)
@ -225,6 +237,7 @@ async def calendar_callback(
user_id = UUID(data["user_id"])
space_id = data["space_id"]
code_verifier = data.get("code_verifier")
# Validate redirect URI (security: ensure it matches configured value)
if not config.GOOGLE_CALENDAR_REDIRECT_URI:
@ -233,6 +246,7 @@ async def calendar_callback(
)
flow = get_google_flow()
flow.code_verifier = code_verifier
flow.fetch_token(code=code)
creds = flow.credentials

View file

@ -41,7 +41,11 @@ from app.utils.connector_naming import (
check_duplicate_connector,
generate_unique_connector_name,
)
from app.utils.oauth_security import OAuthStateManager, TokenEncryption
from app.utils.oauth_security import (
OAuthStateManager,
TokenEncryption,
generate_code_verifier,
)
# Relax token scope validation for Google OAuth
os.environ["OAUTHLIB_RELAX_TOKEN_SCOPE"] = "1"
@ -127,14 +131,19 @@ async def connect_drive(space_id: int, user: User = Depends(current_active_user)
flow = get_google_flow()
# Generate secure state parameter with HMAC signature
code_verifier = generate_code_verifier()
flow.code_verifier = code_verifier
# Generate secure state parameter with HMAC signature (includes PKCE code_verifier)
state_manager = get_state_manager()
state_encoded = state_manager.generate_secure_state(space_id, user.id)
state_encoded = state_manager.generate_secure_state(
space_id, user.id, code_verifier=code_verifier
)
# Generate authorization URL
auth_url, _ = flow.authorization_url(
access_type="offline", # Get refresh token
prompt="consent", # Force consent screen to get refresh token
access_type="offline",
prompt="consent",
include_granted_scopes="true",
state=state_encoded,
)
@ -193,8 +202,11 @@ async def reauth_drive(
flow = get_google_flow()
code_verifier = generate_code_verifier()
flow.code_verifier = code_verifier
state_manager = get_state_manager()
extra: dict = {"connector_id": connector_id}
extra: dict = {"connector_id": connector_id, "code_verifier": code_verifier}
if return_url and return_url.startswith("/"):
extra["return_url"] = return_url
state_encoded = state_manager.generate_secure_state(space_id, user.id, **extra)
@ -285,6 +297,7 @@ async def drive_callback(
space_id = data["space_id"]
reauth_connector_id = data.get("connector_id")
reauth_return_url = data.get("return_url")
code_verifier = data.get("code_verifier")
logger.info(
f"Processing Google Drive callback for user {user_id}, space {space_id}"
@ -296,8 +309,9 @@ async def drive_callback(
status_code=500, detail="GOOGLE_DRIVE_REDIRECT_URI not configured"
)
# Exchange authorization code for tokens
# Exchange authorization code for tokens (restore PKCE code_verifier from state)
flow = get_google_flow()
flow.code_verifier = code_verifier
flow.fetch_token(code=code)
creds = flow.credentials

View file

@ -28,7 +28,11 @@ from app.utils.connector_naming import (
check_duplicate_connector,
generate_unique_connector_name,
)
from app.utils.oauth_security import OAuthStateManager, TokenEncryption
from app.utils.oauth_security import (
OAuthStateManager,
TokenEncryption,
generate_code_verifier,
)
logger = logging.getLogger(__name__)
@ -109,9 +113,14 @@ async def connect_gmail(space_id: int, user: User = Depends(current_active_user)
flow = get_google_flow()
# Generate secure state parameter with HMAC signature
code_verifier = generate_code_verifier()
flow.code_verifier = code_verifier
# Generate secure state parameter with HMAC signature (includes PKCE code_verifier)
state_manager = get_state_manager()
state_encoded = state_manager.generate_secure_state(space_id, user.id)
state_encoded = state_manager.generate_secure_state(
space_id, user.id, code_verifier=code_verifier
)
auth_url, _ = flow.authorization_url(
access_type="offline",
@ -164,8 +173,11 @@ async def reauth_gmail(
flow = get_google_flow()
code_verifier = generate_code_verifier()
flow.code_verifier = code_verifier
state_manager = get_state_manager()
extra: dict = {"connector_id": connector_id}
extra: dict = {"connector_id": connector_id, "code_verifier": code_verifier}
if return_url and return_url.startswith("/"):
extra["return_url"] = return_url
state_encoded = state_manager.generate_secure_state(space_id, user.id, **extra)
@ -256,6 +268,7 @@ async def gmail_callback(
user_id = UUID(data["user_id"])
space_id = data["space_id"]
code_verifier = data.get("code_verifier")
# Validate redirect URI (security: ensure it matches configured value)
if not config.GOOGLE_GMAIL_REDIRECT_URI:
@ -264,6 +277,7 @@ async def gmail_callback(
)
flow = get_google_flow()
flow.code_verifier = code_verifier
flow.fetch_token(code=code)
creds = flow.credentials

View file

@ -1,5 +1,5 @@
"""
API route for fetching the available LLM models catalogue.
API route for fetching the available models catalogue.
Serves a dynamically-updated list sourced from the OpenRouter public API,
with a local JSON fallback when the API is unreachable.
@ -30,7 +30,7 @@ async def list_available_models(
user: User = Depends(current_active_user),
):
"""
Return all available LLM models grouped by provider.
Return all available models grouped by provider.
The list is sourced from the OpenRouter public API and cached for 1 hour.
If the API is unreachable, a local fallback file is used instead.

View file

@ -1,7 +1,7 @@
"""
API routes for NewLLMConfig CRUD operations.
NewLLMConfig combines LLM model settings with prompt configuration:
NewLLMConfig combines model settings with prompt configuration:
- LLM provider, model, API key, etc.
- Configurable system instructions
- Citation toggle

View file

@ -55,23 +55,12 @@ from app.schemas import (
)
from app.services.composio_service import ComposioService, get_composio_service
from app.services.notification_service import NotificationService
from app.tasks.connector_indexers import (
index_airtable_records,
index_clickup_tasks,
index_confluence_pages,
index_crawled_urls,
index_discord_messages,
index_elasticsearch_documents,
index_github_repos,
index_google_calendar_events,
index_google_gmail_messages,
index_jira_issues,
index_linear_issues,
index_luma_events,
index_notion_pages,
index_slack_messages,
)
from app.users import current_active_user
# NOTE: connector indexer functions are imported lazily inside each
# ``run_*_indexing`` helper to break a circular import cycle:
# connector_indexers.__init__ → airtable_indexer → airtable_history
# → app.routes.__init__ → this file → connector_indexers (not ready yet)
from app.utils.connector_naming import ensure_unique_connector_name
from app.utils.indexing_locks import (
acquire_connector_indexing_lock,
@ -1378,6 +1367,8 @@ async def run_slack_indexing(
start_date: Start date for indexing
end_date: End date for indexing
"""
from app.tasks.connector_indexers import index_slack_messages
await _run_indexing_with_notifications(
session=session,
connector_id=connector_id,
@ -1824,6 +1815,8 @@ async def run_notion_indexing_with_new_session(
Create a new session and run the Notion indexing task.
This prevents session leaks by creating a dedicated session for the background task.
"""
from app.tasks.connector_indexers import index_notion_pages
async with async_session_maker() as session:
await _run_indexing_with_notifications(
session=session,
@ -1858,6 +1851,8 @@ async def run_notion_indexing(
start_date: Start date for indexing
end_date: End date for indexing
"""
from app.tasks.connector_indexers import index_notion_pages
await _run_indexing_with_notifications(
session=session,
connector_id=connector_id,
@ -1910,6 +1905,8 @@ async def run_github_indexing(
start_date: Start date for indexing
end_date: End date for indexing
"""
from app.tasks.connector_indexers import index_github_repos
await _run_indexing_with_notifications(
session=session,
connector_id=connector_id,
@ -1961,6 +1958,8 @@ async def run_linear_indexing(
start_date: Start date for indexing
end_date: End date for indexing
"""
from app.tasks.connector_indexers import index_linear_issues
await _run_indexing_with_notifications(
session=session,
connector_id=connector_id,
@ -2011,6 +2010,8 @@ async def run_discord_indexing(
start_date: Start date for indexing
end_date: End date for indexing
"""
from app.tasks.connector_indexers import index_discord_messages
await _run_indexing_with_notifications(
session=session,
connector_id=connector_id,
@ -2113,6 +2114,8 @@ async def run_jira_indexing(
start_date: Start date for indexing
end_date: End date for indexing
"""
from app.tasks.connector_indexers import index_jira_issues
await _run_indexing_with_notifications(
session=session,
connector_id=connector_id,
@ -2166,6 +2169,8 @@ async def run_confluence_indexing(
start_date: Start date for indexing
end_date: End date for indexing
"""
from app.tasks.connector_indexers import index_confluence_pages
await _run_indexing_with_notifications(
session=session,
connector_id=connector_id,
@ -2217,6 +2222,8 @@ async def run_clickup_indexing(
start_date: Start date for indexing
end_date: End date for indexing
"""
from app.tasks.connector_indexers import index_clickup_tasks
await _run_indexing_with_notifications(
session=session,
connector_id=connector_id,
@ -2268,6 +2275,8 @@ async def run_airtable_indexing(
start_date: Start date for indexing
end_date: End date for indexing
"""
from app.tasks.connector_indexers import index_airtable_records
await _run_indexing_with_notifications(
session=session,
connector_id=connector_id,
@ -2321,6 +2330,8 @@ async def run_google_calendar_indexing(
start_date: Start date for indexing
end_date: End date for indexing
"""
from app.tasks.connector_indexers import index_google_calendar_events
await _run_indexing_with_notifications(
session=session,
connector_id=connector_id,
@ -2370,6 +2381,7 @@ async def run_google_gmail_indexing(
start_date: Start date for indexing
end_date: End date for indexing
"""
from app.tasks.connector_indexers import index_google_gmail_messages
# Create a wrapper function that calls index_google_gmail_messages with max_messages
async def gmail_indexing_wrapper(
@ -2465,6 +2477,8 @@ async def run_google_drive_indexing(
stage="fetching",
)
total_unsupported = 0
# Index each folder with indexing options
for folder in items.folders:
try:
@ -2472,6 +2486,7 @@ async def run_google_drive_indexing(
indexed_count,
skipped_count,
error_message,
unsupported_count,
) = await index_google_drive_files(
session,
connector_id,
@ -2485,6 +2500,7 @@ async def run_google_drive_indexing(
include_subfolders=indexing_options.include_subfolders,
)
total_skipped += skipped_count
total_unsupported += unsupported_count
if error_message:
errors.append(f"Folder '{folder.name}': {error_message}")
else:
@ -2560,6 +2576,7 @@ async def run_google_drive_indexing(
indexed_count=total_indexed,
error_message=error_message,
skipped_count=total_skipped,
unsupported_count=total_unsupported,
)
except Exception as e:
@ -2630,7 +2647,12 @@ async def run_onedrive_indexing(
stage="fetching",
)
total_indexed, total_skipped, error_message = await index_onedrive_files(
(
total_indexed,
total_skipped,
error_message,
total_unsupported,
) = await index_onedrive_files(
session,
connector_id,
search_space_id,
@ -2671,6 +2693,7 @@ async def run_onedrive_indexing(
indexed_count=total_indexed,
error_message=error_message,
skipped_count=total_skipped,
unsupported_count=total_unsupported,
)
except Exception as e:
@ -2738,7 +2761,12 @@ async def run_dropbox_indexing(
stage="fetching",
)
total_indexed, total_skipped, error_message = await index_dropbox_files(
(
total_indexed,
total_skipped,
error_message,
total_unsupported,
) = await index_dropbox_files(
session,
connector_id,
search_space_id,
@ -2779,6 +2807,7 @@ async def run_dropbox_indexing(
indexed_count=total_indexed,
error_message=error_message,
skipped_count=total_skipped,
unsupported_count=total_unsupported,
)
except Exception as e:
@ -2836,6 +2865,8 @@ async def run_luma_indexing(
start_date: Start date for indexing
end_date: End date for indexing
"""
from app.tasks.connector_indexers import index_luma_events
await _run_indexing_with_notifications(
session=session,
connector_id=connector_id,
@ -2888,6 +2919,8 @@ async def run_elasticsearch_indexing(
start_date: Start date for indexing
end_date: End date for indexing
"""
from app.tasks.connector_indexers import index_elasticsearch_documents
await _run_indexing_with_notifications(
session=session,
connector_id=connector_id,
@ -2938,6 +2971,8 @@ async def run_web_page_indexing(
start_date: Start date for indexing
end_date: End date for indexing
"""
from app.tasks.connector_indexers import index_crawled_urls
await _run_indexing_with_notifications(
session=session,
connector_id=connector_id,

View file

@ -14,6 +14,7 @@ from app.db import (
SearchSpaceMembership,
SearchSpaceRole,
User,
VisionLLMConfig,
get_async_session,
get_default_roles_config,
)
@ -483,6 +484,63 @@ async def _get_image_gen_config_by_id(
return None
async def _get_vision_llm_config_by_id(
session: AsyncSession, config_id: int | None
) -> dict | None:
if config_id is None:
return None
if config_id == 0:
return {
"id": 0,
"name": "Auto (Fastest)",
"description": "Automatically routes requests across available vision LLM providers",
"provider": "AUTO",
"model_name": "auto",
"is_global": True,
"is_auto_mode": True,
}
if config_id < 0:
for cfg in config.GLOBAL_VISION_LLM_CONFIGS:
if cfg.get("id") == config_id:
return {
"id": cfg.get("id"),
"name": cfg.get("name"),
"description": cfg.get("description"),
"provider": cfg.get("provider"),
"custom_provider": cfg.get("custom_provider"),
"model_name": cfg.get("model_name"),
"api_base": cfg.get("api_base") or None,
"api_version": cfg.get("api_version") or None,
"litellm_params": cfg.get("litellm_params", {}),
"is_global": True,
}
return None
result = await session.execute(
select(VisionLLMConfig).filter(VisionLLMConfig.id == config_id)
)
db_config = result.scalars().first()
if db_config:
return {
"id": db_config.id,
"name": db_config.name,
"description": db_config.description,
"provider": db_config.provider.value if db_config.provider else None,
"custom_provider": db_config.custom_provider,
"model_name": db_config.model_name,
"api_base": db_config.api_base,
"api_version": db_config.api_version,
"litellm_params": db_config.litellm_params or {},
"created_at": db_config.created_at.isoformat()
if db_config.created_at
else None,
"search_space_id": db_config.search_space_id,
}
return None
@router.get(
"/search-spaces/{search_space_id}/llm-preferences",
response_model=LLMPreferencesRead,
@ -522,14 +580,19 @@ async def get_llm_preferences(
image_generation_config = await _get_image_gen_config_by_id(
session, search_space.image_generation_config_id
)
vision_llm_config = await _get_vision_llm_config_by_id(
session, search_space.vision_llm_config_id
)
return LLMPreferencesRead(
agent_llm_id=search_space.agent_llm_id,
document_summary_llm_id=search_space.document_summary_llm_id,
image_generation_config_id=search_space.image_generation_config_id,
vision_llm_config_id=search_space.vision_llm_config_id,
agent_llm=agent_llm,
document_summary_llm=document_summary_llm,
image_generation_config=image_generation_config,
vision_llm_config=vision_llm_config,
)
except HTTPException:
@ -589,14 +652,19 @@ async def update_llm_preferences(
image_generation_config = await _get_image_gen_config_by_id(
session, search_space.image_generation_config_id
)
vision_llm_config = await _get_vision_llm_config_by_id(
session, search_space.vision_llm_config_id
)
return LLMPreferencesRead(
agent_llm_id=search_space.agent_llm_id,
document_summary_llm_id=search_space.document_summary_llm_id,
image_generation_config_id=search_space.image_generation_config_id,
vision_llm_config_id=search_space.vision_llm_config_id,
agent_llm=agent_llm,
document_summary_llm=document_summary_llm,
image_generation_config=image_generation_config,
vision_llm_config=vision_llm_config,
)
except HTTPException:

View file

@ -0,0 +1,291 @@
import logging
from fastapi import APIRouter, Depends, HTTPException
from pydantic import BaseModel
from sqlalchemy import select
from sqlalchemy.ext.asyncio import AsyncSession
from app.config import config
from app.db import (
Permission,
User,
VisionLLMConfig,
get_async_session,
)
from app.schemas import (
GlobalVisionLLMConfigRead,
VisionLLMConfigCreate,
VisionLLMConfigRead,
VisionLLMConfigUpdate,
)
from app.services.vision_model_list_service import get_vision_model_list
from app.users import current_active_user
from app.utils.rbac import check_permission
router = APIRouter()
logger = logging.getLogger(__name__)
# =============================================================================
# Vision Model Catalogue (from OpenRouter, filtered for image-input models)
# =============================================================================
class VisionModelListItem(BaseModel):
value: str
label: str
provider: str
context_window: str | None = None
@router.get("/vision-models", response_model=list[VisionModelListItem])
async def list_vision_models(
user: User = Depends(current_active_user),
):
"""Return vision-capable models sourced from OpenRouter (filtered by image input)."""
try:
return await get_vision_model_list()
except Exception as e:
logger.exception("Failed to fetch vision model list")
raise HTTPException(
status_code=500, detail=f"Failed to fetch vision model list: {e!s}"
) from e
# =============================================================================
# Global Vision LLM Configs (from YAML)
# =============================================================================
@router.get(
"/global-vision-llm-configs",
response_model=list[GlobalVisionLLMConfigRead],
)
async def get_global_vision_llm_configs(
user: User = Depends(current_active_user),
):
try:
global_configs = config.GLOBAL_VISION_LLM_CONFIGS
safe_configs = []
if global_configs and len(global_configs) > 0:
safe_configs.append(
{
"id": 0,
"name": "Auto (Fastest)",
"description": "Automatically routes across available vision LLM providers.",
"provider": "AUTO",
"custom_provider": None,
"model_name": "auto",
"api_base": None,
"api_version": None,
"litellm_params": {},
"is_global": True,
"is_auto_mode": True,
}
)
for cfg in global_configs:
safe_configs.append(
{
"id": cfg.get("id"),
"name": cfg.get("name"),
"description": cfg.get("description"),
"provider": cfg.get("provider"),
"custom_provider": cfg.get("custom_provider"),
"model_name": cfg.get("model_name"),
"api_base": cfg.get("api_base") or None,
"api_version": cfg.get("api_version") or None,
"litellm_params": cfg.get("litellm_params", {}),
"is_global": True,
}
)
return safe_configs
except Exception as e:
logger.exception("Failed to fetch global vision LLM configs")
raise HTTPException(
status_code=500, detail=f"Failed to fetch configs: {e!s}"
) from e
# =============================================================================
# VisionLLMConfig CRUD
# =============================================================================
@router.post("/vision-llm-configs", response_model=VisionLLMConfigRead)
async def create_vision_llm_config(
config_data: VisionLLMConfigCreate,
session: AsyncSession = Depends(get_async_session),
user: User = Depends(current_active_user),
):
try:
await check_permission(
session,
user,
config_data.search_space_id,
Permission.VISION_CONFIGS_CREATE.value,
"You don't have permission to create vision LLM configs in this search space",
)
db_config = VisionLLMConfig(**config_data.model_dump(), user_id=user.id)
session.add(db_config)
await session.commit()
await session.refresh(db_config)
return db_config
except HTTPException:
raise
except Exception as e:
await session.rollback()
logger.exception("Failed to create VisionLLMConfig")
raise HTTPException(
status_code=500, detail=f"Failed to create config: {e!s}"
) from e
@router.get("/vision-llm-configs", response_model=list[VisionLLMConfigRead])
async def list_vision_llm_configs(
search_space_id: int,
skip: int = 0,
limit: int = 100,
session: AsyncSession = Depends(get_async_session),
user: User = Depends(current_active_user),
):
try:
await check_permission(
session,
user,
search_space_id,
Permission.VISION_CONFIGS_READ.value,
"You don't have permission to view vision LLM configs in this search space",
)
result = await session.execute(
select(VisionLLMConfig)
.filter(VisionLLMConfig.search_space_id == search_space_id)
.order_by(VisionLLMConfig.created_at.desc())
.offset(skip)
.limit(limit)
)
return result.scalars().all()
except HTTPException:
raise
except Exception as e:
logger.exception("Failed to list VisionLLMConfigs")
raise HTTPException(
status_code=500, detail=f"Failed to fetch configs: {e!s}"
) from e
@router.get("/vision-llm-configs/{config_id}", response_model=VisionLLMConfigRead)
async def get_vision_llm_config(
config_id: int,
session: AsyncSession = Depends(get_async_session),
user: User = Depends(current_active_user),
):
try:
result = await session.execute(
select(VisionLLMConfig).filter(VisionLLMConfig.id == config_id)
)
db_config = result.scalars().first()
if not db_config:
raise HTTPException(status_code=404, detail="Config not found")
await check_permission(
session,
user,
db_config.search_space_id,
Permission.VISION_CONFIGS_READ.value,
"You don't have permission to view vision LLM configs in this search space",
)
return db_config
except HTTPException:
raise
except Exception as e:
logger.exception("Failed to get VisionLLMConfig")
raise HTTPException(
status_code=500, detail=f"Failed to fetch config: {e!s}"
) from e
@router.put("/vision-llm-configs/{config_id}", response_model=VisionLLMConfigRead)
async def update_vision_llm_config(
config_id: int,
update_data: VisionLLMConfigUpdate,
session: AsyncSession = Depends(get_async_session),
user: User = Depends(current_active_user),
):
try:
result = await session.execute(
select(VisionLLMConfig).filter(VisionLLMConfig.id == config_id)
)
db_config = result.scalars().first()
if not db_config:
raise HTTPException(status_code=404, detail="Config not found")
await check_permission(
session,
user,
db_config.search_space_id,
Permission.VISION_CONFIGS_CREATE.value,
"You don't have permission to update vision LLM configs in this search space",
)
for key, value in update_data.model_dump(exclude_unset=True).items():
setattr(db_config, key, value)
await session.commit()
await session.refresh(db_config)
return db_config
except HTTPException:
raise
except Exception as e:
await session.rollback()
logger.exception("Failed to update VisionLLMConfig")
raise HTTPException(
status_code=500, detail=f"Failed to update config: {e!s}"
) from e
@router.delete("/vision-llm-configs/{config_id}", response_model=dict)
async def delete_vision_llm_config(
config_id: int,
session: AsyncSession = Depends(get_async_session),
user: User = Depends(current_active_user),
):
try:
result = await session.execute(
select(VisionLLMConfig).filter(VisionLLMConfig.id == config_id)
)
db_config = result.scalars().first()
if not db_config:
raise HTTPException(status_code=404, detail="Config not found")
await check_permission(
session,
user,
db_config.search_space_id,
Permission.VISION_CONFIGS_DELETE.value,
"You don't have permission to delete vision LLM configs in this search space",
)
await session.delete(db_config)
await session.commit()
return {
"message": "Vision LLM config deleted successfully",
"id": config_id,
}
except HTTPException:
raise
except Exception as e:
await session.rollback()
logger.exception("Failed to delete VisionLLMConfig")
raise HTTPException(
status_code=500, detail=f"Failed to delete config: {e!s}"
) from e

View file

@ -125,6 +125,13 @@ from .video_presentations import (
VideoPresentationRead,
VideoPresentationUpdate,
)
from .vision_llm import (
GlobalVisionLLMConfigRead,
VisionLLMConfigCreate,
VisionLLMConfigPublic,
VisionLLMConfigRead,
VisionLLMConfigUpdate,
)
__all__ = [
# Folder schemas
@ -163,6 +170,8 @@ __all__ = [
"FolderUpdate",
"GlobalImageGenConfigRead",
"GlobalNewLLMConfigRead",
# Vision LLM Config schemas
"GlobalVisionLLMConfigRead",
"GoogleDriveIndexRequest",
"GoogleDriveIndexingOptions",
# Base schemas
@ -264,4 +273,8 @@ __all__ = [
"VideoPresentationCreate",
"VideoPresentationRead",
"VideoPresentationUpdate",
"VisionLLMConfigCreate",
"VisionLLMConfigPublic",
"VisionLLMConfigRead",
"VisionLLMConfigUpdate",
]

View file

@ -53,25 +53,26 @@ class DocumentRead(BaseModel):
title: str
document_type: DocumentType
document_metadata: dict
content: str # Changed to string to match frontend
content: str = ""
content_preview: str = ""
content_hash: str
unique_identifier_hash: str | None
created_at: datetime
updated_at: datetime | None
search_space_id: int
folder_id: int | None = None
created_by_id: UUID | None = None # User who created/uploaded this document
created_by_id: UUID | None = None
created_by_name: str | None = None
created_by_email: str | None = None
status: DocumentStatusSchema | None = (
None # Processing status (ready, processing, failed)
)
status: DocumentStatusSchema | None = None
model_config = ConfigDict(from_attributes=True)
class DocumentWithChunksRead(DocumentRead):
chunks: list[ChunkRead] = []
total_chunks: int = 0
chunk_start_index: int = 0
model_config = ConfigDict(from_attributes=True)

View file

@ -1,6 +1,7 @@
"""Pydantic schemas for folder CRUD, move, and reorder operations."""
from datetime import datetime
from typing import Any
from uuid import UUID
from pydantic import BaseModel, ConfigDict, Field
@ -34,6 +35,9 @@ class FolderRead(BaseModel):
created_by_id: UUID | None
created_at: datetime
updated_at: datetime
metadata: dict[str, Any] | None = Field(
default=None, validation_alias="folder_metadata"
)
model_config = ConfigDict(from_attributes=True)

View file

@ -1,7 +1,7 @@
"""
Pydantic schemas for the NewLLMConfig API.
NewLLMConfig combines LLM model settings with prompt configuration:
NewLLMConfig combines model settings with prompt configuration:
- LLM provider, model, API key, etc.
- Configurable system instructions
- Citation toggle
@ -26,7 +26,7 @@ class NewLLMConfigBase(BaseModel):
None, max_length=500, description="Optional description"
)
# LLM Model Configuration
# Model Configuration
provider: LiteLLMProvider = Field(..., description="LiteLLM provider type")
custom_provider: str | None = Field(
None, max_length=100, description="Custom provider name when provider is CUSTOM"
@ -71,7 +71,7 @@ class NewLLMConfigUpdate(BaseModel):
name: str | None = Field(None, max_length=100)
description: str | None = Field(None, max_length=500)
# LLM Model Configuration
# Model Configuration
provider: LiteLLMProvider | None = None
custom_provider: str | None = Field(None, max_length=100)
model_name: str | None = Field(None, max_length=100)
@ -106,7 +106,7 @@ class NewLLMConfigPublic(BaseModel):
name: str
description: str | None = None
# LLM Model Configuration (no api_key)
# Model Configuration (no api_key)
provider: LiteLLMProvider
custom_provider: str | None = None
model_name: str
@ -149,7 +149,7 @@ class GlobalNewLLMConfigRead(BaseModel):
name: str
description: str | None = None
# LLM Model Configuration (no api_key)
# Model Configuration (no api_key)
provider: str # String because YAML doesn't enforce enum, "AUTO" for Auto mode
custom_provider: str | None = None
model_name: str
@ -182,6 +182,10 @@ class LLMPreferencesRead(BaseModel):
image_generation_config_id: int | None = Field(
None, description="ID of the image generation config to use"
)
vision_llm_config_id: int | None = Field(
None,
description="ID of the vision LLM config to use for vision/screenshot analysis",
)
agent_llm: dict[str, Any] | None = Field(
None, description="Full config for agent LLM"
)
@ -191,6 +195,9 @@ class LLMPreferencesRead(BaseModel):
image_generation_config: dict[str, Any] | None = Field(
None, description="Full config for image generation"
)
vision_llm_config: dict[str, Any] | None = Field(
None, description="Full config for vision LLM"
)
model_config = ConfigDict(from_attributes=True)
@ -207,3 +214,7 @@ class LLMPreferencesUpdate(BaseModel):
image_generation_config_id: int | None = Field(
None, description="ID of the image generation config to use"
)
vision_llm_config_id: int | None = Field(
None,
description="ID of the vision LLM config to use for vision/screenshot analysis",
)

View file

@ -0,0 +1,75 @@
import uuid
from datetime import datetime
from typing import Any
from pydantic import BaseModel, ConfigDict, Field
from app.db import VisionProvider
class VisionLLMConfigBase(BaseModel):
name: str = Field(..., max_length=100)
description: str | None = Field(None, max_length=500)
provider: VisionProvider = Field(...)
custom_provider: str | None = Field(None, max_length=100)
model_name: str = Field(..., max_length=100)
api_key: str = Field(...)
api_base: str | None = Field(None, max_length=500)
api_version: str | None = Field(None, max_length=50)
litellm_params: dict[str, Any] | None = Field(default=None)
class VisionLLMConfigCreate(VisionLLMConfigBase):
search_space_id: int = Field(...)
class VisionLLMConfigUpdate(BaseModel):
name: str | None = Field(None, max_length=100)
description: str | None = Field(None, max_length=500)
provider: VisionProvider | None = None
custom_provider: str | None = Field(None, max_length=100)
model_name: str | None = Field(None, max_length=100)
api_key: str | None = None
api_base: str | None = Field(None, max_length=500)
api_version: str | None = Field(None, max_length=50)
litellm_params: dict[str, Any] | None = None
class VisionLLMConfigRead(VisionLLMConfigBase):
id: int
created_at: datetime
search_space_id: int
user_id: uuid.UUID
model_config = ConfigDict(from_attributes=True)
class VisionLLMConfigPublic(BaseModel):
id: int
name: str
description: str | None = None
provider: VisionProvider
custom_provider: str | None = None
model_name: str
api_base: str | None = None
api_version: str | None = None
litellm_params: dict[str, Any] | None = None
created_at: datetime
search_space_id: int
user_id: uuid.UUID
model_config = ConfigDict(from_attributes=True)
class GlobalVisionLLMConfigRead(BaseModel):
id: int = Field(...)
name: str
description: str | None = None
provider: str
custom_provider: str | None = None
model_name: str
api_base: str | None = None
api_version: str | None = None
litellm_params: dict[str, Any] | None = None
is_global: bool = True
is_auto_mode: bool = False

View file

@ -111,9 +111,8 @@ class DoclingService:
pipeline_options=pipeline_options, backend=PyPdfiumDocumentBackend
)
# Initialize DocumentConverter
self.converter = DocumentConverter(
format_options={InputFormat.PDF: pdf_format_option}
format_options={InputFormat.PDF: pdf_format_option},
)
acceleration_type = "GPU (WSL2)" if self.use_gpu else "CPU"

View file

@ -405,6 +405,121 @@ async def get_document_summary_llm(
)
async def get_vision_llm(
session: AsyncSession, search_space_id: int
) -> ChatLiteLLM | ChatLiteLLMRouter | None:
"""Get the search space's vision LLM instance for screenshot analysis.
Resolves from the dedicated VisionLLMConfig system:
- Auto mode (ID 0): VisionLLMRouterService
- Global (negative ID): YAML configs
- DB (positive ID): VisionLLMConfig table
"""
from app.db import VisionLLMConfig
from app.services.vision_llm_router_service import (
VISION_PROVIDER_MAP,
VisionLLMRouterService,
get_global_vision_llm_config,
is_vision_auto_mode,
)
try:
result = await session.execute(
select(SearchSpace).where(SearchSpace.id == search_space_id)
)
search_space = result.scalars().first()
if not search_space:
logger.error(f"Search space {search_space_id} not found")
return None
config_id = search_space.vision_llm_config_id
if config_id is None:
logger.error(f"No vision LLM configured for search space {search_space_id}")
return None
if is_vision_auto_mode(config_id):
if not VisionLLMRouterService.is_initialized():
logger.error(
"Vision Auto mode requested but Vision LLM Router not initialized"
)
return None
try:
return ChatLiteLLMRouter(
router=VisionLLMRouterService.get_router(),
streaming=True,
)
except Exception as e:
logger.error(f"Failed to create vision ChatLiteLLMRouter: {e}")
return None
if config_id < 0:
global_cfg = get_global_vision_llm_config(config_id)
if not global_cfg:
logger.error(f"Global vision LLM config {config_id} not found")
return None
if global_cfg.get("custom_provider"):
model_string = (
f"{global_cfg['custom_provider']}/{global_cfg['model_name']}"
)
else:
prefix = VISION_PROVIDER_MAP.get(
global_cfg["provider"].upper(),
global_cfg["provider"].lower(),
)
model_string = f"{prefix}/{global_cfg['model_name']}"
litellm_kwargs = {
"model": model_string,
"api_key": global_cfg["api_key"],
}
if global_cfg.get("api_base"):
litellm_kwargs["api_base"] = global_cfg["api_base"]
if global_cfg.get("litellm_params"):
litellm_kwargs.update(global_cfg["litellm_params"])
return ChatLiteLLM(**litellm_kwargs)
result = await session.execute(
select(VisionLLMConfig).where(
VisionLLMConfig.id == config_id,
VisionLLMConfig.search_space_id == search_space_id,
)
)
vision_cfg = result.scalars().first()
if not vision_cfg:
logger.error(
f"Vision LLM config {config_id} not found in search space {search_space_id}"
)
return None
if vision_cfg.custom_provider:
model_string = f"{vision_cfg.custom_provider}/{vision_cfg.model_name}"
else:
prefix = VISION_PROVIDER_MAP.get(
vision_cfg.provider.value.upper(),
vision_cfg.provider.value.lower(),
)
model_string = f"{prefix}/{vision_cfg.model_name}"
litellm_kwargs = {
"model": model_string,
"api_key": vision_cfg.api_key,
}
if vision_cfg.api_base:
litellm_kwargs["api_base"] = vision_cfg.api_base
if vision_cfg.litellm_params:
litellm_kwargs.update(vision_cfg.litellm_params)
return ChatLiteLLM(**litellm_kwargs)
except Exception as e:
logger.error(
f"Error getting vision LLM for search space {search_space_id}: {e!s}"
)
return None
# Backward-compatible alias (LLM preferences are now per-search-space, not per-user)
async def get_user_long_context_llm(
session: AsyncSession,

View file

@ -1,5 +1,5 @@
"""
Service for fetching and caching the available LLM model list.
Service for fetching and caching the available model list.
Uses the OpenRouter public API as the primary source, with a local
fallback JSON file when the API is unreachable.

View file

@ -421,6 +421,7 @@ class ConnectorIndexingNotificationHandler(BaseNotificationHandler):
error_message: str | None = None,
is_warning: bool = False,
skipped_count: int | None = None,
unsupported_count: int | None = None,
) -> Notification:
"""
Update notification when connector indexing completes.
@ -428,10 +429,11 @@ class ConnectorIndexingNotificationHandler(BaseNotificationHandler):
Args:
session: Database session
notification: Notification to update
indexed_count: Total number of items indexed
indexed_count: Total number of files indexed
error_message: Error message if indexing failed, or warning message (optional)
is_warning: If True, treat error_message as a warning (success case) rather than an error
skipped_count: Number of items skipped (e.g., duplicates) - optional
skipped_count: Number of files skipped (e.g., unchanged) - optional
unsupported_count: Number of files skipped because the ETL parser doesn't support them
Returns:
Updated notification
@ -440,52 +442,45 @@ class ConnectorIndexingNotificationHandler(BaseNotificationHandler):
"connector_name", "Connector"
)
# Build the skipped text if there are skipped items
skipped_text = ""
if skipped_count and skipped_count > 0:
skipped_item_text = "item" if skipped_count == 1 else "items"
skipped_text = (
f" ({skipped_count} {skipped_item_text} skipped - already indexed)"
)
unsupported_text = ""
if unsupported_count and unsupported_count > 0:
file_word = "file was" if unsupported_count == 1 else "files were"
unsupported_text = f" {unsupported_count} {file_word} not supported."
# If there's an error message but items were indexed, treat it as a warning (partial success)
# If is_warning is True, treat it as success even with 0 items (e.g., duplicates found)
# Otherwise, treat it as a failure
if error_message:
if indexed_count > 0:
# Partial success with warnings (e.g., duplicate content from other connectors)
title = f"Ready: {connector_name}"
item_text = "item" if indexed_count == 1 else "items"
message = f"Now searchable! {indexed_count} {item_text} synced{skipped_text}. Note: {error_message}"
file_text = "file" if indexed_count == 1 else "files"
message = f"Now searchable! {indexed_count} {file_text} synced.{unsupported_text} Note: {error_message}"
status = "completed"
elif is_warning:
# Warning case (e.g., duplicates found) - treat as success
title = f"Ready: {connector_name}"
message = f"Sync completed{skipped_text}. {error_message}"
message = f"Sync complete.{unsupported_text} {error_message}"
status = "completed"
else:
# Complete failure
title = f"Failed: {connector_name}"
message = f"Sync failed: {error_message}"
if unsupported_text:
message += unsupported_text
status = "failed"
else:
title = f"Ready: {connector_name}"
if indexed_count == 0:
if skipped_count and skipped_count > 0:
skipped_item_text = "item" if skipped_count == 1 else "items"
message = f"Already up to date! {skipped_count} {skipped_item_text} skipped (already indexed)."
if unsupported_count and unsupported_count > 0:
message = f"Sync complete.{unsupported_text}"
else:
message = "Already up to date! No new items to sync."
message = "Already up to date!"
else:
item_text = "item" if indexed_count == 1 else "items"
message = (
f"Now searchable! {indexed_count} {item_text} synced{skipped_text}."
)
file_text = "file" if indexed_count == 1 else "files"
message = f"Now searchable! {indexed_count} {file_text} synced."
if unsupported_text:
message += unsupported_text
status = "completed"
metadata_updates = {
"indexed_count": indexed_count,
"skipped_count": skipped_count or 0,
"unsupported_count": unsupported_count or 0,
"sync_stage": "completed"
if (not error_message or is_warning or indexed_count > 0)
else "failed",

View file

@ -3,7 +3,7 @@ Service for managing user page limits for ETL services.
"""
import os
from pathlib import Path
from pathlib import Path, PurePosixPath
from sqlalchemy import select
from sqlalchemy.ext.asyncio import AsyncSession
@ -223,10 +223,155 @@ class PageLimitService:
# Estimate ~2000 characters per page
return max(1, content_length // 2000)
@staticmethod
def estimate_pages_from_metadata(
file_name_or_ext: str, file_size: int | str | None = None
) -> int:
"""Size-based page estimation from file name/extension and byte size.
Pure function no file I/O, no database access. Used by cloud
connectors (which only have API metadata) and as the internal
fallback for :meth:`estimate_pages_before_processing`.
``file_name_or_ext`` can be a full filename (``"report.pdf"``) or
a bare extension (``".pdf"``). ``file_size`` may be an int, a
stringified int from a cloud API, or *None*.
"""
if file_size is not None:
try:
file_size = int(file_size)
except (ValueError, TypeError):
file_size = 0
else:
file_size = 0
if file_size <= 0:
return 1
ext = PurePosixPath(file_name_or_ext).suffix.lower() if file_name_or_ext else ""
if not ext and file_name_or_ext.startswith("."):
ext = file_name_or_ext.lower()
file_ext = ext
if file_ext == ".pdf":
return max(1, file_size // (100 * 1024))
if file_ext in {
".doc",
".docx",
".docm",
".dot",
".dotm",
".odt",
".ott",
".sxw",
".stw",
".uot",
".rtf",
".pages",
".wpd",
".wps",
".abw",
".zabw",
".cwk",
".hwp",
".lwp",
".mcw",
".mw",
".sdw",
".vor",
}:
return max(1, file_size // (50 * 1024))
if file_ext in {
".ppt",
".pptx",
".pptm",
".pot",
".potx",
".odp",
".otp",
".sxi",
".sti",
".uop",
".key",
".sda",
".sdd",
".sdp",
}:
return max(1, file_size // (200 * 1024))
if file_ext in {
".xls",
".xlsx",
".xlsm",
".xlsb",
".xlw",
".xlr",
".ods",
".ots",
".fods",
".numbers",
".123",
".wk1",
".wk2",
".wk3",
".wk4",
".wks",
".wb1",
".wb2",
".wb3",
".wq1",
".wq2",
".csv",
".tsv",
".slk",
".sylk",
".dif",
".dbf",
".prn",
".qpw",
".602",
".et",
".eth",
}:
return max(1, file_size // (100 * 1024))
if file_ext in {".epub"}:
return max(1, file_size // (50 * 1024))
if file_ext in {".txt", ".log", ".md", ".markdown", ".htm", ".html", ".xml"}:
return max(1, file_size // 3000)
if file_ext in {
".jpg",
".jpeg",
".png",
".gif",
".bmp",
".tiff",
".webp",
".svg",
".cgm",
".odg",
".pbd",
}:
return 1
if file_ext in {".mp3", ".m4a", ".wav", ".mpga"}:
return max(1, file_size // (1024 * 1024))
if file_ext in {".mp4", ".mpeg", ".webm"}:
return max(1, file_size // (5 * 1024 * 1024))
return max(1, file_size // (80 * 1024))
def estimate_pages_before_processing(self, file_path: str) -> int:
"""
Estimate page count from file before processing (to avoid unnecessary API calls).
This is called BEFORE sending to ETL services to prevent cost on rejected files.
Estimate page count from a local file before processing.
For PDFs, attempts to read the actual page count via pypdf.
For everything else, delegates to :meth:`estimate_pages_from_metadata`.
Args:
file_path: Path to the file
@ -240,7 +385,6 @@ class PageLimitService:
file_ext = Path(file_path).suffix.lower()
file_size = os.path.getsize(file_path)
# PDF files - try to get actual page count
if file_ext == ".pdf":
try:
import pypdf
@ -249,153 +393,6 @@ class PageLimitService:
pdf_reader = pypdf.PdfReader(f)
return len(pdf_reader.pages)
except Exception:
# If PDF reading fails, fall back to size estimation
# Typical PDF: ~100KB per page (conservative estimate)
return max(1, file_size // (100 * 1024))
pass # fall through to size-based estimation
# Word Processing Documents
# Microsoft Word, LibreOffice Writer, WordPerfect, Pages, etc.
elif file_ext in [
".doc",
".docx",
".docm",
".dot",
".dotm", # Microsoft Word
".odt",
".ott",
".sxw",
".stw",
".uot", # OpenDocument/StarOffice Writer
".rtf", # Rich Text Format
".pages", # Apple Pages
".wpd",
".wps", # WordPerfect, Microsoft Works
".abw",
".zabw", # AbiWord
".cwk",
".hwp",
".lwp",
".mcw",
".mw",
".sdw",
".vor", # Other word processors
]:
# Typical word document: ~50KB per page (conservative)
return max(1, file_size // (50 * 1024))
# Presentation Documents
# PowerPoint, Impress, Keynote, etc.
elif file_ext in [
".ppt",
".pptx",
".pptm",
".pot",
".potx", # Microsoft PowerPoint
".odp",
".otp",
".sxi",
".sti",
".uop", # OpenDocument/StarOffice Impress
".key", # Apple Keynote
".sda",
".sdd",
".sdp", # StarOffice Draw/Impress
]:
# Typical presentation: ~200KB per slide (conservative)
return max(1, file_size // (200 * 1024))
# Spreadsheet Documents
# Excel, Calc, Numbers, Lotus, etc.
elif file_ext in [
".xls",
".xlsx",
".xlsm",
".xlsb",
".xlw",
".xlr", # Microsoft Excel
".ods",
".ots",
".fods", # OpenDocument Spreadsheet
".numbers", # Apple Numbers
".123",
".wk1",
".wk2",
".wk3",
".wk4",
".wks", # Lotus 1-2-3
".wb1",
".wb2",
".wb3",
".wq1",
".wq2", # Quattro Pro
".csv",
".tsv",
".slk",
".sylk",
".dif",
".dbf",
".prn",
".qpw", # Data formats
".602",
".et",
".eth", # Other spreadsheets
]:
# Spreadsheets typically have 1 sheet = 1 page for ETL
# Conservative: ~100KB per sheet
return max(1, file_size // (100 * 1024))
# E-books
elif file_ext in [".epub"]:
# E-books vary widely, estimate by size
# Typical e-book: ~50KB per page
return max(1, file_size // (50 * 1024))
# Plain Text and Markup Files
elif file_ext in [
".txt",
".log", # Plain text
".md",
".markdown", # Markdown
".htm",
".html",
".xml", # Markup
]:
# Plain text: ~3000 bytes per page
return max(1, file_size // 3000)
# Image Files
# Each image is typically processed as 1 page
elif file_ext in [
".jpg",
".jpeg", # JPEG
".png", # PNG
".gif", # GIF
".bmp", # Bitmap
".tiff", # TIFF
".webp", # WebP
".svg", # SVG
".cgm", # Computer Graphics Metafile
".odg",
".pbd", # OpenDocument Graphics
]:
# Each image = 1 page
return 1
# Audio Files (transcription = typically 1 page per minute)
# Note: These should be handled by audio transcription flow, not ETL
elif file_ext in [".mp3", ".m4a", ".wav", ".mpga"]:
# Audio files: estimate based on duration
# Fallback: ~1MB per minute of audio, 1 page per minute transcript
return max(1, file_size // (1024 * 1024))
# Video Files (typically not processed for pages, but just in case)
elif file_ext in [".mp4", ".mpeg", ".webm"]:
# Video files: very rough estimate
# Typically wouldn't be page-based, but use conservative estimate
return max(1, file_size // (5 * 1024 * 1024))
# Other/Unknown Document Types
else:
# Conservative estimate: ~80KB per page
# This catches: .sgl, .sxg, .uof, .uos1, .uos2, .web, and any future formats
return max(1, file_size // (80 * 1024))
return self.estimate_pages_from_metadata(file_ext, file_size)

View file

@ -0,0 +1,158 @@
"""Vision autocomplete service — agent-based with scoped filesystem.
Optimized pipeline:
1. Start the SSE stream immediately so the UI shows progress.
2. Derive a KB search query from window_title (no separate LLM call).
3. Run KB filesystem pre-computation and agent graph compilation in PARALLEL.
4. Inject pre-computed KB files as initial state and stream the agent.
"""
import logging
from collections.abc import AsyncGenerator
from langchain_core.messages import HumanMessage
from sqlalchemy.ext.asyncio import AsyncSession
from app.agents.autocomplete import create_autocomplete_agent, stream_autocomplete_agent
from app.services.llm_service import get_vision_llm
from app.services.new_streaming_service import VercelStreamingService
logger = logging.getLogger(__name__)
PREP_STEP_ID = "autocomplete-prep"
def _derive_kb_query(app_name: str, window_title: str) -> str:
parts = [p for p in (window_title, app_name) if p]
return " ".join(parts)
def _is_vision_unsupported_error(e: Exception) -> bool:
msg = str(e).lower()
return "content must be a string" in msg or "does not support image" in msg
# ---------------------------------------------------------------------------
# Main entry point
# ---------------------------------------------------------------------------
async def stream_vision_autocomplete(
screenshot_data_url: str,
search_space_id: int,
session: AsyncSession,
*,
app_name: str = "",
window_title: str = "",
) -> AsyncGenerator[str, None]:
"""Analyze a screenshot with a vision-LLM agent and stream a text completion."""
streaming = VercelStreamingService()
vision_error_msg = (
"The selected model does not support vision. "
"Please set a vision-capable model (e.g. GPT-4o, Gemini) in your search space settings."
)
llm = await get_vision_llm(session, search_space_id)
if not llm:
yield streaming.format_message_start()
yield streaming.format_error("No Vision LLM configured for this search space")
yield streaming.format_done()
return
# Start SSE stream immediately so the UI has something to show
yield streaming.format_message_start()
kb_query = _derive_kb_query(app_name, window_title)
# Show a preparation step while KB search + agent compile run
yield streaming.format_thinking_step(
step_id=PREP_STEP_ID,
title="Searching knowledge base",
status="in_progress",
items=[kb_query] if kb_query else [],
)
try:
agent, kb = await create_autocomplete_agent(
llm,
search_space_id=search_space_id,
kb_query=kb_query,
app_name=app_name,
window_title=window_title,
)
except Exception as e:
if _is_vision_unsupported_error(e):
logger.warning("Vision autocomplete: model does not support vision: %s", e)
yield streaming.format_error(vision_error_msg)
yield streaming.format_done()
return
logger.error("Failed to create autocomplete agent: %s", e, exc_info=True)
yield streaming.format_error("Autocomplete failed. Please try again.")
yield streaming.format_done()
return
has_kb = kb.has_documents
doc_count = len(kb.files) if has_kb else 0 # type: ignore[arg-type]
yield streaming.format_thinking_step(
step_id=PREP_STEP_ID,
title="Searching knowledge base",
status="complete",
items=[f"Found {doc_count} document{'s' if doc_count != 1 else ''}"]
if kb_query
else ["Skipped"],
)
# Build agent input with pre-computed KB as initial state
if has_kb:
instruction = (
"Analyze this screenshot, then explore the knowledge base documents "
"listed above — read the chunk index of any document whose title "
"looks relevant and check matched chunks for useful facts. "
"Finally, generate a concise autocomplete for the active text area, "
"enhanced with any relevant KB information you found."
)
else:
instruction = (
"Analyze this screenshot and generate a concise autocomplete "
"for the active text area based on what you see."
)
user_message = HumanMessage(
content=[
{"type": "text", "text": instruction},
{"type": "image_url", "image_url": {"url": screenshot_data_url}},
]
)
input_data: dict = {"messages": [user_message]}
if has_kb:
input_data["files"] = kb.files
input_data["messages"] = [kb.ls_ai_msg, kb.ls_tool_msg, user_message]
logger.info(
"Autocomplete: injected %d KB files into agent initial state", doc_count
)
else:
logger.info(
"Autocomplete: no KB documents found, proceeding with screenshot only"
)
# Stream the agent (message_start already sent above)
try:
async for sse in stream_autocomplete_agent(
agent,
input_data,
streaming,
emit_message_start=False,
):
yield sse
except Exception as e:
if _is_vision_unsupported_error(e):
logger.warning("Vision autocomplete: model does not support vision: %s", e)
yield streaming.format_error(vision_error_msg)
yield streaming.format_done()
else:
logger.error("Vision autocomplete streaming error: %s", e, exc_info=True)
yield streaming.format_error("Autocomplete failed. Please try again.")
yield streaming.format_done()

View file

@ -0,0 +1,193 @@
import logging
from typing import Any
from litellm import Router
logger = logging.getLogger(__name__)
VISION_AUTO_MODE_ID = 0
VISION_PROVIDER_MAP = {
"OPENAI": "openai",
"ANTHROPIC": "anthropic",
"GOOGLE": "gemini",
"AZURE_OPENAI": "azure",
"VERTEX_AI": "vertex_ai",
"BEDROCK": "bedrock",
"XAI": "xai",
"OPENROUTER": "openrouter",
"OLLAMA": "ollama_chat",
"GROQ": "groq",
"TOGETHER_AI": "together_ai",
"FIREWORKS_AI": "fireworks_ai",
"DEEPSEEK": "openai",
"MISTRAL": "mistral",
"CUSTOM": "custom",
}
class VisionLLMRouterService:
_instance = None
_router: Router | None = None
_model_list: list[dict] = []
_router_settings: dict = {}
_initialized: bool = False
def __new__(cls):
if cls._instance is None:
cls._instance = super().__new__(cls)
return cls._instance
@classmethod
def get_instance(cls) -> "VisionLLMRouterService":
if cls._instance is None:
cls._instance = cls()
return cls._instance
@classmethod
def initialize(
cls,
global_configs: list[dict],
router_settings: dict | None = None,
) -> None:
instance = cls.get_instance()
if instance._initialized:
logger.debug("Vision LLM Router already initialized, skipping")
return
model_list = []
for config in global_configs:
deployment = cls._config_to_deployment(config)
if deployment:
model_list.append(deployment)
if not model_list:
logger.warning(
"No valid vision LLM configs found for router initialization"
)
return
instance._model_list = model_list
instance._router_settings = router_settings or {}
default_settings = {
"routing_strategy": "usage-based-routing",
"num_retries": 3,
"allowed_fails": 3,
"cooldown_time": 60,
"retry_after": 5,
}
final_settings = {**default_settings, **instance._router_settings}
try:
instance._router = Router(
model_list=model_list,
routing_strategy=final_settings.get(
"routing_strategy", "usage-based-routing"
),
num_retries=final_settings.get("num_retries", 3),
allowed_fails=final_settings.get("allowed_fails", 3),
cooldown_time=final_settings.get("cooldown_time", 60),
set_verbose=False,
)
instance._initialized = True
logger.info(
"Vision LLM Router initialized with %d deployments, strategy: %s",
len(model_list),
final_settings.get("routing_strategy"),
)
except Exception as e:
logger.error(f"Failed to initialize Vision LLM Router: {e}")
instance._router = None
@classmethod
def _config_to_deployment(cls, config: dict) -> dict | None:
try:
if not config.get("model_name") or not config.get("api_key"):
return None
if config.get("custom_provider"):
model_string = f"{config['custom_provider']}/{config['model_name']}"
else:
provider = config.get("provider", "").upper()
provider_prefix = VISION_PROVIDER_MAP.get(provider, provider.lower())
model_string = f"{provider_prefix}/{config['model_name']}"
litellm_params: dict[str, Any] = {
"model": model_string,
"api_key": config.get("api_key"),
}
if config.get("api_base"):
litellm_params["api_base"] = config["api_base"]
if config.get("api_version"):
litellm_params["api_version"] = config["api_version"]
if config.get("litellm_params"):
litellm_params.update(config["litellm_params"])
deployment: dict[str, Any] = {
"model_name": "auto",
"litellm_params": litellm_params,
}
if config.get("rpm"):
deployment["rpm"] = config["rpm"]
if config.get("tpm"):
deployment["tpm"] = config["tpm"]
return deployment
except Exception as e:
logger.warning(f"Failed to convert vision config to deployment: {e}")
return None
@classmethod
def get_router(cls) -> Router | None:
instance = cls.get_instance()
return instance._router
@classmethod
def is_initialized(cls) -> bool:
instance = cls.get_instance()
return instance._initialized and instance._router is not None
@classmethod
def get_model_count(cls) -> int:
instance = cls.get_instance()
return len(instance._model_list)
def is_vision_auto_mode(config_id: int | None) -> bool:
return config_id == VISION_AUTO_MODE_ID
def build_vision_model_string(
provider: str, model_name: str, custom_provider: str | None
) -> str:
if custom_provider:
return f"{custom_provider}/{model_name}"
prefix = VISION_PROVIDER_MAP.get(provider.upper(), provider.lower())
return f"{prefix}/{model_name}"
def get_global_vision_llm_config(config_id: int) -> dict | None:
from app.config import config
if config_id == VISION_AUTO_MODE_ID:
return {
"id": VISION_AUTO_MODE_ID,
"name": "Auto (Fastest)",
"provider": "AUTO",
"model_name": "auto",
"is_auto_mode": True,
}
if config_id > 0:
return None
for cfg in config.GLOBAL_VISION_LLM_CONFIGS:
if cfg.get("id") == config_id:
return cfg
return None

View file

@ -0,0 +1,134 @@
"""
Service for fetching and caching the vision-capable model list.
Reuses the same OpenRouter public API and local fallback as the LLM model
list service, but filters for models that accept image input.
"""
import json
import logging
import time
from pathlib import Path
import httpx
logger = logging.getLogger(__name__)
OPENROUTER_API_URL = "https://openrouter.ai/api/v1/models"
FALLBACK_FILE = (
Path(__file__).parent.parent / "config" / "vision_model_list_fallback.json"
)
CACHE_TTL_SECONDS = 86400 # 24 hours
_cache: list[dict] | None = None
_cache_timestamp: float = 0
OPENROUTER_SLUG_TO_VISION_PROVIDER: dict[str, str] = {
"openai": "OPENAI",
"anthropic": "ANTHROPIC",
"google": "GOOGLE",
"mistralai": "MISTRAL",
"x-ai": "XAI",
}
def _format_context_length(length: int | None) -> str | None:
if not length:
return None
if length >= 1_000_000:
return f"{length / 1_000_000:g}M"
if length >= 1_000:
return f"{length / 1_000:g}K"
return str(length)
async def _fetch_from_openrouter() -> list[dict] | None:
try:
async with httpx.AsyncClient(timeout=15) as client:
response = await client.get(OPENROUTER_API_URL)
response.raise_for_status()
data = response.json()
return data.get("data", [])
except Exception as e:
logger.warning("Failed to fetch from OpenRouter API for vision models: %s", e)
return None
def _load_fallback() -> list[dict]:
try:
with open(FALLBACK_FILE, encoding="utf-8") as f:
return json.load(f)
except Exception as e:
logger.error("Failed to load vision model fallback list: %s", e)
return []
def _is_vision_model(model: dict) -> bool:
"""Return True if the model accepts image input and outputs text."""
arch = model.get("architecture", {})
input_mods = arch.get("input_modalities", [])
output_mods = arch.get("output_modalities", [])
return "image" in input_mods and "text" in output_mods
def _process_vision_models(raw_models: list[dict]) -> list[dict]:
processed: list[dict] = []
for model in raw_models:
model_id: str = model.get("id", "")
name: str = model.get("name", "")
context_length = model.get("context_length")
if "/" not in model_id:
continue
if not _is_vision_model(model):
continue
provider_slug, model_name = model_id.split("/", 1)
context_window = _format_context_length(context_length)
processed.append(
{
"value": model_id,
"label": name,
"provider": "OPENROUTER",
"context_window": context_window,
}
)
native_provider = OPENROUTER_SLUG_TO_VISION_PROVIDER.get(provider_slug)
if native_provider:
if native_provider == "GOOGLE" and not model_name.startswith("gemini-"):
continue
processed.append(
{
"value": model_name,
"label": name,
"provider": native_provider,
"context_window": context_window,
}
)
return processed
async def get_vision_model_list() -> list[dict]:
global _cache, _cache_timestamp
if _cache is not None and (time.time() - _cache_timestamp) < CACHE_TTL_SECONDS:
return _cache
raw_models = await _fetch_from_openrouter()
if raw_models is None:
logger.info("Using fallback vision model list")
return _load_fallback()
processed = _process_vision_models(raw_models)
_cache = processed
_cache_timestamp = time.time()
return processed

View file

@ -1,6 +1,7 @@
"""Celery tasks for document processing."""
import asyncio
import contextlib
import logging
import os
from uuid import UUID
@ -10,6 +11,7 @@ from app.config import config
from app.services.notification_service import NotificationService
from app.services.task_logging_service import TaskLoggingService
from app.tasks.celery_tasks import get_celery_session_maker
from app.tasks.connector_indexers.local_folder_indexer import index_local_folder
from app.tasks.document_processors import (
add_extension_received_document,
add_youtube_video_document,
@ -141,21 +143,30 @@ async def _delete_document_background(document_id: int) -> None:
retry_backoff_max=300,
max_retries=5,
)
def delete_folder_documents_task(self, document_ids: list[int]):
"""Celery task to batch-delete documents orphaned by folder deletion."""
def delete_folder_documents_task(
self,
document_ids: list[int],
folder_subtree_ids: list[int] | None = None,
):
"""Celery task to delete documents first, then the folder rows."""
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
loop.run_until_complete(_delete_folder_documents(document_ids))
loop.run_until_complete(
_delete_folder_documents(document_ids, folder_subtree_ids)
)
finally:
loop.close()
async def _delete_folder_documents(document_ids: list[int]) -> None:
"""Delete chunks in batches, then document rows for each orphaned document."""
async def _delete_folder_documents(
document_ids: list[int],
folder_subtree_ids: list[int] | None = None,
) -> None:
"""Delete chunks in batches, then document rows, then folder rows."""
from sqlalchemy import delete as sa_delete, select
from app.db import Chunk, Document
from app.db import Chunk, Document, Folder
async with get_celery_session_maker()() as session:
batch_size = 500
@ -177,6 +188,12 @@ async def _delete_folder_documents(document_ids: list[int]) -> None:
await session.delete(doc)
await session.commit()
if folder_subtree_ids:
await session.execute(
sa_delete(Folder).where(Folder.id.in_(folder_subtree_ids))
)
await session.commit()
@celery_app.task(
name="delete_search_space_background",
@ -1243,3 +1260,154 @@ async def _process_circleback_meeting(
heartbeat_task.cancel()
if notification:
_stop_heartbeat(notification.id)
# ===== Local folder indexing task =====
@celery_app.task(name="index_local_folder", bind=True)
def index_local_folder_task(
self,
search_space_id: int,
user_id: str,
folder_path: str,
folder_name: str,
exclude_patterns: list[str] | None = None,
file_extensions: list[str] | None = None,
root_folder_id: int | None = None,
enable_summary: bool = False,
target_file_paths: list[str] | None = None,
):
"""Celery task to index a local folder. Config is passed directly — no connector row."""
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
loop.run_until_complete(
_index_local_folder_async(
search_space_id=search_space_id,
user_id=user_id,
folder_path=folder_path,
folder_name=folder_name,
exclude_patterns=exclude_patterns,
file_extensions=file_extensions,
root_folder_id=root_folder_id,
enable_summary=enable_summary,
target_file_paths=target_file_paths,
)
)
finally:
loop.close()
async def _index_local_folder_async(
search_space_id: int,
user_id: str,
folder_path: str,
folder_name: str,
exclude_patterns: list[str] | None = None,
file_extensions: list[str] | None = None,
root_folder_id: int | None = None,
enable_summary: bool = False,
target_file_paths: list[str] | None = None,
):
"""Run local folder indexing with notification + heartbeat."""
is_batch = bool(target_file_paths)
is_full_scan = not target_file_paths
file_count = len(target_file_paths) if target_file_paths else None
if is_batch:
doc_name = f"{folder_name} ({file_count} file{'s' if file_count != 1 else ''})"
else:
doc_name = folder_name
notification = None
notification_id: int | None = None
heartbeat_task = None
async with get_celery_session_maker()() as session:
try:
notification = (
await NotificationService.document_processing.notify_processing_started(
session=session,
user_id=UUID(user_id),
document_type="LOCAL_FOLDER_FILE",
document_name=doc_name,
search_space_id=search_space_id,
)
)
notification_id = notification.id
_start_heartbeat(notification_id)
heartbeat_task = asyncio.create_task(_run_heartbeat_loop(notification_id))
except Exception:
logger.warning(
"Failed to create notification for local folder indexing",
exc_info=True,
)
async def _heartbeat_progress(completed_count: int) -> None:
"""Refresh heartbeat and optionally update notification progress."""
if notification:
with contextlib.suppress(Exception):
await NotificationService.document_processing.notify_processing_progress(
session=session,
notification=notification,
stage="indexing",
stage_message=f"Syncing files ({completed_count}/{file_count or '?'})",
)
try:
_indexed, _skipped_or_failed, _rfid, err = await index_local_folder(
session=session,
search_space_id=search_space_id,
user_id=user_id,
folder_path=folder_path,
folder_name=folder_name,
exclude_patterns=exclude_patterns,
file_extensions=file_extensions,
root_folder_id=root_folder_id,
enable_summary=enable_summary,
target_file_paths=target_file_paths,
on_heartbeat_callback=_heartbeat_progress
if (is_batch or is_full_scan)
else None,
)
if notification:
try:
await session.refresh(notification)
if err:
await NotificationService.document_processing.notify_processing_completed(
session=session,
notification=notification,
error_message=err,
)
else:
await NotificationService.document_processing.notify_processing_completed(
session=session,
notification=notification,
)
except Exception:
logger.warning(
"Failed to update notification after local folder indexing",
exc_info=True,
)
except Exception as e:
logger.exception(f"Local folder indexing failed: {e}")
if notification:
try:
await session.refresh(notification)
await NotificationService.document_processing.notify_processing_completed(
session=session,
notification=notification,
error_message=str(e)[:200],
)
except Exception:
pass
raise
finally:
if heartbeat_task:
heartbeat_task.cancel()
if notification_id is not None:
_stop_heartbeat(notification_id)

View file

@ -39,7 +39,6 @@ from app.agents.new_chat.llm_config import (
)
from app.db import (
ChatVisibility,
Document,
NewChatMessage,
NewChatThread,
Report,
@ -63,74 +62,6 @@ _perf_log = get_perf_logger()
_background_tasks: set[asyncio.Task] = set()
def format_mentioned_documents_as_context(documents: list[Document]) -> str:
"""
Format mentioned documents as context for the agent.
Uses the same XML structure as knowledge_base.format_documents_for_context
to ensure citations work properly with chunk IDs.
"""
if not documents:
return ""
context_parts = ["<mentioned_documents>"]
context_parts.append(
"The user has explicitly mentioned the following documents from their knowledge base. "
"These documents are directly relevant to the query and should be prioritized as primary sources. "
"Use [citation:CHUNK_ID] format for citations (e.g., [citation:123])."
)
context_parts.append("")
for doc in documents:
# Build metadata JSON
metadata = doc.document_metadata or {}
metadata_json = json.dumps(metadata, ensure_ascii=False)
# Get URL from metadata
url = (
metadata.get("url")
or metadata.get("source")
or metadata.get("page_url")
or ""
)
context_parts.append("<document>")
context_parts.append("<document_metadata>")
context_parts.append(f" <document_id>{doc.id}</document_id>")
context_parts.append(
f" <document_type>{doc.document_type.value}</document_type>"
)
context_parts.append(f" <title><![CDATA[{doc.title}]]></title>")
context_parts.append(f" <url><![CDATA[{url}]]></url>")
context_parts.append(
f" <metadata_json><![CDATA[{metadata_json}]]></metadata_json>"
)
context_parts.append("</document_metadata>")
context_parts.append("")
context_parts.append("<document_content>")
# Use chunks if available (preferred for proper citations)
if hasattr(doc, "chunks") and doc.chunks:
for chunk in doc.chunks:
context_parts.append(
f" <chunk id='{chunk.id}'><![CDATA[{chunk.content}]]></chunk>"
)
else:
# Fallback to document content if chunks not loaded
# Use document ID as chunk ID prefix for consistency
context_parts.append(
f" <chunk id='{doc.id}'><![CDATA[{doc.content}]]></chunk>"
)
context_parts.append("</document_content>")
context_parts.append("</document>")
context_parts.append("")
context_parts.append("</mentioned_documents>")
return "\n".join(context_parts)
def format_mentioned_surfsense_docs_as_context(
documents: list[SurfsenseDocsDocument],
) -> str:
@ -1317,6 +1248,7 @@ async def stream_new_chat(
firecrawl_api_key=firecrawl_api_key,
thread_visibility=visibility,
disabled_tools=disabled_tools,
mentioned_document_ids=mentioned_document_ids,
)
_perf_log.info(
"[stream_new_chat] Agent created in %.3fs", time.perf_counter() - _t0
@ -1340,18 +1272,9 @@ async def stream_new_chat(
thread.needs_history_bootstrap = False
await session.commit()
# Fetch mentioned documents if any (with chunks for proper citations)
mentioned_documents: list[Document] = []
if mentioned_document_ids:
result = await session.execute(
select(Document)
.options(selectinload(Document.chunks))
.filter(
Document.id.in_(mentioned_document_ids),
Document.search_space_id == search_space_id,
)
)
mentioned_documents = list(result.scalars().all())
# Mentioned KB documents are now handled by KnowledgeBaseSearchMiddleware
# which merges them into the scoped filesystem with full document
# structure. Only SurfSense docs and report context are inlined here.
# Fetch mentioned SurfSense docs if any
mentioned_surfsense_docs: list[SurfsenseDocsDocument] = []
@ -1379,15 +1302,10 @@ async def stream_new_chat(
)
recent_reports = list(recent_reports_result.scalars().all())
# Format the user query with context (mentioned documents + SurfSense docs)
# Format the user query with context (SurfSense docs + reports only)
final_query = user_query
context_parts = []
if mentioned_documents:
context_parts.append(
format_mentioned_documents_as_context(mentioned_documents)
)
if mentioned_surfsense_docs:
context_parts.append(
format_mentioned_surfsense_docs_as_context(mentioned_surfsense_docs)
@ -1479,7 +1397,7 @@ async def stream_new_chat(
yield streaming_service.format_start_step()
# Initial thinking step - analyzing the request
if mentioned_documents or mentioned_surfsense_docs:
if mentioned_surfsense_docs:
initial_title = "Analyzing referenced content"
action_verb = "Analyzing"
else:
@ -1490,18 +1408,6 @@ async def stream_new_chat(
query_text = user_query[:80] + ("..." if len(user_query) > 80 else "")
processing_parts.append(query_text)
if mentioned_documents:
doc_names = []
for doc in mentioned_documents:
title = doc.title
if len(title) > 30:
title = title[:27] + "..."
doc_names.append(title)
if len(doc_names) == 1:
processing_parts.append(f"[{doc_names[0]}]")
else:
processing_parts.append(f"[{len(doc_names)} documents]")
if mentioned_surfsense_docs:
doc_names = []
for doc in mentioned_surfsense_docs:
@ -1527,7 +1433,7 @@ async def stream_new_chat(
# These ORM objects (with eagerly-loaded chunks) can be very large.
# They're only needed to build context strings already copied into
# final_query / langchain_messages — release them before streaming.
del mentioned_documents, mentioned_surfsense_docs, recent_reports
del mentioned_surfsense_docs, recent_reports
del langchain_messages, final_query
# Check if this is the first assistant response so we can generate

View file

@ -42,9 +42,9 @@ from .jira_indexer import index_jira_issues
# Issue tracking and project management
from .linear_indexer import index_linear_issues
from .luma_indexer import index_luma_events
# Documentation and knowledge management
from .luma_indexer import index_luma_events
from .notion_indexer import index_notion_pages
from .obsidian_indexer import index_obsidian_vault
from .slack_indexer import index_slack_messages

View file

@ -28,6 +28,7 @@ from app.indexing_pipeline.connector_document import ConnectorDocument
from app.indexing_pipeline.document_hashing import compute_identifier_hash
from app.indexing_pipeline.indexing_pipeline_service import IndexingPipelineService
from app.services.llm_service import get_user_long_context_llm
from app.services.page_limit_service import PageLimitService
from app.services.task_logging_service import TaskLoggingService
from app.tasks.connector_indexers.base import (
check_document_by_unique_identifier,
@ -50,7 +51,10 @@ async def _should_skip_file(
file_id = file.get("id", "")
file_name = file.get("name", "Unknown")
if skip_item(file):
skip, unsup_ext = skip_item(file)
if skip:
if unsup_ext:
return True, f"unsupported:{unsup_ext}"
return True, "folder/non-downloadable"
if not file_id:
return True, "missing file_id"
@ -250,6 +254,121 @@ async def _download_and_index(
return batch_indexed, download_failed + batch_failed
async def _remove_document(session: AsyncSession, file_id: str, search_space_id: int):
"""Remove a document that was deleted in Dropbox."""
primary_hash = compute_identifier_hash(
DocumentType.DROPBOX_FILE.value, file_id, search_space_id
)
existing = await check_document_by_unique_identifier(session, primary_hash)
if not existing:
result = await session.execute(
select(Document).where(
Document.search_space_id == search_space_id,
Document.document_type == DocumentType.DROPBOX_FILE,
cast(Document.document_metadata["dropbox_file_id"], String) == file_id,
)
)
existing = result.scalar_one_or_none()
if existing:
await session.delete(existing)
async def _index_with_delta_sync(
dropbox_client: DropboxClient,
session: AsyncSession,
connector_id: int,
search_space_id: int,
user_id: str,
cursor: str,
task_logger: TaskLoggingService,
log_entry: object,
max_files: int,
on_heartbeat_callback: HeartbeatCallbackType | None = None,
enable_summary: bool = True,
) -> tuple[int, int, int, str]:
"""Delta sync using Dropbox cursor-based change tracking.
Returns (indexed_count, skipped_count, new_cursor).
"""
await task_logger.log_task_progress(
log_entry,
f"Starting delta sync from cursor: {cursor[:20]}...",
{"stage": "delta_sync", "cursor_prefix": cursor[:20]},
)
entries, new_cursor, error = await dropbox_client.get_changes(cursor)
if error:
err_lower = error.lower()
if "401" in error or "authentication expired" in err_lower:
raise Exception(
f"Dropbox authentication failed. Please re-authenticate. (Error: {error})"
)
raise Exception(f"Failed to fetch Dropbox changes: {error}")
if not entries:
logger.info("No changes detected since last sync")
return 0, 0, 0, new_cursor or cursor
logger.info(f"Processing {len(entries)} change entries")
renamed_count = 0
skipped = 0
unsupported_count = 0
files_to_download: list[dict] = []
files_processed = 0
for entry in entries:
if files_processed >= max_files:
break
files_processed += 1
tag = entry.get(".tag")
if tag == "deleted":
path_lower = entry.get("path_lower", "")
name = entry.get("name", "")
file_id = entry.get("id", "")
if file_id:
await _remove_document(session, file_id, search_space_id)
logger.debug(f"Processed deletion: {name or path_lower}")
continue
if tag != "file":
continue
skip, msg = await _should_skip_file(session, entry, search_space_id)
if skip:
if msg and msg.startswith("unsupported:"):
unsupported_count += 1
elif msg and "renamed" in msg.lower():
renamed_count += 1
else:
skipped += 1
continue
files_to_download.append(entry)
batch_indexed, failed = await _download_and_index(
dropbox_client,
session,
files_to_download,
connector_id=connector_id,
search_space_id=search_space_id,
user_id=user_id,
enable_summary=enable_summary,
on_heartbeat=on_heartbeat_callback,
)
indexed = renamed_count + batch_indexed
logger.info(
f"Delta sync complete: {indexed} indexed, {skipped} skipped, "
f"{unsupported_count} unsupported, {failed} failed"
)
return indexed, skipped, unsupported_count, new_cursor or cursor
async def _index_full_scan(
dropbox_client: DropboxClient,
session: AsyncSession,
@ -265,8 +384,11 @@ async def _index_full_scan(
incremental_sync: bool = True,
on_heartbeat_callback: HeartbeatCallbackType | None = None,
enable_summary: bool = True,
) -> tuple[int, int]:
"""Full scan indexing of a folder."""
) -> tuple[int, int, int]:
"""Full scan indexing of a folder.
Returns (indexed, skipped, unsupported_count).
"""
await task_logger.log_task_progress(
log_entry,
f"Starting full scan of folder: {folder_name}",
@ -278,8 +400,15 @@ async def _index_full_scan(
},
)
page_limit_service = PageLimitService(session)
pages_used, pages_limit = await page_limit_service.get_page_usage(user_id)
remaining_quota = pages_limit - pages_used
batch_estimated_pages = 0
page_limit_reached = False
renamed_count = 0
skipped = 0
unsupported_count = 0
files_to_download: list[dict] = []
all_files, error = await get_files_in_folder(
@ -299,14 +428,36 @@ async def _index_full_scan(
if incremental_sync:
skip, msg = await _should_skip_file(session, file, search_space_id)
if skip:
if msg and "renamed" in msg.lower():
if msg and msg.startswith("unsupported:"):
unsupported_count += 1
elif msg and "renamed" in msg.lower():
renamed_count += 1
else:
skipped += 1
continue
elif skip_item(file):
else:
item_skip, item_unsup = skip_item(file)
if item_skip:
if item_unsup:
unsupported_count += 1
else:
skipped += 1
continue
file_pages = PageLimitService.estimate_pages_from_metadata(
file.get("name", ""), file.get("size")
)
if batch_estimated_pages + file_pages > remaining_quota:
if not page_limit_reached:
logger.warning(
"Page limit reached during Dropbox full scan, "
"skipping remaining files"
)
page_limit_reached = True
skipped += 1
continue
batch_estimated_pages += file_pages
files_to_download.append(file)
batch_indexed, failed = await _download_and_index(
@ -320,11 +471,20 @@ async def _index_full_scan(
on_heartbeat=on_heartbeat_callback,
)
if batch_indexed > 0 and files_to_download and batch_estimated_pages > 0:
pages_to_deduct = max(
1, batch_estimated_pages * batch_indexed // len(files_to_download)
)
await page_limit_service.update_page_usage(
user_id, pages_to_deduct, allow_exceed=True
)
indexed = renamed_count + batch_indexed
logger.info(
f"Full scan complete: {indexed} indexed, {skipped} skipped, {failed} failed"
f"Full scan complete: {indexed} indexed, {skipped} skipped, "
f"{unsupported_count} unsupported, {failed} failed"
)
return indexed, skipped
return indexed, skipped, unsupported_count
async def _index_selected_files(
@ -338,12 +498,18 @@ async def _index_selected_files(
enable_summary: bool,
incremental_sync: bool = True,
on_heartbeat: HeartbeatCallbackType | None = None,
) -> tuple[int, int, list[str]]:
) -> tuple[int, int, int, list[str]]:
"""Index user-selected files using the parallel pipeline."""
page_limit_service = PageLimitService(session)
pages_used, pages_limit = await page_limit_service.get_page_usage(user_id)
remaining_quota = pages_limit - pages_used
batch_estimated_pages = 0
files_to_download: list[dict] = []
errors: list[str] = []
renamed_count = 0
skipped = 0
unsupported_count = 0
for file_path, file_name in file_paths:
file, error = await get_file_by_path(dropbox_client, file_path)
@ -355,15 +521,31 @@ async def _index_selected_files(
if incremental_sync:
skip, msg = await _should_skip_file(session, file, search_space_id)
if skip:
if msg and "renamed" in msg.lower():
if msg and msg.startswith("unsupported:"):
unsupported_count += 1
elif msg and "renamed" in msg.lower():
renamed_count += 1
else:
skipped += 1
continue
elif skip_item(file):
skipped += 1
else:
item_skip, item_unsup = skip_item(file)
if item_skip:
if item_unsup:
unsupported_count += 1
else:
skipped += 1
continue
file_pages = PageLimitService.estimate_pages_from_metadata(
file.get("name", ""), file.get("size")
)
if batch_estimated_pages + file_pages > remaining_quota:
display = file_name or file_path
errors.append(f"File '{display}': page limit would be exceeded")
continue
batch_estimated_pages += file_pages
files_to_download.append(file)
batch_indexed, _failed = await _download_and_index(
@ -377,7 +559,15 @@ async def _index_selected_files(
on_heartbeat=on_heartbeat,
)
return renamed_count + batch_indexed, skipped, errors
if batch_indexed > 0 and files_to_download and batch_estimated_pages > 0:
pages_to_deduct = max(
1, batch_estimated_pages * batch_indexed // len(files_to_download)
)
await page_limit_service.update_page_usage(
user_id, pages_to_deduct, allow_exceed=True
)
return renamed_count + batch_indexed, skipped, unsupported_count, errors
async def index_dropbox_files(
@ -386,7 +576,7 @@ async def index_dropbox_files(
search_space_id: int,
user_id: str,
items_dict: dict,
) -> tuple[int, int, str | None]:
) -> tuple[int, int, str | None, int]:
"""Index Dropbox files for a specific connector.
items_dict format:
@ -417,7 +607,7 @@ async def index_dropbox_files(
await task_logger.log_task_failure(
log_entry, error_msg, None, {"error_type": "ConnectorNotFound"}
)
return 0, 0, error_msg
return 0, 0, error_msg, 0
token_encrypted = connector.config.get("_token_encrypted", False)
if token_encrypted and not config.SECRET_KEY:
@ -428,7 +618,7 @@ async def index_dropbox_files(
"Missing SECRET_KEY",
{"error_type": "MissingSecretKey"},
)
return 0, 0, error_msg
return 0, 0, error_msg, 0
connector_enable_summary = getattr(connector, "enable_summary", True)
dropbox_client = DropboxClient(session, connector_id)
@ -437,9 +627,13 @@ async def index_dropbox_files(
max_files = indexing_options.get("max_files", 500)
incremental_sync = indexing_options.get("incremental_sync", True)
include_subfolders = indexing_options.get("include_subfolders", True)
use_delta_sync = indexing_options.get("use_delta_sync", True)
folder_cursors: dict = connector.config.get("folder_cursors", {})
total_indexed = 0
total_skipped = 0
total_unsupported = 0
selected_files = items_dict.get("files", [])
if selected_files:
@ -447,7 +641,7 @@ async def index_dropbox_files(
(f.get("path", f.get("path_lower", f.get("id", ""))), f.get("name"))
for f in selected_files
]
indexed, skipped, file_errors = await _index_selected_files(
indexed, skipped, unsupported, file_errors = await _index_selected_files(
dropbox_client,
session,
file_tuples,
@ -459,6 +653,7 @@ async def index_dropbox_files(
)
total_indexed += indexed
total_skipped += skipped
total_unsupported += unsupported
if file_errors:
logger.warning(
f"File indexing errors for connector {connector_id}: {file_errors}"
@ -471,25 +666,66 @@ async def index_dropbox_files(
)
folder_name = folder.get("name", "Root")
logger.info(f"Using full scan for folder {folder_name}")
indexed, skipped = await _index_full_scan(
dropbox_client,
session,
connector_id,
search_space_id,
user_id,
folder_path,
folder_name,
task_logger,
log_entry,
max_files,
include_subfolders,
incremental_sync=incremental_sync,
enable_summary=connector_enable_summary,
saved_cursor = folder_cursors.get(folder_path)
can_use_delta = (
use_delta_sync and saved_cursor and connector.last_indexed_at
)
if can_use_delta:
logger.info(f"Using delta sync for folder {folder_name}")
indexed, skipped, unsup, new_cursor = await _index_with_delta_sync(
dropbox_client,
session,
connector_id,
search_space_id,
user_id,
saved_cursor,
task_logger,
log_entry,
max_files,
enable_summary=connector_enable_summary,
)
folder_cursors[folder_path] = new_cursor
total_unsupported += unsup
else:
logger.info(f"Using full scan for folder {folder_name}")
indexed, skipped, unsup = await _index_full_scan(
dropbox_client,
session,
connector_id,
search_space_id,
user_id,
folder_path,
folder_name,
task_logger,
log_entry,
max_files,
include_subfolders,
incremental_sync=incremental_sync,
enable_summary=connector_enable_summary,
)
total_unsupported += unsup
total_indexed += indexed
total_skipped += skipped
# Persist latest cursor for this folder
try:
latest_cursor, cursor_err = await dropbox_client.get_latest_cursor(
folder_path
)
if latest_cursor and not cursor_err:
folder_cursors[folder_path] = latest_cursor
except Exception as e:
logger.warning(f"Failed to get latest cursor for {folder_path}: {e}")
# Persist folder cursors to connector config
if folders:
cfg = dict(connector.config)
cfg["folder_cursors"] = folder_cursors
connector.config = cfg
flag_modified(connector, "config")
if total_indexed > 0 or folders:
await update_connector_last_indexed(session, connector, True)
@ -498,12 +734,18 @@ async def index_dropbox_files(
await task_logger.log_task_success(
log_entry,
f"Successfully completed Dropbox indexing for connector {connector_id}",
{"files_processed": total_indexed, "files_skipped": total_skipped},
{
"files_processed": total_indexed,
"files_skipped": total_skipped,
"files_unsupported": total_unsupported,
},
)
logger.info(
f"Dropbox indexing completed: {total_indexed} indexed, {total_skipped} skipped"
f"Dropbox indexing completed: {total_indexed} indexed, "
f"{total_skipped} skipped, {total_unsupported} unsupported"
)
return total_indexed, total_skipped, None
return total_indexed, total_skipped, None, total_unsupported
except SQLAlchemyError as db_error:
await session.rollback()
@ -514,7 +756,7 @@ async def index_dropbox_files(
{"error_type": "SQLAlchemyError"},
)
logger.error(f"Database error: {db_error!s}", exc_info=True)
return 0, 0, f"Database error: {db_error!s}"
return 0, 0, f"Database error: {db_error!s}", 0
except Exception as e:
await session.rollback()
await task_logger.log_task_failure(
@ -524,4 +766,4 @@ async def index_dropbox_files(
{"error_type": type(e).__name__},
)
logger.error(f"Failed to index Dropbox files: {e!s}", exc_info=True)
return 0, 0, f"Failed to index Dropbox files: {e!s}"
return 0, 0, f"Failed to index Dropbox files: {e!s}", 0

View file

@ -25,7 +25,11 @@ from app.connectors.google_drive import (
get_files_in_folder,
get_start_page_token,
)
from app.connectors.google_drive.file_types import should_skip_file as skip_mime
from app.connectors.google_drive.file_types import (
is_google_workspace_file,
should_skip_by_extension,
should_skip_file as skip_mime,
)
from app.db import Document, DocumentStatus, DocumentType, SearchSourceConnectorType
from app.indexing_pipeline.connector_document import ConnectorDocument
from app.indexing_pipeline.document_hashing import compute_identifier_hash
@ -34,6 +38,7 @@ from app.indexing_pipeline.indexing_pipeline_service import (
PlaceholderInfo,
)
from app.services.llm_service import get_user_long_context_llm
from app.services.page_limit_service import PageLimitService
from app.services.task_logging_service import TaskLoggingService
from app.tasks.connector_indexers.base import (
check_document_by_unique_identifier,
@ -77,6 +82,10 @@ async def _should_skip_file(
if skip_mime(mime_type):
return True, "folder/shortcut"
if not is_google_workspace_file(mime_type):
ext_skip, unsup_ext = should_skip_by_extension(file_name)
if ext_skip:
return True, f"unsupported:{unsup_ext}"
if not file_id:
return True, "missing file_id"
@ -327,6 +336,12 @@ async def _process_single_file(
return 1, 0, 0
return 0, 1, 0
page_limit_service = PageLimitService(session)
estimated_pages = PageLimitService.estimate_pages_from_metadata(
file_name, file.get("size")
)
await page_limit_service.check_page_limit(user_id, estimated_pages)
markdown, drive_metadata, error = await download_and_extract_content(
drive_client, file
)
@ -363,6 +378,9 @@ async def _process_single_file(
)
await pipeline.index(document, connector_doc, user_llm)
await page_limit_service.update_page_usage(
user_id, estimated_pages, allow_exceed=True
)
logger.info(f"Successfully indexed Google Drive file: {file_name}")
return 1, 0, 0
@ -458,18 +476,24 @@ async def _index_selected_files(
user_id: str,
enable_summary: bool,
on_heartbeat: HeartbeatCallbackType | None = None,
) -> tuple[int, int, list[str]]:
) -> tuple[int, int, int, list[str]]:
"""Index user-selected files using the parallel pipeline.
Phase 1 (serial): fetch metadata + skip checks.
Phase 2+3 (parallel): download, ETL, index via _download_and_index.
Returns (indexed_count, skipped_count, errors).
Returns (indexed_count, skipped_count, unsupported_count, errors).
"""
page_limit_service = PageLimitService(session)
pages_used, pages_limit = await page_limit_service.get_page_usage(user_id)
remaining_quota = pages_limit - pages_used
batch_estimated_pages = 0
files_to_download: list[dict] = []
errors: list[str] = []
renamed_count = 0
skipped = 0
unsupported_count = 0
for file_id, file_name in file_ids:
file, error = await get_file_by_id(drive_client, file_id)
@ -480,12 +504,23 @@ async def _index_selected_files(
skip, msg = await _should_skip_file(session, file, search_space_id)
if skip:
if msg and "renamed" in msg.lower():
if msg and msg.startswith("unsupported:"):
unsupported_count += 1
elif msg and "renamed" in msg.lower():
renamed_count += 1
else:
skipped += 1
continue
file_pages = PageLimitService.estimate_pages_from_metadata(
file.get("name", ""), file.get("size")
)
if batch_estimated_pages + file_pages > remaining_quota:
display = file_name or file_id
errors.append(f"File '{display}': page limit would be exceeded")
continue
batch_estimated_pages += file_pages
files_to_download.append(file)
await _create_drive_placeholders(
@ -507,7 +542,15 @@ async def _index_selected_files(
on_heartbeat=on_heartbeat,
)
return renamed_count + batch_indexed, skipped, errors
if batch_indexed > 0 and files_to_download and batch_estimated_pages > 0:
pages_to_deduct = max(
1, batch_estimated_pages * batch_indexed // len(files_to_download)
)
await page_limit_service.update_page_usage(
user_id, pages_to_deduct, allow_exceed=True
)
return renamed_count + batch_indexed, skipped, unsupported_count, errors
# ---------------------------------------------------------------------------
@ -530,8 +573,11 @@ async def _index_full_scan(
include_subfolders: bool = False,
on_heartbeat_callback: HeartbeatCallbackType | None = None,
enable_summary: bool = True,
) -> tuple[int, int]:
"""Full scan indexing of a folder."""
) -> tuple[int, int, int]:
"""Full scan indexing of a folder.
Returns (indexed, skipped, unsupported_count).
"""
await task_logger.log_task_progress(
log_entry,
f"Starting full scan of folder: {folder_name} (include_subfolders={include_subfolders})",
@ -545,8 +591,15 @@ async def _index_full_scan(
# ------------------------------------------------------------------
# Phase 1 (serial): collect files, run skip checks, track renames
# ------------------------------------------------------------------
page_limit_service = PageLimitService(session)
pages_used, pages_limit = await page_limit_service.get_page_usage(user_id)
remaining_quota = pages_limit - pages_used
batch_estimated_pages = 0
page_limit_reached = False
renamed_count = 0
skipped = 0
unsupported_count = 0
files_processed = 0
files_to_download: list[dict] = []
folders_to_process = [(folder_id, folder_name)]
@ -587,12 +640,28 @@ async def _index_full_scan(
skip, msg = await _should_skip_file(session, file, search_space_id)
if skip:
if msg and "renamed" in msg.lower():
if msg and msg.startswith("unsupported:"):
unsupported_count += 1
elif msg and "renamed" in msg.lower():
renamed_count += 1
else:
skipped += 1
continue
file_pages = PageLimitService.estimate_pages_from_metadata(
file.get("name", ""), file.get("size")
)
if batch_estimated_pages + file_pages > remaining_quota:
if not page_limit_reached:
logger.warning(
"Page limit reached during Google Drive full scan, "
"skipping remaining files"
)
page_limit_reached = True
skipped += 1
continue
batch_estimated_pages += file_pages
files_to_download.append(file)
page_token = next_token
@ -636,11 +705,20 @@ async def _index_full_scan(
on_heartbeat=on_heartbeat_callback,
)
if batch_indexed > 0 and files_to_download and batch_estimated_pages > 0:
pages_to_deduct = max(
1, batch_estimated_pages * batch_indexed // len(files_to_download)
)
await page_limit_service.update_page_usage(
user_id, pages_to_deduct, allow_exceed=True
)
indexed = renamed_count + batch_indexed
logger.info(
f"Full scan complete: {indexed} indexed, {skipped} skipped, {failed} failed"
f"Full scan complete: {indexed} indexed, {skipped} skipped, "
f"{unsupported_count} unsupported, {failed} failed"
)
return indexed, skipped
return indexed, skipped, unsupported_count
async def _index_with_delta_sync(
@ -658,8 +736,11 @@ async def _index_with_delta_sync(
include_subfolders: bool = False,
on_heartbeat_callback: HeartbeatCallbackType | None = None,
enable_summary: bool = True,
) -> tuple[int, int]:
"""Delta sync using change tracking."""
) -> tuple[int, int, int]:
"""Delta sync using change tracking.
Returns (indexed, skipped, unsupported_count).
"""
await task_logger.log_task_progress(
log_entry,
f"Starting delta sync from token: {start_page_token[:20]}...",
@ -679,15 +760,22 @@ async def _index_with_delta_sync(
if not changes:
logger.info("No changes detected since last sync")
return 0, 0
return 0, 0, 0
logger.info(f"Processing {len(changes)} changes")
# ------------------------------------------------------------------
# Phase 1 (serial): handle removals, collect files for download
# ------------------------------------------------------------------
page_limit_service = PageLimitService(session)
pages_used, pages_limit = await page_limit_service.get_page_usage(user_id)
remaining_quota = pages_limit - pages_used
batch_estimated_pages = 0
page_limit_reached = False
renamed_count = 0
skipped = 0
unsupported_count = 0
files_to_download: list[dict] = []
files_processed = 0
@ -709,12 +797,28 @@ async def _index_with_delta_sync(
skip, msg = await _should_skip_file(session, file, search_space_id)
if skip:
if msg and "renamed" in msg.lower():
if msg and msg.startswith("unsupported:"):
unsupported_count += 1
elif msg and "renamed" in msg.lower():
renamed_count += 1
else:
skipped += 1
continue
file_pages = PageLimitService.estimate_pages_from_metadata(
file.get("name", ""), file.get("size")
)
if batch_estimated_pages + file_pages > remaining_quota:
if not page_limit_reached:
logger.warning(
"Page limit reached during Google Drive delta sync, "
"skipping remaining files"
)
page_limit_reached = True
skipped += 1
continue
batch_estimated_pages += file_pages
files_to_download.append(file)
# ------------------------------------------------------------------
@ -742,11 +846,20 @@ async def _index_with_delta_sync(
on_heartbeat=on_heartbeat_callback,
)
if batch_indexed > 0 and files_to_download and batch_estimated_pages > 0:
pages_to_deduct = max(
1, batch_estimated_pages * batch_indexed // len(files_to_download)
)
await page_limit_service.update_page_usage(
user_id, pages_to_deduct, allow_exceed=True
)
indexed = renamed_count + batch_indexed
logger.info(
f"Delta sync complete: {indexed} indexed, {skipped} skipped, {failed} failed"
f"Delta sync complete: {indexed} indexed, {skipped} skipped, "
f"{unsupported_count} unsupported, {failed} failed"
)
return indexed, skipped
return indexed, skipped, unsupported_count
# ---------------------------------------------------------------------------
@ -766,8 +879,11 @@ async def index_google_drive_files(
max_files: int = 500,
include_subfolders: bool = False,
on_heartbeat_callback: HeartbeatCallbackType | None = None,
) -> tuple[int, int, str | None]:
"""Index Google Drive files for a specific connector."""
) -> tuple[int, int, str | None, int]:
"""Index Google Drive files for a specific connector.
Returns (indexed, skipped, error_or_none, unsupported_count).
"""
task_logger = TaskLoggingService(session, search_space_id)
log_entry = await task_logger.log_task_start(
task_name="google_drive_files_indexing",
@ -793,7 +909,7 @@ async def index_google_drive_files(
await task_logger.log_task_failure(
log_entry, error_msg, None, {"error_type": "ConnectorNotFound"}
)
return 0, 0, error_msg
return 0, 0, error_msg, 0
await task_logger.log_task_progress(
log_entry,
@ -812,7 +928,7 @@ async def index_google_drive_files(
"Missing Composio account",
{"error_type": "MissingComposioAccount"},
)
return 0, 0, error_msg
return 0, 0, error_msg, 0
pre_built_credentials = build_composio_credentials(connected_account_id)
else:
token_encrypted = connector.config.get("_token_encrypted", False)
@ -827,6 +943,7 @@ async def index_google_drive_files(
0,
0,
"SECRET_KEY not configured but credentials are marked as encrypted",
0,
)
connector_enable_summary = getattr(connector, "enable_summary", True)
@ -839,7 +956,7 @@ async def index_google_drive_files(
await task_logger.log_task_failure(
log_entry, error_msg, {"error_type": "MissingParameter"}
)
return 0, 0, error_msg
return 0, 0, error_msg, 0
target_folder_id = folder_id
target_folder_name = folder_name or "Selected Folder"
@ -850,9 +967,11 @@ async def index_google_drive_files(
use_delta_sync and start_page_token and connector.last_indexed_at
)
documents_unsupported = 0
if can_use_delta:
logger.info(f"Using delta sync for connector {connector_id}")
documents_indexed, documents_skipped = await _index_with_delta_sync(
documents_indexed, documents_skipped, du = await _index_with_delta_sync(
drive_client,
session,
connector,
@ -868,8 +987,9 @@ async def index_google_drive_files(
on_heartbeat_callback,
connector_enable_summary,
)
documents_unsupported += du
logger.info("Running reconciliation scan after delta sync")
ri, rs = await _index_full_scan(
ri, rs, ru = await _index_full_scan(
drive_client,
session,
connector,
@ -887,9 +1007,14 @@ async def index_google_drive_files(
)
documents_indexed += ri
documents_skipped += rs
documents_unsupported += ru
else:
logger.info(f"Using full scan for connector {connector_id}")
documents_indexed, documents_skipped = await _index_full_scan(
(
documents_indexed,
documents_skipped,
documents_unsupported,
) = await _index_full_scan(
drive_client,
session,
connector,
@ -924,14 +1049,17 @@ async def index_google_drive_files(
{
"files_processed": documents_indexed,
"files_skipped": documents_skipped,
"files_unsupported": documents_unsupported,
"sync_type": "delta" if can_use_delta else "full",
"folder": target_folder_name,
},
)
logger.info(
f"Google Drive indexing completed: {documents_indexed} indexed, {documents_skipped} skipped"
f"Google Drive indexing completed: {documents_indexed} indexed, "
f"{documents_skipped} skipped, {documents_unsupported} unsupported"
)
return documents_indexed, documents_skipped, None
return documents_indexed, documents_skipped, None, documents_unsupported
except SQLAlchemyError as db_error:
await session.rollback()
@ -942,7 +1070,7 @@ async def index_google_drive_files(
{"error_type": "SQLAlchemyError"},
)
logger.error(f"Database error: {db_error!s}", exc_info=True)
return 0, 0, f"Database error: {db_error!s}"
return 0, 0, f"Database error: {db_error!s}", 0
except Exception as e:
await session.rollback()
await task_logger.log_task_failure(
@ -952,7 +1080,7 @@ async def index_google_drive_files(
{"error_type": type(e).__name__},
)
logger.error(f"Failed to index Google Drive files: {e!s}", exc_info=True)
return 0, 0, f"Failed to index Google Drive files: {e!s}"
return 0, 0, f"Failed to index Google Drive files: {e!s}", 0
async def index_google_drive_single_file(
@ -1154,7 +1282,7 @@ async def index_google_drive_selected_files(
session, connector_id, credentials=pre_built_credentials
)
indexed, skipped, errors = await _index_selected_files(
indexed, skipped, unsupported, errors = await _index_selected_files(
drive_client,
session,
files,
@ -1165,6 +1293,11 @@ async def index_google_drive_selected_files(
on_heartbeat=on_heartbeat_callback,
)
if unsupported > 0:
file_text = "file was" if unsupported == 1 else "files were"
unsup_msg = f"{unsupported} {file_text} not supported"
errors.append(unsup_msg)
await session.commit()
if errors:
@ -1172,7 +1305,12 @@ async def index_google_drive_selected_files(
log_entry,
f"Batch file indexing completed with {len(errors)} error(s)",
"; ".join(errors),
{"indexed": indexed, "skipped": skipped, "error_count": len(errors)},
{
"indexed": indexed,
"skipped": skipped,
"unsupported": unsupported,
"error_count": len(errors),
},
)
else:
await task_logger.log_task_success(

File diff suppressed because it is too large Load diff

View file

@ -28,6 +28,7 @@ from app.indexing_pipeline.connector_document import ConnectorDocument
from app.indexing_pipeline.document_hashing import compute_identifier_hash
from app.indexing_pipeline.indexing_pipeline_service import IndexingPipelineService
from app.services.llm_service import get_user_long_context_llm
from app.services.page_limit_service import PageLimitService
from app.services.task_logging_service import TaskLoggingService
from app.tasks.connector_indexers.base import (
check_document_by_unique_identifier,
@ -55,7 +56,10 @@ async def _should_skip_file(
file_id = file.get("id")
file_name = file.get("name", "Unknown")
if skip_item(file):
skip, unsup_ext = skip_item(file)
if skip:
if unsup_ext:
return True, f"unsupported:{unsup_ext}"
return True, "folder/onenote/remote"
if not file_id:
return True, "missing file_id"
@ -289,12 +293,18 @@ async def _index_selected_files(
user_id: str,
enable_summary: bool,
on_heartbeat: HeartbeatCallbackType | None = None,
) -> tuple[int, int, list[str]]:
) -> tuple[int, int, int, list[str]]:
"""Index user-selected files using the parallel pipeline."""
page_limit_service = PageLimitService(session)
pages_used, pages_limit = await page_limit_service.get_page_usage(user_id)
remaining_quota = pages_limit - pages_used
batch_estimated_pages = 0
files_to_download: list[dict] = []
errors: list[str] = []
renamed_count = 0
skipped = 0
unsupported_count = 0
for file_id, file_name in file_ids:
file, error = await get_file_by_id(onedrive_client, file_id)
@ -305,12 +315,23 @@ async def _index_selected_files(
skip, msg = await _should_skip_file(session, file, search_space_id)
if skip:
if msg and "renamed" in msg.lower():
if msg and msg.startswith("unsupported:"):
unsupported_count += 1
elif msg and "renamed" in msg.lower():
renamed_count += 1
else:
skipped += 1
continue
file_pages = PageLimitService.estimate_pages_from_metadata(
file.get("name", ""), file.get("size")
)
if batch_estimated_pages + file_pages > remaining_quota:
display = file_name or file_id
errors.append(f"File '{display}': page limit would be exceeded")
continue
batch_estimated_pages += file_pages
files_to_download.append(file)
batch_indexed, _failed = await _download_and_index(
@ -324,7 +345,15 @@ async def _index_selected_files(
on_heartbeat=on_heartbeat,
)
return renamed_count + batch_indexed, skipped, errors
if batch_indexed > 0 and files_to_download and batch_estimated_pages > 0:
pages_to_deduct = max(
1, batch_estimated_pages * batch_indexed // len(files_to_download)
)
await page_limit_service.update_page_usage(
user_id, pages_to_deduct, allow_exceed=True
)
return renamed_count + batch_indexed, skipped, unsupported_count, errors
# ---------------------------------------------------------------------------
@ -346,8 +375,11 @@ async def _index_full_scan(
include_subfolders: bool = True,
on_heartbeat_callback: HeartbeatCallbackType | None = None,
enable_summary: bool = True,
) -> tuple[int, int]:
"""Full scan indexing of a folder."""
) -> tuple[int, int, int]:
"""Full scan indexing of a folder.
Returns (indexed, skipped, unsupported_count).
"""
await task_logger.log_task_progress(
log_entry,
f"Starting full scan of folder: {folder_name}",
@ -358,8 +390,15 @@ async def _index_full_scan(
},
)
page_limit_service = PageLimitService(session)
pages_used, pages_limit = await page_limit_service.get_page_usage(user_id)
remaining_quota = pages_limit - pages_used
batch_estimated_pages = 0
page_limit_reached = False
renamed_count = 0
skipped = 0
unsupported_count = 0
files_to_download: list[dict] = []
all_files, error = await get_files_in_folder(
@ -378,11 +417,28 @@ async def _index_full_scan(
for file in all_files[:max_files]:
skip, msg = await _should_skip_file(session, file, search_space_id)
if skip:
if msg and "renamed" in msg.lower():
if msg and msg.startswith("unsupported:"):
unsupported_count += 1
elif msg and "renamed" in msg.lower():
renamed_count += 1
else:
skipped += 1
continue
file_pages = PageLimitService.estimate_pages_from_metadata(
file.get("name", ""), file.get("size")
)
if batch_estimated_pages + file_pages > remaining_quota:
if not page_limit_reached:
logger.warning(
"Page limit reached during OneDrive full scan, "
"skipping remaining files"
)
page_limit_reached = True
skipped += 1
continue
batch_estimated_pages += file_pages
files_to_download.append(file)
batch_indexed, failed = await _download_and_index(
@ -396,11 +452,20 @@ async def _index_full_scan(
on_heartbeat=on_heartbeat_callback,
)
if batch_indexed > 0 and files_to_download and batch_estimated_pages > 0:
pages_to_deduct = max(
1, batch_estimated_pages * batch_indexed // len(files_to_download)
)
await page_limit_service.update_page_usage(
user_id, pages_to_deduct, allow_exceed=True
)
indexed = renamed_count + batch_indexed
logger.info(
f"Full scan complete: {indexed} indexed, {skipped} skipped, {failed} failed"
f"Full scan complete: {indexed} indexed, {skipped} skipped, "
f"{unsupported_count} unsupported, {failed} failed"
)
return indexed, skipped
return indexed, skipped, unsupported_count
async def _index_with_delta_sync(
@ -416,8 +481,11 @@ async def _index_with_delta_sync(
max_files: int,
on_heartbeat_callback: HeartbeatCallbackType | None = None,
enable_summary: bool = True,
) -> tuple[int, int, str | None]:
"""Delta sync using OneDrive change tracking. Returns (indexed, skipped, new_delta_link)."""
) -> tuple[int, int, int, str | None]:
"""Delta sync using OneDrive change tracking.
Returns (indexed, skipped, unsupported_count, new_delta_link).
"""
await task_logger.log_task_progress(
log_entry,
"Starting delta sync",
@ -437,12 +505,19 @@ async def _index_with_delta_sync(
if not changes:
logger.info("No changes detected since last sync")
return 0, 0, new_delta_link
return 0, 0, 0, new_delta_link
logger.info(f"Processing {len(changes)} delta changes")
page_limit_service = PageLimitService(session)
pages_used, pages_limit = await page_limit_service.get_page_usage(user_id)
remaining_quota = pages_limit - pages_used
batch_estimated_pages = 0
page_limit_reached = False
renamed_count = 0
skipped = 0
unsupported_count = 0
files_to_download: list[dict] = []
files_processed = 0
@ -465,12 +540,28 @@ async def _index_with_delta_sync(
skip, msg = await _should_skip_file(session, change, search_space_id)
if skip:
if msg and "renamed" in msg.lower():
if msg and msg.startswith("unsupported:"):
unsupported_count += 1
elif msg and "renamed" in msg.lower():
renamed_count += 1
else:
skipped += 1
continue
file_pages = PageLimitService.estimate_pages_from_metadata(
change.get("name", ""), change.get("size")
)
if batch_estimated_pages + file_pages > remaining_quota:
if not page_limit_reached:
logger.warning(
"Page limit reached during OneDrive delta sync, "
"skipping remaining files"
)
page_limit_reached = True
skipped += 1
continue
batch_estimated_pages += file_pages
files_to_download.append(change)
batch_indexed, failed = await _download_and_index(
@ -484,11 +575,20 @@ async def _index_with_delta_sync(
on_heartbeat=on_heartbeat_callback,
)
if batch_indexed > 0 and files_to_download and batch_estimated_pages > 0:
pages_to_deduct = max(
1, batch_estimated_pages * batch_indexed // len(files_to_download)
)
await page_limit_service.update_page_usage(
user_id, pages_to_deduct, allow_exceed=True
)
indexed = renamed_count + batch_indexed
logger.info(
f"Delta sync complete: {indexed} indexed, {skipped} skipped, {failed} failed"
f"Delta sync complete: {indexed} indexed, {skipped} skipped, "
f"{unsupported_count} unsupported, {failed} failed"
)
return indexed, skipped, new_delta_link
return indexed, skipped, unsupported_count, new_delta_link
# ---------------------------------------------------------------------------
@ -502,7 +602,7 @@ async def index_onedrive_files(
search_space_id: int,
user_id: str,
items_dict: dict,
) -> tuple[int, int, str | None]:
) -> tuple[int, int, str | None, int]:
"""Index OneDrive files for a specific connector.
items_dict format:
@ -529,7 +629,7 @@ async def index_onedrive_files(
await task_logger.log_task_failure(
log_entry, error_msg, None, {"error_type": "ConnectorNotFound"}
)
return 0, 0, error_msg
return 0, 0, error_msg, 0
token_encrypted = connector.config.get("_token_encrypted", False)
if token_encrypted and not config.SECRET_KEY:
@ -540,7 +640,7 @@ async def index_onedrive_files(
"Missing SECRET_KEY",
{"error_type": "MissingSecretKey"},
)
return 0, 0, error_msg
return 0, 0, error_msg, 0
connector_enable_summary = getattr(connector, "enable_summary", True)
onedrive_client = OneDriveClient(session, connector_id)
@ -552,12 +652,13 @@ async def index_onedrive_files(
total_indexed = 0
total_skipped = 0
total_unsupported = 0
# Index selected individual files
selected_files = items_dict.get("files", [])
if selected_files:
file_tuples = [(f["id"], f.get("name")) for f in selected_files]
indexed, skipped, _errors = await _index_selected_files(
indexed, skipped, unsupported, _errors = await _index_selected_files(
onedrive_client,
session,
file_tuples,
@ -568,6 +669,7 @@ async def index_onedrive_files(
)
total_indexed += indexed
total_skipped += skipped
total_unsupported += unsupported
# Index selected folders
folders = items_dict.get("folders", [])
@ -581,7 +683,7 @@ async def index_onedrive_files(
if can_use_delta:
logger.info(f"Using delta sync for folder {folder_name}")
indexed, skipped, new_delta_link = await _index_with_delta_sync(
indexed, skipped, unsup, new_delta_link = await _index_with_delta_sync(
onedrive_client,
session,
connector_id,
@ -596,6 +698,7 @@ async def index_onedrive_files(
)
total_indexed += indexed
total_skipped += skipped
total_unsupported += unsup
if new_delta_link:
await session.refresh(connector)
@ -605,7 +708,7 @@ async def index_onedrive_files(
flag_modified(connector, "config")
# Reconciliation full scan
ri, rs = await _index_full_scan(
ri, rs, ru = await _index_full_scan(
onedrive_client,
session,
connector_id,
@ -621,9 +724,10 @@ async def index_onedrive_files(
)
total_indexed += ri
total_skipped += rs
total_unsupported += ru
else:
logger.info(f"Using full scan for folder {folder_name}")
indexed, skipped = await _index_full_scan(
indexed, skipped, unsup = await _index_full_scan(
onedrive_client,
session,
connector_id,
@ -639,6 +743,7 @@ async def index_onedrive_files(
)
total_indexed += indexed
total_skipped += skipped
total_unsupported += unsup
# Store new delta link for this folder
_, new_delta_link, _ = await onedrive_client.get_delta(folder_id=folder_id)
@ -657,12 +762,18 @@ async def index_onedrive_files(
await task_logger.log_task_success(
log_entry,
f"Successfully completed OneDrive indexing for connector {connector_id}",
{"files_processed": total_indexed, "files_skipped": total_skipped},
{
"files_processed": total_indexed,
"files_skipped": total_skipped,
"files_unsupported": total_unsupported,
},
)
logger.info(
f"OneDrive indexing completed: {total_indexed} indexed, {total_skipped} skipped"
f"OneDrive indexing completed: {total_indexed} indexed, "
f"{total_skipped} skipped, {total_unsupported} unsupported"
)
return total_indexed, total_skipped, None
return total_indexed, total_skipped, None, total_unsupported
except SQLAlchemyError as db_error:
await session.rollback()
@ -673,7 +784,7 @@ async def index_onedrive_files(
{"error_type": "SQLAlchemyError"},
)
logger.error(f"Database error: {db_error!s}", exc_info=True)
return 0, 0, f"Database error: {db_error!s}"
return 0, 0, f"Database error: {db_error!s}", 0
except Exception as e:
await session.rollback()
await task_logger.log_task_failure(
@ -683,4 +794,4 @@ async def index_onedrive_files(
{"error_type": type(e).__name__},
)
logger.error(f"Failed to index OneDrive files: {e!s}", exc_info=True)
return 0, 0, f"Failed to index OneDrive files: {e!s}"
return 0, 0, f"Failed to index OneDrive files: {e!s}", 0

View file

@ -1,43 +1,17 @@
"""
Document processors module for background tasks.
This module provides a collection of document processors for different content types
and sources. Each processor is responsible for handling a specific type of document
processing task in the background.
Available processors:
- Extension processor: Handle documents from browser extension
- Markdown processor: Process markdown files
- File processors: Handle files using different ETL services (Unstructured, LlamaCloud, Docling)
- YouTube processor: Process YouTube videos and extract transcripts
Content extraction is handled by ``app.etl_pipeline.EtlPipelineService``.
This package keeps orchestration (save, notify, page-limit) and
non-ETL processors (extension, markdown, youtube).
"""
# URL crawler
# Extension processor
from .extension_processor import add_extension_received_document
# File processors
from .file_processors import (
add_received_file_document_using_docling,
add_received_file_document_using_llamacloud,
add_received_file_document_using_unstructured,
)
# Markdown processor
from .markdown_processor import add_received_markdown_file_document
# YouTube processor
from .youtube_processor import add_youtube_video_document
__all__ = [
# Extension processing
"add_extension_received_document",
"add_received_file_document_using_docling",
"add_received_file_document_using_llamacloud",
# File processing with different ETL services
"add_received_file_document_using_unstructured",
# Markdown file processing
"add_received_markdown_file_document",
# YouTube video processing
"add_youtube_video_document",
]

View file

@ -0,0 +1,91 @@
"""
Lossless file-to-markdown converters for text-based formats.
These converters handle file types that can be faithfully represented as
markdown without any external ETL/OCR service:
- CSV / TSV markdown table (stdlib ``csv``)
- HTML / HTM / XHTML markdown (``markdownify``)
"""
from __future__ import annotations
import csv
from collections.abc import Callable
from pathlib import Path
from markdownify import markdownify
# The stdlib csv module defaults to a 128 KB field-size limit which is too
# small for real-world exports (e.g. chat logs, CRM dumps). We raise it once
# at import time so every csv.reader call in this module can handle large fields.
csv.field_size_limit(2**31 - 1)
def _escape_pipe(cell: str) -> str:
"""Escape literal pipe characters inside a markdown table cell."""
return cell.replace("|", "\\|")
def csv_to_markdown(file_path: str, *, delimiter: str = ",") -> str:
"""Convert a CSV (or TSV) file to a markdown table.
The first row is treated as the header. An empty file returns an
empty string so the caller can decide how to handle it.
"""
with open(file_path, encoding="utf-8", newline="") as fh:
reader = csv.reader(fh, delimiter=delimiter)
rows = list(reader)
if not rows:
return ""
header, *body = rows
col_count = len(header)
lines: list[str] = []
header_cells = [_escape_pipe(c.strip()) for c in header]
lines.append("| " + " | ".join(header_cells) + " |")
lines.append("| " + " | ".join(["---"] * col_count) + " |")
for row in body:
padded = row + [""] * (col_count - len(row))
cells = [_escape_pipe(c.strip()) for c in padded[:col_count]]
lines.append("| " + " | ".join(cells) + " |")
return "\n".join(lines) + "\n"
def tsv_to_markdown(file_path: str) -> str:
"""Convert a TSV file to a markdown table."""
return csv_to_markdown(file_path, delimiter="\t")
def html_to_markdown(file_path: str) -> str:
"""Convert an HTML file to markdown via ``markdownify``."""
html = Path(file_path).read_text(encoding="utf-8")
return markdownify(html).strip()
_CONVERTER_MAP: dict[str, Callable[..., str]] = {
".csv": csv_to_markdown,
".tsv": tsv_to_markdown,
".html": html_to_markdown,
".htm": html_to_markdown,
".xhtml": html_to_markdown,
}
def convert_file_directly(file_path: str, filename: str) -> str:
"""Dispatch to the appropriate lossless converter based on file extension.
Raises ``ValueError`` if the extension is not supported.
"""
suffix = Path(filename).suffix.lower()
converter = _CONVERTER_MAP.get(suffix)
if converter is None:
raise ValueError(
f"No direct converter for extension '{suffix}' (file: {filename})"
)
return converter(file_path)

View file

@ -0,0 +1,193 @@
"""
Document helper functions for deduplication, migration, and connector updates.
Provides reusable logic shared across file processors and ETL strategies.
"""
import logging
from sqlalchemy.ext.asyncio import AsyncSession
from app.db import Document, DocumentStatus, DocumentType
from app.utils.document_converters import generate_unique_identifier_hash
from .base import (
check_document_by_unique_identifier,
check_duplicate_document,
)
# ---------------------------------------------------------------------------
# Unique identifier helpers
# ---------------------------------------------------------------------------
def get_google_drive_unique_identifier(
connector: dict | None,
filename: str,
search_space_id: int,
) -> tuple[str, str | None]:
"""
Get unique identifier hash, using file_id for Google Drive (stable across renames).
Returns:
Tuple of (primary_hash, legacy_hash or None).
For Google Drive: (file_id-based hash, filename-based hash for migration).
For other sources: (filename-based hash, None).
"""
if connector and connector.get("type") == DocumentType.GOOGLE_DRIVE_FILE:
metadata = connector.get("metadata", {})
file_id = metadata.get("google_drive_file_id")
if file_id:
primary_hash = generate_unique_identifier_hash(
DocumentType.GOOGLE_DRIVE_FILE, file_id, search_space_id
)
legacy_hash = generate_unique_identifier_hash(
DocumentType.GOOGLE_DRIVE_FILE, filename, search_space_id
)
return primary_hash, legacy_hash
primary_hash = generate_unique_identifier_hash(
DocumentType.FILE, filename, search_space_id
)
return primary_hash, None
# ---------------------------------------------------------------------------
# Document deduplication and migration
# ---------------------------------------------------------------------------
async def handle_existing_document_update(
session: AsyncSession,
existing_document: Document,
content_hash: str,
connector: dict | None,
filename: str,
primary_hash: str,
) -> tuple[bool, Document | None]:
"""
Handle update logic for an existing document.
Returns:
Tuple of (should_skip_processing, document_to_return):
- (True, document): Content unchanged, return existing document
- (False, None): Content changed, needs re-processing
"""
if existing_document.unique_identifier_hash != primary_hash:
existing_document.unique_identifier_hash = primary_hash
logging.info(f"Migrated document to file_id-based identifier: {filename}")
if existing_document.content_hash == content_hash:
if connector and connector.get("type") == DocumentType.GOOGLE_DRIVE_FILE:
connector_metadata = connector.get("metadata", {})
new_name = connector_metadata.get("google_drive_file_name")
doc_metadata = existing_document.document_metadata or {}
old_name = doc_metadata.get("FILE_NAME") or doc_metadata.get(
"google_drive_file_name"
)
if new_name and old_name and old_name != new_name:
from sqlalchemy.orm.attributes import flag_modified
existing_document.title = new_name
if not existing_document.document_metadata:
existing_document.document_metadata = {}
existing_document.document_metadata["FILE_NAME"] = new_name
existing_document.document_metadata["google_drive_file_name"] = new_name
flag_modified(existing_document, "document_metadata")
await session.commit()
logging.info(
f"File renamed in Google Drive: '{old_name}''{new_name}' "
f"(no re-processing needed)"
)
logging.info(f"Document for file {filename} unchanged. Skipping.")
return True, existing_document
# Content has changed — guard against content_hash collision before
# expensive ETL processing.
collision_doc = await check_duplicate_document(session, content_hash)
if collision_doc and collision_doc.id != existing_document.id:
logging.warning(
"Content-hash collision for %s: identical content exists in "
"document #%s (%s). Skipping re-processing.",
filename,
collision_doc.id,
collision_doc.document_type,
)
if DocumentStatus.is_state(
existing_document.status, DocumentStatus.PENDING
) or DocumentStatus.is_state(
existing_document.status, DocumentStatus.PROCESSING
):
await session.delete(existing_document)
await session.commit()
return True, None
return True, existing_document
logging.info(f"Content changed for file {filename}. Updating document.")
return False, None
async def find_existing_document_with_migration(
session: AsyncSession,
primary_hash: str,
legacy_hash: str | None,
content_hash: str | None = None,
) -> Document | None:
"""
Find existing document, checking primary hash, legacy hash, and content_hash.
Supports migration from filename-based to file_id-based hashing for
Google Drive files, with content_hash fallback for cross-source dedup.
"""
existing_document = await check_document_by_unique_identifier(session, primary_hash)
if not existing_document and legacy_hash:
existing_document = await check_document_by_unique_identifier(
session, legacy_hash
)
if existing_document:
logging.info(
"Found legacy document (filename-based hash), "
"will migrate to file_id-based hash"
)
if not existing_document and content_hash:
existing_document = await check_duplicate_document(session, content_hash)
if existing_document:
logging.info(
f"Found duplicate content from different source (content_hash match). "
f"Original document ID: {existing_document.id}, "
f"type: {existing_document.document_type}"
)
return existing_document
# ---------------------------------------------------------------------------
# Connector helpers
# ---------------------------------------------------------------------------
async def update_document_from_connector(
document: Document | None,
connector: dict | None,
session: AsyncSession,
) -> None:
"""Update document type, metadata, and connector_id from connector info."""
if not document or not connector:
return
if "type" in connector:
document.document_type = connector["type"]
if "metadata" in connector:
if not document.document_metadata:
document.document_metadata = connector["metadata"]
else:
merged = {**document.document_metadata, **connector["metadata"]}
document.document_metadata = merged
if "connector_id" in connector:
document.connector_id = connector["connector_id"]
await session.commit()

View file

@ -0,0 +1,204 @@
"""
Unified document save/update logic for file processors.
"""
import logging
from sqlalchemy.exc import SQLAlchemyError
from sqlalchemy.ext.asyncio import AsyncSession
from app.db import Document, DocumentStatus, DocumentType
from app.services.llm_service import get_user_long_context_llm
from app.utils.document_converters import (
create_document_chunks,
embed_text,
generate_content_hash,
generate_document_summary,
)
from ._helpers import (
find_existing_document_with_migration,
get_google_drive_unique_identifier,
handle_existing_document_update,
)
from .base import get_current_timestamp, safe_set_chunks
# ---------------------------------------------------------------------------
# Summary generation
# ---------------------------------------------------------------------------
async def _generate_summary(
markdown_content: str,
file_name: str,
etl_service: str,
user_llm,
enable_summary: bool,
) -> tuple[str, list[float]]:
"""
Generate a document summary and embedding.
Docling uses its own large-document summary strategy; other ETL services
use the standard ``generate_document_summary`` helper.
"""
if not enable_summary:
summary = f"File: {file_name}\n\n{markdown_content[:4000]}"
return summary, embed_text(summary)
if etl_service == "DOCLING":
from app.services.docling_service import create_docling_service
docling_service = create_docling_service()
summary_text = await docling_service.process_large_document_summary(
content=markdown_content, llm=user_llm, document_title=file_name
)
meta = {
"file_name": file_name,
"etl_service": etl_service,
"document_type": "File Document",
}
parts = ["# DOCUMENT METADATA"]
for key, value in meta.items():
if value:
formatted_key = key.replace("_", " ").title()
parts.append(f"**{formatted_key}:** {value}")
enhanced = "\n".join(parts) + "\n\n# DOCUMENT SUMMARY\n\n" + summary_text
return enhanced, embed_text(enhanced)
# Standard summary (Unstructured / LlamaCloud / others)
meta = {
"file_name": file_name,
"etl_service": etl_service,
"document_type": "File Document",
}
return await generate_document_summary(markdown_content, user_llm, meta)
# ---------------------------------------------------------------------------
# Unified save function
# ---------------------------------------------------------------------------
async def save_file_document(
session: AsyncSession,
file_name: str,
markdown_content: str,
search_space_id: int,
user_id: str,
etl_service: str,
connector: dict | None = None,
enable_summary: bool = True,
) -> Document | None:
"""
Process and store a file document with deduplication and migration support.
Handles both creating new documents and updating existing ones. This is
the single implementation behind the per-ETL-service wrapper functions.
Args:
session: Database session
file_name: Name of the processed file
markdown_content: Markdown content to store
search_space_id: ID of the search space
user_id: ID of the user
etl_service: Name of the ETL service (UNSTRUCTURED, LLAMACLOUD, DOCLING)
connector: Optional connector info for Google Drive files
enable_summary: Whether to generate an AI summary
Returns:
Document object if successful, None if duplicate detected
"""
try:
primary_hash, legacy_hash = get_google_drive_unique_identifier(
connector, file_name, search_space_id
)
content_hash = generate_content_hash(markdown_content, search_space_id)
existing_document = await find_existing_document_with_migration(
session, primary_hash, legacy_hash, content_hash
)
if existing_document:
should_skip, doc = await handle_existing_document_update(
session,
existing_document,
content_hash,
connector,
file_name,
primary_hash,
)
if should_skip:
return doc
user_llm = await get_user_long_context_llm(session, user_id, search_space_id)
if not user_llm:
raise RuntimeError(
f"No long context LLM configured for user {user_id} "
f"in search space {search_space_id}"
)
summary_content, summary_embedding = await _generate_summary(
markdown_content, file_name, etl_service, user_llm, enable_summary
)
chunks = await create_document_chunks(markdown_content)
doc_metadata = {"FILE_NAME": file_name, "ETL_SERVICE": etl_service}
if existing_document:
existing_document.title = file_name
existing_document.content = summary_content
existing_document.content_hash = content_hash
existing_document.embedding = summary_embedding
existing_document.document_metadata = doc_metadata
await safe_set_chunks(session, existing_document, chunks)
existing_document.source_markdown = markdown_content
existing_document.content_needs_reindexing = False
existing_document.updated_at = get_current_timestamp()
existing_document.status = DocumentStatus.ready()
await session.commit()
await session.refresh(existing_document)
return existing_document
doc_type = DocumentType.FILE
if connector and connector.get("type") == DocumentType.GOOGLE_DRIVE_FILE:
doc_type = DocumentType.GOOGLE_DRIVE_FILE
document = Document(
search_space_id=search_space_id,
title=file_name,
document_type=doc_type,
document_metadata=doc_metadata,
content=summary_content,
embedding=summary_embedding,
chunks=chunks,
content_hash=content_hash,
unique_identifier_hash=primary_hash,
source_markdown=markdown_content,
content_needs_reindexing=False,
updated_at=get_current_timestamp(),
created_by_id=user_id,
connector_id=connector.get("connector_id") if connector else None,
status=DocumentStatus.ready(),
)
session.add(document)
await session.commit()
await session.refresh(document)
return document
except SQLAlchemyError as db_error:
await session.rollback()
if "ix_documents_content_hash" in str(db_error):
logging.warning(
"content_hash collision during commit for %s (%s). Skipping.",
file_name,
etl_service,
)
return None
raise db_error
except Exception as e:
await session.rollback()
raise RuntimeError(
f"Failed to process file document using {etl_service}: {e!s}"
) from e

View file

@ -14,88 +14,19 @@ from app.utils.document_converters import (
create_document_chunks,
generate_content_hash,
generate_document_summary,
generate_unique_identifier_hash,
)
from ._helpers import (
find_existing_document_with_migration,
get_google_drive_unique_identifier,
)
from .base import (
check_document_by_unique_identifier,
check_duplicate_document,
get_current_timestamp,
safe_set_chunks,
)
def _get_google_drive_unique_identifier(
connector: dict | None,
filename: str,
search_space_id: int,
) -> tuple[str, str | None]:
"""
Get unique identifier hash for a file, with special handling for Google Drive.
For Google Drive files, uses file_id as the unique identifier (doesn't change on rename).
For other files, uses filename.
Args:
connector: Optional connector info dict with type and metadata
filename: The filename (used for non-Google Drive files or as fallback)
search_space_id: The search space ID
Returns:
Tuple of (primary_hash, legacy_hash or None)
"""
if connector and connector.get("type") == DocumentType.GOOGLE_DRIVE_FILE:
metadata = connector.get("metadata", {})
file_id = metadata.get("google_drive_file_id")
if file_id:
primary_hash = generate_unique_identifier_hash(
DocumentType.GOOGLE_DRIVE_FILE, file_id, search_space_id
)
legacy_hash = generate_unique_identifier_hash(
DocumentType.GOOGLE_DRIVE_FILE, filename, search_space_id
)
return primary_hash, legacy_hash
primary_hash = generate_unique_identifier_hash(
DocumentType.FILE, filename, search_space_id
)
return primary_hash, None
async def _find_existing_document_with_migration(
session: AsyncSession,
primary_hash: str,
legacy_hash: str | None,
content_hash: str | None = None,
) -> Document | None:
"""
Find existing document, checking both new hash and legacy hash for migration,
with fallback to content_hash for cross-source deduplication.
"""
existing_document = await check_document_by_unique_identifier(session, primary_hash)
if not existing_document and legacy_hash:
existing_document = await check_document_by_unique_identifier(
session, legacy_hash
)
if existing_document:
logging.info(
"Found legacy document (filename-based hash), will migrate to file_id-based hash"
)
# Fallback: check by content_hash to catch duplicates from different sources
if not existing_document and content_hash:
existing_document = await check_duplicate_document(session, content_hash)
if existing_document:
logging.info(
f"Found duplicate content from different source (content_hash match). "
f"Original document ID: {existing_document.id}, type: {existing_document.document_type}"
)
return existing_document
async def _handle_existing_document_update(
session: AsyncSession,
existing_document: Document,
@ -224,7 +155,7 @@ async def add_received_markdown_file_document(
try:
# Generate unique identifier hash (uses file_id for Google Drive, filename for others)
primary_hash, legacy_hash = _get_google_drive_unique_identifier(
primary_hash, legacy_hash = get_google_drive_unique_identifier(
connector, file_name, search_space_id
)
@ -232,7 +163,7 @@ async def add_received_markdown_file_document(
content_hash = generate_content_hash(file_in_markdown, search_space_id)
# Check if document exists (with migration support for Google Drive and content_hash fallback)
existing_document = await _find_existing_document_with_migration(
existing_document = await find_existing_document_with_migration(
session, primary_hash, legacy_hash, content_hash
)

View file

@ -0,0 +1,107 @@
"""Document versioning: snapshot creation and cleanup.
Rules:
- 30-minute debounce window: if the latest version was created < 30 min ago,
overwrite it instead of creating a new row.
- Maximum 20 versions per document.
- Versions older than 90 days are cleaned up.
"""
from datetime import UTC, datetime, timedelta
from sqlalchemy import delete, func, select
from sqlalchemy.ext.asyncio import AsyncSession
from app.db import Document, DocumentVersion
MAX_VERSIONS_PER_DOCUMENT = 20
DEBOUNCE_MINUTES = 30
RETENTION_DAYS = 90
def _now() -> datetime:
return datetime.now(UTC)
async def create_version_snapshot(
session: AsyncSession,
document: Document,
) -> DocumentVersion | None:
"""Snapshot the document's current state into a DocumentVersion row.
Returns the created/updated DocumentVersion, or None if nothing was done.
"""
now = _now()
latest = (
await session.execute(
select(DocumentVersion)
.where(DocumentVersion.document_id == document.id)
.order_by(DocumentVersion.version_number.desc())
.limit(1)
)
).scalar_one_or_none()
if latest is not None:
age = now - latest.created_at.replace(tzinfo=UTC)
if age < timedelta(minutes=DEBOUNCE_MINUTES):
latest.source_markdown = document.source_markdown
latest.content_hash = document.content_hash
latest.title = document.title
latest.created_at = now
await session.flush()
return latest
max_num = (
await session.execute(
select(func.coalesce(func.max(DocumentVersion.version_number), 0)).where(
DocumentVersion.document_id == document.id
)
)
).scalar_one()
version = DocumentVersion(
document_id=document.id,
version_number=max_num + 1,
source_markdown=document.source_markdown,
content_hash=document.content_hash,
title=document.title,
created_at=now,
)
session.add(version)
await session.flush()
# Cleanup: remove versions older than 90 days
cutoff = now - timedelta(days=RETENTION_DAYS)
await session.execute(
delete(DocumentVersion).where(
DocumentVersion.document_id == document.id,
DocumentVersion.created_at < cutoff,
)
)
# Cleanup: cap at MAX_VERSIONS_PER_DOCUMENT
count = (
await session.execute(
select(func.count())
.select_from(DocumentVersion)
.where(DocumentVersion.document_id == document.id)
)
).scalar_one()
if count > MAX_VERSIONS_PER_DOCUMENT:
excess = count - MAX_VERSIONS_PER_DOCUMENT
oldest_ids_result = await session.execute(
select(DocumentVersion.id)
.where(DocumentVersion.document_id == document.id)
.order_by(DocumentVersion.version_number.asc())
.limit(excess)
)
oldest_ids = [row[0] for row in oldest_ids_result.all()]
if oldest_ids:
await session.execute(
delete(DocumentVersion).where(DocumentVersion.id.in_(oldest_ids))
)
await session.flush()
return version

View file

@ -0,0 +1,153 @@
"""Per-parser document extension sets for the ETL pipeline.
Every consumer (file_classifier, connector-level skip checks, ETL pipeline
validation) imports from here so there is a single source of truth.
Extensions already covered by PLAINTEXT_EXTENSIONS, AUDIO_EXTENSIONS, or
DIRECT_CONVERT_EXTENSIONS in file_classifier are NOT repeated here -- these
sets are exclusively for the "document" ETL path (Docling / LlamaParse /
Unstructured).
"""
from pathlib import PurePosixPath
# ---------------------------------------------------------------------------
# Per-parser document extension sets (from official documentation)
# ---------------------------------------------------------------------------
DOCLING_DOCUMENT_EXTENSIONS: frozenset[str] = frozenset(
{
".pdf",
".docx",
".xlsx",
".pptx",
".png",
".jpg",
".jpeg",
".tiff",
".tif",
".bmp",
".webp",
}
)
LLAMAPARSE_DOCUMENT_EXTENSIONS: frozenset[str] = frozenset(
{
".pdf",
".docx",
".doc",
".xlsx",
".xls",
".pptx",
".ppt",
".docm",
".dot",
".dotm",
".pptm",
".pot",
".potx",
".xlsm",
".xlsb",
".xlw",
".rtf",
".epub",
".png",
".jpg",
".jpeg",
".gif",
".bmp",
".tiff",
".tif",
".webp",
".svg",
".odt",
".ods",
".odp",
".hwp",
".hwpx",
}
)
UNSTRUCTURED_DOCUMENT_EXTENSIONS: frozenset[str] = frozenset(
{
".pdf",
".docx",
".doc",
".xlsx",
".xls",
".pptx",
".ppt",
".png",
".jpg",
".jpeg",
".bmp",
".tiff",
".tif",
".heic",
".rtf",
".epub",
".odt",
".eml",
".msg",
".p7s",
}
)
AZURE_DI_DOCUMENT_EXTENSIONS: frozenset[str] = frozenset(
{
".pdf",
".docx",
".xlsx",
".pptx",
".png",
".jpg",
".jpeg",
".bmp",
".tiff",
".tif",
".heif",
}
)
# ---------------------------------------------------------------------------
# Union (used by classify_file for routing) + service lookup
# ---------------------------------------------------------------------------
DOCUMENT_EXTENSIONS: frozenset[str] = (
DOCLING_DOCUMENT_EXTENSIONS
| LLAMAPARSE_DOCUMENT_EXTENSIONS
| UNSTRUCTURED_DOCUMENT_EXTENSIONS
| AZURE_DI_DOCUMENT_EXTENSIONS
)
_SERVICE_MAP: dict[str, frozenset[str]] = {
"DOCLING": DOCLING_DOCUMENT_EXTENSIONS,
"LLAMACLOUD": LLAMAPARSE_DOCUMENT_EXTENSIONS,
"UNSTRUCTURED": UNSTRUCTURED_DOCUMENT_EXTENSIONS,
}
def get_document_extensions_for_service(etl_service: str | None) -> frozenset[str]:
"""Return the document extensions supported by *etl_service*.
When *etl_service* is ``LLAMACLOUD`` and Azure Document Intelligence
credentials are configured, the set is dynamically expanded to include
Azure DI's supported extensions (e.g. ``.heif``).
Falls back to the full union when the service is ``None`` or unknown.
"""
extensions = _SERVICE_MAP.get(etl_service or "", DOCUMENT_EXTENSIONS)
if etl_service == "LLAMACLOUD":
from app.config import config as app_config
if getattr(app_config, "AZURE_DI_ENDPOINT", None) and getattr(
app_config, "AZURE_DI_KEY", None
):
extensions = extensions | AZURE_DI_DOCUMENT_EXTENSIONS
return extensions
def is_supported_document_extension(filename: str) -> bool:
"""Return True if the file's extension is in the supported document set."""
suffix = PurePosixPath(filename).suffix.lower()
return suffix in DOCUMENT_EXTENSIONS

View file

@ -11,6 +11,8 @@ import hmac
import json
import logging
import time
from random import SystemRandom
from string import ascii_letters, digits
from uuid import UUID
from cryptography.fernet import Fernet
@ -18,6 +20,25 @@ from fastapi import HTTPException
logger = logging.getLogger(__name__)
_PKCE_CHARS = ascii_letters + digits + "-._~"
_PKCE_RNG = SystemRandom()
def generate_code_verifier(length: int = 128) -> str:
"""Generate a PKCE code_verifier (RFC 7636, 43-128 unreserved chars)."""
return "".join(_PKCE_RNG.choice(_PKCE_CHARS) for _ in range(length))
def generate_pkce_pair(length: int = 128) -> tuple[str, str]:
"""Generate a PKCE code_verifier and its S256 code_challenge."""
verifier = generate_code_verifier(length)
challenge = (
base64.urlsafe_b64encode(hashlib.sha256(verifier.encode()).digest())
.decode()
.rstrip("=")
)
return verifier, challenge
class OAuthStateManager:
"""Manages secure OAuth state parameters with HMAC signatures."""

View file

@ -46,8 +46,6 @@ dependencies = [
"redis>=5.2.1",
"firecrawl-py>=4.9.0",
"boto3>=1.35.0",
"litellm>=1.80.10",
"langchain-litellm>=0.3.5",
"fake-useragent>=2.2.0",
"trafilatura>=2.0.0",
"fastapi-users[oauth,sqlalchemy]>=15.0.3",
@ -75,6 +73,9 @@ dependencies = [
"langchain-community>=0.4.1",
"deepagents>=0.4.12",
"stripe>=15.0.0",
"azure-ai-documentintelligence>=1.0.2",
"litellm>=1.83.0",
"langchain-litellm>=0.6.4",
]
[dependency-groups]

View file

@ -3,6 +3,7 @@
Prerequisites: PostgreSQL + pgvector only.
External system boundaries are mocked:
- ETL parsing LlamaParse (external API) and Docling (heavy library)
- LLM summarization, text embedding, text chunking (external APIs)
- Redis heartbeat (external infrastructure)
- Task dispatch is swapped via DI (InlineTaskDispatcher)
@ -11,6 +12,7 @@ External system boundaries are mocked:
from __future__ import annotations
import contextlib
import os
from collections.abc import AsyncGenerator
from unittest.mock import AsyncMock, MagicMock
@ -298,3 +300,59 @@ def _mock_redis_heartbeat(monkeypatch):
"app.tasks.celery_tasks.document_tasks._run_heartbeat_loop",
AsyncMock(),
)
_MOCK_ETL_MARKDOWN = "# Mocked Document\n\nThis is mocked ETL content."
@pytest.fixture(autouse=True)
def _mock_etl_parsing(monkeypatch):
"""Mock ETL parsing services — LlamaParse and Docling are external boundaries.
Preserves the real contract: empty/corrupt files raise an error just like
the actual services would, so tests covering failure paths keep working.
"""
def _reject_empty(file_path: str) -> None:
if os.path.getsize(file_path) == 0:
raise RuntimeError(f"Cannot parse empty file: {file_path}")
# -- LlamaParse mock (external API) --------------------------------
async def _fake_llamacloud_parse(file_path: str, estimated_pages: int) -> str:
_reject_empty(file_path)
return _MOCK_ETL_MARKDOWN
monkeypatch.setattr(
"app.etl_pipeline.parsers.llamacloud.parse_with_llamacloud",
_fake_llamacloud_parse,
)
# -- Docling mock (heavy library boundary) -------------------------
async def _fake_docling_parse(file_path: str, filename: str) -> str:
_reject_empty(file_path)
return _MOCK_ETL_MARKDOWN
monkeypatch.setattr(
"app.etl_pipeline.parsers.docling.parse_with_docling",
_fake_docling_parse,
)
class _FakeDoclingResult:
class Document:
@staticmethod
def export_to_markdown():
return _MOCK_ETL_MARKDOWN
document = Document()
class _FakeDocumentConverter:
def convert(self, file_path):
_reject_empty(file_path)
return _FakeDoclingResult()
monkeypatch.setattr(
"docling.document_converter.DocumentConverter",
_FakeDocumentConverter,
)

View file

@ -2,12 +2,11 @@
Integration tests for backend file upload limit enforcement.
These tests verify that the API rejects uploads that exceed:
- Max files per upload (10)
- Max per-file size (50 MB)
- Max total upload size (200 MB)
- Max per-file size (500 MB)
The limits mirror the frontend's DocumentUploadTab.tsx constants and are
enforced server-side to protect against direct API calls.
No file count or total size limits are enforced the frontend batches
uploads in groups of 5 and there is no cap on how many files a user can
upload in a single session.
Prerequisites:
- PostgreSQL + pgvector
@ -24,60 +23,12 @@ pytestmark = pytest.mark.integration
# ---------------------------------------------------------------------------
# Test A: File count limit
# ---------------------------------------------------------------------------
class TestFileCountLimit:
"""Uploading more than 10 files in a single request should be rejected."""
async def test_11_files_returns_413(
self,
client: httpx.AsyncClient,
headers: dict[str, str],
search_space_id: int,
):
files = [
("files", (f"file_{i}.txt", io.BytesIO(b"test content"), "text/plain"))
for i in range(11)
]
resp = await client.post(
"/api/v1/documents/fileupload",
headers=headers,
files=files,
data={"search_space_id": str(search_space_id)},
)
assert resp.status_code == 413
assert "too many files" in resp.json()["detail"].lower()
async def test_10_files_accepted(
self,
client: httpx.AsyncClient,
headers: dict[str, str],
search_space_id: int,
cleanup_doc_ids: list[int],
):
files = [
("files", (f"file_{i}.txt", io.BytesIO(b"test content"), "text/plain"))
for i in range(10)
]
resp = await client.post(
"/api/v1/documents/fileupload",
headers=headers,
files=files,
data={"search_space_id": str(search_space_id)},
)
assert resp.status_code == 200
cleanup_doc_ids.extend(resp.json().get("document_ids", []))
# ---------------------------------------------------------------------------
# Test B: Per-file size limit
# Test: Per-file size limit (500 MB)
# ---------------------------------------------------------------------------
class TestPerFileSizeLimit:
"""A single file exceeding 50 MB should be rejected."""
"""A single file exceeding 500 MB should be rejected."""
async def test_oversized_file_returns_413(
self,
@ -85,7 +36,7 @@ class TestPerFileSizeLimit:
headers: dict[str, str],
search_space_id: int,
):
oversized = io.BytesIO(b"\x00" * (50 * 1024 * 1024 + 1))
oversized = io.BytesIO(b"\x00" * (500 * 1024 * 1024 + 1))
resp = await client.post(
"/api/v1/documents/fileupload",
headers=headers,
@ -102,11 +53,11 @@ class TestPerFileSizeLimit:
search_space_id: int,
cleanup_doc_ids: list[int],
):
at_limit = io.BytesIO(b"\x00" * (50 * 1024 * 1024))
at_limit = io.BytesIO(b"\x00" * (500 * 1024 * 1024))
resp = await client.post(
"/api/v1/documents/fileupload",
headers=headers,
files=[("files", ("exact50mb.txt", at_limit, "text/plain"))],
files=[("files", ("exact500mb.txt", at_limit, "text/plain"))],
data={"search_space_id": str(search_space_id)},
)
assert resp.status_code == 200
@ -114,26 +65,23 @@ class TestPerFileSizeLimit:
# ---------------------------------------------------------------------------
# Test C: Total upload size limit
# Test: Multiple files accepted without count limit
# ---------------------------------------------------------------------------
class TestTotalSizeLimit:
"""Multiple files whose combined size exceeds 200 MB should be rejected."""
class TestNoFileCountLimit:
"""Many files in a single request should be accepted."""
async def test_total_size_over_200mb_returns_413(
async def test_many_files_accepted(
self,
client: httpx.AsyncClient,
headers: dict[str, str],
search_space_id: int,
cleanup_doc_ids: list[int],
):
chunk_size = 45 * 1024 * 1024 # 45 MB each
files = [
(
"files",
(f"chunk_{i}.txt", io.BytesIO(b"\x00" * chunk_size), "text/plain"),
)
for i in range(5) # 5 x 45 MB = 225 MB > 200 MB
("files", (f"file_{i}.txt", io.BytesIO(b"test content"), "text/plain"))
for i in range(20)
]
resp = await client.post(
"/api/v1/documents/fileupload",
@ -141,5 +89,5 @@ class TestTotalSizeLimit:
files=files,
data={"search_space_id": str(search_space_id)},
)
assert resp.status_code == 413
assert "total upload size" in resp.json()["detail"].lower()
assert resp.status_code == 200
cleanup_doc_ids.extend(resp.json().get("document_ids", []))

Some files were not shown because too many files have changed in this diff Show more