mirror of
https://github.com/MODSetter/SurfSense.git
synced 2026-04-28 18:36:23 +02:00
chore: updated docs for docling
This commit is contained in:
parent
50f84e1d0a
commit
a0aa29eeb0
5 changed files with 39 additions and 14 deletions
|
|
@ -4,14 +4,7 @@ description: Setting up SurfSense using Docker
|
|||
full: true
|
||||
---
|
||||
|
||||
## Known Limitations
|
||||
|
||||
⚠️ **Important Note:** Currently, the following features have limited functionality when running in Docker:
|
||||
|
||||
- **Ollama integration:** Local Ollama models do not work when running SurfSense in Docker. Please use other LLM providers like OpenAI or Gemini instead.
|
||||
- **Web crawler functionality:** The web crawler feature currently doesn't work properly within the Docker environment.
|
||||
|
||||
We're actively working to resolve these limitations in future releases.
|
||||
|
||||
# Docker Installation
|
||||
|
||||
|
|
@ -28,6 +21,7 @@ Before you begin, ensure you have:
|
|||
- **File Processing ETL Service** (choose one):
|
||||
- Unstructured.io API key (Supports 34+ formats)
|
||||
- LlamaIndex API key (enhanced parsing, supports 50+ formats)
|
||||
- Docling (local processing, no API key required, supports PDF, Office docs, images, HTML, CSV)
|
||||
- Other required API keys
|
||||
|
||||
## Installation Steps
|
||||
|
|
@ -97,7 +91,7 @@ Before you begin, ensure you have:
|
|||
| STT_SERVICE_API_KEY | API key for the Speech-to-Text service |
|
||||
| STT_SERVICE_API_BASE | (Optional) Custom API base URL for the Speech-to-Text service |
|
||||
| FIRECRAWL_API_KEY | API key for Firecrawl service for web crawling |
|
||||
| ETL_SERVICE | Document parsing service: `UNSTRUCTURED` (supports 34+ formats) or `LLAMACLOUD` (supports 50+ formats including legacy document types) |
|
||||
| ETL_SERVICE | Document parsing service: `UNSTRUCTURED` (supports 34+ formats), `LLAMACLOUD` (supports 50+ formats including legacy document types), or `DOCLING` (local processing, supports PDF, Office docs, images, HTML, CSV) |
|
||||
| UNSTRUCTURED_API_KEY | API key for Unstructured.io service for document parsing (required if ETL_SERVICE=UNSTRUCTURED) |
|
||||
| LLAMA_CLOUD_API_KEY | API key for LlamaCloud service for document parsing (required if ETL_SERVICE=LLAMACLOUD) |
|
||||
|
||||
|
|
@ -152,7 +146,7 @@ For more details, see the [Uvicorn documentation](https://www.uvicorn.org/#comma
|
|||
| ------------------------------- | ---------------------------------------------------------- |
|
||||
| NEXT_PUBLIC_FASTAPI_BACKEND_URL | URL of the backend service (e.g., `http://localhost:8000`) |
|
||||
| NEXT_PUBLIC_FASTAPI_BACKEND_AUTH_TYPE | Same value as set in backend AUTH_TYPE i.e `GOOGLE` for OAuth with Google, `LOCAL` for email/password authentication |
|
||||
| NEXT_PUBLIC_ETL_SERVICE | Document parsing service (should match backend ETL_SERVICE): `UNSTRUCTURED` or `LLAMACLOUD` - affects supported file formats in upload interface |
|
||||
| NEXT_PUBLIC_ETL_SERVICE | Document parsing service (should match backend ETL_SERVICE): `UNSTRUCTURED`, `LLAMACLOUD`, or `DOCLING` - affects supported file formats in upload interface |
|
||||
|
||||
2. **Build and Start Containers**
|
||||
|
||||
|
|
|
|||
|
|
@ -67,7 +67,7 @@ To set up Google OAuth:
|
|||
|
||||
## File Upload's
|
||||
|
||||
SurfSense supports two ETL (Extract, Transform, Load) services for converting files to LLM-friendly formats:
|
||||
SurfSense supports three ETL (Extract, Transform, Load) services for converting files to LLM-friendly formats:
|
||||
|
||||
### Option 1: Unstructured
|
||||
|
||||
|
|
@ -85,6 +85,16 @@ Files are converted using [LlamaIndex](https://www.llamaindex.ai/) which offers
|
|||
2. Sign up for a LlamaCloud account to access their parsing services
|
||||
3. LlamaCloud provides enhanced parsing capabilities for complex documents
|
||||
|
||||
### Option 3: Docling (Recommended for Privacy)
|
||||
|
||||
Files are processed locally using [Docling](https://github.com/DS4SD/docling) - IBM's open-source document parsing library.
|
||||
|
||||
1. **No API key required** - all processing happens locally
|
||||
2. **Privacy-focused** - documents never leave your system
|
||||
3. **Supported formats**: PDF, Office documents (Word, Excel, PowerPoint), images (PNG, JPEG, TIFF, BMP, WebP), HTML, CSV, AsciiDoc
|
||||
4. **Enhanced features**: Advanced table detection, image extraction, and structured document parsing
|
||||
5. **GPU acceleration** support for faster processing (when available)
|
||||
|
||||
**Note**: You only need to set up one of these services.
|
||||
|
||||
---
|
||||
|
|
|
|||
|
|
@ -16,6 +16,7 @@ Before beginning the manual installation, ensure you have completed all the [pre
|
|||
- **File Processing ETL Service** (choose one):
|
||||
- Unstructured.io API key (Supports 34+ formats)
|
||||
- LlamaIndex API key (enhanced parsing, supports 50+ formats)
|
||||
- Docling (local processing, no API key required, supports PDF, Office docs, images, HTML, CSV)
|
||||
- Other required API keys
|
||||
|
||||
## Backend Setup
|
||||
|
|
@ -67,7 +68,7 @@ Edit the `.env` file and set the following variables:
|
|||
| STT_SERVICE_API_KEY | API key for the Speech-to-Text service |
|
||||
| STT_SERVICE_API_BASE | (Optional) Custom API base URL for the Speech-to-Text service |
|
||||
| FIRECRAWL_API_KEY | API key for Firecrawl service for web crawling |
|
||||
| ETL_SERVICE | Document parsing service: `UNSTRUCTURED` (supports 34+ formats) or `LLAMACLOUD` (supports 50+ formats including legacy document types) |
|
||||
| ETL_SERVICE | Document parsing service: `UNSTRUCTURED` (supports 34+ formats), `LLAMACLOUD` (supports 50+ formats including legacy document types), or `DOCLING` (local processing, supports PDF, Office docs, images, HTML, CSV) |
|
||||
| UNSTRUCTURED_API_KEY | API key for Unstructured.io service for document parsing (required if ETL_SERVICE=UNSTRUCTURED) |
|
||||
| LLAMA_CLOUD_API_KEY | API key for LlamaCloud service for document parsing (required if ETL_SERVICE=LLAMACLOUD) |
|
||||
|
||||
|
|
@ -198,7 +199,7 @@ Edit the `.env` file and set:
|
|||
| ------------------------------- | ------------------------------------------- |
|
||||
| NEXT_PUBLIC_FASTAPI_BACKEND_URL | Backend URL (e.g., `http://localhost:8000`) |
|
||||
| NEXT_PUBLIC_FASTAPI_BACKEND_AUTH_TYPE | Same value as set in backend AUTH_TYPE i.e `GOOGLE` for OAuth with Google, `LOCAL` for email/password authentication |
|
||||
| NEXT_PUBLIC_ETL_SERVICE | Document parsing service (should match backend ETL_SERVICE): `UNSTRUCTURED` or `LLAMACLOUD` - affects supported file formats in upload interface |
|
||||
| NEXT_PUBLIC_ETL_SERVICE | Document parsing service (should match backend ETL_SERVICE): `UNSTRUCTURED`, `LLAMACLOUD`, or `DOCLING` - affects supported file formats in upload interface |
|
||||
|
||||
### 2. Install Dependencies
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue