SurfSense/README.md

290 lines
11 KiB
Markdown
Raw Normal View History

2024-08-14 00:29:10 -07:00
2025-04-14 19:26:23 -07:00
![new_header](https://github.com/user-attachments/assets/e236b764-0ddc-42ff-a1f1-8fbb3d2e0e65)
2024-08-12 00:32:42 -07:00
2024-11-11 03:09:22 -08:00
2025-06-04 23:51:40 -07:00
<div align="center">
<a href="https://discord.gg/ejRNvftDp9">
<img src="https://img.shields.io/discord/1359368468260192417" alt="Discord">
</a>
</div>
2024-11-11 03:09:22 -08:00
2025-05-03 01:08:19 -07:00
2025-03-14 19:03:53 -07:00
# SurfSense
2025-08-02 06:28:24 -07:00
While tools like NotebookLM and Perplexity are impressive and highly effective for conducting research on any topic/query, SurfSense elevates this capability by integrating with your personal knowledge base. It is a highly customizable AI research agent, connected to external sources such as Search Engines (Tavily, LinkUp), Slack, Linear, Jira, ClickUp, Confluence, Notion, YouTube, GitHub, Discord and more to come.
2024-11-11 03:09:22 -08:00
2025-05-03 01:08:19 -07:00
<div align="center">
<a href="https://trendshift.io/repositories/13606" target="_blank"><img src="https://trendshift.io/api/badge/repositories/13606" alt="MODSetter%2FSurfSense | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
</div>
2025-06-06 14:06:24 -07:00
# Video
2025-06-11 01:11:42 -07:00
2025-07-11 01:20:04 -07:00
https://github.com/user-attachments/assets/d9221908-e0de-4b2f-ac3a-691cf4b202da
2025-06-11 01:11:42 -07:00
## Podcast Sample
2025-06-06 14:06:24 -07:00
https://github.com/user-attachments/assets/a0a16566-6967-4374-ac51-9b3e07fbecd7
2024-08-12 21:19:42 -07:00
2024-08-12 00:32:42 -07:00
2024-08-12 21:07:21 -07:00
## Key Features
2025-03-14 19:03:53 -07:00
### 💡 **Idea**:
2025-03-14 19:03:53 -07:00
Have your own highly customizable private NotebookLM and Perplexity integrated with external sources.
### 📁 **Multiple File Format Uploading Support**
Save content from your own personal files *(Documents, images, videos and supports **50+ file extensions**)* to your own personal knowledge base .
### 🔍 **Powerful Search**
2025-03-14 19:03:53 -07:00
Quickly research or find anything in your saved content .
### 💬 **Chat with your Saved Content**
2025-03-14 19:03:53 -07:00
Interact in Natural Language and get cited answers.
### 📄 **Cited Answers**
2025-03-14 19:03:53 -07:00
Get Cited answers just like Perplexity.
### 🔔 **Privacy & Local LLM Support**
2025-03-14 19:03:53 -07:00
Works Flawlessly with Ollama local LLMs.
### 🏠 **Self Hostable**
2025-03-14 19:03:53 -07:00
Open source and easy to deploy locally.
### 🎙️ Podcasts
2025-05-06 00:12:22 -07:00
- Blazingly fast podcast generation agent. (Creates a 3-minute podcast in under 20 seconds.)
2025-05-06 00:04:37 -07:00
- Convert your chat conversations into engaging audio content
2025-08-13 17:57:33 -07:00
- Support for local TTS providers (Kokoro TTS)
2025-05-06 00:04:37 -07:00
- Support for multiple TTS providers (OpenAI, Azure, Google Vertex AI)
### 📊 **Advanced RAG Techniques**
2025-06-09 15:50:15 -07:00
- Supports 100+ LLM's
2025-03-14 19:03:53 -07:00
- Supports 6000+ Embedding Models.
- Supports all major Rerankers (Pinecode, Cohere, Flashrank etc)
- Uses Hierarchical Indices (2 tiered RAG setup).
- Utilizes Hybrid Search (Semantic + Full Text Search combined with Reciprocal Rank Fusion).
- RAG as a Service API Backend.
### **External Sources**
2025-04-27 15:56:31 -07:00
- Search Engines (Tavily, LinkUp)
2025-03-14 19:03:53 -07:00
- Slack
2025-04-15 23:10:35 -07:00
- Linear
2025-07-25 09:55:07 -07:00
- Jira
2025-08-02 06:28:24 -07:00
- ClickUp
2025-07-29 11:46:40 -07:00
- Confluence
2025-03-14 19:03:53 -07:00
- Notion
- Youtube Videos
2025-04-14 19:26:23 -07:00
- GitHub
- Discord
2025-03-14 19:03:53 -07:00
- and more to come.....
2025-05-30 19:27:38 -07:00
## 📄 **Supported File Extensions**
2025-07-21 12:14:11 -07:00
> **Note**: File format support depends on your ETL service configuration. LlamaCloud supports 50+ formats, Unstructured supports 34+ core formats, and Docling (core formats, local processing, privacy-focused, no API key).
2025-05-30 19:27:38 -07:00
### Documents & Text
**LlamaCloud**: `.pdf`, `.doc`, `.docx`, `.docm`, `.dot`, `.dotm`, `.rtf`, `.txt`, `.xml`, `.epub`, `.odt`, `.wpd`, `.pages`, `.key`, `.numbers`, `.602`, `.abw`, `.cgm`, `.cwk`, `.hwp`, `.lwp`, `.mw`, `.mcw`, `.pbd`, `.sda`, `.sdd`, `.sdp`, `.sdw`, `.sgl`, `.sti`, `.sxi`, `.sxw`, `.stw`, `.sxg`, `.uof`, `.uop`, `.uot`, `.vor`, `.wps`, `.zabw`
**Unstructured**: `.doc`, `.docx`, `.odt`, `.rtf`, `.pdf`, `.xml`, `.txt`, `.md`, `.markdown`, `.rst`, `.html`, `.org`, `.epub`
2025-07-21 12:14:11 -07:00
**Docling**: `.pdf`, `.docx`, `.html`, `.htm`, `.xhtml`, `.adoc`, `.asciidoc`
2025-05-30 19:27:38 -07:00
### Presentations
**LlamaCloud**: `.ppt`, `.pptx`, `.pptm`, `.pot`, `.potm`, `.potx`, `.odp`, `.key`
**Unstructured**: `.ppt`, `.pptx`
2025-07-21 12:14:11 -07:00
**Docling**: `.pptx`
2025-05-30 19:27:38 -07:00
### Spreadsheets & Data
**LlamaCloud**: `.xlsx`, `.xls`, `.xlsm`, `.xlsb`, `.xlw`, `.csv`, `.tsv`, `.ods`, `.fods`, `.numbers`, `.dbf`, `.123`, `.dif`, `.sylk`, `.slk`, `.prn`, `.et`, `.uos1`, `.uos2`, `.wk1`, `.wk2`, `.wk3`, `.wk4`, `.wks`, `.wq1`, `.wq2`, `.wb1`, `.wb2`, `.wb3`, `.qpw`, `.xlr`, `.eth`
**Unstructured**: `.xls`, `.xlsx`, `.csv`, `.tsv`
2025-07-21 12:14:11 -07:00
**Docling**: `.xlsx`, `.csv`
2025-05-30 19:27:38 -07:00
### Images
**LlamaCloud**: `.jpg`, `.jpeg`, `.png`, `.gif`, `.bmp`, `.svg`, `.tiff`, `.webp`, `.html`, `.htm`, `.web`
**Unstructured**: `.jpg`, `.jpeg`, `.png`, `.bmp`, `.tiff`, `.heic`
2025-07-21 12:14:11 -07:00
**Docling**: `.jpg`, `.jpeg`, `.png`, `.bmp`, `.tiff`, `.tif`, `.webp`
2025-05-30 19:27:38 -07:00
### Audio & Video *(Always Supported)*
`.mp3`, `.mpga`, `.m4a`, `.wav`, `.mp4`, `.mpeg`, `.webm`
2025-05-30 19:27:38 -07:00
### Email & Communication
**Unstructured**: `.eml`, `.msg`, `.p7s`
### 🔖 Cross Browser Extension
2025-03-26 21:21:22 -07:00
- The SurfSense extension can be used to save any webpage you like.
2025-03-27 11:38:44 -07:00
- Its main usecase is to save any webpages protected beyond authentication.
2025-03-26 21:21:22 -07:00
2025-03-14 19:03:53 -07:00
2025-04-09 16:27:16 -07:00
## FEATURE REQUESTS AND FUTURE
**SurfSense is actively being developed.** While it's not yet production-ready, you can help us speed up the process.
Join the [SurfSense Discord](https://discord.gg/ejRNvftDp9) and help shape the future of SurfSense!
## 🚀 Roadmap
2025-04-09 16:27:16 -07:00
Stay up to date with our development progress and upcoming features!
Check out our public roadmap and contribute your ideas or feedback:
**View the Roadmap:** [SurfSense Roadmap on GitHub Projects](https://github.com/users/MODSetter/projects/2)
2025-04-09 16:27:16 -07:00
2024-08-12 21:07:21 -07:00
## How to get started?
2024-09-25 14:54:25 -07:00
2025-04-24 01:39:56 -07:00
### Installation Options
2024-10-08 01:59:32 -07:00
2025-04-24 01:39:56 -07:00
SurfSense provides two installation methods:
2025-03-20 20:19:47 -07:00
1. **[Docker Installation](https://www.surfsense.net/docs/docker-installation)** - The easiest way to get SurfSense up and running with all dependencies containerized.
- Includes pgAdmin for database management through a web UI
- Supports environment variable customization via `.env` file
- Flexible deployment options (full stack or core services only)
- No need to manually edit configuration files between environments
- See [Docker Setup Guide](DOCKER_SETUP.md) for detailed instructions
- For deployment scenarios and options, see [Deployment Guide](DEPLOYMENT_GUIDE.md)
2024-09-25 14:54:25 -07:00
2025-04-24 19:51:31 -07:00
2. **[Manual Installation (Recommended)](https://www.surfsense.net/docs/manual-installation)** - For users who prefer more control over their setup or need to customize their deployment.
2024-08-12 00:32:42 -07:00
2025-04-24 01:39:56 -07:00
Both installation guides include detailed OS-specific instructions for Windows, macOS, and Linux.
2024-08-16 22:31:38 -07:00
2025-04-24 01:39:56 -07:00
Before installation, make sure to complete the [prerequisite setup steps](https://www.surfsense.net/docs/) including:
- PGVector setup
- **File Processing ETL Service** (choose one):
2025-06-05 14:23:37 -07:00
- Unstructured.io API key (supports 34+ formats)
- LlamaIndex API key (enhanced parsing, supports 50+ formats)
2025-07-21 12:14:11 -07:00
- Docling (local processing, no API key required, supports PDF, Office docs, images, HTML, CSV)
2025-04-24 01:39:56 -07:00
- Other required API keys
2025-04-21 01:42:38 -07:00
2025-04-24 01:39:56 -07:00
## Screenshots
2025-04-14 19:26:23 -07:00
2025-06-09 16:20:35 -07:00
**Research Agent**
![updated_researcher](https://github.com/user-attachments/assets/e22c5d86-f511-4c72-8c50-feba0c1561b4)
2025-04-14 19:26:23 -07:00
**Search Spaces**
![search_spaces](https://github.com/user-attachments/assets/e254c38c-f937-44b6-9e9d-770db583d099)
2025-04-14 19:40:04 -07:00
**Manage Documents**
![documents](https://github.com/user-attachments/assets/7001e306-eb06-4009-89c6-8fadfdc3fc4d)
2025-05-06 00:12:22 -07:00
**Podcast Agent**
![podcasts](https://github.com/user-attachments/assets/6cb82ffd-9e14-4172-bc79-67faf34c4c1c)
2025-04-14 19:26:23 -07:00
**Agent Chat**
![git_chat](https://github.com/user-attachments/assets/bb352d52-1c6d-4020-926b-722d0b98b491)
2024-08-16 22:31:38 -07:00
2025-04-24 01:39:56 -07:00
**Browser Extension**
2024-08-16 22:31:38 -07:00
2025-03-26 21:21:22 -07:00
![ext1](https://github.com/user-attachments/assets/1f042b7a-6349-422b-94fb-d40d0df16c40)
![ext2](https://github.com/user-attachments/assets/a9b9f1aa-2677-404d-b0a0-c1b2dddf24a7)
2024-08-12 21:07:21 -07:00
2025-05-06 00:12:22 -07:00
2025-04-24 01:39:56 -07:00
## Tech Stack
2024-08-17 00:47:12 -07:00
2024-08-12 21:07:21 -07:00
2025-03-14 19:03:53 -07:00
### **BackEnd**
- **FastAPI**: Modern, fast web framework for building APIs with Python
2025-04-14 19:26:23 -07:00
2025-03-14 19:03:53 -07:00
- **PostgreSQL with pgvector**: Database with vector search capabilities for similarity searches
- **SQLAlchemy**: SQL toolkit and ORM (Object-Relational Mapping) for database interactions
2024-08-12 21:07:21 -07:00
2025-04-14 19:26:23 -07:00
- **Alembic**: A database migrations tool for SQLAlchemy.
2025-03-14 19:03:53 -07:00
- **FastAPI Users**: Authentication and user management with JWT and OAuth support
- **LangGraph**: Framework for developing AI-agents.
2025-04-14 19:26:23 -07:00
- **LangChain**: Framework for developing AI-powered applications.
2025-03-14 19:03:53 -07:00
- **LLM Integration**: Integration with LLM models through LiteLLM
2025-03-14 19:03:53 -07:00
- **Rerankers**: Advanced result ranking for improved search relevance
- **Hybrid Search**: Combines vector similarity and full-text search for optimal results using Reciprocal Rank Fusion (RRF)
- **Vector Embeddings**: Document and text embeddings for semantic search
- **pgvector**: PostgreSQL extension for efficient vector similarity operations
- **Chonkie**: Advanced document chunking and embedding library
2025-04-14 19:26:23 -07:00
- Uses `AutoEmbeddings` for flexible embedding model selection
- `LateChunker` for optimized document chunking based on embedding model's max sequence length
2025-03-14 19:03:53 -07:00
2024-11-11 01:36:21 -08:00
---
2025-03-14 19:03:53 -07:00
### **FrontEnd**
2024-08-12 21:07:21 -07:00
2025-04-07 23:48:27 -07:00
- **Next.js 15.2.3**: React framework featuring App Router, server components, automatic code-splitting, and optimized rendering.
2024-08-12 21:07:21 -07:00
2025-03-14 19:03:53 -07:00
- **React 19.0.0**: JavaScript library for building user interfaces.
2024-08-12 21:07:21 -07:00
2025-03-14 19:03:53 -07:00
- **TypeScript**: Static type-checking for JavaScript, enhancing code quality and developer experience.
- **Vercel AI SDK Kit UI Stream Protocol**: To create scalable chat UI.
- **Tailwind CSS 4.x**: Utility-first CSS framework for building custom UI designs.
2024-08-12 21:07:21 -07:00
2025-03-14 19:03:53 -07:00
- **Shadcn**: Headless components library.
2024-08-12 21:07:21 -07:00
2025-03-14 19:03:53 -07:00
- **Lucide React**: Icon set implemented as React components.
- **Framer Motion**: Animation library for React.
- **Sonner**: Toast notification library.
- **Geist**: Font family from Vercel.
- **React Hook Form**: Form state management and validation.
- **Zod**: TypeScript-first schema validation with static type inference.
- **@hookform/resolvers**: Resolvers for using validation libraries with React Hook Form.
- **@tanstack/react-table**: Headless UI for building powerful tables & datagrids.
### **DevOps**
- **Docker**: Container platform for consistent deployment across environments
- **Docker Compose**: Tool for defining and running multi-container Docker applications
- **pgAdmin**: Web-based PostgreSQL administration tool included in Docker setup
2025-03-14 19:03:53 -07:00
### **Extension**
Manifest v3 on Plasmo
2024-08-12 21:07:21 -07:00
2024-08-14 03:21:26 -07:00
## Future Work
2025-03-14 19:03:53 -07:00
- Add More Connectors.
- Patch minor bugs.
- Document Chat **[REIMPLEMENT]**
- Document Podcasts
2024-09-19 22:52:11 -07:00
2025-03-14 19:03:53 -07:00
2024-08-12 21:07:21 -07:00
## Contribute
Contributions are very welcome! A contribution can be as small as a ⭐ or even finding and creating issues.
Fine-tuning the Backend is always desired.
2025-07-07 21:50:25 -07:00
For detailed contribution guidelines, please see our [CONTRIBUTING.md](CONTRIBUTING.md) file.
2025-05-03 01:08:19 -07:00
## Star History
<a href="https://www.star-history.com/#MODSetter/SurfSense&Date">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=MODSetter/SurfSense&type=Date&theme=dark" />
<source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=MODSetter/SurfSense&type=Date" />
<img alt="Star History Chart" src="https://api.star-history.com/svg?repos=MODSetter/SurfSense&type=Date" />
</picture>
</a>