diff --git a/README.md b/README.md index 4cb51fa42..e412fe2be 100644 --- a/README.md +++ b/README.md @@ -72,97 +72,23 @@ Join the [SurfSense Discord](https://discord.gg/ejRNvftDp9) and help shape the f ## How to get started? -### PRE-START CHECKS +### Installation Options -#### PGVector -Make sure pgvector extension is installed on your machine. Setup Guide https://github.com/pgvector/pgvector?tab=readme-ov-file#installation +SurfSense provides two installation methods: -#### File Uploading Support -For File uploading you need Unstructured.io API key. You can get it at http://platform.unstructured.io/ +1. **[Docker Installation](https://www.surfsense.net/docs/docker-installation)** - The easiest way to get SurfSense up and running with all dependencies containerized. Less Customization. -#### Auth -SurfSense now only works with Google OAuth. Make sure to set your OAuth Client at https://developers.google.com/identity/protocols/oauth2 . We need client id and client secret for backend. Make sure to enable people api and add the required scopes under data access (openid, userinfo.email, userinfo.profile) +2. **[Manual Installation (Recommended)](https://www.surfsense.net/docs/manual-installation)** - For users who prefer more control over their setup or need to customize their deployment. -![gauth](https://github.com/user-attachments/assets/80d60fe5-889b-48a6-b947-200fdaf544c1) +Both installation guides include detailed OS-specific instructions for Windows, macOS, and Linux. -#### LLM Observability -One easy way to observe SurfSense Researcher Agent is to use LangSmith. Get its API KEY from https://smith.langchain.com/ +Before installation, make sure to complete the [prerequisite setup steps](https://www.surfsense.net/docs/) including: +- PGVector setup +- Google OAuth configuration +- Unstructured.io API key +- Other required API keys -**Open AI LLMS** -![openai_langraph](https://github.com/user-attachments/assets/b1f4c7a1-0a66-4d21-9053-2e09a5634f95) - - -**Ollama LLMS** -![ollama_langgraph](https://github.com/user-attachments/assets/5b6c870e-095c-4368-86e6-f7488e0fca28) - - -#### Crawler Support -SurfSense currently uses [Firecrawl.py](https://www.firecrawl.dev/) right now. Playwright crawler support will be added soon. - - -## Quick Start - -### Preferred Method: Docker Setup -The recommended way to run SurfSense is using Docker, which ensures consistent environment across different systems. - -1. Make sure you have Docker and Docker Compose installed -2. Follow the detailed instructions in our [Docker Setup Guide](DOCKER_SETUP.md) - -```bash -# Start all services with one command -docker-compose up --build -``` - ---- -### Alternative: Manual Setup - -### Backend (./surfsense_backend) -This is the core of SurfSense. Before we begin let's look at `.env` variables' that we need to successfully setup SurfSense. - -|ENV VARIABLE|DESCRIPTION| -|--|--| -| DATABASE_URL| Your PostgreSQL database connection string. Eg. `postgresql+asyncpg://postgres:postgres@localhost:5432/surfsense`| -| SECRET_KEY| JWT Secret key used for authentication. Should be a secure random string. Eg. `SURFSENSE_SECRET_KEY_123456789`| -| GOOGLE_OAUTH_CLIENT_ID| Google OAuth client ID obtained from Google Cloud Console when setting up OAuth authentication| -| GOOGLE_OAUTH_CLIENT_SECRET| Google OAuth client secret obtained from Google Cloud Console when setting up OAuth authentication| -| NEXT_FRONTEND_URL| URL where your frontend application is hosted. Eg. `http://localhost:3000`| -| EMBEDDING_MODEL| Name of the embedding model to use for vector embeddings. Currently works with Sentence Transformers only. Expect other embeddings soon. Eg. `mixedbread-ai/mxbai-embed-large-v1`| -| RERANKERS_MODEL_NAME| Name of the reranker model for search result reranking. Eg. `ms-marco-MiniLM-L-12-v2`| -| RERANKERS_MODEL_TYPE| Type of reranker model being used. Eg. `flashrank`| -| FAST_LLM| LiteLLM routed Smaller, faster LLM for quick responses. Eg. `openai/gpt-4o-mini`, `ollama/deepseek-r1:8b`| -| STRATEGIC_LLM| LiteLLM routed Advanced LLM for complex reasoning tasks. Eg. `openai/gpt-4o`, `ollama/gemma3:12b`| -| LONG_CONTEXT_LLM| LiteLLM routed LLM capable of handling longer context windows. Eg. `gemini/gemini-2.0-flash`, `ollama/deepseek-r1:8b`| -| UNSTRUCTURED_API_KEY| API key for Unstructured.io service for document parsing| -| FIRECRAWL_API_KEY| API key for Firecrawl service for web crawling and data extraction| - -IMPORTANT: Since LLM calls are routed through LiteLLM make sure to include API keys of LLM models you are using. For example if you used `openai/gpt-4o` make sure to include OpenAI API Key `OPENAI_API_KEY` or if you use `gemini/gemini-2.0-flash` then you include `GEMINI_API_KEY`. - -You can also integrate any LLM just follow this https://docs.litellm.ai/docs/providers - -Now once you have everything let's proceed to run SurfSense. -1. Install `uv` : https://docs.astral.sh/uv/getting-started/installation/ -2. Now just run this command to install dependencies i.e `uv sync` -3. That's it. Now just run the `main.py` file using `uv run main.py`. You can also optionally pass `--reload` as an argument to enable hot reloading. -4. If everything worked fine you should see screen like this. - -![backend](https://i.ibb.co/542Vhqw/backendrunning.png) - ---- - -### FrontEnd (./surfsense_web) - -For local frontend setup just fill out the `.env` file of frontend. - -|ENV VARIABLE|DESCRIPTION| -|--|--| -| NEXT_PUBLIC_FASTAPI_BACKEND_URL | Give hosted backend url here. Eg. `http://localhost:8000`| - -1. Now install dependencies using `pnpm install` -2. Run it using `pnpm run dev` - -You should see your Next.js frontend running at `localhost:3000` - -#### Some FrontEnd Screens +## Screenshots **Search Spaces** @@ -180,43 +106,13 @@ You should see your Next.js frontend running at `localhost:3000` ![chat](https://github.com/user-attachments/assets/bb352d52-1c6d-4020-926b-722d0b98b491) ---- - -### Extension (./surfsense_browser_extension) - -Extension is in plasmo framework which is a cross browser extension framework. Extension main usecase is to save any webpages protected beyond authentication. - -For building extension just fill out the `.env` file of frontend. - -|ENV VARIABLE|DESCRIPTION| -|--|--| -| PLASMO_PUBLIC_BACKEND_URL| SurfSense Backend URL eg. "http://127.0.0.1:8000" | - -Build the extension for your favorite browser using this guide: https://docs.plasmo.com/framework/workflows/build#with-a-specific-target - -When you load and start the extension you should see a Apu page like this +**Browser Extension** ![ext1](https://github.com/user-attachments/assets/1f042b7a-6349-422b-94fb-d40d0df16c40) - - -After filling in your SurfSense API key you should be able to use extension now. - - ![ext2](https://github.com/user-attachments/assets/a9b9f1aa-2677-404d-b0a0-c1b2dddf24a7) - -|Options|Explanations| -|--|--| -| Search Space | Search Space to save your dynamic bookmarks. | -| Clear Inactive History Sessions | It clears the saved content for Inactive Tab Sessions. | -| Save Current Webpage Snapshot | Stores the current webpage session info into SurfSense history store| -| Save to SurfSense | Processes the SurfSense History Store & Initiates a Save Job | - - - - -## Tech Stack +## Tech Stack ### **BackEnd** diff --git a/surfsense_backend/draw.py b/surfsense_backend/draw.py new file mode 100644 index 000000000..ec55f79a5 --- /dev/null +++ b/surfsense_backend/draw.py @@ -0,0 +1,5 @@ +from app.agents.researcher.graph import graph as researcher_graph +from app.agents.researcher.sub_section_writer.graph import graph as sub_section_writer_graph + +print(researcher_graph.get_graph().draw_mermaid()) +print(sub_section_writer_graph.get_graph().draw_mermaid()) \ No newline at end of file diff --git a/surfsense_web/Dockerfile b/surfsense_web/Dockerfile index 2ce0b0f4d..4b3c97af8 100644 --- a/surfsense_web/Dockerfile +++ b/surfsense_web/Dockerfile @@ -8,8 +8,15 @@ RUN npm install -g pnpm # Copy package files COPY package.json pnpm-lock.yaml ./ -# Install dependencies -RUN pnpm install +# First copy the config file to avoid fumadocs-mdx postinstall error +COPY source.config.ts ./ +COPY content ./content + +# Install dependencies with --ignore-scripts to skip postinstall +RUN pnpm install --ignore-scripts + +# Now run the postinstall script manually +RUN pnpm fumadocs-mdx # Copy source code COPY . . diff --git a/surfsense_web/app/dashboard/[search_space_id]/researcher/[chat_id]/page.tsx b/surfsense_web/app/dashboard/[search_space_id]/researcher/[chat_id]/page.tsx index 8156f6d2a..d371e9e53 100644 --- a/surfsense_web/app/dashboard/[search_space_id]/researcher/[chat_id]/page.tsx +++ b/surfsense_web/app/dashboard/[search_space_id]/researcher/[chat_id]/page.tsx @@ -240,7 +240,7 @@ const SourcesDialogContent = ({ const ChatPage = () => { const [token, setToken] = React.useState(null); const [activeTab, setActiveTab] = useState(""); - const [dialogOpen, setDialogOpen] = useState(false); + const [dialogOpenId, setDialogOpenId] = useState(null); const [sourcesPage, setSourcesPage] = useState(1); const [expandedSources, setExpandedSources] = useState(false); const [canScrollLeft, setCanScrollLeft] = useState(false); @@ -260,6 +260,13 @@ const ChatPage = () => { const { search_space_id, chat_id } = useParams(); + // Function to scroll terminal to bottom + const scrollTerminalToBottom = () => { + if (terminalMessagesRef.current) { + terminalMessagesRef.current.scrollTop = terminalMessagesRef.current.scrollHeight; + } + }; + // Get token from localStorage on client side only React.useEffect(() => { setToken(localStorage.getItem('surfsense_bearer_token')); @@ -469,54 +476,60 @@ const ChatPage = () => { updateChat(); }, [messages, status, chat_id, researchMode, selectedConnectors, search_space_id]); - // Log messages whenever they update and extract annotations from the latest assistant message if available - useEffect(() => { - console.log('Messages updated:', messages); - - // Extract annotations from the latest assistant message if available + // Memoize connector sources to prevent excessive re-renders + const processedConnectorSources = React.useMemo(() => { + if (messages.length === 0) return connectorSources; + + // Only process when we have a complete message (not streaming) + if (status !== 'ready') return connectorSources; + + // Find the latest assistant message const assistantMessages = messages.filter(msg => msg.role === 'assistant'); - if (assistantMessages.length > 0) { - const latestAssistantMessage = assistantMessages[assistantMessages.length - 1]; - if (latestAssistantMessage?.annotations) { - const annotations = latestAssistantMessage.annotations as any[]; - - // Debug log to track streaming annotations - if (process.env.NODE_ENV === 'development') { - console.log('Streaming annotations:', annotations); - - // Log counts of each annotation type - const terminalInfoCount = annotations.filter(a => a.type === 'TERMINAL_INFO').length; - const sourcesCount = annotations.filter(a => a.type === 'SOURCES').length; - const answerCount = annotations.filter(a => a.type === 'ANSWER').length; - - console.log(`Annotation counts - Terminal: ${terminalInfoCount}, Sources: ${sourcesCount}, Answer: ${answerCount}`); - } - - // Process SOURCES annotation - get the last one to ensure we have the latest - const sourcesAnnotations = annotations.filter( - (annotation) => annotation.type === 'SOURCES' - ); - - if (sourcesAnnotations.length > 0) { - // Get the last SOURCES annotation to ensure we have the most recent one - const latestSourcesAnnotation = sourcesAnnotations[sourcesAnnotations.length - 1]; - if (latestSourcesAnnotation.content) { - setConnectorSources(latestSourcesAnnotation.content); - } - } - - // Check for terminal info annotations and scroll terminal to bottom if they exist - const terminalInfoAnnotations = annotations.filter( - (annotation) => annotation.type === 'TERMINAL_INFO' - ); - - if (terminalInfoAnnotations.length > 0) { - // Schedule scrolling after the DOM has been updated - setTimeout(scrollTerminalToBottom, 100); - } - } + if (assistantMessages.length === 0) return connectorSources; + + const latestAssistantMessage = assistantMessages[assistantMessages.length - 1]; + if (!latestAssistantMessage?.annotations) return connectorSources; + + // Find the latest SOURCES annotation + const annotations = latestAssistantMessage.annotations as any[]; + const sourcesAnnotations = annotations.filter(a => a.type === 'SOURCES'); + + if (sourcesAnnotations.length === 0) return connectorSources; + + const latestSourcesAnnotation = sourcesAnnotations[sourcesAnnotations.length - 1]; + if (!latestSourcesAnnotation.content) return connectorSources; + + // Use this content if it differs from current + return latestSourcesAnnotation.content; + }, [messages, status, connectorSources]); + + // Update connector sources when processed value changes + useEffect(() => { + if (processedConnectorSources !== connectorSources) { + setConnectorSources(processedConnectorSources); } - }, [messages]); + }, [processedConnectorSources, connectorSources]); + + // Check and scroll terminal when terminal info is available + useEffect(() => { + if (messages.length === 0 || status !== 'ready') return; + + // Find the latest assistant message + const assistantMessages = messages.filter(msg => msg.role === 'assistant'); + if (assistantMessages.length === 0) return; + + const latestAssistantMessage = assistantMessages[assistantMessages.length - 1]; + if (!latestAssistantMessage?.annotations) return; + + // Check for terminal info annotations + const annotations = latestAssistantMessage.annotations as any[]; + const terminalInfoAnnotations = annotations.filter(a => a.type === 'TERMINAL_INFO'); + + if (terminalInfoAnnotations.length > 0) { + // Schedule scrolling after the DOM has been updated + setTimeout(scrollTerminalToBottom, 100); + } + }, [messages, status]); // Custom handleSubmit function to include selected connectors and answer type const handleSubmit = (e: React.FormEvent) => { @@ -543,24 +556,22 @@ const ChatPage = () => { messagesEndRef.current?.scrollIntoView({ behavior: 'smooth' }); }; - // Function to scroll terminal to bottom - const scrollTerminalToBottom = () => { - if (terminalMessagesRef.current) { - terminalMessagesRef.current.scrollTop = terminalMessagesRef.current.scrollHeight; - } - }; - // Scroll to bottom when messages change useEffect(() => { scrollToBottom(); }, [messages]); - // Set activeTab when connectorSources change - useEffect(() => { - if (connectorSources.length > 0) { - setActiveTab(connectorSources[0].type); - } + // Set activeTab when connectorSources change using a memoized value + const activeTabValue = React.useMemo(() => { + return connectorSources.length > 0 ? connectorSources[0].type : ""; }, [connectorSources]); + + // Update activeTab when the memoized value changes + useEffect(() => { + if (activeTabValue && activeTabValue !== activeTab) { + setActiveTab(activeTabValue); + } + }, [activeTabValue, activeTab]); // Scroll terminal to bottom when expanded useEffect(() => { @@ -617,49 +628,89 @@ const ChatPage = () => { }; // Function to get a citation source by ID - const getCitationSource = (citationId: number): Source | null => { + const getCitationSource = React.useCallback((citationId: number, messageIndex?: number): Source | null => { if (!messages || messages.length === 0) return null; - // Find the latest assistant message - const assistantMessages = messages.filter(msg => msg.role === 'assistant'); - if (assistantMessages.length === 0) return null; + // If no specific message index is provided, use the latest assistant message + if (messageIndex === undefined) { + // Find the latest assistant message + const assistantMessages = messages.filter(msg => msg.role === 'assistant'); + if (assistantMessages.length === 0) return null; - const latestAssistantMessage = assistantMessages[assistantMessages.length - 1]; - if (!latestAssistantMessage?.annotations) return null; + const latestAssistantMessage = assistantMessages[assistantMessages.length - 1]; + if (!latestAssistantMessage?.annotations) return null; - // Find all SOURCES annotations - const annotations = latestAssistantMessage.annotations as any[]; - const sourcesAnnotations = annotations.filter( - (annotation) => annotation.type === 'SOURCES' - ); + // Find all SOURCES annotations + const annotations = latestAssistantMessage.annotations as any[]; + const sourcesAnnotations = annotations.filter( + (annotation) => annotation.type === 'SOURCES' + ); - // Get the latest SOURCES annotation - if (sourcesAnnotations.length === 0) return null; - const latestSourcesAnnotation = sourcesAnnotations[sourcesAnnotations.length - 1]; + // Get the latest SOURCES annotation + if (sourcesAnnotations.length === 0) return null; + const latestSourcesAnnotation = sourcesAnnotations[sourcesAnnotations.length - 1]; - if (!latestSourcesAnnotation.content) return null; + if (!latestSourcesAnnotation.content) return null; - // Flatten all sources from all connectors - const allSources: Source[] = []; - latestSourcesAnnotation.content.forEach((connector: ConnectorSource) => { - if (connector.sources && Array.isArray(connector.sources)) { - connector.sources.forEach((source: SourceItem) => { - allSources.push({ - id: source.id, - title: source.title, - description: source.description, - url: source.url, - connectorType: connector.type + // Flatten all sources from all connectors + const allSources: Source[] = []; + latestSourcesAnnotation.content.forEach((connector: ConnectorSource) => { + if (connector.sources && Array.isArray(connector.sources)) { + connector.sources.forEach((source: SourceItem) => { + allSources.push({ + id: source.id, + title: source.title, + description: source.description, + url: source.url, + connectorType: connector.type + }); }); - }); - } - }); + } + }); - // Find the source with the matching ID - const foundSource = allSources.find(source => source.id === citationId); + // Find the source with the matching ID + const foundSource = allSources.find(source => source.id === citationId); - return foundSource || null; - }; + return foundSource || null; + } else { + // Use the specific message by index + const message = messages[messageIndex]; + if (!message || message.role !== 'assistant' || !message.annotations) return null; + + // Find all SOURCES annotations + const annotations = message.annotations as any[]; + const sourcesAnnotations = annotations.filter( + (annotation) => annotation.type === 'SOURCES' + ); + + // Get the latest SOURCES annotation + if (sourcesAnnotations.length === 0) return null; + const latestSourcesAnnotation = sourcesAnnotations[sourcesAnnotations.length - 1]; + + if (!latestSourcesAnnotation.content) return null; + + // Flatten all sources from all connectors + const allSources: Source[] = []; + latestSourcesAnnotation.content.forEach((connector: ConnectorSource) => { + if (connector.sources && Array.isArray(connector.sources)) { + connector.sources.forEach((source: SourceItem) => { + allSources.push({ + id: source.id, + title: source.title, + description: source.description, + url: source.url, + connectorType: connector.type + }); + }); + } + }); + + // Find the source with the matching ID + const foundSource = allSources.find(source => source.id === citationId); + + return foundSource || null; + } + }, [messages]); return ( <> @@ -685,7 +736,11 @@ const ChatPage = () => {
- + getCitationSource(id, index)} + className="text-sm" + />
@@ -856,7 +911,7 @@ const ChatPage = () => { ))} {connector.sources.length > INITIAL_SOURCES_DISPLAY && ( - setDialogOpen(open)}> + setDialogOpenId(open ? connector.id : null)}> - - - - + + + )} ); -}; +}); + +Citation.displayName = 'Citation'; /** * Function to render text with citations diff --git a/surfsense_web/components/markdown-viewer.tsx b/surfsense_web/components/markdown-viewer.tsx index 03fe7bada..1df650114 100644 --- a/surfsense_web/components/markdown-viewer.tsx +++ b/surfsense_web/components/markdown-viewer.tsx @@ -1,4 +1,4 @@ -import React from "react"; +import React, { useMemo } from "react"; import ReactMarkdown from "react-markdown"; import rehypeRaw from "rehype-raw"; import rehypeSanitize from "rehype-sanitize"; @@ -14,75 +14,87 @@ interface MarkdownViewerProps { } export function MarkdownViewer({ content, className, getCitationSource }: MarkdownViewerProps) { + // Memoize the markdown components to prevent unnecessary re-renders + const components = useMemo(() => { + return { + // Define custom components for markdown elements + p: ({node, children, ...props}: any) => { + // If there's no getCitationSource function, just render normally + if (!getCitationSource) { + return

{children}

; + } + + // Process citations within paragraph content + return

{processCitationsInReactChildren(children, getCitationSource)}

; + }, + a: ({node, children, ...props}: any) => { + // Process citations within link content if needed + const processedChildren = getCitationSource + ? processCitationsInReactChildren(children, getCitationSource) + : children; + return {processedChildren}; + }, + li: ({node, children, ...props}: any) => { + // Process citations within list item content + const processedChildren = getCitationSource + ? processCitationsInReactChildren(children, getCitationSource) + : children; + return
  • {processedChildren}
  • ; + }, + ul: ({node, ...props}: any) =>
      , + ol: ({node, ...props}: any) =>
        , + h1: ({node, children, ...props}: any) => { + const processedChildren = getCitationSource + ? processCitationsInReactChildren(children, getCitationSource) + : children; + return

        {processedChildren}

        ; + }, + h2: ({node, children, ...props}: any) => { + const processedChildren = getCitationSource + ? processCitationsInReactChildren(children, getCitationSource) + : children; + return

        {processedChildren}

        ; + }, + h3: ({node, children, ...props}: any) => { + const processedChildren = getCitationSource + ? processCitationsInReactChildren(children, getCitationSource) + : children; + return

        {processedChildren}

        ; + }, + h4: ({node, children, ...props}: any) => { + const processedChildren = getCitationSource + ? processCitationsInReactChildren(children, getCitationSource) + : children; + return

        {processedChildren}

        ; + }, + blockquote: ({node, ...props}: any) =>
        , + hr: ({node, ...props}: any) =>
        , + img: ({node, ...props}: any) => , + table: ({node, ...props}: any) =>
        , + th: ({node, ...props}: any) =>
        , + td: ({node, ...props}: any) => , + code: ({node, className, children, ...props}: any) => { + const match = /language-(\w+)/.exec(className || ''); + const isInline = !match; + return isInline + ? {children} + : ( +
        +
        +                {children}
        +              
        +
        + ); + } + }; + }, [getCitationSource]); + return (
        { - // If there's no getCitationSource function, just render normally - if (!getCitationSource) { - return

        {children}

        ; - } - - // Process citations within paragraph content - return

        {processCitationsInReactChildren(children, getCitationSource)}

        ; - }, - a: ({node, children, ...props}) => { - // Process citations within link content if needed - const processedChildren = getCitationSource - ? processCitationsInReactChildren(children, getCitationSource) - : children; - return {processedChildren}; - }, - ul: ({node, ...props}) =>
          , - ol: ({node, ...props}) =>
            , - h1: ({node, children, ...props}) => { - const processedChildren = getCitationSource - ? processCitationsInReactChildren(children, getCitationSource) - : children; - return

            {processedChildren}

            ; - }, - h2: ({node, children, ...props}) => { - const processedChildren = getCitationSource - ? processCitationsInReactChildren(children, getCitationSource) - : children; - return

            {processedChildren}

            ; - }, - h3: ({node, children, ...props}) => { - const processedChildren = getCitationSource - ? processCitationsInReactChildren(children, getCitationSource) - : children; - return

            {processedChildren}

            ; - }, - h4: ({node, children, ...props}) => { - const processedChildren = getCitationSource - ? processCitationsInReactChildren(children, getCitationSource) - : children; - return

            {processedChildren}

            ; - }, - blockquote: ({node, ...props}) =>
            , - hr: ({node, ...props}) =>
            , - img: ({node, ...props}) => , - table: ({node, ...props}) =>
            , - th: ({node, ...props}) =>
            , - td: ({node, ...props}) => , - code: ({node, className, children, ...props}: any) => { - const match = /language-(\w+)/.exec(className || ''); - const isInline = !match; - return isInline - ? {children} - : ( -
            -
            -                    {children}
            -                  
            -
            - ); - } - }} + components={components} > {content} @@ -91,7 +103,7 @@ export function MarkdownViewer({ content, className, getCitationSource }: Markdo } // Helper function to process citations within React children -function processCitationsInReactChildren(children: React.ReactNode, getCitationSource: (id: number) => Source | null): React.ReactNode { +const processCitationsInReactChildren = (children: React.ReactNode, getCitationSource: (id: number) => Source | null): React.ReactNode => { // If children is not an array or string, just return it if (!children || (typeof children !== 'string' && !Array.isArray(children))) { return children; @@ -113,10 +125,10 @@ function processCitationsInReactChildren(children: React.ReactNode, getCitationS } return children; -} +}; // Process citation references in text content -function processCitationsInText(text: string, getCitationSource: (id: number) => Source | null): React.ReactNode[] { +const processCitationsInText = (text: string, getCitationSource: (id: number) => Source | null): React.ReactNode[] => { // Use improved regex to catch citation numbers more reliably // This will match patterns like [1], [42], etc. including when they appear at the end of a line or sentence const citationRegex = /\[(\d+)\]/g; @@ -124,14 +136,8 @@ function processCitationsInText(text: string, getCitationSource: (id: number) => let lastIndex = 0; let match; let position = 0; - - // Debug log for troubleshooting - console.log("Processing citations in text:", text); while ((match = citationRegex.exec(text)) !== null) { - // Log each match for debugging - console.log("Citation match found:", match[0], "at index", match.index); - // Add text before the citation if (match.index > lastIndex) { parts.push(text.substring(lastIndex, match.index)); @@ -141,9 +147,6 @@ function processCitationsInText(text: string, getCitationSource: (id: number) => const citationId = parseInt(match[1], 10); const source = getCitationSource(citationId); - // Log the citation details - console.log("Citation ID:", citationId, "Source:", source ? "found" : "not found"); - parts.push( } return parts; -} \ No newline at end of file +}; \ No newline at end of file diff --git a/surfsense_web/content/docs/docker-installation.mdx b/surfsense_web/content/docs/docker-installation.mdx new file mode 100644 index 000000000..2a373d048 --- /dev/null +++ b/surfsense_web/content/docs/docker-installation.mdx @@ -0,0 +1,168 @@ +--- +title: Docker Installation +description: Setting up SurfSense using Docker +full: true +--- +## Known Limitations + +⚠️ **Important Note:** Currently, the following features have limited functionality when running in Docker: + +- **Ollama integration:** Local Ollama models do not work when running SurfSense in Docker. Please use other LLM providers like OpenAI or Gemini instead. +- **Web crawler functionality:** The web crawler feature currently doesn't work properly within the Docker environment. + +We're actively working to resolve these limitations in future releases. + + +# Docker Installation + +This guide explains how to run SurfSense using Docker Compose, which is the preferred and recommended method for deployment. + +## Prerequisites + +Before you begin, ensure you have: + +- [Docker](https://docs.docker.com/get-docker/) and [Docker Compose](https://docs.docker.com/compose/install/) installed on your machine +- [Git](https://git-scm.com/downloads) (to clone the repository) +- Completed all the [prerequisite setup steps](/docs) including: + - PGVector setup + - Google OAuth configuration + - Unstructured.io API key + - Other required API keys + +## Installation Steps + +1. **Configure Environment Variables** + + Set up the necessary environment variables: + + **Linux/macOS:** + ```bash + # Copy example environment files + cp surfsense_backend/.env.example surfsense_backend/.env + cp surfsense_web/.env.example surfsense_web/.env + ``` + + **Windows (Command Prompt):** + ```cmd + copy surfsense_backend\.env.example surfsense_backend\.env + copy surfsense_web\.env.example surfsense_web\.env + ``` + + **Windows (PowerShell):** + ```powershell + Copy-Item -Path surfsense_backend\.env.example -Destination surfsense_backend\.env + Copy-Item -Path surfsense_web\.env.example -Destination surfsense_web\.env + ``` + + Edit both `.env` files and fill in the required values: + + **Backend Environment Variables:** + + | ENV VARIABLE | DESCRIPTION | + |--------------|-------------| + | DATABASE_URL | PostgreSQL connection string (e.g., `postgresql+asyncpg://postgres:postgres@localhost:5432/surfsense`) | + | SECRET_KEY | JWT Secret key for authentication (should be a secure random string) | + | GOOGLE_OAUTH_CLIENT_ID | Google OAuth client ID obtained from Google Cloud Console | + | GOOGLE_OAUTH_CLIENT_SECRET | Google OAuth client secret obtained from Google Cloud Console | + | NEXT_FRONTEND_URL | URL where your frontend application is hosted (e.g., `http://localhost:3000`) | + | EMBEDDING_MODEL | Name of the embedding model (e.g., `mixedbread-ai/mxbai-embed-large-v1`) | + | RERANKERS_MODEL_NAME | Name of the reranker model (e.g., `ms-marco-MiniLM-L-12-v2`) | + | RERANKERS_MODEL_TYPE | Type of reranker model (e.g., `flashrank`) | + | FAST_LLM | LiteLLM routed smaller, faster LLM (e.g., `openai/gpt-4o-mini`, `ollama/deepseek-r1:8b`) | + | STRATEGIC_LLM | LiteLLM routed advanced LLM for complex tasks (e.g., `openai/gpt-4o`, `ollama/gemma3:12b`) | + | LONG_CONTEXT_LLM | LiteLLM routed LLM for longer context windows (e.g., `gemini/gemini-2.0-flash`, `ollama/deepseek-r1:8b`) | + | UNSTRUCTURED_API_KEY | API key for Unstructured.io service for document parsing | + | FIRECRAWL_API_KEY | API key for Firecrawl service for web crawling | + + Include API keys for the LLM providers you're using. For example: + - `OPENAI_API_KEY`: If using OpenAI models + - `GEMINI_API_KEY`: If using Google Gemini models + + For other LLM providers, refer to the [LiteLLM documentation](https://docs.litellm.ai/docs/providers). + + **Frontend Environment Variables:** + + | ENV VARIABLE | DESCRIPTION | + |--------------|-------------| + | NEXT_PUBLIC_FASTAPI_BACKEND_URL | URL of the backend service (e.g., `http://localhost:8000`) | + +2. **Build and Start Containers** + + Start the Docker containers: + + **Linux/macOS/Windows:** + ```bash + docker-compose up --build + ``` + + To run in detached mode (in the background): + + **Linux/macOS/Windows:** + ```bash + docker-compose up -d + ``` + + **Note for Windows users:** If you're using older Docker Desktop versions, you might need to use `docker compose` (with a space) instead of `docker-compose`. + +3. **Access the Applications** + + Once the containers are running, you can access: + - Frontend: [http://localhost:3000](http://localhost:3000) + - Backend API: [http://localhost:8000](http://localhost:8000) + - API Documentation: [http://localhost:8000/docs](http://localhost:8000/docs) + +## Useful Docker Commands + +### Container Management + +- **Stop containers:** + + **Linux/macOS/Windows:** + ```bash + docker-compose down + ``` + +- **View logs:** + + **Linux/macOS/Windows:** + ```bash + # All services + docker-compose logs -f + + # Specific service + docker-compose logs -f backend + docker-compose logs -f frontend + docker-compose logs -f db + ``` + +- **Restart a specific service:** + + **Linux/macOS/Windows:** + ```bash + docker-compose restart backend + ``` + +- **Execute commands in a running container:** + + **Linux/macOS/Windows:** + ```bash + # Backend + docker-compose exec backend python -m pytest + + # Frontend + docker-compose exec frontend pnpm lint + ``` + +## Troubleshooting + +- **Linux/macOS:** If you encounter permission errors, you may need to run the docker commands with `sudo`. +- **Windows:** If you see access denied errors, make sure you're running Command Prompt or PowerShell as Administrator. +- If ports are already in use, modify the port mappings in the `docker-compose.yml` file. +- For backend dependency issues, check the `Dockerfile` in the backend directory. +- For frontend dependency issues, check the `Dockerfile` in the frontend directory. +- **Windows-specific:** If you encounter line ending issues (CRLF vs LF), configure Git to handle line endings properly with `git config --global core.autocrlf true` before cloning the repository. + + +## Next Steps + +Once your installation is complete, you can start using SurfSense! Navigate to the frontend URL and log in using your Google account. \ No newline at end of file diff --git a/surfsense_web/content/docs/index.mdx b/surfsense_web/content/docs/index.mdx index e99f597dc..f3411b897 100644 --- a/surfsense_web/content/docs/index.mdx +++ b/surfsense_web/content/docs/index.mdx @@ -1,7 +1,99 @@ --- -title: Welcome Docs +title: Prerequisites +description: Required setup's before setting up SurfSense +full: true --- - -## Introduction - -I love Docs. \ No newline at end of file + +## PGVector installation Guide + +SurfSense requires the pgvector extension for PostgreSQL: + +### Linux and Mac + +Compile and install the extension (supports Postgres 13+) + +```sh +cd /tmp +git clone --branch v0.8.0 https://github.com/pgvector/pgvector.git +cd pgvector +make +make install # may need sudo +``` + +See the [installation notes](https://github.com/pgvector/pgvector/tree/master#installation-notes---linux-and-mac) if you run into issues + +### Windows + +Ensure [C++ support in Visual Studio](https://learn.microsoft.com/en-us/cpp/build/building-on-the-command-line?view=msvc-170#download-and-install-the-tools) is installed, and run: + +```cmd +call "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars64.bat" +``` + +Note: The exact path will vary depending on your Visual Studio version and edition + +Then use `nmake` to build: + +```cmd +set "PGROOT=C:\Program Files\PostgreSQL\16" +cd %TEMP% +git clone --branch v0.8.0 https://github.com/pgvector/pgvector.git +cd pgvector +nmake /F Makefile.win +nmake /F Makefile.win install +``` + +See the [installation notes](https://github.com/pgvector/pgvector/tree/master#installation-notes---windows) if you run into issues + +--- + +## Google OAuth Setup + +SurfSense user management and authentication works on Google OAuth. Lets set it up. + +1. Login to your [Google Developer Console](https://console.cloud.google.com/) +2. Enable People API. +![Google Developer Console People API](/docs/google_oauth_people_api.png) +3. Set up OAuth consent screen. +![Google Developer Console OAuth consent screen](/docs/google_oauth_screen.png) +4. Create OAuth client ID and secret. +![Google Developer Console OAuth client ID](/docs/google_oauth_client.png) +5. It should look like this. +![Google Developer Console Config](/docs/google_oauth_config.png) + +--- + +## File Upload's + +Files are converted to LLM friendly formats using [Unstructured](https://github.com/Unstructured-IO/unstructured) + +1. Get an Unstructured.io API key from [Unstructured Platform](https://platform.unstructured.io/) +2. You should be able to generate API keys once registered +![Unstructured Dashboard](/docs/unstructured.png) + +--- + +## LLM Observability (Optional) + +This is not required for SurfSense to work. But it is always a good idea to monitor LLM interactions. So we do not have those WTH moments. + +1. Get a LangSmith API key from [smith.langchain.com](https://smith.langchain.com/) +2. This helps in observing SurfSense Researcher Agent. +![LangSmith](/docs/langsmith.png) + +--- + +## Crawler + +SurfSense have 2 options for saving webpages: +- [SurfSense Extension](https://github.com/MODSetter/SurfSense/tree/main/surfsense_browser_extension) (Overall better experience & ability to save private webpages, recommended) +- Crawler (If you want to save public webpages) + +**NOTE:** SurfSense currently uses [Firecrawl.py](https://www.firecrawl.dev/) for web crawling. If you plan on using the crawler, you will need to create a Firecrawl account and get an API key. + + +--- + +## Next Steps + +Once you have all prerequisites in place, proceed to the [installation guide](/docs/installation) to set up SurfSense. \ No newline at end of file diff --git a/surfsense_web/content/docs/installation.mdx b/surfsense_web/content/docs/installation.mdx new file mode 100644 index 000000000..1a3f4553b --- /dev/null +++ b/surfsense_web/content/docs/installation.mdx @@ -0,0 +1,21 @@ +--- +title: Installation +description: Current ways to use SurfSense +full: true +--- + +# Installing SurfSense + +There are two ways to install SurfSense, but both require the repository to be cloned first. Clone [SurfSense](https://github.com/MODSetter/SurfSense) and then: + +## Docker Installation + +This method provides a containerized environment with all dependencies pre-configured. Less Customization. + +[Learn more about Docker installation](/docs/docker-installation) + +## Manual Installation (Preferred) + +For users who prefer more control over the installation process or need to customize their setup, we also provide manual installation instructions. + +[Learn more about Manual installation](/docs/manual-installation) \ No newline at end of file diff --git a/surfsense_web/content/docs/manual-installation.mdx b/surfsense_web/content/docs/manual-installation.mdx new file mode 100644 index 000000000..477f5ef17 --- /dev/null +++ b/surfsense_web/content/docs/manual-installation.mdx @@ -0,0 +1,258 @@ +--- +title: Manual Installation +description: Setting up SurfSense manually for customized deployments (Preferred) +full: true +--- + +# Manual Installation (Preferred) + +This guide provides step-by-step instructions for setting up SurfSense without Docker. This approach gives you more control over the installation process and allows for customization of the environment. + +## Prerequisites + +Before beginning the manual installation, ensure you have completed all the [prerequisite setup steps](/docs), including: + +- PGVector installation +- Google OAuth setup +- Unstructured.io API key +- LLM observability (optional) +- Crawler setup (if needed) + +## Backend Setup + +The backend is the core of SurfSense. Follow these steps to set it up: + +### 1. Environment Configuration + +First, create and configure your environment variables by copying the example file: + +**Linux/macOS:** +```bash +cd surfsense_backend +cp .env.example .env +``` + +**Windows (Command Prompt):** +```cmd +cd surfsense_backend +copy .env.example .env +``` + +**Windows (PowerShell):** +```powershell +cd surfsense_backend +Copy-Item -Path .env.example -Destination .env +``` + +Edit the `.env` file and set the following variables: + +| ENV VARIABLE | DESCRIPTION | +|--------------|-------------| +| DATABASE_URL | PostgreSQL connection string (e.g., `postgresql+asyncpg://postgres:postgres@localhost:5432/surfsense`) | +| SECRET_KEY | JWT Secret key for authentication (should be a secure random string) | +| GOOGLE_OAUTH_CLIENT_ID | Google OAuth client ID | +| GOOGLE_OAUTH_CLIENT_SECRET | Google OAuth client secret | +| NEXT_FRONTEND_URL | Frontend application URL (e.g., `http://localhost:3000`) | +| EMBEDDING_MODEL | Name of the embedding model (e.g., `mixedbread-ai/mxbai-embed-large-v1`) | +| RERANKERS_MODEL_NAME | Name of the reranker model (e.g., `ms-marco-MiniLM-L-12-v2`) | +| RERANKERS_MODEL_TYPE | Type of reranker model (e.g., `flashrank`) | +| FAST_LLM | LiteLLM routed faster LLM (e.g., `openai/gpt-4o-mini`, `ollama/deepseek-r1:8b`) | +| STRATEGIC_LLM | LiteLLM routed advanced LLM (e.g., `openai/gpt-4o`, `ollama/gemma3:12b`) | +| LONG_CONTEXT_LLM | LiteLLM routed long-context LLM (e.g., `gemini/gemini-2.0-flash`, `ollama/deepseek-r1:8b`) | +| UNSTRUCTURED_API_KEY | API key for Unstructured.io service | +| FIRECRAWL_API_KEY | API key for Firecrawl service (if using crawler) | + +**Important**: Since LLM calls are routed through LiteLLM, include API keys for the LLM providers you're using: +- For OpenAI models: `OPENAI_API_KEY` +- For Google Gemini models: `GEMINI_API_KEY` +- For other providers, refer to the [LiteLLM documentation](https://docs.litellm.ai/docs/providers) + +### 2. Install Dependencies + +Install the backend dependencies using `uv`: + +**Linux/macOS:** +```bash +# Install uv if you don't have it +curl -fsSL https://astral.sh/uv/install.sh | bash + +# Install dependencies +uv sync +``` + +**Windows (PowerShell):** +```powershell +# Install uv if you don't have it +iwr -useb https://astral.sh/uv/install.ps1 | iex + +# Install dependencies +uv sync +``` + +**Windows (Command Prompt):** +```cmd +# Install dependencies with uv (after installing uv) +uv sync +``` + +### 3. Run the Backend + +Start the backend server: + +**Linux/macOS/Windows:** +```bash +# Run without hot reloading +uv run main.py + +# Or with hot reloading for development +uv run main.py --reload +``` + +If everything is set up correctly, you should see output indicating the server is running on `http://localhost:8000`. + +## Frontend Setup + +### 1. Environment Configuration + +Set up the frontend environment: + +**Linux/macOS:** +```bash +cd surfsense_web +cp .env.example .env +``` + +**Windows (Command Prompt):** +```cmd +cd surfsense_web +copy .env.example .env +``` + +**Windows (PowerShell):** +```powershell +cd surfsense_web +Copy-Item -Path .env.example -Destination .env +``` + +Edit the `.env` file and set: + +| ENV VARIABLE | DESCRIPTION | +|--------------|-------------| +| NEXT_PUBLIC_FASTAPI_BACKEND_URL | Backend URL (e.g., `http://localhost:8000`) | + +### 2. Install Dependencies + +Install the frontend dependencies: + +**Linux/macOS:** +```bash +# Install pnpm if you don't have it +npm install -g pnpm + +# Install dependencies +pnpm install +``` + +**Windows:** +```powershell +# Install pnpm if you don't have it +npm install -g pnpm + +# Install dependencies +pnpm install +``` + +### 3. Run the Frontend + +Start the Next.js development server: + +**Linux/macOS/Windows:** +```bash +pnpm run dev +``` + +The frontend should now be running at `http://localhost:3000`. + +## Browser Extension Setup (Optional) + +The SurfSense browser extension allows you to save any webpage, including those protected behind authentication. + +### 1. Environment Configuration + +**Linux/macOS:** +```bash +cd surfsense_browser_extension +cp .env.example .env +``` + +**Windows (Command Prompt):** +```cmd +cd surfsense_browser_extension +copy .env.example .env +``` + +**Windows (PowerShell):** +```powershell +cd surfsense_browser_extension +Copy-Item -Path .env.example -Destination .env +``` + +Edit the `.env` file: + +| ENV VARIABLE | DESCRIPTION | +|--------------|-------------| +| PLASMO_PUBLIC_BACKEND_URL | SurfSense Backend URL (e.g., `http://127.0.0.1:8000`) | + +### 2. Build the Extension + +Build the extension for your browser using the [Plasmo framework](https://docs.plasmo.com/framework/workflows/build#with-a-specific-target). + +**Linux/macOS/Windows:** +```bash +# Install dependencies +pnpm install + +# Build for Chrome (default) +pnpm build + +# Or for other browsers +pnpm build --target=firefox +pnpm build --target=edge +``` + +### 3. Load the Extension + +Load the extension in your browser's developer mode and configure it with your SurfSense API key. + +## Verification + +To verify your installation: + +1. Open your browser and navigate to `http://localhost:3000` +2. Sign in with your Google account +3. Create a search space and try uploading a document +4. Test the chat functionality with your uploaded content + +## Troubleshooting + +- **Database Connection Issues**: Verify your PostgreSQL server is running and pgvector is properly installed +- **Authentication Problems**: Check your Google OAuth configuration and ensure redirect URIs are set correctly +- **LLM Errors**: Confirm your LLM API keys are valid and the selected models are accessible +- **File Upload Failures**: Validate your Unstructured.io API key +- **Windows-specific**: If you encounter path issues, ensure you're using the correct path separator (`\` instead of `/`) +- **macOS-specific**: If you encounter permission issues, you may need to use `sudo` for some installation commands + +## Next Steps + +Now that you have SurfSense running locally, you can explore its features: + +- Create search spaces for organizing your content +- Upload documents or use the browser extension to save webpages +- Ask questions about your saved content +- Explore the advanced RAG capabilities + +For production deployments, consider setting up: +- A reverse proxy like Nginx +- SSL certificates for secure connections +- Proper database backups +- User access controls \ No newline at end of file diff --git a/surfsense_web/content/docs/meta.json b/surfsense_web/content/docs/meta.json new file mode 100644 index 000000000..27ce8b62f --- /dev/null +++ b/surfsense_web/content/docs/meta.json @@ -0,0 +1,12 @@ +{ + "title": "Setup", + "description": "The setup guide for Surfsense", + "root": true, + "pages": [ + "---Setup---", + "index", + "installation", + "docker-installation", + "manual-installation" + ] + } \ No newline at end of file diff --git a/surfsense_web/public/docs/google_oauth_client.png b/surfsense_web/public/docs/google_oauth_client.png new file mode 100644 index 000000000..f49650b5d Binary files /dev/null and b/surfsense_web/public/docs/google_oauth_client.png differ diff --git a/surfsense_web/public/docs/google_oauth_config.png b/surfsense_web/public/docs/google_oauth_config.png new file mode 100644 index 000000000..58b1216cd Binary files /dev/null and b/surfsense_web/public/docs/google_oauth_config.png differ diff --git a/surfsense_web/public/docs/google_oauth_people_api.png b/surfsense_web/public/docs/google_oauth_people_api.png new file mode 100644 index 000000000..070fda2ea Binary files /dev/null and b/surfsense_web/public/docs/google_oauth_people_api.png differ diff --git a/surfsense_web/public/docs/google_oauth_screen.png b/surfsense_web/public/docs/google_oauth_screen.png new file mode 100644 index 000000000..863a40b97 Binary files /dev/null and b/surfsense_web/public/docs/google_oauth_screen.png differ diff --git a/surfsense_web/public/docs/langsmith.png b/surfsense_web/public/docs/langsmith.png new file mode 100644 index 000000000..ccc7b43e6 Binary files /dev/null and b/surfsense_web/public/docs/langsmith.png differ diff --git a/surfsense_web/public/docs/unstructured.png b/surfsense_web/public/docs/unstructured.png new file mode 100644 index 000000000..861a847fe Binary files /dev/null and b/surfsense_web/public/docs/unstructured.png differ