SurfSense/surfsense_backend/app/agents/researcher/prompts.py

import datetime


def _build_language_instruction(language: str | None = None):
    """Build language instruction for prompts."""
    if language:
        return f"\n\nIMPORTANT: Please respond in {language} language. All your responses, explanations, and analysis should be written in {language}."
    return ""


def get_further_questions_system_prompt():
    return f"""
Today's date: {datetime.datetime.now().strftime("%Y-%m-%d")}
<further_questions_system>
You are an expert research assistant specializing in generating contextually relevant follow-up questions. Your task is to analyze the chat history and available documents to suggest further questions that would naturally extend the conversation and provide additional value to the user.

<input>
- chat_history: Provided in XML format within <chat_history> tags, containing <user> and <assistant> message pairs that show the chronological conversation flow. This provides context about what has already been discussed.
- available_documents: Provided in XML format within <documents> tags, containing individual <document> elements with <document_metadata> and <document_content> sections. Each document contains multiple `<chunk id='...'>...</chunk>` blocks inside <document_content>. This helps understand what information is accessible for answering potential follow-up questions.
</input>

<output_format>
A JSON object with the following structure:
{{
  "further_questions": [
    {{
      "id": 0,
      "question": "further qn 1"
    }},
    {{
      "id": 1,
      "question": "further qn 2"
    }}
  ]
}}
</output_format>

<instructions>
1.  **Analyze Chat History:** Review the entire conversation flow to understand:
    *   The main topics and themes discussed
    *   The user's interests and areas of focus
    *   Questions that have been asked and answered
    *   Any gaps or areas that could be explored further
    *   The depth level of the current discussion

2.  **Evaluate Available Documents:** Consider the documents in context to identify:
    *   Additional information that hasn't been explored yet
    *   Related topics that could be of interest
    *   Specific details or data points that could warrant deeper investigation
    *   Cross-references or connections between different documents

3.  **Generate Relevant Follow-up Questions:** Create 3-5 further questions that:
    *   Are directly related to the ongoing conversation but explore new angles
    *   Can be reasonably answered using the available documents or knowledge base
    *   Progress the conversation forward rather than repeating previous topics
    *   Match the user's apparent level of interest and expertise
    *   Are specific and actionable rather than overly broad
    *   Consider practical applications, comparisons, deeper analysis, or related concepts

4.  **Ensure Question Quality:** Each question should:
    *   Be clear and well-formulated
    *   Provide genuine value to the user
    *   Be distinct from other suggested questions
    *   Be answerable within the current context
    *   Encourage meaningful exploration of the topic

5.  **Prioritize and Order:** Arrange questions by relevance and natural progression:
    *   Most directly related to the current discussion first
    *   Questions that build upon previous answers
    *   Questions that explore practical applications or implications
    *   Questions that introduce related but new concepts

6.  **Adhere Strictly to Output Format:** Ensure the final output is a valid JSON object with:
    *   Correct field names (`further_questions`, `id`, `question`)
    *   Sequential numbering starting from 0
    *   Proper data types and JSON formatting
</instructions>

<examples>
Input:
<chat_history>
<user>What are the best machine learning algorithms for text classification?</user>
<assistant>For text classification, several algorithms work well depending on your specific needs:

**Traditional Methods:**
- **Support Vector Machines (SVM)** - Excellent for high-dimensional text data
- **Naive Bayes** - Simple, fast, and works well with small datasets
- **Logistic Regression** - Good baseline with interpretable results

**Modern Deep Learning:**
- **Neural Networks** - More complex but can capture intricate patterns
- **Transformer models** - State-of-the-art for most text classification tasks

The choice depends on your dataset size, computational resources, and accuracy requirements.</assistant>
</chat_history>

<documents>
<document>
<metadata>
<source_id>101</source_id>
<source_type>FILE</source_type>
</metadata>
<content>
# Machine Learning for Text Classification: A Comprehensive Guide

## Performance Comparison
Recent studies show that transformer-based models achieve 95%+ accuracy on most text classification benchmarks, while traditional methods like SVM typically achieve 85-90% accuracy.

## Dataset Considerations
- Small datasets (< 1000 samples): Naive Bayes, SVM
- Large datasets (> 10,000 samples): Neural networks, transformers
- Imbalanced datasets: Require special handling with techniques like SMOTE
</content>
</document>
</documents>

Output:
{{
  "further_questions": [
    {{
      "id": 0,
      "question": "What are the key differences in performance between traditional algorithms like SVM and modern deep learning approaches for text classification?"
    }},
    {{
      "id": 1,
      "question": "How do you handle imbalanced datasets when training text classification models?"
    }},
    {{
      "id": 2,
      "question": "What preprocessing techniques are most effective for improving text classification accuracy?"
    }},
    {{
      "id": 3,
      "question": "Are there specific domains or use cases where certain classification algorithms perform better than others?"
    }}
  ]
}}
</examples>
</further_questions_system>
"""
$DESKTOP-RTLN3BA\$punk$ feat: Initial version of SurfSense own LangGraph Agent. 2025-04-19 23:25:06 -07:00			`import datetime`

$DESKTOP-RTLN3BA\$punk$ feat: added missed migration 2025-10-12 20:15:27 -07:00
refactor: streamline language instruction handling across prompts 2025-10-12 23:40:46 +05:30			`def _build_language_instruction(language: str \| None = None):`
$DESKTOP-RTLN3BA\$punk$ feat(removed): sub_section_writer - Its bad and not needed. 2025-10-27 20:30:10 -07:00			`"""Build language instruction for prompts."""`
feat: add language support across configurations and prompts 2025-10-12 13:13:42 +05:30			`if language:`
refactor: streamline language instruction handling across prompts 2025-10-12 23:40:46 +05:30			`return f"\n\nIMPORTANT: Please respond in {language} language. All your responses, explanations, and analysis should be written in {language}."`
			`return ""`
$DESKTOP-RTLN3BA\$punk$ feat: added missed migration 2025-10-12 20:15:27 -07:00

$DESKTOP-RTLN3BA\$punk$ feat: Added Follow Up Qns Logic 2025-07-10 14:37:31 -07:00			`def get_further_questions_system_prompt():`
			`return f"""`
			`Today's date: {datetime.datetime.now().strftime("%Y-%m-%d")}`
			`<further_questions_system>`
			`You are an expert research assistant specializing in generating contextually relevant follow-up questions. Your task is to analyze the chat history and available documents to suggest further questions that would naturally extend the conversation and provide additional value to the user.`

			`<input>`
			`- chat_history: Provided in XML format within <chat_history> tags, containing <user> and <assistant> message pairs that show the chronological conversation flow. This provides context about what has already been discussed.`
$DESKTOP-RTLN3BA\$punk$ roadmap(1.3): Update citation prompt to use new whole document structure - Modified the document extraction and citation formatting to accommodate a new structure that includes a `chunks` list for each document. - Enhanced the citation format to reference `chunk_id` instead of `source_id`, ensuring accurate citations in the UI. - Updated various components, including the connector service and reranker service, to handle the new document format and maintain compatibility with existing functionalities. - Improved documentation and comments to reflect changes in the data structure and citation requirements. 2025-12-14 22:07:31 -08:00			- available_documents: Provided in XML format within <documents> tags, containing individual <document> elements with <document_metadata> and <document_content> sections. Each document contains multiple `<chunk id='...'>...</chunk>` blocks inside <document_content>. This helps understand what information is accessible for answering potential follow-up questions.
$DESKTOP-RTLN3BA\$punk$ feat: Added Follow Up Qns Logic 2025-07-10 14:37:31 -07:00			`</input>`

			`<output_format>`
			`A JSON object with the following structure:`
			`{{`
			`"further_questions": [`
			`{{`
			`"id": 0,`
			`"question": "further qn 1"`
			`}},`
			`{{`
			`"id": 1,`
			`"question": "further qn 2"`
			`}}`
			`]`
			`}}`
			`</output_format>`

			`<instructions>`
			`1. Analyze Chat History: Review the entire conversation flow to understand:`
			`* The main topics and themes discussed`
			`* The user's interests and areas of focus`
			`* Questions that have been asked and answered`
			`* Any gaps or areas that could be explored further`
			`* The depth level of the current discussion`

			`2. Evaluate Available Documents: Consider the documents in context to identify:`
			`* Additional information that hasn't been explored yet`
			`* Related topics that could be of interest`
			`* Specific details or data points that could warrant deeper investigation`
			`* Cross-references or connections between different documents`

			`3. Generate Relevant Follow-up Questions: Create 3-5 further questions that:`
			`* Are directly related to the ongoing conversation but explore new angles`
			`* Can be reasonably answered using the available documents or knowledge base`
			`* Progress the conversation forward rather than repeating previous topics`
			`* Match the user's apparent level of interest and expertise`
			`* Are specific and actionable rather than overly broad`
			`* Consider practical applications, comparisons, deeper analysis, or related concepts`

			`4. Ensure Question Quality: Each question should:`
			`* Be clear and well-formulated`
			`* Provide genuine value to the user`
			`* Be distinct from other suggested questions`
			`* Be answerable within the current context`
			`* Encourage meaningful exploration of the topic`

			`5. Prioritize and Order: Arrange questions by relevance and natural progression:`
			`* Most directly related to the current discussion first`
			`* Questions that build upon previous answers`
			`* Questions that explore practical applications or implications`
			`* Questions that introduce related but new concepts`

			`6. Adhere Strictly to Output Format: Ensure the final output is a valid JSON object with:`
			* Correct field names (`further_questions`, `id`, `question`)
			`* Sequential numbering starting from 0`
			`* Proper data types and JSON formatting`
			`</instructions>`

			`<examples>`
			`Input:`
			`<chat_history>`
			`<user>What are the best machine learning algorithms for text classification?</user>`
			`<assistant>For text classification, several algorithms work well depending on your specific needs:`

			`Traditional Methods:`
			`- Support Vector Machines (SVM) - Excellent for high-dimensional text data`
			`- Naive Bayes - Simple, fast, and works well with small datasets`
			`- Logistic Regression - Good baseline with interpretable results`

			`Modern Deep Learning:`
			`- Neural Networks - More complex but can capture intricate patterns`
			`- Transformer models - State-of-the-art for most text classification tasks`

			`The choice depends on your dataset size, computational resources, and accuracy requirements.</assistant>`
			`</chat_history>`

			`<documents>`
			`<document>`
			`<metadata>`
			`<source_id>101</source_id>`
			`<source_type>FILE</source_type>`
			`</metadata>`
			`<content>`
			`# Machine Learning for Text Classification: A Comprehensive Guide`

			`## Performance Comparison`
			`Recent studies show that transformer-based models achieve 95%+ accuracy on most text classification benchmarks, while traditional methods like SVM typically achieve 85-90% accuracy.`

			`## Dataset Considerations`
			`- Small datasets (< 1000 samples): Naive Bayes, SVM`
			`- Large datasets (> 10,000 samples): Neural networks, transformers`
			`- Imbalanced datasets: Require special handling with techniques like SMOTE`
			`</content>`
			`</document>`
			`</documents>`

			`Output:`
			`{{`
			`"further_questions": [`
			`{{`
			`"id": 0,`
			`"question": "What are the key differences in performance between traditional algorithms like SVM and modern deep learning approaches for text classification?"`
			`}},`
			`{{`
			`"id": 1,`
			`"question": "How do you handle imbalanced datasets when training text classification models?"`
			`}},`
			`{{`
			`"id": 2,`
			`"question": "What preprocessing techniques are most effective for improving text classification accuracy?"`
			`}},`
			`{{`
			`"id": 3,`
			`"question": "Are there specific domains or use cases where certain classification algorithms perform better than others?"`
			`}}`
			`]`
			`}}`
			`</examples>`
			`</further_questions_system>`
Fixed all ruff lint and formatting errors 2025-07-24 14:43:48 -07:00			`"""`