SurfSense

mirror of https://github.com/MODSetter/SurfSense.git synced 2026-06-26 21:39:43 +02:00

Author	SHA1	Message	Date
CREDO23	aca23b4731	wire persist_scratch_index into scratch reindex	2026-06-17 14:59:24 +02:00
CREDO23	34de6c6f87	batch chunk inserts in persist_scratch_index	2026-06-17 14:59:24 +02:00
CREDO23	32a6e54ce6	Merge remote-tracking branch 'upstream/dev' into features/documents-injestion-layered-cached	2026-06-14 11:30:33 +02:00
CREDO23	7d55aaf2c1	feat(indexing): reconcile chunks incrementally on re-index index() now loads existing rows and applies a content diff instead of delete-all/reinsert-all: unchanged chunks keep their rows and embeddings (zero HNSW/GIN churn), moved chunks get a position-only UPDATE, and only new texts are embedded, batched with the summary embedding. First index keeps the cache-aware build_chunk_embeddings path.	2026-06-12 18:53:08 +02:00
CREDO23	8d413ea5c2	refactor(indexing): expose chunk_markdown and embed_batch helpers Split _compute so the incremental edit path can reuse the exact same chunker selection and embedding entry points (and their test patch targets) without going through the doc-level cache.	2026-06-12 18:52:57 +02:00
CREDO23	f82dedf712	feat(indexing): add pure chunk reconciler for content-addressed diffs Greedy multiset match on chunk text decides which rows keep their embeddings, which texts need embedding, and which rows are deleted. No DB, no embeddings; fully unit-tested (reuse, head insert, middle edit, deletion, duplicates, reorder, full rewrite).	2026-06-12 18:52:46 +02:00
CREDO23	91d947ff79	refactor(embedding-cache): rename index cache to embedding cache The cached payload is the indexing pipeline's embeddings (markdown is chunked then embedded), so "embedding cache" names the expensive output directly and removes the "index" ambiguity (DB index vs vector index vs indexing phase). Renames the service, settings, eligibility, eviction task, metrics, config flags (INDEX_CACHE_* -> EMBEDDING_CACHE_*), object prefix, and the table (index_cache_embedding_sets -> embedding_cache_sets) with its constraint and indexes. Migration 161 renamed accordingly.	2026-06-12 17:00:01 +02:00
CREDO23	4e4f7f34fa	feat(index-cache): add TTL/size eviction task and daily schedule	2026-06-12 16:48:18 +02:00
CREDO23	019aa7bf76	feat(index-cache): serve chunk embeddings from cache during indexing	2026-06-12 16:48:18 +02:00
CREDO23	e8938c119b	feat(index-cache): add recall/remember service	2026-06-12 16:48:10 +02:00
CREDO23	daccd304ee	feat(index-cache): add settings, eligibility, and config flags	2026-06-12 16:48:10 +02:00
CREDO23	ad6da7c6af	feat(index-cache): add embedding blob store sharing the cache backend	2026-06-12 16:48:01 +02:00
CREDO23	f541114544	feat(index-cache): add cached embedding set table and repository	2026-06-12 16:48:01 +02:00
CREDO23	59fa4c38c3	feat(index-cache): add pickle-free blob serialization	2026-06-12 16:48:01 +02:00
CREDO23	cf208365b4	feat(index-cache): add embedding set value objects	2026-06-12 16:48:01 +02:00
Anish Sarkar	8e8cf96faa	feat(error-handling): implement LLM error adaptation and classification for chat streaming - Introduced LLMErrorCategory and adapt_llm_exception to normalize LLM exceptions. - Updated llm_retryable_message and llm_permanent_message to utilize the new adaptation logic. - Enhanced classify_stream_exception to classify provider errors and return user-friendly messages. - Added tests for error classification and adaptation to ensure robustness. - Updated frontend error handling to display appropriate messages based on new classifications.	2026-06-12 05:03:14 +05:30
CREDO23	8699befaa0	fix(indexing): log and recover session in rollback_and_persist_failure	2026-06-10 00:10:25 +02:00
$DESKTOP-RTLN3BA\$punk$ DESKTOP-RTLN3BA\$punk	ce952d2ad1	chore: linting	2026-06-09 00:42:26 -07:00
Anish Sarkar	81fa219b30	feat(backend): Remove LLM summaries from document indexing	2026-06-04 00:50:19 +05:30
Anish Sarkar	c4abbd6e20	feat(pipeline): enrich ETL and indexing failure telemetry	2026-05-22 17:49:46 +05:30
Anish Sarkar	cea5605e32	feat(indexing): track indexing and connector outcomes	2026-05-21 23:03:43 +05:30
guangyang1206	2f3a33c9d5	feat(chunker): add table-aware chunk_text_hybrid to prevent mid-row table splits Document_chunker currently splits Markdown tables mid-row when the table is larger than a single chunk window, producing garbled rows that are useless for RAG retrieval (issue #1334). Changes: - document_chunker.py: add chunk_text_hybrid() that detects Markdown table blocks with a regex, emits each table as an indivisible single chunk, and feeds the surrounding prose through the normal chunk_text() chunker. - indexing_pipeline_service.py: route normal (non-code) documents through chunk_text_hybrid instead of chunk_text so tables are protected by default. Fixes #1334	2026-05-05 12:48:04 +08:00
$DESKTOP-RTLN3BA\$punk$ DESKTOP-RTLN3BA\$punk	4bee367d4a	feat: added ai file sorting	2026-04-14 01:43:30 -07:00
Anish Sarkar	a8b83dcf3f	feat: add folder_id support in ConnectorDocument and indexing pipeline for improved document organization	2026-04-08 17:48:50 +05:30
Anish Sarkar	76c760b8dd	fix: improve the notification content and improve tooltip	2026-04-07 23:00:52 +05:30
Anish Sarkar	000c2d9b5b	style: simplify LLM model terminology in UI	2026-04-02 10:11:35 +05:30
$DESKTOP-RTLN3BA\$punk$ DESKTOP-RTLN3BA\$punk	2cc2d339e6	feat: made agent file sytem optimized	2026-03-28 16:39:46 -07:00
Anish Sarkar	bd6e335cb3	feat: enhance performance logging in indexing pipeline - Added performance logging to the `index_batch_parallel` method, capturing metrics for document indexing duration and concurrency. - Introduced timing measurements for both the overall indexing process and the parallel document gathering phase, improving observability of the indexing workflow. - Updated logging statements to provide detailed insights into the number of documents processed, indexed, and failed during the indexing operation.	2026-03-26 23:10:49 +05:30
Anish Sarkar	4fd776e7ef	feat: implement parallel indexing for Google Calendar and Gmail connectors - Refactored Google Calendar and Gmail indexers to utilize the new `index_batch_parallel` method for concurrent document indexing, enhancing performance. - Updated the indexing logic to replace serial processing with parallel execution, allowing for improved efficiency in handling multiple documents. - Adjusted logging and error handling to accommodate the new parallel processing approach, ensuring robust operation during indexing. - Enhanced unit tests to validate the functionality of the parallel indexing method and its integration with existing workflows.	2026-03-26 19:34:04 +05:30
Anish Sarkar	e5cb6bfacf	feat: implement parallel document indexing in IndexingPipelineService - Added `index_batch_parallel` method to enable concurrent indexing of documents with bounded concurrency, improving performance and efficiency. - Refactored existing indexing logic to utilize `asyncio.to_thread` for non-blocking execution of embedding and chunking functions. - Introduced unit tests to validate the functionality of the new parallel indexing method, ensuring robustness and error handling during document processing.	2026-03-26 19:33:49 +05:30
Anish Sarkar	f7b52470eb	feat: enhance Google connectors indexing with content extraction and document migration - Added `download_and_extract_content` function to extract content from Google Drive files as markdown. - Updated Google Drive indexer to utilize the new content extraction method. - Implemented document migration logic to update legacy Composio document types to their native Google types. - Introduced identifier hashing for stable document identification. - Improved file pre-filtering to handle unchanged and rename-only files efficiently.	2026-03-25 18:33:44 +05:30
$DESKTOP-RTLN3BA\$punk$ DESKTOP-RTLN3BA\$punk	d8a05ae4d5	feat: refactor agent tools management and add UI integration - Added endpoint to list agent tools with metadata, excluding hidden tools. - Updated NewChatRequest and RegenerateRequest schemas to include disabled tools. - Integrated disabled tools management in the NewChatPage and Composer components. - Improved tool instructions and visibility in the system prompt. - Refactored tool registration to support hidden tools and default enabled states. - Enhanced document chunk creation to handle strict zip behavior. - Cleaned up imports and formatting across various files for consistency.	2026-03-10 17:36:26 -07:00
CREDO23	929445afd9	feat: use batch embedding in IndexingPipelineService.index	2026-03-09 16:13:44 +02:00
CREDO23	cb4b155b9d	feat: re-export embed_texts from document_embedder	2026-03-09 15:54:02 +02:00
Anish Sarkar	6d00b0debf	Merge remote-tracking branch 'upstream/dev' into refactor/upload-document-adapter-class	2026-03-01 22:35:17 +05:30
$DESKTOP-RTLN3BA\$punk$ DESKTOP-RTLN3BA\$punk	0e723a5b8b	feat: perf optimizations - improved search_knowledgebase_tool - Added new endpoint to batch-fetch comments for multiple messages, reducing the number of API calls. - Introduced CommentBatchRequest and CommentBatchResponse schemas for handling batch requests and responses. - Updated chat_comments_service to validate message existence and permissions before fetching comments. - Enhanced frontend with useBatchCommentsPreload hook to optimize comment loading for assistant messages.	2026-02-27 17:19:25 -08:00
$DESKTOP-RTLN3BA\$punk$ DESKTOP-RTLN3BA\$punk	664c43ca13	feat: add performance logging middleware and enhance performance tracking across services - Introduced RequestPerfMiddleware to log request performance metrics, including slow request thresholds. - Updated various services and retrievers to utilize the new performance logging utility for better tracking of execution times. - Enhanced existing methods with detailed performance logs for operations such as embedding, searching, and indexing. - Removed deprecated logging setup in stream_new_chat and replaced it with the new performance logger.	2026-02-27 16:32:30 -08:00
Anish Sarkar	23a98d802c	refactor: implement UploadDocumentAdapter for file indexing and reindexing	2026-02-28 01:38:32 +05:30
$DESKTOP-RTLN3BA\$punk$ DESKTOP-RTLN3BA\$punk	e9892c8fe9	feat: added configable summary calculation and various improvements - Replaced direct embedding calls with a utility function across various components to streamline embedding logic. - Added enable_summary flag to several models and routes to control summary generation behavior.	2026-02-26 18:24:57 -08:00
$DESKTOP-RTLN3BA\$punk$ DESKTOP-RTLN3BA\$punk	23f553ef84	Merge branch 'dev' of https://github.com/MODSetter/SurfSense into dev	2026-02-26 13:01:24 -08:00
$DESKTOP-RTLN3BA\$punk$ DESKTOP-RTLN3BA\$punk	aabc24f82c	feat: enhance performance logging and caching in various components - Introduced slow callback logging in FastAPI to identify blocking calls. - Added performance logging for agent creation and tool loading processes. - Implemented caching for MCP tools to reduce redundant server calls. - Enhanced sandbox management with in-process caching for improved efficiency. - Refactored several functions for better readability and performance tracking. - Updated tests to ensure proper functionality of new features and optimizations.	2026-02-26 13:00:31 -08:00
Anish Sarkar	9ccee054a5	chore: ran linting	2026-02-26 03:05:20 +05:30
CREDO23	c50d661d7d	fix wrong status key in adapter error reporting	2026-02-25 21:00:55 +02:00
CREDO23	d0fdd3224a	fix metadata keys casing and set content_needs_reindexing in adapter	2026-02-25 20:39:18 +02:00
CREDO23	cad400be1b	add file upload adapter and make index() return refreshed document	2026-02-25 19:56:59 +02:00
CREDO23	86ecb82c6e	fix: tighten indexing pipeline exception handling and logging	2026-02-25 17:44:35 +02:00
CREDO23	5be58b78ad	simplify indexing pipeline DB error handling	2026-02-25 16:59:09 +02:00
CREDO23	66d7d3da8a	fix bugs in indexing pipeline exception handling	2026-02-25 16:27:12 +02:00
CREDO23	b6c25628c8	add structured logging to indexing pipeline	2026-02-25 16:04:35 +02:00
CREDO23	610080bfef	extract persistence helpers into document_persistence.py	2026-02-25 15:30:25 +02:00

1 2

63 commits