Commit graph

63 commits

Author SHA1 Message Date
CREDO23
aca23b4731 wire persist_scratch_index into scratch reindex 2026-06-17 14:59:24 +02:00
CREDO23
34de6c6f87 batch chunk inserts in persist_scratch_index 2026-06-17 14:59:24 +02:00
CREDO23
32a6e54ce6 Merge remote-tracking branch 'upstream/dev' into features/documents-injestion-layered-cached 2026-06-14 11:30:33 +02:00
CREDO23
7d55aaf2c1 feat(indexing): reconcile chunks incrementally on re-index
index() now loads existing rows and applies a content diff instead of
delete-all/reinsert-all: unchanged chunks keep their rows and embeddings
(zero HNSW/GIN churn), moved chunks get a position-only UPDATE, and only
new texts are embedded, batched with the summary embedding. First index
keeps the cache-aware build_chunk_embeddings path.
2026-06-12 18:53:08 +02:00
CREDO23
8d413ea5c2 refactor(indexing): expose chunk_markdown and embed_batch helpers
Split _compute so the incremental edit path can reuse the exact same chunker
selection and embedding entry points (and their test patch targets) without
going through the doc-level cache.
2026-06-12 18:52:57 +02:00
CREDO23
f82dedf712 feat(indexing): add pure chunk reconciler for content-addressed diffs
Greedy multiset match on chunk text decides which rows keep their embeddings,
which texts need embedding, and which rows are deleted. No DB, no embeddings;
fully unit-tested (reuse, head insert, middle edit, deletion, duplicates,
reorder, full rewrite).
2026-06-12 18:52:46 +02:00
CREDO23
91d947ff79 refactor(embedding-cache): rename index cache to embedding cache
The cached payload is the indexing pipeline's embeddings (markdown is
chunked then embedded), so "embedding cache" names the expensive output
directly and removes the "index" ambiguity (DB index vs vector index vs
indexing phase). Renames the service, settings, eligibility, eviction
task, metrics, config flags (INDEX_CACHE_* -> EMBEDDING_CACHE_*), object
prefix, and the table (index_cache_embedding_sets -> embedding_cache_sets)
with its constraint and indexes. Migration 161 renamed accordingly.
2026-06-12 17:00:01 +02:00
CREDO23
4e4f7f34fa feat(index-cache): add TTL/size eviction task and daily schedule 2026-06-12 16:48:18 +02:00
CREDO23
019aa7bf76 feat(index-cache): serve chunk embeddings from cache during indexing 2026-06-12 16:48:18 +02:00
CREDO23
e8938c119b feat(index-cache): add recall/remember service 2026-06-12 16:48:10 +02:00
CREDO23
daccd304ee feat(index-cache): add settings, eligibility, and config flags 2026-06-12 16:48:10 +02:00
CREDO23
ad6da7c6af feat(index-cache): add embedding blob store sharing the cache backend 2026-06-12 16:48:01 +02:00
CREDO23
f541114544 feat(index-cache): add cached embedding set table and repository 2026-06-12 16:48:01 +02:00
CREDO23
59fa4c38c3 feat(index-cache): add pickle-free blob serialization 2026-06-12 16:48:01 +02:00
CREDO23
cf208365b4 feat(index-cache): add embedding set value objects 2026-06-12 16:48:01 +02:00
Anish Sarkar
8e8cf96faa feat(error-handling): implement LLM error adaptation and classification for chat streaming
- Introduced LLMErrorCategory and adapt_llm_exception to normalize LLM exceptions.
- Updated llm_retryable_message and llm_permanent_message to utilize the new adaptation logic.
- Enhanced classify_stream_exception to classify provider errors and return user-friendly messages.
- Added tests for error classification and adaptation to ensure robustness.
- Updated frontend error handling to display appropriate messages based on new classifications.
2026-06-12 05:03:14 +05:30
CREDO23
8699befaa0 fix(indexing): log and recover session in rollback_and_persist_failure 2026-06-10 00:10:25 +02:00
DESKTOP-RTLN3BA\$punk
ce952d2ad1 chore: linting 2026-06-09 00:42:26 -07:00
Anish Sarkar
81fa219b30 feat(backend): Remove LLM summaries from document indexing 2026-06-04 00:50:19 +05:30
Anish Sarkar
c4abbd6e20 feat(pipeline): enrich ETL and indexing failure telemetry 2026-05-22 17:49:46 +05:30
Anish Sarkar
cea5605e32 feat(indexing): track indexing and connector outcomes 2026-05-21 23:03:43 +05:30
guangyang1206
2f3a33c9d5 feat(chunker): add table-aware chunk_text_hybrid to prevent mid-row table splits
Document_chunker currently splits Markdown tables mid-row when the table is
larger than a single chunk window, producing garbled rows that are useless
for RAG retrieval (issue #1334).

Changes:
- document_chunker.py: add chunk_text_hybrid() that detects Markdown table
  blocks with a regex, emits each table as an indivisible single chunk, and
  feeds the surrounding prose through the normal chunk_text() chunker.
- indexing_pipeline_service.py: route normal (non-code) documents through
  chunk_text_hybrid instead of chunk_text so tables are protected by default.

Fixes #1334
2026-05-05 12:48:04 +08:00
DESKTOP-RTLN3BA\$punk
4bee367d4a feat: added ai file sorting 2026-04-14 01:43:30 -07:00
Anish Sarkar
a8b83dcf3f feat: add folder_id support in ConnectorDocument and indexing pipeline for improved document organization 2026-04-08 17:48:50 +05:30
Anish Sarkar
76c760b8dd fix: improve the notification content and improve tooltip 2026-04-07 23:00:52 +05:30
Anish Sarkar
000c2d9b5b style: simplify LLM model terminology in UI 2026-04-02 10:11:35 +05:30
DESKTOP-RTLN3BA\$punk
2cc2d339e6 feat: made agent file sytem optimized 2026-03-28 16:39:46 -07:00
Anish Sarkar
bd6e335cb3 feat: enhance performance logging in indexing pipeline
- Added performance logging to the `index_batch_parallel` method, capturing metrics for document indexing duration and concurrency.
- Introduced timing measurements for both the overall indexing process and the parallel document gathering phase, improving observability of the indexing workflow.
- Updated logging statements to provide detailed insights into the number of documents processed, indexed, and failed during the indexing operation.
2026-03-26 23:10:49 +05:30
Anish Sarkar
4fd776e7ef feat: implement parallel indexing for Google Calendar and Gmail connectors
- Refactored Google Calendar and Gmail indexers to utilize the new `index_batch_parallel` method for concurrent document indexing, enhancing performance.
- Updated the indexing logic to replace serial processing with parallel execution, allowing for improved efficiency in handling multiple documents.
- Adjusted logging and error handling to accommodate the new parallel processing approach, ensuring robust operation during indexing.
- Enhanced unit tests to validate the functionality of the parallel indexing method and its integration with existing workflows.
2026-03-26 19:34:04 +05:30
Anish Sarkar
e5cb6bfacf feat: implement parallel document indexing in IndexingPipelineService
- Added `index_batch_parallel` method to enable concurrent indexing of documents with bounded concurrency, improving performance and efficiency.
- Refactored existing indexing logic to utilize `asyncio.to_thread` for non-blocking execution of embedding and chunking functions.
- Introduced unit tests to validate the functionality of the new parallel indexing method, ensuring robustness and error handling during document processing.
2026-03-26 19:33:49 +05:30
Anish Sarkar
f7b52470eb feat: enhance Google connectors indexing with content extraction and document migration
- Added `download_and_extract_content` function to extract content from Google Drive files as markdown.
- Updated Google Drive indexer to utilize the new content extraction method.
- Implemented document migration logic to update legacy Composio document types to their native Google types.
- Introduced identifier hashing for stable document identification.
- Improved file pre-filtering to handle unchanged and rename-only files efficiently.
2026-03-25 18:33:44 +05:30
DESKTOP-RTLN3BA\$punk
d8a05ae4d5 feat: refactor agent tools management and add UI integration
- Added endpoint to list agent tools with metadata, excluding hidden tools.
- Updated NewChatRequest and RegenerateRequest schemas to include disabled tools.
- Integrated disabled tools management in the NewChatPage and Composer components.
- Improved tool instructions and visibility in the system prompt.
- Refactored tool registration to support hidden tools and default enabled states.
- Enhanced document chunk creation to handle strict zip behavior.
- Cleaned up imports and formatting across various files for consistency.
2026-03-10 17:36:26 -07:00
CREDO23
929445afd9 feat: use batch embedding in IndexingPipelineService.index 2026-03-09 16:13:44 +02:00
CREDO23
cb4b155b9d feat: re-export embed_texts from document_embedder 2026-03-09 15:54:02 +02:00
Anish Sarkar
6d00b0debf Merge remote-tracking branch 'upstream/dev' into refactor/upload-document-adapter-class 2026-03-01 22:35:17 +05:30
DESKTOP-RTLN3BA\$punk
0e723a5b8b feat: perf optimizations
- improved search_knowledgebase_tool
- Added new endpoint to batch-fetch comments for multiple messages, reducing the number of API calls.
- Introduced CommentBatchRequest and CommentBatchResponse schemas for handling batch requests and responses.
- Updated chat_comments_service to validate message existence and permissions before fetching comments.
- Enhanced frontend with useBatchCommentsPreload hook to optimize comment loading for assistant messages.
2026-02-27 17:19:25 -08:00
DESKTOP-RTLN3BA\$punk
664c43ca13 feat: add performance logging middleware and enhance performance tracking across services
- Introduced RequestPerfMiddleware to log request performance metrics, including slow request thresholds.
- Updated various services and retrievers to utilize the new performance logging utility for better tracking of execution times.
- Enhanced existing methods with detailed performance logs for operations such as embedding, searching, and indexing.
- Removed deprecated logging setup in stream_new_chat and replaced it with the new performance logger.
2026-02-27 16:32:30 -08:00
Anish Sarkar
23a98d802c refactor: implement UploadDocumentAdapter for file indexing and reindexing 2026-02-28 01:38:32 +05:30
DESKTOP-RTLN3BA\$punk
e9892c8fe9 feat: added configable summary calculation and various improvements
- Replaced direct embedding calls with a utility function across various components to streamline embedding logic.
- Added enable_summary flag to several models and routes to control summary generation behavior.
2026-02-26 18:24:57 -08:00
DESKTOP-RTLN3BA\$punk
23f553ef84 Merge branch 'dev' of https://github.com/MODSetter/SurfSense into dev 2026-02-26 13:01:24 -08:00
DESKTOP-RTLN3BA\$punk
aabc24f82c feat: enhance performance logging and caching in various components
- Introduced slow callback logging in FastAPI to identify blocking calls.
- Added performance logging for agent creation and tool loading processes.
- Implemented caching for MCP tools to reduce redundant server calls.
- Enhanced sandbox management with in-process caching for improved efficiency.
- Refactored several functions for better readability and performance tracking.
- Updated tests to ensure proper functionality of new features and optimizations.
2026-02-26 13:00:31 -08:00
Anish Sarkar
9ccee054a5 chore: ran linting 2026-02-26 03:05:20 +05:30
CREDO23
c50d661d7d fix wrong status key in adapter error reporting 2026-02-25 21:00:55 +02:00
CREDO23
d0fdd3224a fix metadata keys casing and set content_needs_reindexing in adapter 2026-02-25 20:39:18 +02:00
CREDO23
cad400be1b add file upload adapter and make index() return refreshed document 2026-02-25 19:56:59 +02:00
CREDO23
86ecb82c6e fix: tighten indexing pipeline exception handling and logging 2026-02-25 17:44:35 +02:00
CREDO23
5be58b78ad simplify indexing pipeline DB error handling 2026-02-25 16:59:09 +02:00
CREDO23
66d7d3da8a fix bugs in indexing pipeline exception handling 2026-02-25 16:27:12 +02:00
CREDO23
b6c25628c8 add structured logging to indexing pipeline 2026-02-25 16:04:35 +02:00
CREDO23
610080bfef extract persistence helpers into document_persistence.py 2026-02-25 15:30:25 +02:00