Commit graph

21 commits

Author SHA1 Message Date
DhruvTilva
e16e4e2c5c fix: guard missing text_as_html in Table element markdown conversion
When the Unstructured API returns a Table element without text_as_html
in its metadata (e.g. local install or free-tier API), the lambda was
raising KeyError: 'text_as_html', crashing the entire document
indexing pipeline for any file containing tables.

Guard the key access with .get() and fall back to the plain extracted
text content (x) so the pipeline continues and the table content is
still indexed, just without HTML formatting.
2026-06-25 23:52:15 +05:30
CREDO23
5a71769dba fix(chunks): set position on remaining chunk insert paths
document_converters, the github size-fallback chunker, revert_service
restores, and the kb-persistence middleware now write explicit positions
(the middleware read path also orders by position).
2026-06-12 18:53:08 +02:00
Anish Sarkar
81fa219b30 feat(backend): Remove LLM summaries from document indexing 2026-06-04 00:50:19 +05:30
DESKTOP-RTLN3BA\$punk
9d6e9b7e2d feat: enhance task management and timeout configurations in multi-agent chat
- Added new environment variables for controlling task execution limits, including `SURFSENSE_SUBAGENT_INVOKE_TIMEOUT_SECONDS`, `SURFSENSE_TASK_BATCH_CONCURRENCY`, and `SURFSENSE_TASK_BATCH_MAX_SIZE`.
- Updated documentation to reflect new batch processing capabilities for `task` calls, allowing for concurrent execution of multiple subagent tasks.
- Improved error handling and receipt generation for deliverables, ensuring consistent feedback on task status.
- Refactored middleware to incorporate search space ID for better task management.
2026-05-27 14:58:10 -07:00
CREDO23
a3d6fa6196 perf(document-converters): offload sync embed_text/embed_texts to thread
generate_document_summary and create_document_chunks are async helpers
called from the chat path and from many connector indexers. Both wrapped
embed_text/embed_texts directly inside the coroutine, blocking the event
loop for the full duration of the embedding call.
2026-05-20 10:03:42 +02:00
Anish Sarkar
683a4c17dd feat: implement thread-safe embedding access in document converters
- Added a reentrant lock to ensure thread-safe access to the tokenizer and embedding model, preventing runtime errors during concurrent operations.
- Updated the `truncate_for_embedding` and `embed_text` functions to utilize the lock, ensuring safe execution in multi-threaded environments.
- Enhanced the `embed_texts` function to maintain thread safety while processing multiple texts for embedding.
2026-03-27 11:31:00 +05:30
DESKTOP-RTLN3BA\$punk
d8a05ae4d5 feat: refactor agent tools management and add UI integration
- Added endpoint to list agent tools with metadata, excluding hidden tools.
- Updated NewChatRequest and RegenerateRequest schemas to include disabled tools.
- Integrated disabled tools management in the NewChatPage and Composer components.
- Improved tool instructions and visibility in the system prompt.
- Refactored tool registration to support hidden tools and default enabled states.
- Enhanced document chunk creation to handle strict zip behavior.
- Cleaned up imports and formatting across various files for consistency.
2026-03-10 17:36:26 -07:00
CREDO23
6eabfe2396 perf: conditional batch embedding — batch for API, sequential for local 2026-03-09 19:12:43 +02:00
CREDO23
c4f2e9a3a5 feat: use batch embedding in create_document_chunks 2026-03-09 16:21:14 +02:00
CREDO23
15aeec1fcb feat: add embed_texts batch embedding utility 2026-03-09 15:53:40 +02:00
DESKTOP-RTLN3BA\$punk
e9892c8fe9 feat: added configable summary calculation and various improvements
- Replaced direct embedding calls with a utility function across various components to streamline embedding logic.
- Added enable_summary flag to several models and routes to control summary generation behavior.
2026-02-26 18:24:57 -08:00
Anish Sarkar
4526b656a4 fix: update default date range for Google Calendar events and improve query parameter handling 2026-01-30 19:55:48 +05:30
DESKTOP-RTLN3BA\$punk
48fc70a08b chore: cleanup 2026-01-07 19:07:06 -08:00
DESKTOP-RTLN3BA\$punk
c99cd710ea feat: add unique identifier hash for documents to prevent duplicates across various connectors 2025-10-14 21:11:19 -07:00
DESKTOP-RTLN3BA\$punk
ebcfd97a0e fix: added basic context window check for summarization
- Need to optimize. Better to have seperate class with long form summarization using chunking.
2025-08-28 22:58:55 -07:00
DESKTOP-RTLN3BA\$punk
9ef2ddd15c refactor: Remove deprecated document_title parameter from generate_document_summary function 2025-08-18 20:56:53 -07:00
DESKTOP-RTLN3BA\$punk
1c4c61eb04 feat: Fixed Document Summary Content across connectors and processors 2025-08-18 20:51:48 -07:00
Utkarsh-Patel-13
d359a59f6d Fixed all ruff lint and formatting errors 2025-07-24 14:43:48 -07:00
DESKTOP-RTLN3BA\$punk
d8f2c5f7cf fix: generate content hash based on search space id as well.
- Allows Reindexing in selperate seatch spaces.
2025-06-10 13:56:23 -07:00
DESKTOP-RTLN3BA\$punk
5411bac8e0 feat: Added content based hashing to prevent duplicates and fix resync issues 2025-05-28 23:52:00 -07:00
DESKTOP-RTLN3BA\$punk
da23012970 feat: SurfSense v0.0.6 init 2025-03-14 18:53:14 -07:00