SurfSense

mirror of https://github.com/MODSetter/SurfSense.git synced 2026-04-26 01:06:23 +02:00

Author	SHA1	Message	Date
Anish Sarkar	8455451ce1	chore: ran linting	2026-04-08 05:20:03 +05:30
Anish Sarkar	d072ca60bb	test: enhance file classification tests for Azure DI configuration	2026-04-08 05:13:17 +05:30
Anish Sarkar	20fa93f0ba	refactor: make Azure Document Intelligence an internal LLAMACLOUD accelerator instead of a standalone ETL service	2026-04-08 03:26:24 +05:30
Anish Sarkar	1fa8d1220b	feat: add support for Azure Document Intelligence in ETL pipeline	2026-04-08 00:59:12 +05:30
Anish Sarkar	0a26a6c5bb	chore: ran linting	2026-04-07 05:55:39 +05:30
Anish Sarkar	aba5f6a124	refactor: improve file handling logic in Dropbox and OneDrive connectors to include unsupported file extension information	2026-04-07 05:19:23 +05:30
Anish Sarkar	a624c86b04	refactor: update file skipping logic in Dropbox, Google Drive, and OneDrive connectors to return unsupported extension information	2026-04-07 05:11:15 +05:30
Anish Sarkar	122be76133	refactor: update _index_selected_files method signatures in Dropbox, Google Drive, and OneDrive indexers to include unsupported file count, enhancing error reporting and consistency across connectors	2026-04-07 03:16:46 +05:30
Anish Sarkar	3a1d700817	refactor: enhance file skipping logic across Dropbox, Google Drive, and OneDrive connectors to return unsupported extensions, improving error reporting and maintainability	2026-04-07 03:16:34 +05:30
Anish Sarkar	e7beeb2a36	refactor: unify file skipping logic across Dropbox, Google Drive, and OneDrive connectors by replacing classification checks with a centralized service-based approach, enhancing maintainability and consistency in file handling	2026-04-07 02:19:31 +05:30
Anish Sarkar	f03bf05aaa	refactor: enhance Google Drive indexer to support file extension filtering, improving file handling and error reporting	2026-04-06 22:34:49 +05:30
Anish Sarkar	63a75052ca	Merge remote-tracking branch 'upstream/dev' into feat/unified-etl-pipeline	2026-04-06 22:04:51 +05:30
Anish Sarkar	dc7047f64d	refactor: implement file type classification for supported extensions across Dropbox, Google Drive, and OneDrive connectors, enhancing file handling and error management	2026-04-06 22:03:47 +05:30
Anish Sarkar	47f4be08d9	refactor: remove allowed_formats from DocumentConverter initialization in DoclingService to allow acceptance of all supported formats	2026-04-06 19:31:42 +05:30
Anish Sarkar	caca491774	test: add unit tests for Dropbox integration, covering delta sync methods, file type filtering, and re-authentication behavior	2026-04-06 18:36:48 +05:30
Anish Sarkar	f8913adaa3	test: add unit tests for content extraction from cloud connectors and ETL pipeline functionality	2026-04-05 17:46:04 +05:30
Anish Sarkar	a2b3541046	chore: ran linting	2026-04-04 03:11:56 +05:30
Anish Sarkar	0d2acc665d	Merge remote-tracking branch 'upstream/dev' into feat/page-limit-connectors	2026-04-04 03:08:27 +05:30
Anish Sarkar	ce40da80ea	feat: implement page limit estimation and enforcement in file based connector indexers - Added a static method `estimate_pages_from_metadata` to `PageLimitService` for estimating page counts based on file metadata. - Integrated page limit checks in Google Drive, Dropbox, and OneDrive indexers to prevent exceeding user quotas during file indexing. - Updated relevant indexing methods to utilize the new page estimation logic and enforce limits accordingly. - Enhanced tests for page limit functionality, ensuring accurate estimation and enforcement across different file types.	2026-04-04 02:51:28 +05:30
Anish Sarkar	9c0af6569d	feat: implement page limit checks in local folder indexing to manage user page usage	2026-04-03 19:13:25 +05:30
Anish Sarkar	edda5b98cb	chore: ran linting	2026-04-03 17:38:29 +05:30
Anish Sarkar	b759bb36a9	feat: add direct conversion support for CSV, TSV, and HTML files in local folder indexing	2026-04-03 17:36:48 +05:30
Anish Sarkar	746c730b2e	chore: ran linting	2026-04-03 13:14:40 +05:30
Anish Sarkar	62b44889d1	Merge remote-tracking branch 'upstream/dev' into feat/local-folder-sync	2026-04-03 11:42:43 +05:30
Anish Sarkar	2b9d79d44c	feat: add integration tests for batch processing of local folder indexing, covering multiple file scenarios and error handling	2026-04-03 10:04:14 +05:30
Anish Sarkar	1fa8e1cc83	feat: refactor folder indexing to support batch processing of multiple files, enhancing performance and error handling	2026-04-03 10:02:36 +05:30
$DESKTOP-RTLN3BA\$punk$ DESKTOP-RTLN3BA\$punk	62e698d8aa	refactor: streamline document upload limits and enhance handling of mentioned documents - Updated maximum file size limit to 500 MB per file. - Removed restrictions on the number of files per upload and total upload size. - Enhanced handling of user-mentioning documents in the knowledge base search middleware. - Improved document reading and processing logic to accommodate new features and optimizations.	2026-04-02 19:39:10 -07:00
Anish Sarkar	53df393cf7	refactor: streamline local folder indexing logic by removing unused imports, enhancing content hashing, and improving document creation process	2026-04-02 23:28:23 +05:30
Anish Sarkar	c27d24a117	feat: enhance folder indexing by adding root folder ID support and implement folder creation and cleanup logic	2026-04-02 22:41:45 +05:30
Anish Sarkar	caf2525ab5	fix: update folder ID collection logic to include deleted directories and adjust test cases for document titles	2026-04-02 22:29:07 +05:30
Anish Sarkar	22ee5c99cc	refactor: remove Local Folder connector and related tasks, implement new folder indexing endpoints	2026-04-02 22:21:31 +05:30
Anish Sarkar	775dea7894	feat: add integration and unit tests for local folder indexing and document versioning	2026-04-02 11:12:16 +05:30
$DESKTOP-RTLN3BA\$punk$ DESKTOP-RTLN3BA\$punk	ad0e77c3d6	feat: enhance knowledge base search with date filtering	2026-03-31 20:13:46 -07:00
$DESKTOP-RTLN3BA\$punk$ DESKTOP-RTLN3BA\$punk	a9fd45844d	feat: integrate Stripe for page purchases and reconciliation tasks	2026-03-31 18:39:45 -07:00
Anish Sarkar	272de1bb40	feat: add integration and unit tests for Dropbox indexing pipeline and parallel downloads	2026-03-30 22:19:15 +05:30
Anish Sarkar	04691d572b	chore: ran linting	2026-03-30 01:50:41 +05:30
Anish Sarkar	5a3eece397	Merge remote-tracking branch 'upstream/dev' into feat/onedrive-connector	2026-03-29 11:55:06 +05:30
$DESKTOP-RTLN3BA\$punk$ DESKTOP-RTLN3BA\$punk	2cc2d339e6	feat: made agent file sytem optimized	2026-03-28 16:39:46 -07:00
Anish Sarkar	028c88be72	feat: add integration and unit tests for OneDrive indexing pipeline and parallel downloads	2026-03-28 16:39:47 +05:30
Anish Sarkar	489e48644f	fix: revert native excel parsing	2026-03-27 22:15:24 +05:30
Anish Sarkar	3da0ffd683	feat: add native Excel parsing and improve Google Drive content extraction - Introduced a new utility for parsing .xlsx files into markdown format, enhancing the ability to process Excel documents natively. - Updated the Google Drive content extractor to utilize the new Excel parsing functionality, allowing for better handling of spreadsheet files. - Enhanced file type detection and export logic to support various document formats, improving overall content extraction accuracy. - Added unit tests to ensure the correctness of the new Excel parsing feature and its integration with existing content extraction workflows.	2026-03-27 21:47:14 +05:30
Anish Sarkar	4e0749f907	fix: update file skipping logic for failed documents in Google Drive indexer - Modified the `_should_skip_file` function to skip previously failed documents during processing, improving error handling. - Updated the corresponding test to reflect the new behavior, ensuring that failed documents are correctly identified and skipped during automatic sync.	2026-03-27 20:01:08 +05:30
Anish Sarkar	00934ff462	feat: enhance Google Drive client with improved logging and thread-safe operations - Added logging to track the start and end of file download and export processes, improving visibility into execution time. - Implemented per-thread HTTP transport for concurrent downloads and exports, ensuring thread safety. - Refactored download and export methods to utilize resolved credentials, enhancing functionality. - Updated unit tests to validate the new threading and logging features, ensuring robust parallel execution.	2026-03-27 19:25:45 +05:30
Anish Sarkar	d2a4b238d7	feat: enhance Google Drive client with thread-safe download and export methods - Implemented per-thread HTTP transport for concurrent downloads to ensure thread safety. - Refactored `download_file` and `download_file_to_disk` methods to utilize blocking calls on separate threads, improving performance during file operations. - Added logging to track the start and end of download and export processes, providing better visibility into execution time. - Updated unit tests to verify parallel execution of download and export operations, ensuring efficiency in handling multiple requests.	2026-03-27 19:25:03 +05:30
Anish Sarkar	0bc1c766ff	feat: migrate Confluence and Jira indexers to unified parallel pipeline - Refactored Confluence and Jira indexers to utilize the shared IndexingPipelineService for improved document processing. - Updated the `_build_connector_doc` function in both indexers to create ConnectorDocument instances with enhanced metadata and fallback summaries. - Modified the `index_confluence_pages` and `index_jira_issues` functions to return a tuple of (indexed_count, skipped_count, warning_or_error_message) for better error handling and reporting. - Added unit tests for both indexers to validate the new parallel processing logic and ensure correct document creation and indexing behavior.	2026-03-27 16:02:09 +05:30
Anish Sarkar	db6dd058dd	feat: migrate Linear and Notion indexers to unified parallel pipeline - Refactored Linear and Notion indexers to utilize the shared IndexingPipelineService for improved document deduplication, summarization, chunking, and embedding with bounded parallel indexing. - Updated the `_build_connector_doc` function in both indexers to create ConnectorDocument instances with enhanced metadata and fallback summaries. - Modified the `index_linear_issues` and `index_notion_pages` functions to return a tuple of (indexed_count, skipped_count, warning_or_error_message) for better error handling and reporting. - Added unit tests for both indexers to validate the new parallel processing logic and ensure correct document creation and indexing behavior.	2026-03-27 11:19:32 +05:30
Anish Sarkar	7c7f8b216c	feat: implement batch indexing for selected Google Drive files - Introduced `index_google_drive_selected_files` function to enable indexing of multiple user-selected files in parallel, improving efficiency. - Refactored existing indexing logic to handle batch processing, including error handling for individual file failures. - Added unit tests for the new batch indexing functionality, ensuring robustness and proper error collection during the indexing process.	2026-03-27 00:17:07 +05:30
Anish Sarkar	c016962064	feat: implement parallel file downloading and indexing in Google Drive indexer - Added `_download_files_parallel` function to enable concurrent downloading of files from Google Drive, improving efficiency in document processing. - Introduced `_download_and_index` function to handle the parallel downloading and indexing phases, streamlining the overall workflow. - Updated `_index_full_scan` and `_index_with_delta_sync` methods to utilize the new parallel downloading functionality, enhancing performance. - Added unit tests to validate the new parallel downloading and indexing logic, ensuring robustness and error handling during document processing.	2026-03-26 23:53:26 +05:30
Anish Sarkar	4fd776e7ef	feat: implement parallel indexing for Google Calendar and Gmail connectors - Refactored Google Calendar and Gmail indexers to utilize the new `index_batch_parallel` method for concurrent document indexing, enhancing performance. - Updated the indexing logic to replace serial processing with parallel execution, allowing for improved efficiency in handling multiple documents. - Adjusted logging and error handling to accommodate the new parallel processing approach, ensuring robust operation during indexing. - Enhanced unit tests to validate the functionality of the parallel indexing method and its integration with existing workflows.	2026-03-26 19:34:04 +05:30
Anish Sarkar	e5cb6bfacf	feat: implement parallel document indexing in IndexingPipelineService - Added `index_batch_parallel` method to enable concurrent indexing of documents with bounded concurrency, improving performance and efficiency. - Refactored existing indexing logic to utilize `asyncio.to_thread` for non-blocking execution of embedding and chunking functions. - Introduced unit tests to validate the functionality of the new parallel indexing method, ensuring robustness and error handling during document processing.	2026-03-26 19:33:49 +05:30

1 2 3

114 commits