SurfSense

mirror of https://github.com/MODSetter/SurfSense.git synced 2026-06-28 21:49:40 +02:00

guangyang1206 2f3a33c9d5 feat(chunker): add table-aware chunk_text_hybrid to prevent mid-row table splits Document_chunker currently splits Markdown tables mid-row when the table is larger than a single chunk window, producing garbled rows that are useless for RAG retrieval (issue #1334). Changes: - document_chunker.py: add chunk_text_hybrid() that detects Markdown table blocks with a regex, emits each table as an indivisible single chunk, and feeds the surrounding prose through the normal chunk_text() chunker. - indexing_pipeline_service.py: route normal (non-code) documents through chunk_text_hybrid instead of chunk_text so tables are protected by default. Fixes #1334		2026-05-05 12:48:04 +08:00
..
adapters	refactor: implement UploadDocumentAdapter for file indexing and reindexing	2026-02-28 01:38:32 +05:30
__init__.py	test: add ConnectorDocument unit tests and factory fixture	2026-02-24 22:20:08 +02:00
connector_document.py	feat: add folder_id support in ConnectorDocument and indexing pipeline for improved document organization	2026-04-08 17:48:50 +05:30
document_chunker.py	feat(chunker): add table-aware chunk_text_hybrid to prevent mid-row table splits	2026-05-05 12:48:04 +08:00
document_embedder.py	feat: re-export embed_texts from document_embedder	2026-03-09 15:54:02 +02:00
document_hashing.py	feat: made agent file sytem optimized	2026-03-28 16:39:46 -07:00
document_persistence.py	fix bugs in indexing pipeline exception handling	2026-02-25 16:27:12 +02:00
document_summarizer.py	feat: enhance performance logging and caching in various components	2026-02-26 13:00:31 -08:00
exceptions.py	style: simplify LLM model terminology in UI	2026-04-02 10:11:35 +05:30
indexing_pipeline_service.py	feat(chunker): add table-aware chunk_text_hybrid to prevent mid-row table splits	2026-05-05 12:48:04 +08:00
pipeline_logger.py	feat: enhance performance logging and caching in various components	2026-02-26 13:00:31 -08:00