SurfSense/surfsense_backend/app/indexing_pipeline
guangyang1206 2f3a33c9d5 feat(chunker): add table-aware chunk_text_hybrid to prevent mid-row table splits
Document_chunker currently splits Markdown tables mid-row when the table is
larger than a single chunk window, producing garbled rows that are useless
for RAG retrieval (issue #1334).

Changes:
- document_chunker.py: add chunk_text_hybrid() that detects Markdown table
  blocks with a regex, emits each table as an indivisible single chunk, and
  feeds the surrounding prose through the normal chunk_text() chunker.
- indexing_pipeline_service.py: route normal (non-code) documents through
  chunk_text_hybrid instead of chunk_text so tables are protected by default.

Fixes #1334
2026-05-05 12:48:04 +08:00
..
adapters refactor: implement UploadDocumentAdapter for file indexing and reindexing 2026-02-28 01:38:32 +05:30
__init__.py test: add ConnectorDocument unit tests and factory fixture 2026-02-24 22:20:08 +02:00
connector_document.py feat: add folder_id support in ConnectorDocument and indexing pipeline for improved document organization 2026-04-08 17:48:50 +05:30
document_chunker.py feat(chunker): add table-aware chunk_text_hybrid to prevent mid-row table splits 2026-05-05 12:48:04 +08:00
document_embedder.py feat: re-export embed_texts from document_embedder 2026-03-09 15:54:02 +02:00
document_hashing.py feat: made agent file sytem optimized 2026-03-28 16:39:46 -07:00
document_persistence.py fix bugs in indexing pipeline exception handling 2026-02-25 16:27:12 +02:00
document_summarizer.py feat: enhance performance logging and caching in various components 2026-02-26 13:00:31 -08:00
exceptions.py style: simplify LLM model terminology in UI 2026-04-02 10:11:35 +05:30
indexing_pipeline_service.py feat(chunker): add table-aware chunk_text_hybrid to prevent mid-row table splits 2026-05-05 12:48:04 +08:00
pipeline_logger.py feat: enhance performance logging and caching in various components 2026-02-26 13:00:31 -08:00