mirror of
https://github.com/MODSetter/SurfSense.git
synced 2026-05-06 14:22:47 +02:00
Document_chunker currently splits Markdown tables mid-row when the table is larger than a single chunk window, producing garbled rows that are useless for RAG retrieval (issue #1334). Changes: - document_chunker.py: add chunk_text_hybrid() that detects Markdown table blocks with a regex, emits each table as an indivisible single chunk, and feeds the surrounding prose through the normal chunk_text() chunker. - indexing_pipeline_service.py: route normal (non-code) documents through chunk_text_hybrid instead of chunk_text so tables are protected by default. Fixes #1334 |
||
|---|---|---|
| .. | ||
| agents | ||
| config | ||
| connectors | ||
| etl_pipeline | ||
| indexing_pipeline | ||
| observability | ||
| prompts | ||
| retriever | ||
| routes | ||
| schemas | ||
| services | ||
| tasks | ||
| templates | ||
| utils | ||
| __init__.py | ||
| app.py | ||
| celery_app.py | ||
| db.py | ||
| exceptions.py | ||
| rate_limiter.py | ||
| users.py | ||