mirror of
https://github.com/MODSetter/SurfSense.git
synced 2026-05-06 14:22:47 +02:00
Document_chunker currently splits Markdown tables mid-row when the table is larger than a single chunk window, producing garbled rows that are useless for RAG retrieval (issue #1334). Changes: - document_chunker.py: add chunk_text_hybrid() that detects Markdown table blocks with a regex, emits each table as an indivisible single chunk, and feeds the surrounding prose through the normal chunk_text() chunker. - indexing_pipeline_service.py: route normal (non-code) documents through chunk_text_hybrid instead of chunk_text so tables are protected by default. Fixes #1334 |
||
|---|---|---|
| .. | ||
| alembic | ||
| app | ||
| scripts | ||
| tests | ||
| .dockerignore | ||
| .env.example | ||
| .gitignore | ||
| .python-version | ||
| alembic.ini | ||
| celery_worker.py | ||
| Dockerfile | ||
| main.py | ||
| pyproject.toml | ||
| uv.lock | ||