SurfSense

mirror of https://github.com/MODSetter/SurfSense.git synced 2026-06-12 20:45:20 +02:00

Author	SHA1	Message	Date
Thierry CH.	00dd9df44f	Merge `dcebfc4756` into `4c28ba5295`	2026-06-12 10:46:11 -07:00
CREDO23	dcebfc4756	Merge remote-tracking branch 'upstream/dev' into features/documents-injestion-layered-cached	2026-06-12 19:35:34 +02:00
CREDO23	311570b4f0	test(indexing): cover the edit path and make integration caches hermetic Real-DB tests assert unchanged chunk rows survive edits, only new text is embedded, removed rows are deleted with positions compacted, and the kill switch restores full-replace. An autouse fixture disables the ETL/embedding caches so a developer's .env can't leak cache hits into unrelated tests.	2026-06-12 18:53:21 +02:00
CREDO23	052e9ef4d1	refactor(chunks): order chunk reads by (document_id, position) Presentation and citation ordering moves off Chunk.id/created_at to the explicit position column (id kept as tiebreaker). Vector and ts_rank ranking order_by clauses are untouched.	2026-06-12 18:53:21 +02:00
CREDO23	5a71769dba	fix(chunks): set position on remaining chunk insert paths document_converters, the github size-fallback chunker, revert_service restores, and the kb-persistence middleware now write explicit positions (the middleware read path also orders by position).	2026-06-12 18:53:08 +02:00
CREDO23	7d55aaf2c1	feat(indexing): reconcile chunks incrementally on re-index index() now loads existing rows and applies a content diff instead of delete-all/reinsert-all: unchanged chunks keep their rows and embeddings (zero HNSW/GIN churn), moved chunks get a position-only UPDATE, and only new texts are embedded, batched with the summary embedding. First index keeps the cache-aware build_chunk_embeddings path.	2026-06-12 18:53:08 +02:00
CREDO23	fd495e1b2f	feat(observability): add chunk reconcile metric and kill-switch flag surfsense.indexing.reconcile.chunks counts reused/embedded/deleted chunks per re-index. CHUNK_RECONCILE_ENABLED (default on) falls back to delete-all + full re-embed if the diff path ever misbehaves.	2026-06-12 18:52:57 +02:00
CREDO23	8d413ea5c2	refactor(indexing): expose chunk_markdown and embed_batch helpers Split _compute so the incremental edit path can reuse the exact same chunker selection and embedding entry points (and their test patch targets) without going through the doc-level cache.	2026-06-12 18:52:57 +02:00
CREDO23	f82dedf712	feat(indexing): add pure chunk reconciler for content-addressed diffs Greedy multiset match on chunk text decides which rows keep their embeddings, which texts need embedding, and which rows are deleted. No DB, no embeddings; fully unit-tested (reuse, head insert, middle edit, deletion, duplicates, reorder, full rewrite).	2026-06-12 18:52:46 +02:00
CREDO23	c6e71c851c	feat(chunks): add explicit position column with backfill migration Chunk ids stop reflecting document order once incremental re-indexing keeps unchanged rows across edits. Backfill preserves the historical id ordering so behavior is identical on day one.	2026-06-12 18:52:45 +02:00
CREDO23	412493ae08	test(embedding-cache): add integration tests for service, repository, and store Covers the public cache surface against real Postgres and a real local file backend (no mocks): recall miss, remember->recall vector/text/order round-trip, the dimension-mismatch refusal, the repository SQL behind eviction and dedup (size sum, coldest ordering, TTL cutoff, duplicate-key no-op, reuse counter), and the blob store save/load round-trip and delete.	2026-06-12 17:33:21 +02:00
CREDO23	91d947ff79	refactor(embedding-cache): rename index cache to embedding cache The cached payload is the indexing pipeline's embeddings (markdown is chunked then embedded), so "embedding cache" names the expensive output directly and removes the "index" ambiguity (DB index vs vector index vs indexing phase). Renames the service, settings, eligibility, eviction task, metrics, config flags (INDEX_CACHE_* -> EMBEDDING_CACHE_*), object prefix, and the table (index_cache_embedding_sets -> embedding_cache_sets) with its constraint and indexes. Migration 161 renamed accordingly.	2026-06-12 17:00:01 +02:00
CREDO23	8cf578d965	test(index-cache): add unit tests and repoint embed/chunk patch targets	2026-06-12 16:48:18 +02:00
CREDO23	4e4f7f34fa	feat(index-cache): add TTL/size eviction task and daily schedule	2026-06-12 16:48:18 +02:00
CREDO23	019aa7bf76	feat(index-cache): serve chunk embeddings from cache during indexing	2026-06-12 16:48:18 +02:00
CREDO23	e8938c119b	feat(index-cache): add recall/remember service	2026-06-12 16:48:10 +02:00
CREDO23	4d6378e031	feat(observability): add index cache hit/miss and eviction metrics	2026-06-12 16:48:10 +02:00
CREDO23	daccd304ee	feat(index-cache): add settings, eligibility, and config flags	2026-06-12 16:48:10 +02:00
CREDO23	ad6da7c6af	feat(index-cache): add embedding blob store sharing the cache backend	2026-06-12 16:48:01 +02:00
CREDO23	f541114544	feat(index-cache): add cached embedding set table and repository	2026-06-12 16:48:01 +02:00
CREDO23	59fa4c38c3	feat(index-cache): add pickle-free blob serialization	2026-06-12 16:48:01 +02:00
CREDO23	cf208365b4	feat(index-cache): add embedding set value objects	2026-06-12 16:48:01 +02:00
CREDO23	0fb1d3d37b	feat(etl-cache): route all file-based sources through the parse cache Every file ingestion path (Dropbox, Google Drive / Composio Drive, OneDrive, local folder, Obsidian, and the legacy upload handlers) now parses via the extract_with_cache facade instead of calling EtlPipelineService.extract directly, so identical bytes are deduplicated globally regardless of source. vision_llm is passed through, keeping the existing cacheability gate intact.	2026-06-12 14:47:25 +02:00
CREDO23	99cf212c31	test: fix auth-mode mismatch and stale QuotaInsufficientError kwargs Pin AUTH_TYPE=LOCAL (and REGISTRATION_ENABLED=TRUE) in the test bootstrap so the email/password auth routers mount during integration tests regardless of a developer's .env=GOOGLE; without this the upload tests 404 on registration. Also update three tests to the current QuotaInsufficientError signature (balance_micros) after used_micros/limit_micros were removed.	2026-06-12 12:19:49 +02:00
CREDO23	0808fbcdee	feat(etl-cache): emit hit/miss and eviction metrics	2026-06-12 11:57:03 +02:00
CREDO23	9efe24879d	feat(observability): add etl cache lookup and eviction metrics	2026-06-12 11:57:03 +02:00
CREDO23	d5e0280097	test(etl-cache): cover two-phase eviction task on real infra	2026-06-12 11:54:36 +02:00
CREDO23	1460173dad	test(etl-cache): cover extract_with_cache end-to-end	2026-06-12 11:50:57 +02:00
CREDO23	c49a0f1233	test(etl-cache): cover store, service, and repository on real infra	2026-06-12 11:50:57 +02:00
CREDO23	3dec3231d0	test(etl-cache): cover over-budget eviction selection	2026-06-12 11:50:52 +02:00
CREDO23	a3e7047c35	test(etl-cache): cover cacheability gate rules	2026-06-12 11:50:52 +02:00
CREDO23	dddacbe762	test(etl-cache): cover content-addressing dedup and key shape	2026-06-12 11:50:52 +02:00
CREDO23	ce1e90386f	refactor(etl-cache): extract pure cacheability gate	2026-06-12 11:50:51 +02:00
CREDO23	5af594c405	docs(env): document ETL_CACHE_* settings	2026-06-12 11:23:50 +02:00
CREDO23	d898716cf4	feat(migration): add etl_cache_parses table	2026-06-12 11:23:50 +02:00
CREDO23	0dc2ccc003	feat(tasks): route extraction through etl cache	2026-06-12 11:23:50 +02:00
CREDO23	1c05980ffb	feat(celery): schedule etl cache eviction	2026-06-12 11:23:50 +02:00
CREDO23	9f29a885b1	feat(db): register CachedParse model	2026-06-12 11:23:50 +02:00
CREDO23	5c4eec26cc	feat(config): add ETL_CACHE_* settings	2026-06-12 11:23:50 +02:00
CREDO23	324ba141a6	feat(etl-cache): add eviction task and public API	2026-06-12 11:23:40 +02:00
CREDO23	7ad39fd995	feat(etl-cache): add eviction policy	2026-06-12 11:23:40 +02:00
CREDO23	758da06c4f	feat(etl-cache): add extract_with_cache	2026-06-12 11:23:40 +02:00
CREDO23	41dea96af4	feat(etl-cache): add EtlCacheService	2026-06-12 11:23:40 +02:00
CREDO23	87fdb37fa3	feat(etl-cache): expose storage layer	2026-06-12 11:23:40 +02:00
CREDO23	a6f2457c7c	feat(etl-cache): add MarkdownCacheStore for cache blobs	2026-06-12 11:22:57 +02:00
CREDO23	217d040e9e	feat(etl-cache): resolve cache blob storage backend	2026-06-12 11:22:57 +02:00
CREDO23	d9b1b491e9	feat(etl-cache): add cache blob object-key builder	2026-06-12 11:22:57 +02:00
CREDO23	8d3238bcd1	feat(etl-cache): expose cache persistence layer	2026-06-12 11:22:57 +02:00
CREDO23	ea10127979	feat(etl-cache): add CachedParseRepository data access	2026-06-12 11:22:57 +02:00
CREDO23	c624235780	feat(etl-cache): add CachedParse table model	2026-06-12 11:22:48 +02:00

1 2 3 4 5 ...

6665 commits