Commit graph

6648 commits

Author SHA1 Message Date
CREDO23
7d55aaf2c1 feat(indexing): reconcile chunks incrementally on re-index
index() now loads existing rows and applies a content diff instead of
delete-all/reinsert-all: unchanged chunks keep their rows and embeddings
(zero HNSW/GIN churn), moved chunks get a position-only UPDATE, and only
new texts are embedded, batched with the summary embedding. First index
keeps the cache-aware build_chunk_embeddings path.
2026-06-12 18:53:08 +02:00
CREDO23
fd495e1b2f feat(observability): add chunk reconcile metric and kill-switch flag
surfsense.indexing.reconcile.chunks counts reused/embedded/deleted chunks per
re-index. CHUNK_RECONCILE_ENABLED (default on) falls back to delete-all +
full re-embed if the diff path ever misbehaves.
2026-06-12 18:52:57 +02:00
CREDO23
8d413ea5c2 refactor(indexing): expose chunk_markdown and embed_batch helpers
Split _compute so the incremental edit path can reuse the exact same chunker
selection and embedding entry points (and their test patch targets) without
going through the doc-level cache.
2026-06-12 18:52:57 +02:00
CREDO23
f82dedf712 feat(indexing): add pure chunk reconciler for content-addressed diffs
Greedy multiset match on chunk text decides which rows keep their embeddings,
which texts need embedding, and which rows are deleted. No DB, no embeddings;
fully unit-tested (reuse, head insert, middle edit, deletion, duplicates,
reorder, full rewrite).
2026-06-12 18:52:46 +02:00
CREDO23
c6e71c851c feat(chunks): add explicit position column with backfill migration
Chunk ids stop reflecting document order once incremental re-indexing keeps
unchanged rows across edits. Backfill preserves the historical id ordering
so behavior is identical on day one.
2026-06-12 18:52:45 +02:00
CREDO23
412493ae08 test(embedding-cache): add integration tests for service, repository, and store
Covers the public cache surface against real Postgres and a real local file
backend (no mocks): recall miss, remember->recall vector/text/order round-trip,
the dimension-mismatch refusal, the repository SQL behind eviction and dedup
(size sum, coldest ordering, TTL cutoff, duplicate-key no-op, reuse counter),
and the blob store save/load round-trip and delete.
2026-06-12 17:33:21 +02:00
CREDO23
91d947ff79 refactor(embedding-cache): rename index cache to embedding cache
The cached payload is the indexing pipeline's embeddings (markdown is
chunked then embedded), so "embedding cache" names the expensive output
directly and removes the "index" ambiguity (DB index vs vector index vs
indexing phase). Renames the service, settings, eligibility, eviction
task, metrics, config flags (INDEX_CACHE_* -> EMBEDDING_CACHE_*), object
prefix, and the table (index_cache_embedding_sets -> embedding_cache_sets)
with its constraint and indexes. Migration 161 renamed accordingly.
2026-06-12 17:00:01 +02:00
CREDO23
8cf578d965 test(index-cache): add unit tests and repoint embed/chunk patch targets 2026-06-12 16:48:18 +02:00
CREDO23
4e4f7f34fa feat(index-cache): add TTL/size eviction task and daily schedule 2026-06-12 16:48:18 +02:00
CREDO23
019aa7bf76 feat(index-cache): serve chunk embeddings from cache during indexing 2026-06-12 16:48:18 +02:00
CREDO23
e8938c119b feat(index-cache): add recall/remember service 2026-06-12 16:48:10 +02:00
CREDO23
4d6378e031 feat(observability): add index cache hit/miss and eviction metrics 2026-06-12 16:48:10 +02:00
CREDO23
daccd304ee feat(index-cache): add settings, eligibility, and config flags 2026-06-12 16:48:10 +02:00
CREDO23
ad6da7c6af feat(index-cache): add embedding blob store sharing the cache backend 2026-06-12 16:48:01 +02:00
CREDO23
f541114544 feat(index-cache): add cached embedding set table and repository 2026-06-12 16:48:01 +02:00
CREDO23
59fa4c38c3 feat(index-cache): add pickle-free blob serialization 2026-06-12 16:48:01 +02:00
CREDO23
cf208365b4 feat(index-cache): add embedding set value objects 2026-06-12 16:48:01 +02:00
CREDO23
0fb1d3d37b feat(etl-cache): route all file-based sources through the parse cache
Every file ingestion path (Dropbox, Google Drive / Composio Drive, OneDrive,
local folder, Obsidian, and the legacy upload handlers) now parses via the
extract_with_cache facade instead of calling EtlPipelineService.extract
directly, so identical bytes are deduplicated globally regardless of source.
vision_llm is passed through, keeping the existing cacheability gate intact.
2026-06-12 14:47:25 +02:00
CREDO23
99cf212c31 test: fix auth-mode mismatch and stale QuotaInsufficientError kwargs
Pin AUTH_TYPE=LOCAL (and REGISTRATION_ENABLED=TRUE) in the test bootstrap so
the email/password auth routers mount during integration tests regardless of a
developer's .env=GOOGLE; without this the upload tests 404 on registration.
Also update three tests to the current QuotaInsufficientError signature
(balance_micros) after used_micros/limit_micros were removed.
2026-06-12 12:19:49 +02:00
CREDO23
0808fbcdee feat(etl-cache): emit hit/miss and eviction metrics 2026-06-12 11:57:03 +02:00
CREDO23
9efe24879d feat(observability): add etl cache lookup and eviction metrics 2026-06-12 11:57:03 +02:00
CREDO23
d5e0280097 test(etl-cache): cover two-phase eviction task on real infra 2026-06-12 11:54:36 +02:00
CREDO23
1460173dad test(etl-cache): cover extract_with_cache end-to-end 2026-06-12 11:50:57 +02:00
CREDO23
c49a0f1233 test(etl-cache): cover store, service, and repository on real infra 2026-06-12 11:50:57 +02:00
CREDO23
3dec3231d0 test(etl-cache): cover over-budget eviction selection 2026-06-12 11:50:52 +02:00
CREDO23
a3e7047c35 test(etl-cache): cover cacheability gate rules 2026-06-12 11:50:52 +02:00
CREDO23
dddacbe762 test(etl-cache): cover content-addressing dedup and key shape 2026-06-12 11:50:52 +02:00
CREDO23
ce1e90386f refactor(etl-cache): extract pure cacheability gate 2026-06-12 11:50:51 +02:00
CREDO23
5af594c405 docs(env): document ETL_CACHE_* settings 2026-06-12 11:23:50 +02:00
CREDO23
d898716cf4 feat(migration): add etl_cache_parses table 2026-06-12 11:23:50 +02:00
CREDO23
0dc2ccc003 feat(tasks): route extraction through etl cache 2026-06-12 11:23:50 +02:00
CREDO23
1c05980ffb feat(celery): schedule etl cache eviction 2026-06-12 11:23:50 +02:00
CREDO23
9f29a885b1 feat(db): register CachedParse model 2026-06-12 11:23:50 +02:00
CREDO23
5c4eec26cc feat(config): add ETL_CACHE_* settings 2026-06-12 11:23:50 +02:00
CREDO23
324ba141a6 feat(etl-cache): add eviction task and public API 2026-06-12 11:23:40 +02:00
CREDO23
7ad39fd995 feat(etl-cache): add eviction policy 2026-06-12 11:23:40 +02:00
CREDO23
758da06c4f feat(etl-cache): add extract_with_cache 2026-06-12 11:23:40 +02:00
CREDO23
41dea96af4 feat(etl-cache): add EtlCacheService 2026-06-12 11:23:40 +02:00
CREDO23
87fdb37fa3 feat(etl-cache): expose storage layer 2026-06-12 11:23:40 +02:00
CREDO23
a6f2457c7c feat(etl-cache): add MarkdownCacheStore for cache blobs 2026-06-12 11:22:57 +02:00
CREDO23
217d040e9e feat(etl-cache): resolve cache blob storage backend 2026-06-12 11:22:57 +02:00
CREDO23
d9b1b491e9 feat(etl-cache): add cache blob object-key builder 2026-06-12 11:22:57 +02:00
CREDO23
8d3238bcd1 feat(etl-cache): expose cache persistence layer 2026-06-12 11:22:57 +02:00
CREDO23
ea10127979 feat(etl-cache): add CachedParseRepository data access 2026-06-12 11:22:57 +02:00
CREDO23
c624235780 feat(etl-cache): add CachedParse table model 2026-06-12 11:22:48 +02:00
CREDO23
205a63b9bc feat(etl-cache): add EtlCacheSettings resolved from config 2026-06-12 11:22:48 +02:00
CREDO23
b84debd999 feat(etl-cache): expose cache schema value objects 2026-06-12 11:22:48 +02:00
CREDO23
3c9ea0011d feat(etl-cache): add EvictionCandidate value object 2026-06-12 11:22:48 +02:00
CREDO23
24f824b597 feat(etl-cache): add ParseKey cache identity value object 2026-06-12 11:22:48 +02:00
DESKTOP-RTLN3BA\$punk
c855be8ccd fix(auto_reload): update task to use a lambda for user_id in async call 2026-06-11 16:51:18 -07:00