Commit graph

2766 commits

Author SHA1 Message Date
Dmitry Maranik
81fc467187 test(connectors): regression tests for cross-search-space index authorization
Two integration tests pinning the connector index endpoint's authorization:

- cross-space index (attacker owns space B, connector lives in victim's
  space A, request passes search_space_id=B) is rejected with 404 at the
  search-space reconciliation, before the permission check (which would
  otherwise pass for the attacker's own space).
- same-space index authorizes check_permission against the connector's
  own search space, not the caller-supplied query param.

Mirrors the existing tests/integration harness (direct handler calls with
the savepoint-rolled-back db_session; check_permission patched so the test
needs no real RBAC wiring).
2026-06-16 16:18:40 -07:00
Dmitry Maranik
e1ea82d7cf fix(connectors): scope index endpoint authorization to the connector's own search space
The POST /search-source-connectors/{connector_id}/index endpoint loaded
the connector by id and then called check_permission() against the
client-supplied search_space_id query parameter (the caller's own space)
rather than the connector's own search_space_id, and never verified that
the two matched.

A user could therefore index another user's connector by passing their
own search_space_id: the indexer ran with the victim connector's stored
credentials and wrote the fetched content into the attacker's search
space. The read/update/delete handlers already authorize against
connector.search_space_id; this brings the index handler in line.

Reject a connector that does not belong to the requested search space
(404, to avoid disclosing connectors in other spaces) and authorize the
permission check against connector.search_space_id.
2026-06-16 15:58:30 -07:00
DESKTOP-RTLN3BA\$punk
5d99489f4b feat(migration): implement chunk position backfill with batched updates and indexing for improved performance 2026-06-16 15:19:56 -07:00
Anish Sarkar
9b7e278114 refactor(config): update GATEWAY_ENABLED variable to FALSE and adjust related configurations for improved messaging gateway handling 2026-06-16 23:49:26 +05:30
Anish Sarkar
2a840fcc10 refactor(backend): derive frontend and backend urls from SURFSENSE_PUBLIC_URL 2026-06-16 02:10:50 +05:30
Anish Sarkar
6b31997599 Merge remote-tracking branch 'upstream/dev' into experiment/lean-url-port-architecture 2026-06-15 20:52:15 +05:30
Rohan Verma
69bdcf5946
Merge pull request #1491 from AnishSarkar22/feat/unified-model-connections
feat: Fix model attribution for prefix-stripped token usage callbacks
2026-06-14 17:50:48 -07:00
Anish Sarkar
0c15a37618 chore: update dependencies in pyproject.toml and uv.lock, removing flower 2026-06-14 20:29:52 +05:30
CREDO23
32a6e54ce6 Merge remote-tracking branch 'upstream/dev' into features/documents-injestion-layered-cached 2026-06-14 11:30:33 +02:00
Anish Sarkar
d9a4f14f99 feat(token-tracking): enhance model metadata reconciliation by adding bare model name handling 2026-06-14 12:18:22 +05:30
Anish Sarkar
7926814070 refactor(model-connections): remove unused fields and update verification logic 2026-06-14 02:46:19 +05:30
Anish Sarkar
c7409c8995 chore: ran linting 2026-06-13 21:59:35 +05:30
Anish Sarkar
ceace003aa feat(local-models): add documentation for connecting local model providers 2026-06-13 21:52:45 +05:30
Anish Sarkar
ab5423d2d2 Merge remote-tracking branch 'upstream/dev' into feat/unified-model-connections 2026-06-13 19:04:49 +05:30
Anish Sarkar
76843f42f1 refactor(anonymous-models): remove description field from anonymous model responses and update related UI components 2026-06-13 16:30:26 +05:30
Anish Sarkar
576c56628a chore(config): update global LLM configuration example with improved setup instructions, parameter naming, and enhanced comments for clarity 2026-06-13 14:57:14 +05:30
Anish Sarkar
e104193ddf refactor(provider-configuration): standardize provider parameter naming across various modules and improve quota error handling in tests 2026-06-13 14:23:32 +05:30
Anish Sarkar
4a6a282a46 feat(runtime-cooldown): implement Redis-based shared cooldown management for model selection 2026-06-13 13:53:01 +05:30
Anish Sarkar
bd4a04f2e7 feat(database-migrations): add migration to remove legacy model config tables and remove stale model connection code 2026-06-13 12:45:43 +05:30
Anish Sarkar
8fe9c21e76 feat(token-tracking): add model metadata registration and enhance token usage tracking 2026-06-13 03:08:35 +05:30
Anish Sarkar
5e86885a03 feat(model-connections): integrate model provider connections panel and connection card components 2026-06-13 02:40:22 +05:30
Anish Sarkar
15d9983669 feat(model-connections): enhance model selection facts and auto pinning logic 2026-06-13 02:19:27 +05:30
Anish Sarkar
45d27ba879 feat(model-connections): enhance auto mode with auto pinning 2026-06-13 01:39:26 +05:30
Anish Sarkar
9f6210ad08 feat(model-connections): add test preview functionality for model connections 2026-06-13 00:12:04 +05:30
CREDO23
dcebfc4756 Merge remote-tracking branch 'upstream/dev' into features/documents-injestion-layered-cached 2026-06-12 19:35:34 +02:00
Anish Sarkar
55f004e1da feat(model-connections): improve model discovery error handling and enhance UI components 2026-06-12 22:50:50 +05:30
Anish Sarkar
407f2a9612 feat(model-connections): enhance model connection functionality with preview and selection features 2026-06-12 22:41:21 +05:30
CREDO23
311570b4f0 test(indexing): cover the edit path and make integration caches hermetic
Real-DB tests assert unchanged chunk rows survive edits, only new text is
embedded, removed rows are deleted with positions compacted, and the kill
switch restores full-replace. An autouse fixture disables the ETL/embedding
caches so a developer's .env can't leak cache hits into unrelated tests.
2026-06-12 18:53:21 +02:00
CREDO23
052e9ef4d1 refactor(chunks): order chunk reads by (document_id, position)
Presentation and citation ordering moves off Chunk.id/created_at to the
explicit position column (id kept as tiebreaker). Vector and ts_rank
ranking order_by clauses are untouched.
2026-06-12 18:53:21 +02:00
CREDO23
5a71769dba fix(chunks): set position on remaining chunk insert paths
document_converters, the github size-fallback chunker, revert_service
restores, and the kb-persistence middleware now write explicit positions
(the middleware read path also orders by position).
2026-06-12 18:53:08 +02:00
CREDO23
7d55aaf2c1 feat(indexing): reconcile chunks incrementally on re-index
index() now loads existing rows and applies a content diff instead of
delete-all/reinsert-all: unchanged chunks keep their rows and embeddings
(zero HNSW/GIN churn), moved chunks get a position-only UPDATE, and only
new texts are embedded, batched with the summary embedding. First index
keeps the cache-aware build_chunk_embeddings path.
2026-06-12 18:53:08 +02:00
CREDO23
fd495e1b2f feat(observability): add chunk reconcile metric and kill-switch flag
surfsense.indexing.reconcile.chunks counts reused/embedded/deleted chunks per
re-index. CHUNK_RECONCILE_ENABLED (default on) falls back to delete-all +
full re-embed if the diff path ever misbehaves.
2026-06-12 18:52:57 +02:00
CREDO23
8d413ea5c2 refactor(indexing): expose chunk_markdown and embed_batch helpers
Split _compute so the incremental edit path can reuse the exact same chunker
selection and embedding entry points (and their test patch targets) without
going through the doc-level cache.
2026-06-12 18:52:57 +02:00
CREDO23
f82dedf712 feat(indexing): add pure chunk reconciler for content-addressed diffs
Greedy multiset match on chunk text decides which rows keep their embeddings,
which texts need embedding, and which rows are deleted. No DB, no embeddings;
fully unit-tested (reuse, head insert, middle edit, deletion, duplicates,
reorder, full rewrite).
2026-06-12 18:52:46 +02:00
CREDO23
c6e71c851c feat(chunks): add explicit position column with backfill migration
Chunk ids stop reflecting document order once incremental re-indexing keeps
unchanged rows across edits. Backfill preserves the historical id ordering
so behavior is identical on day one.
2026-06-12 18:52:45 +02:00
CREDO23
412493ae08 test(embedding-cache): add integration tests for service, repository, and store
Covers the public cache surface against real Postgres and a real local file
backend (no mocks): recall miss, remember->recall vector/text/order round-trip,
the dimension-mismatch refusal, the repository SQL behind eviction and dedup
(size sum, coldest ordering, TTL cutoff, duplicate-key no-op, reuse counter),
and the blob store save/load round-trip and delete.
2026-06-12 17:33:21 +02:00
CREDO23
91d947ff79 refactor(embedding-cache): rename index cache to embedding cache
The cached payload is the indexing pipeline's embeddings (markdown is
chunked then embedded), so "embedding cache" names the expensive output
directly and removes the "index" ambiguity (DB index vs vector index vs
indexing phase). Renames the service, settings, eligibility, eviction
task, metrics, config flags (INDEX_CACHE_* -> EMBEDDING_CACHE_*), object
prefix, and the table (index_cache_embedding_sets -> embedding_cache_sets)
with its constraint and indexes. Migration 161 renamed accordingly.
2026-06-12 17:00:01 +02:00
CREDO23
8cf578d965 test(index-cache): add unit tests and repoint embed/chunk patch targets 2026-06-12 16:48:18 +02:00
CREDO23
4e4f7f34fa feat(index-cache): add TTL/size eviction task and daily schedule 2026-06-12 16:48:18 +02:00
CREDO23
019aa7bf76 feat(index-cache): serve chunk embeddings from cache during indexing 2026-06-12 16:48:18 +02:00
CREDO23
e8938c119b feat(index-cache): add recall/remember service 2026-06-12 16:48:10 +02:00
CREDO23
4d6378e031 feat(observability): add index cache hit/miss and eviction metrics 2026-06-12 16:48:10 +02:00
CREDO23
daccd304ee feat(index-cache): add settings, eligibility, and config flags 2026-06-12 16:48:10 +02:00
CREDO23
ad6da7c6af feat(index-cache): add embedding blob store sharing the cache backend 2026-06-12 16:48:01 +02:00
CREDO23
f541114544 feat(index-cache): add cached embedding set table and repository 2026-06-12 16:48:01 +02:00
CREDO23
59fa4c38c3 feat(index-cache): add pickle-free blob serialization 2026-06-12 16:48:01 +02:00
CREDO23
cf208365b4 feat(index-cache): add embedding set value objects 2026-06-12 16:48:01 +02:00
CREDO23
0fb1d3d37b feat(etl-cache): route all file-based sources through the parse cache
Every file ingestion path (Dropbox, Google Drive / Composio Drive, OneDrive,
local folder, Obsidian, and the legacy upload handlers) now parses via the
extract_with_cache facade instead of calling EtlPipelineService.extract
directly, so identical bytes are deduplicated globally regardless of source.
vision_llm is passed through, keeping the existing cacheability gate intact.
2026-06-12 14:47:25 +02:00
CREDO23
99cf212c31 test: fix auth-mode mismatch and stale QuotaInsufficientError kwargs
Pin AUTH_TYPE=LOCAL (and REGISTRATION_ENABLED=TRUE) in the test bootstrap so
the email/password auth routers mount during integration tests regardless of a
developer's .env=GOOGLE; without this the upload tests 404 on registration.
Also update three tests to the current QuotaInsufficientError signature
(balance_micros) after used_micros/limit_micros were removed.
2026-06-12 12:19:49 +02:00
CREDO23
0808fbcdee feat(etl-cache): emit hit/miss and eviction metrics 2026-06-12 11:57:03 +02:00