SurfSense

mirror of https://github.com/MODSetter/SurfSense.git synced 2026-06-18 21:15:16 +02:00

Author	SHA1	Message	Date
CREDO23	220d9c4fbb	add INDEXING_CHUNK_INSERT_BATCH_SIZE config	2026-06-17 14:59:19 +02:00
$DESKTOP-RTLN3BA\$punk$ DESKTOP-RTLN3BA\$punk	0fe650fd8e	Merge commit '`7ce409c580`' into dev	2026-06-16 22:48:14 -07:00
okxint	a12cd21f2f	fix(image-gen): resolve relative URLs returned by Xinference and compatible backends Some OpenAI-compatible image backends (e.g. Xinference) return a relative URL like /files/image.png in data[0].url instead of an absolute one. Browsers cannot resolve these, causing images to fail to load. Track the provider's api_base after resolving model config via to_litellm(). When the returned URL starts with "/", extract the origin (scheme + host + port) from api_base and prepend it to produce a full absolute URL. No behaviour change for providers that return absolute URLs (OpenAI, Azure, etc). Closes #1496	2026-06-17 10:57:39 +05:30
$DESKTOP-RTLN3BA\$punk$ DESKTOP-RTLN3BA\$punk	b9702b3245	chore: linting	2026-06-16 16:27:16 -07:00
$DESKTOP-RTLN3BA\$punk$ DESKTOP-RTLN3BA\$punk	da64433439	fix(db): reap orphaned idle-in-transaction sessions on the Celery engine The long-running ingestion/podcast/video tasks run on a separate Celery engine (NullPool), so the web engine's idle_in_transaction_session_timeout did not cover them — which is exactly where the original 11h zombie (INSERT INTO chunks) came from. Apply the same protection to the Celery engine with a generous 60-minute default so a worker that hangs/crashes mid-transaction can't hold locks on documents/chunks indefinitely, while never reaping a legitimate per-document embed window. - config + .env.example: DB_CELERY_IDLE_IN_TX_TIMEOUT_MS (default 3600000). Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-16 16:26:04 -07:00
$DESKTOP-RTLN3BA\$punk$ DESKTOP-RTLN3BA\$punk	89cc3b37ee	fix(db): prevent boot-time index DDL from hanging FastAPI startup A single abandoned "idle in transaction" session held locks on the documents table, which blocked the non-concurrent CREATE INDEX (hnsw) run inside the FastAPI lifespan. Each API restart queued another CREATE INDEX behind an advisory lock, leaving the server stuck at "Waiting for application startup." indefinitely and freezing ingestion writes. Changes: - setup_indexes(): build every index with CREATE INDEX CONCURRENTLY (non-blocking ShareUpdateExclusiveLock) under a per-session lock_timeout, and make each statement non-fatal so a contended/slow build is retried next boot instead of wedging startup. Drop leftover invalid indexes before rebuilding. - create_db_and_tables(): apply lock_timeout to extension/create_all DDL and gate the whole bootstrap behind DB_BOOTSTRAP_ON_STARTUP. - engine: set idle_in_transaction_session_timeout (asyncpg) so an abandoned transaction is reaped automatically. - config + .env.example: DB_BOOTSTRAP_ON_STARTUP, DB_DDL_LOCK_TIMEOUT_MS, DB_IDLE_IN_TX_TIMEOUT_MS. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-16 16:18:49 -07:00
Dmitry Maranik	e1ea82d7cf	fix(connectors): scope index endpoint authorization to the connector's own search space The POST /search-source-connectors/{connector_id}/index endpoint loaded the connector by id and then called check_permission() against the client-supplied search_space_id query parameter (the caller's own space) rather than the connector's own search_space_id, and never verified that the two matched. A user could therefore index another user's connector by passing their own search_space_id: the indexer ran with the victim connector's stored credentials and wrote the fetched content into the attacker's search space. The read/update/delete handlers already authorize against connector.search_space_id; this brings the index handler in line. Reject a connector that does not belong to the requested search space (404, to avoid disclosing connectors in other spaces) and authorize the permission check against connector.search_space_id.	2026-06-16 15:58:30 -07:00
CREDO23	7584312712	style(podcasts): fix ruff issues in podcast spec schema Remove duplicate typing import and format legacy minute coercion guard.	2026-06-16 23:57:36 +02:00
CREDO23	16d226e5ce	refactor(podcasts): plan transcript length from midpoint seconds	2026-06-16 23:38:28 +02:00
CREDO23	116c38feac	refactor(podcasts): build DurationTarget from brief seconds config	2026-06-16 23:38:28 +02:00
CREDO23	af08e2f033	refactor(podcasts): propose brief with min_seconds and max_seconds	2026-06-16 23:38:28 +02:00
CREDO23	d0ed5b94d9	refactor(podcasts): use shared second-based brief duration defaults	2026-06-16 23:38:28 +02:00
CREDO23	845653cbac	feat(podcasts): pass min_seconds and max_seconds when proposing brief	2026-06-16 23:38:27 +02:00
CREDO23	085442ed9a	feat(podcasts): use seconds defaults on create podcast request	2026-06-16 23:38:27 +02:00
CREDO23	32e0d21604	feat(podcasts): store brief duration in seconds with legacy load	2026-06-16 23:38:27 +02:00
CREDO23	9583e8f250	feat(podcasts): add shared duration limit constants	2026-06-16 23:38:27 +02:00
Anish Sarkar	9b7e278114	refactor(config): update GATEWAY_ENABLED variable to FALSE and adjust related configurations for improved messaging gateway handling	2026-06-16 23:49:26 +05:30
CREDO23	1d70af4684	fix(podcasts): guard public stream against missing audio	2026-06-16 20:09:08 +02:00
CREDO23	0c2808640a	fix(podcasts): guard stream against missing audio	2026-06-16 20:09:08 +02:00
CREDO23	d2558e546e	feat(podcasts): add audio_exists storage helper	2026-06-16 20:09:08 +02:00
Anish Sarkar	2a840fcc10	refactor(backend): derive frontend and backend urls from SURFSENSE_PUBLIC_URL	2026-06-16 02:10:50 +05:30
Rohan Verma	69bdcf5946	Merge pull request #1491 from AnishSarkar22/feat/unified-model-connections feat: Fix model attribution for prefix-stripped token usage callbacks	2026-06-14 17:50:48 -07:00
CREDO23	32a6e54ce6	Merge remote-tracking branch 'upstream/dev' into features/documents-injestion-layered-cached	2026-06-14 11:30:33 +02:00
Anish Sarkar	d9a4f14f99	feat(token-tracking): enhance model metadata reconciliation by adding bare model name handling	2026-06-14 12:18:22 +05:30
Anish Sarkar	7926814070	refactor(model-connections): remove unused fields and update verification logic	2026-06-14 02:46:19 +05:30
Anish Sarkar	c7409c8995	chore: ran linting	2026-06-13 21:59:35 +05:30
Anish Sarkar	ceace003aa	feat(local-models): add documentation for connecting local model providers	2026-06-13 21:52:45 +05:30
Anish Sarkar	ab5423d2d2	Merge remote-tracking branch 'upstream/dev' into feat/unified-model-connections	2026-06-13 19:04:49 +05:30
Anish Sarkar	76843f42f1	refactor(anonymous-models): remove description field from anonymous model responses and update related UI components	2026-06-13 16:30:26 +05:30
Anish Sarkar	576c56628a	chore(config): update global LLM configuration example with improved setup instructions, parameter naming, and enhanced comments for clarity	2026-06-13 14:57:14 +05:30
Anish Sarkar	4a6a282a46	feat(runtime-cooldown): implement Redis-based shared cooldown management for model selection	2026-06-13 13:53:01 +05:30
Anish Sarkar	bd4a04f2e7	feat(database-migrations): add migration to remove legacy model config tables and remove stale model connection code	2026-06-13 12:45:43 +05:30
Anish Sarkar	8fe9c21e76	feat(token-tracking): add model metadata registration and enhance token usage tracking	2026-06-13 03:08:35 +05:30
Anish Sarkar	5e86885a03	feat(model-connections): integrate model provider connections panel and connection card components	2026-06-13 02:40:22 +05:30
Anish Sarkar	15d9983669	feat(model-connections): enhance model selection facts and auto pinning logic	2026-06-13 02:19:27 +05:30
Anish Sarkar	45d27ba879	feat(model-connections): enhance auto mode with auto pinning	2026-06-13 01:39:26 +05:30
Anish Sarkar	9f6210ad08	feat(model-connections): add test preview functionality for model connections	2026-06-13 00:12:04 +05:30
CREDO23	dcebfc4756	Merge remote-tracking branch 'upstream/dev' into features/documents-injestion-layered-cached	2026-06-12 19:35:34 +02:00
Anish Sarkar	55f004e1da	feat(model-connections): improve model discovery error handling and enhance UI components	2026-06-12 22:50:50 +05:30
Anish Sarkar	407f2a9612	feat(model-connections): enhance model connection functionality with preview and selection features	2026-06-12 22:41:21 +05:30
CREDO23	052e9ef4d1	refactor(chunks): order chunk reads by (document_id, position) Presentation and citation ordering moves off Chunk.id/created_at to the explicit position column (id kept as tiebreaker). Vector and ts_rank ranking order_by clauses are untouched.	2026-06-12 18:53:21 +02:00
CREDO23	5a71769dba	fix(chunks): set position on remaining chunk insert paths document_converters, the github size-fallback chunker, revert_service restores, and the kb-persistence middleware now write explicit positions (the middleware read path also orders by position).	2026-06-12 18:53:08 +02:00
CREDO23	7d55aaf2c1	feat(indexing): reconcile chunks incrementally on re-index index() now loads existing rows and applies a content diff instead of delete-all/reinsert-all: unchanged chunks keep their rows and embeddings (zero HNSW/GIN churn), moved chunks get a position-only UPDATE, and only new texts are embedded, batched with the summary embedding. First index keeps the cache-aware build_chunk_embeddings path.	2026-06-12 18:53:08 +02:00
CREDO23	fd495e1b2f	feat(observability): add chunk reconcile metric and kill-switch flag surfsense.indexing.reconcile.chunks counts reused/embedded/deleted chunks per re-index. CHUNK_RECONCILE_ENABLED (default on) falls back to delete-all + full re-embed if the diff path ever misbehaves.	2026-06-12 18:52:57 +02:00
CREDO23	8d413ea5c2	refactor(indexing): expose chunk_markdown and embed_batch helpers Split _compute so the incremental edit path can reuse the exact same chunker selection and embedding entry points (and their test patch targets) without going through the doc-level cache.	2026-06-12 18:52:57 +02:00
CREDO23	f82dedf712	feat(indexing): add pure chunk reconciler for content-addressed diffs Greedy multiset match on chunk text decides which rows keep their embeddings, which texts need embedding, and which rows are deleted. No DB, no embeddings; fully unit-tested (reuse, head insert, middle edit, deletion, duplicates, reorder, full rewrite).	2026-06-12 18:52:46 +02:00
CREDO23	c6e71c851c	feat(chunks): add explicit position column with backfill migration Chunk ids stop reflecting document order once incremental re-indexing keeps unchanged rows across edits. Backfill preserves the historical id ordering so behavior is identical on day one.	2026-06-12 18:52:45 +02:00
CREDO23	91d947ff79	refactor(embedding-cache): rename index cache to embedding cache The cached payload is the indexing pipeline's embeddings (markdown is chunked then embedded), so "embedding cache" names the expensive output directly and removes the "index" ambiguity (DB index vs vector index vs indexing phase). Renames the service, settings, eligibility, eviction task, metrics, config flags (INDEX_CACHE_* -> EMBEDDING_CACHE_*), object prefix, and the table (index_cache_embedding_sets -> embedding_cache_sets) with its constraint and indexes. Migration 161 renamed accordingly.	2026-06-12 17:00:01 +02:00
CREDO23	4e4f7f34fa	feat(index-cache): add TTL/size eviction task and daily schedule	2026-06-12 16:48:18 +02:00
CREDO23	019aa7bf76	feat(index-cache): serve chunk embeddings from cache during indexing	2026-06-12 16:48:18 +02:00

1 2 3 4 5 ...

2322 commits