Some OpenAI-compatible image backends (e.g. Xinference) return a relative
URL like /files/image.png in data[0].url instead of an absolute one.
Browsers cannot resolve these, causing images to fail to load.
Track the provider's api_base after resolving model config via to_litellm().
When the returned URL starts with "/", extract the origin (scheme + host + port)
from api_base and prepend it to produce a full absolute URL.
No behaviour change for providers that return absolute URLs (OpenAI, Azure, etc).
Closes#1496
The POST /search-source-connectors/{connector_id}/index endpoint loaded
the connector by id and then called check_permission() against the
client-supplied search_space_id query parameter (the caller's own space)
rather than the connector's own search_space_id, and never verified that
the two matched.
A user could therefore index another user's connector by passing their
own search_space_id: the indexer ran with the victim connector's stored
credentials and wrote the fetched content into the attacker's search
space. The read/update/delete handlers already authorize against
connector.search_space_id; this brings the index handler in line.
Reject a connector that does not belong to the requested search space
(404, to avoid disclosing connectors in other spaces) and authorize the
permission check against connector.search_space_id.
Presentation and citation ordering moves off Chunk.id/created_at to the
explicit position column (id kept as tiebreaker). Vector and ts_rank
ranking order_by clauses are untouched.
document_converters, the github size-fallback chunker, revert_service
restores, and the kb-persistence middleware now write explicit positions
(the middleware read path also orders by position).
index() now loads existing rows and applies a content diff instead of
delete-all/reinsert-all: unchanged chunks keep their rows and embeddings
(zero HNSW/GIN churn), moved chunks get a position-only UPDATE, and only
new texts are embedded, batched with the summary embedding. First index
keeps the cache-aware build_chunk_embeddings path.
surfsense.indexing.reconcile.chunks counts reused/embedded/deleted chunks per
re-index. CHUNK_RECONCILE_ENABLED (default on) falls back to delete-all +
full re-embed if the diff path ever misbehaves.
Split _compute so the incremental edit path can reuse the exact same chunker
selection and embedding entry points (and their test patch targets) without
going through the doc-level cache.
Greedy multiset match on chunk text decides which rows keep their embeddings,
which texts need embedding, and which rows are deleted. No DB, no embeddings;
fully unit-tested (reuse, head insert, middle edit, deletion, duplicates,
reorder, full rewrite).
Chunk ids stop reflecting document order once incremental re-indexing keeps
unchanged rows across edits. Backfill preserves the historical id ordering
so behavior is identical on day one.
The cached payload is the indexing pipeline's embeddings (markdown is
chunked then embedded), so "embedding cache" names the expensive output
directly and removes the "index" ambiguity (DB index vs vector index vs
indexing phase). Renames the service, settings, eligibility, eviction
task, metrics, config flags (INDEX_CACHE_* -> EMBEDDING_CACHE_*), object
prefix, and the table (index_cache_embedding_sets -> embedding_cache_sets)
with its constraint and indexes. Migration 161 renamed accordingly.
Every file ingestion path (Dropbox, Google Drive / Composio Drive, OneDrive,
local folder, Obsidian, and the legacy upload handlers) now parses via the
extract_with_cache facade instead of calling EtlPipelineService.extract
directly, so identical bytes are deduplicated globally regardless of source.
vision_llm is passed through, keeping the existing cacheability gate intact.