Make Vestige a durable, local, semantically-searchable retrieval layer over an external system of record (GitHub Issues first), citing back to the canonical record. Unlike a live ticket-system MCP proxy, Vestige keeps a durable embedded index: searchable offline, joinable with the rest of memory, temporally versioned, and re-syncable idempotently with no duplication. Phases 1-2 of #57 plus a GitHub reference connector and source-aware search: - Source envelope on KnowledgeNode/IngestInput (source_system, source_id, source_url, source_updated_at, content_hash, synced_at, source_project, source_type, source_author). Migration V17: nullable columns (additive), partial UNIQUE index on (source_system, source_id), connector_cursors table. - Idempotent sync primitives in vestige-core: upsert_by_source (content-hash change detection), connector cursor checkpoints, reconcile_source_tombstones (invalidate-don't-delete via bitemporal valid_until). - Connector contract + run_sync driver + GitHub Issues connector behind the optional `connectors` feature (on by default in vestige-mcp, off in the core library default so non-connector consumers link no HTTP client). - source_sync MCP tool ({"repo": "owner/name"}); token from GITHUB_TOKEN env only. Search results gain a sourceRecord citation for connector memories. Adversarial review fixes: GitHub `since` Z-form (the `+00:00` offset corrupted the cursor server-side), un-tombstone clears superseded_by too, cursor never advances past a failing record, Link next-url host-pinned (token-leak guard), records_seen counts new records only. Verified: cargo check/test/clippy -D warnings green across the workspace (default and connectors features); 483 core tests pass. Version bump to 2.1.27 and tag deferred to release. Refs #57 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
6.1 KiB
External-Source Connectors
Status: v2.1.27 — GitHub Issues connector (reference). Redmine and others follow the same contract. Tracking issue: #57.
Connectors let Vestige act as a durable, local retrieval and reasoning layer over a long-lived external system — a ticket tracker, an issue board, a support queue — without replacing it. The external system stays the source of truth. Vestige indexes its records, embeds them for semantic recall, links them into the memory graph, and cites back to the canonical record.
Why this is different from a ticket-system MCP
The official GitHub / Jira MCP servers are live API proxies: every query hits the upstream API, is rate-limited, keyword-only, online-only, and has no memory of past state. Vestige instead keeps a durable local index of the records, so you can:
- search the history offline and semantically (embeddings, not just keywords),
- join ticket history with the rest of your memory in one search,
- see a point-in-time view (records carry temporal validity),
- and re-sync idempotently — re-running never duplicates a record.
Quick start (GitHub Issues)
-
(Optional but recommended) export a token so you get the authenticated rate limit (5,000 req/hr vs 60 for anonymous) and access to private repos:
export GITHUB_TOKEN=ghp_xxx # or VESTIGE_GITHUB_TOKENThe token is read only from the environment — never passed as a tool argument, never logged.
-
Ask your agent to run the
source_syncMCP tool:{ "repo": "samvallad33/vestige" } -
Search as normal. Connector-sourced results carry a
sourceRecordobject with the canonical issue URL:{ "content": "[samvallad33/vestige#57] Roadmap: external source connectors …", "sourceRecord": { "system": "github", "id": "57", "url": "https://github.com/samvallad33/vestige/issues/57", "project": "samvallad33/vestige", "type": "issue", "author": "samvallad33", "tombstoned": false } }
The source_sync tool
| Field | Type | Default | Meaning |
|---|---|---|---|
repo |
string | — (required) | owner/name, e.g. samvallad33/vestige. |
source |
string | github |
External system. Currently only github. |
reconcile |
bool | false |
Also tombstone local memories for issues no longer visible upstream (an extra full-enumeration pass). |
max_pages |
int | 10 |
API pages to fetch this run (≤100 issues each). Lets a first sync of a large repo resume across calls. |
The tool returns counts (created / updated / unchanged / tombstoned),
the saved cursor, whether it ran authenticated, and a hint for the next step.
Idempotent, incremental sync
Each run:
- resumes from the saved cursor (the high-water mark on the record's upstream update time), minus a small overlap window so same-second / clock-skewed updates are never missed;
- pages issues in ascending update order (
state=all, so closing an issue is not mistaken for a deletion), folding each issue + its comments into one memory; - routes each record through an idempotent upsert keyed on
(source_system, source_id):- unseen record → insert,
- changed content (by content hash) → update in place + re-embed,
- unchanged content → no-op (only the "last seen" time advances);
- advances and persists the cursor only after the run, so an interruption re-scans rather than skips.
Re-running source_sync on the same repo is therefore safe and cheap — it picks
up only what changed.
Deletions (tombstoning)
Neither GitHub nor Redmine exposes a deletion feed, so an incremental sync can
never see a delete. Pass reconcile: true to run a reconciliation pass: Vestige
enumerates the currently-visible issue ids and invalidates (does not purge)
any local record no longer present. A tombstoned record keeps its content for
audit but drops out of "currently valid" retrieval (sourceRecord.tombstoned is
true). If the record reappears upstream, the next sync un-tombstones it.
The source envelope
Every connector-ingested memory carries structured provenance, distinct from the
legacy free-form source label:
| Field | Purpose |
|---|---|
source_system |
github, redmine, … (namespaces ids). |
source_id |
Native id (issue number, ticket id). |
source_url |
Canonical link back — the citation. |
source_updated_at |
Upstream update time (the sync cursor field). |
content_hash |
Change detector → idempotency. |
synced_at |
When the connector last saw the record live. |
source_project |
Repo / project / space. |
source_type |
issue, comment, … |
source_author |
Reporter / author upstream. |
(source_system, source_id) is enforced unique, so there is exactly one memory
per external record. Legacy memories (agent- or user-authored) have no envelope
and are completely unaffected.
Building
The connector HTTP client is behind the connectors cargo feature, which is
on by default in the MCP server (vestige-mcp). A build without it still
exposes the source_sync tool but returns a clear "rebuild with --features connectors" message. The core library (vestige-core) leaves the feature
off by default, so library consumers that don't need connectors link no HTTP
client.
# default MCP build already includes connectors
cargo build -p vestige-mcp --release
# explicit, or for the core lib
cargo build -p vestige-core --features connectors
Writing a new connector
Implement the Connector trait in vestige_core::connectors (fetch a window of
records updated since a cursor, page forward, and optionally enumerate live ids
for reconciliation), produce NormalizedRecords with a filled
SourceEnvelope, and hand them to run_sync. The GitHub connector
(crates/vestige-core/src/connectors/github.rs) is the reference
implementation. The sync driver, idempotent upsert, cursor checkpointing, and
tombstone reconciliation are all reused for free.