vestige/docs/CONNECTORS.md
Sam Valladares 50e7f2d0fb feat(connectors): external-source connector layer + GitHub Issues (#57)
Make Vestige a durable, local, semantically-searchable retrieval layer over an
external system of record (GitHub Issues first), citing back to the canonical
record. Unlike a live ticket-system MCP proxy, Vestige keeps a durable embedded
index: searchable offline, joinable with the rest of memory, temporally
versioned, and re-syncable idempotently with no duplication.

Phases 1-2 of #57 plus a GitHub reference connector and source-aware search:

- Source envelope on KnowledgeNode/IngestInput (source_system, source_id,
  source_url, source_updated_at, content_hash, synced_at, source_project,
  source_type, source_author). Migration V17: nullable columns (additive),
  partial UNIQUE index on (source_system, source_id), connector_cursors table.
- Idempotent sync primitives in vestige-core: upsert_by_source (content-hash
  change detection), connector cursor checkpoints, reconcile_source_tombstones
  (invalidate-don't-delete via bitemporal valid_until).
- Connector contract + run_sync driver + GitHub Issues connector behind the
  optional `connectors` feature (on by default in vestige-mcp, off in the core
  library default so non-connector consumers link no HTTP client).
- source_sync MCP tool ({"repo": "owner/name"}); token from GITHUB_TOKEN env
  only. Search results gain a sourceRecord citation for connector memories.

Adversarial review fixes: GitHub `since` Z-form (the `+00:00` offset corrupted
the cursor server-side), un-tombstone clears superseded_by too, cursor never
advances past a failing record, Link next-url host-pinned (token-leak guard),
records_seen counts new records only.

Verified: cargo check/test/clippy -D warnings green across the workspace
(default and connectors features); 483 core tests pass. Version bump to 2.1.27
and tag deferred to release.

Refs #57

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 01:21:59 -05:00

6.1 KiB

External-Source Connectors

Status: v2.1.27 — GitHub Issues connector (reference). Redmine and others follow the same contract. Tracking issue: #57.

Connectors let Vestige act as a durable, local retrieval and reasoning layer over a long-lived external system — a ticket tracker, an issue board, a support queue — without replacing it. The external system stays the source of truth. Vestige indexes its records, embeds them for semantic recall, links them into the memory graph, and cites back to the canonical record.

Why this is different from a ticket-system MCP

The official GitHub / Jira MCP servers are live API proxies: every query hits the upstream API, is rate-limited, keyword-only, online-only, and has no memory of past state. Vestige instead keeps a durable local index of the records, so you can:

  • search the history offline and semantically (embeddings, not just keywords),
  • join ticket history with the rest of your memory in one search,
  • see a point-in-time view (records carry temporal validity),
  • and re-sync idempotently — re-running never duplicates a record.

Quick start (GitHub Issues)

  1. (Optional but recommended) export a token so you get the authenticated rate limit (5,000 req/hr vs 60 for anonymous) and access to private repos:

    export GITHUB_TOKEN=ghp_xxx   # or VESTIGE_GITHUB_TOKEN
    

    The token is read only from the environment — never passed as a tool argument, never logged.

  2. Ask your agent to run the source_sync MCP tool:

    { "repo": "samvallad33/vestige" }
    
  3. Search as normal. Connector-sourced results carry a sourceRecord object with the canonical issue URL:

    {
      "content": "[samvallad33/vestige#57] Roadmap: external source connectors …",
      "sourceRecord": {
        "system": "github",
        "id": "57",
        "url": "https://github.com/samvallad33/vestige/issues/57",
        "project": "samvallad33/vestige",
        "type": "issue",
        "author": "samvallad33",
        "tombstoned": false
      }
    }
    

The source_sync tool

Field Type Default Meaning
repo string — (required) owner/name, e.g. samvallad33/vestige.
source string github External system. Currently only github.
reconcile bool false Also tombstone local memories for issues no longer visible upstream (an extra full-enumeration pass).
max_pages int 10 API pages to fetch this run (≤100 issues each). Lets a first sync of a large repo resume across calls.

The tool returns counts (created / updated / unchanged / tombstoned), the saved cursor, whether it ran authenticated, and a hint for the next step.

Idempotent, incremental sync

Each run:

  1. resumes from the saved cursor (the high-water mark on the record's upstream update time), minus a small overlap window so same-second / clock-skewed updates are never missed;
  2. pages issues in ascending update order (state=all, so closing an issue is not mistaken for a deletion), folding each issue + its comments into one memory;
  3. routes each record through an idempotent upsert keyed on (source_system, source_id):
    • unseen record → insert,
    • changed content (by content hash) → update in place + re-embed,
    • unchanged content → no-op (only the "last seen" time advances);
  4. advances and persists the cursor only after the run, so an interruption re-scans rather than skips.

Re-running source_sync on the same repo is therefore safe and cheap — it picks up only what changed.

Deletions (tombstoning)

Neither GitHub nor Redmine exposes a deletion feed, so an incremental sync can never see a delete. Pass reconcile: true to run a reconciliation pass: Vestige enumerates the currently-visible issue ids and invalidates (does not purge) any local record no longer present. A tombstoned record keeps its content for audit but drops out of "currently valid" retrieval (sourceRecord.tombstoned is true). If the record reappears upstream, the next sync un-tombstones it.

The source envelope

Every connector-ingested memory carries structured provenance, distinct from the legacy free-form source label:

Field Purpose
source_system github, redmine, … (namespaces ids).
source_id Native id (issue number, ticket id).
source_url Canonical link back — the citation.
source_updated_at Upstream update time (the sync cursor field).
content_hash Change detector → idempotency.
synced_at When the connector last saw the record live.
source_project Repo / project / space.
source_type issue, comment, …
source_author Reporter / author upstream.

(source_system, source_id) is enforced unique, so there is exactly one memory per external record. Legacy memories (agent- or user-authored) have no envelope and are completely unaffected.

Building

The connector HTTP client is behind the connectors cargo feature, which is on by default in the MCP server (vestige-mcp). A build without it still exposes the source_sync tool but returns a clear "rebuild with --features connectors" message. The core library (vestige-core) leaves the feature off by default, so library consumers that don't need connectors link no HTTP client.

# default MCP build already includes connectors
cargo build -p vestige-mcp --release

# explicit, or for the core lib
cargo build -p vestige-core --features connectors

Writing a new connector

Implement the Connector trait in vestige_core::connectors (fetch a window of records updated since a cursor, page forward, and optionally enumerate live ids for reconciliation), produce NormalizedRecords with a filled SourceEnvelope, and hand them to run_sync. The GitHub connector (crates/vestige-core/src/connectors/github.rs) is the reference implementation. The sync driver, idempotent upsert, cursor checkpointing, and tombstone reconciliation are all reused for free.