feat(sigma): add Sigma Computing context-source adapter (#316)

* feat(sigma): add Sigma Computing context-source adapter Closes #168 Adds a full ingest adapter for Sigma Computing so `ktx ingest` can pull data model specs and workbook summaries into the ktx context layer. The implementation follows the same fetch → chunk → project → LLM pattern used by the Looker, Metabase, and MetricFlow adapters. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(sigma): address PR review comments - Remove manifest from rawFiles; moves to peerFileIndex so fetchedAt changes don't mark all work units dirty every run - Fix workbookFilter.updatedSince eviction bug: fetch full universe first, apply filter client-side, evict only on archived/deleted - Remove measure projection entirely; project() writes measures: [] and the sigma_ingest skill surfaces Lookup/aggregation formulas as wiki prose - Remove joins projection (v1 limitation); project() writes joins: [] and Lookup relationships are described in wiki prose instead - Remove write-back dead code: createDataModel, updateDataModel, SigmaDataModelPushResult, mutate/post/put - Fix emitBatches notes pluralization bug ('2 data modelss' → '2 data models') - Add tokenInflight dedup on ensureToken to coalesce concurrent auth requests - Retry spec fetch when existing staged spec is null (transient failure cache) - Drop unused WorkbookFilter import from client-port.ts - Note in docs that joins are not projected from Sigma data models in this release Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * updates * fix(sigma): restore sigma in local adapter test + small cleanups The gdrive↔sigma merge dropped 'sigma' from the expected adapter source list in local-adapters.test.ts while keeping gdrive, so the slow TS suite failed even though the source registers both. Add 'sigma' back at its registration position (after metabase, before gdrive). Also: - Move the orphaned SigmaPullConfig docstring onto the schema it documents and drop the stale BullMQ reference (standalone ktx has no BullMQ; the config lives in the ingest job's bundleRef.config). - Drop an O(n^2) find() round-trip in fetch() when building the active data-model list; filter once and reuse for the eviction id set. --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Andrey Avtomonov <andreybavt@gmail.com> Co-authored-by: Luca Martial <48870843+luca-martial@users.noreply.github.com>
2026-07-01 08:59:39 +02:00 · 2026-06-30 16:14:57 -07:00 · 2026-06-30 16:14:57 -07:00 · acd20ac248
commit acd20ac248
parent 139ac08320
41 changed files with 3610 additions and 6 deletions
--- a/docs-site/content/docs/integrations/context-sources.mdx
+++ b/docs-site/content/docs/integrations/context-sources.mdx
@ -1,6 +1,6 @@
 ---
 title: Context Sources
-description: Ingest semantic context from dbt, MetricFlow, LookML, Metabase, Looker, Notion, and Google Drive.
+description: Ingest semantic context from dbt, MetricFlow, LookML, Metabase, Looker, Notion, Sigma, and Google Drive.
 ---

 Context sources feed your existing analytics tooling into **ktx**. During ingestion, **ktx** extracts metadata from each source and uses a reconciliation agent to reconcile it with your existing semantic layer and knowledge base - preserving accepted edits rather than overwriting.
@ -27,7 +27,7 @@ LookML uses top-level `repoUrl`, and MetricFlow uses nested

 | Field | Required | Description |
 |-------|----------|-------------|
-| `driver` | Yes | Source connector: `dbt`, `metricflow`, `lookml`, `metabase`, `looker`, `notion`, or `gdrive` |
+| `driver` | Yes | Source connector: `dbt`, `metricflow`, `lookml`, `metabase`, `looker`, `notion`, `sigma`, or `gdrive` |
 | `source_dir` | For local file sources | Absolute or project-relative source directory |
 | `repo_url` | For Git-hosted dbt sources | Git repository URL |
 | `repoUrl` | For Git-hosted LookML sources | Git repository URL |
@ -378,6 +378,101 @@ Create an integration at [notion.so/my-integrations](https://www.notion.so/my-in

 ---

+## Sigma
+
+Ingests data model definitions and workbook metadata from a Sigma workspace as semantic context. Uses the Sigma REST API to fetch data model specs and workbook summaries.
+
+### What it provides
+
+- Data model names, folder paths, and ownership metadata
+- Page and element definitions within each data model
+- Column identifiers and data types where available
+- Workbook names, paths, descriptions, and version metadata
+
+### Connection config
+
+```yaml title="ktx.yaml"
+connections:
+  sigma-main:
+    driver: sigma
+    api_url: https://api.sigmacomputing.com   # Omit for GCP US (default)
+    client_id: "<your-client-id>"
+    client_secret_ref: env:SIGMA_CLIENT_SECRET
+```
+
+For the AWS US region, override `api_url`:
+
+```yaml title="ktx.yaml"
+connections:
+  sigma-main:
+    driver: sigma
+    api_url: https://aws-api.sigmacomputing.com
+    client_id: "<your-client-id>"
+    client_secret_ref: env:SIGMA_CLIENT_SECRET
+```
+
+### Authentication
+
+| Method | Config |
+|--------|--------|
+| OAuth client credentials | `client_id` + `client_secret_ref: env:SIGMA_CLIENT_SECRET` |
+
+Generate a client in Sigma: **Administration → Developer Access → Add New Client**.
+
+### What gets ingested
+
+- Active data model specs, organized by folder into work units
+- Workbook metadata (name, path, description, version) — archived and exploration workbooks excluded by default
+- Models backed by CSV uploads or unsupported connector subtypes are listed in the manifest but skipped during spec fetch (a Sigma API limitation)
+
+### Warehouse connection mapping
+
+`connectionMappings` is optional. Without it, **ktx** produces wiki knowledge only — no semantic-layer sources are written and warehouse validation is skipped. To get semantic-layer output and enable `sl_validate`, map each Sigma internal connection UUID to a **ktx** warehouse connection ID:
+
+```yaml title="ktx.yaml"
+connections:
+  sigma-main:
+    driver: sigma
+    client_id: "<your-client-id>"
+    client_secret_ref: env:SIGMA_CLIENT_SECRET
+    connectionMappings:
+      "<sigma-internal-uuid>": snowflake-prod   # data models using this connection get SL sources
+```
+
+Find the Sigma connection UUID in **Administration → Connections** or from the `source.connectionId` field in a fetched data model spec. Data model elements whose `connectionId` has no mapping are ingested as wiki-only.
+
+### Workbook filter
+
+At large scale, you can limit which workbooks are fetched during ingest using `workbookFilter`:
+
+```yaml title="ktx.yaml"
+connections:
+  sigma-main:
+    driver: sigma
+    client_id: "<your-client-id>"
+    client_secret_ref: env:SIGMA_CLIENT_SECRET
+    workbookFilter:
+      includeArchived: false       # default
+      includeExplorations: false   # default
+      updatedSince: "2026-01-01T00:00:00Z"   # only recently updated workbooks
+```
+
+| Field | Default | Description |
+|-------|---------|-------------|
+| `includeArchived` | `false` | Include archived workbooks |
+| `includeExplorations` | `false` | Include exploration workbooks |
+| `updatedSince` | — | ISO 8601 date; only workbooks updated on or after this date are fetched |
+
+### Notes
+
+- `connectionMappings` is optional for wiki-only ingest; it is required to generate semantic-layer sources and run warehouse validation
+- Context ingest (`ktx ingest sigma-main`) fetches from the Sigma API directly
+- Ingest is incremental: items whose `updatedAt` timestamp is unchanged since the last run are skipped
+- Models backed by CSV uploads or unsupported connector subtypes cannot have their spec exported; these are skipped with a warning (a Sigma API limitation)
+- Joins are not projected from Sigma data models in this release; `joins: []` is always written by the projection step. Lookup relationships visible in data model specs are captured as wiki knowledge instead.
+
+---
+
 ## Google Drive

 Ingests Google Docs from a shared Google Drive folder as wiki-ready knowledge content. This v1 implementation is knowledge-only and ingests Google Docs MIME types only.