feat(sigma): add Sigma Computing context-source adapter (#316)

* feat(sigma): add Sigma Computing context-source adapter

Closes #168

Adds a full ingest adapter for Sigma Computing so `ktx ingest` can pull
data model specs and workbook summaries into the ktx context layer. The
implementation follows the same fetch → chunk → project → LLM pattern
used by the Looker, Metabase, and MetricFlow adapters.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(sigma): address PR review comments

- Remove manifest from rawFiles; moves to peerFileIndex so fetchedAt
  changes don't mark all work units dirty every run
- Fix workbookFilter.updatedSince eviction bug: fetch full universe first,
  apply filter client-side, evict only on archived/deleted
- Remove measure projection entirely; project() writes measures: [] and
  the sigma_ingest skill surfaces Lookup/aggregation formulas as wiki prose
- Remove joins projection (v1 limitation); project() writes joins: [] and
  Lookup relationships are described in wiki prose instead
- Remove write-back dead code: createDataModel, updateDataModel,
  SigmaDataModelPushResult, mutate/post/put
- Fix emitBatches notes pluralization bug ('2 data modelss' → '2 data models')
- Add tokenInflight dedup on ensureToken to coalesce concurrent auth requests
- Retry spec fetch when existing staged spec is null (transient failure cache)
- Drop unused WorkbookFilter import from client-port.ts
- Note in docs that joins are not projected from Sigma data models in this release

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* updates

* fix(sigma): restore sigma in local adapter test + small cleanups

The gdrive↔sigma merge dropped 'sigma' from the expected adapter source
list in local-adapters.test.ts while keeping gdrive, so the slow TS suite
failed even though the source registers both. Add 'sigma' back at its
registration position (after metabase, before gdrive).

Also:
- Move the orphaned SigmaPullConfig docstring onto the schema it documents
  and drop the stale BullMQ reference (standalone ktx has no BullMQ; the
  config lives in the ingest job's bundleRef.config).
- Drop an O(n^2) find() round-trip in fetch() when building the active
  data-model list; filter once and reuse for the eviction id set.

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Andrey Avtomonov <andreybavt@gmail.com>
Co-authored-by: Luca Martial <48870843+luca-martial@users.noreply.github.com>
This commit is contained in:
Matt Senick (Sigma) 2026-06-30 16:14:57 -07:00 committed by GitHub
parent 139ac08320
commit acd20ac248
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
41 changed files with 3610 additions and 6 deletions

View file

@ -1,6 +1,6 @@
---
title: Context Sources
description: Ingest semantic context from dbt, MetricFlow, LookML, Metabase, Looker, Notion, and Google Drive.
description: Ingest semantic context from dbt, MetricFlow, LookML, Metabase, Looker, Notion, Sigma, and Google Drive.
---
Context sources feed your existing analytics tooling into **ktx**. During ingestion, **ktx** extracts metadata from each source and uses a reconciliation agent to reconcile it with your existing semantic layer and knowledge base - preserving accepted edits rather than overwriting.
@ -27,7 +27,7 @@ LookML uses top-level `repoUrl`, and MetricFlow uses nested
| Field | Required | Description |
|-------|----------|-------------|
| `driver` | Yes | Source connector: `dbt`, `metricflow`, `lookml`, `metabase`, `looker`, `notion`, or `gdrive` |
| `driver` | Yes | Source connector: `dbt`, `metricflow`, `lookml`, `metabase`, `looker`, `notion`, `sigma`, or `gdrive` |
| `source_dir` | For local file sources | Absolute or project-relative source directory |
| `repo_url` | For Git-hosted dbt sources | Git repository URL |
| `repoUrl` | For Git-hosted LookML sources | Git repository URL |
@ -378,6 +378,101 @@ Create an integration at [notion.so/my-integrations](https://www.notion.so/my-in
---
## Sigma
Ingests data model definitions and workbook metadata from a Sigma workspace as semantic context. Uses the Sigma REST API to fetch data model specs and workbook summaries.
### What it provides
- Data model names, folder paths, and ownership metadata
- Page and element definitions within each data model
- Column identifiers and data types where available
- Workbook names, paths, descriptions, and version metadata
### Connection config
```yaml title="ktx.yaml"
connections:
sigma-main:
driver: sigma
api_url: https://api.sigmacomputing.com # Omit for GCP US (default)
client_id: "<your-client-id>"
client_secret_ref: env:SIGMA_CLIENT_SECRET
```
For the AWS US region, override `api_url`:
```yaml title="ktx.yaml"
connections:
sigma-main:
driver: sigma
api_url: https://aws-api.sigmacomputing.com
client_id: "<your-client-id>"
client_secret_ref: env:SIGMA_CLIENT_SECRET
```
### Authentication
| Method | Config |
|--------|--------|
| OAuth client credentials | `client_id` + `client_secret_ref: env:SIGMA_CLIENT_SECRET` |
Generate a client in Sigma: **Administration → Developer Access → Add New Client**.
### What gets ingested
- Active data model specs, organized by folder into work units
- Workbook metadata (name, path, description, version) — archived and exploration workbooks excluded by default
- Models backed by CSV uploads or unsupported connector subtypes are listed in the manifest but skipped during spec fetch (a Sigma API limitation)
### Warehouse connection mapping
`connectionMappings` is optional. Without it, **ktx** produces wiki knowledge only — no semantic-layer sources are written and warehouse validation is skipped. To get semantic-layer output and enable `sl_validate`, map each Sigma internal connection UUID to a **ktx** warehouse connection ID:
```yaml title="ktx.yaml"
connections:
sigma-main:
driver: sigma
client_id: "<your-client-id>"
client_secret_ref: env:SIGMA_CLIENT_SECRET
connectionMappings:
"<sigma-internal-uuid>": snowflake-prod # data models using this connection get SL sources
```
Find the Sigma connection UUID in **Administration → Connections** or from the `source.connectionId` field in a fetched data model spec. Data model elements whose `connectionId` has no mapping are ingested as wiki-only.
### Workbook filter
At large scale, you can limit which workbooks are fetched during ingest using `workbookFilter`:
```yaml title="ktx.yaml"
connections:
sigma-main:
driver: sigma
client_id: "<your-client-id>"
client_secret_ref: env:SIGMA_CLIENT_SECRET
workbookFilter:
includeArchived: false # default
includeExplorations: false # default
updatedSince: "2026-01-01T00:00:00Z" # only recently updated workbooks
```
| Field | Default | Description |
|-------|---------|-------------|
| `includeArchived` | `false` | Include archived workbooks |
| `includeExplorations` | `false` | Include exploration workbooks |
| `updatedSince` | — | ISO 8601 date; only workbooks updated on or after this date are fetched |
### Notes
- `connectionMappings` is optional for wiki-only ingest; it is required to generate semantic-layer sources and run warehouse validation
- Context ingest (`ktx ingest sigma-main`) fetches from the Sigma API directly
- Ingest is incremental: items whose `updatedAt` timestamp is unchanged since the last run are skipped
- Models backed by CSV uploads or unsupported connector subtypes cannot have their spec exported; these are skipped with a warning (a Sigma API limitation)
- Joins are not projected from Sigma data models in this release; `joins: []` is always written by the projection step. Lookup relationships visible in data model specs are captured as wiki knowledge instead.
---
## Google Drive
Ingests Google Docs from a shared Google Drive folder as wiki-ready knowledge content. This v1 implementation is knowledge-only and ingests Google Docs MIME types only.