feat(sigma): add Sigma Computing context-source adapter (#316)

* feat(sigma): add Sigma Computing context-source adapter

Closes #168

Adds a full ingest adapter for Sigma Computing so `ktx ingest` can pull
data model specs and workbook summaries into the ktx context layer. The
implementation follows the same fetch → chunk → project → LLM pattern
used by the Looker, Metabase, and MetricFlow adapters.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(sigma): address PR review comments

- Remove manifest from rawFiles; moves to peerFileIndex so fetchedAt
  changes don't mark all work units dirty every run
- Fix workbookFilter.updatedSince eviction bug: fetch full universe first,
  apply filter client-side, evict only on archived/deleted
- Remove measure projection entirely; project() writes measures: [] and
  the sigma_ingest skill surfaces Lookup/aggregation formulas as wiki prose
- Remove joins projection (v1 limitation); project() writes joins: [] and
  Lookup relationships are described in wiki prose instead
- Remove write-back dead code: createDataModel, updateDataModel,
  SigmaDataModelPushResult, mutate/post/put
- Fix emitBatches notes pluralization bug ('2 data modelss' → '2 data models')
- Add tokenInflight dedup on ensureToken to coalesce concurrent auth requests
- Retry spec fetch when existing staged spec is null (transient failure cache)
- Drop unused WorkbookFilter import from client-port.ts
- Note in docs that joins are not projected from Sigma data models in this release

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* updates

* fix(sigma): restore sigma in local adapter test + small cleanups

The gdrive↔sigma merge dropped 'sigma' from the expected adapter source
list in local-adapters.test.ts while keeping gdrive, so the slow TS suite
failed even though the source registers both. Add 'sigma' back at its
registration position (after metabase, before gdrive).

Also:
- Move the orphaned SigmaPullConfig docstring onto the schema it documents
  and drop the stale BullMQ reference (standalone ktx has no BullMQ; the
  config lives in the ingest job's bundleRef.config).
- Drop an O(n^2) find() round-trip in fetch() when building the active
  data-model list; filter once and reuse for the eviction id set.

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Andrey Avtomonov <andreybavt@gmail.com>
Co-authored-by: Luca Martial <48870843+luca-martial@users.noreply.github.com>
This commit is contained in:
Matt Senick (Sigma) 2026-06-30 16:14:57 -07:00 committed by GitHub
parent 139ac08320
commit acd20ac248
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
41 changed files with 3610 additions and 6 deletions

View file

@ -8,7 +8,7 @@ can also capture free-form text into **ktx** memory. Database connections build
enriched context — schema plus AI-generated descriptions, embeddings, and
relationship evidence — and require a configured model and embeddings.
Context-source connections ingest metadata from tools such as dbt, Looker,
Metabase, MetricFlow, LookML, and Notion. Pass `--text` or `--file` to capture
Metabase, MetricFlow, LookML, Notion, and Sigma. Pass `--text` or `--file` to capture
inline text or text files into memory instead.
## Command signature

View file

@ -193,7 +193,7 @@ sources. This is equivalent to passing `--skip-sources` in scripted setup.
| Flag | Description |
|------|-------------|
| `--source <type>` | Context-source connector type: `dbt`, `metricflow`, `metabase`, `looker`, `lookml`, or `notion` |
| `--source <type>` | Context-source connector type: `dbt`, `metricflow`, `metabase`, `looker`, `lookml`, `notion`, or `sigma` |
| `--source-connection-id <id>` | Connection id for context-source setup |
| `--source-path <path>` | Local source path for dbt, MetricFlow, or LookML |
| `--source-git-url <url>` | Git URL for dbt, MetricFlow, or LookML |
@ -278,6 +278,13 @@ ktx setup \
--notion-crawl-mode selected_roots \
--notion-root-page-id abc123def456
# Add a Sigma source
ktx setup \
--source sigma \
--source-connection-id sigma-main \
--source-client-id your-client-id \
--source-client-secret-ref env:SIGMA_CLIENT_SECRET
# Install project-scoped agent integration for Codex
ktx setup --agents --target codex
```

View file

@ -119,6 +119,7 @@ context-source drivers share the map.
| `dbt` | Context source | `driver`, one of `source_dir` or `repo_url` | `branch`, `path`, `profiles_path`, `target`, `project_name` |
| `metricflow` | Context source | `driver`, `metricflow.repoUrl` | `metricflow.branch`, `metricflow.path`, `metricflow.auth_token_ref` |
| `notion` | Context source | `driver`, `auth_token_ref` | `crawl_mode`, `root_*_ids`, `max_*_per_run` |
| `sigma` | Context source | `driver`, `client_id`, `client_secret_ref` | `api_url` |
### Warehouse drivers
@ -345,6 +346,31 @@ connections:
| `max_knowledge_creates_per_run` | Max new wiki pages created per run (0-25). |
| `max_knowledge_updates_per_run` | Max existing wiki pages updated per run (0-100). |
### Sigma
```yaml
connections:
sigma-main:
driver: sigma
api_url: https://api.sigmacomputing.com
client_id: "<your-client-id>"
client_secret_ref: env:SIGMA_CLIENT_SECRET
workbookFilter:
includeArchived: false
includeExplorations: false
updatedSince: "2026-01-01T00:00:00Z"
```
| Field | Purpose |
|-------|---------|
| `api_url` | Sigma API base URL. Defaults to `https://api.sigmacomputing.com` (GCP US). Override for AWS US (`https://aws-api.sigmacomputing.com`) or other regions. |
| `client_id` | Sigma OAuth client ID. Required. |
| `client_secret` / `client_secret_ref` | Literal secret or reference. Prefer the `_ref`. |
| `connectionMappings` | Maps Sigma internal connection UUIDs to **ktx** warehouse connection IDs. Enables `sl_validate` for projected semantic-layer sources. |
| `workbookFilter.includeArchived` | Include archived workbooks during ingest. Default: `false`. |
| `workbookFilter.includeExplorations` | Include exploration workbooks during ingest. Default: `false`. |
| `workbookFilter.updatedSince` | ISO 8601 date string. Only workbooks updated on or after this date are fetched. Useful for limiting ingest scope at large scale. |
## `setup`
Captured by the setup wizard. The only field **ktx** still reads is

View file

@ -102,6 +102,7 @@ Supported source types:
| `looker` | Looker API | Explores, looks, dashboards, and model metadata |
| `metabase` | Metabase API | Questions, dashboards, table metadata, and mappings |
| `notion` | Notion API | Wiki pages and business knowledge |
| `sigma` | Sigma API | Data model specs, pages, element metadata, and workbook metadata |
Context-source ingest writes semantic source YAML and wiki Markdown, reconciling
with local edits.

View file

@ -1,6 +1,6 @@
---
title: Context Sources
description: Ingest semantic context from dbt, MetricFlow, LookML, Metabase, Looker, Notion, and Google Drive.
description: Ingest semantic context from dbt, MetricFlow, LookML, Metabase, Looker, Notion, Sigma, and Google Drive.
---
Context sources feed your existing analytics tooling into **ktx**. During ingestion, **ktx** extracts metadata from each source and uses a reconciliation agent to reconcile it with your existing semantic layer and knowledge base - preserving accepted edits rather than overwriting.
@ -27,7 +27,7 @@ LookML uses top-level `repoUrl`, and MetricFlow uses nested
| Field | Required | Description |
|-------|----------|-------------|
| `driver` | Yes | Source connector: `dbt`, `metricflow`, `lookml`, `metabase`, `looker`, `notion`, or `gdrive` |
| `driver` | Yes | Source connector: `dbt`, `metricflow`, `lookml`, `metabase`, `looker`, `notion`, `sigma`, or `gdrive` |
| `source_dir` | For local file sources | Absolute or project-relative source directory |
| `repo_url` | For Git-hosted dbt sources | Git repository URL |
| `repoUrl` | For Git-hosted LookML sources | Git repository URL |
@ -378,6 +378,101 @@ Create an integration at [notion.so/my-integrations](https://www.notion.so/my-in
---
## Sigma
Ingests data model definitions and workbook metadata from a Sigma workspace as semantic context. Uses the Sigma REST API to fetch data model specs and workbook summaries.
### What it provides
- Data model names, folder paths, and ownership metadata
- Page and element definitions within each data model
- Column identifiers and data types where available
- Workbook names, paths, descriptions, and version metadata
### Connection config
```yaml title="ktx.yaml"
connections:
sigma-main:
driver: sigma
api_url: https://api.sigmacomputing.com # Omit for GCP US (default)
client_id: "<your-client-id>"
client_secret_ref: env:SIGMA_CLIENT_SECRET
```
For the AWS US region, override `api_url`:
```yaml title="ktx.yaml"
connections:
sigma-main:
driver: sigma
api_url: https://aws-api.sigmacomputing.com
client_id: "<your-client-id>"
client_secret_ref: env:SIGMA_CLIENT_SECRET
```
### Authentication
| Method | Config |
|--------|--------|
| OAuth client credentials | `client_id` + `client_secret_ref: env:SIGMA_CLIENT_SECRET` |
Generate a client in Sigma: **Administration → Developer Access → Add New Client**.
### What gets ingested
- Active data model specs, organized by folder into work units
- Workbook metadata (name, path, description, version) — archived and exploration workbooks excluded by default
- Models backed by CSV uploads or unsupported connector subtypes are listed in the manifest but skipped during spec fetch (a Sigma API limitation)
### Warehouse connection mapping
`connectionMappings` is optional. Without it, **ktx** produces wiki knowledge only — no semantic-layer sources are written and warehouse validation is skipped. To get semantic-layer output and enable `sl_validate`, map each Sigma internal connection UUID to a **ktx** warehouse connection ID:
```yaml title="ktx.yaml"
connections:
sigma-main:
driver: sigma
client_id: "<your-client-id>"
client_secret_ref: env:SIGMA_CLIENT_SECRET
connectionMappings:
"<sigma-internal-uuid>": snowflake-prod # data models using this connection get SL sources
```
Find the Sigma connection UUID in **Administration → Connections** or from the `source.connectionId` field in a fetched data model spec. Data model elements whose `connectionId` has no mapping are ingested as wiki-only.
### Workbook filter
At large scale, you can limit which workbooks are fetched during ingest using `workbookFilter`:
```yaml title="ktx.yaml"
connections:
sigma-main:
driver: sigma
client_id: "<your-client-id>"
client_secret_ref: env:SIGMA_CLIENT_SECRET
workbookFilter:
includeArchived: false # default
includeExplorations: false # default
updatedSince: "2026-01-01T00:00:00Z" # only recently updated workbooks
```
| Field | Default | Description |
|-------|---------|-------------|
| `includeArchived` | `false` | Include archived workbooks |
| `includeExplorations` | `false` | Include exploration workbooks |
| `updatedSince` | — | ISO 8601 date; only workbooks updated on or after this date are fetched |
### Notes
- `connectionMappings` is optional for wiki-only ingest; it is required to generate semantic-layer sources and run warehouse validation
- Context ingest (`ktx ingest sigma-main`) fetches from the Sigma API directly
- Ingest is incremental: items whose `updatedAt` timestamp is unchanged since the last run are skipped
- Models backed by CSV uploads or unsupported connector subtypes cannot have their spec exported; these are skipped with a warning (a Sigma API limitation)
- Joins are not projected from Sigma data models in this release; `joins: []` is always written by the projection step. Lookup relationships visible in data model specs are captured as wiki knowledge instead.
---
## Google Drive
Ingests Google Docs from a shared Google Drive folder as wiki-ready knowledge content. This v1 implementation is knowledge-only and ingests Google Docs MIME types only.