omnigraph/docs/dev/rfc-008-deprecate-omnigraph-yaml.md
aaltshuler 9002cfd5b9 docs(rfc): RFC-009 — unify CLI access paths; align the RFC corpus
Adopts the unify-embedded/remote draft as RFC-009 with three alignment
amendments: (1) the promised 'companion config-authority RFC' is RFC-008,
already landed through stage 4 — referenced, not re-proposed; (2) open
question 3 is answered by the two-surface architecture (embedded graphs
list enumerates the cluster catalog via read_serving_snapshot, never
omnigraph.yaml); (3) Phase 2 salvages PR #139's reviewed-clean
omnigraph-api-types extraction instead of rebuilding. Adds the
cycle's two no-referee bugs (alias positional, write-if-absent flush) as
concrete parity-matrix motivation, and RFC-007's addressing/credential
chains as RemoteClient constructor inputs.

Corpus alignment: RFC-002's header now maps each of its pieces to the
successor that landed or superseded it (007/008/009) with a do-not-
implement-from-here-unchecked warning; RFC-007 gains the RFC-009
relationship; RFC-008 stage 5 notes the Phases-4/5 easing; dev index row.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 17:33:11 +03:00

178 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# RFC: Deprecate `omnigraph.yaml` — One Concern per Config Surface
**Status:** Proposed
**Date:** 2026-06-11
**Builds on:** [rfc-007-operator-config.md](rfc-007-operator-config.md) (the
operator layer that absorbs the identity/credential keys),
[rfc-005-server-cluster-boot.md](rfc-005-server-cluster-boot.md) (Landed —
cluster-booted serving), RFC-006 storage roots (landed: #186/#190/#194).
**Supersedes in part:** RFC-007's "project layer" framing (§Relationship
below) and [rfc-002-config-cli-architecture.md](rfc-002-config-cli-architecture.md)'s
assumption that `omnigraph.yaml` remains the project manifest.
**Target release:** staged; final removal at the next major (see Sequencing).
## Summary
Retire `omnigraph.yaml`. It is three unrelated concerns wearing one
filename — server deployment config, project/CLI conveniences, and operator
identity — and the mixture is not a cosmetic wart but the root cause of a
recurring class of problems: operators keeping personal copies of "project"
files, repo checkouts able to carry credential-adjacent keys (the #139
security findings), `omnigraph init` scaffolding config into unrelated
directories, and every config discussion needing a paragraph to establish
which of the three files is meant.
The end state is **two config surfaces with single owners**:
| Surface | Owner | Declares |
|---|---|---|
| **Cluster config** (`cluster.yaml` + catalog) | the team, in a repo | what the system *is*: graphs, schemas, queries, policies, storage |
| **Operator config** (`~/.omnigraph/`) | one person, in `$HOME` | who *I* am: identity, credentials, known servers, ergonomics |
plus **flags/env** for the zero-config tier (one graph, one server, no
control plane) — which already works today with no file at all.
`omnigraph.yaml` has no role left once every key has a better home. This
RFC gives each key that home, and stages the retirement so that no working
setup breaks without a loud warning, a migration command, and a full
deprecation cycle first.
## Motivation
- **It breaks the ownership logic.** A config file must have one owner. A
file that carries `graphs:` (team-owned, reviewable) next to `cli.actor`
(one person's identity) and `auth.env_file` (credential loading) can be
neither safely committed nor sensibly personal. Every real deployment
this cycle tripped on it: per-operator copies in `~/exp/intel`,
graph-scoped alias URIs that only make sense per-person, the #139
findings where a checkout-supplied file could redirect tokens.
- **The cluster made it redundant.** Since RFC-005/006, a cluster
deployment serves from the applied catalog — `--cluster` mode does not
read `omnigraph.yaml` *at all*. Stored queries, policies, bindings, and
graph addressing all have authoritative homes. What remains in
`omnigraph.yaml` for cluster users is dead weight that can silently
disagree with what is actually serving.
- **Two declarative dialects is one too many.** `cluster.yaml` and
`omnigraph.yaml` both declare graphs/queries/policies with different
schemas, different validation strictness, and different lifecycle
guarantees. Maintaining, documenting, and testing both — and explaining
when each applies — is a permanent tax (the "programming integrated over
time" lens says: this forks on every config-surface change).
## Non-Goals
- **Breaking anyone now.** Every `omnigraph.yaml` that works today keeps
working through the entire deprecation window, with warnings.
- **Retiring the zero-config tier.** `omnigraph-server s3://bucket/g.omni
--bind …` plus env vars stays first-class forever — that tier needs *no*
file, which is the point.
- **Forcing the control plane on single-graph users.** The migration target
for a multi-graph yaml deployment is a *minimal* cluster (file-rooted,
no bucket required, `cluster.yaml` barely longer than the `graphs:` map
it replaces) — but a single graph never needs even that.
- **Touching `cluster.yaml`** — its schema and strictness are unchanged.
## Where every key goes (the complete migration map)
The full `OmnigraphConfig` surface (verified against
`crates/omnigraph-server/src/config.rs:182-207`):
| `omnigraph.yaml` key | Concern | New home |
|---|---|---|
| `graphs.<name>.uri` | what exists / where | `cluster.yaml` `graphs:` (storage-root-derived) — or a flag/env for the zero-config tier |
| `graphs.<name>.queries`, top-level `queries:` | what exists | cluster catalog (`.gq` discovery, RFC-004/#183) |
| `graphs.<name>.policy.file`, top-level `policy.file`, `server.policy.file` | what's enforced | `cluster.yaml` `policies:` + `applies_to` bindings |
| `server.bind` | deployment runtime | `--bind` / env (already authoritative; the key is a default) |
| `server.graph` | deployment runtime | `--target`-style flag / env in the zero-config tier; meaningless under cluster boot |
| `graphs.<name>.bearer_token_env`, `auth.env_file` | credentials | operator credentials chain (RFC-007 §D4) |
| `cli.actor` | identity | `operator.actor` (RFC-007 §D3) |
| `cli.output_format`, `cli.table_*` | personal ergonomics | `defaults:` in operator config (RFC-007 §D2) |
| `cli.graph`, `cli.branch` | personal targeting | operator config: named servers + a per-operator default target (RFC-007 PR 3) |
| `aliases.<name>` | a personal name conflated with a content pointer | **splits in two** (RFC-007 §D2 "bindings, not content"): the referenced `.gq` file's *content* becomes a catalog stored query (team-reviewed); the *binding* becomes an operator alias referencing that name. `config migrate` proposes both halves but cannot publish catalog content itself — that is a `cluster apply` |
| `query.roots` | discovery convenience | obsolete — cluster query discovery (#183) replaced it |
| `project.name` | label | dropped (the cluster's `metadata.name` is the deployment label) |
Two placements worth defending:
- **Aliases are operator config, not cluster config.** The stored query is
the shared contract (catalog-owned, digest-pinned); an alias is one
person's shorthand with their favorite default params and target. Putting
aliases in the cluster would force team review on personal ergonomics;
leaving them per-directory recreates today's problem. Per-operator,
keyed by server/graph name, is the AWS-profile shape.
- **Multi-graph serving without a control plane migrates to a minimal
cluster, not to a new file.** The honest cost: `cluster import` + `apply`
once, on a `file://` root next to the graphs. The honest benefit: one
declarative dialect, one validation path, one serving source — and the
upgrade path to buckets/approvals is a one-line `storage:` change instead
of a re-platform.
## Deprecation mechanics
Per Hyrum's Law (the repo's own deny-list: shipped observable behavior is
contract), retirement is staged, loud, and tooled:
1. **Warn** *(landed)*. Loading `omnigraph.yaml` emits a one-line deprecation notice
naming the replacement for each key actually present in the file (not a
generic banner — the migration map above, applied to *your* file).
Suppressible per-process (`OMNIGRAPH_SUPPRESS_YAML_DEPRECATION=1`) for
CI logs during the window.
2. **Migrate** *(landed)*. `omnigraph config migrate` reads an existing
`omnigraph.yaml` and writes the split: the team half as a ready-to-review
`cluster.yaml` (+ moves query/policy files into the checkout layout),
the personal half merged into `~/.omnigraph/config.yaml` — printing a
diff-style summary and touching nothing without `--write`. The command
is the test of the migration map's completeness: any key it cannot
place is a bug in this RFC.
3. **Stop scaffolding** *(landed)*. `omnigraph init` stops generating
`omnigraph.yaml` (it scaffolded one into cwd — the source of the
test-pollution bug). **No replacement scaffold**: a minimal
`cluster.yaml` is five lines; a generator would be a second copy of the
schema to keep in sync, producing a file that is unusable until
hand-edited anyway (Terraform has no config scaffolder either). New
users copy from the cluster quick-start; migrants get a ready-to-review
`cluster.yaml` from `config migrate`.
4. **Opt-in strict** *(landed — the release gap to stages 13 collapsed: no version boundary was crossed between them, so all four ship in the same release)*. `OMNIGRAPH_NO_LEGACY_CONFIG=1` turns the warning into
an error — for teams that finished migrating and want regressions caught.
5. **Remove at the next major** *(eased by [rfc-009-unify-access-paths.md](rfc-009-unify-access-paths.md) Phases 45: declared plane capabilities and route alignment shrink the yaml-boot removal diff)*. Loading the file becomes an error pointing
at `config migrate`. The `OmnigraphConfig` code path, the dual
query-registry loaders, and the yaml-mode server boot source are deleted
— the payoff that makes the whole exercise worth it.
Stages 13 can land in one release once RFC-007 PRs 12 exist (the operator
layer must exist before anything can migrate *to* it). Stage 4 the release
after. Stage 5 at the major, with the removal listed in release notes from
stage 1 onward.
## What this deletes, eventually
- The `OmnigraphConfig` struct and its 12-key surface, the
`load_config`/`load_cli_config` pair and its env-side-effect, the
scaffolder, and the legacy resolution paths (`resolve_cli_graph`'s dual
modes — finding #11's root cause).
- The yaml-mode multi-graph server boot (`ServerConfigMode::Multi` keeps
existing — cluster boot constructs it — but its `omnigraph.yaml` source
goes).
- An entire class of documentation ("which file does X go in?") and the
#139 security surface (a checkout cannot hijack what no longer loads).
## Relationship to RFC-007 and RFC-002
RFC-007 ships the operator layer this RFC migrates *to*; its "project
layer" language should be read as transitional — after this RFC, the
project layer **is** the cluster checkout, and RFC-007's PR 3 (project
`server:` references) applies to `cluster.yaml`-adjacent operator targeting
rather than to `omnigraph.yaml`. RFC-002's locator/state-layer work, if
resumed, targets the two-surface world directly. RFC-002's file-naming
decisions (`~/.omnigraph/` as the one dir) are unaffected.
## Open questions
- **Window length**: one minor release between warn (stage 1) and strict
(stage 4), or two? Cookbooks, skills, and the deployment docs all need
the same pass; the migration command makes a short window defensible.
- **`omnigraph login` vs `config migrate` ordering** — both write
`~/.omnigraph/`; whichever lands first establishes the file-locking and
atomic-write helpers the other reuses.
- **Does the MCP server config** (RFC-003) reference `omnigraph.yaml`
anywhere that needs the same treatment? To be audited in stage 1.