omnigraph/docs/dev/rfc-010-cli-planes-restructure.md
Andrew Altshuler 2ddb88fad9
docs(rfc): RFC-010 — apply verification-comment current-state fixups (#215)
Folds in the Codex verification review (kept verbatim with per-point
Resolution notes):

- `graphs list` is marked remote-only today in the current-state table
  (the embedded arm bails; it rides GraphClient only to share the resolver).
- `init` is noted as positional-URI-only today (no `--target`); adding
  `--target` to init is part of the proposal, entangled with the
  init→cluster apply signpost, not current state.
- Validated-fact #1 now describes the post-collapse reality
  (`GraphClient::resolve*`; only the two factories call `apply_server_flag`),
  dropping the stale "16 call sites" count.
- The Authority rule carries a flag-shape caveat: `--graph` is already a
  global flag requiring `--server`, so the cluster-managed resolver and its
  flag shape are deferred to a later slice; the illustrative
  `--cluster <dir> --graph <id>` spelling is marked not-final.

Docs-only; no code change.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-13 22:24:09 +03:00

22 KiB
Raw Blame History

RFC: Restructure the CLI Around Explicit Planes

Status: Proposed Date: 2026-06-13 Audience: CLI/server/cluster maintainers Builds on: rfc-009-unify-access-paths.md (Phases 3a3c landed — the embedded/remote data-plane fork is now one GraphClient enum; this RFC expands RFC-009 Phase 4 from a narrow embedded-vs-remote capability table into the full plane model, and leaves Phase 5 route alignment where it is), rfc-007-operator-config.md (operator --server/--graph/--target addressing — the surfaces this RFC makes uniform across planes), rfc-008-deprecate-omnigraph-yaml.md. Sequencing: post-v0.7.0, after RFC-009 Phase 3c (done).

Summary

The CLI silently spans three planes — data, storage/maintenance, and control — and forces the operator to know which plane each verb lives on and address a graph differently per plane. The same graph you query as --server prod --graph knowledge you must maintain as s3://bucket/knowledge.omni. Plane restrictions (graphs list is server-only, optimize is storage-only) are accidental — discovered by hitting a cryptic error, not declared.

This RFC makes the plane model explicit and coherent with three moves:

  1. One graph-addressing model across every verb (--target/--graph/ positional URI/--server), resolving to a storage URI for maintenance and a remote client for data — instead of two different ways to name one graph.
  2. A declared, per-subcommand capability surface (RFC-009 Phase 4): each verb declares its plane(s); wrong-plane invocations get an honest "this is storage-plane, --server doesn't apply" error from one table, not scattered bail!s.
  3. Plane-grouped --help so the model is legible at a glance.

No new server feature. Storage maintenance stays off the wire — deliberately.

Current state of affairs

The CLI has 23 top-level commands. They divide into three planes, addressed three different ways:

Plane Verbs Reaches the graph by Addressing surface
Data query, mutate, load, ingest, branch *, snapshot, export, commit *, schema show/apply (and graphs list, remote-only today — see note) embedded engine or HTTP server (one GraphClient) positional URI or --target / --graph / --server (config aliases)
Storage / maintenance init, optimize, repair, cleanup, schema plan, queries validate embedded engine only, directly on storage (file:// or s3://) positional URI or --targetno --server / --graph (except init, which today takes only a required positional URI — no --target)
Control cluster validate/plan/apply/approve/status/refresh/import/force-unlock a cluster directory (file:// or s3://), not a graph URI --config <dir>

What's confusing (validated facts)

  1. Two names for one graph. Data verbs resolve --server prod --graph knowledge through GraphClient::resolve* (the embedded/remote fork collapsed in RFC-009 Phases 3a3c; only the two GraphClient factories call apply_server_flag). Maintenance verbs instead use resolve_uri/resolve_local_uri and accept only a positional URI or --target — so to compact the graph you query as --server prod --graph knowledge you must type s3://bucket/knowledge.omni. One graph, two addressing vocabularies.

    Note (graphs list). It is routed through GraphClient only to share the addressing/token resolver; its embedded arm fails loudly, so it is remote-only today (the later capability table and Relationship to RFC-009 record it as remote-now / embedded-cluster-later).

  2. Plane restrictions are accidental, not declared. graphs list is server-only and optimize/repair/cleanup/init are storage-only purely by code shape. Point optimize at an https:// URL and you get whatever Omnigraph::open says about an https URI — accidental error text that, per Hyrum's Law, is already someone's dependency. The capability is real but unstated.

  3. The split is per-subcommand, and the family names hide it. schema plan is storage-only (resolve_local_uri) while schema show/schema apply are data-plane (the graph client). queries validate opens the graph to typecheck while queries list only reads the registry config. The plane is a property of the subcommand, not the family.

  4. Maintenance has no server/cluster counterpart at all. There is no HTTP route and no cluster subcommand for optimize/cleanup/repair (verified: nothing in the server route table, nothing in omnigraph-cluster/src). For a server-backed deployment you run the same CLI against the storage URI, out-of-band from the serving process. This is correct (maintenance is heavyweight, destructive, single-operator — it should not be a multi-tenant HTTP surface), but it is undocumented in the CLI's own shape, so it reads as an omission rather than a decision.

  5. init has a hidden control-plane twin. Bare init creates a single graph from storage; in cluster mode the equivalent is cluster apply (graph-creation stage, with ledger/recovery/approval semantics). Same intent, two entry points, no signpost between them.

  6. Flat --help. All 23 commands list as one undifferentiated block, so the plane a verb belongs to is tribal knowledge.

The net effect: a new operator must already know OmniGraph's plane architecture to predict which flags work on which verb and how to name a graph. The CLI does not teach its own model.

Target CLI ergonomics

The throughline: you name a graph one way, and the CLI tells you what works where. Simple examples of the end state:

One name for a graph, everywhere

A config target knowledge works on every verb that touches that graph:

omnigraph query    --target knowledge --query q.gq          # data (embedded or remote, auto)
omnigraph load     --target knowledge --data rows.jsonl     # data
omnigraph optimize --target knowledge                       # maintenance (resolves to its storage URI)
omnigraph cleanup  --target knowledge --keep 10 --confirm
omnigraph repair   --target knowledge --confirm

The positional URI form still works everywhere, unchanged:

omnigraph optimize s3://bucket/knowledge.omni

Data plane: same command, embedded or remote

You don't pick "local vs server" syntax — resolution decides:

omnigraph query ./local.omni                     --query q.gq   # opens engine directly
omnigraph query --server prod --graph knowledge  --query q.gq   # over HTTP
omnigraph query --target knowledge               --query q.gq   # whichever the config says

Maintenance: --target must resolve to direct storage (loud if not)

$ omnigraph optimize --target prod
error: `--target prod` resolves to a remote server (https://prod…).
       `optimize` is a storage-plane command and needs direct storage access.
       Pass the graph's s3://… URI, or use --cluster <dir> --graph <id>.

Cluster-managed graphs get an explicit, intentional path (no implicit cluster.yaml peeking):

omnigraph optimize --cluster ./cluster --graph knowledge

Wrong-plane = one honest, stable error

$ omnigraph optimize --server prod
error: `optimize` is a storage-plane command; `--server` addresses the data
       plane and does not apply here. Use --target <name> or a storage URI.

$ omnigraph graphs list ./local.omni
error: `graphs list` needs a remote multi-graph server (http/https) today.
       (Embedded cluster-catalog enumeration is planned — RFC-009.)

--help teaches the model

DATA PLANE        run against a graph (embedded or --server)
  query  mutate  load  branch  snapshot  export  commit  schema show  schema apply

STORAGE / MAINTENANCE   direct storage access; no server
  init  optimize  repair  cleanup  schema plan  queries validate

CONTROL PLANE     manage a cluster directory
  cluster

INSPECT / SESSION
  graphs list  queries list  lint  policy  embed  login  logout  config

Exceptions, signposted (not silent)

omnigraph init --schema s.pg ./new.omni            # plain path: fine

$ omnigraph init --target knowledge --schema s.pg  # cluster-managed target: redirected
error: `knowledge` is a cluster-managed graph. Create it via `cluster apply`
       (which records ledger + recovery + approvals), not `init`.

In one line: one way to name a graph, the right flags accepted per verb, and a CLI that tells you its planes instead of making you memorize them.

Proposed shape (mechanism)

One addressing model for every graph-addressing verb

Route all graph-addressing verbs — data and maintenance — through one resolver that turns (positional URI | --target | --graph | --server) into either a storage URI (file:///s3://) → embedded execution, or a remote GraphClient → HTTP execution, per the verb's declared plane.

Authority rule (the precedence must not be silent). --target is an operator/legacy target lookup; cluster.yaml is a different authority surface (read only by cluster commands and --cluster boot). A maintenance verb must not quietly consult both and invent a precedence. The rule:

  • A maintenance verb's --target resolves through the operator/legacy config and its URI must already be direct storage; a target that resolves to a remote (http(s)://) URL fails loudly (see the example above).

  • Cluster-managed graphs are addressed explicitly via a cluster-root + graph-id pair (spelled --cluster <dir> --graph <id> for illustration), so reading cluster state is an intentional mode — never an implicit fallback between operator config and cluster.yaml.

    Flag-shape caveat (deferred). --graph is already a global flag that requires = "server" and appends /graphs/<id> to a remote URL — a different meaning, and clap won't permit --graph without --server. So the cluster-maintenance addressing needs either a distinct flag (e.g. --cluster-graph <id>) or an explicit global-flag migration. This is why the cluster-managed resolver is deferred to a later slice (it also rides the applied-state-vs-declared-config open question below); the operator/legacy --target path lands first.

A declared, per-subcommand capability surface (RFC-009 Phase 4, expanded)

One table, per subcommand (family-level rows hide exactly the cases the table exists to make non-accidental):

Command Data (embedded) Data (remote) Storage (direct) Config / session Notes
query, mutate, load, ingest ingest is the deprecated alias of load
branch create/list/delete/merge
snapshot, export, commit list/show
schema show
schema apply declarative alternative: cluster apply
schema plan local resolver today
queries validate opens the graph to typecheck
init cluster-managed graphs → cluster apply
optimize, repair, cleanup
graphs list (later) remote today; embedded-cluster later (RFC-009)
queries list reads the registry config; no graph
lint --schema file, or opens a local graph
policy validate/test/explain reads policy files + config
embed local tooling (files + embedding API)
login, logout, config, version session / config; no graph

The resolver consults this table. A wrong-plane invocation produces one honest, stable message instead of N ad-hoc bail!s and accidental open errors.

Plane-grouped --help

Group the command list by plane (the --help block shown under Target CLI ergonomics). Cosmetic, zero behavior change, highest legibility-per-line.

Maintenance stays off the wire (decision, not omission)

This RFC does not add server routes for optimize/cleanup/repair:

  • Serving = the server. Multi-tenant, safe-for-many-callers data plane.
  • Storage maintenance = the CLI against storage, addressed uniformly, run by an operator or a scheduled job with storage access.

Adding maintenance-over-HTTP would re-introduce a heavyweight, destructive multi-tenant surface and add a plane rather than clarify the three we have. A future cluster-driven maintenance reconciler (scheduled compaction/GC as a control-plane policy) is explicitly out of scope — net-new design (who runs it, with what resource bounds), not a CLI restructure.

init is an explicit exception (decision)

Direct-storage init against a plain URI/target stays. But if a target resolves to a cluster-managed graph root, init refuses and signposts cluster apply (which records ledger, recovery, and approval artifacts) rather than initializing that root out of band. This closes the "hidden twin" of the current state.

Compatibility

Additive and low-risk:

  • --target/--graph on maintenance verbs is new capability; the positional URI form keeps working unchanged.
  • Grouped --help is cosmetic.
  • Capability-surface error text changes the message you get on a wrong-plane or misaddressed invocation. Per Hyrum's Law that text is observable; the change is deliberate, release-noted, and replaces an accidental Omnigraph::open string with a stable, declared one — a net improvement, but flagged.

No engine, server, or wire-protocol change. The work is CLI-internal: the shared resolver, the capability table, and help grouping.

Test plan

Extend the existing CLI suites rather than adding a duplicate harness:

  • parity_matrix.rs — capability exclusions (the per-subcommand plane table becomes the source of truth for which verbs are remote-only / storage-only).
  • cli_data.rs — maintenance wrong-plane errors (optimize --server, optimize --target <remote>), and --target resolving to direct storage.
  • cli_schema_config.rsgraphs list plane behavior, schema plan vs schema show/apply plane split, and plane-grouped --help output.
  • system_local.rs--server / operator-targeting edge cases end-to-end.

Pin the new wrong-plane error strings deliberately: this RFC is intentionally replacing accidental Omnigraph::open strings with stable capability errors, and those strings become observable behavior (Hyrum).

Relationship to RFC-009

RFC-009 Phase 4 was scoped as "declared plane capabilities" for the embedded-vs-remote axis only. This RFC subsumes and broadens that phase into the full three-plane, per-subcommand model (adds uniform maintenance addressing, the authority rule, and help grouping). RFC-009 Phase 5 (remote load/load route alignment) is unaffected and remains in RFC-009.

graphs list reconciliation: RFC-009's answered open question (pinned in parity_matrix.rs's exclusions comment) targets graphs list becoming Both-capability once the embedded arm enumerates the cluster catalog. This RFC aligns with that rather than superseding it: the capability table shows graphs list as remote today, embedded-cluster later.

Open questions

  1. Capability-table location — a CLI-internal const, or surfaced (e.g. in --help and a machine-readable omnigraph capabilities for tooling)?
  2. --cluster <dir> --graph <id> for maintenance — does the maintenance command resolve the storage URI from the applied cluster state, or from the declared cluster.yaml? (Applied state is the truth the server serves; declared config may be ahead of it.)

Review comments (Codex, 2026-06-13)

Overall take: the direction is right. The planes already exist; making them declared in code, help text, and error messages should reduce operator surprise. Keeping storage maintenance off HTTP is also the right boundary: optimize, repair, and cleanup are direct-storage operator actions, not a multi-tenant serving surface.

Before implementation, tighten these points:

  1. Resolver authority needs a sharper rule. The proposal says maintenance resolves storage URIs "from cluster.yaml / operator config", but those are different authority surfaces. Today --target is an operator/legacy graph-target lookup; cluster config is read by cluster commands and by --cluster server boot. Do not make a maintenance command silently consult both and pick a precedence. Either:

    • --target on maintenance means an operator/legacy target whose URI is already direct storage, with remote targets failing loudly; or
    • add an explicit cluster-root/config resolver for this case, so reading cluster state is an intentional mode.

    Resolution (accepted): both — --target resolves through operator/legacy config and must be direct storage (remote → loud fail); cluster-managed graphs use the explicit --cluster <dir> --graph <id> resolver. See Authority rule under Proposed shape.

  2. graphs list conflicts with RFC-009's target shape. This RFC classifies graphs list as remote-only, while RFC-009's answered open question says it becomes Both-capability once the embedded arm enumerates the cluster catalog. Pick one direction here: either this RFC explicitly supersedes that target, or the capability table should show graphs list as remote today and embedded-cluster later.

    Resolution (accepted): align, don't supersede. The table shows graphs list remote-today / embedded-cluster-later. See Relationship to RFC-009.

  3. The capability table should be per subcommand, not per family. The family-level rows hide the exact cases the table is supposed to make non-accidental. At minimum, call out:

    • schema plan as local/storage-backed today, while schema show and schema apply route through the graph client;
    • queries validate versus queries list, which do not have the same plane shape;
    • lint, policy, embed, login, logout, config, and version, so enumeration/session/tooling commands are intentionally classified instead of falling outside the model.

    Resolution (accepted): the capability table is now per-subcommand and classifies every command, including the session/tooling group.

  4. init should be an explicit exception. Direct-storage init is fine. A cluster-managed graph should be created by cluster apply, with ledger, recovery, and approval semantics. If a named target resolves to a cluster-managed graph root, init should signpost cluster apply rather than quietly initializing that root out of band.

    Resolution (accepted): promoted from open question to a decision. See init is an explicit exception.

Testing notes for the implementation slice:

  • Extend the existing CLI suites rather than adding a new duplicate harness: parity_matrix.rs for capability exclusions, cli_data.rs for maintenance wrong-plane errors, cli_schema_config.rs for graphs list / help behavior, and system_local.rs for --server / operator-targeting edge cases.

  • Pin the new wrong-plane error strings deliberately. This RFC is intentionally replacing accidental Omnigraph::open strings with stable capability errors, and those strings become observable behavior.

    Resolution (accepted): captured as the Test plan section.

Verification comments (Codex, 2026-06-13)

Follow-up verification against the current CLI/server code found a few remaining current-state nits. These are doc-shape issues, not objections to the proposal:

  1. Current-state table overstates graphs list. The table under Current state of affairs still lists graphs list with data verbs that reach the graph by embedded engine or HTTP. Current code routes it through GraphClient only to share the resolver, but the embedded arm fails loudly; the later RFC text correctly says remote today / embedded-cluster later. Make the current-state row match that.

    Resolution (accepted): the Data row now marks graphs list remote-only today, with a note that it rides GraphClient only to share the resolver.

  2. Current-state table overstates init addressing. init is grouped with maintenance verbs whose addressing surface is positional URI or --target. Current init only accepts a required positional URI and has no --target or config path. The proposal can add that capability, but the current-state table should not describe it as already present.

    Resolution (accepted): the Storage row now calls out that init takes only a required positional URI today (no --target); adding --target to init is part of the proposal, entangled with the initcluster apply signpost, not current state.

  3. apply_server_flag call-site count is stale. The text says data verbs resolve --server prod --graph knowledge through apply_server_flag at 16 call sites. Current code has the fork collapsed: data verbs call GraphClient::resolve*, and only the two GraphClient factories call apply_server_flag. Rephrase the verified fact around GraphClient, not the old pre-collapse call-site count.

    Resolution (accepted): validated-fact #1 now describes the post-collapse reality (GraphClient::resolve*; the two factories call apply_server_flag), dropping the stale count.

  4. --cluster <dir> --graph <id> collides with today's global --graph semantics. The target ergonomics section proposes that flag shape for maintenance, but current --graph is a global flag that requires --server and appends /graphs/<id> to a remote server URL. Either choose a separate cluster-maintenance graph flag shape, or call out the clap/global flag migration explicitly as part of the implementation.

    Resolution (accepted): the Authority rule now carries a flag-shape caveat — the cluster-managed resolver (and its flag shape, e.g. --cluster-graph vs a --graph migration) is deferred to a later slice; the operator/legacy --target path lands first. The illustrative --cluster <dir> --graph <id> spelling is marked as not-final.