feat(cli): cluster-managed maintenance addressing + init signpost (RFC-010 Slice 3) (#221)

* feat(cluster): cluster_root_for_graph_uri detection helper (RFC-010 Slice 3)

Public helper the CLI uses to refuse `init` into a cluster-managed location:
given a graph storage URI of the cluster layout (`<root>/graphs/<id>.omni`),
return the cluster root if `<root>` holds `__cluster/state.json`, else None.

Cheap by construction — a URI that doesn't match the `<root>/graphs/<id>.omni`
shape returns None with zero I/O, so ordinary `init` targets never probe
storage. Works for file:// and s3:// via the storage adapter. Adds two
ClusterStore accessors (`display_root`, `has_state`).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(cli): cluster-managed maintenance addressing + init signpost (RFC-010 Slice 3)

Two cluster-graph-aware CLI behaviors, sharing the cluster-resolution path.

Maintenance addressing. `optimize`/`repair`/`cleanup` gain
`--cluster <dir|s3://…> --cluster-graph <id>`, which resolves the graph's
storage URI from the served cluster snapshot (the same truth a `--cluster`
server boots from — `read_serving_snapshot*`) and opens it embedded. The
operator no longer hand-types `<storage>/graphs/<id>.omni`. A distinct flag is
required because the global `--graph` is `requires = server` and means a remote
multi-graph id. clap enforces both-or-neither and exclusion with the positional
URI / `--target`; an unserved graph errors loudly, pointing at `cluster apply`.

init signpost. `init` refuses a cluster-managed positional path (the
`<root>/graphs/<id>.omni` layout where `<root>` holds `__cluster/state.json`,
detected by `cluster_root_for_graph_uri`) and points at `cluster apply` — graphs
in an established cluster are created with ledger/recovery/approvals, not by
hand. The check is gated on the path shape, so ordinary `init` does no extra I/O
and existing pre-apply cluster-graph inits are unaffected.

planes guard remediation now also mentions `--cluster … --cluster-graph …`
(the two Slice-1 guard-string tests track it). Docs updated (cli-reference
Command planes, maintenance.md, cluster.md §7); the stale "no S3-hosted cluster
directories" limitation is dropped (RFC-006 landed it).

Tests (cli_cluster.rs, reusing the apply-a-cluster fixture): resolve by id,
unknown-id error, `--cluster` requires `--cluster-graph`, init refusal +
signpost, and ordinary init still works.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(cli): resolve cluster graphs from the state ledger, not the serving snapshot

Addresses the Greptile review on #221. `read_serving_snapshot*` does
all-or-nothing serving validation — recovery-sidecar checks plus a digest
verify of every catalog payload (query .gq, policy blobs). Using it to resolve
a maintenance target coupled `optimize`/`repair`/`cleanup` to the readiness of
unrelated resources: a single corrupt policy blob, or a pending recovery sweep,
would block the command before it could touch the graph — worst for `repair`,
the tool you reach for *when the cluster is degraded*.

Add `omnigraph_cluster::resolve_graph_storage_uri(cluster, graph_id)`: read the
state ledger, confirm the graph is in the applied revision, return
`graph_root(id)` — the URI is deterministically derivable, no catalog
validation. The CLI's cluster resolver now calls it.

Test: `optimize --cluster … --cluster-graph …` still resolves after the catalog
payloads (`__cluster/resources/`) are removed — the ledger-only path is not
blocked by degraded/unrelated catalog state.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Andrew Altshuler 2026-06-14 02:52:21 +03:00 committed by GitHub
parent d6cf5b298c
commit 6144bb18d6
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
13 changed files with 401 additions and 14 deletions

View file

@ -238,6 +238,13 @@ pub(crate) enum Command {
target: Option<String>,
#[arg(long)]
config: Option<PathBuf>,
/// Cluster directory or storage-root URI; with --cluster-graph, resolves
/// the graph's storage URI from the served cluster state.
#[arg(long, conflicts_with_all = ["uri", "target"], requires = "cluster_graph")]
cluster: Option<String>,
/// Graph id within --cluster.
#[arg(long, requires = "cluster")]
cluster_graph: Option<String>,
#[arg(long)]
json: bool,
},
@ -249,6 +256,13 @@ pub(crate) enum Command {
target: Option<String>,
#[arg(long)]
config: Option<PathBuf>,
/// Cluster directory or storage-root URI; with --cluster-graph, resolves
/// the graph's storage URI from the served cluster state.
#[arg(long, conflicts_with_all = ["uri", "target"], requires = "cluster_graph")]
cluster: Option<String>,
/// Graph id within --cluster.
#[arg(long, requires = "cluster")]
cluster_graph: Option<String>,
/// Publish verified maintenance drift. Without this flag, repair only
/// previews what it would do.
#[arg(long)]
@ -268,6 +282,13 @@ pub(crate) enum Command {
target: Option<String>,
#[arg(long)]
config: Option<PathBuf>,
/// Cluster directory or storage-root URI; with --cluster-graph, resolves
/// the graph's storage URI from the served cluster state.
#[arg(long, conflicts_with_all = ["uri", "target"], requires = "cluster_graph")]
cluster: Option<String>,
/// Graph id within --cluster.
#[arg(long, requires = "cluster")]
cluster_graph: Option<String>,
/// Number of recent versions to keep per table. Either `--keep` or
/// `--older-than` (or both) must be set.
#[arg(long)]

View file

@ -513,6 +513,37 @@ pub(crate) fn resolve_local_uri(
Ok(resolve_local_graph(config, cli_uri, cli_target, operation)?.uri)
}
/// Resolve a storage-plane verb's target to a direct storage URI (RFC-010
/// Slice 3). `--cluster <dir|uri> --cluster-graph <id>` resolves the graph's
/// storage URI from the **served cluster state** (the truth a `--cluster`
/// server serves); otherwise the ordinary positional-URI / `--target` path.
/// clap enforces both-or-neither and exclusion with `uri`/`--target`, so the
/// mismatched arm is defensive.
pub(crate) async fn resolve_storage_uri(
config: &OmnigraphConfig,
cli_uri: Option<String>,
cli_target: Option<&str>,
cluster: Option<&str>,
cluster_graph: Option<&str>,
operation: &str,
) -> Result<String> {
match (cluster, cluster_graph) {
(Some(cluster), Some(graph_id)) => resolve_cluster_graph_uri(cluster, graph_id).await,
(None, None) => resolve_local_uri(config, cli_uri, cli_target, operation),
_ => bail!("--cluster and --cluster-graph must be given together"),
}
}
/// Look up a graph's storage URI from a cluster's applied state ledger. Uses
/// the lightweight `resolve_graph_storage_uri` (NOT the full serving-snapshot
/// validation), so maintenance — especially `repair` — works even when an
/// unrelated catalog payload is corrupt or a recovery sweep is pending.
async fn resolve_cluster_graph_uri(cluster: &str, graph_id: &str) -> Result<String> {
omnigraph_cluster::resolve_graph_storage_uri(cluster, graph_id)
.await
.map_err(|diagnostic| color_eyre::eyre::eyre!("{}", diagnostic.message))
}
pub(crate) fn resolve_branch(
config: &OmnigraphConfig,
cli_branch: Option<String>,

View file

@ -147,6 +147,16 @@ async fn main() -> Result<()> {
}
}
Command::Init { schema, uri, force } => {
// RFC-010 Slice 3: graphs inside an established cluster are created
// by `cluster apply` (which records ledger/recovery/approvals), not
// by hand-running `init` into the cluster's storage layout.
if let Some(root) = omnigraph_cluster::cluster_root_for_graph_uri(&uri).await {
bail!(
"`{uri}` is inside cluster `{root}`. Graphs in a cluster are created by \
`cluster apply` (which records ledger, recovery, and approvals), not `init`. \
Declare the graph in cluster.yaml and run `cluster apply`."
);
}
let schema_source = fs::read_to_string(&schema)?;
ensure_local_graph_parent(&uri)?;
Omnigraph::init_with_options(
@ -783,10 +793,20 @@ async fn main() -> Result<()> {
uri,
target,
config,
cluster,
cluster_graph,
json,
} => {
let config = load_cli_config(config.as_ref())?;
let uri = resolve_local_uri(&config, uri, target.as_deref(), "optimize")?;
let uri = resolve_storage_uri(
&config,
uri,
target.as_deref(),
cluster.as_deref(),
cluster_graph.as_deref(),
"optimize",
)
.await?;
let db = Omnigraph::open(&uri).await?;
let stats = db.optimize().await?;
if json {
@ -823,12 +843,22 @@ async fn main() -> Result<()> {
uri,
target,
config,
cluster,
cluster_graph,
confirm,
force,
json,
} => {
let config = load_cli_config(config.as_ref())?;
let uri = resolve_local_uri(&config, uri, target.as_deref(), "repair")?;
let uri = resolve_storage_uri(
&config,
uri,
target.as_deref(),
cluster.as_deref(),
cluster_graph.as_deref(),
"repair",
)
.await?;
let db = Omnigraph::open(&uri).await?;
let stats = db
.repair(omnigraph::db::RepairOptions { confirm, force })
@ -906,13 +936,23 @@ async fn main() -> Result<()> {
uri,
target,
config,
cluster,
cluster_graph,
keep,
older_than,
confirm,
json,
} => {
let config = load_cli_config(config.as_ref())?;
let uri = resolve_local_uri(&config, uri, target.as_deref(), "cleanup")?;
let uri = resolve_storage_uri(
&config,
uri,
target.as_deref(),
cluster.as_deref(),
cluster_graph.as_deref(),
"cleanup",
)
.await?;
let older_than_dur = older_than.as_deref().map(parse_duration_arg).transpose()?;

View file

@ -139,7 +139,7 @@ pub(crate) fn guard_addressing(cli: &Cli) -> Result<()> {
// required positional URI), so its remediation drops the `--target` half.
Plane::Storage => match cli.command {
Command::Init { .. } => "Pass a storage URI.",
_ => "Use --target <name> or a storage URI.",
_ => "Use --target <name>, a storage URI, or --cluster <dir> --cluster-graph <id>.",
},
Plane::Control => "It operates on a cluster directory (pass --config <dir>).",
Plane::Session => "It does not address a graph.",

View file

@ -950,3 +950,138 @@ graphs:
assert!(!leaked.contains("phantom") && !leaked.contains("9999"), "{leaked}");
}
// ── RFC-010 Slice 3: cluster-managed maintenance addressing + init signpost ──
/// Stand up an applied, served cluster with the `knowledge` graph and return
/// its directory guard. Mirrors the e2e setup (fixture → init → import → apply).
fn applied_knowledge_cluster() -> tempfile::TempDir {
let temp = tempdir().unwrap();
write_cluster_config_fixture(temp.path());
init_cluster_derived_graph(temp.path());
let import = cluster_json(temp.path(), "import");
assert_eq!(import["ok"], true, "{import}");
let apply = cluster_json(temp.path(), "apply");
assert_eq!(apply["converged"], true, "{apply}");
temp
}
#[test]
fn optimize_resolves_a_cluster_graph_by_id() {
let temp = applied_knowledge_cluster();
// No hand-typed storage path: address the graph by cluster dir + id.
let out = output_success(
cli()
.arg("optimize")
.arg("--cluster")
.arg(temp.path())
.arg("--cluster-graph")
.arg("knowledge")
.arg("--json"),
);
let payload = parse_stdout_json(&out);
assert!(
payload["tables"].as_array().is_some(),
"optimize did not run against the resolved cluster graph: {payload}"
);
}
#[test]
fn optimize_unknown_cluster_graph_id_errors() {
let temp = applied_knowledge_cluster();
let out = output_failure(
cli()
.arg("optimize")
.arg("--cluster")
.arg(temp.path())
.arg("--cluster-graph")
.arg("does-not-exist")
.arg("--json"),
);
let stderr = String::from_utf8_lossy(&out.stderr);
assert!(
stderr.contains("is not applied in cluster") && stderr.contains("cluster apply"),
"expected an unapplied-graph error pointing at cluster apply; got: {stderr}"
);
}
#[test]
fn cluster_flag_requires_cluster_graph() {
// clap enforces both-or-neither.
let out = output_failure(
cli()
.arg("optimize")
.arg("--cluster")
.arg(".")
.arg("--json"),
);
let stderr = String::from_utf8_lossy(&out.stderr);
assert!(
stderr.contains("cluster-graph") || stderr.contains("required"),
"expected --cluster to require --cluster-graph; got: {stderr}"
);
}
#[test]
fn init_refuses_a_cluster_managed_path_and_signposts_cluster_apply() {
let temp = applied_knowledge_cluster();
// Hand-init a NEW graph into the established cluster's storage layout.
let out = output_failure(
cli()
.arg("init")
.arg("--schema")
.arg(temp.path().join("people.pg"))
.arg(temp.path().join("graphs").join("sneaky.omni")),
);
let stderr = String::from_utf8_lossy(&out.stderr);
assert!(
stderr.contains("cluster apply"),
"init into a cluster-managed path should signpost `cluster apply`; got: {stderr}"
);
// And it did not create the graph.
assert!(!temp.path().join("graphs").join("sneaky.omni").exists());
}
#[test]
fn init_outside_a_cluster_still_works() {
// Regression guard: ordinary init (no cluster layout) is unaffected.
let temp = tempdir().unwrap();
let schema = fixture("test.pg");
let out = output_success(
cli()
.arg("init")
.arg("--schema")
.arg(&schema)
.arg(temp.path().join("plain.omni")),
);
assert!(stdout_string(&out).contains("initialized"));
}
#[test]
fn optimize_by_cluster_works_when_catalog_payloads_are_degraded() {
// Robustness (Greptile, #221): maintenance resolves the graph URI from the
// state ledger alone, so an unrelated corrupt/missing catalog payload (or a
// pending recovery sweep) does NOT block it — unlike the full serving-snapshot
// read. This is what keeps `repair --cluster` usable on a degraded cluster.
let temp = applied_knowledge_cluster();
// Remove the verified catalog payloads (queries/policies) — a serving read
// would refuse with a catalog-payload diagnostic; the ledger-only resolve
// must not care.
let resources = temp.path().join("__cluster").join("resources");
if resources.exists() {
fs::remove_dir_all(&resources).unwrap();
}
let out = output_success(
cli()
.arg("optimize")
.arg("--cluster")
.arg(temp.path())
.arg("--cluster-graph")
.arg("knowledge")
.arg("--json"),
);
assert!(
parse_stdout_json(&out)["tables"].as_array().is_some(),
"optimize should resolve via the ledger despite degraded catalog payloads"
);
}

View file

@ -165,7 +165,7 @@ fn optimize_with_server_flag_errors_wrong_plane() {
assert!(
stderr.contains("`optimize` is a storage-plane command")
&& stderr.contains("--server/--graph address the data plane and do not apply")
&& stderr.contains("Use --target <name> or a storage URI."),
&& stderr.contains("Use --target <name>, a storage URI, or --cluster <dir> --cluster-graph <id>."),
"wrong-plane guard message not found; got: {stderr}"
);
}

View file

@ -121,7 +121,7 @@ fn schema_plan_with_server_flag_errors_wrong_plane() {
let stderr = String::from_utf8_lossy(&output.stderr);
assert!(
stderr.contains("`schema plan` is a storage-plane command")
&& stderr.contains("Use --target <name> or a storage URI."),
&& stderr.contains("Use --target <name>, a storage URI, or --cluster <dir> --cluster-graph <id>."),
"schema plan wrong-plane message not found; got: {stderr}"
);
}