Commit graph

7 commits

Author SHA1 Message Date
Ragnor Comerford
937fd6382d
mr-668: remove POST /graphs and CLI graphs create (defer runtime graph mgmt)
The POST /graphs runtime-create endpoint shipped in PR 7/10 has three
unresolved high-severity bugs:

  - flock-on-renamed-inode race: the YAML flock is taken on
    omnigraph.yaml itself, then a temp file is renamed over it.
    Cross-process writers end up locking different inodes — both
    believing they hold exclusive access.
  - duplicate-check outside the file lock: precheck runs against
    the in-memory registry only; the locked closure does
    config.graphs.insert(...) unconditionally. Concurrent same-id
    POSTs can persist the loser in YAML while the in-memory registry
    keeps the winner — they disagree after restart.
  - best_effort_cleanup_init_artifacts deletes _schema.pg /
    _schema.ir.json / __schema_state.json on any init failure. An
    accidental re-init against an existing graph's URI destroys its
    schema; subsequent open() fails at read_text(_schema.pg).

The correct fix is a Lance-style cluster catalog (reserve → init →
publish with recovery sidecars), parallel to the engine's existing
__manifest discipline. That work is out of scope for v0.7.0.

For now, disable runtime add/remove from the network and CLI surface.
Operators add graphs by editing omnigraph.yaml and restarting. The
GET /graphs read-only enumeration stays.

Removed:
- POST /graphs handler + router fragment + utoipa registration
- 13 post_graphs_* server tests + 3 composite POST tests +
  multi_mode_app_with_real_config / post_graph helpers
- CLI omnigraph graphs create subcommand + its handler + cli.rs tests
- system_remote.rs combined list+create test trimmed to list-only
- YAML rewrite infra: rewrite_atomic[_with_modify], RewriteAtomicError,
  staging_path, hash_config_file, AppState::config_hash field +
  threading through new_multi and open_multi_graph_state
- fs2 dependency (verified absent from cargo tree)
- sha2/fs2 imports in config.rs (only the rewrite path used them)
- Cedar PolicyAction::GraphCreate variant + "graph_create" match arms
  + action def in Cedar schema + graph_create_action_authorizes_against_server_resource test
- GraphCreateRequest / GraphCreateResponse / GraphSchemaSpec /
  GraphPolicySpec API types (only the POST handler / CLI imported them)

Kept:
- GET /graphs (read-only enumeration) and graph_list Cedar action
- omnigraph graphs list CLI subcommand
- All multi-graph startup, mode inference, cluster routes,
  per-graph + server-level Cedar policies
- server_settings_drive_multi_graph_startup_end_to_end (the test
  that covers operator-authored YAML + restart — the path that
  survives)
- best_effort_cleanup_init_artifacts and the three init failpoints
  (still reachable from CLI `omnigraph init`; preflight fix deferred
  as a follow-up)
- GraphRegistry::insert and its concurrency tests — production
  callers gone, but the method is the natural seam for the future
  cluster-catalog work

Also fixed (transcript issue 4):
- ALWAYS_FLAT_PATHS now includes /graphs so multi-mode OpenAPI
  advertises the management route correctly (was previously rewritten
  to /graphs/{graph_id}/graphs)
- multi_mode_openapi_keeps_healthz_flat → renamed to
  multi_mode_openapi_keeps_management_paths_flat, asserts both
  /healthz and /graphs stay flat
- multi_mode_openapi_prefixes_operation_ids_with_cluster skips
  /graphs in addition to /healthz

Doc fixes:
- docs/user/cli.md: graphs list example was --target http://...,
  but --target is a config-graph-name lookup; corrected to --uri.
  Removed the graphs create example.
- docs/user/server.md: dropped POST /graphs row, "omnigraph.yaml
  ownership", and "POST /graphs body shape" sections. Added a
  paragraph stating runtime add/remove is not exposed in v0.7.0.
- docs/user/policy.md: dropped graph_create action; reworded the
  "Configuration" line to clarify that server-scoped rules (graph_list)
  take neither branch_scope nor target_branch_scope.
- docs/releases/v0.7.0.md: rewrote release narrative — multi-graph
  mode ships; runtime add/remove deferred.
- AGENTS.md: HTTP server bullet and capability matrix row updated to
  reflect read-only GET /graphs and the operator-edit workflow.
- openapi.json regenerated; /graphs has only .get, no .post.

Diff: 17 files, +123 −1525 LOC.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 17:49:38 +02:00
Ragnor Comerford
d11c18fb27
mr-668: composite e2e tests, race fix, v0.7.0 release (PR 9/10)
PR 9 — the final integration PR for MR-668 multi-graph server work.
Closes the v0.7.0 release.

Composite lifecycle tests (closes gaps flagged in PR 7's coverage
review):
  - `multi_graph_lifecycle_post_query_restart_persistence` — POST a
    graph, query it via cluster route, reload the config from disk
    and confirm `load_server_settings` sees the rewritten YAML.
    Validates the "restart resolves orphans" failure-mode story.
  - `per_graph_policy_enforced_on_post_created_graph` — POST a graph
    with a per-graph policy attached, then send authenticated read
    and change requests. Per-graph Cedar enforcement fires correctly
    on a POST-created graph (engine-layer policy reinstalled via
    `Omnigraph::with_policy` inside the create flow).
  - `concurrent_post_graphs_distinct_ids_all_succeed` — 4 concurrent
    POSTs with distinct graph_ids all return 201. Caught a real
    race in `rewrite_atomic` (see below).

Race fix — `rewrite_atomic_with_modify`:

The first composite test surfaced a real bug. The old
`rewrite_atomic(path, new_config, expected_hash)` captured the
baseline hash OUTSIDE the flock, then called rewrite_atomic which
re-acquired it inside. Under concurrent writers:

  - POST A: captures baseline H0, calls rewrite_atomic.
  - POST B: captures baseline H0 too (before A's update lands).
  - A: acquires flock, on-disk == H0, writes H1, releases.
  - A: updates baseline H0 → H1.
  - B: tries to acquire flock — waits.
  - B: acquires flock. On-disk is now H1. Expected (captured
       before A finished) is H0. MISMATCH → spurious Drift error.

Worse: even if the timing happens to align, B's `updated` config
was constructed from BYTES read before the flock. B writes a config
that doesn't include A's new graph — silent data loss.

The fix: new `config::rewrite_atomic_with_modify(path, baseline,
modify)` takes a closure. Inside the flock + baseline mutex:
  1. Read on-disk bytes, hash, compare to baseline.
  2. Parse on-disk YAML.
  3. Call `modify(parsed)` to produce the new config — receives
     fresh on-disk state, returns the modification.
  4. Serialize + write + fsync + rename + update baseline.

Everything is read-modify-write under the same critical section.
Concurrent writers serialize cleanly. Test confirmed this is no
longer a race.

The old `rewrite_atomic(path, new_config, expected_hash)` API stays
for tests that don't need the read-modify-write shape; the POST
handler switches to the new shape.

Version bump v0.6.0 → v0.7.0:
  - All 5 `crates/*/Cargo.toml` (compiler, engine, policy, cli, server)
    plus their inter-crate `path` dep version constraints.
  - `Cargo.lock` regenerated by `cargo build --workspace`.
  - `AGENTS.md` "Version surveyed" line, capability matrix HTTP-server
    row updated to mention multi-graph + cluster routes + atomic YAML
    rewrite.
  - `openapi.json` regenerated.

Docs:
  - `docs/releases/v0.7.0.md` (new) — release notes with breaking
    changes, new features, deferred items (DELETE, `delete_prefix`,
    actor forwarding), and the single→multi migration recipe.
  - `docs/user/server.md` — substantial section additions for the
    two modes, mode inference, cluster endpoint table, management
    endpoints, `omnigraph.yaml` ownership contract, `POST /graphs`
    body shape + status codes.
  - `docs/user/cli.md` — `omnigraph graphs list/create` section,
    deferred-DELETE note.
  - `docs/user/policy.md` — server-scoped Cedar actions
    (`graph_create`, `graph_list`), per-graph vs server-level policy
    composition, example server-level policy.

Workspace test pass: 573 tests green across all crates. Zero
failures. MR-731 spoof regression still pinned and passing across
the entire 10-PR series.

This commit closes MR-668. v0.7.0 is ready for tagging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 21:32:49 +02:00
Ragnor Comerford
a4e6cb689a
mr-668: POST /graphs runtime create endpoint (PR 7/10)
PR 7 of the MR-668 multi-graph server work. Operators can now add a
graph to a running multi-graph server without restarting:

  curl -X POST http://server/graphs \
    -H "Content-Type: application/json" \
    -d '{
          "graph_id": "beta",
          "uri": "/data/beta.omni",
          "schema": { "source": "node Person { name: String @key }\n" },
          "policy": { "file": "./policies/beta.yaml" }
        }'

DELETE remains deferred (out of v0.7.0 scope per the trimmed plan —
no `delete_prefix`, no tombstones).

Body shape (decision 7):
  - Nested `schema: { source: "..." }` (mirrors the `policy: { file }`
    pattern; leaves room for future fields without breakage).
  - Optional nested `policy: { file: "..." }` for per-graph Cedar.
  - 32 MiB body limit (reuses `INGEST_REQUEST_BODY_LIMIT_BYTES`).
  - Asymmetric with `SchemaApplyRequest` which keeps flat
    `schema_source: String` — documented in api.rs.

Atomic YAML rewrite + drift detection:
  - New `config::rewrite_atomic(path, new_config, expected_hash)`:
    flock → re-read + hash check → serialize → write `.tmp` → fsync
    → rename → fsync parent dir. Returns the new hash for the caller
    to update its in-memory baseline.
  - New `config::hash_config_file(path)` — SHA-256 of the on-disk
    bytes, used at startup and after each rewrite.
  - New `RewriteAtomicError { Drift | Io | Serialize }` enum.
  - `AppState.config_hash: Option<Arc<Mutex<[u8;32]>>>` carries the
    in-memory baseline. Updated after every successful rewrite so
    subsequent POSTs don't false-trigger drift.
  - The mutex is `std::sync::Mutex` (brief critical section, no .await
    inside). The flock itself serializes file access process-wide
    AND across multiple server instances (defense in depth).
  - All sync I/O runs inside `tokio::task::spawn_blocking` — flock
    is sync.

Handler ordering (the load-bearing sequence):
  1. Mode check: 405 in single mode.
  2. Cedar authorize: `GraphCreate` against `Omnigraph::Server::"root"`.
  3. Validate body: `GraphId::try_from` (regex + reserved-name), empty
     schema/uri checks, per-graph policy file parse.
  4. Pre-check registry for duplicate graph_id / duplicate uri (409).
  5. `Omnigraph::init` the new engine.
  6. Atomic YAML rewrite (drift detection inside).
  7. Publish in registry (atomic re-check via `GraphRegistry::insert`).

Failure modes (documented in handler rustdoc):
  - Init fails → orphan storage at `req.uri` (PR 2a cleans up schema
    files; Lance datasets remain orphans until `delete_prefix` lands).
  - YAML rewrite fails (drift, IO) → orphan storage; YAML unchanged.
  - Registry insert fails (race) → YAML has entry but registry doesn't;
    next restart opens it cleanly.

New dependency: `fs2 = "0.4"` (workspace + omnigraph-server). POSIX-only
file locking. Linux/macOS deployment supported; Windows out of scope.

Tests (10 new in `tests/server.rs::multi_graph_startup`):
  - `post_graphs_creates_a_new_graph_end_to_end` — happy path, includes
    YAML inspection to confirm the rewrite landed.
  - `post_graphs_baseline_hash_updates_between_rewrites` — two POSTs in
    a row both succeed (drift baseline updates correctly).
  - `post_graphs_duplicate_graph_id_returns_409`
  - `post_graphs_duplicate_uri_returns_409`
  - `post_graphs_invalid_graph_id_returns_400` (reserved name)
  - `post_graphs_empty_schema_source_returns_400`
  - `post_graphs_returns_405_in_single_mode`
  - `post_graphs_yaml_drift_detection_returns_503` — operator hand-edits
    omnigraph.yaml; server refuses to clobber.
  - `hash_config_file_is_deterministic_and_detects_changes`
  - `rewrite_atomic_refuses_when_hash_drifts`

OpenAPI: `server_graphs_create` registered in `ApiDoc::paths(...)`;
openapi.json regenerated.

Result: 225 server tests green (74 lib + 66 openapi + 85 integration),
all MR-731 regressions still pinned.

LOC: ~580 lib.rs net (handler + helpers), ~120 config.rs (rewrite
machinery), +71 api.rs (request/response shapes), +332 tests/server.rs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 20:38:58 +02:00
Ragnor Comerford
4df361d152
mr-668: multi-graph startup + mode inference (PR 5/10)
PR 5 of the MR-668 multi-graph server work. This is the first PR that
makes multi mode actually usable end-to-end: operators invoking
`omnigraph-server --config omnigraph.yaml` with a non-empty `graphs:`
map and no single-mode selector now get a running multi-graph server.

Mode inference (MR-668 decision 2, four-rule matrix in
`load_server_settings`):
  1. CLI `<URI>` positional        → Single
  2. CLI `--target <name>`         → Single (URI from graphs.<name>)
  3. `server.graph` in config      → Single (URI from graphs.<name>)
  4. `--config` + non-empty `graphs:` + no single-mode selector
                                   → Multi (all entries in `graphs:`)
  5. otherwise                     → error with migration hint

Rule 5's error message names every escape hatch so operators can fix
their invocation without grepping docs.

Config schema extensions:
  - `TargetConfig.policy: PolicySettings` (per-graph Cedar policy file).
    `#[serde(default)]` so existing single-graph YAMLs keep parsing.
  - `ServerDefaults.policy: PolicySettings` (server-level Cedar policy
    for management endpoints — loaded in PR 5, wired into `GET /graphs`
    in PR 6b).
  - `OmnigraphConfig::resolve_target_policy_file(name)` and
    `resolve_server_policy_file()` helpers — both resolve relative to
    the config file's `base_dir`.

Public types added to `omnigraph-server`:
  - `ServerConfigMode { Single { uri, policy_file } | Multi { graphs,
    config_path, server_policy_file } }`.
  - `GraphStartupConfig { graph_id, uri, policy_file }` — one entry
    per graph in multi mode.

`ServerConfig` shape change:
  - WAS: `{ uri: String, bind, policy_file, allow_unauthenticated }`.
  - NOW: `{ mode: ServerConfigMode, bind, allow_unauthenticated }`.
  - Breaking for any code that constructs `ServerConfig` directly.
    `main.rs` is unaffected (uses `load_server_settings`).

`serve()` now forks on `ServerConfig.mode`:
  - Single: existing flow via `AppState::open_with_bearer_tokens_and_policy`.
  - Multi: parallel open via `futures::stream::iter(graphs)
    .map(open_single_graph).buffer_unordered(4).collect()`. Bound 4 is
    a rule-of-thumb for I/O-bound work — at N≤10 this trades startup
    latency for a small amount of concurrent S3/Lance open pressure.
    Fail-fast: first open error aborts startup; in-flight opens drop
    their engine via Arc (Lance datasets close cleanly).

New helper `open_single_graph(GraphStartupConfig)`:
  - Validates `GraphId` per the regex in PR 1.
  - `Omnigraph::open(uri).await` with descriptive error context.
  - Loads per-graph policy file and re-applies it via
    `Omnigraph::with_policy` (engine-layer enforcement, MR-722).
  - Returns `Arc<GraphHandle>` ready for the registry.

Routing middleware bug fix:
  - `Router::nest("/graphs/{graph_id}", inner)` rewrites
    `request.uri().path()` to the inner suffix (e.g. `/snapshot`).
    The previous middleware tried to parse `{graph_id}` from
    `request.uri().path()` and got 400 instead of 200. Fixed by reading
    from `axum::extract::OriginalUri` request extension, which preserves
    the pre-rewrite URI.
  - Caught by the two new tests
    `cluster_routes_dispatch_per_graph_handle` and
    `cluster_route_for_unknown_graph_returns_404`.

Tests (14 new, all passing):
  - Four-rule matrix: one test per branch + the joint case
    `mode_inference_cli_uri_overrides_graphs_map` + the empty-graphs-map
    error case.
  - Per-graph + server-level policy file path resolution.
  - Reserved `GraphId` rejection at startup.
  - End-to-end multi-graph routing: two graphs side by side, each
    cluster route hits the right engine.
  - Unknown graph id under cluster prefix → 404.
  - Flat routes 404 in multi mode.

Inline `ServerConfig` test (`serve_refuses_to_start_in_state_1_without_unauthenticated`)
and three `server_settings_*` tests updated to the new `mode` shape.

Result: 211 server tests green (74 lib + 71 integration + 66 openapi),
MR-731 regression test still pinned and passing.

LOC: +45 config.rs, +281 lib.rs (net), +395 tests/server.rs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 20:09:12 +02:00
Andrew Altshuler
a275306a15
policy: CLI policy injection — local writes go through engine enforce (MR-722) (#104)
Closes the CLI side of the policy chassis fan-out. Before this commit,
CLI direct-engine writes bypassed Cedar entirely because the CLI never
called `Omnigraph::with_policy(...)` for non-`policy validate|test|explain`
subcommands. After this commit, every CLI direct-engine writer
(change, load, ingest, branch create/delete/merge, schema apply) opens
the engine via a new `open_local_db_with_policy(uri, &config)` helper
that installs the configured `PolicyEngine` when `policy.file` is set,
and threads the resolved actor through to the `_as` writer methods.

Actor identity resolution:

- New top-level `--as <ACTOR>` global flag on the CLI overrides config.
- New `cli.actor` field in `omnigraph.yaml` provides a default actor.
- Precedence: `--as` > `cli.actor` > None.
- When policy is configured and neither is set, the engine-layer
  footgun guard fires and the write is denied — silent bypass via
  "I forgot the actor" is exactly what the guard prevents.
- Remote HTTP writes ignore both — bearer-token-resolved server-side.

Helpers added in main.rs:

- `open_local_db_with_policy(uri, &config) -> Result<Omnigraph>` —
  opens the DB and installs the PolicyEngine when configured. Without
  policy this is identical to a bare `Omnigraph::open`.
- `resolve_cli_actor(cli_as, &config) -> Option<&str>` — implements
  the flag > config > None precedence.

Engine: added `load_file_as` to the loader as the actor-aware mirror of
`load_file`, so CLI file-path loads flow through the same enforce gate
as in-memory `load_as` calls.

Test rewrite: `local_cli_policy_tooling_is_end_to_end_while_local_writes_stay_unenforced`
was the explicit assertion of the pre-chassis hole. Renamed and split:

- `local_cli_policy_tooling_is_end_to_end` — sanity for the read-only
  policy CLI surfaces (validate/test/explain), unchanged behavior.
- `local_cli_change_enforces_engine_layer_policy` — the new assertion:
  policy installed + no actor → footgun-guard denial; `--as act-bruno`
  on protected main → Cedar denial; `--as act-ragnor` (admins-write
  rule) on main → permit, write committed.

POLICY_E2E_YAML gains an `admins-write` rule so the permit case has
a non-trivial actor to exercise.

docs/user/policy.md updated with `cli.actor` + `--as <ACTOR>` usage.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 04:06:21 +03:00
andrew
1a26e2e654 Rename config targets to graphs 2026-04-14 04:12:14 +03:00
andrew
338289656a Initial public Omnigraph repository 2026-04-10 20:49:41 +03:00