omnigraph/docs/user/schema-language.md
Andrew Altshuler aadfa11ecb
Some checks failed
CI / Classify Changes (push) Has been cancelled
CI / Check AGENTS.md Links (push) Has been cancelled
Release Edge / Prepare edge release (push) Has been cancelled
CI / Test Workspace (push) Has been cancelled
CI / Test omnigraph-server --features aws (push) Has been cancelled
CI / RustFS S3 Integration (push) Has been cancelled
Release Edge / Build edge omnigraph-linux-x86_64 (push) Has been cancelled
Release Edge / Build edge omnigraph-macos-arm64 (push) Has been cancelled
schema: HTTP allow_data_loss exposure + e2e drop coverage (MR-694 follow-up) (#107)
The schema-lint chassis v1.2 (PR #100) shipped `--allow-data-loss` on
the CLI, but `SchemaApplyRequest` had no equivalent field — Hard-mode
drops were CLI-only. This commit closes that feature gap and adds e2e
test coverage for drop modes across HTTP + CLI, plus data preservation
on additive apply, plus a CLI↔SDK plan-parity assertion.

Feature gap closed:

- `crates/omnigraph-server/src/api.rs` — added `allow_data_loss: bool`
  (default false via `#[serde(default)]`) to `SchemaApplyRequest`.
  Added `Default` derive so test usages can use `..Default::default()`.
- `crates/omnigraph-server/src/lib.rs` — `server_schema_apply` now
  constructs `SchemaApplyOptions { allow_data_loss: request.allow_data_loss }`
  and threads through to `apply_schema_as`.
- `crates/omnigraph-cli/src/main.rs` — remote-URI schema-apply path
  used to bail with "--allow-data-loss not yet supported on remote";
  now forwards the flag into the JSON payload so the CLI behaves
  identically against local and remote URIs.
- `openapi.json` — regenerated; only diff is the new field on
  `SchemaApplyRequest`.

Tests added (8 new):

* `crates/omnigraph-server/tests/server.rs` (+5):
  - `schema_apply_route_soft_drops_property_via_http` — POST schema
    removing nullable property, verify catalog reflects the drop AND
    `snapshot_at_version(pre)` still has `age` in the field list
    (time-travel reachability is the Soft contract).
  - `schema_apply_route_soft_drops_node_type_via_http` — POST schema
    removing `Company` node + cascading `WorksAt` edge.
  - `schema_apply_route_hard_drops_property_with_allow_data_loss` —
    POST with `allow_data_loss: true`, verify plan step reports
    `mode: hard`.
  - `schema_apply_route_keeps_drops_soft_without_flag` — same schema
    without flag, verify `mode: soft`. Pins default semantics against
    accidental Hard promotion.
  - `schema_apply_route_additive_property_preserves_existing_rows` —
    load fixture, POST adding nullable property, verify row count
    preserved (SDK suite covers data preservation on drops + renames;
    additive AddProperty wasn't pinned).
  Plus helpers `schema_without_age` and `schema_without_company`.

* `crates/omnigraph-cli/tests/cli.rs` (+3):
  - `schema_apply_allow_data_loss_flag_promotes_drops_to_hard` — CLI
    `omnigraph schema apply --allow-data-loss --schema X.pg --json`,
    verify plan step has `mode: hard`.
  - `schema_apply_without_allow_data_loss_keeps_soft_drops` — without
    flag, verify Soft.
  - `schema_plan_parity_cli_and_sdk` — same `.pg` source through
    `Omnigraph::plan_schema` (SDK) and `omnigraph schema plan --json`
    (CLI), assert the steps array is byte-identical post-JSON. HTTP
    has no `/schema/plan` endpoint; apply-side parity is implicitly
    covered by the HTTP drop tests + CLI drop tests using identical
    fixtures.

Docs:

- `docs/user/schema-language.md` — new "Destructive drops" section
  documenting Soft vs Hard semantics and that `allow_data_loss` is
  now honored uniformly across CLI / HTTP / SDK.

Verification: every new test passes; full `cargo test --workspace --locked`
green; `scripts/check-agents-md.sh` passes.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 01:56:46 +03:00

88 lines
4.6 KiB
Markdown

# Schema Language (`.pg`)
Pest grammar at `crates/omnigraph-compiler/src/schema/schema.pest`. AST at `schema/ast.rs`. Catalog at `catalog/mod.rs`.
## Top-level declarations
- `interface <Name> { property* }` — reusable property contracts.
- `node <Name> [implements <Iface>, ...] { property* | constraint* }`
- `edge <Name>: <FromType> -> <ToType> [@card(min..max)] { property* | constraint* }`
- Comments: line `//` and block `/* … */`.
## Property declarations
`<ident>: <TypeRef> [annotation*]`
## Built-in scalar types
| Scalar | Arrow type |
|---|---|
| `String` | Utf8 |
| `Blob` | LargeBinary |
| `Bool` | Boolean |
| `I32` / `I64` | Int32 / Int64 |
| `U32` / `U64` | UInt32 / UInt64 |
| `F32` / `F64` | Float32 / Float64 |
| `Date` | Date32 |
| `DateTime` | Date64 |
| `Vector(<dim>)` | FixedSizeList(Float32, dim), `1 ≤ dim ≤ i32::MAX` |
| `[<scalar>]` | List(scalar) |
| `enum(v1, v2, …)` | Utf8 with sorted/dedup'd set of allowed string values |
| `<scalar>?` | Same as scalar but `nullable: true` |
## Constraints (body level)
| Constraint | On | Effect |
|---|---|---|
| `@key(p, …)` | node | Primary key; implies index on key columns; `key_property()` returns the first key |
| `@unique(p, …)` | node, edge | Uniqueness across listed columns |
| `@index(p, …)` | node, edge | Build a scalar (BTREE) index on the columns |
| `@range(p, min..max)` | node | Numeric range validation (open ranges allowed) |
| `@check(p, "regex")` | node | Regex pattern validation |
| `@card(min..max?)` | edge | Edge multiplicity — default `0..*`; `0..1`, `1..1`, `1..*`, etc. |
Edge bodies only allow `@unique` and `@index`.
## Annotations
- `@<ident>` or `@<ident>(<literal>)` on any declaration or property.
- Known annotations:
- `@embed` on a Vector property — names the *source* property whose text gets embedded into this vector at ingest (`embed_sources` map in NodeType).
- `@description("…")`, `@instruction("…")` on query declarations (carried through to clients).
- Custom annotations are accepted by the parser and surfaced in catalog metadata; unrecognized annotations don't fail compilation.
## Catalog construction
- Pass 0: collect interfaces.
- Pass 1: collect nodes, expand `implements`, build constraint and `@embed` mappings, build the Arrow schema for each node table (`id: Utf8` plus all properties; blob columns get `LargeBinary`).
- Pass 2: collect edges, validate that `from_type` / `to_type` exist, normalize edge names case-insensitively for lookup, validate constraints for edges. Edge Arrow schema: `id: Utf8, src: Utf8, dst: Utf8` plus edge properties.
## Schema IR & stable type IDs
- `SCHEMA_IR_VERSION = 1` (`catalog/schema_ir.rs`).
- Each interface/node/edge currently gets a `stable_type_id` from a kind+name hash.
- Rename-preserving accepted IDs are an architectural invariant, but the current hash-on-name implementation is a known gap until migration carries IDs across `@rename_from`.
- Serialized as JSON for diff/migration plans.
## Schema migration planning
`plan_schema_migration(accepted, desired) -> SchemaMigrationPlan { supported, steps[] }` with step types:
- `AddType { type_kind, name }`
- `RenameType { type_kind, from, to }`
- `AddProperty { type_kind, type_name, property_name, property_type }`
- `RenameProperty { type_kind, type_name, from, to }`
- `AddConstraint { type_kind, type_name, constraint }`
- `UpdateTypeMetadata { … annotations }`
- `UpdatePropertyMetadata { … annotations }`
- `UnsupportedChange { entity, reason }` (forces `supported=false`)
`apply_schema()` returns `SchemaApplyResult { supported, applied, manifest_version, steps }` and is gated by an internal `__schema_apply_lock__` system branch so concurrent schema applies serialize.
## Destructive drops — `--allow-data-loss`
`DropProperty` and `DropType` steps default to `Soft` mode: the catalog tombstones the entry but the prior column / dataset remains time-travel-reachable via `snapshot_at_version(prev)` until `omnigraph cleanup` runs. Soft drops are reversible.
Pass `--allow-data-loss` (CLI) or `allow_data_loss: true` (HTTP `POST /schema/apply` body, SDK `SchemaApplyOptions`) to promote every drop in the plan to `Hard` mode. Hard drops run `cleanup_old_versions` on the affected dataset immediately after the manifest publish, making the prior column / dataset unreachable. **Irreversible.**
The flag is honored uniformly across transports — `omnigraph schema apply --allow-data-loss`, `POST /schema/apply { schema_source, allow_data_loss: true }`, and `apply_schema_with_options(.., SchemaApplyOptions { allow_data_loss: true })` produce identical plans and identical effects.