mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-21 02:28:07 +02:00
Merge branch 'main' into ragnorc/omnigraph-mcp-crate
Bring the MCP feature branch up to date with main (14 commits). One conflict — compiler/parser.rs: main's `NanoError` → `CompilerError` rename vs this branch's `@mcp` / per-param `@description` parser additions; resolved by keeping the new parsing under the renamed error type. The CLI `queries list` change (#280, surfacing `@description`/`@instruction`) auto-merged with this branch's `mcp_expose`/`tool_name` columns.
This commit is contained in:
commit
fbf455a250
110 changed files with 6396 additions and 2511 deletions
|
|
@ -8,10 +8,9 @@ This page explains what the policy says and how to change it.
|
|||
|
||||
| Setting | Value | Why |
|
||||
|---|---|---|
|
||||
| **Required status checks (strict)** | `Classify Changes`, `Check AGENTS.md Links`, `Test omnigraph-server --features aws`, `CODEOWNERS matches source`, `CODEOWNERS not hand-edited` | Every PR must pass the AWS-feature build/test, AGENTS.md link integrity, and the CODEOWNERS hygiene checks. **`Test Workspace` is deliberately NOT required** — it runs only on push to `main` (post-merge), tags, and manual `workflow_dispatch`, to keep PR turnaround fast (it was the ~15min+ slow gate). It is therefore *not* listed here: a required check that never reports on PRs (the `test` job is `if: github.event_name != 'pull_request'`) would leave every PR permanently pending — the same job-never-reports trap the CODEOWNERS contexts call out below. The trade-off (a regression lands on `main` and is caught by the post-merge run, so `main` can briefly go red) and its mitigations are documented in [ci.md](ci.md). The two CODEOWNERS contexts must equal the job `name:` values in `.github/workflows/codeowners.yml` **verbatim** — a context naming a job that never reports (the old `CODEOWNERS / drift` used the job *id*, and the job was path-filtered) leaves every PR permanently pending and forces admin overrides. `strict: true` requires the branch to be up-to-date with `main` before merge. |
|
||||
| **Required approving reviews** | `1` | At least one reviewer. With a 2-person team, going higher would block all merges when one person is unavailable. |
|
||||
| **Require code-owner reviews** | `true` | The reviewer must be a code owner per `.github/CODEOWNERS`. This is what makes the codeowners chassis enforced. |
|
||||
| **Dismiss stale reviews on new commits** | `true` | A push after approval invalidates the prior review. Prevents the "approve, then sneak in unreviewed changes" pattern. |
|
||||
| **Required status checks (strict)** | `Classify Changes`, `Check AGENTS.md Links`, `Test omnigraph-server --features aws` | Every PR must pass the AWS-feature build/test and AGENTS.md link integrity. **`Test Workspace` is deliberately NOT required** — it runs only on push to `main` (post-merge), tags, and manual `workflow_dispatch`, to keep PR turnaround fast (it was the ~15min+ slow gate). It is therefore *not* listed here: a required check that never reports on PRs (the `test` job is `if: github.event_name != 'pull_request'`) would leave every PR permanently pending — the job-never-reports trap. The trade-off (a regression lands on `main` and is caught by the post-merge run, so `main` can briefly go red) and its mitigations are documented in [ci.md](ci.md). Each required context must equal a job `name:` that actually reports on PRs **verbatim** — a context naming a job that never reports leaves every PR permanently pending and forces admin overrides. `strict: true` requires the branch to be up-to-date with `main` before merge. |
|
||||
| **Required approving reviews** | `0` | No human-review gate. With a 2-person team where both maintainers own everything, requiring an approval meant every PR needed the *other* person (or an admin/bypass override) — friction with no real review value. CI checks are the gate; maintainers merge their own PRs once checks pass. Raise this to `1` if an outside-contributor flow ever needs a review gate. |
|
||||
| **Require code-owner reviews** | `false` | CODEOWNERS was removed entirely (see the git history of `.github/`); there is no code-owner review requirement. |
|
||||
| **Require linear history** | `true` | No merge commits — squash or rebase only. Matches recent practice. |
|
||||
| **Disallow force pushes** | `true` | No history rewrites on `main`. |
|
||||
| **Disallow branch deletions** | `true` | `main` cannot be deleted. |
|
||||
|
|
@ -57,7 +56,7 @@ Outputs the live policy. Compare against `.github/branch-protection.json` to det
|
|||
|
||||
- **Audit trail**: `git log .github/branch-protection.json` shows every change with a reviewable diff and a merge commit.
|
||||
- **Disaster recovery**: if branch protection is accidentally removed or weakened via the UI, the JSON is the canonical recovery point.
|
||||
- **Consistency**: pairs with `.github/codeowners-roles.yml` (the CODEOWNERS source of truth). Repository policy lives in the repository.
|
||||
- **Consistency**: repository policy lives in the repository, reviewed like code.
|
||||
|
||||
## What this gates
|
||||
|
||||
|
|
@ -65,11 +64,11 @@ After branch protection is applied, every PR targeting `main` must:
|
|||
|
||||
1. Pass all listed status checks.
|
||||
2. Be up-to-date with `main` (rebase or merge-from-main).
|
||||
3. Have at least one approving review from a code owner for the touched paths.
|
||||
4. Have all review conversations resolved.
|
||||
5. Be squash- or rebase-merged (no merge commits).
|
||||
3. Have all review conversations resolved.
|
||||
4. Be squash- or rebase-merged (no merge commits).
|
||||
|
||||
Even repository admins are subject to these rules.
|
||||
No human approval is required (`required_approving_review_count: 0`). Repository
|
||||
admins can override the gates (`enforce_admins: false`).
|
||||
|
||||
## Subsequent hardening (not in this PR)
|
||||
|
||||
|
|
@ -77,7 +76,7 @@ The branch-protection policy is the foundation. Future hardening adds:
|
|||
|
||||
- **Required signed commits** (`required_signatures: true`) — once maintainers enroll GPG/SSH signing.
|
||||
- **Tag protection** for `v*` tags via `repos/.../tags/protection`.
|
||||
- **Required reviewers from specific teams** for high-leverage paths (e.g., `docs/dev/invariants.md`) via CODEOWNERS tier expansion + the N-unique-approvers CI workaround.
|
||||
- **Required reviewers from specific teams** for high-leverage paths (e.g., `docs/dev/invariants.md`) via a GitHub ruleset's path-scoped required-review rule, if a review gate is ever reintroduced.
|
||||
- **More required CI checks**: `cargo deny`, `cargo audit`, `cargo fmt --check`, `cargo clippy -D warnings`, CodeQL, secret scanning, schema-lint (MR-946).
|
||||
|
||||
See the hardening playbook for the full plan.
|
||||
|
|
|
|||
217
docs/dev/bug-case-fix.md
Normal file
217
docs/dev/bug-case-fix.md
Normal file
|
|
@ -0,0 +1,217 @@
|
|||
# Bug case study: camelCase property filters lowercased at runtime
|
||||
|
||||
**Issue:** [#283](https://github.com/ModernRelay/omnigraph/issues/283) (mirrored
|
||||
in the dev-graph as `iss-990`)
|
||||
**Reported on:** 0.7.0 (release binary)
|
||||
**Status of code:** present on `v0.7.0`; fixed on branch `fix/iss-283-camelcase-filter` (read pushdown + pending mutation scan)
|
||||
**Severity:** correctness — a valid, lint-clean query fails at run time.
|
||||
|
||||
## Symptom
|
||||
|
||||
A read query that filters on a **camelCase** schema field lints and plans
|
||||
cleanly but fails when it executes:
|
||||
|
||||
```text
|
||||
No field named reponame. Column names are case sensitive.
|
||||
```
|
||||
|
||||
Minimal repro:
|
||||
|
||||
```pg
|
||||
node SourceDocument {
|
||||
repoName: String @index
|
||||
}
|
||||
```
|
||||
|
||||
```gq
|
||||
query find($repoName: String) {
|
||||
match { $d: SourceDocument { repoName: $repoName } }
|
||||
return { $d.repoName }
|
||||
}
|
||||
```
|
||||
|
||||
`omnigraph lint` passes; running the query errors. The operator workaround is to
|
||||
rename the field to all-lowercase (`repo`), which is why this looked like a
|
||||
schema-design quirk rather than an engine bug.
|
||||
|
||||
## Root cause
|
||||
|
||||
The filter-pushdown path builds the Lance scan predicate's column reference with
|
||||
`datafusion::prelude::col(property)`:
|
||||
|
||||
- **Site:** `crates/omnigraph/src/exec/query.rs` — `ir_expr_to_expr`:
|
||||
```rust
|
||||
IRExpr::PropAccess { property, .. } => Some(col(property)),
|
||||
```
|
||||
- `col(&str)` runs DataFusion's SQL **identifier normalization**
|
||||
(`Column::from_qualified_name` → `parse_identifiers_normalized(.., false)`),
|
||||
which **lowercases unquoted identifiers**. So `col("repoName")` resolves to a
|
||||
column named `reponame`.
|
||||
- Lance stores columns **case-preserved** (`repoName`) and resolves them
|
||||
case-sensitively, so the scan can't find `reponame` and errors.
|
||||
|
||||
The IR is not at fault: the parser and lowering preserve the original case
|
||||
(`property: pm.prop_name.clone()`), which is exactly why the compiler resolves
|
||||
`repoName` and **lint passes**. The case is destroyed only at the
|
||||
engine → Lance boundary.
|
||||
|
||||
There is a **second** boundary with the same root cause but a *different*
|
||||
parser: the pending-batch scan in `table_store.rs::scan_pending_batches` splices
|
||||
the mutation predicate string into a DataFusion `SELECT … WHERE {filter}` over a
|
||||
`MemTable`, and DataFusion's SQL parser lowercases the unquoted column the same
|
||||
way (`repoName` → `reponame`). See **Part 2** of the fix — it surfaces only on a
|
||||
*chained* mutation that re-reads the pending side, which is why a single
|
||||
update/delete on a camelCase predicate looked fine.
|
||||
|
||||
### Why the rest of the engine is unaffected
|
||||
|
||||
The two pushdown sites above were the offenders; the remaining paths already
|
||||
treat column names case-sensitively and handle camelCase correctly:
|
||||
|
||||
- **Projection / return** uses the real Arrow field name (`f.name()`).
|
||||
- **In-memory filtering** (the fallback for non-pushable predicates) looks the
|
||||
column up by the preserved property name against the batch schema.
|
||||
- **The committed Lance mutation scan** (`Scanner::filter(&str)`) preserves an
|
||||
unquoted identifier's case, so committed-row matching on a camelCase predicate
|
||||
already worked.
|
||||
|
||||
So the read bug surfaces for predicates that *are* pushed down (e.g. an equality
|
||||
on a scalar camelCase column), and the mutation bug only for the pending-side
|
||||
re-scan of a chained mutation.
|
||||
|
||||
### Why it slipped through
|
||||
|
||||
The `ir_filter_to_expr` unit tests only use the all-lowercase field `count`, so
|
||||
no test exercised a camelCase property. Nothing in CI compared the emitted
|
||||
column name against the schema's casing.
|
||||
|
||||
## Fix
|
||||
|
||||
There are **two** engine→Lance boundaries that lose case, and they need
|
||||
**different** fixes because the two consumers disagree on quoting semantics.
|
||||
|
||||
### Part 1 — read pushdown (`exec/query.rs`, `ir_expr_to_expr`)
|
||||
|
||||
Use DataFusion's case-preserving column constructor, `ident()`, instead of
|
||||
`col()`:
|
||||
|
||||
```rust
|
||||
IRExpr::PropAccess { property, .. } => Some(datafusion::prelude::ident(property)),
|
||||
```
|
||||
|
||||
`ident()` builds `Expr::Column(Column::new_unqualified(property))` with no SQL
|
||||
parse and no normalization, so the case is preserved. Property references here
|
||||
are always bare column names (the variable is dropped via `..`), so there is no
|
||||
qualified-name (`a.b`) handling to lose.
|
||||
|
||||
This is the right layer and the right shape:
|
||||
|
||||
- It is a **no-op for the lowercase columns that work today** (`slug`, `id`,
|
||||
`status`, …) — lowercasing those was already a no-op — so there is no
|
||||
regression risk for the common case.
|
||||
- It makes pushdown **consistent** with projection and in-memory filtering,
|
||||
which already use case-preserved names.
|
||||
- It also restores **index use** for camelCase columns: today such a filter
|
||||
errors before the BTREE is even considered.
|
||||
|
||||
### Part 2 — pending mutation scan (`table_store.rs`, `scan_pending_batches`)
|
||||
|
||||
`update`/`delete` predicates lower through `predicate_to_sql(..)` into a single
|
||||
**SQL string** (`format!("{} {} {}", column, op, value_sql)`). That one string
|
||||
is consumed by **two** different parsers, and *they disagree on what quoting
|
||||
means*:
|
||||
|
||||
- The **committed** side passes the string to Lance's `Scanner::filter(&str)`.
|
||||
Lance **preserves an unquoted identifier's case** (so unquoted camelCase
|
||||
*already works* on the committed scan) but treats a double-quoted `"col"` as a
|
||||
**string literal** — `"repoName" = 'acme'` parses as `'repoName' = 'acme'`,
|
||||
a constant-false predicate that silently matches **zero** committed rows.
|
||||
- The **pending** side splices the same string into a DataFusion
|
||||
`SELECT … FROM pending WHERE {filter}` over a `MemTable`. DataFusion's SQL
|
||||
parser **lowercases** an unquoted identifier (`repoName` → `reponame`) and
|
||||
fails to resolve against the case-sensitive `MemTable` schema.
|
||||
|
||||
So no single quoting choice for the column satisfies both: quoting fixes the
|
||||
pending side but breaks the committed side, and vice versa. The fix keeps the
|
||||
predicate **unquoted** (what the committed Lance scan needs) and makes the
|
||||
*pending* context case-preserving instead, by disabling SQL identifier
|
||||
normalization on its `SessionContext`:
|
||||
|
||||
```rust
|
||||
let mut config = SessionConfig::new();
|
||||
config.options_mut().sql_parser.enable_ident_normalization = false;
|
||||
let ctx = SessionContext::new_with_config(config);
|
||||
```
|
||||
|
||||
`predicate_to_sql` itself never lowercased anything (it copies the preserved
|
||||
property name), so its emitted string is unchanged — it gains only a comment
|
||||
recording the unquoted contract. The projection list in the same function is
|
||||
already double-quoted and is unaffected (quoted identifiers are case-preserved
|
||||
under either normalization setting).
|
||||
|
||||
Rejected alternatives: banning/normalizing camelCase at the compiler (a real
|
||||
usability regression — camelCase fields are legitimate), lowercasing column
|
||||
names in storage (a breaking on-disk change), merely making lint *warn* (a
|
||||
band-aid that leaves the runtime broken), or **quoting the column in
|
||||
`predicate_to_sql`** (empirically breaks 7 existing lowercase-column mutation
|
||||
tests because Lance reads `"col"` as a string literal — see Part 2).
|
||||
|
||||
## Scope and caveats
|
||||
|
||||
- **Not Windows-specific.** The original report's environment was Windows, but
|
||||
the cause is platform-independent.
|
||||
- **The mutation path was only *partially* broken, and not where first
|
||||
assumed.** The committed side of `scan_with_pending(..)` (Lance
|
||||
`Scanner::filter(&str)`) and `delete`'s `delete_where(..)` / `Dataset::delete`
|
||||
preserve an unquoted identifier's case, so a *single* `update`/`delete` on a
|
||||
camelCase predicate already worked. Only the **pending** side — the in-memory
|
||||
`MemTable` re-scan that a *chained* mutation hits — lowercased the column.
|
||||
This was confirmed empirically: a single update+delete on `repoName` passes
|
||||
unfixed; a chained update that re-reads the pending side fails with
|
||||
`No field named reponame`. The fix is Part 2 above (disable identifier
|
||||
normalization on the pending `SessionContext`), **not** quoting the column.
|
||||
The eventual MR-A migration (`delete_where` → Lance 7
|
||||
`DeleteBuilder::execute_uncommitted`, structured `Expr`) is the longer-term
|
||||
shape but is out of scope here.
|
||||
- **Check the coercion lookup.** Adjacent to the fix, the literal-coercion step
|
||||
(`prop_data_type(.., schema)`, which keeps the BTREE usable) also resolves the
|
||||
column by name. Confirm it uses the preserved name; if it mishandles case a
|
||||
camelCase filter would resolve but lose its index — a silent perf regression,
|
||||
not a crash.
|
||||
- **Do not use `col(r#""repoName""#)` as the general read-path fix.** Quoting
|
||||
would preserve this one name, but it routes through SQL identifier parsing and
|
||||
changes qualified-name semantics. The IR property here is already a bare
|
||||
column name, so `ident(property)` / `Column::new_unqualified(property)` is the
|
||||
precise structured expression.
|
||||
- **Do not "fix" the mutation string by quoting the column.** It is tempting to
|
||||
reuse a `quote_ident` helper symmetric with `literal_to_sql`'s value escaping,
|
||||
but the column quote-rules differ between the two consumers of the predicate
|
||||
string: Lance's `Scanner::filter(&str)` reads `"col"` as a *string literal*
|
||||
(silently matching nothing), while DataFusion's `ctx.sql` reads it as a
|
||||
case-preserved identifier. Because the committed Lance scan already preserves
|
||||
the *unquoted* identifier's case, the column must stay unquoted and the
|
||||
pending DataFusion context must be told not to normalize — not the reverse.
|
||||
|
||||
## Validation (test-first)
|
||||
|
||||
1. **Red:** add an `ir_filter_to_expr` test asserting the emitted
|
||||
`Expr::Column` name for a camelCase property is `repoName`, not `reponame`.
|
||||
Fails on current code.
|
||||
2. **Green:** apply the `col` → `ident` change (Part 1) and the pending-context
|
||||
`enable_ident_normalization = false` change (Part 2).
|
||||
3. **End-to-end:** a camelCase `@index` field with
|
||||
`match { T { camelField: $x } }` returns the row (the unit test alone can't
|
||||
catch an engine↔Lance boundary regression).
|
||||
4. **Mutation parity:** with the same camelCase field, cover:
|
||||
- `update T where camelField == $x set otherField = ...` updates the intended
|
||||
row.
|
||||
- `delete T where camelField == $x` deletes the intended row and cascades as
|
||||
expected.
|
||||
- A chained update that hits the pending side of `scan_with_pending` still
|
||||
works, so both the committed Lance scan and pending DataFusion `MemTable`
|
||||
predicate paths are case-preserving.
|
||||
5. **Index preservation:** keep or add a plan/trace assertion that the
|
||||
camelCase `@index` equality predicate still reaches the scalar-index path.
|
||||
A result-only test can pass while silently falling back to a full scan.
|
||||
6. Run the full engine suite (`cargo test -p omnigraph-engine`) — in particular
|
||||
the existing BTREE index-eligibility tests, which `ident()` must not disturb.
|
||||
|
|
@ -3,7 +3,7 @@
|
|||
`.github/workflows/`:
|
||||
|
||||
- **ci.yml**: text-only changes skip; otherwise `cargo test --workspace --locked` on ubuntu-latest with protobuf compiler. OpenAPI-drift check that auto-commits the regenerated `openapi.json` for same-repository PRs. Also runs the AGENTS.md cross-link integrity check (`scripts/check-agents-md.sh`).
|
||||
- **`Test Workspace` does not run on pull requests.** The job is gated `if: github.event_name != 'pull_request'`, so the full workspace + failpoints suite runs only on push to `main` (post-merge), on `v*` tags, and on manual `workflow_dispatch`. This was a deliberate PR-latency trade-off — it was the slowest gate (~15min warm, up to the 75min cold ceiling). `RustFS S3 Integration` `needs: test`, so it is push-/dispatch-only for the same reason. The fast PR gates remain: `Classify Changes`, `Check AGENTS.md Links`, `Test omnigraph-server --features aws`, and the two CODEOWNERS checks. `Test Workspace` is correspondingly **not** in the required-check list (`.github/branch-protection.json`); see [branch-protection.md](branch-protection.md).
|
||||
- **`Test Workspace` does not run on pull requests.** The job is gated `if: github.event_name != 'pull_request'`, so the full workspace + failpoints suite runs only on push to `main` (post-merge), on `v*` tags, and on manual `workflow_dispatch`. This was a deliberate PR-latency trade-off — it was the slowest gate (~15min warm, up to the 75min cold ceiling). `RustFS S3 Integration` `needs: test`, so it is push-/dispatch-only for the same reason. The fast PR gates remain: `Classify Changes`, `Check AGENTS.md Links`, and `Test omnigraph-server --features aws`. `Test Workspace` is correspondingly **not** in the required-check list (`.github/branch-protection.json`); see [branch-protection.md](branch-protection.md).
|
||||
- **Consequences to internalize:** (1) a regression that the suite would catch now lands on `main` and turns the post-merge run red, rather than being blocked pre-merge — `main` can briefly break, so run `cargo test --workspace --locked` locally before merging anything non-trivial, or trigger this workflow on your branch via the Actions "Run workflow" button. (2) `openapi.json` is no longer auto-regenerated on PRs (that step is inside the `test` job); for server/API changes, regenerate it locally with `OMNIGRAPH_UPDATE_OPENAPI=1 cargo test -p omnigraph-server --test openapi` and commit it, or the strict drift check fails the post-merge `main` run.
|
||||
- **Applying this policy:** removing `Test Workspace` from the JSON is inert until an admin runs `./scripts/apply-branch-protection.sh`. **Run it immediately after this change merges** — until then GitHub still requires a `Test Workspace` context that no longer reports on PRs, which leaves every open PR permanently pending (the job-never-reports trap).
|
||||
- **AWS feature build job**: `cargo build/test -p omnigraph-server --features aws` on ubuntu-latest.
|
||||
|
|
|
|||
|
|
@ -1,58 +0,0 @@
|
|||
# Code ownership
|
||||
|
||||
`.github/CODEOWNERS` is **generated** — not hand-edited. The source of truth is `.github/codeowners-roles.yml`, expanded by `.github/scripts/render-codeowners.py`. CI rejects drift between the two and rejects direct edits to `CODEOWNERS` that don't accompany a yml change.
|
||||
|
||||
This setup gives every role change a reviewable PR and a permanent in-repository audit trail (`git log .github/codeowners-roles.yml`).
|
||||
|
||||
## Who owns what
|
||||
|
||||
The tables below are **generated** from `.github/codeowners-roles.yml` by `.github/scripts/render-codeowners.py` (the same render that produces `.github/CODEOWNERS`). They are the always-current "who owns what at this commit" view — don't edit them by hand; edit the yml and re-render.
|
||||
|
||||
<!-- BEGIN GENERATED OWNERSHIP — edit codeowners-roles.yml + run render-codeowners.py -->
|
||||
|
||||
**Path → owners** (GitHub applies *last match wins*; the `*` catch-all is listed first and is overridden by the specific patterns below it):
|
||||
|
||||
| Path | Owners | Role(s) |
|
||||
|---|---|---|
|
||||
| `*` | @aaltshuler @ragnorc | engineering |
|
||||
| `crates/**` | @aaltshuler @ragnorc | engineering |
|
||||
| `docs/**` | @aaltshuler @ragnorc | docs |
|
||||
| `README.md` | @aaltshuler @ragnorc | docs |
|
||||
| `AGENTS.md` | @aaltshuler @ragnorc | docs |
|
||||
| `CLAUDE.md` | @aaltshuler @ragnorc | docs |
|
||||
| `SECURITY.md` | @aaltshuler @ragnorc | docs |
|
||||
|
||||
**Roles**:
|
||||
|
||||
| Role | Members | Description |
|
||||
|---|---|---|
|
||||
| `engineering` | @aaltshuler @ragnorc | All production code under crates/**. Engine, CLI, server, compiler. |
|
||||
| `docs` | @aaltshuler @ragnorc | Documentation under docs/**, plus repo-level docs (README.md, AGENTS.md, CLAUDE.md symlink, SECURITY.md). |
|
||||
|
||||
<!-- END GENERATED OWNERSHIP -->
|
||||
|
||||
GitHub treats multiple owners on a CODEOWNERS line as **"any one of them satisfies the review requirement"**. To require N distinct approvers on a specific path, layer a CI check on top (not currently configured).
|
||||
|
||||
## How to change role membership or path mappings
|
||||
|
||||
1. Edit `.github/codeowners-roles.yml`.
|
||||
2. Open a PR. **CI re-renders for you**: the `CODEOWNERS` workflow regenerates `.github/CODEOWNERS` and the ownership tables above and auto-commits them back to your PR branch on same-repository PRs — you don't have to run the script locally (though you can: `python3 .github/scripts/render-codeowners.py`, requires PyYAML).
|
||||
|
||||
On a fork (where CI can't push back), the workflow instead fails with the diff so you can run the script and commit it yourself.
|
||||
|
||||
CI fails the PR if:
|
||||
- a fork PR left a generated artifact out of sync, or
|
||||
- `CODEOWNERS` was edited without a corresponding yml change (the `CODEOWNERS not hand-edited` check).
|
||||
|
||||
## How to add a new role
|
||||
|
||||
1. Add a new entry to `roles:` in the yml with a `description` and `members` list.
|
||||
2. Reference the role from `paths:` (or `default:`).
|
||||
3. Regenerate + commit as above.
|
||||
|
||||
## Why a generator, not direct CODEOWNERS edits?
|
||||
|
||||
- **Audit trail**: `git log .github/codeowners-roles.yml` is the canonical record of every role change. The rendered `CODEOWNERS` is a derived artifact.
|
||||
- **Roles are first-class**: paths reference roles, not raw handles. Renaming a person or rotating a role updates one place, not every path.
|
||||
- **Future extension**: scheduled rotation (weekly on-call, quarterly leads) plugs into the same yml without changing the path mappings. Not enabled today.
|
||||
- **Consistency with the product**: omnigraph itself enforces auditable Cedar policy. The repository's code-owner policy follows the same "policy as reviewed code" pattern.
|
||||
|
|
@ -28,7 +28,6 @@ constraints. User-facing behavior should still be documented through
|
|||
| Three-way merge implementation and conflicts | [merge.md](merge.md) |
|
||||
| Diff/change-feed implementation | [changes.md](../user/branching/changes.md) |
|
||||
| Branch protection policy | [branch-protection.md](branch-protection.md) |
|
||||
| CODEOWNERS source of truth | [codeowners.md](codeowners.md) |
|
||||
|
||||
## Language, Runtime, And Boundaries
|
||||
|
||||
|
|
@ -63,6 +62,16 @@ The `docs/rfcs/` track is the **public, externally-authorable** RFC process. The
|
|||
maintainer/internal RFCs below (`rfc-00N-*.md`) are a separate, team-owned
|
||||
track; don't conflate the two.
|
||||
|
||||
## Case Studies
|
||||
|
||||
Worked write-ups of specific bugs — root cause, fix, and the reasoning that
|
||||
ruled out the tempting-but-wrong alternatives. Read these for the debugging
|
||||
pattern, not just the outcome.
|
||||
|
||||
| Area | Read |
|
||||
|---|---|
|
||||
| camelCase property filters lowercased at runtime (#283) — two engine→Lance boundaries, two different fixes | [bug-case-fix.md](bug-case-fix.md) |
|
||||
|
||||
## Active Implementation Plans
|
||||
|
||||
Working documents for in-flight feature work. Removed when the work lands.
|
||||
|
|
|
|||
|
|
@ -53,7 +53,13 @@ converge the physical state.
|
|||
versioning, fragments, branches, compaction, cleanup, and index primitives.
|
||||
DataFusion should own relational execution where it fits. Do not add custom
|
||||
WALs, transaction managers, buffer pools, page formats, or local clones of
|
||||
substrate behavior. Read [lance.md](lance.md) before guessing.
|
||||
substrate behavior. Read [lance.md](lance.md) before guessing. Respecting the
|
||||
substrate also means *using* it idiomatically, not only refraining from
|
||||
rebuilding it: reuse long-lived handles instead of re-opening per call,
|
||||
resolve latest state through the substrate's cheap primitive instead of
|
||||
re-scanning, and share its caches/session. Re-deriving per call what the
|
||||
substrate keeps warm is a substrate violation even when no code is
|
||||
reimplemented.
|
||||
|
||||
2. **Graph visibility is manifest-atomic.** Lance commits are per dataset.
|
||||
OmniGraph's graph-level atomicity comes from publishing one manifest update
|
||||
|
|
@ -126,6 +132,18 @@ converge the physical state.
|
|||
a substitute for missing lower-level assertions. Read [testing.md](testing.md)
|
||||
before adding tests.
|
||||
|
||||
15. **One source of truth, cheaply derived.** Lance and the manifest are the
|
||||
source of truth. Everything the engine needs at runtime is a derived view of
|
||||
them: read or projected on demand, held warm, refreshed by a cheap probe. Two
|
||||
failure modes are forbidden. A *parallel copy* the engine maintains can drift
|
||||
from the source, and that divergence compounds over time. *Cold
|
||||
re-derivation* rebuilds the view from the full source on every call, so its
|
||||
cost grows with history. Invariants 1 and 7, and the deny-list "state that
|
||||
drifts" and "manifest-derivable reconciler" items, are instances; so is
|
||||
bounding a read's cost to its working set rather than the commit count. This
|
||||
is the structural face of "engineering is programming integrated over time":
|
||||
both failure modes are liabilities that compound as the system grows.
|
||||
|
||||
## Current Truth Matrix
|
||||
|
||||
| Area | Current state | Source |
|
||||
|
|
@ -252,6 +270,37 @@ them explicit.
|
|||
- **Resource bounds:** some operations still lack enforced per-query memory or
|
||||
time budgets. New long-running work should add explicit bounds rather than
|
||||
widening the gap.
|
||||
- **Read-path re-derivation (largely closed by the query-latency work):**
|
||||
snapshot resolution used to re-open a fresh coordinator per read (a full
|
||||
`__manifest` re-scan plus two commit-graph scans), open each table through the
|
||||
namespace (two more `__manifest` scans per table), validate the schema twice,
|
||||
and share no Lance `Session`. That was an O(commits) cost that never warmed up.
|
||||
Fix 1 (warm coordinator reuse behind a `latest_version_id` probe), Fix 2 (open
|
||||
tables by location+version), finding A (validate once), and Fix 3 (a held
|
||||
`Dataset`-handle cache keyed by `(table, branch, version, e_tag when Lance
|
||||
exposes it)` plus one shared `Session` per graph) remove that tax: a warm
|
||||
same-branch read does one probe, one schema read, and zero opens on a repeat.
|
||||
Non-main branch freshness compares the manifest incarnation (`version` plus
|
||||
manifest-location e_tag when available, otherwise Lance manifest timestamp),
|
||||
because Lance branch names can be deleted/recreated at the same version number;
|
||||
the manifest e_tag is carried into synthetic snapshot ids when available, and
|
||||
a detected same-branch manifest refresh clears read caches as the fallback for
|
||||
e_tag-less table locations/topology. Remaining: the internal metadata tables
|
||||
(`__manifest`, `_graph_commits`) are still not compacted, so the probe and
|
||||
refresh cost still grows with fragment count on a long-lived graph (the
|
||||
`optimize`-covers-internal-tables follow-up); the commit graph is not yet
|
||||
reconcilable from the manifest; and the traversal id-map is still rebuilt.
|
||||
- **Commit-graph parent under concurrency:** `record_graph_commit` now refreshes
|
||||
the commit-graph head from storage before appending, so a same-branch write
|
||||
after an external commit no longer forks the commit DAG by parenting off a
|
||||
stale cached head (the single-process fork, pre-existing for non-strict
|
||||
inserts and widened to strict ops by Fix 1's `refresh_manifest_only`, is now
|
||||
closed). Residual: two processes writing disjoint tables can still pass their
|
||||
per-table manifest CAS and append off the same parent (a refresh-then-append
|
||||
TOCTOU). The convergent fix is reconcile-from-manifest (parent = the commit at
|
||||
the manifest version the publisher CAS'd against; `manifest_version` is on
|
||||
every commit row), composing with the manifest-to-commit-graph atomicity gap;
|
||||
it needs commit-graph append ordering or a Lance append-CAS to fully close.
|
||||
|
||||
## Deny-list
|
||||
|
||||
|
|
@ -277,6 +326,10 @@ case is exceptional.
|
|||
- Cost-blind plan choice when statistics are available or required.
|
||||
- Hidden statistics for behavior that affects planning or operator choice.
|
||||
- Hash-map iteration order in result ordering, plan choice, or migration output.
|
||||
- Cold re-derivation on the hot path: rebuilding from the full source what could
|
||||
be held warm and refreshed cheaply, so cost scales with history rather than the
|
||||
working set (the cost face of invariant 15; "state that drifts" above is its
|
||||
shadow-copy face).
|
||||
- String-flattened SQL/filter generation when a structured pushdown API is
|
||||
available.
|
||||
- Eager multi-hop cross-product materialization when factorization fits.
|
||||
|
|
@ -313,6 +366,8 @@ Use this as yes/no/NA for any non-trivial design or PR:
|
|||
- Are stats/capabilities exposed when behavior depends on them?
|
||||
- Are existing known gaps left no worse and documented if touched?
|
||||
- Does the test live at the same boundary as the change?
|
||||
- Is this operation's cost bounded with respect to history and scale, or does it
|
||||
re-derive warm state from cold storage per call?
|
||||
- Does the change avoid every deny-list pattern, or justify the exception?
|
||||
|
||||
## Maintenance Policy
|
||||
|
|
|
|||
|
|
@ -1,7 +1,7 @@
|
|||
# RFC: Server Boots from Cluster State — Phase 5 of the Cluster Control Plane
|
||||
|
||||
**Status:** Landed (5A policy bindings #175; 5B/5C the `--cluster` boot mode — one PR)
|
||||
**Implementation deviations:** (1) cluster mode reuses `ServerConfigMode::Multi` (a new settings *source*, not a new enum variant; `config_path` carries the cluster dir). (2) Stored queries load via `QueryRegistry::from_specs` from verified blob *content*, not blob paths. (3) More than one policy bundle binding a single scope is a boot error (the serving pipeline holds one bundle per graph + one server-level; stacking is a later slice). (4) `GET /graphs` keeps its closed-by-default contract — without a cluster-bound bundle there is no server-level Cedar engine, so enumeration refuses.
|
||||
**Implementation deviations:** (1) cluster mode reuses `ServerConfigMode::Multi` (a new settings *source*, not a new enum variant; `config_path` carries the cluster dir). (2) Stored queries load via `QueryRegistry::from_specs` from verified blob *content*, not blob paths. (3) More than one policy bundle binding a single scope is a boot error (the serving pipeline holds one bundle per graph + one server-level; stacking is a later slice). (4) `GET /graphs` keeps its closed-by-default contract — without a cluster-bound bundle there is no server-level Cedar engine, so enumeration refuses. (5) Graph-attributed startup failures quarantine that graph by default; operators can restore all-or-nothing boot with `--require-all-graphs` / `OMNIGRAPH_REQUIRE_ALL_GRAPHS=1`.
|
||||
**Date:** 2026-06-10
|
||||
**Builds on:** Phase 4 complete ([rfc-004-cluster-graph-schema-apply.md](rfc-004-cluster-graph-schema-apply.md), Landed): `cluster apply` converges graphs, schemas, stored queries, and policies into the cluster catalog. Normative context: [cluster-config-specs.md](cluster-config-specs.md) (the migration model's "window 2"), [cluster-axioms.md](cluster-axioms.md) (axiom 15), [cluster-config-implementation-spec.md](cluster-config-implementation-spec.md) (Phase 5 rollout, Compatibility Stance #7–#9, exit criterion 7).
|
||||
**Target release:** unversioned (phased — see Sequencing).
|
||||
|
|
@ -46,8 +46,8 @@ Mode inference gains rule 0: `--cluster <dir>` → **Cluster mode**, which is al
|
|||
|
||||
`load_server_settings` grows a cluster branch that reads, in order:
|
||||
|
||||
1. `__cluster/state.json` — **missing state is a boot error** ("run `cluster import` + `cluster apply` first"). Pending recovery sidecars under `__cluster/recoveries/` are also a boot error (`cluster_recovery_pending`): a server must not start serving a ledger that a sweep is about to rewrite.
|
||||
2. **Graph set** = state's `graph.<id>` resources (tombstoned graphs are absent by construction). Each graph's URI is the derived root `<dir>/graphs/<id>.omni`. A recorded graph whose root does not open is a boot error — same fail-fast posture as today's bad URI.
|
||||
1. `__cluster/state.json` — **missing state is a boot error** ("run `cluster import` + `cluster apply` first"). Invalid or unattributable recovery sidecars under `__cluster/recoveries/` are also a boot error: a server must not start if it cannot prove the blast radius. Valid graph-attributed sidecars quarantine that graph by default and are logged as `cluster_recovery_pending`; `--require-all-graphs` promotes them back to a boot error.
|
||||
2. **Graph set** = state's `graph.<id>` resources (tombstoned graphs are absent by construction). Each graph's URI is the derived root `<dir>/graphs/<id>.omni`. A recorded graph whose root does not open quarantines that graph by default; `--require-all-graphs` restores the original fail-fast posture.
|
||||
3. **Stored queries** = state's `query.<graph>.<name>` entries, content loaded from the catalog blob at the recorded digest. Blob-missing or digest-mismatched is a boot error (the catalog verification semantics from Stage 3B, applied at boot). Queries type-check at engine open exactly as today (`validate_and_attach` — unchanged).
|
||||
4. **Policies** = state's `policy.<name>` entries, content from catalog blobs, bindings from the applied metadata of D3: bundles bound to `cluster` load as the server-level Cedar engine (`PolicyEngine::load_server`); bundles bound to graphs load per-graph (`PolicyEngine::load_graph`) and install via `with_policy` — the existing two-gate structure, unchanged.
|
||||
5. `cluster.yaml` is parsed **only** to validate that the directory is a cluster root (and for nothing else — explicitly not for resource content; a divergence between desired config and applied state is *served as applied*, visible via `cluster plan`).
|
||||
|
|
@ -76,16 +76,19 @@ State's `StateResource` records only a digest. To make the ledger serving-suffic
|
|||
|
||||
### D4. Readiness and failure posture
|
||||
|
||||
Boot is fail-fast, matching the server's existing stance (bad policy YAML refuses boot):
|
||||
Cluster-global failures are fail-fast, matching the server's existing stance (bad policy YAML refuses boot). Graph-local failures quarantine the affected graph by default so a single bad graph cannot crash-loop an otherwise healthy cluster. Operators who prefer the original all-or-nothing contract pass `--require-all-graphs` or set `OMNIGRAPH_REQUIRE_ALL_GRAPHS=1`, which promotes every graph-local quarantine/open/settings failure to a boot error.
|
||||
|
||||
| Condition | Behavior |
|
||||
|---|---|
|
||||
| `state.json` missing / unparseable / unsupported version | boot error |
|
||||
| pending recovery sidecars | boot error (run any state-mutating cluster command to sweep) |
|
||||
| recorded graph root missing or unopenable | boot error |
|
||||
| invalid/unreadable/unattributable recovery sidecars | boot error (run any state-mutating cluster command to sweep or inspect) |
|
||||
| valid graph-attributed recovery sidecars | quarantine that graph; strict mode boot error |
|
||||
| recorded graph root missing or unopenable | quarantine that graph; strict mode boot error |
|
||||
| query/policy blob missing or digest-mismatched | boot error (run `cluster refresh` + `apply` to self-heal, then restart) |
|
||||
| policy entry without `applies_to` metadata | boot error ("re-run cluster apply", D3) |
|
||||
| stored query fails type-check against the live schema | boot error (existing `validate_and_attach` behavior) |
|
||||
| stored query fails parse/type-check against the live schema | quarantine that graph; strict mode boot error |
|
||||
| embedding provider configuration for one graph cannot resolve | quarantine that graph; strict mode boot error |
|
||||
| every applied graph is quarantined or fails startup | boot error (`cluster_no_healthy_graphs`) |
|
||||
| state lock held | **not** an error — boot takes no lock; it reads a point-in-time snapshot of an immutable-once-written state file (the CAS discipline means a concurrent apply produces a *new* file atomically; the server reads whichever was current at open) |
|
||||
|
||||
### D5. MCP presentation (`@mcp(expose, tool_name)`) in cluster mode
|
||||
|
|
@ -124,7 +127,7 @@ Rollback is the same switch in reverse — nothing in cluster mode mutates `omni
|
|||
- *Axiom 5*: the server serves deployed reality (applied digests), never desired intent; D3 keeps the ledger the single serving source.
|
||||
- *Axiom 12*: boot reads without the lock but relies on the atomic-replace write discipline; it never writes state.
|
||||
- *Axiom 14 / Stance #9*: the expose-all bridge is named, scoped to cluster mode, and carries its Phase 6 sunset.
|
||||
- *Loud failures (deny-list)*: every degraded condition is a typed boot error with a remedy; no partial serving, no silent fallback to the yaml.
|
||||
- *Loud failures (deny-list)*: every degraded condition is either a typed cluster-global boot error with a remedy or an explicit graph quarantine logged at startup; no silent fallback to the yaml. `--require-all-graphs` is the opt-in all-or-nothing mode for operators who treat any degraded graph as fatal.
|
||||
- *Respect the boundaries*: `omnigraph-cluster` stays free of HTTP; the server reads the catalog through a small read-only loader (either a `pub` read surface on `omnigraph-cluster` or a thin module in the server consuming the documented file formats — implementation picks the one that keeps `omnigraph-cluster` dependency-light; the state/blob formats are already a documented contract).
|
||||
|
||||
## Sequencing
|
||||
|
|
@ -132,7 +135,7 @@ Rollback is the same switch in reverse — nothing in cluster mode mutates `omni
|
|||
| Slice | Scope | Gate |
|
||||
|---|---|---|
|
||||
| **5A: serving metadata in state** | `applies_to` recorded on policy resources at apply + sweep roll-forward; additive state schema; `status`/plan surfacing | In-crate tests: metadata written/rolled-forward; old state parses; re-apply backfills |
|
||||
| **5B: `--cluster` boot mode** | Flag + mode inference rule 0; catalog loader (state → `GraphStartupConfig`s + registries + policy engines); readiness table; OpenAPI regen if surface shifts | Server tests: boot from a converged fixture dir, serve `/graphs/{id}/query` + stored queries + Cedar gates; every D4 row refuses boot; e2e: `cluster apply` then serve — "applied means serving" |
|
||||
| **5B: `--cluster` boot mode** | Flag + mode inference rule 0; catalog loader (state → `GraphStartupConfig`s + registries + policy engines); readiness table; OpenAPI regen if surface shifts | Server tests: boot from a converged fixture dir, serve `/graphs/{id}/query` + stored queries + Cedar gates; D4 cluster-global rows refuse boot; graph-local rows quarantine by default and refuse under `--require-all-graphs`; e2e: `cluster apply` then serve — "applied means serving" |
|
||||
| **5C: docs + caveat retirement** | `cluster-config.md` mode-switch section; `server.md`/`deployment.md`; retire the "not serving" caveats for cluster-mode deployments; migration guide (D6) | `check-agents-md.sh`; doc accuracy review |
|
||||
|
||||
## Exit-criteria coverage
|
||||
|
|
|
|||
|
|
@ -24,8 +24,9 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
|
|||
| `merge_truth_table.rs` | Merge-pair truth table (MR-786): all 9×9 `(left_op, right_op)` cells from `{noop, addNode, removeNode, addEdge, removeEdge, setProperty, dropProperty, addLabel, removeLabel}`. Adding a new op to `OpVariant` forces a compile error in `build_case` until the new row + column are dispositioned. 36 executable cells run through real `branch_merge` with a structured oracle (`MergeOutcome` / `MergeConflictKind` + graph-state assert); 45 cells involving `dropProperty`/`addLabel`/`removeLabel` are recorded as `Unsupported` until the mutation grammar grows. |
|
||||
| `writes.rs` | Direct-publish writes: cancellation, non-strict insert/merge rebase under the per-table queue, strict stale-write conflicts, multi-statement atomicity, MR-794 staged-write rewire (D₂ rejection, insert+update coalesce, multi-append coalesce, partial-failure recovery, load RI/cardinality recovery) |
|
||||
| `staged_writes.rs` | TableStore staged-write primitives (`stage_append`, `stage_merge_insert`, `commit_staged`, `scan_with_staged`, `count_rows_with_staged`) — primitive-level only; engine code uses the in-memory `MutationStaging` accumulator instead |
|
||||
| `forbidden_apis.rs` | Defense-in-depth source-walk guard: engine code (`exec/`, `db/omnigraph/`, `loader/`, `changes/`) must not reach around the sealed storage trait to Lance inline-commit APIs; `// forbidden-api-allow: <reason>` sentinel exempts reviewed lines |
|
||||
| `forbidden_apis.rs` | Defense-in-depth source-walk guard: engine code (`exec/`, `db/omnigraph/`, `loader/`, `changes/`) must not reach around the sealed storage trait to Lance inline-commit APIs, nor open datasets directly (`Dataset::open` / `DatasetBuilder::from_uri`/`from_namespace`) — reads route through `Snapshot::open` and the held-handle cache; `// forbidden-api-allow: <reason>` sentinel exempts reviewed lines |
|
||||
| `lance_surface_guards.rs` | Pins the Lance API surfaces omnigraph depends on (named runtime + compile-only guards; see [lance.md](lance.md)) — the first smoke check on any Lance version bump; e.g. `compact_files_still_fails_on_blob_columns` turns red when the upstream blob-compaction fix lands |
|
||||
| `warm_read_cost.rs` | Cost-budget tests for the warm read path (query-latency work), measured at the object-store boundary with Lance `IOTracker` (the LanceDB IO-counted pattern): a warm same-branch read does 0 manifest opens, 0 commit-graph opens, 1 version probe, validates the schema once (Fix 1 / finding A / Fix 2 at commit-history depth); stale same-branch reads perform exactly 2 probes and refresh manifest-only; recreated non-main branches with the same Lance version refresh by incarnation; recreated branch-owned table handles are distinguished by table e_tag or refresh-time cache clearing; recreated traversal topology is protected by synthetic snapshot-id incarnation or refresh-time cache clearing; a warm *repeat* read does 0 table opens via the held-handle cache and a write re-opens only the changed table at its new version/e_tag (Fix 3/6A). See "Cost-budget tests" below |
|
||||
| `lifecycle.rs` | Graph lifecycle, schema state |
|
||||
| `point_in_time.rs` | Snapshots, time travel (`snapshot_at_version`, `entity_at`) |
|
||||
| `changes.rs` | `diff_between` / `diff_commits` |
|
||||
|
|
@ -126,5 +127,14 @@ When you pick up any change, walk through this:
|
|||
6. **For substrate-touching changes** (Lance behavior), reach for `failpoints` or fixture-driven scenarios, not stubbed-out mocks.
|
||||
7. **For server / API changes**, confirm the OpenAPI regeneration happens in `openapi.rs` and that the diff lands in `openapi.json`.
|
||||
8. **Verify your change makes an existing test fail before it makes the new one pass.** If you can break the code without breaking a test, your coverage gap is the problem to fix first.
|
||||
9. **Bound hot-path cost at history depth.** If the change touches a read or open path, add or extend a test that asserts a *bounded* cost (e.g. a warm same-branch read performs zero `Dataset::open`, or a fixed object-op count) against a fixture with realistic *commit-history depth*, not just realistic row counts. Cost that scales with history is invisible on a shallow fixture and only bites in production. See "Cost-budget tests" below.
|
||||
|
||||
## Cost-budget tests: bound hot-path cost at history depth
|
||||
|
||||
Correctness bugs fail loudly in tests; cost-scaling bugs pass every test and degrade silently in production. The engine read path historically had no cost assertion, and fixtures carry shallow commit history, so an O(commits)-per-query cost stayed green in CI and only surfaced on a long-lived graph (read snapshot resolution re-scanned the internal manifest and commit-graph tables on every query, and those tables were never compacted). Guard against the class:
|
||||
|
||||
- **Assert a cost budget, not just a result.** For a read/open path, assert the number of `Dataset::open` calls (or object-store ops) a warm query performs, and that it does not grow with commit count. The reference is LanceDB's IO-counted tests, which assert a cached read costs 0-1 IO and carry a named regression test against "a list call on every subsequent query."
|
||||
- **Test at history depth.** Build a fixture with many *commits* (not many rows) and assert warm-read cost is flat across depths. A shallow fixture cannot catch an O(commits) cost.
|
||||
- This is the testing companion to invariant 15 in [docs/dev/invariants.md](invariants.md) (hot-path cost is bounded by work, not history).
|
||||
|
||||
When in doubt, re-read [docs/dev/invariants.md](invariants.md) — quality gates apply to every change.
|
||||
|
|
|
|||
|
|
@ -178,6 +178,17 @@ are left at `Lance HEAD = manifest_pinned + 1`.
|
|||
post_commit_pin)` it intends to commit + the writer kind +
|
||||
actor_id.
|
||||
2. **Phase B**: writer's per-table `commit_staged` loop runs.
|
||||
- **Phase-B confirmation (`BranchMerge` only)**: a `BranchMerge` writer
|
||||
advances each table's HEAD by *several* commits (append → upsert →
|
||||
delete), so a bare "HEAD moved" is ambiguous — it could be a complete
|
||||
publish or one crashed mid-sequence. After the whole per-table loop
|
||||
finishes, the writer re-writes the sidecar stamping each pin's
|
||||
`confirmed_version` with the exact achieved version, then proceeds to
|
||||
Phase C. This is the commit point of the recovery WAL: a crash *after*
|
||||
confirmation rolls forward to those versions; a crash *during* Phase B
|
||||
(sidecar still unconfirmed) rolls back. Other writers don't confirm —
|
||||
their drift is derived state (index coverage, compaction) that a partial
|
||||
roll-forward never corrupts.
|
||||
3. **Phase C**: publisher commits the manifest.
|
||||
4. **Phase D**: writer deletes the sidecar.
|
||||
|
||||
|
|
@ -197,7 +208,10 @@ recovery sweep in `crates/omnigraph/src/db/manifest/recovery.rs`:
|
|||
- For each sidecar in `__recovery/`, compare every named table's
|
||||
Lance HEAD to the manifest pin. Classify per the all-or-nothing
|
||||
decision tree (RolledPastExpected / NoMovement / UnexpectedAtP1 /
|
||||
UnexpectedMultistep / InvariantViolation).
|
||||
UnexpectedMultistep / IncompletePhaseB / InvariantViolation). For a
|
||||
`BranchMerge` sidecar, a moved HEAD with no `confirmed_version` classifies
|
||||
as `IncompletePhaseB` (a partial multi-commit publish) and forces roll-back;
|
||||
with a `confirmed_version`, roll-forward targets exactly that version.
|
||||
- If any table is `InvariantViolation` (Lance HEAD < manifest pinned —
|
||||
should be impossible), **abort** with a loud error and leave the
|
||||
sidecar on disk for operator review.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue