mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-27 02:39:38 +02:00
* refactor(storage): gate test-only TableStore::append_batch behind cfg(test)
The inherent append_batch is used only by in-source recovery test setup, but
the non-test lib build (cfg(test) off) cannot see those callers and emitted a
dead_code warning. Gating the method #[cfg(test)] silences the false positive
and enforces its own doc contract ("no new engine call sites") by construction
— engine code physically cannot call a cfg(test) method.
* test(failpoints): harden fault-injection harness + reproduce roll-forward CAS race
Hardens the test infrastructure around the process-global `fail` registry, and
adds a deterministic red repro for the open-time recovery sweep's roll-forward
CAS race (iss-schema-apply-reopen-recovery-race). The fix lands in the next
commit — this commit is intentionally red (rule 12: red→green visible in log).
Harness:
- One `ScopedFailPoint` (engine) gaining `with_callback`; the cluster duplicate
is removed and cluster tests reuse the engine type via `omnigraph/failpoints`.
- `#[serial]` on every failpoint test (the registry is process-global, so shared
names interfere under parallelism); `serial_test` added to cluster dev-deps.
- `helpers::failpoint::Rendezvous` (park-first / wait-until-reached / release)
replaces fixed-`sleep` cross-thread coordination; the three concurrent tests
now rendezvous deterministically. The reached flag doubles as a fired-assert.
- Compile-checked `failpoints::names` catalog (engine + cluster); every call
site references a const, and `failpoint_names_guard.rs` enforces "no string
literal names" by source-walk, so a typo is a build error not a silent no-fire.
Red repro:
- New `recovery.before_roll_forward_publish` failpoint at the sweep's
classify -> publish-CAS window (the only injection point there).
- `open_sweep_roll_forward_converges_when_manifest_advances_concurrently`: two
concurrent open-sweeps race one pending sidecar; the sweep parked at the
failpoint loses its publish CAS to the other and fails the open with
`ExpectedVersionMismatch`. FAILS at this commit by design.
* fix(recovery): converge roll-forward when the manifest advances concurrently
The open-time recovery sweep classified a pending sidecar as RolledPastExpected,
then published a manifest CAS at the sidecar's pinned expected_version. Under a
concurrent writer that advanced the manifest past expected during the
classify -> publish window, the CAS failed with ExpectedVersionMismatch and
`?`-propagated, failing the whole Omnigraph::open.
iss-schema-apply-reopen-recovery-race.
A roll-forward's postcondition is "the manifest reflects the sidecar's committed
Lance state", not "this sweep won the CAS" (invariants 7 & 15). On an
ExpectedVersionMismatch, re-read the live manifest and check whether the
sidecar's intent is already satisfied (every pinned table at a version >= the
one we observed and tried to publish; added tables registered; tombstones gone
— sound under the heal-first invariant, documented at the check). If satisfied,
this is convergence: record the RolledForward audit + delete the sidecar
idempotently. If only partway, defer to the next pass. Either way the open no
longer fails. Other errors still propagate; a genuine logical conflict
resurfaces via the classifier's InvariantViolation.
Turns the red repro from the previous commit green. The roll-BACK twin
(iss-recovery-sweep-live-writer-rollback) is destructive (Lance Restore) and
still needs a cross-process lease — the known-gap is updated accordingly.
* Address PR review: harden failpoint name guard + dedupe converge audit
Two issues surfaced in PR review of the failpoint hardening + recovery fix:
1. Name guard had a line-split blind spot. It scanned per line, so a call
wrapped across lines (`park_first(\n "name",\n)`) put the literal on a
different line than the call prefix and bypassed the "no string-literal
failpoint names" check — and one such literal
(`mutation.delete_node_pre_primary_delete`) had slipped through. Make the
guard whitespace/newline-tolerant (skip past the open paren to the first
argument token) so wrapping can't hide a literal, and convert the bypassed
site to the `names::` const.
2. Convergence path could append a duplicate recovery audit. When a
roll-forward publish loses its CAS but the manifest already reached the
sidecar's goal, `converge_or_defer_roll_forward` recorded a RolledForward
audit unconditionally. Under the heal-first invariant, whoever advanced the
manifest already healed this sidecar (audit + delete), so a second row
landed in `_graph_commit_recoveries` for one recovery event. Gate the
audit+delete on the sidecar still being present: absent => the winner
completed it, return success with no duplicate row. The convergence
regression test now asserts exactly one audit row.
* docs(dev): remove the schema-apply recovery-flake handoff (fixed by this PR)
The handoff was a transient investigation note for
`iss-schema-apply-reopen-recovery-race`, which this PR fixes (the converge
helper + the red→green regression). Its rationale now lives durably in the
dev-graph issue, the PR/commit history, and invariants.md, so the handoff is
obsolete. Drop the doc, its dev-index row, and the dangling reference from the
RFC-013 handoff; the doc cross-link check stays green.
* fix(recovery): include added-table registrations in the converge audit
The CAS-loss convergence audit built outcomes only from `sidecar.tables`,
omitting the `additional_registrations` that the normal `roll_forward_all`
audit includes. For a SchemaApply sidecar with added types, a converge-path
audit row would be incomplete versus the normal roll-forward path for the same
recovery kind. Mirror the roll-forward outcome construction (append a
registration outcome per added table) so both paths emit the same audit shape.
102 lines
6.8 KiB
Markdown
102 lines
6.8 KiB
Markdown
# Developer Docs
|
|
|
|
**Audience:** contributors, maintainers, and coding agents
|
|
|
|
This is the contributor-facing entry point. These docs explain architecture,
|
|
invariants, implementation contracts, test ownership, and upstream Lance
|
|
constraints. User-facing behavior should still be documented through
|
|
[docs/user/index.md](../user/index.md) and the relevant public reference docs.
|
|
|
|
## Required For Every Non-Trivial Change
|
|
|
|
| Need | Read |
|
|
|---|---|
|
|
| Architectural rules, known gaps, deny-list | [invariants.md](invariants.md) |
|
|
| Upstream Lance source-of-truth index | [lance.md](lance.md) |
|
|
| Existing test coverage and test placement | [testing.md](testing.md) |
|
|
|
|
## Architecture And Storage
|
|
|
|
| Area | Read |
|
|
|---|---|
|
|
| System structure, L1/L2 framing, component diagrams | [architecture.md](architecture.md) |
|
|
| On-disk layout, manifest schema, URI behavior | [storage.md](../user/concepts/storage.md) |
|
|
| Direct-publish writes, D2, staged writes, recovery sidecars | [writes.md](writes.md) |
|
|
| Query execution, mutation execution, loader flow | [execution.md](execution.md) |
|
|
| Index lifecycle and graph topology indexes | [indexes.md](../user/search/indexes.md) |
|
|
| Branch and commit internals | [branches-commits.md](../user/branching/index.md) |
|
|
| Three-way merge implementation and conflicts | [merge.md](merge.md) |
|
|
| Diff/change-feed implementation | [changes.md](../user/branching/changes.md) |
|
|
| Branch protection policy | [branch-protection.md](branch-protection.md) |
|
|
|
|
## Language, Runtime, And Boundaries
|
|
|
|
| Area | Read |
|
|
|---|---|
|
|
| Schema grammar, catalog, migration planner | [schema-language.md](../user/schema/index.md) |
|
|
| Query grammar, IR, lints, mutation restrictions | [query-language.md](../user/queries/index.md) |
|
|
| Embedding client and `@embed` integration | [embeddings.md](../user/search/embeddings.md) |
|
|
| Cedar policy surface and server gating | [policy.md](../user/operations/policy.md) |
|
|
| Server auth, OpenAPI, endpoint handlers | [server.md](../user/operations/server.md) |
|
|
| Error taxonomy and serialization | [errors.md](../user/operations/errors.md) |
|
|
| Constants and tunables | [constants.md](../user/reference/constants.md) |
|
|
| Transaction model public contract | [transactions.md](../user/branching/transactions.md) |
|
|
| User-doc coherence cleanup ledger | [docs-issues.md](docs-issues.md) |
|
|
|
|
## Project Operations
|
|
|
|
| Area | Read |
|
|
|---|---|
|
|
| CI and release workflows | [ci.md](ci.md) |
|
|
| Install and deployment packaging | [install.md](../user/install.md), [deployment.md](../user/deployment.md) |
|
|
| Release history | [releases/](../releases/) |
|
|
|
|
## Contribution & Governance
|
|
|
|
| Area | Read |
|
|
|---|---|
|
|
| How to contribute (external) | [CONTRIBUTING.md](../../CONTRIBUTING.md) |
|
|
| Governance model, roles, decision authority | [GOVERNANCE.md](../../GOVERNANCE.md) |
|
|
| Public contribution RFC track | [rfcs/](../rfcs/) |
|
|
|
|
The `docs/rfcs/` track is the **public, externally-authorable** RFC process. The
|
|
maintainer/internal RFCs below (`rfc-00N-*.md`) are a separate, team-owned
|
|
track; don't conflate the two.
|
|
|
|
## Case Studies
|
|
|
|
Worked write-ups of specific bugs — root cause, fix, and the reasoning that
|
|
ruled out the tempting-but-wrong alternatives. Read these for the debugging
|
|
pattern, not just the outcome.
|
|
|
|
| Area | Read |
|
|
|---|---|
|
|
| camelCase property filters lowercased at runtime (#283) — two engine→Lance boundaries, two different fixes | [bug-case-fix.md](bug-case-fix.md) |
|
|
|
|
## Active Implementation Plans
|
|
|
|
Working documents for in-flight feature work. Removed when the work lands.
|
|
|
|
| Area | Read |
|
|
|---|---|
|
|
| Schema-lint chassis v1 (MR-694) — `--allow-data-loss`, soft/hard drops | [schema-lint-v1-plan.md](schema-lint-v1-plan.md) |
|
|
| Inline + stored queries, request/response envelope, MCP (MR-656 / MR-976 / MR-969) | [rfc-001-queries-envelope-mcp.md](rfc-001-queries-envelope-mcp.md) |
|
|
| Config & CLI architecture — layered config, client targeting, file naming (MR-973 / MR-974 / MR-981) | [rfc-002-config-cli-architecture.md](rfc-002-config-cli-architecture.md) |
|
|
| MCP server surface — full tool parity, stored queries, modular auth (MR-969 / MR-956 / MR-974) | [rfc-003-mcp-server-surface.md](rfc-003-mcp-server-surface.md) |
|
|
| Future cluster control plane — declarative as-code config, JSON state ledger, reconciler | [cluster-config-specs.md](cluster-config-specs.md), [cluster-axioms.md](cluster-axioms.md), [cluster-config-implementation-spec.md](cluster-config-implementation-spec.md) |
|
|
| Cluster graph & schema apply — Phase 4 sidecars, roll-forward recovery, approval artifacts | [rfc-004-cluster-graph-schema-apply.md](rfc-004-cluster-graph-schema-apply.md) |
|
|
| Server boots from cluster state — Phase 5 mode switch, applied-revision serving | [rfc-005-server-cluster-boot.md](rfc-005-server-cluster-boot.md) |
|
|
| Per-operator config — `~/.omnigraph/` identity, keyed credentials, named servers (the operator slice of RFC-002) | [rfc-007-operator-config.md](rfc-007-operator-config.md) |
|
|
| Deprecate `omnigraph.yaml` — one concern per config surface; key-by-key migration map and staged retirement | [rfc-008-deprecate-omnigraph-yaml.md](rfc-008-deprecate-omnigraph-yaml.md) |
|
|
| Unify CLI embedded/remote access paths — parity referee, shared wire-DTO crate, `GraphClient` trait, declared plane capabilities | [rfc-009-unify-access-paths.md](rfc-009-unify-access-paths.md) |
|
|
| Restructure the CLI around explicit planes — one graph-addressing model, declared capability surface, plane-grouped help (expands RFC-009 Phase 4) | [rfc-010-cli-planes-restructure.md](rfc-010-cli-planes-restructure.md) |
|
|
| CLI refactoring — one addressing & config model post-`omnigraph.yaml`: scope + `--graph` + derived access path, served-default / privileged-direct, profiles, named queries, capability classifier (completes RFC-008) | [rfc-011-cli-refactoring.md](rfc-011-cli-refactoring.md) |
|
|
| Provider-independent embedding configuration — one resolved `EmbeddingConfig` + sealed provider enum (Gemini/OpenAI/Mock), identity recorded in the schema IR, query-time same-space validation, NFR floor | [rfc-012-embedding-provider-config.md](rfc-012-embedding-provider-config.md) |
|
|
| Write-path latency — capture-once `WriteTxn`, version-pinned opens, one `GraphPublishAuthority` fed declarative `PublishPlan`s, manifest-authoritative lineage, epoch fence, bounded history (compaction + cleanup), and an IO-counted cost contract (`iss-write-s3-roundtrip-amplification`, `iss-991`) | [rfc-013-write-path-latency.md](rfc-013-write-path-latency.md) |
|
|
| RFC-013 handoff — current-state map, latest validation, and concrete next actions for finishing write-path latency and correctness work | [handoff-rfc-013-write-path.md](handoff-rfc-013-write-path.md) |
|
|
|
|
## Boundary
|
|
|
|
Developer docs may mention implementation details, stale gaps, upstream Lance
|
|
blockers, and review rules. User docs should not require that context unless
|
|
the detail changes the public contract.
|