diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index d4ecfa5..e937724 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -8,9 +8,9 @@ # CI fails if this file drifts from its source, and rejects PRs that # edit this file directly without also editing the yml. -* @ragnorc +* @ragnorc @aaltshuler -crates/** @ragnorc +crates/** @ragnorc @aaltshuler docs/** @ragnorc README.md @ragnorc AGENTS.md @ragnorc diff --git a/.github/DISCUSSION_TEMPLATE/rfc.yml b/.github/DISCUSSION_TEMPLATE/rfc.yml new file mode 100644 index 0000000..2a63525 --- /dev/null +++ b/.github/DISCUSSION_TEMPLATE/rfc.yml @@ -0,0 +1,34 @@ +labels: ["rfc"] +body: + - type: markdown + attributes: + value: | + Use this to **incubate an RFC** β€” socialize a design and reach rough + consensus before writing the formal document. When it's ready, graduate + it into a pull request that adds `docs/rfcs/NNNN-title.md` + (see [docs/rfcs/README.md](../blob/main/docs/rfcs/README.md)); a + maintainer merging that PR is acceptance. + + For a plain feature request or open-ended idea, use the **Ideas** + category instead. For bugs, open an [Issue](../../issues/new/choose). + - type: textarea + id: problem + attributes: + label: Problem / motivation + description: What needs solving, and why is it worth the long-run cost? + validations: + required: true + - type: textarea + id: sketch + attributes: + label: Proposed direction (sketch) + description: A rough shape of the design. Detail comes later in the RFC document. + validations: + required: true + - type: textarea + id: invariants + attributes: + label: Invariants touched + description: Which items in docs/dev/invariants.md does this affect or risk? Any deny-list brush? + validations: + required: false diff --git a/.github/ISSUE_TEMPLATE/bug_report.yml b/.github/ISSUE_TEMPLATE/bug_report.yml new file mode 100644 index 0000000..8e19465 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/bug_report.yml @@ -0,0 +1,55 @@ +name: Bug report +description: Report a reproducible problem or wrong behavior in OmniGraph. +title: "bug: " +labels: ["bug", "needs-triage"] +body: + - type: markdown + attributes: + value: | + Issues are for **reporting problems** β€” concrete, reproducible bugs. + For ideas, feature requests, or questions, please use + [Discussions](../../discussions) instead. + For a security vulnerability, follow [SECURITY.md](../../blob/main/SECURITY.md) β€” do **not** file it here. + + A maintainer will triage this; once labelled **`accepted`** it's open for a pull request + (see [GOVERNANCE.md](../../blob/main/GOVERNANCE.md)). + - type: textarea + id: what-happened + attributes: + label: What happened + description: What went wrong, and what you expected instead. + validations: + required: true + - type: textarea + id: repro + attributes: + label: Steps to reproduce + description: Minimal steps, commands, schema/query, or a failing snippet. + placeholder: | + 1. omnigraph init ... + 2. omnigraph ... + 3. observed: ... / expected: ... + validations: + required: true + - type: input + id: version + attributes: + label: Version + description: Output of `omnigraph --version` (or the engine/crate version) and how you installed it. + validations: + required: true + - type: input + id: environment + attributes: + label: Environment + description: OS, architecture, and storage backend (local FS / S3 / RustFS / MinIO). + validations: + required: false + - type: textarea + id: logs + attributes: + label: Logs / output + description: Relevant error text or logs. Will be rendered as code. + render: shell + validations: + required: false diff --git a/.github/ISSUE_TEMPLATE/config.yml b/.github/ISSUE_TEMPLATE/config.yml new file mode 100644 index 0000000..50720b8 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/config.yml @@ -0,0 +1,13 @@ +# Issues are for problem reports only. Disable blank issues so everything is +# routed: bugs through the form, everything else to Discussions / SECURITY.md. +blank_issues_enabled: false +contact_links: + - name: πŸ’‘ Idea, feature request, or RFC + url: https://github.com/ModernRelay/omnigraph/discussions + about: Propose features and designs in Discussions. RFCs graduate from there into a docs/rfcs/ pull request. + - name: ❓ Question or help + url: https://github.com/ModernRelay/omnigraph/discussions + about: Ask in Discussions β€” questions are not tracked as Issues. + - name: πŸ”’ Security vulnerability + url: https://github.com/ModernRelay/omnigraph/blob/main/SECURITY.md + about: Report security issues privately per SECURITY.md β€” never as a public Issue. diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md new file mode 100644 index 0000000..2a548c7 --- /dev/null +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -0,0 +1,29 @@ + + +## What & why + + + +## Backing issue / RFC + + + +- [ ] Fixes an **accepted** issue: Closes # +- [ ] Implements / is an **accepted** RFC: +- [ ] **Trivial fast-lane** (typo / docs / dependency bump / comment / one-line CI) β€” no issue/RFC required + +## Checklist + +- [ ] Change is focused (one logical change) +- [ ] Tests added/updated for behavior changes (or N/A) +- [ ] Public docs updated if user-facing surface changed (or N/A) +- [ ] Reviewed against [docs/dev/invariants.md](../blob/main/docs/dev/invariants.md) β€” no Hard Invariant weakened, no deny-list item hit (or justified) + +## Notes for reviewers + + diff --git a/.github/codeowners-roles.yml b/.github/codeowners-roles.yml index c5e36a9..ce4014d 100644 --- a/.github/codeowners-roles.yml +++ b/.github/codeowners-roles.yml @@ -22,6 +22,7 @@ roles: compiler. members: - ragnorc + - aaltshuler docs: description: > diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 8d9c687..2d77ef0 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,10 +1,29 @@ # Contributing -Small bug fixes and documentation improvements are welcome directly through pull -requests. +Thanks for your interest in OmniGraph. This page is the practical how-to; the +rules and decision authority behind it live in [GOVERNANCE.md](GOVERNANCE.md). -For larger changes, please open an issue or design discussion first so the -proposed direction is clear before implementation starts. +## Start in the right place + +| I want to… | Go to | Notes | +|---|---|---| +| **Report a bug** or wrong behavior | **[Open an Issue](../../issues/new/choose)** | Concrete and reproducible. A maintainer triages it; once labelled **`accepted`** it's open for a PR. | +| **Suggest a feature / share an idea / ask** | **[Start a Discussion](../../discussions)** | Ideas and questions live here, not in Issues. | +| **Propose a design / RFC** | **An RFC pull request** | Anyone can author one β€” see [docs/rfcs/README.md](docs/rfcs/README.md). A maintainer merging it is acceptance. | +| **Fix something / implement a change** | **A pull request** | Must link an `accepted` issue or an accepted RFC β€” unless it's trivial (below). | +| **Report a security vulnerability** | **[SECURITY.md](SECURITY.md)** | Do **not** open a public Issue. | + +### When can I just open a PR? +The **trivial fast-lane** β€” open directly, no prior issue/RFC needed: typo and +wording fixes, doc corrections, dependency bumps, comment fixes, obvious +one-line CI tweaks. Anything more substantial needs a backing `accepted` issue +or accepted RFC first, so the *why* is agreed before the *how* is reviewed. A PR +that turns out to be non-trivial will be redirected β€” that's about process, not +the merit of the change. + +> **Maintainers (ModernRelay team)** follow a separate internal process and are +> not bound by the intake rules above. Everyone is bound by review, CODEOWNERS, +> branch protection, and CI. ## Development @@ -49,6 +68,11 @@ CI runs both. ## Pull Requests -- keep changes focused -- include tests for behavior changes when practical -- update public docs when the user-facing surface changes +- **Link the backing issue or RFC** (`Closes #123`, or reference the RFC) β€” or + mark the PR as trivial per the fast-lane. +- Keep changes focused; one logical change per PR. +- Include tests for behavior changes when practical. +- Update public docs when the user-facing surface changes. + +New to the codebase? Read [AGENTS.md](AGENTS.md) β€” the architecture map and the +always-on invariants every change is reviewed against. diff --git a/GOVERNANCE.md b/GOVERNANCE.md new file mode 100644 index 0000000..5878f1f --- /dev/null +++ b/GOVERNANCE.md @@ -0,0 +1,106 @@ +# Governance + +This document describes how **external contributions** to OmniGraph are +proposed, accepted, and merged. It exists so an outside contributor can answer, +without asking: *where does my report/idea/change go, who decides, and what has +to happen before code lands?* + +> **Scope.** This governs the public contribution surface β€” Issues, +> Discussions, RFCs, and pull requests from people outside the ModernRelay +> team. **Maintainers operate under a separate internal process** and are not +> bound by the intake gates below. Everyone, maintainer or not, is still bound +> by the universal gates: branch protection on `main` and CODEOWNERS review +> (see [docs/dev/branch-protection.md](docs/dev/branch-protection.md) and +> [docs/dev/codeowners.md](docs/dev/codeowners.md)). + +## Roles + +| Role | Who | Authority | +|---|---|---| +| **Maintainer** | The code owners in [`.github/CODEOWNERS`](.github/CODEOWNERS) (generated from [`.github/codeowners-roles.yml`](.github/codeowners-roles.yml)) | Validate issues, accept/reject RFCs, review and merge PRs, set direction. Final decision authority. | +| **Contributor** | Anyone else | Report problems (Issues), propose ideas (Discussions), author RFCs, and open pull requests. | + +Decision authority rests with the maintainers. CODEOWNERS is the single source +of truth for who that is; this document does not duplicate the list. + +## The three channels + +Each channel has one job. Using the right one is the first thing we ask of a +contribution. + +| Channel | Purpose | Not for | +|---|---|---| +| **[Issues](../../issues)** | **Report a problem** β€” a bug, a regression, a documented behavior that's wrong. Something concrete and reproducible. | Feature requests, ideas, questions, or design proposals (β†’ Discussions). | +| **[Discussions](../../discussions)** | **Propose and explore** β€” new ideas, feature requests, questions, and the incubation of RFCs. | Bug reports (β†’ Issues). | +| **Pull requests** | **Land a sanctioned change** β€” a fix for a *validated* issue, an *accepted* RFC, or a trivial change (see fast-lane). | Substantive change with no backing issue/RFC β€” it will be redirected. | + +## How a change becomes mergeable + +``` + β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ bug ───────────┐ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€ idea / feature ────────┐ + β–Ό β”‚ β–Ό β”‚ + Issue (problem report) β”‚ Discussion (idea / RFC incubation) β”‚ + β”‚ β”‚ β”‚ β”‚ + maintainer triage β”‚ rough consensus β”‚ + β”‚ β”‚ β”‚ graduate β”‚ + β–Ό β”‚ β–Ό β”‚ + label: accepted ──────────┐ β”‚ RFC PR (docs/rfcs/NNNN-*.md) β”‚ + β”‚ β”‚ β”‚ β”‚ β”‚ + β”‚ β”‚ β”‚ maintainer review β”‚ + β–Ό β–Ό β”‚ β–Ό β”‚ + Pull request ◀──────────┴──────────│── merged == accepted β”‚ + (links the issue or the accepted RFC) β—€β”€β”€β”€β”€β”€β”€β”€β”˜ (implementation PRs reference it) β”‚ + β”‚ + review + CODEOWNERS + branch protection + β–Ό + merged +``` + +### Issues β†’ validated +A new issue starts unlabeled. A maintainer triages it and, if it's a real, +in-scope problem, applies the **`accepted`** label. **Only `accepted` issues are +open for a contributor PR.** This prevents the "I fixed an issue you hadn't +agreed was a problem" rejection. Want to fix something? Get the issue accepted +first, or pick one already labelled `accepted` / `help wanted`. + +### Discussions β†’ RFCs β†’ accepted +Ideas and feature requests start in **Discussions**. Anyone β€” including external +contributors β€” may then **author an RFC** by opening a pull request that adds +`docs/rfcs/NNNN-title.md` (see [docs/rfcs/README.md](docs/rfcs/README.md)). The +RFC is reviewed as code; **a maintainer merging it is the act of acceptance** +(it becomes the durable decision record). Implementation PRs then reference the +accepted RFC. + +Authoring an RFC is open to everyone; **accepting one is a maintainer +decision.** Maintainers may also decline an RFC, with rationale, by closing it. + +### Pull requests β†’ sanctioned +A contributor PR must do one of: +1. link a maintainer-**`accepted`** issue it fixes, or +2. be (or reference) an **accepted RFC**, or +3. qualify for the **trivial fast-lane**. + +**Trivial fast-lane** β€” these may be opened directly, no prior issue/RFC: +typo and wording fixes, documentation corrections, dependency bumps, comment +fixes, and obviously-correct one-line CI tweaks. When in doubt, open an Issue or +Discussion first; a PR that turns out to be non-trivial will be asked to. + +A substantive PR with no backing issue/RFC will be closed with a pointer to the +right channel β€” not as a judgment of the idea, but to keep design discussion +where it's reviewable. + +## What maintainers do *not* gate +Maintainers' own changes do not pass through the intake gates above β€” the team +runs a separate internal process. The universal gates (review, CODEOWNERS, +branch protection, CI) apply to everyone. Enforcement of the intake rules is, to +start, **by convention and review** (PR template + labels); an automated check +keyed to author association may be added later if volume warrants. + +## Code of conduct & security +- Conduct: [CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md). +- Security issues are **not** public Issues β€” see [SECURITY.md](SECURITY.md). + +## Changing this document +Governance changes the same way code does: a pull request, reviewed by +maintainers. This file describes the external surface; the internal maintainer +process is intentionally out of scope here. diff --git a/crates/omnigraph/src/db/manifest.rs b/crates/omnigraph/src/db/manifest.rs index 7fcf7de..3b2886f 100644 --- a/crates/omnigraph/src/db/manifest.rs +++ b/crates/omnigraph/src/db/manifest.rs @@ -48,6 +48,22 @@ const OBJECT_TYPE_TABLE_VERSION: &str = "table_version"; const OBJECT_TYPE_TABLE_TOMBSTONE: &str = "table_tombstone"; const TABLE_VERSION_MANAGEMENT_KEY: &str = "table_version_management"; +/// Apply pending internal-schema migrations against `__manifest` on the +/// open-for-write path, independent of a publish. +/// +/// `Omnigraph::open(ReadWrite)` calls this before the coordinator reads branch +/// state, so branch-observing code (`branch_list`, the schema-apply +/// blocking-branch checks) sees the post-migration graph. In particular the +/// v2β†’v3 step sweeps legacy `__run__*` staging branches off `__manifest` +/// (MR-770); running it here closes the window where those branches would +/// otherwise block schema apply before the first publish runs the migration. +/// +/// Idempotent: a no-op stamp read when the on-disk version already matches. +pub(crate) async fn migrate_on_open(root_uri: &str) -> Result<()> { + let mut dataset = open_manifest_dataset(root_uri, None).await?; + migrations::migrate_internal_schema(&mut dataset).await +} + /// Immutable point-in-time view of the database. /// /// Cheap to create (no storage I/O). All reads within a query go through one diff --git a/crates/omnigraph/src/db/manifest/migrations.rs b/crates/omnigraph/src/db/manifest/migrations.rs index bbb7995..e2801fe 100644 --- a/crates/omnigraph/src/db/manifest/migrations.rs +++ b/crates/omnigraph/src/db/manifest/migrations.rs @@ -46,7 +46,11 @@ use crate::error::{OmniError, Result}; /// - v2 β€” `__manifest.object_id` carries the unenforced-PK annotation, /// engaging Lance's bloom-filter conflict resolver at commit time. Added /// alongside `expected_table_versions` OCC on `ManifestBatchPublisher::publish`. -pub(super) const INTERNAL_MANIFEST_SCHEMA_VERSION: u32 = 2; +/// - v3 β€” one-time sweep of legacy `__run__` staging branches left on the +/// `__manifest` dataset by the pre-v0.4.0 Run state machine (removed in +/// MR-771). Once swept, the `is_internal_run_branch` defense-in-depth guard +/// is no longer needed (MR-770). +pub(super) const INTERNAL_MANIFEST_SCHEMA_VERSION: u32 = 3; const INTERNAL_SCHEMA_VERSION_KEY: &str = "omnigraph:internal_schema_version"; const OBJECT_ID_PK_KEY: &str = "lance-schema:unenforced-primary-key"; @@ -89,6 +93,10 @@ pub(super) async fn migrate_internal_schema(dataset: &mut Dataset) -> Result<()> migrate_v1_to_v2(dataset).await?; current = 2; } + 2 => { + migrate_v2_to_v3(dataset).await?; + current = 3; + } other => { return Err(OmniError::manifest_internal(format!( "no internal-schema migration registered for v{} β†’ v{}", @@ -122,6 +130,51 @@ async fn migrate_v1_to_v2(dataset: &mut Dataset) -> Result<()> { set_stamp(dataset, 2).await } +/// v2 β†’ v3: sweep legacy `__run__` staging branches off the `__manifest` +/// dataset, then bump the stamp. +/// +/// The pre-v0.4.0 Run state machine (removed in MR-771) created graph-level +/// staging branches named `__run__` on `__manifest`. MR-771 stopped +/// creating them but left any pre-existing ones in place; Lance's +/// `list_branches` still enumerates them, so they leak into `branch_list()` +/// and count as blocking branches at schema-apply time. This one-time sweep +/// removes them so the `is_internal_run_branch` guard can retire (MR-770). +/// +/// The `"__run__"` prefix is inlined here on purpose: this migration must keep +/// working after the `run_registry` module (the guard) is deleted, so it does +/// not depend on it. +/// +/// Idempotent under both sequential retry and concurrent runners: each run +/// re-enumerates `list_branches` fresh, and `force_delete_branch` tolerates a +/// branch that is already gone β€” so a crash before the stamp bump, or a second +/// process opening the same legacy graph at the same time, never errors out. +async fn migrate_v2_to_v3(dataset: &mut Dataset) -> Result<()> { + const LEGACY_RUN_BRANCH_PREFIX: &str = "__run__"; + let branches = dataset + .list_branches() + .await + .map_err(|e| OmniError::Lance(e.to_string()))?; + let run_branches: Vec = branches + .into_keys() + .filter(|name| { + name.trim_start_matches('/') + .starts_with(LEGACY_RUN_BRANCH_PREFIX) + }) + .collect(); + for name in run_branches { + // `force_delete_branch` deletes even when the `BranchContents` is + // already gone. Plain `delete_branch` errors "BranchContents not + // found", which would fail a second concurrent open (or a retry that + // raced another runner) after the first one swept the branch. Force is + // exactly Lance's documented path for cleaning up zombie branches. + dataset + .force_delete_branch(&name) + .await + .map_err(|e| OmniError::Lance(e.to_string()))?; + } + set_stamp(dataset, 3).await +} + async fn set_stamp(dataset: &mut Dataset, version: u32) -> Result<()> { dataset .update_schema_metadata([(INTERNAL_SCHEMA_VERSION_KEY.to_string(), version.to_string())]) diff --git a/crates/omnigraph/src/db/manifest/tests.rs b/crates/omnigraph/src/db/manifest/tests.rs index effa0b5..885a2a8 100644 --- a/crates/omnigraph/src/db/manifest/tests.rs +++ b/crates/omnigraph/src/db/manifest/tests.rs @@ -1461,6 +1461,80 @@ async fn test_publish_migrates_pre_stamp_manifest_to_current_version() { assert!(reopened.snapshot().entry("node:Person").is_some()); } +#[tokio::test] +async fn test_v2_to_v3_sweeps_legacy_run_branches_on_write_open() { + let dir = tempfile::tempdir().unwrap(); + let uri = dir.path().to_str().unwrap(); + let catalog = build_test_catalog(); + let mut mc = ManifestCoordinator::init(uri, &catalog).await.unwrap(); + + // Synthesize a pre-MR-770 graph: several stale `__run__` staging branches + // left on `__manifest` (a real legacy graph accumulates one per run), plus + // a real user branch that must survive the sweep. Multiple run branches + // exercise the migration's delete loop on a single reused dataset handle. + mc.create_branch("__run__01J9LEGACY").await.unwrap(); + mc.create_branch("__run__01J9SECOND").await.unwrap(); + mc.create_branch("__run__01J9THIRD").await.unwrap(); + mc.create_branch("feature").await.unwrap(); + let before = mc.list_branches().await.unwrap(); + assert_eq!( + before.iter().filter(|b| b.starts_with("__run__")).count(), + 3, + "precondition: three legacy run branches exist on __manifest; got {before:?}", + ); + + // Rewind the internal-schema stamp to v2 so the next write-open runs the + // v2 β†’ v3 sweep arm (init stamps at the current version, which is past it). + { + let mut ds = open_manifest_dataset(uri, None).await.unwrap(); + ds.update_schema_metadata([( + "omnigraph:internal_schema_version".to_string(), + Some("2".to_string()), + )]) + .await + .unwrap(); + let post = open_manifest_dataset(uri, None).await.unwrap(); + assert_eq!(super::migrations::read_stamp(&post), 2, "stamp rewound to v2"); + } + + // A no-op publish forces the open-for-write path, which runs the migration. + let mut expected = HashMap::new(); + expected.insert("node:Person".to_string(), 1); + GraphNamespacePublisher::new(uri, None) + .publish(&[], &expected) + .await + .unwrap(); + + // Stamp advanced to current; the legacy run branch is physically gone from + // `__manifest` (checked via the raw, unfiltered manifest list β€” not the + // guard-filtered `branch_list`), and the real branch + `main` survive. + let post = open_manifest_dataset(uri, None).await.unwrap(); + assert_eq!( + super::migrations::read_stamp(&post), + super::migrations::INTERNAL_MANIFEST_SCHEMA_VERSION, + ); + let reopened = ManifestCoordinator::open(uri).await.unwrap(); + let after = reopened.list_branches().await.unwrap(); + assert!( + !after.iter().any(|b| b.starts_with("__run__")), + "legacy run branch must be swept; got {after:?}", + ); + assert!(after.iter().any(|b| b == "feature"), "user branch must survive"); + assert!(after.iter().any(|b| b == "main"), "main must survive"); + + // Idempotent: a second write-open finds the stamp at current and does not + // re-run the sweep or error. + GraphNamespacePublisher::new(uri, None) + .publish(&[], &expected) + .await + .unwrap(); + let final_ds = open_manifest_dataset(uri, None).await.unwrap(); + assert_eq!( + super::migrations::read_stamp(&final_ds), + super::migrations::INTERNAL_MANIFEST_SCHEMA_VERSION, + ); +} + #[tokio::test] async fn test_publish_rejects_manifest_stamped_at_future_version() { let dir = tempfile::tempdir().unwrap(); diff --git a/crates/omnigraph/src/db/mod.rs b/crates/omnigraph/src/db/mod.rs index 8702f88..13e1c74 100644 --- a/crates/omnigraph/src/db/mod.rs +++ b/crates/omnigraph/src/db/mod.rs @@ -3,7 +3,6 @@ pub mod graph_coordinator; pub mod manifest; mod omnigraph; mod recovery_audit; -mod run_registry; mod schema_state; pub(crate) mod write_queue; @@ -15,7 +14,6 @@ pub use omnigraph::{ CleanupPolicyOptions, InitOptions, MergeOutcome, Omnigraph, OpenMode, SchemaApplyOptions, SchemaApplyResult, SkipReason, TableCleanupStats, TableOptimizeStats, }; -pub(crate) use run_registry::is_internal_run_branch; pub(crate) const SCHEMA_APPLY_LOCK_BRANCH: &str = "__schema_apply_lock__"; @@ -69,5 +67,8 @@ pub(crate) fn is_schema_apply_lock_branch(name: &str) -> bool { } pub(crate) fn is_internal_system_branch(name: &str) -> bool { - is_internal_run_branch(name) || is_schema_apply_lock_branch(name) + // Legacy `__run__*` staging branches (Run state machine, removed MR-771) + // are swept off `__manifest` by the v2β†’v3 internal-schema migration, so the + // only internal branch the engine still creates is the schema-apply lock. + is_schema_apply_lock_branch(name) } diff --git a/crates/omnigraph/src/db/omnigraph.rs b/crates/omnigraph/src/db/omnigraph.rs index f10739e..8d23ebb 100644 --- a/crates/omnigraph/src/db/omnigraph.rs +++ b/crates/omnigraph/src/db/omnigraph.rs @@ -347,6 +347,16 @@ impl Omnigraph { mode: OpenMode, ) -> Result { let root = normalize_root_uri(uri)?; + // Apply pending internal-schema migrations before the coordinator reads + // branch state, so `branch_list` and the schema-apply blocking-branch + // checks observe the post-migration graph β€” notably the v2β†’v3 sweep of + // legacy `__run__*` staging branches (MR-770). ReadWrite only: a + // read-only open must not trigger object-store writes, so a read-only + // open of an unmigrated legacy graph still lists `__run__*` until its + // first read-write open (an accepted, documented limitation). + if matches!(mode, OpenMode::ReadWrite) { + crate::db::manifest::migrate_on_open(&root).await?; + } // Open the coordinator first so the schema-staging recovery sweep can // compare its snapshot against any leftover staging files. let mut coordinator = GraphCoordinator::open(&root, Arc::clone(&storage)).await?; @@ -1494,12 +1504,6 @@ pub(crate) fn normalize_branch_name(branch: &str) -> Result> { } pub(crate) fn ensure_public_branch_ref(branch: &str, operation: &str) -> Result<()> { - if super::is_internal_run_branch(branch) { - return Err(OmniError::manifest(format!( - "{} does not allow internal run ref '{}'", - operation, branch - ))); - } if is_internal_system_branch(branch) { return Err(OmniError::manifest(format!( "{} does not allow internal system ref '{}'", @@ -1903,7 +1907,6 @@ fn json_value_from_array(array: &dyn Array, row: usize) -> Result Company #[tokio::test] async fn test_apply_schema_succeeds_after_load() { // Historical: schema apply used to be blocked by leftover - // `__run__` branches. A defense-in-depth filter now skips - // internal system branches, and run branches were made - // ephemeral on every terminal state β€” so in practice no - // `__run__` branch survives publish. The filter still guards - // the invariant. + // `__run__` branches. The Run state machine was removed in + // MR-771, so a fresh graph never creates a `__run__` branch; + // legacy ones are swept by the v2β†’v3 manifest migration. This + // asserts the invariant a current graph upholds: publish leaves + // no `__run__` branch behind, so schema apply proceeds. let dir = tempfile::tempdir().unwrap(); let uri = dir.path().to_str().unwrap(); let mut db = Omnigraph::init(uri, TEST_SCHEMA).await.unwrap(); @@ -2264,8 +2267,8 @@ edge WorksAt: Person -> Company let all_branches = db.coordinator.read().await.all_branches().await.unwrap(); assert!( - !all_branches.iter().any(|b| is_internal_run_branch(b)), - "run branch should be deleted after publish, got: {:?}", + !all_branches.iter().any(|b| b.starts_with("__run__")), + "no __run__ branch should exist after publish, got: {:?}", all_branches ); @@ -2277,6 +2280,56 @@ edge WorksAt: Person -> Company assert!(result.applied, "schema apply should have applied"); } + /// Regression (MR-770): a pre-v0.4.0 graph that still carries a stale + /// `__run__*` branch on `__manifest` must not block schema apply. The + /// v2β†’v3 sweep runs in `Omnigraph::open(ReadWrite)` β€” before the + /// schema-apply blocking-branch check β€” so apply succeeds with no + /// intervening publish. + /// + /// Confirmed to fail before the open-time migration landed: the reopened + /// graph still listed `__run__legacy`, and `apply_schema` returned + /// "found non-main branches: __run__legacy". + #[tokio::test] + async fn legacy_run_branch_is_swept_on_open_and_does_not_block_schema_apply() { + let dir = tempfile::tempdir().unwrap(); + let uri = dir.path().to_str().unwrap(); + let mut db = Omnigraph::init(uri, TEST_SCHEMA).await.unwrap(); + + // Synthesize a legacy graph: a stale `__run__` branch on `__manifest` + // plus the manifest stamp rewound to v2 (pre-sweep). + db.branch_create("__run__legacy").await.unwrap(); + drop(db); + { + let mut ds = lance::Dataset::open(&format!("{}/__manifest", uri)) + .await + .unwrap(); + ds.update_schema_metadata([( + "omnigraph:internal_schema_version".to_string(), + Some("2".to_string()), + )]) + .await + .unwrap(); + } + + // Reopen (ReadWrite): the open-time migration must sweep `__run__legacy` + // before any branch-observing code runs. + let db = Omnigraph::open(uri).await.unwrap(); + let branches = db.branch_list().await.unwrap(); + assert!( + !branches.iter().any(|b| b.starts_with("__run__")), + "open-time migration must sweep legacy __run__ branches; got {branches:?}", + ); + + // Schema apply must proceed with no intervening publish β€” the + // blocking-branch check no longer sees `__run__legacy`. + let desired = TEST_SCHEMA.replace( + " age: I32?\n}", + " age: I32?\n nickname: String?\n}", + ); + let result = db.apply_schema(&desired).await.unwrap(); + assert!(result.applied, "schema apply should have applied"); + } + #[tokio::test] async fn test_apply_schema_adds_index_for_existing_property() { let dir = tempfile::tempdir().unwrap(); diff --git a/crates/omnigraph/src/db/omnigraph/schema_apply.rs b/crates/omnigraph/src/db/omnigraph/schema_apply.rs index 76fa58f..0a01163 100644 --- a/crates/omnigraph/src/db/omnigraph/schema_apply.rs +++ b/crates/omnigraph/src/db/omnigraph/schema_apply.rs @@ -61,11 +61,11 @@ async fn plan_schema_for_apply( ) -> Result { db.ensure_schema_state_valid().await?; let branches = db.coordinator.read().await.all_branches().await?; - // Skip `main` and internal system branches. The schema-apply lock branch - // is excluded because it is the cluster-wide schema-apply serializer. - // `__run__*` branches are no longer created; the filter remains as - // defense-in-depth for legacy graphs with leftover staging branches. - // A future production sweep will let this guard go. + // Skip `main` and internal system branches (the schema-apply lock branch, + // the cluster-wide schema-apply serializer). Legacy `__run__*` staging + // branches were swept off `__manifest` by the v2β†’v3 migration that runs in + // `Omnigraph::open(ReadWrite)` before this check (MR-770), so they no + // longer appear here. let blocking_branches = branches .into_iter() .filter(|branch| branch != "main" && !is_internal_system_branch(branch)) diff --git a/crates/omnigraph/src/db/run_registry.rs b/crates/omnigraph/src/db/run_registry.rs deleted file mode 100644 index ee3d336..0000000 --- a/crates/omnigraph/src/db/run_registry.rs +++ /dev/null @@ -1,16 +0,0 @@ -// The Run state machine has been removed. Mutations now write directly -// to target tables and use the publisher's `expected_table_versions` -// CAS for cross-table OCC; `__run__` staging branches and the -// `_graph_runs.lance` state machine no longer exist. -// -// What remains is the branch-name predicate, kept as a defense-in-depth -// guard against users naming a public branch `__run__*`. A future -// production sweep of legacy `_graph_runs.lance` rows and stale -// `__run__*` branches will let this predicate (and this file) go too. - -pub(crate) const INTERNAL_RUN_BRANCH_PREFIX: &str = "__run__"; - -pub(crate) fn is_internal_run_branch(name: &str) -> bool { - name.trim_start_matches('/') - .starts_with(INTERNAL_RUN_BRANCH_PREFIX) -} diff --git a/crates/omnigraph/src/exec/merge.rs b/crates/omnigraph/src/exec/merge.rs index a161e42..88ec3cb 100644 --- a/crates/omnigraph/src/exec/merge.rs +++ b/crates/omnigraph/src/exec/merge.rs @@ -1092,9 +1092,9 @@ impl Omnigraph { target: &str, actor_id: Option<&str>, ) -> Result { - if is_internal_run_branch(source) || is_internal_run_branch(target) { + if is_internal_system_branch(source) || is_internal_system_branch(target) { return Err(OmniError::manifest(format!( - "branch_merge does not allow internal run refs ('{}' -> '{}')", + "branch_merge does not allow internal system refs ('{}' -> '{}')", source, target ))); } diff --git a/crates/omnigraph/src/exec/mod.rs b/crates/omnigraph/src/exec/mod.rs index a114b78..4076414 100644 --- a/crates/omnigraph/src/exec/mod.rs +++ b/crates/omnigraph/src/exec/mod.rs @@ -35,7 +35,7 @@ use time::format_description::well_known::Rfc3339; use crate::db::commit_graph::CommitGraph; use crate::db::manifest::ManifestCoordinator; -use crate::db::{MergeOutcome, Omnigraph, is_internal_run_branch}; +use crate::db::{MergeOutcome, Omnigraph, is_internal_system_branch}; use crate::db::{ReadTarget, Snapshot}; use crate::embedding::EmbeddingClient; use crate::error::{MergeConflict, MergeConflictKind, OmniError, Result}; diff --git a/crates/omnigraph/src/loader/mod.rs b/crates/omnigraph/src/loader/mod.rs index 9876309..9d7329d 100644 --- a/crates/omnigraph/src/loader/mod.rs +++ b/crates/omnigraph/src/loader/mod.rs @@ -288,21 +288,24 @@ async fn load_jsonl_reader( let mut node_rows: HashMap> = HashMap::new(); let mut edge_rows: HashMap> = HashMap::new(); - for (line_num, line) in reader.lines().enumerate() { - let line = line?; - let line = line.trim(); - if line.is_empty() { - continue; - } - let value: JsonValue = serde_json::from_str(line).map_err(|e| { - OmniError::manifest(format!("invalid JSON on line {}: {}", line_num + 1, e)) + // Parse a stream of JSON values. Accepts both compact JSONL (one object + // per line) and pretty-printed JSON where a single object spans multiple + // lines β€” serde's streaming deserializer treats any whitespace (including + // newlines) between top-level values as a separator. + for (idx, parsed) in serde_json::Deserializer::from_reader(reader) + .into_iter::() + .enumerate() + { + let record_num = idx + 1; + let value: JsonValue = parsed.map_err(|e| { + OmniError::manifest(format!("invalid JSON at record {}: {}", record_num, e)) })?; if let Some(type_name) = value.get("type").and_then(|v| v.as_str()) { if !catalog.node_types.contains_key(type_name) { return Err(OmniError::manifest(format!( - "line {}: unknown node type '{}'", - line_num + 1, + "record {}: unknown node type '{}'", + record_num, type_name ))); } @@ -317,8 +320,8 @@ async fn load_jsonl_reader( } else if let Some(edge_name) = value.get("edge").and_then(|v| v.as_str()) { if catalog.lookup_edge_by_name(edge_name).is_none() { return Err(OmniError::manifest(format!( - "line {}: unknown edge type '{}'", - line_num + 1, + "record {}: unknown edge type '{}'", + record_num, edge_name ))); } @@ -326,14 +329,14 @@ async fn load_jsonl_reader( .get("from") .and_then(|v| v.as_str()) .ok_or_else(|| { - OmniError::manifest(format!("line {}: edge missing 'from'", line_num + 1)) + OmniError::manifest(format!("record {}: edge missing 'from'", record_num)) })? .to_string(); let to = value .get("to") .and_then(|v| v.as_str()) .ok_or_else(|| { - OmniError::manifest(format!("line {}: edge missing 'to'", line_num + 1)) + OmniError::manifest(format!("record {}: edge missing 'to'", record_num)) })? .to_string(); let data = value @@ -347,8 +350,8 @@ async fn load_jsonl_reader( .push((from, to, data)); } else { return Err(OmniError::manifest(format!( - "line {}: expected 'type' or 'edge' field", - line_num + 1 + "record {}: expected 'type' or 'edge' field", + record_num ))); } } diff --git a/crates/omnigraph/tests/writes.rs b/crates/omnigraph/tests/writes.rs index 13cb10f..0a309c9 100644 --- a/crates/omnigraph/tests/writes.rs +++ b/crates/omnigraph/tests/writes.rs @@ -371,11 +371,10 @@ async fn cancelled_mutation_future_leaves_no_state() { // Cancel-safety property: no graph-level run/staging state remains. // - // Note: `branch_list()` already filters `__run__*` via - // `is_internal_system_branch`, so a runtime "no `__run__` branches" check - // would be vacuous. The structural property that no `__run__` branches - // can ever be created is enforced by deletion of `begin_run` etc. in - // (verified by the build itself β€” those symbols no longer exist). + // No `__run__` branches can ever be created: the Run state machine + // (`begin_run` etc.) was deleted in MR-771 β€” verified by the build itself, + // those symbols no longer exist. Any legacy `__run__*` branch on an + // upgraded graph is swept by the v2β†’v3 manifest migration. // // (1) The branch list is unchanged: cancellation/completion cannot // synthesize new public branches. @@ -442,34 +441,40 @@ async fn repeated_loads_do_not_accumulate_branches() { assert_eq!(db.branch_list().await.unwrap(), vec!["main".to_string()]); } -/// User code must not be able to write to internal `__run__*` names. -/// The branch-name guard predicate is kept as defense-in-depth; it -/// will be removed once a future production sweep retires the legacy -/// branches. +/// After MR-770, `__run__*` is an ordinary branch name β€” the Run state machine +/// and its `is_internal_run_branch` guard are gone. The surviving internal-ref +/// guard still rejects the active `__schema_apply_lock__` branch on the public +/// create/merge APIs. #[tokio::test] -async fn public_branch_apis_reject_internal_run_refs() { +async fn public_branch_apis_reject_internal_system_refs() { let dir = tempfile::tempdir().unwrap(); let mut db = init_and_load(&dir).await; - let create_err = db.branch_create("__run__synthetic").await.unwrap_err(); + // `__run__*` is no longer reserved β€” creating it now succeeds. + db.branch_create("__run__formerly_reserved") + .await + .expect("__run__ prefix is a normal branch name post-MR-770"); + + // The schema-apply lock branch is still rejected on public branch APIs. + let create_err = db.branch_create("__schema_apply_lock__").await.unwrap_err(); let OmniError::Manifest(err) = create_err else { panic!("expected Manifest error"); }; assert!( - err.message.contains("internal run ref"), + err.message.contains("internal system ref"), "unexpected error: {}", err.message ); let merge_err = db - .branch_merge("__run__synthetic", "main") + .branch_merge("__schema_apply_lock__", "main") .await .unwrap_err(); let OmniError::Manifest(err) = merge_err else { panic!("expected Manifest error"); }; assert!( - err.message.contains("internal run refs"), + err.message.contains("internal system refs"), "unexpected error: {}", err.message ); diff --git a/docs/dev/codeowners.md b/docs/dev/codeowners.md index 14bba0b..50c4dc7 100644 --- a/docs/dev/codeowners.md +++ b/docs/dev/codeowners.md @@ -14,8 +14,8 @@ The tables below are **generated** from `.github/codeowners-roles.yml` by `.gith | Path | Owners | Role(s) | |---|---|---| -| `*` | @ragnorc | engineering | -| `crates/**` | @ragnorc | engineering | +| `*` | @ragnorc @aaltshuler | engineering | +| `crates/**` | @ragnorc @aaltshuler | engineering | | `docs/**` | @ragnorc | docs | | `README.md` | @ragnorc | docs | | `AGENTS.md` | @ragnorc | docs | @@ -26,7 +26,7 @@ The tables below are **generated** from `.github/codeowners-roles.yml` by `.gith | Role | Members | Description | |---|---|---| -| `engineering` | @ragnorc | All production code under crates/**. Engine, CLI, server, compiler. | +| `engineering` | @ragnorc @aaltshuler | All production code under crates/**. Engine, CLI, server, compiler. | | `docs` | @ragnorc | Documentation under docs/**, plus repo-level docs (README.md, AGENTS.md, CLAUDE.md symlink, SECURITY.md). | diff --git a/docs/dev/index.md b/docs/dev/index.md index 600c969..1e41342 100644 --- a/docs/dev/index.md +++ b/docs/dev/index.md @@ -51,6 +51,18 @@ constraints. User-facing behavior should still be documented through | Install and deployment packaging | [install.md](../user/install.md), [deployment.md](../user/deployment.md) | | Release history | [releases/](../releases/) | +## Contribution & Governance + +| Area | Read | +|---|---| +| How to contribute (external) | [CONTRIBUTING.md](../../CONTRIBUTING.md) | +| Governance model, roles, decision authority | [GOVERNANCE.md](../../GOVERNANCE.md) | +| Public contribution RFC track | [rfcs/](../rfcs/) | + +The `docs/rfcs/` track is the **public, externally-authorable** RFC process. The +maintainer/internal RFCs below (`rfc-00N-*.md`) are a separate, team-owned +track; don't conflate the two. + ## Active Implementation Plans Working documents for in-flight feature work. Removed when the work lands. diff --git a/docs/dev/writes.md b/docs/dev/writes.md index 4f03f2a..c346858 100644 --- a/docs/dev/writes.md +++ b/docs/dev/writes.md @@ -14,8 +14,11 @@ publisher's row-level CAS on `__manifest` is the single fence. - No `RunRecord`, no `_graph_runs.lance`, no `_graph_run_actors.lance`. - No `omnigraph run *` CLI subcommands and no `/runs/*` HTTP endpoints. -- No `__run__` staging branches. (Legacy on-disk artifacts from - pre-MR-771 repos are inert; MR-770 sweeps them in production.) +- No `__run__` staging branches; `__run__*` is no longer a reserved + name. The branch-name guard was removed in MR-770, and any stale + `__run__*` branch on an upgraded graph is swept off `__manifest` by the + v2β†’v3 internal-schema migration on first read-write open. (The inert + `_graph_runs.lance` bytes remain until a `delete_prefix` primitive lands.) - Cancelled mutation futures leave **no graph-level state** β€” only orphaned Lance fragments, which the existing `omnigraph cleanup` pipe reclaims. @@ -241,9 +244,14 @@ list`. ## Migration code -`db/manifest/migrations.rs` does not change. Active deletion of -`_graph_runs.lance` belongs in MR-770 (the production sweep) β€” this PR -stops *creating* run state but does not destroy legacy bytes on disk. +`db/manifest/migrations.rs` carries the v2β†’v3 internal-schema step (MR-770): +a one-time sweep that deletes legacy `__run__*` staging branches off +`__manifest`. It runs in `Omnigraph::open(ReadWrite)` (via +`manifest::migrate_on_open`, before the coordinator reads branch state) and +again on the publisher's write path; both are idempotent once the stamp is at +v3. Deleting the inert `_graph_runs.lance` / `_graph_run_actors.lance` dataset +*bytes* is still deferred β€” it needs a `StorageAdapter::delete_prefix` +primitive β€” but those bytes are invisible to graph-level state. ## Mid-query partial failure: closed by MR-794 diff --git a/docs/releases/v0.6.1.md b/docs/releases/v0.6.1.md index aafe1af..0acc34b 100644 --- a/docs/releases/v0.6.1.md +++ b/docs/releases/v0.6.1.md @@ -7,6 +7,7 @@ v0.6.1 focuses on operational polish after v0.6.0: stored-query registries, safe - **Stored-query registries.** `omnigraph.yaml` can declare curated `queries:` blocks per graph. Servers load and type-check them at startup, `omnigraph queries validate` checks them offline, `omnigraph queries list` shows exposed queries and typed params, `GET /queries` exposes a typed catalog, and `POST /queries/{name}` invokes a stored query without accepting ad hoc `.gq` source from the client. - **Stored-query policy gate.** New Cedar action `invoke_query` gates the stored-query invocation surface. Stored mutations are double-gated: `invoke_query` to reach the stored query and `change` for the actual write. - **Safer branch deletion.** `branch_delete` now treats the manifest as the authority, flips branch visibility atomically, and reclaims per-table/commit-graph forks as derived state. If best-effort reclaim is interrupted, `cleanup` reconciles orphaned forks; reusing a branch name before cleanup reports an actionable error. +- **Legacy `__run__` cleanup (MR-770).** Removed the last functional remnant of the Run state machine (retired in v0.4.0): the `__run__` branch-name guard. A new v2β†’v3 `__manifest` internal-schema migration sweeps any stale `__run__*` staging branches on the first read-write open, so `__run__*` is no longer a reserved branch name. This closes the "unpromoted `__run__` branches block reads" condition behind the zombie-run cascade incident; the inert `_graph_runs.lance` row cleanup is tracked separately (it needs a `delete_prefix` primitive). - **Blob-safe optimize.** `omnigraph optimize` skips tables with `Blob` properties instead of failing the whole sweep on Lance's blob-v2 compaction decode bug. Skips are visible in human output, `--json` as `skipped`, `TableOptimizeStats.skipped`, and logs; non-blob tables still compact normally. - **Deployment improvements.** The container entrypoint now composes `OMNIGRAPH_TARGET_URI` with `OMNIGRAPH_CONFIG`, so operators can keep the graph URI in env while loading policy/query config from a mounted file. The local RustFS bootstrap pins RustFS beta.3 and allows the current insecure local-dev default credentials. - **Windows release support.** Tagged and edge releases now publish Windows x86_64 archives containing `omnigraph.exe` and `omnigraph-server.exe`, with a PowerShell installer and Windows install docs. @@ -17,6 +18,7 @@ v0.6.1 focuses on operational polish after v0.6.0: stored-query registries, safe - A graph selected by name (`--target` or `server.graph`) now uses `graphs..policy` and `graphs..queries`. Top-level `policy` / `queries` blocks are only for anonymous bare-URI single-graph mode; using them with a named graph now fails loudly with migration guidance. - `mcp.expose` defaults to `true` for stored-query registry entries. Set `mcp: { expose: false }` for service-only queries that should not appear in the catalog. - `invoke_query` is graph-scoped, not branch-scoped. Branch/snapshot access remains enforced by the inner `read` / `change` gate. +- **Legacy `__run__` migration.** Graphs created before v0.4.0 are migrated automatically on the first **read-write** open by a v0.6.1 binary (one-time `__manifest` stamp v2β†’v3 sweep of stale `__run__*` branches). No action required. Two caveats: (1) a graph opened **read-only** still lists any stale `__run__*` branch until its first read-write open, since the migration is write-path-only like all manifest migrations β€” long-lived read-only deployments should be opened read-write once after upgrading; (2) the inert `_graph_runs.lance` / `_graph_run_actors.lance` dataset bytes are left in place until a future `delete_prefix` primitive (they are invisible to graph-level state). - Blob tables are not compacted until the upstream Lance fix lands, so fragment count and deleted-row space on blob tables are not reclaimed by `optimize`. Reads, writes, and query results are unaffected; no on-disk migration is required. - `TableOptimizeStats` is now `#[non_exhaustive]` and gains a `skipped: Option` field (so does the new `SkipReason` enum). This is a source-level change only for downstream code that built this returned result struct by literal β€” rare, since it is produced by `optimize` and consumed by reading its fields; field access is unaffected, and `#[non_exhaustive]` keeps future additions non-breaking. diff --git a/docs/rfcs/0000-template.md b/docs/rfcs/0000-template.md new file mode 100644 index 0000000..48f4bda --- /dev/null +++ b/docs/rfcs/0000-template.md @@ -0,0 +1,54 @@ +# RFC NNNN: + +| | | +|---|---| +| **Status** | Proposed | +| **Author(s)** | <your name / handle> | +| **Discussion** | <link to the originating Discussion, if any> | +| **Implementation** | <issue/PR links, filled in as work lands> | + +> Status is maintained by maintainers: `Proposed` while the PR is open, +> `Accepted` on merge, `Declined` on close, `Superseded by NNNN` later. + +## Summary + +One paragraph: what this changes, in plain terms. + +## Motivation + +What problem does this solve, and why is it worth the ongoing cost? Tie it to a +concrete need (a Discussion, a recurring issue, a user request). Per the +project's first principle, argue the *long-run liability*, not just the +short-term convenience. + +## Guide-level explanation + +Explain the change as you'd teach it to a user or contributor: new commands, +syntax, API shapes, behavior. Examples first. + +## Reference-level design + +The precise design: data structures, IR/AST/planner changes, storage/format +impact, migration path, error behavior. Enough that a reviewer can find the +holes. + +## Invariants & deny-list check + +Which Hard Invariants in [../dev/invariants.md](../dev/invariants.md) does this +touch? Does it brush against any deny-list item β€” and if so, why is this the +justified exception? State explicitly that no invariant is weakened, or which +Known Gap moves. + +## Drawbacks & alternatives + +What does this cost, what did you reject, and why. "Do nothing" is a valid +alternative to weigh. + +## Reversibility + +Is this reversible? On-disk/wire/format and substrate choices are near-permanent +and demand more evidence; a CLI flag or doc is cheap to undo. Say which this is. + +## Unresolved questions + +What's deliberately left open for review to settle. diff --git a/docs/rfcs/README.md b/docs/rfcs/README.md new file mode 100644 index 0000000..99cdd76 --- /dev/null +++ b/docs/rfcs/README.md @@ -0,0 +1,66 @@ +# RFCs + +Substantial changes to OmniGraph β€” new user-facing surface, format or protocol +changes, anything irreversible or cross-cutting β€” go through a lightweight RFC +so the design is agreed *as reviewable code* before implementation starts. This +is the public RFC track, open to **anyone, including external contributors**. + +This complements the always-on review bar in +[../dev/invariants.md](../dev/invariants.md): the invariants say *what every +change must respect*; an RFC says *why this particular change is worth making and +how*. + +> **Two tracks, don't conflate them.** This `docs/rfcs/` directory is the +> **public contribution** track (anyone authors; maintainers accept). The +> maintainer-internal RFCs under `docs/dev/rfc-00N-*.md` are a separate, +> team-owned track for in-flight internal work. If you're an outside +> contributor, you're in the right place here. + +## When you need one + +- **RFC required:** new query/schema/CLI/HTTP surface; on-disk or wire-format + changes; a new substrate dependency; anything the deny-list in + [../dev/invariants.md](../dev/invariants.md) flags; anything irreversible + ("reversibility shapes evidence demand"). +- **RFC not required:** bug fixes for an `accepted` issue, and the trivial + fast-lane (typos, docs, deps) β€” see [../../CONTRIBUTING.md](../../CONTRIBUTING.md). + +If you're unsure, start a [Discussion](../../../discussions); a maintainer will +tell you whether it needs an RFC. + +## Lifecycle + +``` +Discussion (incubate, get rough consensus) + β”‚ graduate + β–Ό +RFC pull request β†’ adds docs/rfcs/NNNN-title.md (Status: Proposed) + β”‚ +maintainer review ──▢ changes requested / declined (PR closed, with rationale) + β”‚ + β–Ό +merged == Accepted (the merged file is the durable decision record) + β”‚ + β–Ό +Implementation PR(s) reference the accepted RFC +``` + +- **Author:** anyone. **Acceptance:** a maintainer decision, performed by + merging the RFC PR. Declining is closing it with rationale. +- The merged RFC *is* the accepted record β€” there is no separate sign-off step. +- Later reversals don't edit history: supersede with a new RFC that links back + and flip the old one's `Status` to `Superseded`. + +## Numbering & naming + +- File: `docs/rfcs/NNNN-kebab-title.md`, where `NNNN` is the next free + zero-padded integer (`0001`, `0002`, …). `0000-template.md` is reserved. +- Pick the number when you open the PR; if it collides with another in-flight + RFC, the second to merge bumps theirs. + +## Status values + +`Proposed` (open PR) Β· `Accepted` (merged) Β· `Declined` (closed) Β· +`Superseded by NNNN` Β· `Implemented` (set once the work lands, optional). + +Copy [0000-template.md](0000-template.md) to start. diff --git a/docs/user/audit.md b/docs/user/audit.md index e8abe5b..ab028ac 100644 --- a/docs/user/audit.md +++ b/docs/user/audit.md @@ -4,4 +4,4 @@ - `_as` variants of every write API let callers override the actor: `mutate_as`, `ingest_as`, `branch_merge_as`, `apply_schema_as`, etc. - Actor IDs are persisted on `GraphCommit.actor_id` with split storage in `_graph_commit_actors.lance` (the commit graph is split into `_graph_commits.lance` for the linkage and `_graph_commit_actors.lance` for the actor map). - HTTP server uses the bearer-token actor automatically; CLI uses the local user / explicit env (no implicit actor). -- Pre-v0.4.0 graphs also stored actor IDs on `RunRecord.actor_id` in `_graph_runs.lance` / `_graph_run_actors.lance`. The Run state machine was removed in MR-771; those files are inert post-v0.4.0 and reclaimed by MR-770's production sweep. +- Pre-v0.4.0 graphs also stored actor IDs on `RunRecord.actor_id` in `_graph_runs.lance` / `_graph_run_actors.lance`. The Run state machine was removed in MR-771; those files are inert post-v0.4.0. The v2β†’v3 manifest migration sweeps any stale `__run__*` branches on first write-open (MR-770); the inert dataset bytes remain until a `delete_prefix` primitive lands. diff --git a/docs/user/branches-commits.md b/docs/user/branches-commits.md index c1894f9..0565186 100644 --- a/docs/user/branches-commits.md +++ b/docs/user/branches-commits.md @@ -9,8 +9,8 @@ Lance supports branching at the dataset level: a branch is a named lineage of ve OmniGraph builds *graph branches* on top by branching every sub-table coherently: - `branch_create(name)` / `branch_create_from(target, name)` β€” disallowed name `main`; fails if branch exists; ensures the schema-apply lock is idle. Atomic and authority-first like `branch_delete`: it flips the `__manifest` branch (authority), then creates the derived commit-graph branch, force-dropping any orphaned commit-graph ref left by an incomplete prior delete (the manifest branch is fresh, so a same-named commit-graph branch is provably a zombie). If commit-graph creation fails, the manifest branch is rolled back so the name never half-exists. -- `branch_list()` β€” returns public branches, **filters internal** `__run__…` and `__schema_apply_lock__` prefixes. -- `branch_delete(name)` β€” refuses if there are descendants or active runs on the branch. The manifest is the single authority for branch existence: deletion flips the `__manifest` branch ref first (one atomic op), after which the branch is gone from every snapshot. The owned per-table forks and the commit-graph branch are derived state, reclaimed best-effort with `force_delete_branch` after the flip. A failure during that reclaim (transient object-store error) does not fail the call or block the authority flip; the leftover forks are unreachable orphans that the [`cleanup`](maintenance.md) reconciler converges. One consequence: if a delete's best-effort reclaim fails, reusing that branch name before the next `cleanup` surfaces a clear error pointing at `cleanup` (the stale fork would otherwise collide on first write). +- `branch_list()` β€” returns public branches, **filters the internal** `__schema_apply_lock__` branch. +- `branch_delete(name)` β€” refuses if there are descendants on the branch, or if it is the current branch. The manifest is the single authority for branch existence: deletion flips the `__manifest` branch ref first (one atomic op), after which the branch is gone from every snapshot. The owned per-table forks and the commit-graph branch are derived state, reclaimed best-effort with `force_delete_branch` after the flip. A failure during that reclaim (transient object-store error) does not fail the call or block the authority flip; the leftover forks are unreachable orphans that the [`cleanup`](maintenance.md) reconciler converges. One consequence: if a delete's best-effort reclaim fails, reusing that branch name before the next `cleanup` surfaces a clear error pointing at `cleanup` (the stale fork would otherwise collide on first write). - **Lazy forking**: a branch only forks a sub-table when that sub-table is first mutated on it. Pure-read branches share fragments with their source. A fork collision is classified by the manifest authority, not by Lance branch versions: if the live manifest already records the fork on the active branch, a concurrent first-write won and the caller gets a retryable "refresh and retry"; if the manifest does not, a physical branch there is an orphan and the caller is pointed at `cleanup`. - `sync_branch(branch)` β€” re-binds the in-memory handle to the latest head of the branch. @@ -51,10 +51,10 @@ Notes: ## L2 β€” Internal system branches -Filtered from `branch_list()` but visible to internals: +Internal or legacy branch refs: -- `__schema_apply_lock__` β€” serializes schema migrations. -- `__run__<run-id>` β€” legacy from the pre-v0.4.0 Run state machine (removed in MR-771). The branch-name guard predicate `is_internal_run_branch` is kept as defense-in-depth so users cannot create a branch matching the legacy prefix; the filter will be removed once production legacy branches are swept (MR-770). +- `__schema_apply_lock__` β€” serializes schema migrations; filtered from `branch_list()` but visible to internals. +- `__run__<run-id>` β€” legacy from the pre-v0.4.0 Run state machine (removed in MR-771). These are swept off `__manifest` on the first read-write open by the v2β†’v3 internal-schema migration (MR-770), and `__run__*` is no longer a reserved name. Known limitation: a pre-v0.4.0 graph opened **read-only** still surfaces any stale `__run__*` branch in `branch_list()` until its first read-write open (the migration is write-path-only, like all manifest migrations). ## L2 β€” Recovery audit trail diff --git a/docs/user/constants.md b/docs/user/constants.md index 8f13555..210155e 100644 --- a/docs/user/constants.md +++ b/docs/user/constants.md @@ -4,11 +4,11 @@ |---|---|---| | `MANIFEST_DIR` | `__manifest` | `db/manifest/layout.rs` | | Commit graph dir | `_graph_commits.lance` | `db/commit_graph.rs` | -| Run registry dir (legacy, removed MR-771) | `_graph_runs.lance` | inert post-v0.4.0; reclaimed by MR-770 | -| Run branch prefix (legacy, removed MR-771) | `__run__` | filtered by `is_internal_run_branch` defense-in-depth | +| Run registry dir (legacy, removed MR-771) | `_graph_runs.lance` | inert post-v0.4.0; bytes remain until a `delete_prefix` primitive lands | +| Run branch prefix (legacy, removed MR-771/MR-770) | `__run__` | swept off `__manifest` by the v2β†’v3 migration; no longer a reserved name | | Schema apply lock | `__schema_apply_lock__` | `db/mod.rs` | | Manifest publisher retry budget | `PUBLISHER_RETRY_BUDGET = 5` | `db/manifest/publisher.rs` | -| Internal manifest schema version | `INTERNAL_MANIFEST_SCHEMA_VERSION = 2` | `db/manifest/migrations.rs` | +| Internal manifest schema version | `INTERNAL_MANIFEST_SCHEMA_VERSION = 3` | `db/manifest/migrations.rs` | | Merge stage batch | `MERGE_STAGE_BATCH_ROWS = 8192` | `exec/merge.rs` | | Maintenance concurrency | `OMNIGRAPH_MAINTENANCE_CONCURRENCY=8` | `db/omnigraph/optimize.rs` | | Lance blob compaction support | `LANCE_SUPPORTS_BLOB_COMPACTION = false` | `db/omnigraph/optimize.rs` | diff --git a/docs/user/storage.md b/docs/user/storage.md index c22d4d6..d1c52b5 100644 --- a/docs/user/storage.md +++ b/docs/user/storage.md @@ -22,7 +22,7 @@ OmniGraph is **not** a single Lance dataset; it is a *graph* of datasets coordin - `edges/{fnv1a64-hex(edge_type_name)}` β€” one Lance dataset per edge type - `__manifest/` β€” the catalog of all sub-tables and their published versions - `_graph_commits.lance` / `_graph_commit_actors.lance` β€” the commit graph and its actor map - - (legacy `_graph_runs.lance` / `_graph_run_actors.lance` from pre-v0.4.0 graphs are inert; the run state machine was removed in MR-771 and these files are cleaned up via MR-770's production sweep) + - (legacy `_graph_runs.lance` / `_graph_run_actors.lance` from pre-v0.4.0 graphs are inert; the run state machine was removed in MR-771. The v2β†’v3 manifest migration sweeps stale `__run__*` branches on first write-open; the inert dataset bytes themselves remain until a `delete_prefix` storage primitive lands) - **Manifest row schema** (`object_id, object_type, location, metadata, base_objects, table_key, table_version, table_branch, row_count`): - `object_type` ∈ `table | table_version | table_tombstone` - `table_key` ∈ `node:<TypeName> | edge:<EdgeName>` @@ -47,6 +47,7 @@ Adding a new on-disk shape change is one constant bump (`INTERNAL_MANIFEST_SCHEM |---|---| | v1 (implicit, pre-stamp) | `__manifest.object_id` had no PK annotation; publisher had no row-level CAS protection. | | v2 | `__manifest.object_id` carries `lance-schema:unenforced-primary-key=true`; row-level CAS engaged. Stamped as `omnigraph:internal_schema_version=2`. | +| v3 | One-time sweep of legacy `__run__*` staging branches (pre-v0.4.0 Run state machine, removed MR-771) off `__manifest`. Runs at `Omnigraph::open(ReadWrite)` and on publish. Stamped as `omnigraph:internal_schema_version=3`. | ## On-disk layout @@ -91,7 +92,7 @@ flowchart TB - **Graph root** is one directory (or S3 prefix). Everything below is part of one OmniGraph graph. - **`__manifest/`** is a Lance dataset whose rows describe which sub-table version is published at which graph-branch. Reading a snapshot starts here. - **`nodes/`** and **`edges/`** are sibling directories holding one Lance dataset per declared type. Names are `fnv1a64-hex` of the type name to keep paths fixed-length and case-safe. -- **`_graph_commits.lance`** is an L2 dataset that records the graph-level commit DAG, with a paired `_graph_commit_actors.lance` for the actor map. (Pre-v0.4.0 graphs also have inert `_graph_runs.lance` / `_graph_run_actors.lance` from the removed Run state machine; MR-770 sweeps these in production.) +- **`_graph_commits.lance`** is an L2 dataset that records the graph-level commit DAG, with a paired `_graph_commit_actors.lance` for the actor map. (Pre-v0.4.0 graphs also have inert `_graph_runs.lance` / `_graph_run_actors.lance` from the removed Run state machine; the v2β†’v3 migration sweeps their stale `__run__*` branches, and the dataset bytes are reclaimed once `delete_prefix` lands.) - **`_graph_commit_recoveries.lance`** β€” one row per recovery sweep action. Joined to `_graph_commits.lance` by `graph_commit_id`; the linked commit row carries `actor_id=omnigraph:recovery`. Operators correlate recoveries with the original mutations they rolled forward / back via this join. See `crates/omnigraph/src/db/recovery_audit.rs`. - **`__recovery/{ulid}.json`** β€” transient sidecar files written by the four migrated writers (`MutationStaging::finalize`, `schema_apply`, `branch_merge`, `ensure_indices`) before Phase B begins, deleted after Phase C succeeds. A sidecar persisting after process exit means the writer crashed in the Phase B β†’ Phase C window; the next `Omnigraph::open` recovery sweep processes it. Steady-state directory is empty. See `crates/omnigraph/src/db/manifest/recovery.rs`. - **`_refs/branches/{name}.json`** is graph-level branch metadata β€” pointers from a branch name to the manifest version it heads. diff --git a/scripts/check-agents-md.sh b/scripts/check-agents-md.sh index abc6469..02a177a 100755 --- a/scripts/check-agents-md.sh +++ b/scripts/check-agents-md.sh @@ -34,10 +34,15 @@ PY canonical=() while IFS= read -r line; do canonical+=("$line") -done < <(find docs -type f -name '*.md' ! -path 'docs/releases/*' ! -path 'docs/internal/*' | sort) +done < <(find docs -type f -name '*.md' ! -path 'docs/releases/*' ! -path 'docs/internal/*' ! -path 'docs/rfcs/*' | sort) if [[ -d docs/releases ]]; then canonical+=("docs/releases/") fi +# RFCs are a growing collection (like releases): represent the directory, not +# every per-RFC file. The dir must be linked from an audience index. +if [[ -d docs/rfcs ]]; then + canonical+=("docs/rfcs/") +fi linked=() for index_file in "${index_files[@]}"; do