From 98530a0e8ae59971c18459686b9827de248e3a7d Mon Sep 17 00:00:00 2001
From: Andrew Altshuler <andrew@collectivelab.io>
Date: Thu, 2 Jul 2026 01:15:28 +0300
Subject: [PATCH] ci: shard the RustFS S3 integration job across parallel
 runners (#321)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* ci: shard the RustFS S3 integration job across parallel runners

The RustFS S3 Integration job chronically hit its 75-minute timeout (e.g. on
the v0.8.0 release run) and got cancelled. Root cause is compile time, not test
time: the S3 tests each run in seconds (the write_cost_s3 step took 0.2m once the
engine was built), but the job ran six serial `cargo test` steps across four
crates plus a `--features failpoints` rebuild, and on a cold cache (any Cargo.lock
change, e.g. a release version bump) every suite must recompile the omnigraph-engine
+ Lance/DataFusion tree, summing to ~75m.

Split the suites into a `strategy.matrix.shard` (engine / server / cluster / cli /
failpoints), one suite per shard on its own runner with a per-shard rust-cache key
and `fail-fast: false`. Wall-clock becomes the slowest single shard (~40m cold,
~25m warm) instead of the sum. Bundling suites would not help — each crate adds its
own unique-dep compile on top of the shared substrate — so each gets its own shard;
the failpoints shard is isolated because its distinct feature set recompiles the
engine tree. Timeout lowered 75 -> 50 (headroom over the worst cold shard).

The job is renamed `RustFS S3 Integration (<shard>)`; it is not a required check,
so branch protection is unaffected. Docs updated in docs/dev/ci.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* ci: drop the write_cost_s3 cost gate from the correctness job

The RustFS integration job is a correctness gate. write_cost_s3 is a
deterministic IO-count COST gate (RFC-013 step-3a data-table opener, flat
across commit depth) — a performance contract, not a correctness test.
Cost/perf contracts belong on a dedicated harness with a stable runner and
their own cadence, not on the every-merge correctness path. Remove the step
from the engine shard; a comment + testing.md record how to run it on demand
and note it's pending a dedicated cost harness. The local write_cost.rs
opener/scan-split guard still runs every-PR, so the split stays covered; only
the S3 acceptance of the opener term moves off the correctness path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .github/workflows/ci.yml | 43 ++++++++++++++++++++++++++++++++++++----
 docs/dev/ci.md           |  2 +-
 docs/dev/testing.md      |  5 ++---
 3 files changed, 42 insertions(+), 8 deletions(-)
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
index f5462aeb..0a0e841a 100644
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -292,16 +292,38 @@ jobs:
         run: cargo test --locked -p omnigraph-server --features aws
 
   rustfs_integration:
-    name: RustFS S3 Integration
+    name: RustFS S3 Integration (${{ matrix.shard }})
     # `needs: test` means this is push-/dispatch-only too: on pull_request the
     # `test` job is skipped, so this dependent is skipped with it. S3
     # integration runs post-merge on `main`, alongside the workspace suite.
+    #
+    # Sharded across parallel runners (one suite per shard). The S3 tests
+    # themselves run in seconds — the job's wall-clock is almost entirely the
+    # `cargo test` COMPILE, and every suite must build the omnigraph-engine
+    # tree (Lance/DataFusion). On a cold cache (any Cargo.lock change, e.g. a
+    # release version bump) the six suites summed to ~75m and tripped the
+    # timeout. Running each suite on its own runner makes wall-clock the
+    # slowest single shard (~40m cold, ~25m warm) instead of the sum. Bundling
+    # suites would NOT help: each crate adds its own unique-dep compile on top
+    # of the shared substrate, so a combined shard would still approach the sum.
+    # `fail-fast: false` so one shard's failure still lets the others report.
+    # The `failpoints` shard is isolated because `--features failpoints`
+    # compiles a distinct engine variant that must not serialize behind the rest.
     needs:
       - classify_changes
       - test
     if: needs.classify_changes.outputs.run_rustfs_ci == 'true'
     runs-on: ubuntu-latest
-    timeout-minutes: 75
+    timeout-minutes: 50
+    strategy:
+      fail-fast: false
+      matrix:
+        shard:
+          - engine
+          - server
+          - cluster
+          - cli
+          - failpoints
     permissions:
       contents: read
     env:
@@ -332,6 +354,10 @@ jobs:
       - name: Cache Rust build data
         uses: Swatinem/rust-cache@v2
         with:
+          # Per-shard cache: shards compile different targets (and the
+          # failpoints shard a distinct feature set), so keep their caches
+          # separate to avoid cross-shard thrash.
+          key: ${{ matrix.shard }}
           workspaces: |
             . -> target
 
@@ -372,12 +398,18 @@ jobs:
             --bucket "${OMNIGRAPH_S3_TEST_BUCKET}" >/dev/null 2>&1 || true
 
       - name: Run RustFS storage tests
+        if: matrix.shard == 'engine'
         run: cargo test --locked -p omnigraph-engine --test s3_storage -- --nocapture
 
-      - name: Run RustFS write-path cost gate (RFC-013 step 3a opener)
-        run: cargo test --locked -p omnigraph-engine --test write_cost_s3 -- --nocapture
+      # NOTE: the RFC-013 step-3a data-table opener COST gate (write_cost_s3) used
+      # to run here. It is a deterministic IO-count gate, not a correctness test —
+      # performance/cost contracts belong in a dedicated perf harness on a stable
+      # runner + own cadence, not on the every-merge correctness path. Moved out of
+      # CI pending that harness; run it on demand with a bucket set:
+      #   OMNIGRAPH_S3_TEST_BUCKET=… cargo test -p omnigraph-engine --test write_cost_s3
 
       - name: Run RustFS server smoke
+        if: matrix.shard == 'server'
         # No name filter: every test in the s3 target is bucket-gated, and a
         # filter matching nothing passes vacuously (which silently ran zero
         # tests here for a while — the old filter said s3_repo, the test
@@ -385,12 +417,15 @@ jobs:
         run: cargo test --locked -p omnigraph-server --test s3 -- --nocapture
 
       - name: Run RustFS cluster e2e
+        if: matrix.shard == 'cluster'
         run: cargo test --locked -p omnigraph-cluster --test s3_cluster -- --nocapture
 
       - name: Run RustFS CLI smoke
+        if: matrix.shard == 'cli'
         run: cargo test --locked -p omnigraph-cli --test system_local local_cli_s3_end_to_end_init_load_read_flow -- --nocapture
 
       - name: Run RustFS recovery-sidecar lifecycle
+        if: matrix.shard == 'failpoints'
         # Sidecar put/list/delete through the S3 storage backend on a
         # real bucket (the failpoint only wedges the publisher; the
         # sidecar I/O is exercised for real). Name filter `s3_` matches
diff --git a/docs/dev/ci.md b/docs/dev/ci.md
index a287481e..aed0eed5 100644
--- a/docs/dev/ci.md
+++ b/docs/dev/ci.md
@@ -8,7 +8,7 @@
   - **Applying this policy:** removing `Test Workspace` from the JSON is inert until an admin runs `./scripts/apply-branch-protection.sh`. **Run it immediately after this change merges** — until then GitHub still requires a `Test Workspace` context that no longer reports on PRs, which leaves every open PR permanently pending (the job-never-reports trap).
 - **AWS feature build job**: `cargo build/test -p omnigraph-server --features aws` on ubuntu-latest.
 - **Windows binary build job**: `cargo build --release --locked -p omnigraph-cli -p omnigraph-server` on windows-latest with smoke checks for `omnigraph.exe version`, `omnigraph-server.exe --help`, and PowerShell installer syntax.
-- **RustFS S3 integration**: spins up RustFS in Docker, runs `s3_storage`, `server_opens_s3_graph_directly_and_serves_snapshot_and_read`, and `local_cli_s3_end_to_end_init_load_read_flow`.
+- **RustFS S3 integration**: spins up RustFS in Docker and runs the bucket-gated S3 suites against it. **Sharded across parallel runners** (`strategy.matrix.shard`: `engine` = `s3_storage`, `server` = server `s3`, `cluster` = `s3_cluster`, `cli` = `local_cli_s3_end_to_end_init_load_read_flow`, `failpoints` = `failpoints s3_`), one suite per shard with `fail-fast: false` and a per-shard `rust-cache` key. This job carries **correctness** suites only; the RFC-013 `write_cost_s3` **cost** gate was removed (cost/perf contracts belong in a dedicated harness, not the correctness path). The tests run in seconds; the wall-clock is the per-shard `cargo test` **compile** of the engine tree, so on a cold cache (any `Cargo.lock` change) six serial steps summed past the old 75-min timeout — sharding makes wall-clock the slowest single shard (~40m cold, ~25m warm). `needs: test`, so like `Test Workspace` it is push-/dispatch-only. Not a required check.
 - **release-edge.yml**: on every push to main, retags `edge`, builds Linux x86_64 / Linux arm64 / macOS arm64 archives and Windows x86_64 zip + sha256, publishes a rolling prerelease, then smoke-tests the Windows PowerShell installer against `edge`.
 - **release.yml**: on `v*` tags, builds the Linux x86_64 / Linux arm64 / macOS arm64 archives and Windows x86_64 zip release matrix, updates the Homebrew tap (`scripts/update-homebrew-formula.sh`) by pushing the regenerated formula to `ModernRelay/homebrew-tap`, and smoke-tests the Windows PowerShell installer against the tag.
 - **package.yml**: manual ECR image build; emits two image tags per commit (`<sha>`, `<sha>-aws`) via CodeBuild.
diff --git a/docs/dev/testing.md b/docs/dev/testing.md
index a40170d9..10b494f8 100644
--- a/docs/dev/testing.md
+++ b/docs/dev/testing.md
@@ -74,10 +74,9 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
 
 ## RustFS / S3 integration
 
-CI runs these S3-backed tests against a containerized RustFS server (`.github/workflows/ci.yml` → `rustfs_integration` job):
+CI runs these S3-backed **correctness** tests against a containerized RustFS server (`.github/workflows/ci.yml` → `rustfs_integration` job, sharded one suite per runner):
 
 - `cargo test -p omnigraph-engine --test s3_storage` (lifecycle/branching + the e_tag-present CSR topology cache-key reuse test — the path local FS can't reach since its e_tag is `None`)
-- `cargo test -p omnigraph-engine --test write_cost_s3` (RFC-013 step 3a's data-table opener cost gate — flat across commit depth on S3; the term local FS can't reproduce)
 - `cargo test -p omnigraph-server --test s3` (single-graph serving + config-free `--cluster s3://` boot)
 - `cargo test -p omnigraph-cluster --test s3_cluster` (full control-plane lifecycle on the bucket)
 - `cargo test -p omnigraph-cli --test system_local local_cli_s3_end_to_end_init_load_read_flow`
@@ -140,7 +139,7 @@ Correctness bugs fail loudly in tests; cost-scaling bugs pass every test and deg
 
 - **Assert a cost budget, not just a result.** For a read/open path, assert the number of `Dataset::open` calls (or object-store ops) a warm query performs, and that it does not grow with commit count. The reference is LanceDB's IO-counted tests, which assert a cached read costs 0-1 IO and carry a named regression test against "a list call on every subsequent query."
 - **Test at history depth.** Build a fixture with many *commits* (not many rows) and assert warm-read cost is flat across depths. A shallow fixture cannot catch an O(commits) cost.
-- **Use the shared harness, and gate each term on the backend where it manifests.** `helpers::cost` (`measure`/`IoCounts`/`assert_flat`/`local_graph`/`s3_graph`) is the one place the `IOTracker`/task-local plumbing lives — consume it, don't duplicate it. The write path has *two distinct* depth terms that split cleanly across backends, and conflating them is a real trap (the local data-table read count grows with depth too, but for a different reason — the merge-insert/RI scan reading O(depth) *fragments*, reduced by compaction, not by the opener): (1) the **internal-table** scan term (`__manifest` fragment scans, lineage rows included) reproduces on **any** backend including local FS, so `write_cost.rs` gates it on local every-PR; (2) the **data-table opener** term (latest-version resolution) is a per-object-store-RPC phenomenon — local-FS resolves latest with one cheap `read_dir` regardless of the opener used, so the namespace-vs-direct difference is **invisible on local** and only shows on a real object store (per-version GETs), gated by the bucket-gated `write_cost_s3.rs`. Same harness, different fixture; each term asserted where it actually appears.
+- **Use the shared harness, and gate each term on the backend where it manifests.** `helpers::cost` (`measure`/`IoCounts`/`assert_flat`/`local_graph`/`s3_graph`) is the one place the `IOTracker`/task-local plumbing lives — consume it, don't duplicate it. The write path has *two distinct* depth terms that split cleanly across backends, and conflating them is a real trap (the local data-table read count grows with depth too, but for a different reason — the merge-insert/RI scan reading O(depth) *fragments*, reduced by compaction, not by the opener): (1) the **internal-table** scan term (`__manifest` fragment scans, lineage rows included) reproduces on **any** backend including local FS, so `write_cost.rs` gates it on local every-PR; (2) the **data-table opener** term (latest-version resolution) is a per-object-store-RPC phenomenon — local-FS resolves latest with one cheap `read_dir` regardless of the opener used, so the namespace-vs-direct difference is **invisible on local** and only shows on a real object store (per-version GETs), gated by the bucket-gated `write_cost_s3.rs`. Same harness, different fixture; each term asserted where it actually appears. **`write_cost_s3` is a cost (IO-count) gate, not a correctness test, so it was pulled out of the every-merge `rustfs_integration` CI job — run it on demand (`OMNIGRAPH_S3_TEST_BUCKET=… cargo test -p omnigraph-engine --test write_cost_s3`) pending a dedicated cost/perf harness. The local `write_cost.rs` opener/scan-split guard still runs every-PR, so the split itself stays covered; only the S3 acceptance of the opener term is off the correctness path.**
 - **Count on the handle that does the reads, not just the one a measured op opens.** Lance's IO-counted tests attach the `IOTracker` to the (warm, cached) dataset and read `incremental_stats()` per request — the tracker MUST be on the handle performing the reads, or warm-handle reads escape. A per-op tracker installed at measure time cannot see reads on a long-lived handle opened earlier (the warm coordinator's `__manifest` handle, reused across writes), so such reads were silently undercounted. Wrap a depth-swept body in `cost_harness` so the manifest tracker is installed before the graph opens and `manifest_reads` is **ground truth** (handle-age-irrelevant). The `version_probes` counter is the freshness-probe *call* count; ground truth additionally reveals that a write's probe does ~3 object-store RPCs (a read's probe is a 0-IO cache hit). `manifest_reads_capture_warm_probe` is the guard that this stays true.
 - This is the testing companion to invariant 15 in [docs/dev/invariants.md](invariants.md) (hot-path cost is bounded by work, not history).