ci: shard the RustFS S3 integration job across parallel runners (#321)

* ci: shard the RustFS S3 integration job across parallel runners

The RustFS S3 Integration job chronically hit its 75-minute timeout (e.g. on
the v0.8.0 release run) and got cancelled. Root cause is compile time, not test
time: the S3 tests each run in seconds (the write_cost_s3 step took 0.2m once the
engine was built), but the job ran six serial `cargo test` steps across four
crates plus a `--features failpoints` rebuild, and on a cold cache (any Cargo.lock
change, e.g. a release version bump) every suite must recompile the omnigraph-engine
+ Lance/DataFusion tree, summing to ~75m.

Split the suites into a `strategy.matrix.shard` (engine / server / cluster / cli /
failpoints), one suite per shard on its own runner with a per-shard rust-cache key
and `fail-fast: false`. Wall-clock becomes the slowest single shard (~40m cold,
~25m warm) instead of the sum. Bundling suites would not help — each crate adds its
own unique-dep compile on top of the shared substrate — so each gets its own shard;
the failpoints shard is isolated because its distinct feature set recompiles the
engine tree. Timeout lowered 75 -> 50 (headroom over the worst cold shard).

The job is renamed `RustFS S3 Integration (<shard>)`; it is not a required check,
so branch protection is unaffected. Docs updated in docs/dev/ci.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* ci: drop the write_cost_s3 cost gate from the correctness job

The RustFS integration job is a correctness gate. write_cost_s3 is a
deterministic IO-count COST gate (RFC-013 step-3a data-table opener, flat
across commit depth) — a performance contract, not a correctness test.
Cost/perf contracts belong on a dedicated harness with a stable runner and
their own cadence, not on the every-merge correctness path. Remove the step
from the engine shard; a comment + testing.md record how to run it on demand
and note it's pending a dedicated cost harness. The local write_cost.rs
opener/scan-split guard still runs every-PR, so the split stays covered; only
the S3 acceptance of the opener term moves off the correctness path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
Andrew Altshuler 2026-07-02 01:15:28 +03:00 committed by GitHub
parent 23e838ffa8
commit 98530a0e8a
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
3 changed files with 42 additions and 8 deletions

View file

@ -8,7 +8,7 @@
- **Applying this policy:** removing `Test Workspace` from the JSON is inert until an admin runs `./scripts/apply-branch-protection.sh`. **Run it immediately after this change merges** — until then GitHub still requires a `Test Workspace` context that no longer reports on PRs, which leaves every open PR permanently pending (the job-never-reports trap).
- **AWS feature build job**: `cargo build/test -p omnigraph-server --features aws` on ubuntu-latest.
- **Windows binary build job**: `cargo build --release --locked -p omnigraph-cli -p omnigraph-server` on windows-latest with smoke checks for `omnigraph.exe version`, `omnigraph-server.exe --help`, and PowerShell installer syntax.
- **RustFS S3 integration**: spins up RustFS in Docker, runs `s3_storage`, `server_opens_s3_graph_directly_and_serves_snapshot_and_read`, and `local_cli_s3_end_to_end_init_load_read_flow`.
- **RustFS S3 integration**: spins up RustFS in Docker and runs the bucket-gated S3 suites against it. **Sharded across parallel runners** (`strategy.matrix.shard`: `engine` = `s3_storage`, `server` = server `s3`, `cluster` = `s3_cluster`, `cli` = `local_cli_s3_end_to_end_init_load_read_flow`, `failpoints` = `failpoints s3_`), one suite per shard with `fail-fast: false` and a per-shard `rust-cache` key. This job carries **correctness** suites only; the RFC-013 `write_cost_s3` **cost** gate was removed (cost/perf contracts belong in a dedicated harness, not the correctness path). The tests run in seconds; the wall-clock is the per-shard `cargo test` **compile** of the engine tree, so on a cold cache (any `Cargo.lock` change) six serial steps summed past the old 75-min timeout — sharding makes wall-clock the slowest single shard (~40m cold, ~25m warm). `needs: test`, so like `Test Workspace` it is push-/dispatch-only. Not a required check.
- **release-edge.yml**: on every push to main, retags `edge`, builds Linux x86_64 / Linux arm64 / macOS arm64 archives and Windows x86_64 zip + sha256, publishes a rolling prerelease, then smoke-tests the Windows PowerShell installer against `edge`.
- **release.yml**: on `v*` tags, builds the Linux x86_64 / Linux arm64 / macOS arm64 archives and Windows x86_64 zip release matrix, updates the Homebrew tap (`scripts/update-homebrew-formula.sh`) by pushing the regenerated formula to `ModernRelay/homebrew-tap`, and smoke-tests the Windows PowerShell installer against the tag.
- **package.yml**: manual ECR image build; emits two image tags per commit (`<sha>`, `<sha>-aws`) via CodeBuild.