omnigraph/docs/dev/ci.md
Andrew Altshuler 98530a0e8a
ci: shard the RustFS S3 integration job across parallel runners (#321)
* ci: shard the RustFS S3 integration job across parallel runners

The RustFS S3 Integration job chronically hit its 75-minute timeout (e.g. on
the v0.8.0 release run) and got cancelled. Root cause is compile time, not test
time: the S3 tests each run in seconds (the write_cost_s3 step took 0.2m once the
engine was built), but the job ran six serial `cargo test` steps across four
crates plus a `--features failpoints` rebuild, and on a cold cache (any Cargo.lock
change, e.g. a release version bump) every suite must recompile the omnigraph-engine
+ Lance/DataFusion tree, summing to ~75m.

Split the suites into a `strategy.matrix.shard` (engine / server / cluster / cli /
failpoints), one suite per shard on its own runner with a per-shard rust-cache key
and `fail-fast: false`. Wall-clock becomes the slowest single shard (~40m cold,
~25m warm) instead of the sum. Bundling suites would not help — each crate adds its
own unique-dep compile on top of the shared substrate — so each gets its own shard;
the failpoints shard is isolated because its distinct feature set recompiles the
engine tree. Timeout lowered 75 -> 50 (headroom over the worst cold shard).

The job is renamed `RustFS S3 Integration (<shard>)`; it is not a required check,
so branch protection is unaffected. Docs updated in docs/dev/ci.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* ci: drop the write_cost_s3 cost gate from the correctness job

The RustFS integration job is a correctness gate. write_cost_s3 is a
deterministic IO-count COST gate (RFC-013 step-3a data-table opener, flat
across commit depth) — a performance contract, not a correctness test.
Cost/perf contracts belong on a dedicated harness with a stable runner and
their own cadence, not on the every-merge correctness path. Remove the step
from the engine shard; a comment + testing.md record how to run it on demand
and note it's pending a dedicated cost harness. The local write_cost.rs
opener/scan-split guard still runs every-PR, so the split stays covered; only
the S3 acceptance of the opener term moves off the correctness path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-07-02 01:15:28 +03:00

4 KiB

CI / Release Workflows

.github/workflows/:

  • ci.yml: text-only changes skip; otherwise cargo test --workspace --locked on ubuntu-latest with protobuf compiler. OpenAPI-drift check that auto-commits the regenerated openapi.json for same-repository PRs. Also runs the AGENTS.md cross-link integrity check (scripts/check-agents-md.sh).
    • Test Workspace does not run on pull requests. The job is gated if: github.event_name != 'pull_request', so the full workspace + failpoints suite runs only on push to main (post-merge), on v* tags, and on manual workflow_dispatch. This was a deliberate PR-latency trade-off — it was the slowest gate (~15min warm, up to the 75min cold ceiling). RustFS S3 Integration needs: test, so it is push-/dispatch-only for the same reason. The fast PR gates remain: Classify Changes, Check AGENTS.md Links, and Test omnigraph-server --features aws. Test Workspace is correspondingly not in the required-check list (.github/branch-protection.json); see branch-protection.md.
    • Consequences to internalize: (1) a regression that the suite would catch now lands on main and turns the post-merge run red, rather than being blocked pre-merge — main can briefly break, so run cargo test --workspace --locked locally before merging anything non-trivial, or trigger this workflow on your branch via the Actions "Run workflow" button. (2) openapi.json is no longer auto-regenerated on PRs (that step is inside the test job); for server/API changes, regenerate it locally with OMNIGRAPH_UPDATE_OPENAPI=1 cargo test -p omnigraph-server --test openapi and commit it, or the strict drift check fails the post-merge main run.
    • Applying this policy: removing Test Workspace from the JSON is inert until an admin runs ./scripts/apply-branch-protection.sh. Run it immediately after this change merges — until then GitHub still requires a Test Workspace context that no longer reports on PRs, which leaves every open PR permanently pending (the job-never-reports trap).
  • AWS feature build job: cargo build/test -p omnigraph-server --features aws on ubuntu-latest.
  • Windows binary build job: cargo build --release --locked -p omnigraph-cli -p omnigraph-server on windows-latest with smoke checks for omnigraph.exe version, omnigraph-server.exe --help, and PowerShell installer syntax.
  • RustFS S3 integration: spins up RustFS in Docker and runs the bucket-gated S3 suites against it. Sharded across parallel runners (strategy.matrix.shard: engine = s3_storage, server = server s3, cluster = s3_cluster, cli = local_cli_s3_end_to_end_init_load_read_flow, failpoints = failpoints s3_), one suite per shard with fail-fast: false and a per-shard rust-cache key. This job carries correctness suites only; the RFC-013 write_cost_s3 cost gate was removed (cost/perf contracts belong in a dedicated harness, not the correctness path). The tests run in seconds; the wall-clock is the per-shard cargo test compile of the engine tree, so on a cold cache (any Cargo.lock change) six serial steps summed past the old 75-min timeout — sharding makes wall-clock the slowest single shard (~40m cold, ~25m warm). needs: test, so like Test Workspace it is push-/dispatch-only. Not a required check.
  • release-edge.yml: on every push to main, retags edge, builds Linux x86_64 / Linux arm64 / macOS arm64 archives and Windows x86_64 zip + sha256, publishes a rolling prerelease, then smoke-tests the Windows PowerShell installer against edge.
  • release.yml: on v* tags, builds the Linux x86_64 / Linux arm64 / macOS arm64 archives and Windows x86_64 zip release matrix, updates the Homebrew tap (scripts/update-homebrew-formula.sh) by pushing the regenerated formula to ModernRelay/homebrew-tap, and smoke-tests the Windows PowerShell installer against the tag.
  • package.yml: manual ECR image build; emits two image tags per commit (<sha>, <sha>-aws) via CodeBuild.