mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-07-03 02:51:04 +02:00
* ci: shard the RustFS S3 integration job across parallel runners The RustFS S3 Integration job chronically hit its 75-minute timeout (e.g. on the v0.8.0 release run) and got cancelled. Root cause is compile time, not test time: the S3 tests each run in seconds (the write_cost_s3 step took 0.2m once the engine was built), but the job ran six serial `cargo test` steps across four crates plus a `--features failpoints` rebuild, and on a cold cache (any Cargo.lock change, e.g. a release version bump) every suite must recompile the omnigraph-engine + Lance/DataFusion tree, summing to ~75m. Split the suites into a `strategy.matrix.shard` (engine / server / cluster / cli / failpoints), one suite per shard on its own runner with a per-shard rust-cache key and `fail-fast: false`. Wall-clock becomes the slowest single shard (~40m cold, ~25m warm) instead of the sum. Bundling suites would not help — each crate adds its own unique-dep compile on top of the shared substrate — so each gets its own shard; the failpoints shard is isolated because its distinct feature set recompiles the engine tree. Timeout lowered 75 -> 50 (headroom over the worst cold shard). The job is renamed `RustFS S3 Integration (<shard>)`; it is not a required check, so branch protection is unaffected. Docs updated in docs/dev/ci.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ci: drop the write_cost_s3 cost gate from the correctness job The RustFS integration job is a correctness gate. write_cost_s3 is a deterministic IO-count COST gate (RFC-013 step-3a data-table opener, flat across commit depth) — a performance contract, not a correctness test. Cost/perf contracts belong on a dedicated harness with a stable runner and their own cadence, not on the every-merge correctness path. Remove the step from the engine shard; a comment + testing.md record how to run it on demand and note it's pending a dedicated cost harness. The local write_cost.rs opener/scan-split guard still runs every-PR, so the split stays covered; only the S3 acceptance of the opener term moves off the correctness path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
4 KiB
4 KiB
CI / Release Workflows
.github/workflows/:
- ci.yml: text-only changes skip; otherwise
cargo test --workspace --lockedon ubuntu-latest with protobuf compiler. OpenAPI-drift check that auto-commits the regeneratedopenapi.jsonfor same-repository PRs. Also runs the AGENTS.md cross-link integrity check (scripts/check-agents-md.sh).Test Workspacedoes not run on pull requests. The job is gatedif: github.event_name != 'pull_request', so the full workspace + failpoints suite runs only on push tomain(post-merge), onv*tags, and on manualworkflow_dispatch. This was a deliberate PR-latency trade-off — it was the slowest gate (~15min warm, up to the 75min cold ceiling).RustFS S3 Integrationneeds: test, so it is push-/dispatch-only for the same reason. The fast PR gates remain:Classify Changes,Check AGENTS.md Links, andTest omnigraph-server --features aws.Test Workspaceis correspondingly not in the required-check list (.github/branch-protection.json); see branch-protection.md.- Consequences to internalize: (1) a regression that the suite would catch now lands on
mainand turns the post-merge run red, rather than being blocked pre-merge —maincan briefly break, so runcargo test --workspace --lockedlocally before merging anything non-trivial, or trigger this workflow on your branch via the Actions "Run workflow" button. (2)openapi.jsonis no longer auto-regenerated on PRs (that step is inside thetestjob); for server/API changes, regenerate it locally withOMNIGRAPH_UPDATE_OPENAPI=1 cargo test -p omnigraph-server --test openapiand commit it, or the strict drift check fails the post-mergemainrun. - Applying this policy: removing
Test Workspacefrom the JSON is inert until an admin runs./scripts/apply-branch-protection.sh. Run it immediately after this change merges — until then GitHub still requires aTest Workspacecontext that no longer reports on PRs, which leaves every open PR permanently pending (the job-never-reports trap).
- AWS feature build job:
cargo build/test -p omnigraph-server --features awson ubuntu-latest. - Windows binary build job:
cargo build --release --locked -p omnigraph-cli -p omnigraph-serveron windows-latest with smoke checks foromnigraph.exe version,omnigraph-server.exe --help, and PowerShell installer syntax. - RustFS S3 integration: spins up RustFS in Docker and runs the bucket-gated S3 suites against it. Sharded across parallel runners (
strategy.matrix.shard:engine=s3_storage,server= servers3,cluster=s3_cluster,cli=local_cli_s3_end_to_end_init_load_read_flow,failpoints=failpoints s3_), one suite per shard withfail-fast: falseand a per-shardrust-cachekey. This job carries correctness suites only; the RFC-013write_cost_s3cost gate was removed (cost/perf contracts belong in a dedicated harness, not the correctness path). The tests run in seconds; the wall-clock is the per-shardcargo testcompile of the engine tree, so on a cold cache (anyCargo.lockchange) six serial steps summed past the old 75-min timeout — sharding makes wall-clock the slowest single shard (~40m cold, ~25m warm).needs: test, so likeTest Workspaceit is push-/dispatch-only. Not a required check. - release-edge.yml: on every push to main, retags
edge, builds Linux x86_64 / Linux arm64 / macOS arm64 archives and Windows x86_64 zip + sha256, publishes a rolling prerelease, then smoke-tests the Windows PowerShell installer againstedge. - release.yml: on
v*tags, builds the Linux x86_64 / Linux arm64 / macOS arm64 archives and Windows x86_64 zip release matrix, updates the Homebrew tap (scripts/update-homebrew-formula.sh) by pushing the regenerated formula toModernRelay/homebrew-tap, and smoke-tests the Windows PowerShell installer against the tag. - package.yml: manual ECR image build; emits two image tags per commit (
<sha>,<sha>-aws) via CodeBuild.