omnigraph

mirror of https://github.com/ModernRelay/omnigraph.git synced 2026-06-12 01:45:14 +02:00

Author	SHA1	Message	Date
aaltshuler	7f32e6f1bc	ci: raise Test Workspace timeout to 75 minutes A cold rust-cache (every Cargo.lock change) means a full workspace + failpoints-feature build on the 2-core runner, which now exceeds 45 minutes on slow runner days — and because a timed-out run never saves its cache, an undersized budget self-perpetuates: every retry starts cold and dies identically (observed four consecutive 45-minute cancellations on main and PR #188 after #186's lock bump). Warm-cache runs stay ~15 minutes; 75 is headroom matching the rustfs job's budget, not a target. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 10:38:54 +03:00
aaltshuler	711e04a161	ci: pin RustFS to 1.0.0-beta.8 beta.4+ refuses the rustfsadmin/rustfsadmin test credentials unless RUSTFS_ALLOW_INSECURE_DEFAULT_CREDENTIALS=true is set — acceptable for the ephemeral CI container and the local bootstrap script (which already passed it). The three S3 suites were validated against the beta.8 binary locally before this bump. The pin stays explicit, never `latest`, so future upgrades remain deliberate. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 18:44:05 +03:00
aaltshuler	211b37e6de	test(cluster): failpoint tests for crash-mid-apply and state CAS race The apply-side coverage the implementation spec's hard gate requires before Phase 4 graph-moving apply: - crash after the payload phase: state.json byte-identical, blobs inert on disk, lock released, no phantom statuses, nothing acknowledged; a plain re-run repairs via skip-if-exists blob reuse. - CAS race: a cfg_callback rewrites state.json at the exact read->write window (the state.lock:false concurrent-writer scenario); apply surfaces state_cas_mismatch, acknowledges nothing, reports the persisted status snapshot, leaves the concurrent writer's state on disk; a re-run converges. CI's failpoints step now runs both the engine and cluster suites. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 02:14:06 +03:00
Andrew Altshuler	c2a97f4559	ci: drop per-PR Windows release build; bind to release tags (#155 ) The `test_windows_binaries` job ran a full Windows --release build + smoke test on every code PR. It was a non-required (non-blocking) check, so it never gated a merge — it only burned the slowest/most expensive runner (windows-latest, --release, 75-min ceiling) on every code change. Windows binary validation is already covered (better) on release tags: release.yml's `smoke_windows_installer` (on v* tags) builds the release binaries, installs via scripts/install.ps1, and smoke-runs `omnigraph.exe version` + `omnigraph-server.exe --help` — the same smoke test plus the real installer path. Nothing `needs:` the removed job. Trade-off (accepted): a PR that breaks the Windows build or install.ps1 syntax is now caught at release-cut rather than at PR time. install.ps1 and platform-specific code change rarely; the cost savings on every PR outweigh the earlier signal. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 22:25:33 +03:00
Andrew Altshuler	854ad0afcb	feat(server): compose OMNIGRAPH_TARGET_URI with OMNIGRAPH_CONFIG in entrypoint (#129 ) The container entrypoint's URI and config branches were mutually exclusive, so a deployment driven by OMNIGRAPH_TARGET_URI could never load a policy file. Forward --config alongside the positional URI when OMNIGRAPH_CONFIG is also set (the URI still wins via resolve_target_uri), enabling Cedar policy without changing how the URI is provided. Add docker/entrypoint_test.sh (arg-composition cases) + a CI job, and document the env-var contract in docs/user/deployment.md. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-30 20:17:55 +01:00
Ragnor Comerford	24413844ae	Add Windows release binaries (#127 ) * Add Windows release binaries * Fix Windows installer downloads	2026-05-30 14:23:40 +02:00
Andrew Altshuler	cb80fa40f1	exec/query: structured Expr pushdown via Scanner::filter_expr (unblocks CompOp::Contains) (#113 ) * exec/query: pushdown IR filters via DataFusion Expr (Scanner::filter_expr) Switches `execute_node_scan` from string-flattened Lance SQL pushdown (`build_lance_filter` + `scanner.filter(&str)`) to structured DataFusion Expr pushdown (`build_lance_filter_expr` + `scanner.filter_expr(Expr)`). ## What this enables 1. `CompOp::Contains` now pushes down. `ir_filter_to_sql` returned `None` for list-contains (the comment said "Can't pushdown list contains") because string SQL can't easily express it. With Expr, it lowers to DataFusion's `array_has(col, value)` builtin via the `nested_expressions` feature, and pushes down to Lance's scan layer the same way Eq/Lt/etc. do. Pinned by the new regression test `end_to_end::ir_filter_with_list_contains_pushes_down`. 2. DataFusion 53's optimizer rules now reach our predicates. Once the Expr lands at the Lance scanner, DF's planner runs: - `IN`-list vectorized eq kernel (DF #20528) - `PhysicalExprSimplifier` (DF #20111) - CASE WHEN x THEN y ELSE NULL shortcut (DF #20097) - Push limit into hash join (DF #20228) None of these were applicable before because the string SQL path short-circuited the optimizer. ## Scope This is one of three string-flattened pushdown sites; the other two (`hydrate_nodes`/Expand pushdown at query.rs:771-796 and the mutation delete path in `exec/mutation.rs::predicate_to_sql`) stay on the SQL string path for now: - The Expand pushdown still serializes through `hydrate_nodes`'s `extra_filter_sql: Option<&str>` parameter. Migrating it changes the `TableStorage` trait surface (`scan_stream(filter: Option<&str>)` → `Option<Expr>`) and the cascading call sites — out of scope for this MR. - The mutation delete predicate still goes through `Dataset::delete(&str)` in Lance 6.0.1. MR-A (delete two-phase via Lance #6658, gated on the Lance v7 bump per issue #112) will migrate that path to `DeleteBuilder::execute_uncommitted` taking an Expr. The existing `ir_filter_to_sql` / `ir_expr_to_sql` / `literal_to_sql` helpers stay in place to serve the remaining string-SQL consumers (mutation predicates). They get retired when the other call sites migrate. ## Cargo Enables the `nested_expressions` feature on the `datafusion` workspace dep. Lance already pulls in `datafusion-functions-nested` transitively (it's listed in their feature set), so this just exposes the `datafusion::functions_nested::expr_fn::array_has` re-export. No transitive dep change (Cargo.lock unchanged). ## Tests - New: `ir_filter_with_list_contains_pushes_down` — pins the case that was previously impossible (`ir_filter_to_sql` returning `None`). - 906/906 workspace tests still pass. - 417/417 engine integration tests pass (was 416 + the new one). - 19/19 failpoints (recovery canary). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: pin rustfs/rustfs to 1.0.0-beta.3 (last known-good before creds-policy break) The RustFS S3 Integration job started failing 2026-05-23 with all 3 tests panicking on the first PUT: HTTP error: error sending request The "Dump RustFS logs on failure" step revealed the container was dying at startup: [FATAL] Server encountered an error and is shutting down: Default root credentials are not allowed on non-loopback listeners; set RUSTFS_ACCESS_KEY and RUSTFS_SECRET_KEY to non-default values, bind to loopback, or set RUSTFS_ALLOW_INSECURE_DEFAULT_CREDENTIALS=true for local development only `rustfs/rustfs:latest` was updated 2026-05-21 (1.0.0-beta.4) with a credentials-policy check that rejects `rustfsadmin`/`rustfsadmin` as "default" values. PR #111 passed yesterday because it ran against beta.3; today's runs against beta.4 fail at container startup. This is unrelated to PR #113's Expr-pushdown refactor — the bump just happened to hit the same week. Pin to 1.0.0-beta.3 (2026-05-14, last tag before the change). The right long-term fix is one of: - Rotate the CI creds to less-default values (less coupling to RustFS's "default" set definition) - Set `RUSTFS_ALLOW_INSECURE_DEFAULT_CREDENTIALS=true` per the error message - Use a workflow service container with controlled lifecycle Deferred — pinning is the minimal restore. Also incidentally documents which version we tested against, which `:latest` never did. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 12:47:33 +01:00
Ragnor Comerford	675568ce85	ci: fold failpoints test into Test Workspace job The standalone test_failpoints_feature job took 21min on first run (cold cache; the omnigraph-engine crate has lance + datafusion deps that make any fresh build expensive). Folding into Test Workspace shares the warm cache so the failpoints invocation is incremental — ~30s vs 21min on subsequent runs, and within the workspace job's existing budget. The failpoints feature is gated behind a Cargo flag and only adds the small `fail` crate dep + a few feature-gated code paths; it doesn't change the dep tree of any other crate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 21:15:14 +02:00
Ragnor Comerford	052b6e680f	MR-794 step 2: address PR #68 follow-up review (Cubic) — pending dedupe + projection guard + CI Three new findings from Cubic on commit `3223b51`: * Pending edge cardinality counted within-input duplicates (P2): count_src_per_edge's pending walk added every row to the count, including duplicate rows that finalize will collapse via dedupe_merge_batches_by_id. A LoadMode::Merge with the same edge id twice would over-count → spurious @card violation. Fix: when dedupe_key_column is Some, walk pending in reverse, track seen keys via HashSet, count only the kept (last-occurrence) rows. Mirrors finalize-time dedupe so cardinality counts what stage_merge_insert actually publishes. * scan_with_pending silently disabled merge-shadow when projection omitted key_column (P2): if a caller passed Some("id") as key_column but their projection didn't include "id", the filter_out_rows_where_string_in helper passed batches through unchanged — silently degrading to union semantics. Fix: validate up front that projection contains key_column when both are Some; return a typed Lance error otherwise. Tightened the helper too: missing column is now an internal error (was a silent passthrough). * Cascade-vs-explicit delete test was too weak (P2): asserted only that edge count decreased after delete. The cascade alone could satisfy that even if the explicit second-delete silently no-op'd. Strengthened: assert post_knows == 0, which only holds when both ops landed (Bob→Diana would survive if op-2 no-op'd). CI gap: also added test_failpoints_feature job to .github/workflows/ci.yml. The workspace test runs without --features failpoints (the feature is behind a Cargo flag), so the failpoints test suite was never exercised by CI before now. The new job builds + runs `cargo test -p omnigraph-engine --features failpoints --test failpoints` on every full CI run, mirroring the test_aws_feature pattern. New tests on tests/runs.rs: * load_merge_mode_dedupes_within_pending_for_cardinality_count (Cubic P2 #2 — pending-vs-pending dedup, distinct from the load_merge_mode_dedupes_edge_for_cardinality_count test which covers committed-vs-pending dedup). * scan_with_pending_rejects_key_column_missing_from_projection (Cubic P2 #3 — verifies the up-front validation rejects bad callers and that the happy path still works correctly). Local test results: * tests/runs.rs: 23/23 passed * tests/failpoints.rs --features failpoints: 7/7 passed (includes the two new finalize→publisher residual tests landed in `3223b51`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 20:47:45 +02:00
Ragnor Comerford	a335d98854	Refactor AGENTS.md from encyclopedia to map; move spec into docs/ Splits the 990-line AGENTS.md into a 184-line map (architecture, where-to-find index, always-on invariants, capability matrix, maintenance contract) plus 18 new docs/*.md files holding the deep content per topic (storage, schema and query languages, indexes, embeddings, branches/commits, runs, merge, changes, execution, policy, server, CLI reference, audit, errors, CI, constants, v0.3.1 notes). Adds scripts/check-agents-md.sh and a check_agents_md CI job that verifies every docs/ link in AGENTS.md resolves and every doc in the canonical set is linked. CLAUDE.md remains a symlink to AGENTS.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-28 23:31:08 +02:00
Ragnor Comerford	bcddbdf485	Test merge commit; push openapi.json via separate clone Restore the default pull_request checkout (refs/pull/N/merge) so tests see the merged state. The openapi.json auto-commit now uses a separate shallow clone of the PR branch, so the pushed commit contains only the spec change rather than the merge-commit tree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 12:10:40 +02:00
Ragnor Comerford	a157f6a17c	Fold openapi.json auto-sync into main CI test job The separate openapi-sync workflow was duplicating the workspace build (~15 min cold-cache compile), paying the cost twice per PR. Fold the regen + auto-commit into the existing test job: one compile, shared rust-cache, same drift-check semantics. - Same-repo PRs: OMNIGRAPH_UPDATE_OPENAPI=1 during the test run, then commit the regenerated spec back to the PR branch - Fork PRs / pushes: env var empty, test stays in strict drift-check mode - openapi_spec_is_up_to_date treats empty env value as unset, so the conditional workflow env expression works Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:00:46 +02:00
andrew	7a3bf5c758	Add aws feature + SecretsManagerTokenSource backend Introduces an opt-in AWS Secrets Manager backend for bearer tokens, behind the `aws` Cargo feature. Default builds (on-prem, local dev) don't pull in the AWS SDK and don't pay its compile cost. - New Cargo feature `aws` gates the `aws-config` + `aws-sdk-secretsmanager` optional deps. Default features remain empty. - New `auth::aws::SecretsManagerTokenSource` implements `TokenSource` by fetching a JSON `{"actor_id": "token", ...}` payload from a named Secrets Manager secret. Credentials resolve via the AWS default chain (env, shared config, IMDSv2 instance role, ECS task role) so no explicit plumbing is needed under an IAM role. - New `resolve_token_source()` dispatches based on the `OMNIGRAPH_SERVER_BEARER_TOKENS_AWS_SECRET` env var. If the var is set but the binary was built without `--features aws`, returns a clear rebuild instruction rather than silently falling back. - `serve()` now uses `resolve_token_source()` and logs which source was selected at startup. - `parse_json_secret_payload()` is factored out as a free function so the payload validation (trim whitespace, reject blank actor/token, reject non-object) is unit-testable without the AWS SDK. - New CI job `test_aws_feature` builds + tests with `--features aws`. Not in this PR (follow-ups): - Background refresh loop for rotation. `SecretsManagerTokenSource` advertises `supports_refresh: true` but the AppState-level refresh task isn't wired yet. - Config-YAML dispatch (today the AWS source is selected via env var only; eventually `server.bearer_tokens.source` in `omnigraph.yaml`). Tests: - Default-feature build: 33 lib + 41 integration + 64 openapi. - `--features aws` build: 32 lib (one test is cfg-gated) + 41 + 64. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 03:48:51 +03:00
andrew	33bdab1fcb	Prepare v0.2.2 release	2026-04-14 20:13:00 +03:00
andrew	ff83e97cb5	Scope RustFS CI to relevant changes	2026-04-12 15:33:41 +03:00
andrew	af7a74bf2c	Skip heavy CI on text-only changes	2026-04-11 15:22:11 +03:00
andrew	446075f333	Update workflow actions and add Homebrew install docs	2026-04-11 04:01:39 +03:00
andrew	338289656a	Initial public Omnigraph repository	2026-04-10 20:49:41 +03:00

18 commits