Commit graph

29 commits

Author SHA1 Message Date
Ragnor Comerford
24413844ae
Add Windows release binaries (#127)
* Add Windows release binaries

* Fix Windows installer downloads
2026-05-30 14:23:40 +02:00
Andrew Altshuler
587fbeabd8
ci(publish-crates): set User-Agent + treat "already exists" as success (#117)
Two related fixes uncovered while recovering the v0.5.0 publish.

1. crates.io API requires a User-Agent header. The `publish_if_new`
   skip check was doing a bare `curl -fsSL https://crates.io/api/...`
   which crates.io rejects with HTTP 403. With `-f` curl exits
   non-zero, the pipeline returns empty, the script doesn't recognize
   already-published crates, and we fall through to a real publish
   attempt. On a re-run that means cargo publish errors with
   "already exists on crates.io index" for crates that DID publish
   successfully on the previous run.

   Fix: send a `User-Agent: ModernRelay-omnigraph-ci (URL)` header.

2. Defense in depth: even with the UA, the API could hiccup. If the
   skip check misses an existing version and cargo publish errors
   with "already exists on crates.io index", treat as success
   instead of failing the whole run. This makes the workflow
   re-runnable after any partial publish without needing manual
   intervention.

Both fixes are required to recover from the v0.5.0 partial publish
where omnigraph-compiler@0.5.0 made it through but the run failed
before omnigraph-policy / engine / server / cli — re-triggering the
workflow now succeeds end-to-end.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 14:19:17 +01:00
Andrew Altshuler
1a9f8b1f7f
ci(publish-crates): include omnigraph-policy in the publish list (#116)
omnigraph-policy is a new crate this release cycle (Cedar policy
engine, MR-722). It wasn't added to the publish list when it was
created, so v0.5.0's tag-triggered publish run succeeded for
omnigraph-compiler but failed at omnigraph-engine:

  failed to prepare local package for uploading
  Caused by:
    no matching package named `omnigraph-policy` found
    location searched: crates.io index
    required by package `omnigraph-engine v0.5.0`

omnigraph-policy has no internal omnigraph-* deps so it can publish
after omnigraph-compiler (either could go first). omnigraph-engine
depends on both; server on the three; cli on everything.

publish_if_new is idempotent — re-running with the v0.5.0 tag after
this lands will skip omnigraph-compiler (already published), then
publish policy + engine + server + cli.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 14:09:58 +01:00
Andrew Altshuler
cb80fa40f1
exec/query: structured Expr pushdown via Scanner::filter_expr (unblocks CompOp::Contains) (#113)
* exec/query: pushdown IR filters via DataFusion Expr (Scanner::filter_expr)

Switches `execute_node_scan` from string-flattened Lance SQL pushdown
(`build_lance_filter` + `scanner.filter(&str)`) to structured DataFusion
Expr pushdown (`build_lance_filter_expr` + `scanner.filter_expr(Expr)`).

## What this enables

1. **`CompOp::Contains` now pushes down.** `ir_filter_to_sql` returned
   `None` for list-contains (the comment said *"Can't pushdown list
   contains"*) because string SQL can't easily express it. With Expr,
   it lowers to DataFusion's `array_has(col, value)` builtin via the
   `nested_expressions` feature, and pushes down to Lance's scan layer
   the same way Eq/Lt/etc. do. Pinned by the new regression test
   `end_to_end::ir_filter_with_list_contains_pushes_down`.

2. **DataFusion 53's optimizer rules now reach our predicates.** Once
   the Expr lands at the Lance scanner, DF's planner runs:
   - `IN`-list vectorized eq kernel (DF #20528)
   - `PhysicalExprSimplifier` (DF #20111)
   - CASE WHEN x THEN y ELSE NULL shortcut (DF #20097)
   - Push limit into hash join (DF #20228)
   None of these were applicable before because the string SQL path
   short-circuited the optimizer.

## Scope

This is one of three string-flattened pushdown sites; the other two
(`hydrate_nodes`/Expand pushdown at query.rs:771-796 and the mutation
delete path in `exec/mutation.rs::predicate_to_sql`) stay on the SQL
string path for now:

- The Expand pushdown still serializes through `hydrate_nodes`'s
  `extra_filter_sql: Option<&str>` parameter. Migrating it changes the
  `TableStorage` trait surface (`scan_stream(filter: Option<&str>)` →
  `Option<Expr>`) and the cascading call sites — out of scope for this
  MR.
- The mutation delete predicate still goes through `Dataset::delete(&str)`
  in Lance 6.0.1. MR-A (delete two-phase via Lance #6658, gated on the
  Lance v7 bump per issue #112) will migrate that path to
  `DeleteBuilder::execute_uncommitted` taking an Expr.

The existing `ir_filter_to_sql` / `ir_expr_to_sql` / `literal_to_sql`
helpers stay in place to serve the remaining string-SQL consumers
(mutation predicates). They get retired when the other call sites
migrate.

## Cargo

Enables the `nested_expressions` feature on the `datafusion` workspace
dep. Lance already pulls in `datafusion-functions-nested` transitively
(it's listed in their feature set), so this just exposes the
`datafusion::functions_nested::expr_fn::array_has` re-export. No
transitive dep change (Cargo.lock unchanged).

## Tests

- New: `ir_filter_with_list_contains_pushes_down` — pins the case that
  was previously impossible (`ir_filter_to_sql` returning `None`).
- 906/906 workspace tests still pass.
- 417/417 engine integration tests pass (was 416 + the new one).
- 19/19 failpoints (recovery canary).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: pin rustfs/rustfs to 1.0.0-beta.3 (last known-good before creds-policy break)

The RustFS S3 Integration job started failing 2026-05-23 with all 3
tests panicking on the first PUT:

  HTTP error: error sending request

The "Dump RustFS logs on failure" step revealed the container was
dying at startup:

  [FATAL] Server encountered an error and is shutting down:
  Default root credentials are not allowed on non-loopback listeners;
  set RUSTFS_ACCESS_KEY and RUSTFS_SECRET_KEY to non-default values,
  bind to loopback, or set RUSTFS_ALLOW_INSECURE_DEFAULT_CREDENTIALS=true
  for local development only

`rustfs/rustfs:latest` was updated 2026-05-21 (1.0.0-beta.4) with a
credentials-policy check that rejects `rustfsadmin`/`rustfsadmin` as
"default" values. PR #111 passed yesterday because it ran against
beta.3; today's runs against beta.4 fail at container startup.

This is unrelated to PR #113's Expr-pushdown refactor — the bump
just happened to hit the same week.

Pin to 1.0.0-beta.3 (2026-05-14, last tag before the change). The
right long-term fix is one of:
  - Rotate the CI creds to less-default values (less coupling to
    RustFS's "default" set definition)
  - Set `RUSTFS_ALLOW_INSECURE_DEFAULT_CREDENTIALS=true` per the
    error message
  - Use a workflow service container with controlled lifecycle

Deferred — pinning is the minimal restore. Also incidentally
documents *which* version we tested against, which `:latest` never
did.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 12:47:33 +01:00
Andrew Altshuler
730712b73f
codeowners: yml source of truth + generator + drift CI (#88)
* codeowners: generator + drift CI + initial roles

Source-of-truth approach to CODEOWNERS: yml is hand-edited, CODEOWNERS
is generated and CI-enforced. Every role change is a reviewable PR
with a permanent in-repo audit trail. No GitHub UI clicks, no shadow
state.

Initial roles:

  engineering  @aaltshuler            owns crates/** + default (.github/,
                                       scripts/, Cargo.*, openapi.json,
                                       everything else not docs)

  docs         @aaltshuler @ragnorc   owns docs/**, README.md, AGENTS.md,
                                       CLAUDE.md, SECURITY.md

Per GitHub semantics, multiple owners on a CODEOWNERS line means "any
one satisfies the review" — for docs, either named member can approve.
Strict "N distinct approvers" would need a CI workaround (not wired
today; tracked for future hardening).

Components:

- .github/codeowners-roles.yml — source of truth. Edit this.
- .github/scripts/render-codeowners.py — generator (PyYAML; ~100 LoC).
- .github/CODEOWNERS — generated. CI rejects hand-edits.
- .github/workflows/codeowners.yml — two checks:
  * drift: re-render and assert CODEOWNERS matches.
  * noedit: reject PRs that edit CODEOWNERS without editing the yml.
- docs/codeowners.md — explains the source-of-truth pattern, how to
  change roles, how to add new roles.
- AGENTS.md topic-index row.

What's NOT in this PR:

- Branch protection on main (separate PR; needs `gh api` call against
  the org).
- Required-reviewer enforcement (depends on branch protection landing).
- Required CI status checks (depends on branch protection landing).
- Scheduled rotation (the schedule: block in the yml + a weekly
  workflow). Today's roles are stable; rotation isn't needed yet.
- Linear-as-source-of-truth integration (Approach 4 from the design
  discussion; deferred).

Verified:
- Generator output is deterministic (idempotent re-runs).
- scripts/check-agents-md.sh OK (28 links, 28 docs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* codeowners: fix catch-all ordering (Devin review #88)

Devin caught a real bug: GitHub CODEOWNERS uses "last match wins"
semantics, but the generator emitted the catch-all `*` AFTER specific
patterns. Net effect: `*` won for every file, silently nullifying the
docs role and never routing reviews to @ragnorc.

Fix is one-line — emit the default `*` line before iterating the
specific paths. Also:

- Added a regression assertion in the generator: after rendering, the
  first non-comment line must start with `*` if a default is
  configured. Generator exits non-zero otherwise. Catches the same
  class of mistake in any future refactor.
- Rewrote the yml header comment, which incorrectly stated "keep
  more-specific paths after broader patterns" (correct for GitHub
  semantics but the generator was doing the opposite — so the comment
  read as a description of behavior when it was actually a contradicted
  intention).

Verified by re-rendering: `*` is now line 12, `crates/**` is line 14,
`docs/**` is line 15, etc. README.md matches both `*` and `README.md`;
`README.md` is later → wins → @aaltshuler + @ragnorc both assigned.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 17:26:06 +03:00
Andrew Altshuler
92e3886cfa
ci: add publish-crates workflow for crates.io releases (#74)
The release.yml workflow builds binaries and updates Homebrew but never
published to crates.io — v0.4.0 and v0.4.1 are missing from the registry
even though the local Cargo.toml and the v0.4.1 tag are at 0.4.1.

This adds a separate workflow that:
- auto-publishes on every v* tag push (future releases self-publish)
- can be manually dispatched with a tag input (catch up on v0.4.1)
- is idempotent: skips a crate if its current crates.io version already
  matches local, so a partial failure is safe to retry
- gates on CARGO_REGISTRY_TOKEN (already in repo secrets); skips cleanly
  if the token is ever rotated out

Publishes in dependency order: omnigraph-compiler → omnigraph-engine →
omnigraph-server → omnigraph-cli. Path-only deps in Cargo.toml carry
explicit version fields, so cargo publish strips paths and resolves
against crates.io.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 15:48:37 +03:00
Ragnor Comerford
675568ce85
ci: fold failpoints test into Test Workspace job
The standalone test_failpoints_feature job took 21min on first run
(cold cache; the omnigraph-engine crate has lance + datafusion deps
that make any fresh build expensive). Folding into Test Workspace
shares the warm cache so the failpoints invocation is incremental —
~30s vs 21min on subsequent runs, and within the workspace job's
existing budget.

The failpoints feature is gated behind a Cargo flag and only adds
the small `fail` crate dep + a few feature-gated code paths; it
doesn't change the dep tree of any other crate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 21:15:14 +02:00
Ragnor Comerford
052b6e680f
MR-794 step 2: address PR #68 follow-up review (Cubic) — pending dedupe + projection guard + CI
Three new findings from Cubic on commit 3223b51:

* **Pending edge cardinality counted within-input duplicates** (P2):
  count_src_per_edge's pending walk added every row to the count,
  including duplicate rows that finalize will collapse via
  dedupe_merge_batches_by_id. A LoadMode::Merge with the same edge id
  twice would over-count → spurious @card violation. Fix: when
  dedupe_key_column is Some, walk pending in reverse, track seen keys
  via HashSet, count only the kept (last-occurrence) rows. Mirrors
  finalize-time dedupe so cardinality counts what stage_merge_insert
  actually publishes.

* **scan_with_pending silently disabled merge-shadow when projection
  omitted key_column** (P2): if a caller passed Some("id") as
  key_column but their projection didn't include "id", the
  filter_out_rows_where_string_in helper passed batches through
  unchanged — silently degrading to union semantics. Fix: validate
  up front that projection contains key_column when both are Some;
  return a typed Lance error otherwise. Tightened the helper too:
  missing column is now an internal error (was a silent passthrough).

* **Cascade-vs-explicit delete test was too weak** (P2): asserted
  only that edge count decreased after delete. The cascade alone
  could satisfy that even if the explicit second-delete silently
  no-op'd. Strengthened: assert post_knows == 0, which only holds
  when both ops landed (Bob→Diana would survive if op-2 no-op'd).

CI gap: also added test_failpoints_feature job to .github/workflows/ci.yml.
The workspace test runs without --features failpoints (the feature is
behind a Cargo flag), so the failpoints test suite was never exercised
by CI before now. The new job builds + runs
`cargo test -p omnigraph-engine --features failpoints --test failpoints`
on every full CI run, mirroring the test_aws_feature pattern.

New tests on tests/runs.rs:

* load_merge_mode_dedupes_within_pending_for_cardinality_count
  (Cubic P2 #2 — pending-vs-pending dedup, distinct from the
  load_merge_mode_dedupes_edge_for_cardinality_count test which
  covers committed-vs-pending dedup).
* scan_with_pending_rejects_key_column_missing_from_projection
  (Cubic P2 #3 — verifies the up-front validation rejects bad
  callers and that the happy path still works correctly).

Local test results:

* tests/runs.rs: 23/23 passed
* tests/failpoints.rs --features failpoints: 7/7 passed (includes the
  two new finalize→publisher residual tests landed in 3223b51).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 20:47:45 +02:00
Ragnor Comerford
a9430978fb
Merge pull request #60 from ModernRelay/ragnorc/omnigraph-spec
Add AGENTS.md (map) + docs/ knowledge base + CI link check
2026-04-29 00:15:19 +02:00
Ragnor Comerford
a335d98854
Refactor AGENTS.md from encyclopedia to map; move spec into docs/
Splits the 990-line AGENTS.md into a 184-line map (architecture,
where-to-find index, always-on invariants, capability matrix,
maintenance contract) plus 18 new docs/*.md files holding the deep
content per topic (storage, schema and query languages, indexes,
embeddings, branches/commits, runs, merge, changes, execution, policy,
server, CLI reference, audit, errors, CI, constants, v0.3.1 notes).

Adds scripts/check-agents-md.sh and a check_agents_md CI job that
verifies every docs/ link in AGENTS.md resolves and every doc in the
canonical set is linked. CLAUDE.md remains a symlink to AGENTS.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 23:31:08 +02:00
Andrew Altshuler
372f793ad6
Drop macOS x86_64 build target (#55)
Stop producing the omnigraph-macos-x86_64 archive in both the
stable and edge release workflows. The macos-15-intel runner
build was the slowest of the matrix and Apple Silicon is now
the default Mac developer target.

- release.yml + release-edge.yml: drop the macos-15-intel matrix entry
- install.sh: drop the Darwin/x86_64 case so Intel Macs get a clear
  "no prebuilt binary" error instead of attempting an absent download
- update-homebrew-formula.sh: drop the MACOS_X86_* variables and emit
  an arm64-only Homebrew formula. The on_macos block now declares
  `depends_on arch: :arm64` so Intel `brew install` fails fast with
  a clear architecture message instead of installing an arm64 binary
  that errors at exec time.

Linux x86_64 build is unaffected.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 18:19:26 +03:00
andrew
a1b00e2d06 Fix release.yml: move HOMEBREW_TAP_TOKEN guard into steps
GitHub Actions rejects `secrets.*` in job-level `if:` conditions at
runtime (job-level `if` is evaluated before secrets are available),
causing the workflow to abort in 0s with "workflow file issue" on
every trigger. Moving the guard into a step-level check that writes
`HOMEBREW_TAP_SKIP` to GITHUB_ENV lets the rest of the steps
conditionally no-op when the tap token isn't configured.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 19:24:41 +03:00
Ragnor Comerford
567ebe5f24
Merge pull request #24 from ModernRelay/ragnorc/explore-api
Add static OpenAPI spec and clean up operation IDs
2026-04-19 15:36:49 +02:00
Ragnor Comerford
bcddbdf485
Test merge commit; push openapi.json via separate clone
Restore the default pull_request checkout (refs/pull/N/merge) so tests
see the merged state. The openapi.json auto-commit now uses a separate
shallow clone of the PR branch, so the pushed commit contains only the
spec change rather than the merge-commit tree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 12:10:40 +02:00
Ragnor Comerford
a157f6a17c
Fold openapi.json auto-sync into main CI test job
The separate openapi-sync workflow was duplicating the workspace build
(~15 min cold-cache compile), paying the cost twice per PR. Fold the
regen + auto-commit into the existing test job: one compile, shared
rust-cache, same drift-check semantics.

- Same-repo PRs: OMNIGRAPH_UPDATE_OPENAPI=1 during the test run, then
  commit the regenerated spec back to the PR branch
- Fork PRs / pushes: env var empty, test stays in strict drift-check mode
- openapi_spec_is_up_to_date treats empty env value as unset, so the
  conditional workflow env expression works

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:00:46 +02:00
andrew
987c51c376 package caller: pass AWS secrets via secrets: inherit
GitHub Actions doesn't expose the 'secrets' context in 'with:' when
calling a reusable workflow. The companion PR on the shared workflow
(ModernRelay/.github) moves the four AWS values into
on.workflow_call.secrets; this caller drops them from 'with:' and adds
'secrets: inherit' so all four flow through masked.

Trailing from PRs #33 and #34.
2026-04-18 21:54:08 +03:00
andrew
8086a0099c package workflow: read AWS config from secrets, not variables
On a public repo, Actions variables are not masked in workflow logs.
The AWS role ARN and artifact bucket name embed the AWS account ID —
not catastrophic, but norm-preserving to keep them out of public logs.

Switch all four values (region, role, project, bucket) from
`${{ vars.* }}` to `${{ secrets.* }}`. When secrets are passed via
`with:` to a reusable workflow, GitHub's masking still applies because
the value is added to the run's mask list as soon as the secret
reference is resolved.

Followup to #33 — should have landed as secrets from the start.
2026-04-18 21:43:12 +03:00
Ragnor Comerford
9de2079263
Merge remote-tracking branch 'origin/main' into ragnorc/explore-api
# Conflicts:
#	CONTRIBUTING.md
2026-04-18 20:24:39 +02:00
andrew
807c1ba4dc Add manual-dispatch Package workflow for CodeBuild image builds
Invokes the shared omnigraph-package reusable workflow twice per run —
once with default features, once with --features aws — producing two
ECR tags per source commit:

  <sha>         (default features)
  <sha>-aws     (--features aws → SecretsManagerTokenSource)

Manual-dispatch only for now. Neither release.yml nor release-edge.yml
currently invokes the CodeBuild-backed packaging path; this gives
operators a way to produce on-demand image variants without wiring
packaging into the tag/push cadence.

Prerequisites:
- Repo vars AWS_REGION, AWS_ROLE_TO_ASSUME, AWS_CODEBUILD_PACKAGE_PROJECT,
  AWS_ARTIFACT_BUCKET must be set.
- Shared workflow must support the `features` and `image_tag_suffix`
  inputs.

Uses @main as the shared-workflow ref until a versioned tag is cut.
2026-04-18 16:29:43 +03:00
andrew
7a3bf5c758 Add aws feature + SecretsManagerTokenSource backend
Introduces an opt-in AWS Secrets Manager backend for bearer tokens,
behind the `aws` Cargo feature. Default builds (on-prem, local dev)
don't pull in the AWS SDK and don't pay its compile cost.

- New Cargo feature `aws` gates the `aws-config` + `aws-sdk-secretsmanager`
  optional deps. Default features remain empty.
- New `auth::aws::SecretsManagerTokenSource` implements `TokenSource` by
  fetching a JSON `{"actor_id": "token", ...}` payload from a named
  Secrets Manager secret. Credentials resolve via the AWS default chain
  (env, shared config, IMDSv2 instance role, ECS task role) so no
  explicit plumbing is needed under an IAM role.
- New `resolve_token_source()` dispatches based on the
  `OMNIGRAPH_SERVER_BEARER_TOKENS_AWS_SECRET` env var. If the var is set
  but the binary was built without `--features aws`, returns a clear
  rebuild instruction rather than silently falling back.
- `serve()` now uses `resolve_token_source()` and logs which source was
  selected at startup.
- `parse_json_secret_payload()` is factored out as a free function so
  the payload validation (trim whitespace, reject blank actor/token,
  reject non-object) is unit-testable without the AWS SDK.
- New CI job `test_aws_feature` builds + tests with `--features aws`.

Not in this PR (follow-ups):
- Background refresh loop for rotation. `SecretsManagerTokenSource`
  advertises `supports_refresh: true` but the AppState-level refresh
  task isn't wired yet.
- Config-YAML dispatch (today the AWS source is selected via env var
  only; eventually `server.bearer_tokens.source` in `omnigraph.yaml`).

Tests:
- Default-feature build: 33 lib + 41 integration + 64 openapi.
- `--features aws` build: 32 lib (one test is cfg-gated) + 41 + 64.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 03:48:51 +03:00
Ragnor Comerford
dda9728473
Add openapi.json auto-sync workflow 2026-04-17 19:09:36 +02:00
andrew
ad7027c7e9 Automate Homebrew tap updates on release tags 2026-04-15 17:57:21 +03:00
andrew
33bdab1fcb Prepare v0.2.2 release 2026-04-14 20:13:00 +03:00
andrew
ff83e97cb5 Scope RustFS CI to relevant changes 2026-04-12 15:33:41 +03:00
andrew
af7a74bf2c Skip heavy CI on text-only changes 2026-04-11 15:22:11 +03:00
andrew
446075f333 Update workflow actions and add Homebrew install docs 2026-04-11 04:01:39 +03:00
andrew
816b24d05e Fix public binary install flow 2026-04-11 02:19:21 +03:00
andrew
cbb312e74f Split binary and source install flows 2026-04-10 23:26:09 +03:00
andrew
338289656a Initial public Omnigraph repository 2026-04-10 20:49:41 +03:00