Commit graph

14 commits

Author SHA1 Message Date
Andrew Altshuler
854ad0afcb
feat(server): compose OMNIGRAPH_TARGET_URI with OMNIGRAPH_CONFIG in entrypoint (#129)
The container entrypoint's URI and config branches were mutually
exclusive, so a deployment driven by OMNIGRAPH_TARGET_URI could never
load a policy file. Forward --config alongside the positional URI when
OMNIGRAPH_CONFIG is also set (the URI still wins via resolve_target_uri),
enabling Cedar policy without changing how the URI is provided.

Add docker/entrypoint_test.sh (arg-composition cases) + a CI job, and
document the env-var contract in docs/user/deployment.md.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 20:17:55 +01:00
Ragnor Comerford
24413844ae
Add Windows release binaries (#127)
* Add Windows release binaries

* Fix Windows installer downloads
2026-05-30 14:23:40 +02:00
Andrew Altshuler
cb80fa40f1
exec/query: structured Expr pushdown via Scanner::filter_expr (unblocks CompOp::Contains) (#113)
* exec/query: pushdown IR filters via DataFusion Expr (Scanner::filter_expr)

Switches `execute_node_scan` from string-flattened Lance SQL pushdown
(`build_lance_filter` + `scanner.filter(&str)`) to structured DataFusion
Expr pushdown (`build_lance_filter_expr` + `scanner.filter_expr(Expr)`).

## What this enables

1. **`CompOp::Contains` now pushes down.** `ir_filter_to_sql` returned
   `None` for list-contains (the comment said *"Can't pushdown list
   contains"*) because string SQL can't easily express it. With Expr,
   it lowers to DataFusion's `array_has(col, value)` builtin via the
   `nested_expressions` feature, and pushes down to Lance's scan layer
   the same way Eq/Lt/etc. do. Pinned by the new regression test
   `end_to_end::ir_filter_with_list_contains_pushes_down`.

2. **DataFusion 53's optimizer rules now reach our predicates.** Once
   the Expr lands at the Lance scanner, DF's planner runs:
   - `IN`-list vectorized eq kernel (DF #20528)
   - `PhysicalExprSimplifier` (DF #20111)
   - CASE WHEN x THEN y ELSE NULL shortcut (DF #20097)
   - Push limit into hash join (DF #20228)
   None of these were applicable before because the string SQL path
   short-circuited the optimizer.

## Scope

This is one of three string-flattened pushdown sites; the other two
(`hydrate_nodes`/Expand pushdown at query.rs:771-796 and the mutation
delete path in `exec/mutation.rs::predicate_to_sql`) stay on the SQL
string path for now:

- The Expand pushdown still serializes through `hydrate_nodes`'s
  `extra_filter_sql: Option<&str>` parameter. Migrating it changes the
  `TableStorage` trait surface (`scan_stream(filter: Option<&str>)` →
  `Option<Expr>`) and the cascading call sites — out of scope for this
  MR.
- The mutation delete predicate still goes through `Dataset::delete(&str)`
  in Lance 6.0.1. MR-A (delete two-phase via Lance #6658, gated on the
  Lance v7 bump per issue #112) will migrate that path to
  `DeleteBuilder::execute_uncommitted` taking an Expr.

The existing `ir_filter_to_sql` / `ir_expr_to_sql` / `literal_to_sql`
helpers stay in place to serve the remaining string-SQL consumers
(mutation predicates). They get retired when the other call sites
migrate.

## Cargo

Enables the `nested_expressions` feature on the `datafusion` workspace
dep. Lance already pulls in `datafusion-functions-nested` transitively
(it's listed in their feature set), so this just exposes the
`datafusion::functions_nested::expr_fn::array_has` re-export. No
transitive dep change (Cargo.lock unchanged).

## Tests

- New: `ir_filter_with_list_contains_pushes_down` — pins the case that
  was previously impossible (`ir_filter_to_sql` returning `None`).
- 906/906 workspace tests still pass.
- 417/417 engine integration tests pass (was 416 + the new one).
- 19/19 failpoints (recovery canary).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: pin rustfs/rustfs to 1.0.0-beta.3 (last known-good before creds-policy break)

The RustFS S3 Integration job started failing 2026-05-23 with all 3
tests panicking on the first PUT:

  HTTP error: error sending request

The "Dump RustFS logs on failure" step revealed the container was
dying at startup:

  [FATAL] Server encountered an error and is shutting down:
  Default root credentials are not allowed on non-loopback listeners;
  set RUSTFS_ACCESS_KEY and RUSTFS_SECRET_KEY to non-default values,
  bind to loopback, or set RUSTFS_ALLOW_INSECURE_DEFAULT_CREDENTIALS=true
  for local development only

`rustfs/rustfs:latest` was updated 2026-05-21 (1.0.0-beta.4) with a
credentials-policy check that rejects `rustfsadmin`/`rustfsadmin` as
"default" values. PR #111 passed yesterday because it ran against
beta.3; today's runs against beta.4 fail at container startup.

This is unrelated to PR #113's Expr-pushdown refactor — the bump
just happened to hit the same week.

Pin to 1.0.0-beta.3 (2026-05-14, last tag before the change). The
right long-term fix is one of:
  - Rotate the CI creds to less-default values (less coupling to
    RustFS's "default" set definition)
  - Set `RUSTFS_ALLOW_INSECURE_DEFAULT_CREDENTIALS=true` per the
    error message
  - Use a workflow service container with controlled lifecycle

Deferred — pinning is the minimal restore. Also incidentally
documents *which* version we tested against, which `:latest` never
did.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 12:47:33 +01:00
Ragnor Comerford
675568ce85
ci: fold failpoints test into Test Workspace job
The standalone test_failpoints_feature job took 21min on first run
(cold cache; the omnigraph-engine crate has lance + datafusion deps
that make any fresh build expensive). Folding into Test Workspace
shares the warm cache so the failpoints invocation is incremental —
~30s vs 21min on subsequent runs, and within the workspace job's
existing budget.

The failpoints feature is gated behind a Cargo flag and only adds
the small `fail` crate dep + a few feature-gated code paths; it
doesn't change the dep tree of any other crate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 21:15:14 +02:00
Ragnor Comerford
052b6e680f
MR-794 step 2: address PR #68 follow-up review (Cubic) — pending dedupe + projection guard + CI
Three new findings from Cubic on commit 3223b51:

* **Pending edge cardinality counted within-input duplicates** (P2):
  count_src_per_edge's pending walk added every row to the count,
  including duplicate rows that finalize will collapse via
  dedupe_merge_batches_by_id. A LoadMode::Merge with the same edge id
  twice would over-count → spurious @card violation. Fix: when
  dedupe_key_column is Some, walk pending in reverse, track seen keys
  via HashSet, count only the kept (last-occurrence) rows. Mirrors
  finalize-time dedupe so cardinality counts what stage_merge_insert
  actually publishes.

* **scan_with_pending silently disabled merge-shadow when projection
  omitted key_column** (P2): if a caller passed Some("id") as
  key_column but their projection didn't include "id", the
  filter_out_rows_where_string_in helper passed batches through
  unchanged — silently degrading to union semantics. Fix: validate
  up front that projection contains key_column when both are Some;
  return a typed Lance error otherwise. Tightened the helper too:
  missing column is now an internal error (was a silent passthrough).

* **Cascade-vs-explicit delete test was too weak** (P2): asserted
  only that edge count decreased after delete. The cascade alone
  could satisfy that even if the explicit second-delete silently
  no-op'd. Strengthened: assert post_knows == 0, which only holds
  when both ops landed (Bob→Diana would survive if op-2 no-op'd).

CI gap: also added test_failpoints_feature job to .github/workflows/ci.yml.
The workspace test runs without --features failpoints (the feature is
behind a Cargo flag), so the failpoints test suite was never exercised
by CI before now. The new job builds + runs
`cargo test -p omnigraph-engine --features failpoints --test failpoints`
on every full CI run, mirroring the test_aws_feature pattern.

New tests on tests/runs.rs:

* load_merge_mode_dedupes_within_pending_for_cardinality_count
  (Cubic P2 #2 — pending-vs-pending dedup, distinct from the
  load_merge_mode_dedupes_edge_for_cardinality_count test which
  covers committed-vs-pending dedup).
* scan_with_pending_rejects_key_column_missing_from_projection
  (Cubic P2 #3 — verifies the up-front validation rejects bad
  callers and that the happy path still works correctly).

Local test results:

* tests/runs.rs: 23/23 passed
* tests/failpoints.rs --features failpoints: 7/7 passed (includes the
  two new finalize→publisher residual tests landed in 3223b51).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 20:47:45 +02:00
Ragnor Comerford
a335d98854
Refactor AGENTS.md from encyclopedia to map; move spec into docs/
Splits the 990-line AGENTS.md into a 184-line map (architecture,
where-to-find index, always-on invariants, capability matrix,
maintenance contract) plus 18 new docs/*.md files holding the deep
content per topic (storage, schema and query languages, indexes,
embeddings, branches/commits, runs, merge, changes, execution, policy,
server, CLI reference, audit, errors, CI, constants, v0.3.1 notes).

Adds scripts/check-agents-md.sh and a check_agents_md CI job that
verifies every docs/ link in AGENTS.md resolves and every doc in the
canonical set is linked. CLAUDE.md remains a symlink to AGENTS.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 23:31:08 +02:00
Ragnor Comerford
bcddbdf485
Test merge commit; push openapi.json via separate clone
Restore the default pull_request checkout (refs/pull/N/merge) so tests
see the merged state. The openapi.json auto-commit now uses a separate
shallow clone of the PR branch, so the pushed commit contains only the
spec change rather than the merge-commit tree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 12:10:40 +02:00
Ragnor Comerford
a157f6a17c
Fold openapi.json auto-sync into main CI test job
The separate openapi-sync workflow was duplicating the workspace build
(~15 min cold-cache compile), paying the cost twice per PR. Fold the
regen + auto-commit into the existing test job: one compile, shared
rust-cache, same drift-check semantics.

- Same-repo PRs: OMNIGRAPH_UPDATE_OPENAPI=1 during the test run, then
  commit the regenerated spec back to the PR branch
- Fork PRs / pushes: env var empty, test stays in strict drift-check mode
- openapi_spec_is_up_to_date treats empty env value as unset, so the
  conditional workflow env expression works

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:00:46 +02:00
andrew
7a3bf5c758 Add aws feature + SecretsManagerTokenSource backend
Introduces an opt-in AWS Secrets Manager backend for bearer tokens,
behind the `aws` Cargo feature. Default builds (on-prem, local dev)
don't pull in the AWS SDK and don't pay its compile cost.

- New Cargo feature `aws` gates the `aws-config` + `aws-sdk-secretsmanager`
  optional deps. Default features remain empty.
- New `auth::aws::SecretsManagerTokenSource` implements `TokenSource` by
  fetching a JSON `{"actor_id": "token", ...}` payload from a named
  Secrets Manager secret. Credentials resolve via the AWS default chain
  (env, shared config, IMDSv2 instance role, ECS task role) so no
  explicit plumbing is needed under an IAM role.
- New `resolve_token_source()` dispatches based on the
  `OMNIGRAPH_SERVER_BEARER_TOKENS_AWS_SECRET` env var. If the var is set
  but the binary was built without `--features aws`, returns a clear
  rebuild instruction rather than silently falling back.
- `serve()` now uses `resolve_token_source()` and logs which source was
  selected at startup.
- `parse_json_secret_payload()` is factored out as a free function so
  the payload validation (trim whitespace, reject blank actor/token,
  reject non-object) is unit-testable without the AWS SDK.
- New CI job `test_aws_feature` builds + tests with `--features aws`.

Not in this PR (follow-ups):
- Background refresh loop for rotation. `SecretsManagerTokenSource`
  advertises `supports_refresh: true` but the AppState-level refresh
  task isn't wired yet.
- Config-YAML dispatch (today the AWS source is selected via env var
  only; eventually `server.bearer_tokens.source` in `omnigraph.yaml`).

Tests:
- Default-feature build: 33 lib + 41 integration + 64 openapi.
- `--features aws` build: 32 lib (one test is cfg-gated) + 41 + 64.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 03:48:51 +03:00
andrew
33bdab1fcb Prepare v0.2.2 release 2026-04-14 20:13:00 +03:00
andrew
ff83e97cb5 Scope RustFS CI to relevant changes 2026-04-12 15:33:41 +03:00
andrew
af7a74bf2c Skip heavy CI on text-only changes 2026-04-11 15:22:11 +03:00
andrew
446075f333 Update workflow actions and add Homebrew install docs 2026-04-11 04:01:39 +03:00
andrew
338289656a Initial public Omnigraph repository 2026-04-10 20:49:41 +03:00