Stop producing the omnigraph-macos-x86_64 archive in both the
stable and edge release workflows. The macos-15-intel runner
build was the slowest of the matrix and Apple Silicon is now
the default Mac developer target.
- release.yml + release-edge.yml: drop the macos-15-intel matrix entry
- install.sh: drop the Darwin/x86_64 case so Intel Macs get a clear
"no prebuilt binary" error instead of attempting an absent download
- update-homebrew-formula.sh: drop the MACOS_X86_* variables and emit
an arm64-only Homebrew formula. The on_macos block now declares
`depends_on arch: :arm64` so Intel `brew install` fails fast with
a clear architecture message instead of installing an arm64 binary
that errors at exec time.
Linux x86_64 build is unaffected.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Keep machine-local state (.claude/, .worktrees/, local
omnigraph.yaml, CLAUDE.md, and schema design notes) from
showing up as untracked in git status.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both Cursor Bugbot and Cubic flagged that the inbound `headers().get(...)`
call constructed `HeaderName::from_static("x-request-id")` inline instead
of reusing the `X_REQUEST_ID` constant defined at the top of the file.
The two were already kept in sync by both being `from_static("x-request-id")`,
but a future rename would have to touch both sites or risk silent drift
between read and write.
Also drops the now-unused `header` module import.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per-request ULID minted at the edge, exposed in request extensions and
on the response header. Caller-supplied X-Request-Id is echoed when
well-formed (1..=128 ASCII printable characters); otherwise rejected
and replaced with a fresh ULID so the value is always safe to log.
Companion to the TypeScript SDK redesign — clients now correlate logs
across the wire by reading X-Request-Id from response headers (and the
SDK already surfaces it on every OmnigraphError as `requestId`).
No spec change required; the header is a transport-layer concern.
Tests:
- mint a ULID when no header is provided
- echo a valid caller-supplied id
- reject overlong header (200 chars), mint a fresh ULID
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add operation descriptions and examples to utoipa annotations so the
generated TypeScript SDK has rich JSDoc, and so future Python/Go SDKs
and any /openapi.json docs UI benefit from the same effort.
- Doc comments on all 18 handlers (utoipa picks up summary/description)
- #[schema(example = ...)] on free-text fields (query_source,
schema_source, NDJSON data) and i64 timestamps
- Destructive/irreversible warnings on change, applySchema, ingest,
mergeBranches, deleteBranch, publishRun, abortRun
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* Parallel per-type load writes + omnigraph optimize/cleanup CLI
## MR-677.3 — parallel per-type load writes
The load path already groups records into one RecordBatch per type and
makes one Lance commit per table (loader::mod.rs:249-..), but those
commits ran sequentially. Wrap node and edge write loops in
`futures::stream::buffered(N)` against a new helper
`write_batches_concurrently`. Concurrency tunable via
`OMNIGRAPH_LOAD_CONCURRENCY` (default 8).
## MR-676 — `omnigraph optimize` and `omnigraph cleanup`
New CLI subcommands that walk every node + edge table in the repo:
- `omnigraph optimize <uri>` — runs Lance `compact_files` on each
table to merge small fragments into fewer larger ones.
- `omnigraph cleanup <uri> --keep N | --older-than 7d --confirm` —
runs Lance `cleanup_old_versions` to prune historical manifests +
unique fragments. Requires `--confirm` because it's destructive.
Supports both count-based and time-based retention (or both AND'd
together). Time uses chrono `DateTime<Utc>` (added as a workspace
dep, default-features off).
Both commands run their per-table loops in parallel (8-way bounded,
`OMNIGRAPH_MAINTENANCE_CONCURRENCY` env override). Smoke-tested
against the 114-table prod graph: optimize went 7m15s sequential
→ 1m28s parallel. cleanup --keep 1 removed 137 historical versions
across 114 tables in 1m57s without disrupting `/healthz` or query
responses.
Public API on `Omnigraph`:
pub async fn optimize(&mut self) -> Result<Vec<TableOptimizeStats>>
pub async fn cleanup(&mut self, opts: CleanupPolicyOptions)
-> Result<Vec<TableCleanupStats>>
All 10 existing loader tests still pass.
Closes MR-676.
Partially addresses MR-677 (the .3 — parallel by type — piece;
MR-677.1 is for the `omnigraph embed` path, not load, since load
doesn't call Gemini directly. .2 was already in place).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore: regenerate openapi.json
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
BFS now emits Vec<u32> dense ids directly with HashSet<u32> per-source
dedup. Only the deduped set is stringified for Lance's IN-list. The
post-hydrate alignment uses a dense-indexed Vec<Option<u32>> instead of
HashMap<&str, usize>, giving O(1) lookup without repeated string hashing.
End-to-end on the bench_expand harness (release, M-series):
query baseline after speedup
1k hop3 460.2 ms 23.7 ms 19x
10k hop2 4.21 s 139.9 ms 30x
10k hop3 40.59 s 898.5 ms 45x
30k hop2 11.71 s 490.2 ms 24x
30k hop3 197.38 s 3.22 s 61x
The cost lived in stringifying every (src,dst) pair and re-hashing the
strings during alignment; once dense ids stay dense, the BFS inner loop
and the final fan-out both collapse to integer ops.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Remove vestigial code left from removed hasher variants: unused
BuildHasherDefault import, PhantomData suppression line, orphan planning
comments for Variant C/E. Also drop an unused `mut` on the PRNG closure
binding. No behavior change; compiles warning-free.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The BFS in execute_expand emits one (src_idx, dst_id) pair per edge, so
dst_id_list contains heavy duplication when multi-hop traversals revisit
the same destination nodes. hydrate_nodes then built an
"id IN ('a', 'b', ...)" filter from the full list, passing it verbatim
to Lance. On a 30k-node Person graph, a 3-hop query produced a 15.4M-
entry IN-list against a 30k-row target — 512x more entries than unique
ids.
Deduplicate before the Lance scan; the post-hydrate alignment HashMap
already fans results back out to the original (src, dst) pairs, so
output is bit-identical.
Bench numbers (crates/omnigraph/examples/bench_expand.rs, min of 2-3
runs, release build):
query before after speedup
1k hop3 460 ms 28 ms 16x
10k hop2 4.21 s 188 ms 22x
10k hop3 40.59 s 1.30 s 31x
30k hop2 11.71 s 678 ms 17x
30k hop3 197.38 s 4.86 s 41x
All existing omnigraph-engine tests pass (72/72, 0 failures).
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
GitHub Actions rejects `secrets.*` in job-level `if:` conditions at
runtime (job-level `if` is evaluated before secrets are available),
causing the workflow to abort in 0s with "workflow file issue" on
every trigger. Moving the guard into a step-level check that writes
`HOMEBREW_TAP_SKIP` to GITHUB_ENV lets the rest of the steps
conditionally no-op when the tap token isn't configured.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run branches are transactional scaffolding — the durable audit lives
on RunRecord. Invariant: every terminal state (Published, Aborted,
Failed) deletes the __run__ branch.
- Add `terminate_run` helper: appends terminal RunRecord, then
deletes the run branch. Delete errors are swallowed — the record
is authoritative; `cleanup_terminal_run_branches_for_target`
retries on later `branch_delete` of the target.
- Wire into `publish_run_as`, `abort_run`, `fail_run`.
- Include `Failed` in the cleanup filter (was `Published | Aborted`
only) for legacy-repo GC during branch_delete.
- Cleanup now checks `coordinator.all_branches()` first to skip
branches already deleted by a concurrent handle — avoids Lance
NotFound when two handles publish/clean up independently.
- Drop `Failed` from `ensure_branch_delete_safe` — post-fix, Failed
means the branch is already gone, so there's no reason to block
target deletion (MR-674 "Downstream effects").
Tests:
- New regression: `run_branches_do_not_accumulate_across_repeated_loads`
— 10 loads + 1 abort → `branch_list() == ["main"]`.
- New `failed_load_deletes_run_branch` asserts Failed path cleans up.
- Rename `abort_run_keeps_target_unchanged_and_preserves_hidden_branch_for_inspection`
→ `abort_run_leaves_target_unchanged_and_deletes_run_branch`, invert
the hidden-branch assertion.
- Rewrite `public_{load,mutation}_preserves_staged_edge_ids_on_publish`
to capture staged IDs before publish instead of inspecting the run
branch after (branch is gone now).
- Update MR-670 regression test to assert the run branch is *absent*
after publish.
Deferred to follow-up: `--keep-run-branch` debug flag, `omnigraph run gc`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Files where inline tests crowded out production code (test/prod ratio
≥ 0.8) move to sibling files via `#[path]`. Files where production
dominates (query_input.rs, schema_plan.rs) stay inline — extracting
would add noise, not reduce it.
- ir/lower.rs: 1239 → 577 lines (ratio 1.15)
- catalog/mod.rs: 594 → 326 lines (ratio 0.83)
- query/lint.rs: 562 → 314 lines (ratio 0.80)
catalog/tests.rs uses the shorter name since it's inside a module
directory (no ambiguity with filename).
All 229 compiler tests green, identical count to before.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
typecheck.rs, schema/parser.rs, and query/parser.rs each had
~1000-line inline `mod tests` blocks that overshadowed the production
code in the file. Move each to a sibling `*_tests.rs` using
`#[path = "..."] mod tests;`.
- typecheck.rs: 2865 → 1708 lines; typecheck_tests.rs: 1156 lines
- schema/parser.rs: 1950 → 994 lines; parser_tests.rs: 955 lines
- query/parser.rs: 1737 → 803 lines; parser_tests.rs: 933 lines
No visibility change — the sibling module still has `use super::*`
access to crate-privates. No semantic edits beyond de-indenting by
4 spaces (mechanical). All 229 compiler tests green, identical
count to before.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The inline `mod tests` in crates/omnigraph/src/db/omnigraph.rs had grown
to ~620 lines, mixing tests that need crate-private access with tests
that only exercise the public API. Splits the latter out.
- tests/lifecycle.rs: 10 init/open/snapshot/drift tests
- tests/schema_apply.rs: 5 plan/apply tests
- omnigraph.rs: 10 tests remain inline because they use
db.coordinator, db.table_store(), ManifestCoordinator,
SCHEMA_APPLY_LOCK_BRANCH, or is_internal_run_branch — all
crate-private and intentionally kept so.
No behavior change. Zero semantic edits to the tests themselves beyond
replacing db.snapshot() (pub(crate)) with snapshot_main helper at
integration-test boundaries.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AWS CodeBuild shares an outbound IP pool with many other AWS customers,
so anonymous Docker Hub pulls (100/6h per IP) rate-limit quickly. The
aws-feature variant in Package run 24642508475 hit 429 on debian:bookworm-slim.
ECR Public hosts the same official Debian images at
public.ecr.aws/debian/debian, has no pull rate limit, and is
anonymously accessible. Same upstream image, just mirrored on AWS.
Published `__run__` branches are intentionally retained after publish
for post-publish inspection (runs.rs tests verify edge IDs match
between run branch and main). `apply_schema` was counting them as
"non-main" branches and refusing to run — permanently blocking schema
evolution after any load or change, with no CLI recovery path
(`branch_delete` rejects internal refs, `run abort` rejects Published
runs).
Fix: `apply_schema` filters `is_internal_system_branch` (covers both
`__run__*` and the schema-apply lock) rather than just the lock.
Run branches remain available for inspection.
Regression: test_apply_schema_succeeds_after_load_creates_published_run_branch
pins that schema apply succeeds after a load even while the run
branch is still present.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes flaky omnigraph-server integration suite under parallel cargo
test. Lance defaults to a 100 MB FairSpillPool per Omnigraph instance
(lance-datafusion/src/exec.rs:316). That's fine in prod (one server
process, bounded concurrent sorts) but too small when cargo test
spawns many Omnigraph instances in parallel, each running concurrent
BTree index builds during load.
Failure signature:
Lance("create BTree index on node:Person(id): ... LanceError(IO):
Not enough memory to continue external sort. ... 0.0 B remain
available for the total pool")
Before: 10/41 OOM-fail on parallel run; passed with --test-threads=1.
After: 41/41 pass in parallel in ~3s.
[env] in .cargo/config.toml applies to cargo-launched processes only.
Shipped binaries (release tarballs, Docker images) are unaffected —
they inherit whatever the runtime env provides, defaulting to Lance's
100 MB when unset.
Restore the default pull_request checkout (refs/pull/N/merge) so tests
see the merged state. The openapi.json auto-commit now uses a separate
shallow clone of the PR branch, so the pushed commit contains only the
spec change rather than the merge-commit tree.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The target → graph rename shipped in PR #17 but omnigraph.example.yaml
still used the old form (`targets:` / `cli.target`). Since the serde
struct uses `rename = "graphs"` without a `targets` alias, the example
wouldn't deserialize against current code.
Update the example to the new form. No alias is being added — the
deserialization error for old configs is loud and clear, which is the
better migration signal for a young project.
The separate openapi-sync workflow was duplicating the workspace build
(~15 min cold-cache compile), paying the cost twice per PR. Fold the
regen + auto-commit into the existing test job: one compile, shared
rust-cache, same drift-check semantics.
- Same-repo PRs: OMNIGRAPH_UPDATE_OPENAPI=1 during the test run, then
commit the regenerated spec back to the PR branch
- Fork PRs / pushes: env var empty, test stays in strict drift-check mode
- openapi_spec_is_up_to_date treats empty env value as unset, so the
conditional workflow env expression works
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GitHub Actions doesn't expose the 'secrets' context in 'with:' when
calling a reusable workflow. The companion PR on the shared workflow
(ModernRelay/.github) moves the four AWS values into
on.workflow_call.secrets; this caller drops them from 'with:' and adds
'secrets: inherit' so all four flow through masked.
Trailing from PRs #33 and #34.
On a public repo, Actions variables are not masked in workflow logs.
The AWS role ARN and artifact bucket name embed the AWS account ID —
not catastrophic, but norm-preserving to keep them out of public logs.
Switch all four values (region, role, project, bucket) from
`${{ vars.* }}` to `${{ secrets.* }}`. When secrets are passed via
`with:` to a reusable workflow, GitHub's masking still applies because
the value is added to the run's mask list as soon as the secret
reference is resolved.
Followup to #33 — should have landed as secrets from the start.
Invokes the shared omnigraph-package reusable workflow twice per run —
once with default features, once with --features aws — producing two
ECR tags per source commit:
<sha> (default features)
<sha>-aws (--features aws → SecretsManagerTokenSource)
Manual-dispatch only for now. Neither release.yml nor release-edge.yml
currently invokes the CodeBuild-backed packaging path; this gives
operators a way to produce on-demand image variants without wiring
packaging into the tag/push cadence.
Prerequisites:
- Repo vars AWS_REGION, AWS_ROLE_TO_ASSUME, AWS_CODEBUILD_PACKAGE_PROJECT,
AWS_ARTIFACT_BUCKET must be set.
- Shared workflow must support the `features` and `image_tag_suffix`
inputs.
Uses @main as the shared-workflow ref until a versioned tag is cut.
- docs/deployment.md: new "Token sources" section listing the three
bearer-token source precedences (AWS SM, JSON file/env, single token).
New "Build Variants" section explaining default vs aws builds and
their release-artifact naming. New "AWS Secrets Manager" section
covering env var, secret payload format, IAM role credential
discovery, and the hard error for feature-less builds.
- CONTRIBUTING.md: documents the `aws` feature and the two test
commands contributors should run when touching auth code.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduces an opt-in AWS Secrets Manager backend for bearer tokens,
behind the `aws` Cargo feature. Default builds (on-prem, local dev)
don't pull in the AWS SDK and don't pay its compile cost.
- New Cargo feature `aws` gates the `aws-config` + `aws-sdk-secretsmanager`
optional deps. Default features remain empty.
- New `auth::aws::SecretsManagerTokenSource` implements `TokenSource` by
fetching a JSON `{"actor_id": "token", ...}` payload from a named
Secrets Manager secret. Credentials resolve via the AWS default chain
(env, shared config, IMDSv2 instance role, ECS task role) so no
explicit plumbing is needed under an IAM role.
- New `resolve_token_source()` dispatches based on the
`OMNIGRAPH_SERVER_BEARER_TOKENS_AWS_SECRET` env var. If the var is set
but the binary was built without `--features aws`, returns a clear
rebuild instruction rather than silently falling back.
- `serve()` now uses `resolve_token_source()` and logs which source was
selected at startup.
- `parse_json_secret_payload()` is factored out as a free function so
the payload validation (trim whitespace, reject blank actor/token,
reject non-object) is unit-testable without the AWS SDK.
- New CI job `test_aws_feature` builds + tests with `--features aws`.
Not in this PR (follow-ups):
- Background refresh loop for rotation. `SecretsManagerTokenSource`
advertises `supports_refresh: true` but the AppState-level refresh
task isn't wired yet.
- Config-YAML dispatch (today the AWS source is selected via env var
only; eventually `server.bearer_tokens.source` in `omnigraph.yaml`).
Tests:
- Default-feature build: 33 lib + 41 integration + 64 openapi.
- `--features aws` build: 32 lib (one test is cfg-gated) + 41 + 64.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure refactor. No behavior change. Introduces a TokenSource trait so
additional backends (AWS Secrets Manager, Vault, etc.) can plug in
behind feature flags without touching the server wiring.
- New module crates/omnigraph-server/src/auth.rs with the TokenSource
trait and a single EnvOrFileTokenSource implementation that delegates
to the existing server_bearer_tokens_from_env() function.
- serve() now constructs EnvOrFileTokenSource and calls load() instead
of calling the free function directly.
- The trait has a supports_refresh() hook (false for env/file) for
future implementations that can rotate without restart.
- async-trait added to omnigraph-server deps; it's already in the
workspace.
Tests:
- Unit tests in auth.rs covering load paths and the default supports_refresh
/ name values.
- Existing 128 tests (lib + integration + openapi) pass unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>