Merge pull request #181 from ModernRelay/feat/container-cluster-mode

feat(docker): cluster-mode container + AWS/Railway recipes
This commit is contained in:
Andrew Altshuler 2026-06-10 23:57:34 +03:00 committed by GitHub
commit c3ff076e89
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
6 changed files with 107 additions and 2 deletions

View file

@ -2,3 +2,4 @@
!Dockerfile
!docker/entrypoint.sh
!target/release/omnigraph-server
!target/release/omnigraph

View file

@ -11,9 +11,13 @@ RUN groupadd --system omnigraph \
&& useradd --system --gid omnigraph --create-home --home-dir /var/lib/omnigraph omnigraph
COPY target/release/omnigraph-server /usr/local/bin/omnigraph-server
# The CLI ships in the image so the cluster day-2 loop (cluster
# apply/approve/status, data loads by explicit URI) runs in-container via
# `docker exec` / ECS exec / `railway shell` — no omnigraph.yaml required.
COPY target/release/omnigraph /usr/local/bin/omnigraph
COPY docker/entrypoint.sh /usr/local/bin/omnigraph-entrypoint
RUN chmod 0755 /usr/local/bin/omnigraph-server /usr/local/bin/omnigraph-entrypoint
RUN chmod 0755 /usr/local/bin/omnigraph-server /usr/local/bin/omnigraph /usr/local/bin/omnigraph-entrypoint
ENV OMNIGRAPH_BIND=0.0.0.0:8080

View file

@ -9,6 +9,17 @@ fi
bind="${OMNIGRAPH_BIND:-0.0.0.0:8080}"
# Cluster mode first, and exclusive (the server's mode-inference rule 0):
# a deployment serves from cluster state XOR omnigraph.yaml, never a merge.
# Fail fast here with the same contract the server enforces.
if [ -n "${OMNIGRAPH_CLUSTER:-}" ]; then
if [ -n "${OMNIGRAPH_TARGET_URI:-}" ] || [ -n "${OMNIGRAPH_CONFIG:-}" ] || [ -n "${OMNIGRAPH_TARGET:-}" ]; then
echo "OMNIGRAPH_CLUSTER is an exclusive boot source; unset OMNIGRAPH_TARGET_URI/OMNIGRAPH_CONFIG/OMNIGRAPH_TARGET" >&2
exit 64
fi
exec "$SERVER_BIN" --cluster "${OMNIGRAPH_CLUSTER}" --bind "${bind}"
fi
# URI comes from the env var (the positional arg wins over any config
# `graphs` block in resolve_target_uri). OMNIGRAPH_CONFIG, when also set,
# is forwarded as --config purely to supply a policy file — the two
@ -28,6 +39,8 @@ fi
cat >&2 <<'EOF'
omnigraph-server container startup requires one of:
- OMNIGRAPH_CLUSTER (serve a cluster directory's applied revision;
exclusive — cannot combine with the others)
- OMNIGRAPH_TARGET_URI
- OMNIGRAPH_CONFIG

View file

@ -58,6 +58,26 @@ got=$(sh "$ep" some-uri --bind 1.2.3.4:9 --extra)
check "explicit args passthrough" \
"ARGS: some-uri --bind 1.2.3.4:9 --extra" "$got"
got=$(OMNIGRAPH_CLUSTER="/var/lib/omnigraph/company-brain" OMNIGRAPH_BIND="0.0.0.0:8080" sh "$ep")
check "CLUSTER only (Phase 5 mode switch)" \
"ARGS: --cluster /var/lib/omnigraph/company-brain --bind 0.0.0.0:8080" "$got"
# Exclusivity: OMNIGRAPH_CLUSTER refuses every combination, exit 64.
for combo in "OMNIGRAPH_TARGET_URI=s3://b/g" "OMNIGRAPH_CONFIG=/etc/o.yaml" "OMNIGRAPH_TARGET=active"; do
if out=$(env "$combo" OMNIGRAPH_CLUSTER="/data/cluster" sh "$ep" 2>&1); then
echo "FAIL: CLUSTER + ${combo%%=*} unexpectedly succeeded: $out"
fail=1
else
status=$?
if [ "$status" -ne 64 ]; then
echo "FAIL: CLUSTER + ${combo%%=*} exited $status, want 64"
fail=1
else
echo "ok: CLUSTER + ${combo%%=*} refused (64)"
fi
fi
done
if [ "$fail" -ne 0 ]; then
echo "entrypoint_test: FAILED"
exit 1

View file

@ -229,7 +229,8 @@ with an in-flight apply.
- **Replicas**: any number of `--cluster` servers can serve the same config
directory; boot is read-only. Roll out a change by `apply` once, then
restarting replicas (serving is static per process — there is no hot
reload yet).
reload yet). Container/cloud recipes (AWS ECS+EFS, Railway volumes):
[deployment.md](deployment.md#cluster-mode-in-containers-aws-railway).
- **The directory is the deployable unit**: config, catalog, ledger,
approvals, and graph data all live under it. Back it up as a whole;
version the *config files* (not `__cluster/` or `graphs/`) in git.

View file

@ -45,6 +45,72 @@ omnigraph-server s3://my-bucket/graphs/example/releases/2026-04-10-v0.1.0 \
--bind 0.0.0.0:8080
```
## Cluster Mode in Containers (AWS, Railway)
A cluster-booted deployment serves a **cluster directory** (config + state
ledger + content-addressed catalog + graph data) from a mounted volume — the
one structural difference from the stateless S3 single-graph shape, which
needs no volume at all. The container contract:
```bash
docker run -d \
-v /srv/company-brain:/var/lib/omnigraph/cluster \
-e OMNIGRAPH_CLUSTER=/var/lib/omnigraph/cluster \
-e OMNIGRAPH_SERVER_BEARER_TOKEN=... \
-p 8080:8080 <image>
```
`OMNIGRAPH_CLUSTER` is exclusive: combining it with `OMNIGRAPH_TARGET_URI`,
`OMNIGRAPH_CONFIG`, or `OMNIGRAPH_TARGET` fails fast (exit 64), the same
rule the server itself enforces. The image also ships the `omnigraph` CLI,
so the day-2 loop runs in-container with no `omnigraph.yaml`:
```bash
docker exec -it <container> sh -c \
'omnigraph cluster apply --as <you> --config /var/lib/omnigraph/cluster'
# then restart the container to pick up the applied state
```
### AWS (ECS/Fargate + EFS)
1. Push the image to ECR (the `package.yml` workflow builds it).
2. Create an EFS filesystem; mount it in the task definition at
`/var/lib/omnigraph/cluster`.
3. Task environment: `OMNIGRAPH_CLUSTER=/var/lib/omnigraph/cluster`, bearer
tokens via Secrets Manager/SSM into `OMNIGRAPH_SERVER_BEARER_TOKENS_JSON`
(or the `--features aws` build's native Secrets Manager source).
4. ALB in front for TLS; target the container's 8080 with `/healthz` checks.
5. Day-2: ECS exec into the task → edit/upload config on the volume →
`omnigraph cluster apply --as <you> --config /var/lib/omnigraph/cluster`
→ force a new deployment (restart).
For a deployment that doesn't need the cluster control plane, the classic
stateless shape — `OMNIGRAPH_TARGET_URI=s3://bucket/graph.omni`, no volume —
remains the simplest AWS architecture (see Binary/Container Deployment
above).
### Railway
1. Create a service from the image; attach a **volume** mounted at
`/var/lib/omnigraph/cluster`.
2. Variables: `OMNIGRAPH_CLUSTER=/var/lib/omnigraph/cluster`,
`OMNIGRAPH_SERVER_BEARER_TOKEN=<token>`. Railway terminates TLS at its
edge and routes to the exposed 8080.
3. Day-2: `railway shell` (or `railway run`) → `omnigraph cluster apply
--as <you> --config /var/lib/omnigraph/cluster` → redeploy/restart the
service.
### Constraints (current honest list)
- **Cluster directories are local-filesystem** — the volume is mandatory;
S3-hosted cluster dirs are not supported.
- **No hot reload** — applied changes serve on the next restart.
- **Single-writer apply** — run `cluster apply` from one place at a time
(the state lock enforces this; CI or one operator shell, not both).
- **Multi-replica serving off a shared volume (EFS) is documented but
unvalidated** — boot is lock-free read-only so it should compose, but it
is not yet exercised by tests.
## One-Command Local RustFS Bootstrap
The easiest local S3-backed deployment path is: