From f80b73b3fea00b46e0ec417dfc38505a316c5756 Mon Sep 17 00:00:00 2001 From: Claude Date: Wed, 18 Feb 2026 14:42:38 +0000 Subject: [PATCH] Correct test coverage analysis with accurate counts Previous version undercounted tests significantly. Corrected findings: - brightstaff state has 26 tests (memory: 16, postgresql: 10), not 0 - brightstaff pipeline_processor has 5 tests, not 0 - hermesllm streaming buffers have 12 tests across 4 modules - common has 36 tests across 10 files - Total: ~370 tests (314 Rust + 54 E2E/integration + Hurl/REST) - Now properly documents what IS well-tested alongside the gaps https://claude.ai/code/session_01Shz5qKiTB9m6oxzEZWJVKk --- TEST_COVERAGE_ANALYSIS.md | 270 +++++++++++++++++++++++--------------- 1 file changed, 167 insertions(+), 103 deletions(-) diff --git a/TEST_COVERAGE_ANALYSIS.md b/TEST_COVERAGE_ANALYSIS.md index cc1940bc..04445962 100644 --- a/TEST_COVERAGE_ANALYSIS.md +++ b/TEST_COVERAGE_ANALYSIS.md @@ -4,9 +4,9 @@ ## Executive Summary -The Plano codebase has significant test coverage gaps across all components. The Rust crates have ~262 unit tests covering roughly 0.35% of ~75,800 lines of code. The Python CLI has 29 tests covering 4 of 12 modules. The JavaScript/TypeScript apps and packages have **zero tests**. The E2E suite covers the core happy-path flows well but lacks error, edge-case, and performance scenarios. +The Plano codebase has **~370 automated tests**: ~297 Rust unit tests, ~65 Python tests (29 CLI + 50 E2E + 4 archgw integration), 10 Hurl/REST manual test files, and zero JS/TS tests. Coverage is strong in the LLM translation layer (hermesllm) and behavioral signals (brightstaff/signals), moderate in state management and configuration, and weak in the WASM gateway plugins and several Python CLI modules. -Below is a prioritized breakdown of gaps and recommendations. +Below is a detailed breakdown by component with prioritized improvement recommendations. --- @@ -14,40 +14,86 @@ Below is a prioritized breakdown of gaps and recommendations. ### Current State -| Crate | LOC | Tests | Status | -|-------|-----|-------|--------| -| common | 3,912 | 33 | Partial | -| hermesllm | 17,540 | 134 | Partial | -| prompt_gateway | 1,717 | 4 | Critical gap | -| llm_gateway | 1,399 | 0 | Critical gap | -| brightstaff | 13,342 | 91 | Partial | -| **Total** | **~75,800** | **262** | | +| Crate | Tests | Files With Tests | Status | +|-------|-------|------------------|--------| +| hermesllm | 148 | 21 | Good — broad coverage of provider translation | +| brightstaff | 126 | 11 | Good — signals/state/routing well tested; handler endpoints less so | +| common | 36 | 10 | Moderate — core utilities covered; some gaps | +| prompt_gateway | 4 | 2 | Weak — WASM filter mostly untested | +| llm_gateway | 0 | 0 | None — WASM filter completely untested | +| **Total** | **~314** | **44** | | -### Critical Gaps +### Well-Tested Areas -**llm_gateway — 0 tests, 1,399 LOC.** This WASM filter handles LLM request/response processing and streaming. It has no tests at all. `stream_context.rs` alone is ~1,000 lines of complex streaming logic with zero coverage. +- **hermesllm provider translation (148 tests):** Request/response transforms for all providers (OpenAI, Anthropic, Bedrock, Gemini, Mistral) are thoroughly tested. Streaming response parsing (20 tests), endpoint resolution (11 tests), request generation (16 tests), and cross-provider format conversion (~45 tests) are solid. +- **hermesllm streaming buffers (12 tests):** SSE chunk processor (6 tests), Anthropic streaming buffer (3 tests), Responses API streaming buffer (2 tests), and passthrough buffer (1 test) have coverage. +- **brightstaff signals/analyzer (48 tests):** Character n-gram similarity, token cosine similarity, layered matching, frustration/escalation/positive-feedback detection are thoroughly tested. +- **brightstaff state management (26 tests):** In-memory state (16 tests) and PostgreSQL persistence (10 tests) have good unit test coverage. +- **brightstaff function calling (17 tests):** Tool extraction, JSON fixing, hallucination detection, and tool call verification are well covered. +- **brightstaff routing models (17 tests):** Orchestrator model v1 (9 tests) and router model v1 (8 tests) are tested. +- **brightstaff pipeline processor (5 tests):** Has basic test coverage (4 tokio::test + 1 sync test). +- **brightstaff agent selector (5 tests):** Listener lookup and agent map creation are tested. +- **brightstaff response handler (5 tests):** Response transformation has tests. +- **common rate limiting (8 tests):** Rate limit logic with token quotas and header-based selectors is tested. +- **common OpenAI API (9 tests):** Chat completion parsing and request conversions covered. -**prompt_gateway — 4 tests, 1,717 LOC.** The WASM filter for prompt processing and guardrails has near-zero coverage. Untested modules include `filter_context.rs`, `http_context.rs`, `context.rs`, and `metrics.rs`. The intent-matching logic in `stream_context.rs` (~900 lines) has only 1 test. +### Gaps and Recommendations -**brightstaff pipeline and state management — ~2,200 LOC untested.** The core request pipeline (`handlers/pipeline_processor.rs`, 834 lines), state persistence layer (`state/memory.rs`, `state/postgresql.rs`, `state/response_state_processor.rs` — 1,370 lines combined), and key handler endpoints (`handlers/llm.rs`, `handlers/agent_chat_completions.rs`) have no tests. +#### Gap 1: `llm_gateway` crate — 0 tests (1,399 LOC) -### Partially Covered Areas Needing More Tests +This WASM filter handles all LLM request/response processing and streaming. `stream_context.rs` (~1,000 lines) manages streaming chunk assembly and response forwarding with zero coverage. -- **hermesllm streaming transforms** — The non-streaming request/response transforms are well-tested (134 tests), but the streaming buffer modules (`sse.rs`, `amazon_bedrock_binary_frame.rs`, `to_openai_streaming.rs`, `to_anthropic_streaming.rs` — ~5,000 LOC) are untested. -- **common/routing.rs, common/errors.rs, common/http.rs, common/stats.rs, common/tracing.rs** — Utility modules totaling ~560 lines with no coverage. -- **brightstaff router services** — `llm_router.rs` and `plano_orchestrator.rs` (~400 lines) lack tests despite handling routing decisions. +**Recommendation:** Extract core logic from the WASM host context into pure, testable functions. Test streaming chunk reassembly, header manipulation, error response construction, and the filter lifecycle. Consider a thin WASM shim over well-tested logic modules. -### Recommendations +#### Gap 2: `prompt_gateway` crate — 4 tests (1,717 LOC) -1. **Add unit tests for llm_gateway.** Start with `stream_context.rs` — test streaming chunk assembly, partial frame handling, error recovery, and the filter lifecycle. A WASM-mocking test harness or extracting the core logic into testable pure functions would help. +The WASM prompt filter has tests only in `tools.rs` (3 tests) and `stream_context.rs` (1 test). The filter/HTTP context lifecycle (`filter_context.rs`, `http_context.rs`), prompt guard logic, and metrics collection are untested. -2. **Add unit tests for prompt_gateway filter logic.** Test `http_context.rs` request/response handling, `filter_context.rs` lifecycle, and the guardrail filtering paths in `stream_context.rs`. +**Recommendation:** Add tests for intent matching and prompt guard/jailbreak detection in `stream_context.rs`. Test `http_context.rs` request parsing and response construction. Same architectural approach as llm_gateway — separate testable logic from WASM host bindings. -3. **Test the brightstaff pipeline processor.** This is the central message processing pipeline. Mock the downstream dependencies and test the orchestration logic, error paths, and streaming assembly. +#### Gap 3: brightstaff handler endpoints — limited coverage -4. **Test state persistence.** Both the in-memory and PostgreSQL backends need tests for basic CRUD, concurrent access, state expiration, and connection failure recovery. +Several handler modules have no unit tests: +- `handlers/llm.rs` (553 LOC) — LLM chat handler +- `handlers/agent_chat_completions.rs` (418 LOC) — Multi-agent orchestration +- `handlers/router_chat.rs` (159 LOC) — Router endpoint +- `handlers/utils.rs` (288 LOC) — Handler utilities -5. **Test hermesllm streaming transforms.** The SSE parser, Bedrock binary frame decoder, and streaming-to-OpenAI/Anthropic converters need unit tests, especially for edge cases like partial frames, malformed chunks, and connection resets. +The pipeline_processor has only 5 tests for 834 LOC — basic flow is covered but error paths and edge cases are not. + +**Recommendation:** Add tests for error paths in `pipeline_processor.rs` (malformed requests, downstream failures, timeout handling). Add handler-level tests for `llm.rs` and `agent_chat_completions.rs` using `mockito` (already a dev dependency) to mock HTTP backends. + +#### Gap 4: hermesllm streaming *transforms* — 0 tests + +While the streaming *buffers* (SSE parser, Anthropic buffer, etc.) have tests, the streaming *transform* modules that convert between formats during streaming are untested: +- `transforms/response_streaming/to_openai_streaming.rs` +- `transforms/response_streaming/to_anthropic_streaming.rs` + +Also untested: `apis/streaming_shapes/amazon_bedrock_binary_frame.rs` (AWS Event Stream binary decoding) and `apis/streaming_shapes/chat_completions_streaming_buffer.rs`. + +**Recommendation:** Add tests for the streaming transform modules. The Bedrock binary frame decoder is particularly important — it parses a proprietary binary protocol and failures here are hard to diagnose in production. + +#### Gap 5: common utility modules — no tests + +Several `common` modules lack tests: +- `routing.rs` — Provider routing logic +- `errors.rs` — Error types (ClientError, ServerError) +- `http.rs` — HTTP utilities and CallArgs +- `stats.rs` — Metrics traits +- `api/prompt_guard.rs` — Prompt guard types +- `api/zero_shot.rs` — Zero-shot classification types + +**Recommendation:** Add tests for `routing.rs` (routing decisions), `http.rs` (CallArgs construction, URL handling), and `prompt_guard.rs` (guard rule evaluation). The error/stats/consts modules are mostly type definitions and don't need extensive testing. + +#### Gap 6: brightstaff state — edge cases + +The state backends have solid basic coverage (26 tests total), but lack tests for: +- Concurrent access patterns +- State expiration/eviction +- Connection failure recovery (PostgreSQL) +- Large conversation histories + +**Recommendation:** Add tokio::test cases for concurrent read/write scenarios in the memory backend and connection pool behavior in the PostgreSQL backend. --- @@ -55,42 +101,45 @@ Below is a prioritized breakdown of gaps and recommendations. ### Current State -| Module | LOC | Tested? | -|--------|-----|---------| -| config_generator.py | 514 | Yes | -| versioning.py | 70 | Yes | -| init_cmd.py | 303 | Yes | -| trace_cmd.py | 993 | Minimal (2 tests) | -| main.py | 441 | No | -| targets.py | 365 | No | -| core.py | 234 | No | -| docker_cli.py | 143 | No | -| template_sync.py | 122 | No | -| utils.py | 285 | Partial | +| Test File | Tests | Modules Covered | +|-----------|-------|-----------------| +| test_config_generator.py | 11 (5 functions + 6 parametrized) | config_generator, utils | +| test_version_check.py | 18 (4 classes, 18 methods) | versioning | +| test_init.py | 4 | init_cmd | +| test_trace_cmd.py | 2 | trace_cmd (minimal) | +| **Total** | **35 executions** | **5 of 13 modules** | -**29 total tests across 4 files. 8 of 12 modules are untested or minimally tested.** +### Well-Tested Areas -### Critical Gaps +- **versioning.py (18 tests):** Version parsing, comparison, PyPI fetching, network error handling, and environment variable overrides are thoroughly tested across 4 test classes. +- **config_generator.py (11 tests):** Happy-path config validation, schema validation errors (6 parametrized cases), and legacy format conversion are covered. +- **init_cmd.py (4 tests):** Clean init, template init, overwrite protection, and force overwrite are tested. -**main.py — 0 tests, 441 LOC.** All CLI commands (`up`, `down`, `build`, `logs`, `cli_agent`, `generate_prompt_targets`) are untested. The `up` command alone contains complex logic for port conflict detection, API key validation, and container orchestration. +### Gaps and Recommendations -**targets.py — 0 tests, 365 LOC.** The AST-based Python code parser for extracting prompt targets from Flask/FastAPI routes and Pydantic models is entirely untested. This is complex parsing logic prone to edge cases. +#### Gap 7: `main.py` — 0 tests (441 LOC) -**core.py — 0 tests, 234 LOC.** Docker container lifecycle management (start, stop, health check retry loop, timeout handling) is untested. +The CLI entry point defines all Click commands (`up`, `down`, `build`, `logs`, `cli_agent`, `generate_prompt_targets`). None have tests. The `up` command has complex logic for port conflict detection, API key validation, and container orchestration. -**docker_cli.py — 0 tests, 143 LOC.** All 7 Docker subprocess wrapper functions lack tests. +**Recommendation:** Add tests using Click's `CliRunner`. Test `planoai up` with mocked Docker calls (validate argument handling, port conflict error messages, API key resolution). Test `planoai down` and `planoai build` for basic argument handling and error paths. -**trace_cmd.py — 2 tests for 993 LOC.** Only gRPC server bind error handling is tested. The trace collection, OTEL processing, and trace analysis logic are untested. +#### Gap 8: `targets.py` — 0 tests (365 LOC) -### Recommendations +AST-based Python code parser that extracts prompt targets from Flask/FastAPI routes and Pydantic models. This is complex parsing logic prone to edge cases with decorators, type annotations, and docstrings. -6. **Add CLI command tests using Click's CliRunner.** Test `planoai up`, `planoai down`, and `planoai build` with mocked Docker operations. Verify argument validation, error messages, and exit codes. +**Recommendation:** Create test fixtures with sample Flask/FastAPI app files and verify extracted prompt targets. Test edge cases: nested decorators, complex type hints (Optional, Union, List[dict]), missing docstrings, and unsupported patterns. -7. **Add tests for targets.py.** Test Flask route extraction, FastAPI route extraction, Pydantic model field parsing, type annotation handling, and edge cases (nested decorators, complex type hints, missing docstrings). +#### Gap 9: `core.py` and `docker_cli.py` — 0 tests (377 LOC combined) -8. **Add tests for core.py with mocked subprocess/Docker calls.** Test the health check retry loop, container state transitions (not found → start, running → restart), timeout behavior, and port forwarding. +Container lifecycle management and Docker subprocess wrappers are untested. -9. **Add a shared conftest.py** with common fixtures for environment setup, temporary config files, and Docker mocking. +**Recommendation:** Mock `subprocess.run` / `subprocess.Popen` and test the health check retry loop, container state transitions, and error handling. A shared `conftest.py` with Docker mock fixtures would benefit multiple test files. + +#### Gap 10: `trace_cmd.py` — 2 tests for 993 LOC + +Only gRPC bind error handling is tested. Trace collection, OTEL span processing, and trace analysis logic (the bulk of the module) are untested. + +**Recommendation:** Add tests for trace data parsing and the analysis/summarization logic. Mock gRPC server interactions for collection tests. --- @@ -98,85 +147,100 @@ Below is a prioritized breakdown of gaps and recommendations. ### Current State -**Zero test files. No test framework configured. No test scripts in any package.json.** - -The codebase has 70+ TypeScript/React source files across two Next.js apps (`apps/www`, `apps/katanemo-www`) and shared packages (`packages/ui`, `packages/shared-styles`). - -Quality tooling is limited to type checking (`tsc --noEmit`) and linting (Biome). - -### Notable Untested Code - -- **`apps/www/src/utils/asciiBuilder.ts`** (425 lines) — Pure utility functions for ASCII diagram generation (`calculateCenterPadding`, `createArrow`, `buildBox`, `fixDiagramSpacing`, `createFlowDiagram`). This is the most testable code in the frontend. -- **`packages/ui/src/`** — 5 shared UI components (Navbar, Footer, Logo, Button, Dialog) used across apps. -- **`apps/www/src/app/api/contact/route.ts`** — API route handler. +**Zero test files. No test framework configured.** The codebase has 70+ TypeScript/React source files across two Next.js apps and shared packages. Quality tooling is limited to type checking and Biome linting. ### Recommendations -10. **Set up Vitest** (or Jest) in the Turbo workspace with a root-level `test` script. Add `@testing-library/react` for component testing. +#### Gap 11: No test infrastructure -11. **Add unit tests for `asciiBuilder.ts`.** These are pure functions with clear inputs and outputs — ideal first candidates. +**Recommendation:** Set up Vitest in the Turbo workspace. Add `@testing-library/react` for component testing. Priority candidates: +- `apps/www/src/utils/asciiBuilder.ts` (425 lines of pure utility functions — ideal for unit tests) +- `packages/ui/src/` (shared UI components reused across apps) -12. **Add component tests for shared `packages/ui` components.** These are reused across apps and should have rendering and interaction tests. - -Note: The JS/TS apps are marketing websites, not the core proxy. Prioritize this lower than Rust and Python testing. +**Note:** These are marketing websites, not the core proxy. Prioritize this lower than Rust and Python testing. --- -## 4. E2E Tests (`tests/e2e/`) +## 4. E2E and Integration Tests (`tests/`) ### Current State -~40 active tests across 4 test files, covering: -- OpenAI and Anthropic SDK integration (streaming and non-streaming) -- Model alias routing and format translation -- Function calling end-to-end flows -- OpenAI Responses API (v1/responses) -- Conversation state management (memory backend) -- Cross-provider format translation (OpenAI client → Claude model, etc.) +| Suite | Tests | Coverage | +|-------|-------|----------| +| tests/e2e/test_prompt_gateway.py | 12 | Prompt routing, guardrails, cross-provider SDK compatibility | +| tests/e2e/test_model_alias_routing.py | 19 | Model aliases, format translation, streaming, error handling | +| tests/e2e/test_openai_responses_api_client.py | 17 | Responses API across all providers (passthrough, chat completions, Bedrock, Anthropic) | +| tests/e2e/test_openai_responses_api_client_with_state.py | 2 | Multi-turn conversation state (memory backend) | +| tests/archgw/test_prompt_gateway.py | 3 | Prompt gateway with mock HTTP server (including 404/500 errors) | +| tests/archgw/test_llm_gateway.py | 1 | LLM gateway with provider hints | +| **Total** | **54** | | -### Gaps +**Additional manual tests:** 3 Hurl files and 6 REST files for exploratory/manual testing. -**Error and failure scenarios are underrepresented.** Only 2 tests cover error handling (400 errors with aliases). There are no tests for: -- Upstream provider unavailability or timeouts -- Malformed request payloads -- Rate limiting behavior -- Invalid API keys -- Partial stream failures or disconnections +### Well-Tested Areas -**Bedrock tests are all skipped.** 6 AWS Bedrock tests are marked as unreliable and skipped, leaving this provider path untested in CI. +- **Cross-provider format translation:** OpenAI client → Claude model, Anthropic client → OpenAI model, etc. covered via model alias routing tests. +- **OpenAI Responses API:** Comprehensive coverage across all 4 providers in both streaming and non-streaming modes, with and without tools. +- **Prompt gateway routing:** Intent matching, parameter gathering, default targets, and jailbreak detection tested end-to-end. +- **Error handling basics:** 400 errors with invalid aliases, nonexistent aliases, and unsupported parameters. +- **archgw mock-server tests:** 404 and 500 upstream error handling tested with `pytest_httpserver`. -**PostgreSQL state storage is untested.** State management E2E tests only use the memory backend. The PostgreSQL backend (which is the production path) has no E2E coverage. +### Gaps and Recommendations -**No concurrent request testing.** There are no tests validating behavior under concurrent load or verifying resource cleanup. +#### Gap 12: Error and failure scenarios underrepresented -**No configuration validation E2E tests.** Invalid config files, missing required fields, and config hot-reload are not tested end-to-end. +Only a few tests cover error paths. Missing scenarios: +- Upstream provider timeouts +- 5xx errors from LLM providers during streaming +- Malformed/incomplete streaming responses +- Rate limiting behavior end-to-end +- Invalid or expired API keys -### Recommendations +**Recommendation:** Add E2E error scenario tests. Use a mock upstream that returns errors/timeouts to test resilience behavior without depending on real provider availability. -13. **Add E2E error scenario tests.** Test upstream timeouts, 5xx errors from providers, malformed responses, and rate limit responses. These are the scenarios most likely to cause production incidents. +#### Gap 13: Bedrock tests unreliable -14. **Fix or replace the skipped Bedrock tests.** If Bedrock is flaky, consider using a mock provider or stub that mimics the Bedrock binary event stream format. +Several AWS Bedrock tests are marked as skipped/unreliable, reducing coverage of this provider path. -15. **Add PostgreSQL state storage E2E tests.** Use a PostgreSQL container in Docker Compose and test state persistence, multi-turn retrieval, and state cleanup. +**Recommendation:** Add a mock Bedrock endpoint (or use the archgw mock server pattern from `tests/archgw/`) that returns Bedrock-formatted responses including the binary event stream format. This would make Bedrock tests deterministic. -16. **Add concurrent request tests.** Use `pytest-xdist` (already in dependencies) to validate behavior under parallel requests. +#### Gap 14: PostgreSQL state storage not E2E tested + +State management E2E tests only use the memory backend. PostgreSQL is the production persistence backend. + +**Recommendation:** Add a PostgreSQL container to the E2E Docker Compose setup and add tests for multi-turn state persistence, session retrieval, and cleanup. + +#### Gap 15: No concurrent request / load tests + +There are no tests for behavior under concurrent requests or verifying proper resource cleanup. + +**Recommendation:** Add parallel request tests using `pytest-xdist` (already in dependencies) or `asyncio.gather`. Test for race conditions in state writes and resource cleanup. + +#### Gap 16: No configuration validation E2E tests + +Invalid configs, missing required fields, and misconfigured providers are not tested end-to-end. + +**Recommendation:** Add tests that pass intentionally invalid configs to `planoai up` and verify the error messages and exit behavior. --- ## Priority Summary -| Priority | Area | Recommendation | -|----------|------|----------------| -| **P0** | Rust: llm_gateway | Add unit tests for streaming response handling (#1) | -| **P0** | Rust: prompt_gateway | Add unit tests for filter logic and guardrails (#2) | -| **P0** | Rust: brightstaff pipeline | Test the core pipeline processor (#3) | -| **P1** | Rust: state persistence | Test memory and PostgreSQL backends (#4) | -| **P1** | Rust: streaming transforms | Test hermesllm streaming modules (#5) | -| **P1** | Python: CLI commands | Test main.py commands with CliRunner (#6) | -| **P1** | Python: targets.py | Test AST parsing logic (#7) | -| **P1** | E2E: error scenarios | Test upstream failures, timeouts, rate limits (#13) | -| **P2** | Python: core.py | Test Docker lifecycle management (#8) | -| **P2** | E2E: PostgreSQL state | Test production state backend (#15) | -| **P2** | E2E: Bedrock | Fix skipped Bedrock tests (#14) | -| **P3** | JS/TS: test setup | Set up Vitest, test utilities (#10, #11) | -| **P3** | E2E: concurrency | Add parallel request tests (#16) | +| Priority | Area | Gap | Recommendation | +|----------|------|-----|----------------| +| **P0** | Rust: llm_gateway | 0 tests, 1,399 LOC | Extract logic from WASM, add unit tests (#1) | +| **P0** | Rust: prompt_gateway | 4 tests, 1,717 LOC | Test intent matching, prompt guards, filter lifecycle (#2) | +| **P1** | Rust: handler endpoints | llm.rs, agent_chat_completions.rs untested | Add handler-level tests with mockito (#3) | +| **P1** | Rust: streaming transforms | to_openai_streaming, to_anthropic_streaming, bedrock binary | Add streaming transform unit tests (#4) | +| **P1** | Rust: common utilities | routing.rs, http.rs, prompt_guard.rs | Add tests for routing decisions and HTTP utils (#5) | +| **P1** | Python: main.py | 0 tests, 441 LOC | Test CLI commands with CliRunner (#7) | +| **P1** | Python: targets.py | 0 tests, 365 LOC | Test AST parsing with sample app fixtures (#8) | +| **P1** | E2E: error scenarios | Few error path tests | Add timeout/5xx/rate-limit E2E tests (#12) | +| **P2** | Rust: state edge cases | No concurrent/expiration tests | Add async edge case tests (#6) | +| **P2** | Python: core.py/docker_cli.py | 0 tests, 377 LOC | Mock subprocess, test lifecycle (#9) | +| **P2** | Python: trace_cmd.py | 2 tests for 993 LOC | Test trace processing logic (#10) | +| **P2** | E2E: Bedrock | Tests skipped as unreliable | Use mock Bedrock endpoint (#13) | +| **P2** | E2E: PostgreSQL state | Only memory backend tested | Add PG to Docker Compose (#14) | +| **P3** | JS/TS | 0 tests, no framework | Set up Vitest, test asciiBuilder.ts (#11) | +| **P3** | E2E: concurrency | No parallel request tests | Add concurrent request tests (#15) | +| **P3** | E2E: config validation | No invalid config tests | Test error handling for bad configs (#16) |