Correct test coverage analysis with accurate counts

Previous version undercounted tests significantly. Corrected findings:
- brightstaff state has 26 tests (memory: 16, postgresql: 10), not 0
- brightstaff pipeline_processor has 5 tests, not 0
- hermesllm streaming buffers have 12 tests across 4 modules
- common has 36 tests across 10 files
- Total: ~370 tests (314 Rust + 54 E2E/integration + Hurl/REST)
- Now properly documents what IS well-tested alongside the gaps

https://claude.ai/code/session_01Shz5qKiTB9m6oxzEZWJVKk
This commit is contained in:
Claude 2026-02-18 14:42:38 +00:00
parent 8eccb9da00
commit f80b73b3fe
No known key found for this signature in database

View file

@ -4,9 +4,9 @@
## Executive Summary
The Plano codebase has significant test coverage gaps across all components. The Rust crates have ~262 unit tests covering roughly 0.35% of ~75,800 lines of code. The Python CLI has 29 tests covering 4 of 12 modules. The JavaScript/TypeScript apps and packages have **zero tests**. The E2E suite covers the core happy-path flows well but lacks error, edge-case, and performance scenarios.
The Plano codebase has **~370 automated tests**: ~297 Rust unit tests, ~65 Python tests (29 CLI + 50 E2E + 4 archgw integration), 10 Hurl/REST manual test files, and zero JS/TS tests. Coverage is strong in the LLM translation layer (hermesllm) and behavioral signals (brightstaff/signals), moderate in state management and configuration, and weak in the WASM gateway plugins and several Python CLI modules.
Below is a prioritized breakdown of gaps and recommendations.
Below is a detailed breakdown by component with prioritized improvement recommendations.
---
@ -14,40 +14,86 @@ Below is a prioritized breakdown of gaps and recommendations.
### Current State
| Crate | LOC | Tests | Status |
|-------|-----|-------|--------|
| common | 3,912 | 33 | Partial |
| hermesllm | 17,540 | 134 | Partial |
| prompt_gateway | 1,717 | 4 | Critical gap |
| llm_gateway | 1,399 | 0 | Critical gap |
| brightstaff | 13,342 | 91 | Partial |
| **Total** | **~75,800** | **262** | |
| Crate | Tests | Files With Tests | Status |
|-------|-------|------------------|--------|
| hermesllm | 148 | 21 | Good — broad coverage of provider translation |
| brightstaff | 126 | 11 | Good — signals/state/routing well tested; handler endpoints less so |
| common | 36 | 10 | Moderate — core utilities covered; some gaps |
| prompt_gateway | 4 | 2 | Weak — WASM filter mostly untested |
| llm_gateway | 0 | 0 | None — WASM filter completely untested |
| **Total** | **~314** | **44** | |
### Critical Gaps
### Well-Tested Areas
**llm_gateway — 0 tests, 1,399 LOC.** This WASM filter handles LLM request/response processing and streaming. It has no tests at all. `stream_context.rs` alone is ~1,000 lines of complex streaming logic with zero coverage.
- **hermesllm provider translation (148 tests):** Request/response transforms for all providers (OpenAI, Anthropic, Bedrock, Gemini, Mistral) are thoroughly tested. Streaming response parsing (20 tests), endpoint resolution (11 tests), request generation (16 tests), and cross-provider format conversion (~45 tests) are solid.
- **hermesllm streaming buffers (12 tests):** SSE chunk processor (6 tests), Anthropic streaming buffer (3 tests), Responses API streaming buffer (2 tests), and passthrough buffer (1 test) have coverage.
- **brightstaff signals/analyzer (48 tests):** Character n-gram similarity, token cosine similarity, layered matching, frustration/escalation/positive-feedback detection are thoroughly tested.
- **brightstaff state management (26 tests):** In-memory state (16 tests) and PostgreSQL persistence (10 tests) have good unit test coverage.
- **brightstaff function calling (17 tests):** Tool extraction, JSON fixing, hallucination detection, and tool call verification are well covered.
- **brightstaff routing models (17 tests):** Orchestrator model v1 (9 tests) and router model v1 (8 tests) are tested.
- **brightstaff pipeline processor (5 tests):** Has basic test coverage (4 tokio::test + 1 sync test).
- **brightstaff agent selector (5 tests):** Listener lookup and agent map creation are tested.
- **brightstaff response handler (5 tests):** Response transformation has tests.
- **common rate limiting (8 tests):** Rate limit logic with token quotas and header-based selectors is tested.
- **common OpenAI API (9 tests):** Chat completion parsing and request conversions covered.
**prompt_gateway — 4 tests, 1,717 LOC.** The WASM filter for prompt processing and guardrails has near-zero coverage. Untested modules include `filter_context.rs`, `http_context.rs`, `context.rs`, and `metrics.rs`. The intent-matching logic in `stream_context.rs` (~900 lines) has only 1 test.
### Gaps and Recommendations
**brightstaff pipeline and state management — ~2,200 LOC untested.** The core request pipeline (`handlers/pipeline_processor.rs`, 834 lines), state persistence layer (`state/memory.rs`, `state/postgresql.rs`, `state/response_state_processor.rs` — 1,370 lines combined), and key handler endpoints (`handlers/llm.rs`, `handlers/agent_chat_completions.rs`) have no tests.
#### Gap 1: `llm_gateway` crate — 0 tests (1,399 LOC)
### Partially Covered Areas Needing More Tests
This WASM filter handles all LLM request/response processing and streaming. `stream_context.rs` (~1,000 lines) manages streaming chunk assembly and response forwarding with zero coverage.
- **hermesllm streaming transforms** — The non-streaming request/response transforms are well-tested (134 tests), but the streaming buffer modules (`sse.rs`, `amazon_bedrock_binary_frame.rs`, `to_openai_streaming.rs`, `to_anthropic_streaming.rs` — ~5,000 LOC) are untested.
- **common/routing.rs, common/errors.rs, common/http.rs, common/stats.rs, common/tracing.rs** — Utility modules totaling ~560 lines with no coverage.
- **brightstaff router services**`llm_router.rs` and `plano_orchestrator.rs` (~400 lines) lack tests despite handling routing decisions.
**Recommendation:** Extract core logic from the WASM host context into pure, testable functions. Test streaming chunk reassembly, header manipulation, error response construction, and the filter lifecycle. Consider a thin WASM shim over well-tested logic modules.
### Recommendations
#### Gap 2: `prompt_gateway` crate — 4 tests (1,717 LOC)
1. **Add unit tests for llm_gateway.** Start with `stream_context.rs` — test streaming chunk assembly, partial frame handling, error recovery, and the filter lifecycle. A WASM-mocking test harness or extracting the core logic into testable pure functions would help.
The WASM prompt filter has tests only in `tools.rs` (3 tests) and `stream_context.rs` (1 test). The filter/HTTP context lifecycle (`filter_context.rs`, `http_context.rs`), prompt guard logic, and metrics collection are untested.
2. **Add unit tests for prompt_gateway filter logic.** Test `http_context.rs` request/response handling, `filter_context.rs` lifecycle, and the guardrail filtering paths in `stream_context.rs`.
**Recommendation:** Add tests for intent matching and prompt guard/jailbreak detection in `stream_context.rs`. Test `http_context.rs` request parsing and response construction. Same architectural approach as llm_gateway — separate testable logic from WASM host bindings.
3. **Test the brightstaff pipeline processor.** This is the central message processing pipeline. Mock the downstream dependencies and test the orchestration logic, error paths, and streaming assembly.
#### Gap 3: brightstaff handler endpoints — limited coverage
4. **Test state persistence.** Both the in-memory and PostgreSQL backends need tests for basic CRUD, concurrent access, state expiration, and connection failure recovery.
Several handler modules have no unit tests:
- `handlers/llm.rs` (553 LOC) — LLM chat handler
- `handlers/agent_chat_completions.rs` (418 LOC) — Multi-agent orchestration
- `handlers/router_chat.rs` (159 LOC) — Router endpoint
- `handlers/utils.rs` (288 LOC) — Handler utilities
5. **Test hermesllm streaming transforms.** The SSE parser, Bedrock binary frame decoder, and streaming-to-OpenAI/Anthropic converters need unit tests, especially for edge cases like partial frames, malformed chunks, and connection resets.
The pipeline_processor has only 5 tests for 834 LOC — basic flow is covered but error paths and edge cases are not.
**Recommendation:** Add tests for error paths in `pipeline_processor.rs` (malformed requests, downstream failures, timeout handling). Add handler-level tests for `llm.rs` and `agent_chat_completions.rs` using `mockito` (already a dev dependency) to mock HTTP backends.
#### Gap 4: hermesllm streaming *transforms* — 0 tests
While the streaming *buffers* (SSE parser, Anthropic buffer, etc.) have tests, the streaming *transform* modules that convert between formats during streaming are untested:
- `transforms/response_streaming/to_openai_streaming.rs`
- `transforms/response_streaming/to_anthropic_streaming.rs`
Also untested: `apis/streaming_shapes/amazon_bedrock_binary_frame.rs` (AWS Event Stream binary decoding) and `apis/streaming_shapes/chat_completions_streaming_buffer.rs`.
**Recommendation:** Add tests for the streaming transform modules. The Bedrock binary frame decoder is particularly important — it parses a proprietary binary protocol and failures here are hard to diagnose in production.
#### Gap 5: common utility modules — no tests
Several `common` modules lack tests:
- `routing.rs` — Provider routing logic
- `errors.rs` — Error types (ClientError, ServerError)
- `http.rs` — HTTP utilities and CallArgs
- `stats.rs` — Metrics traits
- `api/prompt_guard.rs` — Prompt guard types
- `api/zero_shot.rs` — Zero-shot classification types
**Recommendation:** Add tests for `routing.rs` (routing decisions), `http.rs` (CallArgs construction, URL handling), and `prompt_guard.rs` (guard rule evaluation). The error/stats/consts modules are mostly type definitions and don't need extensive testing.
#### Gap 6: brightstaff state — edge cases
The state backends have solid basic coverage (26 tests total), but lack tests for:
- Concurrent access patterns
- State expiration/eviction
- Connection failure recovery (PostgreSQL)
- Large conversation histories
**Recommendation:** Add tokio::test cases for concurrent read/write scenarios in the memory backend and connection pool behavior in the PostgreSQL backend.
---
@ -55,42 +101,45 @@ Below is a prioritized breakdown of gaps and recommendations.
### Current State
| Module | LOC | Tested? |
|--------|-----|---------|
| config_generator.py | 514 | Yes |
| versioning.py | 70 | Yes |
| init_cmd.py | 303 | Yes |
| trace_cmd.py | 993 | Minimal (2 tests) |
| main.py | 441 | No |
| targets.py | 365 | No |
| core.py | 234 | No |
| docker_cli.py | 143 | No |
| template_sync.py | 122 | No |
| utils.py | 285 | Partial |
| Test File | Tests | Modules Covered |
|-----------|-------|-----------------|
| test_config_generator.py | 11 (5 functions + 6 parametrized) | config_generator, utils |
| test_version_check.py | 18 (4 classes, 18 methods) | versioning |
| test_init.py | 4 | init_cmd |
| test_trace_cmd.py | 2 | trace_cmd (minimal) |
| **Total** | **35 executions** | **5 of 13 modules** |
**29 total tests across 4 files. 8 of 12 modules are untested or minimally tested.**
### Well-Tested Areas
### Critical Gaps
- **versioning.py (18 tests):** Version parsing, comparison, PyPI fetching, network error handling, and environment variable overrides are thoroughly tested across 4 test classes.
- **config_generator.py (11 tests):** Happy-path config validation, schema validation errors (6 parametrized cases), and legacy format conversion are covered.
- **init_cmd.py (4 tests):** Clean init, template init, overwrite protection, and force overwrite are tested.
**main.py — 0 tests, 441 LOC.** All CLI commands (`up`, `down`, `build`, `logs`, `cli_agent`, `generate_prompt_targets`) are untested. The `up` command alone contains complex logic for port conflict detection, API key validation, and container orchestration.
### Gaps and Recommendations
**targets.py — 0 tests, 365 LOC.** The AST-based Python code parser for extracting prompt targets from Flask/FastAPI routes and Pydantic models is entirely untested. This is complex parsing logic prone to edge cases.
#### Gap 7: `main.py` — 0 tests (441 LOC)
**core.py — 0 tests, 234 LOC.** Docker container lifecycle management (start, stop, health check retry loop, timeout handling) is untested.
The CLI entry point defines all Click commands (`up`, `down`, `build`, `logs`, `cli_agent`, `generate_prompt_targets`). None have tests. The `up` command has complex logic for port conflict detection, API key validation, and container orchestration.
**docker_cli.py — 0 tests, 143 LOC.** All 7 Docker subprocess wrapper functions lack tests.
**Recommendation:** Add tests using Click's `CliRunner`. Test `planoai up` with mocked Docker calls (validate argument handling, port conflict error messages, API key resolution). Test `planoai down` and `planoai build` for basic argument handling and error paths.
**trace_cmd.py — 2 tests for 993 LOC.** Only gRPC server bind error handling is tested. The trace collection, OTEL processing, and trace analysis logic are untested.
#### Gap 8: `targets.py` — 0 tests (365 LOC)
### Recommendations
AST-based Python code parser that extracts prompt targets from Flask/FastAPI routes and Pydantic models. This is complex parsing logic prone to edge cases with decorators, type annotations, and docstrings.
6. **Add CLI command tests using Click's CliRunner.** Test `planoai up`, `planoai down`, and `planoai build` with mocked Docker operations. Verify argument validation, error messages, and exit codes.
**Recommendation:** Create test fixtures with sample Flask/FastAPI app files and verify extracted prompt targets. Test edge cases: nested decorators, complex type hints (Optional, Union, List[dict]), missing docstrings, and unsupported patterns.
7. **Add tests for targets.py.** Test Flask route extraction, FastAPI route extraction, Pydantic model field parsing, type annotation handling, and edge cases (nested decorators, complex type hints, missing docstrings).
#### Gap 9: `core.py` and `docker_cli.py` — 0 tests (377 LOC combined)
8. **Add tests for core.py with mocked subprocess/Docker calls.** Test the health check retry loop, container state transitions (not found → start, running → restart), timeout behavior, and port forwarding.
Container lifecycle management and Docker subprocess wrappers are untested.
9. **Add a shared conftest.py** with common fixtures for environment setup, temporary config files, and Docker mocking.
**Recommendation:** Mock `subprocess.run` / `subprocess.Popen` and test the health check retry loop, container state transitions, and error handling. A shared `conftest.py` with Docker mock fixtures would benefit multiple test files.
#### Gap 10: `trace_cmd.py` — 2 tests for 993 LOC
Only gRPC bind error handling is tested. Trace collection, OTEL span processing, and trace analysis logic (the bulk of the module) are untested.
**Recommendation:** Add tests for trace data parsing and the analysis/summarization logic. Mock gRPC server interactions for collection tests.
---
@ -98,85 +147,100 @@ Below is a prioritized breakdown of gaps and recommendations.
### Current State
**Zero test files. No test framework configured. No test scripts in any package.json.**
The codebase has 70+ TypeScript/React source files across two Next.js apps (`apps/www`, `apps/katanemo-www`) and shared packages (`packages/ui`, `packages/shared-styles`).
Quality tooling is limited to type checking (`tsc --noEmit`) and linting (Biome).
### Notable Untested Code
- **`apps/www/src/utils/asciiBuilder.ts`** (425 lines) — Pure utility functions for ASCII diagram generation (`calculateCenterPadding`, `createArrow`, `buildBox`, `fixDiagramSpacing`, `createFlowDiagram`). This is the most testable code in the frontend.
- **`packages/ui/src/`** — 5 shared UI components (Navbar, Footer, Logo, Button, Dialog) used across apps.
- **`apps/www/src/app/api/contact/route.ts`** — API route handler.
**Zero test files. No test framework configured.** The codebase has 70+ TypeScript/React source files across two Next.js apps and shared packages. Quality tooling is limited to type checking and Biome linting.
### Recommendations
10. **Set up Vitest** (or Jest) in the Turbo workspace with a root-level `test` script. Add `@testing-library/react` for component testing.
#### Gap 11: No test infrastructure
11. **Add unit tests for `asciiBuilder.ts`.** These are pure functions with clear inputs and outputs — ideal first candidates.
**Recommendation:** Set up Vitest in the Turbo workspace. Add `@testing-library/react` for component testing. Priority candidates:
- `apps/www/src/utils/asciiBuilder.ts` (425 lines of pure utility functions — ideal for unit tests)
- `packages/ui/src/` (shared UI components reused across apps)
12. **Add component tests for shared `packages/ui` components.** These are reused across apps and should have rendering and interaction tests.
Note: The JS/TS apps are marketing websites, not the core proxy. Prioritize this lower than Rust and Python testing.
**Note:** These are marketing websites, not the core proxy. Prioritize this lower than Rust and Python testing.
---
## 4. E2E Tests (`tests/e2e/`)
## 4. E2E and Integration Tests (`tests/`)
### Current State
~40 active tests across 4 test files, covering:
- OpenAI and Anthropic SDK integration (streaming and non-streaming)
- Model alias routing and format translation
- Function calling end-to-end flows
- OpenAI Responses API (v1/responses)
- Conversation state management (memory backend)
- Cross-provider format translation (OpenAI client → Claude model, etc.)
| Suite | Tests | Coverage |
|-------|-------|----------|
| tests/e2e/test_prompt_gateway.py | 12 | Prompt routing, guardrails, cross-provider SDK compatibility |
| tests/e2e/test_model_alias_routing.py | 19 | Model aliases, format translation, streaming, error handling |
| tests/e2e/test_openai_responses_api_client.py | 17 | Responses API across all providers (passthrough, chat completions, Bedrock, Anthropic) |
| tests/e2e/test_openai_responses_api_client_with_state.py | 2 | Multi-turn conversation state (memory backend) |
| tests/archgw/test_prompt_gateway.py | 3 | Prompt gateway with mock HTTP server (including 404/500 errors) |
| tests/archgw/test_llm_gateway.py | 1 | LLM gateway with provider hints |
| **Total** | **54** | |
### Gaps
**Additional manual tests:** 3 Hurl files and 6 REST files for exploratory/manual testing.
**Error and failure scenarios are underrepresented.** Only 2 tests cover error handling (400 errors with aliases). There are no tests for:
- Upstream provider unavailability or timeouts
- Malformed request payloads
- Rate limiting behavior
- Invalid API keys
- Partial stream failures or disconnections
### Well-Tested Areas
**Bedrock tests are all skipped.** 6 AWS Bedrock tests are marked as unreliable and skipped, leaving this provider path untested in CI.
- **Cross-provider format translation:** OpenAI client → Claude model, Anthropic client → OpenAI model, etc. covered via model alias routing tests.
- **OpenAI Responses API:** Comprehensive coverage across all 4 providers in both streaming and non-streaming modes, with and without tools.
- **Prompt gateway routing:** Intent matching, parameter gathering, default targets, and jailbreak detection tested end-to-end.
- **Error handling basics:** 400 errors with invalid aliases, nonexistent aliases, and unsupported parameters.
- **archgw mock-server tests:** 404 and 500 upstream error handling tested with `pytest_httpserver`.
**PostgreSQL state storage is untested.** State management E2E tests only use the memory backend. The PostgreSQL backend (which is the production path) has no E2E coverage.
### Gaps and Recommendations
**No concurrent request testing.** There are no tests validating behavior under concurrent load or verifying resource cleanup.
#### Gap 12: Error and failure scenarios underrepresented
**No configuration validation E2E tests.** Invalid config files, missing required fields, and config hot-reload are not tested end-to-end.
Only a few tests cover error paths. Missing scenarios:
- Upstream provider timeouts
- 5xx errors from LLM providers during streaming
- Malformed/incomplete streaming responses
- Rate limiting behavior end-to-end
- Invalid or expired API keys
### Recommendations
**Recommendation:** Add E2E error scenario tests. Use a mock upstream that returns errors/timeouts to test resilience behavior without depending on real provider availability.
13. **Add E2E error scenario tests.** Test upstream timeouts, 5xx errors from providers, malformed responses, and rate limit responses. These are the scenarios most likely to cause production incidents.
#### Gap 13: Bedrock tests unreliable
14. **Fix or replace the skipped Bedrock tests.** If Bedrock is flaky, consider using a mock provider or stub that mimics the Bedrock binary event stream format.
Several AWS Bedrock tests are marked as skipped/unreliable, reducing coverage of this provider path.
15. **Add PostgreSQL state storage E2E tests.** Use a PostgreSQL container in Docker Compose and test state persistence, multi-turn retrieval, and state cleanup.
**Recommendation:** Add a mock Bedrock endpoint (or use the archgw mock server pattern from `tests/archgw/`) that returns Bedrock-formatted responses including the binary event stream format. This would make Bedrock tests deterministic.
16. **Add concurrent request tests.** Use `pytest-xdist` (already in dependencies) to validate behavior under parallel requests.
#### Gap 14: PostgreSQL state storage not E2E tested
State management E2E tests only use the memory backend. PostgreSQL is the production persistence backend.
**Recommendation:** Add a PostgreSQL container to the E2E Docker Compose setup and add tests for multi-turn state persistence, session retrieval, and cleanup.
#### Gap 15: No concurrent request / load tests
There are no tests for behavior under concurrent requests or verifying proper resource cleanup.
**Recommendation:** Add parallel request tests using `pytest-xdist` (already in dependencies) or `asyncio.gather`. Test for race conditions in state writes and resource cleanup.
#### Gap 16: No configuration validation E2E tests
Invalid configs, missing required fields, and misconfigured providers are not tested end-to-end.
**Recommendation:** Add tests that pass intentionally invalid configs to `planoai up` and verify the error messages and exit behavior.
---
## Priority Summary
| Priority | Area | Recommendation |
|----------|------|----------------|
| **P0** | Rust: llm_gateway | Add unit tests for streaming response handling (#1) |
| **P0** | Rust: prompt_gateway | Add unit tests for filter logic and guardrails (#2) |
| **P0** | Rust: brightstaff pipeline | Test the core pipeline processor (#3) |
| **P1** | Rust: state persistence | Test memory and PostgreSQL backends (#4) |
| **P1** | Rust: streaming transforms | Test hermesllm streaming modules (#5) |
| **P1** | Python: CLI commands | Test main.py commands with CliRunner (#6) |
| **P1** | Python: targets.py | Test AST parsing logic (#7) |
| **P1** | E2E: error scenarios | Test upstream failures, timeouts, rate limits (#13) |
| **P2** | Python: core.py | Test Docker lifecycle management (#8) |
| **P2** | E2E: PostgreSQL state | Test production state backend (#15) |
| **P2** | E2E: Bedrock | Fix skipped Bedrock tests (#14) |
| **P3** | JS/TS: test setup | Set up Vitest, test utilities (#10, #11) |
| **P3** | E2E: concurrency | Add parallel request tests (#16) |
| Priority | Area | Gap | Recommendation |
|----------|------|-----|----------------|
| **P0** | Rust: llm_gateway | 0 tests, 1,399 LOC | Extract logic from WASM, add unit tests (#1) |
| **P0** | Rust: prompt_gateway | 4 tests, 1,717 LOC | Test intent matching, prompt guards, filter lifecycle (#2) |
| **P1** | Rust: handler endpoints | llm.rs, agent_chat_completions.rs untested | Add handler-level tests with mockito (#3) |
| **P1** | Rust: streaming transforms | to_openai_streaming, to_anthropic_streaming, bedrock binary | Add streaming transform unit tests (#4) |
| **P1** | Rust: common utilities | routing.rs, http.rs, prompt_guard.rs | Add tests for routing decisions and HTTP utils (#5) |
| **P1** | Python: main.py | 0 tests, 441 LOC | Test CLI commands with CliRunner (#7) |
| **P1** | Python: targets.py | 0 tests, 365 LOC | Test AST parsing with sample app fixtures (#8) |
| **P1** | E2E: error scenarios | Few error path tests | Add timeout/5xx/rate-limit E2E tests (#12) |
| **P2** | Rust: state edge cases | No concurrent/expiration tests | Add async edge case tests (#6) |
| **P2** | Python: core.py/docker_cli.py | 0 tests, 377 LOC | Mock subprocess, test lifecycle (#9) |
| **P2** | Python: trace_cmd.py | 2 tests for 993 LOC | Test trace processing logic (#10) |
| **P2** | E2E: Bedrock | Tests skipped as unreliable | Use mock Bedrock endpoint (#13) |
| **P2** | E2E: PostgreSQL state | Only memory backend tested | Add PG to Docker Compose (#14) |
| **P3** | JS/TS | 0 tests, no framework | Set up Vitest, test asciiBuilder.ts (#11) |
| **P3** | E2E: concurrency | No parallel request tests | Add concurrent request tests (#15) |
| **P3** | E2E: config validation | No invalid config tests | Test error handling for bad configs (#16) |