diff --git a/TEST_COVERAGE_ANALYSIS.md b/TEST_COVERAGE_ANALYSIS.md
index 1c00170a..322ae536 100644
--- a/TEST_COVERAGE_ANALYSIS.md
+++ b/TEST_COVERAGE_ANALYSIS.md
@@ -224,6 +224,234 @@ Invalid configs, missing required fields, and misconfigured providers are not te
 
 ---
 
+## 5. Test Organization: Current State and Restructuring Proposals
+
+### Current Layout
+
+```
+plano/
+├── crates/                          # Rust — all tests inline
+│   ├── brightstaff/src/*.rs         #   #[cfg(test)] mod tests at bottom of source files
+│   ├── common/src/*.rs              #   same pattern
+│   ├── hermesllm/src/*.rs           #   same pattern
+│   ├── llm_gateway/src/*.rs         #   (no tests)
+│   └── prompt_gateway/src/*.rs      #   (deprecated)
+├── cli/test/                        # Python CLI unit tests
+│   ├── test_config_generator.py
+│   ├── test_init.py
+│   ├── test_trace_cmd.py
+│   └── test_version_check.py
+├── tests/
+│   ├── archgw/                      # Integration tests (mock HTTP server, no API keys needed)
+│   │   ├── test_prompt_gateway.py
+│   │   ├── test_llm_gateway.py
+│   │   ├── common.py
+│   │   ├── common.sh
+│   │   ├── config.yaml
+│   │   ├── docker-compose.yaml
+│   │   └── pyproject.toml           # separate dependency management
+│   ├── e2e/                         # E2E tests (requires Docker + real API keys)
+│   │   ├── test_prompt_gateway.py
+│   │   ├── test_model_alias_routing.py
+│   │   ├── test_openai_responses_api_client.py
+│   │   ├── test_openai_responses_api_client_with_state.py
+│   │   ├── common.py
+│   │   ├── common_scripts.sh
+│   │   ├── run_e2e_tests.sh         # master orchestrator
+│   │   ├── run_prompt_gateway_tests.sh
+│   │   ├── run_model_alias_tests.sh
+│   │   ├── run_responses_state_tests.sh
+│   │   ├── docker-compose.yaml
+│   │   └── pyproject.toml           # separate dependency management
+│   ├── hurl/                        # Manual HTTP tests (Hurl format)
+│   └── rest/                        # Manual HTTP tests (VS Code REST Client)
+```
+
+### Problems With the Current Organization
+
+**1. Three separate Python dependency environments.** `cli/test/`, `tests/e2e/`, and `tests/archgw/` each have their own `pyproject.toml` and `uv.lock`. This means three `uv sync` calls, three separate dependency graphs, and duplicated dependencies (`pytest`, `requests`, `deepdiff`). Updating a shared dependency means touching three files.
+
+**2. Duplicated test infrastructure.** `tests/e2e/common.py` and `tests/archgw/common.py` define overlapping constants (`PROMPT_GATEWAY_ENDPOINT`, `LLM_GATEWAY_ENDPOINT`, test fixture data). Changes to endpoint ports or test data need to be mirrored across both files.
+
+**3. No conftest.py files.** Neither E2E nor integration tests use `conftest.py` for shared fixtures. Common setup (endpoint URLs, test data factories, health-check waits) is imported via `common.py` modules or inlined. This misses pytest's fixture scoping and teardown capabilities.
+
+**4. Shell-script-orchestrated E2E.** The E2E suite is driven by bash scripts (`run_e2e_tests.sh`, `run_prompt_gateway_tests.sh`, etc.) that start/stop Docker containers, install the CLI, and invoke pytest. This makes it hard to run individual E2E tests locally, and the lifecycle management (container start → test → container stop → reconfigure → restart → test) is fragile. A failure in one suite can leave containers running.
+
+**5. `tests/archgw/` naming is stale.** The directory name `archgw` references the project's old name (Arch Gateway). It should reflect its purpose (mock-based integration testing).
+
+**6. No Rust integration test directory.** All Rust tests are inline unit tests (`#[cfg(test)] mod tests`). There is no `crates/*/tests/` directory for integration tests that exercise multiple modules together. The only exception is `brightstaff/src/handlers/integration_tests.rs`, which is a source file masquerading as an integration test.
+
+**7. No shared Rust test utilities.** Test helper functions (creating mock configs, building test requests/responses, constructing state objects) are duplicated across inline test modules. There is no `testutils` crate or shared test module.
+
+### Restructuring Proposals
+
+#### Proposal A: Unify Python test dependencies
+
+Consolidate the three separate `pyproject.toml` files into a single test workspace, or at minimum extract shared dependencies into a common base.
+
+**Suggested approach:** Create a single `tests/pyproject.toml` that covers both `archgw/` and `e2e/` suites, with dependency groups:
+
+```toml
+[project]
+name = "plano-tests"
+dependencies = [
+    "pytest>=8.3",
+    "requests>=2.29",
+    "deepdiff>=8.0",
+    "pytest-retry>=1.6",
+    "pytest-xdist>=3.5",
+]
+
+[project.optional-dependencies]
+e2e = ["anthropic", "openai"]         # only needed for real-provider E2E
+integration = ["pytest-httpserver"]    # only needed for mock-server tests
+```
+
+CLI tests (`cli/test/`) can stay separate since they live within the CLI package. The E2E and integration tests should share a single dependency environment.
+
+#### Proposal B: Replace shell orchestration with pytest fixtures
+
+Replace `run_e2e_tests.sh` and per-suite shell scripts with pytest session-scoped fixtures that manage Docker container lifecycle:
+
+```python
+# tests/e2e/conftest.py
+import pytest, subprocess, time
+
+@pytest.fixture(scope="session")
+def plano_container(request):
+    config = request.param  # e.g., "config_model_alias.yaml"
+    subprocess.run(["planoai", "up", config, "--detach"], check=True)
+    _wait_for_healthy("http://localhost:12000/healthz")
+    yield
+    subprocess.run(["planoai", "down"], check=True)
+```
+
+Benefits:
+- Run individual tests with `pytest tests/e2e/test_model_alias_routing.py` without needing a shell wrapper.
+- pytest handles cleanup on failure (fixture teardown runs even on test errors).
+- Parametrize the container fixture to test multiple configs without separate scripts.
+- Compatible with `pytest-xdist` for parallel suite execution.
+
+#### Proposal C: Consolidate shared Python test fixtures
+
+Create a `tests/conftest.py` (or `tests/shared/`) with common fixtures shared by both `archgw/` and `e2e/`:
+
+```
+tests/
+├── conftest.py              # shared fixtures: endpoints, test data factories, health-check helpers
+├── integration/             # renamed from archgw/
+│   ├── conftest.py          # integration-specific fixtures (pytest-httpserver)
+│   ├── test_llm_gateway.py
+│   └── docker-compose.yaml
+├── e2e/
+│   ├── conftest.py          # e2e-specific fixtures (container lifecycle)
+│   ├── test_model_alias_routing.py
+│   ├── test_openai_responses_api_client.py
+│   ├── test_openai_responses_api_client_with_state.py
+│   └── docker-compose.yaml
+└── configs/                 # test config YAML files shared across suites
+    ├── model_alias_routing.yaml
+    ├── memory_state_v1_responses.yaml
+    └── ...
+```
+
+Key changes:
+- Rename `archgw/` → `integration/` to reflect its purpose.
+- Promote shared constants and test data from `common.py` into `conftest.py` fixtures.
+- Remove deprecated prompt_gateway test files from both suites.
+- Centralize test config YAML files in `tests/configs/`.
+
+#### Proposal D: Add a Rust `testutils` crate
+
+Create a workspace-internal crate for shared test utilities:
+
+```
+crates/
+├── testutils/               # new crate, #[cfg(test)] only
+│   ├── Cargo.toml
+│   └── src/
+│       ├── lib.rs
+│       ├── fixtures.rs      # factory functions for Configuration, OpenAiChatCompletionRequest, etc.
+│       ├── mock_server.rs   # wrappers around mockito for common provider mocking patterns
+│       └── assertions.rs    # custom assert helpers for comparing LLM responses
+```
+
+Then in other crates:
+```toml
+[dev-dependencies]
+testutils = { path = "../testutils" }
+```
+
+This eliminates duplication of test helper functions across `brightstaff`, `hermesllm`, and `common` inline test modules.
+
+#### Proposal E: Add Rust integration tests for brightstaff
+
+Move `integration_tests.rs` out of `src/handlers/` and into a proper `tests/` directory:
+
+```
+crates/brightstaff/
+├── src/
+│   └── handlers/
+│       └── (no more integration_tests.rs here)
+└── tests/                   # Cargo integration test directory
+    ├── pipeline_flow.rs     # AgentSelector → PipelineProcessor → ResponseHandler
+    ├── state_roundtrip.rs   # write state → read state across memory/PG backends
+    └── routing_decisions.rs # config → routing decision validation
+```
+
+Cargo automatically discovers `tests/*.rs` as integration tests. They compile as separate binaries and test the crate's public API, which is the right boundary for these multi-module tests.
+
+#### Proposal F: Add pytest markers for test categorization
+
+Currently there is no way to selectively run tests by category. Add markers:
+
+```python
+# tests/conftest.py
+def pytest_configure(config):
+    config.addinivalue_line("markers", "streaming: tests that exercise streaming responses")
+    config.addinivalue_line("markers", "bedrock: tests that require/mock AWS Bedrock")
+    config.addinivalue_line("markers", "state: tests that exercise state persistence")
+    config.addinivalue_line("markers", "error_handling: tests for error/failure paths")
+    config.addinivalue_line("markers", "slow: tests that take >10s")
+```
+
+Then annotate tests:
+```python
+@pytest.mark.streaming
+@pytest.mark.bedrock
+def test_openai_client_streaming_with_bedrock():
+    ...
+```
+
+This enables targeted test runs: `pytest -m "streaming and not bedrock"`, `pytest -m error_handling`, etc.
+
+#### Proposal G: Standardize CI test jobs
+
+The current CI has 6+ separate jobs for Python tests, each with its own setup, Docker build, and teardown. Consolidate into a clearer structure:
+
+| Job | What it runs | Docker needed | API keys needed |
+|-----|-------------|---------------|-----------------|
+| `rust-unit` | `cargo test --lib` | No | No |
+| `python-cli-unit` | `cd cli && uv run pytest` | No | No |
+| `integration` | `cd tests/integration && uv run pytest` | Yes (compose) | No |
+| `e2e` | `cd tests/e2e && uv run pytest` | Yes (plano container) | Yes |
+
+This reduces the current 6+ Python-related jobs to 3, with clearer separation of what requires external dependencies.
+
+### Proposal Priority
+
+| Proposal | Effort | Impact | Priority |
+|----------|--------|--------|----------|
+| **C: Consolidate fixtures + rename archgw/** | Low | High — reduces duplication, clearer naming | Do first |
+| **F: Add pytest markers** | Low | Medium — enables selective test runs | Do first |
+| **B: Replace shell orchestration with fixtures** | Medium | High — enables local E2E dev, better cleanup | Do second |
+| **A: Unify Python test dependencies** | Medium | Medium — simplifies maintenance | Do second |
+| **D: Rust testutils crate** | Medium | Medium — enables faster test writing | Do with new Rust tests |
+| **E: Rust integration test directory** | Low | Medium — proper Cargo conventions | Do with new Rust tests |
+| **G: Consolidate CI jobs** | Medium | Medium — faster CI, simpler config | Do after A-C |
+
+---
+
 ## Priority Summary
 
 | Priority | Area | Gap | Recommendation |