flakestorm/docs/TESTING_GUIDE.md

25 KiB
Raw Blame History

Testing Guide

This guide explains how to run, write, and expand tests for flakestorm. It covers the remaining testing items from the implementation checklist.


Table of Contents

  1. Running Tests
  2. Test Structure
  3. V2 Integration Tests
  4. Writing Tests: Agent Adapters
  5. Writing Tests: Orchestrator
  6. Writing Tests: Report Generation
  7. Integration Tests
  8. CLI Tests
  9. Test Fixtures

Running Tests

Prerequisites

# Install dev dependencies
# Create virtual environment first
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Then install
pip install -e ".[dev]"

# Or manually
pip install pytest pytest-asyncio pytest-cov

Running All Tests

# Full test suite
pytest

# With coverage report
pytest --cov=src/flakestorm --cov-report=html

# Verbose output
pytest -v

# Run specific test file
pytest tests/test_config.py

# Run specific test class
pytest tests/test_assertions.py::TestContainsChecker

# Run specific test
pytest tests/test_assertions.py::TestContainsChecker::test_contains_match

Test Categories

# Unit tests only (fast)
pytest tests/test_config.py tests/test_mutations.py tests/test_assertions.py

# Performance tests (requires Rust module)
pytest tests/test_performance.py

# Integration tests (requires Ollama)
pytest tests/test_integration.py

# V2 integration tests (chaos, contract, replay)
pytest tests/test_chaos_integration.py tests/test_contract_integration.py tests/test_replay_integration.py

Test Structure

tests/
├── __init__.py
├── conftest.py                    # Shared fixtures
├── test_config.py                 # Configuration loading tests
├── test_mutations.py              # Mutation engine tests
├── test_assertions.py             # Assertion checkers tests
├── test_performance.py            # Rust/Python bridge tests
├── test_adapters.py               # Agent adapter tests
├── test_orchestrator.py           # Orchestrator tests
├── test_reports.py                # Report generation tests
├── test_cli.py                    # CLI command tests
├── test_integration.py            # Full integration tests
├── test_chaos_integration.py      # V2: chaos (tool/LLM faults, interceptor)
├── test_contract_integration.py   # V2: contract (N×M matrix, score, critical fail)
└── test_replay_integration.py     # V2: replay (session → replay → pass/fail)

V2 Integration Tests

V2 adds three integration test modules; all gaps are closed (see GAP_VERIFICATION).

Module What it tests
test_chaos_integration.py Chaos interceptor, tool faults (match_url/tool *), LLM faults (truncated, empty, garbage, rate_limit, response_drift).
test_contract_integration.py Contract engine: invariants × chaos matrix, reset between cells, resilience score (severity-weighted), critical failure → FAIL.
test_replay_integration.py Replay loader (file/format), ReplayRunner verification against contract, contract resolution by name/path.

For CI pipelines that use V2, run the full suite including these; flakestorm ci runs mutation, contract (if configured), chaos-only (if configured), and replay (if configured), then computes the overall weighted score from scoring.weights.


Writing Tests: Agent Adapters

Location: tests/test_adapters.py

What to Test

  1. HTTPAgentAdapter

    • Sends correct HTTP request format
    • Handles successful responses
    • Handles error responses (4xx, 5xx)
    • Respects timeout settings
    • Retries on transient failures
    • Extracts response using JSONPath
  2. PythonAgentAdapter

    • Imports module correctly
    • Calls sync and async functions
    • Handles exceptions gracefully
    • Measures latency correctly
  3. LangChainAgentAdapter

    • Invokes LangChain agents correctly
    • Handles different chain types

Example Test File

# tests/test_adapters.py
"""Tests for agent adapters."""

import pytest
from unittest.mock import AsyncMock, MagicMock, patch
import asyncio

# Import the modules to test
from flakestorm.core.protocol import (
    HTTPAgentAdapter,
    PythonAgentAdapter,
    AgentResponse,
)
from flakestorm.core.config import AgentConfig, AgentType


class TestHTTPAgentAdapter:
    """Tests for HTTP agent adapter."""

    @pytest.fixture
    def http_config(self):
        """Create a test HTTP agent config."""
        return AgentConfig(
            endpoint="http://localhost:8000/chat",
            type=AgentType.HTTP,
            timeout=30,
            request_template='{"message": "{prompt}"}',
            response_path="$.reply",
        )

    @pytest.fixture
    def adapter(self, http_config):
        """Create adapter instance."""
        return HTTPAgentAdapter(http_config)

    @pytest.mark.asyncio
    async def test_invoke_success(self, adapter):
        """Test successful invocation."""
        with patch("httpx.AsyncClient.post") as mock_post:
            mock_response = MagicMock()
            mock_response.status_code = 200
            mock_response.json.return_value = {"reply": "Hello there!"}
            mock_post.return_value = mock_response

            result = await adapter.invoke("Hello")

            assert isinstance(result, AgentResponse)
            assert result.text == "Hello there!"
            assert result.latency_ms > 0

    @pytest.mark.asyncio
    async def test_invoke_formats_request(self, adapter):
        """Test that request template is formatted correctly."""
        with patch("httpx.AsyncClient.post") as mock_post:
            mock_response = MagicMock()
            mock_response.status_code = 200
            mock_response.json.return_value = {"reply": "OK"}
            mock_post.return_value = mock_response

            await adapter.invoke("Test prompt")

            # Verify the request body
            call_args = mock_post.call_args
            assert '"message": "Test prompt"' in str(call_args)

    @pytest.mark.asyncio
    async def test_invoke_timeout(self, adapter):
        """Test timeout handling."""
        with patch("httpx.AsyncClient.post") as mock_post:
            mock_post.side_effect = asyncio.TimeoutError()

            with pytest.raises(TimeoutError):
                await adapter.invoke("Hello")

    @pytest.mark.asyncio
    async def test_invoke_http_error(self, adapter):
        """Test HTTP error handling."""
        with patch("httpx.AsyncClient.post") as mock_post:
            mock_response = MagicMock()
            mock_response.status_code = 500
            mock_response.text = "Internal Server Error"
            mock_post.return_value = mock_response

            with pytest.raises(Exception):
                await adapter.invoke("Hello")


class TestPythonAgentAdapter:
    """Tests for Python function adapter."""

    @pytest.fixture
    def python_config(self):
        """Create a test Python agent config."""
        return AgentConfig(
            endpoint="tests.fixtures.mock_agent:handle_message",
            type=AgentType.PYTHON,
            timeout=30,
        )

    @pytest.mark.asyncio
    async def test_invoke_sync_function(self):
        """Test invoking a sync function."""
        # Create a mock module with a sync function
        def mock_handler(prompt: str) -> str:
            return f"Echo: {prompt}"

        with patch.dict("sys.modules", {"mock_module": MagicMock(handler=mock_handler)}):
            config = AgentConfig(
                endpoint="mock_module:handler",
                type=AgentType.PYTHON,
            )
            adapter = PythonAgentAdapter(config)

            # This would need the actual implementation to work
            # For now, test the structure

    @pytest.mark.asyncio
    async def test_invoke_async_function(self):
        """Test invoking an async function."""
        async def mock_handler(prompt: str) -> str:
            await asyncio.sleep(0.01)
            return f"Async Echo: {prompt}"

        # Similar test structure


class TestAgentAdapterFactory:
    """Tests for adapter factory function."""

    def test_creates_http_adapter(self):
        """Factory creates HTTP adapter for HTTP type."""
        from flakestorm.core.protocol import create_agent_adapter

        config = AgentConfig(
            endpoint="http://localhost:8000/chat",
            type=AgentType.HTTP,
        )
        adapter = create_agent_adapter(config)
        assert isinstance(adapter, HTTPAgentAdapter)

    def test_creates_python_adapter(self):
        """Factory creates Python adapter for Python type."""
        from flakestorm.core.protocol import create_agent_adapter

        config = AgentConfig(
            endpoint="my_module:my_function",
            type=AgentType.PYTHON,
        )
        adapter = create_agent_adapter(config)
        assert isinstance(adapter, PythonAgentAdapter)

How to Run

# Run adapter tests
pytest tests/test_adapters.py -v

# Run with coverage
pytest tests/test_adapters.py --cov=src/flakestorm/core/protocol

Writing Tests: Orchestrator

Location: tests/test_orchestrator.py

What to Test

  1. Mutation Generation Phase

    • Generates correct number of mutations
    • Handles all mutation types
    • Handles LLM failures gracefully
  2. Test Execution Phase

    • Runs mutations in parallel
    • Respects concurrency limits
    • Handles agent failures
    • Measures latency correctly
  3. Result Aggregation

    • Calculates statistics correctly
    • Scores results with correct weights
    • Groups results by mutation type

Example Test File

# tests/test_orchestrator.py
"""Tests for the flakestorm orchestrator."""

import pytest
from unittest.mock import AsyncMock, MagicMock, patch
from datetime import datetime

from flakestorm.core.orchestrator import Orchestrator, OrchestratorState
from flakestorm.core.config import FlakeStormConfig, AgentConfig, MutationConfig
from flakestorm.mutations.types import Mutation, MutationType
from flakestorm.assertions.verifier import CheckResult


class TestOrchestratorState:
    """Tests for orchestrator state tracking."""

    def test_initial_state(self):
        """State initializes correctly."""
        state = OrchestratorState()
        assert state.total_mutations == 0
        assert state.completed_mutations == 0
        assert state.completed_at is None

    def test_state_updates(self):
        """State updates as tests run."""
        state = OrchestratorState()
        state.total_mutations = 10
        state.completed_mutations = 5
        assert state.completed_mutations == 5


class TestOrchestrator:
    """Tests for main orchestrator."""

    @pytest.fixture
    def mock_config(self):
        """Create a minimal test config."""
        return FlakeStormConfig(
            agent=AgentConfig(
                endpoint="http://localhost:8000/chat",
                type="http",
            ),
            golden_prompts=["Test prompt 1", "Test prompt 2"],
            mutations=MutationConfig(
                count=5,
                types=[MutationType.PARAPHRASE],
            ),
        )

    @pytest.fixture
    def mock_agent(self):
        """Create a mock agent adapter."""
        agent = AsyncMock()
        agent.invoke.return_value = MagicMock(
            text="Agent response",
            latency_ms=100.0,
        )
        return agent

    @pytest.fixture
    def mock_mutation_engine(self):
        """Create a mock mutation engine."""
        engine = AsyncMock()
        engine.generate_mutations.return_value = [
            Mutation(
                original="Test",
                mutated="Test variation",
                type=MutationType.PARAPHRASE,
                difficulty=1.0,
            )
        ]
        return engine

    @pytest.fixture
    def mock_verifier(self):
        """Create a mock verifier."""
        verifier = MagicMock()
        verifier.verify.return_value = [
            CheckResult(passed=True, check_type="contains", details="OK")
        ]
        return verifier

    @pytest.mark.asyncio
    async def test_run_generates_mutations(
        self, mock_config, mock_agent, mock_mutation_engine, mock_verifier
    ):
        """Orchestrator generates mutations for all golden prompts."""
        orchestrator = Orchestrator(
            config=mock_config,
            agent=mock_agent,
            mutation_engine=mock_mutation_engine,
            verifier=mock_verifier,
        )

        await orchestrator.run()

        # Should have called generate_mutations for each golden prompt
        assert mock_mutation_engine.generate_mutations.call_count == 2

    @pytest.mark.asyncio
    async def test_run_invokes_agent(
        self, mock_config, mock_agent, mock_mutation_engine, mock_verifier
    ):
        """Orchestrator invokes agent for each mutation."""
        orchestrator = Orchestrator(
            config=mock_config,
            agent=mock_agent,
            mutation_engine=mock_mutation_engine,
            verifier=mock_verifier,
        )

        await orchestrator.run()

        # Should have invoked agent for each mutation
        # 2 golden prompts × 1 mutation each = 2 invocations
        assert mock_agent.invoke.call_count >= 2

    @pytest.mark.asyncio
    async def test_run_returns_results(
        self, mock_config, mock_agent, mock_mutation_engine, mock_verifier
    ):
        """Orchestrator returns complete test results."""
        orchestrator = Orchestrator(
            config=mock_config,
            agent=mock_agent,
            mutation_engine=mock_mutation_engine,
            verifier=mock_verifier,
        )

        results = await orchestrator.run()

        assert results is not None
        assert hasattr(results, "statistics")
        assert hasattr(results, "mutations")

    @pytest.mark.asyncio
    async def test_handles_agent_failure(
        self, mock_config, mock_mutation_engine, mock_verifier
    ):
        """Orchestrator handles agent failures gracefully."""
        failing_agent = AsyncMock()
        failing_agent.invoke.side_effect = Exception("Agent error")

        orchestrator = Orchestrator(
            config=mock_config,
            agent=failing_agent,
            mutation_engine=mock_mutation_engine,
            verifier=mock_verifier,
        )

        # Should not raise, should mark test as failed
        results = await orchestrator.run()
        assert results is not None

Writing Tests: Report Generation

Location: tests/test_reports.py

What to Test

  1. HTMLReportGenerator

    • Generates valid HTML
    • Contains all required sections
    • Includes statistics
    • Includes mutation details
  2. JSONReportGenerator

    • Generates valid JSON
    • Contains all required fields
    • Serializes datetime correctly
  3. TerminalReporter

    • Formats output correctly
    • Handles different result types

Example Test File

# tests/test_reports.py
"""Tests for report generation."""

import pytest
import json
from datetime import datetime
from pathlib import Path
import tempfile

from flakestorm.reports.models import TestResults, TestStatistics, MutationResult
from flakestorm.reports.html import HTMLReportGenerator
from flakestorm.reports.json_export import JSONReportGenerator


class TestHTMLReportGenerator:
    """Tests for HTML report generation."""

    @pytest.fixture
    def sample_results(self):
        """Create sample test results."""
        return TestResults(
            config=None,  # Simplified for testing
            mutations=[
                MutationResult(
                    mutation=None,
                    response="Test response",
                    latency_ms=100.0,
                    passed=True,
                    checks=[],
                )
            ],
            statistics=TestStatistics(
                total_mutations=10,
                passed_mutations=8,
                failed_mutations=2,
                robustness_score=0.8,
                avg_latency_ms=150.0,
                p50_latency_ms=120.0,
                p95_latency_ms=300.0,
                p99_latency_ms=450.0,
                by_type=[],
            ),
            timestamp=datetime.now(),
        )

    def test_generate_returns_string(self, sample_results):
        """Generator returns HTML string."""
        generator = HTMLReportGenerator(sample_results)
        html = generator.generate()

        assert isinstance(html, str)
        assert len(html) > 0

    def test_generate_valid_html(self, sample_results):
        """Generated HTML is valid."""
        generator = HTMLReportGenerator(sample_results)
        html = generator.generate()

        assert "<html" in html
        assert "</html>" in html
        assert "<head>" in html
        assert "<body>" in html

    def test_contains_robustness_score(self, sample_results):
        """Report contains robustness score."""
        generator = HTMLReportGenerator(sample_results)
        html = generator.generate()

        assert "0.8" in html or "80%" in html

    def test_save_creates_file(self, sample_results):
        """save() creates file on disk."""
        with tempfile.TemporaryDirectory() as tmpdir:
            generator = HTMLReportGenerator(sample_results)
            path = generator.save(Path(tmpdir) / "report.html")

            assert path.exists()
            assert path.read_text().startswith("<!DOCTYPE html>")


class TestJSONReportGenerator:
    """Tests for JSON report generation."""

    @pytest.fixture
    def sample_results(self):
        """Create sample test results."""
        return TestResults(
            config=None,
            mutations=[],
            statistics=TestStatistics(
                total_mutations=10,
                passed_mutations=8,
                failed_mutations=2,
                robustness_score=0.8,
                avg_latency_ms=150.0,
                p50_latency_ms=120.0,
                p95_latency_ms=300.0,
                p99_latency_ms=450.0,
                by_type=[],
            ),
            timestamp=datetime(2024, 1, 15, 12, 0, 0),
        )

    def test_generate_valid_json(self, sample_results):
        """Generator produces valid JSON."""
        generator = JSONReportGenerator(sample_results)
        json_str = generator.generate()

        # Should not raise
        data = json.loads(json_str)
        assert isinstance(data, dict)

    def test_contains_statistics(self, sample_results):
        """JSON contains statistics."""
        generator = JSONReportGenerator(sample_results)
        data = json.loads(generator.generate())

        assert "statistics" in data
        assert data["statistics"]["robustness_score"] == 0.8

Integration Tests

Location: tests/test_integration.py

Prerequisites

Integration tests require:

  1. Ollama running locally
  2. A model pulled (e.g., ollama pull qwen2.5-coder:7b)
  3. A mock agent running

Example Test File

# tests/test_integration.py
"""Integration tests for full flakestorm workflow."""

import pytest
import asyncio
from pathlib import Path
import tempfile

# Skip all tests if Ollama is not running
pytest_plugins = ["pytest_asyncio"]


def ollama_available():
    """Check if Ollama is running."""
    from flakestorm.integrations.huggingface import HuggingFaceModelProvider
    return HuggingFaceModelProvider.verify_ollama_connection()


@pytest.mark.skipif(not ollama_available(), reason="Ollama not running")
class TestFullWorkflow:
    """Integration tests for complete test runs."""

    @pytest.mark.asyncio
    async def test_full_run_with_mock_agent(self):
        """Test complete workflow with mock agent."""
        # This test would:
        # 1. Start a mock agent
        # 2. Create config
        # 3. Run flakestorm
        # 4. Verify results
        pass

    @pytest.mark.asyncio
    async def test_mutation_generation(self):
        """Test that mutation engine generates valid mutations."""
        from flakestorm.mutations.engine import MutationEngine
        from flakestorm.core.config import LLMConfig

        config = LLMConfig(
            model="qwen2.5-coder:7b",
            host="http://localhost:11434",
        )
        engine = MutationEngine(config)

        mutations = await engine.generate_mutations(
            prompt="Hello, world!",
            types=[MutationType.PARAPHRASE],
            count=3,
        )

        assert len(mutations) > 0
        assert all(m.mutated != "Hello, world!" for m in mutations)

CLI Tests

Location: tests/test_cli.py

How to Test CLI Commands

Use the CliRunner from Typer for testing:

# tests/test_cli.py
"""Tests for CLI commands."""

import pytest
from typer.testing import CliRunner
import tempfile
from pathlib import Path

from flakestorm.cli.main import app

runner = CliRunner()


class TestInitCommand:
    """Tests for `flakestorm init`."""

    def test_init_creates_config(self):
        """init creates flakestorm.yaml."""
        with tempfile.TemporaryDirectory() as tmpdir:
            result = runner.invoke(
                app, ["init", "--dir", tmpdir]
            )
            assert result.exit_code == 0
            assert (Path(tmpdir) / "flakestorm.yaml").exists()

    def test_init_no_overwrite(self):
        """init doesn't overwrite existing config."""
        with tempfile.TemporaryDirectory() as tmpdir:
            config_path = Path(tmpdir) / "flakestorm.yaml"
            config_path.write_text("existing: content")

            result = runner.invoke(
                app, ["init", "--dir", tmpdir]
            )
            # Should warn about existing file
            assert "exists" in result.output.lower() or result.exit_code != 0


class TestVerifyCommand:
    """Tests for `flakestorm verify`."""

    def test_verify_valid_config(self):
        """verify accepts valid config."""
        with tempfile.TemporaryDirectory() as tmpdir:
            config_path = Path(tmpdir) / "flakestorm.yaml"
            config_path.write_text("""
agent:
  endpoint: "http://localhost:8000/chat"
  type: http

golden_prompts:
  - "Test prompt"
""")
            result = runner.invoke(
                app, ["verify", "--config", str(config_path)]
            )
            assert result.exit_code == 0

    def test_verify_invalid_config(self):
        """verify rejects invalid config."""
        with tempfile.TemporaryDirectory() as tmpdir:
            config_path = Path(tmpdir) / "flakestorm.yaml"
            config_path.write_text("invalid: yaml: content:")

            result = runner.invoke(
                app, ["verify", "--config", str(config_path)]
            )
            assert result.exit_code != 0


class TestHelpCommand:
    """Tests for help output."""

    def test_main_help(self):
        """Main help displays commands."""
        result = runner.invoke(app, ["--help"])
        assert result.exit_code == 0
        assert "run" in result.output
        assert "init" in result.output

    def test_run_help(self):
        """Run command help displays options."""
        result = runner.invoke(app, ["run", "--help"])
        assert result.exit_code == 0
        assert "--config" in result.output
        assert "--output" in result.output

Test Fixtures

Shared Fixtures in conftest.py

# tests/conftest.py
"""Shared test fixtures."""

import pytest
from pathlib import Path
import tempfile


@pytest.fixture
def temp_dir():
    """Create a temporary directory."""
    with tempfile.TemporaryDirectory() as tmpdir:
        yield Path(tmpdir)


@pytest.fixture
def sample_config_yaml():
    """Sample valid config YAML."""
    return """
agent:
  endpoint: "http://localhost:8000/chat"
  type: http
  timeout: 30

golden_prompts:
  - "Test prompt 1"
  - "Test prompt 2"

mutations:
  count: 5
  types:
    - paraphrase
    - noise

invariants:
  - type: latency
    max_ms: 5000
"""


@pytest.fixture
def config_file(temp_dir, sample_config_yaml):
    """Create a config file in temp directory."""
    config_path = temp_dir / "flakestorm.yaml"
    config_path.write_text(sample_config_yaml)
    return config_path

Summary: Remaining Test Items

Checklist Item Test File Status
Test agent adapters tests/test_adapters.py Template provided above
Test orchestrator tests/test_orchestrator.py Template provided above
Test report generation tests/test_reports.py Template provided above
Test CLI commands tests/test_cli.py Template provided above
Full integration test tests/test_integration.py Template provided above

Quick Start

  1. Copy the templates above to create test files
  2. Run: pytest tests/test_<module>.py -v
  3. Add more test cases as needed
  4. Run full suite: pytest

Happy testing! 🧪