flakestorm/docs/DEVELOPER_FAQ.md

20 KiB
Raw Blame History

flakestorm Developer FAQ

This document answers common questions developers might have about the flakestorm codebase. It's designed to help project maintainers explain design decisions and help contributors understand the codebase.


Table of Contents

  1. Architecture Questions
  2. Configuration System
  3. Mutation Engine
  4. Assertion System
  5. Performance & Rust
  6. Agent Adapters
  7. Testing & Quality
  8. Extending flakestorm
  9. Common Issues

Architecture Questions

Q: Why is the codebase split into core, mutations, assertions, and reports?

A: This follows the Single Responsibility Principle (SRP) and makes the codebase maintainable:

Module Responsibility
core/ Orchestration, configuration, agent communication
mutations/ Adversarial input generation
assertions/ Response validation
reports/ Output formatting

This separation means:

  • Changes to mutation logic don't affect assertions
  • New report formats can be added without touching core logic
  • Each module can be tested independently

Q: Why use async/await throughout the codebase?

A: Agent testing is I/O-bound, not CPU-bound. The bottleneck is waiting for:

  1. LLM responses (mutation generation)
  2. Agent responses (test execution)

Async allows running many operations concurrently:

# Without async: 100 tests × 500ms = 50 seconds
# With async (10 concurrent): 100 tests / 10 × 500ms = 5 seconds

The semaphore in orchestrator.py controls concurrency:

semaphore = asyncio.Semaphore(self.config.advanced.concurrency)

async def _run_single_mutation(self, mutation):
    async with semaphore:  # Limits concurrent executions
        return await self.agent.invoke(mutation.mutated)

Q: Why is there both an orchestrator.py and a runner.py?

A: They serve different purposes:

  • runner.py: High-level API for users - simple FlakeStormRunner.run() interface
  • orchestrator.py: Internal coordination logic - handles the complex flow

This separation allows:

  • runner.py to provide a clean facade
  • orchestrator.py to be refactored without breaking the public API
  • Different entry points (CLI, programmatic) to use the same core logic

Configuration System

Q: Why Pydantic instead of dataclasses or attrs?

A: Pydantic was chosen for several reasons:

  1. Automatic Validation: Built-in validators with clear error messages

    class MutationConfig(BaseModel):
        count: int = Field(ge=1, le=100)  # Validates range automatically
    
  2. Environment Variable Support: Native expansion

    endpoint: str = Field(default="${AGENT_URL}")
    
  3. YAML/JSON Serialization: Works out of the box

  4. IDE Support: Type hints provide autocomplete


Q: Why use environment variable expansion in config?

A: Security best practice - secrets should never be in config files:

# BAD: Secret in file (gets committed to git)
headers:
  Authorization: "Bearer sk-1234567890"

# GOOD: Reference environment variable
headers:
  Authorization: "Bearer ${API_KEY}"

Implementation in config.py:

def expand_env_vars(value: str) -> str:
    """Replace ${VAR} with environment variable value."""
    pattern = r'\$\{([^}]+)\}'
    def replacer(match):
        var_name = match.group(1)
        return os.environ.get(var_name, match.group(0))
    return re.sub(pattern, replacer, value)

Q: Why is MutationType defined as str, Enum?

A: String enums serialize directly to YAML/JSON:

class MutationType(str, Enum):
    PARAPHRASE = "paraphrase"

This allows:

# In config file - uses string value directly
mutations:
  types:
    - paraphrase  # Works!
    - noise

If we used a regular Enum, we'd need custom serialization logic.


Mutation Engine

Q: Why use a local LLM (Ollama) instead of cloud APIs?

A: Several important reasons:

Factor Local LLM Cloud API
Cost Free $0.01-0.10 per mutation
Privacy Data stays local Prompts sent to third party
Rate Limits None Often restrictive
Latency Low Network dependent
Offline Works Requires internet

For a test run with 100 prompts × 20 mutations = 2000 API calls, cloud costs would add up quickly.


Q: Why Qwen Coder 3 8B as the default model?

A: We evaluated several models:

Model Mutation Quality Speed Memory
Qwen Coder 3 8B 8GB
Llama 3 8B 8GB
Mistral 7B 6GB
Phi-3 Mini 4GB

Qwen Coder 3 was chosen because:

  1. Excellent at understanding and modifying prompts
  2. Good balance of quality vs. speed
  3. Runs on consumer hardware (8GB VRAM)

Q: How does the mutation template system work?

A: Templates are stored in templates.py and formatted with the original prompt:

TEMPLATES = {
    MutationType.PARAPHRASE: """
    Rewrite this prompt with different words but same meaning.

    Original: {prompt}

    Rewritten:
    """,
    MutationType.NOISE: """
    Add 2-3 realistic typos to this prompt:

    Original: {prompt}

    With typos:
    """
}

The engine fills in {prompt} and sends to the LLM:

template = TEMPLATES[mutation_type]
filled = template.format(prompt=original_prompt)
response = await self.client.generate(model=self.model, prompt=filled)

Q: What if the LLM returns malformed mutations?

A: We have several safeguards:

  1. Parsing Logic: Extracts text between known markers
  2. Validation: Checks mutation isn't identical to original
  3. Retry Logic: Regenerates if parsing fails
  4. Fallback: Uses simple string manipulation if LLM fails
def _parse_mutation(self, response: str) -> str:
    # Try to extract the mutated text
    lines = response.strip().split('\n')
    for line in lines:
        if line and not line.startswith('#'):
            return line.strip()
    raise MutationParseError("Could not extract mutation")

Assertion System

Q: Why separate deterministic and semantic assertions?

A: They have fundamentally different characteristics:

Aspect Deterministic Semantic
Speed Nanoseconds Milliseconds
Dependencies None sentence-transformers
Reproducibility 100% May vary slightly
Use Case Exact matching Meaning matching

Separating them allows:

  • Running deterministic checks first (fast-fail)
  • Making semantic checks optional (lighter installation)

Q: How does the SimilarityChecker work internally?

A: It uses sentence embeddings and cosine similarity:

class SimilarityChecker:
    def check(self, response: str, latency_ms: float) -> CheckResult:
        # 1. Embed both texts to vectors
        response_vec = self.embedder.embed(response)      # [0.1, 0.2, ...]
        expected_vec = self.embedder.embed(self.expected) # [0.15, 0.18, ...]

        # 2. Calculate cosine similarity
        similarity = cosine_similarity(response_vec, expected_vec)
        # Returns value between -1 and 1 (typically 0-1 for text)

        # 3. Compare to threshold
        return CheckResult(passed=similarity >= self.threshold)

The embedding model (all-MiniLM-L6-v2) converts text to 384-dimensional vectors that capture semantic meaning.


Q: Why is the embedder a class variable with lazy loading?

A: The embedding model is large (23MB) and takes 1-2 seconds to load:

class SimilarityChecker:
    _embedder: LocalEmbedder | None = None  # Class variable, shared

    @property
    def embedder(self) -> LocalEmbedder:
        if SimilarityChecker._embedder is None:
            SimilarityChecker._embedder = LocalEmbedder()  # Load once
        return SimilarityChecker._embedder

Benefits:

  1. Lazy Loading: Only loads if semantic checks are used
  2. Shared Instance: All SimilarityCheckers share one model
  3. Memory Efficient: One copy in memory, not one per checker

Q: How does PII detection work?

A: Uses regex patterns for common PII formats:

PII_PATTERNS = [
    (r'\b\d{3}-\d{2}-\d{4}\b', 'SSN'),           # 123-45-6789
    (r'\b\d{16}\b', 'Credit Card'),              # 1234567890123456
    (r'\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b', 'Email'),
    (r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', 'Phone'), # 123-456-7890
]

def check(self, response: str, latency_ms: float) -> CheckResult:
    for pattern, pii_type in self.PII_PATTERNS:
        if re.search(pattern, response, re.IGNORECASE):
            return CheckResult(
                passed=False,
                details=f"Found potential {pii_type}"
            )
    return CheckResult(passed=True)

Performance & Rust

Q: Why Rust for performance-critical code?

A: Python is slow for CPU-bound operations. Benchmarks show:

Levenshtein Distance (5000 iterations):
  Python: 5864ms
  Rust:     67ms
  Speedup: 88x

Rust was chosen over alternatives because:

  • vs C/C++: Memory safety, easier to write correct code
  • vs Cython: Better tooling (cargo), cleaner code
  • vs NumPy: Works on strings, not just numbers

Q: How does the Rust/Python bridge work?

A: Uses PyO3 for bindings:

// Rust side (lib.rs)
#[pyfunction]
fn levenshtein_distance(s1: &str, s2: &str) -> usize {
    // Rust implementation
}

#[pymodule]
fn flakestorm_rust(m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(levenshtein_distance, m)?)?;
    Ok(())
}
# Python side (performance.py)
try:
    import flakestorm_rust
    _RUST_AVAILABLE = True
except ImportError:
    _RUST_AVAILABLE = False

def levenshtein_distance(s1: str, s2: str) -> int:
    if _RUST_AVAILABLE:
        return flakestorm_rust.levenshtein_distance(s1, s2)
    # Pure Python fallback
    ...

Q: Why provide pure Python fallbacks?

A: Accessibility and reliability:

  1. Easy Installation: pip install flakestorm works without Rust toolchain
  2. Platform Support: Works on any Python platform
  3. Development: Faster iteration without recompiling Rust
  4. Testing: Can test both implementations for parity

The tradeoff is speed, but most time is spent waiting for LLM/agent responses anyway.


Agent Adapters

Q: Why use the Protocol pattern for agents?

A: Enables type-safe duck typing:

class AgentProtocol(Protocol):
    async def invoke(self, prompt: str) -> AgentResponse: ...

Any class with a matching invoke method works, even if it doesn't inherit from a base class. This is more Pythonic than Java-style interfaces.


Q: How does the HTTP adapter handle different API formats?

A: Through configurable templates:

agent:
  endpoint: "https://api.example.com/v1/chat"
  request_template: |
    {"messages": [{"role": "user", "content": "{prompt}"}]}
  response_path: "$.choices[0].message.content"

The adapter:

  1. Replaces {prompt} in the template
  2. Sends the formatted JSON
  3. Uses JSONPath to extract the response

This supports OpenAI, Anthropic, custom APIs, etc.


Q: Why is there a Python adapter?

A: Bypasses HTTP overhead for local testing:

# Instead of: HTTP request → your server → your code → HTTP response
# Just: your_function(prompt) → response

class PythonAgentAdapter:
    async def invoke(self, prompt: str) -> AgentResponse:
        # Import the module dynamically
        module_path, func_name = self.endpoint.rsplit(":", 1)
        module = importlib.import_module(module_path)
        func = getattr(module, func_name)

        # Call directly
        start = time.perf_counter()
        response = await func(prompt) if asyncio.iscoroutinefunction(func) else func(prompt)
        latency = (time.perf_counter() - start) * 1000

        return AgentResponse(text=response, latency_ms=latency)

Q: When do I need to create an HTTP endpoint vs use Python adapter?

A: It depends on your agent's language and setup:

Your Agent Code Adapter Type Endpoint Needed? Notes
Python (internal) Python adapter No Use type: "python", call function directly
TypeScript/JavaScript HTTP adapter Yes Must create HTTP endpoint (can be localhost)
Java/Go/Rust HTTP adapter Yes Must create HTTP endpoint (can be localhost)
Already has HTTP API HTTP adapter Yes Use existing endpoint

For non-Python code (TypeScript example):

Since FlakeStorm is a Python CLI tool, it can only directly call Python functions. For TypeScript/JavaScript/other languages, you must create an HTTP endpoint:

// test-endpoint.ts - Wrapper endpoint for FlakeStorm
import express from 'express';
import { generateRedditSearchQuery } from './your-internal-code';

const app = express();
app.use(express.json());

app.post('/flakestorm-test', async (req, res) => {
  // FlakeStorm sends: {"input": "Industry: X\nProduct: Y..."}
  const structuredText = req.body.input;

  // Parse structured input
  const params = parseStructuredInput(structuredText);

  // Call your internal function
  const query = await generateRedditSearchQuery(params);

  // Return in FlakeStorm's expected format
  res.json({ output: query });
});

app.listen(8000, () => {
  console.log('FlakeStorm test endpoint: http://localhost:8000/flakestorm-test');
});

Then in flakestorm.yaml:

agent:
  endpoint: "http://localhost:8000/flakestorm-test"
  type: "http"
  request_template: |
    {
      "industry": "{industry}",
      "productName": "{productName}",
      "businessModel": "{businessModel}",
      "targetMarket": "{targetMarket}",
      "description": "{description}"
    }
  response_path: "$.output"

Q: Do I need a public endpoint or can I use localhost?

A: It depends on where FlakeStorm runs:

FlakeStorm Location Agent Location Endpoint Type Works?
Same machine Same machine localhost:8000 Yes
Different machine Your machine localhost:8000 No - use public endpoint or ngrok
CI/CD server Your machine localhost:8000 No - use public endpoint
CI/CD server Cloud (AWS/GCP) https://api.example.com Yes

Options for exposing local endpoint:

  1. ngrok: ngrok http 8000 → get public URL
  2. localtunnel: lt --port 8000 → get public URL
  3. Deploy to cloud: Deploy your test endpoint to a cloud service
  4. VPN/SSH tunnel: If both machines are on same network

Q: Can I test internal code without creating an endpoint?

A: Only if your code is in Python:

# my_agent.py
async def flakestorm_agent(input: str) -> str:
    # Parse input, call your internal functions
    return result
# flakestorm.yaml
agent:
  endpoint: "my_agent:flakestorm_agent"
  type: "python"  # ← No HTTP endpoint needed!

For non-Python code, you must create an HTTP endpoint wrapper.

See Connection Guide for detailed examples and troubleshooting.


Testing & Quality

Q: Why are tests split by module?

A: Mirrors the source structure for maintainability:

tests/
├── test_config.py       # Tests for core/config.py
├── test_mutations.py    # Tests for mutations/
├── test_assertions.py   # Tests for assertions/
├── test_performance.py  # Tests for performance module

When fixing a bug in config.py, you immediately know to check test_config.py.


Q: Why use pytest over unittest?

A: Pytest is more Pythonic and powerful:

# unittest style (verbose)
class TestConfig(unittest.TestCase):
    def test_load_config(self):
        self.assertEqual(config.agent.type, AgentType.HTTP)

# pytest style (concise)
def test_load_config():
    assert config.agent.type == AgentType.HTTP

Pytest also offers:

  • Fixtures for setup/teardown
  • Parametrized tests
  • Better assertion introspection

Q: How should I add tests for a new feature?

A: Follow this pattern:

  1. Create test file if needed: tests/test_<module>.py
  2. Write failing test first (TDD)
  3. Group related tests in a class
  4. Use fixtures for common setup
# tests/test_new_feature.py
import pytest
from flakestorm.new_module import NewFeature

class TestNewFeature:
    @pytest.fixture
    def feature(self):
        return NewFeature(config={...})

    def test_basic_functionality(self, feature):
        result = feature.do_something()
        assert result == expected

    def test_edge_case(self, feature):
        with pytest.raises(ValueError):
            feature.do_something(invalid_input)

Extending flakestorm

Q: How do I add a new mutation type?

A: Three steps:

  1. Add to enum (mutations/types.py):

    class MutationType(str, Enum):
        # ... existing types
        MY_NEW_TYPE = "my_new_type"
    
  2. Add template (mutations/templates.py):

    TEMPLATES[MutationType.MY_NEW_TYPE] = """
    Your prompt template here.
    
    Original: {prompt}
    
    Modified:
    """
    
  3. Add default weight (core/config.py):

    class MutationConfig(BaseModel):
        weights: dict = {
            # ... existing weights
            MutationType.MY_NEW_TYPE: 1.0,
        }
    

Q: How do I add a new assertion type?

A: Four steps:

  1. Create checker class (assertions/deterministic.py or semantic.py):

    class MyNewChecker(BaseChecker):
        def check(self, response: str, latency_ms: float) -> CheckResult:
            # Your logic here
            passed = some_condition(response)
            return CheckResult(
                passed=passed,
                check_type=InvariantType.MY_NEW_TYPE,
                details="Explanation"
            )
    
  2. Add to enum (core/config.py):

    class InvariantType(str, Enum):
        # ... existing types
        MY_NEW_TYPE = "my_new_type"
    
  3. Register in verifier (assertions/verifier.py):

    CHECKER_REGISTRY = {
        # ... existing checkers
        InvariantType.MY_NEW_TYPE: MyNewChecker,
    }
    
  4. Add tests (tests/test_assertions.py)


Q: How do I add a new report format?

A: Create a new generator:

# reports/markdown.py
class MarkdownReportGenerator:
    def __init__(self, results: TestResults):
        self.results = results

    def generate(self) -> str:
        """Generate markdown content."""
        md = f"# flakestorm Report\n\n"
        md += f"**Score:** {self.results.statistics.robustness_score:.2f}\n"
        # ... more content
        return md

    def save(self, path: Path = None) -> Path:
        path = path or Path(f"reports/report_{timestamp}.md")
        path.write_text(self.generate())
        return path

Then add CLI option in cli/main.py.


Common Issues

Q: Why am I getting "Cannot connect to Ollama"?

A: Ollama service isn't running. Fix:

# Start Ollama
ollama serve

# Verify it's running
curl http://localhost:11434/api/version

Q: Why is mutation generation slow?

A: LLM inference is inherently slow. Options:

  1. Use a faster model: ollama pull phi3:mini
  2. Reduce mutation count: mutations.count: 10
  3. Use GPU: Ensure Ollama uses GPU acceleration

Q: Why do tests pass locally but fail in CI?

A: Common causes:

  1. Missing Ollama: CI needs Ollama service
  2. Different model: Ensure same model is pulled
  3. Timing: CI may be slower, increase timeouts
  4. Environment variables: Ensure secrets are set in CI

Q: How do I debug a failing assertion?

A: Enable verbose mode and check the report:

flakestorm run --verbose --output html

The HTML report shows:

  • Original prompt
  • Mutated prompt
  • Agent response
  • Which assertion failed and why

Have more questions? Open an issue on GitHub!