flakestorm/docs/DEVELOPER_FAQ.md

# flakestorm Developer FAQ

This document answers common questions developers might have about the flakestorm codebase. It's designed to help project maintainers explain design decisions and help contributors understand the codebase.

---

## Table of Contents

1. [Architecture Questions](#architecture-questions)
2. [Configuration System](#configuration-system)
3. [Mutation Engine](#mutation-engine)
4. [Assertion System](#assertion-system)
5. [Performance & Rust](#performance--rust)
6. [Agent Adapters](#agent-adapters)
7. [Testing & Quality](#testing--quality)
8. [Extending flakestorm](#extending-flakestorm)
9. [Common Issues](#common-issues)

---

## Architecture Questions

### Q: Why is the codebase split into core, mutations, assertions, and reports?

**A:** This follows the **Single Responsibility Principle (SRP)** and makes the codebase maintainable:

| Module | Responsibility |
|--------|---------------|
| `core/` | Orchestration, configuration, agent communication |
| `mutations/` | Adversarial input generation |
| `assertions/` | Response validation |
| `reports/` | Output formatting |

This separation means:
- Changes to mutation logic don't affect assertions
- New report formats can be added without touching core logic
- Each module can be tested independently

---

### Q: Why use async/await throughout the codebase?

**A:** Agent testing is **I/O-bound**, not CPU-bound. The bottleneck is waiting for:
1. LLM responses (mutation generation)
2. Agent responses (test execution)

Async allows running many operations concurrently:

```python
# Without async: 100 tests × 500ms = 50 seconds
# With async (10 concurrent): 100 tests / 10 × 500ms = 5 seconds
```

The semaphore in `orchestrator.py` controls concurrency:

```python
semaphore = asyncio.Semaphore(self.config.advanced.concurrency)

async def _run_single_mutation(self, mutation):
    async with semaphore:  # Limits concurrent executions
        return await self.agent.invoke(mutation.mutated)
```

---

### Q: Why is there both an `orchestrator.py` and a `runner.py`?

**A:** They serve different purposes:

- **`runner.py`**: High-level API for users - simple `FlakeStormRunner.run()` interface
- **`orchestrator.py`**: Internal coordination logic - handles the complex flow

This separation allows:
- `runner.py` to provide a clean facade
- `orchestrator.py` to be refactored without breaking the public API
- Different entry points (CLI, programmatic) to use the same core logic

---

## Configuration System

### Q: Why Pydantic instead of dataclasses or attrs?

**A:** Pydantic was chosen for several reasons:

1. **Automatic Validation**: Built-in validators with clear error messages
   ```python
   class MutationConfig(BaseModel):
       count: int = Field(ge=1, le=100)  # Validates range automatically
   ```

2. **Environment Variable Support**: Native expansion
   ```python
   endpoint: str = Field(default="${AGENT_URL}")
   ```

3. **YAML/JSON Serialization**: Works out of the box
4. **IDE Support**: Type hints provide autocomplete

---

### Q: Why use environment variable expansion in config?

**A:** Security best practice - secrets should never be in config files:

```yaml
# BAD: Secret in file (gets committed to git)
headers:
  Authorization: "Bearer sk-1234567890"

# GOOD: Reference environment variable
headers:
  Authorization: "Bearer ${API_KEY}"
```

Implementation in `config.py`:

```python
def expand_env_vars(value: str) -> str:
    """Replace ${VAR} with environment variable value."""
    pattern = r'\$\{([^}]+)\}'
    def replacer(match):
        var_name = match.group(1)
        return os.environ.get(var_name, match.group(0))
    return re.sub(pattern, replacer, value)
```

---

### Q: Why is MutationType defined as `str, Enum`?

**A:** String enums serialize directly to YAML/JSON:

```python
class MutationType(str, Enum):
    PARAPHRASE = "paraphrase"
```

This allows:
```yaml
# In config file - uses string value directly
mutations:
  types:
    - paraphrase  # Works!
    - noise
```

If we used a regular Enum, we'd need custom serialization logic.

---

## Mutation Engine

### Q: Why use a local LLM (Ollama) instead of cloud APIs?

**A:** Several important reasons:

| Factor | Local LLM | Cloud API |
|--------|-----------|-----------|
| **Cost** | Free | $0.01-0.10 per mutation |
| **Privacy** | Data stays local | Prompts sent to third party |
| **Rate Limits** | None | Often restrictive |
| **Latency** | Low | Network dependent |
| **Offline** | Works | Requires internet |

For a test run with 100 prompts × 20 mutations = 2000 API calls, cloud costs would add up quickly.

---

### Q: Why Qwen Coder 3 8B as the default model?

**A:** We evaluated several models:

| Model | Mutation Quality | Speed | Memory |
|-------|-----------------|-------|--------|
| Qwen Coder 3 8B | ⭐⭐⭐⭐ | ⭐⭐⭐ | 8GB |
| Llama 3 8B | ⭐⭐⭐ | ⭐⭐⭐ | 8GB |
| Mistral 7B | ⭐⭐⭐ | ⭐⭐⭐⭐ | 6GB |
| Phi-3 Mini | ⭐⭐ | ⭐⭐⭐⭐⭐ | 4GB |

Qwen Coder 3 was chosen because:
1. Excellent at understanding and modifying prompts
2. Good balance of quality vs. speed
3. Runs on consumer hardware (8GB VRAM)

---

### Q: How does the mutation template system work?

**A:** Templates are stored in `templates.py` and formatted with the original prompt:

```python
TEMPLATES = {
    MutationType.PARAPHRASE: """
    Rewrite this prompt with different words but same meaning.

    Original: {prompt}

    Rewritten:
    """,
    MutationType.NOISE: """
    Add 2-3 realistic typos to this prompt:

    Original: {prompt}

    With typos:
    """
}
```

The engine fills in `{prompt}` and sends to the LLM:

```python
template = TEMPLATES[mutation_type]
filled = template.format(prompt=original_prompt)
response = await self.client.generate(model=self.model, prompt=filled)
```

---

### Q: What if the LLM returns malformed mutations?

**A:** We have several safeguards:

1. **Parsing Logic**: Extracts text between known markers
2. **Validation**: Checks mutation isn't identical to original
3. **Retry Logic**: Regenerates if parsing fails
4. **Fallback**: Uses simple string manipulation if LLM fails

```python
def _parse_mutation(self, response: str) -> str:
    # Try to extract the mutated text
    lines = response.strip().split('\n')
    for line in lines:
        if line and not line.startswith('#'):
            return line.strip()
    raise MutationParseError("Could not extract mutation")
```

---

## Assertion System

### Q: Why separate deterministic and semantic assertions?

**A:** They have fundamentally different characteristics:

| Aspect | Deterministic | Semantic |
|--------|---------------|----------|
| **Speed** | Nanoseconds | Milliseconds |
| **Dependencies** | None | sentence-transformers |
| **Reproducibility** | 100% | May vary slightly |
| **Use Case** | Exact matching | Meaning matching |

Separating them allows:
- Running deterministic checks first (fast-fail)
- Making semantic checks optional (lighter installation)

---

### Q: How does the SimilarityChecker work internally?

**A:** It uses sentence embeddings and cosine similarity:

```python
class SimilarityChecker:
    def check(self, response: str, latency_ms: float) -> CheckResult:
        # 1. Embed both texts to vectors
        response_vec = self.embedder.embed(response)      # [0.1, 0.2, ...]
        expected_vec = self.embedder.embed(self.expected) # [0.15, 0.18, ...]

        # 2. Calculate cosine similarity
        similarity = cosine_similarity(response_vec, expected_vec)
        # Returns value between -1 and 1 (typically 0-1 for text)

        # 3. Compare to threshold
        return CheckResult(passed=similarity >= self.threshold)
```

The embedding model (`all-MiniLM-L6-v2`) converts text to 384-dimensional vectors that capture semantic meaning.

---

### Q: Why is the embedder a class variable with lazy loading?

**A:** The embedding model is large (23MB) and takes 1-2 seconds to load:

```python
class SimilarityChecker:
    _embedder: LocalEmbedder | None = None  # Class variable, shared

    @property
    def embedder(self) -> LocalEmbedder:
        if SimilarityChecker._embedder is None:
            SimilarityChecker._embedder = LocalEmbedder()  # Load once
        return SimilarityChecker._embedder
```

Benefits:
1. **Lazy Loading**: Only loads if semantic checks are used
2. **Shared Instance**: All SimilarityCheckers share one model
3. **Memory Efficient**: One copy in memory, not one per checker

---

### Q: How does PII detection work?

**A:** Uses regex patterns for common PII formats:

```python
PII_PATTERNS = [
    (r'\b\d{3}-\d{2}-\d{4}\b', 'SSN'),           # 123-45-6789
    (r'\b\d{16}\b', 'Credit Card'),              # 1234567890123456
    (r'\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b', 'Email'),
    (r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', 'Phone'), # 123-456-7890
]

def check(self, response: str, latency_ms: float) -> CheckResult:
    for pattern, pii_type in self.PII_PATTERNS:
        if re.search(pattern, response, re.IGNORECASE):
            return CheckResult(
                passed=False,
                details=f"Found potential {pii_type}"
            )
    return CheckResult(passed=True)
```

---

## Performance & Rust

### Q: Why Rust for performance-critical code?

**A:** Python is slow for CPU-bound operations. Benchmarks show:

```
Levenshtein Distance (5000 iterations):
  Python: 5864ms
  Rust:     67ms
  Speedup: 88x
```

Rust was chosen over alternatives because:
- **vs C/C++**: Memory safety, easier to write correct code
- **vs Cython**: Better tooling (cargo), cleaner code
- **vs NumPy**: Works on strings, not just numbers

---

### Q: How does the Rust/Python bridge work?

**A:** Uses PyO3 for bindings:

```rust
// Rust side (lib.rs)
#[pyfunction]
fn levenshtein_distance(s1: &str, s2: &str) -> usize {
    // Rust implementation
}

#[pymodule]
fn flakestorm_rust(m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(levenshtein_distance, m)?)?;
    Ok(())
}
```

```python
# Python side (performance.py)
try:
    import flakestorm_rust
    _RUST_AVAILABLE = True
except ImportError:
    _RUST_AVAILABLE = False

def levenshtein_distance(s1: str, s2: str) -> int:
    if _RUST_AVAILABLE:
        return flakestorm_rust.levenshtein_distance(s1, s2)
    # Pure Python fallback
    ...
```

---

### Q: Why provide pure Python fallbacks?

**A:** Accessibility and reliability:

1. **Easy Installation**: `pip install flakestorm` works without Rust toolchain
2. **Platform Support**: Works on any Python platform
3. **Development**: Faster iteration without recompiling Rust
4. **Testing**: Can test both implementations for parity

The tradeoff is speed, but most time is spent waiting for LLM/agent responses anyway.

---

## Agent Adapters

### Q: Why use the Protocol pattern for agents?

**A:** Enables type-safe duck typing:

```python
class AgentProtocol(Protocol):
    async def invoke(self, prompt: str) -> AgentResponse: ...
```

Any class with a matching `invoke` method works, even if it doesn't inherit from a base class. This is more Pythonic than Java-style interfaces.

---

### Q: How does the HTTP adapter handle different API formats?

**A:** Through configurable templates:

```yaml
agent:
  endpoint: "https://api.example.com/v1/chat"
  request_template: |
    {"messages": [{"role": "user", "content": "{prompt}"}]}
  response_path: "$.choices[0].message.content"
```

The adapter:
1. Replaces `{prompt}` in the template
2. Sends the formatted JSON
3. Uses JSONPath to extract the response

This supports OpenAI, Anthropic, custom APIs, etc.

---

### Q: Why is there a Python adapter?

**A:** Bypasses HTTP overhead for local testing:

```python
# Instead of: HTTP request → your server → your code → HTTP response
# Just: your_function(prompt) → response

class PythonAgentAdapter:
    async def invoke(self, prompt: str) -> AgentResponse:
        # Import the module dynamically
        module_path, func_name = self.endpoint.rsplit(":", 1)
        module = importlib.import_module(module_path)
        func = getattr(module, func_name)

        # Call directly
        start = time.perf_counter()
        response = await func(prompt) if asyncio.iscoroutinefunction(func) else func(prompt)
        latency = (time.perf_counter() - start) * 1000

        return AgentResponse(text=response, latency_ms=latency)
```

---

### Q: When do I need to create an HTTP endpoint vs use Python adapter?

**A:** It depends on your agent's language and setup:

| Your Agent Code | Adapter Type | Endpoint Needed? | Notes |
|----------------|--------------|------------------|-------|
| Python (internal) | Python adapter | ❌ No | Use `type: "python"`, call function directly |
| TypeScript/JavaScript | HTTP adapter | ✅ Yes | Must create HTTP endpoint (can be localhost) |
| Java/Go/Rust | HTTP adapter | ✅ Yes | Must create HTTP endpoint (can be localhost) |
| Already has HTTP API | HTTP adapter | ✅ Yes | Use existing endpoint |

**For non-Python code (TypeScript example):**

Since FlakeStorm is a Python CLI tool, it can only directly call Python functions. For TypeScript/JavaScript/other languages, you **must** create an HTTP endpoint:

```typescript
// test-endpoint.ts - Wrapper endpoint for FlakeStorm
import express from 'express';
import { generateRedditSearchQuery } from './your-internal-code';

const app = express();
app.use(express.json());

app.post('/flakestorm-test', async (req, res) => {
  // FlakeStorm sends: {"input": "Industry: X\nProduct: Y..."}
  const structuredText = req.body.input;

  // Parse structured input
  const params = parseStructuredInput(structuredText);

  // Call your internal function
  const query = await generateRedditSearchQuery(params);

  // Return in FlakeStorm's expected format
  res.json({ output: query });
});

app.listen(8000, () => {
  console.log('FlakeStorm test endpoint: http://localhost:8000/flakestorm-test');
});
```

Then in `flakestorm.yaml`:
```yaml
agent:
  endpoint: "http://localhost:8000/flakestorm-test"
  type: "http"
  request_template: |
    {
      "industry": "{industry}",
      "productName": "{productName}",
      "businessModel": "{businessModel}",
      "targetMarket": "{targetMarket}",
      "description": "{description}"
    }
  response_path: "$.output"
```

---

### Q: Do I need a public endpoint or can I use localhost?

**A:** It depends on where FlakeStorm runs:

| FlakeStorm Location | Agent Location | Endpoint Type | Works? |
|---------------------|----------------|---------------|--------|
| Same machine | Same machine | `localhost:8000` | ✅ Yes |
| Different machine | Your machine | `localhost:8000` | ❌ No - use public endpoint or ngrok |
| CI/CD server | Your machine | `localhost:8000` | ❌ No - use public endpoint |
| CI/CD server | Cloud (AWS/GCP) | `https://api.example.com` | ✅ Yes |

**Options for exposing local endpoint:**
1. **ngrok**: `ngrok http 8000` → get public URL
2. **localtunnel**: `lt --port 8000` → get public URL
3. **Deploy to cloud**: Deploy your test endpoint to a cloud service
4. **VPN/SSH tunnel**: If both machines are on same network

---

### Q: Can I test internal code without creating an endpoint?

**A:** Only if your code is in Python:

```python
# my_agent.py
async def flakestorm_agent(input: str) -> str:
    # Parse input, call your internal functions
    return result
```

```yaml
# flakestorm.yaml
agent:
  endpoint: "my_agent:flakestorm_agent"
  type: "python"  # ← No HTTP endpoint needed!
```

For non-Python code, you **must** create an HTTP endpoint wrapper.

See [Connection Guide](CONNECTION_GUIDE.md) for detailed examples and troubleshooting.

---

## Testing & Quality

### Q: Why are tests split by module?

**A:** Mirrors the source structure for maintainability:

```
tests/
├── test_config.py       # Tests for core/config.py
├── test_mutations.py    # Tests for mutations/
├── test_assertions.py   # Tests for assertions/
├── test_performance.py  # Tests for performance module
```

When fixing a bug in `config.py`, you immediately know to check `test_config.py`.

---

### Q: Why use pytest over unittest?

**A:** Pytest is more Pythonic and powerful:

```python
# unittest style (verbose)
class TestConfig(unittest.TestCase):
    def test_load_config(self):
        self.assertEqual(config.agent.type, AgentType.HTTP)

# pytest style (concise)
def test_load_config():
    assert config.agent.type == AgentType.HTTP
```

Pytest also offers:
- Fixtures for setup/teardown
- Parametrized tests
- Better assertion introspection

---

### Q: How should I add tests for a new feature?

**A:** Follow this pattern:

1. **Create test file** if needed: `tests/test_<module>.py`
2. **Write failing test first** (TDD)
3. **Group related tests** in a class
4. **Use fixtures** for common setup

```python
# tests/test_new_feature.py
import pytest
from flakestorm.new_module import NewFeature

class TestNewFeature:
    @pytest.fixture
    def feature(self):
        return NewFeature(config={...})

    def test_basic_functionality(self, feature):
        result = feature.do_something()
        assert result == expected

    def test_edge_case(self, feature):
        with pytest.raises(ValueError):
            feature.do_something(invalid_input)
```

---

## Extending flakestorm

### Q: How do I add a new mutation type?

**A:** Three steps:

1. **Add to enum** (`mutations/types.py`):
   ```python
   class MutationType(str, Enum):
       # ... existing types
       MY_NEW_TYPE = "my_new_type"
   ```

2. **Add template** (`mutations/templates.py`):
   ```python
   TEMPLATES[MutationType.MY_NEW_TYPE] = """
   Your prompt template here.

   Original: {prompt}

   Modified:
   """
   ```

3. **Add default weight** (`core/config.py`):
   ```python
   class MutationConfig(BaseModel):
       weights: dict = {
           # ... existing weights
           MutationType.MY_NEW_TYPE: 1.0,
       }
   ```

---

### Q: How do I add a new assertion type?

**A:** Four steps:

1. **Create checker class** (`assertions/deterministic.py` or `semantic.py`):
   ```python
   class MyNewChecker(BaseChecker):
       def check(self, response: str, latency_ms: float) -> CheckResult:
           # Your logic here
           passed = some_condition(response)
           return CheckResult(
               passed=passed,
               check_type=InvariantType.MY_NEW_TYPE,
               details="Explanation"
           )
   ```

2. **Add to enum** (`core/config.py`):
   ```python
   class InvariantType(str, Enum):
       # ... existing types
       MY_NEW_TYPE = "my_new_type"
   ```

3. **Register in verifier** (`assertions/verifier.py`):
   ```python
   CHECKER_REGISTRY = {
       # ... existing checkers
       InvariantType.MY_NEW_TYPE: MyNewChecker,
   }
   ```

4. **Add tests** (`tests/test_assertions.py`)

---

### Q: How do I add a new report format?

**A:** Create a new generator:

```python
# reports/markdown.py
class MarkdownReportGenerator:
    def __init__(self, results: TestResults):
        self.results = results

    def generate(self) -> str:
        """Generate markdown content."""
        md = f"# flakestorm Report\n\n"
        md += f"**Score:** {self.results.statistics.robustness_score:.2f}\n"
        # ... more content
        return md

    def save(self, path: Path = None) -> Path:
        path = path or Path(f"reports/report_{timestamp}.md")
        path.write_text(self.generate())
        return path
```

Then add CLI option in `cli/main.py`.

---

## Common Issues

### Q: Why am I getting "Cannot connect to Ollama"?

**A:** Ollama service isn't running. Fix:

```bash
# Start Ollama
ollama serve

# Verify it's running
curl http://localhost:11434/api/version
```

---

### Q: Why is mutation generation slow?

**A:** LLM inference is inherently slow. Options:
1. Use a faster model: `ollama pull phi3:mini`
2. Reduce mutation count: `mutations.count: 10`
3. Use GPU: Ensure Ollama uses GPU acceleration

---

### Q: Why do tests pass locally but fail in CI?

**A:** Common causes:
1. **Missing Ollama**: CI needs Ollama service
2. **Different model**: Ensure same model is pulled
3. **Timing**: CI may be slower, increase timeouts
4. **Environment variables**: Ensure secrets are set in CI

---

### Q: How do I debug a failing assertion?

**A:** Enable verbose mode and check the report:

```bash
flakestorm run --verbose --output html
```

The HTML report shows:
- Original prompt
- Mutated prompt
- Agent response
- Which assertion failed and why

---

*Have more questions? Open an issue on GitHub!*