Fix .gitignore to allow docs files and add documentation files

- Fix .gitignore pattern: un-ignore docs/ directory first, then ignore docs/*, then un-ignore specific files - Add all documentation files referenced in README.md: - USAGE_GUIDE.md - CONFIGURATION_GUIDE.md - TEST_SCENARIOS.md - MODULES.md - DEVELOPER_FAQ.md - PUBLISHING.md - CONTRIBUTING.md - API_SPECIFICATION.md - TESTING_GUIDE.md - IMPLEMENTATION_CHECKLIST.md - Pre-commit hooks fixed trailing whitespace and end-of-file formatting
2026-04-26 09:16:25 +02:00 · 2025-12-29 11:32:50 +08:00 · 2025-12-29 11:32:50 +08:00 · 69e0f8deeb
commit 69e0f8deeb
parent 4dd882a2d2
11 changed files with 5936 additions and 2 deletions
--- a/docs/DEVELOPER_FAQ.md
+++ b/docs/DEVELOPER_FAQ.md
@ -0,0 +1,679 @@
+# flakestorm Developer FAQ
+
+This document answers common questions developers might have about the flakestorm codebase. It's designed to help project maintainers explain design decisions and help contributors understand the codebase.
+
+---
+
+## Table of Contents
+
+1. [Architecture Questions](#architecture-questions)
+2. [Configuration System](#configuration-system)
+3. [Mutation Engine](#mutation-engine)
+4. [Assertion System](#assertion-system)
+5. [Performance & Rust](#performance--rust)
+6. [Agent Adapters](#agent-adapters)
+7. [Testing & Quality](#testing--quality)
+8. [Extending flakestorm](#extending-flakestorm)
+9. [Common Issues](#common-issues)
+
+---
+
+## Architecture Questions
+
+### Q: Why is the codebase split into core, mutations, assertions, and reports?
+
+**A:** This follows the **Single Responsibility Principle (SRP)** and makes the codebase maintainable:
+
+| Module | Responsibility |
+|--------|---------------|
+| `core/` | Orchestration, configuration, agent communication |
+| `mutations/` | Adversarial input generation |
+| `assertions/` | Response validation |
+| `reports/` | Output formatting |
+
+This separation means:
+- Changes to mutation logic don't affect assertions
+- New report formats can be added without touching core logic
+- Each module can be tested independently
+
+---
+
+### Q: Why use async/await throughout the codebase?
+
+**A:** Agent testing is **I/O-bound**, not CPU-bound. The bottleneck is waiting for:
+1. LLM responses (mutation generation)
+2. Agent responses (test execution)
+
+Async allows running many operations concurrently:
+
+```python
+# Without async: 100 tests × 500ms = 50 seconds
+# With async (10 concurrent): 100 tests / 10 × 500ms = 5 seconds
+```
+
+The semaphore in `orchestrator.py` controls concurrency:
+
+```python
+semaphore = asyncio.Semaphore(self.config.advanced.concurrency)
+
+async def _run_single_mutation(self, mutation):
+    async with semaphore:  # Limits concurrent executions
+        return await self.agent.invoke(mutation.mutated)
+```
+
+---
+
+### Q: Why is there both an `orchestrator.py` and a `runner.py`?
+
+**A:** They serve different purposes:
+
+- **`runner.py`**: High-level API for users - simple `EntropixRunner.run()` interface
+- **`orchestrator.py`**: Internal coordination logic - handles the complex flow
+
+This separation allows:
+- `runner.py` to provide a clean facade
+- `orchestrator.py` to be refactored without breaking the public API
+- Different entry points (CLI, programmatic) to use the same core logic
+
+---
+
+## Configuration System
+
+### Q: Why Pydantic instead of dataclasses or attrs?
+
+**A:** Pydantic was chosen for several reasons:
+
+1. **Automatic Validation**: Built-in validators with clear error messages
+   ```python
+   class MutationConfig(BaseModel):
+       count: int = Field(ge=1, le=100)  # Validates range automatically
+   ```
+
+2. **Environment Variable Support**: Native expansion
+   ```python
+   endpoint: str = Field(default="${AGENT_URL}")
+   ```
+
+3. **YAML/JSON Serialization**: Works out of the box
+4. **IDE Support**: Type hints provide autocomplete
+
+---
+
+### Q: Why use environment variable expansion in config?
+
+**A:** Security best practice - secrets should never be in config files:
+
+```yaml
+# BAD: Secret in file (gets committed to git)
+headers:
+  Authorization: "Bearer sk-1234567890"
+
+# GOOD: Reference environment variable
+headers:
+  Authorization: "Bearer ${API_KEY}"
+```
+
+Implementation in `config.py`:
+
+```python
+def expand_env_vars(value: str) -> str:
+    """Replace ${VAR} with environment variable value."""
+    pattern = r'\$\{([^}]+)\}'
+    def replacer(match):
+        var_name = match.group(1)
+        return os.environ.get(var_name, match.group(0))
+    return re.sub(pattern, replacer, value)
+```
+
+---
+
+### Q: Why is MutationType defined as `str, Enum`?
+
+**A:** String enums serialize directly to YAML/JSON:
+
+```python
+class MutationType(str, Enum):
+    PARAPHRASE = "paraphrase"
+```
+
+This allows:
+```yaml
+# In config file - uses string value directly
+mutations:
+  types:
+    - paraphrase  # Works!
+    - noise
+```
+
+If we used a regular Enum, we'd need custom serialization logic.
+
+---
+
+## Mutation Engine
+
+### Q: Why use a local LLM (Ollama) instead of cloud APIs?
+
+**A:** Several important reasons:
+
+| Factor | Local LLM | Cloud API |
+|--------|-----------|-----------|
+| **Cost** | Free | $0.01-0.10 per mutation |
+| **Privacy** | Data stays local | Prompts sent to third party |
+| **Rate Limits** | None | Often restrictive |
+| **Latency** | Low | Network dependent |
+| **Offline** | Works | Requires internet |
+
+For a test run with 100 prompts × 20 mutations = 2000 API calls, cloud costs would add up quickly.
+
+---
+
+### Q: Why Qwen Coder 3 8B as the default model?
+
+**A:** We evaluated several models:
+
+| Model | Mutation Quality | Speed | Memory |
+|-------|-----------------|-------|--------|
+| Qwen Coder 3 8B | ⭐⭐⭐⭐ | ⭐⭐⭐ | 8GB |
+| Llama 3 8B | ⭐⭐⭐ | ⭐⭐⭐ | 8GB |
+| Mistral 7B | ⭐⭐⭐ | ⭐⭐⭐⭐ | 6GB |
+| Phi-3 Mini | ⭐⭐ | ⭐⭐⭐⭐⭐ | 4GB |
+
+Qwen Coder 3 was chosen because:
+1. Excellent at understanding and modifying prompts
+2. Good balance of quality vs. speed
+3. Runs on consumer hardware (8GB VRAM)
+
+---
+
+### Q: How does the mutation template system work?
+
+**A:** Templates are stored in `templates.py` and formatted with the original prompt:
+
+```python
+TEMPLATES = {
+    MutationType.PARAPHRASE: """
+    Rewrite this prompt with different words but same meaning.
+    
+    Original: {prompt}
+    
+    Rewritten:
+    """,
+    MutationType.NOISE: """
+    Add 2-3 realistic typos to this prompt:
+    
+    Original: {prompt}
+    
+    With typos:
+    """
+}
+```
+
+The engine fills in `{prompt}` and sends to the LLM:
+
+```python
+template = TEMPLATES[mutation_type]
+filled = template.format(prompt=original_prompt)
+response = await self.client.generate(model=self.model, prompt=filled)
+```
+
+---
+
+### Q: What if the LLM returns malformed mutations?
+
+**A:** We have several safeguards:
+
+1. **Parsing Logic**: Extracts text between known markers
+2. **Validation**: Checks mutation isn't identical to original
+3. **Retry Logic**: Regenerates if parsing fails
+4. **Fallback**: Uses simple string manipulation if LLM fails
+
+```python
+def _parse_mutation(self, response: str) -> str:
+    # Try to extract the mutated text
+    lines = response.strip().split('\n')
+    for line in lines:
+        if line and not line.startswith('#'):
+            return line.strip()
+    raise MutationParseError("Could not extract mutation")
+```
+
+---
+
+## Assertion System
+
+### Q: Why separate deterministic and semantic assertions?
+
+**A:** They have fundamentally different characteristics:
+
+| Aspect | Deterministic | Semantic |
+|--------|---------------|----------|
+| **Speed** | Nanoseconds | Milliseconds |
+| **Dependencies** | None | sentence-transformers |
+| **Reproducibility** | 100% | May vary slightly |
+| **Use Case** | Exact matching | Meaning matching |
+
+Separating them allows:
+- Running deterministic checks first (fast-fail)
+- Making semantic checks optional (lighter installation)
+
+---
+
+### Q: How does the SimilarityChecker work internally?
+
+**A:** It uses sentence embeddings and cosine similarity:
+
+```python
+class SimilarityChecker:
+    def check(self, response: str, latency_ms: float) -> CheckResult:
+        # 1. Embed both texts to vectors
+        response_vec = self.embedder.embed(response)      # [0.1, 0.2, ...]
+        expected_vec = self.embedder.embed(self.expected) # [0.15, 0.18, ...]
+        
+        # 2. Calculate cosine similarity
+        similarity = cosine_similarity(response_vec, expected_vec)
+        # Returns value between -1 and 1 (typically 0-1 for text)
+        
+        # 3. Compare to threshold
+        return CheckResult(passed=similarity >= self.threshold)
+```
+
+The embedding model (`all-MiniLM-L6-v2`) converts text to 384-dimensional vectors that capture semantic meaning.
+
+---
+
+### Q: Why is the embedder a class variable with lazy loading?
+
+**A:** The embedding model is large (23MB) and takes 1-2 seconds to load:
+
+```python
+class SimilarityChecker:
+    _embedder: LocalEmbedder | None = None  # Class variable, shared
+    
+    @property
+    def embedder(self) -> LocalEmbedder:
+        if SimilarityChecker._embedder is None:
+            SimilarityChecker._embedder = LocalEmbedder()  # Load once
+        return SimilarityChecker._embedder
+```
+
+Benefits:
+1. **Lazy Loading**: Only loads if semantic checks are used
+2. **Shared Instance**: All SimilarityCheckers share one model
+3. **Memory Efficient**: One copy in memory, not one per checker
+
+---
+
+### Q: How does PII detection work?
+
+**A:** Uses regex patterns for common PII formats:
+
+```python
+PII_PATTERNS = [
+    (r'\b\d{3}-\d{2}-\d{4}\b', 'SSN'),           # 123-45-6789
+    (r'\b\d{16}\b', 'Credit Card'),              # 1234567890123456
+    (r'\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b', 'Email'),
+    (r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', 'Phone'), # 123-456-7890
+]
+
+def check(self, response: str, latency_ms: float) -> CheckResult:
+    for pattern, pii_type in self.PII_PATTERNS:
+        if re.search(pattern, response, re.IGNORECASE):
+            return CheckResult(
+                passed=False,
+                details=f"Found potential {pii_type}"
+            )
+    return CheckResult(passed=True)
+```
+
+---
+
+## Performance & Rust
+
+### Q: Why Rust for performance-critical code?
+
+**A:** Python is slow for CPU-bound operations. Benchmarks show:
+
+```
+Levenshtein Distance (5000 iterations):
+  Python: 5864ms
+  Rust:     67ms
+  Speedup: 88x
+```
+
+Rust was chosen over alternatives because:
+- **vs C/C++**: Memory safety, easier to write correct code
+- **vs Cython**: Better tooling (cargo), cleaner code
+- **vs NumPy**: Works on strings, not just numbers
+
+---
+
+### Q: How does the Rust/Python bridge work?
+
+**A:** Uses PyO3 for bindings:
+
+```rust
+// Rust side (lib.rs)
+#[pyfunction]
+fn levenshtein_distance(s1: &str, s2: &str) -> usize {
+    // Rust implementation
+}
+
+#[pymodule]
+fn entropix_rust(m: &PyModule) -> PyResult<()> {
+    m.add_function(wrap_pyfunction!(levenshtein_distance, m)?)?;
+    Ok(())
+}
+```
+
+```python
+# Python side (performance.py)
+try:
+    import flakestorm_rust
+    _RUST_AVAILABLE = True
+except ImportError:
+    _RUST_AVAILABLE = False
+
+def levenshtein_distance(s1: str, s2: str) -> int:
+    if _RUST_AVAILABLE:
+        return entropix_rust.levenshtein_distance(s1, s2)
+    # Pure Python fallback
+    ...
+```
+
+---
+
+### Q: Why provide pure Python fallbacks?
+
+**A:** Accessibility and reliability:
+
+1. **Easy Installation**: `pip install flakestorm` works without Rust toolchain
+2. **Platform Support**: Works on any Python platform
+3. **Development**: Faster iteration without recompiling Rust
+4. **Testing**: Can test both implementations for parity
+
+The tradeoff is speed, but most time is spent waiting for LLM/agent responses anyway.
+
+---
+
+## Agent Adapters
+
+### Q: Why use the Protocol pattern for agents?
+
+**A:** Enables type-safe duck typing:
+
+```python
+class AgentProtocol(Protocol):
+    async def invoke(self, prompt: str) -> AgentResponse: ...
+```
+
+Any class with a matching `invoke` method works, even if it doesn't inherit from a base class. This is more Pythonic than Java-style interfaces.
+
+---
+
+### Q: How does the HTTP adapter handle different API formats?
+
+**A:** Through configurable templates:
+
+```yaml
+agent:
+  endpoint: "https://api.example.com/v1/chat"
+  request_template: |
+    {"messages": [{"role": "user", "content": "{prompt}"}]}
+  response_path: "$.choices[0].message.content"
+```
+
+The adapter:
+1. Replaces `{prompt}` in the template
+2. Sends the formatted JSON
+3. Uses JSONPath to extract the response
+
+This supports OpenAI, Anthropic, custom APIs, etc.
+
+---
+
+### Q: Why is there a Python adapter?
+
+**A:** Bypasses HTTP overhead for local testing:
+
+```python
+# Instead of: HTTP request → your server → your code → HTTP response
+# Just: your_function(prompt) → response
+
+class PythonAgentAdapter:
+    async def invoke(self, prompt: str) -> AgentResponse:
+        # Import the module dynamically
+        module_path, func_name = self.endpoint.rsplit(":", 1)
+        module = importlib.import_module(module_path)
+        func = getattr(module, func_name)
+        
+        # Call directly
+        start = time.perf_counter()
+        response = await func(prompt) if asyncio.iscoroutinefunction(func) else func(prompt)
+        latency = (time.perf_counter() - start) * 1000
+        
+        return AgentResponse(text=response, latency_ms=latency)
+```
+
+---
+
+## Testing & Quality
+
+### Q: Why are tests split by module?
+
+**A:** Mirrors the source structure for maintainability:
+
+```
+tests/
+├── test_config.py       # Tests for core/config.py
+├── test_mutations.py    # Tests for mutations/
+├── test_assertions.py   # Tests for assertions/
+├── test_performance.py  # Tests for performance module
+```
+
+When fixing a bug in `config.py`, you immediately know to check `test_config.py`.
+
+---
+
+### Q: Why use pytest over unittest?
+
+**A:** Pytest is more Pythonic and powerful:
+
+```python
+# unittest style (verbose)
+class TestConfig(unittest.TestCase):
+    def test_load_config(self):
+        self.assertEqual(config.agent.type, AgentType.HTTP)
+
+# pytest style (concise)
+def test_load_config():
+    assert config.agent.type == AgentType.HTTP
+```
+
+Pytest also offers:
+- Fixtures for setup/teardown
+- Parametrized tests
+- Better assertion introspection
+
+---
+
+### Q: How should I add tests for a new feature?
+
+**A:** Follow this pattern:
+
+1. **Create test file** if needed: `tests/test_<module>.py`
+2. **Write failing test first** (TDD)
+3. **Group related tests** in a class
+4. **Use fixtures** for common setup
+
+```python
+# tests/test_new_feature.py
+import pytest
+from flakestorm.new_module import NewFeature
+
+class TestNewFeature:
+    @pytest.fixture
+    def feature(self):
+        return NewFeature(config={...})
+    
+    def test_basic_functionality(self, feature):
+        result = feature.do_something()
+        assert result == expected
+    
+    def test_edge_case(self, feature):
+        with pytest.raises(ValueError):
+            feature.do_something(invalid_input)
+```
+
+---
+
+## Extending flakestorm
+
+### Q: How do I add a new mutation type?
+
+**A:** Three steps:
+
+1. **Add to enum** (`mutations/types.py`):
+   ```python
+   class MutationType(str, Enum):
+       # ... existing types
+       MY_NEW_TYPE = "my_new_type"
+   ```
+
+2. **Add template** (`mutations/templates.py`):
+   ```python
+   TEMPLATES[MutationType.MY_NEW_TYPE] = """
+   Your prompt template here.
+   
+   Original: {prompt}
+   
+   Modified:
+   """
+   ```
+
+3. **Add default weight** (`core/config.py`):
+   ```python
+   class MutationConfig(BaseModel):
+       weights: dict = {
+           # ... existing weights
+           MutationType.MY_NEW_TYPE: 1.0,
+       }
+   ```
+
+---
+
+### Q: How do I add a new assertion type?
+
+**A:** Four steps:
+
+1. **Create checker class** (`assertions/deterministic.py` or `semantic.py`):
+   ```python
+   class MyNewChecker(BaseChecker):
+       def check(self, response: str, latency_ms: float) -> CheckResult:
+           # Your logic here
+           passed = some_condition(response)
+           return CheckResult(
+               passed=passed,
+               check_type=InvariantType.MY_NEW_TYPE,
+               details="Explanation"
+           )
+   ```
+
+2. **Add to enum** (`core/config.py`):
+   ```python
+   class InvariantType(str, Enum):
+       # ... existing types
+       MY_NEW_TYPE = "my_new_type"
+   ```
+
+3. **Register in verifier** (`assertions/verifier.py`):
+   ```python
+   CHECKER_REGISTRY = {
+       # ... existing checkers
+       InvariantType.MY_NEW_TYPE: MyNewChecker,
+   }
+   ```
+
+4. **Add tests** (`tests/test_assertions.py`)
+
+---
+
+### Q: How do I add a new report format?
+
+**A:** Create a new generator:
+
+```python
+# reports/markdown.py
+class MarkdownReportGenerator:
+    def __init__(self, results: TestResults):
+        self.results = results
+    
+    def generate(self) -> str:
+        """Generate markdown content."""
+        md = f"# flakestorm Report\n\n"
+        md += f"**Score:** {self.results.statistics.robustness_score:.2f}\n"
+        # ... more content
+        return md
+    
+    def save(self, path: Path = None) -> Path:
+        path = path or Path(f"reports/report_{timestamp}.md")
+        path.write_text(self.generate())
+        return path
+```
+
+Then add CLI option in `cli/main.py`.
+
+---
+
+## Common Issues
+
+### Q: Why am I getting "Cannot connect to Ollama"?
+
+**A:** Ollama service isn't running. Fix:
+
+```bash
+# Start Ollama
+ollama serve
+
+# Verify it's running
+curl http://localhost:11434/api/version
+```
+
+---
+
+### Q: Why is mutation generation slow?
+
+**A:** LLM inference is inherently slow. Options:
+1. Use a faster model: `ollama pull phi3:mini`
+2. Reduce mutation count: `mutations.count: 10`
+3. Use GPU: Ensure Ollama uses GPU acceleration
+
+---
+
+### Q: Why do tests pass locally but fail in CI?
+
+**A:** Common causes:
+1. **Missing Ollama**: CI needs Ollama service
+2. **Different model**: Ensure same model is pulled
+3. **Timing**: CI may be slower, increase timeouts
+4. **Environment variables**: Ensure secrets are set in CI
+
+---
+
+### Q: How do I debug a failing assertion?
+
+**A:** Enable verbose mode and check the report:
+
+```bash
+flakestorm run --verbose --output html
+```
+
+The HTML report shows:
+- Original prompt
+- Mutated prompt
+- Agent response
+- Which assertion failed and why
+
+---
+
+*Have more questions? Open an issue on GitHub!*
+