flakestorm/docs/DEVELOPER_FAQ.md
Entropix ee10da0b97 Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00

678 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# flakestorm Developer FAQ
This document answers common questions developers might have about the flakestorm codebase. It's designed to help project maintainers explain design decisions and help contributors understand the codebase.
---
## Table of Contents
1. [Architecture Questions](#architecture-questions)
2. [Configuration System](#configuration-system)
3. [Mutation Engine](#mutation-engine)
4. [Assertion System](#assertion-system)
5. [Performance & Rust](#performance--rust)
6. [Agent Adapters](#agent-adapters)
7. [Testing & Quality](#testing--quality)
8. [Extending flakestorm](#extending-flakestorm)
9. [Common Issues](#common-issues)
---
## Architecture Questions
### Q: Why is the codebase split into core, mutations, assertions, and reports?
**A:** This follows the **Single Responsibility Principle (SRP)** and makes the codebase maintainable:
| Module | Responsibility |
|--------|---------------|
| `core/` | Orchestration, configuration, agent communication |
| `mutations/` | Adversarial input generation |
| `assertions/` | Response validation |
| `reports/` | Output formatting |
This separation means:
- Changes to mutation logic don't affect assertions
- New report formats can be added without touching core logic
- Each module can be tested independently
---
### Q: Why use async/await throughout the codebase?
**A:** Agent testing is **I/O-bound**, not CPU-bound. The bottleneck is waiting for:
1. LLM responses (mutation generation)
2. Agent responses (test execution)
Async allows running many operations concurrently:
```python
# Without async: 100 tests × 500ms = 50 seconds
# With async (10 concurrent): 100 tests / 10 × 500ms = 5 seconds
```
The semaphore in `orchestrator.py` controls concurrency:
```python
semaphore = asyncio.Semaphore(self.config.advanced.concurrency)
async def _run_single_mutation(self, mutation):
async with semaphore: # Limits concurrent executions
return await self.agent.invoke(mutation.mutated)
```
---
### Q: Why is there both an `orchestrator.py` and a `runner.py`?
**A:** They serve different purposes:
- **`runner.py`**: High-level API for users - simple `EntropixRunner.run()` interface
- **`orchestrator.py`**: Internal coordination logic - handles the complex flow
This separation allows:
- `runner.py` to provide a clean facade
- `orchestrator.py` to be refactored without breaking the public API
- Different entry points (CLI, programmatic) to use the same core logic
---
## Configuration System
### Q: Why Pydantic instead of dataclasses or attrs?
**A:** Pydantic was chosen for several reasons:
1. **Automatic Validation**: Built-in validators with clear error messages
```python
class MutationConfig(BaseModel):
count: int = Field(ge=1, le=100) # Validates range automatically
```
2. **Environment Variable Support**: Native expansion
```python
endpoint: str = Field(default="${AGENT_URL}")
```
3. **YAML/JSON Serialization**: Works out of the box
4. **IDE Support**: Type hints provide autocomplete
---
### Q: Why use environment variable expansion in config?
**A:** Security best practice - secrets should never be in config files:
```yaml
# BAD: Secret in file (gets committed to git)
headers:
Authorization: "Bearer sk-1234567890"
# GOOD: Reference environment variable
headers:
Authorization: "Bearer ${API_KEY}"
```
Implementation in `config.py`:
```python
def expand_env_vars(value: str) -> str:
"""Replace ${VAR} with environment variable value."""
pattern = r'\$\{([^}]+)\}'
def replacer(match):
var_name = match.group(1)
return os.environ.get(var_name, match.group(0))
return re.sub(pattern, replacer, value)
```
---
### Q: Why is MutationType defined as `str, Enum`?
**A:** String enums serialize directly to YAML/JSON:
```python
class MutationType(str, Enum):
PARAPHRASE = "paraphrase"
```
This allows:
```yaml
# In config file - uses string value directly
mutations:
types:
- paraphrase # Works!
- noise
```
If we used a regular Enum, we'd need custom serialization logic.
---
## Mutation Engine
### Q: Why use a local LLM (Ollama) instead of cloud APIs?
**A:** Several important reasons:
| Factor | Local LLM | Cloud API |
|--------|-----------|-----------|
| **Cost** | Free | $0.01-0.10 per mutation |
| **Privacy** | Data stays local | Prompts sent to third party |
| **Rate Limits** | None | Often restrictive |
| **Latency** | Low | Network dependent |
| **Offline** | Works | Requires internet |
For a test run with 100 prompts × 20 mutations = 2000 API calls, cloud costs would add up quickly.
---
### Q: Why Qwen Coder 3 8B as the default model?
**A:** We evaluated several models:
| Model | Mutation Quality | Speed | Memory |
|-------|-----------------|-------|--------|
| Qwen Coder 3 8B | ⭐⭐⭐⭐ | ⭐⭐⭐ | 8GB |
| Llama 3 8B | ⭐⭐⭐ | ⭐⭐⭐ | 8GB |
| Mistral 7B | ⭐⭐⭐ | ⭐⭐⭐⭐ | 6GB |
| Phi-3 Mini | ⭐⭐ | ⭐⭐⭐⭐⭐ | 4GB |
Qwen Coder 3 was chosen because:
1. Excellent at understanding and modifying prompts
2. Good balance of quality vs. speed
3. Runs on consumer hardware (8GB VRAM)
---
### Q: How does the mutation template system work?
**A:** Templates are stored in `templates.py` and formatted with the original prompt:
```python
TEMPLATES = {
MutationType.PARAPHRASE: """
Rewrite this prompt with different words but same meaning.
Original: {prompt}
Rewritten:
""",
MutationType.NOISE: """
Add 2-3 realistic typos to this prompt:
Original: {prompt}
With typos:
"""
}
```
The engine fills in `{prompt}` and sends to the LLM:
```python
template = TEMPLATES[mutation_type]
filled = template.format(prompt=original_prompt)
response = await self.client.generate(model=self.model, prompt=filled)
```
---
### Q: What if the LLM returns malformed mutations?
**A:** We have several safeguards:
1. **Parsing Logic**: Extracts text between known markers
2. **Validation**: Checks mutation isn't identical to original
3. **Retry Logic**: Regenerates if parsing fails
4. **Fallback**: Uses simple string manipulation if LLM fails
```python
def _parse_mutation(self, response: str) -> str:
# Try to extract the mutated text
lines = response.strip().split('\n')
for line in lines:
if line and not line.startswith('#'):
return line.strip()
raise MutationParseError("Could not extract mutation")
```
---
## Assertion System
### Q: Why separate deterministic and semantic assertions?
**A:** They have fundamentally different characteristics:
| Aspect | Deterministic | Semantic |
|--------|---------------|----------|
| **Speed** | Nanoseconds | Milliseconds |
| **Dependencies** | None | sentence-transformers |
| **Reproducibility** | 100% | May vary slightly |
| **Use Case** | Exact matching | Meaning matching |
Separating them allows:
- Running deterministic checks first (fast-fail)
- Making semantic checks optional (lighter installation)
---
### Q: How does the SimilarityChecker work internally?
**A:** It uses sentence embeddings and cosine similarity:
```python
class SimilarityChecker:
def check(self, response: str, latency_ms: float) -> CheckResult:
# 1. Embed both texts to vectors
response_vec = self.embedder.embed(response) # [0.1, 0.2, ...]
expected_vec = self.embedder.embed(self.expected) # [0.15, 0.18, ...]
# 2. Calculate cosine similarity
similarity = cosine_similarity(response_vec, expected_vec)
# Returns value between -1 and 1 (typically 0-1 for text)
# 3. Compare to threshold
return CheckResult(passed=similarity >= self.threshold)
```
The embedding model (`all-MiniLM-L6-v2`) converts text to 384-dimensional vectors that capture semantic meaning.
---
### Q: Why is the embedder a class variable with lazy loading?
**A:** The embedding model is large (23MB) and takes 1-2 seconds to load:
```python
class SimilarityChecker:
_embedder: LocalEmbedder | None = None # Class variable, shared
@property
def embedder(self) -> LocalEmbedder:
if SimilarityChecker._embedder is None:
SimilarityChecker._embedder = LocalEmbedder() # Load once
return SimilarityChecker._embedder
```
Benefits:
1. **Lazy Loading**: Only loads if semantic checks are used
2. **Shared Instance**: All SimilarityCheckers share one model
3. **Memory Efficient**: One copy in memory, not one per checker
---
### Q: How does PII detection work?
**A:** Uses regex patterns for common PII formats:
```python
PII_PATTERNS = [
(r'\b\d{3}-\d{2}-\d{4}\b', 'SSN'), # 123-45-6789
(r'\b\d{16}\b', 'Credit Card'), # 1234567890123456
(r'\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b', 'Email'),
(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', 'Phone'), # 123-456-7890
]
def check(self, response: str, latency_ms: float) -> CheckResult:
for pattern, pii_type in self.PII_PATTERNS:
if re.search(pattern, response, re.IGNORECASE):
return CheckResult(
passed=False,
details=f"Found potential {pii_type}"
)
return CheckResult(passed=True)
```
---
## Performance & Rust
### Q: Why Rust for performance-critical code?
**A:** Python is slow for CPU-bound operations. Benchmarks show:
```
Levenshtein Distance (5000 iterations):
Python: 5864ms
Rust: 67ms
Speedup: 88x
```
Rust was chosen over alternatives because:
- **vs C/C++**: Memory safety, easier to write correct code
- **vs Cython**: Better tooling (cargo), cleaner code
- **vs NumPy**: Works on strings, not just numbers
---
### Q: How does the Rust/Python bridge work?
**A:** Uses PyO3 for bindings:
```rust
// Rust side (lib.rs)
#[pyfunction]
fn levenshtein_distance(s1: &str, s2: &str) -> usize {
// Rust implementation
}
#[pymodule]
fn entropix_rust(m: &PyModule) -> PyResult<()> {
m.add_function(wrap_pyfunction!(levenshtein_distance, m)?)?;
Ok(())
}
```
```python
# Python side (performance.py)
try:
import flakestorm_rust
_RUST_AVAILABLE = True
except ImportError:
_RUST_AVAILABLE = False
def levenshtein_distance(s1: str, s2: str) -> int:
if _RUST_AVAILABLE:
return entropix_rust.levenshtein_distance(s1, s2)
# Pure Python fallback
...
```
---
### Q: Why provide pure Python fallbacks?
**A:** Accessibility and reliability:
1. **Easy Installation**: `pip install flakestorm` works without Rust toolchain
2. **Platform Support**: Works on any Python platform
3. **Development**: Faster iteration without recompiling Rust
4. **Testing**: Can test both implementations for parity
The tradeoff is speed, but most time is spent waiting for LLM/agent responses anyway.
---
## Agent Adapters
### Q: Why use the Protocol pattern for agents?
**A:** Enables type-safe duck typing:
```python
class AgentProtocol(Protocol):
async def invoke(self, prompt: str) -> AgentResponse: ...
```
Any class with a matching `invoke` method works, even if it doesn't inherit from a base class. This is more Pythonic than Java-style interfaces.
---
### Q: How does the HTTP adapter handle different API formats?
**A:** Through configurable templates:
```yaml
agent:
endpoint: "https://api.example.com/v1/chat"
request_template: |
{"messages": [{"role": "user", "content": "{prompt}"}]}
response_path: "$.choices[0].message.content"
```
The adapter:
1. Replaces `{prompt}` in the template
2. Sends the formatted JSON
3. Uses JSONPath to extract the response
This supports OpenAI, Anthropic, custom APIs, etc.
---
### Q: Why is there a Python adapter?
**A:** Bypasses HTTP overhead for local testing:
```python
# Instead of: HTTP request → your server → your code → HTTP response
# Just: your_function(prompt) → response
class PythonAgentAdapter:
async def invoke(self, prompt: str) -> AgentResponse:
# Import the module dynamically
module_path, func_name = self.endpoint.rsplit(":", 1)
module = importlib.import_module(module_path)
func = getattr(module, func_name)
# Call directly
start = time.perf_counter()
response = await func(prompt) if asyncio.iscoroutinefunction(func) else func(prompt)
latency = (time.perf_counter() - start) * 1000
return AgentResponse(text=response, latency_ms=latency)
```
---
## Testing & Quality
### Q: Why are tests split by module?
**A:** Mirrors the source structure for maintainability:
```
tests/
├── test_config.py # Tests for core/config.py
├── test_mutations.py # Tests for mutations/
├── test_assertions.py # Tests for assertions/
├── test_performance.py # Tests for performance module
```
When fixing a bug in `config.py`, you immediately know to check `test_config.py`.
---
### Q: Why use pytest over unittest?
**A:** Pytest is more Pythonic and powerful:
```python
# unittest style (verbose)
class TestConfig(unittest.TestCase):
def test_load_config(self):
self.assertEqual(config.agent.type, AgentType.HTTP)
# pytest style (concise)
def test_load_config():
assert config.agent.type == AgentType.HTTP
```
Pytest also offers:
- Fixtures for setup/teardown
- Parametrized tests
- Better assertion introspection
---
### Q: How should I add tests for a new feature?
**A:** Follow this pattern:
1. **Create test file** if needed: `tests/test_<module>.py`
2. **Write failing test first** (TDD)
3. **Group related tests** in a class
4. **Use fixtures** for common setup
```python
# tests/test_new_feature.py
import pytest
from flakestorm.new_module import NewFeature
class TestNewFeature:
@pytest.fixture
def feature(self):
return NewFeature(config={...})
def test_basic_functionality(self, feature):
result = feature.do_something()
assert result == expected
def test_edge_case(self, feature):
with pytest.raises(ValueError):
feature.do_something(invalid_input)
```
---
## Extending flakestorm
### Q: How do I add a new mutation type?
**A:** Three steps:
1. **Add to enum** (`mutations/types.py`):
```python
class MutationType(str, Enum):
# ... existing types
MY_NEW_TYPE = "my_new_type"
```
2. **Add template** (`mutations/templates.py`):
```python
TEMPLATES[MutationType.MY_NEW_TYPE] = """
Your prompt template here.
Original: {prompt}
Modified:
"""
```
3. **Add default weight** (`core/config.py`):
```python
class MutationConfig(BaseModel):
weights: dict = {
# ... existing weights
MutationType.MY_NEW_TYPE: 1.0,
}
```
---
### Q: How do I add a new assertion type?
**A:** Four steps:
1. **Create checker class** (`assertions/deterministic.py` or `semantic.py`):
```python
class MyNewChecker(BaseChecker):
def check(self, response: str, latency_ms: float) -> CheckResult:
# Your logic here
passed = some_condition(response)
return CheckResult(
passed=passed,
check_type=InvariantType.MY_NEW_TYPE,
details="Explanation"
)
```
2. **Add to enum** (`core/config.py`):
```python
class InvariantType(str, Enum):
# ... existing types
MY_NEW_TYPE = "my_new_type"
```
3. **Register in verifier** (`assertions/verifier.py`):
```python
CHECKER_REGISTRY = {
# ... existing checkers
InvariantType.MY_NEW_TYPE: MyNewChecker,
}
```
4. **Add tests** (`tests/test_assertions.py`)
---
### Q: How do I add a new report format?
**A:** Create a new generator:
```python
# reports/markdown.py
class MarkdownReportGenerator:
def __init__(self, results: TestResults):
self.results = results
def generate(self) -> str:
"""Generate markdown content."""
md = f"# flakestorm Report\n\n"
md += f"**Score:** {self.results.statistics.robustness_score:.2f}\n"
# ... more content
return md
def save(self, path: Path = None) -> Path:
path = path or Path(f"reports/report_{timestamp}.md")
path.write_text(self.generate())
return path
```
Then add CLI option in `cli/main.py`.
---
## Common Issues
### Q: Why am I getting "Cannot connect to Ollama"?
**A:** Ollama service isn't running. Fix:
```bash
# Start Ollama
ollama serve
# Verify it's running
curl http://localhost:11434/api/version
```
---
### Q: Why is mutation generation slow?
**A:** LLM inference is inherently slow. Options:
1. Use a faster model: `ollama pull phi3:mini`
2. Reduce mutation count: `mutations.count: 10`
3. Use GPU: Ensure Ollama uses GPU acceleration
---
### Q: Why do tests pass locally but fail in CI?
**A:** Common causes:
1. **Missing Ollama**: CI needs Ollama service
2. **Different model**: Ensure same model is pulled
3. **Timing**: CI may be slower, increase timeouts
4. **Environment variables**: Ensure secrets are set in CI
---
### Q: How do I debug a failing assertion?
**A:** Enable verbose mode and check the report:
```bash
flakestorm run --verbose --output html
```
The HTML report shows:
- Original prompt
- Mutated prompt
- Agent response
- Which assertion failed and why
---
*Have more questions? Open an issue on GitHub!*