mirror of
https://github.com/flakestorm/flakestorm.git
synced 2026-04-25 00:36:54 +02:00
680 lines
17 KiB
Markdown
680 lines
17 KiB
Markdown
|
|
# flakestorm Developer FAQ
|
|||
|
|
|
|||
|
|
This document answers common questions developers might have about the flakestorm codebase. It's designed to help project maintainers explain design decisions and help contributors understand the codebase.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Table of Contents
|
|||
|
|
|
|||
|
|
1. [Architecture Questions](#architecture-questions)
|
|||
|
|
2. [Configuration System](#configuration-system)
|
|||
|
|
3. [Mutation Engine](#mutation-engine)
|
|||
|
|
4. [Assertion System](#assertion-system)
|
|||
|
|
5. [Performance & Rust](#performance--rust)
|
|||
|
|
6. [Agent Adapters](#agent-adapters)
|
|||
|
|
7. [Testing & Quality](#testing--quality)
|
|||
|
|
8. [Extending flakestorm](#extending-flakestorm)
|
|||
|
|
9. [Common Issues](#common-issues)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Architecture Questions
|
|||
|
|
|
|||
|
|
### Q: Why is the codebase split into core, mutations, assertions, and reports?
|
|||
|
|
|
|||
|
|
**A:** This follows the **Single Responsibility Principle (SRP)** and makes the codebase maintainable:
|
|||
|
|
|
|||
|
|
| Module | Responsibility |
|
|||
|
|
|--------|---------------|
|
|||
|
|
| `core/` | Orchestration, configuration, agent communication |
|
|||
|
|
| `mutations/` | Adversarial input generation |
|
|||
|
|
| `assertions/` | Response validation |
|
|||
|
|
| `reports/` | Output formatting |
|
|||
|
|
|
|||
|
|
This separation means:
|
|||
|
|
- Changes to mutation logic don't affect assertions
|
|||
|
|
- New report formats can be added without touching core logic
|
|||
|
|
- Each module can be tested independently
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Q: Why use async/await throughout the codebase?
|
|||
|
|
|
|||
|
|
**A:** Agent testing is **I/O-bound**, not CPU-bound. The bottleneck is waiting for:
|
|||
|
|
1. LLM responses (mutation generation)
|
|||
|
|
2. Agent responses (test execution)
|
|||
|
|
|
|||
|
|
Async allows running many operations concurrently:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# Without async: 100 tests × 500ms = 50 seconds
|
|||
|
|
# With async (10 concurrent): 100 tests / 10 × 500ms = 5 seconds
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
The semaphore in `orchestrator.py` controls concurrency:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
semaphore = asyncio.Semaphore(self.config.advanced.concurrency)
|
|||
|
|
|
|||
|
|
async def _run_single_mutation(self, mutation):
|
|||
|
|
async with semaphore: # Limits concurrent executions
|
|||
|
|
return await self.agent.invoke(mutation.mutated)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Q: Why is there both an `orchestrator.py` and a `runner.py`?
|
|||
|
|
|
|||
|
|
**A:** They serve different purposes:
|
|||
|
|
|
|||
|
|
- **`runner.py`**: High-level API for users - simple `EntropixRunner.run()` interface
|
|||
|
|
- **`orchestrator.py`**: Internal coordination logic - handles the complex flow
|
|||
|
|
|
|||
|
|
This separation allows:
|
|||
|
|
- `runner.py` to provide a clean facade
|
|||
|
|
- `orchestrator.py` to be refactored without breaking the public API
|
|||
|
|
- Different entry points (CLI, programmatic) to use the same core logic
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Configuration System
|
|||
|
|
|
|||
|
|
### Q: Why Pydantic instead of dataclasses or attrs?
|
|||
|
|
|
|||
|
|
**A:** Pydantic was chosen for several reasons:
|
|||
|
|
|
|||
|
|
1. **Automatic Validation**: Built-in validators with clear error messages
|
|||
|
|
```python
|
|||
|
|
class MutationConfig(BaseModel):
|
|||
|
|
count: int = Field(ge=1, le=100) # Validates range automatically
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
2. **Environment Variable Support**: Native expansion
|
|||
|
|
```python
|
|||
|
|
endpoint: str = Field(default="${AGENT_URL}")
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
3. **YAML/JSON Serialization**: Works out of the box
|
|||
|
|
4. **IDE Support**: Type hints provide autocomplete
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Q: Why use environment variable expansion in config?
|
|||
|
|
|
|||
|
|
**A:** Security best practice - secrets should never be in config files:
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
# BAD: Secret in file (gets committed to git)
|
|||
|
|
headers:
|
|||
|
|
Authorization: "Bearer sk-1234567890"
|
|||
|
|
|
|||
|
|
# GOOD: Reference environment variable
|
|||
|
|
headers:
|
|||
|
|
Authorization: "Bearer ${API_KEY}"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Implementation in `config.py`:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
def expand_env_vars(value: str) -> str:
|
|||
|
|
"""Replace ${VAR} with environment variable value."""
|
|||
|
|
pattern = r'\$\{([^}]+)\}'
|
|||
|
|
def replacer(match):
|
|||
|
|
var_name = match.group(1)
|
|||
|
|
return os.environ.get(var_name, match.group(0))
|
|||
|
|
return re.sub(pattern, replacer, value)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Q: Why is MutationType defined as `str, Enum`?
|
|||
|
|
|
|||
|
|
**A:** String enums serialize directly to YAML/JSON:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
class MutationType(str, Enum):
|
|||
|
|
PARAPHRASE = "paraphrase"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
This allows:
|
|||
|
|
```yaml
|
|||
|
|
# In config file - uses string value directly
|
|||
|
|
mutations:
|
|||
|
|
types:
|
|||
|
|
- paraphrase # Works!
|
|||
|
|
- noise
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
If we used a regular Enum, we'd need custom serialization logic.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Mutation Engine
|
|||
|
|
|
|||
|
|
### Q: Why use a local LLM (Ollama) instead of cloud APIs?
|
|||
|
|
|
|||
|
|
**A:** Several important reasons:
|
|||
|
|
|
|||
|
|
| Factor | Local LLM | Cloud API |
|
|||
|
|
|--------|-----------|-----------|
|
|||
|
|
| **Cost** | Free | $0.01-0.10 per mutation |
|
|||
|
|
| **Privacy** | Data stays local | Prompts sent to third party |
|
|||
|
|
| **Rate Limits** | None | Often restrictive |
|
|||
|
|
| **Latency** | Low | Network dependent |
|
|||
|
|
| **Offline** | Works | Requires internet |
|
|||
|
|
|
|||
|
|
For a test run with 100 prompts × 20 mutations = 2000 API calls, cloud costs would add up quickly.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Q: Why Qwen Coder 3 8B as the default model?
|
|||
|
|
|
|||
|
|
**A:** We evaluated several models:
|
|||
|
|
|
|||
|
|
| Model | Mutation Quality | Speed | Memory |
|
|||
|
|
|-------|-----------------|-------|--------|
|
|||
|
|
| Qwen Coder 3 8B | ⭐⭐⭐⭐ | ⭐⭐⭐ | 8GB |
|
|||
|
|
| Llama 3 8B | ⭐⭐⭐ | ⭐⭐⭐ | 8GB |
|
|||
|
|
| Mistral 7B | ⭐⭐⭐ | ⭐⭐⭐⭐ | 6GB |
|
|||
|
|
| Phi-3 Mini | ⭐⭐ | ⭐⭐⭐⭐⭐ | 4GB |
|
|||
|
|
|
|||
|
|
Qwen Coder 3 was chosen because:
|
|||
|
|
1. Excellent at understanding and modifying prompts
|
|||
|
|
2. Good balance of quality vs. speed
|
|||
|
|
3. Runs on consumer hardware (8GB VRAM)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Q: How does the mutation template system work?
|
|||
|
|
|
|||
|
|
**A:** Templates are stored in `templates.py` and formatted with the original prompt:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
TEMPLATES = {
|
|||
|
|
MutationType.PARAPHRASE: """
|
|||
|
|
Rewrite this prompt with different words but same meaning.
|
|||
|
|
|
|||
|
|
Original: {prompt}
|
|||
|
|
|
|||
|
|
Rewritten:
|
|||
|
|
""",
|
|||
|
|
MutationType.NOISE: """
|
|||
|
|
Add 2-3 realistic typos to this prompt:
|
|||
|
|
|
|||
|
|
Original: {prompt}
|
|||
|
|
|
|||
|
|
With typos:
|
|||
|
|
"""
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
The engine fills in `{prompt}` and sends to the LLM:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
template = TEMPLATES[mutation_type]
|
|||
|
|
filled = template.format(prompt=original_prompt)
|
|||
|
|
response = await self.client.generate(model=self.model, prompt=filled)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Q: What if the LLM returns malformed mutations?
|
|||
|
|
|
|||
|
|
**A:** We have several safeguards:
|
|||
|
|
|
|||
|
|
1. **Parsing Logic**: Extracts text between known markers
|
|||
|
|
2. **Validation**: Checks mutation isn't identical to original
|
|||
|
|
3. **Retry Logic**: Regenerates if parsing fails
|
|||
|
|
4. **Fallback**: Uses simple string manipulation if LLM fails
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
def _parse_mutation(self, response: str) -> str:
|
|||
|
|
# Try to extract the mutated text
|
|||
|
|
lines = response.strip().split('\n')
|
|||
|
|
for line in lines:
|
|||
|
|
if line and not line.startswith('#'):
|
|||
|
|
return line.strip()
|
|||
|
|
raise MutationParseError("Could not extract mutation")
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Assertion System
|
|||
|
|
|
|||
|
|
### Q: Why separate deterministic and semantic assertions?
|
|||
|
|
|
|||
|
|
**A:** They have fundamentally different characteristics:
|
|||
|
|
|
|||
|
|
| Aspect | Deterministic | Semantic |
|
|||
|
|
|--------|---------------|----------|
|
|||
|
|
| **Speed** | Nanoseconds | Milliseconds |
|
|||
|
|
| **Dependencies** | None | sentence-transformers |
|
|||
|
|
| **Reproducibility** | 100% | May vary slightly |
|
|||
|
|
| **Use Case** | Exact matching | Meaning matching |
|
|||
|
|
|
|||
|
|
Separating them allows:
|
|||
|
|
- Running deterministic checks first (fast-fail)
|
|||
|
|
- Making semantic checks optional (lighter installation)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Q: How does the SimilarityChecker work internally?
|
|||
|
|
|
|||
|
|
**A:** It uses sentence embeddings and cosine similarity:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
class SimilarityChecker:
|
|||
|
|
def check(self, response: str, latency_ms: float) -> CheckResult:
|
|||
|
|
# 1. Embed both texts to vectors
|
|||
|
|
response_vec = self.embedder.embed(response) # [0.1, 0.2, ...]
|
|||
|
|
expected_vec = self.embedder.embed(self.expected) # [0.15, 0.18, ...]
|
|||
|
|
|
|||
|
|
# 2. Calculate cosine similarity
|
|||
|
|
similarity = cosine_similarity(response_vec, expected_vec)
|
|||
|
|
# Returns value between -1 and 1 (typically 0-1 for text)
|
|||
|
|
|
|||
|
|
# 3. Compare to threshold
|
|||
|
|
return CheckResult(passed=similarity >= self.threshold)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
The embedding model (`all-MiniLM-L6-v2`) converts text to 384-dimensional vectors that capture semantic meaning.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Q: Why is the embedder a class variable with lazy loading?
|
|||
|
|
|
|||
|
|
**A:** The embedding model is large (23MB) and takes 1-2 seconds to load:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
class SimilarityChecker:
|
|||
|
|
_embedder: LocalEmbedder | None = None # Class variable, shared
|
|||
|
|
|
|||
|
|
@property
|
|||
|
|
def embedder(self) -> LocalEmbedder:
|
|||
|
|
if SimilarityChecker._embedder is None:
|
|||
|
|
SimilarityChecker._embedder = LocalEmbedder() # Load once
|
|||
|
|
return SimilarityChecker._embedder
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Benefits:
|
|||
|
|
1. **Lazy Loading**: Only loads if semantic checks are used
|
|||
|
|
2. **Shared Instance**: All SimilarityCheckers share one model
|
|||
|
|
3. **Memory Efficient**: One copy in memory, not one per checker
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Q: How does PII detection work?
|
|||
|
|
|
|||
|
|
**A:** Uses regex patterns for common PII formats:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
PII_PATTERNS = [
|
|||
|
|
(r'\b\d{3}-\d{2}-\d{4}\b', 'SSN'), # 123-45-6789
|
|||
|
|
(r'\b\d{16}\b', 'Credit Card'), # 1234567890123456
|
|||
|
|
(r'\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b', 'Email'),
|
|||
|
|
(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', 'Phone'), # 123-456-7890
|
|||
|
|
]
|
|||
|
|
|
|||
|
|
def check(self, response: str, latency_ms: float) -> CheckResult:
|
|||
|
|
for pattern, pii_type in self.PII_PATTERNS:
|
|||
|
|
if re.search(pattern, response, re.IGNORECASE):
|
|||
|
|
return CheckResult(
|
|||
|
|
passed=False,
|
|||
|
|
details=f"Found potential {pii_type}"
|
|||
|
|
)
|
|||
|
|
return CheckResult(passed=True)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Performance & Rust
|
|||
|
|
|
|||
|
|
### Q: Why Rust for performance-critical code?
|
|||
|
|
|
|||
|
|
**A:** Python is slow for CPU-bound operations. Benchmarks show:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Levenshtein Distance (5000 iterations):
|
|||
|
|
Python: 5864ms
|
|||
|
|
Rust: 67ms
|
|||
|
|
Speedup: 88x
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Rust was chosen over alternatives because:
|
|||
|
|
- **vs C/C++**: Memory safety, easier to write correct code
|
|||
|
|
- **vs Cython**: Better tooling (cargo), cleaner code
|
|||
|
|
- **vs NumPy**: Works on strings, not just numbers
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Q: How does the Rust/Python bridge work?
|
|||
|
|
|
|||
|
|
**A:** Uses PyO3 for bindings:
|
|||
|
|
|
|||
|
|
```rust
|
|||
|
|
// Rust side (lib.rs)
|
|||
|
|
#[pyfunction]
|
|||
|
|
fn levenshtein_distance(s1: &str, s2: &str) -> usize {
|
|||
|
|
// Rust implementation
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
#[pymodule]
|
|||
|
|
fn entropix_rust(m: &PyModule) -> PyResult<()> {
|
|||
|
|
m.add_function(wrap_pyfunction!(levenshtein_distance, m)?)?;
|
|||
|
|
Ok(())
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# Python side (performance.py)
|
|||
|
|
try:
|
|||
|
|
import flakestorm_rust
|
|||
|
|
_RUST_AVAILABLE = True
|
|||
|
|
except ImportError:
|
|||
|
|
_RUST_AVAILABLE = False
|
|||
|
|
|
|||
|
|
def levenshtein_distance(s1: str, s2: str) -> int:
|
|||
|
|
if _RUST_AVAILABLE:
|
|||
|
|
return entropix_rust.levenshtein_distance(s1, s2)
|
|||
|
|
# Pure Python fallback
|
|||
|
|
...
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Q: Why provide pure Python fallbacks?
|
|||
|
|
|
|||
|
|
**A:** Accessibility and reliability:
|
|||
|
|
|
|||
|
|
1. **Easy Installation**: `pip install flakestorm` works without Rust toolchain
|
|||
|
|
2. **Platform Support**: Works on any Python platform
|
|||
|
|
3. **Development**: Faster iteration without recompiling Rust
|
|||
|
|
4. **Testing**: Can test both implementations for parity
|
|||
|
|
|
|||
|
|
The tradeoff is speed, but most time is spent waiting for LLM/agent responses anyway.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Agent Adapters
|
|||
|
|
|
|||
|
|
### Q: Why use the Protocol pattern for agents?
|
|||
|
|
|
|||
|
|
**A:** Enables type-safe duck typing:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
class AgentProtocol(Protocol):
|
|||
|
|
async def invoke(self, prompt: str) -> AgentResponse: ...
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Any class with a matching `invoke` method works, even if it doesn't inherit from a base class. This is more Pythonic than Java-style interfaces.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Q: How does the HTTP adapter handle different API formats?
|
|||
|
|
|
|||
|
|
**A:** Through configurable templates:
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
agent:
|
|||
|
|
endpoint: "https://api.example.com/v1/chat"
|
|||
|
|
request_template: |
|
|||
|
|
{"messages": [{"role": "user", "content": "{prompt}"}]}
|
|||
|
|
response_path: "$.choices[0].message.content"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
The adapter:
|
|||
|
|
1. Replaces `{prompt}` in the template
|
|||
|
|
2. Sends the formatted JSON
|
|||
|
|
3. Uses JSONPath to extract the response
|
|||
|
|
|
|||
|
|
This supports OpenAI, Anthropic, custom APIs, etc.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Q: Why is there a Python adapter?
|
|||
|
|
|
|||
|
|
**A:** Bypasses HTTP overhead for local testing:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# Instead of: HTTP request → your server → your code → HTTP response
|
|||
|
|
# Just: your_function(prompt) → response
|
|||
|
|
|
|||
|
|
class PythonAgentAdapter:
|
|||
|
|
async def invoke(self, prompt: str) -> AgentResponse:
|
|||
|
|
# Import the module dynamically
|
|||
|
|
module_path, func_name = self.endpoint.rsplit(":", 1)
|
|||
|
|
module = importlib.import_module(module_path)
|
|||
|
|
func = getattr(module, func_name)
|
|||
|
|
|
|||
|
|
# Call directly
|
|||
|
|
start = time.perf_counter()
|
|||
|
|
response = await func(prompt) if asyncio.iscoroutinefunction(func) else func(prompt)
|
|||
|
|
latency = (time.perf_counter() - start) * 1000
|
|||
|
|
|
|||
|
|
return AgentResponse(text=response, latency_ms=latency)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Testing & Quality
|
|||
|
|
|
|||
|
|
### Q: Why are tests split by module?
|
|||
|
|
|
|||
|
|
**A:** Mirrors the source structure for maintainability:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
tests/
|
|||
|
|
├── test_config.py # Tests for core/config.py
|
|||
|
|
├── test_mutations.py # Tests for mutations/
|
|||
|
|
├── test_assertions.py # Tests for assertions/
|
|||
|
|
├── test_performance.py # Tests for performance module
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
When fixing a bug in `config.py`, you immediately know to check `test_config.py`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Q: Why use pytest over unittest?
|
|||
|
|
|
|||
|
|
**A:** Pytest is more Pythonic and powerful:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# unittest style (verbose)
|
|||
|
|
class TestConfig(unittest.TestCase):
|
|||
|
|
def test_load_config(self):
|
|||
|
|
self.assertEqual(config.agent.type, AgentType.HTTP)
|
|||
|
|
|
|||
|
|
# pytest style (concise)
|
|||
|
|
def test_load_config():
|
|||
|
|
assert config.agent.type == AgentType.HTTP
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Pytest also offers:
|
|||
|
|
- Fixtures for setup/teardown
|
|||
|
|
- Parametrized tests
|
|||
|
|
- Better assertion introspection
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Q: How should I add tests for a new feature?
|
|||
|
|
|
|||
|
|
**A:** Follow this pattern:
|
|||
|
|
|
|||
|
|
1. **Create test file** if needed: `tests/test_<module>.py`
|
|||
|
|
2. **Write failing test first** (TDD)
|
|||
|
|
3. **Group related tests** in a class
|
|||
|
|
4. **Use fixtures** for common setup
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# tests/test_new_feature.py
|
|||
|
|
import pytest
|
|||
|
|
from flakestorm.new_module import NewFeature
|
|||
|
|
|
|||
|
|
class TestNewFeature:
|
|||
|
|
@pytest.fixture
|
|||
|
|
def feature(self):
|
|||
|
|
return NewFeature(config={...})
|
|||
|
|
|
|||
|
|
def test_basic_functionality(self, feature):
|
|||
|
|
result = feature.do_something()
|
|||
|
|
assert result == expected
|
|||
|
|
|
|||
|
|
def test_edge_case(self, feature):
|
|||
|
|
with pytest.raises(ValueError):
|
|||
|
|
feature.do_something(invalid_input)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Extending flakestorm
|
|||
|
|
|
|||
|
|
### Q: How do I add a new mutation type?
|
|||
|
|
|
|||
|
|
**A:** Three steps:
|
|||
|
|
|
|||
|
|
1. **Add to enum** (`mutations/types.py`):
|
|||
|
|
```python
|
|||
|
|
class MutationType(str, Enum):
|
|||
|
|
# ... existing types
|
|||
|
|
MY_NEW_TYPE = "my_new_type"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
2. **Add template** (`mutations/templates.py`):
|
|||
|
|
```python
|
|||
|
|
TEMPLATES[MutationType.MY_NEW_TYPE] = """
|
|||
|
|
Your prompt template here.
|
|||
|
|
|
|||
|
|
Original: {prompt}
|
|||
|
|
|
|||
|
|
Modified:
|
|||
|
|
"""
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
3. **Add default weight** (`core/config.py`):
|
|||
|
|
```python
|
|||
|
|
class MutationConfig(BaseModel):
|
|||
|
|
weights: dict = {
|
|||
|
|
# ... existing weights
|
|||
|
|
MutationType.MY_NEW_TYPE: 1.0,
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Q: How do I add a new assertion type?
|
|||
|
|
|
|||
|
|
**A:** Four steps:
|
|||
|
|
|
|||
|
|
1. **Create checker class** (`assertions/deterministic.py` or `semantic.py`):
|
|||
|
|
```python
|
|||
|
|
class MyNewChecker(BaseChecker):
|
|||
|
|
def check(self, response: str, latency_ms: float) -> CheckResult:
|
|||
|
|
# Your logic here
|
|||
|
|
passed = some_condition(response)
|
|||
|
|
return CheckResult(
|
|||
|
|
passed=passed,
|
|||
|
|
check_type=InvariantType.MY_NEW_TYPE,
|
|||
|
|
details="Explanation"
|
|||
|
|
)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
2. **Add to enum** (`core/config.py`):
|
|||
|
|
```python
|
|||
|
|
class InvariantType(str, Enum):
|
|||
|
|
# ... existing types
|
|||
|
|
MY_NEW_TYPE = "my_new_type"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
3. **Register in verifier** (`assertions/verifier.py`):
|
|||
|
|
```python
|
|||
|
|
CHECKER_REGISTRY = {
|
|||
|
|
# ... existing checkers
|
|||
|
|
InvariantType.MY_NEW_TYPE: MyNewChecker,
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
4. **Add tests** (`tests/test_assertions.py`)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Q: How do I add a new report format?
|
|||
|
|
|
|||
|
|
**A:** Create a new generator:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# reports/markdown.py
|
|||
|
|
class MarkdownReportGenerator:
|
|||
|
|
def __init__(self, results: TestResults):
|
|||
|
|
self.results = results
|
|||
|
|
|
|||
|
|
def generate(self) -> str:
|
|||
|
|
"""Generate markdown content."""
|
|||
|
|
md = f"# flakestorm Report\n\n"
|
|||
|
|
md += f"**Score:** {self.results.statistics.robustness_score:.2f}\n"
|
|||
|
|
# ... more content
|
|||
|
|
return md
|
|||
|
|
|
|||
|
|
def save(self, path: Path = None) -> Path:
|
|||
|
|
path = path or Path(f"reports/report_{timestamp}.md")
|
|||
|
|
path.write_text(self.generate())
|
|||
|
|
return path
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Then add CLI option in `cli/main.py`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Common Issues
|
|||
|
|
|
|||
|
|
### Q: Why am I getting "Cannot connect to Ollama"?
|
|||
|
|
|
|||
|
|
**A:** Ollama service isn't running. Fix:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Start Ollama
|
|||
|
|
ollama serve
|
|||
|
|
|
|||
|
|
# Verify it's running
|
|||
|
|
curl http://localhost:11434/api/version
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Q: Why is mutation generation slow?
|
|||
|
|
|
|||
|
|
**A:** LLM inference is inherently slow. Options:
|
|||
|
|
1. Use a faster model: `ollama pull phi3:mini`
|
|||
|
|
2. Reduce mutation count: `mutations.count: 10`
|
|||
|
|
3. Use GPU: Ensure Ollama uses GPU acceleration
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Q: Why do tests pass locally but fail in CI?
|
|||
|
|
|
|||
|
|
**A:** Common causes:
|
|||
|
|
1. **Missing Ollama**: CI needs Ollama service
|
|||
|
|
2. **Different model**: Ensure same model is pulled
|
|||
|
|
3. **Timing**: CI may be slower, increase timeouts
|
|||
|
|
4. **Environment variables**: Ensure secrets are set in CI
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Q: How do I debug a failing assertion?
|
|||
|
|
|
|||
|
|
**A:** Enable verbose mode and check the report:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
flakestorm run --verbose --output html
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
The HTML report shows:
|
|||
|
|
- Original prompt
|
|||
|
|
- Mutated prompt
|
|||
|
|
- Agent response
|
|||
|
|
- Which assertion failed and why
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
*Have more questions? Open an issue on GitHub!*
|
|||
|
|
|