mirror of
https://github.com/flakestorm/flakestorm.git
synced 2026-04-25 08:46:47 +02:00
220 lines
6.9 KiB
Markdown
220 lines
6.9 KiB
Markdown
# flakestorm Implementation Checklist
|
||
|
||
This document tracks the implementation progress of flakestorm - The Agent Reliability Engine.
|
||
|
||
## CLI Version (Open Source - Apache 2.0)
|
||
|
||
### Phase 1: Foundation (Week 1-2)
|
||
|
||
#### Project Scaffolding
|
||
- [x] Initialize Python project with pyproject.toml
|
||
- [x] Set up Rust workspace with Cargo.toml
|
||
- [x] Create Apache 2.0 LICENSE file
|
||
- [x] Write comprehensive README.md
|
||
- [x] Create flakestorm.yaml.example template
|
||
- [x] Set up project structure (src/flakestorm/*)
|
||
- [x] Configure pre-commit hooks (black, ruff, mypy)
|
||
|
||
#### Configuration System
|
||
- [x] Define Pydantic models for configuration
|
||
- [x] Implement YAML loading/validation
|
||
- [x] Support environment variable expansion
|
||
- [x] Create configuration factory functions
|
||
- [x] Add configuration validation tests
|
||
|
||
#### Agent Protocol/Adapter
|
||
- [x] Define AgentProtocol interface
|
||
- [x] Implement HTTPAgentAdapter
|
||
- [x] Implement PythonAgentAdapter
|
||
- [x] Implement LangChainAgentAdapter
|
||
- [x] Create adapter factory function
|
||
- [x] Add retry logic for HTTP adapter
|
||
|
||
---
|
||
|
||
### Phase 2: Mutation Engine (Week 2-3)
|
||
|
||
#### Ollama Integration
|
||
- [x] Create MutationEngine class
|
||
- [x] Implement Ollama client wrapper
|
||
- [x] Add connection verification
|
||
- [x] Support async mutation generation
|
||
- [x] Implement batch generation
|
||
|
||
#### Mutation Types & Templates
|
||
- [x] Define MutationType enum
|
||
- [x] Create Mutation dataclass
|
||
- [x] Write templates for PARAPHRASE
|
||
- [x] Write templates for NOISE
|
||
- [x] Write templates for TONE_SHIFT
|
||
- [x] Write templates for PROMPT_INJECTION
|
||
- [x] Add mutation validation logic
|
||
- [x] Support custom templates
|
||
|
||
#### Rust Performance Bindings
|
||
- [x] Set up PyO3 bindings
|
||
- [x] Implement robustness score calculation
|
||
- [x] Implement weighted score calculation
|
||
- [x] Implement Levenshtein distance
|
||
- [x] Implement parallel processing utilities
|
||
- [x] Build and test Rust module
|
||
- [x] Integrate with Python package
|
||
|
||
---
|
||
|
||
### Phase 3: Runner & Assertions (Week 3-4)
|
||
|
||
#### Async Runner
|
||
- [x] Create FlakeStormRunner class
|
||
- [x] Implement orchestrator logic
|
||
- [x] Add concurrency control with semaphores
|
||
- [x] Implement progress tracking
|
||
- [x] Add setup verification
|
||
|
||
#### Invariant System
|
||
- [x] Create InvariantVerifier class
|
||
- [x] Implement ContainsChecker
|
||
- [x] Implement LatencyChecker
|
||
- [x] Implement ValidJsonChecker
|
||
- [x] Implement RegexChecker
|
||
- [x] Implement SimilarityChecker
|
||
- [x] Implement ExcludesPIIChecker
|
||
- [x] Implement RefusalChecker
|
||
- [x] Add checker registry
|
||
|
||
---
|
||
|
||
### Phase 4: CLI & Reporting (Week 4-5)
|
||
|
||
#### CLI Commands
|
||
- [x] Set up Typer application
|
||
- [x] Implement `flakestorm init` command
|
||
- [x] Implement `flakestorm run` command
|
||
- [x] Implement `flakestorm verify` command
|
||
- [x] Implement `flakestorm report` command
|
||
- [x] Implement `flakestorm score` command
|
||
- [x] Add CI mode (--ci --min-score)
|
||
- [x] Add rich progress bars
|
||
|
||
#### Report Generation
|
||
- [x] Create report data models
|
||
- [x] Implement HTMLReportGenerator
|
||
- [x] Create interactive HTML template
|
||
- [x] Implement JSONReportGenerator
|
||
- [x] Implement TerminalReporter
|
||
- [x] Add score visualization
|
||
- [x] Add mutation matrix view
|
||
|
||
---
|
||
|
||
### Phase 5: V2 Features (Week 5-7)
|
||
|
||
#### Environment Chaos & Context Attacks
|
||
- [x] ChaosConfig (tool_faults, llm_faults, context_attacks as list or dict)
|
||
- [x] ChaosInterceptor: memory_poisoning applied to input before invoke; LLM faults (timeout before call, others after)
|
||
- [x] context_attacks: indirect_injection, memory_poisoning (strategy prepend/append/replace), normalize_context_attacks
|
||
- [x] Per-scenario context_attacks in contract.chaos_matrix
|
||
|
||
#### Behavioral Contracts
|
||
- [x] ContractEngine: (invariant × scenario) cells with optional reset (reset_endpoint / reset_function)
|
||
- [x] system_prompt_leak_probe via contract invariant `probes`; behavior_unchanged with baseline auto/manual
|
||
- [x] Stateful detection and warning when no reset configured
|
||
|
||
#### Replay Regression
|
||
- [x] ReplaySessionConfig with `file` (load from file) or inline id/input; validation require id+input when no file
|
||
- [x] ReplayConfig.sources (LangSmith project or run_id) with auto_import
|
||
|
||
#### Scoring & Config
|
||
- [x] ScoringConfig (mutation, chaos, contract, replay) weights must sum to 1.0
|
||
- [x] AgentConfig.reset_endpoint, reset_function; ModelConfig api_key env-only
|
||
- [x] Mutation count max 50 (OSS); 22+ mutation types
|
||
|
||
#### HuggingFace Integration
|
||
- [x] Create HuggingFaceModelProvider
|
||
- [x] Support GGUF model downloading
|
||
- [x] Add recommended models list
|
||
- [x] Integrate with Ollama model importing
|
||
|
||
#### Vector Similarity
|
||
- [x] Create LocalEmbedder class
|
||
- [x] Integrate sentence-transformers
|
||
- [x] Implement similarity calculation
|
||
- [x] Add lazy model loading
|
||
|
||
---
|
||
|
||
### Testing & Quality
|
||
|
||
#### Unit Tests
|
||
- [x] Test configuration loading
|
||
- [x] Test mutation types
|
||
- [x] Test assertion checkers
|
||
- [ ] Test agent adapters
|
||
- [ ] Test orchestrator
|
||
- [ ] Test report generation
|
||
|
||
#### Integration Tests
|
||
- [ ] Test full run with mock agent
|
||
- [ ] Test CLI commands
|
||
- [ ] Test report generation
|
||
|
||
#### Documentation
|
||
- [x] Write README.md
|
||
- [x] Create IMPLEMENTATION_CHECKLIST.md
|
||
- [x] Create ARCHITECTURE_SUMMARY.md
|
||
- [x] Create API_SPECIFICATION.md
|
||
- [x] Create CONTRIBUTING.md
|
||
- [x] Create CONFIGURATION_GUIDE.md
|
||
|
||
---
|
||
|
||
### Phase 6: Essential Mutations (Week 7-8)
|
||
|
||
#### Core Mutation Types
|
||
- [x] Add ENCODING_ATTACKS mutation type
|
||
- [x] Add CONTEXT_MANIPULATION mutation type
|
||
- [x] Add LENGTH_EXTREMES mutation type
|
||
- [x] Update MutationType enum with all 8 types
|
||
- [x] Create templates for new mutation types
|
||
- [x] Update mutation validation for edge cases
|
||
|
||
#### Configuration Updates
|
||
- [x] Update MutationConfig defaults
|
||
- [x] Update example configuration files
|
||
- [x] Update orchestrator comments
|
||
|
||
#### Documentation Updates
|
||
- [x] Update README.md with comprehensive mutation types table
|
||
- [x] Add Mutation Strategy section to README
|
||
- [x] Update API_SPECIFICATION.md with all 8 types
|
||
- [x] Update MODULES.md with detailed mutation documentation
|
||
- [x] Add Mutation Types Guide to CONFIGURATION_GUIDE.md
|
||
- [x] Add Understanding Mutation Types to USAGE_GUIDE.md
|
||
- [x] Add Mutation Type Deep Dive to TEST_SCENARIOS.md
|
||
|
||
---
|
||
|
||
## Progress Summary
|
||
|
||
| Phase | Status | Completion |
|
||
|-------|--------|------------|
|
||
| CLI Phase 1: Foundation | ✅ Complete | 100% |
|
||
| CLI Phase 2: Mutation Engine | ✅ Complete | 100% |
|
||
| CLI Phase 3: Runner & Assertions | ✅ Complete | 100% |
|
||
| CLI Phase 4: CLI & Reporting | ✅ Complete | 100% |
|
||
| CLI Phase 5: V2 Features | ✅ Complete | 90% |
|
||
| CLI Phase 6: Essential Mutations | ✅ Complete | 100% |
|
||
| Documentation | ✅ Complete | 100% |
|
||
|
||
---
|
||
|
||
## Next Steps
|
||
|
||
### Immediate (Current Sprint)
|
||
1. **Rust Build**: Compile and integrate Rust performance module
|
||
2. **Integration Tests**: Add full integration test suite
|
||
3. **PyPI Release**: Prepare and publish to PyPI
|
||
4. **Community Launch**: Publish to Hacker News and Reddit
|
||
|
||
### Future Roadmap
|
||
See [ROADMAP.md](ROADMAP.md) for comprehensive roadmap of advanced chaos engineering and adversarial testing features. These are open for community contribution - see [CONTRIBUTING.md](CONTRIBUTING.md) for how to get involved.
|