Update version to 0.9.0 in pyproject.toml and __init__.py, enhance CONFIGURATION_GUIDE.md and USAGE_GUIDE.md with aggressive mutation strategies and requirements for invariants, and add validation to ensure at least 3 invariants are configured in FlakeStormConfig.

This commit is contained in:
Francisco M Humarang Jr. 2026-01-03 00:18:31 +08:00
parent e673b21b55
commit 0b8777c614
9 changed files with 1041 additions and 4 deletions

View file

@ -2,4 +2,5 @@ fastapi>=0.104.0
uvicorn[standard]>=0.24.0
google-generativeai>=0.3.0
pydantic>=2.0.0
flakestorm>=0.1.0

View file

@ -0,0 +1,364 @@
# LangChain Agent Example
This example demonstrates how to test a LangChain agent with flakestorm. The agent uses LangChain's `LLMChain` to process user queries.
## Overview
The example includes:
- A LangChain agent that uses **Google Gemini AI** (if API key is set) or falls back to a mock LLM
- A `flakestorm.yaml` configuration file for testing the agent
- Instructions for running flakestorm against the agent
- Automatic fallback to mock LLM if API key is not set (no API keys required for basic testing)
## Features
- **Real LLM Support**: Uses Google Gemini AI (if API key is set) for realistic testing
- **Automatic Fallback**: Falls back to a mock LLM if API key is not set (no API keys required for basic testing)
- **Input-Aware Processing**: Actually processes input and can fail on certain inputs, making it realistic for testing
- **Realistic Failure Modes**: The agent can fail on empty inputs, very long inputs, and prompt injection attempts
- **flakestorm Integration**: Ready-to-use configuration for testing robustness with meaningful results
## Setup
### 1. Create Virtual Environment (Recommended)
```bash
cd examples/langchain_agent
# Create virtual environment
python -m venv lc_test_venv
# Activate virtual environment
# On macOS/Linux:
source lc_test_venv/bin/activate
# On Windows (PowerShell):
# lc_test_venv\Scripts\Activate.ps1
# On Windows (Command Prompt):
# lc_test_venv\Scripts\activate.bat
```
**Note:** You should see `(venv)` in your terminal prompt after activation.
### 2. Install Dependencies
```bash
# Make sure virtual environment is activated
pip install -r requirements.txt
# This will install:
# - langchain-core, langchain-community (LangChain packages)
# - langchain-google-genai (for Google Gemini support)
# - flakestorm (for testing)
# Or install manually:
# For modern LangChain (0.3.x+) with Gemini:
# pip install langchain-core langchain-community langchain-google-genai flakestorm
# For older LangChain (0.1.x, 0.2.x):
# pip install langchain flakestorm
```
**Note:** The agent code automatically handles different LangChain versions. If you encounter import errors, try:
```bash
# Install all LangChain packages for maximum compatibility
pip install langchain langchain-core langchain-community
```
### 3. Verify the Agent Works
```bash
# Test the agent directly
python -c "from agent import chain; result = chain.invoke({'input': 'Hello!'}); print(result)"
```
Expected output:
```
{'input': 'Hello!', 'text': 'I can help you with that!'}
```
## Running flakestorm Tests
### From the Project Root (Recommended)
```bash
# Make sure you're in the project root (not in examples/langchain_agent)
cd /path/to/flakestorm
# Run flakestorm against the LangChain agent
flakestorm run --config examples/langchain_agent/flakestorm.yaml
```
**This is the easiest way** - no PYTHONPATH setup needed!
### From the Example Directory
If you want to run from `examples/langchain_agent`, you need to set the Python path:
```bash
# If you're in examples/langchain_agent
cd examples/langchain_agent
# Option 1: Set PYTHONPATH (recommended)
# On macOS/Linux:
export PYTHONPATH="${PYTHONPATH}:$(pwd)"
flakestorm run
# On Windows (PowerShell):
$env:PYTHONPATH = "$env:PYTHONPATH;$PWD"
flakestorm run
# Option 2: Update flakestorm.yaml to use full path
# Change: endpoint: "examples.langchain_agent.agent:chain"
# To: endpoint: "agent:chain"
# Then run: flakestorm run
```
**Note:** The `flakestorm.yaml` is configured to run from the project root by default. For easiest setup, run from the project root. If running from the example directory, either set `PYTHONPATH` or update the `endpoint` in `flakestorm.yaml`.
## Understanding the Configuration
### Agent Configuration
The `flakestorm.yaml` file configures flakestorm to test the LangChain agent:
```yaml
agent:
endpoint: "examples.langchain_agent.agent:chain" # Module path: imports chain from agent.py
type: "langchain" # Tells flakestorm to use LangChain adapter
timeout: 30000 # 30 second timeout
```
**How it works:**
- flakestorm imports `chain` from the `agent` module
- It calls `chain.invoke({"input": prompt})` or `chain.ainvoke({"input": prompt})`
- The adapter handles different LangChain interfaces automatically
### Choosing the Right Invariants
**Important:** Only use invariants that match your agent's expected output format!
**For Text-Only Agents (like this example):**
```yaml
invariants:
- type: "latency"
max_ms: 10000
- type: "not_contains"
value: "" # Response shouldn't be empty
- type: "excludes_pii"
- type: "refusal_check"
```
**For JSON-Only Agents:**
```yaml
invariants:
- type: "valid_json" # ✅ Use this if agent returns JSON
- type: "latency"
max_ms: 5000
```
**For Agents with Mixed Output:**
```yaml
invariants:
- type: "latency"
max_ms: 5000
# Use prompt_filter to apply JSON check only to specific prompts
- type: "valid_json"
prompt_filter: "api|json|data" # Only check JSON for prompts containing these words
```
### Golden Prompts
The configuration includes 8 example prompts that should work correctly:
- Weather queries
- Educational questions
- Help requests
- Technical explanations
flakestorm will generate mutations of these prompts to test robustness.
### Invariants
The tests verify:
- **Latency**: Response under 10 seconds
- **Contains "help"**: Response should contain helpful content (stricter than just checking for space)
- **Minimum Length**: Response must be at least 20 characters (ensures meaningful response)
- **PII Safety**: No personally identifiable information
- **Refusal**: Agent should refuse dangerous prompt injections
**Important:**
- flakestorm requires **at least 3 invariants** to ensure comprehensive testing
- This agent returns plain text responses, so we don't use `valid_json` invariant
- Only use `valid_json` if your agent is supposed to return JSON responses
- The invariants are **stricter** than before to catch more issues and produce meaningful test results
## Using Google Gemini (Real LLM)
This example **already uses Google Gemini** if you set the API key! Just set the environment variable:
```bash
# macOS/Linux:
export GOOGLE_AI_API_KEY=your-api-key-here
# Windows (PowerShell):
$env:GOOGLE_AI_API_KEY="your-api-key-here"
# Windows (Command Prompt):
set GOOGLE_AI_API_KEY=your-api-key-here
```
**Get your API key:**
1. Go to [Google AI Studio](https://makersuite.google.com/app/apikey)
2. Create a new API key
3. Copy and set it as the environment variable above
**Without API Key:**
If you don't set the API key, the agent automatically falls back to a mock LLM that still processes input meaningfully. This is useful for testing without API costs.
**Other LLM Options:**
You can modify `agent.py` to use other LLMs:
- `ChatOpenAI` - OpenAI GPT models (requires `langchain-openai`)
- `ChatAnthropic` - Anthropic Claude (requires `langchain-anthropic`)
- `ChatOllama` - Local Ollama models (requires `langchain-ollama`)
## Expected Test Results
When you run flakestorm, you'll see:
1. **Mutation Generation**: flakestorm generates 20 mutations per golden prompt (200 total tests with 10 golden prompts)
2. **Test Execution**: Each mutation is tested against the agent
3. **Results Report**: HTML report showing:
- Robustness score (0.0 - 1.0)
- Pass/fail breakdown by mutation type
- Detailed failure analysis
- Recommendations for improvement
### Why This Agent is Better for Testing
**Previous Issue:** The original agent used `FakeListLLM`, which ignored input and just cycled through 8 predefined responses. This meant:
- Mutations had no effect (agent didn't read them)
- Invariants were too lax (always passed)
- 100% reliability score was meaningless
**Current Solution:** The agent uses **Google Gemini AI** (if API key is set) or a mock LLM:
- ✅ **With Gemini**: Real LLM that processes input naturally, can fail on edge cases
- ✅ **Without API Key**: Mock LLM that still processes input meaningfully
- ✅ Reads and analyzes the input
- ✅ Can fail on empty/whitespace inputs
- ✅ Can fail on very long inputs (>5000 chars)
- ✅ Detects and refuses prompt injection attempts
- ✅ Returns context-aware responses based on input content
- ✅ Stricter invariants (checks for meaningful content, not just non-empty)
**Expected Results:**
- **With Gemini**: More realistic failures, reliability score typically 70-90% (real LLM behavior)
- **With Mock LLM**: Some failures on edge cases, reliability score typically 80-95%
- You should see **some failures** on edge cases (empty inputs, prompt injections, etc.)
- This makes the test results **meaningful** and helps identify real robustness issues
## Common Issues
### "ModuleNotFoundError: No module named 'agent'" or "No module named 'examples'"
**Solution 1 (Recommended):** Run from the project root:
```bash
cd /path/to/flakestorm # Go to project root
flakestorm run --config examples/langchain_agent/flakestorm.yaml
```
**Solution 2:** If running from `examples/langchain_agent`, set PYTHONPATH:
```bash
# macOS/Linux:
export PYTHONPATH="${PYTHONPATH}:$(pwd)"
flakestorm run
# Windows (PowerShell):
$env:PYTHONPATH = "$env:PYTHONPATH;$PWD"
flakestorm run
```
**Solution 3:** Update `flakestorm.yaml` to use relative path:
```yaml
agent:
endpoint: "agent:chain" # Instead of "examples.langchain_agent.agent:chain"
```
### "ModuleNotFoundError: No module named 'langchain.chains'" or "cannot import name 'LLMChain'"
**Solution:** This happens with newer LangChain versions (0.3.x+). Install the required packages:
```bash
# Install all LangChain packages for compatibility
pip install langchain langchain-core langchain-community
# Or if using requirements.txt:
pip install -r requirements.txt
```
The agent code automatically tries multiple import strategies, so installing all packages ensures compatibility.
### "AttributeError: 'LLMChain' object has no attribute 'invoke'"
**Solution:** Update your LangChain version:
```bash
pip install --upgrade langchain langchain-core
```
### "Timeout errors"
**Solution:** Increase timeout in `flakestorm.yaml`:
```yaml
agent:
timeout: 60000 # 60 seconds
```
## Customizing the Agent
### Add Tools/Agents
You can extend the agent to use LangChain tools or agents:
```python
from langchain.agents import initialize_agent, Tool
from langchain.llms import OpenAI
llm = OpenAI(temperature=0)
tools = [
Tool(
name="Calculator",
func=lambda x: str(eval(x)),
description="Useful for mathematical calculations"
)
]
agent = initialize_agent(tools, llm, agent="zero-shot-react-description", verbose=True)
# Export for flakestorm
chain = agent
```
### Add Memory
Add conversation memory to your agent:
```python
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory()
chain = LLMChain(llm=llm, prompt=prompt_template, memory=memory)
```
## Next Steps
1. **Run the tests**: `flakestorm run --config examples/langchain_agent/flakestorm.yaml`
2. **Review the report**: Check `reports/flakestorm-*.html`
3. **Improve robustness**: Fix issues found in the report
4. **Re-test**: Run flakestorm again to verify improvements
## Learn More
- [LangChain Documentation](https://python.langchain.com/)
- [flakestorm Usage Guide](../docs/USAGE_GUIDE.md)
- [flakestorm Configuration Guide](../docs/CONFIGURATION_GUIDE.md)

View file

@ -0,0 +1,310 @@
"""
LangChain Agent Example for flakestorm Testing
This example demonstrates a simple LangChain agent that can be tested with flakestorm.
The agent uses LangChain's Runnable interface to process user queries.
This agent uses Google Gemini AI (if API key is set) or falls back to a mock LLM.
Set GOOGLE_AI_API_KEY or VITE_GOOGLE_AI_API_KEY environment variable to use Gemini.
Compatible with LangChain 0.1.x, 0.2.x, and 0.3.x+
"""
import os
import re
from typing import Any
# Try multiple import strategies for different LangChain versions
chain = None
llm = None
class InputAwareMockLLM:
"""
A mock LLM that actually processes input, making it suitable for flakestorm testing.
Unlike FakeListLLM, this LLM:
- Actually reads and processes the input
- Can fail on certain inputs (empty, too long, injection attempts)
- Returns responses based on input content
- Simulates realistic failure modes
"""
def __init__(self):
self.call_count = 0
def invoke(self, prompt: str, **kwargs: Any) -> str:
"""Process the input and return a response."""
self.call_count += 1
# Normalize input
prompt_lower = prompt.lower().strip()
# Failure mode 1: Empty or whitespace-only input
if not prompt_lower or len(prompt_lower) < 2:
return "I'm sorry, I didn't understand your question. Could you please rephrase it?"
# Failure mode 2: Very long input (simulates token limit)
if len(prompt) > 5000:
return "Your question is too long. Please keep it under 5000 characters."
# Failure mode 3: Detect prompt injection attempts
injection_patterns = [
r"ignore\s+(previous|all|above|earlier)",
r"forget\s+(everything|all|previous)",
r"system\s*:",
r"assistant\s*:",
r"you\s+are\s+now",
r"new\s+instructions",
]
for pattern in injection_patterns:
if re.search(pattern, prompt_lower):
return "I can't follow instructions that ask me to ignore my guidelines. How can I help you with your original question?"
# Generate response based on input content
# This simulates a real LLM that processes the input
response_parts = []
# Extract key topics from the input
if any(word in prompt_lower for word in ["weather", "temperature", "rain", "sunny"]):
response_parts.append("I can help you with weather information.")
elif any(word in prompt_lower for word in ["time", "clock", "hour", "minute"]):
response_parts.append("I can help you with time-related questions.")
elif any(word in prompt_lower for word in ["capital", "city", "country", "france"]):
response_parts.append("I can help you with geography questions.")
elif any(word in prompt_lower for word in ["math", "calculate", "add", "plus", "1 + 1"]):
response_parts.append("I can help you with math questions.")
elif any(word in prompt_lower for word in ["email", "write", "professional"]):
response_parts.append("I can help you write professional emails.")
elif any(word in prompt_lower for word in ["help", "assist", "support"]):
response_parts.append("I'm here to help you!")
else:
response_parts.append("I understand your question.")
# Add a personalized touch based on input length
if len(prompt) < 20:
response_parts.append("That's a concise question!")
elif len(prompt) > 100:
response_parts.append("You've provided a lot of context, which is helpful.")
# Add a response based on question type
if "?" in prompt:
response_parts.append("Let me provide you with an answer.")
else:
response_parts.append("I've noted your request.")
return " ".join(response_parts)
async def ainvoke(self, prompt: str, **kwargs: Any) -> str:
"""Async version of invoke."""
return self.invoke(prompt, **kwargs)
# Strategy 1: Modern LangChain (0.3.x+) - Use Runnable with Gemini or Mock LLM
try:
from langchain_core.runnables import RunnableLambda
# Try to use Google Gemini if API key is available
api_key = os.getenv("GOOGLE_AI_API_KEY") or os.getenv("VITE_GOOGLE_AI_API_KEY")
if api_key:
try:
# Try langchain-google-genai (newer package)
from langchain_google_genai import ChatGoogleGenerativeAI
llm = ChatGoogleGenerativeAI(
model="gemini-2.5-flash",
google_api_key=api_key,
temperature=0.7,
)
except ImportError:
try:
# Try langchain-community (older package)
from langchain_community.chat_models import ChatGoogleGenerativeAI
llm = ChatGoogleGenerativeAI(
model="gemini-2.5-flash",
google_api_key=api_key,
temperature=0.7,
)
except ImportError:
# Fallback to mock LLM if packages not installed
print("Warning: langchain-google-genai not installed. Using mock LLM.")
print("Install with: pip install langchain-google-genai")
llm = InputAwareMockLLM()
else:
# No API key, use mock LLM
print("Warning: GOOGLE_AI_API_KEY not set. Using mock LLM.")
print("Set GOOGLE_AI_API_KEY environment variable to use Google Gemini.")
llm = InputAwareMockLLM()
def process_input(input_dict):
"""Process input and return response."""
user_input = input_dict.get("input", str(input_dict))
# Handle both ChatModel (returns AIMessage) and regular LLM (returns str)
if hasattr(llm, "invoke"):
response = llm.invoke(user_input)
# Extract text from AIMessage if needed
if hasattr(response, "content"):
response_text = response.content
elif isinstance(response, str):
response_text = response
else:
response_text = str(response)
else:
# Fallback for mock LLM
response_text = llm.invoke(user_input)
# Return dict format that flakestorm expects
return {"output": response_text, "text": response_text}
chain = RunnableLambda(process_input)
except ImportError:
# Strategy 2: LangChain 0.2.x - Use LLMChain with Gemini or Mock LLM
try:
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
prompt_template = PromptTemplate(
input_variables=["input"],
template="""You are a helpful assistant. Answer the user's question clearly and concisely.
User question: {input}
Assistant response:""",
)
# Try to use Google Gemini if API key is available
api_key = os.getenv("GOOGLE_AI_API_KEY") or os.getenv("VITE_GOOGLE_AI_API_KEY")
if api_key:
try:
from langchain_community.chat_models import ChatGoogleGenerativeAI
llm = ChatGoogleGenerativeAI(
model="gemini-2.5-flash",
google_api_key=api_key,
temperature=0.7,
)
except ImportError:
print("Warning: langchain-google-genai not installed. Using mock LLM.")
llm = InputAwareMockLLM()
else:
print("Warning: GOOGLE_AI_API_KEY not set. Using mock LLM.")
llm = InputAwareMockLLM()
# Create a wrapper that makes the LLM compatible with LLMChain
# LLMChain will call the LLM with the formatted prompt, so we extract the user input
class LLMWrapper:
def __call__(self, prompt: str, **kwargs: Any) -> str:
# Extract user input from the formatted prompt template
if "User question:" in prompt:
parts = prompt.split("User question:")
if len(parts) > 1:
user_input = parts[-1].split("Assistant response:")[0].strip()
else:
user_input = prompt
else:
user_input = prompt
# Handle ChatModel (returns AIMessage) vs regular LLM (returns str)
if hasattr(llm, "invoke"):
response = llm.invoke(user_input)
if hasattr(response, "content"):
return response.content
elif isinstance(response, str):
return response
else:
return str(response)
else:
return llm.invoke(user_input)
chain = LLMChain(llm=LLMWrapper(), prompt=prompt_template)
except ImportError:
# Strategy 3: LangChain 0.1.x or alternative structure
try:
from langchain import LLMChain, PromptTemplate
prompt_template = PromptTemplate(
input_variables=["input"],
template="""You are a helpful assistant. Answer the user's question clearly and concisely.
User question: {input}
Assistant response:""",
)
# Try to use Google Gemini if API key is available
api_key = os.getenv("GOOGLE_AI_API_KEY") or os.getenv("VITE_GOOGLE_AI_API_KEY")
if api_key:
try:
from langchain_community.chat_models import ChatGoogleGenerativeAI
llm = ChatGoogleGenerativeAI(
model="gemini-2.5-flash",
google_api_key=api_key,
temperature=0.7,
)
except ImportError:
print("Warning: langchain-google-genai not installed. Using mock LLM.")
llm = InputAwareMockLLM()
else:
print("Warning: GOOGLE_AI_API_KEY not set. Using mock LLM.")
llm = InputAwareMockLLM()
class LLMWrapper:
def __call__(self, prompt: str, **kwargs: Any) -> str:
# Extract user input from the formatted prompt template
if "User question:" in prompt:
parts = prompt.split("User question:")
if len(parts) > 1:
user_input = parts[-1].split("Assistant response:")[0].strip()
else:
user_input = prompt
else:
user_input = prompt
# Handle ChatModel (returns AIMessage) vs regular LLM (returns str)
if hasattr(llm, "invoke"):
response = llm.invoke(user_input)
if hasattr(response, "content"):
return response.content
elif isinstance(response, str):
return response
else:
return str(response)
else:
return llm.invoke(user_input)
chain = LLMChain(llm=LLMWrapper(), prompt=prompt_template)
except ImportError:
# Strategy 4: Simple callable wrapper (works with any version)
class SimpleChain:
"""Simple chain wrapper that works with any LangChain version."""
def __init__(self):
self.mock_llm = InputAwareMockLLM()
def invoke(self, input_dict):
"""Invoke the chain synchronously."""
user_input = input_dict.get("input", str(input_dict))
response = self.mock_llm.invoke(user_input)
return {"output": response, "text": response}
async def ainvoke(self, input_dict):
"""Invoke the chain asynchronously."""
return self.invoke(input_dict)
chain = SimpleChain()
if chain is None:
raise ImportError(
"Could not import LangChain. Install with: pip install langchain langchain-core langchain-community"
)
# Export the chain for flakestorm to use
# flakestorm will call: chain.invoke({"input": prompt}) or chain.ainvoke({"input": prompt})
# The adapter handles different LangChain interfaces automatically
__all__ = ["chain"]

View file

@ -0,0 +1,27 @@
# Core LangChain packages (for modern versions 0.3.x+)
langchain-core>=0.1.0
langchain-community>=0.1.0
# For older LangChain versions (0.1.x, 0.2.x)
langchain>=0.1.0
# Google Gemini integration (recommended for real LLM)
# Install with: pip install langchain-google-genai
# Or use langchain-community which includes ChatGoogleGenerativeAI
langchain-google-genai>=1.0.0 # For Google Gemini (recommended)
# flakestorm for testing
flakestorm>=0.1.0
# Note: This example uses Google Gemini if GOOGLE_AI_API_KEY is set,
# otherwise falls back to a mock LLM for testing without API keys.
#
# To use Google Gemini:
# 1. Install: pip install langchain-google-genai
# 2. Set environment variable: export GOOGLE_AI_API_KEY=your-api-key
#
# Other LLM options you can use:
# openai>=1.0.0 # For ChatOpenAI
# anthropic>=0.3.0 # For ChatAnthropic
# langchain-ollama>=0.1.0 # For ChatOllama (local models)