mirror of
https://github.com/dograh-hq/dograh.git
synced 2026-06-10 08:05:22 +02:00
219 lines
6 KiB
Markdown
219 lines
6 KiB
Markdown
# Gender Prediction Service
|
|
|
|
An internal service for predicting gender from first names using SSA (Social Security Administration) baby names data with GenderAPI fallback for uncertain predictions.
|
|
|
|
## Overview
|
|
|
|
This service provides gender prediction with:
|
|
- **Local model** built from 145 years of SSA data (1880-2024)
|
|
- **104,819 unique names** with confidence scores
|
|
- **Compressed storage** (2.21 MB model file)
|
|
- **GenderAPI fallback** for unknown or low-confidence names
|
|
|
|
## Data Source
|
|
|
|
The SSA baby names data is already downloaded in the `names/` directory from:
|
|
https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-data
|
|
|
|
## Building the Model
|
|
|
|
### Prerequisites
|
|
- Python 3.11+
|
|
- SSA data files in `names/` directory (already included)
|
|
|
|
### Build Steps
|
|
|
|
```bash
|
|
# Navigate to the gender service directory
|
|
cd dograh/api/services/gender/
|
|
|
|
# Run the model builder
|
|
python build_model.py
|
|
```
|
|
|
|
This will:
|
|
1. Process all 145 year files (yob1880.txt to yob2024.txt)
|
|
2. Aggregate name counts across all years
|
|
3. Calculate confidence scores based on gender ratios
|
|
4. Generate a compressed `model.txt` file (~2.21 MB)
|
|
|
|
### Model Output
|
|
|
|
The builder generates `model.txt` with:
|
|
- **Version**: Model version number
|
|
- **Metadata**: Build date, statistics, thresholds
|
|
- **Names**: Compressed array format `[male_count, female_count, confidence]`
|
|
|
|
Example output:
|
|
```
|
|
Model saved to: .../services/gender/model.txt
|
|
File size: 2.21 MB
|
|
|
|
Model statistics:
|
|
Total names: 104,819
|
|
High confidence names (≥0.85): 1,711
|
|
Confidence percentage: 1.6%
|
|
```
|
|
|
|
## Using the Service
|
|
|
|
### Basic Usage
|
|
|
|
```python
|
|
from services.gender.gender_service import GenderService
|
|
|
|
# Initialize the service
|
|
service = GenderService()
|
|
|
|
# Predict gender for a single name
|
|
result = await service.predict("John")
|
|
print(f"Gender: {result.gender}") # "male"
|
|
print(f"Confidence: {result.confidence}") # 0.996
|
|
print(f"Source: {result.source}") # "model"
|
|
|
|
# Get salutation for a name
|
|
greeting = await service.get_salutation("John")
|
|
print(f"Salutation: {greeting}") # "Mr."
|
|
|
|
greeting = await service.get_salutation("Mary")
|
|
print(f"Salutation: {greeting}") # "Ms."
|
|
|
|
greeting = await service.get_salutation("Unknown")
|
|
print(f"Salutation: {greeting}") # "Dear"
|
|
|
|
# Clean up
|
|
await service.close()
|
|
```
|
|
|
|
### Configuration Options
|
|
|
|
```python
|
|
# Custom configuration
|
|
service = GenderService(
|
|
model_path="custom/path/to/model.txt", # Default: ./model.txt
|
|
confidence_threshold=0.85, # Default: 0.85
|
|
gender_api_key="your-api-key", # Default: from GENDER_API_KEY env
|
|
gender_api_url="https://..." # Default: GenderAPI v2 endpoint
|
|
)
|
|
```
|
|
|
|
### Salutation Generation
|
|
|
|
```python
|
|
# Get appropriate salutation based on gender
|
|
salutation = await service.get_salutation("John") # "Mr."
|
|
salutation = await service.get_salutation("Mary") # "Ms."
|
|
salutation = await service.get_salutation("Unknown") # "Dear"
|
|
|
|
# Custom confidence threshold for salutation
|
|
salutation = await service.get_salutation(
|
|
"Taylor", # Ambiguous name
|
|
confidence_threshold=0.9 # Higher threshold
|
|
) # Returns "Dear" due to low confidence
|
|
|
|
# Salutation logic:
|
|
# - "Mr." for male with confidence >= threshold
|
|
# - "Ms." for female with confidence >= threshold
|
|
# - "Dear" for unknown gender or low confidence
|
|
```
|
|
|
|
### Batch Predictions
|
|
|
|
```python
|
|
# Predict multiple names at once
|
|
names = ["Alice", "Bob", "Charlie", "Diana"]
|
|
results = await service.batch_predict(names)
|
|
|
|
for name, result in zip(names, results):
|
|
print(f"{name}: {result.gender} ({result.confidence:.2f})")
|
|
```
|
|
|
|
### Response Format
|
|
|
|
```python
|
|
class GenderPrediction:
|
|
gender: "male" | "female" | "unknown" # Predicted gender
|
|
confidence: float # 0.0 to 1.0
|
|
source: "model" | "genderapi" # Prediction source
|
|
```
|
|
|
|
### Service Statistics
|
|
|
|
```python
|
|
# Get service statistics
|
|
stats = await service.get_stats()
|
|
print(f"Total names: {stats['model']['total_names']:,}")
|
|
print(f"High confidence: {stats['model']['high_confidence_names']:,}")
|
|
print(f"Cached names in Redis: {stats['cache']['cached_names']}")
|
|
print(f"Cache TTL: {stats['cache']['ttl_seconds']} seconds")
|
|
print(f"API enabled: {stats['api']['enabled']}")
|
|
```
|
|
|
|
### Cache Management
|
|
|
|
```python
|
|
# Clear Redis cache
|
|
await service.clear_cache()
|
|
```
|
|
|
|
## Environment Variables
|
|
|
|
```bash
|
|
# Required: Redis connection URL
|
|
export REDIS_URL=redis://localhost:6379
|
|
|
|
# Optional: Set GenderAPI key for fallback
|
|
export GENDERAPI_API_KEY=your-api-key-here
|
|
|
|
# Optional: Override confidence threshold (default: 0.85)
|
|
export CONFIDENCE_THRESHOLD=0.85
|
|
```
|
|
|
|
## How It Works
|
|
|
|
1. **Name normalization**: Converts to lowercase, strips whitespace
|
|
2. **Local model check**: Looks up name in pre-built model
|
|
3. **Confidence evaluation**: If confidence ≥ 0.85, returns local prediction
|
|
4. **Redis cache check**: Checks Redis for previously fetched API results
|
|
5. **API fallback**: For unknown/low-confidence names, calls GenderAPI
|
|
6. **Redis caching**: Stores API responses in Redis with 30-day TTL
|
|
|
|
## Testing
|
|
|
|
Run the test suite to verify the service:
|
|
|
|
```bash
|
|
python test_service.py
|
|
```
|
|
|
|
This tests:
|
|
- High-confidence predictions
|
|
- Ambiguous names
|
|
- Edge cases (empty strings, special characters)
|
|
- International names (with API key)
|
|
- Batch predictions
|
|
|
|
## Model Updates
|
|
|
|
The model should be rebuilt annually when new SSA data is released:
|
|
|
|
1. Download new year file (e.g., yob2025.txt) to `names/` directory
|
|
2. Run `python build_model.py` to rebuild
|
|
3. Test with `python test_service.py`
|
|
4. Commit the updated `model.txt`
|
|
|
|
## Performance
|
|
|
|
- **Model size**: 2.21 MB (compressed JSON)
|
|
- **Load time**: < 100ms
|
|
- **Prediction time**: < 1ms (local), < 5ms (Redis cache), < 500ms (API)
|
|
- **Memory usage**: ~10 MB for model in memory
|
|
- **Cache**: Redis-based with 30-day TTL
|
|
- **Scalability**: Shared cache across all service instances
|
|
|
|
## Limitations
|
|
|
|
- Based on US SSA data (may not work well for non-US names)
|
|
- Historical bias in older data
|
|
- Unisex names have lower confidence
|
|
- Requires GenderAPI key for comprehensive coverage
|