mirror of
https://github.com/dograh-hq/dograh.git
synced 2026-06-07 07:55:16 +02:00
6 KiB
6 KiB
Gender Prediction Service
An internal service for predicting gender from first names using SSA (Social Security Administration) baby names data with GenderAPI fallback for uncertain predictions.
Overview
This service provides gender prediction with:
- Local model built from 145 years of SSA data (1880-2024)
- 104,819 unique names with confidence scores
- Compressed storage (2.21 MB model file)
- GenderAPI fallback for unknown or low-confidence names
Data Source
The SSA baby names data is already downloaded in the names/ directory from:
https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-data
Building the Model
Prerequisites
- Python 3.11+
- SSA data files in
names/directory (already included)
Build Steps
# Navigate to the gender service directory
cd dograh/api/services/gender/
# Run the model builder
python build_model.py
This will:
- Process all 145 year files (yob1880.txt to yob2024.txt)
- Aggregate name counts across all years
- Calculate confidence scores based on gender ratios
- Generate a compressed
model.txtfile (~2.21 MB)
Model Output
The builder generates model.txt with:
- Version: Model version number
- Metadata: Build date, statistics, thresholds
- Names: Compressed array format
[male_count, female_count, confidence]
Example output:
Model saved to: .../services/gender/model.txt
File size: 2.21 MB
Model statistics:
Total names: 104,819
High confidence names (≥0.85): 1,711
Confidence percentage: 1.6%
Using the Service
Basic Usage
from services.gender.gender_service import GenderService
# Initialize the service
service = GenderService()
# Predict gender for a single name
result = await service.predict("John")
print(f"Gender: {result.gender}") # "male"
print(f"Confidence: {result.confidence}") # 0.996
print(f"Source: {result.source}") # "model"
# Get salutation for a name
greeting = await service.get_salutation("John")
print(f"Salutation: {greeting}") # "Mr."
greeting = await service.get_salutation("Mary")
print(f"Salutation: {greeting}") # "Ms."
greeting = await service.get_salutation("Unknown")
print(f"Salutation: {greeting}") # "Dear"
# Clean up
await service.close()
Configuration Options
# Custom configuration
service = GenderService(
model_path="custom/path/to/model.txt", # Default: ./model.txt
confidence_threshold=0.85, # Default: 0.85
gender_api_key="your-api-key", # Default: from GENDER_API_KEY env
gender_api_url="https://..." # Default: GenderAPI v2 endpoint
)
Salutation Generation
# Get appropriate salutation based on gender
salutation = await service.get_salutation("John") # "Mr."
salutation = await service.get_salutation("Mary") # "Ms."
salutation = await service.get_salutation("Unknown") # "Dear"
# Custom confidence threshold for salutation
salutation = await service.get_salutation(
"Taylor", # Ambiguous name
confidence_threshold=0.9 # Higher threshold
) # Returns "Dear" due to low confidence
# Salutation logic:
# - "Mr." for male with confidence >= threshold
# - "Ms." for female with confidence >= threshold
# - "Dear" for unknown gender or low confidence
Batch Predictions
# Predict multiple names at once
names = ["Alice", "Bob", "Charlie", "Diana"]
results = await service.batch_predict(names)
for name, result in zip(names, results):
print(f"{name}: {result.gender} ({result.confidence:.2f})")
Response Format
class GenderPrediction:
gender: "male" | "female" | "unknown" # Predicted gender
confidence: float # 0.0 to 1.0
source: "model" | "genderapi" # Prediction source
Service Statistics
# Get service statistics
stats = await service.get_stats()
print(f"Total names: {stats['model']['total_names']:,}")
print(f"High confidence: {stats['model']['high_confidence_names']:,}")
print(f"Cached names in Redis: {stats['cache']['cached_names']}")
print(f"Cache TTL: {stats['cache']['ttl_seconds']} seconds")
print(f"API enabled: {stats['api']['enabled']}")
Cache Management
# Clear Redis cache
await service.clear_cache()
Environment Variables
# Required: Redis connection URL
export REDIS_URL=redis://localhost:6379
# Optional: Set GenderAPI key for fallback
export GENDERAPI_API_KEY=your-api-key-here
# Optional: Override confidence threshold (default: 0.85)
export CONFIDENCE_THRESHOLD=0.85
How It Works
- Name normalization: Converts to lowercase, strips whitespace
- Local model check: Looks up name in pre-built model
- Confidence evaluation: If confidence ≥ 0.85, returns local prediction
- Redis cache check: Checks Redis for previously fetched API results
- API fallback: For unknown/low-confidence names, calls GenderAPI
- Redis caching: Stores API responses in Redis with 30-day TTL
Testing
Run the test suite to verify the service:
python test_service.py
This tests:
- High-confidence predictions
- Ambiguous names
- Edge cases (empty strings, special characters)
- International names (with API key)
- Batch predictions
Model Updates
The model should be rebuilt annually when new SSA data is released:
- Download new year file (e.g., yob2025.txt) to
names/directory - Run
python build_model.pyto rebuild - Test with
python test_service.py - Commit the updated
model.txt
Performance
- Model size: 2.21 MB (compressed JSON)
- Load time: < 100ms
- Prediction time: < 1ms (local), < 5ms (Redis cache), < 500ms (API)
- Memory usage: ~10 MB for model in memory
- Cache: Redis-based with 30-day TTL
- Scalability: Shared cache across all service instances
Limitations
- Based on US SSA data (may not work well for non-US names)
- Historical bias in older data
- Unisex names have lower confidence
- Requires GenderAPI key for comprehensive coverage