dograh/api/services/gender/README.md

# Gender Prediction Service

An internal service for predicting gender from first names using SSA (Social Security Administration) baby names data with GenderAPI fallback for uncertain predictions.

## Overview

This service provides gender prediction with:
- **Local model** built from 145 years of SSA data (1880-2024)
- **104,819 unique names** with confidence scores
- **Compressed storage** (2.21 MB model file)
- **GenderAPI fallback** for unknown or low-confidence names

## Data Source

The SSA baby names data is already downloaded in the `names/` directory from:
https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-data

## Building the Model

### Prerequisites
- Python 3.11+
- SSA data files in `names/` directory (already included)

### Build Steps

```bash
# Navigate to the gender service directory
cd dograh/api/services/gender/

# Run the model builder
python build_model.py
```

This will:
1. Process all 145 year files (yob1880.txt to yob2024.txt)
2. Aggregate name counts across all years
3. Calculate confidence scores based on gender ratios
4. Generate a compressed `model.txt` file (~2.21 MB)

### Model Output

The builder generates `model.txt` with:
- **Version**: Model version number
- **Metadata**: Build date, statistics, thresholds
- **Names**: Compressed array format `[male_count, female_count, confidence]`

Example output:
```
Model saved to: .../services/gender/model.txt
File size: 2.21 MB

Model statistics:
  Total names: 104,819
  High confidence names (≥0.85): 1,711
  Confidence percentage: 1.6%
```

## Using the Service

### Basic Usage

```python
from services.gender.gender_service import GenderService

# Initialize the service
service = GenderService()

# Predict gender for a single name
result = await service.predict("John")
print(f"Gender: {result.gender}")       # "male"
print(f"Confidence: {result.confidence}") # 0.996
print(f"Source: {result.source}")       # "model"

# Get salutation for a name
greeting = await service.get_salutation("John")
print(f"Salutation: {greeting}")        # "Mr."

greeting = await service.get_salutation("Mary")
print(f"Salutation: {greeting}")        # "Ms."

greeting = await service.get_salutation("Unknown")
print(f"Salutation: {greeting}")        # "Dear"

# Clean up
await service.close()
```

### Configuration Options

```python
# Custom configuration
service = GenderService(
    model_path="custom/path/to/model.txt",  # Default: ./model.txt
    confidence_threshold=0.85,              # Default: 0.85
    gender_api_key="your-api-key",         # Default: from GENDER_API_KEY env
    gender_api_url="https://..."           # Default: GenderAPI v2 endpoint
)
```

### Salutation Generation

```python
# Get appropriate salutation based on gender
salutation = await service.get_salutation("John")     # "Mr."
salutation = await service.get_salutation("Mary")     # "Ms."
salutation = await service.get_salutation("Unknown")  # "Dear"

# Custom confidence threshold for salutation
salutation = await service.get_salutation(
    "Taylor",                    # Ambiguous name
    confidence_threshold=0.9     # Higher threshold
)  # Returns "Dear" due to low confidence

# Salutation logic:
# - "Mr." for male with confidence >= threshold
# - "Ms." for female with confidence >= threshold
# - "Dear" for unknown gender or low confidence
```

### Batch Predictions

```python
# Predict multiple names at once
names = ["Alice", "Bob", "Charlie", "Diana"]
results = await service.batch_predict(names)

for name, result in zip(names, results):
    print(f"{name}: {result.gender} ({result.confidence:.2f})")
```

### Response Format

```python
class GenderPrediction:
    gender: "male" | "female" | "unknown"  # Predicted gender
    confidence: float                      # 0.0 to 1.0
    source: "model" | "genderapi"         # Prediction source
```

### Service Statistics

```python
# Get service statistics
stats = await service.get_stats()
print(f"Total names: {stats['model']['total_names']:,}")
print(f"High confidence: {stats['model']['high_confidence_names']:,}")
print(f"Cached names in Redis: {stats['cache']['cached_names']}")
print(f"Cache TTL: {stats['cache']['ttl_seconds']} seconds")
print(f"API enabled: {stats['api']['enabled']}")
```

### Cache Management

```python
# Clear Redis cache
await service.clear_cache()
```

## Environment Variables

```bash
# Required: Redis connection URL
export REDIS_URL=redis://localhost:6379

# Optional: Set GenderAPI key for fallback
export GENDERAPI_API_KEY=your-api-key-here

# Optional: Override confidence threshold (default: 0.85)
export CONFIDENCE_THRESHOLD=0.85
```

## How It Works

1. **Name normalization**: Converts to lowercase, strips whitespace
2. **Local model check**: Looks up name in pre-built model
3. **Confidence evaluation**: If confidence ≥ 0.85, returns local prediction
4. **Redis cache check**: Checks Redis for previously fetched API results
5. **API fallback**: For unknown/low-confidence names, calls GenderAPI
6. **Redis caching**: Stores API responses in Redis with 30-day TTL

## Testing

Run the test suite to verify the service:

```bash
python test_service.py
```

This tests:
- High-confidence predictions
- Ambiguous names
- Edge cases (empty strings, special characters)
- International names (with API key)
- Batch predictions

## Model Updates

The model should be rebuilt annually when new SSA data is released:

1. Download new year file (e.g., yob2025.txt) to `names/` directory
2. Run `python build_model.py` to rebuild
3. Test with `python test_service.py`
4. Commit the updated `model.txt`

## Performance

- **Model size**: 2.21 MB (compressed JSON)
- **Load time**: < 100ms
- **Prediction time**: < 1ms (local), < 5ms (Redis cache), < 500ms (API)
- **Memory usage**: ~10 MB for model in memory
- **Cache**: Redis-based with 30-day TTL
- **Scalability**: Shared cache across all service instances

## Limitations

- Based on US SSA data (may not work well for non-US names)
- Historical bias in older data
- Unisex names have lower confidence
- Requires GenderAPI key for comprehensive coverage