mirror of https://github.com/dograh-hq/dograh.git synced 2026-06-07 07:55:16 +02:00

Abhishek Kumar 4f2a629340 Initial Commit 🚀 🚀

2025-09-09 14:37:32 +05:30

6 KiB

Raw Blame History

Gender Prediction Service

An internal service for predicting gender from first names using SSA (Social Security Administration) baby names data with GenderAPI fallback for uncertain predictions.

Overview

This service provides gender prediction with:

Local model built from 145 years of SSA data (1880-2024)
104,819 unique names with confidence scores
Compressed storage (2.21 MB model file)
GenderAPI fallback for unknown or low-confidence names

Data Source

The SSA baby names data is already downloaded in the names/ directory from: https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-data

Building the Model

Prerequisites

Python 3.11+
SSA data files in names/ directory (already included)

Build Steps

# Navigate to the gender service directory
cd dograh/api/services/gender/

# Run the model builder
python build_model.py

This will:

Process all 145 year files (yob1880.txt to yob2024.txt)
Aggregate name counts across all years
Calculate confidence scores based on gender ratios
Generate a compressed model.txt file (~2.21 MB)

Model Output

The builder generates model.txt with:

Version: Model version number
Metadata: Build date, statistics, thresholds
Names: Compressed array format [male_count, female_count, confidence]

Example output:

Model saved to: .../services/gender/model.txt
File size: 2.21 MB

Model statistics:
  Total names: 104,819
  High confidence names (≥0.85): 1,711
  Confidence percentage: 1.6%

Using the Service

Basic Usage

from services.gender.gender_service import GenderService

# Initialize the service
service = GenderService()

# Predict gender for a single name
result = await service.predict("John")
print(f"Gender: {result.gender}")       # "male"
print(f"Confidence: {result.confidence}") # 0.996
print(f"Source: {result.source}")       # "model"

# Get salutation for a name
greeting = await service.get_salutation("John")
print(f"Salutation: {greeting}")        # "Mr."

greeting = await service.get_salutation("Mary")
print(f"Salutation: {greeting}")        # "Ms."

greeting = await service.get_salutation("Unknown")
print(f"Salutation: {greeting}")        # "Dear"

# Clean up
await service.close()

Configuration Options

# Custom configuration
service = GenderService(
    model_path="custom/path/to/model.txt",  # Default: ./model.txt
    confidence_threshold=0.85,              # Default: 0.85
    gender_api_key="your-api-key",         # Default: from GENDER_API_KEY env
    gender_api_url="https://..."           # Default: GenderAPI v2 endpoint
)

Salutation Generation

# Get appropriate salutation based on gender
salutation = await service.get_salutation("John")     # "Mr."
salutation = await service.get_salutation("Mary")     # "Ms."
salutation = await service.get_salutation("Unknown")  # "Dear"

# Custom confidence threshold for salutation
salutation = await service.get_salutation(
    "Taylor",                    # Ambiguous name
    confidence_threshold=0.9     # Higher threshold
)  # Returns "Dear" due to low confidence

# Salutation logic:
# - "Mr." for male with confidence >= threshold
# - "Ms." for female with confidence >= threshold  
# - "Dear" for unknown gender or low confidence

Batch Predictions

# Predict multiple names at once
names = ["Alice", "Bob", "Charlie", "Diana"]
results = await service.batch_predict(names)

for name, result in zip(names, results):
    print(f"{name}: {result.gender} ({result.confidence:.2f})")

Response Format

class GenderPrediction:
    gender: "male" | "female" | "unknown"  # Predicted gender
    confidence: float                      # 0.0 to 1.0
    source: "model" | "genderapi"         # Prediction source

Service Statistics

# Get service statistics
stats = await service.get_stats()
print(f"Total names: {stats['model']['total_names']:,}")
print(f"High confidence: {stats['model']['high_confidence_names']:,}")
print(f"Cached names in Redis: {stats['cache']['cached_names']}")
print(f"Cache TTL: {stats['cache']['ttl_seconds']} seconds")
print(f"API enabled: {stats['api']['enabled']}")

Cache Management

# Clear Redis cache
await service.clear_cache()

Environment Variables

# Required: Redis connection URL
export REDIS_URL=redis://localhost:6379

# Optional: Set GenderAPI key for fallback
export GENDERAPI_API_KEY=your-api-key-here

# Optional: Override confidence threshold (default: 0.85)
export CONFIDENCE_THRESHOLD=0.85

How It Works

Name normalization: Converts to lowercase, strips whitespace
Local model check: Looks up name in pre-built model
Confidence evaluation: If confidence ≥ 0.85, returns local prediction
Redis cache check: Checks Redis for previously fetched API results
API fallback: For unknown/low-confidence names, calls GenderAPI
Redis caching: Stores API responses in Redis with 30-day TTL

Testing

Run the test suite to verify the service:

python test_service.py

This tests:

High-confidence predictions
Ambiguous names
Edge cases (empty strings, special characters)
International names (with API key)
Batch predictions

Model Updates

The model should be rebuilt annually when new SSA data is released:

Download new year file (e.g., yob2025.txt) to names/ directory
Run python build_model.py to rebuild
Test with python test_service.py
Commit the updated model.txt

Performance

Model size: 2.21 MB (compressed JSON)
Load time: < 100ms
Prediction time: < 1ms (local), < 5ms (Redis cache), < 500ms (API)
Memory usage: ~10 MB for model in memory
Cache: Redis-based with 30-day TTL
Scalability: Shared cache across all service instances

Limitations

Based on US SSA data (may not work well for non-US names)
Historical bias in older data
Unisex names have lower confidence
Requires GenderAPI key for comprehensive coverage

6 KiB Raw Blame History