# Scaling Checklist

Concrete techniques for when the complexity checklist in SKILL.md confirms scale is a real problem. Apply in order - each level solves the previous level's bottleneck.

---

## Level 0: Optimize First

Before adding infrastructure, exhaust these:

- [ ] Database queries have proper indexes
- [ ] N+1 queries eliminated
- [ ] Connection pooling configured
- [ ] Slow endpoints profiled and optimized
- [ ] Static assets served via CDN

## Level 1: Read-Heavy

**Symptom**: Database reads are the bottleneck.

| Technique | When | Trade-off |
|-----------|------|-----------|
| Application cache (in-memory) | Small, frequently accessed data | Stale data, memory pressure |
| Redis/Memcached | Shared cache across instances | Network hop, cache invalidation complexity |
| Read replicas | High read volume, slight staleness OK | Replication lag, eventual consistency |
| CDN | Static or semi-static content | Cache invalidation delay |

## Level 2: Write-Heavy

**Symptom**: Database writes or processing are the bottleneck.

| Technique | When | Trade-off |
|-----------|------|-----------|
| Async task queue (Celery, SQS) | Work can be deferred | Eventual consistency, failure handling |
| Write-behind cache | Batch frequent writes | Data loss risk on crash |
| Event streaming (Kafka) | Multiple consumers of same data | Operational complexity, ordering guarantees |
| CQRS | Reads and writes have diverged significantly | Two models to maintain |

## Level 3: Traffic Spikes

**Symptom**: Individual instances can't handle peak load.

| Technique | When | Trade-off |
|-----------|------|-----------|
| Horizontal scaling + load balancer | Stateless services | Session management, deploy complexity |
| Auto-scaling | Unpredictable traffic patterns | Cold start latency, cost spikes |
| Rate limiting | Protect against abuse/spikes | Legitimate users may be throttled |
| Circuit breakers | Downstream services degrade | Partial functionality during failures |

## Level 4: Data Growth

**Symptom**: Single database can't hold or query all the data efficiently.

| Technique | When | Trade-off |
|-----------|------|-----------|
| Table partitioning | Time-series or naturally partitioned data | Query complexity, partition management |
| Archival / cold storage | Old data rarely accessed | Access latency for archived data |
| Database sharding | Partitioning insufficient, clear shard key exists | Cross-shard queries, operational burden |
| Search index (Elasticsearch) | Full-text or complex queries on large datasets | Index lag, another system to operate |

## Level 5: Multi-Region

**Symptom**: Users are geographically distributed, latency matters.

| Technique | When | Trade-off |
|-----------|------|-----------|
| CDN + edge caching | Static/semi-static content | Cache invalidation |
| Read replicas per region | Read-heavy, slight staleness OK | Replication lag |
| Active-passive failover | Disaster recovery | Failover time, cost of standby |
| Active-active multi-region | True global low-latency required | Conflict resolution, extreme complexity |

---

## Decision Rule

Always start at Level 0. Move to the next level only when you have **measured evidence** that the current level is insufficient. Skipping levels is how you end up with Kafka for a TODO app.