Scaling Checklist

Concrete techniques for when the complexity checklist in SKILL.md confirms scale is a real problem. Apply in order - each level solves the previous level's bottleneck.

Level 0: Optimize First

Before adding infrastructure, exhaust these:

Database queries have proper indexes
N+1 queries eliminated
Connection pooling configured
Slow endpoints profiled and optimized
Static assets served via CDN

Level 1: Read-Heavy

Symptom: Database reads are the bottleneck.

Technique	When	Trade-off
Application cache (in-memory)	Small, frequently accessed data	Stale data, memory pressure
Redis/Memcached	Shared cache across instances	Network hop, cache invalidation complexity
Read replicas	High read volume, slight staleness OK	Replication lag, eventual consistency
CDN	Static or semi-static content	Cache invalidation delay

Level 2: Write-Heavy

Symptom: Database writes or processing are the bottleneck.

Technique	When	Trade-off
Async task queue (Celery, SQS)	Work can be deferred	Eventual consistency, failure handling
Write-behind cache	Batch frequent writes	Data loss risk on crash
Event streaming (Kafka)	Multiple consumers of same data	Operational complexity, ordering guarantees
CQRS	Reads and writes have diverged significantly	Two models to maintain

Level 3: Traffic Spikes

Symptom: Individual instances can't handle peak load.

Technique	When	Trade-off
Horizontal scaling + load balancer	Stateless services	Session management, deploy complexity
Auto-scaling	Unpredictable traffic patterns	Cold start latency, cost spikes
Rate limiting	Protect against abuse/spikes	Legitimate users may be throttled
Circuit breakers	Downstream services degrade	Partial functionality during failures

Level 4: Data Growth

Symptom: Single database can't hold or query all the data efficiently.

Technique	When	Trade-off
Table partitioning	Time-series or naturally partitioned data	Query complexity, partition management
Archival / cold storage	Old data rarely accessed	Access latency for archived data
Database sharding	Partitioning insufficient, clear shard key exists	Cross-shard queries, operational burden
Search index (Elasticsearch)	Full-text or complex queries on large datasets	Index lag, another system to operate

Level 5: Multi-Region

Symptom: Users are geographically distributed, latency matters.

Technique	When	Trade-off
CDN + edge caching	Static/semi-static content	Cache invalidation
Read replicas per region	Read-heavy, slight staleness OK	Replication lag
Active-passive failover	Disaster recovery	Failover time, cost of standby
Active-active multi-region	True global low-latency required	Conflict resolution, extreme complexity

Decision Rule

Always start at Level 0. Move to the next level only when you have measured evidence that the current level is insufficient. Skipping levels is how you end up with Kafka for a TODO app.

3.2 KiB Raw Permalink Blame History