feat: modify architecture examples and scaling checklist to system architecture skill

This commit is contained in:
Anish Sarkar 2026-03-15 04:58:58 +05:30
parent b22fe012d5
commit 056d3c456b
3 changed files with 302 additions and 183 deletions

View file

@ -0,0 +1,76 @@
# Scaling Checklist
Concrete techniques for when the complexity checklist in SKILL.md confirms scale is a real problem. Apply in order - each level solves the previous level's bottleneck.
---
## Level 0: Optimize First
Before adding infrastructure, exhaust these:
- [ ] Database queries have proper indexes
- [ ] N+1 queries eliminated
- [ ] Connection pooling configured
- [ ] Slow endpoints profiled and optimized
- [ ] Static assets served via CDN
## Level 1: Read-Heavy
**Symptom**: Database reads are the bottleneck.
| Technique | When | Trade-off |
|-----------|------|-----------|
| Application cache (in-memory) | Small, frequently accessed data | Stale data, memory pressure |
| Redis/Memcached | Shared cache across instances | Network hop, cache invalidation complexity |
| Read replicas | High read volume, slight staleness OK | Replication lag, eventual consistency |
| CDN | Static or semi-static content | Cache invalidation delay |
## Level 2: Write-Heavy
**Symptom**: Database writes or processing are the bottleneck.
| Technique | When | Trade-off |
|-----------|------|-----------|
| Async task queue (Celery, SQS) | Work can be deferred | Eventual consistency, failure handling |
| Write-behind cache | Batch frequent writes | Data loss risk on crash |
| Event streaming (Kafka) | Multiple consumers of same data | Operational complexity, ordering guarantees |
| CQRS | Reads and writes have diverged significantly | Two models to maintain |
## Level 3: Traffic Spikes
**Symptom**: Individual instances can't handle peak load.
| Technique | When | Trade-off |
|-----------|------|-----------|
| Horizontal scaling + load balancer | Stateless services | Session management, deploy complexity |
| Auto-scaling | Unpredictable traffic patterns | Cold start latency, cost spikes |
| Rate limiting | Protect against abuse/spikes | Legitimate users may be throttled |
| Circuit breakers | Downstream services degrade | Partial functionality during failures |
## Level 4: Data Growth
**Symptom**: Single database can't hold or query all the data efficiently.
| Technique | When | Trade-off |
|-----------|------|-----------|
| Table partitioning | Time-series or naturally partitioned data | Query complexity, partition management |
| Archival / cold storage | Old data rarely accessed | Access latency for archived data |
| Database sharding | Partitioning insufficient, clear shard key exists | Cross-shard queries, operational burden |
| Search index (Elasticsearch) | Full-text or complex queries on large datasets | Index lag, another system to operate |
## Level 5: Multi-Region
**Symptom**: Users are geographically distributed, latency matters.
| Technique | When | Trade-off |
|-----------|------|-----------|
| CDN + edge caching | Static/semi-static content | Cache invalidation |
| Read replicas per region | Read-heavy, slight staleness OK | Replication lag |
| Active-passive failover | Disaster recovery | Failover time, cost of standby |
| Active-active multi-region | True global low-latency required | Conflict resolution, extreme complexity |
---
## Decision Rule
Always start at Level 0. Move to the next level only when you have **measured evidence** that the current level is insufficient. Skipping levels is how you end up with Kafka for a TODO app.