feat: modify architecture examples and scaling checklist to system architecture skill

2026-05-04 13:22:41 +02:00 · 2026-03-15 04:58:58 +05:30 · 2026-03-15 04:58:58 +05:30 · 056d3c456b
commit 056d3c456b
parent b22fe012d5
3 changed files with 302 additions and 183 deletions
--- a/.cursor/skills/system-architecture/scaling-checklist.md
+++ b/.cursor/skills/system-architecture/scaling-checklist.md
@ -0,0 +1,76 @@
+# Scaling Checklist
+
+Concrete techniques for when the complexity checklist in SKILL.md confirms scale is a real problem. Apply in order - each level solves the previous level's bottleneck.
+
+---
+
+## Level 0: Optimize First
+
+Before adding infrastructure, exhaust these:
+
+- [ ] Database queries have proper indexes
+- [ ] N+1 queries eliminated
+- [ ] Connection pooling configured
+- [ ] Slow endpoints profiled and optimized
+- [ ] Static assets served via CDN
+
+## Level 1: Read-Heavy
+
+**Symptom**: Database reads are the bottleneck.
+
+| Technique | When | Trade-off |
+|-----------|------|-----------|
+| Application cache (in-memory) | Small, frequently accessed data | Stale data, memory pressure |
+| Redis/Memcached | Shared cache across instances | Network hop, cache invalidation complexity |
+| Read replicas | High read volume, slight staleness OK | Replication lag, eventual consistency |
+| CDN | Static or semi-static content | Cache invalidation delay |
+
+## Level 2: Write-Heavy
+
+**Symptom**: Database writes or processing are the bottleneck.
+
+| Technique | When | Trade-off |
+|-----------|------|-----------|
+| Async task queue (Celery, SQS) | Work can be deferred | Eventual consistency, failure handling |
+| Write-behind cache | Batch frequent writes | Data loss risk on crash |
+| Event streaming (Kafka) | Multiple consumers of same data | Operational complexity, ordering guarantees |
+| CQRS | Reads and writes have diverged significantly | Two models to maintain |
+
+## Level 3: Traffic Spikes
+
+**Symptom**: Individual instances can't handle peak load.
+
+| Technique | When | Trade-off |
+|-----------|------|-----------|
+| Horizontal scaling + load balancer | Stateless services | Session management, deploy complexity |
+| Auto-scaling | Unpredictable traffic patterns | Cold start latency, cost spikes |
+| Rate limiting | Protect against abuse/spikes | Legitimate users may be throttled |
+| Circuit breakers | Downstream services degrade | Partial functionality during failures |
+
+## Level 4: Data Growth
+
+**Symptom**: Single database can't hold or query all the data efficiently.
+
+| Technique | When | Trade-off |
+|-----------|------|-----------|
+| Table partitioning | Time-series or naturally partitioned data | Query complexity, partition management |
+| Archival / cold storage | Old data rarely accessed | Access latency for archived data |
+| Database sharding | Partitioning insufficient, clear shard key exists | Cross-shard queries, operational burden |
+| Search index (Elasticsearch) | Full-text or complex queries on large datasets | Index lag, another system to operate |
+
+## Level 5: Multi-Region
+
+**Symptom**: Users are geographically distributed, latency matters.
+
+| Technique | When | Trade-off |
+|-----------|------|-----------|
+| CDN + edge caching | Static/semi-static content | Cache invalidation |
+| Read replicas per region | Read-heavy, slight staleness OK | Replication lag |
+| Active-passive failover | Disaster recovery | Failover time, cost of standby |
+| Active-active multi-region | True global low-latency required | Conflict resolution, extreme complexity |
+
+---
+
+## Decision Rule
+
+Always start at Level 0. Move to the next level only when you have **measured evidence** that the current level is insufficient. Skipping levels is how you end up with Kafka for a TODO app.