SurfSense/.cursor/skills/system-architecture/scaling-checklist.md

3.2 KiB

Scaling Checklist

Concrete techniques for when the complexity checklist in SKILL.md confirms scale is a real problem. Apply in order - each level solves the previous level's bottleneck.


Level 0: Optimize First

Before adding infrastructure, exhaust these:

  • Database queries have proper indexes
  • N+1 queries eliminated
  • Connection pooling configured
  • Slow endpoints profiled and optimized
  • Static assets served via CDN

Level 1: Read-Heavy

Symptom: Database reads are the bottleneck.

Technique When Trade-off
Application cache (in-memory) Small, frequently accessed data Stale data, memory pressure
Redis/Memcached Shared cache across instances Network hop, cache invalidation complexity
Read replicas High read volume, slight staleness OK Replication lag, eventual consistency
CDN Static or semi-static content Cache invalidation delay

Level 2: Write-Heavy

Symptom: Database writes or processing are the bottleneck.

Technique When Trade-off
Async task queue (Celery, SQS) Work can be deferred Eventual consistency, failure handling
Write-behind cache Batch frequent writes Data loss risk on crash
Event streaming (Kafka) Multiple consumers of same data Operational complexity, ordering guarantees
CQRS Reads and writes have diverged significantly Two models to maintain

Level 3: Traffic Spikes

Symptom: Individual instances can't handle peak load.

Technique When Trade-off
Horizontal scaling + load balancer Stateless services Session management, deploy complexity
Auto-scaling Unpredictable traffic patterns Cold start latency, cost spikes
Rate limiting Protect against abuse/spikes Legitimate users may be throttled
Circuit breakers Downstream services degrade Partial functionality during failures

Level 4: Data Growth

Symptom: Single database can't hold or query all the data efficiently.

Technique When Trade-off
Table partitioning Time-series or naturally partitioned data Query complexity, partition management
Archival / cold storage Old data rarely accessed Access latency for archived data
Database sharding Partitioning insufficient, clear shard key exists Cross-shard queries, operational burden
Search index (Elasticsearch) Full-text or complex queries on large datasets Index lag, another system to operate

Level 5: Multi-Region

Symptom: Users are geographically distributed, latency matters.

Technique When Trade-off
CDN + edge caching Static/semi-static content Cache invalidation
Read replicas per region Read-heavy, slight staleness OK Replication lag
Active-passive failover Disaster recovery Failover time, cost of standby
Active-active multi-region True global low-latency required Conflict resolution, extreme complexity

Decision Rule

Always start at Level 0. Move to the next level only when you have measured evidence that the current level is insufficient. Skipping levels is how you end up with Kafka for a TODO app.