mirror of
https://github.com/MODSetter/SurfSense.git
synced 2026-04-25 00:36:31 +02:00
3.2 KiB
3.2 KiB
Scaling Checklist
Concrete techniques for when the complexity checklist in SKILL.md confirms scale is a real problem. Apply in order - each level solves the previous level's bottleneck.
Level 0: Optimize First
Before adding infrastructure, exhaust these:
- Database queries have proper indexes
- N+1 queries eliminated
- Connection pooling configured
- Slow endpoints profiled and optimized
- Static assets served via CDN
Level 1: Read-Heavy
Symptom: Database reads are the bottleneck.
| Technique | When | Trade-off |
|---|---|---|
| Application cache (in-memory) | Small, frequently accessed data | Stale data, memory pressure |
| Redis/Memcached | Shared cache across instances | Network hop, cache invalidation complexity |
| Read replicas | High read volume, slight staleness OK | Replication lag, eventual consistency |
| CDN | Static or semi-static content | Cache invalidation delay |
Level 2: Write-Heavy
Symptom: Database writes or processing are the bottleneck.
| Technique | When | Trade-off |
|---|---|---|
| Async task queue (Celery, SQS) | Work can be deferred | Eventual consistency, failure handling |
| Write-behind cache | Batch frequent writes | Data loss risk on crash |
| Event streaming (Kafka) | Multiple consumers of same data | Operational complexity, ordering guarantees |
| CQRS | Reads and writes have diverged significantly | Two models to maintain |
Level 3: Traffic Spikes
Symptom: Individual instances can't handle peak load.
| Technique | When | Trade-off |
|---|---|---|
| Horizontal scaling + load balancer | Stateless services | Session management, deploy complexity |
| Auto-scaling | Unpredictable traffic patterns | Cold start latency, cost spikes |
| Rate limiting | Protect against abuse/spikes | Legitimate users may be throttled |
| Circuit breakers | Downstream services degrade | Partial functionality during failures |
Level 4: Data Growth
Symptom: Single database can't hold or query all the data efficiently.
| Technique | When | Trade-off |
|---|---|---|
| Table partitioning | Time-series or naturally partitioned data | Query complexity, partition management |
| Archival / cold storage | Old data rarely accessed | Access latency for archived data |
| Database sharding | Partitioning insufficient, clear shard key exists | Cross-shard queries, operational burden |
| Search index (Elasticsearch) | Full-text or complex queries on large datasets | Index lag, another system to operate |
Level 5: Multi-Region
Symptom: Users are geographically distributed, latency matters.
| Technique | When | Trade-off |
|---|---|---|
| CDN + edge caching | Static/semi-static content | Cache invalidation |
| Read replicas per region | Read-heavy, slight staleness OK | Replication lag |
| Active-passive failover | Disaster recovery | Failover time, cost of standby |
| Active-active multi-region | True global low-latency required | Conflict resolution, extreme complexity |
Decision Rule
Always start at Level 0. Move to the next level only when you have measured evidence that the current level is insufficient. Skipping levels is how you end up with Kafka for a TODO app.