feat: modify architecture examples and scaling checklist to system architecture skill

This commit is contained in:
Anish Sarkar 2026-03-15 04:58:58 +05:30
parent b22fe012d5
commit 056d3c456b
3 changed files with 302 additions and 183 deletions

View file

@ -1,192 +1,42 @@
---
name: system-architecture
description: Design systems with appropriate complexity—no more, no less. Use this skill when the user asks to architect applications, design system boundaries, or make structural decisions. Generates pragmatic architectures that solve real problems without premature abstraction.
description: Design systems with appropriate complexity - no more, no less. Use when the user asks to architect applications, design system boundaries, plan service decomposition, evaluate monolith vs microservices, make scaling decisions, or review structural trade-offs. Applies to new system design, refactoring, and migration planning.
---
This skill guides creation of system architectures that match actual requirements, not imagined future needs. Design real structures with clear boundaries, explicit trade-offs, and appropriate complexity.
# System Architecture
The user provides architecture requirements: a system to design, a scaling challenge, or structural decisions to make. They may include context about team size, expected load, or existing constraints.
Design real structures with clear boundaries, explicit trade-offs, and appropriate complexity. Match architecture to actual requirements, not imagined future needs.
<architecture_design_thinking>
## Design Thinking
## Workflow
Before designing, understand the actual constraints:
When the user requests an architecture, follow these steps:
- **Scale**: What's the real load? 100 users? 10,000? 10 million? Design for 10x current, not 1000x.
- **Team**: How many developers? Microservices for a 3-person team is organizational overhead, not architecture.
- **Lifespan**: Prototype? MVP? Long-term product? Temporary systems need temporary solutions.
- **Change vectors**: What actually varies? Abstract only where you have evidence of variation.
```
Task Progress:
- [ ] Step 1: Clarify constraints
- [ ] Step 2: Identify domains
- [ ] Step 3: Map data flow
- [ ] Step 4: Draw boundaries with rationale
- [ ] Step 5: Run complexity checklist
- [ ] Step 6: Present architecture with trade-offs
```
The goal is to solve the problem with the least complexity that allows future adaptation. Every abstraction, every boundary, every pattern has a cost. Pay only for what you need.
**Step 1 - Clarify constraints.** Ask about:
Then design systems that are:
| Constraint | Question | Why it matters |
|------------|----------|----------------|
| Scale | What's the real load? (users, requests/sec, data size) | Design for 10x current, not 1000x |
| Team | How many developers? How many teams? | Deployable units ≤ number of teams |
| Lifespan | Prototype? MVP? Long-term product? | Temporary systems need temporary solutions |
| Change vectors | What actually varies? | Abstract only where you have evidence of variation |
- Appropriate to actual scale and team size
- Easy to understand and modify
- Explicit about trade-offs and constraints
- Deletable—components can be removed without archaeology
</architecture_design_thinking>
**Step 2 - Identify domains.** Group by business capability, not technical layer. Look for things that change for different reasons and at different rates.
<architecture_guidelines>
## Architecture Guidelines
**Step 3 - Map data flow.** Trace: where does data enter → how does it transform → where does it exit? Make the flow obvious.
### Draw Boundaries at Real Differences
Separate concerns that change for different reasons and at different rates.
**Step 4 - Draw boundaries.** Every boundary needs a reason: different team, different change rate, different compliance requirement, or different scaling need.
<example_good title="Meaningful boundary">
# Users and Billing are separate bounded contexts
# - Different teams own them
# - Different change cadences (users: weekly, billing: quarterly)
# - Different compliance requirements
src/
users/ # User management domain
models.py
services.py
api.py
billing/ # Billing domain
models.py
services.py
api.py
shared/ # Truly shared utilities
auth.py
</example_good>
<example_bad title="Ceremony without purpose">
# UserService → UserRepository → UserRepositoryImpl
# ...when you'll never swap the database
src/
interfaces/
IUserRepository.py # One implementation exists
repositories/
UserRepositoryImpl.py # Wraps SQLAlchemy, which is already a repository
services/
UserService.py # Just calls the repository
</example_bad>
### Make Dependencies Explicit and Directional
Core logic depends on nothing. Infrastructure depends on core.
<example_good title="Clear dependency direction">
# Dependency flows inward: infrastructure → application → domain
domain/ # Pure business logic, no imports from outer layers
order.py # Order entity with business rules
application/ # Use cases, orchestrates domain
place_order.py # Imports from domain/, not infrastructure/
infrastructure/ # External concerns
postgres.py # Implements persistence, imports from application/
stripe.py # Implements payments
</example_good>
### Follow the Data
Architecture should make data flow obvious. Where does it enter? How does it transform? Where does it exit?
<example_good title="Obvious data flow">
Request → Validate → Transform → Store → Respond
# Each step is a clear function/module:
api/routes.py # Request enters
validators.py # Validation
transformers.py # Business logic transformation
repositories.py # Storage
serializers.py # Response shaping
</example_good>
### Design for Failure
Network fails. Databases timeout. Services crash. Build this into the structure.
<example_good title="Failure-aware design">
class OrderService:
def place_order(self, order: Order) -> Result:
# Explicit failure handling at boundaries
inventory = self.inventory.reserve(order.items)
if inventory.failed:
return Result.failure("Items unavailable", retry=False)
payment = self.payments.charge(order.total)
if payment.failed:
self.inventory.release(inventory.reservation_id) # Compensate
return Result.failure("Payment failed", retry=True)
return Result.success(order)
</example_good>
### Design for Operations
You will debug this at 3am. Can you trace a request? Can you replay a failure?
<example_good title="Observable architecture">
# Every request gets a correlation ID
# Every service logs with that ID
# Every error includes context for reproduction
@trace
def handle_request(request):
log.info("Processing", request_id=request.id, user=request.user_id)
try:
result = process(request)
log.info("Completed", request_id=request.id, result=result.status)
return result
except Exception as e:
log.error("Failed", request_id=request.id, error=str(e),
context=request.to_dict()) # Full context for replay
raise
</example_good>
</architecture_guidelines>
<architecture_anti_patterns>
## Patterns to Avoid
Avoid premature distribution:
- Microservices before you've earned them → Start with a well-structured monolith
- Event sourcing for CRUD apps → Use simple state storage
- Message queues between functions in the same process → Just call the function
- Distributed transactions → Redesign to avoid them or accept eventual consistency
Avoid abstraction theater:
- Repository wrapping an ORM → The ORM is already a repository
- Interfaces with one implementation "for testing" → Mock at boundaries, not everywhere
- AbstractFactoryFactoryBean → Just instantiate the thing
- DI containers for simple object graphs → Constructor injection is enough
Avoid cargo cult patterns:
- "Clean architecture" with 7 layers for a TODO app → Match layers to actual complexity
- DDD tactical patterns without strategic design → Aggregates need bounded contexts
- Hexagonal ports when you have one adapter → Just call the database
- CQRS when reads and writes are identical → Add complexity when reads/writes diverge
Avoid future-proofing:
- "We might need to swap databases" → You won't; if you do, you'll rewrite anyway
- "This could become multi-tenant" → Build it when you have the second tenant
- "Microservices will help us scale the team" → They help at 50+ engineers, not 4
</architecture_anti_patterns>
<architecture_success_criteria>
## Success Criteria
Your architecture is right-sized when:
1. **You can draw it**: The dependency graph fits on a whiteboard. If it doesn't, simplify.
2. **You can explain it**: A new team member understands data flow in under 30 minutes.
3. **You can change it**: Adding a feature touches 1-3 modules, not 10.
4. **You can delete it**: Removing a component doesn't require archaeology to find hidden dependencies.
5. **You can debug it**: Tracing a request from entry to exit takes minutes, not hours.
6. **It matches your team**: Number of deployable units ≤ number of teams.
</architecture_success_criteria>
<architecture_complexity_decisions>
## When to Add Complexity
Add complexity when you have:
- **Measured evidence**: Profiling shows the bottleneck, not intuition
- **Proven need**: The simpler solution has failed or is failing
- **Operational capacity**: The team can deploy, monitor, and debug the added complexity
- **Clear benefit**: The gain exceeds the cognitive and operational cost
A checklist before adding architectural complexity:
**Step 5 - Run complexity checklist.** Before adding any non-trivial pattern:
```
[ ] Have I tried the simple solution?
@ -197,17 +47,90 @@ A checklist before adding architectural complexity:
```
If any answer is "no", keep it simple.
</architecture_complexity_decisions>
<architecture_iteration>
**Step 6 - Present the architecture** using the output template below.
## Output Template
```markdown
### System: [Name]
**Constraints**:
- Scale: [current and expected load]
- Team: [size and structure]
- Lifespan: [prototype / MVP / long-term]
**Architecture**:
[Component diagram or description of components and their relationships]
**Data Flow**:
[How data enters → transforms → exits]
**Key Boundaries**:
| Boundary | Reason | Change Rate |
|----------|--------|-------------|
| ... | ... | ... |
**Trade-offs**:
- Chose X over Y because [reason]
- Accepted [limitation] to gain [benefit]
**Complexity Justification**:
- [Each non-trivial pattern] → [why it's needed, with evidence]
```
## Core Principles
1. **Boundaries at real differences.** Separate concerns that change for different reasons and at different rates.
2. **Dependencies flow inward.** Core logic depends on nothing. Infrastructure depends on core.
3. **Follow the data.** Architecture should make data flow obvious.
4. **Design for failure.** Network fails. Databases timeout. Build compensation into the structure.
5. **Design for operations.** You will debug this at 3am. Every request needs a trace. Every error needs context for replay.
For concrete good/bad examples of each principle, see [examples.md](examples.md).
## Anti-Patterns
| Don't | Do Instead |
|-------|------------|
| Microservices for a 3-person team | Well-structured monolith |
| Event sourcing for CRUD | Simple state storage |
| Message queues within the same process | Just call the function |
| Distributed transactions | Redesign to avoid, or accept eventual consistency |
| Repository wrapping an ORM | Use the ORM directly |
| Interfaces with one implementation | Mock at boundaries only |
| AbstractFactoryFactoryBean | Just instantiate the thing |
| DI containers for simple graphs | Constructor injection is enough |
| Clean Architecture for a TODO app | Match layers to actual complexity |
| DDD tactics without strategic design | Aggregates need bounded contexts |
| Hexagonal ports with one adapter | Just call the database |
| CQRS when reads = writes | Add when they diverge |
| "We might swap databases" | You won't; rewrite if you do |
| "Multi-tenant someday" | Build it when you have tenant #2 |
| "Microservices for team scale" | Helps at 50+ engineers, not 4 |
## Success Criteria
Your architecture is right-sized when:
1. **You can draw it** - dependency graph fits on a whiteboard
2. **You can explain it** - new team member understands data flow in 30 minutes
3. **You can change it** - adding a feature touches 1-3 modules, not 10
4. **You can delete it** - removing a component needs no archaeology
5. **You can debug it** - tracing a request takes minutes, not hours
6. **It matches your team** - deployable units ≤ number of teams
## When the Simple Solution Isn't Enough
If the complexity checklist says "yes, scale is real", see [scaling-checklist.md](scaling-checklist.md) for concrete techniques covering caching, async processing, partitioning, horizontal scaling, and multi-region.
## Iterative Architecture
Architecture is discovered, not designed upfront. For complex systems:
Architecture is discovered, not designed upfront:
1. **Start with the obvious structure**: Group by domain, not by technical layer
2. **Let hotspots emerge**: Monitor which modules change together—they might belong together
3. **Extract when painful**: Only split a module when its current form causes measurable problems
4. **Document decisions**: Record why boundaries exist so future you knows what's load-bearing
</architecture_iteration>
1. **Start obvious** - group by domain, not by technical layer
2. **Let hotspots emerge** - monitor which modules change together
3. **Extract when painful** - split only when the current form causes measurable problems
4. **Document decisions** - record why boundaries exist so future you knows what's load-bearing
Every senior engineer has a graveyard of over-engineered systems they regret. Learn from their pain. Build boring systems that work.

View file

@ -0,0 +1,120 @@
# Architecture Examples
Concrete good/bad examples for each core principle in SKILL.md.
---
## Boundaries at Real Differences
**Good** - Meaningful boundary:
```
# Users and Billing are separate bounded contexts
# - Different teams own them
# - Different change cadences (users: weekly, billing: quarterly)
# - Different compliance requirements
src/
users/ # User management domain
models.py
services.py
api.py
billing/ # Billing domain
models.py
services.py
api.py
shared/ # Truly shared utilities
auth.py
```
**Bad** - Ceremony without purpose:
```
# UserService → UserRepository → UserRepositoryImpl
# ...when you'll never swap the database
src/
interfaces/
IUserRepository.py # One implementation exists
repositories/
UserRepositoryImpl.py # Wraps SQLAlchemy, which is already a repository
services/
UserService.py # Just calls the repository
```
---
## Dependencies Flow Inward
**Good** - Clear dependency direction:
```
# Dependency flows inward: infrastructure → application → domain
domain/ # Pure business logic, no imports from outer layers
order.py # Order entity with business rules
application/ # Use cases, orchestrates domain
place_order.py # Imports from domain/, not infrastructure/
infrastructure/ # External concerns
postgres.py # Implements persistence, imports from application/
stripe.py # Implements payments
```
---
## Follow the Data
**Good** - Obvious data flow:
```
Request → Validate → Transform → Store → Respond
# Each step is a clear function/module:
api/routes.py # Request enters
validators.py # Validation
transformers.py # Business logic transformation
repositories.py # Storage
serializers.py # Response shaping
```
---
## Design for Failure
**Good** - Failure-aware design with compensation:
```python
class OrderService:
def place_order(self, order: Order) -> Result:
inventory = self.inventory.reserve(order.items)
if inventory.failed:
return Result.failure("Items unavailable", retry=False)
payment = self.payments.charge(order.total)
if payment.failed:
self.inventory.release(inventory.reservation_id) # Compensate
return Result.failure("Payment failed", retry=True)
return Result.success(order)
```
---
## Design for Operations
**Good** - Observable architecture:
```python
@trace
def handle_request(request):
log.info("Processing", request_id=request.id, user=request.user_id)
try:
result = process(request)
log.info("Completed", request_id=request.id, result=result.status)
return result
except Exception as e:
log.error("Failed", request_id=request.id, error=str(e),
context=request.to_dict()) # Full context for replay
raise
```
Key elements:
- Every request gets a correlation ID
- Every service logs with that ID
- Every error includes full context for reproduction

View file

@ -0,0 +1,76 @@
# Scaling Checklist
Concrete techniques for when the complexity checklist in SKILL.md confirms scale is a real problem. Apply in order - each level solves the previous level's bottleneck.
---
## Level 0: Optimize First
Before adding infrastructure, exhaust these:
- [ ] Database queries have proper indexes
- [ ] N+1 queries eliminated
- [ ] Connection pooling configured
- [ ] Slow endpoints profiled and optimized
- [ ] Static assets served via CDN
## Level 1: Read-Heavy
**Symptom**: Database reads are the bottleneck.
| Technique | When | Trade-off |
|-----------|------|-----------|
| Application cache (in-memory) | Small, frequently accessed data | Stale data, memory pressure |
| Redis/Memcached | Shared cache across instances | Network hop, cache invalidation complexity |
| Read replicas | High read volume, slight staleness OK | Replication lag, eventual consistency |
| CDN | Static or semi-static content | Cache invalidation delay |
## Level 2: Write-Heavy
**Symptom**: Database writes or processing are the bottleneck.
| Technique | When | Trade-off |
|-----------|------|-----------|
| Async task queue (Celery, SQS) | Work can be deferred | Eventual consistency, failure handling |
| Write-behind cache | Batch frequent writes | Data loss risk on crash |
| Event streaming (Kafka) | Multiple consumers of same data | Operational complexity, ordering guarantees |
| CQRS | Reads and writes have diverged significantly | Two models to maintain |
## Level 3: Traffic Spikes
**Symptom**: Individual instances can't handle peak load.
| Technique | When | Trade-off |
|-----------|------|-----------|
| Horizontal scaling + load balancer | Stateless services | Session management, deploy complexity |
| Auto-scaling | Unpredictable traffic patterns | Cold start latency, cost spikes |
| Rate limiting | Protect against abuse/spikes | Legitimate users may be throttled |
| Circuit breakers | Downstream services degrade | Partial functionality during failures |
## Level 4: Data Growth
**Symptom**: Single database can't hold or query all the data efficiently.
| Technique | When | Trade-off |
|-----------|------|-----------|
| Table partitioning | Time-series or naturally partitioned data | Query complexity, partition management |
| Archival / cold storage | Old data rarely accessed | Access latency for archived data |
| Database sharding | Partitioning insufficient, clear shard key exists | Cross-shard queries, operational burden |
| Search index (Elasticsearch) | Full-text or complex queries on large datasets | Index lag, another system to operate |
## Level 5: Multi-Region
**Symptom**: Users are geographically distributed, latency matters.
| Technique | When | Trade-off |
|-----------|------|-----------|
| CDN + edge caching | Static/semi-static content | Cache invalidation |
| Read replicas per region | Read-heavy, slight staleness OK | Replication lag |
| Active-passive failover | Disaster recovery | Failover time, cost of standby |
| Active-active multi-region | True global low-latency required | Conflict resolution, extreme complexity |
---
## Decision Rule
Always start at Level 0. Move to the next level only when you have **measured evidence** that the current level is insufficient. Skipping levels is how you end up with Kafka for a TODO app.