nomyo-router/race_condition_fixes.md
alpha-nerd-nomyo 067cdf641a feat: add timestamp index and improve cache concurrency
- Added index on token_time_series timestamp for faster queries
- Introduced cache locks to prevent race conditions
2026-01-16 16:47:24 +01:00

131 lines
5.5 KiB
Markdown

# Race Condition Fixes in router.py
This document summarizes all the race condition fixes that have been implemented in the router.py file.
## Summary of Issues Found and Fixed
### 1. Cache Access Race Conditions
**Problem:** The `_models_cache`, `_loaded_models_cache`, and `_error_cache` dictionaries were being accessed concurrently without proper synchronization, leading to potential data corruption and inconsistent state.
**Solution:** Added three new asyncio.Lock objects to protect cache access:
- `_models_cache_lock` - protects `_models_cache`
- `_loaded_models_cache_lock` - protects `_loaded_models_cache`
- `_error_cache_lock` - protects `_error_cache`
**Implementation:**
- All cache reads now acquire the appropriate lock before accessing the cache
- All cache writes now acquire the appropriate lock before modifying the cache
- Cache entries are checked for freshness while holding the lock
- Stale entries are removed while holding the lock
### 2. Timestamp Calculation Race Condition
**Problem:** In the `token_worker()` function, the timestamp calculation was performed inside the lock, which could lead to:
- Inconsistent timestamps across multiple token entries
- Potential delays in processing due to lock contention
- Race conditions if the datetime object was modified between calculation and use
**Solution:** Moved timestamp calculation outside the lock:
- Calculate timestamp before acquiring the buffer lock
- Pass the pre-calculated timestamp into the critical section
- This ensures consistent timestamps while minimizing lock duration
**Implementation:**
```python
# Calculate timestamp once before acquiring lock
now = datetime.now(tz=timezone.utc)
timestamp = int(datetime(now.year, now.month, now.day, now.hour, now.minute, tzinfo=timezone.utc).timestamp())
# Then use it inside the lock
async with buffer_lock:
time_series_buffer.append({
'endpoint': endpoint,
'model': model,
'input_tokens': prompt,
'output_tokens': comp,
'total_tokens': prompt + comp,
'timestamp': timestamp # Use pre-calculated timestamp
})
```
### 3. Buffer Access Race Conditions
**Problem:** The `token_buffer` and `time_series_buffer` were accessed concurrently from multiple sources (token_worker, flush_buffer, flush_remaining_buffers) without proper synchronization.
**Solution:** The existing `buffer_lock` was correctly used, but we ensured all access patterns properly acquire the lock.
**Implementation:**
- All reads and writes to `token_buffer` are protected by `buffer_lock`
- All reads and writes to `time_series_buffer` are protected by `buffer_lock`
- Buffer copies are made while holding the lock, then released before DB operations
### 4. Usage Count Race Conditions
**Problem:** The `usage_counts` and `token_usage_counts` dictionaries were accessed concurrently without proper synchronization.
**Solution:** The existing `usage_lock` and `token_usage_lock` were correctly used throughout the codebase.
**Implementation:**
- All increments/decrements of `usage_counts` are protected by `usage_lock`
- All increments/decrements of `token_usage_counts` are protected by `token_usage_lock`
- The `publish_snapshot()` function acquires `usage_lock` before creating snapshots
### 5. Subscriber Management Race Conditions
**Problem:** The `_subscribers` set was accessed concurrently without proper synchronization.
**Solution:** The existing `_subscribers_lock` was correctly used throughout the codebase.
**Implementation:**
- All additions/removals from `_subscribers` are protected by `_subscribers_lock`
- The `subscribe()` function acquires the lock before adding new subscribers
- The `unsubscribe()` function acquires the lock before removing subscribers
- The `publish_snapshot()` function acquires the lock before iterating over subscribers
## Testing Recommendations
To verify the race condition fixes:
1. **Concurrent Request Testing:**
- Send multiple simultaneous requests to the same endpoint/model
- Verify that usage counts remain consistent
- Verify that token counts are accurately tracked
2. **Cache Consistency Testing:**
- Rapidly query `/api/tags` and `/api/ps` endpoints
- Verify that cached responses remain consistent
- Verify that stale cache entries are properly invalidated
3. **Timestamp Consistency Testing:**
- Send multiple requests in quick succession
- Verify that timestamps in time_series_buffer are consistent and accurate
- Verify that timestamps are properly rounded to minute boundaries
4. **Load Testing:**
- Simulate high load with many concurrent connections
- Verify that the system remains stable under load
- Verify that no deadlocks occur
## Performance Considerations
The race condition fixes introduce additional locking, which may have performance implications:
1. **Lock Granularity:** The locks are fine-grained (one per cache type), minimizing contention.
2. **Lock Duration:** Locks are held for minimal durations - typically just for cache reads/writes.
3. **Concurrent Operations:** Multiple endpoints/models can be processed concurrently as long as they don't contend for the same cache entry.
4. **Monitoring:** Consider adding metrics to track lock contention and wait times.
## Future Improvements
1. **Read-Write Locks:** Consider using read-write locks if read-heavy workloads are identified.
2. **Cache Invalidation:** Implement more sophisticated cache invalidation strategies.
3. **Lock Timeouts:** Add timeout mechanisms to prevent deadlocks.
4. **Performance Monitoring:** Add instrumentation to track lock contention and performance.