Merge commit 'a8390532f7' as 'ai-context/workbench-ui'

This commit is contained in:
elpresidank 2026-04-05 21:08:02 -05:00
commit 1a72bfdec0
310 changed files with 56430 additions and 0 deletions

View file

@ -0,0 +1,289 @@
# Socket Reliability Refactor
## Overview
This document outlines a comprehensive refactor to address critical issues in the TrustGraph UI WebSocket connection handling that are causing exponential retry storms and excessive logging.
## Current Problems
### Issue #1: Dual Retry System Conflict ⚠️ CRITICAL
**Problem**: Two independent retry mechanisms create multiplicative retry storms:
1. **BaseApi Socket-Level Reconnection** (`trustgraph-socket.ts`)
- Triggers on `onClose()` events
- 10 attempts with exponential backoff (2-60 seconds)
- Handles socket-level connection failures
2. **ServiceCall Request-Level Retries** (`service-call.ts`)
- Triggers on send failures and timeouts
- 3 retries per request with backoff
- **Calls `socket.reopen()` which triggers BaseApi reconnection**
**Result**: Single connection failure → 3 request retries × 10 socket reconnections = **30+ retry attempts**
```typescript
// service-call.ts:160, 174 - PROBLEM LINES
console.log("Reopen...");
this.socket.reopen(); // ← Triggers BaseApi reconnection
```
### Issue #2: SocketProvider Dependency Loop ✅ FIXED
**Status**: Resolved by removing `socket` from dependency array
### Issue #3: Inconsistent Request Retry Backoff ⚠️ MEDIUM
**Location**: `service-call.ts:170`
```typescript
// Inconsistent retry strategies:
setTimeout(this.attempt.bind(this), backoffDelay); // Exponential backoff ✅
setTimeout(this.attempt.bind(this), 500); // Fixed 500ms ❌ (spams)
setTimeout(this.attempt.bind(this), backoffDelay); // Exponential backoff ✅
```
### Issue #4: Concurrent Socket Reopen Calls ⚠️ MEDIUM
**Problem**: Multiple failed requests simultaneously call `socket.reopen()`:
- No coordination between ServiceCalls
- Redundant reconnection attempts
- Race conditions in connection state
## Proposed Solution: Centralized Retry Strategy
### Architectural Decision
**Adopt Option A: Let BaseApi handle ALL reconnection logic**
**Rationale**:
- ✅ Single source of truth for connection state
- ✅ BaseApi already has robust exponential backoff
- ✅ Eliminates retry system conflicts
- ✅ Cleaner separation of concerns
- ✅ Minimal code changes required
### Implementation Plan
#### Phase 1: Remove ServiceCall Reconnection Triggers
**File**: `src/api/trustgraph/service-call.ts`
**Changes**:
1. Remove `this.socket.reopen()` calls (lines 160, 174)
2. Replace with passive waiting for socket reconnection
3. Standardize backoff for all retry paths
```typescript
// BEFORE (service-call.ts:156-161)
console.log("Reopen...");
this.socket.reopen(); // ← REMOVE THIS
// AFTER
console.log("Message send failure, waiting for socket reconnection...");
// Let BaseApi handle reconnection, just retry the request
```
#### Phase 2: Improve Request Queueing Strategy
**Current Behavior**: ServiceCall attempts fail when socket is not ready
**New Behavior**: ServiceCall waits for socket to become available
```typescript
// Enhanced attempt() method logic
attempt() {
if (this.complete) return;
this.retries--;
if (this.retries < 0) {
// Give up after retries exhausted
this.error("Ran out of retries");
return;
}
if (this.socket.ws && this.socket.ws.readyState === WebSocket.OPEN) {
// Socket ready - send message
try {
this.socket.ws.send(JSON.stringify(this.msg));
this.timeoutId = setTimeout(this.onTimeout.bind(this), this.timeout);
} catch (e) {
// Send failed - wait and retry (no socket reopen)
setTimeout(this.attempt.bind(this), this.calculateBackoff());
}
} else {
// Socket not ready - wait for BaseApi to reconnect
console.log("Request", this.mid, "waiting for socket reconnection...");
setTimeout(this.attempt.bind(this), this.calculateBackoff());
}
}
calculateBackoff() {
return Math.min(
SOCKET_RECONNECTION_TIMEOUT * Math.pow(2, 3 - this.retries) + Math.random() * 1000,
30000
);
}
```
#### Phase 3: Enhanced BaseApi Connection Management
**File**: `src/api/trustgraph/trustgraph-socket.ts`
**Improvements**:
1. Add connection state tracking
2. Prevent redundant reconnection attempts
3. Improve logging for debugging
```typescript
class BaseApi {
reconnectionState: 'idle' | 'reconnecting' | 'failed' = 'idle';
scheduleReconnect() {
// Prevent concurrent reconnection attempts
if (this.reconnectionState === 'reconnecting') {
console.log("[socket] Reconnection already in progress, skipping");
return;
}
if (this.reconnectTimer) return;
this.reconnectionState = 'reconnecting';
// ... existing logic
}
onOpen() {
console.log("[socket open]");
this.reconnectAttempts = 0;
this.reconnectionState = 'idle'; // Reset state
// Clear any pending reconnect timer
if (this.reconnectTimer) {
clearTimeout(this.reconnectTimer);
this.reconnectTimer = undefined;
}
}
}
```
## Expected Benefits
### Immediate Impact
- **80-90% reduction in retry attempts** - eliminates dual retry system
- **Cleaner logs** - single source of reconnection messages
- **Predictable behavior** - one retry algorithm instead of two
### Log Message Changes
```
// BEFORE: Chaotic dual retry messages
[socket] Reconnecting in 2000ms (attempt 1)
Request test-123 timed out
Message send failure, retry...
Reopen...
[socket] Reconnecting in 4000ms (attempt 2)
Request test-123 ran out of retries
Request test-456 timed out
Reopen...
[socket] Reconnecting in 8000ms (attempt 3)
// AFTER: Clean, coordinated messages
[socket] Reconnecting in 2000ms (attempt 1)
Request test-123 waiting for socket reconnection...
Request test-456 waiting for socket reconnection...
[socket open]
Request test-123 sent successfully
Request test-456 sent successfully
```
### Performance Improvements
- **Reduced CPU usage** - fewer concurrent timers and retry loops
- **Less network spam** - coordinated reconnection attempts
- **Better user experience** - faster recovery from connection issues
## Risk Assessment
### Low Risk Changes
- ✅ Removing `socket.reopen()` calls from ServiceCall
- ✅ Standardizing backoff calculations
- ✅ Adding connection state tracking
### Potential Issues
- ⚠️ **Request timeout behavior may change** - requests may take longer to fail
- ⚠️ **Need to test edge cases** - rapid API key changes, server restarts
- ⚠️ **Verify inflight request cleanup** - ensure requests don't hang indefinitely
### Mitigation Strategies
1. **Preserve existing timeout behavior** - requests should still timeout appropriately
2. **Add circuit breaker** - stop retrying after socket reconnection gives up
3. **Comprehensive testing** - test connection failure scenarios
## Testing Strategy
### Unit Tests
- Mock WebSocket state transitions
- Verify ServiceCall doesn't trigger socket reopens
- Test backoff calculations are consistent
### Integration Tests
- Test connection failure and recovery scenarios
- Verify request queueing during reconnection
- Test concurrent request handling
### Manual Testing Scenarios
1. **Server shutdown** - verify clean reconnection behavior
2. **Network interruption** - test mobile/wifi scenarios
3. **API key changes** - ensure proper socket recreation
4. **High load** - multiple concurrent requests during connection issues
## Implementation Timeline
### Phase 1: Core Fixes (1-2 hours)
- Remove `socket.reopen()` calls from ServiceCall
- Standardize ServiceCall backoff calculations
- Add basic connection state tracking
### Phase 2: Enhanced Reliability (2-3 hours)
- Implement request queueing improvements
- Add comprehensive logging
- Enhanced error handling
### Phase 3: Testing & Validation (2-4 hours)
- Unit test coverage
- Integration testing
- Performance validation
**Total Estimated Effort**: 5-9 hours
## Success Metrics
### Quantitative Goals
- **Reduce retry attempts by 80%+** during connection failures
- **Eliminate concurrent socket reopen calls**
- **Standardize all retry backoff to exponential**
### Qualitative Goals
- **Cleaner, more understandable logs**
- **Predictable connection recovery behavior**
- **Better separation of concerns in codebase**
## Future Enhancements
### Potential Improvements (Out of Scope)
1. **Request prioritization** - critical requests retry faster
2. **Connection health monitoring** - proactive reconnection
3. **Metrics collection** - track connection reliability
4. **Advanced queueing** - persist important requests across sessions
### Monitoring Additions
```typescript
// Connection reliability metrics
interface SocketMetrics {
connectionAttempts: number;
successfulConnections: number;
averageReconnectionTime: number;
requestsLostDuringReconnection: number;
}
```
## Conclusion
This refactor addresses the root cause of socket retry storms by establishing BaseApi as the single authority for connection management. The changes are surgical and low-risk, focusing on removing the problematic dual retry system while preserving all existing functionality.
**Next Steps**: Implement Phase 1 changes and validate that retry storms are eliminated before proceeding with enhanced features.