# Socket Reliability Refactor ## Overview This document outlines a comprehensive refactor to address critical issues in the TrustGraph UI WebSocket connection handling that are causing exponential retry storms and excessive logging. ## Current Problems ### Issue #1: Dual Retry System Conflict ⚠️ CRITICAL **Problem**: Two independent retry mechanisms create multiplicative retry storms: 1. **BaseApi Socket-Level Reconnection** (`trustgraph-socket.ts`) - Triggers on `onClose()` events - 10 attempts with exponential backoff (2-60 seconds) - Handles socket-level connection failures 2. **ServiceCall Request-Level Retries** (`service-call.ts`) - Triggers on send failures and timeouts - 3 retries per request with backoff - **Calls `socket.reopen()` which triggers BaseApi reconnection** **Result**: Single connection failure → 3 request retries × 10 socket reconnections = **30+ retry attempts** ```typescript // service-call.ts:160, 174 - PROBLEM LINES console.log("Reopen..."); this.socket.reopen(); // ← Triggers BaseApi reconnection ``` ### Issue #2: SocketProvider Dependency Loop ✅ FIXED **Status**: Resolved by removing `socket` from dependency array ### Issue #3: Inconsistent Request Retry Backoff ⚠️ MEDIUM **Location**: `service-call.ts:170` ```typescript // Inconsistent retry strategies: setTimeout(this.attempt.bind(this), backoffDelay); // Exponential backoff ✅ setTimeout(this.attempt.bind(this), 500); // Fixed 500ms ❌ (spams) setTimeout(this.attempt.bind(this), backoffDelay); // Exponential backoff ✅ ``` ### Issue #4: Concurrent Socket Reopen Calls ⚠️ MEDIUM **Problem**: Multiple failed requests simultaneously call `socket.reopen()`: - No coordination between ServiceCalls - Redundant reconnection attempts - Race conditions in connection state ## Proposed Solution: Centralized Retry Strategy ### Architectural Decision **Adopt Option A: Let BaseApi handle ALL reconnection logic** **Rationale**: - ✅ Single source of truth for connection state - ✅ BaseApi already has robust exponential backoff - ✅ Eliminates retry system conflicts - ✅ Cleaner separation of concerns - ✅ Minimal code changes required ### Implementation Plan #### Phase 1: Remove ServiceCall Reconnection Triggers **File**: `src/api/trustgraph/service-call.ts` **Changes**: 1. Remove `this.socket.reopen()` calls (lines 160, 174) 2. Replace with passive waiting for socket reconnection 3. Standardize backoff for all retry paths ```typescript // BEFORE (service-call.ts:156-161) console.log("Reopen..."); this.socket.reopen(); // ← REMOVE THIS // AFTER console.log("Message send failure, waiting for socket reconnection..."); // Let BaseApi handle reconnection, just retry the request ``` #### Phase 2: Improve Request Queueing Strategy **Current Behavior**: ServiceCall attempts fail when socket is not ready **New Behavior**: ServiceCall waits for socket to become available ```typescript // Enhanced attempt() method logic attempt() { if (this.complete) return; this.retries--; if (this.retries < 0) { // Give up after retries exhausted this.error("Ran out of retries"); return; } if (this.socket.ws && this.socket.ws.readyState === WebSocket.OPEN) { // Socket ready - send message try { this.socket.ws.send(JSON.stringify(this.msg)); this.timeoutId = setTimeout(this.onTimeout.bind(this), this.timeout); } catch (e) { // Send failed - wait and retry (no socket reopen) setTimeout(this.attempt.bind(this), this.calculateBackoff()); } } else { // Socket not ready - wait for BaseApi to reconnect console.log("Request", this.mid, "waiting for socket reconnection..."); setTimeout(this.attempt.bind(this), this.calculateBackoff()); } } calculateBackoff() { return Math.min( SOCKET_RECONNECTION_TIMEOUT * Math.pow(2, 3 - this.retries) + Math.random() * 1000, 30000 ); } ``` #### Phase 3: Enhanced BaseApi Connection Management **File**: `src/api/trustgraph/trustgraph-socket.ts` **Improvements**: 1. Add connection state tracking 2. Prevent redundant reconnection attempts 3. Improve logging for debugging ```typescript class BaseApi { reconnectionState: 'idle' | 'reconnecting' | 'failed' = 'idle'; scheduleReconnect() { // Prevent concurrent reconnection attempts if (this.reconnectionState === 'reconnecting') { console.log("[socket] Reconnection already in progress, skipping"); return; } if (this.reconnectTimer) return; this.reconnectionState = 'reconnecting'; // ... existing logic } onOpen() { console.log("[socket open]"); this.reconnectAttempts = 0; this.reconnectionState = 'idle'; // Reset state // Clear any pending reconnect timer if (this.reconnectTimer) { clearTimeout(this.reconnectTimer); this.reconnectTimer = undefined; } } } ``` ## Expected Benefits ### Immediate Impact - **80-90% reduction in retry attempts** - eliminates dual retry system - **Cleaner logs** - single source of reconnection messages - **Predictable behavior** - one retry algorithm instead of two ### Log Message Changes ``` // BEFORE: Chaotic dual retry messages [socket] Reconnecting in 2000ms (attempt 1) Request test-123 timed out Message send failure, retry... Reopen... [socket] Reconnecting in 4000ms (attempt 2) Request test-123 ran out of retries Request test-456 timed out Reopen... [socket] Reconnecting in 8000ms (attempt 3) // AFTER: Clean, coordinated messages [socket] Reconnecting in 2000ms (attempt 1) Request test-123 waiting for socket reconnection... Request test-456 waiting for socket reconnection... [socket open] Request test-123 sent successfully Request test-456 sent successfully ``` ### Performance Improvements - **Reduced CPU usage** - fewer concurrent timers and retry loops - **Less network spam** - coordinated reconnection attempts - **Better user experience** - faster recovery from connection issues ## Risk Assessment ### Low Risk Changes - ✅ Removing `socket.reopen()` calls from ServiceCall - ✅ Standardizing backoff calculations - ✅ Adding connection state tracking ### Potential Issues - ⚠️ **Request timeout behavior may change** - requests may take longer to fail - ⚠️ **Need to test edge cases** - rapid API key changes, server restarts - ⚠️ **Verify inflight request cleanup** - ensure requests don't hang indefinitely ### Mitigation Strategies 1. **Preserve existing timeout behavior** - requests should still timeout appropriately 2. **Add circuit breaker** - stop retrying after socket reconnection gives up 3. **Comprehensive testing** - test connection failure scenarios ## Testing Strategy ### Unit Tests - Mock WebSocket state transitions - Verify ServiceCall doesn't trigger socket reopens - Test backoff calculations are consistent ### Integration Tests - Test connection failure and recovery scenarios - Verify request queueing during reconnection - Test concurrent request handling ### Manual Testing Scenarios 1. **Server shutdown** - verify clean reconnection behavior 2. **Network interruption** - test mobile/wifi scenarios 3. **API key changes** - ensure proper socket recreation 4. **High load** - multiple concurrent requests during connection issues ## Implementation Timeline ### Phase 1: Core Fixes (1-2 hours) - Remove `socket.reopen()` calls from ServiceCall - Standardize ServiceCall backoff calculations - Add basic connection state tracking ### Phase 2: Enhanced Reliability (2-3 hours) - Implement request queueing improvements - Add comprehensive logging - Enhanced error handling ### Phase 3: Testing & Validation (2-4 hours) - Unit test coverage - Integration testing - Performance validation **Total Estimated Effort**: 5-9 hours ## Success Metrics ### Quantitative Goals - **Reduce retry attempts by 80%+** during connection failures - **Eliminate concurrent socket reopen calls** - **Standardize all retry backoff to exponential** ### Qualitative Goals - **Cleaner, more understandable logs** - **Predictable connection recovery behavior** - **Better separation of concerns in codebase** ## Future Enhancements ### Potential Improvements (Out of Scope) 1. **Request prioritization** - critical requests retry faster 2. **Connection health monitoring** - proactive reconnection 3. **Metrics collection** - track connection reliability 4. **Advanced queueing** - persist important requests across sessions ### Monitoring Additions ```typescript // Connection reliability metrics interface SocketMetrics { connectionAttempts: number; successfulConnections: number; averageReconnectionTime: number; requestsLostDuringReconnection: number; } ``` ## Conclusion This refactor addresses the root cause of socket retry storms by establishing BaseApi as the single authority for connection management. The changes are surgical and low-risk, focusing on removing the problematic dual retry system while preserving all existing functionality. **Next Steps**: Implement Phase 1 changes and validate that retry storms are eliminated before proceeding with enhanced features.