mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-07-01 17:39:39 +02:00
Squashed 'ai-context/workbench-ui/' content from commit 32e36a5c
git-subtree-dir: ai-context/workbench-ui git-subtree-split: 32e36a5c2131e429a7081cfaf67dabad3193cda3
This commit is contained in:
commit
a8390532f7
310 changed files with 56430 additions and 0 deletions
289
docs/tech-specs/socket-reliability.md
Normal file
289
docs/tech-specs/socket-reliability.md
Normal file
|
|
@ -0,0 +1,289 @@
|
|||
# Socket Reliability Refactor
|
||||
|
||||
## Overview
|
||||
|
||||
This document outlines a comprehensive refactor to address critical issues in the TrustGraph UI WebSocket connection handling that are causing exponential retry storms and excessive logging.
|
||||
|
||||
## Current Problems
|
||||
|
||||
### Issue #1: Dual Retry System Conflict ⚠️ CRITICAL
|
||||
|
||||
**Problem**: Two independent retry mechanisms create multiplicative retry storms:
|
||||
|
||||
1. **BaseApi Socket-Level Reconnection** (`trustgraph-socket.ts`)
|
||||
- Triggers on `onClose()` events
|
||||
- 10 attempts with exponential backoff (2-60 seconds)
|
||||
- Handles socket-level connection failures
|
||||
|
||||
2. **ServiceCall Request-Level Retries** (`service-call.ts`)
|
||||
- Triggers on send failures and timeouts
|
||||
- 3 retries per request with backoff
|
||||
- **Calls `socket.reopen()` which triggers BaseApi reconnection**
|
||||
|
||||
**Result**: Single connection failure → 3 request retries × 10 socket reconnections = **30+ retry attempts**
|
||||
|
||||
```typescript
|
||||
// service-call.ts:160, 174 - PROBLEM LINES
|
||||
console.log("Reopen...");
|
||||
this.socket.reopen(); // ← Triggers BaseApi reconnection
|
||||
```
|
||||
|
||||
### Issue #2: SocketProvider Dependency Loop ✅ FIXED
|
||||
**Status**: Resolved by removing `socket` from dependency array
|
||||
|
||||
### Issue #3: Inconsistent Request Retry Backoff ⚠️ MEDIUM
|
||||
**Location**: `service-call.ts:170`
|
||||
|
||||
```typescript
|
||||
// Inconsistent retry strategies:
|
||||
setTimeout(this.attempt.bind(this), backoffDelay); // Exponential backoff ✅
|
||||
setTimeout(this.attempt.bind(this), 500); // Fixed 500ms ❌ (spams)
|
||||
setTimeout(this.attempt.bind(this), backoffDelay); // Exponential backoff ✅
|
||||
```
|
||||
|
||||
### Issue #4: Concurrent Socket Reopen Calls ⚠️ MEDIUM
|
||||
**Problem**: Multiple failed requests simultaneously call `socket.reopen()`:
|
||||
- No coordination between ServiceCalls
|
||||
- Redundant reconnection attempts
|
||||
- Race conditions in connection state
|
||||
|
||||
## Proposed Solution: Centralized Retry Strategy
|
||||
|
||||
### Architectural Decision
|
||||
|
||||
**Adopt Option A: Let BaseApi handle ALL reconnection logic**
|
||||
|
||||
**Rationale**:
|
||||
- ✅ Single source of truth for connection state
|
||||
- ✅ BaseApi already has robust exponential backoff
|
||||
- ✅ Eliminates retry system conflicts
|
||||
- ✅ Cleaner separation of concerns
|
||||
- ✅ Minimal code changes required
|
||||
|
||||
### Implementation Plan
|
||||
|
||||
#### Phase 1: Remove ServiceCall Reconnection Triggers
|
||||
|
||||
**File**: `src/api/trustgraph/service-call.ts`
|
||||
|
||||
**Changes**:
|
||||
1. Remove `this.socket.reopen()` calls (lines 160, 174)
|
||||
2. Replace with passive waiting for socket reconnection
|
||||
3. Standardize backoff for all retry paths
|
||||
|
||||
```typescript
|
||||
// BEFORE (service-call.ts:156-161)
|
||||
console.log("Reopen...");
|
||||
this.socket.reopen(); // ← REMOVE THIS
|
||||
|
||||
// AFTER
|
||||
console.log("Message send failure, waiting for socket reconnection...");
|
||||
// Let BaseApi handle reconnection, just retry the request
|
||||
```
|
||||
|
||||
#### Phase 2: Improve Request Queueing Strategy
|
||||
|
||||
**Current Behavior**: ServiceCall attempts fail when socket is not ready
|
||||
|
||||
**New Behavior**: ServiceCall waits for socket to become available
|
||||
|
||||
```typescript
|
||||
// Enhanced attempt() method logic
|
||||
attempt() {
|
||||
if (this.complete) return;
|
||||
|
||||
this.retries--;
|
||||
if (this.retries < 0) {
|
||||
// Give up after retries exhausted
|
||||
this.error("Ran out of retries");
|
||||
return;
|
||||
}
|
||||
|
||||
if (this.socket.ws && this.socket.ws.readyState === WebSocket.OPEN) {
|
||||
// Socket ready - send message
|
||||
try {
|
||||
this.socket.ws.send(JSON.stringify(this.msg));
|
||||
this.timeoutId = setTimeout(this.onTimeout.bind(this), this.timeout);
|
||||
} catch (e) {
|
||||
// Send failed - wait and retry (no socket reopen)
|
||||
setTimeout(this.attempt.bind(this), this.calculateBackoff());
|
||||
}
|
||||
} else {
|
||||
// Socket not ready - wait for BaseApi to reconnect
|
||||
console.log("Request", this.mid, "waiting for socket reconnection...");
|
||||
setTimeout(this.attempt.bind(this), this.calculateBackoff());
|
||||
}
|
||||
}
|
||||
|
||||
calculateBackoff() {
|
||||
return Math.min(
|
||||
SOCKET_RECONNECTION_TIMEOUT * Math.pow(2, 3 - this.retries) + Math.random() * 1000,
|
||||
30000
|
||||
);
|
||||
}
|
||||
```
|
||||
|
||||
#### Phase 3: Enhanced BaseApi Connection Management
|
||||
|
||||
**File**: `src/api/trustgraph/trustgraph-socket.ts`
|
||||
|
||||
**Improvements**:
|
||||
1. Add connection state tracking
|
||||
2. Prevent redundant reconnection attempts
|
||||
3. Improve logging for debugging
|
||||
|
||||
```typescript
|
||||
class BaseApi {
|
||||
reconnectionState: 'idle' | 'reconnecting' | 'failed' = 'idle';
|
||||
|
||||
scheduleReconnect() {
|
||||
// Prevent concurrent reconnection attempts
|
||||
if (this.reconnectionState === 'reconnecting') {
|
||||
console.log("[socket] Reconnection already in progress, skipping");
|
||||
return;
|
||||
}
|
||||
|
||||
if (this.reconnectTimer) return;
|
||||
|
||||
this.reconnectionState = 'reconnecting';
|
||||
// ... existing logic
|
||||
}
|
||||
|
||||
onOpen() {
|
||||
console.log("[socket open]");
|
||||
this.reconnectAttempts = 0;
|
||||
this.reconnectionState = 'idle'; // Reset state
|
||||
|
||||
// Clear any pending reconnect timer
|
||||
if (this.reconnectTimer) {
|
||||
clearTimeout(this.reconnectTimer);
|
||||
this.reconnectTimer = undefined;
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Expected Benefits
|
||||
|
||||
### Immediate Impact
|
||||
- **80-90% reduction in retry attempts** - eliminates dual retry system
|
||||
- **Cleaner logs** - single source of reconnection messages
|
||||
- **Predictable behavior** - one retry algorithm instead of two
|
||||
|
||||
### Log Message Changes
|
||||
```
|
||||
// BEFORE: Chaotic dual retry messages
|
||||
[socket] Reconnecting in 2000ms (attempt 1)
|
||||
Request test-123 timed out
|
||||
Message send failure, retry...
|
||||
Reopen...
|
||||
[socket] Reconnecting in 4000ms (attempt 2)
|
||||
Request test-123 ran out of retries
|
||||
Request test-456 timed out
|
||||
Reopen...
|
||||
[socket] Reconnecting in 8000ms (attempt 3)
|
||||
|
||||
// AFTER: Clean, coordinated messages
|
||||
[socket] Reconnecting in 2000ms (attempt 1)
|
||||
Request test-123 waiting for socket reconnection...
|
||||
Request test-456 waiting for socket reconnection...
|
||||
[socket open]
|
||||
Request test-123 sent successfully
|
||||
Request test-456 sent successfully
|
||||
```
|
||||
|
||||
### Performance Improvements
|
||||
- **Reduced CPU usage** - fewer concurrent timers and retry loops
|
||||
- **Less network spam** - coordinated reconnection attempts
|
||||
- **Better user experience** - faster recovery from connection issues
|
||||
|
||||
## Risk Assessment
|
||||
|
||||
### Low Risk Changes
|
||||
- ✅ Removing `socket.reopen()` calls from ServiceCall
|
||||
- ✅ Standardizing backoff calculations
|
||||
- ✅ Adding connection state tracking
|
||||
|
||||
### Potential Issues
|
||||
- ⚠️ **Request timeout behavior may change** - requests may take longer to fail
|
||||
- ⚠️ **Need to test edge cases** - rapid API key changes, server restarts
|
||||
- ⚠️ **Verify inflight request cleanup** - ensure requests don't hang indefinitely
|
||||
|
||||
### Mitigation Strategies
|
||||
1. **Preserve existing timeout behavior** - requests should still timeout appropriately
|
||||
2. **Add circuit breaker** - stop retrying after socket reconnection gives up
|
||||
3. **Comprehensive testing** - test connection failure scenarios
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Tests
|
||||
- Mock WebSocket state transitions
|
||||
- Verify ServiceCall doesn't trigger socket reopens
|
||||
- Test backoff calculations are consistent
|
||||
|
||||
### Integration Tests
|
||||
- Test connection failure and recovery scenarios
|
||||
- Verify request queueing during reconnection
|
||||
- Test concurrent request handling
|
||||
|
||||
### Manual Testing Scenarios
|
||||
1. **Server shutdown** - verify clean reconnection behavior
|
||||
2. **Network interruption** - test mobile/wifi scenarios
|
||||
3. **API key changes** - ensure proper socket recreation
|
||||
4. **High load** - multiple concurrent requests during connection issues
|
||||
|
||||
## Implementation Timeline
|
||||
|
||||
### Phase 1: Core Fixes (1-2 hours)
|
||||
- Remove `socket.reopen()` calls from ServiceCall
|
||||
- Standardize ServiceCall backoff calculations
|
||||
- Add basic connection state tracking
|
||||
|
||||
### Phase 2: Enhanced Reliability (2-3 hours)
|
||||
- Implement request queueing improvements
|
||||
- Add comprehensive logging
|
||||
- Enhanced error handling
|
||||
|
||||
### Phase 3: Testing & Validation (2-4 hours)
|
||||
- Unit test coverage
|
||||
- Integration testing
|
||||
- Performance validation
|
||||
|
||||
**Total Estimated Effort**: 5-9 hours
|
||||
|
||||
## Success Metrics
|
||||
|
||||
### Quantitative Goals
|
||||
- **Reduce retry attempts by 80%+** during connection failures
|
||||
- **Eliminate concurrent socket reopen calls**
|
||||
- **Standardize all retry backoff to exponential**
|
||||
|
||||
### Qualitative Goals
|
||||
- **Cleaner, more understandable logs**
|
||||
- **Predictable connection recovery behavior**
|
||||
- **Better separation of concerns in codebase**
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
### Potential Improvements (Out of Scope)
|
||||
1. **Request prioritization** - critical requests retry faster
|
||||
2. **Connection health monitoring** - proactive reconnection
|
||||
3. **Metrics collection** - track connection reliability
|
||||
4. **Advanced queueing** - persist important requests across sessions
|
||||
|
||||
### Monitoring Additions
|
||||
```typescript
|
||||
// Connection reliability metrics
|
||||
interface SocketMetrics {
|
||||
connectionAttempts: number;
|
||||
successfulConnections: number;
|
||||
averageReconnectionTime: number;
|
||||
requestsLostDuringReconnection: number;
|
||||
}
|
||||
```
|
||||
|
||||
## Conclusion
|
||||
|
||||
This refactor addresses the root cause of socket retry storms by establishing BaseApi as the single authority for connection management. The changes are surgical and low-risk, focusing on removing the problematic dual retry system while preserving all existing functionality.
|
||||
|
||||
**Next Steps**: Implement Phase 1 changes and validate that retry storms are eliminated before proceeding with enhanced features.
|
||||
Loading…
Add table
Add a link
Reference in a new issue