Structure the tech specs directory (#836)

Tech spec some subdirectories for different languages
2026-04-26 00:46:22 +02:00 · 2026-04-21 16:06:41 +01:00 · 2026-04-21 16:06:41 +01:00 · e7efb673ef
commit e7efb673ef
parent 48da6c5f8b
423 changed files with 0 additions and 0 deletions
--- a/docs/tech-specs/zh-cn/import-export-graceful-shutdown.zh-cn.md
+++ b/docs/tech-specs/zh-cn/import-export-graceful-shutdown.zh-cn.md
@ -0,0 +1,725 @@
+---
+layout: default
+title: "导入/导出 优雅关闭 技术规范"
+parent: "Chinese (Beta)"
+---
+
+# 导入/导出 优雅关闭 技术规范
+
+> **Beta Translation:** This document was translated via Machine Learning and as such may not be 100% accurate. All non-English languages are currently classified as Beta.
+
+## 问题陈述
+
+<<<<<<< HEAD
+TrustGraph 网关在导入和导出操作中，在 WebSocket 连接关闭时，目前存在消息丢失的问题。 这是由于存在竞争条件，导致正在传输的消息在到达目的地之前被丢弃（对于导入，是丢弃到 Pulsar 队列的消息；对于导出，是丢弃到 WebSocket 客户端的消息）。
+
+### 导入端问题
+1. 发布者的 asyncio.Queue 缓冲区在关闭时未被清空
+2. 在确保所有排队的消息到达 Pulsar 之前，WebSocket 连接关闭
+=======
+TrustGraph 网关在导入和导出操作的 WebSocket 关闭过程中，目前存在消息丢失的问题。 这是由于存在竞争条件，导致正在传输的消息在到达目的地之前被丢弃（对于导入，是丢弃到 Pulsar 队列的消息；对于导出，是丢弃到 WebSocket 客户端的消息）。
+
+### 导入端问题
+1. 发布者的 asyncio.Queue 缓冲区在关闭时未被清空
+2. 在确保所有排队的消息到达 Pulsar 之前，WebSocket 被关闭
+>>>>>>> 82edf2d (New md files from RunPod)
+3. 没有用于确认消息成功传递的机制
+
+### 导出端问题
+1. 消息在成功传递到客户端之前，已在 Pulsar 中得到确认
+2. 预定义的超时时间会导致队列已满时消息丢失
+3. 没有用于处理慢速消费者的反压机制
+4. 存在多个缓冲区，数据可能在此处丢失
+
+## 架构概述
+
+```
+Import Flow:
+Client -> Websocket -> TriplesImport -> Publisher -> Pulsar Queue
+
+Export Flow:
+Pulsar Queue -> Subscriber -> TriplesExport -> Websocket -> Client
+```
+
+## 建议的修复方案
+
+### 1. 发布者改进（导入端）
+
+#### A. 优雅的队列清空
+
+**文件**: `trustgraph-base/trustgraph/base/publisher.py`
+
+```python
+class Publisher:
+    def __init__(self, client, topic, schema=None, max_size=10,
+                 chunking_enabled=True, drain_timeout=5.0):
+        self.client = client
+        self.topic = topic
+        self.schema = schema
+        self.q = asyncio.Queue(maxsize=max_size)
+        self.chunking_enabled = chunking_enabled
+        self.running = True
+        self.draining = False  # New state for graceful shutdown
+        self.task = None
+        self.drain_timeout = drain_timeout
+
+    async def stop(self):
+        """Initiate graceful shutdown with draining"""
+        self.running = False
+        self.draining = True
+        
+        if self.task:
+            # Wait for run() to complete draining
+            await self.task
+
+    async def run(self):
+        """Enhanced run method with integrated draining logic"""
+        while self.running or self.draining:
+            try:
+                producer = self.client.create_producer(
+                    topic=self.topic,
+                    schema=JsonSchema(self.schema),
+                    chunking_enabled=self.chunking_enabled,
+                )
+
+                drain_end_time = None
+                
+                while self.running or self.draining:
+                    try:
+                        # Start drain timeout when entering drain mode
+                        if self.draining and drain_end_time is None:
+                            drain_end_time = time.time() + self.drain_timeout
+                            logger.info(f"Publisher entering drain mode, timeout={self.drain_timeout}s")
+                        
+                        # Check drain timeout
+                        if self.draining and time.time() > drain_end_time:
+                            if not self.q.empty():
+                                logger.warning(f"Drain timeout reached with {self.q.qsize()} messages remaining")
+                            self.draining = False
+                            break
+                        
+                        # Calculate wait timeout based on mode
+                        if self.draining:
+                            # Shorter timeout during draining to exit quickly when empty
+                            timeout = min(0.1, drain_end_time - time.time())
+                        else:
+                            # Normal operation timeout
+                            timeout = 0.25
+                        
+                        # Get message from queue
+                        id, item = await asyncio.wait_for(
+                            self.q.get(),
+                            timeout=timeout
+                        )
+                        
+                        # Send the message (single place for sending)
+                        if id:
+                            producer.send(item, { "id": id })
+                        else:
+                            producer.send(item)
+                            
+                    except asyncio.TimeoutError:
+                        # If draining and queue is empty, we're done
+                        if self.draining and self.q.empty():
+                            logger.info("Publisher queue drained successfully")
+                            self.draining = False
+                            break
+                        continue
+                        
+                    except asyncio.QueueEmpty:
+                        # If draining and queue is empty, we're done  
+                        if self.draining and self.q.empty():
+                            logger.info("Publisher queue drained successfully")
+                            self.draining = False
+                            break
+                        continue
+                
+                # Flush producer before closing
+                if producer:
+                    producer.flush()
+                    producer.close()
+
+            except Exception as e:
+                logger.error(f"Exception in publisher: {e}", exc_info=True)
+
+            if not self.running and not self.draining:
+                return
+
+            # If handler drops out, sleep a retry
+            await asyncio.sleep(1)
+
+    async def send(self, id, item):
+        """Send still works normally - just adds to queue"""
+        if self.draining:
+            # Optionally reject new messages during drain
+            raise RuntimeError("Publisher is shutting down, not accepting new messages")
+        await self.q.put((id, item))
+```
+
+**主要设计优势：**
+**单一发送位置：** 所有 `producer.send()` 调用都发生在 `run()` 方法内的单一位置。
+**清晰的状态机：** 三个明确的状态 - 运行中、正在排空、已停止。
+**超时保护：** 在排空过程中不会无限期挂起。
+**更好的可观察性：** 清晰地记录排空进度和状态转换。
+**可选的消息拒绝：** 可以在关闭阶段拒绝新的消息。
+
+#### B. 改进的关闭顺序
+
+**文件：** `trustgraph-flow/trustgraph/gateway/dispatch/triples_import.py`
+
+```python
+class TriplesImport:
+    async def destroy(self):
+        """Enhanced destroy with proper shutdown order"""
+        # Step 1: Stop accepting new messages
+        self.running.stop()
+        
+        # Step 2: Wait for publisher to drain its queue
+        logger.info("Draining publisher queue...")
+        await self.publisher.stop()
+        
+        # Step 3: Close websocket only after queue is drained
+        if self.ws:
+            await self.ws.close()
+```
+
+### 2. 订阅者改进（导出端）
+
+#### A. 集成排空模式
+
+**文件**: `trustgraph-base/trustgraph/base/subscriber.py`
+
+```python
+class Subscriber:
+    def __init__(self, client, topic, subscription, consumer_name,
+                 schema=None, max_size=100, metrics=None,
+                 backpressure_strategy="block", drain_timeout=5.0):
+        # ... existing init ...
+        self.backpressure_strategy = backpressure_strategy
+        self.running = True
+        self.draining = False  # New state for graceful shutdown
+        self.drain_timeout = drain_timeout
+        self.pending_acks = {}  # Track messages awaiting delivery
+        
+    async def stop(self):
+        """Initiate graceful shutdown with draining"""
+        self.running = False
+        self.draining = True
+        
+        if self.task:
+            # Wait for run() to complete draining
+            await self.task
+            
+    async def run(self):
+        """Enhanced run method with integrated draining logic"""
+        while self.running or self.draining:
+            if self.metrics:
+                self.metrics.state("stopped")
+
+            try:
+                self.consumer = self.client.subscribe(
+                    topic = self.topic,
+                    subscription_name = self.subscription,
+                    consumer_name = self.consumer_name,
+                    schema = JsonSchema(self.schema),
+                )
+
+                if self.metrics:
+                    self.metrics.state("running")
+
+                logger.info("Subscriber running...")
+                drain_end_time = None
+
+                while self.running or self.draining:
+                    # Start drain timeout when entering drain mode
+                    if self.draining and drain_end_time is None:
+                        drain_end_time = time.time() + self.drain_timeout
+                        logger.info(f"Subscriber entering drain mode, timeout={self.drain_timeout}s")
+                        
+                        # Stop accepting new messages from Pulsar during drain
+                        self.consumer.pause_message_listener()
+                    
+                    # Check drain timeout
+                    if self.draining and time.time() > drain_end_time:
+                        async with self.lock:
+                            total_pending = sum(
+                                q.qsize() for q in 
+                                list(self.q.values()) + list(self.full.values())
+                            )
+                            if total_pending > 0:
+                                logger.warning(f"Drain timeout reached with {total_pending} messages in queues")
+                        self.draining = False
+                        break
+                    
+                    # Check if we can exit drain mode
+                    if self.draining:
+                        async with self.lock:
+                            all_empty = all(
+                                q.empty() for q in 
+                                list(self.q.values()) + list(self.full.values())
+                            )
+                            if all_empty and len(self.pending_acks) == 0:
+                                logger.info("Subscriber queues drained successfully")
+                                self.draining = False
+                                break
+                    
+                    # Process messages only if not draining
+                    if not self.draining:
+                        try:
+                            msg = await asyncio.to_thread(
+                                self.consumer.receive,
+                                timeout_millis=250
+                            )
+                        except _pulsar.Timeout:
+                            continue
+                        except Exception as e:
+                            logger.error(f"Exception in subscriber receive: {e}", exc_info=True)
+                            raise e
+
+                        if self.metrics:
+                            self.metrics.received()
+
+                        # Process the message
+                        await self._process_message(msg)
+                    else:
+                        # During draining, just wait for queues to empty
+                        await asyncio.sleep(0.1)
+
+            except Exception as e:
+                logger.error(f"Subscriber exception: {e}", exc_info=True)
+
+            finally:
+                # Negative acknowledge any pending messages
+                for msg in self.pending_acks.values():
+                    self.consumer.negative_acknowledge(msg)
+                self.pending_acks.clear()
+
+                if self.consumer:
+                    self.consumer.unsubscribe()
+                    self.consumer.close()
+                    self.consumer = None
+
+            if self.metrics:
+                self.metrics.state("stopped")
+
+            if not self.running and not self.draining:
+                return
+
+            # If handler drops out, sleep a retry
+            await asyncio.sleep(1)
+
+    async def _process_message(self, msg):
+        """Process a single message with deferred acknowledgment"""
+        # Store message for later acknowledgment
+        msg_id = str(uuid.uuid4())
+        self.pending_acks[msg_id] = msg
+        
+        try:
+            id = msg.properties()["id"]
+        except:
+            id = None
+            
+        value = msg.value()
+        delivery_success = False
+        
+        async with self.lock:
+            # Deliver to specific subscribers
+            if id in self.q:
+                delivery_success = await self._deliver_to_queue(
+                    self.q[id], value
+                )
+            
+            # Deliver to all subscribers
+            for q in self.full.values():
+                if await self._deliver_to_queue(q, value):
+                    delivery_success = True
+        
+        # Acknowledge only on successful delivery
+        if delivery_success:
+            self.consumer.acknowledge(msg)
+            del self.pending_acks[msg_id]
+        else:
+            # Negative acknowledge for retry
+            self.consumer.negative_acknowledge(msg)
+            del self.pending_acks[msg_id]
+                
+    async def _deliver_to_queue(self, queue, value):
+        """Deliver message to queue with backpressure handling"""
+        try:
+            if self.backpressure_strategy == "block":
+                # Block until space available (no timeout)
+                await queue.put(value)
+                return True
+                
+            elif self.backpressure_strategy == "drop_oldest":
+                # Drop oldest message if queue full
+                if queue.full():
+                    try:
+                        queue.get_nowait()
+                        if self.metrics:
+                            self.metrics.dropped()
+                    except asyncio.QueueEmpty:
+                        pass
+                await queue.put(value)
+                return True
+                
+            elif self.backpressure_strategy == "drop_new":
+                # Drop new message if queue full
+                if queue.full():
+                    if self.metrics:
+                        self.metrics.dropped()
+                    return False
+                await queue.put(value)
+                return True
+                
+        except Exception as e:
+            logger.error(f"Failed to deliver message: {e}")
+            return False
+```
+
+**主要设计优势（与发布者模式匹配）：**
+<<<<<<< HEAD
+**单一处理位置：** 所有消息处理都在 `run()` 方法中进行。
+=======
+**单一处理位置：** 所有消息处理都发生在 `run()` 方法中。
+>>>>>>> 82edf2d (New md files from RunPod)
+**清晰的状态机：** 三个明确的状态 - 运行中、正在排空、已停止。
+**排空期间暂停：** 在排空现有队列时，停止从 Pulsar 接收新消息。
+**超时保护：** 在排空过程中不会无限期挂起。
+**正确的清理：** 在关闭时，会对任何未发送的消息进行否定确认。
+
+#### B. 导出处理程序改进
+
+**文件：** `trustgraph-flow/trustgraph/gateway/dispatch/triples_export.py`
+
+```python
+class TriplesExport:
+    async def destroy(self):
+        """Enhanced destroy with graceful shutdown"""
+        # Step 1: Signal stop to prevent new messages
+        self.running.stop()
+        
+        # Step 2: Wait briefly for in-flight messages
+        await asyncio.sleep(0.5)
+        
+        # Step 3: Unsubscribe and stop subscriber (triggers queue drain)
+        if hasattr(self, 'subs'):
+            await self.subs.unsubscribe_all(self.id)
+            await self.subs.stop()
+        
+        # Step 4: Close websocket last
+        if self.ws and not self.ws.closed:
+            await self.ws.close()
+            
+    async def run(self):
+        """Enhanced run with better error handling"""
+        self.subs = Subscriber(
+            client = self.pulsar_client, 
+            topic = self.queue,
+            consumer_name = self.consumer, 
+            subscription = self.subscriber,
+            schema = Triples,
+            backpressure_strategy = "block"  # Configurable
+        )
+        
+        await self.subs.start()
+        
+        self.id = str(uuid.uuid4())
+        q = await self.subs.subscribe_all(self.id)
+        
+        consecutive_errors = 0
+        max_consecutive_errors = 5
+        
+        while self.running.get():
+            try:
+                resp = await asyncio.wait_for(q.get(), timeout=0.5)
+                await self.ws.send_json(serialize_triples(resp))
+                consecutive_errors = 0  # Reset on success
+                
+            except asyncio.TimeoutError:
+                continue
+                
+            except queue.Empty:
+                continue
+                
+            except Exception as e:
+                logger.error(f"Exception sending to websocket: {str(e)}")
+                consecutive_errors += 1
+                
+                if consecutive_errors >= max_consecutive_errors:
+                    logger.error("Too many consecutive errors, shutting down")
+                    break
+                    
+                # Brief pause before retry
+                await asyncio.sleep(0.1)
+        
+        # Graceful cleanup handled in destroy()
+```
+
+### 3. Socket 级别改进
+
+**文件**: `trustgraph-flow/trustgraph/gateway/endpoint/socket.py`
+
+```python
+class SocketEndpoint:
+    async def listener(self, ws, dispatcher, running):
+        """Enhanced listener with graceful shutdown"""
+        async for msg in ws:
+            if msg.type == WSMsgType.TEXT:
+                await dispatcher.receive(msg)
+                continue
+            elif msg.type == WSMsgType.BINARY:
+                await dispatcher.receive(msg)
+                continue
+            else:
+                # Graceful shutdown on close
+                logger.info("Websocket closing, initiating graceful shutdown")
+                running.stop()
+                
+                # Allow time for dispatcher cleanup
+                await asyncio.sleep(1.0)
+                break
+                
+    async def handle(self, request):
+        """Enhanced handler with better cleanup"""
+        # ... existing setup code ...
+        
+        try:
+            async with asyncio.TaskGroup() as tg:
+                running = Running()
+                
+                dispatcher = await self.dispatcher(
+                    ws, running, request.match_info
+                )
+                
+                worker_task = tg.create_task(
+                    self.worker(ws, dispatcher, running)
+                )
+                
+                lsnr_task = tg.create_task(
+                    self.listener(ws, dispatcher, running)
+                )
+                
+        except ExceptionGroup as e:
+            logger.error("Exception group occurred:", exc_info=True)
+            
+            # Attempt graceful dispatcher shutdown
+            try:
+                await asyncio.wait_for(
+                    dispatcher.destroy(), 
+                    timeout=5.0
+                )
+            except asyncio.TimeoutError:
+                logger.warning("Dispatcher shutdown timed out")
+            except Exception as de:
+                logger.error(f"Error during dispatcher cleanup: {de}")
+                
+        except Exception as e:
+            logger.error(f"Socket exception: {e}", exc_info=True)
+            
+        finally:
+            # Ensure dispatcher cleanup
+            if dispatcher and hasattr(dispatcher, 'destroy'):
+                try:
+                    await dispatcher.destroy()
+                except:
+                    pass
+                    
+            # Ensure websocket is closed
+            if ws and not ws.closed:
+                await ws.close()
+                
+        return ws
+```
+
+## 配置选项
+
+添加配置支持，用于调整行为：
+
+```python
+# config.py
+class GracefulShutdownConfig:
+    # Publisher settings
+    PUBLISHER_DRAIN_TIMEOUT = 5.0  # Seconds to wait for queue drain
+    PUBLISHER_FLUSH_TIMEOUT = 2.0  # Producer flush timeout
+    
+    # Subscriber settings  
+    SUBSCRIBER_DRAIN_TIMEOUT = 5.0  # Seconds to wait for queue drain
+    BACKPRESSURE_STRATEGY = "block"  # Options: "block", "drop_oldest", "drop_new"
+    SUBSCRIBER_MAX_QUEUE_SIZE = 100  # Maximum queue size before backpressure
+    
+    # Socket settings
+    SHUTDOWN_GRACE_PERIOD = 1.0  # Seconds to wait for graceful shutdown
+    MAX_CONSECUTIVE_ERRORS = 5  # Maximum errors before forced shutdown
+    
+    # Monitoring
+    LOG_QUEUE_STATS = True  # Log queue statistics on shutdown
+    METRICS_ENABLED = True  # Enable metrics collection
+```
+
+## 测试策略
+
+### 单元测试
+
+```python
+async def test_publisher_queue_drain():
+    """Verify Publisher drains queue on shutdown"""
+    publisher = Publisher(...)
+    
+    # Fill queue with messages
+    for i in range(10):
+        await publisher.send(f"id-{i}", {"data": i})
+    
+    # Stop publisher
+    await publisher.stop()
+    
+    # Verify all messages were sent
+    assert publisher.q.empty()
+    assert mock_producer.send.call_count == 10
+
+async def test_subscriber_deferred_ack():
+    """Verify Subscriber only acks on successful delivery"""
+    subscriber = Subscriber(..., backpressure_strategy="drop_new")
+    
+    # Fill queue to capacity
+    queue = await subscriber.subscribe("test")
+    for i in range(100):
+        await queue.put({"data": i})
+    
+    # Try to add message when full
+    msg = create_mock_message()
+    await subscriber._process_message(msg)
+    
+    # Verify negative acknowledgment
+    assert msg.negative_acknowledge.called
+    assert not msg.acknowledge.called
+```
+
+### 集成测试
+
+```python
+async def test_import_graceful_shutdown():
+    """Test import path handles shutdown gracefully"""
+    # Setup
+    import_handler = TriplesImport(...)
+    await import_handler.start()
+    
+    # Send messages
+    messages = []
+    for i in range(100):
+        msg = {"metadata": {...}, "triples": [...]}
+        await import_handler.receive(msg)
+        messages.append(msg)
+    
+    # Shutdown while messages in flight
+    await import_handler.destroy()
+    
+    # Verify all messages reached Pulsar
+    received = await pulsar_consumer.receive_all()
+    assert len(received) == 100
+
+async def test_export_no_message_loss():
+    """Test export path doesn't lose acknowledged messages"""
+    # Setup Pulsar with test messages
+    for i in range(100):
+        await pulsar_producer.send({"data": i})
+    
+    # Start export handler
+    export_handler = TriplesExport(...)
+    export_task = asyncio.create_task(export_handler.run())
+    
+    # Receive some messages
+    received = []
+    for _ in range(50):
+        msg = await websocket.receive()
+        received.append(msg)
+    
+    # Force shutdown
+    await export_handler.destroy()
+    
+    # Continue receiving until websocket closes
+    while not websocket.closed:
+        try:
+            msg = await websocket.receive()
+            received.append(msg)
+        except:
+            break
+    
+    # Verify no acknowledged messages were lost
+    assert len(received) >= 50
+```
+
+## 部署计划
+
+<<<<<<< HEAD
+### 第一阶段：关键修复（第一周）
+修复订阅者确认时序问题（防止消息丢失）
+添加发布者队列排空功能
+部署到测试环境
+
+### 第二阶段：平滑关闭（第二周）
+=======
+### 第一阶段：关键修复 (第一周)
+修复订阅者确认时序问题 (防止消息丢失)
+添加发布者队列排空功能
+部署到测试环境
+
+### 第二阶段：平滑关闭 (第二周)
+>>>>>>> 82edf2d (New md files from RunPod)
+实施关闭协调
+添加反压策略
+性能测试
+
+<<<<<<< HEAD
+### 第三阶段：监控与调优（第三周）
+=======
+### 第三阶段：监控与调优 (第三周)
+>>>>>>> 82edf2d (New md files from RunPod)
+添加队列深度的指标
+添加消息丢失的告警
+根据生产数据调整超时值
+
+## 监控与告警
+
+### 跟踪指标
+`publisher.queue.depth` - 当前发布者队列大小
+`publisher.messages.dropped` - 关闭期间丢失的消息数量
+`subscriber.messages.negatively_acknowledged` - 失败的交付
+`websocket.graceful_shutdowns` - 成功的平滑关闭
+`websocket.forced_shutdowns` - 强制/超时关闭
+
+### 告警
+<<<<<<< HEAD
+发布者队列深度超过 80% 的容量
+=======
+发布者队列深度超过 80% 容量
+>>>>>>> 82edf2d (New md files from RunPod)
+关闭期间出现任何消息丢失
+订阅者否定确认率 > 1%
+关闭超时
+
+## 向后兼容性
+
+所有更改都保持向后兼容性：
+在没有配置的情况下，默认行为保持不变
+现有部署继续正常运行
+如果新的功能不可用，则会进行优雅降级
+
+## 安全注意事项
+
+没有引入新的攻击向量
+反压可防止内存耗尽攻击
+可配置的限制可防止资源滥用
+
+## 性能影响
+
+正常运行期间的开销很小
+<<<<<<< HEAD
+关闭可能需要长达 5 秒（可配置）
+内存使用量受队列大小限制
+CPU 影响可以忽略不计（<1% 的增加）
+=======
+关闭可能需要长达 5 秒钟 (可配置)
+内存使用量受队列大小限制
+CPU 影响可以忽略不计 (<1% 的增长)
+>>>>>>> 82edf2d (New md files from RunPod)