mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-25 08:26:21 +02:00
726 lines
25 KiB
Markdown
726 lines
25 KiB
Markdown
|
|
---
|
|||
|
|
layout: default
|
|||
|
|
title: "导入/导出 优雅关闭 技术规范"
|
|||
|
|
parent: "Chinese (Beta)"
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# 导入/导出 优雅关闭 技术规范
|
|||
|
|
|
|||
|
|
> **Beta Translation:** This document was translated via Machine Learning and as such may not be 100% accurate. All non-English languages are currently classified as Beta.
|
|||
|
|
|
|||
|
|
## 问题陈述
|
|||
|
|
|
|||
|
|
<<<<<<< HEAD
|
|||
|
|
TrustGraph 网关在导入和导出操作中,在 WebSocket 连接关闭时,目前存在消息丢失的问题。 这是由于存在竞争条件,导致正在传输的消息在到达目的地之前被丢弃(对于导入,是丢弃到 Pulsar 队列的消息;对于导出,是丢弃到 WebSocket 客户端的消息)。
|
|||
|
|
|
|||
|
|
### 导入端问题
|
|||
|
|
1. 发布者的 asyncio.Queue 缓冲区在关闭时未被清空
|
|||
|
|
2. 在确保所有排队的消息到达 Pulsar 之前,WebSocket 连接关闭
|
|||
|
|
=======
|
|||
|
|
TrustGraph 网关在导入和导出操作的 WebSocket 关闭过程中,目前存在消息丢失的问题。 这是由于存在竞争条件,导致正在传输的消息在到达目的地之前被丢弃(对于导入,是丢弃到 Pulsar 队列的消息;对于导出,是丢弃到 WebSocket 客户端的消息)。
|
|||
|
|
|
|||
|
|
### 导入端问题
|
|||
|
|
1. 发布者的 asyncio.Queue 缓冲区在关闭时未被清空
|
|||
|
|
2. 在确保所有排队的消息到达 Pulsar 之前,WebSocket 被关闭
|
|||
|
|
>>>>>>> 82edf2d (New md files from RunPod)
|
|||
|
|
3. 没有用于确认消息成功传递的机制
|
|||
|
|
|
|||
|
|
### 导出端问题
|
|||
|
|
1. 消息在成功传递到客户端之前,已在 Pulsar 中得到确认
|
|||
|
|
2. 预定义的超时时间会导致队列已满时消息丢失
|
|||
|
|
3. 没有用于处理慢速消费者的反压机制
|
|||
|
|
4. 存在多个缓冲区,数据可能在此处丢失
|
|||
|
|
|
|||
|
|
## 架构概述
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Import Flow:
|
|||
|
|
Client -> Websocket -> TriplesImport -> Publisher -> Pulsar Queue
|
|||
|
|
|
|||
|
|
Export Flow:
|
|||
|
|
Pulsar Queue -> Subscriber -> TriplesExport -> Websocket -> Client
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 建议的修复方案
|
|||
|
|
|
|||
|
|
### 1. 发布者改进(导入端)
|
|||
|
|
|
|||
|
|
#### A. 优雅的队列清空
|
|||
|
|
|
|||
|
|
**文件**: `trustgraph-base/trustgraph/base/publisher.py`
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
class Publisher:
|
|||
|
|
def __init__(self, client, topic, schema=None, max_size=10,
|
|||
|
|
chunking_enabled=True, drain_timeout=5.0):
|
|||
|
|
self.client = client
|
|||
|
|
self.topic = topic
|
|||
|
|
self.schema = schema
|
|||
|
|
self.q = asyncio.Queue(maxsize=max_size)
|
|||
|
|
self.chunking_enabled = chunking_enabled
|
|||
|
|
self.running = True
|
|||
|
|
self.draining = False # New state for graceful shutdown
|
|||
|
|
self.task = None
|
|||
|
|
self.drain_timeout = drain_timeout
|
|||
|
|
|
|||
|
|
async def stop(self):
|
|||
|
|
"""Initiate graceful shutdown with draining"""
|
|||
|
|
self.running = False
|
|||
|
|
self.draining = True
|
|||
|
|
|
|||
|
|
if self.task:
|
|||
|
|
# Wait for run() to complete draining
|
|||
|
|
await self.task
|
|||
|
|
|
|||
|
|
async def run(self):
|
|||
|
|
"""Enhanced run method with integrated draining logic"""
|
|||
|
|
while self.running or self.draining:
|
|||
|
|
try:
|
|||
|
|
producer = self.client.create_producer(
|
|||
|
|
topic=self.topic,
|
|||
|
|
schema=JsonSchema(self.schema),
|
|||
|
|
chunking_enabled=self.chunking_enabled,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
drain_end_time = None
|
|||
|
|
|
|||
|
|
while self.running or self.draining:
|
|||
|
|
try:
|
|||
|
|
# Start drain timeout when entering drain mode
|
|||
|
|
if self.draining and drain_end_time is None:
|
|||
|
|
drain_end_time = time.time() + self.drain_timeout
|
|||
|
|
logger.info(f"Publisher entering drain mode, timeout={self.drain_timeout}s")
|
|||
|
|
|
|||
|
|
# Check drain timeout
|
|||
|
|
if self.draining and time.time() > drain_end_time:
|
|||
|
|
if not self.q.empty():
|
|||
|
|
logger.warning(f"Drain timeout reached with {self.q.qsize()} messages remaining")
|
|||
|
|
self.draining = False
|
|||
|
|
break
|
|||
|
|
|
|||
|
|
# Calculate wait timeout based on mode
|
|||
|
|
if self.draining:
|
|||
|
|
# Shorter timeout during draining to exit quickly when empty
|
|||
|
|
timeout = min(0.1, drain_end_time - time.time())
|
|||
|
|
else:
|
|||
|
|
# Normal operation timeout
|
|||
|
|
timeout = 0.25
|
|||
|
|
|
|||
|
|
# Get message from queue
|
|||
|
|
id, item = await asyncio.wait_for(
|
|||
|
|
self.q.get(),
|
|||
|
|
timeout=timeout
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# Send the message (single place for sending)
|
|||
|
|
if id:
|
|||
|
|
producer.send(item, { "id": id })
|
|||
|
|
else:
|
|||
|
|
producer.send(item)
|
|||
|
|
|
|||
|
|
except asyncio.TimeoutError:
|
|||
|
|
# If draining and queue is empty, we're done
|
|||
|
|
if self.draining and self.q.empty():
|
|||
|
|
logger.info("Publisher queue drained successfully")
|
|||
|
|
self.draining = False
|
|||
|
|
break
|
|||
|
|
continue
|
|||
|
|
|
|||
|
|
except asyncio.QueueEmpty:
|
|||
|
|
# If draining and queue is empty, we're done
|
|||
|
|
if self.draining and self.q.empty():
|
|||
|
|
logger.info("Publisher queue drained successfully")
|
|||
|
|
self.draining = False
|
|||
|
|
break
|
|||
|
|
continue
|
|||
|
|
|
|||
|
|
# Flush producer before closing
|
|||
|
|
if producer:
|
|||
|
|
producer.flush()
|
|||
|
|
producer.close()
|
|||
|
|
|
|||
|
|
except Exception as e:
|
|||
|
|
logger.error(f"Exception in publisher: {e}", exc_info=True)
|
|||
|
|
|
|||
|
|
if not self.running and not self.draining:
|
|||
|
|
return
|
|||
|
|
|
|||
|
|
# If handler drops out, sleep a retry
|
|||
|
|
await asyncio.sleep(1)
|
|||
|
|
|
|||
|
|
async def send(self, id, item):
|
|||
|
|
"""Send still works normally - just adds to queue"""
|
|||
|
|
if self.draining:
|
|||
|
|
# Optionally reject new messages during drain
|
|||
|
|
raise RuntimeError("Publisher is shutting down, not accepting new messages")
|
|||
|
|
await self.q.put((id, item))
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**主要设计优势:**
|
|||
|
|
**单一发送位置:** 所有 `producer.send()` 调用都发生在 `run()` 方法内的单一位置。
|
|||
|
|
**清晰的状态机:** 三个明确的状态 - 运行中、正在排空、已停止。
|
|||
|
|
**超时保护:** 在排空过程中不会无限期挂起。
|
|||
|
|
**更好的可观察性:** 清晰地记录排空进度和状态转换。
|
|||
|
|
**可选的消息拒绝:** 可以在关闭阶段拒绝新的消息。
|
|||
|
|
|
|||
|
|
#### B. 改进的关闭顺序
|
|||
|
|
|
|||
|
|
**文件:** `trustgraph-flow/trustgraph/gateway/dispatch/triples_import.py`
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
class TriplesImport:
|
|||
|
|
async def destroy(self):
|
|||
|
|
"""Enhanced destroy with proper shutdown order"""
|
|||
|
|
# Step 1: Stop accepting new messages
|
|||
|
|
self.running.stop()
|
|||
|
|
|
|||
|
|
# Step 2: Wait for publisher to drain its queue
|
|||
|
|
logger.info("Draining publisher queue...")
|
|||
|
|
await self.publisher.stop()
|
|||
|
|
|
|||
|
|
# Step 3: Close websocket only after queue is drained
|
|||
|
|
if self.ws:
|
|||
|
|
await self.ws.close()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 2. 订阅者改进(导出端)
|
|||
|
|
|
|||
|
|
#### A. 集成排空模式
|
|||
|
|
|
|||
|
|
**文件**: `trustgraph-base/trustgraph/base/subscriber.py`
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
class Subscriber:
|
|||
|
|
def __init__(self, client, topic, subscription, consumer_name,
|
|||
|
|
schema=None, max_size=100, metrics=None,
|
|||
|
|
backpressure_strategy="block", drain_timeout=5.0):
|
|||
|
|
# ... existing init ...
|
|||
|
|
self.backpressure_strategy = backpressure_strategy
|
|||
|
|
self.running = True
|
|||
|
|
self.draining = False # New state for graceful shutdown
|
|||
|
|
self.drain_timeout = drain_timeout
|
|||
|
|
self.pending_acks = {} # Track messages awaiting delivery
|
|||
|
|
|
|||
|
|
async def stop(self):
|
|||
|
|
"""Initiate graceful shutdown with draining"""
|
|||
|
|
self.running = False
|
|||
|
|
self.draining = True
|
|||
|
|
|
|||
|
|
if self.task:
|
|||
|
|
# Wait for run() to complete draining
|
|||
|
|
await self.task
|
|||
|
|
|
|||
|
|
async def run(self):
|
|||
|
|
"""Enhanced run method with integrated draining logic"""
|
|||
|
|
while self.running or self.draining:
|
|||
|
|
if self.metrics:
|
|||
|
|
self.metrics.state("stopped")
|
|||
|
|
|
|||
|
|
try:
|
|||
|
|
self.consumer = self.client.subscribe(
|
|||
|
|
topic = self.topic,
|
|||
|
|
subscription_name = self.subscription,
|
|||
|
|
consumer_name = self.consumer_name,
|
|||
|
|
schema = JsonSchema(self.schema),
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
if self.metrics:
|
|||
|
|
self.metrics.state("running")
|
|||
|
|
|
|||
|
|
logger.info("Subscriber running...")
|
|||
|
|
drain_end_time = None
|
|||
|
|
|
|||
|
|
while self.running or self.draining:
|
|||
|
|
# Start drain timeout when entering drain mode
|
|||
|
|
if self.draining and drain_end_time is None:
|
|||
|
|
drain_end_time = time.time() + self.drain_timeout
|
|||
|
|
logger.info(f"Subscriber entering drain mode, timeout={self.drain_timeout}s")
|
|||
|
|
|
|||
|
|
# Stop accepting new messages from Pulsar during drain
|
|||
|
|
self.consumer.pause_message_listener()
|
|||
|
|
|
|||
|
|
# Check drain timeout
|
|||
|
|
if self.draining and time.time() > drain_end_time:
|
|||
|
|
async with self.lock:
|
|||
|
|
total_pending = sum(
|
|||
|
|
q.qsize() for q in
|
|||
|
|
list(self.q.values()) + list(self.full.values())
|
|||
|
|
)
|
|||
|
|
if total_pending > 0:
|
|||
|
|
logger.warning(f"Drain timeout reached with {total_pending} messages in queues")
|
|||
|
|
self.draining = False
|
|||
|
|
break
|
|||
|
|
|
|||
|
|
# Check if we can exit drain mode
|
|||
|
|
if self.draining:
|
|||
|
|
async with self.lock:
|
|||
|
|
all_empty = all(
|
|||
|
|
q.empty() for q in
|
|||
|
|
list(self.q.values()) + list(self.full.values())
|
|||
|
|
)
|
|||
|
|
if all_empty and len(self.pending_acks) == 0:
|
|||
|
|
logger.info("Subscriber queues drained successfully")
|
|||
|
|
self.draining = False
|
|||
|
|
break
|
|||
|
|
|
|||
|
|
# Process messages only if not draining
|
|||
|
|
if not self.draining:
|
|||
|
|
try:
|
|||
|
|
msg = await asyncio.to_thread(
|
|||
|
|
self.consumer.receive,
|
|||
|
|
timeout_millis=250
|
|||
|
|
)
|
|||
|
|
except _pulsar.Timeout:
|
|||
|
|
continue
|
|||
|
|
except Exception as e:
|
|||
|
|
logger.error(f"Exception in subscriber receive: {e}", exc_info=True)
|
|||
|
|
raise e
|
|||
|
|
|
|||
|
|
if self.metrics:
|
|||
|
|
self.metrics.received()
|
|||
|
|
|
|||
|
|
# Process the message
|
|||
|
|
await self._process_message(msg)
|
|||
|
|
else:
|
|||
|
|
# During draining, just wait for queues to empty
|
|||
|
|
await asyncio.sleep(0.1)
|
|||
|
|
|
|||
|
|
except Exception as e:
|
|||
|
|
logger.error(f"Subscriber exception: {e}", exc_info=True)
|
|||
|
|
|
|||
|
|
finally:
|
|||
|
|
# Negative acknowledge any pending messages
|
|||
|
|
for msg in self.pending_acks.values():
|
|||
|
|
self.consumer.negative_acknowledge(msg)
|
|||
|
|
self.pending_acks.clear()
|
|||
|
|
|
|||
|
|
if self.consumer:
|
|||
|
|
self.consumer.unsubscribe()
|
|||
|
|
self.consumer.close()
|
|||
|
|
self.consumer = None
|
|||
|
|
|
|||
|
|
if self.metrics:
|
|||
|
|
self.metrics.state("stopped")
|
|||
|
|
|
|||
|
|
if not self.running and not self.draining:
|
|||
|
|
return
|
|||
|
|
|
|||
|
|
# If handler drops out, sleep a retry
|
|||
|
|
await asyncio.sleep(1)
|
|||
|
|
|
|||
|
|
async def _process_message(self, msg):
|
|||
|
|
"""Process a single message with deferred acknowledgment"""
|
|||
|
|
# Store message for later acknowledgment
|
|||
|
|
msg_id = str(uuid.uuid4())
|
|||
|
|
self.pending_acks[msg_id] = msg
|
|||
|
|
|
|||
|
|
try:
|
|||
|
|
id = msg.properties()["id"]
|
|||
|
|
except:
|
|||
|
|
id = None
|
|||
|
|
|
|||
|
|
value = msg.value()
|
|||
|
|
delivery_success = False
|
|||
|
|
|
|||
|
|
async with self.lock:
|
|||
|
|
# Deliver to specific subscribers
|
|||
|
|
if id in self.q:
|
|||
|
|
delivery_success = await self._deliver_to_queue(
|
|||
|
|
self.q[id], value
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# Deliver to all subscribers
|
|||
|
|
for q in self.full.values():
|
|||
|
|
if await self._deliver_to_queue(q, value):
|
|||
|
|
delivery_success = True
|
|||
|
|
|
|||
|
|
# Acknowledge only on successful delivery
|
|||
|
|
if delivery_success:
|
|||
|
|
self.consumer.acknowledge(msg)
|
|||
|
|
del self.pending_acks[msg_id]
|
|||
|
|
else:
|
|||
|
|
# Negative acknowledge for retry
|
|||
|
|
self.consumer.negative_acknowledge(msg)
|
|||
|
|
del self.pending_acks[msg_id]
|
|||
|
|
|
|||
|
|
async def _deliver_to_queue(self, queue, value):
|
|||
|
|
"""Deliver message to queue with backpressure handling"""
|
|||
|
|
try:
|
|||
|
|
if self.backpressure_strategy == "block":
|
|||
|
|
# Block until space available (no timeout)
|
|||
|
|
await queue.put(value)
|
|||
|
|
return True
|
|||
|
|
|
|||
|
|
elif self.backpressure_strategy == "drop_oldest":
|
|||
|
|
# Drop oldest message if queue full
|
|||
|
|
if queue.full():
|
|||
|
|
try:
|
|||
|
|
queue.get_nowait()
|
|||
|
|
if self.metrics:
|
|||
|
|
self.metrics.dropped()
|
|||
|
|
except asyncio.QueueEmpty:
|
|||
|
|
pass
|
|||
|
|
await queue.put(value)
|
|||
|
|
return True
|
|||
|
|
|
|||
|
|
elif self.backpressure_strategy == "drop_new":
|
|||
|
|
# Drop new message if queue full
|
|||
|
|
if queue.full():
|
|||
|
|
if self.metrics:
|
|||
|
|
self.metrics.dropped()
|
|||
|
|
return False
|
|||
|
|
await queue.put(value)
|
|||
|
|
return True
|
|||
|
|
|
|||
|
|
except Exception as e:
|
|||
|
|
logger.error(f"Failed to deliver message: {e}")
|
|||
|
|
return False
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**主要设计优势(与发布者模式匹配):**
|
|||
|
|
<<<<<<< HEAD
|
|||
|
|
**单一处理位置:** 所有消息处理都在 `run()` 方法中进行。
|
|||
|
|
=======
|
|||
|
|
**单一处理位置:** 所有消息处理都发生在 `run()` 方法中。
|
|||
|
|
>>>>>>> 82edf2d (New md files from RunPod)
|
|||
|
|
**清晰的状态机:** 三个明确的状态 - 运行中、正在排空、已停止。
|
|||
|
|
**排空期间暂停:** 在排空现有队列时,停止从 Pulsar 接收新消息。
|
|||
|
|
**超时保护:** 在排空过程中不会无限期挂起。
|
|||
|
|
**正确的清理:** 在关闭时,会对任何未发送的消息进行否定确认。
|
|||
|
|
|
|||
|
|
#### B. 导出处理程序改进
|
|||
|
|
|
|||
|
|
**文件:** `trustgraph-flow/trustgraph/gateway/dispatch/triples_export.py`
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
class TriplesExport:
|
|||
|
|
async def destroy(self):
|
|||
|
|
"""Enhanced destroy with graceful shutdown"""
|
|||
|
|
# Step 1: Signal stop to prevent new messages
|
|||
|
|
self.running.stop()
|
|||
|
|
|
|||
|
|
# Step 2: Wait briefly for in-flight messages
|
|||
|
|
await asyncio.sleep(0.5)
|
|||
|
|
|
|||
|
|
# Step 3: Unsubscribe and stop subscriber (triggers queue drain)
|
|||
|
|
if hasattr(self, 'subs'):
|
|||
|
|
await self.subs.unsubscribe_all(self.id)
|
|||
|
|
await self.subs.stop()
|
|||
|
|
|
|||
|
|
# Step 4: Close websocket last
|
|||
|
|
if self.ws and not self.ws.closed:
|
|||
|
|
await self.ws.close()
|
|||
|
|
|
|||
|
|
async def run(self):
|
|||
|
|
"""Enhanced run with better error handling"""
|
|||
|
|
self.subs = Subscriber(
|
|||
|
|
client = self.pulsar_client,
|
|||
|
|
topic = self.queue,
|
|||
|
|
consumer_name = self.consumer,
|
|||
|
|
subscription = self.subscriber,
|
|||
|
|
schema = Triples,
|
|||
|
|
backpressure_strategy = "block" # Configurable
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
await self.subs.start()
|
|||
|
|
|
|||
|
|
self.id = str(uuid.uuid4())
|
|||
|
|
q = await self.subs.subscribe_all(self.id)
|
|||
|
|
|
|||
|
|
consecutive_errors = 0
|
|||
|
|
max_consecutive_errors = 5
|
|||
|
|
|
|||
|
|
while self.running.get():
|
|||
|
|
try:
|
|||
|
|
resp = await asyncio.wait_for(q.get(), timeout=0.5)
|
|||
|
|
await self.ws.send_json(serialize_triples(resp))
|
|||
|
|
consecutive_errors = 0 # Reset on success
|
|||
|
|
|
|||
|
|
except asyncio.TimeoutError:
|
|||
|
|
continue
|
|||
|
|
|
|||
|
|
except queue.Empty:
|
|||
|
|
continue
|
|||
|
|
|
|||
|
|
except Exception as e:
|
|||
|
|
logger.error(f"Exception sending to websocket: {str(e)}")
|
|||
|
|
consecutive_errors += 1
|
|||
|
|
|
|||
|
|
if consecutive_errors >= max_consecutive_errors:
|
|||
|
|
logger.error("Too many consecutive errors, shutting down")
|
|||
|
|
break
|
|||
|
|
|
|||
|
|
# Brief pause before retry
|
|||
|
|
await asyncio.sleep(0.1)
|
|||
|
|
|
|||
|
|
# Graceful cleanup handled in destroy()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3. Socket 级别改进
|
|||
|
|
|
|||
|
|
**文件**: `trustgraph-flow/trustgraph/gateway/endpoint/socket.py`
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
class SocketEndpoint:
|
|||
|
|
async def listener(self, ws, dispatcher, running):
|
|||
|
|
"""Enhanced listener with graceful shutdown"""
|
|||
|
|
async for msg in ws:
|
|||
|
|
if msg.type == WSMsgType.TEXT:
|
|||
|
|
await dispatcher.receive(msg)
|
|||
|
|
continue
|
|||
|
|
elif msg.type == WSMsgType.BINARY:
|
|||
|
|
await dispatcher.receive(msg)
|
|||
|
|
continue
|
|||
|
|
else:
|
|||
|
|
# Graceful shutdown on close
|
|||
|
|
logger.info("Websocket closing, initiating graceful shutdown")
|
|||
|
|
running.stop()
|
|||
|
|
|
|||
|
|
# Allow time for dispatcher cleanup
|
|||
|
|
await asyncio.sleep(1.0)
|
|||
|
|
break
|
|||
|
|
|
|||
|
|
async def handle(self, request):
|
|||
|
|
"""Enhanced handler with better cleanup"""
|
|||
|
|
# ... existing setup code ...
|
|||
|
|
|
|||
|
|
try:
|
|||
|
|
async with asyncio.TaskGroup() as tg:
|
|||
|
|
running = Running()
|
|||
|
|
|
|||
|
|
dispatcher = await self.dispatcher(
|
|||
|
|
ws, running, request.match_info
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
worker_task = tg.create_task(
|
|||
|
|
self.worker(ws, dispatcher, running)
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
lsnr_task = tg.create_task(
|
|||
|
|
self.listener(ws, dispatcher, running)
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
except ExceptionGroup as e:
|
|||
|
|
logger.error("Exception group occurred:", exc_info=True)
|
|||
|
|
|
|||
|
|
# Attempt graceful dispatcher shutdown
|
|||
|
|
try:
|
|||
|
|
await asyncio.wait_for(
|
|||
|
|
dispatcher.destroy(),
|
|||
|
|
timeout=5.0
|
|||
|
|
)
|
|||
|
|
except asyncio.TimeoutError:
|
|||
|
|
logger.warning("Dispatcher shutdown timed out")
|
|||
|
|
except Exception as de:
|
|||
|
|
logger.error(f"Error during dispatcher cleanup: {de}")
|
|||
|
|
|
|||
|
|
except Exception as e:
|
|||
|
|
logger.error(f"Socket exception: {e}", exc_info=True)
|
|||
|
|
|
|||
|
|
finally:
|
|||
|
|
# Ensure dispatcher cleanup
|
|||
|
|
if dispatcher and hasattr(dispatcher, 'destroy'):
|
|||
|
|
try:
|
|||
|
|
await dispatcher.destroy()
|
|||
|
|
except:
|
|||
|
|
pass
|
|||
|
|
|
|||
|
|
# Ensure websocket is closed
|
|||
|
|
if ws and not ws.closed:
|
|||
|
|
await ws.close()
|
|||
|
|
|
|||
|
|
return ws
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 配置选项
|
|||
|
|
|
|||
|
|
添加配置支持,用于调整行为:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# config.py
|
|||
|
|
class GracefulShutdownConfig:
|
|||
|
|
# Publisher settings
|
|||
|
|
PUBLISHER_DRAIN_TIMEOUT = 5.0 # Seconds to wait for queue drain
|
|||
|
|
PUBLISHER_FLUSH_TIMEOUT = 2.0 # Producer flush timeout
|
|||
|
|
|
|||
|
|
# Subscriber settings
|
|||
|
|
SUBSCRIBER_DRAIN_TIMEOUT = 5.0 # Seconds to wait for queue drain
|
|||
|
|
BACKPRESSURE_STRATEGY = "block" # Options: "block", "drop_oldest", "drop_new"
|
|||
|
|
SUBSCRIBER_MAX_QUEUE_SIZE = 100 # Maximum queue size before backpressure
|
|||
|
|
|
|||
|
|
# Socket settings
|
|||
|
|
SHUTDOWN_GRACE_PERIOD = 1.0 # Seconds to wait for graceful shutdown
|
|||
|
|
MAX_CONSECUTIVE_ERRORS = 5 # Maximum errors before forced shutdown
|
|||
|
|
|
|||
|
|
# Monitoring
|
|||
|
|
LOG_QUEUE_STATS = True # Log queue statistics on shutdown
|
|||
|
|
METRICS_ENABLED = True # Enable metrics collection
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 测试策略
|
|||
|
|
|
|||
|
|
### 单元测试
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
async def test_publisher_queue_drain():
|
|||
|
|
"""Verify Publisher drains queue on shutdown"""
|
|||
|
|
publisher = Publisher(...)
|
|||
|
|
|
|||
|
|
# Fill queue with messages
|
|||
|
|
for i in range(10):
|
|||
|
|
await publisher.send(f"id-{i}", {"data": i})
|
|||
|
|
|
|||
|
|
# Stop publisher
|
|||
|
|
await publisher.stop()
|
|||
|
|
|
|||
|
|
# Verify all messages were sent
|
|||
|
|
assert publisher.q.empty()
|
|||
|
|
assert mock_producer.send.call_count == 10
|
|||
|
|
|
|||
|
|
async def test_subscriber_deferred_ack():
|
|||
|
|
"""Verify Subscriber only acks on successful delivery"""
|
|||
|
|
subscriber = Subscriber(..., backpressure_strategy="drop_new")
|
|||
|
|
|
|||
|
|
# Fill queue to capacity
|
|||
|
|
queue = await subscriber.subscribe("test")
|
|||
|
|
for i in range(100):
|
|||
|
|
await queue.put({"data": i})
|
|||
|
|
|
|||
|
|
# Try to add message when full
|
|||
|
|
msg = create_mock_message()
|
|||
|
|
await subscriber._process_message(msg)
|
|||
|
|
|
|||
|
|
# Verify negative acknowledgment
|
|||
|
|
assert msg.negative_acknowledge.called
|
|||
|
|
assert not msg.acknowledge.called
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 集成测试
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
async def test_import_graceful_shutdown():
|
|||
|
|
"""Test import path handles shutdown gracefully"""
|
|||
|
|
# Setup
|
|||
|
|
import_handler = TriplesImport(...)
|
|||
|
|
await import_handler.start()
|
|||
|
|
|
|||
|
|
# Send messages
|
|||
|
|
messages = []
|
|||
|
|
for i in range(100):
|
|||
|
|
msg = {"metadata": {...}, "triples": [...]}
|
|||
|
|
await import_handler.receive(msg)
|
|||
|
|
messages.append(msg)
|
|||
|
|
|
|||
|
|
# Shutdown while messages in flight
|
|||
|
|
await import_handler.destroy()
|
|||
|
|
|
|||
|
|
# Verify all messages reached Pulsar
|
|||
|
|
received = await pulsar_consumer.receive_all()
|
|||
|
|
assert len(received) == 100
|
|||
|
|
|
|||
|
|
async def test_export_no_message_loss():
|
|||
|
|
"""Test export path doesn't lose acknowledged messages"""
|
|||
|
|
# Setup Pulsar with test messages
|
|||
|
|
for i in range(100):
|
|||
|
|
await pulsar_producer.send({"data": i})
|
|||
|
|
|
|||
|
|
# Start export handler
|
|||
|
|
export_handler = TriplesExport(...)
|
|||
|
|
export_task = asyncio.create_task(export_handler.run())
|
|||
|
|
|
|||
|
|
# Receive some messages
|
|||
|
|
received = []
|
|||
|
|
for _ in range(50):
|
|||
|
|
msg = await websocket.receive()
|
|||
|
|
received.append(msg)
|
|||
|
|
|
|||
|
|
# Force shutdown
|
|||
|
|
await export_handler.destroy()
|
|||
|
|
|
|||
|
|
# Continue receiving until websocket closes
|
|||
|
|
while not websocket.closed:
|
|||
|
|
try:
|
|||
|
|
msg = await websocket.receive()
|
|||
|
|
received.append(msg)
|
|||
|
|
except:
|
|||
|
|
break
|
|||
|
|
|
|||
|
|
# Verify no acknowledged messages were lost
|
|||
|
|
assert len(received) >= 50
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 部署计划
|
|||
|
|
|
|||
|
|
<<<<<<< HEAD
|
|||
|
|
### 第一阶段:关键修复(第一周)
|
|||
|
|
修复订阅者确认时序问题(防止消息丢失)
|
|||
|
|
添加发布者队列排空功能
|
|||
|
|
部署到测试环境
|
|||
|
|
|
|||
|
|
### 第二阶段:平滑关闭(第二周)
|
|||
|
|
=======
|
|||
|
|
### 第一阶段:关键修复 (第一周)
|
|||
|
|
修复订阅者确认时序问题 (防止消息丢失)
|
|||
|
|
添加发布者队列排空功能
|
|||
|
|
部署到测试环境
|
|||
|
|
|
|||
|
|
### 第二阶段:平滑关闭 (第二周)
|
|||
|
|
>>>>>>> 82edf2d (New md files from RunPod)
|
|||
|
|
实施关闭协调
|
|||
|
|
添加反压策略
|
|||
|
|
性能测试
|
|||
|
|
|
|||
|
|
<<<<<<< HEAD
|
|||
|
|
### 第三阶段:监控与调优(第三周)
|
|||
|
|
=======
|
|||
|
|
### 第三阶段:监控与调优 (第三周)
|
|||
|
|
>>>>>>> 82edf2d (New md files from RunPod)
|
|||
|
|
添加队列深度的指标
|
|||
|
|
添加消息丢失的告警
|
|||
|
|
根据生产数据调整超时值
|
|||
|
|
|
|||
|
|
## 监控与告警
|
|||
|
|
|
|||
|
|
### 跟踪指标
|
|||
|
|
`publisher.queue.depth` - 当前发布者队列大小
|
|||
|
|
`publisher.messages.dropped` - 关闭期间丢失的消息数量
|
|||
|
|
`subscriber.messages.negatively_acknowledged` - 失败的交付
|
|||
|
|
`websocket.graceful_shutdowns` - 成功的平滑关闭
|
|||
|
|
`websocket.forced_shutdowns` - 强制/超时关闭
|
|||
|
|
|
|||
|
|
### 告警
|
|||
|
|
<<<<<<< HEAD
|
|||
|
|
发布者队列深度超过 80% 的容量
|
|||
|
|
=======
|
|||
|
|
发布者队列深度超过 80% 容量
|
|||
|
|
>>>>>>> 82edf2d (New md files from RunPod)
|
|||
|
|
关闭期间出现任何消息丢失
|
|||
|
|
订阅者否定确认率 > 1%
|
|||
|
|
关闭超时
|
|||
|
|
|
|||
|
|
## 向后兼容性
|
|||
|
|
|
|||
|
|
所有更改都保持向后兼容性:
|
|||
|
|
在没有配置的情况下,默认行为保持不变
|
|||
|
|
现有部署继续正常运行
|
|||
|
|
如果新的功能不可用,则会进行优雅降级
|
|||
|
|
|
|||
|
|
## 安全注意事项
|
|||
|
|
|
|||
|
|
没有引入新的攻击向量
|
|||
|
|
反压可防止内存耗尽攻击
|
|||
|
|
可配置的限制可防止资源滥用
|
|||
|
|
|
|||
|
|
## 性能影响
|
|||
|
|
|
|||
|
|
正常运行期间的开销很小
|
|||
|
|
<<<<<<< HEAD
|
|||
|
|
关闭可能需要长达 5 秒(可配置)
|
|||
|
|
内存使用量受队列大小限制
|
|||
|
|
CPU 影响可以忽略不计(<1% 的增加)
|
|||
|
|
=======
|
|||
|
|
关闭可能需要长达 5 秒钟 (可配置)
|
|||
|
|
内存使用量受队列大小限制
|
|||
|
|
CPU 影响可以忽略不计 (<1% 的增长)
|
|||
|
|
>>>>>>> 82edf2d (New md files from RunPod)
|