--- layout: default title: "Especificação Técnica de Desligamento Gratuito de Importação/Exportação" parent: "Portuguese (Beta)" --- # Especificação Técnica de Desligamento Gratuito de Importação/Exportação > **Beta Translation:** This document was translated via Machine Learning and as such may not be 100% accurate. All non-English languages are currently classified as Beta. ## Declaração do Problema Atualmente, o gateway TrustGraph experimenta perda de mensagens durante o fechamento do WebSocket tanto em operações de importação quanto de exportação. Isso ocorre devido a condições de corrida em que as mensagens em trânsito são descartadas antes de atingir seu destino (filas Pulsar para importações, clientes WebSocket para exportações). ### Problemas no Lado da Importação 1. O buffer da fila asyncio do publicador não é esvaziado durante o desligamento. 2. O WebSocket é fechado antes de garantir que as mensagens enfileiradas cheguem ao Pulsar. 3. Não há mecanismo de confirmação para a entrega bem-sucedida da mensagem. ### Problemas no Lado da Exportação 1. As mensagens são confirmadas no Pulsar antes da entrega bem-sucedida aos clientes. 2. Os tempos limite fixos causam a perda de mensagens quando as filas estão cheias. 3. Não há mecanismo de controle de fluxo para lidar com consumidores lentos. 4. Múltiplos pontos de buffer onde os dados podem ser perdidos. ## Visão Geral da Arquitetura ``` Import Flow: Client -> Websocket -> TriplesImport -> Publisher -> Pulsar Queue Export Flow: Pulsar Queue -> Subscriber -> TriplesExport -> Websocket -> Client ``` ## Correções Propostas ### 1. Melhorias no Publicador (Lado da Importação) #### A. Esvaziamento Gratuito da Fila **Arquivo**: `trustgraph-base/trustgraph/base/publisher.py` ```python class Publisher: def __init__(self, client, topic, schema=None, max_size=10, chunking_enabled=True, drain_timeout=5.0): self.client = client self.topic = topic self.schema = schema self.q = asyncio.Queue(maxsize=max_size) self.chunking_enabled = chunking_enabled self.running = True self.draining = False # New state for graceful shutdown self.task = None self.drain_timeout = drain_timeout async def stop(self): """Initiate graceful shutdown with draining""" self.running = False self.draining = True if self.task: # Wait for run() to complete draining await self.task async def run(self): """Enhanced run method with integrated draining logic""" while self.running or self.draining: try: producer = self.client.create_producer( topic=self.topic, schema=JsonSchema(self.schema), chunking_enabled=self.chunking_enabled, ) drain_end_time = None while self.running or self.draining: try: # Start drain timeout when entering drain mode if self.draining and drain_end_time is None: drain_end_time = time.time() + self.drain_timeout logger.info(f"Publisher entering drain mode, timeout={self.drain_timeout}s") # Check drain timeout if self.draining and time.time() > drain_end_time: if not self.q.empty(): logger.warning(f"Drain timeout reached with {self.q.qsize()} messages remaining") self.draining = False break # Calculate wait timeout based on mode if self.draining: # Shorter timeout during draining to exit quickly when empty timeout = min(0.1, drain_end_time - time.time()) else: # Normal operation timeout timeout = 0.25 # Get message from queue id, item = await asyncio.wait_for( self.q.get(), timeout=timeout ) # Send the message (single place for sending) if id: producer.send(item, { "id": id }) else: producer.send(item) except asyncio.TimeoutError: # If draining and queue is empty, we're done if self.draining and self.q.empty(): logger.info("Publisher queue drained successfully") self.draining = False break continue except asyncio.QueueEmpty: # If draining and queue is empty, we're done if self.draining and self.q.empty(): logger.info("Publisher queue drained successfully") self.draining = False break continue # Flush producer before closing if producer: producer.flush() producer.close() except Exception as e: logger.error(f"Exception in publisher: {e}", exc_info=True) if not self.running and not self.draining: return # If handler drops out, sleep a retry await asyncio.sleep(1) async def send(self, id, item): """Send still works normally - just adds to queue""" if self.draining: # Optionally reject new messages during drain raise RuntimeError("Publisher is shutting down, not accepting new messages") await self.q.put((id, item)) ``` **Principais Vantagens do Design:** <<<<<<< HEAD **Local de Envio Único**: Todas as chamadas de `producer.send()` ocorrem em um único local dentro do método `run()`. ======= **Local de Envio Único**: Todas as chamadas `producer.send()` ocorrem em um único local dentro do método `run()`. >>>>>>> 82edf2d (New md files from RunPod) **Máquina de Estados Clara**: Três estados claros - em execução, esvaziando, parado. **Proteção por Timeout**: Não fica indefinidamente travado durante o esvaziamento. **Melhor Observabilidade**: Registro claro do progresso do esvaziamento e das transições de estado. **Rejeição de Mensagens Opcional**: Pode rejeitar novas mensagens durante a fase de desligamento. #### B. Ordem de Desligamento Aprimorada **Arquivo**: `trustgraph-flow/trustgraph/gateway/dispatch/triples_import.py` ```python class TriplesImport: async def destroy(self): """Enhanced destroy with proper shutdown order""" # Step 1: Stop accepting new messages self.running.stop() # Step 2: Wait for publisher to drain its queue logger.info("Draining publisher queue...") await self.publisher.stop() # Step 3: Close websocket only after queue is drained if self.ws: await self.ws.close() ``` ### 2. Melhorias para o Assinante (Lado de Exportação) #### A. Padrão de Drenagem Integrado **Arquivo**: `trustgraph-base/trustgraph/base/subscriber.py` ```python class Subscriber: def __init__(self, client, topic, subscription, consumer_name, schema=None, max_size=100, metrics=None, backpressure_strategy="block", drain_timeout=5.0): # ... existing init ... self.backpressure_strategy = backpressure_strategy self.running = True self.draining = False # New state for graceful shutdown self.drain_timeout = drain_timeout self.pending_acks = {} # Track messages awaiting delivery async def stop(self): """Initiate graceful shutdown with draining""" self.running = False self.draining = True if self.task: # Wait for run() to complete draining await self.task async def run(self): """Enhanced run method with integrated draining logic""" while self.running or self.draining: if self.metrics: self.metrics.state("stopped") try: self.consumer = self.client.subscribe( topic = self.topic, subscription_name = self.subscription, consumer_name = self.consumer_name, schema = JsonSchema(self.schema), ) if self.metrics: self.metrics.state("running") logger.info("Subscriber running...") drain_end_time = None while self.running or self.draining: # Start drain timeout when entering drain mode if self.draining and drain_end_time is None: drain_end_time = time.time() + self.drain_timeout logger.info(f"Subscriber entering drain mode, timeout={self.drain_timeout}s") # Stop accepting new messages from Pulsar during drain self.consumer.pause_message_listener() # Check drain timeout if self.draining and time.time() > drain_end_time: async with self.lock: total_pending = sum( q.qsize() for q in list(self.q.values()) + list(self.full.values()) ) if total_pending > 0: logger.warning(f"Drain timeout reached with {total_pending} messages in queues") self.draining = False break # Check if we can exit drain mode if self.draining: async with self.lock: all_empty = all( q.empty() for q in list(self.q.values()) + list(self.full.values()) ) if all_empty and len(self.pending_acks) == 0: logger.info("Subscriber queues drained successfully") self.draining = False break # Process messages only if not draining if not self.draining: try: msg = await asyncio.to_thread( self.consumer.receive, timeout_millis=250 ) except _pulsar.Timeout: continue except Exception as e: logger.error(f"Exception in subscriber receive: {e}", exc_info=True) raise e if self.metrics: self.metrics.received() # Process the message await self._process_message(msg) else: # During draining, just wait for queues to empty await asyncio.sleep(0.1) except Exception as e: logger.error(f"Subscriber exception: {e}", exc_info=True) finally: # Negative acknowledge any pending messages for msg in self.pending_acks.values(): self.consumer.negative_acknowledge(msg) self.pending_acks.clear() if self.consumer: self.consumer.unsubscribe() self.consumer.close() self.consumer = None if self.metrics: self.metrics.state("stopped") if not self.running and not self.draining: return # If handler drops out, sleep a retry await asyncio.sleep(1) async def _process_message(self, msg): """Process a single message with deferred acknowledgment""" # Store message for later acknowledgment msg_id = str(uuid.uuid4()) self.pending_acks[msg_id] = msg try: id = msg.properties()["id"] except: id = None value = msg.value() delivery_success = False async with self.lock: # Deliver to specific subscribers if id in self.q: delivery_success = await self._deliver_to_queue( self.q[id], value ) # Deliver to all subscribers for q in self.full.values(): if await self._deliver_to_queue(q, value): delivery_success = True # Acknowledge only on successful delivery if delivery_success: self.consumer.acknowledge(msg) del self.pending_acks[msg_id] else: # Negative acknowledge for retry self.consumer.negative_acknowledge(msg) del self.pending_acks[msg_id] async def _deliver_to_queue(self, queue, value): """Deliver message to queue with backpressure handling""" try: if self.backpressure_strategy == "block": # Block until space available (no timeout) await queue.put(value) return True elif self.backpressure_strategy == "drop_oldest": # Drop oldest message if queue full if queue.full(): try: queue.get_nowait() if self.metrics: self.metrics.dropped() except asyncio.QueueEmpty: pass await queue.put(value) return True elif self.backpressure_strategy == "drop_new": # Drop new message if queue full if queue.full(): if self.metrics: self.metrics.dropped() return False await queue.put(value) return True except Exception as e: logger.error(f"Failed to deliver message: {e}") return False ``` <<<<<<< HEAD **Principais Vantagens do Design (compatível com o padrão do Editor):** **Local de Processamento Único**: Todo o processamento de mensagens ocorre no método `run()` **Máquina de Estados Clara**: Três estados claros - em execução, esvaziando, parado **Pausa Durante o Esvaziamento**: Interrompe a aceitação de novas mensagens do Pulsar enquanto esvazia as filas existentes **Proteção por Timeout**: Não fica indefinidamente travado durante o esvaziamento ======= **Principais Vantagens do Design (compatível com o padrão do Publicador):** **Local de Processamento Único**: Todo o processamento de mensagens ocorre no método `run()` **Máquina de Estados Clara**: Três estados claros - em execução, esvaziando, parado **Pausa Durante o Esvaziamento**: Interrompe a aceitação de novas mensagens do Pulsar enquanto esvazia as filas existentes **Proteção por Timeout**: Não fica indefinidamente bloqueado durante o esvaziamento >>>>>>> 82edf2d (New md files from RunPod) **Limpeza Adequada**: Reconhece negativamente quaisquer mensagens não entregues durante o desligamento #### B. Melhorias no Manipulador de Exportação **Arquivo**: `trustgraph-flow/trustgraph/gateway/dispatch/triples_export.py` ```python class TriplesExport: async def destroy(self): """Enhanced destroy with graceful shutdown""" # Step 1: Signal stop to prevent new messages self.running.stop() # Step 2: Wait briefly for in-flight messages await asyncio.sleep(0.5) # Step 3: Unsubscribe and stop subscriber (triggers queue drain) if hasattr(self, 'subs'): await self.subs.unsubscribe_all(self.id) await self.subs.stop() # Step 4: Close websocket last if self.ws and not self.ws.closed: await self.ws.close() async def run(self): """Enhanced run with better error handling""" self.subs = Subscriber( client = self.pulsar_client, topic = self.queue, consumer_name = self.consumer, subscription = self.subscriber, schema = Triples, backpressure_strategy = "block" # Configurable ) await self.subs.start() self.id = str(uuid.uuid4()) q = await self.subs.subscribe_all(self.id) consecutive_errors = 0 max_consecutive_errors = 5 while self.running.get(): try: resp = await asyncio.wait_for(q.get(), timeout=0.5) await self.ws.send_json(serialize_triples(resp)) consecutive_errors = 0 # Reset on success except asyncio.TimeoutError: continue except queue.Empty: continue except Exception as e: logger.error(f"Exception sending to websocket: {str(e)}") consecutive_errors += 1 if consecutive_errors >= max_consecutive_errors: logger.error("Too many consecutive errors, shutting down") break # Brief pause before retry await asyncio.sleep(0.1) # Graceful cleanup handled in destroy() ``` ### 3. Melhorias no Nível de Socket **Arquivo**: `trustgraph-flow/trustgraph/gateway/endpoint/socket.py` ```python class SocketEndpoint: async def listener(self, ws, dispatcher, running): """Enhanced listener with graceful shutdown""" async for msg in ws: if msg.type == WSMsgType.TEXT: await dispatcher.receive(msg) continue elif msg.type == WSMsgType.BINARY: await dispatcher.receive(msg) continue else: # Graceful shutdown on close logger.info("Websocket closing, initiating graceful shutdown") running.stop() # Allow time for dispatcher cleanup await asyncio.sleep(1.0) break async def handle(self, request): """Enhanced handler with better cleanup""" # ... existing setup code ... try: async with asyncio.TaskGroup() as tg: running = Running() dispatcher = await self.dispatcher( ws, running, request.match_info ) worker_task = tg.create_task( self.worker(ws, dispatcher, running) ) lsnr_task = tg.create_task( self.listener(ws, dispatcher, running) ) except ExceptionGroup as e: logger.error("Exception group occurred:", exc_info=True) # Attempt graceful dispatcher shutdown try: await asyncio.wait_for( dispatcher.destroy(), timeout=5.0 ) except asyncio.TimeoutError: logger.warning("Dispatcher shutdown timed out") except Exception as de: logger.error(f"Error during dispatcher cleanup: {de}") except Exception as e: logger.error(f"Socket exception: {e}", exc_info=True) finally: # Ensure dispatcher cleanup if dispatcher and hasattr(dispatcher, 'destroy'): try: await dispatcher.destroy() except: pass # Ensure websocket is closed if ws and not ws.closed: await ws.close() return ws ``` ## Opções de Configuração Adicionar suporte para configuração para ajustar o comportamento: ```python # config.py class GracefulShutdownConfig: # Publisher settings PUBLISHER_DRAIN_TIMEOUT = 5.0 # Seconds to wait for queue drain PUBLISHER_FLUSH_TIMEOUT = 2.0 # Producer flush timeout # Subscriber settings SUBSCRIBER_DRAIN_TIMEOUT = 5.0 # Seconds to wait for queue drain BACKPRESSURE_STRATEGY = "block" # Options: "block", "drop_oldest", "drop_new" SUBSCRIBER_MAX_QUEUE_SIZE = 100 # Maximum queue size before backpressure # Socket settings SHUTDOWN_GRACE_PERIOD = 1.0 # Seconds to wait for graceful shutdown MAX_CONSECUTIVE_ERRORS = 5 # Maximum errors before forced shutdown # Monitoring LOG_QUEUE_STATS = True # Log queue statistics on shutdown METRICS_ENABLED = True # Enable metrics collection ``` ## Estratégia de Testes ### Testes Unitários ```python async def test_publisher_queue_drain(): """Verify Publisher drains queue on shutdown""" publisher = Publisher(...) # Fill queue with messages for i in range(10): await publisher.send(f"id-{i}", {"data": i}) # Stop publisher await publisher.stop() # Verify all messages were sent assert publisher.q.empty() assert mock_producer.send.call_count == 10 async def test_subscriber_deferred_ack(): """Verify Subscriber only acks on successful delivery""" subscriber = Subscriber(..., backpressure_strategy="drop_new") # Fill queue to capacity queue = await subscriber.subscribe("test") for i in range(100): await queue.put({"data": i}) # Try to add message when full msg = create_mock_message() await subscriber._process_message(msg) # Verify negative acknowledgment assert msg.negative_acknowledge.called assert not msg.acknowledge.called ``` ### Testes de Integração ```python async def test_import_graceful_shutdown(): """Test import path handles shutdown gracefully""" # Setup import_handler = TriplesImport(...) await import_handler.start() # Send messages messages = [] for i in range(100): msg = {"metadata": {...}, "triples": [...]} await import_handler.receive(msg) messages.append(msg) # Shutdown while messages in flight await import_handler.destroy() # Verify all messages reached Pulsar received = await pulsar_consumer.receive_all() assert len(received) == 100 async def test_export_no_message_loss(): """Test export path doesn't lose acknowledged messages""" # Setup Pulsar with test messages for i in range(100): await pulsar_producer.send({"data": i}) # Start export handler export_handler = TriplesExport(...) export_task = asyncio.create_task(export_handler.run()) # Receive some messages received = [] for _ in range(50): msg = await websocket.receive() received.append(msg) # Force shutdown await export_handler.destroy() # Continue receiving until websocket closes while not websocket.closed: try: msg = await websocket.receive() received.append(msg) except: break # Verify no acknowledged messages were lost assert len(received) >= 50 ``` ## Plano de Implementação ### Fase 1: Correções Críticas (Semana 1) Corrigir o tempo de reconhecimento do assinante (evitar perda de mensagens) Adicionar esvaziamento da fila do publicador Implantar no ambiente de teste ### Fase 2: Desligamento Gradual (Semana 2) <<<<<<< HEAD Implementar coordenação de desligamento ======= Implementar a coordenação de desligamento >>>>>>> 82edf2d (New md files from RunPod) Adicionar estratégias de backpressure Testes de desempenho ### Fase 3: Monitoramento e Ajuste (Semana 3) Adicionar métricas para profundidade da fila Adicionar alertas para perda de mensagens Ajustar valores de timeout com base em dados de produção ## Monitoramento e Alertas ### Métricas a serem Monitoradas `publisher.queue.depth` - Tamanho atual da fila do publicador `publisher.messages.dropped` - Mensagens perdidas durante o desligamento `subscriber.messages.negatively_acknowledged` - Entregas falhadas `websocket.graceful_shutdowns` - Desligamentos graduais bem-sucedidos `websocket.forced_shutdowns` - Desligamentos forçados/por timeout ### Alertas Profundidade da fila do publicador > 80% da capacidade Qualquer perda de mensagens durante o desligamento Taxa de reconhecimento negativo do assinante > 1% Timeout de desligamento excedido ## Compatibilidade com Versões Anteriores Todas as alterações mantêm a compatibilidade com versões anteriores: Comportamento padrão inalterado sem configuração Implantações existentes continuam a funcionar Degradação gradual se novos recursos não estiverem disponíveis ## Considerações de Segurança Nenhum novo vetor de ataque introduzido O backpressure impede ataques de esgotamento de memória <<<<<<< HEAD Limites configuráveis ​​evitam o abuso de recursos ======= Limites configuráveis evitam o abuso de recursos >>>>>>> 82edf2d (New md files from RunPod) ## Impacto no Desempenho Sobrecarga mínima durante a operação normal O desligamento pode levar até 5 segundos a mais (configurável) O uso de memória é limitado pelos limites do tamanho da fila O impacto na CPU é insignificante (<1% de aumento)