PageIndex/examples/workspace/12345678-abcd-4321-abcd-123456789abc.json
Ray 77722838e1
Restructure examples directory and improve document storage (#189)
* Consolidate tests/ into examples/documents/

* Add line_count and reorder structure keys

* Lazy-load documents with _meta.json index

* Update demo script and add pre-shipped workspace

* Extract shared helpers for JSON reading and meta entry building
2026-03-28 04:28:59 +08:00

274 lines
No EOL
106 KiB
JSON
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"id": "12345678-abcd-4321-abcd-123456789abc",
"type": "pdf",
"path": "../documents/attention-residuals.pdf",
"doc_name": "attention-residuals.pdf",
"doc_description": "This document introduces \"Attention Residuals\" (AttnRes) and its scalable variant \"Block AttnRes,\" novel mechanisms for replacing fixed residual accumulation in neural networks with learned, input-dependent depth-wise attention, addressing limitations of standard residual connections while optimizing memory, computation, and scalability for large-scale training and inference.",
"page_count": 21,
"structure": [
{
"title": "Preface",
"node_id": "0000",
"start_index": 1,
"end_index": 2,
"summary": "The partial document introduces \"Attention Residuals\" (AttnRes), a novel approach to replace fixed residual accumulation in large language models (LLMs) with learned, input-dependent softmax attention over preceding layer outputs. This method addresses issues like uncontrolled hidden-state growth and dilution of layer contributions caused by standard residual connections with PreNorm. To enhance scalability, the document proposes \"Block AttnRes,\" which partitions layers into blocks and applies attention at the block level, reducing memory and communication overhead while maintaining performance gains. The document highlights system optimizations, such as cross-stage caching and a two-phase computation strategy, to make Block AttnRes efficient for large-scale training. Experiments confirm consistent improvements across model sizes, with AttnRes mitigating PreNorm dilution, leading to more uniform output magnitudes, gradient distributions, and better downstream task performance. Key contributions include the introduction of AttnRes and Block AttnRes, scalable infrastructure optimizations, and comprehensive evaluations demonstrating their effectiveness."
},
{
"title": "Introduction",
"node_id": "0001",
"start_index": 2,
"end_index": 3,
"summary": "The partial document introduces \"Attention Residuals\" (AttnRes), a novel mechanism that replaces fixed residual accumulation in deep networks with learned softmax attention over depth. It highlights the limitations of standard residual connections, such as uniform layer contributions, irreversible information loss, and output growth, and draws parallels between depth-wise accumulation and sequence modeling in RNNs. AttnRes enables selective, content-aware aggregation of information across layers using attention weights, addressing these limitations. The document also proposes a scalable variant, Block AttnRes, which reduces memory and communication overhead for large-scale training. Key contributions include the development of AttnRes and Block AttnRes, system optimizations for scalability, and comprehensive evaluations demonstrating improved training dynamics, bounded hidden-state magnitudes, and better gradient distribution. The approach is validated through scaling law experiments, ablations, and downstream benchmarks, showing consistent performance improvements over standard residual connections."
},
{
"title": "Motivation",
"node_id": "0002",
"start_index": 3,
"end_index": 3,
"summary": "The partial document discusses the concept of Attention Residuals in the context of deep learning models, particularly Transformers. It begins by introducing the notation and structure of input sequences and layers in a Transformer model. The document then explains residual learning, highlighting its importance in training deep networks by enabling gradients to bypass transformations through identity mapping. It expands on the limitations of traditional residual connections and highway networks, such as lack of selective access to earlier layer outputs, irreversible information loss, and output growth issues that destabilize training. To address these limitations, the document proposes Attention Residuals (AttnRes), a mechanism inspired by the duality of time and depth in sequence modeling. This approach introduces layer-specific attention weights to selectively aggregate information from all preceding layers, offering a unified view of time and depth while maintaining computational feasibility."
},
{
"title": "Attention Residuals: A Unified View of Time and Depth",
"node_id": "0003",
"start_index": 3,
"end_index": 4,
"summary": "The partial document discusses the concept of \"Attention Residuals\" as a mechanism to address limitations in training deep networks with residual connections. It begins by explaining residual learning, its benefits in gradient flow, and its limitations, such as lack of selective access, irreversible information loss, and output growth. The document introduces \"Attention Residuals\" (AttnRes), which generalizes residual connections by allowing layers to selectively aggregate information from all preceding layers using attention mechanisms. It describes \"Full Attention Residuals,\" which compute attention weights over depth with softmax normalization, and highlights their computational and memory overhead. To address scalability challenges, the document proposes \"Block Attention Residuals,\" which partition layers into blocks, reducing memory and communication overhead by applying attention at the block level. The text also outlines the intra-block accumulation process and its efficiency in distributed training setups.",
"nodes": [
{
"title": "Full Attention Residuals",
"node_id": "0004",
"start_index": 4,
"end_index": 4,
"summary": "The partial document discusses \"Attention Residuals\" in neural networks, focusing on two main approaches: Full Attention Residuals and Block Attention Residuals. \n\n1. **Full Attention Residuals**: This method computes attention weights using a kernel function with RMS normalization to prevent large-magnitude outputs from dominating. It introduces no additional memory overhead during vanilla training but incurs communication and memory overhead in distributed training due to the need to retain and transmit layer outputs across stages. A blockwise optimization strategy is proposed to reduce memory I/O by batching attention computation within groups of layers.\n\n2. **Block Attention Residuals**: This approach partitions layers into blocks, reducing memory and communication overhead by summing layer outputs within each block and applying attention only to block-level representations. This reduces the complexity from O(Ld) to O(Nd), where N is the number of blocks. The method ensures normalization to avoid biases from magnitude differences between blocks.\n\nThe document highlights the trade-offs between memory, computation, and communication overheads in these methods and introduces strategies to optimize their efficiency in distributed training setups."
},
{
"title": "Block Attention Residuals",
"node_id": "0005",
"start_index": 4,
"end_index": 5,
"summary": "The partial document discusses \"Attention Residuals,\" focusing on two main variants: Full Attention Residuals (Full AttnRes) and Block Attention Residuals (Block AttnRes). \n\n1. **Full Attention Residuals (Full AttnRes):**\n - Defines attention weights using a kernel function with RMS normalization to prevent large-magnitude outputs from dominating.\n - Requires O(L²d) arithmetic and O(Ld) memory, with no additional memory overhead during vanilla training.\n - Highlights challenges in large-scale training, such as memory and communication overhead under pipeline parallelism.\n - Introduces blockwise optimization to reduce memory I/O but notes that cross-stage communication remains a bottleneck.\n\n2. **Block Attention Residuals (Block AttnRes):**\n - Partitions layers into blocks, reducing memory and communication overhead from O(Ld) to O(Nd) by summing layer outputs within blocks and applying attention over block-level representations.\n - Provides PyTorch-style pseudocode for implementation, detailing intra-block accumulation and inter-block attention mechanisms.\n - Improves efficiency by reducing memory and computation requirements, with block count N interpolating between Full AttnRes (N=L) and standard residual connections (N=1).\n - Enhances inference latency and bounds KV cache size through blockwise optimization.\n\nThe document also addresses infrastructure challenges for large-scale training, emphasizing the need to manage communication overhead and optimize system design for block-based attention mechanisms."
}
]
},
{
"title": "Infrastructure Design",
"node_id": "0006",
"start_index": 5,
"end_index": 6,
"summary": "The partial document describes the concept and implementation of Block Attention Residuals (Block AttnRes), a mechanism designed to improve memory and computational efficiency in attention-based models. It introduces inter-block attention, where attention is computed over block representations and partial sums, reducing memory and computation from O(L) and O(L²) to O(N) and O(N²), respectively. The document provides PyTorch-style pseudocode for the implementation, detailing how block representations and partial sums are managed across layers. It highlights the efficiency benefits of using block representations instead of individual outputs, with empirical findings suggesting that a block count of N≈8 balances performance and resource usage. \n\nThe document also addresses infrastructure challenges in large-scale training and inference. It discusses pipeline communication optimizations, such as cross-stage caching, to reduce redundant data transmission and improve efficiency during distributed training. For inference, it proposes a two-phase computation strategy and memory-efficient prefilling to handle long-context scenarios. An example of cache-based pipeline communication is provided, illustrating how caching minimizes communication overhead in distributed systems.",
"nodes": [
{
"title": "Training",
"node_id": "0007",
"start_index": 6,
"end_index": 7,
"summary": "The partial document discusses the optimization of Attention Residuals (AttnRes) in training and inference for large-scale distributed systems. It introduces cross-stage caching to address communication and memory overheads in pipeline parallelism, reducing redundant data transmission and improving efficiency. The document details a two-phase computation strategy for Block AttnRes, which includes parallel inter-block attention and sequential intra-block attention with online softmax merging. This approach minimizes memory access and I/O overhead while maintaining a low training overhead. Additionally, it highlights the memory-efficient prefilling scheme for long-context inputs and explains how Block AttnRes compresses representations to reduce storage requirements. The document also provides algorithmic details and performance improvements in both training and inference scenarios."
},
{
"title": "Inference",
"node_id": "0008",
"start_index": 7,
"end_index": 8,
"summary": "The partial document describes the technical details and implementation of Attention Residuals (AttnRes) in neural network architectures. It introduces a two-phase computation strategy for block-based attention, optimizing memory and computational efficiency. Phase 1 handles parallel inter-block attention, while Phase 2 processes sequential intra-block attention with an online softmax merge. The document highlights memory overhead reduction through cross-stage caching, sequence-sharded prefilling, and kernel fusion, achieving minimal training and inference latency overhead. It compares memory access costs across different residual mechanisms and demonstrates the efficiency of AttnRes, particularly in Block AttnRes, which compresses block representations. Experimental results show that AttnRes improves scaling behavior and validation loss compared to baseline models, with negligible parameter overhead and consistent performance gains across compute ranges."
}
]
},
{
"title": "Experiments",
"node_id": "0009",
"start_index": 8,
"end_index": 8,
"summary": "The partial document discusses the technical details and performance of the Attention Residuals (AttnRes) mechanism in transformer architectures. It highlights the memory efficiency and reduced inference latency of AttnRes compared to prior residual mechanisms like mHC. The document provides a breakdown of memory access costs for different schemes, emphasizing the two-phase inference schedule of AttnRes and its memory-efficient prefilling strategy, which significantly reduces memory overhead through sharding and chunked prefill techniques. It also describes the integration of AttnRes into a Mixture-of-Experts (MoE) Transformer architecture, detailing its minimal parameter addition and initialization strategy to ensure stable training. Additionally, the document presents experimental results, including scaling laws and validation loss comparisons across model variants, demonstrating that AttnRes achieves consistently lower loss while maintaining similar scaling behavior to the baseline.",
"nodes": [
{
"title": "Scaling Laws",
"node_id": "0010",
"start_index": 8,
"end_index": 9,
"summary": "The partial document discusses the implementation and evaluation of Attention Residuals (AttnRes) in transformer architectures. It highlights the memory efficiency and reduced inference latency of AttnRes compared to prior residual mechanisms like mHC. The document introduces a two-phase inference schedule for AttnRes, optimizing memory access costs and reducing per-device memory usage through sharding and chunked prefill techniques. It describes the integration of AttnRes into a Mixture-of-Experts (MoE) Transformer architecture, maintaining minimal parameter overhead and ensuring stable training through specific initialization strategies. Experiments compare scaling laws and validation loss across model sizes, showing that both Full and Block AttnRes outperform baselines and mHC in terms of loss and compute efficiency. The main results include training recipes for large-scale models, leveraging hybrid attention mechanisms and progressive sequence length extension without additional modifications."
},
{
"title": "Main Results",
"node_id": "0011",
"start_index": 9,
"end_index": 11,
"summary": "The partial document discusses the concept of Attention Residuals (AttnRes) in transformer models, comparing its performance and efficiency against baseline models and other methods. Key points include:\n\n1. **Model Configurations and Validation Loss**: Comparison of Baseline, Block AttnRes, Full AttnRes, and mHC(-lite) models across various configurations, showing that AttnRes consistently achieves lower validation loss, with Block AttnRes closely tracking Full AttnRes.\n\n2. **Scaling Laws**: Analysis of scaling behavior, demonstrating that Block AttnRes achieves significant compute efficiency and narrows the performance gap with Full AttnRes at larger scales.\n\n3. **Training Recipe**: Description of the training process for large models, including pre-training and mid-training phases, use of hybrid attention mechanisms, and progressive sequence length extension.\n\n4. **Training Dynamics**: Examination of validation loss, output magnitude, and gradient magnitude during training, highlighting how Block AttnRes mitigates issues like PreNorm dilution and uneven gradient flow.\n\n5. **Downstream Performance**: Evaluation of AttnRes on various benchmarks for language understanding, reasoning, and code/math tasks, showing consistent improvements over the baseline, particularly in multi-step reasoning and compositional tasks.\n\n6. **Ablation Study**: Validation of key design choices in AttnRes, comparing it with prior methods like DenseFormer and mHC. Full AttnRes achieves the best performance, while Block AttnRes offers a memory-efficient trade-off with competitive results.\n\n7. **Cross-Layer Access**: Exploration of different granularities of cross-layer access, with Block AttnRes providing an effective balance between performance and memory efficiency, and Full AttnRes offering the best results at higher memory costs."
},
{
"title": "Ablation Study",
"node_id": "0012",
"start_index": 11,
"end_index": 12,
"summary": "The partial document focuses on the development and evaluation of Attention Residuals (AttnRes), a novel mechanism for improving Transformer models. Key points include:\n\n1. **Ablation Studies**: The document evaluates the impact of various design choices in AttnRes, such as input-dependent queries, input-independent mixing, softmax vs. sigmoid, multihead attention, and RMSNorm. Results show that input-dependent queries and RMSNorm improve performance, while softmax outperforms sigmoid due to sharper selection.\n\n2. **Comparison with Prior Methods**: AttnRes is compared against baseline PreNorm, DenseFormer, and mHC. AttnRes achieves superior performance, with Full AttnRes and Block AttnRes showing significant improvements in validation loss.\n\n3. **Cross-Layer Access**: Different granularities of cross-layer access are analyzed. Full AttnRes achieves the best performance, while Block AttnRes offers a memory-efficient trade-off. Sliding-window aggregation (SWA) is less effective, highlighting the importance of selectively accessing distant layers.\n\n4. **Performance on Benchmarks**: AttnRes outperforms the baseline on various benchmarks, particularly in multi-step reasoning tasks, code generation, and knowledge-oriented tasks, demonstrating its effectiveness in compositional tasks.\n\n5. **Optimal Architecture Analysis**: The study explores how AttnRes reshapes architectural scaling under fixed compute and parameter budgets. AttnRes favors deeper models with a shift in the optimal depthwidthattention trade-off, achieving consistently lower losses across configurations compared to the baseline.\n\n6. **Validation Loss Trends**: The document provides detailed validation loss trends across different configurations and block sizes, showing graceful degradation with increasing block size and highlighting the efficiency of finer-grained configurations."
},
{
"title": "Analysis",
"node_id": "0013",
"start_index": 12,
"end_index": 12,
"summary": "The partial document discusses the evaluation and analysis of Attention Residuals (AttnRes) in Transformer architectures. Key points include:\n\n1. **Architecture Sweep**: A study under fixed compute and parameter budgets to analyze validation loss across different configurations of model depth (dmodel/Lb) and attention heads (H/Lb). AttnRes consistently outperforms the baseline in all configurations, with a notable shift in optimal depth from dmodel/Lb ≈ 60 (baseline) to dmodel/Lb ≈ 45 (AttnRes).\n\n2. **Component Design Ablations**:\n - **Input-dependent query**: Improves performance but adds computational complexity.\n - **Input-independent mixing**: Degrades performance compared to learned queries.\n - **Softmax vs. Sigmoid**: Softmax performs better due to sharper selection among sources.\n - **Multihead Attention**: Depth aggregation across heads reduces performance, indicating uniform depth-wise mixtures are optimal.\n - **RMSNorm on Keys**: Removing RMSNorm negatively impacts performance, especially for block-level representations, by preventing bias in attention weights.\n\n3. **Optimal Architecture Analysis**: Investigates how AttnRes influences depthwidthattention trade-offs under fixed compute and parameter constraints. AttnRes favors deeper models and achieves lower loss compared to conventional Transformer designs.",
"nodes": [
{
"title": "Optimal Architecture",
"node_id": "0014",
"start_index": 12,
"end_index": 13,
"summary": "The partial document discusses the concept of Attention Residuals (AttnRes) in Transformer architectures, focusing on their design, performance, and analysis. Key points include:\n\n1. **Component Design Ablations**: The document evaluates various modifications to the attention mechanism, such as input-dependent queries, input-independent mixing, softmax vs. sigmoid, multihead attention, and RMSNorm on keys. These experiments highlight the impact of each component on performance, with findings such as the importance of softmax for competitive normalization and RMSNorm for preventing bias in attention weights.\n\n2. **Optimal Architecture Analysis**: A controlled study under fixed compute and parameter budgets examines how AttnRes reshapes architectural scaling preferences. Results show that AttnRes favors deeper, narrower networks compared to baseline Transformers, achieving lower validation loss across configurations. The optimal configuration shifts to a lower dmodel/Lb ratio, indicating better exploitation of depth.\n\n3. **Learned AttnRes Patterns**: Visualization of learned attention weights reveals key insights:\n - Preserved locality with layers attending strongly to immediate predecessors while forming selective skip connections.\n - Layer specialization, with embeddings retaining weight and distinct patterns in pre-attention and pre-MLP layers.\n - Block AttnRes effectively preserves structural patterns while acting as implicit regularization.\n\n4. **Performance Trends**: AttnRes consistently outperforms the baseline across configurations, with lower validation loss and sharper, more decisive weight distributions in block attention settings."
},
{
"title": "Analyzing Learned AttnRes Patterns",
"node_id": "0015",
"start_index": 13,
"end_index": 14,
"summary": "The partial document discusses Attention Residuals (AttnRes) in deep learning models, focusing on their structure, behavior, and benefits. Key points include:\n\n1. **Depth-wise Attention Weight Distributions**: Analysis of weight distributions in a 16-head model with full and block Attention Residuals, highlighting diagonal dominance (locality), learned skip connections, and sharper weight distributions in block settings.\n\n2. **Learned AttnRes Patterns**: Observations include preserved locality, layer specialization, and the ability of block AttnRes to maintain essential information pathways while acting as implicit regularization.\n\n3. **Comparison of Residual Update Mechanisms**: A detailed comparison of various residual connection methods, including their update rules, weight types (fixed, learned, or dynamic), and source access.\n\n4. **Sequence-Depth Duality**: Exploration of the analogy between residual connections and RNNs, emphasizing how AttnRes replaces depth-wise recurrence with direct cross-layer attention for improved information propagation.\n\n5. **Residual Connections as Structured Matrices**: Formalization of residual connections as depth mixing matrices, comparing different methods based on weight generation and structural constraints.\n\nThe document emphasizes the advantages of AttnRes in leveraging depth, preserving structure, and enabling efficient information flow across layers."
}
]
}
]
},
{
"title": "Discussions",
"node_id": "0016",
"start_index": 14,
"end_index": 14,
"summary": "The partial document discusses various residual update mechanisms in neural network architectures, comparing their update rules, weight types (fixed, learned-static, or input-dependent), and sources of earlier representations. It categorizes methods into single-state recurrence, multi-state recurrence, and cross-layer access, providing examples like Residual, ReZero, LayerScale, Highway, DeepNorm, KEEL, DenseNet, DenseFormer, MRLA, and AttnRes. The document explores the sequence-depth duality, drawing parallels between residual connections and recurrent neural networks (RNNs), and highlights how AttnRes replaces depth-wise recurrence with direct cross-layer attention. Additionally, it formalizes residual connections as structured matrices, introducing a depth mixing matrix to analyze how different methods aggregate outputs from previous layers, and discusses their weight generation and rank constraints.",
"nodes": [
{
"title": "Sequence-Depth Duality",
"node_id": "0017",
"start_index": 14,
"end_index": 14,
"summary": "The partial document discusses various residual update mechanisms in neural network architectures, comparing their update rules, weight types (fixed, learned-static, or input-dependent), and sources of earlier representations. It categorizes methods into single-state recurrence, multi-state recurrence, and cross-layer access, providing examples like Residual, ReZero, LayerScale, Highway, DeepNorm, KEEL, DenseNet, DenseFormer, MRLA, and AttnRes. The document explores the sequence-depth duality, drawing parallels between residual connections and recurrent neural networks (RNNs), and highlights how AttnRes replaces depth-wise recurrence with cross-layer attention. Additionally, it formalizes residual connections as structured matrices, introducing a depth mixing matrix to analyze how different methods aggregate outputs from previous layers, and discusses their weight generation and rank constraints."
},
{
"title": "Residual Connections as Structured Matrices",
"node_id": "0018",
"start_index": 14,
"end_index": 16,
"summary": "The partial document discusses various residual update mechanisms in neural networks, comparing their weight types (fixed, learned, or input-dependent) and source accessibility. It introduces AttnRes, a novel approach that replaces fixed residual accumulation with learned, input-dependent depth-wise attention, inspired by the sequence-depth duality. The document explores structured matrix perspectives, showing how residual variants can be viewed as depth-wise linear attention. It highlights the limitations of existing methods like single-state recurrence and multi-state recurrence, and contrasts them with AttnRes, which provides selective access to earlier-layer outputs. The paper also introduces Block AttnRes, a scalable variant that partitions layers into blocks to reduce memory and computational overhead while retaining performance gains. Empirical results validate the effectiveness of AttnRes and Block AttnRes, with discussions on normalization, scaling, depth stability, and cross-layer connectivity. The document concludes by emphasizing the practicality and scalability of Block AttnRes for large-scale models."
},
{
"title": "Prior Residuals as Depth-Wise Linear Attention",
"node_id": "0019",
"start_index": 16,
"end_index": 16,
"summary": "The partial document discusses the concept of Attention Residuals (AttnRes), which replaces traditional residual accumulation with learned, input-dependent depth-wise attention. It explores the structured-matrix perspective, sequence-depth duality, and the role of state expansion in depth-wise linear attention. The document addresses challenges in normalization, scaling, and depth stability, comparing PreNorm and PostNorm approaches and introducing AttnRes as a solution to avoid cumulative magnitude growth and gradient vanishing. It highlights multi-state recurrence methods, cross-layer connectivity strategies, and the advantages of AttnRes in selectively accessing earlier-layer outputs. The introduction of Block AttnRes is proposed to address memory constraints in large-scale models by partitioning layers into blocks, reducing computational overhead while maintaining performance. Empirical studies validate the effectiveness of AttnRes and Block AttnRes, with scalability and efficiency improvements highlighted as key contributions."
}
]
},
{
"title": "Related Work",
"node_id": "0020",
"start_index": 16,
"end_index": 16,
"summary": "The partial document discusses the concept of Attention Residuals (AttnRes) and its application as depth-wise attention in neural networks. It explores the structured-matrix perspective, highlighting how existing residual variants can be interpreted as linear attention mechanisms over the depth axis. The document addresses challenges in normalization, scaling, and depth stability, comparing PreNorm and PostNorm approaches and introducing AttnRes as a solution to avoid cumulative magnitude growth and gradient vanishing. It also examines multi-state recurrence methods, cross-layer connectivity strategies, and their limitations, proposing AttnRes as a method that selectively aggregates earlier-layer outputs with softmax-normalized, input-dependent weights. The introduction of Block AttnRes is detailed as a scalable alternative to Full AttnRes, reducing memory and computational overhead by partitioning layers into blocks while maintaining performance gains. Empirical validation and practical implementation strategies, such as cross-stage caching and two-phase computation, are also discussed."
},
{
"title": "Conclusion",
"node_id": "0021",
"start_index": 16,
"end_index": 20,
"summary": "The partial document discusses the concept of Attention Residuals (AttnRes), introducing a novel approach to residual accumulation in neural networks by leveraging depth-wise attention mechanisms. It explores the sequence-depth duality, interpreting residual variants as linear attention over the depth axis. The document highlights the challenges of normalization placement and gradient propagation in standard residual updates, comparing PreNorm and PostNorm methods, and presents AttnRes as a solution that avoids cumulative magnitude growth and gradient vanishing. It also examines multi-state recurrence and cross-layer connectivity, contrasting AttnRes with existing methods like Hyper-Connections, DenseNet, and MUDDFormer, emphasizing its selective access to earlier-layer outputs and efficient scaling. The introduction of Block AttnRes addresses memory constraints by partitioning layers into blocks, reducing computational overhead while maintaining performance. Empirical studies validate the scalability and efficiency of AttnRes, with future directions focusing on finer-grained blocking and hardware advancements."
},
{
"title": "Contributions",
"node_id": "0022",
"start_index": 20,
"end_index": 21,
"summary": "The partial document discusses the concept of \"Attention Residuals\" and provides a technical explanation of optimized inference input/output (I/O) for Full Attention Residuals. It highlights the inefficiencies of a naïve implementation, where memory traffic scales linearly with depth, and introduces a two-phase scheduling approach to reduce I/O costs. The document explains the partitioning of layers into blocks and details the two phases: Phase 1 (batched inter-block attention) and Phase 2 (sequential intra-block attention). It provides mathematical formulations for read and write costs during these phases and demonstrates how batching inter-block reads reduces per-layer I/O complexity from O(L) to O(S+N). The approach maintains the model architecture while optimizing inference efficiency. Additionally, the document lists the contributors to the work, with equal contributions noted for some authors."
},
{
"title": "Optimized Inference I/O for Full Attention Residuals",
"node_id": "0023",
"start_index": 21,
"end_index": 21,
"summary": "The partial document discusses an optimized inference I/O strategy for Full Attention Residuals (Full AttnRes) to reduce memory traffic, which scales linearly with model depth in a naïve implementation. It introduces a two-phase scheduling approach for inference, dividing the model into blocks to batch inter-block and intra-block computations. Phase 1 handles batched inter-block attention, reducing redundant memory reads by reusing key-value pairs across layers within a block. Phase 2 processes sequential intra-block dependencies. The document provides detailed calculations for read and write costs during both phases, showing that the proposed method reduces per-layer I/O complexity from O(L) to O(S+N), where S is the block size and N is the number of blocks. The approach maintains the model architecture while optimizing memory efficiency during inference."
}
],
"pages": [
{
"page": 1,
"content": "ATTENTIONRESIDUALS\nTECHNICALREPORT OFATTENTIONRESIDUALS\nKimi Team\n/gtbhttps://github.com/MoonshotAI/Attention-Residuals\nABSTRACT\nResidual connections [12] with PreNorm [60] are standard in modern LLMs, yet they accumulate\nall layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state\ngrowth with depth, progressively diluting each layers contribution [27]. We proposeAttention\nResiduals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding\nlayer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-\ndependent weights. To address the memory and communication overhead of attending over all\npreceding layer outputs for large-scale model training, we introduceBlock AttnRes, which partitions\nlayers into blocks and attends over block-level representations, reducing the memory footprint while\npreserving most of the gains of full AttnRes. Combined with cache-based pipeline communication\nand a two-phase computation strategy, Block AttnRes becomes a practical drop-in replacement for\nstandard residual connections with minimal overhead.\nScaling law experiments confirm that the improvement is consistent across model sizes, and ablations\nvalidate the benefit of content-dependent depth-wise selection. We further integrate AttnRes into\nthe Kimi Linear architecture [69] (48B total / 3B activated parameters) and pre-train on 1.4T tokens,\nwhere AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient\ndistribution across depth, and improves downstream performance across all evaluated tasks.\nEmbedding...AttentionMoEAttentionMoEOutput\n(a) Standard ResidualsEmbedding...αAttentionαMoEαAttentionαMoE\nwwwwOutput\nαw\nααααα\n(b) Full Attention ResidualsEmbedding···Blockn-2Blockn-1AttentionMoEAttentionMoEOutput\nα\nαααα\nααααα\nwwwww\nAttnRes Op(α)wQKV\n(c) Block Attention Residuals\nFigure 1: Overview of Attention Residuals.(a)Standard Residuals: standard residual connections with uniform additive accumulation.\n(b)Full AttnRes: each layer selectively aggregates all previous layer outputs via learned attention weights.(c)Block AttnRes: layers\nare grouped into blocks, reducing memory fromO(Ld)toO(Nd).arXiv:2603.15031v1 [cs.CL] 16 Mar 2026"
},
{
"page": 2,
"content": "Attention ResidualsTECHNICALREPORT\n1 Introduction\nStandard residual connections [12] are thede factobuilding block of modern LLMs [35, 51, 9]. The update hl=\nhl1+fl1(hl1)is widely understood as agradient highwaythat lets gradients bypass transformations via identity\nmappings, enabling stable training at depth. Yet residuals also play a second role that has received less attention.\nUnrolling the recurrence shows that every layer receives the same uniformly-weighted sum of all prior layer outputs;\nresiduals define how information aggregates across depth. Unlike sequence mixing and expert routing, which now\nemploy learnable input-dependent weighting [53, 20, 9], this depth-wise aggregation remains governed by fixed unit\nweights, with no mechanism to selectively emphasize or suppress individual layer contributions.\nIn practice, PreNorm [60] has become the dominant paradigm, yet its unweighted accumulation causes hidden-state\nmagnitudes to grow as O(L) with depth, progressively diluting each layers relative contribution [27]. Early-layer\ninformation is buried and cannot be selectively retrieved; empirically, a significant fraction of layers can be pruned with\nminimal loss [11]. Recent efforts such as scaled residual paths [54] and multi-stream recurrences [72] remain bound to\nthe additive recurrence, while methods that do introduce cross-layer access [36, 56] are difficult to scale. The situation\nparallels the challenges that recurrent neural networks (RNNs) faced over the sequence dimension before attention\nmechanism provided an alternative.\nWe observe a formal duality between depth-wise accumulation and the sequential recurrence in RNNs. Building\non this duality, we proposeAttention Residuals (AttnRes), which replaces the fixed accumulation hl=P\nivi\nwithhl=P\niαi→l·vi, where αi→laresoftmax attention weights computed from a single learned pseudo-query\nwl∈Rdper layer. This lightweight mechanism enables selective, content-aware retrieval across depth with only one\nd-dimensional vector per layer. Indeed, standard residual connections and prior recurrence-based variants can all be\nshown to perform depth-wiselinearattention; AttnRes generalizes them to depth-wise softmax attention, completing\nfor depth the same linear-to-softmaxtransition that proved transformative over sequences (§6.2, §6.1).\nIn standard training, Full AttnRes adds negligible overhead, since the layer outputs it requires are already retained for\nbackpropagation. At scale, however, activation recomputation and pipeline parallelism are routinely employed, and these\nactivations must now be explicitly preserved and communicated across pipeline stages. We introduceBlock AttnResto\nmaintain efficiency in this regime: layers are partitioned into Nblocks, each reduced to a single representation via\nstandard residuals, with cross-block attention applied only over the Nblock-level summaries. This brings both memory\nand communication down to O(Nd) , and together with infrastructure optimizations (§4), Block AttnRes serves as a\ndrop-in replacement for standard residual connections with marginal training cost and negligible inference latency\noverhead.\nScaling law experiments confirm that AttnRes consistently outperforms the baseline across compute budgets, with\nBlock AttnRes matching the loss of a baseline trained with 1.25× more compute. We further integrate AttnRes into\nthe Kimi Linear architecture [69] (48B total / 3B activated parameters) and pre-train on 1.4T tokens. Analysis of\nthe resulting training dynamics reveals that AttnRes mitigates PreNorm dilution, with output magnitudes remaining\nbounded across depth and gradient norms distributing more uniformly across layers. On downstream benchmarks, our\nfinal model improves over the baseline across all evaluated tasks.\nContributions\n•Attention Residuals.We propose AttnRes, which replaces fixed residual accumulation with learned softmax\nattention over depth, and its scalable variant Block AttnRes that reduces memory and communication from O(Ld) to\nO(Nd) . Through a unified structured-matrix analysis, we show that standard residuals and prior recurrence-based\nvariants correspond to depth-wiselinearattention, while AttnRes performs depth-wisesoftmaxattention.\n•Infrastructure for scale.We develop system optimizations that make Block AttnRes practical and efficient at scale,\nincluding cross-stage caching that eliminates redundant transfers under pipeline parallelism and a two-phase inference\nstrategy that amortizes cross-block attention via online softmax [31]. The resulting training overhead is marginal,\nand the inference latency overhead is less than 2% on typical inference workloads.\n•Comprehensive evaluation and analysis.We validate AttnRes through scaling law experiments, component\nablations, and downstream benchmarks on a 48B-parameter model pre-trained on 1.4T tokens, demonstrating\nconsistent improvements over standard residual connections. Training dynamics analysis further reveals that AttnRes\nmitigates PreNorm dilution, yielding bounded hidden-state magnitudes and more uniform gradient distribution across\ndepth.\n2"
},
{
"page": 3,
"content": "Attention ResidualsTECHNICALREPORT\n2 Motivation\nNotation.Consider a batch of input sequences with shape B×T×d , where Bis the batch size, Tis the sequence\nlength, and dis the hidden dimension. For clarity, we write formulas for a single token: hl∈Rddenotes the hidden state\nentering layer l, where l∈ {1, . . . , L} is the layer index and Lis the total number of layers. The token embedding is h1.\nThe function flrepresents the transformation applied by layer l. In Transformer models, we treat each self-attention or\nMLP as an individuallayer.\n2.1 Training Deep Networks via Residuals\nResidual Learning.Residual learning [12] proves to be a critical technique in training deep networks as it allows\ngradients to bypass transformations. Specifically, each layer updates the hidden state as:\nhl=hl1+fl1(hl1)\nExpanding this recurrence, the hidden state at layer lis the sum of the embedding and all preceding layer outputs:\nhl=h 1+Pl1\ni=1fi(hi). The key insight behind residual connections isidentity mapping: each layer preserves a direct\npath for both information and gradients to flow unchanged. During back-propagation, the gradient with respect to an\nintermediate hidden state is:\n∂L\n∂hl=∂L\n∂hL·L1Y\nj=l\u0012\nI+∂fj\n∂hj\u0013\nExpanding this product yields Iplus higher-order terms involving the layer Jacobians ∂fj/∂hj. The identity term is\nalways preserved, providing a direct gradient path from the loss to any layer regardless of depth.\nGeneralizing Residuals.While effective, the fixed unit coefficients in the residual update treat every layers con-\ntribution uniformly, offering no mechanism to adapt the mixing across depth. Highway networks [45] relax this by\nintroducing learned element-wise gates:\nhl= (1g l)⊙h l1+gl⊙fl1(hl1)\nwhere gl∈[0,1]dinterpolates between the transformation and the identity path. More generally, both are instances\nof a weighted recurrence hl=α l·hl1+βl·fl1(hl1), with residual setting αl=βl=1and Highway setting\nαl=1g l, βl=gl.\nLimitations.Whether fixed or gated, both approaches share a fundamental constraint: each layer can only access\nits immediate input hl1, a single compressed state that conflates all earlier layer outputs, rather than the individual\noutputs themselves. This entails several limitations: (1)no selective access: different layer types (e.g., attention vs.\nMLP) receive the same aggregated state, despite potentially benefiting from different weightings; (2)irreversible loss:\ninformation lost through aggregation cannot be selectively recovered in deeper layers; and (3)output growth: later\nlayers learn increasingly larger outputs to gain influence over the accumulated residual, which can destabilize training.\nThese limitations motivate a mechanism that lets each layer selectively aggregate information from all preceding layers.\n3 Attention Residuals: A Unified View of Time and Depth\nThe limitations discussed above are reminiscent of similar bottlenecks in sequence modeling, suggesting that we seek\nsimilar solutions for the depth dimension.\nThe Duality of Time and Depth.Like RNNs over time, residual connections compress all prior information into a\nsingle state hlover depth. For sequence modeling, the Transformer improved upon RNNs by replacing recurrence with\nattention [3, 52], allowing each position to selectively access all previous positions with data-dependent weights. We\npropose the same methodology for depth:\nhl=α 0→l·h1+l1X\ni=1αi→l·fi(hi)(1)\nwhere αi→lare layer-specific attention weights satisfyingPl1\ni=0αi→l= 1. Unlike sequence length (which can reach\nmillions of tokens), network depth is typically modest ( L <1000 ), making O(L2)attention over depth computationally\nfeasible. We call this approachAttention Residuals, abbreviated asAttnRes.\n3"
},
{
"page": 4,
"content": "Attention ResidualsTECHNICALREPORT\n3.1 Full Attention Residuals\nThe attention weights can be written as αi→l=ϕ(q l,ki)for a kernel function ϕ:Rd×Rd→R≥0, where qland\nkiare query and key vectors [23, 70]. Different choices of ϕrecover different residual variants (§6.2); we adopt\nϕ(q,k) = exp\u0000\nqRMSNorm(k)\u0001\n[66] with normalization, yieldingsoftmaxattention over depth:\nαi→l=ϕ(ql,ki)\nPl1\nj=0ϕ(ql,kj)(2)\nFor each layerl, we define:\nql=w l,k i=vi=\u001ah1 i= 0\nfi(hi) 1≤i≤l1(3)\nwhere the query ql=w lis a layer-specific learnable vector in Rd. The RMSNorm inside ϕprevents layers with\nlarge-magnitude outputs from dominating the attention weights. The input to layerlis then:\nhl=l1X\ni=0αi→l·vi (4)\nWe call this formfull attention residuals. For each token, Full AttnRes requires O(L2d)arithmetic and O(Ld) memory\nto store layer outputs. Since depth is far smaller than sequence length, the arithmetic cost is modest.\nOverhead.The O(Ld) memory overlaps entirely with the activations already retained for backpropagation, so Full\nAttnRes introduces no additional memory overhead in vanilla training. At scale, however, activation recomputation and\npipeline parallelism are widely adopted: layer outputs that would otherwise be freed and recomputed must now be kept\nalive for all subsequent layers, and under pipeline parallelism each must further be transmitted across stage boundaries.\nBoth the memory and communication overhead then grow asO(Ld).\nBlockwise optimization.A deliberate design choice in Full AttnRes is that thepseudo-query wlis a learned parameter\ndecoupled from the layers forward computation. This independence means that attention weights for any group of\nlayers can be computed in parallel without waiting for their sequential outputs, and in particular permits grouping the L\nlayers into Nblocks of Slayers each and batching the attention computation within each block, reducing per-layer\nmemory I/O from O(Ld) toO((S+N)d) (we defer the detailed two-phase strategy to §4). Under current distributed\ntraining regimes, however, the dominant cost is not local memory bandwidth but cross-stage communication under\npipeline parallelism: every layer output must still be transmitted between stages, and this O(Ld) communication\noverhead cannot be alleviated by local batching. This motivates the Block AttnRes variant introduced below, which\nreduces the number of cross-stage representations from LtoN. We anticipate that future interconnect improvements\nwill make the fullO(Ld)communication practical, fully realizing the potential of Full AttnRes.\n3.2 Block Attention Residuals\nWe proposeBlock Attention Residuals, which partitions the Llayers into Nblocks: within each block, the layer outputs\nare reduced to a single representation via summation, and across blocks, we apply full attention over only Nblock-level\nrepresentations and the token embedding. This reduces both memory and communication overhead from O(Ld) to\nO(Nd).\nIntra-Block Accumulation.Specifically, we divide the Llayers into Nblocks of S=L/N layers each, assuming\nLis divisible by N; otherwise, the last block contains the remaining LmodN layers. Let Bndenote the set of layer\nindices in blockn(n= 1, . . . , N). To form a block, we sum all of its layer outputs:\nbn=X\nj∈Bnfj(hj)(5)\nWe further denote bi\nnas the partial sum over the first ilayers in Bn, so that bn=bS\nn. When Lis not divisible by N,\nthe final partial sum is taken as the last blocks representation. As in Full AttnRes, the RMSNorm inside ϕprevents\nmagnitude differences between complete blocks and partial sums from biasing the attention weights.\n4"
},
{
"page": 5,
"content": "Attention ResidualsTECHNICALREPORT\n1 def block_attn_res(blocks: list[Tensor], partial_block: Tensor, proj: Linear, norm: RMSNorm) -> Tensor:\n2 \"\"\"\n3 Inter-block attention: attend over block reps + partial sum.\n4 blocks:\n5 N tensors of shape [B, T, D]: completed block representations for each previous block\n6 partial_block:\n7 [B, T, D]: intra-block partial sum (b_n^i)\n8 \"\"\"\n9 V = torch.stack(blocks + [partial_block]) # [N+1, B, T, D]\n10 K = norm(V)\n11 logits = torch.einsum('d, n b t d -> n b t', proj.weight.squeeze(), K)\n12 h = torch.einsum('n b t, n b t d -> b t d', logits.softmax(0), V)\n13 return h\n14\n15 def forward(self, blocks: list[Tensor], hidden_states: Tensor) -> tuple[list[Tensor], Tensor]:\n16 partial_block = hidden_states\n17 # apply block attnres before attn\n18 # blocks already include token embedding\n19 h = block_attn_res(blocks, partial_block, self.attn_res_proj, self.attn_res_norm)\n20\n21 # if reaches block boundary, start new block\n22 # block_size counts ATTN + MLP; each transformer layer has 2\n23 if self.layer_number % (self.block_size // 2) == 0:\n24 blocks.append(partial_block)\n25 partial_block = None\n26\n27 # self-attention layer\n28 attn_out = self.attn(self.attn_norm(h))\n29 partial_block = partial_block + attn_out if partial_block is not None else attn_out\n30\n31 # apply block attnres before MLP\n32 h = block_attn_res(blocks, partial_block, self.mlp_res_proj, self.mlp_res_norm)\n33\n34 # MLP layer\n35 mlp_out = self.mlp(self.mlp_norm(h))\n36 partial_block = partial_block + mlp_out\n37\n38 return blocks, partial_block\nFigure 2: PyTorch-style pseudo code for Block Attention Residuals. block_attn_res computes softmax attention over block\nrepresentations using a learned pseudo-query wl;forward is a single-layer pass that maintains partial_block (bi\nn, intra-block\nresidual) andblocks([b 0, . . . ,b n1], inter-block history).\nInter-Block Attention.In Full AttnRes, the input to layer lis computed by attending over all outputs up to fl1(hl1).\nThe block-wise variant replaces these individual outputs with block representations, defining b0=h 1so that the token\nembedding is always included as a source. For thei-th layer in blockn, the value matrix is:\nV=\u001a[b0,b1, . . . ,b n1]ifi= 1(first layer of blockn)\n[b0,b1, . . . ,b n1,bi1\nn]ifi≥2(subsequent layers)(6)\nKeys and attention weights follow Eq. 3 and Eq. 2. The input of the very first layer of the network is the token\nembeddings, i.e. b0=h 1. In each block, the first layer receives the previous block representations and the token\nembeddings, and the subsequent layers additionally attend to the partial sum bi1\nn. The final output layer aggregates all\nNblock representations. Fig. 2 provides PyTorch-style pseudocode for Block AttnRes.\nEfficiency.Since each layer now attends over Nblock representations rather than Lindividual outputs, memory\nreduces from O(L) toO(N) and computation from O(L2)toO(N2). The block count Ninterpolates between two\nextremes: N=L recovers Full AttnRes, while N= 1 reduces to standard residual connections with the embedding\nisolated as b0. Empirically, we find that N≈8 recovers most of the benefit across model scales, requiring only eight\nstored hidden states per token (see § 5).\nBeyond memory and computation, the block structure also benefits inference latency: block boundaries define the\ndispatch granularity for the blockwise optimization described in §3, and the fixed block count Nbounds the KV cache\nsize. The parallel inter-block results are merged with the sequential intra-block partial sums via online softmax [31],\npreserving exact equivalence (§4).\n4 Infrastructure Design\nBlock AttnRes introduces additional system challenges compared to standard residual connections. For large-scale\nmodel training, block representations must be propagated across pipeline stages, causing heavy communication in a\n5"
},
{
"page": 6,
"content": "Attention ResidualsTECHNICALREPORT\nRANK0\nRANK1\nRANK2\nRANK3[b0] [ ]\n[b0] [b1]\n[b0,b1] [ ]\n[b0,b1] [b2]+ [b 1,b2][ ]\n+ [b 1,b2][b3]\n+ [b 2,b3][ ]\n+ [b 2,b3][b4]VIRTUALSTAGE0 VIRTUALSTAGE1\n1 2\n1 2\n1 2\n1 21 2\n1 2\n1 2\n1 2\nFigure 3: Cache-based pipeline communication example with 4 physical ranks and 2 virtual stages per rank, where hatched boxes\ndenote end of AttnRes blocks. Numbers indicate micro-batch indices. Each rank caches previously received blocks; stage transitions\nonly transmit incremental blocks (+[b 1,b2]) instead of the full history.\nnaïve implementation. During inference, repeated access to accumulated block representations increases latency, while\nlong-context prefilling amplifies the memory cost of caching block representations. We address these challenges with\ncross-stage caching in training, and with a two-phase computation strategy together with a memory-efficient prefilling\nscheme in inference.\n4.1 Training\nFor small-scale training, AttnRes adds a tiny computation overhead and no extra memory usage, as the activations\nneed to be saved for backpropagation regardless. Under large-scale distributed training, pipeline parallelism poses the\nprimary infrastructure challenge for AttnRes. Full AttnRes requires all Llayer outputs to be transmitted across stages;\nBlock AttnRes reduces this to Nblock representations, and the optimizations below further minimize the remaining\noverhead.\nPipeline communication.With standard residual connections, pipeline parallelism [18] transfers a fixed-size hidden\nstate between adjacent stages, independent of pipeline depth. Block AttnRes requires all accumulated block representa-\ntions at each stage for inter-block attention, and naïvely transmitting the full history at every transition incurs redundant\ncommunication.\nConsider an interleaved pipeline schedule [33] with Pphysical stages and Vvirtual stages per physical stage. For\nsimplicity, assume each physical stage produces on average Npblock representations of dimension dper token.1With\nC=PV total chunks (each physical stage in each virtual stage), the j-th chunk accumulates jNpblocks. Naïvely\ntransmitting all accumulated blocks at every transition incurs per-token communication cost:\nComm naïve=C1X\nj=1jNp·d=C(C1)\n2Npd.(7)\nCross-stage caching.Since each physical stage processes multiple virtual stages in succession, we can eliminate\nthis redundancy by caching blocks locally: blocks received during earlier virtual stages remain in local memory and\nneed not be re-transmitted. The first virtual stage ( v= 1 ) has no cache and accumulates normally; for v≥2 , each\ntransition conveys only the PN pincremental blocks accumulated since the receivers corresponding chunk in the\nprevious virtual stage. Total communication reduces to:\nComm cached =P(P1)\n2Npd\n|{z}\nfirst virtual stage+ (V1)P2Npd|{z }\nsubsequent virtual stages.(8)\nCaching reduces peak per-transition cost from O(C) toO(P) , aV× improvement that enables full overlap with\ncomputation during steady-state 1F1B. The backward pass benefits from the same scheme. Fig. 3 illustrates this\noptimization withP=4andV=2: for the second virtual stage, caching eliminates 6 redundant block transmissions.\n1In practice, block boundaries need not align with physical stage boundaries. For example, in Fig. 3, each block spans two\nphysical stages, so only every other transition involves a newly completed block.\n6"
},
{
"page": 7,
"content": "Attention ResidualsTECHNICALREPORT\nAlgorithm 1:Two-phase computation for blockn\nInput:Pseudo queries{w l}l∈Bn, block representations{b 0, . . . ,b n1}\n/* Phase 1: Parallel inter-block attention */\n1Q←[w l]l∈Bn //[S, d]\n2K,V←[b 0;. . .;b n1]//[n, d]\n3{o(1)\nl, m(1)\nl, (1)\nl}l∈Bn←ATTNWITHSTATS(Q,K,V)// Return LSE\n4\n/* Phase 2: Sequential intra-block attention + Onlinesoftmaxmerge */\n5i←0\n6forl∈ B ndo\n7ifi= 0then\n8h l←o(1)\nl/(1)\nl// Inter-block only\n9else\n10o(2)\nl, m(2)\nl, (2)\nl←ATTNWITHSTATS(w l,bi\nn,bi\nn)// Intra-block\n11m l←max(m(1)\nl, m(2)\nl)\n12h l←em(1)\nlmlo(1)\nl+em(2)\nlmlo(2)\nl\nem(1)\nlml(1)\nl+em(2)\nlml(2)\nl// Online softmax merge\n13i←i+ 1\n14bi\nn←bi1\nn+fl(hl)// Update partial sum;b0\nn:=0\n15return{h l}l∈Bn\nMemory overhead.With cross-stage caching, each block is stored exactly once across all Vvirtual stages, which\nbecomes negligible relative to standard per-layer activation cache. Crucially, the per-layer activation footprint remains\nidentical to standard architectures, as activation checkpointing eliminates all inter-block attention intermediates, and the\ncheckpointed inputp lmatches the memory size of the hidden stateh lit replaces.\nIn terms of wall-clock time, Block AttnRes adds negligible training overhead when pipeline parallelism is not enabled;\nunder pipeline parallelism, the measured end-to-end overhead is less than 4%.\n4.2 Inference\nThe two-phase computation strategy described below applies to both Full and Block AttnRes: in either case, layers are\ngrouped into blocks of size S, with Phase 1 batching the inter-block queries and Phase 2 handling sequential intra-block\nlookback. For Full AttnRes, this reduces per-layer I/O from O(Ld) toO((S+N)d) (detailed derivation shown in\nAppendix B); Block AttnRes further reduces the stored representations from LtoN, since each block is compressed\ninto a single vector. In what follows, we focus on Block AttnRes and detail the two-phase computation strategy together\nwith a sequence-sharded prefilling scheme for long-context inputs.\nTwo-phase computation strategy.The layer-wise attention computation of Block AttnRes resembles autoregressive\ndecoding, where block representations serve as a shared KV cache reused across layers. A naïve implementation\ncomputes the attention residual at every layer, each requiring a full pass over all preceding blocks, resulting in O(L·N)\nmemory accesses. Since the pseudo-query vectors are decoupled from the forward computation (§3), all S=L/N\nqueries within a block can be batched into a single matrix multiplication, amortizing memory access from Sreads to 1.\nAlgorithm 1 instantiates a two-phase computation strategy exploiting this property.\n•Phase 1computes inter-block attention for all Slayers simultaneously via a single batched query against the cached\nblock representations, returning both outputs and softmax statistics (max and log-sum-exp). This amortizes the\nmemory access cost, reducing reads fromStimes to just once per block.\n•Phase 2computes intra-block attention sequentially for each layer using the evolving partial sum, then merges with\nPhase 1 outputs through online softmax [31]. Because the online- softmax merge is elementwise, this phase naturally\nadmits kernel fusion with surrounding operations, further reducing I/O overhead.\nWith the two-phase design, Phase 2 preserves an I/O footprint similar to that of standard residual connections, whereas\nthe main additional cost arises from Phase 1 inter-block attention. Because these inter-block reads are amortized across\n7"
},
{
"page": 8,
"content": "Attention ResidualsTECHNICALREPORT\nall layers in a block through batching, the total per-layer memory access cost remains only (N\nS+ 3)d reads and 2d\nwrites (Table 1). This is substantially lower than the residual-stream I/O of prior residual generalizations such as (m)HC\nunder typical settings. In practice, Phase 1 can also partially overlap with the computation of the first layer in the block,\nfurther reducing its wall-clock impact. As a result, the end-to-end inference latency overhead is less than 2% on typical\ninference workloads.\nTable 1: Memory access cost per token per layer incurred by the residual mechanism under each scheme. The internal I/O of the layer\nfunction flis excluded. For AttnRes, both Full and Block variants use the two-phase inference schedule described in Appendix B;\namortized costs are averaged overNlayers within a block. Typical values:L=128,N=8,S=L/N=16,m=4.\nOperation Read WriteTotal I/O\nSymbolic Typical\nStandard Residuals Residual Merge2d d3d3d\nmHC (mstreams)Computeα l,βl,Al md m2+2m\n(8m+2)d+2m2+4m 34dApplyα l md+m d\nApplyβ l d+m md\nApplyA l md+m2md\nResidual Merge2md md\nAttnResFullPhase 1 (amortized)(N1)d d(S+N)d24dPhase 2(S1)d d\nBlockPhase 1 (amortized)N\nSd d \u0000N\nS+5\u0001\nd 5.5dPhase 23d d\nMemory-efficient prefilling.Storing block representations during prefilling requires N·T·d elements, which incurs\n15 GB of memory for a 128K-token sequence with 8 blocks. We mitigate this by sharding these representations along\nthe sequence dimension across Ptensor-parallel devices, allowing Phase 1 to execute independently on local sequence\nshards. The Phase 2 online- softmax merge then integrates into the standard TP all-reduce communication path: the\noutput is reduce-scattered, merged locally, and reconstructed via all-gather, naturally admitting kernel fusion with\noperations like RMSNorm . This reduces the per-device memory footprint to N·(T/P)·d —lowering the 128K-context\nexample from 15 GB to roughly 1.9 GB per device. Combined with chunked prefill (e.g., 16K chunk size), the overhead\nfurther reduces to under 0.3 GB per device.\n5 Experiments\nArchitecture Details.Our architecture is identical to Kimi Linear [69], a Mixture-of-Experts (MoE) Transformer\nfollowing the Moonlight [28] / DeepSeek-V3 [9] design, which interleaves Kimi Delta Attention (KDA) and Multi-Head\nLatent Attention (MLA) layers in a 3:1 ratio, each followed by an MoE feed-forward layer. The only modification is the\naddition of AttnRes to the residual connections; all other components (model depth, hidden dimensions, expert routing,\nand MLP structure) remain unchanged. AttnRes introduces only one RMSNorm and one pseudo-query vector wl∈Rd\nper layer, amounting to a negligible fraction of the total parameter count. Crucially, all pseudo-query vectors must be\ninitialized to zero. This ensures that the initial attention weights αi→lare uniform across source layers, which reduces\nAttnRes to an equal-weight average at the start of training and prevents training volatility, as we validated empirically.\n5.1 Scaling Laws\nWe sweep five model sizes (Table 2) and train three variants per size: a PreNorm baseline, Full AttnRes, and Block\nAttnRes with ≈8blocks. They are trained with an 8192-token context window and a cosine learning rate schedule.\nWithin each scaling law size group, all variants share identical hyperparameters selected under the baseline to ensure\nfair comparison; this setup intentionally favors the baseline and thus makes the comparison conservative. Following\nstandard practice, we fit power-law curves of the form L=A×Cα[22, 15], where Lis validation loss and Cis\ncompute measured in PFLOP/s-days.\nScaling Behavior.Fig. 4 presents the fitted scaling curves. The Baseline follows L= 1.891×C0.057, while Block\nAttnRes fits L= 1.870×C0.058, and Full AttnRes fits L= 1.865×C0.057. All three variants exhibit a similar\nslope, but AttnRes consistently achieves lower loss across the entire compute range. Based on the fitted curves, at 5.6\n8"
},
{
"page": 9,
"content": "Attention ResidualsTECHNICALREPORT\nTable 2: Baseline vs Block AttnRes ( N= 8 ) vs Full AttnRes vs mHC(-lite) [64]: Model configurations, Hyperparameters, and\nValidation Loss.\n# Act.\nParams†TokensL bH d model dff lr batch size‡ Val. Loss\nBaseline Block AttnRes Full AttnRes mHC(-lite)\n194M 038.7B 12 12 0896 4002.99×103192 1.931 1.9091.8991.906\n241M 045.4B 13 13 0960 4322.80×103256 1.895 1.875 1.8741.869\n296M 062.1B 14 14 1024 4642.50×103320 1.829 1.8091.8041.807\n436M 087.9B 16 16 1168 5282.20×103384 1.766 1.7461.7371.747\n528M 119.0B 17 17 1264 5602.02×103432 1.719 1.6931.6921.694\n†Denotes the number of activated parameters in our MoE models, excluding embeddings.\n‡All models were trained with a context length of 8192.\n⋆Lb=L/2denotes the number of Transformer blocks.\n0.5 1 2 51.71.81.9\n1.25×\nPFLOP/s-daysLossBaseline:1.891×C0.057\nFull AttnRes:1.865×C0.057\nBlock AttnRes:1.870×C0.058\nFigure 4: Scaling law curves for Attention Residuals. Both Full and Block AttnRes consistently outperform the baseline across all\nscales. Block AttnRes closely tracks Full AttnRes, recovering most of the gain at the largest scale.\nPFLOP/s-days, Block AttnRes reaches 1.692 versus the Baselines 1.714, equivalent to a 1.25× compute advantage.\nThe gap between Full and Block AttnRes narrows with scale, shrinking to just 0.001 at the largest size. We also list\nmHC(-lite) [64] in Table 2 for reference. Full AttnRes outperforms mHC, while Block AttnRes matches it at lower\nmemory I/O per layer:5.5dversus34dfor mHC withm=4streams (Table 1).\n5.2 Main Results\nTraining recipe.The largest models we study are based on the full Kimi Linear 48B configuration: 27 Transformer\nblocks (54 layers) with 8 out of 256 routed experts plus 1 shared expert, yielding 48B total and 3B activated parameters.\nThis model applies Block AttnRes with 6 layers per block, producing 9 blocks plus the token embedding for a total of\n10 depth-wise sources.\nWe follow the same data and training recipe as the Kimi Linear 1.4T-token runs [69]: all models are pre-trained with a\n4096-token context window, the Muon optimizer [28], and a WSD (WarmupStableDecay) learning rate schedule [16],\nwith a global batch size of 8M tokens. Training of the final model proceeds in two stages: (i) a WSD pre-training phase\non 1T tokens, followed by (ii) a mid-training phase on ≈400B high-quality tokens, following the annealing recipe of\nMoonlight [28].\nAfter mid-training, we continue training with progressively longer sequence length of 32K tokens. Since our architecture\nuses hybrid KDA/MLA attention [69], where MLA operates without positional encodings (NoPE) [61], context extension\nrequires no modifications such as YaRN [37] or attention temperature rescaling.\n9"
},
{
"page": 10,
"content": "Attention ResidualsTECHNICALREPORT\n20k 40k 60k 80k 100k1.21.31.41.5\nStep(a) Validation Loss\nBaseline\nBlock AttnRes\n0 10 20051015\nTransformer Block Index(b) Output Magnitude\n0 10 200123\nTransformer Block Index(c) Gradient Magnitude (×105)\nFigure 5: Training dynamics of Baseline and Block AttnRes.(a)Validation loss during training.(b)Each transformer blocks output\nmagnitude at the end of training.(c)Each transformer blocks gradient magnitude.\nTraining dynamics.We compare the training dynamics of our final Baseline and Block AttnRes models over 1T\ntokens in Fig. 5.\n•Validation loss:AttnRes achieves consistently lower validation loss throughout training, with the gap widening\nduring the decay phase and resulting in a notably lower final loss.\n•Output magnitude:The Baseline suffers from the PreNorm dilution problem [60, 27]: as hidden-state magnitudes\ngrow monotonically with depth, deeper layers are compelled to learn increasingly large outputs from fixed-scale\nnormalized inputs to remain influential. Block AttnRes confines this growth within each block, as selective aggregation\nat block boundaries resets the accumulation, yielding a bounded periodic pattern.\n•Gradient magnitude:With all residual weights fixed to 1, the Baseline provides no means of regulating gradient\nflow across depth, leading to disproportionately large gradients in the earliest layers. The learnable softmax weights\nin Block AttnRes (Fig. 8) introduce competition among sources for probability mass, resulting in a substantially more\nuniform gradient distribution.\nTable 3: Performance comparison of AttnRes with the baseline, both after the same pre-training recipe. Best per-row results are\nbolded.\nBaseline AttnRes\nGeneralMMLU 73.574.6\nMMLU-Pro52.2 52.2\nGPQA-Diamond 36.944.4\nBBH 76.378.0\nARC-Challenge 64.665.7\nHellaSwag 83.283.4\nTriviaQA 69.971.8\nMath & CodeGSM8K 81.782.4\nMGSM 64.966.1\nMath 53.557.1\nCMath 84.785.1\nHumanEval 59.162.2\nMBPP 72.073.9\nChineseCMMLU 82.082.9\nC-Eval 79.682.5\nDownstream performance.Following the evaluation protocol of Kimi Linear [69], we assess both models across\nthree areas (Table 3):\n10"
},
{
"page": 11,
"content": "Attention ResidualsTECHNICALREPORT\nTable 4: Ablation on key components of AttnRes (16-layer\nmodel).\nVariant Loss\nBaseline (PreNorm) 1.766\nDenseFormer [36] 1.767\nmHC [59] 1.747\nAttnRes Full 1.737\nw/ input-dependent query1.731\nw/ input-independent mixing1.749\nw/sigmoid1.741\nw/oRMSNorm1.743\nSWA (W= 1 + 8) 1.764\nBlock (S= 4) 1.746\nw/ multihead (H= 16)1.752\nw/oRMSNorm1.75032 16 8 4 21.7351.7401.7451.7501.7551.7601.7651.770\n1.757\n1.753\n1.748\n1.746 1.746Baseline (1.766)\nFull AttnRes i.e. S=1 (1.737)\nBlock size (S)Validation lossBaseline\nFull AttnRes\nBlock AttnRes\nFigure 6: Effect of block size on validation loss (16-layer model).\n•Language understanding and reasoning: MMLU [13], MMLU-Pro Hard [55], GPQA-Diamond [41], BBH [48],\nARC-Challenge [6], HellaSwag [65], and TriviaQA [21].\n•Reasoning (Code and Math): GSM8K [7], MGSM [44], Math [25], CMath [14], HumanEval [5], and MBPP [1].\n•Chinese language understanding: CMMLU [26] and C-Eval [19].\nAs shown in Table 3, Block AttnRes matches or outperforms the baseline on all benchmarks. The improvements are\nparticularly pronounced on multi-step reasoning tasks such as GPQA-Diamond (+7.5) and Minerva Math (+3.6), as\nwell as code generation such as HumanEval (+3.1), while knowledge-oriented benchmarks such as MMLU (+1.1)\nand TriviaQA (+1.9) also show solid gains. This pattern is consistent with the hypothesis that improved depth-wise\ninformation flow benefits compositional tasks, where later layers can selectively retrieve and build upon earlier\nrepresentations.\n5.3 Ablation Study\nWe conduct ablation studies on the 16-head model from Table 2 to validate key design choices in AttnRes (Table 4). All\nmodels share identical hyperparameters and compute budget.\nComparison with prior methods.We compare AttnRes against the PreNorm baseline (loss 1.766) and two rep-\nresentative methods that generalize residual connections. DenseFormer [36] grants each layer access to all previous\noutputs but combines them with fixed, input-independent scalar coefficients; it shows no gain over the baseline (1.767),\nhighlighting the importance of input-dependent weighting. mHC [59] introduces input dependence through mparallel\nstreams with learned mixing matrices, improving to 1.747. AttnRes takes this further with explicit content-dependent\nselection via softmax attention: Full AttnRes achieves 1.737 and Block AttnRes 1.746, outperforming both methods\nwith only a single query vector per layer.\nCross-layer access.We compare three granularities of cross-layer access. Full AttnRes follows directly from the\ntimedepth duality (§ 3), applying attention over all previous layers, and achieves the lowest loss (1.737). A simple\nway to reduce its memory cost is sliding-window aggregation (SWA), which retains only the most recent W=8 layer\noutputs plus the token embedding; it improves over baseline (1.764) but falls well short of both Full and Block AttnRes,\nsuggesting that selectively accessing distant layers matters more than attending to many nearby ones.\nBlock AttnRes offers a better trade-off: with block size S=4 it reaches 1.746 while keeping memory overhead constant\nper layer. Fig. 6 sweeps Sacross the full spectrum from S=1 (i.e. Full AttnRes) to increasingly coarse groupings. Loss\ndegrades gracefully as Sgrows, with S=2,4,8 all landing near 1.746 while larger blocks ( S=16,32 ) move toward\nbaseline. In practice, we fix the number of blocks to ≈8for infrastructure efficiency (§ 4). As future hardware alleviates\nmemory capacity constraints, adopting finer-grained block sizes or Full AttnRes represents a natural pathway to further\nimprove performance.\n11"
},
{
"page": 12,
"content": "Attention ResidualsTECHNICALREPORT\n15 30 45 60 750.30.40.50.60.7 2.017 1.909 1.875 1.851 1.858\n1.990 1.902 1.862 1.852 1.862\n1.973 1.883 1.859 1.849 1.854\n1.952 1.868 1.850 1.849 1.857\n1.926 1.857 1.851 1.858 1.847\ndmodel/LbH/L b\n(a) Baseline15 30 45 60 751.954 1.890 1.843 1.828 1.824\n1.931 1.863 1.830 1.817 1.818\n1.917 1.841 1.819 1.812 1.817\n1.893 1.823 1.815 1.813 1.813\n1.877 1.816 1.820 1.806 1.802\ndmodel/Lb\n1.841.881.921.962\n(b) Attention Residuals\nFigure 7: Architecture sweep under fixed compute ( ≈6.5×1019FLOPs, ≈2.3×108active parameters). Each cell reports\nvalidation loss for a (dmodel/Lb, H/L b)configuration, where Lb=L/2 is the number of Transformer blocks; the star marks the\noptimum.\nComponent design.We further ablate individual components of the attention mechanism:\n•Input-dependent query.A natural extension is to make the query input-dependent by projecting it from the current\nhidden state. This further lowers loss to 1.731, but introduces a d×d projection per layer and requires sequential\nmemory access during decoding, so we default to the learned query.\n•Input-independent mixing.We removed the query and key and replaced them with learnable, input-independent\nscalars to weigh previous layers, which hurts performance (1.749 vs. 1.737).\n•softmax vs.sigmoid .Replacing softmax withsigmoid degrades performance (1.741). We attribute this to softmax s\ncompetitive normalization, which forces sharper selection among sources.\n•Multihead attention.We test per-head depth aggregation ( H=16 ) on Block AttnRes, allowing different channel\ngroups to attend to different source layers. This hurts performance (1.752 vs. 1.746), indicating that the optimal\ndepth-wise mixture is largely uniform across channels: when a layers output is relevant, it is relevant as a whole.\n•RMSNorm on keys.Removing RMSNorm degrades both Full AttnRes (1.743) and Block AttnRes (1.750). For\nFull AttnRes, it prevents individual layers with naturally larger outputs from dominating the softmax . This becomes\neven more critical for Block AttnRes, as block-level representations accumulate over more layers and can develop\nlarge magnitude differences;RMSNormprevents these from biasing the attention weights.\n5.4 Analysis\n5.4.1 Optimal Architecture\nTo understand how AttnRes reshapes optimal architectural scaling, we perform a controlled capacity reallocation\nstudy under a fixed compute and parameter budget. Our central question is whether AttnRes alters the preferred\ndepthwidthattention trade-off, and in particular, given its potential strength on the depth dimension, whether it favors\ndeeper models compared to conventional Transformer design heuristics. To isolate structural factors directly coupled\nto depth, we fix the per-expert MLP expansion ratio based on internal empirical observations ( dff/dmodel≈0.45 ).\nWe further fix total training compute (FLOPs ≈6.5×1019) and active parameters ( ≈2.3×108), ensuring that any\nperformance variation arises purely from architectural reallocation rather than overall capacity differences. Under\nthis constrained budget, we enumerate 25 configurations on a 5×5 grid over dmodel/Lb∈ {15,30,45,60,75} and\nH/L b∈ {0.3,0.4,0.5,0.6,0.7} , where Lb=L/2 is the number of Transformer blocks and Hthe number of attention\nheads. The results are shown in Fig. 7.\nBoth heatmaps exhibit a shared pattern: loss decreases with growing dmodel/Lband shrinking H/L b, and both methods\nreach their optima at H/L b≈0.3 . Despite this shared trend, AttnRes achieves a lower loss than the baseline in each of\nthe 25 configurations, by 0.019 0.063 . The most apparent difference lies in the location of the optimum: the baseline\nachieves its lowest loss at dmodel/Lb≈60 (1.847 ), whereas AttnRes shifts it to dmodel/Lb≈45 (1.802 ). Under a fixed\n12"
},
{
"page": 13,
"content": "Attention ResidualsTECHNICALREPORT\n0 5 10 15 20 25 301\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n15\n16\nSource IndexLayerFull AttnRes, Pre-Attn\n0 5 10 15 20 25 30\nSource IndexFull AttnRes, Pre-MLP\n0 1 2 3 4 5 6 7 81\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n15\n16\nBlock IndexLayerBlock AttnRes, Pre-Attn\n0 1 2 3 4 5 6 7 8\nBlock IndexBlock AttnRes, Pre-MLP\n00.20.40.60.8Weight\nFigure 8: Depth-wise attention weight distributions for a 16-head model with full (top) and block (bottom) Attention Residuals,\naveraged over tokens. The model has 16 attention and 16 MLP layers. Each row shows how the lth attention (left) or MLP (right)\nlayer distributes weight over previous sources. Diagonal dominance indicates locality remains the primary information pathway,\nwhile persistent weights on source 0 (embedding) and occasional off-diagonal concentrations reveal learned skip connections. Block\nattention (N= 8) recovers the essential structure with sharper, more decisive weight distributions.\nparameter budget, a lower dmodel/Lbcorresponds to a deeper, narrower network, suggesting that AttnRes can exploit\nadditional depth more effectively. We note that this preference for depth does not directly translate to a deployment\nrecommendation, as deeper models generally incur higher inference latency due to their sequential computation [39].\nRather, this sweep serves as a diagnostic that reveals where AttnRes benefits most, and this depth preference can be\nfactored into the architecture selection alongside inference cost.\n5.4.2 Analyzing Learned AttnRes Patterns\nWe visualize the learned weights αi→lin Fig. 8 for the 16-head model (from Table 2) with both full and block ( N=8 )\nAttnRes. Each heatmap shows how the lth attention or MLP layer (rows) allocates its attention over previous sources\n(columns), with pre-attention and pre-MLP layers shown separately. We highlight three key observations:\n•Preserved locality.Each layer attends most strongly to its immediate predecessor, yet selective off-diagonal\nconcentrations emerge (e.g., layer 4 attending to early sources, layers 1516 reaching back under the block setting),\nindicating learned skip connections beyond the standard residual path.\n•Layer specialization.The embedding h1retains non-trivial weight throughout, especially in pre-attention layers.\nPre-MLP inputs show sharper diagonal reliance on recent representations, while pre-attention inputs maintain broader\nreceptive fields, consistent with attention routing information across layers and MLPs operating locally.\n•Block AttnRes preserves structure.Diagonal dominance, embedding persistence, and layer specialization all\ntransfer from the full to the block variant, suggesting that block-wise compression acts as implicit regularization\nwhile preserving the essential information pathways.\n13"
},
{
"page": 14,
"content": "Attention ResidualsTECHNICALREPORT\nTable 5: Comparison of residual update mechanisms.Weight: whether the mixing coefficients are architecture-fixed, learned-static\n(fixed after training), or input-dependent (dynamic).Source: which earlier representations layer lcan access. Normalization is\nomitted from most formulas for clarity.\nMethod Update rule Weight Source\nSingle-state recurrence: layerlreceives onlyh l1\nResidual [12]h l=hl1+fl1(hl1)Fixedh l1\nReZero [2]h l=hl1+αl·fl1(hl1)Statich l1\nLayerScale [50]h l=hl1+ diag(λ l)·fl1(hl1)Statich l1\nHighway [45]h l= (1g l)⊙h l1+gl⊙fl1(hl1)Dynamich l1\nDeepNorm [54]h l= Norm(αh l1+fl1(hl1))Fixedh l1\nKEEL [4]h l= Norm(αh l1+fl1(Norm(h l1)))Fixedh l1\nMulti-state recurrence: layerlreceivesmstreams\nSiameseNorm [27]h1\nl=Norm(h1\nl1+yl1);h2\nl=h2\nl1+yl1 Fixed 2 streams\nHC/mHC [72, 59]H l=H l1Al+fl1(Hl1αl1)β⊤\nl1 Dynamicmstreams\nDDL [67]H l= (Iβ lklk\nl)Hl1+βlklv\nl Dynamicd vstreams\nCross-layer access: layerlcan access individual earlier-layer outputs\nDenseNet [17]h l= ConvPool([h 1;f1(h1);. . .;f l1(hl1)])Static[h 1, . . . ,h l1]\nDenseFormer [36]h l=α 0→lh1+Pl1\ni=1αi→lfi(hi)Static[h 1, . . . ,h l1]\nMRLA [10]1hl=Pl1\ni=1σ\u0000\nConvPool(f l1(hl1))\u0001σ\u0000\nConvPool(f i(hi))\u0001\nConv(f i(hi))Dynamic[h 1, . . . ,h l1]\nFull2hl∝Pl1\ni=0ϕ(w l,ki)vi Dynamic [h1, . . . ,h l1]AttnRes (ours)Block3hl∝Pn1\ni=0ϕ(w l,ki)vi+ϕ(w l,kj\nn)vj\nn Dynamic [b0, . . . ,b n1,bj\nn]\n1ConvPool: pooling operation followed by convolution (channel projection).\n2ϕ(q,k) = exp\u0000\nqRMSNorm(k)\u0001\n;ki=vi;v0=h 1,vi≥1=fi(hi).softmaxjointly normalized over all sources.\n3Sameϕand normalization as Full;v i=bi,vj\nn=bj\nn.\n6 Discussions\n6.1 Sequence-Depth Duality\nResidual connections propagate information over depth via a fixed recurrence hl=hl1+fl1(hl1), much as RNNs\npropagate information over time. Test-Time Training (TTT) [46] formalizes the sequence side of this analogy (cf. Fast\nWeight Programmers [43, 32]), casting each recurrent step as gradient descent on a self-supervised loss:\nWt=W t1η∇(W t1;xt),(9)\nwhere a slow network parameterizes and the state Wis updated once per token. When fis linear, this reduces to\nvanilla linear attention St=St1+ktv\nt. The standard residual exhibits the same additive form along depth, with hl\nserving as the state and each layerf lacting as one “gradient step.”\nAs noted by [4], this duality extends to richer variants (Table 5). Data-dependent gates on the sequence side [47, 63]\ncorrespond to Highway networks [45] on the depth side; the delta rule [42, 62, 69] corresponds to DDL [67]; and\nMRLA [10] mirrors GLAs [63] gated linear attention. These methods all refine the recurrent update while remaining\nwithin the recurrence paradigm. AttnRes goes a step further and replaces depth-wise recurrence with direct cross-layer\nattention, just as Transformers replaced temporal recurrence with self-attention. Since the number of layers in current\narchitectures remains well within the practical regime of softmax attention, we adopt vanilla depth-wise attention.\nIncorporating more expressive yet memory-efficient (e.g. linear-complexity) alternatives is a natural direction for future\nwork.\n6.2 Residual Connections as Structured Matrices\nThe residual variants discussed above can all be viewed as weighted aggregations over previous layer outputs. We\nformalize this with adepth mixing matrix M∈RL×L, where Mi→lis the weight that layer lassigns to the output of\nlayer i. The variants differ in how these weights arise (fixed, learned, or input-dependent) and whether Mis constrained\nto low rank or allowed to be dense. The semiseparable rank ofM[8] offers a unified lens for comparing them.\nConcretely, the input to layer lishl=Pl1\ni=0Mi→lvi, where v0=h 1(embedding) and vi=fi(hi)fori≥1 . Fig. 9\nvisualizesMfor representative methods; we derive each below.\n14"
},
{
"page": 15,
"content": "Attention ResidualsTECHNICALREPORT\nHighway\n\n1\nγ×\n1→2g2\nγ×\n1→3g2γ×\n2→3g3\nγ×\n1→4g2γ×\n2→4g3γ×\n3→4g4\n(m)HC\n\nβ\n0α1\nβ\n0A×\n1→2α2 β⊤\n1α2\nβ\n0A×\n1→3α\n1A×\n2→3α3 β⊤\n2α3\nβ\n0A×\n1→4α\n1A×\n2→4α\n2A×\n3→4α4 β⊤\n3α4\n\nFull AttnRes\n\nϕ(w 1,k0)\nϕ(w 2,k0) ϕ(w 2,k1)\nϕ(w 3,k0) ϕ(w 3,k1) ϕ(w 3,k2)\nϕ(w 4,k0) ϕ(w 4,k1) ϕ(w 4,k2) ϕ(w 4,k3)\nBlock AttnRes\n\nϕ(w 1,k0)\nϕ(w 2,k0) ϕ(w 2,k1)\nϕ(w 3,k0)\nϕ(w 4,k0) ϕ(w 4,k3)ϕ(w 3,k1+k 2)\nϕ(w 4,k1+k 2)\n\nFigure 9: Depth mixing matrices Mfor four residual variants ( L=4 ; Block AttnRes uses block size S=2 ). Highway is shown with\nscalar gates for clarity. AttnRes panels show unnormalized ϕscores; background colors group entries that share the same source\n(Full AttnRes) or the same source block (Block AttnRes).\n•Standard residual [12], hl=hl1+fl1(hl1). Expanding gives hl=Pl1\ni=0vi, soMi→l= 1for all i < l andM\nis an all-ones lower-triangular matrix:\n\nh1\nh2\n...\nhL\n=\n1\n1 1\n.........\n1 1···1\n\nv0\nv1\n...\nvL1\n\n•Highway [45], hl= (1g l)hl1+glfl1(hl1)(written here with scalar gates for clarity; the element-wise\nextension is straightforward). Defining the carry product γ×\ni→l:=Ql\nj=i+1(1g j), the weights are M0→l=γ×\n1→l\nfor the embedding and Mi→l=gi+1γ×\ni+1→lfori≥1 . Since the cumulative products factor through scalar gates, M\nis 1-semiseparable [8], the same rank as the standard residual but with input-dependent weights. The weights sum to\none by construction, making Highway a softmax-free depth-wise instance of stick-breaking attention [49].\n• (m)HC [72, 59] maintainmparallel streamsH l∈Rd×m, updated via\nHl=H l1Al+fl1(Hl1αl1)β⊤\nl1,\nwhere Al∈Rm×mis a learned transition matrix, αl1∈Rmmixes streams into a single input for fl1, and\nβl1∈Rmdistributes the output back across streams. Unrolling the recurrence gives the effective weight\nMi→l=β⊤\niA×\ni+1→lαl,(10)\nwhereA×\ni→j:=Qj\nk=i+1Ak. The m×m transitions render Mm -semiseparable [8]. mHC [59, 64] further constrains\neachA lto be doubly stochastic, stabilizing the cumulative products across depth.\n•Full AttnRes computes Mi→l=α i→lviaϕ(w l,ki) = exp\u0000\nw\nlRMSNorm(k i)\u0001\nwith normalization, where\nki=viare input-dependent layer outputs, yielding a dense, rank-LM.\n•Block AttnRes partitions layers into Nblocks B1, . . . ,B N. For sources iin a completed earlier block Bn, all share\nthe block-level key/value bn, soMi→l=αn→lfor every i∈ B n. Within the current block, each layer additionally\nattends over the evolving partial sum bi1\nn, introducing one extra distinct source per intra-block position. The effective\nrank of Mtherefore lies between NandN+S (where Sis the block size), interpolating between standard residual\n(N=1) and Full AttnRes (N=L).\nPracticality.The structured-matrix perspective serves two purposes. First, it enables analytical insights that are not\napparent from the recurrence form alone. The input-dependent Mof AttnRes, for instance, reveals depth-wise attention\nsinks (§5.4.2), where certain layers consistently attract high weight regardless of input, mirroring the same phenomenon\nin sequence-wise attention [57]. Second, it informs new designs by exposing which properties of the kernel ϕmatter. For\nexample, when ϕdecomposes as ϕ(q,k) =φ(q)⊤φ(k) for some feature map φ[23], depth-wise attention collapses\ninto a recurrence—precisely the structure underlying the MRLAGLA and DDLDeltaNet correspondences noted\nabove.\n15"
},
{
"page": 16,
"content": "Attention ResidualsTECHNICALREPORT\nPrior Residuals as Depth-Wise Linear AttentionThe structured-matrix perspective further relates to the sequence-\ndepth duality by showing that existing residual variants are, in effect, instances oflinearattention over the depth axis.\nFor example, the unrolled (m)HC weight Mi→l=β⊤\niA×\ni+1→lαl(Eq. 10) admits a natural attention interpretation in\nwhich αlplays the role of a query issued by layer l,βiserves as a key summarizing the contribution of layer i, and\nthe cumulative transition A×\ni+1→lacts as a depth-relative positional operator [69] governing the querykey interaction\nacross intervening layers. Notably, themparallel streams correspond to state expansion [40, 29] along the depth axis,\nexpanding the recurrent state from dtod×m and thereby increasing the semiseparable rank of M. [58] show that\nreplacing A×\ni+1→lwith the identity matrix still yields competitive performance, highlighting the role of state expansion.\nThrough this lens, methods like (m)HC thus act as depth-wiselinearattention with matrix-valued states, while AttnRes\nacts as depth-wisesoftmaxattention.\n7 Related Work\nNormalization, Scaling, and Depth Stability.The standard residual update hl+1=h l+fl(hl)[12] presents a\nfundamental tension betweennormalization placementandgradient propagation. PostNorm [52] maintains bounded\nmagnitudes but distorts gradients, as repeated normalization on the residual path compounds into gradient vanishing at\ndepth [60]. PreNorm [34, 60] restores a clean identity path yet introduces unbounded magnitude growth: since ∥hl∥\ngrows as O(L) , each layers relative contribution shrinks, compelling deeper layers to produce ever-larger outputs\nand limiting effective depth [27]. Subsequent work reconciles both desiderata via scaled residual paths [54], hybrid\nnormalization [73], amplified skip connections [4], or learned element-wise gates [45] (see Table 5). AttnRes sidesteps\nthis tension by replacing the additive recurrence with selective aggregation over individual earlier-layer outputs, avoiding\nboth the cumulative magnitude growth of PreNorm and the repeated scale contraction of PostNorm.\nMulti-State Recurrence.All single-state methods above condition layer lonly on hl1, from which individual\nearlier-layer contributions cannot be selectively retrieved. Several methods address this by widening the recurrence\nto multiple parallel streams: Hyper-Connections [72] and its stabilized variant mHC [59] maintain mstreams with\nlearned mixing matrices; DDL [67] maintains a matrix state updated via a delta-rule erase-and-write mechanism;\nSiameseNorm [27] maintains two parameter-shared streams—one PreNorm and one PostNorm—to preserve identity\ngradients and bounded representations. While these methods alleviate information compression, they still condition\non the immediate predecessors state; AttnRes is orthogonal, providing selective access to individual earlier-layer\noutputs while remaining compatible with any normalization or gating scheme. We discuss the formal connection to\nHyper-Connections in § 6.2.\nCross-Layer Connectivity.A separate line of work bypasses the single-state bottleneck by giving each layer direct\naccess to individual earlier-layer outputs. The simplest approach uses static weights: DenseNet [17] concatenates all\npreceding feature maps; ELMo [38] computes a softmax -weighted sum of layer representations with learned scalar\nweights; DenseFormer [36] and ANCRe [68] assign learned per-layer scalar coefficients fixed after training. For\ninput-dependent aggregation, MUDDFormer [56] generates position-dependent weights via a small MLP across four\ndecoupled streams; MRLA [10] applies element-wise sigmoid gating over all previous layers, though its separable\nquerykey product is closer to linear attention than softmax -based retrieval. Other methods trade full cross-layer access\nfor more targeted designs: Value Residual Learning [71] accesses only a single earlier layer; LAuReL [30] augments\nthe residual with low-rank projections over the previous kactivations; Dreamer [24] combines sequence attention with\ndepth attention and sparse experts. AttnRes combines softmax -normalized, input-dependent weights with selective\naccess to all preceding layers through a single d-dimensional pseudo-query per layer, and introduces a block structure\nreducing cost from O(L2)toO(LN) . Cache-based pipeline communication and a two-phase computation strategy\n(§ 4) make Block AttnRes practical at scale with negligible overhead.\nConclusion\nInspired by the duality between sequence and depth, we introduce AttnRes, which replaces fixed, uniform residual\naccumulation with learned, input-dependent depth-wise attention. We validate the method through ablation studies and\nscaling law experiments, showing that its gains persist across scales. Because Full AttnRes must access all preceding\nlayer outputs at every layer, the memory footprint of cross-layer aggregation grows as O(Ld) , which is prohibitive\nfor large-scale models on current hardware. We therefore introduce Block AttnRes, which partitions layers into N\nblocks and attends over block-level representations. Empirically, using about 8 blocks recovers most of the gains of Full\nAttnRes, while finer-grained blocking remains a promising direction as future hardware constraints relax. Together with\ncross-stage caching and a two-phase computation strategy, Block AttnRes is practical at scale, incurring only marginal\ntraining overhead and minimal inference overhead.\n16"
},
{
"page": 17,
"content": "Attention ResidualsTECHNICALREPORT\nReferences\n[1] Jacob Austin et al.Program Synthesis with Large Language Models. 2021. arXiv: 2108.07732 [cs.PL] .URL:\nhttps://arxiv.org/abs/2108.07732.\n[2] Thomas Bachlechner et al.ReZero is All You Need: Fast Convergence at Large Depth. 2020. arXiv: 2003.04887\n[cs.LG].URL:https://arxiv.org/abs/2003.04887.\n[3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.Neural Machine Translation by Jointly Learning to\nAlign and Translate. 2016. arXiv:1409.0473 [cs.CL].URL:https://arxiv.org/abs/1409.0473.\n[4] Chen Chen and Lai Wei.Post-LayerNorm Is Back: Stable, ExpressivE, and Deep. 2026. arXiv: 2601.19895\n[cs.LG].URL:https://arxiv.org/abs/2601.19895.\n[5] Mark Chen et al.Evaluating Large Language Models Trained on Code. 2021. arXiv: 2107.03374 [cs.LG] .\nURL:https://arxiv.org/abs/2107.03374.\n[6] Peter Clark et al. “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge”. In:\narXiv:1803.05457v1(2018).\n[7] Karl Cobbe et al.Training Verifiers to Solve Math Word Problems. 2021. arXiv: 2110.14168 [cs.LG] .URL:\nhttps://arxiv.org/abs/2110.14168.\n[8] Tri Dao and Albert Gu. “Transformers are SSMs: Generalized Models and Efficient Algorithms Through\nStructured State Space Duality”. In:CoRRabs/2405.21060 (2024).DOI: 10.48550/ARXIV.2405.21060 . arXiv:\n2405.21060.URL:https://doi.org/10.48550/arXiv.2405.21060.\n[9] DeepSeek-AI et al.DeepSeek-V3 Technical Report. 2025. arXiv: 2412.19437 [cs.CL] .URL: https://arxiv.\norg/abs/2412.19437.\n[10] Yanwen Fang et al.Cross-Layer Retrospective Retrieving via Layer Attention. 2023. arXiv: 2302 . 03985\n[cs.CV].URL:https://arxiv.org/abs/2302.03985.\n[11] Andrey Gromov et al.The Unreasonable Ineffectiveness of the Deeper Layers. 2025. arXiv: 2403.17887\n[cs.CL].URL:https://arxiv.org/abs/2403.17887.\n[12] Kaiming He et al.Deep Residual Learning for Image Recognition. 2015. arXiv: 1512.03385 [cs.CV] .URL:\nhttps://arxiv.org/abs/1512.03385.\n[13] Dan Hendrycks et al.Measuring Massive Multitask Language Understanding. 2021. arXiv: 2009.03300\n[cs.CY].URL:https://arxiv.org/abs/2009.03300.\n[14] Dan Hendrycks et al.Measuring Mathematical Problem Solving With the MATH Dataset. 2021. arXiv: 2103.\n03874 [cs.LG].URL:https://arxiv.org/abs/2103.03874.\n[15] Jordan Hoffmann et al.Training Compute-Optimal Large Language Models. 2022. arXiv: 2203.15556 [cs.CL] .\nURL:https://arxiv.org/abs/2203.15556.\n[16] Shengding Hu et al.MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training\nStrategies. 2024. arXiv:2404.06395 [cs.CL].URL:https://arxiv.org/abs/2404.06395.\n[17] Gao Huang et al.Densely Connected Convolutional Networks. 2018. arXiv: 1608.06993 [cs.CV] .URL:\nhttps://arxiv.org/abs/1608.06993.\n[18] Yanping Huang et al. “GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism”. In:\nAdvances in NeurIPS. 2019.\n[19] Yuzhen Huang et al. “C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models”. In:\nAdvances in NeurIPS36 (2023), pp. 6299163010.\n[20] Robert A. Jacobs et al. “Adaptive Mixtures of Local Experts”. In:Neural Computation3.1 (1991), pp. 7987.\nDOI:10.1162/neco.1991.3.1.79.\n[21] Mandar Joshi et al. “Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension”.\nIn:arXiv preprint arXiv:1705.03551(2017).\n[22] Jared Kaplan et al.Scaling Laws for Neural Language Models. 2020. arXiv: 2001.08361 [cs.LG] .URL:\nhttps://arxiv.org/abs/2001.08361.\n[23] Angelos Katharopoulos et al. “Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention”.\nIn:Proceedings of ICML. Ed. by Hal Daumé III and Aarti Singh. PMLR, 2020, pp. 51565165.URL: https:\n//proceedings.mlr.press/v119/katharopoulos20a.html.\n[24] Jonas Knupp et al.Depth-Recurrent Attention Mixtures: Giving Latent Reasoning the Attention it Deserves. 2026.\narXiv:2601.21582 [cs.AI].URL:https://arxiv.org/abs/2601.21582.\n[25] Aitor Lewkowycz et al.Solving Quantitative Reasoning Problems with Language Models. 2022. arXiv: 2206.\n14858 [cs.CL].URL:https://arxiv.org/abs/2206.14858.\n17"
},
{
"page": 18,
"content": "Attention ResidualsTECHNICALREPORT\n[26] Haonan Li et al. “CMMLU: Measuring massive multitask language understanding in Chinese”. In:Findings\nof the Association for Computational Linguistics: ACL 2024. Ed. by Lun-Wei Ku, Andre Martins, and Vivek\nSrikumar. Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 1126011285.DOI:\n10 . 18653 / v1 / 2024 . findings - acl . 671 .URL: https : / / aclanthology . org / 2024 . findings -\nacl.671/.\n[27] Tianyu Li et al.SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm. 2026. arXiv: 2602.08064\n[cs.LG].URL:https://arxiv.org/abs/2602.08064.\n[28] Jingyuan Liu et al.Muon is Scalable for LLM Training. 2025. arXiv: 2502.16982 [cs.LG] .URL: https:\n//arxiv.org/abs/2502.16982.\n[29] Brian Mak and Jeffrey Flanigan.Residual Matrix Transformers: Scaling the Size of the Residual Stream. 2025.\narXiv:2506.22696 [cs.LG].URL:https://arxiv.org/abs/2506.22696.\n[30] Gaurav Menghani, Ravi Kumar, and Sanjiv Kumar.LAuReL: Learned Augmented Residual Layer. 2025. arXiv:\n2411.07501 [cs.LG].URL:https://arxiv.org/abs/2411.07501.\n[31] Maxim Milakov and Natalia Gimelshein.Online normalizer calculation for softmax. 2018. arXiv: 1805.02867\n[cs.PF].URL:https://arxiv.org/abs/1805.02867.\n[32] Tsendsuren Munkhdalai et al. “Metalearned Neural Memory”. In:ArXivabs/1907.09720 (2019).URL: https:\n//api.semanticscholar.org/CorpusID:198179407.\n[33] Deepak Narayanan et al.Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM.\n2021. arXiv:2104.04473 [cs.CL].URL:https://arxiv.org/abs/2104.04473.\n[34] Toan Q. Nguyen and Julian Salazar. “Transformers without Tears: Improving the Normalization of Self-\nAttention”. In:Proceedings of IWSLT. Ed. by Jan Niehues et al. 2019.URL: https : / / aclanthology .\norg/2019.iwslt-1.17/.\n[35] OpenAI et al.GPT-4 Technical Report. 2024. arXiv: 2303.08774 [cs.CL] .URL: https://arxiv.org/abs/\n2303.08774.\n[36] Matteo Pagliardini et al.DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted\nAveraging. 2024. arXiv:2402.02622 [cs.CL].URL:https://arxiv.org/abs/2402.02622.\n[37] Bowen Peng et al. “Yarn: Efficient context window extension of large language models”. In:arXiv preprint\narXiv:2309.00071(2023).\n[38] Matthew E. Peters et al. “Deep Contextualized Word Representations”. In:Proceedings of NAACL. 2018,\npp. 22272237.URL:https://aclanthology.org/N18-1202/.\n[39] Reiner Pope et al.Efficiently Scaling Transformer Inference. 2022. arXiv:2211.05102 [cs.LG].\n[40] Zhen Qin et al.HGRN2: Gated Linear RNNs with State Expansion. 2024. arXiv:2404.07904 [cs.CL].\n[41] David Rein et al. “Gpqa: A graduate-level google-proof q&a benchmark”. In:First Conference on Language\nModeling. 2024.\n[42] Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. “Linear Transformers Are Secretly Fast Weight Program-\nmers”. In:Proceedings of ICML. Ed. by Marina Meila and Tong Zhang. PMLR, 2021, pp. 93559366.URL:\nhttps://proceedings.mlr.press/v139/schlag21a.html.\n[43] Jürgen Schmidhuber. “Learning to control fast-weight memories: An alternative to dynamic recurrent networks”.\nIn:Neural Computation4.1 (1992), pp. 131139.\n[44] Freda Shi et al.Language Models are Multilingual Chain-of-Thought Reasoners. 2022. arXiv: 2210.03057\n[cs.CL].URL:https://arxiv.org/abs/2210.03057.\n[45] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber.Highway Networks. 2015. arXiv: 1505.00387\n[cs.LG].URL:https://arxiv.org/abs/1505.00387.\n[46] Yu Sun et al. “Learning to (Learn at Test Time): RNNs with Expressive Hidden States”. In:ArXivabs/2407.04620\n(2024).URL:https://api.semanticscholar.org/CorpusID:271039606.\n[47] Yutao Sun et al.Retentive Network: A Successor to Transformer for Large Language Models. 2023. arXiv:\n2307.08621 [cs.CL].\n[48] Mirac Suzgun et al. “Challenging big-bench tasks and whether chain-of-thought can solve them”. In:arXiv\npreprint arXiv:2210.09261(2022).\n[49] Shawn Tan et al. “Scaling Stick-Breaking Attention: An Efficient Implementation and In-depth Study”. In:\nProceedings of ICLR. 2025.\n[50] Hugo Touvron et al.Going deeper with Image Transformers. 2021. arXiv: 2103.17239 [cs.CV] .URL: https:\n//arxiv.org/abs/2103.17239.\n[51] Hugo Touvron et al.LLaMA: Open and Efficient Foundation Language Models. 2023. arXiv: 2302.13971\n[cs.CL].\n18"
},
{
"page": 19,
"content": "Attention ResidualsTECHNICALREPORT\n[52] Ashish Vaswani et al. “Attention is All you Need”. In:Advances in NeurIPS. Ed. by I. Guyon et al. Curran\nAssociates, Inc., 2017.URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/\n3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.\n[53] Ashish Vaswani et al. “Attention is All you Need”. In:Advances in NeurIPS. Ed. by I. Guyon et al. V ol. 30.\nCurran Associates, Inc., 2017.URL: https://proceedings.neurips.cc/paper_files/paper/2017/\nfile/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.\n[54] Hongyu Wang et al.DeepNet: Scaling Transformers to 1,000 Layers. 2022. arXiv: 2203.00555 [cs.CL] .URL:\nhttps://arxiv.org/abs/2203.00555.\n[55] Yubo Wang et al. “Mmlu-pro: A more robust and challenging multi-task language understanding benchmark”. In:\nAdvances in NeurIPS37 (2024), pp. 9526695290.\n[56] Da Xiao et al. “MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense\nConnections”. In:Proceedings of ICML. 2025.\n[57] Guangxuan Xiao et al. “Efficient streaming language models with attention sinks”. In:arXiv preprint\narXiv:2309.17453(2023).\n[58] Tian Xie.Your DeepSeek mHC Might Not Need the “m”. Zhihu blog post. 2026.URL: https://zhuanlan.\nzhihu.com/p/2010852389670908320.\n[59] Zhenda Xie et al.mHC: Manifold-Constrained Hyper-Connections. 2026. arXiv: 2512.24880 [cs.CL] .URL:\nhttps://arxiv.org/abs/2512.24880.\n[60] Ruibin Xiong et al.On Layer Normalization in the Transformer Architecture. 2020. arXiv: 2002.04745 [cs.LG] .\nURL:https://arxiv.org/abs/2002.04745.\n[61] Bowen Yang et al.Rope to Nope and Back Again: A New Hybrid Attention Strategy. 2025. arXiv: 2501.18795\n[cs.CL].URL:https://arxiv.org/abs/2501.18795.\n[62] Songlin Yang, Jan Kautz, and Ali Hatamizadeh. “Gated Delta Networks: Improving Mamba2 with Delta Rule”.\nIn:Proceedings of ICLR. 2025.URL:https://openreview.net/forum?id=r8H7xhYPwz.\n[63] Songlin Yang et al. “Gated Linear Attention Transformers with Hardware-Efficient Training”. In:Proceedings of\nICML. PMLR, 2024.\n[64] Yongyi Yang and Jianyang Gao.mHC-lite: You Dont Need 20 Sinkhorn-Knopp Iterations. 2026. arXiv: 2601.\n05732 [cs.LG].URL:https://arxiv.org/abs/2601.05732.\n[65] Rowan Zellers et al. “HellaSwag: Can a Machine Really Finish Your Sentence?” In:Proceedings of the 57th\nAnnual Meeting of the Association for Computational Linguistics. 2019.\n[66] Biao Zhang and Rico Sennrich. “Root mean square layer normalization”. In:Advances in NeurIPS32 (2019).\n[67] Yifan Zhang et al.Deep Delta Learning. 2026. arXiv: 2601.00417 [cs.LG] .URL: https://arxiv.org/\nabs/2601.00417.\n[68] Yilang Zhang et al.ANCRe: Adaptive Neural Connection Reassignment for Efficient Depth Scaling. 2026. arXiv:\n2602.09009 [cs.LG].URL:https://arxiv.org/abs/2602.09009.\n[69] Yu Zhang et al.Kimi Linear: An Expressive, Efficient Attention Architecture. 2025. arXiv: 2510.26692 [cs.CL] .\n[70] Shu Zhong et al.Understanding Transformer from the Perspective of Associative Memory. 2025. arXiv: 2505.\n19488 [cs.LG].URL:https://arxiv.org/abs/2505.19488.\n[71] Zhanchao Zhou et al. “Value Residual Learning”. In:Proceedings of ACL. Ed. by Wanxiang Che et al. Vienna,\nAustria, 2025, pp. 2834128356.URL:https://aclanthology.org/2025.acl-long.1375/.\n[72] Defa Zhu et al.Hyper-Connections. 2025. arXiv: 2409.19606 [cs.LG] .URL: https://arxiv.org/abs/\n2409.19606.\n[73] Zhijian Zhuo et al.HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization.\n2025. arXiv:2503.04598 [cs.CL].URL:https://arxiv.org/abs/2503.04598.\n19"
},
{
"page": 20,
"content": "Attention ResidualsTECHNICALREPORT\nA Contributions\nThe authors are listed in order of the significance of their contributions, with those in project leadership roles appearing\nlast.\nGuangyu Chen\nYu Zhang\nJianlin Su\nWeixin Xu\nSiyuan Pan\nYaoyu Wang\nYucheng Wang\nGuanduo Chen\nBohong Yin\nYutian Chen\nJunjie Yan\nMing Wei\nY . Zhang\nFanqing Meng\nChao Hong\nXiaotong Xie\nShaowei Liu\nEnzhe Lu\nYunpeng TaiYanru Chen\nXin Men\nHaiqing Guo\nY . Charles\nHaoyu Lu\nLin Sui\nJinguo Zhu\nZaida Zhou\nWeiran He\nWeixiao Huang\nXinran Xu\nYuzhi Wang\nGuokun Lai\nYulun Du\nYuxin Wu\nZhilin Yang\nXinyu Zhou\nEqual contribution\n20"
},
{
"page": 21,
"content": "Attention ResidualsTECHNICALREPORT\nB Optimized Inference I/O for Full Attention Residuals\nA naïve implementation of Full AttnRes scans all preceding layer outputs at every layer, so memory traffic scales\nlinearly with depth. As noted in §4.2, however, the pseudo-query wlis a learned parameter independent of both the\ninput and the hidden state. We can therefore batch inter-block accesses across layers in a two-phase schedule, bringing\ntotal I/O well below the naïve bound.\nNote that the block partition introduced below is purely an inference scheduling device. Unlike Block AttnRes, it leaves\nthe model architecture unchanged and does not replace per-layer sources with block summaries; it simply makes the\namortization argument concrete.\nSetupLet the model have Llayers and hidden dimension d, partitioned into Ncontiguous blocks of size S=L/N .\nInference proceeds one block at a time: Phase 1 jointly computes inter-block attention for all Slayers in the block\nagainst all preceding blocks, and Phase 2 walks through intra-block dependencies sequentially.\nPhase 1: Batched Inter-block Attention\nConsider block nwith its Slayers. The queries {wl}l∈Bnare all known before execution begins, so the (n1)S\npreceding keyvalue pairs need only be read once from HBM and reused across all Squeries. The read cost for block n\nis therefore\nRead(n)\ninter= 2(n1)Sd,(11)\nwhere the factor of2accounts for both keys and values. Summing over allNblocks and usingSN=L:\nRead inter=NX\nn=12(n1)Sd= 2Sd·N(N1)\n2=dL(N1).(12)\nPhase 1 also writes oned-dimensional output per layer, givingWrite(n)\ninter=Sdper block and\nWrite inter=Ld(13)\nin total.\nPhase 2: Sequential Intra-block Attention\nPhase 1 covers all sources before the current block. Within the block, however, each layer depends on those before it,\nso these must be handled in order. Layer t(1≤t≤S ) reads t1 intra-block keyvalue pairs at a cost of 2(t1)d .\nSumming over one block:\nRead(n)\nintra=SX\nt=12(t1)d=S(S1)d.(14)\nPhase 2 also writes one output per layer, soWrite(n)\nintra=Sd.\nTotal Amortized I/O per Layer\nSumming both phases over allNblocks:\nRead total=dL(N1) +N·S(S1)d,Write total= 2Ld.(15)\nDividing byLand usingSN=L:\nRead per layer= (N1)d+ (S1)d= (S+N2)d,Write per layer= 2d,(16)\nTotal I/O per layer= (S+N)d. (17)\nBatching inter-block reads thus brings per-layer I/O from O(L) down to O(S+N) . The schedule follows the same\ntwo-phase split as Block AttnRes: inter-block attention accounts for the bulk of the traffic, while sequential computation\nstays local within each block.\n21"
}
]
}