Achieving linear-time operations with shift in attention mechanisms in AI architectures – Mamba, Recurrent Windowed Key-Value

Rapid advancements in AI

The field of Large Language Models (LLMs) is currently experiencing rapid development, with a significant focus on exploring the capabilities to process long sequences more efficiently.

5

Transformers

with their global attention mechanism

Transfomers, renowned for their success in AI tasks, uses "Attention" - Global Attention Mechanism for every element in the data sequence when processing a single element. While comprehensive, this method can become inefficient and computationally demanding, especially with longer sequences.

Mamba

with state space

Mamba is a new architecture for LLMs (and other use cases) which uses Selective State Space Models or Selective Attention Strategy (SSMs or call it SAS). SSMs dynamically filter and process information based on the content, allowing the model to selectively remember or ignore parts of the input. akin to a detective meticulously picking out crucial clues from a plethora of information. This results in significant improvements by achieving greater efficiency in processing speed and scaling capabilities, particularly with longer sequences.

Historically, it goes back to the last century. The paper (Kalman, in 1960c) also reports yet another successful example of the new paradigm of “controlling the state” in the context of sampled-data control,

since been applied across various fields, including engineering, statistics, computer science, and economics, to solve a wide range of dynamical systems problems.

Mamba operates with linear time complexity, meaning its processing time increases at a linear rate as the sequence lengthens. This makes it significant for long sequences.

Mamba models for text generation, question answering and text classification have started coming on HuggingFace over last couple of weeks around publishing of this post.

Shift in Attention

Transformer’s Quadratic Attention: computes pairwise interactions between all elements in the input sequence. This results in a quadratic computational complexity O(n²), where n is the sequence length. Advantage: captures complex dependencies Con: it becomes computationally expensive and memory-intensive for long sequences.

Linear Attention: reduces computational complexity to linear O(n) by approximating the attention mechanism in a way that avoids computing all pairwise interactions.

RWKV

with power of combination

takes inspiration from both Transformers and RNNs to leverage the strengths of both approaches -

attention to capture global dependencies in the input sequence.

modelling sequential data by maintaining an internal memory state.

It uses a stack of Transformer layers replacing its self-attention with a recurrent attention mechanism. This enables the network to retain information from previous time steps and capture long-range dependencies effectively.

RWKV incorporates a linear attention mechanism, allowing it to function as both a Transformer and an RNN.

Moreover, for training improvements, it employs -

a deep stacking technique - hierarchical structure, the model is composed of multiple layers, each building on the previous ones.

assigning the training process to processors for parallel training.

Business Implications

The potential of Mamba and RWKV are particularly exciting. Their ability to handle long sequences with greater efficiency opens the door to faster, more efficient, and scalable AI-driven solutions, setting the stage for transformative advancements across numerous fields. The next few months will be critical in determining whether the Mamba model or RWKV architecture can deliver on its promises.