Rapid advancements in AI
The field of Large Language Models (LLMs) is currently experiencing rapid development, with a significant focus on exploring the capabilities to process long sequences more efficiently.
5
Transformers
with their global attention mechanism
Transfomers, renowned for their success in AI tasks, uses "Attention" - Global Attention Mechanism for every element in the data sequence when processing a single element. While comprehensive, this method can become inefficient and computationally demanding, especially with longer sequences.
Mamba
with state space
Mamba is a new architecture for LLMs (and other use cases) which uses Selective State Space Models or Selective Attention Strategy (SSMs or call it SAS). SSMs dynamically filter and process information based on the content, allowing the model to selectively remember or ignore parts of the input. akin to a detective meticulously picking out crucial clues from a plethora of information. This results in significant improvements by achieving greater efficiency in processing speed and scaling capabilities, particularly with longer sequences.
Historically, it goes back to the last century. The paper (Kalman, in 1960c) also reports yet another successful example of the new paradigm of “controlling the state” in the context of sampled-data control, since been applied across various fields, including engineering, statistics, computer science, and economics, to solve a wide range of dynamical systems problems.Mamba operates with linear time complexity, meaning its processing time increases at a linear rate as the sequence lengthens. This makes it significant for long sequences.
Mamba models for text generation, question answering and text classification have started coming on HuggingFace over last couple of weeks around publishing of this post.
Shift in Attention
Transformer’s Quadratic Attention: computes pairwise interactions between all elements in the input sequence. This results in a quadratic computational complexity O(n²), where n is the sequence length. Advantage: captures complex dependencies Con: it becomes computationally expensive and memory-intensive for long sequences.
Linear Attention: reduces computational complexity to linear O(n) by approximating the attention mechanism in a way that avoids computing all pairwise interactions.
RWKV
with power of combination
takes inspiration from both Transformers and RNNs to leverage the strengths of both approaches -
Business Implications
The potential of Mamba and RWKV are particularly exciting. Their ability to handle long sequences with greater efficiency opens the door to faster, more efficient, and scalable AI-driven solutions, setting the stage for transformative advancements across numerous fields. The next few months will be critical in determining whether the Mamba model or RWKV architecture can deliver on its promises.