Quick guide to Transformer Architectures

Quick Refresher

Rest assured, we’re not revisiting the Transformer model architecture and paper for the 100th time. However, this model is so versatile that it’s easy to forget the breadth of its applications. Here’s a quick refresher:

It is an architecture based on the self-attention mechanism.

Despite of its drawbacks like difficult training and high memory requirements, it has established itself as the default model for both Natural Language Processing and Computer Vision. Consequently, Transformers has become an essential prerequisite for every data scientist in their daily work. Familiarizing oneself with its layers, architectures, inputs, and outputs is very important of being able to effectively work with them.

Let’s start by first revisiting RNNs:

Natural Language processing problems have long-term dependencies, even after applying hacks like bi-directional, multi-layer, LSTMs/GRUs RNNs have suffered from vanishing gradient problem and by handling input sequence 1 by 1 word.

Pros: considered as the core of seq2seq (with attention). Popular, successful for variable-length representations such as sequences (e.g. languages), images, etc. The gating models such as LSTM or GRU are for long-range error propagation.
Cons: The sequentiality prohibits parallelization within instances. Even with gating models, long-range dependencies still tricky. Hard to model hierarchical-alike domains such as languages.

A big question with seq2seq like models – is one hidden state really enough to capture global information for the translation? No, right?

and CNN:

Pros: Trivial to parallelize (per layer) and fit intuition that most dependencies are local.
Cons: Path length between positions can be logarithmic when using dilated convolutions, left-padding for text.

Alternate approach: To use Hierarchical Convolution Seq2Seq architecture

(https://arxiv.org/abs/1705.03122)

The idea here was that close input elements interact in the lower layers, while long-term dependencies are captured at the higher layers.

However, as the number of calculations in the parallel computation of the hidden representation, for input → output position in the sequence, grows with the distance between those positions (architecture grows in height).

The complexity of O(n) for ConvS2S and O(nlogn) for ByteNet makes it harder to learn dependencies on distant positions. RNN/CNN handle sequences word-by-word sequentially which is an obstacle to parallelize.

A prominent need for the architectural shift

A shift which can do –

Parallelization of Seq2Seq.
Reduce sequential computation

Transformer achieves parallelization by replacing recurrence with attention and encoding the symbol position in the sequence. This, in turn, leads to significantly shorter training time. Constant O(1) number of operations to learn dependency between two symbols independently of their position distance in sequence.

As an alternative to convolutions, Transformer, proposes to encode each position and applying the attention mechanism, to relate two distant words of both the inputs and outputs w.r.t itself, which then can be parallelized, thus accelerating the training. The Transformer reduces the number of sequential operations to relate two symbols from input/output sequences to a constant O(1) number of operations. Transformer achieves this with the multi-head attention mechanism that allows to model dependencies regardless of their distance in input or output sentence.

The significance is given to learn a context vector (say Cv), which gives us global level information on all the given inputs and tells us about the most important information (for ex, by using a function for cosine similarity or euclidean distance) of this context vector Cv w.r.t the input hidden states from the fully connected layer.

We do this for each input Xi and thus obtain a THETAi (attention weights) i.e. : THETAi = cosine_similarity(Cv, Xi) For each of the input hidden states X1...Xk, we learn a set of weights THETA1 to THETAk which measures how much of the inputs answer the query and this generates an output

Encoder-Decoder Architecture

The OG architecture

Encoder: This component takes text as input, usually a complete sentence or an image, and does some self-attention-GPU-killer magic to it that extracts the features (as continuous representation embedding) of the text, similar to how a CNN extracts features from an image.

Decoder: Takes the output of the encoder and uses it to generate text or whatever. The architecture is the similar as the encoder but with two special attributes: 1.Cross-Attention Layer 2.Masked Multi-Head Self-Attention

T5

reframes all NLP tasks into a unified text-to-text format, allowing it to be applied to a wide variety of tasks including translation, summarization, question answering, and more.

BART

a denoising autoencoder for pretraining sequence-to-sequence models.

Encoder-only Architecture

The architecture is highly parallelizable. It eliminates the decoder and retains only the encoder.

Due to its self-attention mechanism, it can process the entire input sequence at once, hence the encoder-only Transformer architecture is efficient for tasks that do not require sequence-to-sequence transformations. For NLP, applications include Sentiment Analysis, Text Classification and Named Entity Recognition. For Computer Vision, Image Captioning with a caveat.

Vision Transformer

It treats an image as a sequence of patches and applies the transformer model to this sequence

BERT

pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context.

Decoder-only Architecture

the architecture that has seen the most traction in recent times - GPT (Generative Pre-trained Transformer), the most famous decoder-only model!

This architecture is used to generate output data based on a fixed input. The input can be a prompt, such as a piece of text or an image with some parts missing, and the output is generated based on that prompt.

the input sequence is directly fed into the decoder, which generates the output sequence by attending to the input sequence through self-attention mechanisms.

MaMMUT

very practical for Image Captioning, Visual Question Answering, Open Vocabulary Object Detection.

PaLM

uses the Google-developed Pathways machine learning system to train a model across multiple pods of tensor processing units.

Let’s summarize:

The attention mechanism lets the model focus on different parts of the input sequence when making each output token.

Transformers can handle the whole input simultaneously..

It can capture relationships between inputs far away from each other in the sequence, which is helpful for natural language tasks.

It needs fewer parameters to model long-term dependencies since it only has to pay attention to the inputs that matter.

It’s really good at handling inputs of different lengths.