Attention Is All You Need Explained

"Attention Is All You Need" is a 2017 landmark[1] [2] research paper in machine learning authored by eight scientists working at Google. The paper introduced a new deep learning architecture known as the transformer, based on the attention mechanism proposed in 2014 by Bahdanau et al.[3] It is considered a foundational[4] paper in modern artificial intelligence, as the transformer approach has become the main architecture of large language models like those based on GPT. At the time, the focus of the research was on improving Seq2seq techniques for machine translation, but the authors go further in the paper, foreseeing the technique's potential for other tasks like question answering and what is now known as multimodal Generative AI.[5]

The paper's title is a reference to the song "All You Need Is Love" by the Beatles.[6] The name "Transformer" was picked because Uszkoreit liked the sound of that word.[7]

An early design document was titled "Transformers: Iterative Self-Attention and Processing for Various Tasks", and included an illustration of six characters from the Transformers animated show. The team was named Team Transformer.

Some early examples that the team tried their Transformer architecture on included English-to-German translation, generating Wikipedia articles on "The Transformer", and parsing. These convinced the team that the Transformer is a general purpose language model, and not just good for translation.

the paper has been cited more than 100,000 times.[8]

Authors

The authors of the paper are: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin. All eight authors were "equal contributors" to the paper; the listed order was randomized. The Wired article highlights the group's diversity:[6]

Six of the eight authors were born outside the United States; the other two are children of two green-card-carrying Germans who were temporarily in California and a first-generation American whose family had fled persecution, respectively.
The paper credits the different contributions of each of the authors as follows:

  1. Jakob Uszkoreit for introducing the idea of replacing RNNs with self-attention
  2. Ashish Vaswani and Illia Polosukhin for designing and implementing the first transformer model
  3. Noam Shazeer for introducing ideas of multi-headed attention, scaled dot-product attention and parameter-free position representation
  4. Llion Jones for being responsible for the original codebase, testing new variations of the transformer models and for efficient inference alongside visualizations for the paper itself.
  5. Niki Parmar for the empirical design, testing and implementation of many variations of transformer models in both the tensor2tensor library and the paper's original codebase.
  6. Lukasz Kaiser and Aidan Gomez for designing and implementing various parts of the tensor2tensor library to suit the needs of the paper's translation model which greatly helped improve results

In addition, Ashish Vaswani alongside Noam Shazeer are credited with being involved "in nearly every aspect of the paper" as well.

By 2023, all eight authors had left Google and founded their own AI start-ups (except Łukasz Kaiser, who joined OpenAI).

Methods Discussed & Introduced

The paper is most well known for the introduction of the Transformer architecture, which forms the underlying architecture for most forms of modern Large Language Models (LLMs). A key reason for why the architecture is preferred by most modern LLMs is the parallelizability of the architecture over its predecessors. This ensures that the operations necessary for training can be accelerated on a GPU allowing both faster training times and models of bigger sizes to be trained.

The following mechanisms were introduced by the paper as part of the development of the transformer architecture.

Scaled Dot-Product Attention & Self-Attention

The use of the scaled dot-product attention and self-attention mechanism instead of an RNN or LSTM (which rely on recurrence instead) allow for better performance as described in the following paragraph. The paper described the scaled-dot production as follows:

Attention(Q,K,V)=softmax(

QKT
\sqrt{dk
})V

Since the model relies on Query (Q), Key (K) and Value (V) matrices that come from the same source itself (i.e. the input sequence / context window), this eliminates the need for RNNs completely ensuring parallelizability for the architecture. This differs from the original form of the Attention mechanism introduced in 2024. Additionally, The paper also discusses the use of an additional scaling factor that was found to be most effective with respect to the dimension of the key vectors (represented as

dk

and initially set to 64 within the paper) in the manner shown above.

In the specific context of translation which the paper focused on, the Query and Key matrices are usually represented in embeddings corresponding to the source language while the Value matrix corresponds to the target language.

Multi-Head Attention

In the self-attention mechanism, queries (Q), keys (K), and values (V) are dynamically generated for each input sequence (limited typically by the size of the context window), allowing the model to focus on different parts of the input sequence at different steps. Multi-head attention enhances this process by introducing multiple parallel attention heads. Each attention head learns different linear projections of the Q, K, and V matrices. This allows the model to capture different aspects of the relationships between words in the sequence simultaneously, rather than focusing on a single aspect.

By doing this, multi-head attention ensures that the input embeddings are updated from a more varied and diverse set of perspectives. After the attention outputs from all heads are calculated, they are concatenated and passed through a final linear transformation to generate the output.

Positional Encoding

Since the Transformer model is not a seq2seq model and does not rely on the sequence of the text in order to perform encoding and decoding, the paper relied on the use of sine and cosine wave functions to encode the position of the token into the embedding. The methods introduced in the paper are discussed below:

PE(pos,2i)=

2i/dmodel
\sin(pos/{10000}

)

PE(pos,2i+1)=

2i/dmodel
\cos(pos/{10000}

)

wherein

pos

,

i

,

{dmodel

} correspond to the position of the word, the current dimension index and the dimension of the model respectively. The sine function is used for even indices of the embedding while the cosine function is used for odd indices. The resultant

PE

embedding is then added to the word at that corresponding position with respect to the current context window. The paper specifically comments on why this method was chosen describing:

"We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training."

Training

While the primary focus of the paper at the time was to improve machine translation, the paper also discussed the use of the architecture on English Constituency Parsing, both with limited and large-sized datasets, achieving a high-score without specific tuning for the task indicating the promising nature of the model for use in a wide-variety of general purpose of seq2seq tasks.

Dataset

The English-to-German translation model was trained on the 2014 WMT English-German dataset consisting of nearly 4.5 million sentences derived from TED Talks and high-quality news articles. A separate translation model was trained on the much larger 2014 WMT English-French dataset, consisting of 36 million sentences. Both datasets were encoded with byte-pair encoding.

Hardware

The models were trained using 8 NVIDIA P100 GPUs. The base models were trained for 100,000 steps and the big models were trained for 300,000 steps - each step taking about 0.4 seconds to complete. The base model trained for a total of 12 hours, and the big model trained for a total of 3.5 days. Both the base and big models outperforms the 2017 state-of-the-art in both English-German and English-French while achieving the comparatively lowest training cost.

Hyperparameters and Regularization

For their 100M-parameter Transformer model, the authors increased the learning rate linearly for the first 4000 (warmup) steps and decreased it proportionally to inverse square root of the current step number. Dropout layers were applied to the output of each sub-layer before normalization, the sums of the embeddings, and the positional encodings. The dropout rate was set to 0.1. Label smoothing was applied with a value of 0.1 which "improves accuracy and BLEU score".

External links

Notes and References

  1. Web site: AI Researcher Who Helped Write Landmark Paper Is Leaving Google . Love . Julia . 2023-07-10 . . 2024-04-01.
  2. Web site: 'Attention is All You Need' creators look beyond Transformers for AI at Nvidia GTC: 'The world needs something better' . Goldman . Sharon . 2024-03-20 . . 2024-04-01.
  3. Bahdanau . Dzmitry . Neural Machine Translation by Jointly Learning to Align and Translate . 2016-05-19 . 1409.0473 . Cho . Kyunghyun . Bengio . Yoshua. cs.CL .
  4. Book: Shinde . Gitanjali . Wasatkar . Namrata . Mahalle . Parikshit . 2024-06-06 . Data-Centric Artificial Intelligence for Multidisciplinary Applications . . 75 . 9781040031131.
  5. Vaswani . Ashish . Ashish Vaswani . Shazeer . Noam . Noam Shazeer . Parmar . Niki . Uszkoreit . Jakob . Jones . Llion . Gomez . Aidan N . Aidan Gomez . Kaiser . Łukasz . Polosukhin . Illia . Attention is All you Need . Advances in Neural Information Processing Systems . 2017 . 30 . Curran Associates, Inc..
  6. Levy . Steven . 8 Google Employees Invented Modern AI. Here's the Inside Story . 2024-03-20 . Wired . en-US . 1059-1028.
  7. Marche . Stephen . 2024-08-23 . Was Linguistic A.I. Created by Accident? . 2024-08-24 . The New Yorker . en-US . 0028-792X.
  8. News: 13 July 2023 . Meet the $4 Billion AI Superstars That Google Lost . Bloomberg . www.bloomberg.com.