Transformers: The Secret Behind Modern NLP

Ramya Surati
6 min readJun 3, 2024

--

Although GPT is a new tech buzzword that we hear all the time, the transformer architecture, which originated in a 2017 research paper, is the true game changer. Since GPT is simply a particular implementation of the Transformer architecture, there is a close link between the Transformer architecture and the Generative Pre-trained Transformer.

It has revolutionized natural language processing (NLP) and established new benchmarks for scalability and model performance. The Transformer is more effective and efficient for a variety of activities since it primarily uses attention processes for data processing. This blog will go deeply into the parts and mechanisms of the Transformer architecture, why it is so revolutionary in the field of natural language processing.

Recurrent neural networks, or RNNs, which dominated NLP until the Transformer. However, these models could handle sequential data and capture temporal connections, their sequential character made them computationally costly and difficult to manage long-range dependencies.

In order to work around these challenges, the Transformer design included a brand-new mechanism known as self-attention, which enables the model to take into account any point in the input sequence simultaneously.

Transformer Architecture

Components of Transformer Architecture:

  1. Input Embeddings:

Words or tokens are converted into dense vectors (embeddings) that capture their semantic meaning. These embeddings are essential for transforming textual data into a format that the model can process.

2. Positional Encoding:

Since Transformers do not inherently understand the order of sequences, positional encodings are added to the input embeddings. These encodings provide information about the position of each token in the sequence, allowing the model to take the order into account.

3. Self-Attention Mechanism:

The self-attention mechanism enables the model to focus on different parts of the input sequence when encoding a particular token. This mechanism calculates a weighted sum of values, where the weights are determined by the similarity between the query and the keys.

4. Multi-Head Attention:

Instead of performing a single attention function, multi-head attention runs several attention mechanisms in parallel. The outputs are concatenated and linearly transformed, allowing the model to capture different aspects of relationships between tokens.

5. Feed-Forward Neural Networks:

Each position in the sequence is passed through a fully connected feed-forward neural network. This network consists of two linear transformations with a ReLU activation in between, applied independently to each position.

6. Residual Connections and Layer Normalization:

Residual connections (skip connections) are used around each sub-layer (attention and feed-forward layers), followed by layer normalization. These connections help in training deeper models by addressing issues like vanishing gradients.

7. Encoder and Decoder Stacks:

The Transformer model consists of an encoder and a decoder stack, each composed of multiple identical layers. The encoder processes the input sequence, while the decoder generates the output sequence.

  • Encoder: Each encoder layer has two main sub-layers: multi-head self-attention and feed-forward neural network.
  • Decoder: Each decoder layer has three main sub-layers: masked multi-head self-attention, multi-head attention (attending to the encoder’s output), and feed-forward neural network.

Working:

i. Encoding Phase:

The input sequence is first tokenized and embedded. Positional encodings are added to these embeddings, which are then passed through the encoder stack. Each encoder layer applies self-attention and feed-forward transformations to produce a set of encoded representations.

ii. Decoding Phase:

The decoder takes the encoded representations and the previously generated tokens as inputs. It applies masked self-attention (to prevent attending to future tokens) and multi-head attention to generate the next token in the sequence. This process is repeated until the entire output sequence is generated.

Imagine we have a short sentence and we want to understand how each word in the sentence relates to every other word.

Example

Let’s take the sentence: “The cat sat on the mat.”

Step-by-Step Simplified Explanation

  1. Tokenization:

Break the sentence into individual words (tokens): [“The”, “cat”, “sat”, “on”, “the”, “mat”]

2. Embedding:

Convert each word into a numerical vector (embedding) that captures some meaning of the word. For simplicity, let’s use small numbers:

  • “The” -> [1]
  • “cat” -> [2]
  • “sat” -> [3]
  • “on” -> [4]
  • “the” -> [1]
  • “mat” -> [5]

3. Positional Encoding:

Add information about the position of each word in the sentence. For simplicity, let’s just add the position index:

  • “The” -> [1 + 0] = [1]
  • “cat” -> [2 + 1] = [3]
  • “sat” -> [3 + 2] = [5]
  • “on” -> [4 + 3] = [7]
  • “the” -> [1 + 4] = [5]
  • “mat” -> [5 + 5] = [10]

4. Self-Attention Mechanism:

Calculate how much attention each word should pay to every other word. For simplicity, let’s assume each word looks equally at every other word.

5. Attention Scores:

We’ll compute simple attention scores. For simplicity, let’s assume they are just a function of the embeddings:

Attention(“The”) = [1, 1, 1, 1, 1, 1]

Attention(“cat”) = [3, 3, 3, 3, 3, 3]

Attention(“sat”) = [5, 5, 5, 5, 5, 5]

Attention(“on”) = [7, 7, 7, 7, 7, 7]

Attention(“the”) = [5, 5, 5, 5, 5, 5]

Attention(“mat”) = [10, 10, 10, 10, 10, 10]

6. Weighted Sum:

  • Each word’s new representation is a weighted sum of the attention scores and embeddings of all words. For simplicity, let’s just average the attention scores:
  • New “The” -> (1 + 1 + 1 + 1 + 1 + 1) / 6 = 1
  • New “cat” -> (3 + 3 + 3 + 3 + 3 + 3) / 6 = 3
  • New “sat” -> (5 + 5 + 5 + 5 + 5 + 5) / 6 = 5
  • New “on” -> (7 + 7 + 7 + 7 + 7 + 7) / 6 = 7
  • New “the” -> (5 + 5 + 5 + 5 + 5 + 5) / 6 = 5
  • New “mat” -> (10 + 10 + 10 + 10 + 10 + 10) / 6 = 10

Simplified Version:

Input Sentence: [“The”, “cat”, “sat”, “on”, “the”, “mat”]

1. Tokenization: [“The”, “cat”, “sat”, “on”, “the”, “mat”]

2. Embedding: [1, 2, 3, 4, 1, 5]

3. Positional Encoding:

[1, 3, 5, 7, 5, 10]

4. Self-Attention Scores:

[[1, 1, 1, 1, 1, 1],

[3, 3, 3, 3, 3, 3],

[5, 5, 5, 5, 5, 5],

[7, 7, 7, 7, 7, 7],

[5, 5, 5, 5, 5, 5],

[10, 10, 10, 10, 10, 10]]

5. New Representation:

[1, 3, 5, 7, 5, 10]

Advantages:

Transformers over RNNS:

  • Transformers allow for parallel processing of sequences, significantly speeding up training and inference compared to RNNs, which process tokens sequentially.

Handling Long-Range Dependencies:

  • The self-attention mechanism can capture dependencies between tokens regardless of their distance in the sequence, improving the model’s ability to understand context and relationships.

Real-World Applications

Transformers have been the backbone of many state-of-the-art models, such as BERT (Bidirectional Encoder Representations from Transformers) for understanding context and GPT (Generative Pre-trained Transformer) for text generation.

These models have achieved remarkable results in benchmarks and have been widely adopted in applications like:

  • Chatbots and Virtual Assistants: Providing more natural and contextually relevant interactions.
  • Language Translation: Enabling more accurate and fluent translations between languages.
  • Sentiment Analysis: Analyzing customer feedback and social media to gauge public opinion.

Conclusion

The Transformer architecture has undoubtedly revolutionized the field of NLP, offering a robust and scalable solution for a wide range of language tasks. By leveraging self-attention and multi-head attention mechanisms, Transformers can process and generate text with unprecedented accuracy and efficiency. As research and development in this area continue to advance, we can expect even more innovative applications and improvements in how machines understand and interact with human language.

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008).

Cristina, S. (2023) The Transformer model, MachineLearningMastery.com. Available at: https://machinelearningmastery.com/the-transformer-model/ .

Transformer neural networks: A step-by-step breakdown (no date) Built In. Available at: https://builtin.com/artificial-intelligence/transformer-neural-network .

--

--