Transformer Networks

An Introduction

Ramkumar

ramkumars.24@res.iist.ac.in

Department of Aerospace Engineering,

Indian Institute of Space science and Technology,

Thiruvananthapuram, Kerala - 695547

2025-10-19

Content

Recurrent Neural Networks
Attention Mechanism
Transformer Architecture

Recurrent Neural Networks

RNNs

Sequence to single output processing
Sequence to Sequence mapping

RNNs

Implementing word-embedding algorithms
- To transform vocabulary of words into vector spaces with meaning/semantic information of words
- Example: “King” - “Man” + “Woman” = “Queen”

RNNs

RNNs take two inputs at each timestep
Output of RNNs can be either immediate (output for each timestep input) or culminated (single output at end of sequence)

RNNs

RNNs - limitations

Slow to train: processing each input sequentially
Long sequences lead to vanishing gradient: its memory is not strong to interpret old connections

RNNs - limitations - LSTMs & GRUs

To overcome the vanishing gradient problem
LSTM: Long Short-Term Memory units. GRU: Gated Recurrent Units.
Memory gates: input,forget and output.

RNNs - limitations

Still LSTMs and GRUs facing difficulty to interpret long sequences like encountered in LLMs (when elements are distant from one another)
Example:
- Sequential Networks (RNNs and its variants) can fill this phrase “The clouds are in the —-”
- But they may not interpret and fill this phrase
  
  “I grew up in Germany with my parents, I spent many years there and have proper knowledge about their culture. That’s why I speak fluent —–”

They cannot grasp the context of the sequence efficiently.
- “The bank is on the river”
- “The bank approved the loan”

Sequential Networks are slow to train
- Not fully utilizing the parallel processing power of GPUs.

Attention Mechanism

Attention mechanism

To induce contextual information on each element of the input sequence
Let the input sequence of embedded words be \[ \mathbf{X} = \left[\mathbf{x}_1,\dots,\mathbf{x}_n\right]^T \in \mathbb{R}^{n\times d_{model}} \]
Now, each input token \(\mathbf{x}_i\) is projected linearly into three different spaces. \[ \mathbf{Q} = \mathbf{X}\mathbf{W}_Q, \ \ \ \mathbf{K} = \mathbf{X}\mathbf{W}_K, \ \ \ \mathbf{V} = \mathbf{X}\mathbf{W}_V \]
Where,
- \(\mathbf{Q}\) - Query: what this token is looking for.
- \(\mathbf{K}\) - Key: What this token contains that others might look for
- \(\mathbf{V}\) - Value: The information content to pass along if selected
- \(d_{model}\) - size of the word embedding vector.

Attention mechanism

The attention output is computed as \[ \mathbf{Z} = \text{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_{model}}}\right)\mathbf{V} \]
Where,
- \(\mathbf{Q}\mathbf{K}^T\) - Score matrix: gives similarity scores b/w all pairs of token (how much each query matches with each key)
- \(\frac{1}{\sqrt{d_{model}}}\) - Scaling factor to prevent large dot product values
- \(\text{softmax}(.)\) - Convert scores to probabilities that sum to 1.
- \(\text{multiply by }\mathbf{V}\) - takes weighted average of all values, where weights are the attention scores

Attention mechanism

The attention output, \(\mathbf{Z}\in\mathbb{R}^{n\times d_{model}}\) is a new representation of \(\mathbf{X}\) where each token’s new embedding contains information aggregated from all other tokens.
Goal of attention layer is to mix the contextual information across all tokens efficiently.

Attention mechanism

4^th time step is the beginning of decoder stage in last animation

Attention mechanism

The attention decoder RNN takes in the embeddings of the <END> token, and an initial decoder hidden state h_init to produce new hidden state vector h₄. Output is discarded.
Attention step h₄ \(\to\) context vector C₄.
h₄ and C₄ are concatenated and passed through feedforward NN to get final output of 4^th timestep and the process is repeated

Attention mechanism

Attention score matrix visualization

Transformer Networks

Transformer

Architecture is similar to RNNs

Transformer

Original architecture contains 6 layers on both encoder and decoder, but it is a hyperparameter

Transformer - architecture schematic

Transformer - encoder

Input embedding
Positional encoding
Multi-headed attention layers
Residual connections
Layer normalization and feedforward network

Transformer - encoder - input embedding

using word embedding algorithms and a vocabulary data

Transformer - encoder - positional encoding

Transformers do not have recurrence mechanism like RNNs
Positional encodings are added to the input embeddings to provide each token’s position information
Position vectors are generated using various sines and cosines combinations (for any sequence length)

Transformer - encoder - stack of encoders

Transformer - encoder - Multi-head self-attention mechanism

Input tensor is split into \(h\) segments along the embedding dimension and fed to individual self-attention layer
Each attention layer uses different \(\mathbf{W}_Q,\mathbf{W}_K,\mathbf{W}_V\) matrices to capture different context like syntax, long-range dependencies and local neighbors

Transformer - encoder - Multi-head self-attention mechanism

Score matrix shows relevance between words in the example.

Transformer - encoder - Normalization and residual connection

Layer normalization for stability and residual connection to prevent gradient vanishing

Transformer - encoder - Feedforward network

For additional refinement
Here also, residual connections and Layer normalization
Then the output of encoder is sent to decoder part

Transformer - decoder

Overview of a single decoder layer
Each has 2 multi-headed attention mechanism but slightly different
Then the output of decoder is passed through softmax layer to get probability of word that comes next
Output embedding and positional encodings are similar to that in encoder part

Transformer - decoder - Masked multi-head self-attention mechanism

This prevents the positions/tokens from attending subsequent positions/tokens
Each has 2 multi-headed attention mechanism but slightly different
This mask ensures that the prediction for particular position is dependent only on previous positions.

Transformer - decoder - Multi-head cross-attention mechanism

This is where the input from encoder comes in.
Here, the correlation between vocabulary of two languages is determined
This will be followed by the feedforward network, where the interpretation in translation happens

Transformer - decoder - linear classifier and output probabilities

The final output is passed through softmax acitvation function
It generates the probability of the next word in sequence
The probability is looked against the words in the vocabulary of second language

Transformer - live interaction

Conclusion

RNNs and its variants fail to interpret long sequences and they are under-utilizing the computing resources
Attention mechanism is used to interpret the context of short/long sequences. It is the core of Transformer networks.
Transformer also has encoder and decoder type architecture.
Transformer takes entire sequence at once as input. It does not have recurrence feature like in RNN.
Multi-headed attention mechanism parallelizes training and captures more context from the sequence.
Position encoding and word embedding are used for proper transformation of words to vectors. Hence, the name Transformer networks.

Transformer Networks

Content

Recurrent Neural Networks

RNNs

RNNs

RNNs

RNNs

RNNs

RNNs - limitations

RNNs - limitations - LSTMs & GRUs

RNNs - limitations

Attention Mechanism

Attention mechanism

Attention mechanism

Attention mechanism

Attention mechanism

Attention mechanism

Attention mechanism

Transformer Networks

Transformer

Transformer

Transformer - architecture schematic

Transformer - encoder

Transformer - encoder - input embedding

Transformer - encoder - positional encoding

Transformer - encoder - stack of encoders

Transformer - encoder - Multi-head self-attention mechanism

Transformer - encoder - Multi-head self-attention mechanism

Transformer - encoder - Normalization and residual connection

Transformer - encoder - Feedforward network

Transformer - decoder

Transformer - decoder - Masked multi-head self-attention mechanism

Transformer - decoder - Multi-head cross-attention mechanism

Transformer - decoder - linear classifier and output probabilities

Transformer - live interaction

Conclusion

References