Transformer Networks

An Introduction

by

Department of Aerospace Engineering,

Indian Institute of Space science and Technology,

Thiruvananthapuram, Kerala - 695547

2025-10-19

Content

  1. Recurrent Neural Networks

  2. Attention Mechanism

  3. Transformer Architecture

Recurrent Neural Networks

RNNs

  • Sequence to single output processing

  • Sequence to Sequence mapping

RNNs

  • Implementing word-embedding algorithms
    • To transform vocabulary of words into vector spaces with meaning/semantic information of words
    • Example: “King” - “Man” + “Woman” = “Queen”

RNNs

  • RNNs take two inputs at each timestep

  • Output of RNNs can be either immediate (output for each timestep input) or culminated (single output at end of sequence)

RNNs

RNNs

RNNs - limitations

  • Slow to train: processing each input sequentially

  • Long sequences lead to vanishing gradient: its memory is not strong to interpret old connections

RNNs - limitations - LSTMs & GRUs

  • To overcome the vanishing gradient problem

  • LSTM: Long Short-Term Memory units. GRU: Gated Recurrent Units.

  • Memory gates: input,forget and output.

RNNs - limitations

  • Still LSTMs and GRUs facing difficulty to interpret long sequences like encountered in LLMs (when elements are distant from one another)

  • Example:

    • Sequential Networks (RNNs and its variants) can fill this phrase “The clouds are in the —-”

    • But they may not interpret and fill this phrase

      “I grew up in Germany with my parents, I spent many years there and have proper knowledge about their culture. That’s why I speak fluent —–”

  • They cannot grasp the context of the sequence efficiently.
    • “The bank is on the river”
    • “The bank approved the loan”
  • Sequential Networks are slow to train
    • Not fully utilizing the parallel processing power of GPUs.

Attention Mechanism

Attention mechanism

  • To induce contextual information on each element of the input sequence

  • Let the input sequence of embedded words be \[ \mathbf{X} = \left[\mathbf{x}_1,\dots,\mathbf{x}_n\right]^T \in \mathbb{R}^{n\times d_{model}} \]

  • Now, each input token \(\mathbf{x}_i\) is projected linearly into three different spaces. \[ \mathbf{Q} = \mathbf{X}\mathbf{W}_Q, \ \ \ \mathbf{K} = \mathbf{X}\mathbf{W}_K, \ \ \ \mathbf{V} = \mathbf{X}\mathbf{W}_V \]

  • Where,

    • \(\mathbf{Q}\) - Query: what this token is looking for.
    • \(\mathbf{K}\) - Key: What this token contains that others might look for
    • \(\mathbf{V}\) - Value: The information content to pass along if selected
    • \(d_{model}\) - size of the word embedding vector.

Attention mechanism

  • The attention output is computed as \[ \mathbf{Z} = \text{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_{model}}}\right)\mathbf{V} \]

  • Where,

    • \(\mathbf{Q}\mathbf{K}^T\) - Score matrix: gives similarity scores b/w all pairs of token (how much each query matches with each key)
    • \(\frac{1}{\sqrt{d_{model}}}\) - Scaling factor to prevent large dot product values
    • \(\text{softmax}(.)\) - Convert scores to probabilities that sum to 1.
    • \(\text{multiply by }\mathbf{V}\) - takes weighted average of all values, where weights are the attention scores

Attention mechanism

  • The attention output, \(\mathbf{Z}\in\mathbb{R}^{n\times d_{model}}\) is a new representation of \(\mathbf{X}\) where each token’s new embedding contains information aggregated from all other tokens.

  • Goal of attention layer is to mix the contextual information across all tokens efficiently.

Attention mechanism

  • 4th time step is the beginning of decoder stage in last animation

Attention mechanism

  • The attention decoder RNN takes in the embeddings of the <END> token, and an initial decoder hidden state hinit to produce new hidden state vector h4. Output is discarded.

  • Attention step h4 \(\to\) context vector C4.

  • h4 and C4 are concatenated and passed through feedforward NN to get final output of 4th timestep and the process is repeated

Attention mechanism

  • Attention score matrix visualization

Transformer Networks

Transformer

  • Architecture is similar to RNNs

Transformer

  • Original architecture contains 6 layers on both encoder and decoder, but it is a hyperparameter

Transformer - architecture schematic

Transformer - encoder

  • Input embedding
  • Positional encoding
  • Multi-headed attention layers
  • Residual connections
  • Layer normalization and feedforward network

Transformer - encoder - input embedding

  • using word embedding algorithms and a vocabulary data

Transformer - encoder - positional encoding

  • Transformers do not have recurrence mechanism like RNNs

  • Positional encodings are added to the input embeddings to provide each token’s position information

  • Position vectors are generated using various sines and cosines combinations (for any sequence length)

Transformer - encoder - stack of encoders

Transformer - encoder - Multi-head self-attention mechanism

  • Input tensor is split into \(h\) segments along the embedding dimension and fed to individual self-attention layer

  • Each attention layer uses different \(\mathbf{W}_Q,\mathbf{W}_K,\mathbf{W}_V\) matrices to capture different context like syntax, long-range dependencies and local neighbors

Transformer - encoder - Multi-head self-attention mechanism

  • Score matrix shows relevance between words in the example.

Transformer - encoder - Normalization and residual connection

  • Layer normalization for stability and residual connection to prevent gradient vanishing

Transformer - encoder - Feedforward network

  • For additional refinement

  • Here also, residual connections and Layer normalization

  • Then the output of encoder is sent to decoder part

Transformer - decoder

  • Overview of a single decoder layer

  • Each has 2 multi-headed attention mechanism but slightly different

  • Then the output of decoder is passed through softmax layer to get probability of word that comes next

  • Output embedding and positional encodings are similar to that in encoder part

Transformer - decoder - Masked multi-head self-attention mechanism

  • This prevents the positions/tokens from attending subsequent positions/tokens

  • Each has 2 multi-headed attention mechanism but slightly different

  • This mask ensures that the prediction for particular position is dependent only on previous positions.

Transformer - decoder - Multi-head cross-attention mechanism

  • This is where the input from encoder comes in.

  • Here, the correlation between vocabulary of two languages is determined

  • This will be followed by the feedforward network, where the interpretation in translation happens

Transformer - decoder - linear classifier and output probabilities

  • The final output is passed through softmax acitvation function

  • It generates the probability of the next word in sequence

  • The probability is looked against the words in the vocabulary of second language

Conclusion

  1. RNNs and its variants fail to interpret long sequences and they are under-utilizing the computing resources

  2. Attention mechanism is used to interpret the context of short/long sequences. It is the core of Transformer networks.

  3. Transformer also has encoder and decoder type architecture.

  4. Transformer takes entire sequence at once as input. It does not have recurrence feature like in RNN.

  5. Multi-headed attention mechanism parallelizes training and captures more context from the sequence.

  6. Position encoding and word embedding are used for proper transformation of words to vectors. Hence, the name Transformer networks.

References