DL Lesson 54 – Transformers Intro | Dataplexa

Transformers – Introduction

Transformers represent one of the most important breakthroughs in the history of deep learning.

They completely changed how sequence data is processed by removing recurrence and convolution entirely, and relying purely on attention mechanisms.

Today, transformers power modern language models, vision models, speech systems, and multi-modal AI systems.


Why Transformers Were Created

Earlier sequence models such as RNNs, LSTMs, and GRUs process data step by step.

This sequential nature creates two major problems:

First, training is slow because computations cannot be fully parallelized.

Second, long-range dependencies are difficult to capture reliably, even with gated architectures.

Transformers solve both problems at the same time.


The Core Idea Behind Transformers

Instead of processing sequences sequentially, transformers process the entire sequence at once.

They do this by allowing every token in the sequence to attend to every other token simultaneously.

This is made possible through self-attention.


What Is Self-Attention?

Self-attention allows each token in a sequence to compute relationships with all other tokens in the same sequence.

Each token learns:

• What information it should focus on • How strongly it should attend to other tokens • Which tokens are irrelevant for the current representation

This creates context-aware representations for every token.


Query, Key, and Value

Self-attention is built using three learned vectors:

Query (Q) – what the token is looking for
Key (K) – what the token contains
Value (V) – the information to pass forward

Every token generates its own Q, K, and V vectors.


Scaled Dot-Product Attention

The attention score is computed using a dot product between Query and Key vectors.

These scores are scaled, normalized using softmax, and applied to the Value vectors.

# Scaled dot-product attention
scores = (Q @ K.T) / sqrt(d_k)
weights = softmax(scores)
output = weights @ V

This produces a weighted combination of values based on relevance.


Why Scaling Is Important

As dimensionality increases, dot products can grow very large.

Large values push softmax into extreme regions, making gradients unstable.

Scaling by the square root of the dimension keeps training numerically stable.


Parallelism in Transformers

Since all tokens are processed simultaneously, transformers can be trained using massive parallel computation.

This allows:

• Faster training • Better utilization of GPUs and TPUs • Efficient handling of long sequences


Encoder–Decoder Transformer Architecture

A standard transformer consists of two main components:

Encoder – builds representations of input tokens
Decoder – generates output tokens using attention

Both are built using stacked layers of self-attention and feed-forward networks.


Position Information in Transformers

Because transformers do not process tokens sequentially, they have no inherent notion of order.

To solve this, positional information is added explicitly using positional encodings.

These encodings allow the model to understand relative and absolute token positions.


Why Transformers Are So Powerful

Transformers excel because they:

• Capture global dependencies efficiently • Scale extremely well with data and compute • Enable transfer learning and pretraining

This is why modern AI systems are transformer-based.


Real-World Impact

Transformers power:

• Large language models • Image recognition systems • Speech recognition pipelines • Recommendation engines

They form the backbone of modern AI.


Exercises

Exercise 1:
What key limitation of RNNs do transformers eliminate?

Sequential processing and limited parallelism.

Exercise 2:
Why is self-attention more flexible than recurrence?

Because it allows every token to attend to all other tokens simultaneously.

Quick Check

Q: Do transformers require recurrence to process sequences?

No. Transformers remove recurrence entirely.

Next, we will dive deeper into positional encoding and understand how transformers learn order without sequence processing.