AI Course
Lesson 93: Vision Transformers (ViT)
Vision Transformers, commonly called ViT, are a modern approach to computer vision that apply transformer architectures directly to images. Unlike traditional convolutional neural networks, ViT treats an image as a sequence of patches and processes them using attention mechanisms.
This approach allows the model to understand global relationships across the entire image instead of focusing only on local pixel regions.
Real-World Connection
Vision Transformers are used in high-accuracy image classification systems, medical image analysis, satellite imagery interpretation, and large-scale vision tasks where understanding global context is important.
Companies use ViT-based models when working with large datasets and complex visual patterns that traditional CNNs may struggle with.
Why Vision Transformers Were Introduced
Convolutional neural networks are excellent at capturing local patterns but may miss long-range relationships. Vision Transformers solve this by allowing every image patch to attend to every other patch.
- Better global context understanding
- Scales well with large datasets
- Highly flexible architecture
How Vision Transformers Work
The ViT pipeline converts an image into smaller fixed-size patches. Each patch is flattened and treated as a token, similar to words in a sentence.
- Split image into patches
- Embed each patch into a vector
- Add positional embeddings
- Apply transformer encoder layers
- Predict class labels
Key Difference: ViT vs CNN
CNNs rely on convolution kernels to detect local features, while ViTs rely on self-attention to capture global dependencies across the image.
This makes Vision Transformers especially powerful for complex images with long-range visual relationships.
Vision Transformer Example (Code)
Below is a simplified example showing how a Vision Transformer model can be created using TensorFlow and Keras.
import tensorflow as tf
from tensorflow.keras import layers
inputs = layers.Input(shape=(224, 224, 3))
# Patch extraction
patches = layers.Conv2D(
filters=64,
kernel_size=16,
strides=16,
padding="valid"
)(inputs)
# Flatten patches
x = layers.Reshape((-1, 64))(patches)
# Transformer encoder
for _ in range(4):
x1 = layers.LayerNormalization()(x)
attention = layers.MultiHeadAttention(num_heads=4, key_dim=64)(x1, x1)
x2 = layers.Add()([attention, x])
x3 = layers.LayerNormalization()(x2)
x = layers.Dense(64, activation="relu")(x3)
# Classification head
x = layers.GlobalAveragePooling1D()(x)
outputs = layers.Dense(10, activation="softmax")(x)
model = tf.keras.Model(inputs, outputs)
What This Code Is Doing
The image is split into patches using a convolution layer. These patches are treated as tokens and passed through transformer encoder layers using self-attention.
Finally, the model aggregates information across all patches and predicts the output class.
Understanding the Output
The model outputs a probability distribution over image classes. The class with the highest probability is selected as the prediction.
Because attention considers the entire image, predictions often improve on complex datasets.
Advantages of Vision Transformers
- Strong global feature modeling
- Highly parallelizable
- Works well with large-scale datasets
Limitations of Vision Transformers
- Require large datasets to perform well
- Computationally expensive
- Less efficient on small datasets
Practice Questions
Practice 1: Vision Transformers process images as sequences of what?
Practice 2: Which mechanism allows ViT to model global relationships?
Practice 3: Vision Transformers perform best when trained on what?
Quick Quiz
Quiz 1: Vision Transformers are based on which architecture?
Quiz 2: What is a key advantage of ViT over CNNs?
Quiz 3: What step converts image patches into vectors?
Coming up next: Multimodal Models — combining vision, text, and other data types.