AI Lesson 93 – Vision Transformers(ViT) | Dataplexa

Lesson 93: Vision Transformers (ViT)

Vision Transformers, commonly called ViT, are a modern approach to computer vision that apply transformer architectures directly to images. Unlike traditional convolutional neural networks, ViT treats an image as a sequence of patches and processes them using attention mechanisms.

This approach allows the model to understand global relationships across the entire image instead of focusing only on local pixel regions.

Real-World Connection

Vision Transformers are used in high-accuracy image classification systems, medical image analysis, satellite imagery interpretation, and large-scale vision tasks where understanding global context is important.

Companies use ViT-based models when working with large datasets and complex visual patterns that traditional CNNs may struggle with.

Why Vision Transformers Were Introduced

Convolutional neural networks are excellent at capturing local patterns but may miss long-range relationships. Vision Transformers solve this by allowing every image patch to attend to every other patch.

Better global context understanding
Scales well with large datasets
Highly flexible architecture

How Vision Transformers Work

The ViT pipeline converts an image into smaller fixed-size patches. Each patch is flattened and treated as a token, similar to words in a sentence.

Split image into patches
Embed each patch into a vector
Add positional embeddings
Apply transformer encoder layers
Predict class labels

Key Difference: ViT vs CNN

CNNs rely on convolution kernels to detect local features, while ViTs rely on self-attention to capture global dependencies across the image.

This makes Vision Transformers especially powerful for complex images with long-range visual relationships.

Vision Transformer Example (Code)

Below is a simplified example showing how a Vision Transformer model can be created using TensorFlow and Keras.


import tensorflow as tf
from tensorflow.keras import layers

inputs = layers.Input(shape=(224, 224, 3))

# Patch extraction
patches = layers.Conv2D(
    filters=64,
    kernel_size=16,
    strides=16,
    padding="valid"
)(inputs)

# Flatten patches
x = layers.Reshape((-1, 64))(patches)

# Transformer encoder
for _ in range(4):
    x1 = layers.LayerNormalization()(x)
    attention = layers.MultiHeadAttention(num_heads=4, key_dim=64)(x1, x1)
    x2 = layers.Add()([attention, x])
    x3 = layers.LayerNormalization()(x2)
    x = layers.Dense(64, activation="relu")(x3)

# Classification head
x = layers.GlobalAveragePooling1D()(x)
outputs = layers.Dense(10, activation="softmax")(x)

model = tf.keras.Model(inputs, outputs)

What This Code Is Doing

The image is split into patches using a convolution layer. These patches are treated as tokens and passed through transformer encoder layers using self-attention.

Finally, the model aggregates information across all patches and predicts the output class.

Understanding the Output

The model outputs a probability distribution over image classes. The class with the highest probability is selected as the prediction.

Because attention considers the entire image, predictions often improve on complex datasets.

Advantages of Vision Transformers

Strong global feature modeling
Highly parallelizable
Works well with large-scale datasets

Limitations of Vision Transformers

Require large datasets to perform well
Computationally expensive
Less efficient on small datasets

Practice Questions

Practice 1: Vision Transformers process images as sequences of what?

Practice 2: Which mechanism allows ViT to model global relationships?

Practice 3: Vision Transformers perform best when trained on what?

Quick Quiz

Quiz 1: Vision Transformers are based on which architecture?

CNN
Transformers
RNN

Quiz 2: What is a key advantage of ViT over CNNs?

Local Filters
Global Context
Fewer Parameters

Quiz 3: What step converts image patches into vectors?

Pooling
Patch Embedding
Thresholding

Coming up next: Multimodal Models — combining vision, text, and other data types.

← Previous Course Index Next →

AI Course