AI Lesson 93 – Vision Transformers(ViT) | Dataplexa

Lesson 93: Vision Transformers (ViT)

Vision Transformers, commonly called ViT, are a modern approach to computer vision that apply transformer architectures directly to images. Unlike traditional convolutional neural networks, ViT treats an image as a sequence of patches and processes them using attention mechanisms.

This approach allows the model to understand global relationships across the entire image instead of focusing only on local pixel regions.

Real-World Connection

Vision Transformers are used in high-accuracy image classification systems, medical image analysis, satellite imagery interpretation, and large-scale vision tasks where understanding global context is important.

Companies use ViT-based models when working with large datasets and complex visual patterns that traditional CNNs may struggle with.

Why Vision Transformers Were Introduced

Convolutional neural networks are excellent at capturing local patterns but may miss long-range relationships. Vision Transformers solve this by allowing every image patch to attend to every other patch.

  • Better global context understanding
  • Scales well with large datasets
  • Highly flexible architecture

How Vision Transformers Work

The ViT pipeline converts an image into smaller fixed-size patches. Each patch is flattened and treated as a token, similar to words in a sentence.

  • Split image into patches
  • Embed each patch into a vector
  • Add positional embeddings
  • Apply transformer encoder layers
  • Predict class labels

Key Difference: ViT vs CNN

CNNs rely on convolution kernels to detect local features, while ViTs rely on self-attention to capture global dependencies across the image.

This makes Vision Transformers especially powerful for complex images with long-range visual relationships.

Vision Transformer Example (Code)

Below is a simplified example showing how a Vision Transformer model can be created using TensorFlow and Keras.


import tensorflow as tf
from tensorflow.keras import layers

inputs = layers.Input(shape=(224, 224, 3))

# Patch extraction
patches = layers.Conv2D(
    filters=64,
    kernel_size=16,
    strides=16,
    padding="valid"
)(inputs)

# Flatten patches
x = layers.Reshape((-1, 64))(patches)

# Transformer encoder
for _ in range(4):
    x1 = layers.LayerNormalization()(x)
    attention = layers.MultiHeadAttention(num_heads=4, key_dim=64)(x1, x1)
    x2 = layers.Add()([attention, x])
    x3 = layers.LayerNormalization()(x2)
    x = layers.Dense(64, activation="relu")(x3)

# Classification head
x = layers.GlobalAveragePooling1D()(x)
outputs = layers.Dense(10, activation="softmax")(x)

model = tf.keras.Model(inputs, outputs)
  

What This Code Is Doing

The image is split into patches using a convolution layer. These patches are treated as tokens and passed through transformer encoder layers using self-attention.

Finally, the model aggregates information across all patches and predicts the output class.

Understanding the Output

The model outputs a probability distribution over image classes. The class with the highest probability is selected as the prediction.

Because attention considers the entire image, predictions often improve on complex datasets.

Advantages of Vision Transformers

  • Strong global feature modeling
  • Highly parallelizable
  • Works well with large-scale datasets

Limitations of Vision Transformers

  • Require large datasets to perform well
  • Computationally expensive
  • Less efficient on small datasets

Practice Questions

Practice 1: Vision Transformers process images as sequences of what?



Practice 2: Which mechanism allows ViT to model global relationships?



Practice 3: Vision Transformers perform best when trained on what?



Quick Quiz

Quiz 1: Vision Transformers are based on which architecture?





Quiz 2: What is a key advantage of ViT over CNNs?





Quiz 3: What step converts image patches into vectors?





Coming up next: Multimodal Models — combining vision, text, and other data types.