Transformers in Vision

Aim

To understand the core principles of Vision Transformers (ViT) by exploring self-attention mechanisms and patch-based image representations, and to apply a pretrained or lightly fine-tuned ViT model on a CIFAR-10 subset, with detailed visualization of patch embeddings and self-attention maps for selected images.