Virtual Labs

1. What limitation of CNNs primarily motivated the development of Vision Transformers?

a: CNNs cannot process color images. b: CNNs are limited to local receptive fields and cannot directly model long-range spatial dependencies. c: CNNs do not use learnable parameters. d: CNNs require positional encoding.

2. How does a Vision Transformer process an image?

a: By directly applying convolutions b: By flattening the entire image into one vector c: By splitting the image into fixed-size patches d: By applying pooling operations

3. Why are positional embeddings required in Vision Transformers?

a: To reduce model size b: To encode spatial information of patches c: To normalize patch values d: To perform classification

4. What role does the CLS token play in a Vision Transformer?

a: It stores the final image representation for classification b: It represents the background of the image c: It replaces patch embeddings d: It performs pooling

5. Which operation allows ViTs to model long-range dependencies?

a: Pooling b: Self-attention c: Dropout d: Normalization

6. In the attention equation Attention(Q,K,V) = softmax(QK^T / sqrt(d_k))V, What is the role of the softmax function?

a: Increasing the dimensionality of vectors b: Removing unnecessary neurons during training c: Converting attention scores into probabilities d: Initializing model parameters

7. Why is the term sqrt(d_k) used in the denominator of the attention equation?

a: To increase numerical instability b: To reduce the number of parameters c: To scale down large dot-product values d: To perform normalization

8. What does the Query-Key dot product QK^T represent in self-attention?

a: The similarity between different patches b: The pixel intensity of image patches c: The classification score d: The loss value

9. What is the purpose of using multiple attention heads in a Vision Transformer?

a: To process images in grayscale only b: To reduce the total number of model parameters c: To ensure all heads learn the same attention pattern d: To allow the model to attend to different representation subspaces simultaneously

10. Which statement best describes self-attention in Vision Transformers?

a: Each patch attends only to its neighboring patches b: Self-attention replaces the need for labels c: Attention is applied only in the final layer d: Each patch attends to all other patches in the image

Transformers in Vision