Transformers in Vision
1. What limitation of CNNs primarily motivated the development of Vision Transformers?
2. How does a Vision Transformer process an image?
3. Why are positional embeddings required in Vision Transformers?
4. What role does the CLS token play in a Vision Transformer?
5. Which operation allows ViTs to model long-range dependencies?
6. In the attention equation Attention(Q,K,V) = softmax(QK^T / sqrt(d_k))V, What is the role of the softmax function?
7. Why is the term sqrt(d_k) used in the denominator of the attention equation?
8. What does the Query-Key dot product QK^T represent in self-attention?
9. What is the purpose of using multiple attention heads in a Vision Transformer?
10. Which statement best describes self-attention in Vision Transformers?