Transformers in Vision

1. Why are pretrained weights commonly used when training Vision Transformers on small datasets?
2. What information do self-attention maps provide in Vision Transformers?
3. Compared to CNNs, Vision Transformers primarily differ by using:
4. Why is patch size an important hyperparameter in Vision Transformers?
5. What is a key advantage of visualizing attention maps in ViTs?
6. Why are early layers of a pretrained Vision Transformer often frozen during fine-tuning?
7. What is the effect of reducing patch size in a Vision Transformer?
8. During evaluation, why is gradient computation disabled in the ViT code?
9. What insight is gained by visualizing self-attention maps in a Vision Transformer?
10. Compared to CNN feature maps, attention maps in ViTs primarily show: