Awesome Transformer in Vision Awesome

A curated list of vision transformer related resources. Please feel free to pull requests or open an issue to add papers.

Table of Contents

Awesome Surveys

Title Venue BibTeX
A Survey on Visual Transformer ArXiv Bib
Intriguing Properties of Vision Transformers ArXiv Code
CVPR 2021 视觉Transformer论文(43篇) github --

Transformer in Vision

Task Reg Det Seg Trk Other
Explanation Image Recoginition Object Detection Image Segmentation Object Tracking other types

You can add a tag for domains which contains several transformer-based works


(Pls follow Time Inverse Ranking)

Title Venue Task Code BibTeX
Generative Video Transformer: Can Objects be the Words? ICML2021 Cls -- --
Tracking Instances as Queries arxiv Seg -- --
Instances as Queries arxiv Seg -- GitHub
OadTR: Online Action Detection with Transformers CVPRW Det -- GitHub
An Empirical Study of Training Self-Supervised Vision Transformers ArXiv Other -- --
End-to-end Temporal Action Detection with Transformer ArXiv Cls -- GitHub
MlTr: Multi-label Classification with Transformer ArXiv Cls -- GitHub
Delving Deep into the Generalization of Vision Transformers under Distribution Shifts ArXiv Other -- --
Improved Transformer for High-Resolution GANs ArXiv Other -- --
BEIT: BERT Pre-Training of Image Transformers ArXiv Cls -- GitHub
XCiT: Cross-Covariance Image Transformers ArXiv Other -- --
Semi-Autoregressive Transformer for Image Captioning ArXiv Other -- --
Long-Short Temporal Contrastive Learning of Video Transformers ArXiv Other -- --
Uformer: A General U-Shaped Transformer for Image Restoration ArXiv Other -- GitHub
Video Super-Resolution Transformer ArXiv Other -- GitHub
DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification ArXiv Cls -- GitHub
Semantic Correspondence with Transformers ArXiv Other -- GitHub
Glance-and-Gaze Vision Transformer ArXiv Other -- GitHub
Few-Shot Segmentation via Cycle-Consistent Transformer ArXiv Seg -- --
Self-Supervised Learning with Swin Transformers ArXiv Other -- GitHub
Visual Grounding with Transformers ArXiv Other -- --
Associating Objects with Transformers for Video Object Segmentation ArXiv Seg -- GitHub
When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations ArXiv Other -- --
DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification ArXiv Other -- GitHub
Anticipative Video Transformer ArXiv Other -- GitHub
An Attention Free Transformer ArXiv Other -- --
Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks ArXiv Other GitHub --
TransVOS: Video Object Segmentation with Transformers ArXiv Seg -- --
You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection ArXiv Det GitHub --
ResT: An Efficient Transformer for Visual Recognition ArXiv Reg GitHub --
Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length ArXiv Other -- --
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers ArXiv Seg -- --
Aggregating Nested Transformers ArXiv Other -- --
End-to-End Video Object Detection with Spatial-Temporal Transformers ArXiv Det GitHub --
HOTR: End-to-End Human-Object Interaction Detection with Transformers CVPR2021 Other GitHub --
Line Segment Detection Using Transformers without Edges CVPR2021 Other -- --
Boosting Crowd Counting with Transformers ArXiv Other -- --
Vision Transformers for Dense Prediction ArXiv Other -- --
Points as Queries: Weakly Semi-supervised Object Detection by Points ArXiv Other -- --
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet Arxiv Reg GitHub

Bottleneck Transformers for Visual Recognition Arxiv Reg GitHub

SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation Arxiv Seg ---

TrackFormer: Multi-Object Tracking with Transformers Arxiv Trk ---

Title Venue Task Code BibTeX
End-to-End Video Instance Segmentation with Transformers ArXiv Seg -- --
Training data-efficient image transformers & distillation through attention ArXiv Reg GitHub

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale ICLR Reg GitHub

Toward Transformer-Based Object Detection ArXiv Det ---

Rethinking Transformer-based Set Prediction for Object Detection ArXiv Det ---

UP-DETR: Unsupervised Pre-training for Object Detection with Transformers ArXiv Det ---

Deformable DETR: Deformable Transformers for End-to-End Object Detection ArXiv Det GitHub

End-to-End Object Detection with Transformers ECCV Det GitHub

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers Arxiv Seg Github

MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers Arxiv Seg ---

TransTrack: Multiple-Object Tracking with Transformer ArXiv Trk GitHub

Title Venue Task Code BibTeX
Attention Is All You Need NeurIPS'17 -- GitHub

