Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

by Jinghuan Shang, Srijan Das and Michael S. Ryoo at NeurIPS 2022

We present 3DTRL, a plug-and play layer in Transformer using 3D camera transformations to recover tokens in 3D that learns viewpoint-agnostic representations. Check our paper and project page for more details.

Quick link: [Usage] [Dataset] [Image Classification] [Action Recognition] [Video Alignment]

By 3DTRL, we can align videos from multiple viewpoints, even including ego-centric view and third-person view videos.

Third-person view	First-person view GT	Ours	DeiT+TCN

3DTRL recovers pseudo-depth of images -- getting semantically meaningful results.

Overview of 3DTRL

Usage

Directory Structure

├── _doc                            # images, gifs, etc for readme
├── action_recognition              # all files related to action recognition go here, this can work stand alone
    ├── configs                     # config files for TimeSformer and +3DTRL
    ├── timesformer
        ├── datasets                # data pipeline for action recognition
        ├── models                  # definitions of TimeSformer and +3DTRL
    ├── script.sh                   # launch script for action recognition
├
├── backbone                        # modules used by 3DTRL (depth and camera estimators)
├── model                           # Transformer models with 3DTRL plug-in (ViT, Swin, TnT)
├── data_pipeline                   # dataset class for video alignment
├── i1k_configs                     # Configuration files for ImageNet-1K training
├
├── 3dtrl_env.yml                   # conda env for image classification and video alignment
├── i1k.sh                          # launch script for ImageNet-1K jobs
├── imagenet_train.py               # entry point of ImageNet-1K training
├── imagenet_val.py                 # entry point of ImageNet-1K evaluation
├── multiview_video_alignment.py    # entry point of video alignment
├── utils.py                        # some utility functions

Image Classification

Environment:

conda env create -f 3dtrl_env.yml

Run:

conda activate 3dtrl
bash i1k.sh num_gpu your_imagenet_dir

Credit: We build our code for image classification on top of timm.

Video Alignment

FTPV Dataset

We release the First-Third Person View (FTPV) dataset (including MC, Panda, Lift, and Can used in our paper) at Google Drive. Download and unzip it. Please consider cite our paper if you use the datasets. Note: I also include Pouring dataset introduced by TCN paper in the drive. The reason is that I got a hard time to find a valid source to download it when doing my research. I'm re-sharing it for your convenience. Please cite them if you use Pouring.

Environment:

conda env create -f 3dtrl_env.yml

Run:

conda activate 3dtrl
python multiview_video_alignment.py --data dataset_name [--model vit_3dtrl] [--train_videos num_video_used]

Action Recognition

Environment: we follow TimeSformer to set up the virtual environment. Then,

cd action_recognition
bash script.sh your_config_file data_location log_location

Cite 3DTRL

If you find our research useful, please consider cite:

@inproceedings{
    3dtrl,
    title={Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space},
    author={Jinghuan Shang and Srijan Das and Michael S Ryoo},
    booktitle={Advances in Neural Information Processing Systems},
    year={2022},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

Usage

Directory Structure

Image Classification

Video Alignment

FTPV Dataset

Action Recognition

Cite 3DTRL

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
_doc		_doc
action_recognition		action_recognition
backbone		backbone
data_pipeline		data_pipeline
i1k_configs		i1k_configs
model		model
.gitignore		.gitignore
3dtrl_env.yml		3dtrl_env.yml
LICENSE		LICENSE
README.md		README.md
i1k.sh		i1k.sh
imagenet_train.py		imagenet_train.py
imagenet_val.py		imagenet_val.py
multiview_video_alignment.py		multiview_video_alignment.py
utils.py		utils.py

License

elicassion/3DTRL

Folders and files

Latest commit

History

Repository files navigation

Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

Usage

Directory Structure

Image Classification

Video Alignment

FTPV Dataset

Action Recognition

Cite 3DTRL

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages