dancing-in-style

https://github.com/leeswu/dancing-in-style

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.8%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: leeswu
License: apache-2.0
Language: Python
Default Branch: main
Size: 102 MB

Statistics

Stars: 0
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 0

Created about 2 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License Code of conduct Citation

Dancing in Style: Classifying Dance Videos By Style

Authors: | Sophie Wu | Han Dao | Nathan Guzman | | --- | --- | --- | | Stanford University | Stanford University | Stanford University |

Abstract

The task of classifying human actions in videos has been a significant area of computer vision research, with dance being a culturally rich and stylistically diverse subset. In this project, we present a novel approach to classify dance videos by style using an enhanced Two-Stream Inflated 3D ConvNet (I3D) model. We leverage the Kinetics-700 and Let's Dance datasets, combining them to create a robust dataset for training and evaluation. Our method and experimentation involve extensive hyperparameter tuning to improve the model performance. Additionally, we experimented with transfer learning by fine-tuning two different existing I3D models, originally based on the ResNet50 and the ResNet50 (NonLocal Dot Product) backbones, on our dance-focused dataset. Experimental results demonstrate the effectiveness of our approach, achieving accuracy in classifying various dance styles.

Introduction

The task of classifying human action in videos has long since been a major area of computer vision research. We aim to focus on a narrower subset: learning to recognize different styles of dance. Dance is an integral part of cultures around the world, and each style is a product of unique regional, historical, and social influences. Automatically and systematically classifying these different forms of dance would help preserve and study them, potentially revealing new insights into what aspects make each dance style unique.

This project was completed as the final project for our CS231N Deep Learning for Computer Vision class at Stanford University, which we took in Spring 2024.

Dataset and Features

Creating the Datasets

To develop a comprehensive dataset for dance video classification, we utilized two primary sources: the Kinetics-700 dataset and the Let's Dance dataset.

Filtering Kinetics-700: We began by filtering the Kinetics-700 dataset to extract videos specifically related to dance. This involved identifying and selecting videos labeled with dance-related actions. The identified dance categories included 19 unique labels.
Combining with Let's Dance Dataset: Next, we incorporated the Let's Dance dataset, which contains detailed annotations for various dance styles. We mapped the dance labels from Let's Dance to align with those from Kinetics-700, ensuring consistency across our combined dataset.
Data Augmentation: To increase the size and variability of our dataset, we applied several data augmentation techniques to the training split. These techniques included horizontal flip, vertical flip, color jitter, and noise addition.
Splitting the Dataset: The final merged dataset was split into three sets:
- Training Set (90%)
- Validation Set (5%)
- Test Set (5%)

Annotations

We generated annotation files for each split, listing the video names and their corresponding integer-encoded labels.

Class Labels

The class labels were mapped to the following dance styles: - 0: belly dancing - 1: breakdancing - 2: country line dancing - 3: cumbia - 4: dancing ballet - 5: dancing charleston - 6: dancing gangnam style - 7: dancing macarena - 8: jumpstyle dancing - 9: krumping - 10: moon walking - 11: mosh pit dancing - 12: robot dancing - 13: salsa dancing - 14: square dancing - 15: swing dancing - 16: tango dancing - 17: tap dancing - 18: zumba

Methods

Overview

Our primary model, the Two-Stream Inflated 3D ConvNet (I3D), builds on state-of-the-art image classification architectures by extending them into the spatio-temporal domain. This section includes the mathematical formulations of our input, output, and loss functions, and details the modifications made to existing models to enhance performance on action recognition tasks.

Two-Stream Inflated 3D ConvNet (I3D)

The I3D model inflates 2D ConvNet filters and pooling kernels into 3D, enabling the network to learn spatio-temporal features directly from video data. The architecture is based on the Inception-v1 network, which we extend by converting 2D filters $N \times N$ into 3D filters $N \times N \times N$. The model comprises two parallel streams: one for RGB frames and one for optical flow.

Training Procedure

We trained the I3D model on the K700-2020 dataset, focusing specifically on dance videos. To implement our training procedures, we built on top of MMAction2, an open-source toolbox for video understanding based on PyTorch.

Algorithm Steps

Initialization: Inflate 2D filters from a pre-trained Inception-v1 model into 3D filters.
Data Augmentation: Apply random cropping, resizing, and flipping to the input frames.
Forward Pass: Compute the spatio-temporal features using 3D convolutions and pooling.
Loss Computation: Calculate the cross-entropy loss between the predicted and true labels.
Backpropagation: Update the network weights using gradient descent with momentum.
Testing: Evaluate the model on test sets by averaging predictions across video frames.

Hyperparameter Tuning

We performed hyperparameter tuning using Optuna to identify the optimal learning rate, dropout rate, and optimizer for our model. The best performing hyperparameters were: - Learning rate: $7.87 \times 10^{-4}$ - Dropout rate: 0.2966 - Optimizer: SGD

Transfer Learning and Finetuning

We experimented with transfer learning by fine-tuning existing I3D models. Finetuning was performed by loading pretrained weights from two different I3D models (ResNet50 and ResNet50 (NonLocal Dot Product)) and adapting them to our dance-focused dataset.

Experimental Results

Evaluation Metrics

We used the following metric to evaluate our model's performance: - Accuracy: The proportion of correct predictions out of the total number of predictions.

Initial Experiment Results

Initial results showed that the baseline I3D model achieved moderate accuracy in classifying dance styles. The model's performance improved significantly with hyperparameter tuning and transfer learning.

Hyperparameter Tuning Results

Table 1 summarizes the training and validation losses, as well as the validation accuracy over 10 epochs:

| Epoch | Training Loss | Validation Loss | Validation Accuracy (%) | |-------|----------------|-----------------|-------------------------| | 1 | 2.90550 | 2.81120 | 15.32 | | 2 | 2.88860 | 2.79430 | 15.47 | | 3 | 2.87170 | 2.77740 | 15.41 | | 4 | 2.85480 | 2.76050 | 15.48 | | 5 | 2.83790 | 2.74360 | 15.44 | | 6 | 2.82100 | 2.72670 | 15.45 | | 7 | 2.80420 | 2.70980 | 15.46 | | 8 | 2.78730 | 2.69300 | 15.43 | | 9 | 2.77040 | 2.67600 | 15.49 | | 10 | 2.75350 | 2.65910 | 15.45 |

Transfer Learning and Finetuning Results

The best results from fine-tuning the I3D models on our dataset are summarized below:

| Pretrain Model | Learning Rate | Epochs | Top-1 Accuracy (%) | Top-5 Accuracy (%) | |---------------------|---------------|--------|---------------------|---------------------| | ResNet50 | $1 \times 10^{-2}$ | 3 | 51.61 | 82.80 | | ResNet50 (NonLocal) | Optimal: $7.87 \times 10^{-4}$ | 10 | 56.90 | 86.31 |

Conclusion and Future Work

In this study, we explored various experimentations, including hyperparameter tuning and transfer learning, to improve the performance of our model. Through extensive experimentation, we identified optimal hyperparameters that significantly enhanced model performance. Our results indicate that transfer learning is effective for this specific domain, and general action recognition models can be fine-tuned to classify and recognize more specific variations of actions, such as dance styles.

Future Work

Future work includes: - Implementing additional data augmentation techniques. - Extending training time to further enhance model performance. - Conducting a broader hyperparameter search. - Exploring ensemble methods to improve robustness and accuracy. - Performing further pretraining and finetuning on different pretrained models. - Exploring additional regularization techniques to reduce overfitting.

References

Carreira, J., & Zisserman, A. (2017). Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. CoRR, abs/1705.07750.
Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., & Zisserman, A. (2020). A short note on the kinetics-700-2020 human action dataset. arXiv preprint arXiv:2010.10864.
Carreira, J., & Zisserman, A. (2019). Kinetics-700 dataset. arXiv preprint arXiv:1907.06987.
Castro, D., Hickson, S., Sangkloy, P., Mittal, B., Dai, S., Hays, J., & Essa, I. A. (2018). Let’s Dance: Learning from Online Dance Videos. CoRR, abs/1801.07388.
Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. IEEE International Conference on Computer Vision (ICCV), 6202–6211.
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1725–1732.
MMAction2 Contributors. (2020). OpenMMLab’s Next Generation Video Understanding Toolbox and Benchmark. Retrieved from https://github.com/open-mmlab/mmaction2.
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. CoRR, abs/1406.2199.
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. IEEE International Conference on Computer Vision (ICCV), 4489–4497.
Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. IEEE International Conference on Computer Vision (ICCV), 3551–3558.
Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. CVPR.

Owner

Name: Sophie Wu
Login: leeswu
Kind: user

Repositories: 1
Profile: https://github.com/leeswu

GitHub Events

Total

Last Year

Dependencies

.github/workflows/lint.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite

.github/workflows/merge_stage_test.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite
codecov/codecov-action v1.0.14 composite

.github/workflows/pr_stage_test.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite
codecov/codecov-action v1.0.14 composite

.github/workflows/publish-to-pypi.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite

.circleci/docker/Dockerfile docker

pytorch/pytorch ${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel build

docker/Dockerfile docker

pytorch/pytorch ${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel build

docker/serve/Dockerfile docker

pytorch/pytorch ${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel build

requirements/build.txt pypi

Pillow *
decord >=0.4.1
einops *
matplotlib *
numpy *
opencv-contrib-python *
scipy *
torch >=1.3

requirements/docs.txt pypi

docutils ==0.18.1
einops *
modelindex *
myst-parser *
opencv-python *
scipy *
sphinx ==6.1.3
sphinx-notfound-page *
sphinx-tabs *
sphinx_copybutton *
sphinx_markdown_tables *
sphinxcontrib-jquery *
tabulate *

requirements/mminstall.txt pypi

mmcv >=2.0.0rc4,<2.2.0
mmengine >=0.7.1,<1.0.0

requirements/multimodal.txt pypi

transformers >=4.28.0

requirements/optional.txt pypi

PyTurboJPEG *
av >=9.0
future *
imgaug *
librosa *
lmdb *
moviepy *
openai-clip *
packaging *
pims *
soundfile *
tensorboard *
wandb *

requirements/readthedocs.txt pypi

mmcv *
titlecase *
torch *
torchvision *

requirements/tests.txt pypi

coverage * test
flake8 * test
interrogate * test
isort ==4.3.21 test
parameterized * test
pytest * test
pytest-runner * test
xdoctest >=0.10.0 test
yapf * test

requirements.txt pypi

setup.py pypi

tools/data/activitynet/environment.yml pypi

decorator ==4.4.2
intel-openmp ==2019.0
joblib ==0.15.1
mkl ==2019.0
numpy ==1.18.4
olefile ==0.46
pandas ==1.0.3
python-dateutil ==2.8.1
pytz ==2020.1
six ==1.14.0
youtube-dl *

tools/data/gym/environment.yml pypi

decorator ==4.4.2
intel-openmp ==2019.0
joblib ==0.15.1
mkl ==2019.0
numpy ==1.18.4
olefile ==0.46
pandas ==1.0.3
python-dateutil ==2.8.1
pytz ==2020.1
six ==1.14.0
youtube-dl *

tools/data/hvu/environment.yml pypi

decorator ==4.4.2
intel-openmp ==2019.0
joblib ==0.15.1
mkl ==2019.0
numpy ==1.18.4
olefile ==0.46
pandas ==1.0.3
python-dateutil ==2.8.1
pytz ==2020.1
six ==1.14.0
youtube-dl *

tools/data/kinetics/environment.yml pypi

decorator ==4.4.2
intel-openmp ==2019.0
joblib ==0.15.1
mkl ==2019.0
numpy ==1.18.4
olefile ==0.46
pandas ==1.0.3
python-dateutil ==2.8.1
pytz ==2020.1
six ==1.14.0
youtube-dl *