Recent Releases of videoswin
videoswin - v2.0
Summary
Keras 3 implementation of Video Swin Transformer. The official PyTorch weight has been converted to Keras 3 compatible. This implementaiton supports to run the model on multiple backend, i.e. TensorFlow, PyTorch, and Jax.
Full Changelog: https://github.com/innat/VideoSwin/compare/v1.1...v2.0
- Jupyter Notebook
Published by innat almost 2 years ago
videoswin - v1.1
TensorFlow SavedModel formet weights. Details.
- Jupyter Notebook
Published by innat over 2 years ago
videoswin - v1.0
Checkpoints of VideoSwin in Keras
Checkpoints of VideoSwin: Video Swin Transformer model in keras. The pretrained weights are ported from official pytorch model. Following are the list of all available model in .h5 format.
Checkpoint Naming Style
For the variation and brevity, the general format is:
``python
dataset = 'K400' # K400, SSV2
pretrained_dataset = 'IN1K' # 'IN1K', 'IN22K
size = 'B' # 'B', 'L'
patchsize = (2,4,4)
windowsize=(8,7,7) # (8,7,7), (16,7,7)
numframes = 32
inputsize = 224
checkpointname = ( f'TFVideoSwin{size}' f'{dataset}' f'{datasetext + ""' f'P{patchsize}' f'W{windowsize}' f'{numframes}x{inputsize}.h5' ) checkpointname TFUniFormerV2K400K710L14_32x224.h5 ```
Here, size represents tiny, small, and base. The pretrained_dataset refers the initialized pretrained weights while training the video swin model. For example, IN22K or ImageNet 22K pretrained 2D swin image models are used to initialize in 3D video swin model. The dataset refers the benchmark dataset, i.e., Kinetics, Something-Something-V2. The patch_size and window_size refer the internal parameter of model architecture. The input_frame and input_size for video-swin is 32 and 224 respectively. In keras implementation, the checkpoints are also available in SavedModel and h5 format. Check release page of v.1.1 for the SavedModel checkpoints.
| Model Name | |-------------------------------------| | TFVideoSwinTK400IN1KP244W87732x224.h5 | | TFVideoSwinSK400IN1KP244W87732x224.h5 | | TFVideoSwinBSSV2K400P244W167732x224.h5 | | TFVideoSwinBK600IN22KP244W87732x224.h5 | | TFVideoSwinBK400IN22KP244W87732x224.h5 | | TFVideoSwinBK400IN1KP244W87732x224.h5 |
Here, IN1K and IN22K refer to ImageNet 1K and ImageNet 22K. The P244 refers to patch_size of [2,4,4] and W877 refers to window_size of [8,7,7]. All these models give logit as output that makes it easy to add custom head on top of it for downstream task further. Check the notebook.
- Jupyter Notebook
Published by innat over 2 years ago