https://github.com/ajaymin28/humanactionrecognition

CLIP based human action recognition, alignment of text and image using Prompt engineering.

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.8%) to scientific vocabulary

Keywords

clip computervision kaggle-dataset prompt-engineering pytorch

Last synced: 5 months ago · JSON representation

Repository

CLIP based human action recognition, alignment of text and image using Prompt engineering.

Basic Info

Host: GitHub
Owner: ajaymin28
Language: Python
Default Branch: main
Homepage:
Size: 2.48 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Topics

clip computervision kaggle-dataset prompt-engineering pytorch

Created about 2 years ago · Last pushed 8 months ago

Metadata Files

Readme

Human Action Recognition on Static Images

📝 Introduction

Human Action Recognition (HAR) in computer vision is vital for applications such as surveillance, healthcare, and human-computer interaction. Traditional approaches depended on video-based inputs or handcrafted features, often struggling with complex actions. Recent advances leverage static images and multimodal models like CLIP, which align textual and visual embeddings for robust action recognition.
This project explores prompt engineering and Top-K accuracy to showcase how CLIP achieves state-of-the-art performance on static image action recognition, outperforming traditional CNNs.

🚀 Highlights

Static Image Action Recognition: No video required; CLIP recognizes human actions from single images.
Multimodal Modeling: Combines textual prompts and visual features via CLIP.
Prompt Engineering: Custom textual prompts enhance action label discrimination.
Top-K Evaluation: Assesses model ranking capabilities beyond Top-1 accuracy.
Visualization: Self-attention maps reveal model interpretability.

🧪 Experiments

Dataset

Human Action Recognition (HAR) Dataset
15 classes, >12,000 labeled images
Images are organized in class-specific folders for streamlined training & evaluation

Dataset Classes

Prompt Engineering (finegrained relabelling of the data)

Dataset Classes

Evaluation Metrics

Top-K Accuracy:
- Evaluates if the true label appears in the top-K predictions (commonly K=5)
- Especially useful when multiple actions are plausible for a single image
- For test data (there is no GT label): so we show top-3 model predictions

Results

ResNet18: Struggled with inter-class variance, limited generalization.
CLIP: Achieved strong semantic alignment between text and image features. Prompt engineering further boosted recognition of fine-grained actions.

📊 Analysis & Visualization

CLIP can overfit on training but still generalizes well (as observed in loss/accuracy plots).
No ground truth for test set: Top-3 predictions shown for interpretability.
Attention maps highlight the synergy between textual prompts and visual focus.

Attention maps

Dataset Classes

Owner

Name: Jaimin Bhoi
Login: ajaymin28
Kind: user
Location: Bangalore
Company: Tata Consulatancy Services

Repositories: 1
Profile: https://github.com/ajaymin28

A computer engineer. on my way to make computers do great things.

GitHub Events

Total

Push event: 1

Last Year

Push event: 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/ajaymin28/humanactionrecognition

Science Score: 26.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Human Action Recognition on Static Images

📝 Introduction

🚀 Highlights

🧪 Experiments

Dataset

Prompt Engineering (finegrained relabelling of the data)

Evaluation Metrics

Results

📊 Analysis & Visualization

Attention maps

Owner

GitHub Events

Total

Last Year