https://github.com/ajaymin28/humanactionrecognition

CLIP based human action recognition, alignment of text and image using Prompt engineering.

https://github.com/ajaymin28/humanactionrecognition

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.8%) to scientific vocabulary

Keywords

clip computervision kaggle-dataset prompt-engineering pytorch
Last synced: 5 months ago · JSON representation

Repository

CLIP based human action recognition, alignment of text and image using Prompt engineering.

Basic Info
  • Host: GitHub
  • Owner: ajaymin28
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 2.48 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
clip computervision kaggle-dataset prompt-engineering pytorch
Created about 2 years ago · Last pushed 8 months ago
Metadata Files
Readme

README.md

Human Action Recognition on Static Images


📝 Introduction

Human Action Recognition (HAR) in computer vision is vital for applications such as surveillance, healthcare, and human-computer interaction. Traditional approaches depended on video-based inputs or handcrafted features, often struggling with complex actions. Recent advances leverage static images and multimodal models like CLIP, which align textual and visual embeddings for robust action recognition.
This project explores prompt engineering and Top-K accuracy to showcase how CLIP achieves state-of-the-art performance on static image action recognition, outperforming traditional CNNs.


🚀 Highlights

  • Static Image Action Recognition: No video required; CLIP recognizes human actions from single images.
  • Multimodal Modeling: Combines textual prompts and visual features via CLIP.
  • Prompt Engineering: Custom textual prompts enhance action label discrimination.
  • Top-K Evaluation: Assesses model ranking capabilities beyond Top-1 accuracy.
  • Visualization: Self-attention maps reveal model interpretability.

🧪 Experiments

Dataset

Dataset Classes

Prompt Engineering (finegrained relabelling of the data)

Dataset Classes


Evaluation Metrics

  • Top-K Accuracy:
    • Evaluates if the true label appears in the top-K predictions (commonly K=5)
    • Especially useful when multiple actions are plausible for a single image
    • For test data (there is no GT label): so we show top-3 model predictions

Results

  • ResNet18: Struggled with inter-class variance, limited generalization.
  • CLIP: Achieved strong semantic alignment between text and image features. Prompt engineering further boosted recognition of fine-grained actions.

📊 Analysis & Visualization

  • CLIP can overfit on training but still generalizes well (as observed in loss/accuracy plots).
  • No ground truth for test set: Top-3 predictions shown for interpretability.
  • Attention maps highlight the synergy between textual prompts and visual focus.

Attention maps

Dataset Classes


Owner

  • Name: Jaimin Bhoi
  • Login: ajaymin28
  • Kind: user
  • Location: Bangalore
  • Company: Tata Consulatancy Services

A computer engineer. on my way to make computers do great things.

GitHub Events

Total
  • Push event: 1
Last Year
  • Push event: 1