https://github.com/cosmaadrian/multimodal-depression-from-video

Official source code for the paper: "Reading Between the Frames Multi-Modal Non-Verbal Depression Detection in Videos"

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org, scholar.google
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.9%) to scientific vocabulary

Keywords

depression-detection multimodal-deep-learning non-verbal-communication video-classification

Last synced: 10 months ago · JSON representation

Repository

Official source code for the paper: "Reading Between the Frames Multi-Modal Non-Verbal Depression Detection in Videos"

Basic Info

Host: GitHub
Owner: cosmaadrian
License: other
Language: Python
Default Branch: master
Homepage:
Size: 370 KB

Statistics

Stars: 20
Watchers: 3
Forks: 2
Open Issues: 2
Releases: 0

Topics

depression-detection multimodal-deep-learning non-verbal-communication video-classification

Created about 3 years ago · Last pushed about 2 years ago

Metadata Files

Readme License

Reading Between the 🎞️ Frames:
Multi-Modal Depression Detection in Videos from Non-Verbal Cues

Accepted at the 2024 edition of European Conference on Information Retrieval (ECIR).

[David Gimeno-Gómez](https://scholar.google.es/citations?user=DVRSla8AAAAJ&hl=en), [Ana-Maria Bucur](https://scholar.google.com/citations?user=TQuQ5IAAAAAJ&hl=en), [Adrian Cosma](https://scholar.google.com/citations?user=cdYk_RUAAAAJ&hl=en), [Carlos-D. Martínez-Hinarejos](https://scholar.google.es/citations?user=M_EmUoIAAAAJ&hl=en), [Paolo Rosso](https://scholar.google.es/citations?user=HFKXPH8AAAAJ&hl=en)

[📘 Introduction](#intro) | [🛠️ Data Preparation](#preparation) | [💪 Training and Evaluation](#training) | [📖 Citation](#citation) | [📝 License](#license)

TL;DR

[📜 Arxiv Link](https://arxiv.org/abs/2401.02746) [🛝 Google Slides](https://docs.google.com/presentation/d/1QskN661UrygIYqaqa1yGkuLz4O4zwbnal_gWO5lOJ0s/edit?usp=sharing)

We extract high-level non-verbal cues using pretrained models, process them using a modality-specific encoder, condition the resulting embeddings with positional and modality embeddings, and process the sequence with a transformer encoder to perform the final classification.

📘 Introduction

Depression, a prominent contributor to global disability, affects a substantial portion of the population. Efforts to detect depression from social media texts have been prevalent, yet only a few works explored depression detection from user-generated video content. In this work, we address this research gap by proposing a simple and flexible multi-modal temporal model capable of discerning non-verbal depression cues from diverse modalities in noisy, real-world videos. We show that, for in-the-wild videos, using additional high-level non-verbal cues is crucial to achieving good performance, and we extracted and processed audio speech embeddings, face emotion embeddings, face, body and hand landmarks, and gaze and blinking information. Through extensive experiments, we show that our model achieves state-of-the-art results on three key benchmark datasets for depression detection from video by a substantial margin.

🛠️ Data Preparation

Downloading the datasets

For D-Vlog, the features extracted by the authors are publicly available here. Original vlog videos are available upon request. Please contact the original paper authors.
For DAIC-WOZ and E-DAIC, the features are only available upon request here.

Extracting non-verbal modalities

Click here for detailed tutorial

#### D-Vlog - To extract the audio embeddings: ``` conda create -y -n pase+ python=3.7 conda activate pase+ bash ./scripts/conda_envs/prepare_pase+_env.sh bash ./scripts/features scripts/feature_extraction/extract-dvlog-pase+-feats.sh conda deactivate pase+ ``` - To extract face, body, and hand landmarks: ``` conda create -y -n landmarks python=3.8 conda activate landmarks bash scripts/conda_envs/prepare_landmarks_env.sh scripts/feature_extraction/extract-dvlog-landmarks.sh conda deactivate landmarks ``` - To extract face EmoNet embeddings: ``` conda create -y -n emonet python=3.8 conda activate emonet bash ./scripts/conda_envs/prepare_emonet_env.sh bash ./scripts/feature_extraction/extract-dvlog-emonet-feats.sh conda deactivate emonet ``` - To extract gaze tracking: ``` conda create -y -n mpiigaze python=3.8 conda activate mpiigaze bash ./scripts/conda_envs/prepare_mpiigaze_env.sh bash ./scripts/feature_extraction/extract-dvlog-gaze-feats.sh conda deactivate mpiigaze ``` - To extract blinking features: ``` conda create -y -n instblink python=3.7 conda activate instblink bash ./scripts/conda_envs/prepare_instblink_env.sh bash ./scripts/feature_extraction/extract-dvlog-blinking-feats.sh conda deactivate instblink ``` #### DAIC-WOZ - To pre-process the DAIC-WOZ features: ``` conda activate landmarks bash ./scripts/feature_extraction/extract-daicwoz-features.sh conda deactivate ``` #### E-DAIC - To pre-process the DAIC-WOZ features: ``` conda activate landmarks bash ./scripts/feature_extraction/extract-edaic-features.sh conda deactivate ```

Implementation Detail

Once all the data has been pre-processed, you should indicate the absule path to the directory where it is stored in the 'configs/env_config.yaml' file for each one of the corresponding datasets.

In addition, you can continue working in the 'landmarks' environment, since it has everything we need for training and evaluating our model:

conda activate landmarks

Modality distributions for D-Vlog

Once extracted, the modalities for D-Vlog should be similar to the following plots (counts / missing frames):

💪 Training and Evaluation

To train and evaluate the models and the results reported in the paper, you can run the following commands:

cd experiments/ bash run-exps.sh

📖 Citation

If you found our work useful, please cite our paper:

Reading Between the Frames: Multi-Modal Non-Verbal Depression Detection in Videos

@InProceedings{gimeno2024videodepression, author="Gimeno-G{\'o}mez, David and Bucur, Ana-Maria and Cosma, Adrian and Mart{\'i}nez-Hinarejos, Carlos-David and Rosso, Paolo", editor="Goharian, Nazli and Tonellotto, Nicola and He, Yulan and Lipani, Aldo and McDonald, Graham and Macdonald, Craig and Ounis, Iadh", title="Reading Between the Frames: Multi-modal Depression Detection in Videos from Non-verbal Cues", booktitle="Advances in Information Retrieval", year="2024", publisher="Springer Nature Switzerland", address="Cham", pages="191--209", isbn="978-3-031-56027-9" }

This repository is based on the Acumen ✨ Template ✨.

📝 License

This work is protected by CC BY-NC-ND 4.0 License (Non-Commercial & No Derivatives)

Owner

Name: Adrian Cosma
Login: cosmaadrian
Kind: user
Location: Bucharest, Romania
Company: University Politehnica of Bucharest

Repositories: 21
Profile: https://github.com/cosmaadrian

Mercenary Researcher

GitHub Events

Total

Issues event: 2
Watch event: 24
Issue comment event: 1
Fork event: 2

Last Year

Issues event: 2
Watch event: 24
Issue comment event: 1
Fork event: 2

Dependencies

requirements.txt pypi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/cosmaadrian/multimodal-depression-from-video

Science Score: 23.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Reading Between the 🎞️ Frames:
Multi-Modal Depression Detection in Videos from Non-Verbal Cues

Accepted at the 2024 edition of European Conference on Information Retrieval (ECIR).

TL;DR

📘 Introduction

🛠️ Data Preparation

Downloading the datasets

Extracting non-verbal modalities

Implementation Detail

Modality distributions for D-Vlog

💪 Training and Evaluation

📖 Citation

📝 License

Owner

GitHub Events

Total

Last Year

Dependencies

https://github.com/cosmaadrian/multimodal-depression-from-video

Science Score: 23.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Reading Between the 🎞️ Frames:Multi-Modal Depression Detection in Videos from Non-Verbal Cues

Accepted at the 2024 edition of European Conference on Information Retrieval (ECIR).

TL;DR

📘 Introduction

🛠️ Data Preparation

Downloading the datasets

Extracting non-verbal modalities

Implementation Detail

Modality distributions for D-Vlog

💪 Training and Evaluation

📖 Citation

📝 License

Owner

GitHub Events

Total

Last Year

Dependencies

Reading Between the 🎞️ Frames:
Multi-Modal Depression Detection in Videos from Non-Verbal Cues