simmc2-multimodal_coreference_resolution

The second Situated Interactive MultiModal Conversations (SIMMC 2.0) Challenge 2021 - Project focused on second task "Multimodal Coreference Resolution" and carried out as the dissertation for the MPhil in Machine Learning and Machine Intelligence at the University of Cambridge

https://github.com/alejandrosantorum/simmc2-multimodal_coreference_resolution

Last synced: 11 months ago · JSON representation

Repository

The second Situated Interactive MultiModal Conversations (SIMMC 2.0) Challenge 2021 - Project focused on second task "Multimodal Coreference Resolution" and carried out as the dissertation for the MPhil in Machine Learning and Machine Intelligence at the University of Cambridge

Basic Info

Host: GitHub
Owner: AlejandroSantorum
License: other
Language: Python
Default Branch: main
Homepage:
Size: 308 MB

Statistics

Stars: 4
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created over 4 years ago · Last pushed over 2 years ago

Metadata Files

Readme License Citation

README.md

SIMMC 2.0 Challenge 2021: Multimodal Coreference Resolution task

Project carried out as my dissertation for the MPhil in Machine Learning and Machine Intelligence at the University of Cambridge, awarded with a distinction (band A+).

An article based on this research was developed with title "Entity Resolution in Situated Dialog With Unimodal and Multimodal Transformers", and it was accepted and published in IEEE/ACM Transactions on Audio, Speech, and Language Processing. For more info visit the section about IEEE TASLP.

Multimodal Coreference Resolution is defined as the task of identifying which word or expression co-refer to the same entity within a certain context. The context is defined using both visual (images, videos) and natural language (dialog, textual descriptions) modalities. This is a crucial problem since many visual agents have to link coreferences (e.g. pronouns) to their corresponding general reference (e.g. nouns or noun phrases), and only then solve their main task, such a object grounding or visual question asnwering. Definition Coreferene Resolution

The final version of the masters thesis can be found in thesis_report directory.

Challenge overview

The second Situated Interactive MultiModal Conversations (SIMMC 2.0) Challenge 2021. Project focused on the second subtask: Multimodal Coreference Resolution.

The official GitHub Repository of the challenge is published by Meta Research team.

The Multimodal Coreference Resolution subtask is proposed as part of the SIMMC2.0 track of the Tenth Dialog System Technology Challenge DSTC10.

Clone this repository to download the code and experiments: bash $ git clone https://github.com/AlejandroSantorum/simmc2-Multimodal_Coreference_Resolution.git

Configuration (recommended)

Using a virtual environment is recommended to minimize the chance of conflicts.

Setting up a virtual environment: `venv`

In the cloned repository folder, create a virtual environment (you can change the name "env"). bash python3 -m venv env

Activate the environment "env". bash source env/bin/activate

Install requirements using pip. bash pip install -r requirements.txt You may need to use pip3. bash pip3 install -r requirements.txt

Python requirements

The requirements to run the code are listed in requirements.txt: * numpy - The fundamental package for scientific computing with Python * torch - Optimized tensor Python library for deep learning using GPUs and CPUs. * transformers - State-of-the-art Machine Learning library for PyTorch, TensorFlow and JAX. * tqdm - Python library to make loops show a smart progress meter. * tensorboardX - Python library to watch tensors flow without Tensorflow.

Download the Dataset

The dataset is hosted in Meta's GitHub Repository with Git LFS. The folder data contains the whole dataset and the instructions to be downloaded.

Make sure to install and update Git LFS before cloning the repository: bash $ git lfs install

bash $ git clone https://github.com/facebookresearch/simmc2.git

You may need to pull using Git LFS: bash $ git lfs pull

The dataset can also be downloaded from the following link: data.zip.

Models

The folder models contains all the investigated systems in this project:

gpt2_baseline: replication of the baseline model introduced in "SIMMC2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations", using the code from the SIMMC2.0 repository.
KAIST_BART_based: replication of the model described in "Tackling Situated Multi-Modal Task-Oriented Dialogs with a Single Transformer Model", using the open source code of the original team.
bart_coref: previous system modified by removing other task heads that are not related with coreference resolution.
bart_only_coref: previous system modified by removing all specific-task heads but the one performing coreference resolution. Auxiliary task heads are also deleted. This is the main directory where the majority of experiments on the BART-based model are available and described.
uniter_based: replication of the model described in "UNITER-Based Situated Coreference Resolution with Rich Multimodal Input", using the open source code of the original team. This is also the main directory where the majority of experiments on the UNITER-based model are available and described.
combined: simple system that combines the predictions of one of the best BART-based models with one of the best UNITER-based models to obtain a boost in performance due to the ensemble of both solutions.

Main models, proposed modifications and improvements

The BART-based model was the main studied system since it was the winner of the DSTC10 challenge. First, we showed that other task-specific heads were not beneficial for the overall performance. Additionally, auxiliary task heads (empty_coref and attributes heads) were proven to be not helpful.

The proposed improvements consist of including textual descriptions of the objects in the BART's input in the form of list of attributes, as shown in the following detailed diagram. Only Coref BART-based model diagram  Moreover, including an auxiliary task head that predicts the number of referred objects in the last utterance boosted the overall performance of the model. The object predictions are modified accordingly the output of this auxiliary task head using a set of heuristics. Only Coref BART-based model diagram with auxiliary head

The UNITER-based model was studied focusing on its multimodal properties. First, we showed that the usage of object IDs was not necessary and the overall performance increased after removing them.

Similar to the BART system, including an auxiliary task head that predicts the number of referred objects in the last utterance boosted the overall performance of the model. The object predictions are again modified accordingly the output of this auxiliary task head using a set of heuristics. UNITER-based model diagram with auxiliary head

Model combination: UNITER and BART-based models are both combined together since the UNITER-based model is better at identifying referred objects previously mentioned in the conversation context, whether the BART-based model excels at recognizing items just referred in the last user utterance. Each model would focus on the set of objects accordingly their strengths.

Main results

License

The project is lecensed under the BSD 3-Clause License. A copy of the license can be found along with the code.

Acknowledgements

I would like to thank my two supervisors, Dr. Svetlana Stoyanchev and Prof. Kate Knill for their countless suggestions, guidance and support throughout the whole project. I am truly grateful for inspiring and teaching me during these last months. I would also like to thank Dr. Simon Keizer and Dr. Rama Doddipatla for their time and insightful discussions at every meeting. I am highly thankful to Toshiba Europe Ltd. for the provided computing resources during the internship, and the Cambridge University Engineering Department for all the given service in the last year.

Attribution

This project has been developed by Alejandro Santorum Varela (2022) as part of the dissertation for the MPhil in Machine Learning and Machine Intelligence at the University of Cambridge, supervised by Dr. Svetlana Stoyanchev (Toshiba Europe Ltd.) and Dr. Kate Knill (Cambridge University Engineering Department).

If you happen to use the open sourced code of this project in your work or research, please cite it:

A. Santorum, "Multimodal Coreference Resolution", https://github.com/AlejandroSantorum/simmc2-MultimodalCoreferenceResolution, 2022. MLMI MPhil dissertation at the University of Cambridge.

The corresponding BibTex entry is

@misc{ author = {Alejandro Santorum}, title = {Multimodal Coreference Resolution}, year = {2022}, howpublished = {\url{https://github.com/AlejandroSantorum/simmc2-Multimodal_Coreference_Resolution}}, note = {MLMI MPhil dissertation at the University of Cambridge} }

Article in IEEE TASLP

An article based on the work of this project was developed and it got successfully accepted in IEEE/ACM Transactions on Audio, Speech, and Language Processing with title "Entity Resolution in Situated Dialog With Unimodal and Multimodal Transformers". The paper is available at ieeexplore.ieee.org.

The BibTex entry is

@article{ author = {Alejandro Santorum, Svetlana Stoyanchev, Simon Keizer, Rama Doddipatla, Kate Knill}, title = {Entity Resolution in Situated Dialog with Unimodal and Multimodal Transformers}, journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing}, year = {2023}, doi = {10.1109/TASLP.2023.3304468}, volume = {32}, pages = {704--713}, }

Owner

Name: Alejandro Santorum
Login: AlejandroSantorum
Kind: user
Location: London (UK)

Website: https://www.linkedin.com/in/alejandro-santorum/
Repositories: 16
Profile: https://github.com/AlejandroSantorum

Data Engineer. MPhil in Machine Learning @ University of Cambridge. Computer Science & Mathematics graduate.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

simmc2-multimodal_coreference_resolution

Science Score: 49.0%