simmc2-multimodal_coreference_resolution
The second Situated Interactive MultiModal Conversations (SIMMC 2.0) Challenge 2021 - Project focused on second task "Multimodal Coreference Resolution" and carried out as the dissertation for the MPhil in Machine Learning and Machine Intelligence at the University of Cambridge
https://github.com/alejandrosantorum/simmc2-multimodal_coreference_resolution
Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org, ieee.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.3%) to scientific vocabulary
Repository
The second Situated Interactive MultiModal Conversations (SIMMC 2.0) Challenge 2021 - Project focused on second task "Multimodal Coreference Resolution" and carried out as the dissertation for the MPhil in Machine Learning and Machine Intelligence at the University of Cambridge
Basic Info
Statistics
- Stars: 4
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
SIMMC 2.0 Challenge 2021: Multimodal Coreference Resolution task
Project carried out as my dissertation for the MPhil in Machine Learning and Machine Intelligence at the University of Cambridge, awarded with a distinction (band A+).
An article based on this research was developed with title "Entity Resolution in Situated Dialog With Unimodal and Multimodal Transformers", and it was accepted and published in IEEE/ACM Transactions on Audio, Speech, and Language Processing. For more info visit the section about IEEE TASLP.
Multimodal Coreference Resolution is defined as the task of identifying which word or expression co-refer to the same entity within a certain context. The context is defined using both visual (images, videos) and natural language (dialog, textual descriptions) modalities. This is a crucial problem since many visual agents have to link coreferences (e.g. pronouns) to their corresponding general reference (e.g. nouns or noun phrases), and only then solve their main task, such a object grounding or visual question asnwering.
<!---
-->
The final version of the masters thesis can be found in thesis_report directory.
Challenge overview
The second Situated Interactive MultiModal Conversations (SIMMC 2.0) Challenge 2021. Project focused on the second subtask: Multimodal Coreference Resolution.
The official GitHub Repository of the challenge is published by Meta Research team.
The Multimodal Coreference Resolution subtask is proposed as part of the SIMMC2.0 track of the Tenth Dialog System Technology Challenge DSTC10.
Clone this repository to download the code and experiments:
bash
$ git clone https://github.com/AlejandroSantorum/simmc2-Multimodal_Coreference_Resolution.git
Configuration (recommended)
Using a virtual environment is recommended to minimize the chance of conflicts.
Setting up a virtual environment: venv
In the cloned repository folder, create a virtual environment (you can change the name "env").
bash
python3 -m venv env
Activate the environment "env".
bash
source env/bin/activate
Install requirements using pip.
bash
pip install -r requirements.txt
You may need to use pip3.
bash
pip3 install -r requirements.txt
Python requirements
The requirements to run the code are listed in requirements.txt:
* numpy - The fundamental package for scientific computing with Python
* torch - Optimized tensor Python library for deep learning using GPUs and CPUs.
* transformers - State-of-the-art Machine Learning library for PyTorch, TensorFlow and JAX.
* tqdm - Python library to make loops show a smart progress meter.
* tensorboardX - Python library to watch tensors flow without Tensorflow.
Download the Dataset
The dataset is hosted in Meta's GitHub Repository with Git LFS. The folder data contains the whole dataset and the instructions to be downloaded.
Make sure to install and update Git LFS before cloning the repository:
bash
$ git lfs install
bash
$ git clone https://github.com/facebookresearch/simmc2.git
You may need to pull using Git LFS:
bash
$ git lfs pull
The dataset can also be downloaded from the following link: data.zip.
Models
The folder models contains all the investigated systems in this project:
gpt2_baseline: replication of the baseline model introduced in "SIMMC2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations", using the code from the SIMMC2.0 repository.KAIST_BART_based: replication of the model described in "Tackling Situated Multi-Modal Task-Oriented Dialogs with a Single Transformer Model", using the open source code of the original team.bart_coref: previous system modified by removing other task heads that are not related with coreference resolution.bart_only_coref: previous system modified by removing all specific-task heads but the one performing coreference resolution. Auxiliary task heads are also deleted. This is the main directory where the majority of experiments on the BART-based model are available and described.uniter_based: replication of the model described in "UNITER-Based Situated Coreference Resolution with Rich Multimodal Input", using the open source code of the original team. This is also the main directory where the majority of experiments on the UNITER-based model are available and described.combined: simple system that combines the predictions of one of the best BART-based models with one of the best UNITER-based models to obtain a boost in performance due to the ensemble of both solutions.
Main models, proposed modifications and improvements
- The BART-based model was the main studied system since it was the winner of the DSTC10 challenge. First, we showed that other task-specific heads were not beneficial for the overall performance. Additionally, auxiliary task heads (empty_coref and attributes heads) were proven to be not helpful.
<!---
-->
The proposed improvements consist of including textual descriptions of the objects in the BART's input in the form of list of attributes, as shown in the following detailed diagram.
<!---
-->
Moreover, including an auxiliary task head that predicts the number of referred objects in the last utterance boosted the overall performance of the model. The object predictions are modified accordingly the output of this auxiliary task head using a set of heuristics.
<!---
-->
- The UNITER-based model was studied focusing on its multimodal properties. First, we showed that the usage of object IDs was not necessary and the overall performance increased after removing them.
<!---
-->
Similar to the BART system, including an auxiliary task head that predicts the number of referred objects in the last utterance boosted the overall performance of the model. The object predictions are again modified accordingly the output of this auxiliary task head using a set of heuristics.
<!---
-->
- Model combination: UNITER and BART-based models are both combined together since the UNITER-based model is better at identifying referred objects previously mentioned in the conversation context, whether the BART-based model excels at recognizing items just referred in the last user utterance. Each model would focus on the set of objects accordingly their strengths.
Main results
| Model description | Author | Overall F1 Score | | :---------- | :------------ | :-------: | | UNITER-based model submitted to DSTC10 | Y. Huang et.al. (NYU Shanghai) | 0.674 | | UNITER-based model improved post-evaluation | Y. Huang et.al. (NYU Shanghai) | 0.728 | | BART-based model submitted to DSTC10 | H. Lee et. al. (KAIST AI Lab) | 0.743 | | UNITER-based model post-evaluation + removing obj. IDs | A. Santorum (Cambridge University) | 0.758 | | UNITER-based model post-eval + noIDs + auxiliary task head | A. Santorum (Cambridge University) | 0.761 | | BART-based model + non-visual input attributes | A. Santorum (Cambridge University) | 0.760 | | BART-based model + visual+non-visual input attributes | A. Santorum (Cambridge University) | 0.771 | | BART-based model + auxiliary task head | A. Santorum (Cambridge University) | 0.752 | | BART-based model + aux. task head + all input attrs. | A. Santorum (Cambridge University) | 0.775 | | Model combination (version legal on DSTC10) | A. Santorum (Cambridge University) | 0.800 | | Model combination (overpowered w.r.t. DSTC10 rules) | A. Santorum (Cambridge University) | 0.806 |
License
The project is lecensed under the BSD 3-Clause License. A copy of the license can be found along with the code.
Acknowledgements
I would like to thank my two supervisors, Dr. Svetlana Stoyanchev and Prof. Kate Knill for their countless suggestions, guidance and support throughout the whole project. I am truly grateful for inspiring and teaching me during these last months. I would also like to thank Dr. Simon Keizer and Dr. Rama Doddipatla for their time and insightful discussions at every meeting. I am highly thankful to Toshiba Europe Ltd. for the provided computing resources during the internship, and the Cambridge University Engineering Department for all the given service in the last year.
Attribution
This project has been developed by Alejandro Santorum Varela (2022) as part of the dissertation for the MPhil in Machine Learning and Machine Intelligence at the University of Cambridge, supervised by Dr. Svetlana Stoyanchev (Toshiba Europe Ltd.) and Dr. Kate Knill (Cambridge University Engineering Department).
If you happen to use the open sourced code of this project in your work or research, please cite it:
A. Santorum, "Multimodal Coreference Resolution", https://github.com/AlejandroSantorum/simmc2-MultimodalCoreferenceResolution, 2022. MLMI MPhil dissertation at the University of Cambridge.
The corresponding BibTex entry is
@misc{
author = {Alejandro Santorum},
title = {Multimodal Coreference Resolution},
year = {2022},
howpublished = {\url{https://github.com/AlejandroSantorum/simmc2-Multimodal_Coreference_Resolution}},
note = {MLMI MPhil dissertation at the University of Cambridge}
}
Article in IEEE TASLP
An article based on the work of this project was developed and it got successfully accepted in IEEE/ACM Transactions on Audio, Speech, and Language Processing with title "Entity Resolution in Situated Dialog With Unimodal and Multimodal Transformers". The paper is available at ieeexplore.ieee.org.
The BibTex entry is
@article{
author = {Alejandro Santorum, Svetlana Stoyanchev, Simon Keizer, Rama Doddipatla, Kate Knill},
title = {Entity Resolution in Situated Dialog with Unimodal and Multimodal Transformers},
journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing},
year = {2023},
doi = {10.1109/TASLP.2023.3304468},
volume = {32},
pages = {704--713},
}
Owner
- Name: Alejandro Santorum
- Login: AlejandroSantorum
- Kind: user
- Location: London (UK)
- Website: https://www.linkedin.com/in/alejandro-santorum/
- Repositories: 16
- Profile: https://github.com/AlejandroSantorum
Data Engineer. MPhil in Machine Learning @ University of Cambridge. Computer Science & Mathematics graduate.