https://github.com/amazon-science/question-answering-nlu

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.6%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: amazon-science
License: other
Language: Python
Default Branch: main
Size: 22.5 KB

Statistics

Stars: 7
Watchers: 2
Forks: 1
Open Issues: 0
Releases: 0

Created almost 5 years ago · Last pushed over 4 years ago

Metadata Files

Readme Contributing License Code of conduct

Question Answering NLU

Question Answering NLU (QANLU) is an approach that maps the NLU task into question answering, leveraging pre-trained question-answering models to perform well on few-shot settings. Instead of training an intent classifier or a slot tagger, for example, we can ask the model intent- and slot-related questions in natural language:

``` Context : I'm looking for a cheap flight to Boston.

Question: Is the user looking to book a flight? Answer : Yes

Question: Is the user asking about departure time? Answer : No

Question: What price is the user looking for? Answer : cheap

Question: Where is the user flying from? Answer : (empty) ```

Thus, by asking questions for each intent and slot in natural language, we can effectively construct an NLU hypothesis. For more details, please read the paper: Language model is all you need: Natural language understanding as question answering.

This repository contains code to transform MATIS++ NLU data (e.g. utterances and intent / slot annotations) into SQuAD 2.0 format question-answering data that can be used by QANLU. MATIS++ includes the original English version of ATIS and a translation into eight languages: German, Spanish, French, Japanese, Hindi, Portuguese, Turkish, and Chinese.

To create a SQuAD-style dataset, we first need to create a list of questions for each intent and a list of questions for each slot. Questions in English are saved in the MATIS_questions.json file. In order to parse data in languages other than English, you need to provide questions in that language (or translate the English questions we provide in this repository).

While we can have a number of questions for each intent and slot, sometimes QANLU will perform better if it sees one question per intent and slot. We control this with the optional --single_q argument. If you call the atis.py script using that argument, only the first question in the list will be chosen for each intent and slot. In the opposite case, all questions for each intent and slot will be used.

Run the following to parse MATIS NLU data into SQuAD:

python atis.py \ --data_path <path to the data> \ --languages <de,en,es,fr,ja,hi,pt,tr,zh> \ --qas_file <path to intent and slot questions json file> \ --output_dir <path to where output files are stored> \ [--single_q]

The output of this process will be in the exact format of SQuAD and can be used to train question answering models. The next step would be to train a question answering model, see here for a guide. Alternatively, you can download a QA model trained on SQuAD-v2 directly from huggingface here, and fine-tune it with the MATIS++ NLU data parsed into SQuAD format. Please note that we need a model trained on SQuAD-v2 in order to support negative examples.

A QANLU model trained using SQuAD-v2 and MATIS++ (English) is also available from huggingface here.

In order to calculate precision, recall, and F1 for predictions done on QANLU test sets by the fine-tuned question answering model, you need to call:

python calculate_pr.py \ --pred_file <full path to the predictions file created by transformers> \ --test_file <full path to the test file that the predictions are for>

Example

In this example, we show how to train QANLU on English MATIS (i.e. the original ATIS). We assume that MATIS has been downloaded at a folder called MATIS in the root directory of this repository.

The first step is to convert the data into SQuAD format:

``` mkdir data

python atis.py \ --datapath MATIS/data/traindevtest \ --languages en \ --qasfile MATISquestions.json \ --outputdir data ```

The next step is to fine-tune a SQuAD-trained QA model on the data we just created. For this example, we will use the deepset/roberta-base-squad2 model from huggingface. To do the fine-tuning, we will use the run_squad.py script from here (assuming 8 GPUs present):

``` mkdir models

python -m torch.distributed.launch --nprocpernode=8 runsquad.py \ --modeltype roberta \ --modelnameorpath deepset/roberta-base-squad2 \ --dotrain \ --doeval \ --dolowercase \ --trainfile data/matisentrainsquad.json \ --predictfile data/matisentestsquad.json \ --learningrate 3e-5 \ --numtrainepochs 2 \ --maxseqlength 384 \ --docstride 64 \ --outputdir models/qanlu/ \ --pergputrainbatchsize 8 \ --overwriteoutputdir \ --version2withnegative \ --savesteps 100000 \ --gradientaccumulationsteps 8 \ --seed $RANDOM ```

Once our model is fine-tuned with MATIS++ data, the model will be saved in the models/qanlu/. The final step is to calculate performance metrics:

python calculate_pr.py --pred_file models/qanlu/predictions_.json --test_file data/matis_en_test.json >> results_matis_en.txt

The output should look like this:

atis_en.txt Precision: 0.9613439306358381 Recall: 0.9582283039250991 F1: 0.9597835888187556 Results: {'slot': {'Precision': 0.9613439306358381, 'Recall': 0.9582283039250991, 'F1': 0.9597835888187556}}

Citation

If you use this work, please cite:

@inproceedings{namazifar2021language, title={Language model is all you need: Natural language understanding as question answering}, author={Namazifar, Mahdi and Papangelis, Alexandros and Tur, Gokhan and Hakkani-T{\"u}r, Dilek}, booktitle={ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages={7803--7807}, year={2021}, organization={IEEE} }

Security

See CONTRIBUTING for more information.

License

This library is licensed under the CC BY NC License.

Owner

Name: Amazon Science
Login: amazon-science
Kind: organization

Website: https://amazon.science
Twitter: AmazonScience
Repositories: 80
Profile: https://github.com/amazon-science

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science