https://github.com/amir22010/unsupervisedqa

Unsupervised Question answering via Cloze Translation

Last synced: 10 months ago · JSON representation

Repository

Unsupervised Question answering via Cloze Translation

Basic Info

Host: GitHub
Owner: Amir22010
License: other
Default Branch: master
Size: 32.2 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Fork of facebookresearch/UnsupervisedQA

Created almost 7 years ago · Last pushed almost 7 years ago

https://github.com/Amir22010/UnsupervisedQA/blob/master/

# UnsupervisedQA Code, Data and models supporting the experiments in the ACL 2019 Paper: [Unsupervised Question Answering by Cloze Translation](https://arxiv.org/abs/1906.04980). Obtaining training data for Question Answering (QA) is time-consuming and resource-intensive, and existing QA datasets are only available for limited domains and languages. In this work, we take some of the first steps towards unsupervised QA, and develop an approach that, *without using the SQuAD training data at all*, achieves 56.4 F1 on SQuAD v1.1, and 64.5 F1 when the answer is a named entity mention.

This repository provides code to run pre-trained models to generate sythetic question answering question data. We also make a very large synthetic training dataset for extractive question answering available. ## Dataset Downloads We make available a dataset of 4 million SQuAD-like question answering datapoints, automatically generated by the unsupervised system described in the system. The data can be downloaded [here](https://dl.fbaipublicfiles.com/UnsupervisedQA/UnsupervisedQAData.tar.gz). The data is in the SQuAD v1 format, and contains: | Fold | # Paragraphs | # QA pairs | | :-----------------: | :-----------: | :-----------: | | `unsupervised_qa_train.json` | 782,556 | 3,915,498 | | `unsupervised_qa_dev.json` | 1,000 | 4,795 | | `unsupervised_qa_test.json` | 1,000 | 4,804 | Using this training data to fine-tune BERT-Large for reading comprehension will achieve over 50.0 F1 on the SQuAD V1.1 development set using an appropriate early stopping strategy on the unsupervised_qa dev set. ## Models and Code In addition the above data, this repository provides functionality to generate synthetic training data from user-provided documents ### Installation: The code is built to run on top of [UnsupervisedMT](https://github.com/facebookresearch/UnsupervisedMT), and requires all of its its dependencies. Additional requirements are [spaCy](https://spacy.io/) (for NER and noun chunking), [attrs](https://www.attrs.org/en/stable/), and [NLTK](https://www.nltk.org/) and [allennlp](https://github.com/allenai/allennlp) (for constituency parsing). It was developed to run on Ubuntu Linux 18.04 and Python 3.7, with CUDA 9 (Optionally) Create a conda environment to keep things clean: ``` conda create -n uqa37 python=3.7 && conda activate uqa37 ``` The recommended way to install is shown below, which should install and handle all dependencies: ``` # clone the repo git clone https://github.com/facebookresearch/UnsupervisedQA.git cd UnsupervisedQA # install python dependencies: pip install -r requirements.txt # install UnsupervisedMT and its dependencies ./install_tools.sh ``` ### Models: Four UNMT models are made available for download * Sentence Cloze boundaries, Noun Phrase Answers * Sentence Cloze boundaries, Named Entity Answers * Sub-clause Cloze boundaries, Named Entity Answers * Sub-cluase Cloze boundaries, Named Entity Answers, Wh Heuristics (best downstream performance) The models can be downloaded using the script: ``` ./download_models.sh ``` This will download all the models and unzip them to the appropriate directory. Each unzipped model is about 850MB, so total space requirement is 3.5GB. ### Usage: You can generate reading comprehension training data using `unsupervisedqa.generate_synthetic_qa_data` This script will allow you to generate unsupervised question answering data using the `identity`, `noisy cloze` or `unsupervised NMT` methods explored in the paper, as well as specifying several different configurations (i.e. whether to use subclause shortening, use named entity answers and whether to use the wh heuristic). This script provides the following command line arguments: ``` usage: generate_synthetic_qa_data.py [-h] [--input_file_format {txt,jsonl}] [--output_file_formats OUTPUT_FILE_FORMATS] [--translation_method {identity,noisy_cloze,unmt}] [--use_subclause_clozes] [--use_named_entity_clozes] [--use_wh_heuristic] input_file output_file Generate synthetic training data for extractive QA tasks without supervision positional arguments: input_file input file, see readme for formatting info output_file Path to write generated data to, see readme for formatting info optional arguments: -h, --help show this help message and exit --input_file_format {txt,jsonl} input file format, see readme for more info, default is txt --output_file_formats OUTPUT_FILE_FORMATS comma-seperated list of output file formats, from [jsonl, squad], an output file will be created for each format. Default is 'jsonl,squad' --translation_method {identity,noisy_cloze,unmt} define the method to generate clozes -- either the Unsupervised NMT method (unmt), or the identity or noisy cloze baseline methods. UNMT is recommended for downstream performance, but the noisy_cloze is relatively stong on downstream QA and fast to generate. Default is unmt --use_subclause_clozes pass this flag to shorten clozes with constituency parsing instead of using sentence boundaries (recommended for downstream performance) --use_named_entity_clozes pass this flag to use named entity answer prior instead of noun phrases (recommended for downstream performance) --use_wh_heuristic pass this flag to use the wh-word heuristic (recommended for downstream performance). Only compatable with named entity clozes ``` The input format is specified by the `--input_file format` argument, and can either be a `.txt` file of paragraphs, one per line, for questions and answers to be generated from, or a `.jsonl` file with each line containing a json-serialised dict of the format `{"text": text of paragraph, "paragraph_id" : your unique identifier for the paragraph}` The output format can be specified by the user using the `--output_file_formats` argument. The user can choose between `jsonl` and `squad` format. Requesting the `squad` format will output a file using the squad v1.1 format, ready to be plugged into downstream extractive QA tasks. The `jsonl` format provides more metadata than the squad format, the fields are explained below: ``` { "cloze_id": unique identifier for this datapoint "paragraph": data on the paragraph this datapoint was generated from "source_text": the text from the paragraph the cloze was generated from "source_start": character index in paragraph where "source_text" starts "cloze_text": the text of the cloze question the question is generated from "answer_text": the answer text of the (cloze) question "answer_start": the character index that the answer starts at in the paragraph "constituency_parse": the constituency parse of the "source_text" if available, otherwise null, "root_label": the node label of the root of the constituency parse if available, otherwise null, "answer_type": The named entity label of the answer (if using named entity clozes) otherwise "NOUNPHRASE" "question_text": the text of the natural question, translated from "cloze_text" } ``` A working example to produce unsupervised NMT-translated questions using the model trained with wh heuristics, named entity answers, subclause shortening is below: ``` python -m unsupervisedqa.generate_synthetic_qa_data example_input.txt example_output \ --input_file_format "txt" \ --output_file_format "jsonl,squad" \ --translation_method unmt \ --use_named_entity_clozes \ --use_subclause_clozes \ --use_wh_heuristic ``` ### I'm running out of GPU memory The repository requires a CUDA-enabled GPU (this is a requirement of UnsupervisedMT), but you can reduce the amount of GPU memory required by adjusting the batch sizes. This can be done by modifying `unsupervisedqa/configs.py` file, adjusting `CONSTITUENCY_BATCH_SIZE` and `UNMT_BATCH_SIZE`. ### Training Your own question translation models This repository only provides functionality to run pre-trained unsupervised question translation models in the paper. For users who want to train new question translation models, they should use the training functionality in [UnsupervisedMT](https://github.com/facebookresearch/UnsupervisedMT), or consider the newer and more powerful [XLM](https://github.com/facebookresearch/XLM) repository. To train question translation models in [UnsupervisedMT](https://github.com/facebookresearch/UnsupervisedMT), first prepare large corpora of cloze questions (potentially using the functionality in this repository) and a large corpus of natural questions. Preprocess these corpora by adapting [UnsupervisedMT/NMT/get_data_enfr.sh](https://github.com/facebookresearch/UnsupervisedMT/blob/master/NMT/get_data_enfr.sh), and train using the example script in [UnsupervisedMT/README](https://github.com/facebookresearch/UnsupervisedMT#train-the-nmt-model), with appropriate edits to the args (e.g en->cloze and fr->question) and paths. ## References Please cite [[1]](https://arxiv.org/abs/1906.04980) and [[2]](https://arxiv.org/abs/1804.07755) if you found the resources in this repository useful. ### Unsupervised Question Answering by Cloze Translation [1] P. Lewis, L. Denoyer, S. Riedel [*Unsupervised Question Answering by Cloze Translation*](https://arxiv.org/abs/1906.04980) ``` @inproceedings{lewis2019unsupervisedqa, title={Unsupervised Question Answering by Cloze Translation}, author={Lewis, Patrick and Denoyer, Ludovic and Riedel, Sebastian}, booktitle={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, year={2019} } ``` ### Phrase-Based \& Neural Unsupervised Machine Translation [2] G. Lample, M. Ott, A. Conneau, L. Denoyer, MA. Ranzato [*Phrase-Based & Neural Unsupervised Machine Translation*](https://arxiv.org/abs/1804.07755) ``` @inproceedings{lample2018phrase, title={Phrase-Based \& Neural Unsupervised Machine Translation}, author={Lample, Guillaume and Ott, Myle and Conneau, Alexis and Denoyer, Ludovic and Ranzato, Marc'Aurelio}, booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, year={2018} } ``` ## License See the [LICENSE](LICENSE) file for more details. ## Troubleshooting If you run into problems with installing dependencies (particularly allennlp) installing libffi may help: ``` apt-get install libffi6 libffi-dev ```

Owner

Name: Amir Khan
Login: Amir22010
Kind: user
Location: India

Repositories: 3
Profile: https://github.com/Amir22010

working on developing a state of art AI solutions mainly in computer vision, chat bots and nlp domain. building an awesome AI as a professional developer 😍.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/amir22010/unsupervisedqa

Science Score: 10.0%

Repository

Basic Info

Statistics

https://github.com/Amir22010/UnsupervisedQA/blob/master/

Owner

GitHub Events

Total

Last Year