https://github.com/amazon-science/sc2qa-dril
Code for Generating Self-Contained and Summary-Centric Question Answer Pairs via Differentiable Reward Imitation Learning, EMNLP 2021
Science Score: 23.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.8%) to scientific vocabulary
Keywords
Keywords from Contributors
Repository
Code for Generating Self-Contained and Summary-Centric Question Answer Pairs via Differentiable Reward Imitation Learning, EMNLP 2021
Basic Info
- Host: GitHub
- Owner: amazon-science
- License: other
- Language: Python
- Default Branch: main
- Homepage: https://arxiv.org/pdf/2109.04689.pdf
- Size: 218 KB
Statistics
- Stars: 4
- Watchers: 6
- Forks: 1
- Open Issues: 2
- Releases: 0
Topics
Metadata Files
README.md
Generating Self-Contained and Summary-Centric Question Answer Pairs via Differentiable Reward Imitation Learning
This repository contains code for our EMNLP 2021 paper: Generating Self-Contained and Summary-Centric Question Answer Pairs via Differentiable Reward Imitation Learning. Li Zhou, Kevin Small, Yong Zhang, Sandeep Atluri.
Table of Contents
(SC)^2QA Dataset
We provide code and scripts to construct the public version of (SC)^2QA dataset from commoncrawl's news data stream. Compared with the internal version of (SC)^2QA used in our paper, this public version is larger, including 50,441 quesiton-article pairs and 65,332 {question, article, summary, length constraint} 4-tuples. We also provide an even larger question-article pairs dataset with 529,039 pairs.
For your convenience, we also provide the constructed dataset, and you can directly load the dataset with huggingface's load_dataset API.
```python
!pip install datasets
from datasets import loaddataset sc2qadataset = loaddataset("sc2qa/sc2qacommoncrawl") ```
This will load {Question, Article, Summary, Length Constraint} 4-tuples. However, if you only want to use question-article pairs as a dataset (i.e. output of the step 1 below), you can do ```python
!pip install datasets
from datasets import loaddataset sc2qdataset = loaddataset("sc2qa/sc2qcommoncrawl") ```
We also provide an even larger question-article pairs dataset (without summaries), which includes articles from an expanded domain list. ```python
!pip install datasets
from datasets import loaddataset sc2qdatasetlarge = loaddataset("sc2qa/sc2qcommoncrawllarge") ``` You can skip the remaining of this section if you use the load_dataset API above.
The following are steps to construct the dataset. Appendix A of our paper describes each step in details.
Install dependencies
bash
pip3 install -r requirements.txt
Step 1 Collect Question-Article Pairs
bash
cd CommonCrawl_Question_Mining
bash collect_qa.sh
This script will download WARC files between 2019/01 to 2021/09 from commoncrawl's news data stream, and then filter out news articles based on a set of rules we defined in CommonCrawlQuestionMining/collectqastep2.py.
Step 2 Collect {Question, Article, Summary, Length Constraint} 4-Tuples as Training and Validation Set
bash
bash collect_qasl.sh
This script will call BART, PEGASUS, and CTRLSum models to geneate length-constrained summaries for articles in step 1, and then used our pre-trained question-answering model to filter out summaries that are likely incorrect answers.
In total, we have 65,332 {Question, Article, Summary, Length Constraint} 4-tuples. We use the first 57,332 4-tuples as training set and the last 8,000 4-tuples as validation set.
Step 3 Collect Articles as Test Set (Optional)
bash
bash collect_test_set_articles.sh
This script will randomly sample 10,000 news articles in 2021/09 from commoncrawl's news data stream. There is no ground-truth questions for these articles.
D-S-DRIL Model
Install dependencies
bash
pip3 install -r requirements.txt
Train an Answer Generation Model using DRIL
bash
cd D-S-DRIL
bash scripts/train_ag.sh
Train an answer generation model using the proposed DRIL method in the paper. The model samples summaries (answers) during training and calculate gradients based on the question reconstruction loss.
We use Amazon EC2 p3dn.24xlarge GPU instances for training. There are 8 GPUs and each GPU has 32GB memory. Training takes about 8 hours. If you encounter GPU out of memory issue, consider reducing the batch size.
During training, the model checkpoints will be saved to ag_model_output/checkpoint-*. Each checkpoint folder has a trainer_state.json file showing the current best model checkpoint. At the end of the training, the best model will be saved to ag_model_output/. However, you may encounter GPU out of memory issue when saving the best model. If so, you can manually copy the best model checkpoint from ag_model_output/checkpoint-* to ag_model_output/ and then copy the model configuration file from scripts/model_config/config.json to ag_model_output/config.json.
Train a Question Generation Model
bash
bash scripts/train_qg.sh
Train a question generation model that generates questions based on summaries of articles.
Inference
bash
bash scripts/inference.sh [validation|test]
Inference question and answer pairs of articles in the validation set and test set.
How to Cite
If you find this repository useful, please cite the following paper.
@inproceedings{zhou-etal-2021-generating,
title = "Generating Self-Contained and Summary-Centric Question Answer Pairs via Differentiable Reward Imitation Learning",
author = "Zhou, Li and Small, Kevin and Zhang, Yong and Atluri, Sandeep",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
year = "2021",
pages = "5103--5135",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp-main.416",
}
Owner
- Name: Amazon Science
- Login: amazon-science
- Kind: organization
- Website: https://amazon.science
- Twitter: AmazonScience
- Repositories: 80
- Profile: https://github.com/amazon-science
GitHub Events
Total
Last Year
Committers
Last synced: 7 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Li Zhou | l****l@a****m | 3 |
| dependabot[bot] | 4****] | 1 |
| Amazon GitHub Automation | 5****o | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 7 months ago
All Time
- Total issues: 0
- Total pull requests: 3
- Average time to close issues: N/A
- Average time to close pull requests: about 1 hour
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 3
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- dependabot[bot] (3)