https://github.com/amazon-science/sc2qa-dril

Code for Generating Self-Contained and Summary-Centric Question Answer Pairs via Differentiable Reward Imitation Learning, EMNLP 2021

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.8%) to scientific vocabulary

Keywords

answer-generation imitation-learning natural-language-generation question-answer-generation question-answering question-generation

Keywords from Contributors

generation transformers controllable interactive archival projection sequences observability autograding hacking

Last synced: 8 months ago · JSON representation

Repository

Code for Generating Self-Contained and Summary-Centric Question Answer Pairs via Differentiable Reward Imitation Learning, EMNLP 2021

Basic Info

Host: GitHub
Owner: amazon-science
License: other
Language: Python
Default Branch: main
Homepage: https://arxiv.org/pdf/2109.04689.pdf
Size: 218 KB

Statistics

Stars: 4
Watchers: 6
Forks: 1
Open Issues: 2
Releases: 0

Topics

answer-generation imitation-learning natural-language-generation question-answer-generation question-answering question-generation

Created about 4 years ago · Last pushed almost 3 years ago

Metadata Files

Readme License

Generating Self-Contained and Summary-Centric Question Answer Pairs via Differentiable Reward Imitation Learning

This repository contains code for our EMNLP 2021 paper: Generating Self-Contained and Summary-Centric Question Answer Pairs via Differentiable Reward Imitation Learning. Li Zhou, Kevin Small, Yong Zhang, Sandeep Atluri.

(SC)^2QA Dataset
D-S-DRIL Model
How to Cite

(SC)^2QA Dataset

We provide code and scripts to construct the public version of (SC)^2QA dataset from commoncrawl's news data stream. Compared with the internal version of (SC)^2QA used in our paper, this public version is larger, including 50,441 quesiton-article pairs and 65,332 {question, article, summary, length constraint} 4-tuples. We also provide an even larger question-article pairs dataset with 529,039 pairs.

For your convenience, we also provide the constructed dataset, and you can directly load the dataset with huggingface's load_dataset API.

```python

!pip install datasets

from datasets import loaddataset sc2qadataset = loaddataset("sc2qa/sc2qacommoncrawl") ```

This will load {Question, Article, Summary, Length Constraint} 4-tuples. However, if you only want to use question-article pairs as a dataset (i.e. output of the step 1 below), you can do ```python

!pip install datasets

from datasets import loaddataset sc2qdataset = loaddataset("sc2qa/sc2qcommoncrawl") ```

We also provide an even larger question-article pairs dataset (without summaries), which includes articles from an expanded domain list. ```python

!pip install datasets

from datasets import loaddataset sc2qdatasetlarge = loaddataset("sc2qa/sc2qcommoncrawllarge") ``` You can skip the remaining of this section if you use the load_dataset API above.

The following are steps to construct the dataset. Appendix A of our paper describes each step in details.

Install dependencies

bash pip3 install -r requirements.txt

Step 1 Collect Question-Article Pairs

bash cd CommonCrawl_Question_Mining bash collect_qa.sh

This script will download WARC files between 2019/01 to 2021/09 from commoncrawl's news data stream, and then filter out news articles based on a set of rules we defined in CommonCrawlQuestionMining/collectqastep2.py.

Step 2 Collect {Question, Article, Summary, Length Constraint} 4-Tuples as Training and Validation Set

bash bash collect_qasl.sh

This script will call BART, PEGASUS, and CTRLSum models to geneate length-constrained summaries for articles in step 1, and then used our pre-trained question-answering model to filter out summaries that are likely incorrect answers.

In total, we have 65,332 {Question, Article, Summary, Length Constraint} 4-tuples. We use the first 57,332 4-tuples as training set and the last 8,000 4-tuples as validation set.

Step 3 Collect Articles as Test Set (Optional)

bash bash collect_test_set_articles.sh

This script will randomly sample 10,000 news articles in 2021/09 from commoncrawl's news data stream. There is no ground-truth questions for these articles.

D-S-DRIL Model

Install dependencies

bash pip3 install -r requirements.txt

Train an Answer Generation Model using DRIL

bash cd D-S-DRIL bash scripts/train_ag.sh

Train an answer generation model using the proposed DRIL method in the paper. The model samples summaries (answers) during training and calculate gradients based on the question reconstruction loss.

We use Amazon EC2 p3dn.24xlarge GPU instances for training. There are 8 GPUs and each GPU has 32GB memory. Training takes about 8 hours. If you encounter GPU out of memory issue, consider reducing the batch size.

During training, the model checkpoints will be saved to ag_model_output/checkpoint-*. Each checkpoint folder has a trainer_state.json file showing the current best model checkpoint. At the end of the training, the best model will be saved to ag_model_output/. However, you may encounter GPU out of memory issue when saving the best model. If so, you can manually copy the best model checkpoint from ag_model_output/checkpoint-* to ag_model_output/ and then copy the model configuration file from scripts/model_config/config.json to ag_model_output/config.json.

Train a Question Generation Model

bash bash scripts/train_qg.sh

Train a question generation model that generates questions based on summaries of articles.

Inference

bash bash scripts/inference.sh [validation|test] Inference question and answer pairs of articles in the validation set and test set.

How to Cite

If you find this repository useful, please cite the following paper. @inproceedings{zhou-etal-2021-generating, title = "Generating Self-Contained and Summary-Centric Question Answer Pairs via Differentiable Reward Imitation Learning", author = "Zhou, Li and Small, Kevin and Zhang, Yong and Atluri, Sandeep", booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)", year = "2021", pages = "5103--5135", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.emnlp-main.416", }

Owner

Name: Amazon Science
Login: amazon-science
Kind: organization

Website: https://amazon.science
Twitter: AmazonScience
Repositories: 80
Profile: https://github.com/amazon-science

GitHub Events

Total

Last Year

Committers

Last synced: 9 months ago

All Time

Total Commits: 5
Total Committers: 3
Avg Commits per committer: 1.667
Development Distribution Score (DDS): 0.4

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Li Zhou	l**l@a**m	3
dependabot[bot]	4****]	1
Amazon GitHub Automation	5****o	1

Committer Domains (Top 20 + Academic)

amazon.com: 1

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 0
Total pull requests: 3
Average time to close issues: N/A
Average time to close pull requests: about 1 hour
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 3

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

https://github.com/amazon-science/sc2qa-dril

Science Score: 23.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Generating Self-Contained and Summary-Centric Question Answer Pairs via Differentiable Reward Imitation Learning

Table of Contents

(SC)^2QA Dataset

!pip install datasets

!pip install datasets

!pip install datasets

Install dependencies

Step 1 Collect Question-Article Pairs

Step 2 Collect {Question, Article, Summary, Length Constraint} 4-Tuples as Training and Validation Set

Step 3 Collect Articles as Test Set (Optional)

D-S-DRIL Model

Install dependencies

Train an Answer Generation Model using DRIL

Train a Question Generation Model

Inference

How to Cite

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels