https://github.com/amazon-science/sc2qa-dril

Code for Generating Self-Contained and Summary-Centric Question Answer Pairs via Differentiable Reward Imitation Learning, EMNLP 2021

https://github.com/amazon-science/sc2qa-dril

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.8%) to scientific vocabulary

Keywords

answer-generation imitation-learning natural-language-generation question-answer-generation question-answering question-generation

Keywords from Contributors

generation transformers controllable interactive archival projection sequences observability autograding hacking
Last synced: 5 months ago · JSON representation

Repository

Code for Generating Self-Contained and Summary-Centric Question Answer Pairs via Differentiable Reward Imitation Learning, EMNLP 2021

Basic Info
Statistics
  • Stars: 4
  • Watchers: 6
  • Forks: 1
  • Open Issues: 2
  • Releases: 0
Topics
answer-generation imitation-learning natural-language-generation question-answer-generation question-answering question-generation
Created almost 4 years ago · Last pushed over 2 years ago
Metadata Files
Readme License

README.md

Generating Self-Contained and Summary-Centric Question Answer Pairs via Differentiable Reward Imitation Learning

This repository contains code for our EMNLP 2021 paper: Generating Self-Contained and Summary-Centric Question Answer Pairs via Differentiable Reward Imitation Learning. Li Zhou, Kevin Small, Yong Zhang, Sandeep Atluri.

Table of Contents

(SC)^2QA Dataset

We provide code and scripts to construct the public version of (SC)^2QA dataset from commoncrawl's news data stream. Compared with the internal version of (SC)^2QA used in our paper, this public version is larger, including 50,441 quesiton-article pairs and 65,332 {question, article, summary, length constraint} 4-tuples. We also provide an even larger question-article pairs dataset with 529,039 pairs.

For your convenience, we also provide the constructed dataset, and you can directly load the dataset with huggingface's load_dataset API.

```python

!pip install datasets

from datasets import loaddataset sc2qadataset = loaddataset("sc2qa/sc2qacommoncrawl") ```

This will load {Question, Article, Summary, Length Constraint} 4-tuples. However, if you only want to use question-article pairs as a dataset (i.e. output of the step 1 below), you can do ```python

!pip install datasets

from datasets import loaddataset sc2qdataset = loaddataset("sc2qa/sc2qcommoncrawl") ```

We also provide an even larger question-article pairs dataset (without summaries), which includes articles from an expanded domain list. ```python

!pip install datasets

from datasets import loaddataset sc2qdatasetlarge = loaddataset("sc2qa/sc2qcommoncrawllarge") ``` You can skip the remaining of this section if you use the load_dataset API above.

The following are steps to construct the dataset. Appendix A of our paper describes each step in details.

Install dependencies

bash pip3 install -r requirements.txt

Step 1 Collect Question-Article Pairs

bash cd CommonCrawl_Question_Mining bash collect_qa.sh

This script will download WARC files between 2019/01 to 2021/09 from commoncrawl's news data stream, and then filter out news articles based on a set of rules we defined in CommonCrawlQuestionMining/collectqastep2.py.

Step 2 Collect {Question, Article, Summary, Length Constraint} 4-Tuples as Training and Validation Set

bash bash collect_qasl.sh

This script will call BART, PEGASUS, and CTRLSum models to geneate length-constrained summaries for articles in step 1, and then used our pre-trained question-answering model to filter out summaries that are likely incorrect answers.

In total, we have 65,332 {Question, Article, Summary, Length Constraint} 4-tuples. We use the first 57,332 4-tuples as training set and the last 8,000 4-tuples as validation set.

Step 3 Collect Articles as Test Set (Optional)

bash bash collect_test_set_articles.sh

This script will randomly sample 10,000 news articles in 2021/09 from commoncrawl's news data stream. There is no ground-truth questions for these articles.

D-S-DRIL Model

Install dependencies

bash pip3 install -r requirements.txt

Train an Answer Generation Model using DRIL

bash cd D-S-DRIL bash scripts/train_ag.sh

Train an answer generation model using the proposed DRIL method in the paper. The model samples summaries (answers) during training and calculate gradients based on the question reconstruction loss.

We use Amazon EC2 p3dn.24xlarge GPU instances for training. There are 8 GPUs and each GPU has 32GB memory. Training takes about 8 hours. If you encounter GPU out of memory issue, consider reducing the batch size.

During training, the model checkpoints will be saved to ag_model_output/checkpoint-*. Each checkpoint folder has a trainer_state.json file showing the current best model checkpoint. At the end of the training, the best model will be saved to ag_model_output/. However, you may encounter GPU out of memory issue when saving the best model. If so, you can manually copy the best model checkpoint from ag_model_output/checkpoint-* to ag_model_output/ and then copy the model configuration file from scripts/model_config/config.json to ag_model_output/config.json.

Train a Question Generation Model

bash bash scripts/train_qg.sh

Train a question generation model that generates questions based on summaries of articles.

Inference

bash bash scripts/inference.sh [validation|test] Inference question and answer pairs of articles in the validation set and test set.

How to Cite

If you find this repository useful, please cite the following paper. @inproceedings{zhou-etal-2021-generating, title = "Generating Self-Contained and Summary-Centric Question Answer Pairs via Differentiable Reward Imitation Learning", author = "Zhou, Li and Small, Kevin and Zhang, Yong and Atluri, Sandeep", booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)", year = "2021", pages = "5103--5135", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.emnlp-main.416", }

Owner

  • Name: Amazon Science
  • Login: amazon-science
  • Kind: organization

GitHub Events

Total
Last Year

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 5
  • Total Committers: 3
  • Avg Commits per committer: 1.667
  • Development Distribution Score (DDS): 0.4
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Li Zhou l****l@a****m 3
dependabot[bot] 4****] 1
Amazon GitHub Automation 5****o 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 0
  • Total pull requests: 3
  • Average time to close issues: N/A
  • Average time to close pull requests: about 1 hour
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 3
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • dependabot[bot] (3)
Top Labels
Issue Labels
Pull Request Labels
dependencies (3)