https://github.com/creolehustle/sci_long-summarization
Resources for the NAACL 2018 paper "A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents"
Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.0%) to scientific vocabulary
Last synced: 10 months ago
·
JSON representation
Repository
Resources for the NAACL 2018 paper "A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents"
Basic Info
- Host: GitHub
- Owner: CreoleHustle
- License: apache-2.0
- Language: Python
- Default Branch: master
- Size: 34.7 MB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Fork of armancohan/long-summarization
Created over 1 year ago
· Last pushed over 1 year ago
https://github.com/CreoleHustle/SCI_long-summarization/blob/master/
## A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents
This repository contains data and code for the NAACL 2018 paper ["A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents"](https://arxiv.org/abs/1804.05685). Please note that the code is not actively maintained.
### Data
Two datasets of long and structured documents (scientific papers) are provided. The datasets are obtained from ArXiv and PubMed OpenAccess repositories.
### Get the datasets
ArXiv dataset: [Download](https://drive.google.com/file/d/1b3rmCSIoh6VhD4HKWjI4HOW-cSwcwbeC/view?usp=sharing) ([mirror](https://archive.org/download/armancohan-long-summarization-paper-code/arxiv-dataset.zip))
PubMed dataset: [Download](https://drive.google.com/file/d/1lvsqvsFi3W-pE1SqNZI0s8NR9rC1tsja/view?usp=sharing) ([mirror](https://archive.org/download/armancohan-long-summarization-paper-code/pubmed-dataset.zip))
The datasets are rather large. You need about 5G disk space to download and about 15G additional space when extracting the files. Each `tar` file consists of 4 files. `train.txt`, `val.txt`, `test.txt` respectively correspond to the training, validation, and test sets. The `vocab` file is a plaintext file for the vocabulary.
### Format of the data
The files are in jsonlines format where each line is a json object corresponding to one scientific paper from ArXiv or PubMed.
The abstract, sections and body are all sentence tokenized. The json objects are in the following format:
```
{
'article_id': str,
'abstract_text': List[str],
'article_text': List[str],
'section_names': List[str],
'sections': List[List[str]]
}
```
### Huggingface Datasets:
You can access the dataset in Huggingface datasets library:
https://huggingface.co/datasets/scientific_papers
### Tensorflow datasets
The dataset is also available on [Tensorflow Datasets](https://www.tensorflow.org/datasets/catalog/scientific_papers) which makes it easy to use within Tensorflow or colab.
### Code
The code is based on the pointer-generator network code by [See et al. (2017)](https://github.com/abisee/pointer-generator). Refer to their repo for documentation about the structure of the code.
You will need `python 3.6` and `Tensorflow 1.5` to run the code. The code might run with later versions of Tensorflow but it is not tested. Checkout other dependencies in `requirements.txt` file. A small sample of the dataset is already provided in this repo. To run the code with the sample data unzip the files in the `data` directory and simply execute the run script: `./run.sh`. To train the model with the entire dataset, first convert the jsonlines files to binary using the the following script: `scripts/json_to_bin.py` and modify the corresponding training data path in the `run.sh` script.
### Citing
If you ended up finding this paper or repo useful please cite:
```
"A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents"
Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian
NAACL-HLT 2018
```
Another relevant reference is Pointer-Generator network by See et al. (2017):
```
"Get to the point: Summarization with pointer-generator networks."
Abigail See, Peter J. Liu, and Christopher D. Manning.
ACL (2017).
```
Owner
- Name: CreoleCreator
- Login: CreoleHustle
- Kind: user
- Repositories: 1
- Profile: https://github.com/CreoleHustle