https://github.com/dimits-ts/longsum0_dict_learning

Long-Span Summarization

https://github.com/dimits-ts/longsum0_dict_learning

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.2%) to scientific vocabulary
Last synced: 5 months ago · JSON representation

Repository

Long-Span Summarization

Basic Info
  • Host: GitHub
  • Owner: dimits-ts
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 146 KB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Fork of potsawee/longsum0
Created almost 3 years ago · Last pushed over 2 years ago

https://github.com/dimits-ts/longsum0_dict_learning/blob/main/

Long-Span Summarization
=====================================================
Code for ACL 2021 paper "[Long-Span Summarization via Local Attention and Content Selection](https://arxiv.org/abs/2105.03801)" (previously the title was "Long-Span Dependencies in Transformer-based Summarization Systems").


Requirements
--------------------------------------
- python 3.7
- torch 1.2.0
- transformers (HuggingFace) 2.11.0

Overview
--------------------------------------
1. train/ = training scripts for BART, LoBART, HierarchicalModel (MCS)
2. decode/ = running decoding (inference) for BART, LoBART, MCS-extractive, MCS-attention, MCS-combined
3. data/ = data modules, pre-processing, and sub-directories containing train/dev/test data
4. models/ = defined LoBART & HierarchicalModel
5. traintime_select/ = scripts for processing data for trainining (aka ORACLE methods, pad-rand, pad-lead, no-pad)
6. conf/ = configuration files for training

Pipeline (before training starts)
--------------------------------------
- Download data, e.g. Spotify Podcast, arXiv, PubMed & put in data/
- Basic pre-processing (train/dev/test) & put in data/
- ORACLE processing (train/dev/) & put in data/
- Train HierModel (aka MCS) using data with basic pre-processing
- MCS processing & put in data/
- Train BART or LoBART using data above
- Decode BART or LoBART (note that if MCS is applied, run MCS first i.e. save your data from MCS somewhere and load it)

Data Preparation
--------------------------------------
**Spotify Podcast**
- Download link: https://podcastsdataset.byspotify.com/
- See ```data/podcast_processor.py```
- We recommend splitting the data into chunks such that each chuck contains 10k instance, e.g. id0-id9999 in podcast_set0

**arXiv & PubMed**
- Download link: https://github.com/armancohan/long-summarization
- See ```data/arxiv_processor.py``` (very minimal pre-processing done)

Training BART & LoBART
--------------------------------------
**Training**:

    python train/train_abssum.py conf.txt

**Configurations**:

Setting in conf.txt, e.g. conf/bart_podcast_v0.txt
- **bart_weights** - pre-trained BART weights, e.g. facebook/bart-large-cnn
- **bart_tokenizer** - pre-trained tokenizer, e.g. facebook/bart-large
- **model_name** - model name to be saved
- **selfattn** - full | local
- **multiple_input_span** - maximum input span (multiple of 1024)
- **window_width** - local self-attention width
- **save_dir** - directory to save checkpoints
- **dataset** - podcast
- **data_dir** -  path to data
- **optimizer** - optimzer (currently only adam supported)
- **max_target_len** - maximum target length
- **lr0**  - lr0
- **warmup** - warmup
- **batch_size** - batch_size
- **gradient_accum** - gradient_accum
- **valid_step** - save a checkpoint every ...
- **total_step** - maximum training steps
- **early_stop** - stop training if validaation loss stops improving for ... times
- **random_seed** - random_seed
- **use_gpu** - True | False

Decoding (Inference) BART & LoBART
--------------------------------------
**decoding**:

    python decode/decode_abssum.py \
        --load model_checkpoint
        --selfattn [full|local]
        --multiple_input_span INT
        --window_width INT
        --decode_dir output_dir
        --dataset [podcast|arxiv|pubmed]
        --datapath path_to_dataset
        --start_id 0
        --end_id 1000
        [--num_beams NUM_BEAMS]
        [--max_length MAX_LENGTH]
        [--min_length MIN_LENGTH]
        [--no_repeat_ngram_size NO_REPEAT_NGRAM_SIZE]
        [--length_penalty LENGTH_PENALTY]
        [--random_order [RANDOM_ORDER]]
        [--use_gpu [True|False]]

Training Hierarchical Model
--------------------------------------
    python train/train_hiermodel.py conf.txt

 see conf/hiermodel_v1.txt for an example of config file

Training-time Content Selection
--------------------------------------
 **step1**: running oracle selection {pad|nopad} per sample/instance

    python traintime_select/oracle_select_{pad|nopad}.py

**step2**: combine all test samples into one file

    python traintime_select/oracle_select_combine.py

See traintime_select/README.md for more information about arguments.

Test-time Content Selection (e.g. MCS inference)
--------------------------------------
 **step1**: running decoding for get attention & extractive labelling predictions (per sample)

    python decode/inference_hiermodel.py

**step2**: combine all test samples into one file

    python decode/inference_hiermodel_combine.py

See decode/README.md for more information about arguments.

Analysis
-----------------------------------------
Requires package ```pytorch_memlab```. Args: localattn = True if LoBART, False if BART, X = max. input length, Y = max. target length, W = local attention width (if localattn == True), B = batch size.

**Memeory (BART & LoBART)**

    python analysis/memory_inspect.py localattn X Y W B

**Time (BART & LoBART)**

    python analysis/speed_inspect.py localattn X Y W B num_iterations


Results using this repository
-----------------------------------------
The outputs of our systems are available -- click the dataset in the table to download (note that after the unzipped files are id_decoded.txt). Note that podcast IDs are according to the order in metadata, and arxiv/pubmed IDs are according to the order in text file in the original data download. If you need to convert these IDs into article_id, refer to [id_lists](https://drive.google.com/file/d/116Hw7aWp13AU3K65Bu0jzxgOMOSplT4B/view?usp=sharing). 

- BART(1k,truncate)

|   Data  | ROUGE-1 | ROUGE-2 | ROUGE-L |
|:-------:|:-------:|:-------:|:-------:|
| [Podcast](https://drive.google.com/file/d/1-jCanm14LUIeozU5GwIYcptiUtxPtUK2/view?usp=sharing) |  26.43  |   9.22  |   18.35 |
|  [arXiv](https://drive.google.com/file/d/1-LzzOsMshwf4NUK-RDLqVa0M81ul4CwR/view?usp=sharing)  |  44.96  |  17.25  |  39.76  |
|  [PubMed](https://drive.google.com/file/d/1-EvvQQ8ijk9cjua8vRMetXzJqorrsxqC/view?usp=sharing) |  45.06  |  18.27  |  40.84  |

- BART(1k,ORC-padrand) + ContentSelection

|   Data  | ROUGE-1 | ROUGE-2 | ROUGE-L |
|:-------:|:-------:|:-------:|:-------:|
| [Podcast](https://drive.google.com/file/d/1-klu-SZV_3JGuk-TORhZRaDGIP1BxalS/view?usp=sharing) |  27.28  |  9.82   |  19.00  |
|  [arXiv](https://drive.google.com/file/d/1-XfjvTFNVP3JszlLP2WNRneP106d8ftQ/view?usp=sharing)  |  47.68  |  19.77  |  42.25  |
|  [PubMed](https://drive.google.com/file/d/1-ACWhTSI2NQJIoTXo05Rcm5q3L39L9m-/view?usp=sharing) |  46.49  |  19.45  |  42.04  |

- LoBART(N=4096,W=1024,ORC-padrand)

|   Data  | ROUGE-1 | ROUGE-2 | ROUGE-L |
|:-------:|:-------:|:-------:|:-------:|
| [Podcast](https://drive.google.com/file/d/1Y85RUahLn0wuwks1w7Fp7BrhASpfVIO3/view?usp=sharing) |  27.36  |  10.04  |  19.33  |
|  [arXiv](https://drive.google.com/file/d/1-K7oEBwIXMybqfPgM6jbn9cMnx5WXfIy/view?usp=sharing)  |  46.59  |  18.72  |  41.24  |
|  [PubMed](https://drive.google.com/file/d/1-Gbqxc4zkxd3CGX84LtqloZjjMIlHF3H/view?usp=sharing) |  47.47  |  20.47  |  43.02  |

- LoBART(N=4096,W=1024,ORC-padrand) + ContentSelection. This is the best configuration reported in the paper.

|   Data  | ROUGE-1 | ROUGE-2 | ROUGE-L |
|:-------:|:-------:|:-------:|:-------:|
| [Podcast](https://drive.google.com/file/d/1f2IpAjrhLU_z5uImB1jmaHHwXkaIPRDR/view?usp=sharing) |  27.81  |  10.30  |   19.61 |
|  [arXiv](https://drive.google.com/file/d/1b1JHD5VkBhhvYjkEKT0YLDsq5CHtviHJ/view?usp=sharing)  |  48.79  |  20.55  |  43.31  |
|  [PubMed](https://drive.google.com/file/d/1pM7SH6UL5HZozhJxzqJrFKxnkiaVYfvq/view?usp=sharing) |  48.06  |  20.96  |  43.56  |

Trained Weights
-----------------------------------------
TRC=Truncate-training, ORC=Oracle-training

|   Model  | Trained on Data |
|:--------:|:------------:|
|LoBART(N=4096,W=1024)\_TRC|[Podcast](https://drive.google.com/file/d/1ZXQ0KP3CHJxWZdK88ebNslV6hlcLgTvA/view?usp=sharing), [arXiv](https://drive.google.com/file/d/1gwX-FCXib5WF9p-dTx-mWIQ3lPL7Psn8/view?usp=sharing), [PubMed](https://drive.google.com/file/d/18TtN-jwW4WadBAUA7P8vcgB6x4BlFHJf/view?usp=sharing)|
|LoBART(N=4096,W=1024)\_ORC|[Podcast](https://drive.google.com/file/d/1JdZpJsgrvjTqA1NqPbiKteNL3CuTjMRC/view?usp=sharing), [arXiv](https://drive.google.com/file/d/1H9Bw2ighKT8LJe-iNK2iLk7lwiQAEli0/view?usp=sharing), [PubMed](https://drive.google.com/file/d/1vvJHKmPI1E284RugWuW_ZaFJO1taG-pb/view?usp=sharing)|
|Hierarchical-Model|[Podcast](https://drive.google.com/file/d/1jF7ydOXVNBj01-aWi18_60H2_D7sFsFo/view?usp=sharing), [arXiv](https://drive.google.com/file/d/1EDZ-XfhDxQUwtbb3y_bH5T7zL_rnUklS/view?usp=sharing), [PubMed](https://drive.google.com/file/d/1yUfY7hEZTQfInYM9BeAdTGsdhz0KRiBa/view?usp=sharing)|

Citation
-----------------------------------------

	@inproceedings{manakul-gales-2021-long,
	    title = "Long-Span Summarization via Local Attention and Content Selection",
	    author = "Manakul, Potsawee  and Gales, Mark",
	    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
	    month = aug,
	    year = "2021",
	    address = "Online",
	    publisher = "Association for Computational Linguistics",
	    url = "https://aclanthology.org/2021.acl-long.470",
	    doi = "10.18653/v1/2021.acl-long.470",
	    pages = "6026--6041",
	}
    

Owner

  • Name: Dimitris Tsirmpas
  • Login: dimits-ts
  • Kind: user

I like playing around with data and building stuff.

GitHub Events

Total
Last Year