factorsum

Code for the paper Factorizing Content and Budget Decisions in Abstractive Summarization of Long Documents: https://arxiv.org/abs/2205.12486

https://github.com/fonsc/factorsum

Science Score: 28.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.9%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Code for the paper Factorizing Content and Budget Decisions in Abstractive Summarization of Long Documents: https://arxiv.org/abs/2205.12486

Basic Info

Host: GitHub
Owner: fonsc
Language: Python
Default Branch: main
Homepage:
Size: 103 KB

Statistics

Stars: 11
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 0

Created over 3 years ago · Last pushed over 2 years ago

Metadata Files

Readme Citation

FactorSum

Supporting code for the paper Factorizing Content and Budget Decisions in Abstractive Summarization of Long Documents.

Abstract

We argue that disentangling content selection from the budget used to cover salient content improves the performance and applicability of abstractive summarizers. Our method, FactorSum, does this disentanglement by factorizing summarization into two steps through an energy function:

Intrinsic importance model: generation of abstractive summary views.
Extrinsic importance model: combination of these views into a final summary, following a budget and content guidance.

This extrinsic guidance may come from different sources, including from an advisor model such as BART or BigBird, or in oracle mode -- from the reference. This factorization achieves significantly higher ROUGE scores on multiple benchmarks for long document summarization, namely PubMed, arXiv, and GovReport. Most notably, our model is effective for domain adaptation. When trained only on PubMed samples, it achieves a 46.29 ROUGE-1 score on arXiv, which indicates a strong performance due to more flexible budget adaptation and content selection less dependent on domain-specific textual structure.

Getting started

Clone this repository and install the dependencies: ```bash git clone https://github.com/thefonseca/factorsum.git cd factorsum

Optional: checkout the arXiv version 2205.12486v2 for reproducibility

git checkout 2205.12486v2

Install dependencies

pip install -r requirements.txt ```

Usage

Example: summarizing a single document using a budget guidance and source content guidance. python training_domain = 'arxiv' model = FactorSum(training_domain) summary = model.summarize( document, # a document string budget_guidance=200, # budget guidance in tokens source_token_budget=budget_guidance, # number of tokens to use from source document as content guidance verbose=True, )

A command-line tool is provided to explore summary samples and parameters. For instance, to see the summary for the sample 230 from arXiv test set, use the following command (GPU recommended): shell python -m factorsum.model --doc_id 230 --dataset_name arxiv --split test \ --budget_guidance=200 --content_guidance_type source It will output target abstract, the generated summary, and the evaluation scores.

Colab Playground

A Colab notebook is available for summary generation.

Reproducing the evaluation results

The evaluation procedure relies on the following data: - The arXiv, PubMed, and GovReport summarization datasets. - The document views dataset generated by the sampling procedure (refer to Section 2.1 "Sampling Document Views" in the paper). - The summary views predicted from the document views (see Section 2.1.1 in the paper).

For convenience, we provide all the preprocessed resources, which can be downloaded using this command: shell python -m factorsum.data download Alternatively, you can use the instructions below to prepare the resources from scratch.

Prepare data from scratch

Preprocess the summarization datasets (test splits): ```shell python -m factorsum.data preparedataset scientificpapers arxiv --split test

python -m factorsum.data preparedataset scientificpapers pubmed --split test

python -m factorsum.data prepare_dataset ccdv/govreport-summarization govreport --split test ```

Then generate the document views for each dataset: ```shell python -m factorsum.data preparedataset scientificpapers arxiv --split test --sampletype random --samplefactor 5 --viewsperdoc 20

python -m factorsum.data preparedataset scientificpapers pubmed --split test --sampletype random --samplefactor 5 --viewsperdoc 20

python -m factorsum.data preparedataset ccdv/govreport-summarization govreport --split test --sampletype random --samplefactor 5 --viewsper_doc 20 ```

Download the intrinsic model importance checkpoints: shell python -m factorsum.utils download_models --model_dir ./artifacts Currently, the checkpoints are: - arXiv: model-rs86h5g0:v0 - PubMed: model-cku41vkj:v0 - GovReport: model-2oklw1wt:v0

Finally, generate summary views using the run_summarization.py script (slightly adapted from the original huggingface script). The following command generates summary views for the arXiv test set using the model checkpoint in artifacts/model-rs86h5g0:v0: bash MODEL_PATH='artifacts/model-rs86h5g0:v0' \ DATASET='arxiv' SPLIT='test' \ python scripts/run_summarization.py \ --model_name_or_path "${MODEL_PATH}" \ --do_predict \ --output_dir output/"${DATASET}-${SPLIT}-summary_views" \ --per_device_eval_batch_size=8 \ --overwrite_output_dir \ --predict_with_generate \ --validation_file "data/${DATASET}-random_k_5_samples_20_${SPLIT}.csv" \ --test_file "data/${DATASET}-random_k_5_samples_20_${SPLIT}.csv" \ --text_column source \ --summary_column target \ --generation_max_length 128 \ --generation_num_beams 4 It will generate a generated_predictions.pkl in the output_dir folder. To use the summary views, this file has to be moved to the data folder according to this naming convention: shell cp output/"${DATASET}-${SPLIT}-summary_views/generated_predictions.pkl" data/"${DATASET}-${SPLIT}-summary_views-bart-${TRAINING_DOMAIN}-run=${RUN_ID}.pkl" For instance, for the arXiv test set in-domain summary views we would have: shell cp output/arxiv-test-summary_views/generated_predictions.pkl data/arxiv-test-summary_views-bart-arxiv-run=rs86h5g0.pkl

To generate summary views in a cross-domain setting, just set the variables MODEL_PATH and DATASET accordingly.

Hyperparameters

Refer to the file config.py for hyperparameter definitions.

In-domain evaluation

The in-domain summarization results in Table 2 in the paper are obtained with the following command: shell python -m evaluation.factorsum evaluate --dataset_name arxiv --split test --output_dir output where dataset_name is arxiv, pubmed, or govreport. By default, scores and summary predictions are saved to the ./output folder.

Cross-domain evaluation

(Results in Table 3 of the paper) To specify the training domain of the intrinsic model, use the training_domain option. The following example performs cross-domain evaluation on the arXiv dataset, using summary views generated by a model trained on PubMed. shell python -m evaluation.factorsum evaluate --dataset_name arxiv --split test --training_domain pubmed

Varying budget guidance

Results for the experiments with varying budget guidance (Appendix D in the paper) are obtained with the following command: shell python -m evaluation.budgets --dataset_name <dataset_name> --split test where dataset_name is arxiv, pubmed, or govreport.

Baselines

PEGASUS predictions:

shell python scripts/run_summarization.py \ --model_name_or_path google/pegasus-arxiv \ --do_predict \ --output_dir /output \ --per_device_eval_batch_size 4 \ --overwrite_output_dir \ --predict_with_generate \ --generation_max_length 256 \ --generation_num_beams 8 \ --val_max_target_length 256 \ --max_source_length 1024 \ --dataset_name scientific_papers \ --dataset_config arxiv \ --predict_split test

BigBird predictions: shell python scripts/run_summarization.py \ --model_name_or_path google/bigbird-pegasus-large-arxiv \ --do_predict \ --output_dir /content/output \ --per_device_eval_batch_size 4 \ --overwrite_output_dir \ --predict_with_generate \ --report_to none \ --generation_max_length 256 \ --generation_num_beams 5 \ --val_max_target_length 256 \ --max_source_length 3072 \ --dataset_name scientific_papers \ --dataset_config arxiv \ --predict_split test

Training the intrinsic importance model

First, make sure the data for all splits are available (processing of the training sets might take several minutes): bash python -m factorsum.data prepare_dataset scientific_papers arxiv python -m factorsum.data prepare_dataset scientific_papers pubmed python -m factorsum.data prepare_dataset ccdv/govreport-summarization govreport

Then run the training script as follows:

bash DATASET='arxiv' \ python scripts/run_summarization.py \ --model_name_or_path facebook/bart-base \ --do_train \ --do_eval \ --do_predict \ --output_dir output/"${DATASET}"-k_5_samples_20 \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --predict_with_generate \ --gradient_accumulation_steps 4 \ --generation_max_length 128 \ --generation_num_beams 4 \ --val_max_target_length 128 \ --max_source_length 1024 \ --max_target_length 128 \ --fp16 \ --save_total_limit 2 \ --save_strategy steps \ --evaluation_strategy steps \ --save_steps 5000 \ --eval_steps 5000 \ --max_steps 50000 \ --learning_rate 5e-5 \ --report_to wandb \ --metric_for_best_model eval_rouge1_fmeasure \ --load_best_model_at_end \ --max_train_samples 4000000 \ --max_eval_samples 10000 \ --max_predict_samples 10000 \ --train_file data/"${DATASET}"-random_k_5_samples_20_train.csv \ --validation_file data/"${DATASET}"-random_k_5_samples_20_validation.csv \ --test_file data/"${DATASET}"-random_k_5_samples_20_test.csv \ --text_column source \ --summary_column target \ --seed 17

Note: to use mixed precision (--fp16) you need a compatible CUDA device.

Citation

bibtex @inproceedings{fonseca2022factorizing, author = {Fonseca, Marcio and Ziser, Yftah and Cohen, Shay B.}, booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing}, location = {Abu Dhabi}, publisher = {Association for Computational Linguistics}, title = {Factorizing Content and Budget Decisions in Abstractive Summarization of Long Documents}, year = {2022} }

Owner

Name: Marcio Fonseca
Login: fonsc
Kind: user
Location: Edinburgh, UK
Company: University of Edinburgh

Website: marciofonseca.me
Repositories: 5
Profile: https://github.com/fonsc

Citation (CITATION.bib)

@inproceedings{fonseca-etal-2022-factorizing,
    title = "Factorizing Content and Budget Decisions in Abstractive Summarization of Long Documents",
    author = "Fonseca, Marcio  and
      Ziser, Yftah  and
      Cohen, Shay B.",
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.emnlp-main.426",
    pages = "6341--6364",
    abstract = "We argue that disentangling content selection from the budget used to cover salient content improves the performance and applicability of abstractive summarizers. Our method, FactorSum, does this disentanglement by factorizing summarization into two steps through an energy function: (1) generation of abstractive summary views covering salient information in subsets of the input document (document views); (2) combination of these views into a final summary, following a budget and content guidance. This guidance may come from different sources, including from an advisor model such as BART or BigBird, or in oracle mode {--} from the reference. This factorization achieves significantly higher ROUGE scores on multiple benchmarks for long document summarization, namely PubMed, arXiv, and GovReport. Most notably, our model is effective for domain adaptation. When trained only on PubMed samples, it achieves a 46.29 ROUGE-1 score on arXiv, outperforming PEGASUS trained in domain by a large margin. Our experimental results indicate that the performance gains are due to more flexible budget adaptation and processing of shorter contexts provided by partial document views.",
}

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

Dependencies

pyproject.toml pypi

accelerate >=0.17.1
datasets >=2.9.1
diskcache >=5.4.0
fire >=0.5.0
gdown >=4.5.1
optimum >=1.6.3
p_tqdm >=1.4.0
requests >=2.28.1
rich >=13.3.2
rouge_score >=0.1.2
summa >=1.2.0
textdistance >=4.5.0
torch >=2.0.0
transformers >=4.27.1
wandb >=0.13.4

requirements.txt pypi

accelerate ==0.16.0
datasets ==2.9.0
diskcache ==5.4.0
fire ==0.4.0
gdown ==4.5.1
optimum ==1.6.3
p_tqdm ==1.4.0
requests ==2.28.1
rich ==13.0.1
rouge-score ==0.1.2
summa ==1.2.0
textdistance ==4.5.0
transformers ==4.26.0
wandb ==0.13.4

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science