factorsum
Code for the paper Factorizing Content and Budget Decisions in Abstractive Summarization of Long Documents: https://arxiv.org/abs/2205.12486
Science Score: 28.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.9%) to scientific vocabulary
Repository
Code for the paper Factorizing Content and Budget Decisions in Abstractive Summarization of Long Documents: https://arxiv.org/abs/2205.12486
Statistics
- Stars: 11
- Watchers: 1
- Forks: 1
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
FactorSum
Supporting code for the paper Factorizing Content and Budget Decisions in Abstractive Summarization of Long Documents.
Abstract
We argue that disentangling content selection from the budget used to cover salient content improves the performance and applicability of abstractive summarizers. Our method, FactorSum, does this disentanglement by factorizing summarization into two steps through an energy function:
- Intrinsic importance model: generation of abstractive summary views.
- Extrinsic importance model: combination of these views into a final summary, following a budget and content guidance.
This extrinsic guidance may come from different sources, including from an advisor model such as BART or BigBird, or in oracle mode -- from the reference. This factorization achieves significantly higher ROUGE scores on multiple benchmarks for long document summarization, namely PubMed, arXiv, and GovReport. Most notably, our model is effective for domain adaptation. When trained only on PubMed samples, it achieves a 46.29 ROUGE-1 score on arXiv, which indicates a strong performance due to more flexible budget adaptation and content selection less dependent on domain-specific textual structure.
Getting started
Clone this repository and install the dependencies: ```bash git clone https://github.com/thefonseca/factorsum.git cd factorsum
Optional: checkout the arXiv version 2205.12486v2 for reproducibility
git checkout 2205.12486v2
Install dependencies
pip install -r requirements.txt ```
Usage
Example: summarizing a single document using a budget guidance and source content guidance.
python
training_domain = 'arxiv'
model = FactorSum(training_domain)
summary = model.summarize(
document, # a document string
budget_guidance=200, # budget guidance in tokens
source_token_budget=budget_guidance, # number of tokens to use from source document as content guidance
verbose=True,
)
A command-line tool is provided to explore summary samples and parameters. For instance,
to see the summary for the sample 230 from arXiv test set, use the following command (GPU recommended):
shell
python -m factorsum.model --doc_id 230 --dataset_name arxiv --split test \
--budget_guidance=200 --content_guidance_type source
It will output target abstract, the generated summary, and the evaluation scores.
Colab Playground
A Colab notebook is available for summary generation.
Reproducing the evaluation results
The evaluation procedure relies on the following data: - The arXiv, PubMed, and GovReport summarization datasets. - The document views dataset generated by the sampling procedure (refer to Section 2.1 "Sampling Document Views" in the paper). - The summary views predicted from the document views (see Section 2.1.1 in the paper).
For convenience, we provide all the preprocessed resources, which can be downloaded using this command:
shell
python -m factorsum.data download
Alternatively, you can use the instructions below to prepare the resources from scratch.
Prepare data from scratch
Preprocess the summarization datasets (test splits): ```shell python -m factorsum.data preparedataset scientificpapers arxiv --split test
python -m factorsum.data preparedataset scientificpapers pubmed --split test
python -m factorsum.data prepare_dataset ccdv/govreport-summarization govreport --split test ```
Then generate the document views for each dataset: ```shell python -m factorsum.data preparedataset scientificpapers arxiv --split test --sampletype random --samplefactor 5 --viewsperdoc 20
python -m factorsum.data preparedataset scientificpapers pubmed --split test --sampletype random --samplefactor 5 --viewsperdoc 20
python -m factorsum.data preparedataset ccdv/govreport-summarization govreport --split test --sampletype random --samplefactor 5 --viewsper_doc 20 ```
Download the intrinsic model importance checkpoints:
shell
python -m factorsum.utils download_models --model_dir ./artifacts
Currently, the checkpoints are:
- arXiv: model-rs86h5g0:v0
- PubMed: model-cku41vkj:v0
- GovReport: model-2oklw1wt:v0
Finally, generate summary views using the run_summarization.py script (slightly adapted from the original huggingface script). The following command generates summary views for the arXiv test set using the model checkpoint in artifacts/model-rs86h5g0:v0:
bash
MODEL_PATH='artifacts/model-rs86h5g0:v0' \
DATASET='arxiv' SPLIT='test' \
python scripts/run_summarization.py \
--model_name_or_path "${MODEL_PATH}" \
--do_predict \
--output_dir output/"${DATASET}-${SPLIT}-summary_views" \
--per_device_eval_batch_size=8 \
--overwrite_output_dir \
--predict_with_generate \
--validation_file "data/${DATASET}-random_k_5_samples_20_${SPLIT}.csv" \
--test_file "data/${DATASET}-random_k_5_samples_20_${SPLIT}.csv" \
--text_column source \
--summary_column target \
--generation_max_length 128 \
--generation_num_beams 4
It will generate a generated_predictions.pkl in the output_dir folder. To use the summary views, this file has to be moved to the data folder according to this naming convention:
shell
cp output/"${DATASET}-${SPLIT}-summary_views/generated_predictions.pkl" data/"${DATASET}-${SPLIT}-summary_views-bart-${TRAINING_DOMAIN}-run=${RUN_ID}.pkl"
For instance, for the arXiv test set in-domain summary views we would have:
shell
cp output/arxiv-test-summary_views/generated_predictions.pkl data/arxiv-test-summary_views-bart-arxiv-run=rs86h5g0.pkl
To generate summary views in a cross-domain setting, just set the variables MODEL_PATH and DATASET accordingly.
Hyperparameters
Refer to the file config.py for hyperparameter definitions.
In-domain evaluation
The in-domain summarization results in Table 2 in the paper
are obtained with the following command:
shell
python -m evaluation.factorsum evaluate --dataset_name arxiv --split test --output_dir output
where dataset_name is arxiv, pubmed, or govreport. By default, scores and summary predictions
are saved to the ./output folder.
Cross-domain evaluation
(Results in Table 3 of the paper)
To specify the training domain of the intrinsic model, use the training_domain option.
The following example performs cross-domain evaluation on the arXiv dataset, using summary
views generated by a model trained on PubMed.
shell
python -m evaluation.factorsum evaluate --dataset_name arxiv --split test --training_domain pubmed
Varying budget guidance
Results for the experiments with varying budget guidance (Appendix D in the paper) are obtained with the following command:
shell
python -m evaluation.budgets --dataset_name <dataset_name> --split test
where dataset_name is arxiv, pubmed, or govreport.
Baselines
PEGASUS predictions:
shell
python scripts/run_summarization.py \
--model_name_or_path google/pegasus-arxiv \
--do_predict \
--output_dir /output \
--per_device_eval_batch_size 4 \
--overwrite_output_dir \
--predict_with_generate \
--generation_max_length 256 \
--generation_num_beams 8 \
--val_max_target_length 256 \
--max_source_length 1024 \
--dataset_name scientific_papers \
--dataset_config arxiv \
--predict_split test
BigBird predictions:
shell
python scripts/run_summarization.py \
--model_name_or_path google/bigbird-pegasus-large-arxiv \
--do_predict \
--output_dir /content/output \
--per_device_eval_batch_size 4 \
--overwrite_output_dir \
--predict_with_generate \
--report_to none \
--generation_max_length 256 \
--generation_num_beams 5 \
--val_max_target_length 256 \
--max_source_length 3072 \
--dataset_name scientific_papers \
--dataset_config arxiv \
--predict_split test
Training the intrinsic importance model
First, make sure the data for all splits are available (processing of the training sets might take several minutes):
bash
python -m factorsum.data prepare_dataset scientific_papers arxiv
python -m factorsum.data prepare_dataset scientific_papers pubmed
python -m factorsum.data prepare_dataset ccdv/govreport-summarization govreport
Then run the training script as follows:
bash
DATASET='arxiv' \
python scripts/run_summarization.py \
--model_name_or_path facebook/bart-base \
--do_train \
--do_eval \
--do_predict \
--output_dir output/"${DATASET}"-k_5_samples_20 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--predict_with_generate \
--gradient_accumulation_steps 4 \
--generation_max_length 128 \
--generation_num_beams 4 \
--val_max_target_length 128 \
--max_source_length 1024 \
--max_target_length 128 \
--fp16 \
--save_total_limit 2 \
--save_strategy steps \
--evaluation_strategy steps \
--save_steps 5000 \
--eval_steps 5000 \
--max_steps 50000 \
--learning_rate 5e-5 \
--report_to wandb \
--metric_for_best_model eval_rouge1_fmeasure \
--load_best_model_at_end \
--max_train_samples 4000000 \
--max_eval_samples 10000 \
--max_predict_samples 10000 \
--train_file data/"${DATASET}"-random_k_5_samples_20_train.csv \
--validation_file data/"${DATASET}"-random_k_5_samples_20_validation.csv \
--test_file data/"${DATASET}"-random_k_5_samples_20_test.csv \
--text_column source \
--summary_column target \
--seed 17
Note: to use mixed precision (--fp16) you need a compatible CUDA device.
Citation
bibtex
@inproceedings{fonseca2022factorizing,
author = {Fonseca, Marcio and Ziser, Yftah and Cohen, Shay B.},
booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing},
location = {Abu Dhabi},
publisher = {Association for Computational Linguistics},
title = {Factorizing Content and Budget Decisions in Abstractive Summarization of Long Documents},
year = {2022}
}
Owner
- Name: Marcio Fonseca
- Login: fonsc
- Kind: user
- Location: Edinburgh, UK
- Company: University of Edinburgh
- Website: marciofonseca.me
- Repositories: 5
- Profile: https://github.com/fonsc
Citation (CITATION.bib)
@inproceedings{fonseca-etal-2022-factorizing,
title = "Factorizing Content and Budget Decisions in Abstractive Summarization of Long Documents",
author = "Fonseca, Marcio and
Ziser, Yftah and
Cohen, Shay B.",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.emnlp-main.426",
pages = "6341--6364",
abstract = "We argue that disentangling content selection from the budget used to cover salient content improves the performance and applicability of abstractive summarizers. Our method, FactorSum, does this disentanglement by factorizing summarization into two steps through an energy function: (1) generation of abstractive summary views covering salient information in subsets of the input document (document views); (2) combination of these views into a final summary, following a budget and content guidance. This guidance may come from different sources, including from an advisor model such as BART or BigBird, or in oracle mode {--} from the reference. This factorization achieves significantly higher ROUGE scores on multiple benchmarks for long document summarization, namely PubMed, arXiv, and GovReport. Most notably, our model is effective for domain adaptation. When trained only on PubMed samples, it achieves a 46.29 ROUGE-1 score on arXiv, outperforming PEGASUS trained in domain by a large margin. Our experimental results indicate that the performance gains are due to more flexible budget adaptation and processing of shorter contexts provided by partial document views.",
}
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1
Dependencies
- accelerate >=0.17.1
- datasets >=2.9.1
- diskcache >=5.4.0
- fire >=0.5.0
- gdown >=4.5.1
- optimum >=1.6.3
- p_tqdm >=1.4.0
- requests >=2.28.1
- rich >=13.3.2
- rouge_score >=0.1.2
- summa >=1.2.0
- textdistance >=4.5.0
- torch >=2.0.0
- transformers >=4.27.1
- wandb >=0.13.4
- accelerate ==0.16.0
- datasets ==2.9.0
- diskcache ==5.4.0
- fire ==0.4.0
- gdown ==4.5.1
- optimum ==1.6.3
- p_tqdm ==1.4.0
- requests ==2.28.1
- rich ==13.0.1
- rouge-score ==0.1.2
- summa ==1.2.0
- textdistance ==4.5.0
- transformers ==4.26.0
- wandb ==0.13.4