https://github.com/disi-unibo-nlp/sci-lay
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.4%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: disi-unibo-nlp
- Language: Python
- Default Branch: master
- Size: 10.2 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
What Are You Token About? Differentiable Perturbed Top-k Token Selection for Scientific Document Summarization
Abstract: Scientific document summarization (SDS) aims to condense complex and long articles in both technical and plain-language terms to facilitate the accessibility and dissemination of scientific findings. Existing datasets suffer from a deficiency in source heterogeneity, as their data predominantly stem from a single common resource, hindering effective model training and generalizability. First, we introduce SciLay, a novel dataset that includes documents from multiple natural science journals with expert-authored technical and lay summaries. Second, we propose PrunePert, a new transformer-based model that incorporates a differentiable perturbed top-k encoder layer to prune irrelevant tokens in end-to-end learning. Experimental results show that our model achieves a nearly 2x speed-up compared to a state-of-the-art linear transformer, remaining comparable in effectiveness. Additional examinations underscore the importance of employing a training dataset that includes different sources to enhance the generalizability of the models.
SciLay
SciLay is publicly available on HuggingFace: https://huggingface.co/datasets/disi-unibo-nlp/SciLay
PrunePert
Getting Started
pip install -r requirements.txt
git clone https://github.com/neulab/BARTScore.git
Pegasus + PrunePert Fine-tuning
CUDA_VISIBLE_DEVICES=0 python3 run_generation.py \
--logging online \
--run_name pegasus_hard_top50%_200n_0.2sigma_decr_layer3_all_technical_text \
--use_topk \
--decr_sigma \
--n_samples 200 \
--top_p 0.50 \
--sigma 0.2 \
--encoder_topk_layer 3 \
--topk_inference hard \
--do_train \
--do_predict \
--do_val \
--output_dir output_technical_text \
--dataset_name ccdv/arxiv-summarization \
--model_name_or_path ccdv/lsg-pegasus-large-4096 \
--log_level error \
--gradient_accumulation_steps 1 \
--max_target_length 512 \
--generation_max_length 512 \
--num_train_epochs 1 \
--learning_rate 5e-5 \
--save_strategy epoch \
--evaluation_strategy epoch \
--gradient_checkpointing \
--load_best_model_at_end \
--predict_with_generate \
--overwrite_cache \
--metric_for_best_model eval_rouge1 \
--save_total_limit 1 \
--group_by_length \
--sortish_sampler \
--weight_decay 0.01 \
--label_smoothing_factor 0.1 \
--include_inputs_for_metrics \
--remove_unused_columns \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
Owner
- Name: DISI UniBo NLP
- Login: disi-unibo-nlp
- Kind: user
- Location: Italy
- Website: https://disi-unibo-nlp.github.io/
- Repositories: 20
- Profile: https://github.com/disi-unibo-nlp
NLU Research Group @ University of Bologna @ Department of Computer Science and Engineering (DISI)
GitHub Events
Total
- Issues event: 1
- Watch event: 2
Last Year
- Issues event: 1
- Watch event: 2