https://github.com/disi-unibo-nlp/sci-lay

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.4%) to scientific vocabulary

Last synced: 8 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: disi-unibo-nlp
Language: Python
Default Branch: master
Size: 10.2 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created about 2 years ago · Last pushed almost 2 years ago

Metadata Files

Readme

What Are You Token About? Differentiable Perturbed Top-k Token Selection for Scientific Document Summarization

Abstract: Scientific document summarization (SDS) aims to condense complex and long articles in both technical and plain-language terms to facilitate the accessibility and dissemination of scientific findings. Existing datasets suffer from a deficiency in source heterogeneity, as their data predominantly stem from a single common resource, hindering effective model training and generalizability. First, we introduce SciLay, a novel dataset that includes documents from multiple natural science journals with expert-authored technical and lay summaries. Second, we propose PrunePert, a new transformer-based model that incorporates a differentiable perturbed top-k encoder layer to prune irrelevant tokens in end-to-end learning. Experimental results show that our model achieves a nearly 2x speed-up compared to a state-of-the-art linear transformer, remaining comparable in effectiveness. Additional examinations underscore the importance of employing a training dataset that includes different sources to enhance the generalizability of the models.

SciLay

SciLay is publicly available on HuggingFace: https://huggingface.co/datasets/disi-unibo-nlp/SciLay

PrunePert

Getting Started

pip install -r requirements.txt git clone https://github.com/neulab/BARTScore.git

Pegasus + PrunePert Fine-tuning

CUDA_VISIBLE_DEVICES=0 python3 run_generation.py \ --logging online \ --run_name pegasus_hard_top50%_200n_0.2sigma_decr_layer3_all_technical_text \ --use_topk \ --decr_sigma \ --n_samples 200 \ --top_p 0.50 \ --sigma 0.2 \ --encoder_topk_layer 3 \ --topk_inference hard \ --do_train \ --do_predict \ --do_val \ --output_dir output_technical_text \ --dataset_name ccdv/arxiv-summarization \ --model_name_or_path ccdv/lsg-pegasus-large-4096 \ --log_level error \ --gradient_accumulation_steps 1 \ --max_target_length 512 \ --generation_max_length 512 \ --num_train_epochs 1 \ --learning_rate 5e-5 \ --save_strategy epoch \ --evaluation_strategy epoch \ --gradient_checkpointing \ --load_best_model_at_end \ --predict_with_generate \ --overwrite_cache \ --metric_for_best_model eval_rouge1 \ --save_total_limit 1 \ --group_by_length \ --sortish_sampler \ --weight_decay 0.01 \ --label_smoothing_factor 0.1 \ --include_inputs_for_metrics \ --remove_unused_columns \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \

Owner

Name: DISI UniBo NLP
Login: disi-unibo-nlp
Kind: user
Location: Italy

Website: https://disi-unibo-nlp.github.io/
Repositories: 20
Profile: https://github.com/disi-unibo-nlp

NLU Research Group @ University of Bologna @ Department of Computer Science and Engineering (DISI)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science