Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.8%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: DiLi-Lab
- Language: Python
- Default Branch: main
- Size: 2.9 MB
Statistics
- Stars: 14
- Watchers: 6
- Forks: 3
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
ScanDL

This repository contains the code to reproduce the experiments in ScanDL: A Diffusion Model for Generating Synthetic Scanpaths on Texts.
Summary
- Our proposed model ScanDL is the first diffusion model for synthetic scanpath generation
- ScanDL is able to exhibit human-like reading behavior

Setup
Clone this repository
bash
git clone git@github.com:dili-lab/scandl
cd scandl
Install requirements
The code is based on the PyTorch and huggingface modules.
bash
pip install -r requirements.txt
Download data
The CELER data can be downloaded from this link, where you need to follow the description.
The ZuCo data can be downloaded from this OSF repository. You can use scripts/get_zuco_data.sh to automatically download the ZuCo data. Note, ZuCo is a big dataset and requires a lot of storage.
Make sure you adapt the path to the folder that contains both the celer and the zuco in the file CONSTANTS.py. If you use aboves bash script scripts/get_zuco_data.sh, the zuco paths is data/.
Make sure there are no whitespaces in the zuco directories (there might be when you download the data). You might want to check sp_load_celer_zuco.load_zuco() for the spelling of the directories.
Preprocess data
Preprocessing the eye-tracking data takes time. It is thus recommended to perform the preprocessing once for each setting and save the preprocessed data in a directory processed_data.
This not only saves time if training is performed several times but it also ensures the same data splits for each training run in the same setting.
For preprocessing and saving the data, run
bash
python -m scripts.create_data_splits
Training
Execute the following commands to perform the training.
Notes
- To execute the training commands, you need GPUs setup with CUDA.
--nproc_per_nodeindicates the number of GPUs over which you want to split training.- If you want to start multiple training processes at the same time, change
--master_portto be different for all of them. --load_train_data processed_datameans that the preprocessed data is loaded from the folderprocessed_data. If the data has not been preprocessed and saved, leave this argument away.
Training Commands
To execute the training commands below, you need GPUs setup with CUDA.
New Reader setting
bash
python -m torch.distributed.launch \
--nproc_per_node=4 \
--master_port=12233 \
--use_env scripts/sp_run_train.py \
--corpus celer \
--inference cv \
--load_train_data processed_data \
--num_transformer_heads 8 \
--num_transformer_layers 12 \
--hidden_dim 256 \
--noise_schedule sqrt \
--learning_steps 80000 \
--log_interval 500 \
--eval_interval 500 \
--save_interval 5000 \
--data_split_criterion reader
New Sentence setting
bash
python -m torch.distributed.launch \
--nproc_per_node=4 \
--master_port=12233 \
--use_env scripts/sp_run_train.py \
--corpus celer \
--inference cv \
--load_train_data processed_data \
--num_transformer_heads 8 \
--num_transformer_layers 12 \
--hidden_dim 256 \
--noise_schedule sqrt \
--learning_steps 80000 \
--log_interval 500 \
--eval_interval 500 \
--save_interval 5000 \
--data_split_criterion sentence
New Reader/New Sentence setting
bash
python -m torch.distributed.launch \
--nproc_per_node=4 \
--master_port=12233 \
--use_env scripts/sp_run_train.py \
--corpus celer \
--inference cv \
--load_train_data processed_data \
--num_transformer_heads 8 \
--num_transformer_layers 12 \
--hidden_dim 256 \
--noise_schedule sqrt \
--learning_steps 80000 \
--log_interval 500 \
--eval_interval 500 \
--save_interval 5000 \
--data_split_criterion combined
Cross-dataset setting
bash
python -m torch.distributed.launch \
--nproc_per_node=4 \
--master_port=12233 \
--use_env scripts/sp_run_train.py \
--corpus celer \
--inference zuco \
--load_train_data processed_data \
--num_transformer_heads 8 \
--num_transformer_layers 12 \
--hidden_dim 256 \
--noise_schedule sqrt \
--learning_steps 80000 \
--log_interval 500 \
--eval_interval 500 \
--save_interval 5000 \
--notes cross_dataset \
--data_split_criterion scanpath
Ablation: without positional embedding and BERT embedding (New Reader/New Sentence)
bash
python -m torch.distributed.launch \
--nproc_per_node=4 \
--master_port=12233 \
--use_env scripts/sp_run_train_ablation.py \
--corpus celer \
--inference cv \
--load_train_data processed_data \
--num_transformer_heads 8 \
--num_transformer_layers 12 \
--hidden_dim 256 \
--noise_schedule sqrt \
--learning_steps 80000 \
--log_interval 50 \
--eval_interval 500 \
--save_interval 5000 \
--data_split_criterion combined \
--notes ablation-no-pos-bert
Ablation: without condition (sentence): unconditional scanpath generation (New Reader/New Sentence)
bash
python -m torch.distributed.launch \
--nproc_per_node=4 \
--master_port=12233 \
--use_env scripts/sp_run_train_ablation_no_condition.py \
--corpus celer \
--inference cv \
--num_transformer_heads 8 \
--num_transformer_layers 12 \
--hidden_dim 256 \
--noise_schedule sqrt \
--learning_steps 80000 \
--log_interval 50 \
--eval_interval 500 \
--save_interval 5000 \
--data_split_criterion combined \
--notes ablation-no-condition
Ablation: cosine noise schedule (New Reader/New Sentence)
bash
python -m torch.distributed.launch \
--nproc_per_node=4 \
--master_port=12233 \
--use_env scripts/sp_run_train.py \
--corpus celer \
--inference cv \
--load_train_data processed_data \
--num_transformer_heads 8 \
--num_transformer_layers 12 \
--hidden_dim 256 \
--noise_schedule cosine \
--learning_steps 80000 \
--log_interval 500 \
--eval_interval 500 \
--save_interval 5000 \
--data_split_criterion combined
Ablation: linear noise schedule (New Reader/New Sentence)
bash
python -m torch.distributed.launch \
--nproc_per_node=4 \
--master_port=12233 \
--use_env scripts/sp_run_train.py \
--corpus celer \
--inference cv \
--load_train_data processed_data \
--num_transformer_heads 8 \
--num_transformer_layers 12 \
--hidden_dim 256 \
--noise_schedule linear \
--learning_steps 80000 \
--log_interval 500 \
--eval_interval 500 \
--save_interval 5000 \
--data_split_criterion combined
Inference
NOTES
checkpoint-pathto indicte the folder name within the directory that refers to your trained model--no_gpusindicates the number of GPUs across which you split the inference. It is recommended to set it to 1; if inference is split on multiple GPUs, each process will produce a separate output files which will have to be combined before evaluation can be run on them.--bszis the batch size.--cvmust be given for the cross-validation settings and it is not given for the cross-dataset setting.--load_test_data processed_datais given if the data has been preprocessed and split and saved already before training; otherwise leave it away. It is never given for the ablation case of unconditional scanpath generation.
If you run several inference processes at the same time, make sure to choose a different --seed for each of them. During training, the model is saved for many checkpoints. If you want to run inference on every checkpoint, leave the argument --run_only_on away. However, inference is quite costly time-wise and it is thus sensible to only
specify certain checkpoints onto which inference should be run. For that purpose, the exact path to that saved model must be given.
Inference Commands
Adapt the following paths/variables:
[MODEL_DIR]
[FOLD_IDX]
[STEPS]
For the settings: * New Reader * New Sentence * New Reader/New Sentence * Ablation: cosine noise schedule (New Reader/New Sentence) * Ablation: linear noise schedule (New Reader/New Sentence)
bash
python -u scripts/sp_run_decode.py \
--model_dir checkpoint-path/[MODEL_DIR] \
--seed 60 \
--split test \
--cv \
--no_gpus 1 \
--bsz 24 \
--run_only_on 'checkpoint-path/[MODEL_DIR]/fold-[FOLD_IDX]/ema_0.9999_0[STEPS].pt' \
--load_test_data processed_data
Cross-dataset:
bash
python -u scripts/sp_run_decode.py \
--model_dir checkpoint-path/[MODEL_DIR] \
--seed 60 \
--split test \
--no_gpus 1 \
--bsz 24 \
--run_only_on 'checkpoint-path/[MODEL_DIR]/ema_0.9999_0[STEPS].pt' \
--load_test_data processed_data
Ablation: without positional embedding and BERT embedding (New Reader/New Sentence)
bash
python -u scripts/sp_run_decode_ablation.py \
--model_dir checkpoint-path/[MODEL_DIR] \
--seed 60 \
--split test \
--cv \
--no_gpus 1 \
--bsz 24 \
--load_test_data processed_data \
--run_only_on 'checkpoint-path/[MODEL_DIR/fold-[FOLD_IDX]/ema_0.9999_0[STEPS].pt'
Ablation: without condition (sentence): unconditional scanpath generation (New Reader/New Sentence)
bash
python -u scripts/sp_run_decode_ablation_no_condition.py \
--model_dir checkpoint-path/[MODEL_DIR] \
--seed 60 \
--split test \
--cv \
--no_gpus 1 \
--bsz 24 \
--run_only_on 'checkpoint-path/[MODEL_DIR]/fold-[FOLD_IDX]/ema_0.9999_0[STEPS].pt'
Evaluation
To run the evaluation on the ScanDL output, again indicate the model dir in generation_outputs:
[MODEL_DIR]:
The argument --cv should be used for the evaluation on all cross-validation settings.
For all cases except for the Cross-dataset:
bash
python -m scripts.sp_eval --generation_outputs [MODEL_DIR] --cv
For the Cross-dataset setting:
bash
python -m scripts.sp_eval --generation_outputs [MODEL_DIR]
Psycholinguistic Analysis
To run the psycholinguistic analysis, first compute reading measures as well as psycholinguistic effects:
Set MODEL_DIR to be the model directory in generation_outputs.
NOTES
--seedshould be the same seed as used during inference.--settingto 'reader' for the New Reader setting, 'sentence' for the New Sentence setting, 'combined' for the 'New Reader/New Sentence setting', and 'cross_dataset' for cross dataset (train on celer, test on zuco).--stepsto the number of training steps for the saved model checkpoint on which you have run the inference (e.g., 80000).
bash
python model_analyses/psycholinguistic_analysis.py --model [MODEL_DIR] --steps [N_STEPS] --setting [SETTING] --seed [SEED]
The reading measure files will be stored in the directory pl_analysis/reading_measures.
To fit the generalized linear models, run
bash
Rscript --vanilla model_analyses/compute_effects.R --setting [SETTING] --steps [N_STEPS]
The fitted models will be saved as RDS-files in the directory model_fits.
To compare the effect sizes between the different models, run
bash
Rscript --vanilla model_analyses/analyze_fit.R --setting [SETTING] --steps [N_STEPS]
Citation
If you are using ScanDL, please consider citing our work:
bibtex
@inproceedings{bolliger2023scandl,
author = {Bolliger, Lena S. and Reich, David R. and Haller, Patrick and Jakobi, Deborah N. and Prasse, Paul and J{\"a}ger, Lena A.},
title = {{S}can{DL}: {A} Diffusion Model for Generating Synthetic Scanpaths on Texts},
booktitle={The 2023 Conference on Empirical Methods in Natural Language Processing},
year={2023},
publisher = {Association for Computational Linguistics},
}
Acknowledgements
As indicated in the paper, our code is based on the implementation of DiffuSeq.
Owner
- Name: Digital Linguistics Lab, Department of Computational Linguistics, University of Zurich
- Login: DiLi-Lab
- Kind: organization
- Email: jaeger@cl.uzh.ch
- Repositories: 1
- Profile: https://github.com/DiLi-Lab
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this work, please cite it as below."
title: "ScanDL: A diffusion model for generating synthetic scanpaths on texts"
authors:
- family-names: Bolliger
given-names: Lena S.
- family-names: Reich
given-names: David R.
- family-names: Haller
given-names: Patrick
- family-names: Jakobi
given-names: Deborah N.
- family-names: Prasse
given-names: Paul
- family-names: Jäger
given-names: Lena A.
date-released: 2023-12-01
conference: "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023)"
location: "Singapore, Singapore"
publisher: "Association for Computational Linguistics"
abstract: "Eye movements in reading play a crucial role in psycholinguistic research studying the cognitive mechanisms underlying human language processing. More recently, the tight coupling between eye movements and cognition has also been leveraged for language-related machine learning tasks such as the interpretability, enhancement, and pre-training of language models, as well as the inference of reader- and text-specific properties. However, scarcity of eye movement data and its unavailability at application time poses a major challenge for this line of research. Initially, this problem was tackled by resorting to cognitive models for synthesizing eye movement data. However, for the sole purpose of generating human-like scanpaths, purely data-driven machine-learning-based methods have proven to be more suitable. Following recent advances in adapting diffusion processes to discrete data, we propose ScanDL, a novel discrete sequence-to-sequence diffusion model that generates synthetic scanpaths on texts. By leveraging pre-trained word representations and jointly embedding both the stimulus text and the fixation sequence, our model captures multi-modal interactions between the two inputs. We evaluate ScanDL within- and across-dataset and demonstrate that it significantly outperforms state-of-the-art scanpath generation methods. Finally, we provide an extensive psycholinguistic analysis that underlines the model's ability to exhibit human-like reading behavior."
GitHub Events
Total
- Issues event: 6
- Watch event: 2
- Issue comment event: 11
- Push event: 1
- Fork event: 2
Last Year
- Issues event: 6
- Watch event: 2
- Issue comment event: 11
- Push event: 1
- Fork event: 2
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 7
- Total pull requests: 6
- Average time to close issues: 24 days
- Average time to close pull requests: 2 days
- Total issue authors: 6
- Total pull request authors: 4
- Average comments per issue: 2.29
- Average comments per pull request: 0.0
- Merged pull requests: 5
- Bot issues: 0
- Bot pull requests: 1
Past Year
- Issues: 4
- Pull requests: 0
- Average time to close issues: about 1 month
- Average time to close pull requests: N/A
- Issue authors: 3
- Pull request authors: 0
- Average comments per issue: 3.25
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- giuseppecartella (2)
- foxquan (1)
- SiQube (1)
- chuan111111 (1)
- nisar2 (1)
- BaiYunpeng1949 (1)
Pull Request Authors
- SiQube (3)
- prassepaul (1)
- dependabot[bot] (1)
- hallerp (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- blobfile ==2.0.1
- datasets ==2.10.1
- h5py *
- numpy ==1.23.4
- pandas ==1.5.3
- plotly ==5.17.0
- scikit-learn ==1.2.2
- torch ==2.0.0
- tqdm ==4.64.1
- transformers ==4.36.0
- wandb ==0.14.0
- wordfreq *