potec
This repository contains the Potsdam Textbook Corpus (PoTeC) which is a natural reading eye-tracking corpus.
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.9%) to scientific vocabulary
Keywords
Repository
This repository contains the Potsdam Textbook Corpus (PoTeC) which is a natural reading eye-tracking corpus.
Basic Info
- Host: GitHub
- Owner: DiLi-Lab
- Language: Jupyter Notebook
- Default Branch: main
- Homepage: https://osf.io/dn5hp/
- Size: 16.7 MB
Statistics
- Stars: 11
- Watchers: 4
- Forks: 1
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
PoTeC - Potsdam Textbook Corpus
:star2: :star2: If you'd like to ask a question, notice a mistake or want to say anything regarding the data please use the Discussions tab on GitHub. We're happy to hear your ideas and feeback!
This repository contains the Potsdam Textbook Corpus (PoTeC) which is a natural reading eye-tracking corpus. Four groups of participants (expert/beginner level students of physics and biology) read 12 short texts taken from textbooks of physics and biology while their eye movements were monitored. The final dataset contains the reading data for 75 participants each reading all 12 texts. The study follows a 2x2x2 fully-crossed factorial design: * Factor 1: Study discipline of participant with the levels either physics or biology * Factor 2: Study level of participant with the levels either beginner or expert * Factor 3: Text domain with the levels either physics or biology
| | Physics | Biology | |--------------|---------|---------| | Beginner | 12 | 16 | | Expert | 20 | 27 | | | | | | Total | 32 | 43 |
Both factors are quasi-experimental and manipulated between subjects. The readers' text comprehension as well as their background knowledge on the topics presented in the texts were assessed by multiple-choice questions.
More information is found in the following README'S: * preprocessing * participants * stimuli * eye-tracking data * additional processing
For a detailed description of the data types, format and content, please refer to the CODEBOOK.
Download the data
The data files are stored in an OSF repository. If this GitHub repository has been cloned, they can be downloaded and extracted automatically using the following script:
```bash
or python3
python downloaddatafiles.py
OR to extract the files directly
python downloaddatafiles.py --extract ```
Alternatively, they can be downloaded manually from the OSF repository and extracted into the respective folders.
pymovements integration
PoTeC is integrated into the pymovements package. The package allows to easily download the raw data and further process it. The following code snippet shows how to download the data:
```python
pip install pymovements
import pymovements as pm
dataset = pm.Dataset('PoTeC', path='data/PoTeC')
dataset.download() ```
Note on reading the data files using pandas
The German text p3 includes the word "null". If e.g. the word features are read using pandas, the word "null" is interpreted as a NA value. In order to avoid this behavior the command can be used with the following arguments:
python
import pandas as pd
pd.read_csv('word_features_p3.tsv', sep='\t',
keep_default_na=False,
na_values=['#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan',
'1.#IND', '1.#QNAN', '<NA>', 'N/A', 'NA', 'NaN', 'None', 'n/a',
'nan', '']
)
Data Overview
The data that was used to create the corpus and that was obtained during the experiments is made available in various stages. The data is stored in respective sub folders each of which contains a README that provides more information about the data and how to use it. For a detailed description of the data types, format and content, please refer to the CODEBOOK.
This repository contains the following data: * Eye-tracking data * raw eye-tracking data * preprocessed eye-tracking data * Stimuli * stimuli texts * text and background questions * Anonymized participant data * Scripts (in Python) * scripts to preprocess the data * additional scripts that have been used to process the data further
The scripts were run using Python 3.9 with the dependencies specified in the requirements.txt file.
Technical set-up
The experiment was run with the following technical set-up:
| | Setting | Value | |---------------------------|----------------------------------------------------------|--------------------------------------------------------------| | | | | | Technical set-up | Eye-tracking device | Eyelink 1000, dektop mounted camera system with a 35 mm lens | | | Sampling rate | 1000 Hz | | | Monitor size | 47.5x30 cm, 22 inch | | | Monitor resolution | 1680x1050 pixels | | | Eye-to-screen distance | 61 cm | | | Eye-to-camera distance | 65 cm | | | Experiment software | Experiment Builder software provided by SR Research | | | | | | Stimulus presentation | Background color | Black | | | Font color | White | | | Font size | 18 | | | Font | Courier (monospaced) | | | Stimulus size | On average 158 words shown on multiple lines on one page | | | Number of characters per visual angle (middle of screen) | 2.8 characters per degree of visual angle | | | Line spacing | |
Stimuli
The stimuli texts are made available via this website. Note that in order to publish the stimuli, we acquired licenses for the texts from the respective publishers. Please, find more information on the website.
Stimuli Annotation
The stimuli have been manually annoted with part-of-speech tags and other linguistic information. The annotations are described in a separate file: ANNOTATION.
Citation
@article{potec,
url={\url{https://github.com/DiLi-Lab/PoTeC}},
author={Jakobi, Deborah N. and Kern, Thomas and Reich, David R. and Haller, Patrick and J\"ager, Lena A.},
title={{PoTeC}: A {German} Naturalistic Eye-tracking-while-reading Corpus},
year={2025},
note={In press},
journal={Behavior Research Methods}
}
Repository Structure
PoTeC data
├── CODEBOOK.md
├── README.md
├── download_data_files.py
├── requirements.txt
├── additional_scripts
│ ├── ADDITIONAL_SCRIPTS.md
│ ├── analyses
│ │ ├── analyses.R
│ │ ├── run_bayesian_models.R
│ │ ├── run_freq_models.R
│ │ ├── count_reader_texts.py
│ │ ├── get_validation_scores.py
│ │ ├── analyse_online_survey.py
│ │ └── visualizations.ipynb
│ ├── codebook
│ │ ├── create_codebook_tables.py
│ │ ├── all_cols_description.csv
│ │ └── all_codebook_texts.csv
│ ├── add_syntax_trees.py
│ ├── annotate_constituency_trees_manually.py
│ ├── compute_reading_measures.py
│ ├── generate_scanpaths.py
│ ├── merge_reading_measures.py
│ ├── surprisal.py
│ ├── get_surprisal.py
│ ├── merge_fixations_and_coordinates.py
│ ├── merge_scanpaths.py
│ └── psycholing_analysis_plots.r
├── eyetracking_data
│ ├── EYETRACKING_DATA.md
│ ├── original_uncorrected_fixation_report.txt
│ ├── asc_files
│ │ └── ...
│ ├── fixations
│ │ └── ...
│ ├── fixations_uncorrected
│ │ └── ...
│ ├── raw_data
│ │ └── ...
│ ├── reading_measures
│ │ └── ...
│ ├── reading_measures_merged
│ │ └── ...
│ ├── scanpaths
│ │ └── ...
│ └── scanpaths_merged
│ └── ...
├── participants
│ ├── README.md
│ ├── ParticipantBriefing.pdf
| ├── answer_coding_online_survey.csv
│ ├── response_accuracy_online_survey.csv
│ ├── response_data_online_survey.csv
│ ├── participant_response_accuracy.tsv
│ └── participant_data.tsv
├── preprocessing_scripts
│ ├── PREPROCESSING_SCRIPTS.md
│ ├── char_index_to_word_index.py
│ ├── create_word_aoi_limits.py
│ ├── correct_fixations.py
│ ├── split_fixation_report.py
│ ├── asc_to_csv.py
│ ├── aoi_to_word.tsv
│ ├── sent_limits.json
│ └── word_limits.json
└── stimuli
├── ANNOTATION.md
├── STIMULI.md
├── practice_items.txt
├── generate_word_aois.py
├── manually_corrected_dependency_trees.tsv
├── manually_corrected_constituency_trees.tsv
├── uncorrected_dependency_trees.tsv
├── uncorrected_constituency_trees.tsv
├── images
│ └── ...
├── aoi_texts
│ └── ...
├── word_aoi_texts
│ └── ...
├── stimuli
│ ├── stimuli.bib
│ ├── items.tsv
│ └── stimuli.tsv
└── word_features
└── ...
Owner
- Name: Digital Linguistics Lab, Department of Computational Linguistics, University of Zurich
- Login: DiLi-Lab
- Kind: organization
- Email: jaeger@cl.uzh.ch
- Repositories: 1
- Profile: https://github.com/DiLi-Lab
Citation (citation.cff)
ff-version: 1.2.0
message: "If you use this software, please cite it as below."
preferred-citation:
authors:
- family-names: "Jakobi"
given-names: "Deborah N."
- family-names: "Kern"
given-names: "Thomas"
- family-names: "Reich"
given-names: "David R."
- family-names: "Haller"
given-names: "Patrick"
- family-names: "Jäger"
given-names: "Lena A."
title: "PoTeC: A German Naturalistic Eye-tracking-while-reading Corpus"
type: generic
year: 2025
journal: "Behavior Research Methods"
note: "in press"
GitHub Events
Total
- Watch event: 4
- Delete event: 1
- Push event: 10
- Create event: 1
Last Year
- Watch event: 4
- Delete event: 1
- Push event: 10
- Create event: 1
Dependencies
- benepar ==0.2.0
- nltk ==3.8.1
- numpy ==1.24.3
- pandas ==2.0.1
- plotly ==5.18.0
- protobuf ==3.20.0
- requests ==2.31.0
- spacy ==3.7.2
- torch ==2.1.0
- tqdm ==4.65.0
- transformers ==4.30.2