potec

This repository contains the Potsdam Textbook Corpus (PoTeC) which is a natural reading eye-tracking corpus.

https://github.com/dili-lab/potec

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.9%) to scientific vocabulary

Keywords

dataset eye-tracking fair-data reading
Last synced: 6 months ago · JSON representation ·

Repository

This repository contains the Potsdam Textbook Corpus (PoTeC) which is a natural reading eye-tracking corpus.

Basic Info
  • Host: GitHub
  • Owner: DiLi-Lab
  • Language: Jupyter Notebook
  • Default Branch: main
  • Homepage: https://osf.io/dn5hp/
  • Size: 16.7 MB
Statistics
  • Stars: 11
  • Watchers: 4
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Topics
dataset eye-tracking fair-data reading
Created over 2 years ago · Last pushed 8 months ago
Metadata Files
Readme Citation

README.md

PoTeC - Potsdam Textbook Corpus

:star2: :star2: If you'd like to ask a question, notice a mistake or want to say anything regarding the data please use the Discussions tab on GitHub. We're happy to hear your ideas and feeback!

This repository contains the Potsdam Textbook Corpus (PoTeC) which is a natural reading eye-tracking corpus. Four groups of participants (expert/beginner level students of physics and biology) read 12 short texts taken from textbooks of physics and biology while their eye movements were monitored. The final dataset contains the reading data for 75 participants each reading all 12 texts. The study follows a 2x2x2 fully-crossed factorial design: * Factor 1: Study discipline of participant with the levels either physics or biology * Factor 2: Study level of participant with the levels either beginner or expert * Factor 3: Text domain with the levels either physics or biology

| | Physics | Biology | |--------------|---------|---------| | Beginner | 12 | 16 | | Expert | 20 | 27 | | | | | | Total | 32 | 43 |

Both factors are quasi-experimental and manipulated between subjects. The readers' text comprehension as well as their background knowledge on the topics presented in the texts were assessed by multiple-choice questions.

More information is found in the following README'S: * preprocessing * participants * stimuli * eye-tracking data * additional processing

For a detailed description of the data types, format and content, please refer to the CODEBOOK.

Download the data

The data files are stored in an OSF repository. If this GitHub repository has been cloned, they can be downloaded and extracted automatically using the following script:

```bash

or python3

python downloaddatafiles.py

OR to extract the files directly

python downloaddatafiles.py --extract ```

Alternatively, they can be downloaded manually from the OSF repository and extracted into the respective folders.

pymovements integration

PoTeC is integrated into the pymovements package. The package allows to easily download the raw data and further process it. The following code snippet shows how to download the data:

```python

pip install pymovements

import pymovements as pm

dataset = pm.Dataset('PoTeC', path='data/PoTeC')

dataset.download() ```

Note on reading the data files using pandas

The German text p3 includes the word "null". If e.g. the word features are read using pandas, the word "null" is interpreted as a NA value. In order to avoid this behavior the command can be used with the following arguments:

python import pandas as pd pd.read_csv('word_features_p3.tsv', sep='\t', keep_default_na=False, na_values=['#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', '<NA>', 'N/A', 'NA', 'NaN', 'None', 'n/a', 'nan', ''] )

Data Overview

The data that was used to create the corpus and that was obtained during the experiments is made available in various stages. The data is stored in respective sub folders each of which contains a README that provides more information about the data and how to use it. For a detailed description of the data types, format and content, please refer to the CODEBOOK.

This repository contains the following data: * Eye-tracking data * raw eye-tracking data * preprocessed eye-tracking data * Stimuli * stimuli texts * text and background questions * Anonymized participant data * Scripts (in Python) * scripts to preprocess the data * additional scripts that have been used to process the data further

The scripts were run using Python 3.9 with the dependencies specified in the requirements.txt file.

Technical set-up

The experiment was run with the following technical set-up:

| | Setting | Value | |---------------------------|----------------------------------------------------------|--------------------------------------------------------------| | | | | | Technical set-up | Eye-tracking device | Eyelink 1000, dektop mounted camera system with a 35 mm lens | | | Sampling rate | 1000 Hz | | | Monitor size | 47.5x30 cm, 22 inch | | | Monitor resolution | 1680x1050 pixels | | | Eye-to-screen distance | 61 cm | | | Eye-to-camera distance | 65 cm | | | Experiment software | Experiment Builder software provided by SR Research | | | | | | Stimulus presentation | Background color | Black | | | Font color | White | | | Font size | 18 | | | Font | Courier (monospaced) | | | Stimulus size | On average 158 words shown on multiple lines on one page | | | Number of characters per visual angle (middle of screen) | 2.8 characters per degree of visual angle | | | Line spacing | |

Stimuli

The stimuli texts are made available via this website. Note that in order to publish the stimuli, we acquired licenses for the texts from the respective publishers. Please, find more information on the website.

Stimuli Annotation

The stimuli have been manually annoted with part-of-speech tags and other linguistic information. The annotations are described in a separate file: ANNOTATION.

Citation

@article{potec, url={\url{https://github.com/DiLi-Lab/PoTeC}}, author={Jakobi, Deborah N. and Kern, Thomas and Reich, David R. and Haller, Patrick and J\"ager, Lena A.}, title={{PoTeC}: A {German} Naturalistic Eye-tracking-while-reading Corpus}, year={2025}, note={In press}, journal={Behavior Research Methods} }

Repository Structure

PoTeC data
├── CODEBOOK.md
├── README.md
├── download_data_files.py
├── requirements.txt
├── additional_scripts
│   ├── ADDITIONAL_SCRIPTS.md
│   ├── analyses
│ │   ├── analyses.R
│ │   ├── run_bayesian_models.R
│ │   ├── run_freq_models.R
│ │   ├── count_reader_texts.py
│ │   ├── get_validation_scores.py
│ │   ├── analyse_online_survey.py
│ │   └── visualizations.ipynb
│   ├── codebook
│ │   ├── create_codebook_tables.py
│ │   ├── all_cols_description.csv
│ │   └── all_codebook_texts.csv
│ ├── add_syntax_trees.py
│ ├── annotate_constituency_trees_manually.py
│   ├── compute_reading_measures.py
│   ├── generate_scanpaths.py
│   ├── merge_reading_measures.py
│   ├── surprisal.py
│   ├── get_surprisal.py
│   ├── merge_fixations_and_coordinates.py
│   ├── merge_scanpaths.py
│   └── psycholing_analysis_plots.r
├── eyetracking_data
│   ├── EYETRACKING_DATA.md
│   ├── original_uncorrected_fixation_report.txt
│   ├── asc_files
│   │    └── ...
│   ├── fixations
│   │    └── ...
│   ├── fixations_uncorrected
│   │    └── ...
│   ├── raw_data
│   │    └── ...
│   ├── reading_measures
│   │    └── ...
│   ├── reading_measures_merged
│   │    └── ...
│   ├── scanpaths
│   │    └── ...
│   └── scanpaths_merged
│        └── ...
├── participants
│   ├── README.md
│   ├── ParticipantBriefing.pdf
|   ├── answer_coding_online_survey.csv
│   ├── response_accuracy_online_survey.csv
│   ├── response_data_online_survey.csv
│   ├── participant_response_accuracy.tsv
│   └── participant_data.tsv
├── preprocessing_scripts
│   ├── PREPROCESSING_SCRIPTS.md
│   ├── char_index_to_word_index.py
│   ├── create_word_aoi_limits.py
│   ├── correct_fixations.py
│   ├── split_fixation_report.py
│   ├── asc_to_csv.py
│   ├── aoi_to_word.tsv
│   ├── sent_limits.json
│   └── word_limits.json
└── stimuli
    ├── ANNOTATION.md
    ├── STIMULI.md
    ├── practice_items.txt
    ├── generate_word_aois.py
    ├── manually_corrected_dependency_trees.tsv
    ├── manually_corrected_constituency_trees.tsv
    ├── uncorrected_dependency_trees.tsv
    ├── uncorrected_constituency_trees.tsv
    ├── images
    │   └── ...
    ├── aoi_texts
    │   └── ...
    ├── word_aoi_texts
    │   └── ...
    ├── stimuli
    │   ├── stimuli.bib
    │   ├── items.tsv
    │   └── stimuli.tsv
    └── word_features
        └── ...

Owner

  • Name: Digital Linguistics Lab, Department of Computational Linguistics, University of Zurich
  • Login: DiLi-Lab
  • Kind: organization
  • Email: jaeger@cl.uzh.ch

Citation (citation.cff)

ff-version: 1.2.0
message: "If you use this software, please cite it as below."
preferred-citation:
  authors:
  - family-names: "Jakobi"
    given-names: "Deborah N."
  - family-names: "Kern"
    given-names: "Thomas"
  - family-names: "Reich"
    given-names: "David R."
  - family-names: "Haller"
    given-names: "Patrick"
  - family-names: "Jäger"
    given-names: "Lena A."
  title: "PoTeC: A German Naturalistic Eye-tracking-while-reading Corpus"
  type: generic
  year: 2025
  journal: "Behavior Research Methods"
  note: "in press"

GitHub Events

Total
  • Watch event: 4
  • Delete event: 1
  • Push event: 10
  • Create event: 1
Last Year
  • Watch event: 4
  • Delete event: 1
  • Push event: 10
  • Create event: 1

Dependencies

requirements.txt pypi
  • benepar ==0.2.0
  • nltk ==3.8.1
  • numpy ==1.24.3
  • pandas ==2.0.1
  • plotly ==5.18.0
  • protobuf ==3.20.0
  • requests ==2.31.0
  • spacy ==3.7.2
  • torch ==2.1.0
  • tqdm ==4.65.0
  • transformers ==4.30.2