potec

This repository contains the Potsdam Textbook Corpus (PoTeC) which is a natural reading eye-tracking corpus.

https://github.com/dili-lab/potec

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.9%) to scientific vocabulary

Keywords

dataset eye-tracking fair-data reading

Last synced: 11 months ago · JSON representation ·

Repository

This repository contains the Potsdam Textbook Corpus (PoTeC) which is a natural reading eye-tracking corpus.

Basic Info

Host: GitHub
Owner: DiLi-Lab
Language: Jupyter Notebook
Default Branch: main
Homepage: https://osf.io/dn5hp/
Size: 16.7 MB

Statistics

Stars: 11
Watchers: 4
Forks: 1
Open Issues: 0
Releases: 0

Topics

dataset eye-tracking fair-data reading

Created about 3 years ago · Last pushed about 1 year ago

Metadata Files

Readme Citation

PoTeC - Potsdam Textbook Corpus

:star2: :star2: If you'd like to ask a question, notice a mistake or want to say anything regarding the data please use the Discussions tab on GitHub. We're happy to hear your ideas and feeback!

This repository contains the Potsdam Textbook Corpus (PoTeC) which is a natural reading eye-tracking corpus. Four groups of participants (expert/beginner level students of physics and biology) read 12 short texts taken from textbooks of physics and biology while their eye movements were monitored. The final dataset contains the reading data for 75 participants each reading all 12 texts. The study follows a 2x2x2 fully-crossed factorial design: * Factor 1: Study discipline of participant with the levels either physics or biology * Factor 2: Study level of participant with the levels either beginner or expert * Factor 3: Text domain with the levels either physics or biology

| | Physics | Biology | |--------------|---------|---------| | Beginner | 12 | 16 | | Expert | 20 | 27 | | | | | | Total | 32 | 43 |

Both factors are quasi-experimental and manipulated between subjects. The readers' text comprehension as well as their background knowledge on the topics presented in the texts were assessed by multiple-choice questions.

More information is found in the following README'S: * preprocessing * participants * stimuli * eye-tracking data * additional processing

For a detailed description of the data types, format and content, please refer to the CODEBOOK.

Download the data

The data files are stored in an OSF repository. If this GitHub repository has been cloned, they can be downloaded and extracted automatically using the following script:

```bash

or python3

python downloaddatafiles.py

OR to extract the files directly

python downloaddatafiles.py --extract ```

Alternatively, they can be downloaded manually from the OSF repository and extracted into the respective folders.

`pymovements` integration

PoTeC is integrated into the pymovements package. The package allows to easily download the raw data and further process it. The following code snippet shows how to download the data:

```python

pip install pymovements

import pymovements as pm

dataset = pm.Dataset('PoTeC', path='data/PoTeC')

dataset.download() ```

Note on reading the data files using `pandas`

The German text p3 includes the word "null". If e.g. the word features are read using pandas, the word "null" is interpreted as a NA value. In order to avoid this behavior the command can be used with the following arguments:

python import pandas as pd pd.read_csv('word_features_p3.tsv', sep='\t', keep_default_na=False, na_values=['#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', '<NA>', 'N/A', 'NA', 'NaN', 'None', 'n/a', 'nan', ''] )

Data Overview

The data that was used to create the corpus and that was obtained during the experiments is made available in various stages. The data is stored in respective sub folders each of which contains a README that provides more information about the data and how to use it. For a detailed description of the data types, format and content, please refer to the CODEBOOK.

This repository contains the following data: * Eye-tracking data * raw eye-tracking data * preprocessed eye-tracking data * Stimuli * stimuli texts * text and background questions * Anonymized participant data * Scripts (in Python) * scripts to preprocess the data * additional scripts that have been used to process the data further

The scripts were run using Python 3.9 with the dependencies specified in the requirements.txt file.

Technical set-up

The experiment was run with the following technical set-up:

| | Setting | Value | |---------------------------|----------------------------------------------------------|--------------------------------------------------------------| | | | | | Technical set-up | Eye-tracking device | Eyelink 1000, dektop mounted camera system with a 35 mm lens | | | Sampling rate | 1000 Hz | | | Monitor size | 47.5x30 cm, 22 inch | | | Monitor resolution | 1680x1050 pixels | | | Eye-to-screen distance | 61 cm | | | Eye-to-camera distance | 65 cm | | | Experiment software | Experiment Builder software provided by SR Research | | | | | | Stimulus presentation | Background color | Black | | | Font color | White | | | Font size | 18 | | | Font | Courier (monospaced) | | | Stimulus size | On average 158 words shown on multiple lines on one page | | | Number of characters per visual angle (middle of screen) | 2.8 characters per degree of visual angle | | | Line spacing | |

Stimuli

The stimuli texts are made available via this website. Note that in order to publish the stimuli, we acquired licenses for the texts from the respective publishers. Please, find more information on the website.

Stimuli Annotation

The stimuli have been manually annoted with part-of-speech tags and other linguistic information. The annotations are described in a separate file: ANNOTATION.

Citation

@article{potec, url={\url{https://github.com/DiLi-Lab/PoTeC}}, author={Jakobi, Deborah N. and Kern, Thomas and Reich, David R. and Haller, Patrick and J\"ager, Lena A.}, title={{PoTeC}: A {German} Naturalistic Eye-tracking-while-reading Corpus}, year={2025}, note={In press}, journal={Behavior Research Methods} }

Repository Structure

PoTeC data
├── CODEBOOK.md
├── README.md
├── download_data_files.py
├── requirements.txt
├── additional_scripts
│   ├── ADDITIONAL_SCRIPTS.md
│   ├── analyses
│ │   ├── analyses.R
│ │   ├── run_bayesian_models.R
│ │   ├── run_freq_models.R
│ │   ├── count_reader_texts.py
│ │   ├── get_validation_scores.py
│ │   ├── analyse_online_survey.py
│ │   └── visualizations.ipynb
│   ├── codebook
│ │   ├── create_codebook_tables.py
│ │   ├── all_cols_description.csv
│ │   └── all_codebook_texts.csv
│ ├── add_syntax_trees.py
│ ├── annotate_constituency_trees_manually.py
│   ├── compute_reading_measures.py
│   ├── generate_scanpaths.py
│   ├── merge_reading_measures.py
│   ├── surprisal.py
│   ├── get_surprisal.py
│   ├── merge_fixations_and_coordinates.py
│   ├── merge_scanpaths.py
│   └── psycholing_analysis_plots.r
├── eyetracking_data
│   ├── EYETRACKING_DATA.md
│   ├── original_uncorrected_fixation_report.txt
│   ├── asc_files
│   │    └── ...
│   ├── fixations
│   │    └── ...
│   ├── fixations_uncorrected
│   │    └── ...
│   ├── raw_data
│   │    └── ...
│   ├── reading_measures
│   │    └── ...
│   ├── reading_measures_merged
│   │    └── ...
│   ├── scanpaths
│   │    └── ...
│   └── scanpaths_merged
│        └── ...
├── participants
│   ├── README.md
│   ├── ParticipantBriefing.pdf
|   ├── answer_coding_online_survey.csv
│   ├── response_accuracy_online_survey.csv
│   ├── response_data_online_survey.csv
│   ├── participant_response_accuracy.tsv
│   └── participant_data.tsv
├── preprocessing_scripts
│   ├── PREPROCESSING_SCRIPTS.md
│   ├── char_index_to_word_index.py
│   ├── create_word_aoi_limits.py
│   ├── correct_fixations.py
│   ├── split_fixation_report.py
│   ├── asc_to_csv.py
│   ├── aoi_to_word.tsv
│   ├── sent_limits.json
│   └── word_limits.json
└── stimuli
    ├── ANNOTATION.md
    ├── STIMULI.md
    ├── practice_items.txt
    ├── generate_word_aois.py
    ├── manually_corrected_dependency_trees.tsv
    ├── manually_corrected_constituency_trees.tsv
    ├── uncorrected_dependency_trees.tsv
    ├── uncorrected_constituency_trees.tsv
    ├── images
    │   └── ...
    ├── aoi_texts
    │   └── ...
    ├── word_aoi_texts
    │   └── ...
    ├── stimuli
    │   ├── stimuli.bib
    │   ├── items.tsv
    │   └── stimuli.tsv
    └── word_features
        └── ...

Owner

Name: Digital Linguistics Lab, Department of Computational Linguistics, University of Zurich
Login: DiLi-Lab
Kind: organization
Email: jaeger@cl.uzh.ch

Repositories: 1
Profile: https://github.com/DiLi-Lab

Citation (citation.cff)

ff-version: 1.2.0
message: "If you use this software, please cite it as below."
preferred-citation:
  authors:
  - family-names: "Jakobi"
    given-names: "Deborah N."
  - family-names: "Kern"
    given-names: "Thomas"
  - family-names: "Reich"
    given-names: "David R."
  - family-names: "Haller"
    given-names: "Patrick"
  - family-names: "Jäger"
    given-names: "Lena A."
  title: "PoTeC: A German Naturalistic Eye-tracking-while-reading Corpus"
  type: generic
  year: 2025
  journal: "Behavior Research Methods"
  note: "in press"

GitHub Events

Total

Watch event: 4
Delete event: 1
Push event: 10
Create event: 1

Last Year

Watch event: 4
Delete event: 1
Push event: 10
Create event: 1

Dependencies

requirements.txt pypi

benepar ==0.2.0
nltk ==3.8.1
numpy ==1.24.3
pandas ==2.0.1
plotly ==5.18.0
protobuf ==3.20.0
requests ==2.31.0
spacy ==3.7.2
torch ==2.1.0
tqdm ==4.65.0
transformers ==4.30.2

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

potec

Science Score: 44.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

PoTeC - Potsdam Textbook Corpus

Download the data

or python3

OR to extract the files directly

`pymovements` integration

pip install pymovements

Note on reading the data files using `pandas`

Data Overview

Technical set-up

Stimuli

Stimuli Annotation

Citation

Repository Structure

Owner

Citation (citation.cff)

GitHub Events

Total

Last Year

Dependencies

potec

Science Score: 44.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

PoTeC - Potsdam Textbook Corpus

Download the data

or python3

OR to extract the files directly

pymovements integration

pip install pymovements

Note on reading the data files using pandas

Data Overview

Technical set-up

Stimuli

Stimuli Annotation

Citation

Repository Structure

Owner

Citation (citation.cff)

GitHub Events

Total

Last Year

Dependencies

`pymovements` integration

Note on reading the data files using `pandas`