reading-in-the-wild-columbus

Reading in the Wild - Columbus Subset

https://github.com/aiot-mlsys-lab/reading-in-the-wild-columbus

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.7%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Reading in the Wild - Columbus Subset

Basic Info
Statistics
  • Stars: 2
  • Watchers: 3
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created about 1 year ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README.md

Reading in the Wild - Columbus Subset

[📝 Blogpost] [📂 Project Page] [📊 Data Explorer] [📄 Paper]

Introduction

The Reading in the Wild dataset is the first-of-its-kind large-scale multimodal dataset collected using Meta's smart glasses under Project Aria. The dataset contains 100 hours of reading and non-reading egocentric videos in diverse and realistic scenarios. It also includes eye gaze and head pose data collected during reading and non-reading activities. The dataset can be used to develop models not only for identifying reading activities but also for classifying different types of reading actvities in real-world scenarios.

The dataset contains two subsets -- Seattle subset and Columbus subset. This repository is for the Columbus subset. The Seattle subset is maintained separately here.

Overview of Columbus Subset

comparison

The Columbus subset contains around 20 hours of data from 31 subjects containing reading and non-reading activities in indoor scenarios. It is collected with the objective for zero-shot experiments. It contains examples of hard negatives (where text is present but is not being read), searching/browsing (which gives confusing gaze patterns), and reading non-English texts (where reading direction differs).

comparison

As summarized in the following chart, the Columbus subset contains data collected from reading across three different medium types including digital, print, and objects. It also contains data collected from reading across three different types of contents, including paragraphs which have long continuous text, short texts such as posters and nutrition labels, and non-textual content such as illustrative diagrams.

comparison

Comparison to Existing Datasets

Compared to existing egocentric video datasets as well as reading datasets, our dataset is the first reading dataset that contains high-frequency eye-gaze, diverse and realistic egocentric videos, and hard negative (HN) samples.

comparison

Models

A base model (v1_default) trained on the training data of the Seattle subset can be found here. The model uses a 5° FoV RGB crop (64x64) from the RGB camera of the glasses centered on the wearer's eye gaze, 3D gaze velocities sampled at 60Hz spanning 2s from the eye tracking cameras and 3D head orientation and velocity sampled at 60Hz spanning 2s from the IMU sensors. The model can selectively work with any combination of these three modalities.

comparison

Besides the base model, the following variants are also provided here: + v1_1s: uses a shorter 1s span for Gaze data + v1_15Hz: uses a lower 15Hz sampling frequency for Gaze data + v1_large: uses a larger RGB crop size of 128x128 + v1_medium: outputs categorical predictions for medium (no-read, 'print, 'digital and objects). + v1_mode: outputs categorical predictions for reading modes(no-read, 'walk, 'out-loud, engaged, scan, write/type and skim).

For details of the base model and its variants, please refer to here.

Getting Started

Setup

Use conda to create a new environment and install the required packages. The codebase has been tested with Python 3.12 and PyTorch 2.4. commandline conda env create -f environment.yml Once the environment is created, activate it: commandline conda activate ritw-osu

Download

Dataset

The dataset is hosted at HuggingFace Hub in this page. To download please run: commandline python -m ritw.download --config-name config.yaml An example config file is provided in /config/download.yaml. The dataset will be downloaded to the folder indicated by local_dir. The config file allows you to specify filters to download a subset of the dataset.

Models

Download the models from here and put them inside the models/ folder.

Inference

The inference pipeline is configurable via a config file. An example config is shown below: ```yaml

Example: ../config/config.yaml

conf/config.yaml

starttime: 0.0 snippetgap: 0.01667 # roughly 1/60 seconds mode: "folder" # folder/single. If mode is single, please provide inputfilename in config. If mode is folder, inference will be done on all files in root dir. modalities: - "gaze" - "imu" - "rgb" - "gaze+rgb" - "gaze+imu" - "imu+rgb" - "gaze+imu+rgb" outputsavepath: "output/" rootdir: "/path/to/ritw/dataset/" modelname: - "v1default" - "v0" num_workers: 4 # adjust based on available CPU cores `` The config allows selecting modalities and models to infer on. Note that, to add new models, put the model in theritw-osu/models` directory and add the model name to the config file.

To run prediction, create a config.yaml file (also see predict.yaml for reference) and save to ritw-osu/config. Then use the following command: bash python -m ritw.predict --config-name config.yaml The command runs each file and model combinations in separate processes. The output is saved in the form of csv files in directory <output_save_path>/<model_name>.

Evaluation

The evaluation module allows you to benchmark the performance of the models based on the inference results. It utilizes metadata from the recordings, applies configurable filters to focus on a specific subset of the dataset, computes various performance metrics for each modality, and outputs a summary table in Markdown format.

Below is an example configuration file (conf/config.yaml) for the evaluation module:

```yaml

Example: ../config/config.yaml

metadatafile: "data/metadata.csv" resultdir: "output/v1default" targetrecall: 0.9 metrics: - "F1" - "Acc" - "P@R=0.9" - "T@R=0.9" - "Acc@R=0.9" - "F1@R=0.9" - "AUC" modalities: - "gaze" - "rgb" - "imu" - "gaze+imu" - "imu+rgb" - "gaze+rgb" - "gaze+imu+rgb" filters: ContainsNonText: - "images" ShortTextOrPara: - "paragraphs" Medium: - "digital" Platform: - "laptop" ```

After setting up your environment and ensuring that the metadata and prediction CSV files are available, run the evaluation module from the project’s root directory: bash python -m ritw.evaluate --config-name config.yaml

License

Reading in the Wild - Columbus Subset dataset and code is released by The Ohio State University under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). Data and code may not be used for commercial purposes. For more information, please refer to the LICENSE file included in this repository.

Attribution

When using the dataset and code, please attribute it as follows: @inproceedings{yang25reading, title={Reading Recognition in the Wild}, author={Charig Yang and Samiul Alam and Shakhrul Iman Siam and Michael Proulx and Lambert Mathias and Kiran Somasundaram and Luis Pesqueira and James Fort and Sheroze Sheriffdeen and Omkar Parkhi and Carl Ren and Mi Zhang and Yuning Chai and Richard Newcombe and Hyo Jin Kim}, booktitle={arXiv Preprint}, year={2025}, url={https://arxiv.org/abs/2505.24848}, }

Owner

  • Name: OSU AIoT-MLSys Lab
  • Login: AIoT-MLSys-Lab
  • Kind: organization
  • Location: United States of America

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Reading Recognition in the Wild
message: 'If you use this software, please cite it as below.'
type: dataset
authors:
  - given-names: Charig
    family-names: Yang
  - given-names: Samiul
    family-names: Alam
    email: alam.140@osu.edu
    affiliation: OSU
    orcid: 'https://orcid.org/0000-0002-8458-4642'
  - given-names: Shakhrul Iman
    family-names: Siam
  - given-names: Michael J.
    family-names: Proulx
  - given-names: Lambert
    family-names: Mathias
  - given-names: Kiran
    family-names: Somasundaram
  - given-names: Luis
    family-names: Pesqueira
  - given-names: James
    family-names: Fort
  - given-names: Sheroze
    family-names: Sheriffdeen
  - given-names: Omkar
    family-names: Parkhi
  - given-names: Carl
    family-names: Ren
  - given-names: Mi
    family-names: Zhang
  - given-names: Yuning
    family-names: Chai
  - given-names: Richard
    family-names: Newcombe
  - given-names: Hyo Jin
    family-names: Kim
identifiers:
  - type: doi
    value: 10.48550/arXiv.2505.24848
repository-code: >-
  https://github.com/AIoT-MLSys-Lab/Reading-in-the-Wild-Columbu
url: 'https://www.projectaria.com/datasets/reading-in-the-wild/'
repository-artifact: 'https://huggingface.co/datasets/OSU-AIoT-MLSys-Lab'
abstract: >-
  To enable egocentric contextual AI in always-on smart
  glasses, it is crucial to be able to keep a record of the
  user's interactions with the world, including during
  reading. In this paper, we introduce a new task of reading
  recognition to determine when the user is reading. We
  first introduce the first-of-its-kind large-scale
  multimodal Reading in the Wild dataset, containing 100
  hours of reading and non-reading videos in diverse and
  realistic scenarios. We then identify three modalities
  (egocentric RGB, eye gaze, head pose) that can be used to
  solve the task, and present a flexible transformer model
  that performs the task using these modalities, either
  individually or combined. We show that these modalities
  are relevant and complementary to the task, and
  investigate how to efficiently and effectively encode each
  modality. Additionally, we show the usefulness of this
  dataset towards classifying types of reading, extending
  current reading understanding studies conducted in
  constrained settings to larger scale, diversity and
  realism.

GitHub Events

Total
  • Watch event: 2
  • Push event: 12
  • Public event: 1
Last Year
  • Watch event: 2
  • Push event: 12
  • Public event: 1