ritcis
RITCIS handwritten letters database for training and testing machine learning based classification algorithms
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.5%) to scientific vocabulary
Repository
RITCIS handwritten letters database for training and testing machine learning based classification algorithms
Basic Info
- Host: GitHub
- Owner: csalvaggio
- License: gpl-3.0
- Language: Python
- Default Branch: main
- Size: 1.69 GB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
RITCIS Database of Handwritten Letters
The MNIST database of handwritten digits has long been a standard data set for training and testing machine learning architectures aimed at the classification problem.
The RITCIS database of handwritten letters (uppercase) has been developed to serve this same purpose. This database has been created using samples prepared by students, staff, and faculty at the Rochester Institute of Technology's Chester F. Carlson Center for Imaging Science as part of the course titled Image Processing (IMGS.361) [Fall Semesters of AY2023-2024/AY2024-2025].
A sample handwritten letters form used to collect the samples for the RITCIS database is shown below:
To date, 134 forms have been prepared and used to create the current instantiation of the database. This instantiation is split into 42,142 training images and 8,429 test images.
The format of the training/test images and labels files, located in the datasets directory above, are identical to those used for the MNIST database files. Data readers designed for reading MNIST database files should work identically on the RITCIS database files.
The compressed database files located in datasets are named as
test-images-028-ubyte.gz
test-labels-028-ubyte.gz
train-images-028-ubyte.gz
train-labels-028-ubyte.gz
where the 028 represents the square dimension of the images (28x28 pixels in this example). Square-dimensioned versions of the data are provided at 28, 56, 112, and 224 pixels. Be sure to decompress these database files prior to use on your system.
Data Preparation
The handwritten letters (paper/physical) forms (located in the samples/ppt directory) were filled out by students, staff, and faculty at the Rochester Institute of Technology. Forms were scanned utilizing a Xerox AltaLink B8170 WorkCentre. The following modifications were made to the default scan preferences:
- Output Color: Color
- Resolution: 600 dpi
- Quality/File Size: Lowest Compression / Largest File Size
The PDFs generated from the scans are located in the samples/pdfs directory. Each PDF contains 10 scanned forms.
Data Processing Pipeline
The handwritten letter forms are processed to the final datasets using the following scripts:
extract_tiles.py
usage: extract_tiles.py [-h] [-d TILES_BASE_DIRECTORY] [-c] [-o] [-s]
[-r STRUCTURING_ELEMENT_RADIUS]
[-t ABSOLUTE_TOLERANCE]
starting_sample pdf_filename
Extract tiles from handwritten letter data set development sheets
positional arguments:
starting_sample starting sample ID number
pdf_filename file path of the PDF containing the scanned
handwritten letter data set development sheets
options:
-h, --help show this help message and exit
-d TILES_BASE_DIRECTORY, --tiles-base-directory TILES_BASE_DIRECTORY
base directory for extracted letter tiles [default is
None]
-c, --use-centroid the letters will be centered using its centroid value,
otherwise the letter's bounding box center will be
used [default is False]
-o, --use-original-dimensions
the letters will retain their original dimensions,
otherwise they will be size normalized while retaining
their original aspect ratio [default is False]
-s, --square create square tiles [default is False]
-r STRUCTURING_ELEMENT_RADIUS, --radius STRUCTURING_ELEMENT_RADIUS
structuring element radius to use for character
cleanup using morphological closings/openings [default
is 3]
-t ABSOLUTE_TOLERANCE, --tolerance ABSOLUTE_TOLERANCE
absolute tolerance for determining if channel
probability density functions are different [default
is 0.025]
This script performs the extraction of an individual tile for each handwritten letter on the handwritten letter forms.
Each page of the provided PDF file is converted to an image at a resolution of 600 dpi. Each page's image is aligned to a fixed grid using the fiducials located in the four (4) corners of the form utilizing a quad-to-quad perspective mapping function. This allows for the individual letter regions (denoted by the dots on the form) to be extracted easily from predefined locations.
The extracted tiles undergo a series of processing steps to prepare them for later ingestion into the database. The steps are:
- Screening for editorial marking (colored scribbles) to determine if the current tile should be excluded from further processing
- Conversion from color to greyscale for each tile
- Inversion of the greyscale (to create white letters on a black background)
- Binary thresholding (at a fixed level of 127) to prepare the tile for cleanup
- Cleanup using a morphological opening and closing operation to minimize noise and small letter defects
- Computation of the centroid of each letter within the tile AND computation of a bounding box for each letter
- Letter is centered within the tile using either the centroid OR bounding box (user-selectable option)
- Letter is scaled to fit 0.7 of the tile height or width to size normalize the letter representations
- Tile is rejected if portions of the letter fall within a buffer region around the outside of the tile
- Tiles are written to individual image files (PNG format by default), separated in directories by tile size and sample ID number (these images are not include in this repository due to size limitations)
curate_tiles.py
usage: curate_tiles.py [-h] [-t TILES_PATH] [-e TILE_EXTENSION]
sample_to_curate
Curate the tiles (remove bad instances) that have been extracted from the
RITCIS handwritten letters forms
positional arguments:
sample_to_curate sample ID number to curate
options:
-h, --help show this help message and exit
-t TILES_PATH, --tiles-path TILES_PATH
path to the directory containing the RITCIS data tiles
(the directory containing the "full" resolution folder
should be selected [default is "samples/tiles"]
-e TILE_EXTENSION, --extension TILE_EXTENSION
tile image extension [default is "png"]
This script allows for visual inspection of each tile from an individual sample form and keeps/removes that tile from the extracted set. This allows for removal of tiles that made it through the previous screening process that should be excluded from the database.
construct_dataset.py
usage: construct_dataset.py [-h] [-e TILE_EXTENSION] tiles_path dataset_path
Construct the RITCIS handwritten letters data file in the same fashion as the
MNIST data set
positional arguments:
tiles_path path to the directory containing the RITCIS data tiles
to be included in the data set (i.e. the specific tile
resolution directory)
dataset_path path to the directory to contain the constructed data
set
options:
-h, --help show this help message and exit
-e TILE_EXTENSION, --extension TILE_EXTENSION
tile image extension [default is "png"]
This script will create MNIST style datasets for the tiles in the specified tile size directory. Tiles will be randomized and divided into 5/6 training data and 1/6 test data sets. This split can be modified in the code if desired. The database files are compressed using gzip to identically mimic the MNIST distribution.
Requirements
All processing scripts require Python 3 and the following non-standard modules:
- OpenCV (cv2)
- pdf2image
Performance
This dataset has undergone limited testing to this point. The results by algorithm are as follows:
| CLASSIFIER | PREPROCESSING | TEST ERROR RATE (%) | | ---------- | ------------- | ------------------- | | k-NN [k=3] (L1-norm) | (above) | 13.0 | | k-NN [k=3] (L2-norm) | (above) | 12.0 | | k-NN [k=3] (L3-norm) | (above) | 11.5 |
Acknowledgements
The author would like to acknowledge the students of the AY2023-2024/AY2024-2025 Fall semester offerings of the course titled Image Processing (IMGS.361) at the Rochester Institute of Technology's Chester F. Carlson Center for Imaging Science for their assistance in preparing the data provided in this repository as well as their vetting of the data's performance in their final classification project:
(listed alphabetically)
AY2023-2024
- Shey Cajigas
- Troy Church
- Nick Duggan
- Jett Forward
- Elizabeth Husarek
- Grace Kachmaryk
- Ellias Kim
- Danny Klosinski
- Robert Mancini
- Anna Mason (TA)
- Ryan McDonald
- Parker Mei
- Lauren Mowrey
- Sarah Pool
- Emily Rivera Ojeda
- Micah Ross (TA)
- Maxwell Schaefer
- Luke Spinosa
- Anna Steele
- Kailey Switzer
- Karla van Aardt
- Cheney Zhang
AY2024-2025
- Carly Adams
- Will Barden
- Teddy Batkin
- Alexander Benanti
- Terrlyn Byrd
- Luke Callahan
- Josie Clapp
- Kyle Cummings
- Brandon Faunce
- Lillian Freeman
- Anthony Guarino
- Jason Kwong
- Adele Jones
- Aidan Montag
- Ellie Nixon
- Evelyn Sutkus
- Gian-Mateo Tifone
- Stavros Viron
- Mason Wahlers
- Jonathan Wheeler
- Cooper White
- Preston Yates
If you find this database useful and utilize it in your research, please attribute this repository in your publications as follows
Plain Text
Salvaggio, Carl (2024). RITCIS Database of Handwritten Letters. GitHub. URL: https://github.com/csalvaggio/ritcis.
BibTeX
@misc{salvaggio_ritcis_2024,
title = {RITCIS Database of Handwritten Letters},
author = {Carl Salvaggio},
year = 2024,
publisher = {Rochester Institute of Technology},
howpublished = {https://github.com/csalvaggio/ritcis}
}
Copyright (C) 2024, Rochester Institute of Technology
Owner
- Name: Carl Salvaggio
- Login: csalvaggio
- Kind: user
- Location: Rochester, NY, USA
- Company: Rochester Institute of technology
- Website: https://www.cis.rit.edu/~cnspci/index.php
- Repositories: 1
- Profile: https://github.com/csalvaggio
Carl is a Professor of @Imaging Science at @RIT specializing in remote sensing, image processing, and computer vision research.
GitHub Events
Total
- Push event: 2
Last Year
- Push event: 2