https://github.com/bluegreen-labs/weahtr_model
HTR/OCR for climate data model development
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.9%) to scientific vocabulary
Keywords
Repository
HTR/OCR for climate data model development
Basic Info
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 1
- Releases: 0
Topics
Metadata Files
README.md
OCR Climate (proof of concept)
Introduction
The Congo basin eco-climatological data recovery and valorisation (COBECORE, contract BR/175/A3/COBECORE) project made the static analog archives of the Institut National pour lEtude Agronomique du Congo belge (INEAC) digitally accessible. Although a part of the climate data was transcribed using citizen science in the Jungle Weather programme (~350K values), the vast majority of the data remains digitized but not transcribed (~7M values for the main climate variables alone, i.e. temperature, rainfall). Within this context machine learning, and optical character recognition in particular, promises to bring a workable solution.
Optical Character (handwritten) Recognition (OCR)
Within the context of COBECORE handwritten text recognition happens within the very specific context of tabulated data. Given the context of this problem, i.e. snippets of tables with numbers, it can be approached solving a captcha. With the "noise" of table cell boundary lines and other smudges on old paper this comparison is more than fitting.
Generally, solving captcha problems can be done using a CRNN + CTC loss setup. The Keras introduction into the captcha problem provides the baseline, to be expanded to handwritten text in another demo - accounting for varying input image sizes. In a quick test I used the vanilla Keras handwritten text recognition code, adapted to the COBECORE climate data formatting (separate labels and images, some cropping) to learn to recognize the value of climate variables.
A simple test on a subset of the data (~10K images, instead of the total dataset of ~350K) shows reasonable performance (see image below). Given these results the exercise should be expanded to the full dataset, including many more writing styles to increase model robustness.

Next steps
- formalize the code (potentially Keras in R for consistency)
- formalize accuracy metrics (evaluation)
- expand training data and matching testing
- generate true out of sample data (extracted from novel tables)
- consider image augmentation and contrast boosting options to increase accuracy
- other model setups simpleHTR
- layout detection instead of template matching
- ...
Setup
Getting Tensorflow to work (sort of consistently)
Install docker per usual and add the user to the docker user group.
sudo usermod -aG docker $USER
Install the Nvidia container toolkit.
sudo apt install nvidia-container-toolkit
Configure the toolkit to recognize the GPUs in docker.
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Run grab the default image to built upon, this should fire up a console in which the GPUs should be active.
docker run --gpus all -it --rm nvcr.io/nvidia/tensorflow:23.09-tf2-py3
Mounting external files can be done using:
docker run --gpus all -it --rm -v $(pwd):/current_project/ nvcr.io/nvidia/tensorflow:23.09-tf2-py3
Which will set /current_project/ as the mount point for the current (present) working directory (pwd).
Custom Dockerfile
A docker file is included which builds a custom environment, using the tensorflow setup as a basis, adding Jupyter notebooks for easier interaction with the docker container and the option to visualize things (the other option is forwarding X sessions which is security wise probably iffy).
Owner
- Name: BlueGreen Labs
- Login: bluegreen-labs
- Kind: organization
- Email: info@bluegreenlabs.org
- Location: Melsele, Belgium
- Website: http://bluegreenlabs.org
- Repositories: 17
- Profile: https://github.com/bluegreen-labs
BlueGreen open science labs & consulting, providing environmental research infrastructure and editorial solutions.
GitHub Events
Total
- Public event: 1
Last Year
- Public event: 1
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 1
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 1
- Total pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- khufkens (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- nvcr.io/nvidia/tensorflow 23.09-tf2-py3 build
- JamesIves/github-pages-deploy-action v4.4.1 composite
- actions/checkout v3 composite
- r-lib/actions/setup-pandoc v2 composite
- r-lib/actions/setup-r v2 composite
- r-lib/actions/setup-r-dependencies v2 composite