daiR
daiR: an R package for OCR with Google Document AI - Published in JOSS (2021)
ocr-fileformat
Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
https://github.com/aim-uofa/adelaidet
AdelaiDet is an open source toolbox for multiple instance-level detection and recognition tasks.
imgocr
Python3 package for Chinese/English OCR,use paddleocr-v5 onnx model(~20MB), with ultra-fast inference speed. 基于ppocr-v5-onnx模型推理,中英文OCR开源SOTA,推理速度超快。
taulu
Taulu is a Python package designed to segment tabular data in scanned or photographed documents.
handprint
Apply different text recognition services to images of handwritten documents.
gt-mufilevelrules
OCR-D-Level-Rules can be created automatically with gt-MufiLevelRules from the encodings published by MUFI: The Medieval Unicode Font Initiative.
https://github.com/amazon-science/semimtr-text-recognition
Multimodal Semi-Supervised Learning for Text Recognition (SemiMTR)
https://github.com/ai-forever/ocr-model
An easy-to-run OCR model pipeline based on CRNN and CTC loss
apple-ocr
Easy-to-Use Apple Vision wrapper for text extraction, scalar representation and clustering using K-means.
docutron
Docutron Toolkit: detection and segmentation analysis for legal data extraction over documents.
https://github.com/adithya-s-k/omniparse
Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks
https://github.com/amirabbasasadi/persianocr
Simple word-level OCR program for the Persian language based on Recurrent Neural Networks using Pytorch and OpenCV
https://github.com/caltechlibrary/htr-test-cases
Images of documents for testing HTR.
dtrocr
A PyTorch implementation of DTrOCR: Decoder-only Transformer for Optical Character Recognition
keyboardgt
Offer of different keyboards for transcription software (Aletheia, Transkribus, LAREX, QURATOR-neat, eScriptorium)
lines_merge_ocr
This script merges lines after extracting plain text from a pdf that produces OCR
reichsanzeiger-gt
Ground truth for German newspaper "Deutscher Reichsanzeiger und Preußischer Staatsanzeiger" (1819–1945)
german-brazilian-newspapers-dataset_1
The GBN Dataset consists German-Brazilian historical newspapers, along with their digital and binarized images and ground truth files.
langchain-chatbot
AI Chatbot for analyzing/extracting information from data in conversational format.
https://github.com/bytedance/dolphin
The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.
wordart
The official code of CornerTransformer (ECCV 2022, Oral) on top of MMOCR.
hemdig-framework
HEMDIG(pt) framework: Métodos, ferramentas e hemerotecas digitais em português. Pesquisa de pós-doutorado.
https://github.com/amazon-science/textadain-robust-recognition
TextAdaIN: Paying Attention to Shortcut Learning in Text Recognizers
https://github.com/ai-forever/readingpipeline
Text reading pipeline that combines segmentation and OCR-models.
german-brazilian-newspapers-dataset_2
The GBN Dataset consists German-Brazilian historical newspapers, along with their digital and binarized images and ground truth files.
https://github.com/awslabs/aws-ai-solution-kit
Machine Learning APIs for common use cases, include: General OCR (Simplified/Traditional Chinese), Custom OCR, Image Similarity, Object Recognition, Face Detection, Face Comparison, Human Image Segmentation, Human Attribute Recognition, Pornography Detection, Image Super Resolution, Text Similarity, Car License Plate, etc.
https://github.com/bytedance/sptsv2
The official implementation of SPTS v2: Single-Point Text Spotting
https://github.com/cemenenkoff/runedark-public
Runedark is a color-based automation framework (for OSRS).
cremma-medieval
Transcription corpora for training HTR models for medieval manuscripts from the 12th to the 15th century.
cataloguessegmentationocr
Dataset and models for catalogs' Layout analysis and HTR