Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.9%) to scientific vocabulary
Keywords
alto-xml
handwritten-text-recognition
hocr
htr
layout-analysis
neural-networks
ocr
optical-character-recognition
page-xml
Last synced: 6 months ago
·
JSON representation
·
Repository
OCR engine for all the languages
Basic Info
- Host: GitHub
- Owner: mittagessen
- License: apache-2.0
- Language: Python
- Default Branch: main
- Homepage: http://kraken.re
- Size: 29.6 MB
Statistics
- Stars: 870
- Watchers: 30
- Forks: 148
- Open Issues: 32
- Releases: 13
Topics
alto-xml
handwritten-text-recognition
hocr
htr
layout-analysis
neural-networks
ocr
optical-character-recognition
page-xml
Created almost 11 years ago
· Last pushed 6 months ago
Metadata Files
Readme
License
Citation
README.rst
Description
===========
.. image:: https://github.com/mittagessen/kraken/actions/workflows/test.yml/badge.svg
:target: https://github.com/mittagessen/kraken/actions/workflows/test.yml
kraken is a turn-key OCR system optimized for historical and non-Latin script
material.
kraken's main features are:
- Fully trainable layout analysis, reading order, and character recognition
- `Right-to-Left `_, `BiDi
`_, and Top-to-Bottom
script support
- `ALTO `_, PageXML, abbyyXML, and hOCR
output
- Word bounding boxes and character cuts
- Multi-script recognition support
- `Public repository `_ of model files
- Variable recognition network architecture
Installation
============
Kraken can be run on Linux or Mac OS X (both x64 and ARM). Installation is
through the on-board *pip* utility. To not pollute the global state of your
distribution's package manager it is recommended to use virtual environments.
If you do not have a setup or do not wish to handle virtual environments
yourself you can use `pipx`.
.. code-block:: console
$ sudo apt install pipx
$ pipx install kraken
kraken works both on Linux and Mac OS X and with any python interpreter between
3.9 and 3.11. It is possible the installation fails because `pipx` defaults to
an unsupported interpreter version. In that case you need to install a
compatible interpreter version such as 3.11 and then specify this version
explicitly:
.. code-block:: console
$ sudo apt install python3.11-full
$ pipx install --python python3.11 kraken
Installation using pip
----------------------
Create and activate a separate virtual environment using whatever tool you
like.
.. code-block:: console
$ pip install kraken
or by running pip in the git repository:
.. code-block:: console
$ pip install .
If you want direct PDF and multi-image TIFF/JPEG2000 support it is necessary to
install the `pdf` extras package for PyPi:
.. code-block:: console
$ pip install kraken[pdf]
Finally you'll have to scrounge up a model to do the actual recognition of
characters. To download the default model for printed French text and place it
in the kraken directory for the current user:
::
$ kraken get 10.5281/zenodo.10592716
A list of libre models available in the central repository can be retrieved by
running:
::
$ kraken list
Quickstart
==========
Recognizing text on an image using the default parameters including the
prerequisite steps of binarization and page segmentation:
::
$ kraken -i image.tif image.txt binarize segment ocr
To binarize a single image using the nlbin algorithm:
::
$ kraken -i image.tif bw.png binarize
To segment an image (binarized or not) with the new baseline segmenter:
::
$ kraken -i image.tif lines.json segment -bl
To segment and OCR an image using the default model(s):
::
$ kraken -i image.tif image.txt segment -bl ocr -m catmus-print-fondue-large.mlmodel
All subcommands and options are documented. Use the ``help`` option to get more
information.
Documentation
=============
Have a look at the `docs `_.
Related Software
================
These days kraken is quite closely linked to the `eScriptorium
`_ project developed in the same eScripta research
group. eScriptorium provides a user-friendly interface for annotating data,
training models, and inference (but also much more). There is a `gitter channel
`_ that is mostly intended for
coordinating technical development but is also a spot to find people with
experience on applying kraken on a wide variety of material.
Funding
=======
kraken is developed at the `École Pratique des Hautes Études `_, `Université PSL `_.
.. container:: twocol
.. container::
.. image:: https://raw.githubusercontent.com/mittagessen/kraken/main/docs/_static/normal-reproduction-low-resolution.jpg
:width: 100
:alt: Co-financed by the European Union
.. container::
This project was funded in part by the European Union. (ERC, MiDRASH,
project number 101071829).
.. container:: twocol
.. container::
.. image:: https://raw.githubusercontent.com/mittagessen/kraken/main/docs/_static/normal-reproduction-low-resolution.jpg
:width: 100
:alt: Co-financed by the European Union
.. container::
This project was partially funded through the RESILIENCE project, funded from
the European Union’s Horizon 2020 Framework Programme for Research and
Innovation.
.. container:: twocol
.. container::
.. image:: https://projet.biblissima.fr/sites/default/files/2021-11/biblissima-baseline-sombre-ia.png
:width: 400
:alt: Received funding from the Programme d’investissements d’Avenir
.. container::
Ce travail a bénéficié d’une aide de l’État gérée par l’Agence Nationale de la
Recherche au titre du Programme d’Investissements d’Avenir portant la référence
ANR-21-ESRE-0005 (Biblissima+).
Owner
- Login: mittagessen
- Kind: user
- Location: Paris
- Company: École Pratique des Hautes Études, Aoroc
- Website: http://l.unchti.me
- Repositories: 31
- Profile: https://github.com/mittagessen
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Kiessling" given-names: "Benjamin" orcid: "https://orcid.org/0000-0001-9543-7827" title: "The Kraken OCR system" version: 6.0 date-released: 2025-08-29 url: "https://kraken.re"
GitHub Events
Total
- Issues event: 92
- Watch event: 121
- Delete event: 3
- Issue comment event: 150
- Push event: 106
- Pull request review event: 1
- Pull request review comment event: 2
- Pull request event: 27
- Fork event: 19
- Create event: 9
Last Year
- Issues event: 92
- Watch event: 121
- Delete event: 3
- Issue comment event: 150
- Push event: 106
- Pull request review event: 1
- Pull request review comment event: 2
- Pull request event: 27
- Fork event: 19
- Create event: 9
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 104
- Total pull requests: 41
- Average time to close issues: 3 months
- Average time to close pull requests: about 1 month
- Total issue authors: 54
- Total pull request authors: 14
- Average comments per issue: 2.84
- Average comments per pull request: 0.98
- Merged pull requests: 19
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 44
- Pull requests: 13
- Average time to close issues: about 1 month
- Average time to close pull requests: 27 days
- Issue authors: 27
- Pull request authors: 5
- Average comments per issue: 1.45
- Average comments per pull request: 0.23
- Merged pull requests: 8
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- johnlockejrr (18)
- megamattc (8)
- PonteIneptique (7)
- dstoekl (6)
- jesusbft (6)
- fattynoparents (6)
- alexislitvine (5)
- stweil (5)
- bertsky (4)
- alitarek-dot (3)
- dantetemplar (3)
- Klyma79 (2)
- sk-dataocean-online (2)
- tarrinw (2)
- jeffreycwitt (2)
Pull Request Authors
- stweil (12)
- mittagessen (6)
- jesusbft (6)
- PonteIneptique (6)
- Arch-W (2)
- fattynoparents (2)
- anutkk (2)
- Evarin (1)
- Shreejan-git (1)
- saiprabhath2002 (1)
- particitae (1)
- sadra-barikbin (1)
- johnlockejrr (1)
- rlskoeser (1)
Top Labels
Issue Labels
enhancement (1)
beginner-friendly (1)
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 9,503 last-month
- Total dependent packages: 6
- Total dependent repositories: 24
- Total versions: 100
- Total maintainers: 2
pypi.org: kraken
OCR/HTR engine for all the languages
- Homepage: https://kraken.re
- Documentation: https://kraken.readthedocs.io/
- License: Apache
-
Latest release: 5.3.0
published over 1 year ago
Rankings
Dependent packages count: 1.4%
Dependent repos count: 3.0%
Average: 3.2%
Downloads: 5.2%
Maintainers (2)
Last synced:
6 months ago
Dependencies
.github/workflows/test.yml
actions
- actions/checkout v2 composite
- actions/download-artifact v2 composite
- actions/setup-python v2 composite
- actions/setup-python v1 composite
- actions/upload-artifact v2 composite
- conda-incubator/setup-miniconda v2 composite
- crazy-max/ghaction-github-pages v2 composite
- marvinpinto/action-automatic-releases latest composite
- pypa/gh-action-pypi-publish release/v1 composite
environment.yml
conda
- albumentations
- click >=8.1
- imagemagick >=7.1.0
- jinja2 ~=3.0
- jsonschema
- lxml
- numpy ~=1.23.0
- pillow ~=9.2.0
- pip
- pyarrow
- python >=3.9
- python-bidi
- pytorch-cpu >=1.11,<1.14
- pytorch-lightning ~=2.0.0
- pyvips
- regex
- requests
- rich
- scikit-image ~=0.21.0
- scikit-learn ~=1.2.1
- scipy ~=1.11.0
- shapely ~=1.8.5
- threadpoolctl ~=3.2
- torchmetrics >=1.1.0
- torchvision-cpu >=0.5.0
pyproject.toml
pypi
setup.py
pypi