kraken

OCR engine for all the languages

https://github.com/mittagessen/kraken

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.9%) to scientific vocabulary

Keywords

alto-xml handwritten-text-recognition hocr htr layout-analysis neural-networks ocr optical-character-recognition page-xml
Last synced: 6 months ago · JSON representation ·

Repository

OCR engine for all the languages

Basic Info
  • Host: GitHub
  • Owner: mittagessen
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage: http://kraken.re
  • Size: 29.6 MB
Statistics
  • Stars: 870
  • Watchers: 30
  • Forks: 148
  • Open Issues: 32
  • Releases: 13
Topics
alto-xml handwritten-text-recognition hocr htr layout-analysis neural-networks ocr optical-character-recognition page-xml
Created almost 11 years ago · Last pushed 6 months ago
Metadata Files
Readme License Citation

README.rst

Description
===========

.. image:: https://github.com/mittagessen/kraken/actions/workflows/test.yml/badge.svg
    :target: https://github.com/mittagessen/kraken/actions/workflows/test.yml

kraken is a turn-key OCR system optimized for historical and non-Latin script
material.

kraken's main features are:

  - Fully trainable layout analysis, reading order, and character recognition
  - `Right-to-Left `_, `BiDi
    `_, and Top-to-Bottom
    script support
  - `ALTO `_, PageXML, abbyyXML, and hOCR
    output
  - Word bounding boxes and character cuts
  - Multi-script recognition support
  - `Public repository `_ of model files
  - Variable recognition network architecture

Installation
============

Kraken can be run on Linux or Mac OS X (both x64 and ARM). Installation is
through the on-board *pip* utility. To not pollute the global state of your
distribution's package manager it is recommended to use virtual environments.
If you do not have a setup or do not wish to handle virtual environments
yourself you can use `pipx`.

.. code-block:: console

   $ sudo apt install pipx
   $ pipx install kraken

kraken works both on Linux and Mac OS X and with any python interpreter between
3.9 and 3.11. It is possible the installation fails because `pipx` defaults to
an unsupported interpreter version. In that case you need to install a
compatible interpreter version such as 3.11 and then specify this version
explicitly:

.. code-block:: console

   $ sudo apt install python3.11-full
   $ pipx install --python python3.11 kraken


Installation using pip
----------------------

Create and activate a separate virtual environment using whatever tool you
like.

.. code-block:: console

  $ pip install kraken

or by running pip in the git repository:

.. code-block:: console

  $ pip install .

If you want direct PDF and multi-image TIFF/JPEG2000 support it is necessary to
install the `pdf` extras package for PyPi:

.. code-block:: console

   $ pip install kraken[pdf]

Finally you'll have to scrounge up a model to do the actual recognition of
characters. To download the default model for printed French text and place it
in the kraken directory for the current user:

::

  $ kraken get 10.5281/zenodo.10592716

A list of libre models available in the central repository can be retrieved by
running:

::

  $ kraken list

Quickstart
==========

Recognizing text on an image using the default parameters including the
prerequisite steps of binarization and page segmentation:

::

  $ kraken -i image.tif image.txt binarize segment ocr

To binarize a single image using the nlbin algorithm:

::

  $ kraken -i image.tif bw.png binarize

To segment an image (binarized or not) with the new baseline segmenter:

::

  $ kraken -i image.tif lines.json segment -bl


To segment and OCR an image using the default model(s):

::

  $ kraken -i image.tif image.txt segment -bl ocr -m catmus-print-fondue-large.mlmodel

All subcommands and options are documented. Use the ``help`` option to get more
information.

Documentation
=============

Have a look at the `docs `_.

Related Software
================

These days kraken is quite closely linked to the `eScriptorium
`_ project developed in the same eScripta research
group. eScriptorium provides a user-friendly interface for annotating data,
training models, and inference (but also much more). There is a `gitter channel
`_ that is mostly intended for
coordinating technical development but is also a spot to find people with
experience on applying kraken on a wide variety of material.

Funding
=======

kraken is developed at the `École Pratique des Hautes Études `_, `Université PSL `_.

.. container:: twocol

   .. container::

        .. image:: https://raw.githubusercontent.com/mittagessen/kraken/main/docs/_static/normal-reproduction-low-resolution.jpg
          :width: 100
          :alt: Co-financed by the European Union

   .. container::

        This project was funded in part by the European Union. (ERC, MiDRASH,
        project number 101071829).

.. container:: twocol

   .. container::

        .. image:: https://raw.githubusercontent.com/mittagessen/kraken/main/docs/_static/normal-reproduction-low-resolution.jpg
          :width: 100
          :alt: Co-financed by the European Union

   .. container::

        This project was partially funded through the RESILIENCE project, funded from
        the European Union’s Horizon 2020 Framework Programme for Research and
        Innovation.


.. container:: twocol

   .. container::

      .. image:: https://projet.biblissima.fr/sites/default/files/2021-11/biblissima-baseline-sombre-ia.png
         :width: 400
         :alt: Received funding from the Programme d’investissements d’Avenir

   .. container::

        Ce travail a bénéficié d’une aide de l’État gérée par l’Agence Nationale de la
        Recherche au titre du Programme d’Investissements d’Avenir portant la référence
        ANR-21-ESRE-0005 (Biblissima+).


Owner

  • Login: mittagessen
  • Kind: user
  • Location: Paris
  • Company: École Pratique des Hautes Études, Aoroc

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Kiessling"
  given-names: "Benjamin"
  orcid: "https://orcid.org/0000-0001-9543-7827"
title: "The Kraken OCR system"
version: 6.0
date-released: 2025-08-29
url: "https://kraken.re"

GitHub Events

Total
  • Issues event: 92
  • Watch event: 121
  • Delete event: 3
  • Issue comment event: 150
  • Push event: 106
  • Pull request review event: 1
  • Pull request review comment event: 2
  • Pull request event: 27
  • Fork event: 19
  • Create event: 9
Last Year
  • Issues event: 92
  • Watch event: 121
  • Delete event: 3
  • Issue comment event: 150
  • Push event: 106
  • Pull request review event: 1
  • Pull request review comment event: 2
  • Pull request event: 27
  • Fork event: 19
  • Create event: 9

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 104
  • Total pull requests: 41
  • Average time to close issues: 3 months
  • Average time to close pull requests: about 1 month
  • Total issue authors: 54
  • Total pull request authors: 14
  • Average comments per issue: 2.84
  • Average comments per pull request: 0.98
  • Merged pull requests: 19
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 44
  • Pull requests: 13
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 27 days
  • Issue authors: 27
  • Pull request authors: 5
  • Average comments per issue: 1.45
  • Average comments per pull request: 0.23
  • Merged pull requests: 8
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • johnlockejrr (18)
  • megamattc (8)
  • PonteIneptique (7)
  • dstoekl (6)
  • jesusbft (6)
  • fattynoparents (6)
  • alexislitvine (5)
  • stweil (5)
  • bertsky (4)
  • alitarek-dot (3)
  • dantetemplar (3)
  • Klyma79 (2)
  • sk-dataocean-online (2)
  • tarrinw (2)
  • jeffreycwitt (2)
Pull Request Authors
  • stweil (12)
  • mittagessen (6)
  • jesusbft (6)
  • PonteIneptique (6)
  • Arch-W (2)
  • fattynoparents (2)
  • anutkk (2)
  • Evarin (1)
  • Shreejan-git (1)
  • saiprabhath2002 (1)
  • particitae (1)
  • sadra-barikbin (1)
  • johnlockejrr (1)
  • rlskoeser (1)
Top Labels
Issue Labels
enhancement (1) beginner-friendly (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 9,503 last-month
  • Total dependent packages: 6
  • Total dependent repositories: 24
  • Total versions: 100
  • Total maintainers: 2
pypi.org: kraken

OCR/HTR engine for all the languages

  • Versions: 100
  • Dependent Packages: 6
  • Dependent Repositories: 24
  • Downloads: 9,503 Last month
Rankings
Dependent packages count: 1.4%
Dependent repos count: 3.0%
Average: 3.2%
Downloads: 5.2%
Maintainers (2)
Last synced: 6 months ago

Dependencies

.github/workflows/test.yml actions
  • actions/checkout v2 composite
  • actions/download-artifact v2 composite
  • actions/setup-python v2 composite
  • actions/setup-python v1 composite
  • actions/upload-artifact v2 composite
  • conda-incubator/setup-miniconda v2 composite
  • crazy-max/ghaction-github-pages v2 composite
  • marvinpinto/action-automatic-releases latest composite
  • pypa/gh-action-pypi-publish release/v1 composite
environment.yml conda
  • albumentations
  • click >=8.1
  • imagemagick >=7.1.0
  • jinja2 ~=3.0
  • jsonschema
  • lxml
  • numpy ~=1.23.0
  • pillow ~=9.2.0
  • pip
  • pyarrow
  • python >=3.9
  • python-bidi
  • pytorch-cpu >=1.11,<1.14
  • pytorch-lightning ~=2.0.0
  • pyvips
  • regex
  • requests
  • rich
  • scikit-image ~=0.21.0
  • scikit-learn ~=1.2.1
  • scipy ~=1.11.0
  • shapely ~=1.8.5
  • threadpoolctl ~=3.2
  • torchmetrics >=1.1.0
  • torchvision-cpu >=0.5.0
pyproject.toml pypi
setup.py pypi