Recent Releases of kraken
kraken - 5.2.9 - Bugfix release
What's Changed
- Pins python-bidi to a version that supports our internal data structure mangling
- Fixes a small regression in pretraining
- Various PageXML serialization improvements
- ketos now prints a helpful message when trying to use a binary file with the
-t/-eoptions expecting manifest files - Fixes serialization of dummy boxes by @PonteIneptique in https://github.com/mittagessen/kraken/pull/612
- Update alto to not produce Polygon tag on default blocks by @PonteIneptique in https://github.com/mittagessen/kraken/pull/620
- corrected mask of patch by @saiprabhath2002 in https://github.com/mittagessen/kraken/pull/617
New Contributors
- @saiprabhath2002 made their first contribution in https://github.com/mittagessen/kraken/pull/617
Full Changelog: https://github.com/mittagessen/kraken/compare/5.2.5...5.2.9
- Python
Published by mittagessen over 1 year ago
kraken - 5.2.5 Bugfix release
- Fixes XML serialization of segmentation results (#597)
- Removes regression in polygonization code introduced with performance enhancements (#605)
extract_polygons()now raises an exception when processing baselines < 5px in length (#606)- Various small improvements to
contrib/segmentation_overlay.py ketos compileprogress bar now displays elapsed/remaining time (#504)
- Python
Published by mittagessen almost 2 years ago
kraken - Hotfix release
- Fixes a regression in container-based binary dataset building
- Fixes spurious updates of validation metrics after sanity checking
- Python
Published by mittagessen almost 2 years ago
kraken - Hotfix for segmentation training
What's Changed
- Hotfix for segmentation training
- Python
Published by mittagessen almost 2 years ago
kraken - Hotfix for no_segmentation mode recognition
Hotfix release fixing a regression in no_segmentation recognition.
- Python
Published by mittagessen almost 2 years ago
kraken - 5.2.1 hotfix release
This release contains two small fixes for a regression related to bumping lightning up to 2.2 and a crash in Segmentation instantiation occurring when the first region type does not contain a region/dict.
- Python
Published by mittagessen almost 2 years ago
kraken - 5.0 release with minor bugfixes
Kraken 5.x is a major release introducing trainable reading order, a cleaner API, and changes resulting in a ~50% performance improvement of recognition inference, in addition to a large number of smaller bug fixes and stability improvements.
What's Changed
- Trainable reading order based on an neural order relation operator adapted from this method (https://github.com/mittagessen/kraken/pull/492)
- Updates to the ALTO/PageXML templates and the serializer which correct serialization of region and line taxonomies, use UUIDs, and reuse identifiers from input XML files in output.
- Requirements are now mostly pinned to avoid pytorch/lightning accuracy and speed regressions that popped up semi-regularly with more free package versions.
- Threadpool limits are now set in all CLI drivers to prevent slowdown from unreasonably large numbers of threads in libraries like OpenCV. As a result the
--threadsoption of all commands has been split into--workersand --threads. kraken.repomethods have been adapted to the new Zenodo API. They also correctly handle versioned records now.- A small fix enabling recognition inference with AMP.
- Support for
--fixed-splitsinketos test(@PonteIneptique) - Performance increase for polygon extraction by @Evarin in https://github.com/mittagessen/kraken/pull/555
- Speed up legacy polygon extraction by @anutkk in https://github.com/mittagessen/kraken/pull/586
- New container classes in
kraken.containersreplace the previous dicts produced and expected bysegment/rpred/serialize. kraken.serialize.serialize_segmentation()has been removed as part of the container class rework.train/rotrain/segtrain/pretraincosine annealing scheduling now allows setting the final learning rate with--cos-min-lr.- Lots of PEP8/whitespace/spelling mistake fixes from @stweil
New features
Reading order training
Reading order can now be learned with ketos rotrain and reading order models can be added to segmentation model files. The training process is documented here.
Upgrade guide
Command line
Polygon extractor
The polygon extractor is responsible for taking a page image, baselines, and their bounding polygons and dewarping + masking out the line. Here is an example:
The new polygon extractor reduces line extraction time 30x, roughly halving inference time and significantly speeding up training from XML files and compilation of datasets. It should be noted that polygon extraction does not concern data in the legacy bounding box format nor does it touch the segmentation process as it is only a preprocessing step in the recognizer on an already existing segmentation.
Not all improvements in the polygon extractor are backward compatible, causing models trained with data extracted with the old implementation to suffer from a slight reduction in accuracy (usually <0.25 percentage points). Therefore models now contain a flag in their metadata indicating which implementation has been used to train them. This flag can be overridden, e.g.:
$ kraken --no-legacy-polygons -i ... ... ocr ...
to enable all speedups for a slight increase in character error rate.
For training the new extractor is enabled per default, i.e. models trained with kraken 5.x will perform slightly worse on earlier kraken version but will still work. It is possible to force use of only backwards compatible speedups:
$ ketos compile --legacy-polygons ...
$ ketos train --legacy-polygons ....
$ ketos pretrain --legacy-polygons ...
Threads and Multiprocessing
The command line tools now handle multiprocessing and thread pools more completely and configurably. --workers has been split into --threads and --workers, the former option limiting the size of thread pools (as much as possible) for intra-op parallelization, the latter setting the number of worker processes, usually for the purpose of data loading in training and dataset compilation.
API changes
While 5.x preserves the general OCR functional blocks, the existing dictionary-based data structures have been replaced with container classes and the XML parser has been reworked.
Container classes
For straightforward processing little has changed. Most keys of the dictionaries have been converted into attributes of their respective classes.
The segmentation methods now return a Segmentation object containing Region and BaselineLine/BBoxLine objects:
```
pageseg.segment(im) {'textdirection': 'horizontal-lr', 'boxes': [(x1, y1, x2, y2),...], 'scriptdetection': False }
blla.segment(im) {'textdirection': '$dir', 'type': 'baseline', 'lines': [{'baseline': [[x0, y0], [x1, y1], ..., [xn, yn]], 'boundary': [[x0, y0, x1, y1], ... [xm, ym]]}, ... {'baseline': [[x0, ...]], 'boundary': [[x0, ...]]}] 'regions': [{'region': [[x0, y0], [x1, y1], ..., [xn, y_n]], 'type': 'image'}, ... {'region': [[x0, ...]], 'type': 'text'}] } ```
becomes:
```
pageseg.segment(im) Segmentation(type='bbox', imagename=None, textdirection='horizontal-lr', scriptdetection=False, lines=[BBoxLine(id='f1d5b1e2-030c-41d5-b299-8a114eb0996e', bbox=[34, 198, 279, 251], text=None, basedir=None, type='bbox', imagename=None, tags=None, split=None, regions=None, textdirection='horizontal-lr'), BBoxLine(...], line_orders=[])
blla.segment(im) Segmentation(type='baseline', imagename=im, textdirection='horizontal-lr', scriptdetection=False, lines=[BaselineLine(id='50ab1a29-c3b6-4659-9713-ff246b21d2dc', baseline=[[183, 284], [272, 282]], boundary=[[183, 284], ... ,[183, 284]], text=None, basedir=None, type='baselines', tags={'type': 'default'}, split=None, regions=['e28ccb6b-2874-4be0-8e0d-38948f0fdf09']), ...], regions={'text': [Region(id='e28ccb6b-2874-4be0-8e0d-38948f0fdf09', boundary=[[123, 218], ..., [123, 218]], tags={'type': 'text'}), ...], 'foo': [Region(...), ...]}, lineorders=[]) ```
The recognizer now yields
BaselineOCRRecords/BBoxOCRRecords
which both inherit from the BaselineLine/BBoxLine classes:
```
record = rpred(network=model, im=im, segmentation=baseline_seg) record = next(rpred.rpred(im)) record BaselineOCRRecord pred: 'predicted text' baseline: ... record.type 'baselines' record.line BaselineLine(...) record.prediction 'predicted text' ```
One complication is the new serialization function which now accepts a
Segmentation object instead of a list of ocr_records and ancillary metadata:
```
records = list(x for x in rpred(...)) serialize(records, imagename=im.filename, imagesize=im.size, writingmode='horizontal-tb', scripts=['Latn', 'Hebr'], regions=[{...}], template='alto', templatesource='native', processingsteps=procsteps) ```
becomes:
```
import dataclasses baselineseg Segmentation(...) records = list(x for x in rpred(..., segmentation=baselineseg)) results = dataclasses.replace(baselineseg, lines=records) serialize(results, imagesize=im.size, writingmode='horizontal-tb', scripts=['Latn', 'Hebr'], template='alto', templatesource='native', processingsteps=procsteps) ```
This requires the construction of a new Segmentation object that contains the
records produced by the text predictor. The most straightforward way to create
this new Segmentation is through the dataclasses.replace function as our
container classes are immutable.
Lastly, serialize_segmentation has been removed. The serialize function now
accepts Segmentation objects which do not contain text predictions:
```
serializesegmentation(segresult={'textdirection': '$dir', 'type': 'baseline', 'lines': [{'baseline': [[x0, y0], [x1, y1], ..., [xn, yn]], 'boundary': [[x0, y0, x1, y1], ... [xm, ym]]}, ... {'baseline': [[x0, ...]], 'boundary': [[x0, ...]]}] 'regions': [{'region': [[x0, y0], [x1, y1], ..., [xn, yn]], 'type': 'image'}, ... {'region': [[x0, ...]], 'type': 'text'}] }, imagename=im.filename, imagesize=im.size, template='alto', templatesource='native', processingsteps=proc_steps) ``` is replaced by:
```
baselineseg Segmentation(...) serialize(baselineseg, imagesize=im.size, writingmode='horizontal-tb', scripts=['Latn', 'Hebr'], template='alto', templatesource='native', processingsteps=proc_steps) ```
XML parsing
The kraken.lib.xml.parse_{xml,alto,page} methods have been replaced by a single kraken.lib.xml.XMLPage class.
```
parsexml('xyz.xml') {'image': impath, 'lines': [{'boundary': [[x0, y0], ...], 'baseline': [[x0, y0], ...], 'text': apdjfqpf', 'tags': {'type': 'default', ...}}, ... {...}], 'regions': {'regiontype_0': [[[x0, y0], ...], ...], ...}} ```
becomes
```
XMLPage('xyz.xml') XMLPage xyz.xml (format: alto, image: impath) ```
As the parser is now aware of reading order the XMLPage.lines attribute is an
unordered dict of BaselineLine/BBoxLine container classes. As ALTO/PageXML
files can generally contain multiple different reading orders the
XMLPage.get_sorted_lines()/XMLPAge.get_sorted_regions() method on the object
provides an ordered view of lines or regions. The default order
line_implicit/region_implicit corresponds to the order produced by the
previous parsers, i.e. the order formed by the sequence of elements in the XML
tree.
XMLPage objects can be converted into a Segmentation container using the
XMLPage.to_container() method:
```
XMLPage('xyz.xml').to_container() Segmentation(...) ```
Full Changelog: https://github.com/mittagessen/kraken/compare/4.3.13...5.2
- Python
Published by mittagessen almost 2 years ago
kraken - 4.3.10
This is mostly a bugfix release but also includes a couple of minor improvements and changes.
Changes
- Deterministic mode is now set to 'warn' preventing crashes in deterministic recognition training (CTC loss does not have a deterministic implementation).
contrib/extract_lines.pywork with binary datasets- 'Word' error rate has been added as a validation metric in recognition training
- The fine-tuning options (
--resize) add/both have been renamed to union/new. (Thibault Clérice) #488 - Tensorboard logging now also logs a couple of training images
- Python
Published by github-actions[bot] almost 3 years ago
kraken - 4.3.5
This is just another hotfix release.
Changes
- 799ee78: Propagation of the --raise-on-error for raising non-blocking errors in blla segmentation (Thibault Clérice) #444
- d81e898: adds pl_logger to default hyperparams dict (Benjamin Kiessling)
- Python
Published by github-actions[bot] about 3 years ago
kraken - 4.3.4
This is a hotfix release to 4.3.0 correcting a regression in the CLI, fixing pretrain validation losses, and the conda environment files.
Commits
- ac5fab6: Invalid type in click option definition for loggers (Benjamin Kiessling)
- 0cb9e0e: fix validation loss computation in pretrain (Benjamin Kiessling)
- 7d5069b: Remove former development raise in segmentation (Thibault Clérice) #441
- 0e3d10f: Install coremltools from pip for conda environments (Benjamin Kiessling)
- Python
Published by github-actions[bot] about 3 years ago
kraken - 4.3.0
What's Changed
- Pretraining has been reimplemented to be more faithful to the original publication for more stable memory consumption and easier hyperparameter selection
- Learning rate warmup and backbone freezing in recognition training with
--warmupand--freeze-backbone(mostly to enable fine-tuning pretrained models) - Enable
ketos compileto create precompiled datasets with lines without a corresponding transcription with the--keep-empty-linesswitch (mostly for pretraining models). --failed-sample-thresholdin training modules, aborting training after a certain number of samples failed to load- tensorboard logging with
--logger/--log-diroptions - Change codec construction during training when training and validation dataset alphabets don't match. Prior code points that only exist in the validation set would be copied to the model codec. Now the model codec only contains trained code points.
- Replace
ocr_recordwith new smart classesBaselineOCRRecordandBBoxOCRRecord. These keep track of reading/display order, compute bounding polygons from the whole line bounding polygon, and average confidences when slicing. - ALTO parsing now deals with any reasonable PointsType (see https://github.com/altoxml/schema/issues/49)
- The fallback line orientation heuristic now takes into account the principal text orientation defined with
--text-directioninstead of assuming horizontal lines (--text-direction horizontal-lr/-rl). - Baseline segmentation now supports padding of input images with
--pad. - CLI now allows serialization with custom jinja2 templates through the
--templateoption. - Switch validation metrics computation to torchmetrics.
- Various bugfixes, mostly to deal with shapely shenanigans.
Thanks
- @sixtyfive, @anutkk, @stweil, @colibrisson, @PonteIneptique for their contributions to this release.
Full Changelog: https://github.com/mittagessen/kraken/compare/4.2.0...4.3.0
- Python
Published by mittagessen about 3 years ago
kraken - 3.0.6
This is mainly a bugfix release containing small improvements such as additional tests, typing, spelling corrections, additional contrib scripts, and fixes for rarely used functionality.
Bugfixes
- Orthography and missing help messages in the CLI drivers
- Documentation for batch input specifications
- Fix a regression in early stopping when training on GPU
- Fix a regression in polygonization in the presence of regions
- Do not duplicate regions during serialization
- Add dummy String beneath
TextLinew/o text in ALTO to avoid standard-violating emptyTextLines - The codec loading functionality of
ketos trainandKrakenTraineractually loads a given codec now. - Fall back to simple scaling when centerline dewarping fails
- Drop (duplicate) short option form -p for --pad in all ketos commands
Features
- The forced alignment script
contrib/forced_alignment_overlay.pynow preserves the input file and only replaces the character cuts. - Add reading order tests
- Explicit model sanity checks in blla.segment()
- Add baseline offset options to repolygonization script
- Make codec self-synchronizing
- Add
TextEquivforWordandTextLinein PAGE XML output - Raise PIL image size limit to 20k*20k image dimensions
- Python
Published by github-actions[bot] over 4 years ago