chocomufin

Tools for normalizing the use of some characters and checking file consistencies

https://github.com/ponteineptique/choco-mufin

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.2%) to scientific vocabulary
Last synced: 7 months ago · JSON representation ·

Repository

Tools for normalizing the use of some characters and checking file consistencies

Basic Info
  • Host: GitHub
  • Owner: PonteIneptique
  • License: gpl-3.0
  • Language: Python
  • Default Branch: main
  • Size: 809 KB
Statistics
  • Stars: 11
  • Watchers: 2
  • Forks: 3
  • Open Issues: 5
  • Releases: 4
Created almost 5 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

Choco-Mufin

[CHaracter Ocr COordination for MUFI iN texts]

Tools for normalizing the use of some characters and checking file consistencies. Mainly target at dealing with overly diverse ways to transcribe medieval data (allographetic and graphematic for example) while keeping information such as abbreviation, hence MUFI.

Install

pip install chocomufin

Commands

The workflow is generally the following: you generate a conversion table (chocomufin generate table.csv your-files.xml), then use this table to either control (chocomufin control table.csv your-files.xml) or convert them (chocomufin convert table.csv your-files.xml). Conversion will automatically add a suffix which you can define with --suffix.

Table of conversion

Syntax

A conversion table MUST contain at least a char and a replacement column, SHOULD contain a regex and allow column (with either true or empty values), and MAY contain any additional column.

The columns have the following effect:

  • char is used to match a value in the XML or the text.
  • replacement is used to replace what was found in char.
  • regex, if true, means char and replacement should be parsed as regex.
  • allow, if true, will indicate that replacement should be ignored, and that the value(s) in char are valid.

Any other column should be seen as a comment.

Example

In the following table:

csv lineno,char,replacement,regex,allow 1,V,U,, 2,[a-ik-uw-zA-IK-UW-Z],,true,true 3,(\S)(\.)(\S),\g<0>\g<1> \g<2>,true, 4,_,,,true

  • Line no. 1 will replace any V into a U;
  • Line no. 2 will allow any character in the range defined: those characters won't be replaced and will be accepted as is.
  • Line no. 3 will replace any dot without spaces around it with a regex replacement groups used in the regex.
  • Line no. 4 will allow _ in the text, and not replace it with anything.

As table:

| lineno | char | replacement | regex | allow | |--------|----------------------|------------------|-------|-------| | 1 | V | U | | | | 2 | [a-ik-uw-zA-IK-UW-Z] | | true | true | | 3 | (\S)(.)(\S) | \g<1>\g<2> \g<3> | true | | | 4 | _ | | | true |

In this context, lineno is not used at all by chocomufin, but serves as a documentation tool. It would not break chocomufin.

Github Action Template

Just replace the path to table.csv and the file that needs to be tested, then save this file on your repository in .github/workflows/chocomufin.yml:

```yaml

This workflow will install Python dependencies, run tests and lint with a single version of Python

For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions

name: ChocoMufin

on: [push, pull_request]

jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - name: Set up Python 3.8 uses: actions/setup-python@v2 with: python-version: 3.8 - name: Install dependencies run: | python -m pip install --upgrade pip pip install chocomufin - name: Run ChocoMufin run: | chocomufin control table.csv */.xml ```


Logo by Alix Chagué.

The file original_mufi_json's content is under CC BY-SA 4.0 and comes from https://mufi.info/m.php?p=mufiexport

Owner

  • Name: Thibault Clérice
  • Login: PonteIneptique
  • Kind: user
  • Location: Chantilly, France
  • Company: PSL ENS - Lattice

Simply working on stuff.

Citation (CITATION.CFF)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Clérice
    given-names: Thibault
    orcid: https://orcid.org/0000-0002-0136-4434
  - family-names: Pinche
    given-names: Ariane
    orcid: https://orcid.org/0000-0002-7843-5050
title: "Choco-Mufin, a tool for controlling characters used in OCR and HTR projects"
version: 0.0.4
date-released: 2021-09-01
url: "https://github.com/PonteIneptique/choco-mufin"
doi: 10.5281/zenodo.5356154 

GitHub Events

Total
  • Watch event: 1
  • Issue comment event: 1
  • Push event: 13
  • Pull request event: 9
  • Fork event: 1
  • Create event: 4
Last Year
  • Watch event: 1
  • Issue comment event: 1
  • Push event: 13
  • Pull request event: 9
  • Fork event: 1
  • Create event: 4

Committers

Last synced: about 3 years ago

All Time
  • Total Commits: 33
  • Total Committers: 3
  • Avg Commits per committer: 11.0
  • Development Distribution Score (DDS): 0.242
Top Committers
Name Email Commits
Thibault Clérice l****e@g****m 25
Thibault Clérice 1****e@u****m 6
Alix Chagué 3****z@u****m 2

Issues and Pull Requests

Last synced: 12 months ago

All Time
  • Total issues: 12
  • Total pull requests: 8
  • Average time to close issues: 5 months
  • Average time to close pull requests: about 1 month
  • Total issue authors: 4
  • Total pull request authors: 2
  • Average comments per issue: 1.67
  • Average comments per pull request: 1.63
  • Merged pull requests: 6
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 4
  • Average time to close issues: N/A
  • Average time to close pull requests: about 10 hours
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.25
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • PonteIneptique (7)
  • alix-tz (3)
  • PaulineJac (1)
  • ArianePinche (1)
Pull Request Authors
  • PonteIneptique (9)
  • alix-tz (4)
Top Labels
Issue Labels
enhancement (3) documentation (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 469 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 18
  • Total maintainers: 1
pypi.org: chocomufin

[CHaracter Ocr COordination for MUFI iN texts] A simple script to maintain a reasonable training set of HTR/OCR characters

  • Versions: 18
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 469 Last month
Rankings
Dependent packages count: 10.1%
Downloads: 14.0%
Stargazers count: 18.5%
Forks count: 19.1%
Average: 25.8%
Dependent repos count: 67.3%
Maintainers (1)
Last synced: 8 months ago

Dependencies

requirements.txt pypi
  • Unidecode ==1.2.0
  • click >=8.0.0
  • lxml ==4.6.3
  • mufidecode ==0.1.0
  • regex *
  • tabulate ==0.8.9
  • tqdm ==4.61.1
.github/workflows/test.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
setup.py pypi