https://github.com/google-research/nisaba

Finite-state script normalization and processing utilities

https://github.com/google-research/nisaba

Science Score: 39.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.5%) to scientific vocabulary

Keywords

bengali brahmic-scripts devanagari finite-state finite-state-automata finite-state-transducers grammars gujarati gurmukhi indic-languages kannada malayalam oriya pynini sinhala tamil telugu unicode unicode-normalization writing-systems

Keywords from Contributors

deep-neural-networks distributed research jax tpu reinforcement-learning video-processing stream-processing pipeline-framework perception
Last synced: 4 months ago · JSON representation

Repository

Finite-state script normalization and processing utilities

Basic Info
  • Host: GitHub
  • Owner: google-research
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 2.15 MB
Statistics
  • Stars: 43
  • Watchers: 6
  • Forks: 4
  • Open Issues: 19
  • Releases: 1
Topics
bengali brahmic-scripts devanagari finite-state finite-state-automata finite-state-transducers grammars gujarati gurmukhi indic-languages kannada malayalam oriya pynini sinhala tamil telugu unicode unicode-normalization writing-systems
Created about 5 years ago · Last pushed 5 months ago
Metadata Files
Readme Contributing License

README.md

GitHub license Paper Build Tests (Linux) Build Tests (macOS)

Nisaba

Named after Nisaba — the Sumerian goddess of writing and scribe of the gods (𒀭𒉀).

nisaba

About

Collection of finite-state transducer-based (FST) tools for visual normalization, well-formedness, transliteration and NFC normalization of various scripts from South Asia and beyond. Nisaba provides these APIs in Python and C++. Currently supported script families:

Nisaba primarily relies on OpenGrm Pynini, which is a Python toolkit for finite-state grammar development. OpenGrm Pynini, like its C++ counterpart Thrax, compiles grammars expressed as strings, regular expressions, and context-dependent rewrite rules into weighted finite-state transducers (WFSTs). It uses the OpenFst library and its Python extension to create, access and manipulate compiled grammars.

Building and testing

This library will build on any system that supports Bazel versatile multiplatform build and test tool. The following examples assume Debian Linux distribution, but should also apply with minor modifications to other Linux and non-Linux platforms that Bazel supports.

Prerequisites

Bazel or Bazelisk

Your operating system may permit an easy installation of pre-built Bazel package, like the Debian-specific example below shows:

shell sudo apt-get install bazel

Alternatively, e.g., on macOS, a user-friendly Bazel launcher called Bazelisk can be installed:

shell BAZEL=bazelisk-darwin-amd64 curl -LO "https://github.com/bazelbuild/bazelisk/releases/latest/download/$BAZEL" chmod +x $BAZEL

When using Bazelisk, simply replace the command bazel in the examples below with $BAZEL.

C++ and Python

Nisaba requires a modern C++ compiler that supports C++17 standard (e.g., the GCC 10 release series) and Python3. Assuming these are already present, the required dependencies are the Python3 development headers and the Python3 package installer pip.

shell sudo apt-get install python3-dev sudo apt-get install python3-pip

Example Debian configuration: gcc (10.2.0), bazel (3.7.2), python3 (3.8.6) and pip (20.1.1).

Getting and building the code

  1. Locally, make sure you are in some sort of a virtual environment (venv, virtualenv, conda, etc).

  2. Clone the repository (please note, this example does not clone the fork of the main repository, but a forked repo can be used as well):

    shell git clone https://github.com/google-research/nisaba.git cd nisaba

  3. Build all the targets using Bazel (this example uses optimized mode):

    shell bazel build -c opt ...

    The above command will build Nisaba artifacts using all the remote repository dependencies, including OpenFst, Pynin and Thrax, that are specified in the Bazel WORKSPACE file. The resulting artifacts are located in bazel-bin/nisaba directory.

    If the above command fails due to missing Python prerequisites, please install them using pip Python package manager and try again:

    shell pip3 install --upgrade pip pip3 install -r requirements.txt

  4. Make sure the small unit tests are passing:

    shell bazel test -c opt --test_size_filters=-large,-enormous ...

    The above command should produce something along the following lines:

    shell ... //nisaba/scripts/brahmic:cc_test PASSED in 0.4s //nisaba/scripts/brahmic:far_cc_test PASSED in 0.2s //nisaba/scripts/brahmic:far_test PASSED in 2.0s //nisaba/scripts/brahmic:fixed_test PASSED in 0.2s //nisaba/scripts/brahmic:fst_properties_test PASSED in 2.3s //nisaba/scripts/brahmic:iso_test PASSED in 0.3s //nisaba/scripts/brahmic:nfc_test PASSED in 0.2s //nisaba/scripts/brahmic:nfc_utf8_test PASSED in 0.2s //nisaba/scripts/brahmic:py_test PASSED in 2.1s //nisaba/scripts/brahmic:util_test PASSED in 1.9s //nisaba/scripts/brahmic:visual_norm_test PASSED in 0.3s //nisaba/scripts/brahmic:visual_norm_utf8_test PASSED in 0.3s //nisaba/scripts/brahmic:wellformed_test PASSED in 0.2s //nisaba/scripts/brahmic:wellformed_utf8_test PASSED in 0.2s ...

    You may also want to run all the tests, but depending on your host configuration these may take a long time:

    shell bazel test -c opt ...

Contributions

NOTE: We don't accept pull requests (PRs) at the moment.

License

Nisaba is licensed under the terms of the Apache license. See LICENSE for more information.

Citation

If you use this software in a publication, please cite the accompanying paper from EACL 2021:

bibtex @inproceedings{nisaba-eacl2021, title = {Finite-state script normalization and processing utilities: The {N}isaba {B}rahmic library}, author = {Cibu Johny and Lawrence Wolf-Sonkin and Alexander Gutkin and Brian Roark}, booktitle = {16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021): System Demonstrations}, address = {[Online], Kyiv, Ukraine}, month = apr, year = {2021}, pages = {14--23}, publisher = {Association for Computational Linguistics}, doi = {10.18653/v1/2021.eacl-demos.3}, url = {https://www.aclweb.org/anthology/2021.eacl-demos.3}, }

Mandatory disclaimer

This is not an official Google product.

Owner

  • Name: Google Research
  • Login: google-research
  • Kind: organization
  • Location: Earth

GitHub Events

Total
  • Watch event: 3
  • Delete event: 47
  • Issue comment event: 9
  • Push event: 124
  • Pull request event: 99
  • Create event: 42
Last Year
  • Watch event: 3
  • Delete event: 47
  • Issue comment event: 9
  • Push event: 124
  • Pull request event: 99
  • Create event: 42

Committers

Last synced: about 2 years ago

All Time
  • Total Commits: 402
  • Total Committers: 16
  • Avg Commits per committer: 25.125
  • Development Distribution Score (DDS): 0.58
Past Year
  • Commits: 98
  • Committers: 10
  • Avg Commits per committer: 9.8
  • Development Distribution Score (DDS): 0.602
Top Committers
Name Email Commits
Alexander Gutkin a****n@g****m 169
Cibu Johny c****u@g****m 107
Isin Demirsahin i****n@g****m 80
Nisaba Authors n****y@g****m 12
Lawrence Wolf-Sonkin l****s@g****m 11
Anna Katanova a****a@g****m 5
Kyle Gorman k****g@g****m 5
Brian Roark r****k@g****m 4
Richard Levasseur r****r@g****m 2
Derek Mauro d****o@g****m 1
Jesse Emond e****d@g****m 1
John Cater j****r@g****m 1
Lawrence Wolf-Sonkin w****n@g****m 1
Mauricio Alfonso m****o@g****m 1
Stephen Thorne s****e@g****m 1
Yilei Yang y****g@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 5 months ago

All Time
  • Total issues: 8
  • Total pull requests: 384
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 6 days
  • Total issue authors: 5
  • Total pull request authors: 1
  • Average comments per issue: 4.0
  • Average comments per pull request: 0.07
  • Merged pull requests: 184
  • Bot issues: 0
  • Bot pull requests: 384
Past Year
  • Issues: 0
  • Pull requests: 58
  • Average time to close issues: N/A
  • Average time to close pull requests: 1 day
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.24
  • Merged pull requests: 29
  • Bot issues: 0
  • Bot pull requests: 58
Top Authors
Issue Authors
  • raydoc (3)
  • Snorlaxcode (2)
  • sinaahmadi (1)
  • ramSeraph (1)
  • coderarjob (1)
Pull Request Authors
  • copybara-service[bot] (384)
Top Labels
Issue Labels
enhancement (5) bug (3)
Pull Request Labels
copybara (360)

Dependencies

.github/workflows/linux-ci.yml actions
  • actions/cache v3 composite
  • actions/checkout v3 composite
.github/workflows/macos-ci.yml actions
  • actions/cache v3 composite
  • actions/checkout v3 composite
requirements.in pypi
  • networkx >=2.5
  • numpy >=1.26.1
  • pandas >=1.0.5
  • pycountry >=22.3.5
  • pyemd >=0.4.1
requirements.txt pypi
  • networkx ==3.1
  • numpy ==1.26.2
  • pandas ==2.0.1
  • pycountry ==22.3.5
  • pyemd ==1.0.0
  • python-dateutil ==2.8.2
  • pytz ==2023.3
  • setuptools ==67.7.2
  • six ==1.16.0
  • tzdata ==2023.3