https://github.com/google-research/nisaba
Finite-state script normalization and processing utilities
Science Score: 39.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.5%) to scientific vocabulary
Keywords
Keywords from Contributors
Repository
Finite-state script normalization and processing utilities
Basic Info
Statistics
- Stars: 43
- Watchers: 6
- Forks: 4
- Open Issues: 19
- Releases: 1
Topics
Metadata Files
README.md
Nisaba
Named after Nisaba — the Sumerian goddess of writing and scribe of the gods (𒀭𒉀).

About
Collection of finite-state transducer-based (FST) tools for visual normalization, well-formedness, transliteration and NFC normalization of various scripts from South Asia and beyond. Nisaba provides these APIs in Python and C++. Currently supported script families:
- Brahmic scripts (documentation).
- Alphabets and abjads (documentation).
- Natural transliteration for Brahmic scripts (documentation).
Nisaba primarily relies on OpenGrm Pynini, which is a Python toolkit for finite-state grammar development. OpenGrm Pynini, like its C++ counterpart Thrax, compiles grammars expressed as strings, regular expressions, and context-dependent rewrite rules into weighted finite-state transducers (WFSTs). It uses the OpenFst library and its Python extension to create, access and manipulate compiled grammars.
Building and testing
This library will build on any system that supports Bazel versatile multiplatform build and test tool. The following examples assume Debian Linux distribution, but should also apply with minor modifications to other Linux and non-Linux platforms that Bazel supports.
Prerequisites
Bazel or Bazelisk
Your operating system may permit an easy installation of pre-built Bazel package, like the Debian-specific example below shows:
shell
sudo apt-get install bazel
Alternatively, e.g., on macOS, a user-friendly Bazel launcher called Bazelisk can be installed:
shell
BAZEL=bazelisk-darwin-amd64
curl -LO "https://github.com/bazelbuild/bazelisk/releases/latest/download/$BAZEL"
chmod +x $BAZEL
When using Bazelisk, simply replace the command bazel in the examples below
with $BAZEL.
C++ and Python
Nisaba requires a modern C++ compiler that supports C++17 standard (e.g., the GCC 10 release series) and Python3. Assuming these are already present, the required dependencies are the Python3 development headers and the Python3 package installer pip.
shell
sudo apt-get install python3-dev
sudo apt-get install python3-pip
Example Debian configuration: gcc (10.2.0), bazel (3.7.2), python3 (3.8.6) and pip (20.1.1).
Getting and building the code
Locally, make sure you are in some sort of a virtual environment (
venv,virtualenv,conda, etc).Clone the repository (please note, this example does not clone the fork of the main repository, but a forked repo can be used as well):
shell git clone https://github.com/google-research/nisaba.git cd nisabaBuild all the targets using Bazel (this example uses optimized mode):
shell bazel build -c opt ...The above command will build Nisaba artifacts using all the remote repository dependencies, including OpenFst, Pynin and Thrax, that are specified in the Bazel WORKSPACE file. The resulting artifacts are located in
bazel-bin/nisabadirectory.If the above command fails due to missing Python prerequisites, please install them using
pipPython package manager and try again:shell pip3 install --upgrade pip pip3 install -r requirements.txtMake sure the small unit tests are passing:
shell bazel test -c opt --test_size_filters=-large,-enormous ...The above command should produce something along the following lines:
shell ... //nisaba/scripts/brahmic:cc_test PASSED in 0.4s //nisaba/scripts/brahmic:far_cc_test PASSED in 0.2s //nisaba/scripts/brahmic:far_test PASSED in 2.0s //nisaba/scripts/brahmic:fixed_test PASSED in 0.2s //nisaba/scripts/brahmic:fst_properties_test PASSED in 2.3s //nisaba/scripts/brahmic:iso_test PASSED in 0.3s //nisaba/scripts/brahmic:nfc_test PASSED in 0.2s //nisaba/scripts/brahmic:nfc_utf8_test PASSED in 0.2s //nisaba/scripts/brahmic:py_test PASSED in 2.1s //nisaba/scripts/brahmic:util_test PASSED in 1.9s //nisaba/scripts/brahmic:visual_norm_test PASSED in 0.3s //nisaba/scripts/brahmic:visual_norm_utf8_test PASSED in 0.3s //nisaba/scripts/brahmic:wellformed_test PASSED in 0.2s //nisaba/scripts/brahmic:wellformed_utf8_test PASSED in 0.2s ...You may also want to run all the tests, but depending on your host configuration these may take a long time:
shell bazel test -c opt ...
Contributions
NOTE: We don't accept pull requests (PRs) at the moment.
License
Nisaba is licensed under the terms of the Apache license. See LICENSE for more information.
Citation
If you use this software in a publication, please cite the accompanying paper from EACL 2021:
bibtex
@inproceedings{nisaba-eacl2021,
title = {Finite-state script normalization and processing utilities: The {N}isaba {B}rahmic library},
author = {Cibu Johny and Lawrence Wolf-Sonkin and Alexander Gutkin and Brian Roark},
booktitle = {16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021): System Demonstrations},
address = {[Online], Kyiv, Ukraine},
month = apr,
year = {2021},
pages = {14--23},
publisher = {Association for Computational Linguistics},
doi = {10.18653/v1/2021.eacl-demos.3},
url = {https://www.aclweb.org/anthology/2021.eacl-demos.3},
}
Mandatory disclaimer
This is not an official Google product.
Owner
- Name: Google Research
- Login: google-research
- Kind: organization
- Location: Earth
- Website: https://research.google
- Repositories: 226
- Profile: https://github.com/google-research
GitHub Events
Total
- Watch event: 3
- Delete event: 47
- Issue comment event: 9
- Push event: 124
- Pull request event: 99
- Create event: 42
Last Year
- Watch event: 3
- Delete event: 47
- Issue comment event: 9
- Push event: 124
- Pull request event: 99
- Create event: 42
Committers
Last synced: about 2 years ago
Top Committers
| Name | Commits | |
|---|---|---|
| Alexander Gutkin | a****n@g****m | 169 |
| Cibu Johny | c****u@g****m | 107 |
| Isin Demirsahin | i****n@g****m | 80 |
| Nisaba Authors | n****y@g****m | 12 |
| Lawrence Wolf-Sonkin | l****s@g****m | 11 |
| Anna Katanova | a****a@g****m | 5 |
| Kyle Gorman | k****g@g****m | 5 |
| Brian Roark | r****k@g****m | 4 |
| Richard Levasseur | r****r@g****m | 2 |
| Derek Mauro | d****o@g****m | 1 |
| Jesse Emond | e****d@g****m | 1 |
| John Cater | j****r@g****m | 1 |
| Lawrence Wolf-Sonkin | w****n@g****m | 1 |
| Mauricio Alfonso | m****o@g****m | 1 |
| Stephen Thorne | s****e@g****m | 1 |
| Yilei Yang | y****g@g****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 5 months ago
All Time
- Total issues: 8
- Total pull requests: 384
- Average time to close issues: about 2 months
- Average time to close pull requests: 6 days
- Total issue authors: 5
- Total pull request authors: 1
- Average comments per issue: 4.0
- Average comments per pull request: 0.07
- Merged pull requests: 184
- Bot issues: 0
- Bot pull requests: 384
Past Year
- Issues: 0
- Pull requests: 58
- Average time to close issues: N/A
- Average time to close pull requests: 1 day
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.24
- Merged pull requests: 29
- Bot issues: 0
- Bot pull requests: 58
Top Authors
Issue Authors
- raydoc (3)
- Snorlaxcode (2)
- sinaahmadi (1)
- ramSeraph (1)
- coderarjob (1)
Pull Request Authors
- copybara-service[bot] (384)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- actions/cache v3 composite
- actions/checkout v3 composite
- actions/cache v3 composite
- actions/checkout v3 composite
- networkx >=2.5
- numpy >=1.26.1
- pandas >=1.0.5
- pycountry >=22.3.5
- pyemd >=0.4.1
- networkx ==3.1
- numpy ==1.26.2
- pandas ==2.0.1
- pycountry ==22.3.5
- pyemd ==1.0.0
- python-dateutil ==2.8.2
- pytz ==2023.3
- setuptools ==67.7.2
- six ==1.16.0
- tzdata ==2023.3