g2p

Grapheme-to-Phoneme transductions that preserve input and output indices, and support cross-lingual g2p!

https://github.com/roedoejet/g2p

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    1 of 26 committers (3.8%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.2%) to scientific vocabulary

Keywords from Contributors

mesh sequences interactive hacking network-simulation
Last synced: 6 months ago · JSON representation ·

Repository

Grapheme-to-Phoneme transductions that preserve input and output indices, and support cross-lingual g2p!

Basic Info
Statistics
  • Stars: 174
  • Watchers: 9
  • Forks: 33
  • Open Issues: 30
  • Releases: 19
Created almost 7 years ago · Last pushed 6 months ago
Metadata Files
Readme Contributing License Citation

README.md

Gᵢ2Pᵢ

codecov Documentation Build Status PyPI package license standard-readme compliant

Grapheme-to-Phoneme transformations that preserve input and output indices!

This library is for handling arbitrary conversions between input and output segments while preserving indices.

indices

Table of Contents

See also: - Gᵢ2Pᵢ documentation - 7-part series on the Mother Tongues Blog - Gᵢ2Pᵢ Studio

Background

The initial version of this package was developed by Patrick Littell and was developed in order to allow for g2p from community orthographies to IPA and back again in ReadAlong-Studio. We decided to then pull out the g2p mechanism from Convertextract which allows transducer relations to be declared in CSV files, and turn it into its own library - here it is! For an in-depth series on the motivation and how to use this tool, have a look at this 7-part series on the Mother Tongues Blog, or for a more technical overview, have a look at this paper.

Install

The best thing to do is install with pip pip install g2p. This command will install the latest release published on PyPI g2p releases.

You can also use hatch (see hatch installation instructions) to set up an isolated local development environment, which may be useful if you wish to contribute new mappings:

sh $ git clone https://github.com/roedoejet/g2p.git $ cd g2p $ hatch shell

You can also simply install an "editable" version with pip (but it is recommended to do this in a virtual environment or a conda environment):

sh $ git clone https://github.com/roedoejet/g2p.git $ cd g2p $ pip install -e .

Usage

The easiest way to create a transducer is to use the g2p.make_g2p function.

To use it, first import the function:

from g2p import make_g2p

Then, call it with an argument for in_lang and out_lang. Both must be strings equal to the name of a particular mapping.

```python

transducer = makeg2p('dan', 'eng-arpabet') transducer('hej').outputstring 'HH EH Y' ```

There must be a valid path between the in_lang and out_lang in order for this to work. If you've edited a mapping or added a custom mapping, you must update g2p to include it: g2p update

Writing mapping files

Mapping files are written as either CSV or JSON files.

CSV

CSV files write each new rule as a new line and consist of at least two columns, and up to four. The first column is required and corresponds to the rule's input. The second column is also required and corresponds to the rule's output. The third column is optional and corresponds to the context before the rule input. The fourth column is also optional and corresponds to the context after the rule input. For example:

  1. This mapping describes two rules; a -> b and c -> d.

csv a,b c,d

  1. This mapping describes two rules; a -> b / c _ d1 and a -> e

csv a,b,c,d a,e

The g2p studio exports its rules to CSV format.

JSON

JSON files are written as an array of objects where each object corresponds to a new rule. The following two examples illustrate how the examples from the CSV section above would be written in JSON:

  1. This mapping describes two rules; a -> b and c -> d.

json [ { "in": "a", "out": "b" }, { "in": "c", "out": "d" } ]

  1. This mapping describes two rules; a -> b / c _ d1 and a -> e

json [ { "in": "a", "out": "b", "context_before": "c", "context_after": "d" }, { "in": "a", "out": "e" } ]

Python

You can also write your rules programatically in Python. For example:

```python from g2p.mappings import Mapping, Rule from g2p.transducer import Transducer

mapping = Mapping(rules=[ Rule(ruleinput="a", ruleoutput="b", contextbefore="c", contextafter="d"), Rule(ruleinput="a", ruleoutput="e") ])

transducer = Transducer(mapping) transducer('cad') # returns "cbd" ```

CLI

update

If you edit or add new mappings to the g2p.mappings.langs folder, you need to update g2p. You do this by running g2p update

convert

If you want to convert a string on the command line, you can use g2p convert <input_text> <in_lang> <out_lang>

Ex. g2p convert hej dan eng-arpabet would produce HH EH Y

If you have written your own mapping that is not included in the standard g2p library, you can point to its configuration file using the --config flag, as in g2p convert <input_text> <in_lang> <out_lang> --config path/to/config.yml. This will add the mappings defined in your configuration to the existing g2p network, so be careful to avoid namespace errors.

generate-mapping

If your language has a mapping to IPA and you want to generate a mapping between that and the English IPA mapping, you can use g2p generate-mapping <in_lang> --ipa. Remember to run g2p update before so that it has the latest mappings for your language.

Ex. g2p generate-mapping dan --ipa will produce a mapping from dan-ipa to eng-ipa. You must also run g2p update afterwards to update g2p. The resulting mapping will be added to the folder in g2p.mappings.langs.generated

Note: if your language goes through an intermediate representation, e.g., lang -> lang-equiv -> lang-ipa, specify both the <in_lang> and <out_lang> of your final IPA mapping to g2p generate-mapping. E.g., to generate crl-ipa -> eng-ipa, you would run g2p generate-mapping --ipa crl-equiv crl-ipa.

g2p workflow diagram

The interactions between g2p update and g2p generate-mapping are not fully intuitive, so this diagram should help understand what's going on:

Text DB: this is the textual database of g2p conversion rules created by contributors. It consists of these files: * g2p/mappings/langs/*/*.csv * g2p/mappings/langs/*/*.json * g2p/mappings/langs/*/*.yaml

Gen DB: this is the part of the textual database that is generated when running the g2p generate-mapping command: * g2p/mappings/generated/*

Compiled DB: this contains the same info as Text DB + Gen DB, but in a format optimized for fast reading by the machine. This is what any program using g2p reads: g2p convert, readalongs align, convertextract, and also g2p generate-mapping. It consists of these files: * g2p/mappings/langs/langs.json.gz * g2p/mappings/langs/network.json.gz * g2p/static/languages-network.json

So, when you write a new g2p mapping for a language, say lll, and you want to be able to convert text from lll to eng-ipa or eng-arpabet, you need to do the following: 1. Write the mapping from lll to lll-ipa in g2p/mappings/langs/lll/. You've just updated Text DB. 2. Run g2p update to regenerate Compiled DB from the current Text DB and Gen DB, i.e., to incorporate your new mapping rules. 3. Run g2p generate-mapping --ipa lll to generate g2p/mappings/langs/generated/lll-ipa_to_eng-ipa.json. This is not based on what you wrote directly, but rather on what's in Generated DB. 4. Run g2p update again. g2p generate-mapping updates Gen DB only, so what gets written there will only be reflected in Compiled DB when you run g2p update once more.

Once you have the Compiled DB, it is then possible to use the g2p convert command, create time-aligned audiobooks with readalongs align, or convert files with the convertextract library.

Studio

You can also run the g2p Studio which is a web interface for creating custom lookup tables to be used with g2p. To run the g2p Studio either visit https://g2p-studio.herokuapp.com/ or run it locally with python run_studio.py.

API for Developers

There is also a REST API available for use in your own applications. To launch it from the command-line use python run_studio.py or uvicorn g2p.app:APP. The API documentation will be viewable (with the ability to use it interactively) at http://localhost:5000/api/v1/docs - an OpenAPI definition is also available at http://localhost:5000/api/v1/openapi.json .

Maintainers

@roedoejet. @joanise.

Contributing

Feel free to dive in! Open an issue or submit PRs.

This repo follows the Contributor Covenant Code of Conduct.

Have a look at Contributing.md for help using our standardized formatting conventions and pre-commit hooks.

Adding a new mapping

In order to add a new mapping, you have to follow the following steps.

  1. Determine your language's ISO 639-3 code.
  2. Add a folder with your language's ISO 639-3 code to g2p/mappings/langs
  3. Add a configuration file at g2p/mappings/langs/<yourlangISOcode>/config-g2p.yaml. Here is the basic template for a configuration:

yaml <<: &shared language_name: <This is the actual name of the language> mappings: - display_name: This is a description of the mapping in_lang: This is your language's ISO 639-3 code out_lang: This is the output of the mapping type: mapping authors: - <YourNameHere> rules_path: <FilenameOfMapping> <<: *shared

  1. Add a mapping file. Look at the other mappings for examples, or visit the g2p studio to practise your mappings. Mappings are defined in either a CSV or json file. See writing mapping files for more info.
  2. Start a development shell with hatch shell (or install an editable version with pip install -e .) then update with g2p update
  3. Add some tests in g2p/testspublic/data/<YourIsoCode>.psv. Each line in the file will run a test with the following structure: <in_lang>|<out_lang>|<input_string>|<expected_output>
  4. Run python3 run_tests.py langs to make sure your tests pass.
  5. Make sure you have checked all the boxes and make a [pull request]((https://github.com/roedoejet/g2p/pulls)!

Adding a new language for support with ReadAlongs

This repo is used extensively by ReadAlongs. In order to make your language supported by ReadAlongs, you must add a mapping from your language's orthography to IPA. So, for example, to add Danish (ISO 639-3: dan), the steps above must be followed. The in_lang for the mapping must be dan and the out_lang must be suffixed with 'ipa' as in dan-ipa. The following is the proper configuration:

yaml mappings: - display_name: Danish to IPA language_name: Danish in_lang: dan out_lang: dan-ipa type: mapping authors: - Aidan Pine rules_path: dan_to_ipa.csv abbreviations_path: dan_abbs.csv rule_ordering: as-written case_sensitive: false norm_form: 'none'

Then, you can generate the mapping between dan-ipa and eng-ipa by running g2p generate-mapping --ipa. This will add the mapping to g2p/mappings/langs/generated - do not edit this file, but feel free to have a look. Then, run g2p update and submit a pull request, and tada! Your language is supported by ReadAlongs as well!

Footnotes

1 If this notation is unfamiliar, have a look at phonological rewrite rules

Contributors

This project exists thanks to all the people who contribute.

Citation

If you use this work in a project of yours and write about it, please cite us using the following:

Aidan Pine, Patrick Littell, Eric Joanis, David Huggins-Daines, Christopher Cox, Fineen Davis, Eddie Antonio Santos, Shankhalika Srikanth, Delasie Torkornoo, and Sabrina Yu. 2022. Gᵢ2Pᵢ Rule-based, index-preserving grapheme-to-phoneme transformations Rule-based, index-preserving grapheme-to-phoneme transformations. In Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages, pages 52–60, Dublin, Ireland. Association for Computational Linguistics.

Or in BibTeX:

@inproceedings{pine-etal-2022-gi22pi, title = "{G}$_i$2{P}$_i$ Rule-based, index-preserving grapheme-to-phoneme transformations", author = "Pine, Aidan and Littell, Patrick and Joanis, Eric and Huggins-Daines, David and Cox, Christopher and Davis, Fineen and Antonio Santos, Eddie and Srikanth, Shankhalika and Torkornoo, Delasie and Yu, Sabrina", booktitle = "Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages", month = may, year = "2022", address = "Dublin, Ireland", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.computel-1.7", pages = "52--60", abstract = "This paper describes the motivation and implementation details for a rule-based, index-preserving grapheme-to-phoneme engine {`}G$_i$2P$_i$' implemented in pure Python and released under the open source MIT license. The engine and interface have been designed to prioritize the developer experience of potential contributors without requiring a high level of programming knowledge. {`}G$_i$2P$_i$' already provides mappings for 30 (mostly Indigenous) languages, and the package is accompanied by a web-based interactive development environment, a RESTful API, and extensive documentation to encourage the addition of more mappings in the future. We also present three downstream applications of {`}G$_i$2P$_i$' and show results of a preliminary evaluation.", }

License

MIT. See LICENSE for the Copyright and license statements.

Owner

  • Name: Aidan Pine
  • Login: roedoejet
  • Kind: user
  • Company: @nrc-cnrc

Linguist and developer interested in language revitalization.

Citation (CITATION.cff)

cff-version: 1.2.0
message: >-
  If you use this work in a project of yours and write about it, please cite
  our ComputEL 2022 paper using the following citation data.
title: Gi2Pi
url: https://github.com/roedoejet/g2p
preferred-citation:
  type: conference-paper
  title: >-
    G$_i$2P$_i$ Rule-based, index-preserving grapheme-to-phoneme
    transformations
  authors:
    - given-names: Aidan
      family-names: Pine
      email: Aidan.Pine@nrc-cnrc.gc.ca
      affiliation: National Research Council Canada
    - given-names: Patrick
      family-names: Littell
      email: Patrick.Littell@nrc-cnrc.gc.ca
      affiliation: National Research Council Canada
    - given-names: Eric
      family-names: Joanis
      email: Eric.Joanis@nrc-cnrc.gc.ca
      affiliation: National Research Council Canada
    - given-names: David
      family-names: Huggins-Daines
      email: dhdaines@gmail.com
      affiliation: Independent Researcher
    - given-names: Christopher
      family-names: Cox
      email: christopher.cox@carleton.ca
      affiliation: Carleton University
    - given-names: Fineen
      family-names: Davis
      email: fineen.davis@gmail.com
      affiliation: Wiichihitotaak ILR Inc
    - given-names: Eddie
      family-names: Antonio Santos
      email: eddie.santos@ucdconnect.ie
      affiliation: University College Dublin
    - given-names: Shankhalika
      family-names: Srikanth
      email: ssrikanth@uvic.ca
      affiliation: University of Victoria
    - given-names: Delasie
      family-names: Torkornoo
      email: delasie.torkornoo@carleton.ca
      affiliation: Carleton University
    - given-names: Sabrina
      family-names: Yu
      email: sab.yu@mail.utoronto.ca
      affiliation: University of Toronto
  collection-title: >-
    Proceedings of the Fifth Workshop on the Use of Computational Methods in the
    Study of Endangered Languages
  year: 2022
  month: 5
  publisher:
    name: Association for Computational Linguistics
  url: https://aclanthology.org/2022.computel-1.7
  start: 52
  end: 60
  location:
    name: Dublin, Ireland
  abstract: >-
    This paper describes the motivation and implementation details for a
    rule-based, index-preserving grapheme-to-phoneme engine {`}G$_i$2P$_i$'
    implemented in pure Python and released under the open source MIT license.
    The engine and interface have been designed to prioritize the developer
    experience of potential contributors without requiring a high level of
    programming knowledge. {`}G$_i$2P$_i$' already provides mappings for 30
    (mostly Indigenous) languages, and the package is accompanied by a web-based
    interactive development environment, a RESTful API, and extensive
    documentation to encourage the addition of more mappings in the future. We
    also present three downstream applications of {`}G$_i$2P$_i$' and show
    results of a preliminary evaluation.

GitHub Events

Total
  • Create event: 31
  • Issues event: 4
  • Release event: 2
  • Watch event: 38
  • Delete event: 26
  • Issue comment event: 60
  • Push event: 98
  • Pull request review comment event: 17
  • Pull request review event: 32
  • Pull request event: 44
  • Fork event: 5
Last Year
  • Create event: 31
  • Issues event: 4
  • Release event: 2
  • Watch event: 38
  • Delete event: 26
  • Issue comment event: 60
  • Push event: 98
  • Pull request review comment event: 17
  • Pull request review event: 32
  • Pull request event: 44
  • Fork event: 5

Committers

Last synced: over 1 year ago

All Time
  • Total Commits: 1,384
  • Total Committers: 26
  • Avg Commits per committer: 53.231
  • Development Distribution Score (DDS): 0.711
Past Year
  • Commits: 206
  • Committers: 9
  • Avg Commits per committer: 22.889
  • Development Distribution Score (DDS): 0.558
Top Committers
Name Email Commits
roedoejet a****e@s****a 400
Eric Joanis E****s@c****a 321
Eric Joanis e****s@n****a 143
David Huggins-Daines d****d@e****a 115
shankhalika-katya 5****a 72
Davis f****3@u****a 63
Aidan Pine h****o@a****a 49
David Huggins-Daines d****s@g****m 48
Bradley Ellert b****t@c****a 29
shankhalika-katya s****1@g****m 29
Eddie Antonio Santos e****s@u****a 26
saltyseadawg s****7@g****m 23
Del d****o@y****m 20
Eric Joanis E****s@n****a 12
Patrick Littell P****l@n****a 10
Eddie Antonio Santos e****s@u****e 4
Littell P****l@d****a 4
dependabot[bot] 4****] 4
Olivia Chen o****n@O****t 3
Delasie Torkornoo d****o@c****a 2
Olivia Chen o****n@O****l 2
Akwiratékha' Martin t****u@h****m 1
CarpenterSaw c****r@g****m 1
Dante d****n@g****m 1
Davis F****s@d****a 1
MENGZHEGENG g****k@g****m 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 82
  • Total pull requests: 239
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 9 days
  • Total issue authors: 7
  • Total pull request authors: 12
  • Average comments per issue: 1.26
  • Average comments per pull request: 2.67
  • Merged pull requests: 203
  • Bot issues: 0
  • Bot pull requests: 14
Past Year
  • Issues: 8
  • Pull requests: 39
  • Average time to close issues: 17 days
  • Average time to close pull requests: 4 days
  • Issue authors: 4
  • Pull request authors: 7
  • Average comments per issue: 1.88
  • Average comments per pull request: 1.97
  • Merged pull requests: 31
  • Bot issues: 0
  • Bot pull requests: 7
Top Authors
Issue Authors
  • joanise (49)
  • roedoejet (19)
  • dhdaines (6)
  • MENGZHEGENG (2)
  • marctessier (2)
  • goodzack (1)
  • adelgiudice (1)
Pull Request Authors
  • joanise (164)
  • dhdaines (42)
  • roedoejet (23)
  • dependabot[bot] (17)
  • littell (4)
  • MENGZHEGENG (4)
  • deltork (3)
  • eddieantonio (2)
  • goodzack (1)
  • dacerron (1)
  • kubicra (1)
  • soghomon-b (1)
Top Labels
Issue Labels
bug (4) enhancement (2)
Pull Request Labels
dependencies (19) python (2)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 25,831 last-month
  • Total docker downloads: 414
  • Total dependent packages: 5
  • Total dependent repositories: 12
  • Total versions: 35
  • Total maintainers: 2
pypi.org: g2p

Module for creating context-aware, rule-based G2P mappings that preserve indices

  • Versions: 35
  • Dependent Packages: 5
  • Dependent Repositories: 12
  • Downloads: 25,831 Last month
  • Docker Downloads: 414
Rankings
Dependent packages count: 1.9%
Docker downloads count: 3.9%
Dependent repos count: 4.2%
Downloads: 5.0%
Average: 5.2%
Stargazers count: 7.7%
Forks count: 8.5%
Maintainers (2)
Last synced: 6 months ago

Dependencies

.github/workflows/codeql.yml actions
  • actions/checkout v3 composite
  • github/codeql-action/analyze v2 composite
  • github/codeql-action/autobuild v2 composite
  • github/codeql-action/init v2 composite
.github/workflows/matrix-tests.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
.github/workflows/pythonpublish.yml actions
  • actions/checkout v3 composite
  • actions/create-release v1 composite
  • actions/setup-python v4 composite
  • mathieudutour/github-tag-action v5.5 composite
.github/workflows/tests.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
  • codecov/codecov-action v3 composite
Dockerfile docker
  • debian bullseye-slim build
.github/workflows/docs.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
requirements.txt pypi
pyproject.toml pypi
  • click >=8.0.4
  • coloredlogs >=15.0.1
  • networkx >=2.6
  • openpyxl *
  • panphon >=0.19
  • pydantic >=2.3
  • pyyaml >=5.2
  • regex *
  • text_unidecode *
  • tqdm *