https://github.com/ai4bharat/indic-punct

Including support for alphabets and different spoken form of words

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (6.8%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Including support for alphabets and different spoken form of words

Basic Info

Host: GitHub
Owner: AI4Bharat
License: mit
Language: Python
Default Branch: main
Size: 428 KB

Statistics

Stars: 1
Watchers: 0
Forks: 4
Open Issues: 0
Releases: 0

Fork of Open-Speech-EkStep/indic-punct

Created over 3 years ago · Last pushed over 2 years ago

Metadata Files

Readme License

Indic Punct Library

About

Inverse text normalization (ITN) is a part of the Automatic Speech Recognition (ASR) post-processing pipeline. ITN is the task of converting the raw spoken output of the ASR model into its written form to improve text readability. We currently only handle numbers as a part of our ITN pipeline, and have developed and open-sourced WFST (weighted finite state transducer) based ITN support for 11 Indic languages -Hindi, Gujarati, Telugu, Marathi, Punjabi, Tamil, Bengali, Malayalam, Odia, Assamese, Kannada, using NVIDIA’s NEMO toolkit.

Installation Instructions

buildoutcfg git clone https://github.com/Open-Speech-EkStep/indic-punct.git cd indic-punct bash install.sh python setup.py bdist_wheel pip install -e .

Usage

Currently (v 2.0.6) we are supporting the following languages: - Punctuation: - Hindi ('hi') - English ('en') - Gujarati ('gu') - Telugu ('te') - Marathi ('mr') - Kannada ('kn') - Punjabi ('pa') - Tamil ('ta') - Bengali ('bn') - Odia ('or') - Malayalam ('ml') - Assamese ('as')

Inverse Text Normalization:
- Hindi
- English
- Gujarati
- Telugu
- Marathi
- Punjabi
- Tamil
- Bengali
- Malayalam
- Odia
- Assamese
- Kannada

We are planning to add other Indic languages.

Punctuation

```buildoutcfg from punctuate.punctuate_text import Punctuation hindi = Punctuation('hi') #loads model in memory english = Punctuation('en') gujarati = Punctuation('gu') telugu = Punctuation('te') marathi = Punctuation('mr') kannada = Punctuation('kn') punjabi = Punctuation('pa') tamil = Punctuation('ta') bengali = Punctuation('bn') odia = Punctuation('or') malayalam = Punctuation('ml') assamese = Punctuation('as')

hindi.punctuatetext(["इस श्रेणी में केवल निम्नलिखित उपश्रेणी है", "मेहुल को भारत को सौंप दिया जाए"]) english.punctuatetext(['how are you', 'great how about you']) gujarati.punctuatetext(['નમસ્તે તમે કેમ છો', 'મારે કામે જવુ જ પડશે']) telugu.punctuatetext(['రోహిత్ శర్మ విరాట్ కోహ్లీ రాహుల్ మరియు మహమ్మద్ షమీ భారతదేశం కోసం ఆడతారు']) marathi.punctuatetext(['पण रामायण हिंदुत्व किंवा आजच्या भारतापुरते मर्यादित नाही तर इंडोनेशिया मलेशिया थायलंड कंबोडिया फिलिपिन्स व्हिएतनाम इत्यादींमध्येही प्रचलित आहे']) kannada.punctuatetext(['ಬಿಜೆಪಿ ಕಾಂಗ್ರೆಸ್ ಮತ್ತು ಜನತಾದಳವು ಪ್ರತಿಷ್ಠಿತ ಸ್ಥಾನಗಳನ್ನು ಗಳಿಸಲು ಎಲ್ಲಾ ಹಂತಗಳನ್ನು ಹಿಂತೆಗೆದುಕೊಳ್ಳುತ್ತಿವೆ']) punjabi.punctuatetext(['ਸਰੀਰ ਵਿੱਚ ਕੈਲਸ਼ੀਅਮ ਜ਼ਿੰਕ ਆਇਰਨ ਆਦਿ ਪੌਸ਼ਟਿਕ ਤੱਤਾਂ ਦੀ ਕਮੀ ਹੁੰਦੀ ਹੈ']) tamil.punctuatetext(['உங்கள் பெயர் என்ன']) bengali.punctuatetext(['যে কুড়ুলটা দিয়ে এই ধ্বংসলীলা হয়েছিল সেটিকে নিয়ে কী করা উচিত']) odia.punctuatetext(['ମୋର ଅନେକ କଲମ ପେନ୍ସିଲ୍ ନୋଟବୁକ୍ ବହି ଏବଂ ଟେବୁଲ୍ ଅଛି', 'ଭାରତର ରାଜଧାନୀ କ’ଣ']) malayalam.punctuatetext(['നിങ്ങൾ എവിടെ താമസിക്കുന്നു', 'ഇന്ന് ഒരു നല്ല ദിവസമാണ്']) assamese.punctuatetext(['তোমাৰ ভাল নে'])

----Outputs---- ['इस श्रेणी में केवल निम्नलिखित उपश्रेणी है। ', 'मेहुल को भारत को सौंप दिया जाए। '] ['How are you?', 'Great, how about you?'] ['નમસ્તે તમે કેમ છો? ', 'મારે કામે જવુ જ પડશે। '] ['రోహిత్ శర్మ, విరాట్ కోహ్లీ, రాహుల్ మరియు మహమ్మద్ షమీ భారతదేశం కోసం ఆడతారు.'] ['पण रामायण हिंदुत्व किंवा आजच्या भारतापुरते मर्यादित नाही तर इंडोनेशिया, मलेशिया, थायलंड, कंबोडिया, फिलिपिन्स, व्हिएतनाम इत्यादींमध्येही प्रचलित आहे.'] ['ಬಿಜೆಪಿ, ಕಾಂಗ್ರೆಸ್ ಮತ್ತು ಜನತಾದಳವು ಪ್ರತಿಷ್ಠಿತ ಸ್ಥಾನಗಳನ್ನು ಗಳಿಸಲು ಎಲ್ಲಾ ಹಂತಗಳನ್ನು ಹಿಂತೆಗೆದುಕೊಳ್ಳುತ್ತಿವೆ.'] ['ਸਰੀਰ ਵਿੱਚ ਕੈਲਸ਼ੀਅਮ ਜ਼ਿੰਕ, ਆਇਰਨ ਆਦਿ ਪੌਸ਼ਟਿਕ ਤੱਤਾਂ ਦੀ ਕਮੀ ਹੁੰਦੀ ਹੈ।'] ['உங்கள் பெயர் என்ன? '] ['যে কুড়ুলটা দিয়ে এই ধ্বংসলীলা হয়েছিল, সেটিকে নিয়ে কী করা উচিত?'] ['ମୋର ଅନେକ କଲମ ପେନ୍ସିଲ୍, ନୋଟବୁକ୍ ବହି ଏବଂ ଟେବୁଲ୍ ଅଛି।','ଭାରତର ରାଜଧାନୀ କ’ଣ?'] ['നിങ്ങൾ എവിടെ താമസിക്കുന്നു? ', 'ഇന്ന് ഒരു നല്ല ദിവസമാണ്. '] ['তোমাৰ ভাল নে? '] ```

Inverse Text Normalization

```buildoutcfg from inversetextnormalization.runpredict import inversenormalizetext inversenormalizetext(['I have twenty cars', 'The army had four thousand six hundred forty six horses'], lang='en') inversenormalizetext(['दस लाख एक हज़ार चार सौ बीस', 'चार करोड़ चार लाख'], lang='hi') inversenormalizetext(['મારી પાસે ત્રણ બિલાડીઓ છે', 'ચાર કરોડ ચાર લાખ', 'તેને એક હજાર ચારસો ચાર રૂપિયા આપો'], lang='gu') inversenormalizetext(['ఏడు లక్షల నాలుగు వేల తొమ్మిది వందల యాభై ఒకటి', 'నేను ఏడు వందల పదమూడు సినిమాలు చూశాను'], lang='te') inversenormalizetext(['रीटाकडे नऊशे वीस मांजरी आहेत','बत्तीस कोटी एकवीस लाख सदतीस हजार चारशे बारा'], lang='mr') inversenormalizetext(['ਬਾਰਾਂ ਲੱਖ ਵੀਹ ਹਜਾਰ ਸੱਤ ਸੌ ਪੰਦਰਾਂ','ਮੇਰੇ ਕੋਲ ਦਸ ਰੁਪਏ ਹਨ'], lang='pa') inversenormalizetext(['ஒன்று நூறு முப்பத்து ஒன்பது படங்கள் பார்த்திருக்கிறேன்','தொண்ணூற்றிநான்கு கோடி ஐந்து இலட்சம் முந்நூறு இருபத்து இரண்டு'], lang='ta') inversenormalizetext(['আমার পাঁচটি কলম আছে', 'তিনি দুইশত সাতটি সিনেমা দেখেছেন'], lang='bn') inversenormalizetext(['ഇരുനൂറ്റി അമ്പത് രൂപ ഞാൻ അവന് കൊടുത്തു', 'അവൻ എനിക്ക് പത്ത് യൂറോ തന്നു'], lang='ml') inversenormalizetext(['ମୋ ହାତରେ ପାଞ୍ଚ ଡଲାର ଅଛି', 'ମୋ ହାତରେ ପାଞ୍ଚ ଶହ ଟଙ୍କା ଅଛି', 'ମୋ ହାତରେ ସାତ ଶହ ୟୁରୋ ଅଛି'], lang='or') inversenormalizetext(['মই 10 বাকচ মিঠাই বিতৰণ কৰিলো', 'নিৰান্নব্বৈটা কোটি পাঁচ লাখ আঠশ বাইছ'], lang='as') inversenormalize_text(['ನನ್ನ ಕೈಯಲ್ಲಿ ಐದು ಡಾಲರ್ ಇದೆ', 'ನನ್ನ ಬ್ಯಾಗ್ ನಲ್ಲಿ ಐದು ನೂರು ರೂಪಾಯಿ ಪೆನ್ನಿದೆ', 'ನನ್ನ ಖಾತೆಯಲ್ಲಿ ಐದು ಕೋಟಿ ಯೂರೋ ಇದೆ'], lang='kn')

----Outputs---- ['I have 20 cars', 'The army had 4646 horses'] ['10,01,420', '4,04,00,000']
['મારી પાસે 3 બિલાડીઓ છે', '4,04,00,000', 'તેને ₹ 1,404 આપો'] ['7,04,951', 'నేను 713 సినిమాలు చూశాను'] ['रीटाकडे 920 मांजरी आहेत','32,21,37,412'] ['12,20,715', 'ਮੇਰੇ ਕੋਲ ₹ 10 ਹਨ'] ['139 படங்கள் பார்த்திருக்கிறேன்', '94,05,00,322'] ['আমার 5 কলম আছে', 'তিনি 207 সিনেমা দেখেছেন'] ['₹ 250 ഞാൻ അവന് കൊടുത്തു', 'അവൻ എനിക്ക് € 10 തന്നു'] ['ମୋ ହାତରେ $ 5 ଅଛି', 'ମୋ ହାତରେ ₹ 500 ଅଛି', 'ମୋ ହାତରେ € 700 ଅଛି'] ['মই 10 বাকচ মিঠাই বিতৰণ কৰিলো', '99,05,00,822'] ['ನನ್ನ ಕೈಯಲ್ಲಿ $ 5 ಇದೆ', 'ನನ್ನ ಬ್ಯಾಗ್ ನಲ್ಲಿ ₹ 500 ಪೆನ್ನಿದೆ', 'ನನ್ನ ಖಾತೆಯಲ್ಲಿ € 5,00,00,000 ಇದೆ'] ```

Citation

``` @misc{https://doi.org/10.48550/arxiv.2203.16825, doi = {10.48550/ARXIV.2203.16825},

url = {https://arxiv.org/abs/2203.16825},

author = {Gupta, Anirudh and Chhimwal, Neeraj and Dhuriya, Ankur and Gaur, Rishabh and Shah, Priyanshi and Chadha, Harveen Singh and Raghavan, Vivek},

keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},

title = {indic-punct: An automatic punctuation restoration and inverse text normalization framework for Indic languages},

publisher = {arXiv},

year = {2022},

Owner

Name: AI4Bhārat
Login: AI4Bharat
Kind: organization
Email: opensource@ai4bharat.org
Location: India

Website: https://ai4bharat.org
Twitter: AI4Bharat
Repositories: 37
Profile: https://github.com/AI4Bharat

Artificial-Intelligence-For-Bhārat : Building open-source AI solutions for India!

GitHub Events

Total

Watch event: 2
Pull request event: 2
Fork event: 1

Last Year

Watch event: 2
Pull request event: 2
Fork event: 1

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 0
Total pull requests: 5
Average time to close issues: N/A
Average time to close pull requests: about 3 hours
Total issue authors: 0
Total pull request authors: 3
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 2
Average time to close issues: N/A
Average time to close pull requests: 1 minute
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

janaab11 (2)
ryback123 (2)
safikhanSoofiyani (1)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

setup.py pypi

certifi ==2020.12.5
indic-nlp-library ==0.81
inflect ==5.3.0
numpy ==1.20.2
pandas ==1.2.4
python-dateutil ==2.8.1
pytz ==2021.1
scipy ==1.5.4
sentencepiece ==0.1.94
six ==1.15.0
tokenizers ==0.9.4
torch ==1.7.1
torchvision ==0.8.2
tqdm ==4.60.0
transformers ==4.0.1
wget *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/ai4bharat/indic-punct

Science Score: 23.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Indic Punct Library

About

Installation Instructions

Usage

Punctuation

Inverse Text Normalization

Citation

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies