https://github.com/ai4bharat/indic-punct

Including support for alphabets and different spoken form of words

https://github.com/ai4bharat/indic-punct

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (6.8%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Including support for alphabets and different spoken form of words

Basic Info
  • Host: GitHub
  • Owner: AI4Bharat
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 428 KB
Statistics
  • Stars: 1
  • Watchers: 0
  • Forks: 4
  • Open Issues: 0
  • Releases: 0
Fork of Open-Speech-EkStep/indic-punct
Created over 3 years ago · Last pushed over 2 years ago
Metadata Files
Readme License

README.md

Indic Punct Library

About

Inverse text normalization (ITN) is a part of the Automatic Speech Recognition (ASR) post-processing pipeline. ITN is the task of converting the raw spoken output of the ASR model into its written form to improve text readability. We currently only handle numbers as a part of our ITN pipeline, and have developed and open-sourced WFST (weighted finite state transducer) based ITN support for 11 Indic languages -Hindi, Gujarati, Telugu, Marathi, Punjabi, Tamil, Bengali, Malayalam, Odia, Assamese, Kannada, using NVIDIA’s NEMO toolkit.

Installation Instructions

buildoutcfg git clone https://github.com/Open-Speech-EkStep/indic-punct.git cd indic-punct bash install.sh python setup.py bdist_wheel pip install -e .

Usage

Currently (v 2.0.6) we are supporting the following languages: - Punctuation: - Hindi ('hi') - English ('en') - Gujarati ('gu') - Telugu ('te') - Marathi ('mr') - Kannada ('kn') - Punjabi ('pa') - Tamil ('ta') - Bengali ('bn') - Odia ('or') - Malayalam ('ml') - Assamese ('as')

  • Inverse Text Normalization:
    • Hindi
    • English
    • Gujarati
    • Telugu
    • Marathi
    • Punjabi
    • Tamil
    • Bengali
    • Malayalam
    • Odia
    • Assamese
    • Kannada

We are planning to add other Indic languages.

Punctuation

```buildoutcfg from punctuate.punctuate_text import Punctuation hindi = Punctuation('hi') #loads model in memory english = Punctuation('en') gujarati = Punctuation('gu') telugu = Punctuation('te') marathi = Punctuation('mr') kannada = Punctuation('kn') punjabi = Punctuation('pa') tamil = Punctuation('ta') bengali = Punctuation('bn') odia = Punctuation('or') malayalam = Punctuation('ml') assamese = Punctuation('as')

hindi.punctuatetext(["इस श्रेणी में केवल निम्नलिखित उपश्रेणी है", "मेहुल को भारत को सौंप दिया जाए"]) english.punctuatetext(['how are you', 'great how about you']) gujarati.punctuatetext(['નમસ્તે તમે કેમ છો', 'મારે કામે જવુ જ પડશે']) telugu.punctuatetext(['రోహిత్ శర్మ విరాట్ కోహ్లీ రాహుల్ మరియు మహమ్మద్ షమీ భారతదేశం కోసం ఆడతారు']) marathi.punctuatetext(['पण रामायण हिंदुत्व किंवा आजच्या भारतापुरते मर्यादित नाही तर इंडोनेशिया मलेशिया थायलंड कंबोडिया फिलिपिन्स व्हिएतनाम इत्यादींमध्येही प्रचलित आहे']) kannada.punctuatetext(['ಬಿಜೆಪಿ ಕಾಂಗ್ರೆಸ್ ಮತ್ತು ಜನತಾದಳವು ಪ್ರತಿಷ್ಠಿತ ಸ್ಥಾನಗಳನ್ನು ಗಳಿಸಲು ಎಲ್ಲಾ ಹಂತಗಳನ್ನು ಹಿಂತೆಗೆದುಕೊಳ್ಳುತ್ತಿವೆ']) punjabi.punctuatetext(['ਸਰੀਰ ਵਿੱਚ ਕੈਲਸ਼ੀਅਮ ਜ਼ਿੰਕ ਆਇਰਨ ਆਦਿ ਪੌਸ਼ਟਿਕ ਤੱਤਾਂ ਦੀ ਕਮੀ ਹੁੰਦੀ ਹੈ']) tamil.punctuatetext(['உங்கள் பெயர் என்ன']) bengali.punctuatetext(['যে কুড়ুলটা দিয়ে এই ধ্বংসলীলা হয়েছিল সেটিকে নিয়ে কী করা উচিত']) odia.punctuatetext(['ମୋର ଅନେକ କଲମ ପେନ୍ସିଲ୍ ନୋଟବୁକ୍ ବହି ଏବଂ ଟେବୁଲ୍ ଅଛି', 'ଭାରତର ରାଜଧାନୀ କ’ଣ']) malayalam.punctuatetext(['നിങ്ങൾ എവിടെ താമസിക്കുന്നു', 'ഇന്ന് ഒരു നല്ല ദിവസമാണ്']) assamese.punctuatetext(['তোমাৰ ভাল নে'])

----Outputs---- ['इस श्रेणी में केवल निम्नलिखित उपश्रेणी है। ', 'मेहुल को भारत को सौंप दिया जाए। '] ['How are you?', 'Great, how about you?'] ['નમસ્તે તમે કેમ છો? ', 'મારે કામે જવુ જ પડશે। '] ['రోహిత్ శర్మ, విరాట్ కోహ్లీ, రాహుల్ మరియు మహమ్మద్ షమీ భారతదేశం కోసం ఆడతారు.'] ['पण रामायण हिंदुत्व किंवा आजच्या भारतापुरते मर्यादित नाही तर इंडोनेशिया, मलेशिया, थायलंड, कंबोडिया, फिलिपिन्स, व्हिएतनाम इत्यादींमध्येही प्रचलित आहे.'] ['ಬಿಜೆಪಿ, ಕಾಂಗ್ರೆಸ್ ಮತ್ತು ಜನತಾದಳವು ಪ್ರತಿಷ್ಠಿತ ಸ್ಥಾನಗಳನ್ನು ಗಳಿಸಲು ಎಲ್ಲಾ ಹಂತಗಳನ್ನು ಹಿಂತೆಗೆದುಕೊಳ್ಳುತ್ತಿವೆ.'] ['ਸਰੀਰ ਵਿੱਚ ਕੈਲਸ਼ੀਅਮ ਜ਼ਿੰਕ, ਆਇਰਨ ਆਦਿ ਪੌਸ਼ਟਿਕ ਤੱਤਾਂ ਦੀ ਕਮੀ ਹੁੰਦੀ ਹੈ।'] ['உங்கள் பெயர் என்ன? '] ['যে কুড়ুলটা দিয়ে এই ধ্বংসলীলা হয়েছিল, সেটিকে নিয়ে কী করা উচিত?'] ['ମୋର ଅନେକ କଲମ ପେନ୍ସିଲ୍, ନୋଟବୁକ୍ ବହି ଏବଂ ଟେବୁଲ୍ ଅଛି।','ଭାରତର ରାଜଧାନୀ କ’ଣ?'] ['നിങ്ങൾ എവിടെ താമസിക്കുന്നു? ', 'ഇന്ന് ഒരു നല്ല ദിവസമാണ്. '] ['তোমাৰ ভাল নে? '] ```

Inverse Text Normalization

```buildoutcfg from inversetextnormalization.runpredict import inversenormalizetext inversenormalizetext(['I have twenty cars', 'The army had four thousand six hundred forty six horses'], lang='en') inversenormalizetext(['दस लाख एक हज़ार चार सौ बीस', 'चार करोड़ चार लाख'], lang='hi') inversenormalizetext(['મારી પાસે ત્રણ બિલાડીઓ છે', 'ચાર કરોડ ચાર લાખ', 'તેને એક હજાર ચારસો ચાર રૂપિયા આપો'], lang='gu') inversenormalizetext(['ఏడు లక్షల నాలుగు వేల తొమ్మిది వందల యాభై ఒకటి', 'నేను ఏడు వందల పదమూడు సినిమాలు చూశాను'], lang='te') inversenormalizetext(['रीटाकडे नऊशे वीस मांजरी आहेत','बत्तीस कोटी एकवीस लाख सदतीस हजार चारशे बारा'], lang='mr') inversenormalizetext(['ਬਾਰਾਂ ਲੱਖ ਵੀਹ ਹਜਾਰ ਸੱਤ ਸੌ ਪੰਦਰਾਂ','ਮੇਰੇ ਕੋਲ ਦਸ ਰੁਪਏ ਹਨ'], lang='pa') inversenormalizetext(['ஒன்று நூறு முப்பத்து ஒன்பது படங்கள் பார்த்திருக்கிறேன்','தொண்ணூற்றிநான்கு கோடி ஐந்து இலட்சம் முந்நூறு இருபத்து இரண்டு'], lang='ta') inversenormalizetext(['আমার পাঁচটি কলম আছে', 'তিনি দুইশত সাতটি সিনেমা দেখেছেন'], lang='bn') inversenormalizetext(['ഇരുനൂറ്റി അമ്പത് രൂപ ഞാൻ അവന് കൊടുത്തു', 'അവൻ എനിക്ക് പത്ത് യൂറോ തന്നു'], lang='ml') inversenormalizetext(['ମୋ ହାତରେ ପାଞ୍ଚ ଡଲାର ଅଛି', 'ମୋ ହାତରେ ପାଞ୍ଚ ଶହ ଟଙ୍କା ଅଛି', 'ମୋ ହାତରେ ସାତ ଶହ ୟୁରୋ ଅଛି'], lang='or') inversenormalizetext(['মই 10 বাকচ মিঠাই বিতৰণ কৰিলো', 'নিৰান্নব্বৈটা কোটি পাঁচ লাখ আঠশ বাইছ'], lang='as') inversenormalize_text(['ನನ್ನ ಕೈಯಲ್ಲಿ ಐದು ಡಾಲರ್ ಇದೆ', 'ನನ್ನ ಬ್ಯಾಗ್ ನಲ್ಲಿ ಐದು ನೂರು ರೂಪಾಯಿ ಪೆನ್ನಿದೆ', 'ನನ್ನ ಖಾತೆಯಲ್ಲಿ ಐದು ಕೋಟಿ ಯೂರೋ ಇದೆ'], lang='kn')

----Outputs---- ['I have 20 cars', 'The army had 4646 horses'] ['10,01,420', '4,04,00,000']
['મારી પાસે 3 બિલાડીઓ છે', '4,04,00,000', 'તેને ₹ 1,404 આપો'] ['7,04,951', 'నేను 713 సినిమాలు చూశాను'] ['रीटाकडे 920 मांजरी आहेत','32,21,37,412'] ['12,20,715', 'ਮੇਰੇ ਕੋਲ ₹ 10 ਹਨ'] ['139 படங்கள் பார்த்திருக்கிறேன்', '94,05,00,322'] ['আমার 5 কলম আছে', 'তিনি 207 সিনেমা দেখেছেন'] ['₹ 250 ഞാൻ അവന് കൊടുത്തു', 'അവൻ എനിക്ക് € 10 തന്നു'] ['ମୋ ହାତରେ $ 5 ଅଛି', 'ମୋ ହାତରେ ₹ 500 ଅଛି', 'ମୋ ହାତରେ € 700 ଅଛି'] ['মই 10 বাকচ মিঠাই বিতৰণ কৰিলো', '99,05,00,822'] ['ನನ್ನ ಕೈಯಲ್ಲಿ $ 5 ಇದೆ', 'ನನ್ನ ಬ್ಯಾಗ್ ನಲ್ಲಿ ₹ 500 ಪೆನ್ನಿದೆ', 'ನನ್ನ ಖಾತೆಯಲ್ಲಿ € 5,00,00,000 ಇದೆ'] ```

Citation

``` @misc{https://doi.org/10.48550/arxiv.2203.16825, doi = {10.48550/ARXIV.2203.16825},

url = {https://arxiv.org/abs/2203.16825},

author = {Gupta, Anirudh and Chhimwal, Neeraj and Dhuriya, Ankur and Gaur, Rishabh and Shah, Priyanshi and Chadha, Harveen Singh and Raghavan, Vivek},

keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},

title = {indic-punct: An automatic punctuation restoration and inverse text normalization framework for Indic languages},

publisher = {arXiv},

year = {2022},

copyright = {Creative Commons Attribution 4.0 International} } ```

Owner

  • Name: AI4Bhārat
  • Login: AI4Bharat
  • Kind: organization
  • Email: opensource@ai4bharat.org
  • Location: India

Artificial-Intelligence-For-Bhārat : Building open-source AI solutions for India!

GitHub Events

Total
  • Watch event: 2
  • Pull request event: 2
  • Fork event: 1
Last Year
  • Watch event: 2
  • Pull request event: 2
  • Fork event: 1

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 0
  • Total pull requests: 5
  • Average time to close issues: N/A
  • Average time to close pull requests: about 3 hours
  • Total issue authors: 0
  • Total pull request authors: 3
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: 1 minute
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • janaab11 (2)
  • ryback123 (2)
  • safikhanSoofiyani (1)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

setup.py pypi
  • certifi ==2020.12.5
  • indic-nlp-library ==0.81
  • inflect ==5.3.0
  • numpy ==1.20.2
  • pandas ==1.2.4
  • python-dateutil ==2.8.1
  • pytz ==2021.1
  • scipy ==1.5.4
  • sentencepiece ==0.1.94
  • six ==1.15.0
  • tokenizers ==0.9.4
  • torch ==1.7.1
  • torchvision ==0.8.2
  • tqdm ==4.60.0
  • transformers ==4.0.1
  • wget *