https://github.com/ai4bharat/indic-punct
Including support for alphabets and different spoken form of words
Science Score: 23.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (6.8%) to scientific vocabulary
Repository
Including support for alphabets and different spoken form of words
Basic Info
- Host: GitHub
- Owner: AI4Bharat
- License: mit
- Language: Python
- Default Branch: main
- Size: 428 KB
Statistics
- Stars: 1
- Watchers: 0
- Forks: 4
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Indic Punct Library
About
Inverse text normalization (ITN) is a part of the Automatic Speech Recognition (ASR) post-processing pipeline. ITN is the task of converting the raw spoken output of the ASR model into its written form to improve text readability. We currently only handle numbers as a part of our ITN pipeline, and have developed and open-sourced WFST (weighted finite state transducer) based ITN support for 11 Indic languages -Hindi, Gujarati, Telugu, Marathi, Punjabi, Tamil, Bengali, Malayalam, Odia, Assamese, Kannada, using NVIDIA’s NEMO toolkit.
Installation Instructions
buildoutcfg
git clone https://github.com/Open-Speech-EkStep/indic-punct.git
cd indic-punct
bash install.sh
python setup.py bdist_wheel
pip install -e .
Usage
Currently (v 2.0.6) we are supporting the following languages: - Punctuation: - Hindi ('hi') - English ('en') - Gujarati ('gu') - Telugu ('te') - Marathi ('mr') - Kannada ('kn') - Punjabi ('pa') - Tamil ('ta') - Bengali ('bn') - Odia ('or') - Malayalam ('ml') - Assamese ('as')
- Inverse Text Normalization:
- Hindi
- English
- Gujarati
- Telugu
- Marathi
- Punjabi
- Tamil
- Bengali
- Malayalam
- Odia
- Assamese
- Kannada
We are planning to add other Indic languages.
Punctuation
```buildoutcfg from punctuate.punctuate_text import Punctuation hindi = Punctuation('hi') #loads model in memory english = Punctuation('en') gujarati = Punctuation('gu') telugu = Punctuation('te') marathi = Punctuation('mr') kannada = Punctuation('kn') punjabi = Punctuation('pa') tamil = Punctuation('ta') bengali = Punctuation('bn') odia = Punctuation('or') malayalam = Punctuation('ml') assamese = Punctuation('as')
hindi.punctuatetext(["इस श्रेणी में केवल निम्नलिखित उपश्रेणी है", "मेहुल को भारत को सौंप दिया जाए"]) english.punctuatetext(['how are you', 'great how about you']) gujarati.punctuatetext(['નમસ્તે તમે કેમ છો', 'મારે કામે જવુ જ પડશે']) telugu.punctuatetext(['రోహిత్ శర్మ విరాట్ కోహ్లీ రాహుల్ మరియు మహమ్మద్ షమీ భారతదేశం కోసం ఆడతారు']) marathi.punctuatetext(['पण रामायण हिंदुत्व किंवा आजच्या भारतापुरते मर्यादित नाही तर इंडोनेशिया मलेशिया थायलंड कंबोडिया फिलिपिन्स व्हिएतनाम इत्यादींमध्येही प्रचलित आहे']) kannada.punctuatetext(['ಬಿಜೆಪಿ ಕಾಂಗ್ರೆಸ್ ಮತ್ತು ಜನತಾದಳವು ಪ್ರತಿಷ್ಠಿತ ಸ್ಥಾನಗಳನ್ನು ಗಳಿಸಲು ಎಲ್ಲಾ ಹಂತಗಳನ್ನು ಹಿಂತೆಗೆದುಕೊಳ್ಳುತ್ತಿವೆ']) punjabi.punctuatetext(['ਸਰੀਰ ਵਿੱਚ ਕੈਲਸ਼ੀਅਮ ਜ਼ਿੰਕ ਆਇਰਨ ਆਦਿ ਪੌਸ਼ਟਿਕ ਤੱਤਾਂ ਦੀ ਕਮੀ ਹੁੰਦੀ ਹੈ']) tamil.punctuatetext(['உங்கள் பெயர் என்ன']) bengali.punctuatetext(['যে কুড়ুলটা দিয়ে এই ধ্বংসলীলা হয়েছিল সেটিকে নিয়ে কী করা উচিত']) odia.punctuatetext(['ମୋର ଅନେକ କଲମ ପେନ୍ସିଲ୍ ନୋଟବୁକ୍ ବହି ଏବଂ ଟେବୁଲ୍ ଅଛି', 'ଭାରତର ରାଜଧାନୀ କ’ଣ']) malayalam.punctuatetext(['നിങ്ങൾ എവിടെ താമസിക്കുന്നു', 'ഇന്ന് ഒരു നല്ല ദിവസമാണ്']) assamese.punctuatetext(['তোমাৰ ভাল নে'])
----Outputs---- ['इस श्रेणी में केवल निम्नलिखित उपश्रेणी है। ', 'मेहुल को भारत को सौंप दिया जाए। '] ['How are you?', 'Great, how about you?'] ['નમસ્તે તમે કેમ છો? ', 'મારે કામે જવુ જ પડશે। '] ['రోహిత్ శర్మ, విరాట్ కోహ్లీ, రాహుల్ మరియు మహమ్మద్ షమీ భారతదేశం కోసం ఆడతారు.'] ['पण रामायण हिंदुत्व किंवा आजच्या भारतापुरते मर्यादित नाही तर इंडोनेशिया, मलेशिया, थायलंड, कंबोडिया, फिलिपिन्स, व्हिएतनाम इत्यादींमध्येही प्रचलित आहे.'] ['ಬಿಜೆಪಿ, ಕಾಂಗ್ರೆಸ್ ಮತ್ತು ಜನತಾದಳವು ಪ್ರತಿಷ್ಠಿತ ಸ್ಥಾನಗಳನ್ನು ಗಳಿಸಲು ಎಲ್ಲಾ ಹಂತಗಳನ್ನು ಹಿಂತೆಗೆದುಕೊಳ್ಳುತ್ತಿವೆ.'] ['ਸਰੀਰ ਵਿੱਚ ਕੈਲਸ਼ੀਅਮ ਜ਼ਿੰਕ, ਆਇਰਨ ਆਦਿ ਪੌਸ਼ਟਿਕ ਤੱਤਾਂ ਦੀ ਕਮੀ ਹੁੰਦੀ ਹੈ।'] ['உங்கள் பெயர் என்ன? '] ['যে কুড়ুলটা দিয়ে এই ধ্বংসলীলা হয়েছিল, সেটিকে নিয়ে কী করা উচিত?'] ['ମୋର ଅନେକ କଲମ ପେନ୍ସିଲ୍, ନୋଟବୁକ୍ ବହି ଏବଂ ଟେବୁଲ୍ ଅଛି।','ଭାରତର ରାଜଧାନୀ କ’ଣ?'] ['നിങ്ങൾ എവിടെ താമസിക്കുന്നു? ', 'ഇന്ന് ഒരു നല്ല ദിവസമാണ്. '] ['তোমাৰ ভাল নে? '] ```
Inverse Text Normalization
```buildoutcfg from inversetextnormalization.runpredict import inversenormalizetext inversenormalizetext(['I have twenty cars', 'The army had four thousand six hundred forty six horses'], lang='en') inversenormalizetext(['दस लाख एक हज़ार चार सौ बीस', 'चार करोड़ चार लाख'], lang='hi') inversenormalizetext(['મારી પાસે ત્રણ બિલાડીઓ છે', 'ચાર કરોડ ચાર લાખ', 'તેને એક હજાર ચારસો ચાર રૂપિયા આપો'], lang='gu') inversenormalizetext(['ఏడు లక్షల నాలుగు వేల తొమ్మిది వందల యాభై ఒకటి', 'నేను ఏడు వందల పదమూడు సినిమాలు చూశాను'], lang='te') inversenormalizetext(['रीटाकडे नऊशे वीस मांजरी आहेत','बत्तीस कोटी एकवीस लाख सदतीस हजार चारशे बारा'], lang='mr') inversenormalizetext(['ਬਾਰਾਂ ਲੱਖ ਵੀਹ ਹਜਾਰ ਸੱਤ ਸੌ ਪੰਦਰਾਂ','ਮੇਰੇ ਕੋਲ ਦਸ ਰੁਪਏ ਹਨ'], lang='pa') inversenormalizetext(['ஒன்று நூறு முப்பத்து ஒன்பது படங்கள் பார்த்திருக்கிறேன்','தொண்ணூற்றிநான்கு கோடி ஐந்து இலட்சம் முந்நூறு இருபத்து இரண்டு'], lang='ta') inversenormalizetext(['আমার পাঁচটি কলম আছে', 'তিনি দুইশত সাতটি সিনেমা দেখেছেন'], lang='bn') inversenormalizetext(['ഇരുനൂറ്റി അമ്പത് രൂപ ഞാൻ അവന് കൊടുത്തു', 'അവൻ എനിക്ക് പത്ത് യൂറോ തന്നു'], lang='ml') inversenormalizetext(['ମୋ ହାତରେ ପାଞ୍ଚ ଡଲାର ଅଛି', 'ମୋ ହାତରେ ପାଞ୍ଚ ଶହ ଟଙ୍କା ଅଛି', 'ମୋ ହାତରେ ସାତ ଶହ ୟୁରୋ ଅଛି'], lang='or') inversenormalizetext(['মই 10 বাকচ মিঠাই বিতৰণ কৰিলো', 'নিৰান্নব্বৈটা কোটি পাঁচ লাখ আঠশ বাইছ'], lang='as') inversenormalize_text(['ನನ್ನ ಕೈಯಲ್ಲಿ ಐದು ಡಾಲರ್ ಇದೆ', 'ನನ್ನ ಬ್ಯಾಗ್ ನಲ್ಲಿ ಐದು ನೂರು ರೂಪಾಯಿ ಪೆನ್ನಿದೆ', 'ನನ್ನ ಖಾತೆಯಲ್ಲಿ ಐದು ಕೋಟಿ ಯೂರೋ ಇದೆ'], lang='kn')
----Outputs----
['I have 20 cars', 'The army had 4646 horses']
['10,01,420', '4,04,00,000']
['મારી પાસે 3 બિલાડીઓ છે', '4,04,00,000', 'તેને ₹ 1,404 આપો']
['7,04,951', 'నేను 713 సినిమాలు చూశాను']
['रीटाकडे 920 मांजरी आहेत','32,21,37,412']
['12,20,715', 'ਮੇਰੇ ਕੋਲ ₹ 10 ਹਨ']
['139 படங்கள் பார்த்திருக்கிறேன்', '94,05,00,322']
['আমার 5 কলম আছে', 'তিনি 207 সিনেমা দেখেছেন']
['₹ 250 ഞാൻ അവന് കൊടുത്തു', 'അവൻ എനിക്ക് € 10 തന്നു']
['ମୋ ହାତରେ $ 5 ଅଛି', 'ମୋ ହାତରେ ₹ 500 ଅଛି', 'ମୋ ହାତରେ € 700 ଅଛି']
['মই 10 বাকচ মিঠাই বিতৰণ কৰিলো', '99,05,00,822']
['ನನ್ನ ಕೈಯಲ್ಲಿ $ 5 ಇದೆ', 'ನನ್ನ ಬ್ಯಾಗ್ ನಲ್ಲಿ ₹ 500 ಪೆನ್ನಿದೆ', 'ನನ್ನ ಖಾತೆಯಲ್ಲಿ € 5,00,00,000 ಇದೆ']
```
Citation
``` @misc{https://doi.org/10.48550/arxiv.2203.16825, doi = {10.48550/ARXIV.2203.16825},
url = {https://arxiv.org/abs/2203.16825},
author = {Gupta, Anirudh and Chhimwal, Neeraj and Dhuriya, Ankur and Gaur, Rishabh and Shah, Priyanshi and Chadha, Harveen Singh and Raghavan, Vivek},
keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {indic-punct: An automatic punctuation restoration and inverse text normalization framework for Indic languages},
publisher = {arXiv},
year = {2022},
copyright = {Creative Commons Attribution 4.0 International} } ```
Owner
- Name: AI4Bhārat
- Login: AI4Bharat
- Kind: organization
- Email: opensource@ai4bharat.org
- Location: India
- Website: https://ai4bharat.org
- Twitter: AI4Bharat
- Repositories: 37
- Profile: https://github.com/AI4Bharat
Artificial-Intelligence-For-Bhārat : Building open-source AI solutions for India!
GitHub Events
Total
- Watch event: 2
- Pull request event: 2
- Fork event: 1
Last Year
- Watch event: 2
- Pull request event: 2
- Fork event: 1
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 0
- Total pull requests: 5
- Average time to close issues: N/A
- Average time to close pull requests: about 3 hours
- Total issue authors: 0
- Total pull request authors: 3
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 3
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 2
- Average time to close issues: N/A
- Average time to close pull requests: 1 minute
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- janaab11 (2)
- ryback123 (2)
- safikhanSoofiyani (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- certifi ==2020.12.5
- indic-nlp-library ==0.81
- inflect ==5.3.0
- numpy ==1.20.2
- pandas ==1.2.4
- python-dateutil ==2.8.1
- pytz ==2021.1
- scipy ==1.5.4
- sentencepiece ==0.1.94
- six ==1.15.0
- tokenizers ==0.9.4
- torch ==1.7.1
- torchvision ==0.8.2
- tqdm ==4.60.0
- transformers ==4.0.1
- wget *