https://github.com/cabralpinto/active-learning-syllabification
Language Agnostic Syllabification with Active Learning
https://github.com/cabralpinto/active-learning-syllabification
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.7%) to scientific vocabulary
Keywords
Repository
Language Agnostic Syllabification with Active Learning
Basic Info
Statistics
- Stars: 3
- Watchers: 3
- Forks: 1
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
Language Agnostic Syllabification with Active Learning
This repository contains an implementation of a language-agnostic syllabification method using active learning. Syllabification is the process of splitting a word into syllables, crucial in speech synthesis and recognition. Our approach utilizes active learning to reduce the need for large labeled datasets. By adapting the neural network from Krantz et al. (2019) and training it with active learning, we improved accuracy on the Portuguese and Italian datasets, using only a small fraction of the data: 384 words (1.4% of the dataset) for Portuguese and 528 words (0.6% of the dataset) for Italian.
🚀 Usage
Prerequisites
Before running the project, ensure that you have the following: - MATLAB 2021a (or a newer version) - Statistics and Machine Learning Toolbox - Text Analytics Toolbox
Running the Project
- Clone this repository to your local machine or download the ZIP archive.
- Open MATLAB and navigate to the root directory of the cloned repository.
- Locate the src folder and open the
main.mfile. - Run the
main.mscript to execute the project.
📊 Results
The project showcases its effectiveness by achieving remarkable accuracy values with minimal labeled data. Specifically, the following results were obtained:
- Porlex v3 (Portuguese dataset): Achieved an accuracy of 96.8% using only 384 words, which corresponds to 1.4% of the original dataset.
- PhonItalia (Italian dataset): Achieved an accuracy of 82.0% using only 528 words, which corresponds to 0.6% of the original dataset.
- Lexique 2 (French dataset): Achieved an accuracy of 95.8% using only 208 words, which is less than 0.01% of the whole dataset.
For both Portuguese and Italian, these results surpass those achieved by training the network on the entire dataset, 95.6% and 81%, respectively.
📜 License
This project is licensed under the MIT License.
🎉 Acknowledgments
We would like to acknowledge the work of Krantz et al. (2019) for providing the neural network architecture used in this project. Their research serves as a foundation for our active learning adaptation.
📬 Contact
If you have any questions, suggestions, or just want to say hello, feel free to email me at jmcabralpinto@gmail.com.
Owner
- Name: João Cabral Pinto
- Login: cabralpinto
- Kind: user
- Repositories: 1
- Profile: https://github.com/cabralpinto
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: 6 months ago