https://github.com/astorfi/punctuation-restoration
A TensorFlow Implementation of Punctuation Restoration.
Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.4%) to scientific vocabulary
Last synced: 9 months ago
·
JSON representation
Repository
A TensorFlow Implementation of Punctuation Restoration.
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Fork of k9luo/Punctuation-Restoration
Created over 5 years ago
· Last pushed over 5 years ago
https://github.com/astorfi/Punctuation-Restoration/blob/main/
Punctuation Restoration ====================================================================           ## Requirements Imagine that you are building a software for transcribing speech to text. The speech transcription part works perfectly, but cannot transcribe punctuations. The task is to train a predictive model to ingest a sequence of text and add punctuation (period, comma or question mark) in the appropriate locations. This task is important for all downstream data processing jobs. **Example input:** ```this is a string of text with no punctuation this is a new sentence``` **Example output:** ```this is a string of text with no punctuationthis is a new sentence ``` ## Solution My solution is largely based on [Bidirectional Recurrent Neural Network with Attention Mechanism for Punctuation Restoration](https://www.isca-speech.org/archive/Interspeech_2016/pdfs/1517.PDF). The architecture is defined as follows: 1. Obtain words embeddings from [GloVe](https://nlp.stanford.edu/projects/glove/). 2. The word embeddings are then processed by densely connected [Bi-LSTM](https://arxiv.org/pdf/1303.5778.pdf) layers. 3. These Bi-LSTM layers are followed by a RNN with an attention mechanism and [conditional random field (CRF)](https://repository.upenn.edu/cgi/viewcontent.cgi?article=1162&context=cis_papers) log likelihood loss. The experiments are performed on the IWSLT dataset which consists of TED Talks transcript. The detailed analysis can be found in this [notebook](https://github.com/k9luo/Punctuation-Restoration/blob/main/main.ipynb). ## Setup and Installation First step, clone the repo: ```https://github.com/k9luo/Punctuation-Restoration.git``` Second step, you can download pretrained [GloVe](https://nlp.stanford.edu/projects/glove/) word embeddings and create a new conda virutal environment with `setup.sh`. Or you can manually do these steps yourself. Note that the running `setup.sh` will install the GPU version of TensorFlow: ```sh setup.sh``` Third step, activate the virtual environment: ```conda activate restore_punct``` Fourth step, add the new virutal environment to Jupyter Notebook: ```python -m ipykernel install --user --name=restore_punct``` ## Training and Inference Please run `python main.py`.
Owner
- Name: Sina Torfi
- Login: astorfi
- Kind: user
- Location: San Jose
- Company: Meta
- Website: https://astorfi.github.io/
- Repositories: 196
- Profile: https://github.com/astorfi
PhD & Developer working on Deep Learning, Computer Vision & NLP