https://github.com/camel-lab/qalb

Code for "Utilizing Character and Word Embeddings for Text Normalization with Sequence-to-Sequence Models"

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
1 of 2 committers (50.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.2%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Code for "Utilizing Character and Word Embeddings for Text Normalization with Sequence-to-Sequence Models"

Basic Info

Host: GitHub
Owner: CAMeL-Lab
Language: Python
Default Branch: master
Homepage: https://www.aclweb.org/anthology/papers/D/D18/D18-1097/
Size: 43.6 MB

Statistics

Stars: 0
Watchers: 4
Forks: 4
Open Issues: 0
Releases: 0

Created over 8 years ago · Last pushed over 6 years ago

Metadata Files

Readme

QALB

Code for Utilizing Character and Word Embeddings for Text Normalization with Sequence-to-Sequence Models.

Setup

Python2, Python3, and TensorFlow 1.4 are required to run this project.

Training & testing

python -m ai.tests.qalb will run a generic character-level model. python -m ai.tests.char_qalb will run the hybrid character-level + fastText model. To distinguish between training and inference, --decode=path/to/file.txt indicates to run the model on inference mode on that text file.

See the individual test scripts for more information on the flags that can be passed to them directly via the terminal.

Evaluations

To compute the F1 score, use python2 ai/tests/m2scripts/m2scorer.py --beta 1 -v $1 $2 where $1 is the system output file and $2 is the .m2 gold file.

To compute the Levenshtein score, use python levenshtein.py $1 $2 where $1 is the output file and $2 is the .gold file.

Analysis

python analysis.py can be used to break down any .m2 file into a more human-readable format.

Useful UNIX commands

To remove the document id's from the *.sent* files, simply use

cut -d' ' -f2- ai/datasets/data/qalb/FILENAME

This can be piped to give a word count:

# *.sent* file
cut -d' ' -f2- ai/datasets/data/qalb/QALB.train.sent.sbw | awk '{print NF}'
# *.gold* file
cat ai/datasets/data/qalb/QALB.train.gold.sbw | awk '{print NF}'

Or a character count:

cat ai/datasets/data/qalb/QALB.train.gold.sbw | awk '{ print length($0); }'

The total number of usable characters doesn't include newlines, so as an example, the number of characters in a *.sent* file can be obtained automatically with

cut -d' ' -f2- ai/datasets/data/qalb/QALB.train.sent.sbw | awk '{ print length($0); }' | awk '{s+=$1} END {print s}'

Which coincides with doing cut -d' ' -f2- ai/datasets/data/qalb/QALB.train.sent.sbw | wc and subtracting the number of lines to the number of characters.

To obtain a histogram of the character counts, simply pipe sort and uniq. For instance, # Characters cat ai/datasets/data/qalb/QALB.train.gold.sbw | awk '{ print length($0); }' | sort -n | uniq -c # Words cat ai/datasets/data/qalb/QALB.train.gold.sbw | awk '{print NF}' | sort -n | uniq -c

Owner

Name: CAMeL Lab
Login: CAMeL-Lab
Kind: organization
Location: Abu Dhabi, UAE

Website: http://camel-lab.com
Repositories: 22
Profile: https://github.com/CAMeL-Lab

The Computational Approaches to Modeling Language (CAMeL) Lab at New York University Abu Dhabi

GitHub Events

Total

Last Year

Committers

Last synced: over 2 years ago

All Time

Total Commits: 196
Total Committers: 2
Avg Commits per committer: 98.0
Development Distribution Score (DDS): 0.01

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Daniel Watson	w**6@g**m	194
Daniel Watson	d**n@n**u	2

Committer Domains (Top 20 + Academic)

nyu.edu: 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/camel-lab/qalb

Science Score: 10.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

QALB

Setup

Training & testing

Evaluations

Analysis

Useful UNIX commands

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)