https://github.com/camel-lab/qalb

Code for "Utilizing Character and Word Embeddings for Text Normalization with Sequence-to-Sequence Models"

https://github.com/camel-lab/qalb

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    1 of 2 committers (50.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.2%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Code for "Utilizing Character and Word Embeddings for Text Normalization with Sequence-to-Sequence Models"

Basic Info
Statistics
  • Stars: 0
  • Watchers: 4
  • Forks: 4
  • Open Issues: 0
  • Releases: 0
Created over 8 years ago · Last pushed over 6 years ago
Metadata Files
Readme

README.md

QALB

Code for Utilizing Character and Word Embeddings for Text Normalization with Sequence-to-Sequence Models.

Setup

Python2, Python3, and TensorFlow 1.4 are required to run this project.

Training & testing

python -m ai.tests.qalb will run a generic character-level model. python -m ai.tests.char_qalb will run the hybrid character-level + fastText model. To distinguish between training and inference, --decode=path/to/file.txt indicates to run the model on inference mode on that text file.

See the individual test scripts for more information on the flags that can be passed to them directly via the terminal.

Evaluations

To compute the F1 score, use python2 ai/tests/m2scripts/m2scorer.py --beta 1 -v $1 $2 where $1 is the system output file and $2 is the .m2 gold file.

To compute the Levenshtein score, use python levenshtein.py $1 $2 where $1 is the output file and $2 is the .gold file.

Analysis

python analysis.py can be used to break down any .m2 file into a more human-readable format.

Useful UNIX commands

To remove the document id's from the *.sent* files, simply use

cut -d' ' -f2- ai/datasets/data/qalb/FILENAME

This can be piped to give a word count:

# *.sent* file
cut -d' ' -f2- ai/datasets/data/qalb/QALB.train.sent.sbw | awk '{print NF}'
# *.gold* file
cat ai/datasets/data/qalb/QALB.train.gold.sbw | awk '{print NF}'

Or a character count:

cat ai/datasets/data/qalb/QALB.train.gold.sbw | awk '{ print length($0); }'

The total number of usable characters doesn't include newlines, so as an example, the number of characters in a *.sent* file can be obtained automatically with

cut -d' ' -f2- ai/datasets/data/qalb/QALB.train.sent.sbw | awk '{ print length($0); }' | awk '{s+=$1} END {print s}'

Which coincides with doing cut -d' ' -f2- ai/datasets/data/qalb/QALB.train.sent.sbw | wc and subtracting the number of lines to the number of characters.

To obtain a histogram of the character counts, simply pipe sort and uniq. For instance, # Characters cat ai/datasets/data/qalb/QALB.train.gold.sbw | awk '{ print length($0); }' | sort -n | uniq -c # Words cat ai/datasets/data/qalb/QALB.train.gold.sbw | awk '{print NF}' | sort -n | uniq -c

Owner

  • Name: CAMeL Lab
  • Login: CAMeL-Lab
  • Kind: organization
  • Location: Abu Dhabi, UAE

The Computational Approaches to Modeling Language (CAMeL) Lab at New York University Abu Dhabi

GitHub Events

Total
Last Year

Committers

Last synced: over 2 years ago

All Time
  • Total Commits: 196
  • Total Committers: 2
  • Avg Commits per committer: 98.0
  • Development Distribution Score (DDS): 0.01
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Daniel Watson w****6@g****m 194
Daniel Watson d****n@n****u 2
Committer Domains (Top 20 + Academic)
nyu.edu: 1