mweaswsd
Repo for the paper "MWE as WSD: Solving Multi-Word Expression Identification with Word Sense Disambiguation"
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (7.3%) to scientific vocabulary
Keywords
Repository
Repo for the paper "MWE as WSD: Solving Multi-Word Expression Identification with Word Sense Disambiguation"
Basic Info
Statistics
- Stars: 4
- Watchers: 1
- Forks: 1
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
MWE as WSD
Repo for the paper MWE as WSD: Solving Multiword Expression Identification with Word
Sense Disambiguation.
Installation
shell
cd MWEasWSD
pip install -e .
Data
shell
cd data && ./get-data.sh
While our repository contains all the code necessary to recreate the modified SemCor data
used in our experiments, we also make the fully processed data with synthetic negatives and
our annotations available in the data/augmented directory.
Note that the data has been converted to a JSON format.
If you use the SemCor data in any capacity, please cite the original
authors as mentioned here.
Preprocessing
Generate .jsonl files and split data.
```shell
Convert the data into .jsonl files
python scripts/preprocessing/wsdxml.py data/WSDEvaluationFramework/TrainingCorpora/SemCor cp data/WSDEvaluationFramework/EvaluationDatasets/ALL/ALL.data.xml data/WSDEvaluationFramework/EvaluationDatasets/ALL/all.data.xml cp data/WSDEvaluationFramework/EvaluationDatasets/ALL/ALL.gold.key.txt data/WSDEvaluationFramework/EvaluationDatasets/ALL/all.gold.key.txt for d in "data/WSDEvaluationFramework/EvaluationDatasets/"*"/"; do python scripts/preprocessing/wsdxml.py "$d" done
python scripts/preprocessing/connlu.py --datatype cupt data/parsememwe/EN/train.cupt data/cupttrain.jsonl python scripts/preprocessing/connlu.py --datatype cupt data/parsememwe/EN/test.cupt data/cupttest.jsonl
python scripts/preprocessing/dimsum.py data/dimsum-data/dimsum16.test data/dimsumtest.jsonl python scripts/preprocessing/dimsum.py data/dimsum-data/dimsum16.train data/dimsumtrain.jsonl
python scripts/preprocessing/randomsplit.py data/cupttrain.jsonl python scripts/preprocessing/randomsplit.py data/dimsumtrain.jsonl python scripts/preprocessing/fixexistingmwes.py data/WSDEvaluationFramework/Training_Corpora/SemCor/semcor.jsonl ```
Apply annotations, add automatic negatives
shell
python scripts/preprocessing/apply_mwe_annotations.py data/annotations.jsonl data/WSD_Evaluation_Framework/Training_Corpora/SemCor/semcor.fixed.jsonl
python scripts/preprocessing/auto_add_negative_mwes.py data/WSD_Evaluation_Framework/Training_Corpora/SemCor/semcor.fixed.annotated.jsonl --target_percent 0.55
Model download
Pre-trained models can be downloaded from the Hugging Face Hub.
| Data | Architecture | Download | |-------------------------|------------------|-------------------------------------------------| | SemCor | Bi-encoder | https://huggingface.co/Jotanner/mweaswsd | | SemCor + PARSEME/DiMSUM | Bi-encoder | https://huggingface.co/Jotanner/mweaswsd-ft | | SemCor | DCA Poly-encoder | https://huggingface.co/Jotanner/mweaswsd-dca | | SemCor + PARSEME/DiMSUM | DCA Poly-encoder | https://huggingface.co/Jotanner/mweaswsd-dca-ft |
Training
To replicate our top run Bi-encoder run:
shell
python scripts/training/train.py \
--max_epochs 15 \
--batch_size 16 \
--accumulate_grad_batches 2 \
--gpus 1 \
--swa true \
--gradient_clip_val 1.0 \
--lr 0.00001 \
--run_name replicate-top \
--encoder bert-base-uncased \
--enable_checkpointing true \
--mwe_processing true \
--train_data_suffix fixed.annotated.autoneg
For the distinct codes attention Poly-encoder:
shell
python scripts/training/train.py \
--max_epochs 15 \
--batch_size 16 \
--accumulate_grad_batches 2 \
--gpus 1 \
--swa false \
--gradient_clip_val 1.0 \
--lr 0.00001 \
--run_name replicate-top-dca \
--encoder bert-base-uncased \
--enable_checkpointing true \
--weight_decay 0.01 \
--dropout 0.1 \
--mwe_processing true \
--head_config configs/poly_distinct_codes_128.yaml \
--train_data_suffix fixed.annotated.autoneg
Finetune on DiMSUM/PARSEME data
```shell export BASEMODEL=checkpoints/replicate-top/ep140.73f1.ckpt # set this to the base model you want to fine tune from
add candidates using the model we want to finetune, so it learns from its own mistakes
python scripts/preprocessing/autoaddnegativemwes.py data/dimsumtraintrain.jsonl \ --targetpercent 1.0 --pipeline dimsumtrain --nofiltercandidates --model $BASEMODEL python scripts/preprocessing/assigngoldsenses.py data/dimsumtraintrain.autoneg.jsonl mkdir data/dimsumtraintrain mv data/dimsumtraintrain.* data/dimsumtraintrain
python scripts/preprocessing/autoaddnegativemwes.py data/cupttraintrain.jsonl \ --targetpercent 1.0 --pipeline cupttrain --nofiltercandidates --model $BASEMODEL python scripts/preprocessing/assigngoldsenses.py data/cupttraintrain.autoneg.jsonl mkdir data/cupttraintrain mv data/cupttraintrain.* data/cupttraintrain
python scripts/training/train.py \ --data mixedfinetune \ --loadmodel $BASEMODEL \ --maxepochs 3 \ --batchsize 16 \ --accumulategradbatches 2 \ --gpus 1 \ --swa true \ --gradientclipval 1.0 \ --lr 0.00001 \ --runname mixedfinetune \ --enablecheckpointing true \ --mweprocessing true \ --wsdprocessing false \ --limitvalbatches 0 \ --mweevalpipelines cuptsample dimsumsample \ --checkpointmetric val/cuptsample/mwepipelinef1 \ --traindatasuffix autoneg.sense \ --limitkeycandidates False ```
To finetune on only a single dataset, just change the data argument to --data cupt or --data dimsum.
Evaluation
```shell export MODEL=checkpoints/replicate-top/ep14_0.73f1.ckpt # set this to the base model you want to evaluate
python scripts/training/wsdeval.py --data data/WSDEvaluationFramework/EvaluationDatasets/ALL/ --model $MODEL --batch_size 2
python scripts/training/mweout.py cupttest --model $MODEL --output compare/results.cupt python scripts/training/mweout.py dimsumtest --model $MODEL --output compare/results.dimsum
python data/parsememwe/bin/evaluate.py --pred compare/results.cupt --gold data/parsememwe/EN/test.cupt conda activate py27 # dimsum eval requires python 2.7 python data/dimsum-data/scripts/dimsumeval.py data/dimsum-data/dimsum16.test compare/results.dimsum ```
If the dimsum scorer errors out, it may be necessary to comment out lines 204, 206, 456 and 458-472. This does not change MWE scoring, but prevents errors from happening.
Owner
- Name: Josh Tanner
- Login: Mindful
- Kind: user
- Location: Seattle
- Company: @mantra-inc
- Website: https://joshuatanner.dev/
- Twitter: mindful_jt
- Repositories: 34
- Profile: https://github.com/Mindful
NLP & Software Engineering
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this resesarch, please cite the paper"
authors:
- family-names: "Tanner"
given-names: "Joshua"
- family-names: "Hoffman"
given-names: "Jacob"
title: "MWE as WSD"
version: 0.1.0
doi: https://doi.org/10.48550/arXiv.2303.06623
date-released: 2023-03-14
url: "https://github.com/Mindful/MWEasWSD"
preferred-citation:
title: "MWE as WSD: Solving Multiword Expression Identification with Word Sense Disambiguation"
type: article
authors:
- family-names: Tanner
given-names: Joshua
- family-names: Hoffman
given-names: Jacob
year: 2023
journal: ArXiv
doi: https://doi.org/10.48550/arXiv.2303.06623
url: https://arxiv.org/abs/2303.06623
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 2
- Total pull requests: 2
- Average time to close issues: 9 days
- Average time to close pull requests: about 12 hours
- Total issue authors: 1
- Total pull request authors: 1
- Average comments per issue: 2.5
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 2
- Pull requests: 2
- Average time to close issues: 9 days
- Average time to close pull requests: about 12 hours
- Issue authors: 1
- Pull request authors: 1
- Average comments per issue: 2.5
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- Yusuke196 (2)
Pull Request Authors
- Yusuke196 (4)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- fugashi *
- gdown *
- gitpython *
- ipadic ==1.0.0
- jsonlines *
- jupyter *
- lxml *
- matplotlib *
- nltk *
- pytorch-lightning *
- scikit-learn *
- spacy ==3.4.4
- torch *
- torchmetrics *
- tqdm *
- transformers *
- unidic-lite *
- wandb *
- fugashi *
- gdown *
- gitpython *
- ipadic ==1.0.0
- jsonlines *
- jupyter *
- lxml *
- matplotlib *
- nltk *
- pytorch-lightning *
- scikit-learn *
- torch *
- torchmetrics *
- tqdm *
- transformers *
- unidic-lite *
- wandb *