mweaswsd

Repo for the paper "MWE as WSD: Solving Multi-Word Expression Identification with Word Sense Disambiguation"

https://github.com/mindful/mweaswsd

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (7.3%) to scientific vocabulary

Keywords

multiword-expressions mwe nlp word-sense-disambiguation wsd

Last synced: 6 months ago · JSON representation ·

Repository

Repo for the paper "MWE as WSD: Solving Multi-Word Expression Identification with Word Sense Disambiguation"

Basic Info

Host: GitHub
Owner: Mindful
License: agpl-3.0
Language: Python
Default Branch: main
Homepage:
Size: 1.12 MB

Statistics

Stars: 4
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 0

Topics

multiword-expressions mwe nlp word-sense-disambiguation wsd

Created almost 3 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

MWE as WSD

Repo for the paper MWE as WSD: Solving Multiword Expression Identification with Word Sense Disambiguation.

Installation

shell cd MWEasWSD pip install -e .

Data

shell cd data && ./get-data.sh

While our repository contains all the code necessary to recreate the modified SemCor data used in our experiments, we also make the fully processed data with synthetic negatives and our annotations available in the data/augmented directory. Note that the data has been converted to a JSON format. If you use the SemCor data in any capacity, please cite the original authors as mentioned here.

Preprocessing

Generate .jsonl files and split data. ```shell

Convert the data into .jsonl files

python scripts/preprocessing/wsdxml.py data/WSDEvaluationFramework/TrainingCorpora/SemCor cp data/WSDEvaluationFramework/EvaluationDatasets/ALL/ALL.data.xml data/WSDEvaluationFramework/EvaluationDatasets/ALL/all.data.xml cp data/WSDEvaluationFramework/EvaluationDatasets/ALL/ALL.gold.key.txt data/WSDEvaluationFramework/EvaluationDatasets/ALL/all.gold.key.txt for d in "data/WSDEvaluationFramework/EvaluationDatasets/"*"/"; do python scripts/preprocessing/wsdxml.py "$d" done

python scripts/preprocessing/connlu.py --datatype cupt data/parsememwe/EN/train.cupt data/cupttrain.jsonl python scripts/preprocessing/connlu.py --datatype cupt data/parsememwe/EN/test.cupt data/cupttest.jsonl

python scripts/preprocessing/dimsum.py data/dimsum-data/dimsum16.test data/dimsumtest.jsonl python scripts/preprocessing/dimsum.py data/dimsum-data/dimsum16.train data/dimsumtrain.jsonl

python scripts/preprocessing/randomsplit.py data/cupttrain.jsonl python scripts/preprocessing/randomsplit.py data/dimsumtrain.jsonl python scripts/preprocessing/fixexistingmwes.py data/WSDEvaluationFramework/Training_Corpora/SemCor/semcor.jsonl ```

Apply annotations, add automatic negatives shell python scripts/preprocessing/apply_mwe_annotations.py data/annotations.jsonl data/WSD_Evaluation_Framework/Training_Corpora/SemCor/semcor.fixed.jsonl python scripts/preprocessing/auto_add_negative_mwes.py data/WSD_Evaluation_Framework/Training_Corpora/SemCor/semcor.fixed.annotated.jsonl --target_percent 0.55

Model download

Pre-trained models can be downloaded from the Hugging Face Hub.

| Data | Architecture | Download | |-------------------------|------------------|-------------------------------------------------| | SemCor | Bi-encoder | https://huggingface.co/Jotanner/mweaswsd | | SemCor + PARSEME/DiMSUM | Bi-encoder | https://huggingface.co/Jotanner/mweaswsd-ft | | SemCor | DCA Poly-encoder | https://huggingface.co/Jotanner/mweaswsd-dca | | SemCor + PARSEME/DiMSUM | DCA Poly-encoder | https://huggingface.co/Jotanner/mweaswsd-dca-ft |

Training

To replicate our top run Bi-encoder run: shell python scripts/training/train.py \ --max_epochs 15 \ --batch_size 16 \ --accumulate_grad_batches 2 \ --gpus 1 \ --swa true \ --gradient_clip_val 1.0 \ --lr 0.00001 \ --run_name replicate-top \ --encoder bert-base-uncased \ --enable_checkpointing true \ --mwe_processing true \ --train_data_suffix fixed.annotated.autoneg

For the distinct codes attention Poly-encoder: shell python scripts/training/train.py \ --max_epochs 15 \ --batch_size 16 \ --accumulate_grad_batches 2 \ --gpus 1 \ --swa false \ --gradient_clip_val 1.0 \ --lr 0.00001 \ --run_name replicate-top-dca \ --encoder bert-base-uncased \ --enable_checkpointing true \ --weight_decay 0.01 \ --dropout 0.1 \ --mwe_processing true \ --head_config configs/poly_distinct_codes_128.yaml \ --train_data_suffix fixed.annotated.autoneg

Finetune on DiMSUM/PARSEME data

```shell export BASEMODEL=checkpoints/replicate-top/ep140.73f1.ckpt # set this to the base model you want to fine tune from

add candidates using the model we want to finetune, so it learns from its own mistakes

python scripts/preprocessing/autoaddnegativemwes.py data/dimsumtraintrain.jsonl \ --targetpercent 1.0 --pipeline dimsumtrain --nofiltercandidates --model $BASEMODEL python scripts/preprocessing/assigngoldsenses.py data/dimsumtraintrain.autoneg.jsonl mkdir data/dimsumtraintrain mv data/dimsumtraintrain.* data/dimsumtraintrain

python scripts/preprocessing/autoaddnegativemwes.py data/cupttraintrain.jsonl \ --targetpercent 1.0 --pipeline cupttrain --nofiltercandidates --model $BASEMODEL python scripts/preprocessing/assigngoldsenses.py data/cupttraintrain.autoneg.jsonl mkdir data/cupttraintrain mv data/cupttraintrain.* data/cupttraintrain

python scripts/training/train.py \ --data mixedfinetune \ --loadmodel $BASEMODEL \ --maxepochs 3 \ --batchsize 16 \ --accumulategradbatches 2 \ --gpus 1 \ --swa true \ --gradientclipval 1.0 \ --lr 0.00001 \ --runname mixedfinetune \ --enablecheckpointing true \ --mweprocessing true \ --wsdprocessing false \ --limitvalbatches 0 \ --mweevalpipelines cuptsample dimsumsample \ --checkpointmetric val/cuptsample/mwepipelinef1 \ --traindatasuffix autoneg.sense \ --limitkeycandidates False ```

To finetune on only a single dataset, just change the data argument to --data cupt or --data dimsum.

Evaluation

```shell export MODEL=checkpoints/replicate-top/ep14_0.73f1.ckpt # set this to the base model you want to evaluate

python scripts/training/wsdeval.py --data data/WSDEvaluationFramework/EvaluationDatasets/ALL/ --model $MODEL --batch_size 2

python scripts/training/mweout.py cupttest --model $MODEL --output compare/results.cupt python scripts/training/mweout.py dimsumtest --model $MODEL --output compare/results.dimsum

python data/parsememwe/bin/evaluate.py --pred compare/results.cupt --gold data/parsememwe/EN/test.cupt conda activate py27 # dimsum eval requires python 2.7 python data/dimsum-data/scripts/dimsumeval.py data/dimsum-data/dimsum16.test compare/results.dimsum ```

If the dimsum scorer errors out, it may be necessary to comment out lines 204, 206, 456 and 458-472. This does not change MWE scoring, but prevents errors from happening.

Owner

Name: Josh Tanner
Login: Mindful
Kind: user
Location: Seattle
Company: @mantra-inc

Website: https://joshuatanner.dev/
Twitter: mindful_jt
Repositories: 34
Profile: https://github.com/Mindful

NLP & Software Engineering

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this resesarch, please cite the paper"
authors:
- family-names: "Tanner"
  given-names: "Joshua"
- family-names: "Hoffman"
  given-names: "Jacob"
title: "MWE as WSD"
version: 0.1.0
doi: https://doi.org/10.48550/arXiv.2303.06623
date-released: 2023-03-14
url: "https://github.com/Mindful/MWEasWSD"

preferred-citation:
  title: "MWE as WSD: Solving Multiword Expression Identification with Word Sense Disambiguation"
  type: article
  authors:
    - family-names: Tanner
      given-names: Joshua
    - family-names: Hoffman
      given-names: Jacob 
  year: 2023
  journal: ArXiv
  doi: https://doi.org/10.48550/arXiv.2303.06623
  url: https://arxiv.org/abs/2303.06623

GitHub Events

Total

Last Year

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 2
Total pull requests: 2
Average time to close issues: 9 days
Average time to close pull requests: about 12 hours
Total issue authors: 1
Total pull request authors: 1
Average comments per issue: 2.5
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 2
Average time to close issues: 9 days
Average time to close pull requests: about 12 hours
Issue authors: 1
Pull request authors: 1
Average comments per issue: 2.5
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Yusuke196 (2)

Pull Request Authors

Yusuke196 (4)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

pyproject.toml pypi

requirements.txt pypi

fugashi *
gdown *
gitpython *
ipadic ==1.0.0
jsonlines *
jupyter *
lxml *
matplotlib *
nltk *
pytorch-lightning *
scikit-learn *
spacy ==3.4.4
torch *
torchmetrics *
tqdm *
transformers *
unidic-lite *
wandb *

setup.py pypi

fugashi *
gdown *
gitpython *
ipadic ==1.0.0
jsonlines *
jupyter *
lxml *
matplotlib *
nltk *
pytorch-lightning *
scikit-learn *
torch *
torchmetrics *
tqdm *
transformers *
unidic-lite *
wandb *