https://github.com/boostcampaitech5/level2_klue-nlp-11

level2_klue-nlp-11 created by GitHub Classroom

https://github.com/boostcampaitech5/level2_klue-nlp-11

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (3.1%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

level2_klue-nlp-11 created by GitHub Classroom

Basic Info
  • Host: GitHub
  • Owner: boostcampaitech5
  • Language: Python
  • Default Branch: main
  • Size: 87.9 KB
Statistics
  • Stars: 0
  • Watchers: 2
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Created about 3 years ago · Last pushed almost 3 years ago

https://github.com/boostcampaitech5/level2_klue-nlp-11/blob/main/

# Relation Extraction Competition
> Boostcamp AI Tech 5 Level 2   


## Leader Board **Private 1st** ![lb](https://github.com/boostcampaitech5/level2_klue-nlp-11/assets/102800474/073aa3a9-c997-4e9f-b63a-5ea6dde13053)
## . (Relation Extraction) (Entity) . , , , . .
## & ### |||||| |:-:|:-:|:-:|:-:|:-:| |||||| |[](https://github.com/line1029)|[](https://github.com/Minwoo0206)|[](https://github.com/jaekwanyda)|[](https://github.com/wjdals3406)|[](https://github.com/jiho-hong)| ### | | | | ------ | ------------------------------------------------------------ | | | , , , | | | , | | | , , | | | , | | | , |
## ### Meeting - - 10 - 4 ### - Notion - Git - W&B
## Skill - Pytorch - HuggingFace - Pandas
## Directory ``` level2_klue-nlp-11 README.md config config.py config.yaml sweep_config.yaml dataloader.py inference.py models.py pretraining.py sweep.py train.py utils dict_label_to_num.pkl dict_num_to_label.pkl losses.py metrics.py seed.py utils.py ```
## EDA . validation dataset . - train.csv: 32,470 - test_data.csv: 7,765 - ![Untitled](https://github.com/boostcampaitech5/level2_klue-nlp-11/assets/74582277/c2ed7146-4f9f-47ca-bb69-1fecfb8b15c8) ### . 1. label . 2. source label .
![Untitled 1](https://github.com/boostcampaitech5/level2_klue-nlp-11/assets/74582277/a48ea5fe-731a-4851-b3ec-98be6413917b)
source = wikitree ![Untitled 2](https://github.com/boostcampaitech5/level2_klue-nlp-11/assets/74582277/fa8e5690-bcb2-4179-8337-97852bdb5e87)
source = wikipedia ![Untitled 3](https://github.com/boostcampaitech5/level2_klue-nlp-11/assets/74582277/60e697d3-8aad-4a4d-90ee-3b079349ea11)
source = policy_briefing ![Untitled 4](https://github.com/boostcampaitech5/level2_klue-nlp-11/assets/74582277/0fb84fa1-f676-4286-b5df-6c6c6660a984)

## Data Experiments ### Data Split validation dataset validation dataset . data split 10% , sentence subject entity object entity split . - ![Untitled 5](https://github.com/boostcampaitech5/level2_klue-nlp-11/assets/74582277/9a36f254-8767-4229-a1d9-4a08e6d525cc) train validation dataset . test dataset val_f1_score test_f1_score . 1. sentence train validation dataset split 2. sentnece train validation dataset split 2 val_f1_score test_f1_score . validation dataset train dataset 10% train validation dataset . - value = |val_f1_score - test_f1_score| | | avg | seed a | seed b | seed c | | ------------ | -------- | -------- | -------- | -------- | | case 1() | 0.869156 | 1.15831 | 0.554001 | 0.895157 | | case 2() | 0.163386 | 0.203056 | 0.067093 | 0.22001 | ### Typed Entity Marker [Matching the Blanks](https://aclanthology.org/P19-1279.pdf) [An Improved Baseline for Sentence-level Relation Extraction](https://arxiv.org/pdf/2102.01373.pdf) . A. Matching the Blanks ![Untitled 6](https://github.com/boostcampaitech5/level2_klue-nlp-11/assets/74582277/c48f8229-97bc-4fd8-9332-f2d90f3cf15a) [E1] token [E2] token [CLS] token concatenate classifie . . 1. [CLS] token 2. [CLS], [E1]-, [E1]-, [E2]-, [E2]- token 3. [CLS], [E1]-, [E2]- token 3 . - **LB Score( ): 64 70.43** B. An Improve Baseline for Sentence-level Relation Extraction [E1] [E2] token . 1. entity type special token . : Bill was born in Seattle. Typed entity marker 2. tokenizer , corpus . : @ * person * Bill @ was born in # ^ city ^ Seattle #. Typed entity marker(punct) 2 . . - **LB Score( ): 70.43 71.028** ### Semantic Typing Bert Next Sentence Prediction entity ([Unified Semantic Typing with Meaningful Label Inference](https://arxiv.org/pdf/2205.01826v1.pdf)) . - 1) [Subject] [Object] ? 2) [Subject] [Object] [Subject:type] [Object:type] . 3) [Object] [Subject:type] [Subject] [Object:type]. - Typed entity marker(punct) . 1: Bill Seattle ? + sentence2 2: @ * person * Bill # ^ city ^ Seattle ? + sentence2 2 . - **LB Score( ): 71.028 74.2119** ### Confusion Matrix Confusion Matrix . Confusion Matrix label . no_relation . - ![image](https://github.com/boostcampaitech5/level2_klue-nlp-11/assets/74582277/0f5c6a23-9e0a-4a68-99c1-60927c0ceb96) ### Data Augmentation no_relation label label . backtranslation . LB Score . - **LB Score( ) : 72.052 73.8259** ### Source Token data source(wikitree, wikipedia, policy_briefing) label token label . [WT], [WP], [PB] Semantic Typing . - 1: [CLS] [WT] sentence1 [SEP] sentence2 - 2: [CLS] sentence1 [SEP] [WT] sentence2 source token . label wikitree wikipedia policy_briefing .
## Loss Function Cross Entropy Loss Class-Balanced Loss Focal Loss . ### CE Loss - loss. - baseline loss, loss . ### CB Loss - . - - [Class-Balanced Loss Based on Effective Number of Samples](https://arxiv.org/pdf/1901.05555.pdf) ### Focal Loss - . - - [Focal Loss for Dense Object Detection](https://arxiv.org/pdf/1708.02002.pdf) Focal Loss . - **LB Score( ) : 73.8014 74.0923**
## Modeling ### TAPT [Dont Stop Pretraining: Adapt Language Models to Domains and Tasks](https://arxiv.org/pdf/2004.10964.pdf) . huggingface pretrained-model downstream-task fine-tuning , task specific pretrained . huggingface pretrained-model MLM(Masked Language Model) . . re-pretraining wikitree, wikipedia, policy_briefing wikitree wikipedia pretraining . , task-specific re-pretraining subject object masking . ### LSTM RNN . KLUE-RoBERTa 3 linear layer LSTM, Bi-LSTM, GRU . Untitled 8 . - | | validation f1 | train loss | validation loss | | ------- | ------------- | ---------- | --------------- | | base | 85.852 | 0.4184 | 0.5429 | | LSTM | 81.134 | 0.5868 | 0.7024 | | Bi-LSTM | 85.535 | 0.1236 | 0.6249 | | GRU | 85.608 | 0.1750 | 0.6273 |
## Hyper-parameter Tuning wandb sweep hyper-parameter . ### Hyper-parameter list - batch-size : 16, 24, 32 - leraning_rate : 1e-05, 2e-05 - loss_function : Cross-Entropy, Focal Loss - warm_up_ratio : 0, 0.1, 0.3, 0.6 - weight_decay : 0, 0.01 - lr_scheduler : Linear, Invsqrt, Cosine Annealing w/ Hard Restart
## Ensemble - F1 AUPRC . - Hard Voting, Soft Voting, Weighted Voting(Hard, Soft) . - Weighted Voting(Hard) . | | micro_f1 | auprc | | ---------------------- | -------- | ------- | | LB Public Score | 76.7790 | 81.5786 | | LB Private Score (1) | 76.3907 | 83.4108 |
## ### - Base Code pytorch_lightning, torchmetrics, huggingface . - main branch Branch main merge Git . - . - offline meeting, online meeting . ### - train/validation dataset, seed . - commit message formatting, source code review, issue . - . - .
## [1] [Matching the Blanks: Matching the Blanks: Distributional Similarity for Relation Learning](https://aclanthology.org/P19-1279.pdf) [2] [An Improved Baseline for Sentence-level Relation Extraction](https://arxiv.org/pdf/2102.01373.pdf) [3] [Unified Semantic Typing with Meaningful Label Inference](https://arxiv.org/pdf/2205.01826v1.pdf) [4] [Class-Balanced Loss Based on Effective Number of Samples](https://arxiv.org/pdf/1901.05555.pdf) [5] [Focal Loss for Dense Object Detection](https://arxiv.org/pdf/1708.02002.pdf) [6] [Dont Stop Pretraining: Adapt Language Models to Domains and Tasks](https://arxiv.org/pdf/2004.10964.pdf)

Owner

  • Name: 부스트캠프 AI Tech 5기
  • Login: boostcampaitech5
  • Kind: organization
  • Email: boostcamp_ai@connect.or.kr
  • Location: Korea, South

AI 엔지니어의 지속 가능한 성장을 위한 학습 커뮤니티, 부스트캠프 AI Tech입니다.

GitHub Events

Total
  • Fork event: 1
Last Year
  • Fork event: 1