https://github.com/boostcampaitech5/level2_klue-nlp-11

level2_klue-nlp-11 created by GitHub Classroom

Last synced: 10 months ago · JSON representation

Repository

level2_klue-nlp-11 created by GitHub Classroom

Basic Info

Host: GitHub
Owner: boostcampaitech5
Language: Python
Default Branch: main
Size: 87.9 KB

Statistics

Stars: 0
Watchers: 2
Forks: 1
Open Issues: 0
Releases: 0

Created about 3 years ago · Last pushed almost 3 years ago

https://github.com/boostcampaitech5/level2_klue-nlp-11/blob/main/

# Relation Extraction Competition
> Boostcamp AI Tech 5 Level 2   




## Leader Board
 **Private 1st**
![lb](https://github.com/boostcampaitech5/level2_klue-nlp-11/assets/102800474/073aa3a9-c997-4e9f-b63a-5ea6dde13053)




## 

            .  (Relation Extraction) (Entity)    .           ,  ,  ,      .                   .




##  & 

### 
||||||
|:-:|:-:|:-:|:-:|:-:|
||||||
|[](https://github.com/line1029)|[](https://github.com/Minwoo0206)|[](https://github.com/jaekwanyda)|[](https://github.com/wjdals3406)|[](https://github.com/jiho-hong)|

### 

|    |                                                          |
| ------ | ------------------------------------------------------------ |
|  |  ,  ,  ,   |
|  |    ,                         |
|  |    ,  ,                      |
|  |    ,                         |
|  |    ,                                           |




## 

### Meeting

-           
-   10         
-   4          

### 

- Notion
- Git
- W&B




## Skill

- Pytorch
- HuggingFace
- Pandas




## Directory
```
  level2_klue-nlp-11   
   README.md   
   config   
      config.py   
      config.yaml   
      sweep_config.yaml   
   dataloader.py   
   inference.py   
   models.py   
   pretraining.py   
   sweep.py   
   train.py   
   utils   
       dict_label_to_num.pkl      
       dict_num_to_label.pkl   
       losses.py   
       metrics.py   
       seed.py   
       utils.py   
```




## EDA

       . 

   validation dataset   .

- train.csv:  32,470
- test_data.csv:  7,765
-  
   
   ![Untitled](https://github.com/boostcampaitech5/level2_klue-nlp-11/assets/74582277/c2ed7146-4f9f-47ca-bb69-1fecfb8b15c8)
 

###  

            .

1. label    . 
2. source label  .


  

![Untitled 1](https://github.com/boostcampaitech5/level2_klue-nlp-11/assets/74582277/a48ea5fe-731a-4851-b3ec-98be6413917b)




  


source = wikitree

![Untitled 2](https://github.com/boostcampaitech5/level2_klue-nlp-11/assets/74582277/fa8e5690-bcb2-4179-8337-97852bdb5e87)




source = wikipedia

![Untitled 3](https://github.com/boostcampaitech5/level2_klue-nlp-11/assets/74582277/60e697d3-8aad-4a4d-90ee-3b079349ea11)




source = policy_briefing

![Untitled 4](https://github.com/boostcampaitech5/level2_klue-nlp-11/assets/74582277/0fb84fa1-f676-4286-b5df-6c6c6660a984)









## Data Experiments

### Data Split

validation dataset     validation dataset  . data split     10% , sentence  subject entity object entity    split  . 

-   

    ![Untitled 5](https://github.com/boostcampaitech5/level2_klue-nlp-11/assets/74582277/9a36f254-8767-4229-a1d9-4a08e6d525cc)
    

  train validation dataset       .       test dataset         val_f1_score test_f1_score  .

1. sentence    train  validation dataset split
2. sentnece   train  validation dataset  split

  2 val_f1_score test_f1_score   .  validation dataset  train dataset 10%      train  validation dataset   .

-  
    
    value = |val_f1_score - test_f1_score|
    
    |              | avg      | seed a   | seed b   | seed c   |
    | ------------ | -------- | -------- | -------- | -------- |
    | case 1() | 0.869156 | 1.15831  | 0.554001 | 0.895157 |
    | case 2() | 0.163386 | 0.203056 | 0.067093 | 0.22001  |

### Typed Entity Marker

    [Matching the Blanks](https://aclanthology.org/P19-1279.pdf) [An Improved Baseline for Sentence-level Relation Extraction](https://arxiv.org/pdf/2102.01373.pdf)    .

A. Matching the Blanks

![Untitled 6](https://github.com/boostcampaitech5/level2_klue-nlp-11/assets/74582277/c48f8229-97bc-4fd8-9332-f2d90f3cf15a)

   [E1] token [E2] token   [CLS] token   concatenate classifie  .      .

1. [CLS] token 
2. [CLS], [E1]-, [E1]-, [E2]-, [E2]- token 
3. [CLS], [E1]-, [E2]- token 

  3   . 

- **LB Score( ): 64  70.43**

B. An Improve Baseline for Sentence-level Relation Extraction

[E1]  [E2] token      .

1.   entity type   special token  .
:  Bill was born in  Seattle.  Typed entity marker
2.  tokenizer , corpus      .
: @ * person * Bill @  was born in # ^ city ^ Seattle #.   Typed entity marker(punct)

  2   .                .

- **LB Score( ): 70.43  71.028**

### Semantic Typing

Bert       Next Sentence Prediction        entity           ([Unified Semantic Typing with Meaningful Label Inference](https://arxiv.org/pdf/2205.01826v1.pdf))       .

-  
    
    1) [Subject] [Object]   ?
    
    2) [Subject] [Object]  [Subject:type] [Object:type] .
    
    3) [Object] [Subject:type] [Subject] [Object:type].
    
-       Typed entity marker(punct)      .
    
    1: Bill Seattle   ? + sentence2
    
    2: @ * person * Bill # ^ city ^ Seattle   ? + sentence2
    

  2         .

- **LB Score( ): 71.028  74.2119**

### Confusion Matrix

       Confusion Matrix . Confusion Matrix      label     .        no_relation       .

- 
    
    ![image](https://github.com/boostcampaitech5/level2_klue-nlp-11/assets/74582277/0f5c6a23-9e0a-4a68-99c1-60927c0ceb96)
    

### Data Augmentation

no_relation      label        label      .            backtranslation  .   LB Score    .

- **LB Score( ) : 72.052  73.8259**

### Source Token

data source(wikitree, wikipedia, policy_briefing)  label     token    label      .   [WT], [WP], [PB]    Semantic Typing      .

- 1: [CLS] [WT] sentence1 [SEP] sentence2
- 2: [CLS] sentence1 [SEP] [WT] sentence2

     source token       .  label  wikitree wikipedia   policy_briefing     .




## Loss Function

       Cross Entropy Loss  Class-Balanced Loss Focal Loss .

### CE Loss

-      loss.
- baseline   loss,  loss .

### CB Loss

-            .
-   - [Class-Balanced Loss Based on Effective Number of Samples](https://arxiv.org/pdf/1901.05555.pdf)

### Focal Loss

-              .
-   - [Focal Loss for Dense Object Detection](https://arxiv.org/pdf/1708.02002.pdf)

Focal Loss      .

- **LB Score( ) : 73.8014  74.0923**




## Modeling

### TAPT

[Dont Stop Pretraining: Adapt Language Models to Domains and Tasks](https://arxiv.org/pdf/2004.10964.pdf)     .  huggingface pretrained-model  downstream-task  fine-tuning ,   task specific pretrained   .   huggingface pretrained-model      MLM(Masked Language Model)   .

          . re-pretraining    wikitree, wikipedia, policy_briefing   wikitree wikipedia    pretraining         . , task-specific  re-pretraining  subject object   masking            .

  

### LSTM

             RNN  . KLUE-RoBERTa 3 linear layer    LSTM, Bi-LSTM, GRU  .



       .

- 
    
    
    |         | validation f1 | train loss | validation loss |
    | ------- | ------------- | ---------- | --------------- |
    | base    | 85.852        | 0.4184     | 0.5429          |
    | LSTM    | 81.134        | 0.5868     | 0.7024          |
    | Bi-LSTM | 85.535        | 0.1236     | 0.6249          |
    | GRU     | 85.608        | 0.1750     | 0.6273          |




## Hyper-parameter Tuning

wandb sweep   hyper-parameter  .

### Hyper-parameter list

- batch-size : 16, 24, 32
- leraning_rate : 1e-05, 2e-05
- loss_function : Cross-Entropy, Focal Loss
- warm_up_ratio : 0, 0.1, 0.3, 0.6
- weight_decay : 0, 0.01
- lr_scheduler : Linear, Invsqrt, Cosine Annealing w/ Hard Restart




## Ensemble

-  F1  AUPRC        .
- Hard Voting, Soft Voting, Weighted Voting(Hard, Soft) .
-         Weighted Voting(Hard)   .

|                | micro_f1 | auprc   |
| ---------------------- | -------- | ------- |
| LB Public Score        | 76.7790  | 81.5786 |
| LB Private Score (1) | 76.3907  | 83.4108 |




##  

###  

- Base Code pytorch_lightning, torchmetrics, huggingface         .
-    main branch    Branch       main merge   Git .
-                    .
- offline meeting, online meeting       .

###  

-    train/validation dataset, seed             .
- commit message formatting, source code review, issue        .
-               .
-          .




##  

[1] [Matching the Blanks: Matching the Blanks: Distributional Similarity for Relation Learning](https://aclanthology.org/P19-1279.pdf)

[2] [An Improved Baseline for Sentence-level Relation Extraction](https://arxiv.org/pdf/2102.01373.pdf)

[3] [Unified Semantic Typing with Meaningful Label Inference](https://arxiv.org/pdf/2205.01826v1.pdf)

[4] [Class-Balanced Loss Based on Effective Number of Samples](https://arxiv.org/pdf/1901.05555.pdf)

[5] [Focal Loss for Dense Object Detection](https://arxiv.org/pdf/1708.02002.pdf)

[6] [Dont Stop Pretraining: Adapt Language Models to Domains and Tasks](https://arxiv.org/pdf/2004.10964.pdf)