indonesian-sentence-embeddings

Embedding Representation for Indonesian Sentences!

https://github.com/lazarusnlp/indonesian-sentence-embeddings

Keywords

indonesia indonesian natural-language-processing sbert semantic-textual-similarity sentence-embeddings sentence-transformers unsupervised-learning

Last synced: 6 months ago · JSON representation ·

Repository

Embedding Representation for Indonesian Sentences!

Basic Info

Host: GitHub
Owner: LazarusNLP
License: apache-2.0
Language: Jupyter Notebook
Default Branch: main
Homepage: https://lazarusnlp.github.io/indonesian-sentence-embeddings/
Size: 1.56 MB

Statistics

Stars: 17
Watchers: 1
Forks: 2
Open Issues: 0
Releases: 1

Topics

indonesia indonesian natural-language-processing sbert semantic-textual-similarity sentence-embeddings sentence-transformers unsupervised-learning

Created over 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

Indonesian Sentence Embeddings

Inspired by Thai Sentence Vector Benchmark, we decided to embark on the journey of training Indonesian sentence embedding models!

Evaluation

Semantic Textual Similarity

We followed approached done in the Thai Sentence Vector Benchmark project and translated the STS-B dev and test set to Indonesian via Google Translate API. This dataset will be used to evaluate our model's Spearman correlation score on the translated test set.

You can find the translated dataset on 🤗 HuggingFace Hub.

Further, we will similarly be evaluating our models on the SemRel2024 dataset which contains human-annotated, Indonesian semantic textual relatedness (STR) data. The dataset consists of two splits: dev and test. We will be evaluating our models' Spearman correlation score on both splits.

Retrieval

To evaluate our models' capability to perform retrieval tasks, we evaluate them on Indonesian subsets of MIRACL and TyDiQA datasets. In both datasets, the model's ability to retrieve relevant documents given a query is tested. We employ R@1 (top-1 accuracy), MRR@10, and nDCG@10 metrics to measure our model's performance.

Classification

For text classification, we will be doing emotion classification and sentiment analysis on the EmoT and SmSA subsets of IndoNLU, respectively. To do so, we will be doing the same approach as Thai Sentence Vector Benchmark and simply fit a Linear SVC on sentence representations of our texts with their corresponding labels. Thus, unlike conventional fine-tuning method where the backbone model is also updated, the Sentence Transformer stays frozen in our case; with only the classification head being trained.

Further, we will evaluate our models using the official MTEB code that contains two Indonesian classification subtasks: MassiveIntentClassification (id) and MassiveScenarioClassification (id).

Pair Classification

We followed MTEB's PairClassification evaluation procedure for pair classification. Specifically for zero-shot natural language inference tasks, all neutral pairs are dropped, while contradictions and entailments are re-mapped as 0s and 1s. The maximum average precision (AP) score is found by finding the best threshold value.

We leverage the IndoNLI dataset's two test subsets: test_lay and test_expert.

Methods

(Unsupervised) SimCSE

We followed SimCSE: Simple Contrastive Learning of Sentence Embeddings and trained a sentence embedding model in an unsupervised fashion. Unsupervised SimCSE allows us to leverage an unsupervised corpus -- which are plenty -- and with different dropout masks in the encoder, contrastively learn sentence representations. This is parallel with the situation that there is a lack of supervised Indonesian sentence similarity datasets, hence SimCSE is a natural first move into this field. We used the Sentence Transformer implementation of SimCSE.

ConGen

Like SimCSE, ConGen: Unsupervised Control and Generalization Distillation For Sentence Representation is another unsupervised technique to train a sentence embedding model. Since it is in-part a distillation method, ConGen relies on a teacher model which will then be distilled to a student model. The original paper proposes back-translation as the best data augmentation technique. However, due to the lack of resources, we implemented word deletion, which was found to be on-par with back-translation despite being trivial. We used the official ConGen implementation which was written on top of the Sentence Transformers library.

SCT

SCT: An Efficient Self-Supervised Cross-View Training For Sentence Embedding is another unsupervised technique to train a sentence embedding model. It is very similar to ConGen in its knowledge distillation methodology, but also supports self-supervised training procedure without a teacher model. The original paper proposes back-translation as its data augmentation technique, but we implemented single-word deletion and found it to perform better than our backtranslated corpus. We used the official SCT implementation which was written on top of the Sentence Transformers library.

Pretrained Models

| Model | ----------------------------------------------------------------------------------------- | SimCSE-IndoBERT Base | ConGen-IndoBERT Lite Base | ConGen-IndoBERT Base | ConGen-SimCSE-IndoBERT | ConGen-Indo-e5 Small | SCT-IndoBERT Base | all-IndoBERT Base | all-IndoBERT Base-v2 | all-IndoBERT Base-v4 | all-NusaBERT Base-v4 | all-NusaBERT Large-v4 | all-Indo-e5 Small-v2 | all-Indo-e5 Small-v3 | all-Indo-e5 Small-v4 | multilingual-e5-small | multilingual-e5-base | multilingual-e5-large | #params | Base/Student Model | Teacher Model | Train Dataset | Supervised | ---------------------------------- | :-----: | --------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ | :--------: | | 125M | IndoBERT Base | N/A | Wikipedia | | | 12M | IndoBERT Lite Base | paraphrase-multilingual-mpnet-base-v2 | Wikipedia | | | 125M | IndoBERT Base | paraphrase-multilingual-mpnet-base-v2 | Wikipedia | | Base | 125M | SimCSE-IndoBERT Base | paraphrase-multilingual-mpnet-base-v2 | Wikipedia | | | 118M | multilingual-e5-small | paraphrase-multilingual-mpnet-base-v2 | Wikipedia | | | 125M | IndoBERT Base | paraphrase-multilingual-mpnet-base-v2 | Wikipedia | | | 125M | IndoBERT Base | N/A | See: README | ✅ | | 125M | IndoBERT Base | N/A | See: README | ✅ | | 125M | IndoBERT Base | N/A | See: README | ✅ | | 111M | NusaBERT Base | N/A | See: README | ✅ | | 337M | NusaBERT Large | N/A | See: README | ✅ | | 118M | multilingual-e5-small | N/A | See: README | ✅ | | 118M | multilingual-e5-small | N/A | See: README | ✅ | | 118M | multilingual-e5-small | N/A | See: README | ✅ | 2">distiluse-base-multilingual-cased-v2 | 134M | DistilBERT Base Multilingual | mUSE | See: SBERT | ✅ | v2">paraphrase-multilingual-mpnet-base-v2 | 125M | XLM-RoBERTa Base | paraphrase-mpnet-base-v2 | See: SBERT | ✅ | | 118M | Multilingual-MiniLM-L12-H384 | See: arXiv | See: 🤗 | ✅ | | 278M | XLM-RoBERTa Base | See: arXiv | See: 🤗 | ✅ | | 560M | XLM-RoBERTa Large | See: arXiv | See: 🤗 | ✅ |

Deprecated Models

| Model | #params | Base/Student Model | Teacher Model | Train Dataset | Supervised | | ---------------------------------------------------------------------------------------- | :-----: | --------------------------------------------------------------------------------- | ------------- | ----------------------------------------------------------------------------- | :--------: | | [SimCSE-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/simcse-indobert-lite-base) | 12M | [IndoBERT Lite Base](https://huggingface.co/indobenchmark/indobert-lite-base-p1) | N/A | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | | [SimCSE-IndoRoBERTa Base](https://huggingface.co/LazarusNLP/simcse-indoroberta-base) | 125M | [IndoRoBERTa Base](https://huggingface.co/flax-community/indonesian-roberta-base) | N/A | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | | [S-IndoBERT Base mMARCO](https://huggingface.co/LazarusNLP/s-indobert-base-mmarco) | 125M | [IndoBERT Base](https://huggingface.co/indobenchmark/indobert-base-p1) | N/A | [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) | ✅ | | [all-IndoBERT Base p2](https://huggingface.co/LazarusNLP/all-indobert-base-p2) | 125M | [IndoBERT Base p2](https://huggingface.co/indobenchmark/indobert-base-p2) | N/A | See: [README](./training/all/) | ✅ |

Results

Semantic Textual Similarity

Machine Translated Indonesian STS-B

| Model | Spearman's Correlation (%) ↑ | | --------------------------------------------------------------------------------------------------------------------------- | :--------------------------: | | SimCSE-IndoBERT Base | 70.13 | | ConGen-IndoBERT Lite Base | 79.97 | | ConGen-IndoBERT Base | 80.47 | | ConGen-SimCSE-IndoBERT Base | 81.16 | | ConGen-Indo-e5 Small | 80.94 | | SCT-IndoBERT Base | 74.56 | | all-IndoBERT Base | 73.84 | | all-IndoBERT Base-v2 | 76.03 | | all-IndoBERT Base-v4 | 75.99 | | all-NusaBERT Base-v4 | 77.65 | | all-NusaBERT Large-v4 | 79.23 | | all-Indo-e5 Small-v2 | 79.57 | | all-Indo-e5 Small-v3 | 79.95 | | all-Indo-e5 Small-v4 | 79.85 | | distiluse-base-multilingual-cased-v2 | 75.08 | | paraphrase-multilingual-mpnet-base-v2 | 83.83 | | multilingual-e5-small | 78.89 | | multilingual-e5-base | 79.72 | | multilingual-e5-large | 79.44 |

SemRel2024: Semantic Textual Relatedness (STR)

| Model | dev Spearman's Correlation (%) ↑ | test Spearman's Correlation (%) ↑ | | --------------------------------------------------------------------------------------------------------------------------- | :--------------------------------: | :---------------------------------: | | SimCSE-IndoBERT Base | 30.64 | 36.77 | | ConGen-IndoBERT Lite Base | 35.95 | 41.73 | | ConGen-IndoBERT Base | 35.05 | 39.14 | | ConGen-SimCSE-IndoBERT Base | 33.71 | 37.73 | | ConGen-Indo-e5 Small | 36.35 | 42.47 | | SCT-IndoBERT Base | 41.50 | 43.25 | | all-IndoBERT Base | 42.87 | 38.78 | | all-IndoBERT Base-v2 | 41.68 | 40.42 | | all-IndoBERT Base-v4 | 41.38 | 38.05 | | all-NusaBERT Base-v4 | 42.11 | 41.55 | | all-NusaBERT Large-v4 | 40.21 | 42.25 | | all-Indo-e5 Small-v2 | 39.79 | 43.85 | | all-Indo-e5 Small-v3 | 40.25 | 42.60 | | all-Indo-e5 Small-v4 | 40.20 | 42.90 | | distiluse-base-multilingual-cased-v2 | 37.22 | 49.35 | | paraphrase-multilingual-mpnet-base-v2 | 34.56 | 37.51 | | multilingual-e5-small | 41.92 | 49.60 | | multilingual-e5-base | 41.29 | 45.04 | | multilingual-e5-large | 39.20 | 45.04 |

Retrieval

MIRACL

| Model | R@1 (%) ↑ | MRR@10 (%) ↑ | nDCG@10 (%) ↑ | | --------------------------------------------------------------------------------------------------------------------------- | :-------: | :----------: | :-----------: | | SimCSE-IndoBERT Base | 36.04 | 48.25 | 39.70 | | ConGen-IndoBERT Lite Base | 46.04 | 59.06 | 51.01 | | ConGen-IndoBERT Base | 45.93 | 58.58 | 49.95 | | ConGen-SimCSE-IndoBERT Base | 45.83 | 58.27 | 49.91 | | ConGen-Indo-e5 Small | 55.00 | 66.74 | 58.95 | | SCT-IndoBERT Base | 40.41 | 47.29 | 40.68 | | all-IndoBERT Base | 65.52 | 75.92 | 70.13 | | all-IndoBERT Base-v2 | 67.18 | 76.59 | 70.16 | | all-IndoBERT Base-v4 | 67.91 | 77.37 | 70.97 | | all-NusaBERT Base-v4 | 67.08 | 77.47 | 71.24 | | all-NusaBERT Large-v4 | 68.43 | 78.29 | 71.99 | | all-Indo-e5 Small-v2 | 68.33 | 78.33 | 73.04 | | all-Indo-e5 Small-v3 | 68.12 | 78.22 | 73.09 | | all-Indo-e5 Small-v4 | 68.33 | 78.41 | 73.23 | | distiluse-base-multilingual-cased-v2 | 41.35 | 54.93 | 48.79 | | paraphrase-multilingual-mpnet-base-v2 | 52.81 | 65.07 | 57.97 | | multilingual-e5-small | 70.20 | 79.61 | 74.80 | | multilingual-e5-base | 70.00 | 79.50 | 75.16 | | multilingual-e5-large | 70.83 | 80.58 | 76.16 |

TyDiQA

| Model | R@1 (%) ↑ | MRR@10 (%) ↑ | nDCG@10 (%) ↑ | | --------------------------------------------------------------------------------------------------------------------------- | :-------: | :----------: | :-----------: | | SimCSE-IndoBERT Base | 61.94 | 69.89 | 73.52 | | ConGen-IndoBERT Lite Base | 75.22 | 81.55 | 84.13 | | ConGen-IndoBERT Base | 73.09 | 80.32 | 83.29 | | ConGen-SimCSE-IndoBERT Base | 72.38 | 79.37 | 82.51 | | ConGen-Indo-e5 Small | 84.60 | 89.30 | 91.27 | | SCT-IndoBERT Base | 76.81 | 83.16 | 85.87 | | all-IndoBERT Base | 88.14 | 91.47 | 92.91 | | all-IndoBERT Base-v2 | 87.61 | 90.91 | 92.31 | | all-IndoBERT Base-v4 | 89.02 | 92.59 | 93.91 | | all-NusaBERT Base-v4 | 92.74 | 94.95 | 95.73 | | all-NusaBERT Large-v4 | 93.62 | 95.77 | 96.56 | | all-Indo-e5 Small-v2 | 93.27 | 95.63 | 96.46 | | all-Indo-e5 Small-v3 | 93.27 | 95.72 | 96.58 | | all-Indo-e5 Small-v4 | 93.45 | 95.66 | 96.43 | | distiluse-base-multilingual-cased-v2 | 70.44 | 77.94 | 81.56 | | paraphrase-multilingual-mpnet-base-v2 | 81.41 | 87.05 | 89.44 | | multilingual-e5-small | 91.50 | 94.34 | 95.39 | | multilingual-e5-base | 93.45 | 95.88 | 96.69 | | multilingual-e5-large | 94.69 | 96.71 | 97.44 |

Classification

MTEB - Massive Intent Classification `(id)`

| Model | Accuracy (%) ↑ | F1 Macro (%) ↑ | | --------------------------------------------------------------------------------------------------------------------------- | :------------: | :------------: | | SimCSE-IndoBERT Base | 59.71 | 57.70 | | ConGen-IndoBERT Lite Base | 62.41 | 60.94 | | ConGen-IndoBERT Base | 61.14 | 60.02 | | ConGen-SimCSE-IndoBERT Base | 60.93 | 59.50 | | ConGen-Indo-e5 Small | 62.92 | 60.18 | | SCT-IndoBERT Base | 55.66 | 54.48 | | all-IndoBERT Base | 58.40 | 57.21 | | all-IndoBERT Base-v2 | 58.31 | 57.11 | | all-IndoBERT Base-v4 | 57.80 | 56.71 | | all-NusaBERT Base-v4 | 62.10 | 60.38 | | all-NusaBERT Large-v4 | 61.41 | 59.93 | | all-Indo-e5 Small-v2 | 61.51 | 59.24 | | all-Indo-e5 Small-v3 | 61.63 | 59.29 | | all-Indo-e5 Small-v4 | 61.38 | 59.07 | | distiluse-base-multilingual-cased-v2 | 55.99 | 52.44 | | paraphrase-multilingual-mpnet-base-v2 | 65.43 | 63.55 | | multilingual-e5-small | 64.16 | 61.33 | | multilingual-e5-base | 66.63 | 63.88 | | multilingual-e5-large | 70.04 | 67.66 |

MTEB - Massive Scenario Classification `(id)`

| Model | Accuracy (%) ↑ | F1 Macro (%) ↑ | | --------------------------------------------------------------------------------------------------------------------------- | :------------: | :------------: | | SimCSE-IndoBERT Base | 66.14 | 65.56 | | ConGen-IndoBERT Lite Base | 67.25 | 66.53 | | ConGen-IndoBERT Base | 67.72 | 67.32 | | ConGen-SimCSE-IndoBERT Base | 67.12 | 66.64 | | ConGen-Indo-e5 Small | 66.92 | 66.29 | | SCT-IndoBERT Base | 61.89 | 60.97 | | all-IndoBERT Base | 66.37 | 66.31 | | all-IndoBERT Base-v2 | 66.02 | 65.97 | | all-IndoBERT Base-v4 | 66.33 | 66.14 | | all-NusaBERT Base-v4 | 70.17 | 70.18 | | all-NusaBERT Large-v4 | 70.10 | 70.38 | | all-Indo-e5 Small-v2 | 67.02 | 66.86 | | all-Indo-e5 Small-v3 | 67.27 | 67.13 | | all-Indo-e5 Small-v4 | 67.33 | 67.24 | | distiluse-base-multilingual-cased-v2 | 65.25 | 63.45 | | paraphrase-multilingual-mpnet-base-v2 | 70.72 | 70.58 | | multilingual-e5-small | 67.92 | 67.23 | | multilingual-e5-base | 70.70 | 70.26 | | multilingual-e5-large | 74.11 | 73.82 |

IndoNLU - Emotion Classification (EmoT)

| Model | Accuracy (%) ↑ | F1 Macro (%) ↑ | | --------------------------------------------------------------------------------------------------------------------------- | :------------: | :------------: | | SimCSE-IndoBERT Base | 55.45 | 55.78 | | ConGen-IndoBERT Lite Base | 58.18 | 58.84 | | ConGen-IndoBERT Base | 57.04 | 57.06 | | ConGen-SimCSE-IndoBERT Base | 59.54 | 60.37 | | ConGen-Indo-e5 Small | 60.00 | 60.52 | | SCT-IndoBERT Base | 61.13 | 61.70 | | all-IndoBERT Base | 57.27 | 57.47 | | all-IndoBERT Base-v2 | 58.86 | 59.31 | | all-IndoBERT Base-v4 | 61.36 | 61.81 | | all-NusaBERT Base-v4 | 53.18 | 53.01 | | all-NusaBERT Large-v4 | 63.18 | 63.17 | | all-Indo-e5 Small-v2 | 58.18 | 57.99 | | all-Indo-e5 Small-v3 | 56.81 | 56.46 | | all-Indo-e5 Small-v4 | 56.94 | 57.04 | | distiluse-base-multilingual-cased-v2 | 63.63 | 64.13 | | paraphrase-multilingual-mpnet-base-v2 | 63.18 | 63.78 | | multilingual-e5-small | 64.54 | 65.04 | | multilingual-e5-base | 68.63 | 69.07 | | multilingual-e5-large | 74.77 | 74.66 |

IndoNLU - Sentiment Analysis (SmSA)

| Model | Accuracy (%) ↑ | F1 Macro (%) ↑ | | --------------------------------------------------------------------------------------------------------------------------- | :------------: | :------------: | | SimCSE-IndoBERT Base | 85.6 | 81.50 | | ConGen-IndoBERT Lite Base | 81.2 | 75.59 | | ConGen-IndoBERT Base | 85.4 | 82.12 | | ConGen-SimCSE-IndoBERT Base | 83.0 | 78.74 | | ConGen-Indo-e5 Small | 84.2 | 80.21 | | SCT-IndoBERT Base | 82.0 | 76.92 | | all-IndoBERT Base | 84.4 | 79.79 | | all-IndoBERT Base-v2 | 83.4 | 79.04 | | all-IndoBERT Base-v4 | 82.4 | 77.82 | | all-NusaBERT Base-v4 | 84.2 | 78.68 | | all-NusaBERT Large-v4 | 84.8 | 81.01 | | all-Indo-e5 Small-v2 | 82.0 | 78.15 | | all-Indo-e5 Small-v3 | 82.6 | 78.98 | | all-Indo-e5 Small-v4 | 82.6 | 79.14 | | distiluse-base-multilingual-cased-v2 | 78.8 | 73.64 | | paraphrase-multilingual-mpnet-base-v2 | 89.6 | 86.56 | | multilingual-e5-small | 83.6 | 79.51 | | multilingual-e5-base | 89.4 | 86.22 | | multilingual-e5-large | 90.0 | 86.50 |

Pair Classification

IndoNLI

| Model | test_lay AP (%) ↑ | test_expert AP (%) ↑ | | --------------------------------------------------------------------------------------------------------------------------- | :-----------------: | :--------------------: | | SimCSE-IndoBERT Base | 56.06 | 50.72 | | ConGen-IndoBERT Lite Base | 69.44 | 53.74 | | ConGen-IndoBERT Base | 71.14 | 56.35 | | ConGen-SimCSE-IndoBERT Base | 70.80 | 56.59 | | ConGen-Indo-e5 Small | 70.51 | 55.67 | | SCT-IndoBERT Base | 59.82 | 53.41 | | all-IndoBERT Base | 72.01 | 56.79 | | all-IndoBERT Base-v2 | 71.36 | 56.83 | | all-IndoBERT Base-v4 | 70.99 | 58.99 | | all-NusaBERT Base-v4 | 73.07 | 59.86 | | all-NusaBERT Large-v4 | 73.26 | 61.14 | | all-Indo-e5 Small-v2 | 76.29 | 57.05 | | all-Indo-e5 Small-v3 | 75.21 | 56.62 | | all-Indo-e5 Small-v4 | 75.05 | 57.42 | | distiluse-base-multilingual-cased-v2 | 58.48 | 50.50 | | paraphrase-multilingual-mpnet-base-v2 | 74.87 | 57.96 | | multilingual-e5-small | 63.97 | 51.85 | | multilingual-e5-base | 60.25 | 50.91 | | multilingual-e5-large | 61.39 | 51.62 |

Credits

Indonesian Sentence Embeddings is developed with love by:

Owner

Name: LazarusNLP
Login: LazarusNLP
Kind: organization
Location: Indonesia

Website: https://lazarusnlp.github.io/
Repositories: 1
Profile: https://github.com/LazarusNLP

Lazarus NLP is a collective initiative to revive the dying languages of Indonesia through speech and language technology.

Citation (CITATION.cff)

abstract: Embedding Representation for Indonesian Sentences!
authors:
- affiliation: Lazarus NLP
  family-names: Wongso
  given-names: Wilson
  orcid: 0000-0003-0896-1941
- affiliation: Lazarus NLP
  family-names: Joyoadikusumo
  given-names: Ananto
  orcid: 0009-0004-1761-4137
- affiliation: Lazarus NLP
  family-names: Setiawan
  given-names: David Samuel
  orcid: 0009-0001-2286-8439
- affiliation: Lazarus NLP
  family-names: Limcorn
  given-names: Steven
cff-version: 1.2.0
date-released: '2024-04-17'
doi: 10.5281/zenodo.10983756
license:
- apache-2.0
repository-code: https://github.com/LazarusNLP/indonesian-sentence-embeddings/tree/v0.0.1
title: 'LazarusNLP/indonesian-sentence-embeddings: v0.0.1'
type: software
version: v0.0.1

indonesian-sentence-embeddings

Science Score: 67.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Indonesian Sentence Embeddings

Evaluation

Semantic Textual Similarity

Retrieval

Classification

Pair Classification

Methods

(Unsupervised) SimCSE

ConGen

SCT

Pretrained Models

Results

Semantic Textual Similarity

Machine Translated Indonesian STS-B

SemRel2024: Semantic Textual Relatedness (STR)

Retrieval

MIRACL

TyDiQA

Classification

MTEB - Massive Intent Classification (id)

MTEB - Massive Scenario Classification (id)

IndoNLU - Emotion Classification (EmoT)

IndoNLU - Sentiment Analysis (SmSA)

Pair Classification

IndoNLI

Credits

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies

MTEB - Massive Intent Classification `(id)`

MTEB - Massive Scenario Classification `(id)`