polyglot-or-not

Are foundation LMs multilingual knowledge bases? (EMNLP 2023)

https://github.com/daniel-furman/polyglot-or-not

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.0%) to scientific vocabulary

Keywords

emnlp2023 machine-learning nlp pytorch text-generation transformer
Last synced: 6 months ago · JSON representation

Repository

Are foundation LMs multilingual knowledge bases? (EMNLP 2023)

Basic Info
Statistics
  • Stars: 19
  • Watchers: 4
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
emnlp2023 machine-learning nlp pytorch text-generation transformer
Created about 3 years ago · Last pushed about 2 years ago
Metadata Files
Readme License Citation

README.md

Polyglot or Not? Measuring Multilingual Encyclopedic Knowledge in Foundation Models

License Python 3.9+ Code style: black

This is the repository for the following paper: Polyglot or Not? Measuring Multilingual Encyclopedic Knowledge in Foundation Models, which was published in EMNLP 2023 (Main). It contains several research artifacts, including:

  1. The code for running the fact-completion test
  2. Our dataset of factual associations translated into 20 languages
  3. A demo of contrastive knowledge assessment

Method

Given a factual association such as The capital of France is *Paris***, we determine whether a model adequately "knows" the correct completion with the following test:

  • Step 1: prompt the model to predict the likelihood of the token Paris following The Capital of France is
  • Step 2: prompt the model to predict the average likelihood of a set of false, counterfactual tokens following the same stem.

If the value from Step 1 is greater than the value from Step 2 we conclude that the model adequately recalls that fact. Formally, this is an application of the Contrastive Knowledge Assessment proposed in [1].

Models Evaluated

We evaluate 5 foundation models of interest in a multilingual setting, like Llama [2]. We perform this assessment with 303k fact-completions spanning 20 languages (results).

In addition to our multilingual assessment, we also scored a diverse set of ~30 models (like Mistral, Llama-2, and Falcon) on the English-only subset of our dataset, which comprises 26.3k fact-completions.

While we would have liked to test close-sourced models, such as OpenAI's GPT-4, such models do not provide vocabulary-wide probabilities at inference. These models are thus incompatible at present with our contrastive knowledge assessment test. As such, our study demonstrates the need for all LLMs - open and closed - to produce vocabulary-wide probabilities for more robust evaluations.

Data Release

We present 303k unique fact-completions in Polyglot-or-Not/Fact-Completion, which are in the form of {stem, fact, counterfact} triples. See the dataset viewer for a closer look.

  • 20 Latin/Cyrillic script languages are included. The ISO 639-1 language codes are: bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, and uk.

The factual associations were originally sourced from English-language Wikidata curated in the T-REx dataset [3] as utilized in factual association research such as [1] and [4]. We used the Google Translate API alongside bespoke wrapper code to programmatically generate the non-English cuts.

Test Results

### Multilingual leaderboard

| model | accuracy (%) | params | n tokens |------------------|:--------------:|:--------------:|:--------------:| | llama-33b | 79.31 (+/- 0.74) | 32.5B | 1.4T | | m-bert | 62.00 (+/- 0.87) | 110M | - | | bloom-7b1 | 57.70 (+/- 0.88) | 7.1B | 341B | | xlm-roberta | 56.03 (+/- 0.90) | 355M | 295B | | mt5-xl | 52.51 (+/- 0.91) | 3.7B | - | | Random Baseline | 50 | - | - |

Table 1: Multilingual test leaderboard. Here, accuracy refers to the average performance of each model across 20 distinct languages. The uncertainty estimates represent averaged 95% confidence intervals computed from 10000 bootstrap iterations per language. Params and n tokens record each models number of parameters and number of dataset tokens, respectively (when such data is available). These results reveal that models struggle to recall facts in a multilingual setting, as compared to their English-only performance (Table 2). For instance, on average, Llama-33B's accuracy decreased by approximately 11% from English to non-English languages.

 

### English-only leaderboard

| model | accuracy (%) | params | n tokens |------------------|:--------------:|:--------------:|:--------------:| | falcon-180b | 91.53 (+/- 0.34) | 180B | 3.5T | | llama-2-70b | 90.86 (+/- 0.35) | 70B | 2T | | llama-65b | 89.56 (+/- 0.37) | 65.2B | 1.4T | | llama-33b | 89.40 (+/- 0.38) | 32.5B | 1.4T | | llama-2-13b | 87.51 (+/- 0.40) | 13B | 2T | | falcon-40b | 87.01 (+/- 0.41) | 40B | 1T | | mistral-7b-v0.1 | 86.88 (+/- 0.41) | 7.3B | - | | llama-13b | 86.66 (+/- 0.42) | 12.5B | 1T | | llama-2-7b | 86.22 (+/- 0.42) | 7B | 2T | | llama-7b | 85.53 (+/- 0.43) | 6.7B | 1T | | mpt-30b | 85.09 (+/- 0.43) | 30B | 1T | | redpajama-7b | 85.07 (+/- 0.44) | 7B | 800B | | mpt-7b | 83.39 (+/- 0.46) | 7B | 1T | | opt-30b | 82.09 (+/- 0.47) | 30B | 180B | | redpajama-3b | 82.09 (+/- 0.47) | 3B | 800B | | opt-13b | 81.94 (+/- 0.46) | 13B | 30B | | gpt-neox-20b | 81.50 (+/- 0.47) | 20B | 420B | | falcon-7b | 81.34 (+/- 0.47) | 7B | 1.5T | | gpt-j-6b | 81.14 (+/- 0.47) | 6B | 420B | | pythia-12b | 80.53 (+/- 0.48) | 12B | 420B | | t5-v1-xxl | 76.55 (+/- 0.52) | 13B | 34B | | bloom-7b1 | 76.16 (+/- 0.51) | 7.1B | 341B | | gpt2-xl | 73.76 (+/- 0.54) | 1.5B | - | | bert | 72.60 (+/- 0.54) | 110M | - | | m-bert | 71.80 (+/- 0.55) | 110M | - | | stablelm-7b | 68.85 (+/- 0.55) | 7B | 1.5T | | distilgpt2 | 64.23 (+/- 0.59) | 82M | - | | mt5-xxl | 61.58 (+/- 0.59) | 13B | - | | xlm-roberta | 61.55 (+/- 0.59) | 355M | 295B | | mt5-xl | 59.96 (+/- 0.59) | 3.7B | - | | Random Baseline | 50 | - | - |

Table 2: Monolingual test leaderboard. Accuracy represents performance on English-only data. The uncertainty estimates are 95% confidence intervals computed from 10000 bootstrap iterations. Params and n tokens record each models number of parameters and number of dataset tokens, respectively (when such data is available). Consistent with the trends in Table 1, Llamas of varying sizes emerge as the front-runners.

 

Llama-33B performance across languages

Llama test leaderboard

Figure 1: Llama-33B's test performance across languages. Accuracy denotes the model's performance assessed individually for each language. The Llama-33B model demonstrates higher proficiency with languages utilizing the Latin script as compared to those using the Cyrillic script (Ukrainian, Bulgarian, Russian, and Serbian). A chi-squared test substantiates a significant dependency of the model's test performance on the language script (2 = 3570.576, p < 0.001).

Authors

Advisor

Citation

Please cite this repository as follows if you use its data or code:

@misc{schott2023polyglot, doi = {10.48550/arXiv.2305.13675}, title={Polyglot or Not? Measuring Multilingual Encyclopedic Knowledge in Foundation Models}, author={Tim Schott and Daniel Furman and Shreshta Bhat}, year={2023}, eprint={2305.13675}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Bibliography

[1] Calibrating Factual Knowledge in Pretrained Language Models. Dong, Qingxiu, Damai Dai, Yifan Song, Jingjing Xu, Zhifang Sui, and Lei Li. In Findings of the Association for Computational Linguistics: EMNLP 2022. arXiv:2210.03329 (2022).

[2] Llama: Open and Efficient Foundation Language Models. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothe Lacroix, Baptiste Rozire, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample. https://arxiv.org/abs/2302.13971v1 (2023).

  • Llama weights were accessed with the approval of Meta AI and used in accordance with the License (see link for more details).

[3] T-REx: A Large Scale Alignment of Natural Language with Knowledge Base Triples. ElSahar, Hady, Pavlos Vougiouklis, Arslen Remaci, Christophe Gravier, Jonathon S. Hare, Frdrique Laforest and Elena Paslaru Bontas Simperl. International Conference on Language Resources and Evaluation. Link (2018).

[4] Mass Editing Memory in a Transformer. Meng, Kevin, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. arXiv preprint arXiv:2210.07229 (2022).

Owner

  • Name: Daniel Furman
  • Login: daniel-furman
  • Kind: user
  • Location: San Francisco
  • Company: @twosixcapital

Master’s student, UC Berkeley School of Information. University of Pennsylvania alum. DS @twosixcapital. Prev MLE @understory.ai.

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Issues and Pull Requests

Last synced: about 1 year ago

All Time
  • Total issues: 1
  • Total pull requests: 11
  • Average time to close issues: 40 minutes
  • Average time to close pull requests: about 5 hours
  • Total issue authors: 1
  • Total pull request authors: 2
  • Average comments per issue: 2.0
  • Average comments per pull request: 0.18
  • Merged pull requests: 9
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • bjoernpl (1)
Pull Request Authors
  • daniel-furman (8)
  • timschott (3)
Top Labels
Issue Labels
Pull Request Labels