https://github.com/ai-forever/mgpt

Multilingual Generative Pretrained Model

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.0%) to scientific vocabulary

Keywords

chinese english generative-model gpt-2 gpt-3 hindi korean multilingual multilingual-models natural-language-generation natural-language-processing russian transformer transformers

Last synced: 10 months ago · JSON representation

Repository

Multilingual Generative Pretrained Model

Basic Info

Host: GitHub
Owner: ai-forever
License: other
Language: Jupyter Notebook
Default Branch: main
Homepage:
Size: 6.18 MB

Statistics

Stars: 205
Watchers: 12
Forks: 22
Open Issues: 7
Releases: 0

Topics

chinese english generative-model gpt-2 gpt-3 hindi korean multilingual multilingual-models natural-language-generation natural-language-processing russian transformer transformers

Created about 4 years ago · Last pushed about 2 years ago

Metadata Files

Readme License

mGPT

Multilingual Generative Pretrained Transformer

We introduce mGPT, a multilingual variant of GPT-3, pretrained on 61 languages from linguistically diverse 25 language families using Wikipedia and C4 Corpus. We detail the design and pretraining procedure. The models undergo an intrinsic and extrinsic evaluation: language modeling in all languages, downstream evaluation on cross-lingual NLU datasets and benchmarks in 33 languages, and world knowledge probing in 23 languages. The in-context learning abilities are on par with the contemporaneous language models while covering a larger amount of languages, including underrepresented and low-resource languages of the Commonwealth of Independent States and the small peoples in Russia. The source code and the language models are available under the MIT license.

[Paper] [Habr (Russian)] [HugginFace mGPT-1.3B Model Card] [HugginFace mGPT-13B Model Card] [Papers With Code]

### Setting up environment pip install -r requirements.txt

Pretrain data

The model was pretrained on a 600Gb of texts, mainly from MC4 and Wikipedia. - MC4 - Wikipedia (version 20201101)

The Wikipedia texts are extracted from the dumps (v. 20201101) with WikiExtractor (Attardi, 2015). Training data was deduplicated, and the text deduplication includes 64-bit hashing of each text in the corpus for keeping texts with a unique hash. We also filter the documents based on their text compression rate using zlib4. The most strongly and weakly compressing deduplicated texts are discarded.

Transformers usage 🤗

``` from transformers import GPT2LMHeadModel, GPT2Tokenizer tokenizer = GPT2Tokenizer.frompretrained("sberbank-ai/mGPT") model = GPT2LMHeadModel.frompretrained("sberbank-ai/mGPT")

text = "Александр Сергеевич Пушкин родился в " inputids = tokenizer.encode(text, returntensors="pt").cuda(device) out = model.generate( inputids, minlength=100, maxlength=100, eostokenid=5, padtoken=1, topk=10, topp=0.0, norepeatngramsize=5 ) generatedtext = list(map(tokenizer.decode, out))[0] print(generated_text) Александр Сергеевич Пушкин родился в г. Санкт-Петербурге. ```

Choosing best parameters:

In general: min_length=100, eos_token_id=5, pad_token=1, do_sample=True, top_k=0, top_p=0.8, no_repeat_ngram_size=4

English Generation: top_p=0.95, top_k=0

Examples

mGPT Generation Examples

mGPT Fine-tuning example

Languages supported

Afrikaans (af), Arabic (ar), Armenian (hy), Azerbaijani (az), Basque (eu), Bashkir (ba), Belarusian (be), Bengali (bn), Bulgarian (bg), Burmese (my), Buryat (bxr), Chuvash (cv), Danish (da), English (en), Estonian (et), Finnish (fi), French (fr), Georgian (ka), German (de), Greek (el), Hebrew (he), Hindi (hi), Hungarian (hu), Indonesian (id), Italian (it), Japanese (ja), Javanese (jv), Kalmyk (xal), Kazakh (kk), Korean (ko), Kyrgyz (ky), Latvian (lv), Lithuanian (lt), Malay (ms), Malayalam (ml), Marathi (mr), Mongolian (mn), Ossetian (os), Persian (fa), Polish (pl), Portuguese (pt), Romanian (ro), Russian (ru), Spanish (es), Swedish (sv), Swahili (sw), Tatar (tt), Telugu (te), Thai (th), Turkish (tr), Turkmen (tk), Tuvan (tyv), Ukrainian (uk), Uzbek (uz), Vietnamese (vi), Yakut (sax), Yoruba (yo)

Pretraining

[mGPT-1.3B Model Card] [mGPT-13B Model Card]

We utilize the DeepSpeed library and Megatron-LM. We pretrain our LMs with a total batch size of 2048 and the context window of 512 tokens. The total number of the training steps is 600k, and the models have seen $400$B tokens during pretraining. The pretraining took 14 days on a cluster of 256 V100 GPUs for mGPT-1.3B and 22 days on 512 V100 GPUs for mGPT-13B.

Monoligual models:

Habr article about the monoligual mGPT-1.3B models (Russian)

Monolingual models on HuggingFace: - 🇦🇲 mGPT-1.3B Armenian - 🇦🇿 mGPT-1.3B Azerbaijan - 🍯 mGPT-1.3B Bashkir - 🇧🇾 mGPT-1.3B Belorussian - 🇧🇬 mGPT-1.3B Bulgarian - 🌞 mGPT-1.3B Buryat - 🌳 mGPT-1.3B Chuvash - 🇬🇪 mGPT-1.3B Georgian - 🌸 mGPT-1.3B Kalmyk - 🇰🇿 mGPT-1.3B Kazakh - 🇰🇬 mGPT-1.3B Kirgiz - 🐻 mGPT-1.3B Mari - 🇲🇳 mGPT-1.3B Mongol - 🐆 mGPT-1.3B Ossetian - 🇮🇷 mGPT-1.3B Persian - 🇷🇴 mGPT-1.3B Romanian - 🇹🇯 mGPT-1.3B Tajik - ☕ mGPT-1.3B Tatar - 🇹🇲 mGPT-1.3B Turkmen - 🐎 mGPT-1.3B Tuvan - 🇺🇦 mGPT-1.3B Ukranian - 🇺🇿 mGPT-1.3B Uzbek - 💎 mGPT-1.3B Yakut

Contributing

We welcome community contributions to the model, and celebrate both its inference and training technique enhancements.

## Cite Us

@article{shliazhko2024mgpt, title={mGPT: Few-Shot Learners Go Multilingual}, author={Shliazhko, Oleh and Fenogenova, Alena and Tikhonova, Maria and Kozlova, Anastasia and Mikhailov, Vladislav and Shavrina, Tatiana}, journal={Transactions of the Association for Computational Linguistics}, volume={12}, pages={58--79}, year={2024}, publisher={MIT Press One Broadway, 12th Floor, Cambridge, Massachusetts 02142, USA~…} }

Owner

Name: AI Forever
Login: ai-forever
Kind: organization
Location: Armenia

Repositories: 60
Profile: https://github.com/ai-forever

Creating ML for the future. AI projects you already know. We are non-profit organization with members from all over the world.

GitHub Events

Total

Watch event: 9
Fork event: 1

Last Year

Watch event: 9
Fork event: 1

Committers

Last synced: 11 months ago

All Time

Total Commits: 15
Total Committers: 5
Avg Commits per committer: 3.0
Development Distribution Score (DDS): 0.333

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Tatiana Shavrina	r**s@g**m	10
Oleh Shliazhko	s**g@g**m	2
Vladislav Mikhailov	4****v	1
alenusch	a**3@g**m	1
Rybolos	S**O@s**u	1

Committer Domains (Top 20 + Academic)

sberbank.ru: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 13
Total pull requests: 1
Average time to close issues: 11 days
Average time to close pull requests: N/A
Total issue authors: 11
Total pull request authors: 1
Average comments per issue: 1.54
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 1.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

ajesujoba (2)
ymoslem (2)
riyajatar37003 (1)
edmondja (1)
rovodanica (1)
AK391 (1)
malteos (1)
p0mad (1)
unnir (1)
0x7o (1)

https://github.com/ai-forever/mgpt

Science Score: 23.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

mGPT

Multilingual Generative Pretrained Transformer

Pretrain data

Transformers usage 🤗

Choosing best parameters:

Examples

mGPT Generation Examples

mGPT Fine-tuning example

Languages supported

Pretraining

Monoligual models:

Contributing

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies