yalm-100b

Pretrained language model with 100B parameters

https://github.com/yandex/yalm-100b

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.1%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Pretrained language model with 100B parameters

Basic Info

Host: GitHub
Owner: yandex
License: apache-2.0
Language: Python
Default Branch: main
Size: 304 KB

Statistics

Stars: 3,755
Watchers: 48
Forks: 296
Open Issues: 19
Releases: 0

Created about 4 years ago · Last pushed almost 3 years ago

Metadata Files

Readme License Citation

YaLM 100B

YaLM 100B is a GPT-like neural network for generating and processing text. It can be used freely by developers and researchers from all over the world.

The model leverages 100 billion parameters. It took 65 days to train the model on a cluster of 800 A100 graphics cards and 1.7 TB of online texts, books, and countless other sources in both English and Russian.

Training details and best practices on acceleration and stabilizations can be found on Medium (English) and Habr (Russian) articles.

We used DeepSpeed to train the model and drew inspiration from Megatron-LM example. However, the code in this repo is not the same code that was used to train the model. Rather it is stock example from DeepSpeed repo with minimal changes needed to infer our model.

Setup

Make sure to have 200GB of free disk space before downloading weights. The model (code is based on microsoft/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3) is supposed to run on multiple GPUs with tensor parallelism. It was tested on 4 (A100 80g) and 8 (V100 32g) GPUs, but is able to work with different configurations with ≈200GB of GPU memory in total which divide weight dimensions correctly (e.g. 16, 64, 128).

Downloading checkpoint

Run bash download/download.sh to download model weights and vocabulary.
By default, weights will be downloaded to ./yalm100b_checkpoint/weights/, and vocabulary will be downloaded to ./yalm100b_checkpoint/vocab/.
As another option, you can clone our HF repo and pull the checkpoint.

Docker

We published image on Docker Hub, it can be pulled with docker/pull.sh. It is compatible with A100 and V100.
Alternatively, you can build docker image from source using docker/build.sh (which will just build docker image from docker/Dockerfile).
To run container, use docker/run.sh (volumes, name and other parameters can be changed).

Usage

You can start with the following scripts: * examples/generate_interactive.sh: interactive generation from command line, the simplest way to try the model. * examples/generate_conditional_sampling.sh: conditional generation with sampling strategy. Top-p is used by default, feel free to change temperature or use top-k. Input is jsonlines (example: examples/example_cond_input.json), output will be the same jsonlines with generated text field added to each line. * examples/generate_conditional_greedy.sh: same as previous, but generation is greedy. Suitable for solving problems with few-shot. * examples/generate_unconditional.sh: unconditional generation. No input is used, output will be jsonlines.

License

The model is published under the Apache 2.0 license that permits both research and commercial use, Megatron-LM is licensed under the Megatron-LM license.

Training details

Dataset composition

Dataset used for the training of YaLM-100B is comprised of the following parts (rough percentages are measured in tokens seen by the model):

25% The Pile — open English dataset by Eleuther AI team
75% Texts in Russian collected by our team (percentages of the whole dataset are given)
- 49% Russian web pages from Yandex Search index filtered from ~100Tb to ~1Tb by the following heuristics:
  1. LSH Deduplication — clusters of similar texts were truncated to just one text each
  2. Length filtration — too short or too long texts or texts with too few natural sentences were discarded.
  3. Entropy filtration — texts with too high or too low entropy were discarded
  4. Domain filtration — domains with repetitive texts (like online retail) were discarded
  5. Classifier filtration — dataset of good texts was collected in a manner similar to WebText from pages linked in tweets in Russian that have at least one reply. Then a classifier was trained to distinguish those good texts from random pages from the dataset. Texts from the original crawled dataset with low classifier scores were then discarded
- 12% News from various sources from Yandex Search index
- 10% Books from the dataset used in Russian Distributional Thesarus
- 3% Misc texts from the Taiga Dataset
- 1.5% Dialogues from social media preprocessed in a manner similar to how Reddit is proccessed in The Pile
- 0.5% Russian portion of Wikipedia

Some subsets were traversed up to 3 times during the training.

Training process

Model was trained on a cluster of 800 A100 for ~65 days. In that time it consumed 300B tokens. You can see TensorBoard with LR and ramp up schedule, training metrics and our "thermometers" on the HF page.

Owner

Name: Yandex
Login: yandex
Kind: organization
Email: opensource-support@yandex-team.ru
Location: Moscow, Russia

Website: https://tech.yandex.com/
Repositories: 85
Profile: https://github.com/yandex

Yandex open source projects and technologies

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Khrushchev"
  given-names: "Mikhail"
- family-names: "Vasilev"
  given-names: "Ruslan"
- family-names: "Petrov"
  given-names: "Alexey"
- family-names: "Zinov"
  given-names: "Nikolay"
title: "YaLM 100B"
date-released: 2022-06-23
url: "https://github.com/yandex/YaLM-100B"

GitHub Events

Total

Watch event: 54
Fork event: 5

Last Year

Watch event: 54
Fork event: 5

Committers

Last synced: over 1 year ago

All Time

Total Commits: 10
Total Committers: 3
Avg Commits per committer: 3.333
Development Distribution Score (DDS): 0.3

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Vasilev Ruslan	5****g	7
Nikolay Zinov	n**v@y**u	2
Alexey Petrov	p****a	1

Committer Domains (Top 20 + Academic)

yandex-team.ru: 1

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 51
Total pull requests: 9
Average time to close issues: 23 days
Average time to close pull requests: about 1 year
Total issue authors: 27
Total pull request authors: 4
Average comments per issue: 1.9
Average comments per pull request: 0.33
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

AlexanderKozhevin (2)
alexander-potemkin (1)
tonymacx86PRO (1)
TatianaShavrina (1)
alkavan (1)
finetunej (1)
spanarek (1)
justheuristic (1)
mgrankin (1)
joshlk (1)
CommanderTvis (1)
Vbansal21 (1)
ghosthamlet (1)
githubuser100007 (1)
hostingmydata (1)

Pull Request Authors

lostmsu (2)
IgorDuino (2)
erjanmx (2)
e0xextazy (1)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

docker/Dockerfile docker

nvcr.io/nvidia/pytorch 20.11-py3 build

megatron_lm/requirements.txt pypi

numpy *
pybind11 *
regex *
six *
torch *

megatron_lm/setup.py pypi

yalm-100b

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

YaLM 100B

Setup

Downloading checkpoint

Docker

Usage

License

Training details

Dataset composition

Training process

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies