jubench-megatron-lm

JUPITER Benchmark Suite: Megatron-LM Benchmark

https://github.com/fzj-jsc/jubench-megatron-lm

Science Score: 62.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
✓
Institutional organization owner
Organization fzj-jsc has institutional domain (www.fz-juelich.de)
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.8%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

JUPITER Benchmark Suite: Megatron-LM Benchmark

Basic Info

Host: GitHub
Owner: FZJ-JSC
License: mit
Language: Shell
Default Branch: main
Size: 3.47 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 2

Created almost 2 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License Citation

JUPITER Benchmark Suite: Megatron-LM

This benchmark is part of the JUPITER Benchmark Suite. See the repository of the suite for some general remarks.

This repository contains the Megatron-LM NLP/LLM benchmark. DESCRIPTION.md contains details for compilation, execution, and evaluation.

The required source code (Megatron-LM, Apex) is included in the ./src/ subdirectory as submodules from the upstream repositories; github.com/NVIDIA/Megatron-LM for Megatron-LM and github.com/NVIDIA/apex for Apex. Sample data files are also included.

Overview of Benchmark

Description Of Folder Structure

benchmark
- aux
  - tokenizers
  - script used for getting data and tokenizers; get_shrink_data_and_tokenizers.sh
  - script used for preprocessing data; job_preprocess_data.sbatch
  - sample 10MB oscar dataset got using get_shrink_data_and_tokenizers.sh
- env
  - script for activating the python virtual env; activate.bash
  - script to set up python virtual env; setup_venv.sh
- slurm
  - sbatch scripts for 13B and 175B model to be used when running without JUBE
- jube
  - contains accompanying files for JUBE run and the JUBE yaml file
src
- data : contains the preprocessed data (*idx and *.bin files)
- compile_build.sh : script to build the software dependencies
- variables.bash : file that sets important paths
- prebuild_kernels.py : script to prebuild fused kernels

Workflow Without JUBE:

Getting Data and Tokenizers

The following workflow can be done if data and tokenizers are not already present with this repository:

Step 1: Set NLP_BENCH_ROOT variable as export NLP_BENCH_ROOT=<rootdir path of this benchmark> in your bash shell
Step 2: cd benchmark/aux/
Step 3: bash get_shrink_data_and_tokenizers.sh to get tokenizers and compress the raw data oscar-1GB.jsonl.xz to oscar-10MB.jsonl.xz

Prepocessing Data

If your src/data folder does not contain preprocessed data (*.idx and *.bin files), then execute sbatch job_preprocess_data.sbatch after Step 5 in "Workflow With Preprocessed Data And Tokenizers Available" from benchmark/aux directory.

The job_preprocess_data.sbatch script in benchmark/aux/ is used to preprocess the oscar-10MB.jsonl.xz and put it in src/data/. The file can be modified to preprocess any data of choice.

Workflow With Preprocessed Data And Tokenizers Available

Step 1: cd into it the folder of this benchmark
Step 2: Set NLP_BENCH_ROOT variable as export NLP_BENCH_ROOT=<rootdir path of this benchmark> in your bash shell
Step 3: Set TORCH_CUDA_ARCH_LIST according to GPU's compute capability in benchmark/env/activate.bash
Step 4: Run bash benchmark/env/setup_venv.sh
Step 5: Run bash src/compile_build.sh
Step 6: Run sbatch benchmark/slurm/jobscript_13B.sbatch or sbatch benchmark/slurm/jobscript_175B.sbatch

```

The metric tokens_per_sec should be calculated as (1.0/$elapsed_time_per_iteration)*$global_batch_size*$sequence_length obtained from the *.out file.

For submission the throughput tokenspersec is converted into time, a hypothetical training would require. This conversion is done by assuming a training with 20 Million tokens, using the formula [ time_to_report_in_seconds ] = [tokens] / [tokens/second] Example: Using the 13B model result below (Tokens/sec: 59463.14), we obtain a duration of 20,000,000 / 59463.14 = 336.34 seconds.

Hint: sequence_length can be found in the jobscript.

Workflow With JUBE:

Step 1: cd into it the folder of this benchmark
Step 2: Set TORCH_CUDA_ARCH_LIST according to GPU's compute capability in benchmark/env/activate.bash
Step 3: Execute either jube run benchmark/jube/nlp_benchmark.yaml --tag 175 for 175B model orjube run benchmark/jube/nlp_benchmark.yaml --tag 13 for 13B model
Step 4: Wait for the benchmark to run and then do jube continue nlp_benchmark_run -i last until no Steps with the "wait" state remain
Step 5: After the benchmark finishes, run jube result -a nlp_benchmark_run -i last to print the benchmark results

Example result from JUBE:

``` | system | version | queue | JobID | JobTime | ModelSize (Billion Param) | Nodes | BatchSize | PipelineParallel | TensorParallel | Iterations | AvgTFLOPs/GPU | Tokens/sec | timetoreportinseconds | |---------------|---------|---------|----------|------------|----------------------------|-------|------------|-------------------|-----------------|------------|----------------|------------|---------------------------| | juwelsbooster | 2024.01 | booster | 10011638 | "00:30:00" | 13 | 8 | 1024 | 4 | 2 | 20 | 206.885 | 60777.68 | 329.07 |

```

Owner

Name: Jülich Supercomputing Centre
Login: FZJ-JSC
Kind: organization
Location: Germany

Website: https://www.fz-juelich.de/en/ias/jsc
Twitter: fzj_jsc
Repositories: 29
Profile: https://github.com/FZJ-JSC

Jülich Supercomputing Centre provides HPC resources and expertise. Part of Forschungszentrum Jülich.

Citation (CITATION.cff)

cff-version: 1.2.0
title: "JUPITER Benchmark Suite: Megatron-LM"
message: >-
  In addition to citing this benchmark repository, please also cite either the JUPITER Benchmark Suite or the accompanying SC24 paper
authors:
  - given-names: Chelsea
    family-names: John
    affiliation: Forschungszentrum Jülich, Jülich Supercomputing Centre
    orcid: 'https://orcid.org/0000-0003-3777-7393'
  - given-names: Stefan
    family-names: Kesselheim
    affiliation: Forschungszentrum Jülich, Jülich Supercomputing Centre
    orcid: 'https://orcid.org/0000-0003-0940-5752'
  - given-names: Carolin
    family-names: Penke
    affiliation: Forschungszentrum Jülich, Jülich Supercomputing Centre
    orcid: 'https://orcid.org/0000-0002-4043-3885'
  - given-names: Jan
    family-names: Ebert
    affiliation: Forschungszentrum Jülich, Jülich Supercomputing Centre
    orcid: 'https://orcid.org/0000-0001-7118-0481'
  - given-names: Stepan
    family-names: Nassyr
    affiliation: Forschungszentrum Jülich, Jülich Supercomputing Centre
    orcid: 'https://orcid.org/0000-0002-0035-244X'
  - given-names: Andreas
    family-names: Herten
    affiliation: Forschungszentrum Jülich, Jülich Supercomputing Centre
    orcid: 'https://orcid.org/0000-0002-7150-2505'
  - given-names: Sebastian
    family-names: Achilles
    affiliation: Forschungszentrum Jülich, Jülich Supercomputing Centre
    orcid: 'https://orcid.org/0000-0002-1943-6803'
abstract: "The Megatron-LM benchmark of the JUPITER Benchmark Suite"
identifiers:
  - type: doi
    value: 10.5281/zenodo.12788115
    description: Version-agnostic Zenodo Identifier
repository-code: 'https://github.com/FZJ-JSC/jubench-megatron-lm/'
license: MIT
date-released: '2024-07-13'
references:
  - title: "JUPITER Benchmark Suite"
    type: software
    doi: 10.5281/zenodo.12737073

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science