jubench-megatron-lm

JUPITER Benchmark Suite: Megatron-LM Benchmark

https://github.com/fzj-jsc/jubench-megatron-lm

Science Score: 62.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
    Organization fzj-jsc has institutional domain (www.fz-juelich.de)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.8%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

JUPITER Benchmark Suite: Megatron-LM Benchmark

Basic Info
  • Host: GitHub
  • Owner: FZJ-JSC
  • License: mit
  • Language: Shell
  • Default Branch: main
  • Size: 3.47 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 2
Created almost 2 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License Citation

README.md

JUPITER Benchmark Suite: Megatron-LM

DOI Static Badge

This benchmark is part of the JUPITER Benchmark Suite. See the repository of the suite for some general remarks.

This repository contains the Megatron-LM NLP/LLM benchmark. DESCRIPTION.md contains details for compilation, execution, and evaluation.

The required source code (Megatron-LM, Apex) is included in the ./src/ subdirectory as submodules from the upstream repositories; github.com/NVIDIA/Megatron-LM for Megatron-LM and github.com/NVIDIA/apex for Apex. Sample data files are also included.

Overview of Benchmark

Description Of Folder Structure

  • benchmark
    • aux
      • tokenizers
      • script used for getting data and tokenizers; get_shrink_data_and_tokenizers.sh
      • script used for preprocessing data; job_preprocess_data.sbatch
      • sample 10MB oscar dataset got using get_shrink_data_and_tokenizers.sh
    • env
      • script for activating the python virtual env; activate.bash
      • script to set up python virtual env; setup_venv.sh
    • slurm
      • sbatch scripts for 13B and 175B model to be used when running without JUBE
    • jube
      • contains accompanying files for JUBE run and the JUBE yaml file
  • src
    • data : contains the preprocessed data (*idx and *.bin files)
    • compile_build.sh : script to build the software dependencies
    • variables.bash : file that sets important paths
    • prebuild_kernels.py : script to prebuild fused kernels

Workflow Without JUBE:

Getting Data and Tokenizers

The following workflow can be done if data and tokenizers are not already present with this repository:

  • Step 1: Set NLP_BENCH_ROOT variable as export NLP_BENCH_ROOT=<rootdir path of this benchmark> in your bash shell
  • Step 2: cd benchmark/aux/
  • Step 3: bash get_shrink_data_and_tokenizers.sh to get tokenizers and compress the raw data oscar-1GB.jsonl.xz to oscar-10MB.jsonl.xz

Prepocessing Data

If your src/data folder does not contain preprocessed data (*.idx and *.bin files), then execute sbatch job_preprocess_data.sbatch after Step 5 in "Workflow With Preprocessed Data And Tokenizers Available" from benchmark/aux directory.

The job_preprocess_data.sbatch script in benchmark/aux/ is used to preprocess the oscar-10MB.jsonl.xz and put it in src/data/. The file can be modified to preprocess any data of choice.

Workflow With Preprocessed Data And Tokenizers Available

  • Step 1: cd into it the folder of this benchmark
  • Step 2: Set NLP_BENCH_ROOT variable as export NLP_BENCH_ROOT=<rootdir path of this benchmark> in your bash shell
  • Step 3: Set TORCH_CUDA_ARCH_LIST according to GPU's compute capability in benchmark/env/activate.bash
  • Step 4: Run bash benchmark/env/setup_venv.sh
  • Step 5: Run bash src/compile_build.sh
  • Step 6: Run sbatch benchmark/slurm/jobscript_13B.sbatch or sbatch benchmark/slurm/jobscript_175B.sbatch

The output file *.out file would have result logs of the following form that are important : ``` [default3]: iteration 10/ 292968 | consumed samples: 10240 | elapsed time per iteration (s): 35.8651 | learning rate: 4.734E-06 | global batch size: 1024 | lm loss: 1.332803E+01 | loss scale: 4096.0 | grad norm: 42.627 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 28.551 | TFLOPs: 199.03 | [default3]: iteration 20/ 292968 | consumed samples: 20480 | elapsed time per iteration (s): 34.9991 | learning rate: 9.467E-06 | global batch size: 1024 | lm loss: 1.010884E+01 | loss scale: 4096.0 | grad norm: 13.038 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 29.258 | TFLOPs: 203.96 | [default3]: iteration 30/ 292968 | consumed samples: 30720 | elapsed time per iteration (s): 34.8709 | learning rate: 1.420E-05 | global batch size: 1024 | lm loss: 9.072961E+00 | loss scale: 4096.0 | grad norm: 26.640 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 29.365 | TFLOPs: 204.71 | [default3]: iteration 40/ 292968 | consumed samples: 40960 | elapsed time per iteration (s): 35.3346 | learning rate: 1.893E-05 | global batch size: 1024 | lm loss: 8.486469E+00 | loss scale: 4096.0 | grad norm: 3.441 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 28.980 | TFLOPs: 202.02 | [default3]: iteration 50/ 292968 | consumed samples: 51200 | elapsed time per iteration (s): 35.3357 | learning rate: 2.367E-05 | global batch size: 1024 | lm loss: 8.

```

The metric tokens_per_sec should be calculated as (1.0/$elapsed_time_per_iteration)*$global_batch_size*$sequence_length obtained from the *.out file.

For submission the throughput tokenspersec is converted into time, a hypothetical training would require. This conversion is done by assuming a training with 20 Million tokens, using the formula [ time_to_report_in_seconds ] =  [tokens] / [tokens/second]  Example: Using the 13B model result below (Tokens/sec: 59463.14), we obtain a duration of 20,000,000 / 59463.14 = 336.34 seconds.

Hint: sequence_length can be found in the jobscript.

Workflow With JUBE:

  • Step 1: cd into it the folder of this benchmark
  • Step 2: Set TORCH_CUDA_ARCH_LIST according to GPU's compute capability in benchmark/env/activate.bash
  • Step 3: Execute either jube run benchmark/jube/nlp_benchmark.yaml --tag 175 for 175B model orjube run benchmark/jube/nlp_benchmark.yaml --tag 13 for 13B model
  • Step 4: Wait for the benchmark to run and then do jube continue nlp_benchmark_run -i last until no Steps with the "wait" state remain
  • Step 5: After the benchmark finishes, run jube result -a nlp_benchmark_run -i last to print the benchmark results

Example result from JUBE:

``` | system | version | queue | JobID | JobTime | ModelSize (Billion Param) | Nodes | BatchSize | PipelineParallel | TensorParallel | Iterations | AvgTFLOPs/GPU | Tokens/sec | timetoreportinseconds | |---------------|---------|---------|----------|------------|----------------------------|-------|------------|-------------------|-----------------|------------|----------------|------------|---------------------------| | juwelsbooster | 2024.01 | booster | 10011638 | "00:30:00" | 13 | 8 | 1024 | 4 | 2 | 20 | 206.885 | 60777.68 | 329.07 |

```

Owner

  • Name: Jülich Supercomputing Centre
  • Login: FZJ-JSC
  • Kind: organization
  • Location: Germany

Jülich Supercomputing Centre provides HPC resources and expertise. Part of Forschungszentrum Jülich.

Citation (CITATION.cff)

cff-version: 1.2.0
title: "JUPITER Benchmark Suite: Megatron-LM"
message: >-
  In addition to citing this benchmark repository, please also cite either the JUPITER Benchmark Suite or the accompanying SC24 paper
authors:
  - given-names: Chelsea
    family-names: John
    affiliation: Forschungszentrum Jülich, Jülich Supercomputing Centre
    orcid: 'https://orcid.org/0000-0003-3777-7393'
  - given-names: Stefan
    family-names: Kesselheim
    affiliation: Forschungszentrum Jülich, Jülich Supercomputing Centre
    orcid: 'https://orcid.org/0000-0003-0940-5752'
  - given-names: Carolin
    family-names: Penke
    affiliation: Forschungszentrum Jülich, Jülich Supercomputing Centre
    orcid: 'https://orcid.org/0000-0002-4043-3885'
  - given-names: Jan
    family-names: Ebert
    affiliation: Forschungszentrum Jülich, Jülich Supercomputing Centre
    orcid: 'https://orcid.org/0000-0001-7118-0481'
  - given-names: Stepan
    family-names: Nassyr
    affiliation: Forschungszentrum Jülich, Jülich Supercomputing Centre
    orcid: 'https://orcid.org/0000-0002-0035-244X'
  - given-names: Andreas
    family-names: Herten
    affiliation: Forschungszentrum Jülich, Jülich Supercomputing Centre
    orcid: 'https://orcid.org/0000-0002-7150-2505'
  - given-names: Sebastian
    family-names: Achilles
    affiliation: Forschungszentrum Jülich, Jülich Supercomputing Centre
    orcid: 'https://orcid.org/0000-0002-1943-6803'
abstract: "The Megatron-LM benchmark of the JUPITER Benchmark Suite"
identifiers:
  - type: doi
    value: 10.5281/zenodo.12788115
    description: Version-agnostic Zenodo Identifier
repository-code: 'https://github.com/FZJ-JSC/jubench-megatron-lm/'
license: MIT
date-released: '2024-07-13'
references:
  - title: "JUPITER Benchmark Suite"
    type: software
    doi: 10.5281/zenodo.12737073

GitHub Events

Total
Last Year