foundation-model-benchmarking-tool

Foundation model benchmarking tool. Run any model on any AWS platform and benchmark for performance across instance type and serving stack options.

https://github.com/aws-samples/foundation-model-benchmarking-tool

Keywords

bedrock benchmark benchmarking deepseek deepseek-r1 evaluation-metrics foundation-models g5 g6 g6e generative-ai inferentia llama2 llama3 p4d p5 sagemaker trainium

Last synced: 6 months ago · JSON representation ·

Repository

Foundation model benchmarking tool. Run any model on any AWS platform and benchmark for performance across instance type and serving stack options.

Basic Info

Host: GitHub
Owner: aws-samples
License: mit-0
Language: Jupyter Notebook
Default Branch: main
Homepage: https://aws-samples.github.io/foundation-model-benchmarking-tool/
Size: 95.9 MB

Statistics

Stars: 250
Watchers: 8
Forks: 43
Open Issues: 52
Releases: 50

Topics

bedrock benchmark benchmarking deepseek deepseek-r1 evaluation-metrics foundation-models g5 g6 g6e generative-ai inferentia llama2 llama3 p4d p5 sagemaker trainium

Created about 2 years ago · Last pushed 11 months ago

Metadata Files

Readme Contributing License Code of conduct Citation

FMBench

Benchmark any Foundation Model (FM) on any AWS Generative AI service [Amazon SageMaker, Amazon Bedrock, Amazon EKS, Amazon EC2, or Bring your own endpoint.]

Amazon Bedrock | Amazon SageMaker | Amazon EKS | Amazon EC2

What's new: Benchmarks for Llama-4 models on Amazon EC2.

FMBench is a Python package for running performance benchmarks and accuracy for any Foundation Model (FM) deployed on any AWS Generative AI service, be it Amazon SageMaker, Amazon Bedrock, Amazon EKS, or Amazon EC2. The FMs could be deployed on these platforms either directly through FMbench, or, if they are already deployed then also they could be benchmarked through the Bring your own endpoint mode supported by FMBench.

Here are some salient features of FMBench:

Highly flexible: in that it allows for using any combinations of instance types (g5, p4d, p5, Inf2), inference containers (DeepSpeed, TensorRT, HuggingFace TGI and others) and parameters such as tensor parallelism, rolling batch etc. as long as those are supported by the underlying platform.
Benchmark any model: it can be used to be benchmark open-source models, third party models, and proprietary models trained by enterprises on their own data. Benchmarking includes both performance benchmaking and model evaluations (accuracy measurement given ground truth). NEW: Model evaluations done by a Panel of LLM Evaluators added in release 2.0.0
Run anywhere: it can be run on any AWS platform where we can run Python, such as Amazon EC2, Amazon SageMaker, or even the AWS CloudShell. It is important to run this tool on an AWS platform so that internet round trip time does not get included in the end-to-end response time latency.

Intro Video

Determine the optimal price|performance serving stack for your generative AI workload

Use FMBench to benchmark an LLM on any AWS generative AI service for price and performance (inference latency, transactions/minute). Here is one of the plots generated by FMBench to help answer the price performance question for the Llama2-13b model when hosted on Amazon SageMaker (the instance types in the legend have been blurred out on purpose, you can find them in the actual plot generated on running FMBench).

business question

Determine the optimal model for your generative AI workload

Use FMBench to determine model accuracy using a panel of LLM evaluators (PoLL [1]). Here is one of the plots generated by FMBench to help answer the accuracy question for various FMs on Amazon Bedrock (the model ids in the charts have been blurred out on purpose, you can find them in the actual plot generated on running FMBench).

Accuracy trajectory with prompt size

Overall accuracy

Models benchmarked

Configuration files are available in the configs folder for the following models in this repo.

Full list of benchmarked models

| Model | Amazon EC2 |:--------------------------------|:--- | Deepseek-R1 distilled | Qwen2.5-72b | Amazon Nova | Anthropic Claude-3 Sonnet | Anthropic Claude-3 Haiku | Mistral-7b-instruct | Mistral-7b-AWQ | Mixtral-8x7b-instruct | Llama-4-Scout-17B-16E-Instruct | Llama3.3-70b instruct | Llama3.2-1b instruct | Llama3.2-3b instruct | Llama3.1-8b instruct | Llama3.1-70b instruct | Llama3-8b instruct | Llama3-70b instruct | Llama2-13b chat | Llama2-70b chat | NousResearch-Hermes-70b | Amazon Titan text lite | Amazon Titan text express | Cohere Command text | Cohere Command light text | AI21 J2 Mid | AI21 J2 Ultra | Gemma-2b | Phi-3-mini-4k-instruct | distilbert-base-uncased | Amazon SageMaker | Amazon Bedrock | ----------------------------|:-------------------------------------------|:-----------------------------------| | g6e | g6e | | | g5, g6e | | | | | | On-demand | | | | On-demand, provisioned | | | | On-demand | | inf2, trn1 | g4dn, g5, p3, p4d, p5 | On-demand | | | p5 | | | | | On-demand | | g6e | | | | | | On-demand | | g5 | | | | g5 | | | | g5, p4d, p4de, p5, p5e, g6e, g6, inf2, trn1 | g4dn, g5, p3, inf2, trn1 | On-demand | | p4d, p4de, p5, p5e, g6e, g5, inf2, trn1 | inf2, trn1 | On-demand | | g5, g6e, inf2, trn1, c8g | g4dn, g5, p3, inf2, trn1, p4d, p5e | On-demand | | g5 | g4dn, g5, p3, inf2, trn1, p4d | On-demand | | | g4dn, g5, p3, inf2, trn1, p4d | On-demand | | | g4dn, g5, p3, inf2, trn1, p4d | On-demand | | | g5, inf2, trn1 | On-demand | | | | On-demand | | | | On-demand | | | | On-demand | | | | On-demand | | | | On-demand | | | | On-demand | | | g4dn, g5, p3 | | | | g4dn, g5, p3 | | | | g4dn, g5, p3 | |

New in this release

2.1.6

Add a synthetic dataset for meeting summarization, config file for benchmarking Llama-4-Scout-17B-16E using this dataset.

2.1.5

Llama-4-Scout-17B-16E config file for g6e.48xlarge instance type using vLLM.

2.1.4

Llama3.1-8b config file for p5en instance type.
Remove vllm from pyproject.toml.

2.1.3

SGLang support.

2.1.2

Deepseek prompt updates.
Handle case for < 1 txn/minute.

2.1.1

Optimized prompt templates and config files for DeepSeek-R1 and Amazon Nova for ConvFinQA and LongBench datasets.

Release history

Getting started

FMBench is available as a Python package on PyPi and is run as a command line tool once it is installed. All data that includes metrics, reports and results are stored in an Amazon S3 bucket.

[!IMPORTANT] All documentation for FMBench is available on the FMBench website

You can run FMBench on either a SageMaker notebook or on an EC2 VM. Both options are described here as part of the documentation. You can even run FMBench as a Docker container A Quickstart guide for SageMaker is bring provided below as well.

The following sections are discussing running FMBench the tool, as different from where the FM is actually deployed. For example, we could run FMBench on EC2 but the model being deployed is on SageMaker or even Bedrock.

Quickstart

FMBench on a Amazon SageMaker Notebook

Each FMBench run works with a configuration file that contains the information about the model, the deployment steps, and the tests to run. A typical FMBench workflow involves either directly using an already provided config file from the configs folder in the FMBench GitHub repo or editing an already provided config file as per your own requirements (say you want to try benchmarking on a different instance type, or a different inference container etc.).

A simple config file with key parameters annotated is included in this repo, see config-llama2-7b-g5-quick.yml. This file benchmarks performance of Llama2-7b on an ml.g5.xlarge instance and an ml.g5.2xlarge instance. You can use this config file as it is for this Quickstart.
Launch the AWS CloudFormation template included in this repository using one of the buttons from the table below. The CloudFormation template creates the following resources within your AWS account: Amazon S3 buckets, Amazon IAM role and an Amazon SageMaker Notebook with this repository cloned. A read S3 bucket is created which contains all the files (configuration files, datasets) required to run FMBench and a write S3 bucket is created which will hold the metrics and reports generated by FMBench. The CloudFormation stack takes about 5-minutes to create.

Once the CloudFormation stack is created, navigate to SageMaker Notebooks and open the fmbench-notebook.
On the fmbench-notebook open a Terminal and run the following commands.

{.bash} curl -LsSf https://astral.sh/uv/install.sh | sh export PATH="$HOME/.local/bin:$PATH" uv venv .fmbench_python312 --python 3.12 source .fmbench_python312/bin/activate uv pip install -U fmbench
Now you are ready to fmbench with the following command line. We will use a sample config file placed in the S3 bucket by the CloudFormation stack for a quick first run.
1. We benchmark performance for the Llama2-7b model on a ml.g5.xlarge and a ml.g5.2xlarge instance type, using the huggingface-pytorch-tgi-inference inference container. This test would take about 30 minutes to complete and cost about $0.20.
2. It uses a simple relationship of 750 words equals 1000 tokens, to get a more accurate representation of token counts use the Llama2 tokenizer (instructions are provided in the next section). It is strongly recommended that for more accurate results on token throughput you use a tokenizer specific to the model you are testing rather than the default tokenizer. See instructions provided later in this document on how to use a custom tokenizer.
  
  {.bash} account=`aws sts get-caller-identity | jq .Account | tr -d '"'` region=`aws configure get region` fmbench --config-file s3://sagemaker-fmbench-read-${region}-${account}/configs/llama2/7b/config-llama2-7b-g5-quick.yml > fmbench.log 2>&1
3. Open another terminal window and do a tail -f on the fmbench.log file to see all the traces being generated at runtime.
  
  {.bash} tail -f fmbench.log
4. For streaming support on SageMaker and Bedrock checkout these config files:
  1. config-llama3-8b-g5-streaming.yml
  2. config-bedrock-llama3-streaming.yml
The generated reports and metrics are available in the sagemaker-fmbench-write-<replace_w_your_aws_region>-<replace_w_your_aws_account_id> bucket. The metrics and report files are also downloaded locally and in the results directory (created by FMBench) and the benchmarking report is available as a markdown file called report.md in the results directory. You can view the rendered Markdown report in the SageMaker notebook itself or download the metrics and report files to your machine for offline analysis.

If you would like to understand what is being done under the hood by the CloudFormation template, see the DIY version with gory details

`FMBench` on SageMaker in GovCloud

No special steps are required for running FMBench on GovCloud. The CloudFormation link for us-gov-west-1 has been provided in the section above.

Not all models available via Bedrock or other services may be available in GovCloud. The following commands show how to run FMBench to benchmark the Amazon Titan Text Express model in the GovCloud. See the Amazon Bedrock GovCloud page for more details.

{.bash} account=`aws sts get-caller-identity | jq .Account | tr -d '"'` region=`aws configure get region` fmbench --config-file s3://sagemaker-fmbench-read-${region}-${account}/configs/bedrock/config-bedrock-titan-text-express.yml > fmbench.log 2>&1

Running `FMBench` via the `FMBench-orchestrator`

FMBench on Amazon EC2 via the FMBench orchestrator

If you want to benchmark FMs on Amazon EC2 then you can use the fmbench-orchestrator as a quick and simple way to get started. The orchestrator is a Python program that can be installed on an EC2 machine and it in turn launches other EC2 machines for benchmarking purposes. The orchestrator installs and runs FMBench on these EC2 machines, downloads the benchmarking result from these machines and finally terminates these machines once the benchmarking finished.

As an example, consider a scenario that you want to benchmark say the Llama3.1-8b model on a g5.2xlarge, g6.2xlarge, p4d.24xlarge, p5e.48xlarge and a trn1.32xlarge. Usually this would mean that you have to create these EC2 instances, install the pre-requisites, installed FMBench, run FMBench, download the results and then repeat the process for the next instance. This is tedious work. The orchestrator makes this super convenient by doing all this for you and doing this in parallel. It will spawn all these EC2 VMs and do all the steps mentioned above and at the end of the test you will have results from all the instances downloaded on the orchestrator VM and all the EC2 VMs that were spawned would have automatically been terminated. See the orchestrator README for more details.

Results

Depending upon the experiments in the config file, the FMBench run may take a few minutes to several hours. Once the run completes, you can find the report and metrics in the local results-* folder in the directory from where FMBench was run. The rpeort and metrics are also written to the write S3 bucket set in the config file.

Here is a screenshot of the report.md file generated by FMBench. Report

Benchmark models deployed on different AWS Generative AI services (Docs)

FMBench comes packaged with configuration files for benchmarking models on different AWS Generative AI services i.e. Bedrock, SageMaker, EKS and EC2 or bring your own endpoint even.

Enhancements

View the ISSUES on GitHub and add any you might think be an beneficial iteration to this benchmarking harness.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Star History

Support

Schedule Demo - send us an email
Community Discord
Our emails aroraai@amazon.com / madhurpt@amazon.com

Contributors

References

[1] Pat Verga et al., "Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models", arXiv:2404.18796, 2024.

Owner

Name: AWS Samples
Login: aws-samples
Kind: organization

Website: https://amazon.com/aws
Repositories: 6,789
Profile: https://github.com/aws-samples

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Arora"
  given-names: "Amit"
  orcid: "https://orcid.org/0000-0001-6697-2724"
- family-names: "Prashant"
  given-names: "Madhur"
  orcid: "https://orcid.org/0009-0002-4086-2003"

title: "FMBench: benchmark any foundation model on any AWS GenAI service"
version: 2.0.1
doi: 10.5281/zenodo.13324418
date-released: 2024-08-14
url: "https://aws-samples.github.io/foundation-model-benchmarking-tool/"

GitHub Events

Total

Create event: 25
Issues event: 17
Release event: 14
Watch event: 55
Issue comment event: 9
Push event: 193
Pull request review comment event: 22
Pull request event: 111
Pull request review event: 71
Fork event: 15

Last Year

Create event: 25
Issues event: 17
Release event: 14
Watch event: 55
Issue comment event: 9
Push event: 193
Pull request review comment event: 22
Pull request event: 111
Pull request review event: 71
Fork event: 15

Committers

Last synced: 6 months ago

All Time

Total Commits: 1,285
Total Committers: 61
Avg Commits per committer: 21.066
Development Distribution Score (DDS): 0.64

Past Year

Commits: 647
Committers: 42
Avg Commits per committer: 15.405
Development Distribution Score (DDS): 0.53

Top Committers

Name	Email	Commits
Madhur Prashant	1****h	462
Amit Arora	a**i@a**m	365
Madhur prashant	M**t@a**m	109
annn	4****8	109
dheerajoruganty	d**b@g**m	88
madhur	m**s@g**m	15
Ubuntu	u**u@i**l	10
Karanbir Bains	b**b@a**m	9
Ubuntu	u**u@i**l	8
Rajesh Ramchander	R**r@g**m	6
Ubuntu	u**u@i**l	6
Ubuntu	u**u@i**l	6
Ubuntu	u**u@i**l	6
EC2 Default User	e**r@i**l	5
EC2 Default User	e**r@i**l	5
Ubuntu	u**u@i**l	4
Ubuntu	u**u@i**l	4
EC2 Default User	e**r@i**l	4
Jim Burtoft	3****t	3
Ubuntu	u**u@i**l	3
Ubuntu	u**u@i**l	3
Ubuntu	u**u@i**l	3
Ubuntu	u**u@i**l	2
Ubuntu	u**u@i**l	2
Ubuntu	u**u@i**l	2
Ubuntu	u**u@i**l	2
Ubuntu	u**u@i**l	2
Ubuntu	u**u@i**l	2
EC2 Default User	e**r@i**l	2
EC2 Default User	e**r@i**l	2
and 31 more...

Committer Domains (Top 20 + Academic)

amazon.com: 4 amazon.es: 1 ip-172-16-105-236.ec2.internal: 1 ip-172-16-167-3.ec2.internal: 1 ip-172-16-181-87.ec2.internal: 1 ip-172-16-40-111.ec2.internal: 1 ip-172-16-80-103.us-west-2.compute.internal: 1 ip-10-12-6-51.ec2.internal: 1 ip-172-31-18-193.ec2.internal: 1 ip-172-31-20-182.us-west-2.compute.internal: 1 ip-172-31-31-215.ec2.internal: 1 ip-172-31-38-212.ec2.internal: 1 ip-172-31-43-122.ec2.internal: 1 ip-172-31-12-116.us-west-2.compute.internal: 1 ip-172-31-38-138.ec2.internal: 1 ip-172-31-48-243.us-west-2.compute.internal: 1 ip-172-16-43-107.us-west-2.compute.internal: 1 ip-172-31-12-246.us-west-2.compute.internal: 1 ip-172-31-95-31.ec2.internal: 1 ip-172-16-122-218.ec2.internal: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 58
Total pull requests: 209
Average time to close issues: 21 days
Average time to close pull requests: 2 days
Total issue authors: 16
Total pull request authors: 9
Average comments per issue: 0.34
Average comments per pull request: 0.05
Merged pull requests: 173
Bot issues: 0
Bot pull requests: 10

Past Year

Issues: 20
Pull requests: 109
Average time to close issues: 7 days
Average time to close pull requests: 2 days
Issue authors: 10
Pull request authors: 6
Average comments per issue: 0.2
Average comments per pull request: 0.09
Merged pull requests: 95
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

madhurprash (28)
aarora79 (14)
athewsey (4)
Yiwen-Zhang (3)
jimburtoft (3)
NidPlays (2)
prasad-nair-amd (2)
antara678 (2)
prasadwrites (1)
wilalbto (1)
zhimin-z (1)
dheerajoruganty (1)
dickren123 (1)
RajeshRamchander (1)
lxning (1)

Pull Request Authors

madhurprash (208)
aarora79 (59)
dheerajoruganty (48)
antara678 (39)
dependabot[bot] (21)
haozhx23 (4)
tonyksong (4)
jimburtoft (4)
bainskb (3)
RajeshRamchander (2)
fespigares (2)
athewsey (1)

Top Labels

Issue Labels

enhancement (5) help wanted (2) documentation (1) good first issue (1) bug (1)

Pull Request Labels

dependencies (21)

Packages

Total packages: 1
Total downloads: unknown

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 59

proxy.golang.org: github.com/aws-samples/foundation-model-benchmarking-tool

Documentation: https://pkg.go.dev/github.com/aws-samples/foundation-model-benchmarking-tool#section-documentation
License: mit-0
Latest release: v2.1.6+incompatible
published 11 months ago

Versions: 59
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 5.4%

Average: 5.6%

Dependent repos count: 5.8%

Last synced: 6 months ago

foundation-model-benchmarking-tool

Science Score: 54.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

FMBench

Amazon Bedrock | Amazon SageMaker | Amazon EKS | Amazon EC2

Intro Video

Determine the optimal price|performance serving stack for your generative AI workload

Determine the optimal model for your generative AI workload

Models benchmarked

Full list of benchmarked models

New in this release

2.1.6

2.1.5

2.1.4

2.1.3

2.1.2

2.1.1

Getting started

Quickstart

FMBench on SageMaker in GovCloud

Running FMBench via the FMBench-orchestrator

Results

Benchmark models deployed on different AWS Generative AI services (Docs)

Enhancements

Security

License

Star History

Support

Contributors

References

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

proxy.golang.org: github.com/aws-samples/foundation-model-benchmarking-tool

Rankings

Dependencies

`FMBench` on SageMaker in GovCloud

Running `FMBench` via the `FMBench-orchestrator`