https://github.com/awslabs/sagemaker-hyperpod-usage-report

https://github.com/awslabs/sagemaker-hyperpod-usage-report

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.9%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: awslabs
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Size: 577 KB
Statistics
  • Stars: 1
  • Watchers: 2
  • Forks: 3
  • Open Issues: 1
  • Releases: 0
Created about 1 year ago · Last pushed 10 months ago
Metadata Files
Readme Changelog Contributing License Code of conduct Security

README.md

HyperPod Cluster Usage Report

Overview

Usage reporting in SageMaker HyperPod EKS-orchestrated clusters provides visibility into compute resource consumption. The capability allows organizations to implement transparent cost attribution, allocating cluster costs to teams, projects, or departments based on their actual usage. By tracking metrics such as GPU/CPU hours, and Neuron Core utilization over time, usage reporting complements SageMaker HyperPod's Task Governance functionality, ensuring fair cost attribution in shared multi-tenant clusters by: - Eliminating guesswork in cost allocation - Directly linking expenses to measurable resource consumption - Enforcing usage-based accountability in shared infrastructure environments

Table of Contents

  1. Set up Usage Reporting
  2. Generate Reports
  3. Clean Up Resources
  4. Local Development
  5. Attributions and Open Source Acknowledgments
  6. Contributing
  7. License

Set up Usage Reporting

Usage reporting in SageMaker HyperPod requires deploying the SageMaker HyperPod usage report infrastructure using a CloudFormation stack and installing the SageMaker HyperPod usage report Kubernetes operator using a Helm chart.

To successfully deploy and use the SageMaker HyperPod usage report, you should meet the following prerequisites.

Prerequisites

  • Have a running EKS-orchestrated SageMaker HyperPod cluster (Kubernetes version >= 1.30) with the Task Governance add-on.

  • Have AWS CLI, kubectl, and Helm (package manager for Kubernetes - version >= 3.17.1) installed.

  • A Python environment (version >= 3.9).

  • Clone the GitHub repository sagemaker-hyperpod-usage-report.

    sh git clone https://github.com/awslabs/sagemaker-hyperpod-usage-report

  • Set the following local environment variables in your terminal:

Note - To install the usage report, you need an Installer IAM role and with appropriate permissions. You can either create a new IAM role and leave the role policies blank for now, or reuse an existing role such as your current administrator role. Use the selected role name in the USAGE_REPORT_INSTALLER_ROLE_NAME variable. You will populate the role policies in the upcoming configuration steps.

```sh
# Set up the environment variable
export AWS_ACCOUNT=<account number>
export AWS_REGION=<region>
export HYPERPOD_CLUSTER_NAME=<hyperpod cluster name>
export EKS_CLUSTER_NAME=<eks cluster name>
export USAGE_REPORT_INSTALLER_ROLE_NAME=<Installer IAM role name>
export USAGE_REPORT_OPERATOR_NAME=hyperpod-usage-report <keep under 22 characters if custom>
export HYPERPOD_CLUSTER_ID=$(aws sagemaker describe-cluster --cluster-name ml-cluster --region $AWS_REGION | jq -r '.ClusterArn | split("/")[-1]')

aws configure set region $AWS_REGION
```
Verify the content of your variables:
```sh
echo "AWS_ACCOUNT is $AWS_ACCOUNT"
echo "AWS_REGION is $AWS_REGION"
echo "HYPERPOD_CLUSTER_NAME is $HYPERPOD_CLUSTER_NAME"
echo "EKS_CLUSTER_NAME is $EKS_CLUSTER_NAME"
echo "USAGE_REPORT_INSTALLER_ROLE_NAME is $USAGE_REPORT_INSTALLER_ROLE_NAME"
echo "USAGE_REPORT_OPERATOR_NAME is $USAGE_REPORT_OPERATOR_NAME"
echo "HYPERPOD_CLUSTER_ID is $HYPERPOD_CLUSTER_ID"
```
  • Set up kubectl authentication and context for accessing the EKS cluster

    • Start by running the aws eks update-kubeconfig command to update your local kube config file (located at ~/.kube/config) with the credentials and configuration needed to connect to your EKS cluster using the kubectl command.

      sh aws eks update-kubeconfig --region $AWS_REGION --name $EKS_CLUSTER_NAME

    • You can verify that you are connected to the EKS cluster by running:

      sh kubectl config current-context

      arn:aws:eks:$AWS_REGION:$AWS_ACCOUNT:cluster/$EKS_CLUSTER_NAME

    • Generate and attach the required IAM policies.

    • Populate the IAM policy document for your Installer role from the template provided in permissions/usage-report-installer-policy.json.template. sh INPUT_FILE="permissions/usage-report-installer-policy.json.template" OUTPUT_FILE="permissions/usage-report-installer-policy.json" sed \ -e "s/AWS_REGION/$AWS_REGION/g" \ -e "s/AWS_ACCOUNT/$AWS_ACCOUNT/g" \ -e "s/USAGE_REPORT_OPERATOR_NAME/$USAGE_REPORT_OPERATOR_NAME/g" \ -e "s/HYPERPOD_CLUSTER_ID/$HYPERPOD_CLUSTER_ID/g" \ -e "s/EKS_CLUSTER_NAME/$EKS_CLUSTER_NAME/g" \ -e "s/USAGE_REPORT_INSTALLER_ROLE_NAME/$USAGE_REPORT_INSTALLER_ROLE_NAME/g" \ "$INPUT_FILE" > "$OUTPUT_FILE"

    • Attach the permissions/usage-report-installer-policy.json IAM policy to the IAM Installer role that performs AWS CLI, kubectl, and helm operations. This ensures usage report installers have the required permissions to install and manage SageMaker HyperPod Usage report data capture.

    To embed the inline policy in an existing role, use the following command:

    sh aws iam put-role-policy \ --role-name $USAGE_REPORT_INSTALLER_ROLE_NAME \ --policy-name sagemaker-hyperpod-usage-report \ --policy-document file://permissions/usage-report-installer-policy.json

    To verify that the policy has been added correctly, run:

    sh aws iam get-role-policy \ --role-name $USAGE_REPORT_INSTALLER_ROLE_NAME \ --policy-name sagemaker-hyperpod-usage-report

  • Create a dedicated Kubernetes namespace for the usage report operator:

    • In sagemaker-hyperpod-usage-report, run the following command to create the namespace $USAGE_REPORT_OPERATOR_NAME:

    ```sh INPUTFILE="permissions/usage-report-namespace.yaml.template" OUTPUTFILE="permissions/usage-report-namespace.yaml" sed \ -e "s/NAMESPACE/$USAGEREPORTOPERATORNAME/g" \ "$INPUTFILE" > "$OUTPUT_FILE"

    kubectl apply -f permissions/usage-report-namespace.yaml ```

  • Create custom RBAC permissions for deploying the HyperPod usage report Kubernetes operator helm chart on the cluster:

    • In sagemaker-hyperpod-usage-report, run the following command to setup the RBAC permissions in your EKS cluster.

      ```sh INPUTFILE="permissions/usage-report-installer-cluster-policy.yaml.template" OUTPUTFILE="permissions/usage-report-installer-cluster-policy.yaml" sed \ -e "s/NAMESPACE/$USAGEREPORTOPERATORNAME/g" \ -e "s/ROLENAME/$USAGEREPORTINSTALLERROLENAME/g" \ "$INPUTFILE" > "$OUTPUTFILE"

      kubectl apply -f permissions/usage-report-installer-cluster-policy.yaml ```

    • Enable the access entry for the EKS cluster.

      sh aws eks update-cluster-config --name $EKS_CLUSTER_NAME --access-config authenticationMode=API_AND_CONFIG_MAP

      Note: If you receive an error message indicating Unsupported authentication mode update, no further action is necessary as the authentication mode has already been configured.

Install SageMaker HyperPod Usage Report Infrastructure using CloudFormation

The following installation assume you are using the role USAGEREPORTINSTALLERROLENAME you specified above.

Retrieve the CloudFormation Template

You can find the CloudFormation template in the /cloudformation directory. The template provisions the following AWS resources:

  • Storage infrastructure: An S3 bucket (s3://$AWS_ACCOUNT-$AWS_REGION-$HYPERPOD_CLUSTER_ID-usage-report-<random string>) to capture usage data, with associated IAM role allowing pods to write data to the bucket.
  • Query infrastructure: An Athena database for querying and aggregating usage data.
  • Processing infrastructure: An AWS Lambda function triggered daily by a CloudWatch Event rule to perform automated usage data aggregation and reporting.

Input Parameters to the CloudFormation Template

| Parameter | Required | Default Value | Notes | |----------------------------|----------|---------------------------------------|---------------------------------------------------------------------------------------------| | EKSClusterName | Yes | - | Name of the EKS cluster | | HyperPodClusterId | Yes | - | Id of the HyperPod cluster | | UsageReportInstallerRoleName | Yes | - | Name of the IAM role for usage reporting installation | | DataRententionDays | No | 180 | Data retention days for S3 Bucket | | InstallPodIdentityAddon | No | "true" | Whether to install the Pod Identity Addon. Allowed values: "true", "false" | | UsageReportOperatorNameSpace | No | hyperpod-usage-report | Kubernetes cluster namespace where usage report operator is installed | | OperatorServiceAccount | No | hyperpod-usage-report | Service account used by usage report operator pod identity for permissions to access AWS resources |

Deploy the Stack

Run the following stack creation command: sh cd sagemaker-hyperpod-usage-report aws cloudformation create-stack \ --region $AWS_REGION \ --stack-name $USAGE_REPORT_OPERATOR_NAME \ --template-body file://cloudformation/usage-report.yaml \ --capabilities CAPABILITY_NAMED_IAM \ --parameters \ ParameterKey=EKSClusterName,ParameterValue=$EKS_CLUSTER_NAME \ ParameterKey=HyperPodClusterId,ParameterValue=$HYPERPOD_CLUSTER_ID \ ParameterKey=UsageReportOperatorNameSpace,ParameterValue=$USAGE_REPORT_OPERATOR_NAME \ ParameterKey=OperatorServiceAccount,ParameterValue=$USAGE_REPORT_OPERATOR_NAME \ ParameterKey=UsageReportInstallerRoleName,ParameterValue=$USAGE_REPORT_INSTALLER_ROLE_NAME

Verify the CloudFormation stack creation status: sh aws cloudformation describe-stacks --stack-name $USAGE_REPORT_OPERATOR_NAME \ --region $AWS_REGION --query 'Stacks[0].StackStatus' --output text

CloudFormation Outputs

| Output Name | Description | |-------------------|------------------------------------------| | DatabaseName | Name of the created database | | UsageReportBucket | Name of the created S3 Bucket |

Note

  • If the CloudFormation stack status indicates a ROLLBACK state, you can investigate the failure reason by using the AWS CLI command below or by checking the AWS CloudFormation console directly: sh aws cloudformation describe-stack-events \ --stack-name $USAGE_REPORT_OPERATOR_NAME \ --query 'StackEvents[?ResourceStatus==`CREATE_FAILED`].[LogicalResourceId,ResourceStatusReason]'
  • Ensure that the eks-auth:AssumeRoleForPodIdentity permission is included in the IAM execution role for the SageMaker HyperPod cluster.
  • If the stack creation fails with the error eks-pod-identity-agent already exists, recreate the stack with the additional parameters ParameterKey=InstallPodIdentityAddon,ParameterValue=false: sh aws cloudformation create-stack \ --region $AWS_REGION \ --stack-name $USAGE_REPORT_OPERATOR_NAME \ --template-body file://cloudformation/usage-report.yaml \ --capabilities CAPABILITY_NAMED_IAM \ --parameters \ ParameterKey=EKSClusterName,ParameterValue=$EKS_CLUSTER_NAME \ ParameterKey=HyperPodClusterId,ParameterValue=$HYPERPOD_CLUSTER_ID \ ParameterKey=UsageReportInstallerRoleName,ParameterValue=$USAGE_REPORT_INSTALLER_ROLE_NAME \ ParameterKey=UsageReportOperatorNameSpace,ParameterValue=$USAGE_REPORT_OPERATOR_NAME \ ParameterKey=OperatorServiceAccount,ParameterValue=$USAGE_REPORT_OPERATOR_NAME \ ParameterKey=InstallPodIdentityAddon,ParameterValue=false

Install the SageMaker HyperPod Usage Report Kubernetes Operator using Helm

Overview

The values.yaml Helm chart in the /helm_chart directory configures the SageMaker HyperPod usage report Kubernetes operator, which provisions and manages the following cluster resources: * Namespace: hyperpod-usage-report (default) * Service Account: hyperpod-usage-report (default) * RBAC rules granting the operator cluster-scoped permissions to: - Monitor cluster resources (clusterqueues, workloads, namespaces, pods) - Retrieve node-level metadata - Manage leader election (if there are multiple replicas of the operator) using Kubernetes leases * Kubernetes operator collecting and storing usage report data in S3.

Configure the Helm Chart

You can configure the Helm chart by either updating the values.yaml file or by providing parameters directly during the helm install command. Any values passed as parameters during installation override the settings in the values.yaml file.

| Parameter | Description | Default Value | Required | Notes | |-----------------------|-----------------------------------------------------------------------------------------------------------|-------------------------------------------|----------|---------------------------------------------------------------------------------------------| | replicaCount | Number of operator replicas to run | 2 | No | | | namespace | Namespace where the operator will be installed | "hyperpod-usage-report" | No | Can be modified to deploy in a different namespace | | serviceAccount.name | Name of the service account | "hyperpod-usage-report" | No | Can be modified if using custom naming | | s3BucketName | Name of the S3 bucket that was created from the cloudformation template | | Yes | Operator will start storing the usage report data in this bucket. | | clusterName | Name of the EKS Cluster | | Yes | | | region | Specify the AWS region | | Yes | example: us-west-2 |

Install the Helm Chart

To install the Helm chart, run the following command: ```sh cd helm_chart

retrieve s3 bucket name

USAGEREPORTS3BUCKET=$(aws cloudformation describe-stack-resources \ --stack-name $USAGEREPORTOPERATORNAME \ --query 'StackResources[?ResourceType==AWS::S3::Bucket].PhysicalResourceId' \ --output text)

verification

echo $USAGEREPORTS3_BUCKET

helm install $USAGEREPORTOPERATORNAME \ ./SageMakerHyperPodUsageReportChart \ -n $USAGEREPORTOPERATORNAME \ --set region=$AWSREGION \ --set serviceAccount.name=$USAGEREPORTOPERATORNAME \ --set clusterName=$HYPERPODCLUSTERNAME \ --set s3BucketName=$USAGEREPORTS3_BUCKET ```

Verify the Operator Installation

Verify the operator installation: sh kubectl get pods -n $USAGE_REPORT_OPERATOR_NAME You can start submitting jobs to the cluster. Raw job usage data is stored in the S3 bucket path $USAGE_REPORT_S3_BUCKET/raw/.

Notes - Before install the operator through helm chart, make sure the HyperPod Usage Report cloudformation stack is completed. - A pre-existing namespace $USAGE_REPORT_OPERATOR_NAME is required to install the helm chart (check with kubectl get namspaces). If you don't have it yet, please refer to prerequisite to create namespace. - When uninstalling the $USAGE_REPORT_OPERATOR_NAME helm chart, the associated namespace is automatically deleted, which invalidates the RBAC permissions. You must restore the namespace-level RBAC configurations previously set in the cluster by re-running the steps in the prerequisites section.

Generate Reports

Overview

You can use the run.py script to extract and export usage metrics for your SageMaker HyperPod cluster.

Install Required Dependencies

```sh cd sagemaker-hyperpod-usage-report/report_generation pip install -e .

retrieve Athena database name

USAGEREPORTDATABASE=$(aws cloudformation describe-stack-resources \ --stack-name $USAGEREPORTOPERATOR_NAME \ --query 'StackResources[?ResourceType==AWS::Glue::Database].PhysicalResourceId' \ --output text)

DATABASEWORKGROUPNAME=$(aws cloudformation describe-stack-resources \ --stack-name $USAGEREPORTOPERATOR_NAME \ --query 'StackResources[?ResourceType==AWS::Athena::WorkGroup].PhysicalResourceId' \ --output text)

verification

echo $USAGEREPORTDATABASE echo $DATABASEWORKGROUPNAME ```

Generate the Report

To generate a usage report and export it to a specified S3 bucket, provide the following parameters to the run.py Python script:

Parameters for the run.py Script

| Parameter | Description | Example Value | Required | |-----------------------|-----------------------------------------|----------------|----------| | --start-date | Beginning date for report data | 2025-04-15 | Yes | | --end-date | Ending date for report data |2025-04-17 | Yes | | --format | Output format of the report | csv or pdf | Yes | | --database-name | Name of the database to query | usage_report | Yes | | --database-workgroup-name | Name of Athena's workgroup | usage_report_workgroup | Yes | | --type | Type of report to generate | detailed or summary | Yes | | --output-report-location | Directory where report will be saved | s3://bucket-name/path | Yes | | --cluster-name | Name of the HyperPod cluster | my-hyperpod-cluster | Yes | | --namespace | Filter report by namespace (optional) | ml-namespace-a | No | | --task | Filter report by task name (optional) | training-job-1 | No |

Note: - Select a date range that falls within the previous 180 days from the current date (unless you customized the DataRententionDays when installing the CloudFormation stack).

  • A good practice is to create a separate folder in your S3 bucket to serve as the destination for generated usage reports.

  • The --namespace parameter allows you to filter reports to show only data for a specific namespace. If not specified, the report will include data for all namespaces.

  • The --task parameter allows you to filter reports to show only data for a specific task. If not specified, the report will include data for all tasks.

Use the following command to generate and export the report: sh python run.py \ --start-date <Start date of the report, i.e. 2025-04-22> \ --end-date <End date of the report, i.e. 2025-04-22> \ --format <csv or pdf> \ --database-name $USAGE_REPORT_DATABASE \ --database-workgroup-name $DATABASE_WORKGROUP_NAME \ --type <detailed or summary> \ --output-report-location s3://$USAGE_REPORT_S3_BUCKET/<usage report output folder> \ --cluster-name $HYPERPOD_CLUSTER_NAME \ --namespace <namespace, optional> \ --task <task name, optional> Note * Ensure that the S3 bucket specified in --output-report-location has the necessary permissions to accept the report files. * The cluster-name should match the name of your SageMaker HyperPod cluster. * You can find all original captured data in the raw directory of your S3 bucket $USAGE_REPORT_S3_BUCKET/raw or in the Athena console.

Usage Examples

Generate a report for all namespaces (default behavior)

sh python run.py \ --start-date 2025-04-15 \ --end-date 2025-04-17 \ --format csv \ --database-name $USAGE_REPORT_DATABASE \ --database-workgroup-name $DATABASE_WORKGROUP_NAME \ --type summary \ --output-report-location s3://$USAGE_REPORT_S3_BUCKET/reports/ \ --cluster-name $HYPERPOD_CLUSTER_NAME

Generate a report filtered by namespace

sh python run.py \ --start-date 2025-04-15 \ --end-date 2025-04-17 \ --format pdf \ --database-name $USAGE_REPORT_DATABASE \ --database-workgroup-name $DATABASE_WORKGROUP_NAME \ --type detailed \ --output-report-location s3://$USAGE_REPORT_S3_BUCKET/reports/ \ --cluster-name $HYPERPOD_CLUSTER_NAME \ --namespace ml-namespace-a

Generate a summary report for a specific namespace

sh python run.py \ --start-date 2025-04-15 \ --end-date 2025-04-17 \ --format csv \ --database-name $USAGE_REPORT_DATABASE \ --database-workgroup-name $DATABASE_WORKGROUP_NAME \ --type summary \ --output-report-location s3://$USAGE_REPORT_S3_BUCKET/reports/ \ --cluster-name $HYPERPOD_CLUSTER_NAME \ --namespace data-science-namespace

Generate a report for a specific task

sh python run.py \ --start-date 2025-04-15 \ --end-date 2025-04-17 \ --format csv \ --database-name $USAGE_REPORT_DATABASE \ --database-workgroup-name $DATABASE_WORKGROUP_NAME \ --type summary \ --output-report-location s3://$USAGE_REPORT_S3_BUCKET/reports/ \ --cluster-name $HYPERPOD_CLUSTER_NAME \ --task training-job-1

Generate a report for a specific namespace and task

sh python run.py \ --start-date 2025-04-15 \ --end-date 2025-04-17 \ --format pdf \ --database-name $USAGE_REPORT_DATABASE \ --database-workgroup-name $DATABASE_WORKGROUP_NAME \ --type detailed \ --output-report-location s3://$USAGE_REPORT_S3_BUCKET/reports/ \ --cluster-name $HYPERPOD_CLUSTER_NAME \ --namespace ml-namespace-a \ --task inference-job-2

Output File Naming Convention

The output file follows the naming convention: <report-type>-report-<start-date>-<end-date>.<format>.

When filters are applied, the filename includes the filter names after the date range: <report-type>-report-<start-date>-<end-date>-<namespace>-<task>.<format>.

Examples: - Summary report for all namespaces: summary-report-2025-04-15-2025-04-17.csv - Summary report for ML Namespace A: summary-report-2025-04-15-2025-04-17-ml-namespace-a.csv - Detailed report for Data Science Namespace: detailed-report-2025-04-15-2025-04-17-data-science-namespace.pdf - Report for specific namespace and task: summary-report-2025-04-15-2025-04-17-ml-namespace-a-training-job-1.csv

Clean Up Resources

Overview

When you no longer need your SageMaker HyperPod usage reporting infrastructure, follow these steps to clean up Kubernetes and AWS resources (in that order). Proper resource deletion helps prevent unnecessary costs.

Delete the Kubernetes Resources

To uninstall the Helm chart, run the following command: sh cd sagemaker-hyperpod-usage-report/helm_chart helm uninstall $USAGE_REPORT_OPERATOR_NAME --namespace $USAGE_REPORT_OPERATOR_NAME

Ensure that you uninstalled the SageMaker HyperPod usage report Kubernetes operator: sh kubectl get pods --namespace $USAGE_REPORT_OPERATOR_NAME

Delete the AWS Resources

To delete the CloudFormation stack and the resources it created, run the following command: sh aws cloudformation delete-stack --region $AWS_REGION --stack-name $USAGE_REPORT_OPERATOR_NAME

Ensure that the stack is properly deleted: sh aws cloudformation describe-stacks --region $AWS_REGION --stack-name $USAGE_REPORT_OPERATOR_NAME \ --region $AWS_REGION --query 'Stacks[0].StackStatus' --output text

Note: To prevent accidental deletion, you should delete the S3 buckets created by the CloudFormation stack manually: - $USAGE_REPORT_S3_BUCKET

Local Development

Run Unit Tests

To run the unit tests locally: bash cd report_generation pytest This will execute all test cases in the test directory. The test suite includes unit tests for all major components of the usage report functionality.

Attributions and Open Source Acknowledgments

See ./attributions for credits.

Contribute

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Owner

  • Name: Amazon Web Services - Labs
  • Login: awslabs
  • Kind: organization
  • Location: Seattle, WA

AWS Labs

GitHub Events

Total
  • Watch event: 1
  • Delete event: 6
  • Public event: 1
  • Push event: 7
  • Pull request review comment event: 11
  • Pull request review event: 9
  • Pull request event: 13
  • Create event: 4
Last Year
  • Watch event: 1
  • Delete event: 6
  • Public event: 1
  • Push event: 7
  • Pull request review comment event: 11
  • Pull request review event: 9
  • Pull request event: 13
  • Create event: 4

Dependencies

report_generation/requirements.txt pypi
  • awswrangler >=3.0.0
  • boto3 >=1.26.0
  • coverage >=7.0.0
  • fpdf >=1.7.2
  • mock >=5.0.0
  • pandas >=1.5.0
  • pytest >=7.0.0
  • pytest-cov >=4.0.0
report_generation/setup.py pypi
  • awswrangler >=3.0.0
  • boto3 >=1.26.0
  • fpdf >=1.7.2
  • pandas >=1.5.0