https://github.com/awslabs/sagemaker-debugger

Amazon SageMaker Debugger provides functionality to save tensors during training of machine learning jobs and analyze those tensors

Keywords

amazon aws debugging deep-learning machine-learning sagemaker

Keywords from Contributors

mxnet onnx cryptocurrencies cryptography transformers jax deep-neural-networks pose-estimation person-reid semantic-segmentation

Last synced: 5 months ago · JSON representation

Repository

Amazon SageMaker Debugger provides functionality to save tensors during training of machine learning jobs and analyze those tensors

Basic Info

Host: GitHub
Owner: awslabs
License: apache-2.0
Language: Python
Default Branch: master
Homepage:
Size: 12.3 MB

Statistics

Stars: 162
Watchers: 23
Forks: 83
Open Issues: 88
Releases: 44

Topics

amazon aws debugging deep-learning machine-learning sagemaker

Created over 6 years ago · Last pushed almost 2 years ago

Metadata Files

Readme Contributing License Code of conduct Codeowners

Amazon SageMaker Debugger

Overview

Amazon SageMaker Debugger automates the debugging process of machine learning training jobs. From training jobs, Debugger allows you to run your own training script (Zero Script Change experience) using Debugger built-in features—Hook and Rule—to capture tensors, have flexibility to build customized Hooks and Rules for configuring tensors as you want, and make the tensors available for analysis by saving in an Amazon S3 bucket, all through a flexible and powerful API.

The smdebug library powers Debugger by calling the saved tensors from the S3 bucket during the training job. smdebug retrieves and filters the tensors generated from Debugger such as gradients, weights, and biases.

Debugger helps you develop better, faster, and cheaper models by minimally modifying estimator, tracing the tensors, catching anomalies while training models, and iterative model pruning.

Debugger supports TensorFlow, PyTorch, MXNet, and XGBoost frameworks.

The following list is a summary of the main functionalities of Debugger:

Run and debug training jobs of your model on SageMaker when using supported containers
No changes needed to your training script if using AWS Deep Learning Containers with Debugger fully integrated
Minimal changes to your training script if using AWS containers with script mode or custom containers
Full visibility into any tensor retrieved from targeted parts of the training jobs
Real-time training job monitoring through Rules
Automated anomaly detection and state assertions through built-in and custom Rules on SageMaker
Actions on your training jobs based on the status of Rules
Interactive exploration of saved tensors
Distributed training support
TensorBoard support

See How it works for more details.

Install the smdebug library

The smdebug library runs on Python 3. Install using the following command:

python pip install smdebug

Debugger-supported Frameworks

For a complete overview of Amazon SageMaker Debugger to learn how it works, go to the Use Debugger in AWS Containers developer guide.

AWS Deep Learning Containers with zero code change

Debugger is installed by default in AWS Deep Learning Containers with TensorFlow, PyTorch, MXNet, and XGBoost. The following framework containers enable you to use Debugger with no changes to your training script, by automatically adding SageMaker Debugger's Hook.

The following frameworks are available AWS Deep Learning Containers with the deep learning frameworks for the zero script change experience.

| Framework | Version | | --- | --- | | TensorFlow | 1.15, 2.1.0, 2.2.0, 2.3.0, 2.3.1 | | MXNet | 1.6, 1.7 | | PyTorch | 1.4, 1.5, 1.6 | | XGBoost | 0.90-2, 1.0-1 (As a built-in algorithm)|

Note: Debugger with zero script change is partially available for TensorFlow v2.1.0. The inputs, outputs, gradients, and layers built-in collections are currently not available for these TensorFlow versions.

AWS training containers with script mode

The smdebug library supports frameworks other than the ones listed above while using AWS containers with script mode. If you want to use SageMaker Debugger with one of the following framework versions, you need to make minimal changes to your training script.

| Framework | Versions | | --- | --- | | TensorFlow | 1.13, 1.14, 1.15, 2.1.0, 2.2.0, 2.3.0, 2.3.1 | | Keras (with TensorFlow backend) | 2.3 | | MXNet | 1.4, 1.5, 1.6, 1.7 | | PyTorch | 1.2, 1.3, 1.4, 1.5, 1.6 | | XGBoost | 0.90-2, 1.0-1 (As a framework)|

Debugger on custom containers or local machines

You can also fully use the Debugger features in custom containers with the SageMaker Python SDK. Furthermore, smdebug is an open source library, so you can install it on your local machine for any advanced use cases that cannot be run in the SageMaker environment and for constructing smdebug custom hooks and rules.

How It Works

Amazon SageMaker Debugger uses the construct of a Hook to save the values of requested tensors throughout the training process. You can then setup a Rule job which simultaneously monitors and validates these tensors to ensure that training is progressing as expected.

A Rule checks for vanishing gradients, exploding tensor values, or poor weight initialization. Rules are attached to Amazon CloudWatch events, so that when a rule is triggered it changes the state of the CloudWatch event. You can configure any action on the CloudWatch event, such as to stop the training job saving you time and money.

Debugger can be used inside or outside of SageMaker. However the built-in rules that AWS provides are only available for SageMaker training. Scenarios of usage can be classified into the following three cases.

Using SageMaker Debugger on AWS Deep Learning Containers with zero training script change

Use Debugger built-in hook configurations and rules while setting up the estimator and monitor your training job.

For a full guide and examples of using the built-in rules, see Running a Rule with zero script change on AWS Deep Learning Containers.

To see a complete list of built-in rules and their functionalities, see List of Debugger Built-in Rules.

Using SageMaker Debugger on AWS training containers with script mode

You can use Debugger with your training script on your own container making only a minimal modification to your training script to add Debugger's Hook. For an example template of code to use Debugger on your own container in TensorFlow 2.x frameworks, see Run Debugger in custom container. See the following instruction pages to set up Debugger in your preferred framework. - TensorFlow - MXNet - PyTorch - XGBoost

Using SageMaker Debugger on custom containers

Debugger is available for any deep learning models that you bring to Amazon SageMaker. The AWS CLI, the SageMaker Estimator API, and the Debugger APIs enable you to use any Docker base images to build and customize containers to train and debug your models. To use Debugger with customized containers, go to Use Debugger in Custom Training Containers.

Using SageMaker Debugger on a non-SageMaker environment

Using the smdebug library, you can create custom hooks and rules (or manually analyze the tensors) and modify your training script to enable tensor analysis on a non-SageMaker environment, such as your local machine. For an example of this, see Run Debugger locally.

Examples

SageMaker Notebook Examples

To find a collection of demonstrations using Debugger, see SageMaker Debugger Example Notebooks.

Run Debugger rules with zero script change

This example shows a how to use Debugger with Zero Script Change of your training script on a SageMaker DLC.

```python import sagemaker as sm from sagemaker.debugger import rule_configs, Rule, CollectionConfig

Choose a built-in rule to monitor your training job

rule = Rule.sagemaker( ruleconfigs.explodingtensor(), # configure your rule if applicable ruleparameters={"tensorregex": ".*"}, # specify collections to save for processing your rule collectionstosave=[ CollectionConfig(name="weights"), CollectionConfig(name="losses"), ], )

Pass the rule to the estimator

sagemakersimpleestimator = sm.tensorflow.TensorFlow( entrypoint="script.py", #replace script.py to your own training script role=sm.getexecutionrole(), frameworkversion="1.15", py_version="py3", # argument for smdebug below rules=[rule], )

sagemakersimpleestimator.fit() tensorspath = sagemakersimpleestimator.latestjobdebuggerartifacts_path()

import smdebug.trials as smd trial = smd.createtrial(outdir=tensorspath) print(f"Saved these tensors: {trial.tensornames()}") print(f"Loss values during evaluation were {trial.tensor('CrossEntropyLoss:0').values(mode=smd.modes.EVAL)}") ```

That's it! When you configure the sagemaker_simple_estimator, you simply specify the entry_point to your training script python file. When you run the sagemaker_simple_estimator.fit() API, SageMaker will automatically monitor your training job for you with the Rules specified and create a CloudWatch event that tracks the status of the Rule, so you can take any action based on them.

If you want additional configuration and control, see Running SageMaker jobs with Debugger for more information.

Run Debugger in custom container

The following example shows how to set hook to set a training model using Debugger in your own container. This example is for containers in TensorFlow 2.x framework using GradientTape to configure the hook.

```python import smdebug.tensorflow as smd hook = smd.KerasHook(outdir=args.outdir)

model = tf.keras.models.Sequential([ ... ]) for epoch in range(nepochs): for data, labels in dataset: datasetlabels = labels # wrap the tape to capture tensors with hook.wraptape(tf.GradientTape(persistent=True)) as tape: logits = model(data, training=True) # (32,10) lossvalue = cce(labels, logits) grads = tape.gradient(lossvalue, model.variables) opt.applygradients(zip(grads, model.variables)) acc = trainaccmetric(datasetlabels, logits) # manually save metric values hook.recordtensorvalue(tensorname="accuracy", tensor_value=acc) ```

To see a full script of this, refer to the tfkerasgradienttape.py example script. For a notebook example of using BYOC in PyTorch, see Using Amazon SageMaker Debugger with Your Own PyTorch Container

Run Debugger locally

This example shows how to use Debugger for the Keras model.fit() API.

To use Debugger, simply add a callback hook: ```python import smdebug.tensorflow as smd hook = smd.KerasHook(outdir='~/smdoutputs/')

model = tf.keras.models.Sequential([ ... ]) model.compile( optimizer='adam', loss='sparsecategoricalcrossentropy', )

Add the hook as a callback

model.fit(xtrain, ytrain, epochs=2, callbacks=[hook]) model.evaluate(xtest, ytest, callbacks=[hook])

Create a trial to inspect the saved tensors

trial = smd.createtrial(outdir='~/smdoutputs/') print(f"Saved these tensors: {trial.tensornames()}") print(f"Loss values during evaluation were {trial.tensor('CrossEntropyLoss:0').values(mode=smd.modes.EVAL)}") ```

SageMaker Debugger in Action

Through the model pruning process using Debugger and smdebug, you can iteratively identify the importance of weights and cut neurons below a threshold you define. This process allows you to train the model with significantly fewer neurons, which means a lighter, more efficient, faster, and cheaper model without compromising accuracy. The following accuracy versus the number of parameters graph is produced in Studio. It shows that the model accuracy started from about 0.9 with 12 million parameters (the data point moves from right to left along with the pruning process), improved during the first few pruning iterations, kept the quality of accuracy until it cut the number of parameters down to 6 million, and start sacrificing the accuracy afterwards.

Debugger Iterative Model Pruning using ResNet Debugger provides you tools to access such training process and have a complete control over your model. See Using SageMaker Debugger and SageMaker Experiments for iterative model pruning notebook for the full example and more information.

Use Debugger with XGBoost in SageMaker Studio to save feature importance values and plot them in a notebook during training.
Use Debugger with TensorFlow in SageMaker Studio to run built-in rules and visualize the loss.

Further Documentation and References

| Section | Description | | --- | --- | | SageMaker Training | SageMaker users, we recommend you start with this page on how to run SageMaker training jobs with SageMaker Debugger | | Frameworks

| See the frameworks pages for details on what's supported and how to modify your training script if applicable | | APIs for Saving Tensors | Full description of our APIs on saving tensors | | Programming Model for Analysis | For description of the programming model provided by the APIs that enable you to perform interactive exploration of tensors saved, as well as to write your own Rules monitoring your training jobs. |

License

This library is licensed under the Apache 2.0 License.

Owner

Name: Amazon Web Services - Labs
Login: awslabs
Kind: organization
Location: Seattle, WA

Website: http://amazon.com/aws/
Repositories: 914
Profile: https://github.com/awslabs

AWS Labs

GitHub Events

Total

Watch event: 4
Issue comment event: 1
Member event: 1

Last Year

Watch event: 4
Issue comment event: 1
Member event: 1

Committers

Last synced: almost 3 years ago

All Time

Total Commits: 773
Total Committers: 44
Avg Commits per committer: 17.568
Development Distribution Score (DDS): 0.745

Top Committers

Name	Email	Commits
Rahul Huilgol	h**r@a**m	197
Nihal Harish	n**h@g**m	163
Jared T Nielsen	j**n@g**m	58
Amol Lele	1**l@u**m	43
Vikas-kum	v**r@a**m	40
Neelesh Dodda	n**a@a**m	32
mariumof	9**f@u**m	27
Edward J Kim	k**w@a**m	24
Vandana Kannan	v**k@u**m	22
Kaustubh Milindrao Sardar	k**r@g**m	21
Denis Davydenko	d**a@g**m	18
Anirudh	a**c@g**m	14
Andrea Olgiati	o**g@a**m	13
lakshya97	l**n@b**u	13
Allen Liu	l**5@o**m	12
Nathalie Rauschmayr	n**r@g**m	8
Vikas89	d**4@g**m	7
Ubuntu	u**u@i**l	7
Zeeshan Ashraf	s**f@g**m	7
Miyoung	m**9@g**m	5
Tianyi Wei	t**5@j**u	5
adimux	a**m@c**e	4
Ben Snyder	j**r@g**m	4
Jihyeong Lee	l**n@a**m	3
zaoliu-aws	1**s@u**m	3
dependabot[bot]	4**]@u**m	2
Danny Key	1**n@u**m	2
ShiboXing	s**6@p**u	2
Pedro Larroy	p**s@g**m	2
Abhinav Sharma	a**1@g**m	1
and 14 more...

Committer Domains (Top 20 + Academic)

amazon.com: 7 pitt.edu: 1 cherti.name: 1 jhu.edu: 1 ip-172-31-20-165.us-east-2.compute.internal: 1 berkeley.edu: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 9
Total pull requests: 97
Average time to close issues: 3 months
Average time to close pull requests: 12 days
Total issue authors: 9
Total pull request authors: 21
Average comments per issue: 0.89
Average comments per pull request: 0.55
Merged pull requests: 69
Bot issues: 0
Bot pull requests: 3

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

anotinelg (1)
fredsensibill (1)
tejaschumbalkar (1)
NRauschmayr (1)
vandanavk (1)
ChristopherBrix (1)
plamb-viso (1)
haixiw (1)
ZhangPei25 (1)

Pull Request Authors

mariumof (30)
yl-to (19)
MZSHAN (9)
ShiboXing (4)
johnbensnyder (4)
jleeleee (4)
zaoliu-aws (4)
dependabot[bot] (4)
adimux (3)
atqy (3)
ztlevi (2)
dkey-amazon (2)
NRauschmayr (2)
ntw-au (1)
josephevans (1)

Top Labels

Issue Labels

Pull Request Labels

dependencies (4)

Packages

Total packages: 1
Total downloads:
- pypi 64,851 last-month

Total dependent packages: 0
Total dependent repositories: 100
Total versions: 297
Total maintainers: 1

pypi.org: smdebug

Amazon SageMaker Debugger is an offering from AWS which helps you automate the debugging of machine learning training jobs.

Homepage: https://github.com/awslabs/sagemaker-debugger
Documentation: https://smdebug.readthedocs.io/
License: Apache License Version 2.0
Latest release: 1.0.34
published over 2 years ago

Versions: 297
Dependent Packages: 0
Dependent Repositories: 100
Downloads: 64,851 Last month
Docker Downloads: 0

Rankings

Downloads: 1.4%

Dependent repos count: 1.5%

Docker downloads count: 4.1%

Average: 4.6%

Forks count: 4.8%

Stargazers count: 5.7%

Dependent packages count: 10.1%

Maintainers (1)

aws-deep-learning

Last synced: 6 months ago

https://github.com/awslabs/sagemaker-debugger

Science Score: 23.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Amazon SageMaker Debugger

Table of Contents

Overview

Install the smdebug library

Debugger-supported Frameworks

AWS Deep Learning Containers with zero code change

AWS training containers with script mode

Debugger on custom containers or local machines

How It Works

Using SageMaker Debugger on AWS Deep Learning Containers with zero training script change

Using SageMaker Debugger on AWS training containers with script mode

Using SageMaker Debugger on custom containers

Using SageMaker Debugger on a non-SageMaker environment

Examples

SageMaker Notebook Examples

Run Debugger rules with zero script change

Choose a built-in rule to monitor your training job

Pass the rule to the estimator

Run Debugger in custom container

Run Debugger locally

Add the hook as a callback

Create a trial to inspect the saved tensors

SageMaker Debugger in Action

Further Documentation and References

License

Owner

GitHub Events

Total

Last Year

Committers

All Time

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: smdebug

Rankings

Maintainers (1)

Dependencies