pytorch_optimizer

optimizer & lr scheduler & loss function collections in PyTorch

https://github.com/kozistr/pytorch_optimizer

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (5.6%) to scientific vocabulary

Keywords

adabelief adai adamp adan ademamix deep-learning diffgrad gradient-centralization learning-rate-scheduling lookahead loss-functions madgrad muon optimizer pytorch radam ranger sam scion

Keywords from Contributors

interactive mesh interpretability profiles sequences generic projection embedded hacking network-simulation

Last synced: 6 months ago · JSON representation ·

Repository

optimizer & lr scheduler & loss function collections in PyTorch

Basic Info

Host: GitHub
Owner: kozistr
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://pytorch-optimizers.readthedocs.io/en/latest/
Size: 310 MB

Statistics

Stars: 348
Watchers: 7
Forks: 29
Open Issues: 10
Releases: 0

Topics

adabelief adai adamp adan ademamix deep-learning diffgrad gradient-centralization learning-rate-scheduling lookahead loss-functions madgrad muon optimizer pytorch radam ranger sam scion

Created over 4 years ago · Last pushed 6 months ago

Metadata Files

Readme Changelog Contributing License Code of conduct Citation Codeowners Security

pytorch-optimizer

| | | |---------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Build | | | Quality | | | Package | | | Status | | | License | |

The reasons why you use `pytorch-optimizer`.

Wide range of supported optimizers. Currently, 128 optimizers (+ bitsandbytes, qgalore, torchao), 16 lr schedulers, and 13 loss functions are supported!
Including many variants such as ADOPT, Cautious, AdamD, StableAdamW, and Gradient Centrailiaztion
Easy to use, clean, and tested codes
Active maintenance
Somewhat a bit more optimized compared to the original implementation

Highly inspired by pytorch-optimizer.

Getting Started

For more, see the stable documentation or latest documentation.

Most optimizers are under MIT or Apache 2.0 license, but a few optimizers like Fromage, Nero have CC BY-NC-SA 4.0 license, which is non-commercial. So, please double-check the license before using it at your work.

Installation

bash $ pip3 install pytorch-optimizer

From v2.12.0, v3.1.0, you can use bitsandbytes, q-galore-torch, torchao optimizers respectively! please check the bnb requirements, q-galore-torch installation, torchao installation before installing it.

From v3.0.0, drop Python 3.7 support. However, you can still use this package with Python 3.7 by installing with --ignore-requires-python option.

Simple Usage

```python from pytorch_optimizer import AdamP

model = YourModel() optimizer = AdamP(model.parameters())

or you can use optimizer loader, simply passing a name of the optimizer.

from pytorchoptimizer import loadoptimizer

optimizer = load_optimizer(optimizer='adamp')(model.parameters())

if you install `bitsandbytes` optimizer, you can use `8-bit` optimizers from `pytorch-optimizer`.

optimizer = loadoptimizer(optimizer='bnbadamw8bit')(model.parameters()) ```

Also, you can load the optimizer via torch.hub.

```python import torch

model = YourModel()

opt = torch.hub.load('kozistr/pytorch_optimizer', 'adamp') optimizer = opt(model.parameters()) ```

If you want to build the optimizer with parameters & configs, there's create_optimizer() API.

```python from pytorchoptimizer import createoptimizer

optimizer = createoptimizer( model, 'adamp', lr=1e-3, weightdecay=1e-3, usegc=True, uselookahead=True, use_orthograd=False, ) ```

Supported Optimizers

You can check the supported optimizers with below code.

```python from pytorchoptimizer import getsupported_optimizers

supportedoptimizers = getsupported_optimizers() ```

or you can also search them with the filter(s).

```python from pytorchoptimizer import getsupported_optimizers

getsupportedoptimizers('adam*')

['adamax', 'adamg', 'adammini', 'adamod', 'adamp', 'adams', 'adamw']

getsupportedoptimizers(['adam', 'ranger'])

['adamax', 'adamg', 'adammini', 'adamod', 'adamp', 'adams', 'adamw', 'ranger', 'ranger21']

```

| Optimizer | Description | Official Code | Paper | Citation | |-----------------------|----------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------| | AdaBelief | Adapting Step-sizes by the Belief in Observed Gradients | github | https://arxiv.org/abs/2010.07468 | cite | | AdaBound | Adaptive Gradient Methods with Dynamic Bound of Learning Rate | github | https://openreview.net/forum?id=Bkg3g2R9FX | cite | | AdaHessian | An Adaptive Second Order Optimizer for Machine Learning | github | https://arxiv.org/abs/2006.00719 | cite | | AdamD | Improved bias-correction in Adam | | https://arxiv.org/abs/2110.10828 | cite | | AdamP | Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights | github | https://arxiv.org/abs/2006.08217 | cite | | diffGrad | An Optimization Method for Convolutional Neural Networks | github | https://arxiv.org/abs/1909.11015v3 | cite | | MADGRAD | A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic | github | https://arxiv.org/abs/2101.11075 | cite | | RAdam | On the Variance of the Adaptive Learning Rate and Beyond | github | https://arxiv.org/abs/1908.03265 | cite | | Ranger | a synergistic optimizer combining RAdam and LookAhead, and now GC in one optimizer | github | https://bit.ly/3zyspC3 | cite | | Ranger21 | a synergistic deep learning optimizer | github | https://arxiv.org/abs/2106.13731 | cite | | Lamb | Large Batch Optimization for Deep Learning | github | https://arxiv.org/abs/1904.00962 | cite | | Shampoo | Preconditioned Stochastic Tensor Optimization | github | https://arxiv.org/abs/1802.09568 | cite | | Nero | Learning by Turning: Neural Architecture Aware Optimisation | github | https://arxiv.org/abs/2102.07227 | cite | | Adan | Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models | github | https://arxiv.org/abs/2208.06677 | cite | | Adai | Disentangling the Effects of Adaptive Learning Rate and Momentum | github | https://arxiv.org/abs/2006.15815 | cite | | SAM | Sharpness-Aware Minimization | github | https://arxiv.org/abs/2010.01412 | cite | | ASAM | Adaptive Sharpness-Aware Minimization | github | https://arxiv.org/abs/2102.11600 | cite | | GSAM | Surrogate Gap Guided Sharpness-Aware Minimization | github | https://openreview.net/pdf?id=edONMAnhLu- | cite | | D-Adaptation | Learning-Rate-Free Learning by D-Adaptation | github | https://arxiv.org/abs/2301.07733 | cite | | AdaFactor | Adaptive Learning Rates with Sublinear Memory Cost | github | https://arxiv.org/abs/1804.04235 | cite | | Apollo | An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization | github | https://arxiv.org/abs/2009.13586 | cite | | NovoGrad | Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks | github | https://arxiv.org/abs/1905.11286 | cite | | Lion | Symbolic Discovery of Optimization Algorithms | github | https://arxiv.org/abs/2302.06675 | cite | | Ali-G | Adaptive Learning Rates for Interpolation with Gradients | github | https://arxiv.org/abs/1906.05661 | cite | | SM3 | Memory-Efficient Adaptive Optimization | github | https://arxiv.org/abs/1901.11150 | cite | | AdaNorm | Adaptive Gradient Norm Correction based Optimizer for CNNs | github | https://arxiv.org/abs/2210.06364 | cite | | RotoGrad | Gradient Homogenization in Multitask Learning | github | https://openreview.net/pdf?id=T8wHz4rnuGL | cite | | A2Grad | Optimal Adaptive and Accelerated Stochastic Gradient Descent | github | https://arxiv.org/abs/1810.00553 | cite | | AccSGD | Accelerating Stochastic Gradient Descent For Least Squares Regression | github | https://arxiv.org/abs/1704.08227 | cite | | SGDW | Decoupled Weight Decay Regularization | github | https://arxiv.org/abs/1711.05101 | cite | | ASGD | Adaptive Gradient Descent without Descent | github | https://arxiv.org/abs/1910.09529 | cite | | Yogi | Adaptive Methods for Nonconvex Optimization | | NIPS 2018 | cite | | SWATS | Improving Generalization Performance by Switching from Adam to SGD | | https://arxiv.org/abs/1712.07628 | cite | | Fromage | On the distance between two neural networks and the stability of learning | github | https://arxiv.org/abs/2002.03432 | cite | | MSVAG | Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients | github | https://arxiv.org/abs/1705.07774 | cite | | AdaMod | An Adaptive and Momental Bound Method for Stochastic Learning | github | https://arxiv.org/abs/1910.12249 | cite | | AggMo | Aggregated Momentum: Stability Through Passive Damping | github | https://arxiv.org/abs/1804.00325 | cite | | QHAdam | Quasi-hyperbolic momentum and Adam for deep learning | github | https://arxiv.org/abs/1810.06801 | cite | | PID | A PID Controller Approach for Stochastic Optimization of Deep Networks | github | CVPR 18 | cite | | Gravity | a Kinematic Approach on Optimization in Deep Learning | github | https://arxiv.org/abs/2101.09192 | cite | | AdaSmooth | An Adaptive Learning Rate Method based on Effective Ratio | | https://arxiv.org/abs/2204.00825v1 | cite | | SRMM | Stochastic regularized majorization-minimization with weakly convex and multi-convex surrogates | github | https://arxiv.org/abs/2201.01652 | cite | | AvaGrad | Domain-independent Dominance of Adaptive Methods | github | https://arxiv.org/abs/1912.01823 | cite | | PCGrad | Gradient Surgery for Multi-Task Learning | github | https://arxiv.org/abs/2001.06782 | cite | | AMSGrad | On the Convergence of Adam and Beyond | | https://openreview.net/pdf?id=ryQu7f-RZ | cite | | Lookahead | k steps forward, 1 step back | github | https://arxiv.org/abs/1907.08610 | cite | | PNM | Manipulating Stochastic Gradient Noise to Improve Generalization | github | https://arxiv.org/abs/2103.17182 | cite | | GC | Gradient Centralization | github | https://arxiv.org/abs/2004.01461 | cite | | AGC | Adaptive Gradient Clipping | github | https://arxiv.org/abs/2102.06171 | cite | | Stable WD | Understanding and Scheduling Weight Decay | github | https://arxiv.org/abs/2011.11152 | cite | | Softplus T | Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM | | https://arxiv.org/abs/1908.00700 | cite | | Un-tuned w/u | On the adequacy of untuned warmup for adaptive optimization | | https://arxiv.org/abs/1910.04209 | cite | | Norm Loss | An efficient yet effective regularization method for deep neural networks | | https://arxiv.org/abs/2103.06583 | cite | | AdaShift | Decorrelation and Convergence of Adaptive Learning Rate Methods | github | https://arxiv.org/abs/1810.00143v4 | cite | | AdaDelta | An Adaptive Learning Rate Method | | https://arxiv.org/abs/1212.5701v1 | cite | | Amos | An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale | github | https://arxiv.org/abs/2210.11693 | cite | | SignSGD | Compressed Optimisation for Non-Convex Problems | github | https://arxiv.org/abs/1802.04434 | cite | | Sophia | A Scalable Stochastic Second-order Optimizer for Language Model Pre-training | github | https://arxiv.org/abs/2305.14342 | cite | | Prodigy | An Expeditiously Adaptive Parameter-Free Learner | github | https://arxiv.org/abs/2306.06101 | cite | | PAdam | Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks | github | https://arxiv.org/abs/1806.06763 | cite | | LOMO | Full Parameter Fine-tuning for Large Language Models with Limited Resources | github | https://arxiv.org/abs/2306.09782 | cite | | AdaLOMO | Low-memory Optimization with Adaptive Learning Rate | github | https://arxiv.org/abs/2310.10195 | cite | | Tiger | A Tight-fisted Optimizer, an optimizer that is extremely budget-conscious | github | | cite | | CAME | Confidence-guided Adaptive Memory Efficient Optimization | github | https://aclanthology.org/2023.acl-long.243/ | cite | | WSAM | Sharpness-Aware Minimization Revisited: Weighted Sharpness as a Regularization Term | github | https://arxiv.org/abs/2305.15817 | cite | | Aida | A DNN Optimizer that Improves over AdaBelief by Suppression of the Adaptive Stepsize Range | github | https://arxiv.org/abs/2203.13273 | cite | | GaLore | Memory-Efficient LLM Training by Gradient Low-Rank Projection | github | https://arxiv.org/abs/2403.03507 | cite | | Adalite | Adalite optimizer | github | https://github.com/VatsaDev/adalite | cite | | bSAM | SAM as an Optimal Relaxation of Bayes | github | https://arxiv.org/abs/2210.01620 | cite | | Schedule-Free | Schedule-Free Optimizers | github | https://github.com/facebookresearch/schedule_free | cite | | FAdam | Adam is a natural gradient optimizer using diagonal empirical Fisher information | github | https://arxiv.org/abs/2405.12807 | cite | | Grokfast | Accelerated Grokking by Amplifying Slow Gradients | github | https://arxiv.org/abs/2405.20233 | cite | | Kate | Remove that Square Root: A New Efficient Scale-Invariant Version of AdaGrad | github | https://arxiv.org/abs/2403.02648 | cite | | StableAdamW | Stable and low-precision training for large-scale vision-language models | | https://arxiv.org/abs/2304.13013 | cite | | AdamMini | Use Fewer Learning Rates To Gain More | github | https://arxiv.org/abs/2406.16793 | cite | | TRAC | Adaptive Parameter-free Optimization | github | https://arxiv.org/abs/2405.16642 | cite | | AdamG | Towards Stability of Parameter-free Optimization | | https://arxiv.org/abs/2405.04376 | cite | | AdEMAMix | Better, Faster, Older | github | https://arxiv.org/abs/2409.03137 | cite | | SOAP | Improving and Stabilizing Shampoo using Adam | github | https://arxiv.org/abs/2409.11321 | cite | | ADOPT | Modified Adam Can Converge with Any β2 with the Optimal Rate | github | https://arxiv.org/abs/2411.02853 | cite | | FTRL | Follow The Regularized Leader | | https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41159.pdf | | | Cautious | Improving Training with One Line of Code | github | https://arxiv.org/pdf/2411.16085v1 | cite | | DeMo | Decoupled Momentum Optimization | github | https://arxiv.org/abs/2411.19870 | cite | | MicroAdam | Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence | github | https://arxiv.org/abs/2405.15593 | cite | | Muon | MomentUm Orthogonalized by Newton-schulz | github | https://x.com/kellerjordan0/status/1842300916864844014 | cite | | LaProp | Separating Momentum and Adaptivity in Adam | github | https://arxiv.org/abs/2002.04839 | cite | | APOLLO | SGD-like Memory, AdamW-level Performance | github | https://arxiv.org/abs/2412.05270 | cite | | MARS | Unleashing the Power of Variance Reduction for Training Large Models | github | https://arxiv.org/abs/2411.10438 | cite | | SGDSaI | No More Adam: Learning Rate Scaling at Initialization is All You Need | github | https://arxiv.org/abs/2411.10438 | cite | | Grams | Gradient Descent with Adaptive Momentum Scaling | | https://arxiv.org/abs/2412.17107 | cite | | OrthoGrad | Grokking at the Edge of Numerical Stability | github | https://arxiv.org/abs/2501.04697 | cite | | Adam-ATAN2 | Scaling Exponents Across Parameterizations and Optimizers | | https://arxiv.org/abs/2407.05872 | cite | | SPAM | Spike-Aware Adam with Momentum Reset for Stable LLM Training | github | https://arxiv.org/abs/2501.06842 | cite | | TAM | Torque-Aware Momentum | | https://arxiv.org/abs/2412.18790 | cite | | FOCUS | First Order Concentrated Updating Scheme | github | https://arxiv.org/abs/2501.12243 | cite | | PSGD | Preconditioned Stochastic Gradient Descent | github | https://arxiv.org/abs/1512.04202 | cite | | EXAdam | The Power of Adaptive Cross-Moments | github | https://arxiv.org/abs/2412.20302 | cite | | GCSAM | Gradient Centralized Sharpness Aware Minimization | github | https://arxiv.org/abs/2501.11584 | cite | | LookSAM | Towards Efficient and Scalable Sharpness-Aware Minimization | github | https://arxiv.org/abs/2203.02714 | cite | | SCION | Training Deep Learning Models with Norm-Constrained LMOs | github | https://arxiv.org/abs/2502.07529 | cite | | COSMOS | SOAP with Muon | github | | | | StableSPAM | How to Train in 4-Bit More Stably than 16-Bit Adam | github | https://arxiv.org/abs/2502.17055 | | | AdaGC | Improving Training Stability for Large Language Model Pretraining | | https://arxiv.org/abs/2502.11034 | cite | | Simplified-Ademamix | Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants | github | https://arxiv.org/abs/2502.02431 | cite | | Fira | Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint? | github | https://arxiv.org/abs/2410.01623 | cite | | RACS & Alice | Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension | | https://arxiv.org/pdf/2502.07752 | cite | | VSGD | Variational Stochastic Gradient Descent for Deep Neural Networks | github | https://openreview.net/forum?id=xu4ATNjcdy | cite | | SNSM | Subset-Norm and Subspace-Momentum: Faster Memory-Efficient Adaptive Optimization with Convergence Guarantees | github | https://arxiv.org/abs/2411.07120 | cite | | AdamC | Why Gradients Rapidly Increase Near the End of Training | | https://arxiv.org/abs/2506.02285 | cite | | AdaMuon | Adaptive Muon Optimizer | | https://arxiv.org/abs/2507.11005v1 | cite | | SPlus | A Stable Whitening Optimizer for Efficient Neural Network Training | github | https://arxiv.org/abs/2506.07254 | cite | | EmoNavi | An emotion-driven optimizer that feels loss and navigates accordingly | github | | | | Refined Schedule-Free | Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training | | https://arxiv.org/abs/2507.09846 | cite | | FriendlySAM | Friendly Sharpness-Aware Minimization | github | https://openaccess.thecvf.com/content/CVPR2024/papers/Li_Friendly_Sharpness-Aware_Minimization_CVPR_2024_paper.pdf | cite | | AdaGO | AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates | | https://arxiv.org/abs/2509.02981 | cite |

Supported LR Scheduler

You can check the supported learning rate schedulers with below code.

```python from pytorchoptimizer import getsupportedlrschedulers

supportedlrschedulers = getsupportedlr_schedulers() ```

or you can also search them with the filter(s).

```python from pytorchoptimizer import getsupportedlrschedulers

getsupportedlr_schedulers('cosine*')

['cosine', 'cosineannealing', 'cosineannealingwithwarmrestart', 'cosineannealingwithwarmup']

getsupportedlr_schedulers(['cosine', 'warm*'])

['cosine', 'cosineannealing', 'cosineannealingwithwarmrestart', 'cosineannealingwithwarmup', 'warmupstabledecay']

```

| LR Scheduler | Description | Official Code | Paper | Citation | |-----------------|---------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|------------------------------------|----------------------------------------------------------------------------------------------------| | Explore-Exploit | Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule | | https://arxiv.org/abs/2003.03977 | cite | | Chebyshev | Acceleration via Fractal Learning Rate Schedules | | https://arxiv.org/abs/2103.01338 | cite | | REX | Revisiting Budgeted Training with an Improved Schedule | github | https://arxiv.org/abs/2107.04197 | cite | | WSD | Warmup-Stable-Decay learning rate scheduler | github | https://arxiv.org/abs/2404.06395 | cite |

Supported Loss Function

You can check the supported loss functions with below code.

```python from pytorchoptimizer import getsupportedlossfunctions

supportedlossfunctions = getsupportedloss_functions() ```

or you can also search them with the filter(s).

```python from pytorchoptimizer import getsupportedlossfunctions

getsupportedloss_functions('focal')

['bcefocalloss', 'focalcosineloss', 'focalloss', 'focaltverskyloss']

getsupportedloss_functions(['focal', 'bce*'])

['bcefocalloss', 'bceloss', 'focalcosineloss', 'focalloss', 'focaltverskyloss']

```

| Loss Functions | Description | Official Code | Paper | Citation | |-----------------|-------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------|------------------------------------|------------------------------------------------------------------------------| | Label Smoothing | Rethinking the Inception Architecture for Computer Vision | | https://arxiv.org/abs/1512.00567 | cite | | Focal | Focal Loss for Dense Object Detection | | https://arxiv.org/abs/1708.02002 | cite | | Focal Cosine | Data-Efficient Deep Learning Method for Image Classification Using Data Augmentation, Focal Cosine Loss, and Ensemble | | https://arxiv.org/abs/2007.07805 | cite | | LDAM | Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss | github | https://arxiv.org/abs/1906.07413 | cite | | Jaccard (IOU) | IoU Loss for 2D/3D Object Detection | | https://arxiv.org/abs/1908.03851 | cite | | Bi-Tempered | The Principle of Unchanged Optimality in Reinforcement Learning Generalization | | https://arxiv.org/abs/1906.03361 | cite | | Tversky | Tversky loss function for image segmentation using 3D fully convolutional deep networks | | https://arxiv.org/abs/1706.05721 | cite | | Lovasz Hinge | A tractable surrogate for the optimization of the intersection-over-union measure in neural networks | github | https://arxiv.org/abs/1705.08790 | cite |

Useful Resources

Several optimization ideas to regularize & stabilize the training. Most of the ideas are applied in Ranger21 optimizer.

Also, most of the captures are taken from Ranger21 paper.

Adaptive Gradient Clipping

This idea originally proposed in NFNet (Normalized-Free Network) paper. AGC (Adaptive Gradient Clipping) clips gradients based on the unit-wise ratio of gradient norms to parameter norms.

code : github
paper : arXiv

Gradient Centralization

| | |---------------------------------------------------------------------------------------------------------------| | |

Gradient Centralization (GC) operates directly on gradients by centralizing the gradient to have zero mean.

code : github
paper : arXiv

Softplus Transformation

By running the final variance denom through the softplus function, it lifts extremely tiny values to keep them viable.

paper : arXiv

Gradient Normalization

Norm Loss

| | |-------------------------------------------------------------------------------------------------| | |

paper : arXiv

Positive-Negative Momentum

| | |------------------------------------------------------------------------------------------------------------------| | |

code : github
paper : arXiv

Linear learning rate warmup

| | |--------------------------------------------------------------------------------------------------------| | |

paper : arXiv

Stable weight decay

| | |-----------------------------------------------------------------------------------------------------------| | |

code : github
paper : arXiv

Explore-exploit learning rate schedule

| | |-------------------------------------------------------------------------------------------------------------------| | |

code : github
paper : arXiv

Lookahead

k steps forward, 1 step back. Lookahead consisting of keeping an exponential moving average of the weights that is updated and substituted to the current weights every k lookahead steps (5 by default).

Chebyshev learning rate schedule

Acceleration via Fractal Learning Rate Schedules.

(Adaptive) Sharpness-Aware Minimization

Sharpness-Aware Minimization (SAM) simultaneously minimizes loss value and loss sharpness.
In particular, it seeks parameters that lie in neighborhoods having uniformly low loss.

On the Convergence of Adam and Beyond

Convergence issues can be fixed by endowing such algorithms with 'long-term memory' of past gradients.

Improved bias-correction in Adam

With the default bias-correction, Adam may actually make larger than requested gradient updates early in training.

Adaptive Gradient Norm Correction

Correcting the norm of a gradient in each iteration based on the adaptive training history of gradient norm.

Cautious optimizer

Updates only occur when the proposed update direction aligns with the current gradient.

Adam-ATAN2

Adam-atan2 is a new numerically stable, scale-invariant version of Adam that eliminates the epsilon hyperparameter.

Frequently asked questions

here

Visualization

here

Citation

Please cite the original authors of optimization algorithms. You can easily find it in the above table! If you use this software, please cite it below. Or you can get it from "cite this repository" button.

@software{Kim_pytorch_optimizer_optimizer_2021,
    author = {Kim, Hyeongchan},
    month = jan,
    title = {{pytorch_optimizer: optimizer & lr scheduler & loss function collections in PyTorch}},
    url = {https://github.com/kozistr/pytorch_optimizer},
    version = {3.1.0},
    year = {2021}
}

Maintainer

Hyeongchan Kim / @kozistr

Owner

Name: Hyeongchan Kim
Login: kozistr
Kind: user
Location: South Korea, Seoul
Company: @toss

Website: http://kozistr.tech
Repositories: 7
Profile: https://github.com/kozistr

pursue generalist better than specialist

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Kim
    given-names: Hyeongchan
    orcid: https://orcid.org/0000-0002-1729-0580
title: "pytorch_optimizer: optimizer & lr scheduler & loss function collections in PyTorch"
version: 2.12.0
date-released: 2021-09-21
url: "https://github.com/kozistr/pytorch_optimizer"

GitHub Events

Total

Create event: 81
Release event: 13
Issues event: 83
Watch event: 95
Delete event: 71
Issue comment event: 150
Push event: 263
Pull request review comment event: 17
Pull request review event: 31
Pull request event: 165
Fork event: 11

Last Year

Create event: 81
Release event: 13
Issues event: 83
Watch event: 95
Delete event: 71
Issue comment event: 150
Push event: 263
Pull request review comment event: 17
Pull request review event: 31
Pull request event: 165
Fork event: 11

Committers

Last synced: 7 months ago

All Time

Total Commits: 3,700
Total Committers: 13
Avg Commits per committer: 284.615
Development Distribution Score (DDS): 0.009

Past Year

Commits: 804
Committers: 6
Avg Commits per committer: 134.0
Development Distribution Score (DDS): 0.014

Top Committers

Name	Email	Commits
kozistr	k**r@g**m	3,667
ferris	f**s@d**m	7
Luciferian Ink	L**k@p**m	5
dowon	k**5@n**m	5
Michał Dyczko	4****o	3
青龍聖者@bdsqlsz	8**9@q**m	2
dependabot[bot]	4****]	2
Kyle Vedder	k**r@g**m	2
Georg Wölflein	g**7@g**m	2
Aidin	a**l@g**m	2
Jan Beitner	j**r@i**m	1
C2D	5****8	1
Anke Tang	t**e@f**m	1

Committer Domains (Top 20 + Academic)

foxmail.com: 1 inflexion.com: 1 qq.com: 1 naver.com: 1 devdroplets.com: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 109
Total pull requests: 281
Average time to close issues: about 1 month
Average time to close pull requests: about 12 hours
Total issue authors: 34
Total pull request authors: 15
Average comments per issue: 1.61
Average comments per pull request: 1.03
Merged pull requests: 257
Bot issues: 0
Bot pull requests: 5

Past Year

Issues: 55
Pull requests: 137
Average time to close issues: 5 days
Average time to close pull requests: about 9 hours
Issue authors: 17
Pull request authors: 8
Average comments per issue: 0.98
Average comments per pull request: 1.07
Merged pull requests: 120
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

sdbds (22)
redknightlois (15)
67372a (10)
Vectorrent (9)
kozistr (6)
ogencoglu (5)
aliencaocao (4)
hatonosuke (3)
gesen2egee (3)
Bing-su (3)
LiutongZhou (3)
Yura52 (2)
robotzheng (2)
michaldyczko (2)
muooon (2)

Pull Request Authors

kozistr (280)
dependabot[bot] (8)
Vectorrent (8)
kylevedder (4)
AidinHamedi (4)
sdbds (4)
michaldyczko (3)
i404788 (3)
Bing-su (2)
Mirza-Samad-Ahmed-Baig (2)
hatonosuke (2)
tanganke (2)
georg-wolflein (1)
liveck (1)
jdb78 (1)

Top Labels

Issue Labels

feature request (71) bug (29) feature (4) enhancement (2) dependencies (2) question (1) performance (1)

Pull Request Labels

documentation (143) dependencies (89) feature (82) size/L (79) feature request (77) size/M (68) bug (62) enhancement (62) size/XS (61) size/S (51) refactoring (51) size/XL (31) size/XXL (16) cleanup (15) test (13) performance (9) optimizer (4) github_actions (3) automerge (3)

Packages

Total packages: 1
Total downloads:
- pypi 150,266 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 86
Total maintainers: 1

pypi.org: pytorch_optimizer

optimizer & lr scheduler & objective function collections in PyTorch

Homepage: https://github.com/kozistr/pytorch_optimizer
Documentation: https://pytorch-optimizers.readthedocs.io/en/latest
License: Apache-2.0
Latest release: 3.8.0
published 6 months ago

Versions: 86
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 150,266 Last month

Rankings

Downloads: 1.5%

Stargazers count: 5.8%

Forks count: 9.6%

Average: 9.7%

Dependent packages count: 10.1%

Dependent repos count: 21.6%

Maintainers (1)

kozistr

Last synced: 6 months ago

Dependencies

poetry.lock pypi

astroid 2.8.6 develop
atomicwrites 1.4.0 develop
attrs 21.4.0 develop
black 21.12b0 develop
click 8.0.4 develop
colorama 0.4.4 develop
coverage 6.3.2 develop
importlib-metadata 4.11.3 develop
iniconfig 1.1.1 develop
isort 5.10.1 develop
lazy-object-proxy 1.7.1 develop
mccabe 0.6.1 develop
mypy-extensions 0.4.3 develop
packaging 21.3 develop
pathspec 0.9.0 develop
platformdirs 2.5.2 develop
pluggy 1.0.0 develop
py 1.11.0 develop
pylint 2.11.1 develop
pyparsing 3.0.8 develop
pytest 7.1.2 develop
pytest-cov 3.0.0 develop
toml 0.10.2 develop
tomli 1.2.3 develop
typed-ast 1.5.3 develop
wrapt 1.13.3 develop
zipp 3.8.0 develop
numpy 1.21.1
torch 1.11.0
typing-extensions 4.2.0

pyproject.toml pypi

black ==21.12b0 develop
click ==8.0.4 develop
isort ==5.10.1 develop
pylint ==2.11.1 develop
pytest ==7.1.2 develop
pytest-cov ==3.0.0 develop
numpy ^1.21.1
python ^3.7
torch ^1.11.0

requirements-dev.txt pypi

astroid ==2.8.6 development
atomicwrites ==1.4.0 development
attrs ==21.4.0 development
black ==21.12b0 development
click ==8.0.4 development
colorama ==0.4.4 development
coverage ==6.3.2 development
importlib-metadata ==4.11.3 development
iniconfig ==1.1.1 development
isort ==5.10.1 development
lazy-object-proxy ==1.7.1 development
mccabe ==0.6.1 development
mypy-extensions ==0.4.3 development
numpy ==1.21.1 development
packaging ==21.3 development
pathspec ==0.9.0 development
platformdirs ==2.5.2 development
pluggy ==1.0.0 development
py ==1.11.0 development
pylint ==2.11.1 development
pyparsing ==3.0.8 development
pytest ==7.1.2 development
pytest-cov ==3.0.0 development
toml ==0.10.2 development
tomli ==1.2.3 development
torch ==1.11.0 development
typed-ast ==1.5.3 development
typing-extensions ==4.2.0 development
wrapt ==1.13.3 development
zipp ==3.8.0 development

requirements.txt pypi

numpy ==1.21.1
torch ==1.11.0
typing-extensions ==4.2.0

.github/workflows/ci.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite
codecov/codecov-action v3 composite

.github/workflows/publish.yml actions

actions/checkout v3 composite
actions/create-release v1 composite
actions/setup-python v4 composite

requirements-docs.txt pypi

markdown-include ==0.8.1
mdx_truly_sane_lists ==1.3
mkdocs ==1.5.2
mkdocs-awesome-pages-plugin ==2.9.2
mkdocs-material ==9.3.1
mkdocstrings-python ==1.7.0
numpy *
pymdown-extensions ==10.3
torch ==2.1.0

pytorch_optimizer

Science Score: 54.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

pytorch-optimizer

The reasons why you use pytorch-optimizer.

Getting Started

Installation

Simple Usage

or you can use optimizer loader, simply passing a name of the optimizer.

if you install bitsandbytes optimizer, you can use 8-bit optimizers from pytorch-optimizer.

Supported Optimizers

['adamax', 'adamg', 'adammini', 'adamod', 'adamp', 'adams', 'adamw']

['adamax', 'adamg', 'adammini', 'adamod', 'adamp', 'adams', 'adamw', 'ranger', 'ranger21']

Supported LR Scheduler

['cosine', 'cosineannealing', 'cosineannealingwithwarmrestart', 'cosineannealingwithwarmup']

['cosine', 'cosineannealing', 'cosineannealingwithwarmrestart', 'cosineannealingwithwarmup', 'warmupstabledecay']

Supported Loss Function

['bcefocalloss', 'focalcosineloss', 'focalloss', 'focaltverskyloss']

['bcefocalloss', 'bceloss', 'focalcosineloss', 'focalloss', 'focaltverskyloss']

Useful Resources

Adaptive Gradient Clipping

Gradient Centralization

Softplus Transformation

Gradient Normalization

Norm Loss

Positive-Negative Momentum

Linear learning rate warmup

Stable weight decay

Explore-exploit learning rate schedule

Lookahead

Chebyshev learning rate schedule

(Adaptive) Sharpness-Aware Minimization

On the Convergence of Adam and Beyond

Improved bias-correction in Adam

Adaptive Gradient Norm Correction

Cautious optimizer

Adam-ATAN2

Frequently asked questions

Visualization

Citation

Maintainer

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: pytorch_optimizer

Rankings

Maintainers (1)

Dependencies

The reasons why you use `pytorch-optimizer`.

if you install `bitsandbytes` optimizer, you can use `8-bit` optimizers from `pytorch-optimizer`.