https://github.com/cloneofsimo/policy-optimization-torch

https://github.com/cloneofsimo/policy-optimization-torch

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.2%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: cloneofsimo
  • Language: Python
  • Default Branch: master
  • Size: 2.15 MB
Statistics
  • Stars: 2
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created about 4 years ago · Last pushed almost 4 years ago
Metadata Files
Readme

README.md

Policy Optimization implemented with PyTorch

Benchmarking VPG & PPO with various weights ($\Psi$) for the policy gradient term, where $\Psi$ is used as:

$$\begin{align} g\theta := & \nabla J\theta (\pi\theta) \ =& \mathbb{E}\big[ \nabla\theta \sum \log \pi\theta(at|st) R \big] \ =& \mathbb{E}\big[ \sum{t = 0} ^ \infty \Psit \nabla\theta \log \pi\theta(at | s_t ) \big] \end{align}$$

How does this differ from Spinning Up RL's Implementation?

This repo is heavily based on Spinning Up RL's implementation of VPG and PPO.

However there are some changes. First, similar to various other object-oriented pipelines such as Timm, Avalenche, etc., we have used pytorch & pythonic OOP structured training for these policy optimization algorithms.

We have also removed some of the features that are not essential to the algorithm themselves, and on the other hand, we have added some new features that are not present in the original implementations.

Results

Where $\Psi$ is one of:

  1. Discounted return:

$$ \Psit = \sum{l = 0} ^ \infty \gamma^l r_l $$

  1. Reward-to-go:

$$ \Psit = \sum{l = 0} ^ \infty \gamma^{t + l} r_{t + l} $$

  1. Reward-to-go with baseline:

$$ \Psit = \sum{l = 0} ^ \infty \gamma^{t + l} r{t + l} - b(st) $$

Where we have used baseline as $b(st) = V(st)$.

  1. Discounted Temporal Difference Residual:

$$ \Psit = \deltat := rt + \gamma V(s{t+1}) - V(s_t) $$

  1. Generalized Advantage Estimation:

$$ \Psit = \sum{l = 0} ^ \infty (\gamma \lambda)^l \delta_{t + l} $$

Where $\delta_{t}$ is the discounted TD residual defined on 4.

Note

Using the original definition from the paper proposing GAE, $\lambda>0$ has the same value as 5. By their definition, setting $\lambda=0$ yields 4. But we can see that this is not the case if we evauate using 5. It's not a big deal, because this shenanigan happend since ${0}^{0}$ is not well defined. Follow equation (16) from the paper for more details.

Updates!

Faster buffer, cuda support, checked working on continuous environment(gym's BipedalWalker-v3)

Fixed minor bugs.

Owner

  • Name: Simo Ryu
  • Login: cloneofsimo
  • Kind: user
  • Company: Corca AI

Cats are Turing machines cloneofsimo@gmail.com

GitHub Events

Total
Last Year

Issues and Pull Requests

Last synced: over 1 year ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels