rliable

[NeurIPS'21 Outstanding Paper] Library for reliable evaluation on RL and ML benchmarks, even with only a handful of seeds.

https://github.com/google-research/rliable

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.6%) to scientific vocabulary

Keywords

benchmarking evaluation-metrics google machine-learning reinforcement-learning rl

Keywords from Contributors

gym-environment gym distributed
Last synced: 6 months ago · JSON representation ·

Repository

[NeurIPS'21 Outstanding Paper] Library for reliable evaluation on RL and ML benchmarks, even with only a handful of seeds.

Basic Info
  • Host: GitHub
  • Owner: google-research
  • License: apache-2.0
  • Language: Jupyter Notebook
  • Default Branch: master
  • Homepage: https://agarwl.github.io/rliable
  • Size: 1.85 MB
Statistics
  • Stars: 840
  • Watchers: 10
  • Forks: 49
  • Open Issues: 4
  • Releases: 0
Topics
benchmarking evaluation-metrics google machine-learning reinforcement-learning rl
Created over 4 years ago · Last pushed over 1 year ago
Metadata Files
Readme Contributing License Citation

README.md

Open In Colab Website Blog

rliable is an open-source Python library for reliable evaluation, even with a handful of runs, on reinforcement learning and machine learnings benchmarks. | Desideratum | Current evaluation approach | Our Recommendation | | --------------------------------- | ----------- | --------- | | Uncertainty in aggregate performance | Point estimates:

  • Ignore statistical uncertainty
  • Hinder results reproducibility
| Interval estimates using stratified bootstrap confidence intervals (CIs) | |Performance variability across tasks and runs| Tables with task mean scores:
  • Overwhelming beyond a few tasks
  • Standard deviations frequently omitted
  • Incomplete picture for multimodal and heavy-tailed distributions
| Score distributions (performance profiles):
  • Show tail distribution of scores on combined runs across tasks
  • Allow qualitative comparisons
  • Easily read any score percentile
| |Aggregate metrics for summarizing benchmark performance | Mean:
  • Often dominated by performance on outlier tasks
  Median:
  • Statistically inefficient (requires a large number of runs to claim improvements)
  • Poor indicator of overall performance: 0 scores on nearly half the tasks doesn't change it
| Interquartile Mean (IQM) across all runs:
  • Performance on middle 50% of combined runs
  • Robust to outlier scores but more statistically efficient than median
To show other aspects of performance gains, report Probability of improvement and Optimality gap |

rliable provides support for:

  • Stratified Bootstrap Confidence Intervals (CIs)
  • Performance Profiles (with plotting functions)
  • Aggregate metrics
    • Interquartile Mean (IQM) across all runs
    • Optimality Gap
    • Probability of Improvement

Interactive colab

We provide a colab at bit.ly/statisticalprecipicecolab, which shows how to use the library with examples of published algorithms on widely used benchmarks including Atari 100k, ALE, DM Control and Procgen.

Data for individual runs on Atari 100k, ALE, DM Control and Procgen

You can access the data for individual runs using the public GCP bucket here (you might need to sign in with your gmail account to use Gcloud) : https://console.cloud.google.com/storage/browser/rl-benchmark-data. The interactive colab above also allows you to access the data programatically.

Paper

For more details, refer to the accompanying NeurIPS 2021 paper (Outstanding Paper Award): Deep Reinforcement Learning at the Edge of the Statistical Precipice.

Installation

To install rliable, run: python pip install -U rliable

To install latest version of rliable as a package, run:

python pip install git+https://github.com/google-research/rliable

To import rliable, we suggest:

python from rliable import library as rly from rliable import metrics from rliable import plot_utils

Aggregate metrics with 95% Stratified Bootstrap CIs

IQM, Optimality Gap, Median, Mean

```python algorithms = ['DQN (Nature)', 'DQN (Adam)', 'C51', 'REM', 'Rainbow', 'IQN', 'M-IQN', 'DreamerV2']

Load ALE scores as a dictionary mapping algorithms to their human normalized

score matrices, each of which is of size (num_runs x num_games).

atari200mnormalizedscoredict = ... aggregatefunc = lambda x: np.array([ metrics.aggregatemedian(x), metrics.aggregateiqm(x), metrics.aggregatemean(x), metrics.aggregateoptimalitygap(x)]) aggregatescores, aggregatescorecis = rly.getintervalestimates( atari200mnormalizedscoredict, aggregatefunc, reps=50000) fig, axes = plotutils.plotintervalestimates( aggregatescores, aggregatescorecis, metric_names=['Median', 'IQM', 'Mean', 'Optimality Gap'], algorithms=algorithms, xlabel='Human Normalized Score') ```

Probability of Improvement

```python

Load ProcGen scores as a dictionary containing pairs of normalized score

matrices for pairs of algorithms we want to compare

procgenalgorithmpairs = {.. , 'x,y': (scorex, scorey), ..} averageprobabilities, averageprobcis = rly.getintervalestimates( procgenalgorithmpairs, metrics.probabilityofimprovement, reps=2000) plotutils.plotprobabilityofimprovement(averageprobabilities, averageprobcis) ```

Sample Efficiency Curve

```python algorithms = ['DQN (Nature)', 'DQN (Adam)', 'C51', 'REM', 'Rainbow', 'IQN', 'M-IQN', 'DreamerV2']

Load ALE scores as a dictionary mapping algorithms to their human normalized

score matrices across all 200 million frames, each of which is of size

(num_runs x num_games x 200) where scores are recorded every million frame.

aleallframesscoresdict = ... frames = np.array([1, 10, 25, 50, 75, 100, 125, 150, 175, 200]) - 1 aleframesscoresdict = {algorithm: score[:, :, frames] for algorithm, score in aleallframesscoresdict.items()} iqm = lambda scores: np.array([metrics.aggregateiqm(scores[..., frame]) for frame in range(scores.shape[-1])]) iqmscores, iqmcis = rly.getintervalestimates( aleframesscoresdict, iqm, reps=50000) plotutils.plotsampleefficiencycurve( frames+1, iqmscores, iqmcis, algorithms=algorithms, xlabel=r'Number of Frames (in millions)', ylabel='IQM Human Normalized Score') ```

<img src="https://raw.githubusercontent.com/google-research/rliable/master/images/alelegend.png">

Performance Profiles

```python

Load ALE scores as a dictionary mapping algorithms to their human normalized

score matrices, each of which is of size (num_runs x num_games).

atari200mnormalizedscoredict = ...

Human normalized score thresholds

atari200mthresholds = np.linspace(0.0, 8.0, 81) scoredistributions, scoredistributionscis = rly.createperformanceprofile( atari200mnormalizedscoredict, atari200m_thresholds)

Plot score distributions

fig, ax = plt.subplots(ncols=1, figsize=(7, 5)) plotutils.plotperformanceprofiles( scoredistributions, atari200mthresholds, performanceprofilecis=scoredistributionscis, colors=dict(zip(algorithms, sns.colorpalette('colorblind'))), xlabel=r'Human Normalized Score $(\tau)$', ax=ax) ```

<img src="https://raw.githubusercontent.com/google-research/rliable/master/images/alelegend.png">

The above profile can also be plotted with non-linear scaling as follows:

python plot_utils.plot_performance_profiles( perf_prof_atari_200m, atari_200m_tau, performance_profile_cis=perf_prof_atari_200m_cis, use_non_linear_scaling=True, xticks = [0.0, 0.5, 1.0, 2.0, 4.0, 8.0] colors=dict(zip(algorithms, sns.color_palette('colorblind'))), xlabel=r'Human Normalized Score $(\tau)$', ax=ax)

Dependencies

The code was tested under Python>=3.7 and uses these packages:

  • arch == 5.3.0
  • scipy >= 1.7.0
  • numpy >= 0.9.0
  • absl-py >= 1.16.4
  • seaborn >= 0.11.2

Citing

If you find this open source release useful, please reference in your paper:

@article{agarwal2021deep,
  title={Deep Reinforcement Learning at the Edge of the Statistical Precipice},
  author={Agarwal, Rishabh and Schwarzer, Max and Castro, Pablo Samuel
          and Courville, Aaron and Bellemare, Marc G},
  journal={Advances in Neural Information Processing Systems},
  year={2021}
}

Disclaimer: This is not an official Google product.

Owner

  • Name: Google Research
  • Login: google-research
  • Kind: organization
  • Location: Earth

Citation (CITATION.bib)

@article{agarwal2021deep,
  title={Deep reinforcement learning at the edge of the statistical precipice},
  author={Agarwal, Rishabh and Schwarzer, Max and Castro, Pablo Samuel and Courville, Aaron C and Bellemare, Marc},
  journal={Advances in Neural Information Processing Systems},
  volume={34},
  year={2021}
}

GitHub Events

Total
  • Issues event: 5
  • Watch event: 78
  • Fork event: 4
Last Year
  • Issues event: 5
  • Watch event: 78
  • Fork event: 4

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 63
  • Total Committers: 10
  • Avg Commits per committer: 6.3
  • Development Distribution Score (DDS): 0.222
Past Year
  • Commits: 4
  • Committers: 3
  • Avg Commits per committer: 1.333
  • Development Distribution Score (DDS): 0.5
Top Committers
Name Email Commits
Rishabh Agarwal 1****l 49
Quentin Gallouédec 4****c 4
zclzc 3****c 2
Dennis Soemers d****s@g****m 2
Sebastian Markgraf S****f@t****e 1
RLiable Team n****y@g****m 1
Michael Panchenko 3****h 1
Jet 3****s 1
Antonin RAFFIN a****n@e****g 1
Michael Panchenko m****o@a****e 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 20
  • Total pull requests: 10
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 16 days
  • Total issue authors: 17
  • Total pull request authors: 8
  • Average comments per issue: 2.9
  • Average comments per pull request: 1.3
  • Merged pull requests: 8
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 4
  • Pull requests: 0
  • Average time to close issues: 3 months
  • Average time to close pull requests: N/A
  • Issue authors: 4
  • Pull request authors: 0
  • Average comments per issue: 0.5
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • amantuer (2)
  • MischaPanch (2)
  • slerman12 (2)
  • zhefan (2)
  • HYDesmondLiu (1)
  • STAY-Melody (1)
  • e-pet (1)
  • ezhang7423 (1)
  • MarcoMeter (1)
  • RongpingZhou (1)
  • kidd12138 (1)
  • xkianteb (1)
  • qgallouedec (1)
  • nirbhayjm (1)
  • TaoHuang13 (1)
Pull Request Authors
  • MischaPanch (3)
  • qgallouedec (3)
  • araffin (2)
  • DennisSoemers (2)
  • Aladoro (1)
  • sebimarkgraf (1)
  • lkevinzc (1)
  • jjshoots (1)
Top Labels
Issue Labels
documentation (1) bug (1) enhancement (1) good first issue (1) help wanted (1)
Pull Request Labels

Packages

  • Total packages: 2
  • Total downloads:
    • pypi 1,618 last-month
  • Total docker downloads: 10
  • Total dependent packages: 6
    (may contain duplicates)
  • Total dependent repositories: 15
    (may contain duplicates)
  • Total versions: 12
  • Total maintainers: 2
pypi.org: rliable

rliable: Reliable evaluation on reinforcement learning and machine learning benchmarks.

  • Versions: 11
  • Dependent Packages: 6
  • Dependent Repositories: 15
  • Downloads: 1,613 Last month
  • Docker Downloads: 10
Rankings
Dependent packages count: 1.6%
Stargazers count: 2.5%
Downloads: 3.1%
Average: 3.5%
Dependent repos count: 3.7%
Docker downloads count: 4.1%
Forks count: 6.2%
Maintainers (2)
Last synced: 6 months ago
pypi.org: rliable-fork

rliable: Reliable evaluation on reinforcement learning and machine learning benchmarks.

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 5 Last month
Rankings
Dependent packages count: 10.5%
Average: 34.8%
Dependent repos count: 59.1%
Maintainers (1)
Last synced: 6 months ago

Dependencies

setup.py pypi
  • absl-py *
  • arch *
  • numpy *
  • scipy *
  • seaborn *