rliable

[NeurIPS'21 Outstanding Paper] Library for reliable evaluation on RL and ML benchmarks, even with only a handful of seeds.

https://github.com/google-research/rliable

Keywords

benchmarking evaluation-metrics google machine-learning reinforcement-learning rl

Keywords from Contributors

gym-environment gym distributed

Last synced: 6 months ago · JSON representation ·

Repository

[NeurIPS'21 Outstanding Paper] Library for reliable evaluation on RL and ML benchmarks, even with only a handful of seeds.

Basic Info

Host: GitHub
Owner: google-research
License: apache-2.0
Language: Jupyter Notebook
Default Branch: master
Homepage: https://agarwl.github.io/rliable
Size: 1.85 MB

Statistics

Stars: 840
Watchers: 10
Forks: 49
Open Issues: 4
Releases: 0

Topics

benchmarking evaluation-metrics google machine-learning reinforcement-learning rl

Created over 4 years ago · Last pushed over 1 year ago

Metadata Files

Readme Contributing License Citation

rliable is an open-source Python library for reliable evaluation, even with a handful of runs, on reinforcement learning and machine learnings benchmarks. | Desideratum | Current evaluation approach | Our Recommendation | | --------------------------------- | ----------- | --------- | | Uncertainty in aggregate performance | Point estimates:

Ignore statistical uncertainty
Hinder results reproducibility

| Interval estimates using stratified bootstrap confidence intervals (CIs) | |Performance variability across tasks and runs| Tables with task mean scores:

Overwhelming beyond a few tasks
Standard deviations frequently omitted
Incomplete picture for multimodal and heavy-tailed distributions

| Score distributions (performance profiles):

Show tail distribution of scores on combined runs across tasks
Allow qualitative comparisons
Easily read any score percentile

| |Aggregate metrics for summarizing benchmark performance | Mean:

Often dominated by performance on outlier tasks

Median:

Statistically inefficient (requires a large number of runs to claim improvements)
Poor indicator of overall performance: 0 scores on nearly half the tasks doesn't change it

| Interquartile Mean (IQM) across all runs:

Performance on middle 50% of combined runs
Robust to outlier scores but more statistically efficient than median

To show other aspects of performance gains, report Probability of improvement and Optimality gap |

rliable provides support for:

Stratified Bootstrap Confidence Intervals (CIs)
Performance Profiles (with plotting functions)
Aggregate metrics
- Interquartile Mean (IQM) across all runs
- Optimality Gap
- Probability of Improvement

Interactive colab

We provide a colab at bit.ly/statisticalprecipicecolab, which shows how to use the library with examples of published algorithms on widely used benchmarks including Atari 100k, ALE, DM Control and Procgen.

Data for individual runs on Atari 100k, ALE, DM Control and Procgen

You can access the data for individual runs using the public GCP bucket here (you might need to sign in with your gmail account to use Gcloud) : https://console.cloud.google.com/storage/browser/rl-benchmark-data. The interactive colab above also allows you to access the data programatically.

Paper

For more details, refer to the accompanying NeurIPS 2021 paper (Outstanding Paper Award): Deep Reinforcement Learning at the Edge of the Statistical Precipice.

Installation

To install rliable, run: python pip install -U rliable

To install latest version of rliable as a package, run:

python pip install git+https://github.com/google-research/rliable

To import rliable, we suggest:

python from rliable import library as rly from rliable import metrics from rliable import plot_utils

Aggregate metrics with 95% Stratified Bootstrap CIs

IQM, Optimality Gap, Median, Mean

```python algorithms = ['DQN (Nature)', 'DQN (Adam)', 'C51', 'REM', 'Rainbow', 'IQN', 'M-IQN', 'DreamerV2']

Load ALE scores as a dictionary mapping algorithms to their human normalized

score matrices, each of which is of size `(num_runs x num_games)`.

atari200mnormalizedscoredict = ... aggregatefunc = lambda x: np.array([ metrics.aggregatemedian(x), metrics.aggregateiqm(x), metrics.aggregatemean(x), metrics.aggregateoptimalitygap(x)]) aggregatescores, aggregatescorecis = rly.getintervalestimates( atari200mnormalizedscoredict, aggregatefunc, reps=50000) fig, axes = plotutils.plotintervalestimates( aggregatescores, aggregatescorecis, metric_names=['Median', 'IQM', 'Mean', 'Optimality Gap'], algorithms=algorithms, xlabel='Human Normalized Score') ```

Probability of Improvement

```python

Load ProcGen scores as a dictionary containing pairs of normalized score

matrices for pairs of algorithms we want to compare

procgenalgorithmpairs = {.. , 'x,y': (scorex, scorey), ..} averageprobabilities, averageprobcis = rly.getintervalestimates( procgenalgorithmpairs, metrics.probabilityofimprovement, reps=2000) plotutils.plotprobabilityofimprovement(averageprobabilities, averageprobcis) ```

Sample Efficiency Curve

```python algorithms = ['DQN (Nature)', 'DQN (Adam)', 'C51', 'REM', 'Rainbow', 'IQN', 'M-IQN', 'DreamerV2']

Load ALE scores as a dictionary mapping algorithms to their human normalized

score matrices across all 200 million frames, each of which is of size

`(num_runs x num_games x 200)` where scores are recorded every million frame.

aleallframesscoresdict = ... frames = np.array([1, 10, 25, 50, 75, 100, 125, 150, 175, 200]) - 1 aleframesscoresdict = {algorithm: score[:, :, frames] for algorithm, score in aleallframesscoresdict.items()} iqm = lambda scores: np.array([metrics.aggregateiqm(scores[..., frame]) for frame in range(scores.shape[-1])]) iqmscores, iqmcis = rly.getintervalestimates( aleframesscoresdict, iqm, reps=50000) plotutils.plotsampleefficiencycurve( frames+1, iqmscores, iqmcis, algorithms=algorithms, xlabel=r'Number of Frames (in millions)', ylabel='IQM Human Normalized Score') ```

<img src="https://raw.githubusercontent.com/google-research/rliable/master/images/alelegend.png">

Performance Profiles

```python

Load ALE scores as a dictionary mapping algorithms to their human normalized

score matrices, each of which is of size `(num_runs x num_games)`.

atari200mnormalizedscoredict = ...

Human normalized score thresholds

atari200mthresholds = np.linspace(0.0, 8.0, 81) scoredistributions, scoredistributionscis = rly.createperformanceprofile( atari200mnormalizedscoredict, atari200m_thresholds)

Plot score distributions

fig, ax = plt.subplots(ncols=1, figsize=(7, 5)) plotutils.plotperformanceprofiles( scoredistributions, atari200mthresholds, performanceprofilecis=scoredistributionscis, colors=dict(zip(algorithms, sns.colorpalette('colorblind'))), xlabel=r'Human Normalized Score $(\tau)$', ax=ax) ```

<img src="https://raw.githubusercontent.com/google-research/rliable/master/images/alelegend.png">

The above profile can also be plotted with non-linear scaling as follows:

python plot_utils.plot_performance_profiles( perf_prof_atari_200m, atari_200m_tau, performance_profile_cis=perf_prof_atari_200m_cis, use_non_linear_scaling=True, xticks = [0.0, 0.5, 1.0, 2.0, 4.0, 8.0] colors=dict(zip(algorithms, sns.color_palette('colorblind'))), xlabel=r'Human Normalized Score $(\tau)$', ax=ax)

Dependencies

The code was tested under Python>=3.7 and uses these packages:

arch == 5.3.0
scipy >= 1.7.0
numpy >= 0.9.0
absl-py >= 1.16.4
seaborn >= 0.11.2

Citing

If you find this open source release useful, please reference in your paper:

@article{agarwal2021deep,
  title={Deep Reinforcement Learning at the Edge of the Statistical Precipice},
  author={Agarwal, Rishabh and Schwarzer, Max and Castro, Pablo Samuel
          and Courville, Aaron and Bellemare, Marc G},
  journal={Advances in Neural Information Processing Systems},
  year={2021}
}

Disclaimer: This is not an official Google product.

Owner

Name: Google Research
Login: google-research
Kind: organization
Location: Earth

Website: https://research.google
Repositories: 226
Profile: https://github.com/google-research

Citation (CITATION.bib)

@article{agarwal2021deep,
  title={Deep reinforcement learning at the edge of the statistical precipice},
  author={Agarwal, Rishabh and Schwarzer, Max and Castro, Pablo Samuel and Courville, Aaron C and Bellemare, Marc},
  journal={Advances in Neural Information Processing Systems},
  volume={34},
  year={2021}
}

GitHub Events

Total

Issues event: 5
Watch event: 78
Fork event: 4

Last Year

Issues event: 5
Watch event: 78
Fork event: 4

Committers

Last synced: 9 months ago

All Time

Total Commits: 63
Total Committers: 10
Avg Commits per committer: 6.3
Development Distribution Score (DDS): 0.222

Past Year

Commits: 4
Committers: 3
Avg Commits per committer: 1.333
Development Distribution Score (DDS): 0.5

Top Committers

Name	Email	Commits
Rishabh Agarwal	1****l	49
Quentin Gallouédec	4****c	4
zclzc	3****c	2
Dennis Soemers	d**s@g**m	2
Sebastian Markgraf	S**f@t**e	1
RLiable Team	n**y@g**m	1
Michael Panchenko	3****h	1
Jet	3****s	1
Antonin RAFFIN	a**n@e**g	1
Michael Panchenko	m**o@a**e	1

Committer Domains (Top 20 + Academic)

appliedai.de: 1 ensta.org: 1 google.com: 1 t-online.de: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 20
Total pull requests: 10
Average time to close issues: about 1 month
Average time to close pull requests: 16 days
Total issue authors: 17
Total pull request authors: 8
Average comments per issue: 2.9
Average comments per pull request: 1.3
Merged pull requests: 8
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 4
Pull requests: 0
Average time to close issues: 3 months
Average time to close pull requests: N/A
Issue authors: 4
Pull request authors: 0
Average comments per issue: 0.5
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

amantuer (2)
MischaPanch (2)
slerman12 (2)
zhefan (2)
HYDesmondLiu (1)
STAY-Melody (1)
e-pet (1)
ezhang7423 (1)
MarcoMeter (1)
RongpingZhou (1)
kidd12138 (1)
xkianteb (1)
qgallouedec (1)
nirbhayjm (1)
TaoHuang13 (1)

Pull Request Authors

MischaPanch (3)
qgallouedec (3)
araffin (2)
DennisSoemers (2)
Aladoro (1)
sebimarkgraf (1)
lkevinzc (1)
jjshoots (1)

Top Labels

Issue Labels

documentation (1) bug (1) enhancement (1) good first issue (1) help wanted (1)

Pull Request Labels

Packages

Total packages: 2
Total downloads:
- pypi 1,618 last-month
Total docker downloads: 10

Total dependent packages: 6
(may contain duplicates)
Total dependent repositories: 15
(may contain duplicates)
Total versions: 12
Total maintainers: 2

pypi.org: rliable

rliable: Reliable evaluation on reinforcement learning and machine learning benchmarks.

Homepage: https://github.com/google-research/rliable
Documentation: https://github.com/google-research/rliable
License: Apache 2.0
Latest release: 1.2.0
published over 1 year ago

Versions: 11
Dependent Packages: 6
Dependent Repositories: 15
Downloads: 1,613 Last month
Docker Downloads: 10

Rankings

Dependent packages count: 1.6%

Stargazers count: 2.5%

Downloads: 3.1%

Average: 3.5%

Dependent repos count: 3.7%

Docker downloads count: 4.1%

Forks count: 6.2%

Maintainers (2)

rliable MischaPanch

Last synced: 6 months ago

pypi.org: rliable-fork

rliable: Reliable evaluation on reinforcement learning and machine learning benchmarks.

Homepage: https://github.com/aai-institute/rliable
Documentation: https://github.com/google-research/rliable
License: Apache 2.0
Latest release: 1.2.0
published over 1 year ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 5 Last month

Rankings

Dependent packages count: 10.5%

Average: 34.8%

Dependent repos count: 59.1%

Maintainers (1)

MischaPanch

Last synced: 6 months ago

rliable

Science Score: 54.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Interactive colab

Data for individual runs on Atari 100k, ALE, DM Control and Procgen

Paper

Installation

Aggregate metrics with 95% Stratified Bootstrap CIs

IQM, Optimality Gap, Median, Mean

Load ALE scores as a dictionary mapping algorithms to their human normalized

score matrices, each of which is of size (num_runs x num_games).

Probability of Improvement

Load ProcGen scores as a dictionary containing pairs of normalized score

matrices for pairs of algorithms we want to compare

Sample Efficiency Curve

Load ALE scores as a dictionary mapping algorithms to their human normalized

score matrices across all 200 million frames, each of which is of size

(num_runs x num_games x 200) where scores are recorded every million frame.

Performance Profiles

Load ALE scores as a dictionary mapping algorithms to their human normalized

score matrices, each of which is of size (num_runs x num_games).

Human normalized score thresholds

Plot score distributions

Dependencies

Citing

Owner

Citation (CITATION.bib)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: rliable

Rankings

Maintainers (2)

pypi.org: rliable-fork

Rankings

Maintainers (1)

Dependencies

score matrices, each of which is of size `(num_runs x num_games)`.

`(num_runs x num_games x 200)` where scores are recorded every million frame.

score matrices, each of which is of size `(num_runs x num_games)`.