masters-thesis

https://github.com/phiresky/masters-thesis
Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:
✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
2 of 20 committers (10.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (6.4%) to scientific vocabulary
Keywords from Contributors

hetnets dwpcs hetnet-connectivity-search networks metagenomics github-pages life-sciences carpentries-lab training stable
Last synced: 10 months ago · JSON representation ·
Repository

Basic Info

Host: GitHub
Owner: phiresky
License: other
Language: HTML
Default Branch: main
Size: 32.3 MB
Statistics

Stars: 1
Watchers: 2
Forks: 1
Open Issues: 0
Releases: 0
Created almost 5 years ago · Last pushed almost 5 years ago
Metadata Files

Readme License Citation
README.md

Bayesian and Attentive Aggregation for Cooperative Multi-Agent Deep Reinforcement Learning

My master's thesis. Written in Pandoc-Markdown.
To install the dependencies:
poetry install
To compile the website:
poetry run build/build.sh && poetry run manubot webpage
Then open webpage/index.html in your browser.
The above steps are automatically run by GitHub CI whenever you push.
To compile the LaTeX PDF:
DISPLAY=:0 poetry run build/build-pdf.sh
The pdf is then in output/manuscript.pdf.
The hosted version is here:
https://phiresky.github.io/masters-thesis/
The LaTeX PDF is here:
https://phiresky.github.io/masters-thesis/manuscript.pdf
This repository is based on Manubot: https://github.com/manubot/rootstock with the following changes:
Add the KIT ALR LaTeX thesis template and the build/build-pdf.sh script to build the thesis using LaTeX in an indistinguishable manner from if it had been written with LaTeX. This (sadly) does not run in GitHub CI due to lazyness.
Minor styling changes to the html template in build/themes/default.html
A pandoc filter that automatically converts all headings to title-case (that is a great idea -> That is a Great Idea)
A pandoc filter that automatically converts svgs to pdf (including complex ones that inkscape / the normal latex svg package can't handle)
Switches pandoc-xnos pandoc filter to pandoc-crossref mostly because I'm more familiar with that syntax
Switches pandoc-manubot-cite pandoc filter to pandoc-url2cite because it's my own
Owner

Login: phiresky
Kind: user
Location: Germany
Website: https://phiresky.github.io/blog
Repositories: 199
Profile: https://github.com/phiresky
Fan of FOSS.
Citation (citation-cache.json)

{
	"_info": "Auto-generated by pandoc-url2cite. Feel free to modify, keys will never be overwritten.",
	"urls": {
		"https://arxiv.org/abs/1909.07528": {
			"fetched": "2021-06-14T14:19:50.607Z",
			"bibtex": [
				"",
				"@article{baker_emergent_2020,",
				"   title = {Emergent {Tool} {Use} {From} {Multi}-{Agent} {Autocurricula}},",
				"   url = {http://arxiv.org/abs/1909.07528},",
				"   abstract = {Through multi-agent competition, the simple objective of hide-and-seek, and standard reinforcement learning algorithms at scale, we find that agents create a self-supervised autocurriculum inducing multiple distinct rounds of emergent strategy, many of which require sophisticated tool use and coordination. We find clear evidence of six emergent phases in agent strategy in our environment, each of which creates a new pressure for the opposing team to adapt; for instance, agents learn to build multi-object shelters using moveable boxes which in turn leads to agents discovering that they can overcome obstacles using ramps. We further provide evidence that multi-agent competition may scale better with increasing environment complexity and leads to behavior that centers around far more human-relevant skills than other self-supervised reinforcement learning methods such as intrinsic motivation. Finally, we propose transfer and fine-tuning as a way to quantitatively evaluate targeted capabilities, and we compare hide-and-seek agents to both intrinsic motivation and random initialization baselines in a suite of domain-specific intelligence tests.},",
				"   urldate = {2021-06-14},",
				"   journal = {arXiv:1909.07528 [cs, stat]},",
				"   author = {Baker, Bowen and Kanitscheider, Ingmar and Markov, Todor and Wu, Yi and Powell, Glenn and McGrew, Bob and Mordatch, Igor},",
				"   month = feb,",
				"   year = {2020},",
				"   note = {arXiv: 1909.07528},",
				"   keywords = {Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Multiagent Systems, Statistics - Machine Learning},",
				"}",
				""
			],
			"csl": {
				"URL": "http://arxiv.org/abs/1909.07528",
				"abstract": "Through multi-agent competition, the simple objective of hide-and-seek, and standard reinforcement learning algorithms at scale, we find that agents create a self-supervised autocurriculum inducing multiple distinct rounds of emergent strategy, many of which require sophisticated tool use and coordination. We find clear evidence of six emergent phases in agent strategy in our environment, each of which creates a new pressure for the opposing team to adapt; for instance, agents learn to build multi-object shelters using moveable boxes which in turn leads to agents discovering that they can overcome obstacles using ramps. We further provide evidence that multi-agent competition may scale better with increasing environment complexity and leads to behavior that centers around far more human-relevant skills than other self-supervised reinforcement learning methods such as intrinsic motivation. Finally, we propose transfer and fine-tuning as a way to quantitatively evaluate targeted capabilities, and we compare hide-and-seek agents to both intrinsic motivation and random initialization baselines in a suite of domain-specific intelligence tests.",
				"accessed": {
					"date-parts": [
						[
							2021,
							6,
							14
						]
					]
				},
				"author": [
					{
						"family": "Baker",
						"given": "Bowen"
					},
					{
						"family": "Kanitscheider",
						"given": "Ingmar"
					},
					{
						"family": "Markov",
						"given": "Todor"
					},
					{
						"family": "Wu",
						"given": "Yi"
					},
					{
						"family": "Powell",
						"given": "Glenn"
					},
					{
						"family": "McGrew",
						"given": "Bob"
					},
					{
						"family": "Mordatch",
						"given": "Igor"
					}
				],
				"container-title": "arXiv:1909.07528 [cs, stat]",
				"id": "https://arxiv.org/abs/1909.07528",
				"issued": {
					"date-parts": [
						[
							2020,
							2
						]
					]
				},
				"keyword": "Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Multiagent Systems, Statistics - Machine Learning",
				"note": "arXiv: 1909.07528",
				"title": "Emergent Tool Use From Multi-Agent Autocurricula",
				"type": "article-journal"
			}
		},
		"https://jmlr.org/beta/papers/v20/18-476.html": {
			"fetched": "2021-06-14T14:19:51.257Z",
			"bibtex": [
				"",
				"@article{huttenrauch_deep_2019,",
				"   title = {Deep {Reinforcement} {Learning} for {Swarm} {Systems}},",
				"   volume = {20},",
				"   issn = {1533-7928},",
				"   url = {http://jmlr.org/papers/v20/18-476.html},",
				"   language = {en},",
				"   number = {54},",
				"   urldate = {2021-06-14},",
				"   journal = {Journal of Machine Learning Research},",
				"   author = {Hüttenrauch, Maximilian and Šošić, Adrian and Neumann, Gerhard},",
				"   year = {2019},",
				"   pages = {1--31},",
				"}",
				""
			],
			"csl": {
				"ISSN": "1533-7928",
				"URL": "http://jmlr.org/papers/v20/18-476.html",
				"accessed": {
					"date-parts": [
						[
							2021,
							6,
							14
						]
					]
				},
				"author": [
					{
						"family": "Hüttenrauch",
						"given": "Maximilian"
					},
					{
						"family": "Šošić",
						"given": "Adrian"
					},
					{
						"family": "Neumann",
						"given": "Gerhard"
					}
				],
				"container-title": "Journal of Machine Learning Research",
				"id": "https://jmlr.org/beta/papers/v20/18-476.html",
				"issue": "54",
				"issued": {
					"date-parts": [
						[
							2019
						]
					]
				},
				"page": "1-31",
				"title": "Deep Reinforcement Learning for Swarm Systems",
				"type": "article-journal",
				"volume": "20"
			}
		},
		"http://proceedings.mlr.press/v80/yang18d.html": {
			"fetched": "2021-06-14T14:19:51.907Z",
			"bibtex": [
				"",
				"@inproceedings{yang_mean_2018,",
				"   title = {Mean {Field} {Multi}-{Agent} {Reinforcement} {Learning}},",
				"   url = {http://proceedings.mlr.press/v80/yang18d.html},",
				"   abstract = {Existing multi-agent reinforcement learning methods are limited typically to a small number of agents. When the agent number increases largely, the learning becomes intractable due to the curse of ...},",
				"   language = {en},",
				"   urldate = {2021-06-14},",
				"   booktitle = {International {Conference} on {Machine} {Learning}},",
				"   publisher = {PMLR},",
				"   author = {Yang, Yaodong and Luo, Rui and Li, Minne and Zhou, Ming and Zhang, Weinan and Wang, Jun},",
				"   month = jul,",
				"   year = {2018},",
				"   pages = {5571--5580},",
				"}",
				""
			],
			"csl": {
				"URL": "http://proceedings.mlr.press/v80/yang18d.html",
				"abstract": "Existing multi-agent reinforcement learning methods are limited typically to a small number of agents. When the agent number increases largely, the learning becomes intractable due to the curse of ...",
				"accessed": {
					"date-parts": [
						[
							2021,
							6,
							14
						]
					]
				},
				"author": [
					{
						"family": "Yang",
						"given": "Yaodong"
					},
					{
						"family": "Luo",
						"given": "Rui"
					},
					{
						"family": "Li",
						"given": "Minne"
					},
					{
						"family": "Zhou",
						"given": "Ming"
					},
					{
						"family": "Zhang",
						"given": "Weinan"
					},
					{
						"family": "Wang",
						"given": "Jun"
					}
				],
				"container-title": "International Conference on Machine Learning",
				"id": "http://proceedings.mlr.press/v80/yang18d.html",
				"issued": {
					"date-parts": [
						[
							2018,
							7
						]
					]
				},
				"page": "5571-5580",
				"publisher": "PMLR",
				"title": "Mean Field Multi-Agent Reinforcement Learning",
				"type": "paper-conference"
			}
		},
		"https://arxiv.org/abs/1707.06347": {
			"fetched": "2021-06-14T14:19:55.511Z",
			"bibtex": [
				"",
				"@article{schulman_proximal_2017,",
				"   title = {Proximal {Policy} {Optimization} {Algorithms}},",
				"   url = {http://arxiv.org/abs/1707.06347},",
				"   abstract = {We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a \"surrogate\" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.},",
				"   urldate = {2021-06-14},",
				"   journal = {arXiv:1707.06347 [cs]},",
				"   author = {Schulman, John and Wolski, Filip and Dhariwal, Prafulla and Radford, Alec and Klimov, Oleg},",
				"   month = aug,",
				"   year = {2017},",
				"   note = {arXiv: 1707.06347},",
				"   keywords = {Computer Science - Machine Learning},",
				"}",
				""
			],
			"csl": {
				"URL": "http://arxiv.org/abs/1707.06347",
				"abstract": "We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a \"surrogate\" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.",
				"accessed": {
					"date-parts": [
						[
							2021,
							6,
							14
						]
					]
				},
				"author": [
					{
						"family": "Schulman",
						"given": "John"
					},
					{
						"family": "Wolski",
						"given": "Filip"
					},
					{
						"family": "Dhariwal",
						"given": "Prafulla"
					},
					{
						"family": "Radford",
						"given": "Alec"
					},
					{
						"family": "Klimov",
						"given": "Oleg"
					}
				],
				"container-title": "arXiv:1707.06347 [cs]",
				"id": "https://arxiv.org/abs/1707.06347",
				"issued": {
					"date-parts": [
						[
							2017,
							8
						]
					]
				},
				"keyword": "Computer Science - Machine Learning",
				"note": "arXiv: 1707.06347",
				"title": "Proximal Policy Optimization Algorithms",
				"type": "article-journal"
			}
		},
		"https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html#results": {
			"fetched": "2021-06-14T14:19:57.454Z",
			"bibtex": [
				"",
				"@misc{noauthor_ppo_nodate,",
				"   title = {{PPO} — {Stable} {Baselines3} 1.1.0a11 documentation},",
				"   url = {https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html#results},",
				"   urldate = {2021-06-14},",
				"   journal = {stable-baselines3.readthedocs.io},",
				"}",
				""
			],
			"csl": {
				"URL": "https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html#results",
				"author": [
					{
						"family": "Raffin",
						"given": "Antonin"
					},
					{
						"family": "Hill",
						"given": "Ashley"
					},
					{
						"family": "Ernestus",
						"given": "Maximilian"
					},
					{
						"family": "Gleave",
						"given": "Adam"
					},
					{
						"family": "Kanervisto",
						"given": "Anssi"
					},
					{
						"family": "Dormann",
						"given": "Noah"
					}
				],
				"accessed": {
					"date-parts": [
						[
							2021,
							6,
							14
						]
					]
				},
				"issued": {
					"date-parts": [
						[
							2020
						]
					]
				},
				"container-title": "stable-baselines3.readthedocs.io",
				"id": "https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html_x35_results",
				"title": "PPO — Stable Baselines3 1.1.0a11 documentation",
				"type": ""
			}
		},
		"https://openreview.net/forum?id=qYZD-AO1Vn": {
			"fetched": "2021-06-14T14:19:58.208Z",
			"bibtex": [
				"",
				"@inproceedings{otto_differentiable_2020,",
				"   title = {Differentiable {Trust} {Region} {Layers} for {Deep} {Reinforcement} {Learning}},",
				"   url = {https://openreview.net/forum?id=qYZD-AO1Vn},",
				"   abstract = {Trust region methods are a popular tool in reinforcement learning as they yield robust policy updates in continuous and discrete action spaces. However, enforcing such trust regions in deep...},",
				"   language = {en},",
				"   urldate = {2021-06-14},",
				"   author = {Otto, Fabian and Becker, Philipp and Ngo, Vien Anh and Ziesche, Hanna Carolin Maria and Neumann, Gerhard},",
				"   month = sep,",
				"   year = {2020},",
				"}",
				""
			],
			"csl": {
				"URL": "https://openreview.net/forum?id=qYZD-AO1Vn",
				"abstract": "Trust region methods are a popular tool in reinforcement learning as they yield robust policy updates in continuous and discrete action spaces. However, enforcing such trust regions in deep...",
				"accessed": {
					"date-parts": [
						[
							2021,
							6,
							14
						]
					]
				},
				"author": [
					{
						"family": "Otto",
						"given": "Fabian"
					},
					{
						"family": "Becker",
						"given": "Philipp"
					},
					{
						"family": "Ngo",
						"given": "Vien Anh"
					},
					{
						"family": "Ziesche",
						"given": "Hanna Carolin Maria"
					},
					{
						"family": "Neumann",
						"given": "Gerhard"
					}
				],
				"id": "https://openreview.net/forum?id_x61_qYZD-AO1Vn",
				"issued": {
					"date-parts": [
						[
							2020,
							9
						]
					]
				},
				"title": "Differentiable Trust Region Layers for Deep Reinforcement Learning",
				"type": "paper-conference"
			}
		},
		"https://papers.nips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html": {
			"fetched": "2021-06-14T14:20:00.087Z",
			"bibtex": [
				"",
				"@article{vaswani_attention_2017,",
				"   title = {Attention is {All} you {Need}},",
				"   volume = {30},",
				"   url = {https://papers.nips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html},",
				"   language = {en},",
				"   urldate = {2021-06-14},",
				"   journal = {Advances in Neural Information Processing Systems},",
				"   author = {Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser, Łukasz and Polosukhin, Illia},",
				"   year = {2017},",
				"}",
				""
			],
			"csl": {
				"URL": "https://papers.nips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html",
				"accessed": {
					"date-parts": [
						[
							2021,
							6,
							14
						]
					]
				},
				"author": [
					{
						"family": "Vaswani",
						"given": "Ashish"
					},
					{
						"family": "Shazeer",
						"given": "Noam"
					},
					{
						"family": "Parmar",
						"given": "Niki"
					},
					{
						"family": "Uszkoreit",
						"given": "Jakob"
					},
					{
						"family": "Jones",
						"given": "Llion"
					},
					{
						"family": "Gomez",
						"given": "Aidan N."
					},
					{
						"family": "Kaiser",
						"given": "Łukasz"
					},
					{
						"family": "Polosukhin",
						"given": "Illia"
					}
				],
				"container-title": "Advances in Neural Information Processing Systems",
				"id": "https://papers.nips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html",
				"issued": {
					"date-parts": [
						[
							2017
						]
					]
				},
				"title": "Attention is All you Need",
				"type": "article-journal",
				"volume": "30"
			}
		},
		"https://openreview.net/forum?id=ufZN2-aehFa": {
			"fetched": "2021-06-14T14:20:02.638Z",
			"bibtex": [
				"",
				"@inproceedings{volpp_bayesian_2020,",
				"   title = {Bayesian {Context} {Aggregation} for {Neural} {Processes}},",
				"   url = {https://openreview.net/forum?id=ufZN2-aehFa},",
				"   abstract = {Formulating scalable probabilistic regression models with reliable uncertainty estimates has been a long-standing challenge in machine learning research.  Recently, casting probabilistic regression...},",
				"   language = {en},",
				"   urldate = {2021-06-14},",
				"   author = {Volpp, Michael and Flürenbrock, Fabian and Grossberger, Lukas and Daniel, Christian and Neumann, Gerhard},",
				"   month = sep,",
				"   year = {2020},",
				"}",
				""
			],
			"csl": {
				"URL": "https://openreview.net/forum?id=ufZN2-aehFa",
				"abstract": "Formulating scalable probabilistic regression models with reliable uncertainty estimates has been a long-standing challenge in machine learning research. Recently, casting probabilistic regression...",
				"accessed": {
					"date-parts": [
						[
							2021,
							6,
							14
						]
					]
				},
				"author": [
					{
						"family": "Volpp",
						"given": "Michael"
					},
					{
						"family": "Flürenbrock",
						"given": "Fabian"
					},
					{
						"family": "Grossberger",
						"given": "Lukas"
					},
					{
						"family": "Daniel",
						"given": "Christian"
					},
					{
						"family": "Neumann",
						"given": "Gerhard"
					}
				],
				"id": "https://openreview.net/forum?id_x61_ufZN2-aehFa",
				"issued": {
					"date-parts": [
						[
							2020,
							9
						]
					]
				},
				"title": "Bayesian Context Aggregation for Neural Processes",
				"type": "paper-conference"
			}
		},
		"https://www.springer.com/gp/book/9780387310732": {
			"fetched": "2021-06-14T14:20:03.663Z",
			"bibtex": [
				"",
				"@book{bishop_pattern_2006,",
				"   address = {New York},",
				"   series = {Information {Science} and {Statistics}},",
				"   title = {Pattern {Recognition} and {Machine} {Learning}},",
				"   isbn = {9780387310732},",
				"   url = {https://www.springer.com/gp/book/9780387310732},",
				"   abstract = {The dramatic growth in practical applications for machine learning over the last ten years has been accompanied by many important developments in the underlying algorithms and techniques. For example, Bayesian methods have grown from a specialist niche to become mainstream, while graphical models have emerged as a general framework for describing and applying probabilistic techniques. The practical applicability of Bayesian methods has been greatly enhanced by the development of a range of approximate inference algorithms such as variational Bayes and expectation propagation, while new models based on kernels have had a significant impact on both algorithms and applications. This completely new textbook reflects these recent developments while providing a comprehensive introduction to the fields of pattern recognition and machine learning. It is aimed at advanced undergraduates or first-year PhD students, as well as researchers and practitioners. No previous knowledge of pattern recognition or machine learning concepts is assumed. Familiarity with multivariate calculus and basic linear algebra is required, and some experience in the use of probabilities would be helpful though not essential as the book includes a self-contained introduction to basic probability theory. The book is suitable for courses on machine learning, statistics, computer science, signal processing, computer vision, data mining, and bioinformatics. Extensive support is provided for course instructors, including more than 400 exercises, graded according to difficulty. Example solutions for a subset of the exercises are available from the book web site, while solutions for the remainder can be obtained by instructors from the publisher. The book is supported by a great deal of additional material, and the reader is encouraged to visit the book web site for the latest information. Christopher M. Bishop is Deputy Director of Microsoft Research Cambridge, and holds a Chair in Computer Science at the University of Edinburgh. He is a Fellow of Darwin College Cambridge, a Fellow of the Royal Academy of Engineering, and a Fellow of the Royal Society of Edinburgh. His previous textbook \"Neural Networks for Pattern Recognition\" has been widely adopted. Coming soon: *For students, worked solutions to a subset of exercises available on a public web site (for exercises marked \"www\" in the text) *For instructors, worked solutions to remaining exercises from the Springer web site *Lecture slides to accompany each chapter *Data sets available for download},",
				"   language = {en},",
				"   urldate = {2021-06-14},",
				"   publisher = {Springer-Verlag},",
				"   author = {Bishop, Christopher},",
				"   year = {2006},",
				"}",
				""
			],
			"csl": {
				"ISBN": "9780387310732",
				"URL": "https://www.springer.com/gp/book/9780387310732",
				"abstract": "The dramatic growth in practical applications for machine learning over the last ten years has been accompanied by many important developments in the underlying algorithms and techniques. For example, Bayesian methods have grown from a specialist niche to become mainstream, while graphical models have emerged as a general framework for describing and applying probabilistic techniques. The practical applicability of Bayesian methods has been greatly enhanced by the development of a range of approximate inference algorithms such as variational Bayes and expectation propagation, while new models based on kernels have had a significant impact on both algorithms and applications. This completely new textbook reflects these recent developments while providing a comprehensive introduction to the fields of pattern recognition and machine learning. It is aimed at advanced undergraduates or first-year PhD students, as well as researchers and practitioners. No previous knowledge of pattern recognition or machine learning concepts is assumed. Familiarity with multivariate calculus and basic linear algebra is required, and some experience in the use of probabilities would be helpful though not essential as the book includes a self-contained introduction to basic probability theory. The book is suitable for courses on machine learning, statistics, computer science, signal processing, computer vision, data mining, and bioinformatics. Extensive support is provided for course instructors, including more than 400 exercises, graded according to difficulty. Example solutions for a subset of the exercises are available from the book web site, while solutions for the remainder can be obtained by instructors from the publisher. The book is supported by a great deal of additional material, and the reader is encouraged to visit the book web site for the latest information. Christopher M. Bishop is Deputy Director of Microsoft Research Cambridge, and holds a Chair in Computer Science at the University of Edinburgh. He is a Fellow of Darwin College Cambridge, a Fellow of the Royal Academy of Engineering, and a Fellow of the Royal Society of Edinburgh. His previous textbook \"Neural Networks for Pattern Recognition\" has been widely adopted. Coming soon: *For students, worked solutions to a subset of exercises available on a public web site (for exercises marked \"www\" in the text) *For instructors, worked solutions to remaining exercises from the Springer web site *Lecture slides to accompany each chapter *Data sets available for download",
				"accessed": {
					"date-parts": [
						[
							2021,
							6,
							14
						]
					]
				},
				"author": [
					{
						"family": "Bishop",
						"given": "Christopher"
					}
				],
				"collection-title": "Information Science and Statistics",
				"id": "https://www.springer.com/gp/book/9780387310732",
				"issued": {
					"date-parts": [
						[
							2006
						]
					]
				},
				"publisher": "Springer-Verlag",
				"publisher-place": "New York",
				"title": "Pattern Recognition and Machine Learning",
				"type": "book"
			}
		},
		"https://arxiv.org/abs/1802.05438": {
			"fetched": "2021-06-14T14:59:08.068Z",
			"bibtex": [
				"",
				"@article{yang_mean_2020,",
				"   title = {Mean {Field} {Multi}-{Agent} {Reinforcement} {Learning}},",
				"   url = {http://arxiv.org/abs/1802.05438},",
				"   abstract = {Existing multi-agent reinforcement learning methods are limited typically to a small number of agents. When the agent number increases largely, the learning becomes intractable due to the curse of the dimensionality and the exponential growth of agent interactions. In this paper, we present {\\textbackslash}emph\\{Mean Field Reinforcement Learning\\} where the interactions within the population of agents are approximated by those between a single agent and the average effect from the overall population or neighboring agents; the interplay between the two entities is mutually reinforced: the learning of the individual agent's optimal policy depends on the dynamics of the population, while the dynamics of the population change according to the collective patterns of the individual policies. We develop practical mean field Q-learning and mean field Actor-Critic algorithms and analyze the convergence of the solution to Nash equilibrium. Experiments on Gaussian squeeze, Ising model, and battle games justify the learning effectiveness of our mean field approaches. In addition, we report the first result to solve the Ising model via model-free reinforcement learning methods.},",
				"   urldate = {2021-06-14},",
				"   journal = {arXiv:1802.05438 [cs]},",
				"   author = {Yang, Yaodong and Luo, Rui and Li, Minne and Zhou, Ming and Zhang, Weinan and Wang, Jun},",
				"   month = dec,",
				"   year = {2020},",
				"   note = {arXiv: 1802.05438},",
				"   keywords = {Computer Science - Multiagent Systems, Computer Science - Artificial Intelligence, Computer Science - Machine Learning},",
				"}",
				""
			],
			"csl": {
				"URL": "http://arxiv.org/abs/1802.05438",
				"abstract": "Existing multi-agent reinforcement learning methods are limited typically to a small number of agents. When the agent number increases largely, the learning becomes intractable due to the curse of the dimensionality and the exponential growth of agent interactions. In this paper, we present emph{Mean Field Reinforcement Learning} where the interactions within the population of agents are approximated by those between a single agent and the average effect from the overall population or neighboring agents; the interplay between the two entities is mutually reinforced: the learning of the individual agent’s optimal policy depends on the dynamics of the population, while the dynamics of the population change according to the collective patterns of the individual policies. We develop practical mean field Q-learning and mean field Actor-Critic algorithms and analyze the convergence of the solution to Nash equilibrium. Experiments on Gaussian squeeze, Ising model, and battle games justify the learning effectiveness of our mean field approaches. In addition, we report the first result to solve the Ising model via model-free reinforcement learning methods.",
				"accessed": {
					"date-parts": [
						[
							2021,
							6,
							14
						]
					]
				},
				"author": [
					{
						"family": "Yang",
						"given": "Yaodong"
					},
					{
						"family": "Luo",
						"given": "Rui"
					},
					{
						"family": "Li",
						"given": "Minne"
					},
					{
						"family": "Zhou",
						"given": "Ming"
					},
					{
						"family": "Zhang",
						"given": "Weinan"
					},
					{
						"family": "Wang",
						"given": "Jun"
					}
				],
				"container-title": "arXiv:1802.05438 [cs]",
				"id": "https://arxiv.org/abs/1802.05438",
				"issued": {
					"date-parts": [
						[
							2020,
							12
						]
					]
				},
				"keyword": "Computer Science - Multiagent Systems, Computer Science - Artificial Intelligence, Computer Science - Machine Learning",
				"note": "arXiv: 1802.05438",
				"title": "Mean Field Multi-Agent Reinforcement Learning",
				"type": "article-journal"
			}
		},
		"https://arxiv.org/abs/1703.04908": {
			"fetched": "2021-06-14T15:29:27.728Z",
			"bibtex": [
				"",
				"@article{mordatch_emergence_2018,",
				"   title = {Emergence of {Grounded} {Compositional} {Language} in {Multi}-{Agent} {Populations}},",
				"   url = {http://arxiv.org/abs/1703.04908},",
				"   abstract = {By capturing statistical patterns in large corpora, machine learning has enabled significant advances in natural language processing, including in machine translation, question answering, and sentiment analysis. However, for agents to intelligently interact with humans, simply capturing the statistical patterns is insufficient. In this paper we investigate if, and how, grounded compositional language can emerge as a means to achieve goals in multi-agent populations. Towards this end, we propose a multi-agent learning environment and learning methods that bring about emergence of a basic compositional language. This language is represented as streams of abstract discrete symbols uttered by agents over time, but nonetheless has a coherent structure that possesses a defined vocabulary and syntax. We also observe emergence of non-verbal communication such as pointing and guiding when language communication is unavailable.},",
				"   urldate = {2021-06-14},",
				"   journal = {arXiv:1703.04908 [cs]},",
				"   author = {Mordatch, Igor and Abbeel, Pieter},",
				"   month = jul,",
				"   year = {2018},",
				"   note = {arXiv: 1703.04908},",
				"   keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language},",
				"}",
				""
			],
			"csl": {
				"URL": "http://arxiv.org/abs/1703.04908",
				"abstract": "By capturing statistical patterns in large corpora, machine learning has enabled significant advances in natural language processing, including in machine translation, question answering, and sentiment analysis. However, for agents to intelligently interact with humans, simply capturing the statistical patterns is insufficient. In this paper we investigate if, and how, grounded compositional language can emerge as a means to achieve goals in multi-agent populations. Towards this end, we propose a multi-agent learning environment and learning methods that bring about emergence of a basic compositional language. This language is represented as streams of abstract discrete symbols uttered by agents over time, but nonetheless has a coherent structure that possesses a defined vocabulary and syntax. We also observe emergence of non-verbal communication such as pointing and guiding when language communication is unavailable.",
				"accessed": {
					"date-parts": [
						[
							2021,
							6,
							14
						]
					]
				},
				"author": [
					{
						"family": "Mordatch",
						"given": "Igor"
					},
					{
						"family": "Abbeel",
						"given": "Pieter"
					}
				],
				"container-title": "arXiv:1703.04908 [cs]",
				"id": "https://arxiv.org/abs/1703.04908",
				"issued": {
					"date-parts": [
						[
							2018,
							7
						]
					]
				},
				"keyword": "Computer Science - Artificial Intelligence, Computer Science - Computation and Language",
				"note": "arXiv: 1703.04908",
				"title": "Emergence of Grounded Compositional Language in Multi-Agent Populations",
				"type": "article-journal"
			}
		},
		"https://ieeexplore.ieee.org/abstract/document/9173524": {
			"fetched": "2021-06-20T13:36:49.738Z",
			"bibtex": [
				"",
				"@article{hu_occlusion-based_2020,",
				"   title = {Occlusion-{Based} {Coordination} {Protocol} {Design} for {Autonomous} {Robotic} {Shepherding} {Tasks}},",
				"   issn = {2379-8939},",
				"   url = {https://ieeexplore.ieee.org/abstract/document/9173524},",
				"   doi = {10.1109/TCDS.2020.3018549},",
				"   abstract = {The robotic shepherding problem has earned significant research interest over the last few decades due to its potential application in precision agriculture. In this paper, we first modeled the sheep flocking behavior using adaptive protocols and artificial potential field methods. Then we designed a coordination algorithm for the robotic dogs. An occlusion-based motion control strategy was proposed to herd the sheep to the desired location. Compared to formation based techniques, the proposed control strategy provides more flexibility and efficiency when herding a large number of sheep. Simulation and lab-based experiments, using real robots and global vision-based tracking system, were carried out to validate the effectiveness of the proposed approach.},",
				"   urldate = {2021-06-20},",
				"   journal = {IEEE Transactions on Cognitive and Developmental Systems},",
				"   author = {Hu, Junyan and Turgut, Ali Emre and Krajník, Tomáš and Lennox, Barry and Arvin, Farshad},",
				"   year = {2020},",
				"   keywords = {Dogs, Robot kinematics, Task analysis, Protocols, Adaptation models, Trajectory, Autonomous robots, bio-inspired swarm intelligence, shepherding, multi-robot coordination, mobile robotics.},",
				"   pages = {1--1},",
				"}",
				""
			],
			"csl": {
				"DOI": "10.1109/TCDS.2020.3018549",
				"ISSN": "2379-8939",
				"URL": "https://ieeexplore.ieee.org/abstract/document/9173524",
				"abstract": "The robotic shepherding problem has earned significant research interest over the last few decades due to its potential application in precision agriculture. In this paper, we first modeled the sheep flocking behavior using adaptive protocols and artificial potential field methods. Then we designed a coordination algorithm for the robotic dogs. An occlusion-based motion control strategy was proposed to herd the sheep to the desired location. Compared to formation based techniques, the proposed control strategy provides more flexibility and efficiency when herding a large number of sheep. Simulation and lab-based experiments, using real robots and global vision-based tracking system, were carried out to validate the effectiveness of the proposed approach.",
				"accessed": {
					"date-parts": [
						[
							2021,
							6,
							20
						]
					]
				},
				"author": [
					{
						"family": "Hu",
						"given": "Junyan"
					},
					{
						"family": "Turgut",
						"given": "Ali Emre"
					},
					{
						"family": "Krajník",
						"given": "Tomáš"
					},
					{
						"family": "Lennox",
						"given": "Barry"
					},
					{
						"family": "Arvin",
						"given": "Farshad"
					}
				],
				"container-title": "IEEE Transactions on Cognitive and Developmental Systems",
				"id": "https://ieeexplore.ieee.org/abstract/document/9173524",
				"issued": {
					"date-parts": [
						[
							2020
						]
					]
				},
				"keyword": "Dogs, Robot kinematics, Task analysis, Protocols, Adaptation models, Trajectory, Autonomous robots, bio-inspired swarm intelligence, shepherding, multi-robot coordination, mobile robotics.",
				"page": "1-1",
				"title": "Occlusion-Based Coordination Protocol Design for Autonomous Robotic Shepherding Tasks",
				"type": "article-journal"
			}
		},
		"https://github.com/DLR-RM/stable-baselines3": {
			"fetched": "2021-07-13T13:54:50.346Z",
			"bibtex": [],
			"csl": {
				"author": [
					{
						"family": "Raffin",
						"given": "Antonin"
					},
					{
						"family": "Hill",
						"given": "Ashley"
					},
					{
						"family": "Ernestus",
						"given": "Maximilian"
					},
					{
						"family": "Gleave",
						"given": "Adam"
					},
					{
						"family": "Kanervisto",
						"given": "Anssi"
					},
					{
						"family": "Dormann",
						"given": "Noah"
					}
				],
				"container-title": "GitHub repository",
				"id": "https://github.com/DLR-RM/stable-baselines3",
				"issued": {
					"date-parts": [
						[
							2019
						]
					]
				},
				"publisher": "https://github.com/DLR-RM/stable-baselines3; GitHub",
				"title": "Stable Baselines3",
				"type": ""
			}
		},
		"https://www.semanticscholar.org/paper/Using-M-Embeddings-to-Learn-Control-Strategies-for-Gebhardt-H%C3%BCttenrauch/9f550815f8858e7c4c8aef23665fa5817884f1b3": {
			"fetched": "2021-06-20T17:30:05.842Z",
			"bibtex": [
				"",
				"@misc{gebhardt_using_2019,",
				"   title = {Using {M}-{Embeddings} to {Learn} {Control} {Strategies} for {Robot} {Swarms}},",
				"   url = {https://www.semanticscholar.org/paper/Using-M-Embeddings-to-Learn-Control-Strategies-for-Gebhardt-H%C3%BCttenrauch/9f550815f8858e7c4c8aef23665fa5817884f1b3},",
				"   abstract = {Neural networks usually have a predefined structurewhich requires that the number of inputs and outputs is known in advance. In the case of swarms this is a severe limitation, as we might not always have the same number of agents in the swarm. However, also in other situations we might have to deal with variable numbers of homogeneous observations, as for example point clouds. Furthermore, such data has usually no ordering (i.e., if we exchange two swarm agents, we still have semantically the same state of the swarm, if we exchange two points in a pointcloud, it still represents the same 3D structure) which cannot be exploited by standard neural network architectures. In this paper, we present a structure, called the deep M-embeddings which are inspired by the kernel mean embeddings and allow for a compact representation of a variable set of homogeneous inputs as a fixed size feature vector. In experimental evaluations, we show that this representation allows to learn complex policies in a multi-agent environment outperforming a standard multi-layer perceptron both in the achieved average episode return and in sample efficiency.},",
				"   language = {en},",
				"   urldate = {2021-06-20},",
				"   journal = {www.semanticscholar.org},",
				"   author = {Gebhardt, Gregor H. W. and Hüttenrauch, Maximilian and Neumann, G.},",
				"   year = {2019},",
				"}",
				""
			],
			"csl": {
				"URL": "https://www.semanticscholar.org/paper/Using-M-Embeddings-to-Learn-Control-Strategies-for-Gebhardt-H%C3%BCttenrauch/9f550815f8858e7c4c8aef23665fa5817884f1b3",
				"abstract": "Neural networks usually have a predefined structurewhich requires that the number of inputs and outputs is known in advance. In the case of swarms this is a severe limitation, as we might not always have the same number of agents in the swarm. However, also in other situations we might have to deal with variable numbers of homogeneous observations, as for example point clouds. Furthermore, such data has usually no ordering (i.e., if we exchange two swarm agents, we still have semantically the same state of the swarm, if we exchange two points in a pointcloud, it still represents the same 3D structure) which cannot be exploited by standard neural network architectures. In this paper, we present a structure, called the deep M-embeddings which are inspired by the kernel mean embeddings and allow for a compact representation of a variable set of homogeneous inputs as a fixed size feature vector. In experimental evaluations, we show that this representation allows to learn complex policies in a multi-agent environment outperforming a standard multi-layer perceptron both in the achieved average episode return and in sample efficiency.",
				"accessed": {
					"date-parts": [
						[
							2021,
							6,
							20
						]
					]
				},
				"author": [
					{
						"family": "Gebhardt",
						"given": "Gregor H. W."
					},
					{
						"family": "Hüttenrauch",
						"given": "Maximilian"
					},
					{
						"family": "Neumann",
						"given": "G."
					}
				],
				"container-title": "www.semanticscholar.org",
				"id": "https://www.semanticscholar.org/paper/Using-M-Embeddings-to-Learn-Control-Strategies-for-Gebhardt-H_x37_C3_x37_BCttenrauch/9f550815f8858e7c4c8aef23665fa5817884f1b3",
				"issued": {
					"date-parts": [
						[
							2019
						]
					]
				},
				"title": "Using M-Embeddings to Learn Control Strategies for Robot Swarms",
				"type": ""
			}
		},
		"https://ieeexplore.ieee.org/document/9049415": {
			"fetched": "2021-06-22T13:46:09.892Z",
			"bibtex": [
				"",
				"@article{liu_attentive_2020,",
				"   title = {Attentive {Relational} {State} {Representation} in {Decentralized} {Multiagent} {Reinforcement} {Learning}},",
				"   issn = {2168-2275},",
				"   url = {https://ieeexplore.ieee.org/document/9049415},",
				"   doi = {10.1109/TCYB.2020.2979803},",
				"   abstract = {In multiagent reinforcement learning (MARL), it is crucial for each agent to model the relation with its neighbors. Existing approaches usually resort to concatenate the features of multiple neighbors, fixing the size and the identity of the inputs. But these settings are inflexible and unscalable. In this article, we propose an attentive relational encoder (ARE), which is a novel scalable feedforward neural module, to attentionally aggregate an arbitrary-sized neighboring feature set for state representation in the decentralized MARL. The ARE actively selects the relevant information from the neighboring agents and is permutation invariant, computationally efficient, and flexible to interactive multiagent systems. Our method consistently outperforms the latest competing decentralized MARL methods in several multiagent tasks. In particular, it shows strong cooperative performance in challenging StarCraft micromanagement tasks and achieves over a 96\\% winning rate against the most difficult noncheating built-in artificial intelligence bots.},",
				"   urldate = {2021-06-22},",
				"   journal = {IEEE Transactions on Cybernetics},",
				"   author = {Liu, Xiangyu and Tan, Ying},",
				"   year = {2020},",
				"   keywords = {Aggregates, Multi-agent systems, Reinforcement learning, Task analysis, Protocols, Decision making, Topology, Agent modeling, attentive relational encoder (ARE), decentralized learning, multiagent reinforcement learning (MARL), state representation},",
				"   pages = {1--13},",
				"}",
				""
			],
			"csl": {
				"DOI": "10.1109/TCYB.2020.2979803",
				"ISSN": "2168-2275",
				"URL": "https://ieeexplore.ieee.org/document/9049415",
				"abstract": "In multiagent reinforcement learning (MARL), it is crucial for each agent to model the relation with its neighbors. Existing approaches usually resort to concatenate the features of multiple neighbors, fixing the size and the identity of the inputs. But these settings are inflexible and unscalable. In this article, we propose an attentive relational encoder (ARE), which is a novel scalable feedforward neural module, to attentionally aggregate an arbitrary-sized neighboring feature set for state representation in the decentralized MARL. The ARE actively selects the relevant information from the neighboring agents and is permutation invariant, computationally efficient, and flexible to interactive multiagent systems. Our method consistently outperforms the latest competing decentralized MARL methods in several multiagent tasks. In particular, it shows strong cooperative performance in challenging StarCraft micromanagement tasks and achieves over a 96% winning rate against the most difficult noncheating built-in artificial intelligence bots.",
				"accessed": {
					"date-parts": [
						[
							2021,
							6,
							22
						]
					]
				},
				"author": [
					{
						"family": "Liu",
						"given": "Xiangyu"
					},
					{
						"family": "Tan",
						"given": "Ying"
					}
				],
				"container-title": "IEEE Transactions on Cybernetics",
				"id": "https://ieeexplore.ieee.org/document/9049415",
				"issued": {
					"date-parts": [
						[
							2020
						]
					]
				},
				"keyword": "Aggregates, Multi-agent systems, Reinforcement learning, Task analysis, Protocols, Decision making, Topology, Agent modeling, attentive relational encoder (ARE), decentralized learning, multiagent reinforcement learning (MARL), state representation",
				"page": "1-13",
				"title": "Attentive Relational State Representation in Decentralized Multiagent Reinforcement Learning",
				"type": "article-journal"
			}
		},
		"https://arxiv.org/abs/1706.02275": {
			"fetched": "2021-06-22T13:47:53.130Z",
			"bibtex": [
				"",
				"@article{lowe_multi-agent_2020,",
				"   title = {Multi-{Agent} {Actor}-{Critic} for {Mixed} {Cooperative}-{Competitive} {Environments}},",
				"   url = {http://arxiv.org/abs/1706.02275},",
				"   abstract = {We explore deep reinforcement learning methods for multi-agent domains. We begin by analyzing the difficulty of traditional algorithms in the multi-agent case: Q-learning is challenged by an inherent non-stationarity of the environment, while policy gradient suffers from a variance that increases as the number of agents grows. We then present an adaptation of actor-critic methods that considers action policies of other agents and is able to successfully learn policies that require complex multi-agent coordination. Additionally, we introduce a training regimen utilizing an ensemble of policies for each agent that leads to more robust multi-agent policies. We show the strength of our approach compared to existing methods in cooperative as well as competitive scenarios, where agent populations are able to discover various physical and informational coordination strategies.},",
				"   urldate = {2021-06-22},",
				"   journal = {arXiv:1706.02275 [cs]},",
				"   author = {Lowe, Ryan and Wu, Yi and Tamar, Aviv and Harb, Jean and Abbeel, Pieter and Mordatch, Igor},",
				"   month = mar,",
				"   year = {2020},",
				"   note = {arXiv: 1706.02275},",
				"   keywords = {Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Neural and Evolutionary Computing},",
				"}",
				""
			],
			"csl": {
				"URL": "http://arxiv.org/abs/1706.02275",
				"abstract": "We explore deep reinforcement learning methods for multi-agent domains. We begin by analyzing the difficulty of traditional algorithms in the multi-agent case: Q-learning is challenged by an inherent non-stationarity of the environment, while policy gradient suffers from a variance that increases as the number of agents grows. We then present an adaptation of actor-critic methods that considers action policies of other agents and is able to successfully learn policies that require complex multi-agent coordination. Additionally, we introduce a training regimen utilizing an ensemble of policies for each agent that leads to more robust multi-agent policies. We show the strength of our approach compared to existing methods in cooperative as well as competitive scenarios, where agent populations are able to discover various physical and informational coordination strategies.",
				"accessed": {
					"date-parts": [
						[
							2021,
							6,
							22
						]
					]
				},
				"author": [
					{
						"family": "Lowe",
						"given": "Ryan"
					},
					{
						"family": "Wu",
						"given": "Yi"
					},
					{
						"family": "Tamar",
						"given": "Aviv"
					},
					{
						"family": "Harb",
						"given": "Jean"
					},
					{
						"family": "Abbeel",
						"given": "Pieter"
					},
					{
						"family": "Mordatch",
						"given": "Igor"
					}
				],
				"container-title": "arXiv:1706.02275 [cs]",
				"id": "https://arxiv.org/abs/1706.02275",
				"issued": {
					"date-parts": [
						[
							2020,
							3
						]
					]
				},
				"keyword": "Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Neural and Evolutionary Computing",
				"note": "arXiv: 1706.02275",
				"title": "Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments",
				"type": "article-journal"
			}
		},
		"https://arxiv.org/abs/1806.00877": {
			"fetched": "2021-06-25T13:41:29.173Z",
			"bibtex": [
				"",
				"@article{wai_multi-agent_2019,",
				"   title = {Multi-{Agent} {Reinforcement} {Learning} via {Double} {Averaging} {Primal}-{Dual} {Optimization}},",
				"   url = {http://arxiv.org/abs/1806.00877},",
				"   abstract = {Despite the success of single-agent reinforcement learning, multi-agent reinforcement learning (MARL) remains challenging due to complex interactions between agents. Motivated by decentralized applications such as sensor networks, swarm robotics, and power grids, we study policy evaluation in MARL, where agents with jointly observed state-action pairs and private local rewards collaborate to learn the value of a given policy. In this paper, we propose a double averaging scheme, where each agent iteratively performs averaging over both space and time to incorporate neighboring gradient information and local reward information, respectively. We prove that the proposed algorithm converges to the optimal solution at a global geometric rate. In particular, such an algorithm is built upon a primal-dual reformulation of the mean squared projected Bellman error minimization problem, which gives rise to a decentralized convex-concave saddle-point problem. To the best of our knowledge, the proposed double averaging primal-dual optimization algorithm is the first to achieve fast finite-time convergence on decentralized convex-concave saddle-point problems.},",
				"   urldate = {2021-06-25},",
				"   journal = {arXiv:1806.00877 [cs, math, stat]},",
				"   author = {Wai, Hoi-To and Yang, Zhuoran and Wang, Zhaoran and Hong, Mingyi},",
				"   month = jan,",
				"   year = {2019},",
				"   note = {arXiv: 1806.00877},",
				"   keywords = {Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning},",
				"}",
				""
			],
			"csl": {
				"URL": "http://arxiv.org/abs/1806.00877",
				"abstract": "Despite the success of single-agent reinforcement learning, multi-agent reinforcement learning (MARL) remains challenging due to complex interactions between agents. Motivated by decentralized applications such as sensor networks, swarm robotics, and power grids, we study policy evaluation in MARL, where agents with jointly observed state-action pairs and private local rewards collaborate to learn the value of a given policy. In this paper, we propose a double averaging scheme, where each agent iteratively performs averaging over both space and time to incorporate neighboring gradient information and local reward information, respectively. We prove that the proposed algorithm converges to the optimal solution at a global geometric rate. In particular, such an algorithm is built upon a primal-dual reformulation of the mean squared projected Bellman error minimization problem, which gives rise to a decentralized convex-concave saddle-point problem. To the best of our knowledge, the proposed double averaging primal-dual optimization algorithm is the first to achieve fast finite-time convergence on decentralized convex-concave saddle-point problems.",
				"accessed": {
					"date-parts": [
						[
							2021,
							6,
							25
						]
					]
				},
				"author": [
					{
						"family": "Wai",
						"given": "Hoi-To"
					},
					{
						"family": "Yang",
						"given": "Zhuoran"
					},
					{
						"family": "Wang",
						"given": "Zhaoran"
					},
					{
						"family": "Hong",
						"given": "Mingyi"
					}
				],
				"container-title": "arXiv:1806.00877 [cs, math, stat]",
				"id": "https://arxiv.org/abs/1806.00877",
				"issued": {
					"date-parts": [
						[
							2019,
							1
						]
					]
				},
				"keyword": "Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning",
				"note": "arXiv: 1806.00877",
				"title": "Multi-Agent Reinforcement Learning via Double Averaging Primal-Dual Optimization",
				"type": "article-journal"
			}
		},
		"https://arxiv.org/abs/1911.10635": {
			"fetched": "2021-06-25T14:17:16.578Z",
			"bibtex": [
				"",
				"@article{zhang_multi-agent_2021,",
				"   title = {Multi-{Agent} {Reinforcement} {Learning}: {A} {Selective} {Overview} of {Theories} and {Algorithms}},",
				"   shorttitle = {Multi-{Agent} {Reinforcement} {Learning}},",
				"   url = {http://arxiv.org/abs/1911.10635},",
				"   abstract = {Recent years have witnessed significant advances in reinforcement learning (RL), which has registered great success in solving various sequential decision-making problems in machine learning. Most of the successful RL applications, e.g., the games of Go and Poker, robotics, and autonomous driving, involve the participation of more than one single agent, which naturally fall into the realm of multi-agent RL (MARL), a domain with a relatively long history, and has recently re-emerged due to advances in single-agent RL techniques. Though empirically successful, theoretical foundations for MARL are relatively lacking in the literature. In this chapter, we provide a selective overview of MARL, with focus on algorithms backed by theoretical analysis. More specifically, we review the theoretical results of MARL algorithms mainly within two representative frameworks, Markov/stochastic games and extensive-form games, in accordance with the types of tasks they address, i.e., fully cooperative, fully competitive, and a mix of the two. We also introduce several significant but challenging applications of these algorithms. Orthogonal to the existing reviews on MARL, we highlight several new angles and taxonomies of MARL theory, including learning in extensive-form games, decentralized MARL with networked agents, MARL in the mean-field regime, (non-)convergence of policy-based methods for learning in games, etc. Some of the new angles extrapolate from our own research endeavors and interests. Our overall goal with this chapter is, beyond providing an assessment of the current state of the field on the mark, to identify fruitful future research directions on theoretical studies of MARL. We expect this chapter to serve as continuing stimulus for researchers interested in working on this exciting while challenging topic.},",
				"   urldate = {2021-06-25},",
				"   journal = {arXiv:1911.10635 [cs, stat]},",
				"   author = {Zhang, Kaiqing and Yang, Zhuoran and Başar, Tamer},",
				"   month = apr,",
				"   year = {2021},",
				"   note = {arXiv: 1911.10635},",
				"   keywords = {Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Multiagent Systems, Statistics - Machine Learning},",
				"}",
				""
			],
			"csl": {
				"URL": "http://arxiv.org/abs/1911.10635",
				"abstract": "Recent years have witnessed significant advances in reinforcement learning (RL), which has registered great success in solving various sequential decision-making problems in machine learning. Most of the successful RL applications, e.g., the games of Go and Poker, robotics, and autonomous driving, involve the participation of more than one single agent, which naturally fall into the realm of multi-agent RL (MARL), a domain with a relatively long history, and has recently re-emerged due to advances in single-agent RL techniques. Though empirically successful, theoretical foundations for MARL are relatively lacking in the literature. In this chapter, we provide a selective overview of MARL, with focus on algorithms backed by theoretical analysis. More specifically, we review the theoretical results of MARL algorithms mainly within two representative frameworks, Markov/stochastic games and extensive-form games, in accordance with the types of tasks they address, i.e., fully cooperative, fully competitive, and a mix of the two. We also introduce several significant but challenging applications of these algorithms. Orthogonal to the existing reviews on MARL, we highlight several new angles and taxonomies of MARL theory, including learning in extensive-form games, decentralized MARL with networked agents, MARL in the mean-field regime, (non-)convergence of policy-based methods for learning in games, etc. Some of the new angles extrapolate from our own research endeavors and interests. Our overall goal with this chapter is, beyond providing an assessment of the current state of the field on the mark, to identify fruitful future research directions on theoretical studies of MARL. We expect this chapter to serve as continuing stimulus for researchers interested in working on this exciting while challenging topic.",
				"accessed": {
					"date-parts": [
						[
							2021,
							6,
							25
						]
					]
				},
				"author": [
					{
						"family": "Zhang",
						"given": "Kaiqing"
					},
					{
						"family": "Yang",
						"given": "Zhuoran"
					},
					{
						"family": "Başar",
						"given": "Tamer"
					}
				],
				"container-title": "arXiv:1911.10635 [cs, stat]",
				"id": "https://arxiv.org/abs/1911.10635",
				"issued": {
					"date-parts": [
						[
							2021,
							4
						]
					]
				},
				"keyword": "Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Multiagent Systems, Statistics - Machine Learning",
				"note": "arXiv: 1911.10635",
				"title": "Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms",
				"title-short": "Multi-Agent Reinforcement Learning",
				"type": "article-journal"
			}
		},
		"https://ieeexplore.ieee.org/document/6415291": {
			"fetched": "2021-07-19T12:25:05.206Z",
			"bibtex": [],
			"csl": {
				"DOI": "10.1109/TSP.2013.2241057",
				"author": [
					{
						"family": "Kar",
						"given": "Soummya"
					},
					{
						"family": "Moura",
						"given": "José M. F."
					},
					{
						"family": "Poor",
						"given": "H. Vincent"
					}
				],
				"container-title": "IEEE Transactions on Signal Processing",
				"id": "https://ieeexplore.ieee.org/document/6415291",
				"issue": "7",
				"issued": {
					"date-parts": [
						[
							2013
						]
					]
				},
				"page": "1848-1862",
				"title": "Q D-learning: A collaborative distributed strategy for multi-agent reinforcement learning through {\\rm Consensus} + {\\rm Innovations}",
				"title-short": "Q D-learning",
				"type": "article-journal",
				"volume": "61"
			}
		},
		"http://proceedings.mlr.press/v80/zhang18n.html": {
			"fetched": "2021-06-25T14:49:32.901Z",
			"bibtex": [
				"",
				"@inproceedings{zhang_fully_2018,",
				"   title = {Fully {Decentralized} {Multi}-{Agent} {Reinforcement} {Learning} with {Networked} {Agents}},",
				"   url = {http://proceedings.mlr.press/v80/zhang18n.html},",
				"   abstract = {We consider the fully decentralized multi-agent reinforcement learning (MARL) problem, where the agents are connected via a time-varying and possibly sparse communication network. Specifically, we ...},",
				"   language = {en},",
				"   urldate = {2021-06-25},",
				"   booktitle = {International {Conference} on {Machine} {Learning}},",
				"   publisher = {PMLR},",
				"   author = {Zhang, Kaiqing and Yang, Zhuoran and Liu, Han and Zhang, Tong and Basar, Tamer},",
				"   month = jul,",
				"   year = {2018},",
				"   pages = {5872--5881},",
				"}",
				""
			],
			"csl": {
				"URL": "http://proceedings.mlr.press/v80/zhang18n.html",
				"abstract": "We consider the fully decentralized multi-agent reinforcement learning (MARL) problem, where the agents are connected via a time-varying and possibly sparse communication network. Specifically, we ...",
				"accessed": {
					"date-parts": [
						[
							2021,
							6,
							25
						]
					]
				},
				"author": [
					{
						"family": "Zhang",
						"given": "Kaiqing"
					},
					{
						"family": "Yang",
						"given": "Zhuoran"
					},
					{
						"family": "Liu",
						"given": "Han"
					},
					{
						"family": "Zhang",
						"given": "Tong"
					},
					{
						"family": "Basar",
						"given": "Tamer"
					}
				],
				"container-title": "International Conference on Machine Learning",
				"id": "http://proceedings.mlr.press/v80/zhang18n.html",
				"issued": {
					"date-parts": [
						[
							2018,
							7
						]
					]
				},
				"page": "5872-5881",
				"publisher": "PMLR",
				"title": "Fully Decentralized Multi-Agent Reinforcement Learning with Networked Agents",
				"type": "paper-conference"
			}
		},
		"https://link.springer.com/chapter/10.1007/978-3-642-27645-3_15": {
			"fetched": "2021-06-25T14:49:35.869Z",
			"bibtex": [
				"",
				"@incollection{oliehoek_decentralized_2012,",
				"   address = {Berlin, Heidelberg},",
				"   series = {Adaptation, {Learning}, and {Optimization}},",
				"   title = {Decentralized {POMDPs}},",
				"   isbn = {9783642276453},",
				"   url = {https://doi.org/10.1007/978-3-642-27645-3_15},",
				"   abstract = {This chapter presents an overview of the decentralized POMDP (Dec- POMDP) framework. In a Dec-POMDP, a team of agents collaborates to maximize a global reward based on local information only. This means that agents do not observe a Markovian signal during execution and therefore the agents’ individual policies map fromhistories to actions. Searching for an optimal joint policy is an extremely hard problem: it is NEXP-complete. This suggests, assuming NEXP≠EXP, that any optimal solution method will require doubly exponential time in the worst case. This chapter focuses on planning for Dec-POMDPs over a finite horizon. It covers the forward heuristic search approach to solving Dec-POMDPs, as well as the backward dynamic programming approach. Also, it discusses how these relate to the optimal Q-value function of a Dec-POMDP. Finally, it provides pointers to other solution methods and further related topics.},",
				"   language = {en},",
				"   urldate = {2021-06-25},",
				"   booktitle = {Reinforcement {Learning}: {State}-of-the-{Art}},",
				"   publisher = {Springer},",
				"   author = {Oliehoek, Frans A.},",
				"   editor = {Wiering, Marco and van Otterlo, Martijn},",
				"   year = {2012},",
				"   doi = {10.1007/978-3-642-27645-3_15},",
				"   keywords = {Multi Agent System ,  Multiagent System ,  Autonomous Agent ,  International Joint ,  Observation History },",
				"   pages = {471--503},",
				"}",
				""
			],
			"csl": {
				"DOI": "10.1007/978-3-642-27645-3_15",
				"ISBN": "9783642276453",
				"URL": "https://doi.org/10.1007/978-3-642-27645-3_15",
				"abstract": "This chapter presents an overview of the decentralized POMDP (Dec- POMDP) framework. In a Dec-POMDP, a team of agents collaborates to maximize a global reward based on local information only. This means that agents do not observe a Markovian signal during execution and therefore the agents’ individual policies map fromhistories to actions. Searching for an optimal joint policy is an extremely hard problem: it is NEXP-complete. This suggests, assuming NEXP≠EXP, that any optimal solution method will require doubly exponential time in the worst case. This chapter focuses on planning for Dec-POMDPs over a finite horizon. It covers the forward heuristic search approach to solving Dec-POMDPs, as well as the backward dynamic programming approach. Also, it discusses how these relate to the optimal Q-value function of a Dec-POMDP. Finally, it provides pointers to other solution methods and further related topics.",
				"accessed": {
					"date-parts": [
						[
							2021,
							6,
							25
						]
					]
				},
				"author": [
					{
						"family": "Oliehoek",
						"given": "Frans A."
					}
				],
				"collection-title": "Adaptation, Learning, and Optimization",
				"container-title": "Reinforcement Learning: State-of-the-Art",
				"editor": [
					{
						"family": "Wiering",
						"given": "Marco"
					},
					{
						"dropping-particle": "van",
						"family": "Otterlo",
						"given": "Martijn"
					}
				],
				"id": "https://link.springer.com/chapter/10.1007/978-3-642-27645-3_15",
				"issued": {
					"date-parts": [
						[
							2012
						]
					]
				},
				"keyword": "Multi Agent System , Multiagent System , Autonomous Agent , International Joint , Observation History",
				"page": "471-503",
				"publisher": "Springer",
				"publisher-place": "Berlin, Heidelberg",
				"title": "Decentralized POMDPs",
				"type": "chapter"
			}
		},
		"https://link.springer.com/chapter/10.1007/978-3-319-71682-4_5": {
			"fetched": "2021-06-25T14:49:38.340Z",
			"bibtex": [
				"",
				"@inproceedings{gupta_cooperative_2017,",
				"   address = {Cham},",
				"   series = {Lecture {Notes} in {Computer} {Science}},",
				"   title = {Cooperative {Multi}-agent {Control} {Using} {Deep} {Reinforcement} {Learning}},",
				"   isbn = {9783319716824},",
				"   url = {https://link.springer.com/chapter/10.1007/978-3-319-71682-4_5},",
				"   doi = {10.1007/978-3-319-71682-4_5},",
				"   abstract = {This work considers the problem of learning cooperative policies in complex, partially observable domains without explicit communication. We extend three classes of single-agent deep reinforcement learning algorithms based on policy gradient, temporal-difference error, and actor-critic methods to cooperative multi-agent systems. To effectively scale these algorithms beyond a trivial number of agents, we combine them with a multi-agent variant of curriculum learning. The algorithms are benchmarked on a suite of cooperative control tasks, including tasks with discrete and continuous actions, as well as tasks with dozens of cooperating agents. We report the performance of the algorithms using different neural architectures, training procedures, and reward structures. We show that policy gradient methods tend to outperform both temporal-difference and actor-critic methods and that curriculum learning is vital to scaling reinforcement learning algorithms in complex multi-agent domains.},",
				"   language = {en},",
				"   urldate = {2021-06-25},",
				"   booktitle = {Autonomous {Agents} and {Multiagent} {Systems}},",
				"   publisher = {Springer International Publishing},",
				"   author = {Gupta, Jayesh K. and Egorov, Maxim and Kochenderfer, Mykel},",
				"   editor = {Sukthankar, Gita and Rodriguez-Aguilar, Juan A.},",
				"   year = {2017},",
				"   pages = {66--83},",
				"}",
				""
			],
			"csl": {
				"DOI": "10.1007/978-3-319-71682-4_5",
				"ISBN": "9783319716824",
				"URL": "https://link.springer.com/chapter/10.1007/978-3-319-71682-4_5",
				"abstract": "This work considers the problem of learning cooperative policies in complex, partially observable domains without explicit communication. We extend three classes of single-agent deep reinforcement learning algorithms based on policy gradient, temporal-difference error, and actor-critic methods to cooperative multi-agent systems. To effectively scale these algorithms beyond a trivial number of agents, we combine them with a multi-agent variant of curriculum learning. The algorithms are benchmarked on a suite of cooperative control tasks, including tasks with discrete and continuous actions, as well as tasks with dozens of cooperating agents. We report the performance of the algorithms using different neural architectures, training procedures, and reward structures. We show that policy gradient methods tend to outperform both temporal-difference and actor-critic methods and that curriculum learning is vital to scaling reinforcement learning algorithms in complex multi-agent domains.",
				"accessed": {
					"date-parts": [
						[
							2021,
							6,
							25
						]
					]
				},
				"author": [
					{
						"family": "Gupta",
						"given": "Jayesh K."
					},
					{
						"family": "Egorov",
						"given": "Maxim"
					},
					{
						"family": "Kochenderfer",
						"given": "Mykel"
					}
				],
				"collection-title": "Lecture Notes in Computer Science",
				"container-title": "Autonomous Agents and Multiagent Systems",
				"editor": [
					{
						"family": "Sukthankar",
						"given": "Gita"
					},
					{
						"family": "Rodriguez-Aguilar",
						"given": "Juan A."
					}
				],
				"id": "https://link.springer.com/chapter/10.1007/978-3-319-71682-4_5",
				"issued": {
					"date-parts": [
						[
							2017
						]
					]
				},
				"page": "66-83",
				"publisher": "Springer International Publishing",
				"publisher-place": "Cham",
				"title": "Cooperative Multi-agent Control Using Deep Reinforcement Learning",
				"type": "paper-conference"
			}
		},
		"https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17193": {
			"fetched": "2021-06-25T14:49:40.232Z",
			"bibtex": [
				"",
				"@misc{noauthor_foerster_nodate,",
				"   title = {Foerster},",
				"   url = {https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17193},",
				"   urldate = {2021-06-25},",
				"   journal = {www.aaai.org},",
				"}",
				""
			],
			"csl": {
				"URL": "https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17193",
				"accessed": {
					"date-parts": [
						[
							2021,
							6,
							25
						]
					]
				},
				"container-title": "www.aaai.org",
				"id": "https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17193",
				"title": "Foerster",
				"type": ""
			}
		},
		"https://proceedings.neurips.cc/paper/2016/hash/c7635bfd99248a2cdef8249ef7bfbef4-Abstract.html": {
			"fetched": "2021-06-25T14:49:41.264Z",
			"bibtex": [
				"",
				"@article{foerster_learning_2016,",
				"   title = {Learning to {Communicate} with {Deep} {Multi}-{Agent} {Reinforcement} {Learning}},",
				"   volume = {29},",
				"   url = {https://proceedings.neurips.cc/paper/2016/hash/c7635bfd99248a2cdef8249ef7bfbef4-Abstract.html},",
				"   language = {en},",
				"   urldate = {2021-06-25},",
				"   journal = {Advances in Neural Information Processing Systems},",
				"   author = {Foerster, Jakob and Assael, Ioannis Alexandros and de Freitas, Nando and Whiteson, Shimon},",
				"   year = {2016},",
				"}",
				""
			],
			"csl": {
				"URL": "https://proceedings.neurips.cc/paper/2016/hash/c7635bfd99248a2cdef8249ef7bfbef4-Abstract.html",
				"accessed": {
					"date-parts": [
						[
							2021,
							6,
							25
						]
					]
				},
				"author": [
					{
						"family": "Foerster",
						"given": "Jakob"
					},
					{
						"family": "Assael",
						"given": "Ioannis Alexandros"
					},
					{
						"dropping-particle": "de",
						"family": "Freitas",
						"given": "Nando"
					},
					{
						"family": "Whiteson",
						"given": "Shimon"
					}
				],
				"container-title": "Advances in Neural Information Processing Systems",
				"id": "https://proceedings.neurips.cc/paper/2016/hash/c7635bfd99248a2cdef8249ef7bfbef4-Abstract.html",
				"issued": {
					"date-parts": [
						[
							2016
						]
					]
				},
				"title": "Learning to Communicate with Deep Multi-Agent Reinforcement Learning",
				"type": "article-journal",
				"volume": "29"
			}
		},
		"https://ojs.aaai.org/index.php/AAAI/article/view/11371": {
			"fetched": "2021-06-25T14:49:42.555Z",
			"bibtex": [
				"",
				"@article{zheng_magent:_2018,",
				"   title = {{MAgent}: {A} {Many}-{Agent} {Reinforcement} {Learning} {Platform} for {Artificial} {Collective} {Intelligence}},",
				"   volume = {32},",
				"   copyright = {Copyright (c)},",
				"   issn = {2374-3468},",
				"   shorttitle = {{MAgent}},",
				"   url = {https://ojs.aaai.org/index.php/AAAI/article/view/11371},",
				"   language = {en},",
				"   number = {1},",
				"   urldate = {2021-06-25},",
				"   journal = {Proceedings of the AAAI Conference on Artificial Intelligence},",
				"   author = {Zheng, Lianmin and Yang, Jiacheng and Cai, Han and Zhou, Ming and Zhang, Weinan and Wang, Jun and Yu, Yong},",
				"   month = apr,",
				"   year = {2018},",
				"   keywords = {learning environment},",
				"}",
				""
			],
			"csl": {
				"ISSN": "2374-3468",
				"URL": "https://ojs.aaai.org/index.php/AAAI/article/view/11371",
				"accessed": {
					"date-parts": [
						[
							2021,
							6,
							25
						]
					]
				},
				"author": [
					{
						"family": "Zheng",
						"given": "Lianmin"
					},
					{
						"family": "Yang",
						"given": "Jiacheng"
					},
					{
						"family": "Cai",
						"given": "Han"
					},
					{
						"family": "Zhou",
						"given": "Ming"
					},
					{
						"family": "Zhang",
						"given": "Weinan"
					},
					{
						"family": "Wang",
						"given": "Jun"
					},
					{
						"family": "Yu",
						"given": "Yong"
					}
				],
				"container-title": "Proceedings of the AAAI Conference on Artificial Intelligence",
				"id": "https://ojs.aaai.org/index.php/AAAI/article/view/11371",
				"issue": "1",
				"issued": {
					"date-parts": [
						[
							2018,
							4
						]
					]
				},
				"keyword": "learning environment",
				"title": "MAgent: A Many-Agent Reinforcement Learning Platform for Artificial Collective Intelligence",
				"title-short": "MAgent",
				"type": "article-journal",
				"volume": "32"
			}
		},
		"https://www.nature.com/articles/nature24270": {
			"fetched": "2021-06-25T14:49:46.064Z",
			"bibtex": [
				"",
				"@article{silver_mastering_2017,",
				"   title = {Mastering the game of {Go} without human knowledge},",
				"   volume = {550},",
				"   copyright = {2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved.},",
				"   issn = {1476-4687},",
				"   url = {https://www.nature.com/articles/nature24270},",
				"   doi = {10.1038/nature24270},",
				"   abstract = {A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo’s own move selections and also the winner of AlphaGo’s games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo.},",
				"   language = {en},",
				"   number = {7676},",
				"   urldate = {2021-06-25},",
				"   journal = {Nature},",
				"   author = {Silver, David and Schrittwieser, Julian and Simonyan, Karen and Antonoglou, Ioannis and Huang, Aja and Guez, Arthur and Hubert, Thomas and Baker, Lucas and Lai, Matthew and Bolton, Adrian and Chen, Yutian and Lillicrap, Timothy and Hui, Fan and Sifre, Laurent and van den Driessche, George and Graepel, Thore and Hassabis, Demis},",
				"   month = oct,",
				"   year = {2017},",
				"   pages = {354--359},",
				"}",
				""
			],
			"csl": {
				"DOI": "10.1038/nature24270",
				"ISSN": "1476-4687",
				"URL": "https://www.nature.com/articles/nature24270",
				"abstract": "A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo’s own move selections and also the winner of AlphaGo’s games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo.",
				"accessed": {
					"date-parts": [
						[
							2021,
							6,
							25
						]
					]
				},
				"author": [
					{
						"family": "Silver",
						"given": "David"
					},
					{
						"family": "Schrittwieser",
						"given": "Julian"
					},
					{
						"family": "Simonyan",
						"given": "Karen"
					},
					{
						"family": "Antonoglou",
						"given": "Ioannis"
					},
					{
						"family": "Huang",
						"given": "Aja"
					},
					{
						"family": "Guez",
						"given": "Arthur"
					},
					{
						"family": "Hubert",
						"given": "Thomas"
					},
					{
						"family": "Baker",
						"given": "Lucas"
					},
					{
						"family": "Lai",
						"given": "Matthew"
					},
					{
						"family": "Bolton",
						"given": "Adrian"
					},
					{
						"family": "Chen",
						"given": "Yutian"
					},
					{
						"family": "Lillicrap",
						"given": "Timothy"
					},
					{
						"family": "Hui",
						"given": "Fan"
					},
					{
						"family": "Sifre",
						"given": "Laurent"
					},
					{
						"dropping-particle": "van den",
						"family": "Driessche",
						"given": "George"
					},
					{
						"family": "Graepel",
						"given": "Thore"
					},
					{
						"family": "Hassabis",
						"given": "Demis"
					}
				],
				"container-title": "Nature",
				"id": "https://www.nature.com/articles/nature24270",
				"issue": "7676",
				"issued": {
					"date-parts": [
						[
							2017,
							10
						]
					]
				},
				"page": "354-359",
				"title": "Mastering the game of Go without human knowledge",
				"type": "article-journal",
				"volume": "550"
			}
		},
		"https://ieeexplore.ieee.org/abstract/document/9137257": {
			"fetched": "2021-06-25T14:49:48.728Z",
			"bibtex": [
				"",
				"@article{chen_mean_2021,",
				"   title = {Mean {Field} {Deep} {Reinforcement} {Learning} for {Fair} and {Efficient} {UAV} {Control}},",
				"   volume = {8},",
				"   issn = {2327-4662},",
				"   url = {https://ieeexplore.ieee.org/abstract/document/9137257},",
				"   doi = {10.1109/JIOT.2020.3008299},",
				"   abstract = {Unmanned aerial vehicles (UAVs) can provide flexible network coverage services. UAVs can be applied in a large number of scenarios, such as emergency communication and network access in areas without terrestrial network coverage. However, UAVs are limited to relatively short communication range and restricted energy resources. In extreme conditions such as disasters, there may also be a problem that the communication bandwidth is limited and the UAV cannot communicate with the server with a large amount of information, so a decentralized solution is expected. In addition, the interaction between multiple objectives and multiple UAVs leads to a huge state space, which makes large-scale practical applications difficult. To simplify complex interactions, we modeled the UAV control problem with mean-field game (MFG). We propose a new UAV control method, the mean-field trust region policy optimization (MFTRPO), which uses the MFG method to construct the Hamilton-Jacobi-Bellman/Fokker-Planck-Kolmogorov equation that obtains the optimal solution and solves the difficulties in the practical application through the trust region policy optimization and neural network feature embedding methods. The proposed method: 1) maximizes communication efficiency while ensuring fair communication range and network connectivity; 2) fuses the mean-field theory with deep reinforcement learning techniques; and 3) is scalable and adaptive. We conduct extensive simulations for performance evaluation. The simulation results have shown that MFTRPO significantly and consistently outperforms two commonly used baseline methods in terms of coverage, fairness, and energy consumption.},",
				"   number = {2},",
				"   urldate = {2021-06-25},",
				"   journal = {IEEE Internet of Things Journal},",
				"   author = {Chen, Dezhi and Qi, Qi and Zhuang, Zirui and Wang, Jingyu and Liao, Jianxin and Han, Zhu},",
				"   month = jan,",
				"   year = {2021},",
				"   keywords = {Unmanned aerial vehicles, Internet of Things, Aerospace electronics, Games, Energy consumption, Mathematical model, Reinforcement learning, Mean field, multiagent deep reinforcement learning (DRL), trust region policy optimization (TRPO), unmanned aerial vehicle (UAV)},",
				"   pages = {813--828},",
				"}",
				""
			],
			"csl": {
				"DOI": "10.1109/JIOT.2020.3008299",
				"ISSN": "2327-4662",
				"URL": "https://ieeexplore.ieee.org/abstract/document/9137257",
				"abstract": "Unmanned aerial vehicles (UAVs) can provide flexible network coverage services. UAVs can be applied in a large number of scenarios, such as emergency communication and network access in areas without terrestrial network coverage. However, UAVs are limited to relatively short communication range and restricted energy resources. In extreme conditions such as disasters, there may also be a problem that the communication bandwidth is limited and the UAV cannot communicate with the server with a large amount of information, so a decentralized solution is expected. In addition, the interaction between multiple objectives and multiple UAVs leads to a huge state space, which makes large-scale practical applications difficult. To simplify complex interactions, we modeled the UAV control problem with mean-field game (MFG). We propose a new UAV control method, the mean-field trust region policy optimization (MFTRPO), which uses the MFG method to construct the Hamilton-Jacobi-Bellman/Fokker-Planck-Kolmogorov equation that obtains the optimal solution and solves the difficulties in the practical application through the trust region policy optimization and neural network feature embedding methods. The proposed method: 1) maximizes communication efficiency while ensuring fair communication range and network connectivity; 2) fuses the mean-field theory with deep reinforcement learning techniques; and 3) is scalable and adaptive. We conduct extensive simulations for performance evaluation. The simulation results have shown that MFTRPO significantly and consistently outperforms two commonly used baseline methods in terms of coverage, fairness, and energy consumption.",
				"accessed": {
					"date-parts": [
						[
							2021,
							6,
							25
						]
					]
				},
				"author": [
					{
						"family": "Chen",
						"given": "Dezhi"
					},
					{
						"family": "Qi",
						"given": "Qi"
					},
					{
						"family": "Zhuang",
						"given": "Zirui"
					},
					{
						"family": "Wang",
						"given": "Jingyu"
					},
					{
						"family": "Liao",
						"given": "Jianxin"
					},
					{
						"family": "Han",
						"given": "Zhu"
					}
				],
				"container-title": "IEEE Internet of Things Journal",
				"id": "https://ieeexplore.ieee.org/abstract/document/9137257",
				"issue": "2",
				"issued": {
					"date-parts": [
						[
							2021,
							1
						]
					]
				},
				"keyword": "Unmanned aerial vehicles, Internet of Things, Aerospace electronics, Games, Energy consumption, Mathematical model, Reinforcement learning, Mean field, multiagent deep reinforcement learning (DRL), trust region policy optimization (TRPO), unmanned aerial vehicle (UAV)",
				"page": "813-828",
				"title": "Mean Field Deep Reinforcement Learning for Fair and Efficient UAV Control",
				"type": "article-journal",
				"volume": "8"
			}
		},
		"https://ieeexplore.ieee.org/document/976029": {
			"fetched": "2021-07-05T12:07:42.787Z",
			"bibtex": [
				"",
				"@article{egerstedt_formation_2001,",
				"   title = {Formation constrained multi-agent control},",
				"   volume = {17},",
				"   issn = {2374-958X},",
				"   url = {https://ieeexplore.ieee.org/document/976029},",
				"   doi = {10.1109/70.976029},",
				"   abstract = {We propose a model independent coordination strategy for multi-agent formation control. The main theorem states that under a bounded tracking error assumption, our method stabilizes the formation error. We illustrate the usefulness of the method by applying it to rigid body constrained motions.},",
				"   number = {6},",
				"   urldate = {2021-07-05},",
				"   journal = {IEEE Transactions on Robotics and Automation},",
				"   author = {Egerstedt, M. and Hu, Xiaoming},",
				"   month = dec,",
				"   year = {2001},",
				"   keywords = {Robot kinematics, Mobile robots, Stability, Robot control, Trajectory, Navigation, Robustness, Manufacturing, Distributed control, Vehicle dynamics},",
				"   pages = {947--951},",
				"}",
				""
			],
			"csl": {
				"DOI": "10.1109/70.976029",
				"ISSN": "2374-958X",
				"URL": "https://ieeexplore.ieee.org/document/976029",
				"abstract": "We propose a model independent coordination strategy for multi-agent formation control. The main theorem states that under a bounded tracking error assumption, our method stabilizes the formation error. We illustrate the usefulness of the method by applying it to rigid body constrained motions.",
				"accessed": {
					"date-parts": [
						[
							2021,
							7,
							5
						]
					]
				},
				"author": [
					{
						"family": "Egerstedt",
						"given": "M."
					},
					{
						"family": "Hu",
						"given": "Xiaoming"
					}
				],
				"container-title": "IEEE Transactions on Robotics and Automation",
				"id": "https://ieeexplore.ieee.org/document/976029",
				"issue": "6",
				"issued": {
					"date-parts": [
						[
							2001,
							12
						]
					]
				},
				"keyword": "Robot kinematics, Mobile robots, Stability, Robot control, Trajectory, Navigation, Robustness, Manufacturing, Distributed control, Vehicle dynamics",
				"page": "947-951",
				"title": "Formation constrained multi-agent control",
				"type": "article-journal",
				"volume": "17"
			}
		},
		"https://www.sciencedirect.com/science/article/abs/pii/S0005109816301911": {
			"fetched": "2021-07-05T13:38:06.005Z",
			"bibtex": [
				"",
				"@article{zhou_cooperative_2016,",
				"   title = {Cooperative pursuit with {Voronoi} partitions},",
				"   volume = {72},",
				"   issn = {0005-1098},",
				"   url = {https://www.sciencedirect.com/science/article/pii/S0005109816301911},",
				"   doi = {10.1016/j.automatica.2016.05.007},",
				"   abstract = {This work considers a pursuit–evasion game in which a number of pursuers are attempting to capture a single evader. Cooperation among multiple agents can be difficult to achieve, as it may require the selection of actions in the joint input space of all agents. This work presents a decentralized, real-time algorithm for cooperative pursuit of a single evader by multiple pursuers in bounded, simply-connected planar domains. The algorithm is based on minimizing the area of the generalized Voronoi partition of the evader. The pursuers share state information but compute their inputs independently. No assumptions are made about the evader’s control strategies other than requiring the evader control inputs to conform to a speed limit. Proof of guaranteed capture is shown when the domain is convex and the players’ motion models are kinematic. Simulation results are presented showing the efficiency and effectiveness of this strategy.},",
				"   language = {en},",
				"   urldate = {2021-07-05},",
				"   journal = {Automatica},",
				"   author = {Zhou, Zhengyuan and Zhang, Wei and Ding, Jerry and Huang, Haomiao and Stipanović, Dušan M. and Tomlin, Claire J.},",
				"   month = oct,",
				"   year = {2016},",
				"   keywords = {Pursuit–evasion games, Voronoi, Cooperative pursuit},",
				"   pages = {64--72},",
				"}",
				""
			],
			"csl": {
				"DOI": "10.1016/j.automatica.2016.05.007",
				"ISSN": "0005-1098",
				"URL": "https://www.sciencedirect.com/science/article/pii/S0005109816301911",
				"abstract": "This work considers a pursuit–evasion game in which a number of pursuers are attempting to capture a single evader. Cooperation among multiple agents can be difficult to achieve, as it may require the selection of actions in the joint input space of all agents. This work presents a decentralized, real-time algorithm for cooperative pursuit of a single evader by multiple pursuers in bounded, simply-connected planar domains. The algorithm is based on minimizing the area of the generalized Voronoi partition of the evader. The pursuers share state information but compute their inputs independently. No assumptions are made about the evader’s control strategies other than requiring the evader control inputs to conform to a speed limit. Proof of guaranteed capture is shown when the domain is convex and the players’ motion models are kinematic. Simulation results are presented showing the efficiency and effectiveness of this strategy.",
				"accessed": {
					"date-parts": [
						[
							2021,
							7,
							5
						]
					]
				},
				"author": [
					{
						"family": "Zhou",
						"given": "Zhengyuan"
					},
					{
						"family": "Zhang",
						"given": "Wei"
					},
					{
						"family": "Ding",
						"given": "Jerry"
					},
					{
						"family": "Huang",
						"given": "Haomiao"
					},
					{
						"family": "Stipanović",
						"given": "Dušan M."
					},
					{
						"family": "Tomlin",
						"given": "Claire J."
					}
				],
				"container-title": "Automatica",
				"id": "https://www.sciencedirect.com/science/article/abs/pii/S0005109816301911",
				"issued": {
					"date-parts": [
						[
							2016,
							10
						]
					]
				},
				"keyword": "Pursuit–evasion games, Voronoi, Cooperative pursuit",
				"page": "64-72",
				"title": "Cooperative pursuit with Voronoi partitions",
				"type": "article-journal",
				"volume": "72"
			}
		},
		"https://dl.acm.org/doi/10.1145/3292500.3330701": {
			"fetched": "2021-07-07T10:28:02.599Z",
			"bibtex": [
				"",
				"@inproceedings{akiba_optuna:_2019,",
				"   address = {Anchorage, AK, USA},",
				"   series = {{KDD} '19},",
				"   title = {Optuna: {A} {Next}-generation {Hyperparameter} {Optimization} {Framework}},",
				"   isbn = {9781450362016},",
				"   shorttitle = {Optuna},",
				"   url = {https://doi.org/10.1145/3292500.3330701},",
				"   doi = {10.1145/3292500.3330701},",
				"   abstract = {The purpose of this study is to introduce new design-criteria for next-generation hyperparameter optimization software. The criteria we propose include (1) define-by-run API that allows users to construct the parameter search space dynamically, (2) efficient implementation of both searching and pruning strategies, and (3) easy-to-setup, versatile architecture that can be deployed for various purposes, ranging from scalable distributed computing to light-weight experiment conducted via interactive interface. In order to prove our point, we will introduce Optuna, an optimization software which is a culmination of our effort in the development of a next generation optimization software. As an optimization software designed with define-by-run principle, Optuna is particularly the first of its kind. We will present the design-techniques that became necessary in the development of the software that meets the above criteria, and demonstrate the power of our new design through experimental results and real world applications. Our software is available under the MIT license (https://github.com/pfnet/optuna/).},",
				"   urldate = {2021-07-07},",
				"   booktitle = {Proceedings of the 25th {ACM} {SIGKDD} {International} {Conference} on {Knowledge} {Discovery} \\& {Data} {Mining}},",
				"   publisher = {Association for Computing Machinery},",
				"   author = {Akiba, Takuya and Sano, Shotaro and Yanase, Toshihiko and Ohta, Takeru and Koyama, Masanori},",
				"   month = jul,",
				"   year = {2019},",
				"   keywords = {black-box optimization, hyperparameter optimization, Bayesian optimization, machine learning system},",
				"   pages = {2623--2631},",
				"}",
				""
			],
			"csl": {
				"DOI": "10.1145/3292500.3330701",
				"ISBN": "9781450362016",
				"URL": "https://doi.org/10.1145/3292500.3330701",
				"abstract": "The purpose of this study is to introduce new design-criteria for next-generation hyperparameter optimization software. The criteria we propose include (1) define-by-run API that allows users to construct the parameter search space dynamically, (2) efficient implementation of both searching and pruning strategies, and (3) easy-to-setup, versatile architecture that can be deployed for various purposes, ranging from scalable distributed computing to light-weight experiment conducted via interactive interface. In order to prove our point, we will introduce Optuna, an optimization software which is a culmination of our effort in the development of a next generation optimization software. As an optimization software designed with define-by-run principle, Optuna is particularly the first of its kind. We will present the design-techniques that became necessary in the development of the software that meets the above criteria, and demonstrate the power of our new design through experimental results and real world applications. Our software is available under the MIT license (https://github.com/pfnet/optuna/).",
				"accessed": {
					"date-parts": [
						[
							2021,
							7,
							7
						]
					]
				},
				"author": [
					{
						"family": "Akiba",
						"given": "Takuya"
					},
					{
						"family": "Sano",
						"given": "Shotaro"
					},
					{
						"family": "Yanase",
						"given": "Toshihiko"
					},
					{
						"family": "Ohta",
						"given": "Takeru"
					},
					{
						"family": "Koyama",
						"given": "Masanori"
					}
				],
				"collection-title": "KDD ’19",
				"container-title": "Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining",
				"id": "https://dl.acm.org/doi/10.1145/3292500.3330701",
				"issued": {
					"date-parts": [
						[
							2019,
							7
						]
					]
				},
				"keyword": "black-box optimization, hyperparameter optimization, Bayesian optimization, machine learning system",
				"page": "2623-2631",
				"publisher": "Association for Computing Machinery",
				"publisher-place": "Anchorage, AK, USA",
				"title": "Optuna: A Next-generation Hyperparameter Optimization Framework",
				"title-short": "Optuna",
				"type": "paper-conference"
			}
		},
		"https://arxiv.org/abs/1703.06182": {
			"fetched": "2021-07-09T18:06:24.872Z",
			"bibtex": [
				"",
				"@article{omidshafiei_deep_2017,",
				"   title = {Deep {Decentralized} {Multi}-task {Multi}-{Agent} {Reinforcement} {Learning} under {Partial} {Observability}},",
				"   url = {http://arxiv.org/abs/1703.06182},",
				"   abstract = {Many real-world tasks involve multiple agents with partial observability and limited communication. Learning is challenging in these settings due to local viewpoints of agents, which perceive the world as non-stationary due to concurrently-exploring teammates. Approaches that learn specialized policies for individual tasks face problems when applied to the real world: not only do agents have to learn and store distinct policies for each task, but in practice identities of tasks are often non-observable, making these approaches inapplicable. This paper formalizes and addresses the problem of multi-task multi-agent reinforcement learning under partial observability. We introduce a decentralized single-task learning approach that is robust to concurrent interactions of teammates, and present an approach for distilling single-task policies into a unified policy that performs well across multiple related tasks, without explicit provision of task identity.},",
				"   urldate = {2021-07-09},",
				"   journal = {arXiv:1703.06182 [cs]},",
				"   author = {Omidshafiei, Shayegan and Pazis, Jason and Amato, Christopher and How, Jonathan P. and Vian, John},",
				"   month = jul,",
				"   year = {2017},",
				"   note = {arXiv: 1703.06182},",
				"   keywords = {Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Multiagent Systems},",
				"}",
				""
			],
			"csl": {
				"URL": "http://arxiv.org/abs/1703.06182",
				"abstract": "Many real-world tasks involve multiple agents with partial observability and limited communication. Learning is challenging in these settings due to local viewpoints of agents, which perceive the world as non-stationary due to concurrently-exploring teammates. Approaches that learn specialized policies for individual tasks face problems when applied to the real world: not only do agents have to learn and store distinct policies for each task, but in practice identities of tasks are often non-observable, making these approaches inapplicable. This paper formalizes and addresses the problem of multi-task multi-agent reinforcement learning under partial observability. We introduce a decentralized single-task learning approach that is robust to concurrent interactions of teammates, and present an approach for distilling single-task policies into a unified policy that performs well across multiple related tasks, without explicit provision of task identity.",
				"accessed": {
					"date-parts": [
						[
							2021,
							7,
							9
						]
					]
				},
				"author": [
					{
						"family": "Omidshafiei",
						"given": "Shayegan"
					},
					{
						"family": "Pazis",
						"given": "Jason"
					},
					{
						"family": "Amato",
						"given": "Christopher"
					},
					{
						"family": "How",
						"given": "Jonathan P."
					},
					{
						"family": "Vian",
						"given": "John"
					}
				],
				"container-title": "arXiv:1703.06182 [cs]",
				"id": "https://arxiv.org/abs/1703.06182",
				"issued": {
					"date-parts": [
						[
							2017,
							7
						]
					]
				},
				"keyword": "Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Multiagent Systems",
				"note": "arXiv: 1703.06182",
				"title": "Deep Decentralized Multi-task Multi-Agent Reinforcement Learning under Partial Observability",
				"type": "article-journal"
			}
		},
		"https://arxiv.org/abs/2009.14471": {
			"fetched": "2021-07-09T18:17:34.148Z",
			"bibtex": [
				"",
				"@article{terry_pettingzoo:_2021,",
				"   title = {{PettingZoo}: {Gym} for {Multi}-{Agent} {Reinforcement} {Learning}},",
				"   shorttitle = {{PettingZoo}},",
				"   url = {http://arxiv.org/abs/2009.14471},",
				"   abstract = {This paper introduces the PettingZoo library and the accompanying Agent Environment Cycle (\"AEC\") games model. PettingZoo is a library of diverse sets of multi-agent environments with a universal, elegant Python API. PettingZoo was developed with the goal of accelerating research in Multi-Agent Reinforcement Learning (\"MARL\"), by making work more interchangeable, accessible and reproducible akin to what OpenAI's Gym library did for single-agent reinforcement learning. PettingZoo's API, while inheriting many features of Gym, is unique amongst MARL APIs in that it's based around the novel AEC games model. We argue, in part through case studies on major problems in popular MARL environments, that the popular game models are poor conceptual models of the games commonly used with MARL, that they promote severe bugs that are hard to detect, and that the AEC games model addresses these problems.},",
				"   urldate = {2021-07-09},",
				"   journal = {arXiv:2009.14471 [cs, stat]},",
				"   author = {Terry, J. K. and Black, Benjamin and Grammel, Nathaniel and Jayakumar, Mario and Hari, Ananth and Sullivan, Ryan and Santos, Luis and Perez, Rodrigo and Horsch, Caroline and Dieffendahl, Clemens and Williams, Niall L. and Lokesh, Yashas and Ravi, Praveen},",
				"   month = jun,",
				"   year = {2021},",
				"   note = {arXiv: 2009.14471},",
				"   keywords = {Computer Science - Machine Learning, Computer Science - Multiagent Systems, Statistics - Machine Learning},",
				"}",
				""
			],
			"csl": {
				"URL": "http://arxiv.org/abs/2009.14471",
				"abstract": "This paper introduces the PettingZoo library and the accompanying Agent Environment Cycle (\"AEC\") games model. PettingZoo is a library of diverse sets of multi-agent environments with a universal, elegant Python API. PettingZoo was developed with the goal of accelerating research in Multi-Agent Reinforcement Learning (\"MARL\"), by making work more interchangeable, accessible and reproducible akin to what OpenAI’s Gym library did for single-agent reinforcement learning. PettingZoo’s API, while inheriting many features of Gym, is unique amongst MARL APIs in that it’s based around the novel AEC games model. We argue, in part through case studies on major problems in popular MARL environments, that the popular game models are poor conceptual models of the games commonly used with MARL, that they promote severe bugs that are hard to detect, and that the AEC games model addresses these problems.",
				"accessed": {
					"date-parts": [
						[
							2021,
							7,
							9
						]
					]
				},
				"author": [
					{
						"family": "Terry",
						"given": "J. K."
					},
					{
						"family": "Black",
						"given": "Benjamin"
					},
					{
						"family": "Grammel",
						"given": "Nathaniel"
					},
					{
						"family": "Jayakumar",
						"given": "Mario"
					},
					{
						"family": "Hari",
						"given": "Ananth"
					},
					{
						"family": "Sullivan",
						"given": "Ryan"
					},
					{
						"family": "Santos",
						"given": "Luis"
					},
					{
						"family": "Perez",
						"given": "Rodrigo"
					},
					{
						"family": "Horsch",
						"given": "Caroline"
					},
					{
						"family": "Dieffendahl",
						"given": "Clemens"
					},
					{
						"family": "Williams",
						"given": "Niall L."
					},
					{
						"family": "Lokesh",
						"given": "Yashas"
					},
					{
						"family": "Ravi",
						"given": "Praveen"
					}
				],
				"container-title": "arXiv:2009.14471 [cs, stat]",
				"id": "https://arxiv.org/abs/2009.14471",
				"issued": {
					"date-parts": [
						[
							2021,
							6
						]
					]
				},
				"keyword": "Computer Science - Machine Learning, Computer Science - Multiagent Systems, Statistics - Machine Learning",
				"note": "arXiv: 2009.14471",
				"title": "PettingZoo: Gym for Multi-Agent Reinforcement Learning",
				"title-short": "PettingZoo",
				"type": "article-journal"
			}
		},
		"http://proceedings.mlr.press/v97/iqbal19a.html": {
			"fetched": "2021-07-09T20:04:44.342Z",
			"bibtex": [
				"",
				"@inproceedings{iqbal_actor-attention-critic_2019,",
				"   title = {Actor-{Attention}-{Critic} for {Multi}-{Agent} {Reinforcement} {Learning}},",
				"   url = {http://proceedings.mlr.press/v97/iqbal19a.html},",
				"   abstract = {Reinforcement learning in multi-agent scenarios is important for real-world applications but presents challenges beyond those seen in single-agent settings. We present an actor-critic algorithm tha...},",
				"   language = {en},",
				"   urldate = {2021-07-09},",
				"   booktitle = {International {Conference} on {Machine} {Learning}},",
				"   publisher = {PMLR},",
				"   author = {Iqbal, Shariq and Sha, Fei},",
				"   month = may,",
				"   year = {2019},",
				"   pages = {2961--2970},",
				"}",
				""
			],
			"csl": {
				"URL": "http://proceedings.mlr.press/v97/iqbal19a.html",
				"abstract": "Reinforcement learning in multi-agent scenarios is important for real-world applications but presents challenges beyond those seen in single-agent settings. We present an actor-critic algorithm tha...",
				"accessed": {
					"date-parts": [
						[
							2021,
							7,
							9
						]
					]
				},
				"author": [
					{
						"family": "Iqbal",
						"given": "Shariq"
					},
					{
						"family": "Sha",
						"given": "Fei"
					}
				],
				"container-title": "International Conference on Machine Learning",
				"id": "http://proceedings.mlr.press/v97/iqbal19a.html",
				"issued": {
					"date-parts": [
						[
							2019,
							5
						]
					]
				},
				"page": "2961-2970",
				"publisher": "PMLR",
				"title": "Actor-Attention-Critic for Multi-Agent Reinforcement Learning",
				"type": "paper-conference"
			}
		},
		"https://www.mdpi.com/2076-3417/11/11/4948": {
			"fetched": "2021-07-10T10:01:16.914Z",
			"bibtex": [
				"",
				"@article{canese_multi-agent_2021,",
				"   title = {Multi-{Agent} {Reinforcement} {Learning}: {A} {Review} of {Challenges} and {Applications}},",
				"   volume = {11},",
				"   copyright = {http://creativecommons.org/licenses/by/3.0/},",
				"   shorttitle = {Multi-{Agent} {Reinforcement} {Learning}},",
				"   url = {https://www.mdpi.com/2076-3417/11/11/4948},",
				"   doi = {10.3390/app11114948},",
				"   abstract = {In this review, we present an analysis of the most used multi-agent reinforcement learning algorithms. Starting with the single-agent reinforcement learning algorithms, we focus on the most critical issues that must be taken into account in their extension to multi-agent scenarios. The analyzed algorithms were grouped according to their features. We present a detailed taxonomy of the main multi-agent approaches proposed in the literature, focusing on their related mathematical models. For each algorithm, we describe the possible application fields, while pointing out its pros and cons. The described multi-agent algorithms are compared in terms of the most important characteristics for multi-agent reinforcement learning applications—namely, nonstationarity, scalability, and observability. We also describe the most common benchmark environments used to evaluate the performances of the considered methods.},",
				"   language = {en},",
				"   number = {11},",
				"   urldate = {2021-07-10},",
				"   journal = {Applied Sciences},",
				"   author = {Canese, Lorenzo and Cardarilli, Gian Carlo and Di Nunzio, Luca and Fazzolari, Rocco and Giardino, Daniele and Re, Marco and Spanò, Sergio},",
				"   month = jan,",
				"   year = {2021},",
				"   keywords = {machine learning, reinforcement learning, multi-agent, swarm},",
				"   pages = {4948},",
				"}",
				""
			],
			"csl": {
				"DOI": "10.3390/app11114948",
				"URL": "https://www.mdpi.com/2076-3417/11/11/4948",
				"abstract": "In this review, we present an analysis of the most used multi-agent reinforcement learning algorithms. Starting with the single-agent reinforcement learning algorithms, we focus on the most critical issues that must be taken into account in their extension to multi-agent scenarios. The analyzed algorithms were grouped according to their features. We present a detailed taxonomy of the main multi-agent approaches proposed in the literature, focusing on their related mathematical models. For each algorithm, we describe the possible application fields, while pointing out its pros and cons. The described multi-agent algorithms are compared in terms of the most important characteristics for multi-agent reinforcement learning applications—namely, nonstationarity, scalability, and observability. We also describe the most common benchmark environments used to evaluate the performances of the considered methods.",
				"accessed": {
					"date-parts": [
						[
							2021,
							7,
							10
						]
					]
				},
				"author": [
					{
						"family": "Canese",
						"given": "Lorenzo"
					},
					{
						"family": "Cardarilli",
						"given": "Gian Carlo"
					},
					{
						"family": "Di Nunzio",
						"given": "Luca"
					},
					{
						"family": "Fazzolari",
						"given": "Rocco"
					},
					{
						"family": "Giardino",
						"given": "Daniele"
					},
					{
						"family": "Re",
						"given": "Marco"
					},
					{
						"family": "Spanò",
						"given": "Sergio"
					}
				],
				"container-title": "Applied Sciences",
				"id": "https://www.mdpi.com/2076-3417/11/11/4948",
				"issue": "11",
				"issued": {
					"date-parts": [
						[
							2021,
							1
						]
					]
				},
				"keyword": "machine learning, reinforcement learning, multi-agent, swarm",
				"page": "4948",
				"title": "Multi-Agent Reinforcement Learning: A Review of Challenges and Applications",
				"title-short": "Multi-Agent Reinforcement Learning",
				"type": "article-journal",
				"volume": "11"
			}
		},
		"https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17193": {
			"fetched": "2021-07-10T10:01:18.549Z",
			"bibtex": [
				"",
				"@misc{noauthor_foerster_nodate,",
				"   title = {Foerster},",
				"   url = {https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17193},",
				"   urldate = {2021-07-10},",
				"   journal = {aaai.org},",
				"}",
				""
			],
			"csl": {
				"URL": "https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17193",
				"accessed": {
					"date-parts": [
						[
							2021,
							7,
							10
						]
					]
				},
				"container-title": "aaai.org",
				"id": "https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17193",
				"title": "Foerster",
				"type": ""
			}
		},
		"https://ieeexplore.ieee.org/document/4399095": {
			"fetched": "2021-07-10T10:19:58.560Z",
			"bibtex": [
				"",
				"@inproceedings{matignon_hysteretic_2007,",
				"   title = {Hysteretic {Q}-learning : an algorithm for {Decentralized} {Reinforcement} {Learning} in {Cooperative} {Multi}-{Agent} {Teams}},",
				"   shorttitle = {Hysteretic {Q}-learning},",
				"   url = {https://ieeexplore.ieee.org/document/4399095},",
				"   doi = {10.1109/IROS.2007.4399095},",
				"   abstract = {Multi-agent systems (MAS) are a field of study of growing interest in a variety of domains such as robotics or distributed controls. The article focuses on decentralized reinforcement learning (RL) in cooperative MAS, where a team of independent learning robots (IL) try to coordinate their individual behavior to reach a coherent joint behavior. We assume that each robot has no information about its teammates' actions. To date, RL approaches for such ILs did not guarantee convergence to the optimal joint policy in scenarios where the coordination is difficult. We report an investigation of existing algorithms for the learning of coordination in cooperative MAS, and suggest a Q-learning extension for ILs, called hysteretic Q-learning. This algorithm does not require any additional communication between robots. Its advantages are showing off and compared to other methods on various applications: bi-matrix games, collaborative ball balancing task and pursuit domain.},",
				"   urldate = {2021-07-10},",
				"   booktitle = {2007 {IEEE}/{RSJ} {International} {Conference} on {Intelligent} {Robots} and {Systems}},",
				"   author = {Matignon, Laetitia and Laurent, Guillaume J. and Le Fort-Piat, Nadine},",
				"   month = oct,",
				"   year = {2007},",
				"   note = {ISSN: 2153-0866},",
				"   keywords = {Hysteresis, Learning, Robot kinematics, Multiagent systems, Distributed control, Convergence, Stochastic processes, Game theory, Intelligent robots, USA Councils},",
				"   pages = {64--69},",
				"}",
				""
			],
			"csl": {
				"DOI": "10.1109/IROS.2007.4399095",
				"URL": "https://ieeexplore.ieee.org/document/4399095",
				"abstract": "Multi-agent systems (MAS) are a field of study of growing interest in a variety of domains such as robotics or distributed controls. The article focuses on decentralized reinforcement learning (RL) in cooperative MAS, where a team of independent learning robots (IL) try to coordinate their individual behavior to reach a coherent joint behavior. We assume that each robot has no information about its teammates’ actions. To date, RL approaches for such ILs did not guarantee convergence to the optimal joint policy in scenarios where the coordination is difficult. We report an investigation of existing algorithms for the learning of coordination in cooperative MAS, and suggest a Q-learning extension for ILs, called hysteretic Q-learning. This algorithm does not require any additional communication between robots. Its advantages are showing off and compared to other methods on various applications: bi-matrix games, collaborative ball balancing task and pursuit domain.",
				"accessed": {
					"date-parts": [
						[
							2021,
							7,
							10
						]
					]
				},
				"author": [
					{
						"family": "Matignon",
						"given": "Laetitia"
					},
					{
						"family": "Laurent",
						"given": "Guillaume J."
					},
					{
						"family": "Le Fort-Piat",
						"given": "Nadine"
					}
				],
				"container-title": "2007 IEEE/RSJ International Conference on Intelligent Robots and Systems",
				"id": "https://ieeexplore.ieee.org/document/4399095",
				"issued": {
					"date-parts": [
						[
							2007,
							10
						]
					]
				},
				"keyword": "Hysteresis, Learning, Robot kinematics, Multiagent systems, Distributed control, Convergence, Stochastic processes, Game theory, Intelligent robots, USA Councils",
				"note": "ISSN: 2153-0866",
				"page": "64-69",
				"title": "Hysteretic Q-learning : An algorithm for Decentralized Reinforcement Learning in Cooperative Multi-Agent Teams",
				"title-short": "Hysteretic Q-learning",
				"type": "paper-conference"
			}
		},
		"https://link.springer.com/article/10.1007/s11537-007-0657-8": {
			"fetched": "2021-07-10T10:40:45.133Z",
			"bibtex": [
				"",
				"@article{lasry_mean_2007,",
				"   title = {Mean field games},",
				"   volume = {2},",
				"   issn = {1861-3624},",
				"   url = {https://doi.org/10.1007/s11537-007-0657-8},",
				"   doi = {10.1007/s11537-007-0657-8},",
				"   abstract = {We survey here some recent studies concerning what we call mean-field models by analogy with Statistical Mechanics and Physics. More precisely, we present three examples of our mean-field approach to modelling in Economics and Finance (or other related subjects...). Roughly speaking, we are concerned with situations that involve a very large number of “rational players” with a limited information (or visibility) on the “game”. Each player chooses his optimal strategy in view of the global (or macroscopic) informations that are available to him and that result from the actions of all players. In the three examples we mention here, we derive a mean-field problem which consists in nonlinear differential equations. These equations are of a new type and our main goal here is to study them and establish their links with various fields of Analysis. We show in particular that these nonlinear problems are essentially well-posed problems i.e., have unique solutions. In addition, we give various limiting cases, examples and possible extensions. And we mention many open problems.},",
				"   language = {en},",
				"   number = {1},",
				"   urldate = {2021-07-10},",
				"   journal = {Japanese Journal of Mathematics},",
				"   author = {Lasry, Jean-Michel and Lions, Pierre-Louis},",
				"   month = mar,",
				"   year = {2007},",
				"   pages = {229--260},",
				"}",
				""
			],
			"csl": {
				"DOI": "10.1007/s11537-007-0657-8",
				"ISSN": "1861-3624",
				"URL": "https://doi.org/10.1007/s11537-007-0657-8",
				"abstract": "We survey here some recent studies concerning what we call mean-field models by analogy with Statistical Mechanics and Physics. More precisely, we present three examples of our mean-field approach to modelling in Economics and Finance (or other related subjects...). Roughly speaking, we are concerned with situations that involve a very large number of “rational players” with a limited information (or visibility) on the “game”. Each player chooses his optimal strategy in view of the global (or macroscopic) informations that are available to him and that result from the actions of all players. In the three examples we mention here, we derive a mean-field problem which consists in nonlinear differential equations. These equations are of a new type and our main goal here is to study them and establish their links with various fields of Analysis. We show in particular that these nonlinear problems are essentially well-posed problems i.e., have unique solutions. In addition, we give various limiting cases, examples and possible extensions. And we mention many open problems.",
				"accessed": {
					"date-parts": [
						[
							2021,
							7,
							10
						]
					]
				},
				"author": [
					{
						"family": "Lasry",
						"given": "Jean-Michel"
					},
					{
						"family": "Lions",
						"given": "Pierre-Louis"
					}
				],
				"container-title": "Japanese Journal of Mathematics",
				"id": "https://link.springer.com/article/10.1007/s11537-007-0657-8",
				"issue": "1",
				"issued": {
					"date-parts": [
						[
							2007,
							3
						]
					]
				},
				"page": "229-260",
				"title": "Mean field games",
				"type": "article-journal",
				"volume": "2"
			}
		},
		"https://link.springer.com/chapter/10.1007/978-3-540-28650-9_4": {
			"fetched": "2021-07-10T10:40:47.569Z",
			"bibtex": [
				"",
				"@incollection{rasmussen_gaussian_2004,",
				"   address = {Berlin, Heidelberg},",
				"   series = {Lecture {Notes} in {Computer} {Science}},",
				"   title = {Gaussian {Processes} in {Machine} {Learning}},",
				"   isbn = {9783540286509},",
				"   url = {https://doi.org/10.1007/978-3-540-28650-9_4},",
				"   abstract = {We give a basic introduction to Gaussian Process regression models. We focus on understanding the role of the stochastic process and how it is used to define a distribution over functions. We present the simple equations for incorporating training data and examine how to learn the hyperparameters using the marginal likelihood. We explain the practical advantages of Gaussian Process and end with conclusions and a look at the current trends in GP work.},",
				"   language = {en},",
				"   urldate = {2021-07-10},",
				"   booktitle = {Advanced {Lectures} on {Machine} {Learning}: {ML} {Summer} {Schools} 2003, {Canberra}, {Australia}, {February} 2 - 14, 2003, {Tübingen}, {Germany}, {August} 4 - 16, 2003, {Revised} {Lectures}},",
				"   publisher = {Springer},",
				"   author = {Rasmussen, Carl Edward},",
				"   editor = {Bousquet, Olivier and von Luxburg, Ulrike and Rätsch, Gunnar},",
				"   year = {2004},",
				"   doi = {10.1007/978-3-540-28650-9_4},",
				"   keywords = {Covariance Function ,  Gaussian Process ,  Marginal Likelihood ,  Posterior Variance ,  Joint Gaussian Distribution },",
				"   pages = {63--71},",
				"}",
				""
			],
			"csl": {
				"DOI": "10.1007/978-3-540-28650-9_4",
				"ISBN": "9783540286509",
				"URL": "https://doi.org/10.1007/978-3-540-28650-9_4",
				"abstract": "We give a basic introduction to Gaussian Process regression models. We focus on understanding the role of the stochastic process and how it is used to define a distribution over functions. We present the simple equations for incorporating training data and examine how to learn the hyperparameters using the marginal likelihood. We explain the practical advantages of Gaussian Process and end with conclusions and a look at the current trends in GP work.",
				"accessed": {
					"date-parts": [
						[
							2021,
							7,
							10
						]
					]
				},
				"author": [
					{
						"family": "Rasmussen",
						"given": "Carl Edward"
					}
				],
				"collection-title": "Lecture Notes in Computer Science",
				"container-title": "Advanced Lectures on Machine Learning: ML Summer Schools 2003, Canberra, Australia, February 2 - 14, 2003, Tübingen, Germany, August 4 - 16, 2003, Revised Lectures",
				"editor": [
					{
						"family": "Bousquet",
						"given": "Olivier"
					},
					{
						"dropping-particle": "von",
						"family": "Luxburg",
						"given": "Ulrike"
					},
					{
						"family": "Rätsch",
						"given": "Gunnar"
					}
				],
				"id": "https://link.springer.com/chapter/10.1007/978-3-540-28650-9_4",
				"issued": {
					"date-parts": [
						[
							2004
						]
					]
				},
				"keyword": "Covariance Function , Gaussian Process , Marginal Likelihood , Posterior Variance , Joint Gaussian Distribution",
				"page": "63-71",
				"publisher": "Springer",
				"publisher-place": "Berlin, Heidelberg",
				"title": "Gaussian Processes in Machine Learning",
				"type": "chapter"
			}
		},
		"https://dl.acm.org/doi/10.5555/3295222.3295385": {
			"fetched": "2021-07-12T14:03:50.952Z",
			"bibtex": [
				"",
				"@inproceedings{lowe_multi-agent_2017,",
				"   address = {Long Beach, California, USA},",
				"   series = {{NIPS}'17},",
				"   title = {Multi-agent actor-critic for mixed cooperative-competitive environments},",
				"   isbn = {9781510860964},",
				"   url = {https://dl.acm.org/doi/10.5555/3295222.3295385},",
				"   abstract = {We explore deep reinforcement learning methods for multi-agent domains. We begin by analyzing the difficulty of traditional algorithms in the multi-agent case: Q-learning is challenged by an inherent non-stationarity of the environment, while policy gradient suffers from a variance that increases as the number of agents grows. We then present an adaptation of actor-critic methods that considers action policies of other agents and is able to successfully learn policies that require complex multi-agent coordination. Additionally, we introduce a training regimen utilizing an ensemble of policies for each agent that leads to more robust multi-agent policies. We show the strength of our approach compared to existing methods in cooperative as well as competitive scenarios, where agent populations are able to discover various physical and informational coordination strategies.},",
				"   urldate = {2021-07-12},",
				"   booktitle = {Proceedings of the 31st {International} {Conference} on {Neural} {Information} {Processing} {Systems}},",
				"   publisher = {Curran Associates Inc.},",
				"   author = {Lowe, Ryan and Wu, Yi and Tamar, Aviv and Harb, Jean and Abbeel, Pieter and Mordatch, Igor},",
				"   month = dec,",
				"   year = {2017},",
				"   pages = {6382--6393},",
				"}",
				""
			],
			"csl": {
				"ISBN": "9781510860964",
				"URL": "https://dl.acm.org/doi/10.5555/3295222.3295385",
				"abstract": "We explore deep reinforcement learning methods for multi-agent domains. We begin by analyzing the difficulty of traditional algorithms in the multi-agent case: Q-learning is challenged by an inherent non-stationarity of the environment, while policy gradient suffers from a variance that increases as the number of agents grows. We then present an adaptation of actor-critic methods that considers action policies of other agents and is able to successfully learn policies that require complex multi-agent coordination. Additionally, we introduce a training regimen utilizing an ensemble of policies for each agent that leads to more robust multi-agent policies. We show the strength of our approach compared to existing methods in cooperative as well as competitive scenarios, where agent populations are able to discover various physical and informational coordination strategies.",
				"accessed": {
					"date-parts": [
						[
							2021,
							7,
							12
						]
					]
				},
				"author": [
					{
						"family": "Lowe",
						"given": "Ryan"
					},
					{
						"family": "Wu",
						"given": "Yi"
					},
					{
						"family": "Tamar",
						"given": "Aviv"
					},
					{
						"family": "Harb",
						"given": "Jean"
					},
					{
						"family": "Abbeel",
						"given": "Pieter"
					},
					{
						"family": "Mordatch",
						"given": "Igor"
					}
				],
				"collection-title": "NIPS’17",
				"container-title": "Proceedings of the 31st International Conference on Neural Information Processing Systems",
				"id": "https://dl.acm.org/doi/10.5555/3295222.3295385",
				"issued": {
					"date-parts": [
						[
							2017,
							12
						]
					]
				},
				"page": "6382-6393",
				"publisher": "Curran Associates Inc.",
				"publisher-place": "Long Beach, California, USA",
				"title": "Multi-agent actor-critic for mixed cooperative-competitive environments",
				"type": "paper-conference"
			}
		},
		"https://arxiv.org/abs/1705.08926": {
			"fetched": "2021-07-18T10:53:24.775Z",
			"bibtex": [
				"",
				"@article{foerster_counterfactual_2017,",
				"   title = {Counterfactual {Multi}-{Agent} {Policy} {Gradients}},",
				"   url = {http://arxiv.org/abs/1705.08926},",
				"   abstract = {Cooperative multi-agent systems can be naturally used to model many real world problems, such as network packet routing and the coordination of autonomous vehicles. There is a great need for new reinforcement learning methods that can efficiently learn decentralised policies for such systems. To this end, we propose a new multi-agent actor-critic method called counterfactual multi-agent (COMA) policy gradients. COMA uses a centralised critic to estimate the Q-function and decentralised actors to optimise the agents' policies. In addition, to address the challenges of multi-agent credit assignment, it uses a counterfactual baseline that marginalises out a single agent's action, while keeping the other agents' actions fixed. COMA also uses a critic representation that allows the counterfactual baseline to be computed efficiently in a single forward pass. We evaluate COMA in the testbed of StarCraft unit micromanagement, using a decentralised variant with significant partial observability. COMA significantly improves average performance over other multi-agent actor-critic methods in this setting, and the best performing agents are competitive with state-of-the-art centralised controllers that get access to the full state.},",
				"   urldate = {2021-07-18},",
				"   journal = {arXiv:1705.08926 [cs]},",
				"   author = {Foerster, Jakob and Farquhar, Gregory and Afouras, Triantafyllos and Nardelli, Nantas and Whiteson, Shimon},",
				"   month = dec,",
				"   year = {2017},",
				"   note = {arXiv: 1705.08926},",
				"   keywords = {Computer Science - Artificial Intelligence, Computer Science - Multiagent Systems},",
				"}",
				""
			],
			"csl": {
				"URL": "http://arxiv.org/abs/1705.08926",
				"abstract": "Cooperative multi-agent systems can be naturally used to model many real world problems, such as network packet routing and the coordination of autonomous vehicles. There is a great need for new reinforcement learning methods that can efficiently learn decentralised policies for such systems. To this end, we propose a new multi-agent actor-critic method called counterfactual multi-agent (COMA) policy gradients. COMA uses a centralised critic to estimate the Q-function and decentralised actors to optimise the agents’ policies. In addition, to address the challenges of multi-agent credit assignment, it uses a counterfactual baseline that marginalises out a single agent’s action, while keeping the other agents’ actions fixed. COMA also uses a critic representation that allows the counterfactual baseline to be computed efficiently in a single forward pass. We evaluate COMA in the testbed of StarCraft unit micromanagement, using a decentralised variant with significant partial observability. COMA significantly improves average performance over other multi-agent actor-critic methods in this setting, and the best performing agents are competitive with state-of-the-art centralised controllers that get access to the full state.",
				"accessed": {
					"date-parts": [
						[
							2021,
							7,
							18
						]
					]
				},
				"author": [
					{
						"family": "Foerster",
						"given": "Jakob"
					},
					{
						"family": "Farquhar",
						"given": "Gregory"
					},
					{
						"family": "Afouras",
						"given": "Triantafyllos"
					},
					{
						"family": "Nardelli",
						"given": "Nantas"
					},
					{
						"family": "Whiteson",
						"given": "Shimon"
					}
				],
				"container-title": "arXiv:1705.08926 [cs]",
				"id": "https://arxiv.org/abs/1705.08926",
				"issued": {
					"date-parts": [
						[
							2017,
							12
						]
					]
				},
				"keyword": "Computer Science - Artificial Intelligence, Computer Science - Multiagent Systems",
				"note": "arXiv: 1705.08926",
				"title": "Counterfactual Multi-Agent Policy Gradients",
				"type": "article-journal"
			}
		},
		"https://link.springer.com/article/10.1007/s43154-021-00048-3": {
			"fetched": "2021-07-20T18:24:04.599Z",
			"bibtex": [
				"",
				"@article{drew_multi-agent_2021,",
				"   title = {Multi-{Agent} {Systems} for {Search} and {Rescue} {Applications}},",
				"   volume = {2},",
				"   issn = {2662-4087},",
				"   url = {https://doi.org/10.1007/s43154-021-00048-3},",
				"   doi = {10.1007/s43154-021-00048-3},",
				"   abstract = {The goal of this review is to evaluate the current status of multi-robot systems in the context of search and rescue. This includes an investigation of their current use in the field, what major technical challenge areas currently preclude more widespread use, and which key topics will drive future development and adoption.},",
				"   language = {en},",
				"   number = {2},",
				"   urldate = {2021-07-20},",
				"   journal = {Current Robotics Reports},",
				"   author = {Drew, Daniel S.},",
				"   month = jun,",
				"   year = {2021},",
				"   pages = {189--200},",
				"}",
				""
			],
			"csl": {
				"DOI": "10.1007/s43154-021-00048-3",
				"ISSN": "2662-4087",
				"URL": "https://doi.org/10.1007/s43154-021-00048-3",
				"abstract": "The goal of this review is to evaluate the current status of multi-robot systems in the context of search and rescue. This includes an investigation of their current use in the field, what major technical challenge areas currently preclude more widespread use, and which key topics will drive future development and adoption.",
				"accessed": {
					"date-parts": [
						[
							2021,
							7,
							20
						]
					]
				},
				"author": [
					{
						"family": "Drew",
						"given": "Daniel S."
					}
				],
				"container-title": "Current Robotics Reports",
				"id": "https://link.springer.com/article/10.1007/s43154-021-00048-3",
				"issue": "2",
				"issued": {
					"date-parts": [
						[
							2021,
							6
						]
					]
				},
				"page": "189-200",
				"title": "Multi-Agent Systems for Search and Rescue Applications",
				"type": "article-journal",
				"volume": "2"
			}
		},
		"https://doi.org/10.3929/ethz-a-010831954": {
			"fetched": "2021-07-20T18:24:08.611Z",
			"bibtex": [
				"",
				"@techreport{waibel_drone_2017,",
				"   type = {Report},",
				"   title = {Drone shows: {Creative} potential and best practices},",
				"   copyright = {http://rightsstatements.org/page/InC-NC/1.0/},",
				"   shorttitle = {Drone shows},",
				"   url = {https://www.research-collection.ethz.ch/handle/20.500.11850/125498},",
				"   language = {en},",
				"   urldate = {2021-07-20},",
				"   institution = {ETH Zurich},",
				"   author = {Waibel, Markus and Keays, Bill and Augugliaro, Federico},",
				"   year = {2017},",
				"   doi = {10.3929/ethz-a-010831954},",
				"}",
				""
			],
			"csl": {
				"DOI": "10.3929/ethz-a-010831954",
				"URL": "https://www.research-collection.ethz.ch/handle/20.500.11850/125498",
				"accessed": {
					"date-parts": [
						[
							2021,
							7,
							20
						]
					]
				},
				"author": [
					{
						"family": "Waibel",
						"given": "Markus"
					},
					{
						"family": "Keays",
						"given": "Bill"
					},
					{
						"family": "Augugliaro",
						"given": "Federico"
					}
				],
				"genre": "Report",
				"id": "https://doi.org/10.3929/ethz-a-010831954",
				"issued": {
					"date-parts": [
						[
							2017
						]
					]
				},
				"publisher": "ETH Zurich",
				"title": "Drone shows: Creative potential and best practices",
				"title-short": "Drone shows",
				"type": "report"
			}
		},
		"https://doi.org/10.1117/12.830408": {
			"fetched": "2021-07-20T18:24:12.806Z",
			"bibtex": [
				"",
				"@inproceedings{burkle_collaborating_2009,",
				"   title = {Collaborating miniature drones for surveillance and reconnaissance},",
				"   volume = {7480},",
				"   url = {https://www.spiedigitallibrary.org/conference-proceedings-of-spie/7480/74800H/Collaborating-miniature-drones-for-surveillance-and-reconnaissance/10.1117/12.830408.short},",
				"   doi = {10.1117/12.830408},",
				"   abstract = {The use of miniature Unmanned Aerial Vehicles (UAVs), e.g. quadrocopters, has gained great popularity over the last years. Some complex application scenarios for micro UAVs call for the formation of swarms of multiple drones. In this paper a platform for the creation of such swarms is presented. It consists of commercial quadrocopters enhanced with on-board processing and communication units enabling autonomy of individual drones. Furthermore, a generic ground control station has been realized. Different co-operation strategies for teams of UAVs are currently evaluated with an agent based simulation tool. Finally, complex application scenarios for multiple micro UAVs are presented.},",
				"   urldate = {2021-07-20},",
				"   booktitle = {Unmanned/{Unattended} {Sensors} and {Sensor} {Networks} {VI}},",
				"   publisher = {International Society for Optics and Photonics},",
				"   author = {Bürkle, Axel},",
				"   month = sep,",
				"   year = {2009},",
				"   pages = {74800H},",
				"}",
				""
			],
			"csl": {
				"DOI": "10.1117/12.830408",
				"URL": "https://www.spiedigitallibrary.org/conference-proceedings-of-spie/7480/74800H/Collaborating-miniature-drones-for-surveillance-and-reconnaissance/10.1117/12.830408.short",
				"abstract": "The use of miniature Unmanned Aerial Vehicles (UAVs), e.g. quadrocopters, has gained great popularity over the last years. Some complex application scenarios for micro UAVs call for the formation of swarms of multiple drones. In this paper a platform for the creation of such swarms is presented. It consists of commercial quadrocopters enhanced with on-board processing and communication units enabling autonomy of individual drones. Furthermore, a generic ground control station has been realized. Different co-operation strategies for teams of UAVs are currently evaluated with an agent based simulation tool. Finally, complex application scenarios for multiple micro UAVs are presented.",
				"accessed": {
					"date-parts": [
						[
							2021,
							7,
							20
						]
					]
				},
				"author": [
					{
						"family": "Bürkle",
						"given": "Axel"
					}
				],
				"container-title": "Unmanned/Unattended Sensors and Sensor Networks VI",
				"id": "https://doi.org/10.1117/12.830408",
				"issued": {
					"date-parts": [
						[
							2009,
							9
						]
					]
				},
				"page": "74800H",
				"publisher": "International Society for Optics; Photonics",
				"title": "Collaborating miniature drones for surveillance and reconnaissance",
				"type": "paper-conference",
				"volume": "7480"
			}
		},
		"https://apps.dtic.mil/sti/citations/AD1039921": {
			"fetched": "2021-07-20T18:24:18.320Z",
			"bibtex": [
				"",
				"@techreport{sanders_drone_2017,",
				"   title = {Drone {Swarms}},",
				"   url = {https://apps.dtic.mil/sti/citations/AD1039921},",
				"   abstract = {Drone swarms are here. The United States, China, and Russia are on the forefront of drone swarm development and utilization. However, the low cost and easy accessibility to drones allow non-state actors to utilize drones in imaginative and creative ways, to include swarming. The aim of the monograph is to address the following question What utility do drone swarms provide the military Drone swarms provide numerous advantages, to include persistent intelligence, surveillance, reconnaissance, and targeting low-risk and low-cost to military personnel and organizations, and the potential to paralyze personal and organizational decision making. In contrast, drone swarms come with vulnerabilities and challenges. The vulnerabilities range from an adversary hacking to the existence of counter swarm weapons, and some challenges include organizational resistance and international law. Drone swarms are here and are coming to a battlefield soon, and it is time to address how best to employ them. After outlining the potential benefits and limitations of drone swarms, the monograph concludes with four recommendations the need for narrative, establishing a drone swarm doctrine, understanding human-drone interface, and an organizational transition for drone swarm employment.},",
				"   language = {en},",
				"   urldate = {2021-07-20},",
				"   institution = {US Army School for Advanced Military Studies Fort Leavenworth United States},",
				"   author = {Sanders, Andrew W.},",
				"   month = may,",
				"   year = {2017},",
				"}",
				""
			],
			"csl": {
				"URL": "https://apps.dtic.mil/sti/citations/AD1039921",
				"abstract": "Drone swarms are here. The United States, China, and Russia are on the forefront of drone swarm development and utilization. However, the low cost and easy accessibility to drones allow non-state actors to utilize drones in imaginative and creative ways, to include swarming. The aim of the monograph is to address the following question What utility do drone swarms provide the military Drone swarms provide numerous advantages, to include persistent intelligence, surveillance, reconnaissance, and targeting low-risk and low-cost to military personnel and organizations, and the potential to paralyze personal and organizational decision making. In contrast, drone swarms come with vulnerabilities and challenges. The vulnerabilities range from an adversary hacking to the existence of counter swarm weapons, and some challenges include organizational resistance and international law. Drone swarms are here and are coming to a battlefield soon, and it is time to address how best to employ them. After outlining the potential benefits and limitations of drone swarms, the monograph concludes with four recommendations the need for narrative, establishing a drone swarm doctrine, understanding human-drone interface, and an organizational transition for drone swarm employment.",
				"accessed": {
					"date-parts": [
						[
							2021,
							7,
							20
						]
					]
				},
				"author": [
					{
						"family": "Sanders",
						"given": "Andrew W."
					}
				],
				"id": "https://apps.dtic.mil/sti/citations/AD1039921",
				"issued": {
					"date-parts": [
						[
							2017,
							5
						]
					]
				},
				"publisher": "US Army School for Advanced Military Studies Fort Leavenworth United States",
				"title": "Drone Swarms",
				"type": "report"
			}
		},
		"https://ieeexplore.ieee.org/document/7568316": {
			"fetched": "2021-07-20T18:24:21.709Z",
			"bibtex": [
				"",
				"@inproceedings{ceraso_controlling_2016,",
				"   title = {Controlling swarms of medical nanorobots using {CPPSO} on a {GPU}},",
				"   url = {https://ieeexplore.ieee.org/document/7568316},",
				"   doi = {10.1109/HPCSim.2016.7568316},",
				"   abstract = {Nanotechnology has the potential to revolutionize our lives and to provide technological solutions to our problems in energy, the environment and medicine. This paper describes a swarm intelligence-based control mechanism for medical nanorobots that operates as artificial platelets to search for wounds within the human body. We present a coloured perceptive particle swarm (CPPSO) algorithm to control the movement of nanorobots in self-assembly. To predict emergent nanorobot behaviors, we designed a parallel simulator that models how nanorobots interact with each other and the environment. We will show that due to their implicitly parallel structure, swarm intelligence algorithms can benefit from GPU-based implementations. The algorithm is implemented with CUDA. With the GPU-based implementation adopted here, we find that CPPSO is faster than a PPSO implementation.},",
				"   urldate = {2021-07-20},",
				"   booktitle = {2016 {International} {Conference} on {High} {Performance} {Computing} {Simulation} ({HPCS})},",
				"   author = {Ceraso, Davide and Spezzano, Giandomenico},",
				"   month = jul,",
				"   year = {2016},",
				"   keywords = {Particle swarm optimization, Graphics processing units, Blood, Algorithm design and analysis, Wounds, Nanoscale devices, Standards},",
				"   pages = {58--65},",
				"}",
				""
			],
			"csl": {
				"DOI": "10.1109/HPCSim.2016.7568316",
				"URL": "https://ieeexplore.ieee.org/document/7568316",
				"abstract": "Nanotechnology has the potential to revolutionize our lives and to provide technological solutions to our problems in energy, the environment and medicine. This paper describes a swarm intelligence-based control mechanism for medical nanorobots that operates as artificial platelets to search for wounds within the human body. We present a coloured perceptive particle swarm (CPPSO) algorithm to control the movement of nanorobots in self-assembly. To predict emergent nanorobot behaviors, we designed a parallel simulator that models how nanorobots interact with each other and the environment. We will show that due to their implicitly parallel structure, swarm intelligence algorithms can benefit from GPU-based implementations. The algorithm is implemented with CUDA. With the GPU-based implementation adopted here, we find that CPPSO is faster than a PPSO implementation.",
				"accessed": {
					"date-parts": [
						[
							2021,
							7,
							20
						]
					]
				},
				"author": [
					{
						"family": "Ceraso",
						"given": "Davide"
					},
					{
						"family": "Spezzano",
						"given": "Giandomenico"
					}
				],
				"container-title": "2016 International Conference on High Performance Computing Simulation (HPCS)",
				"id": "https://ieeexplore.ieee.org/document/7568316",
				"issued": {
					"date-parts": [
						[
							2016,
							7
						]
					]
				},
				"keyword": "Particle swarm optimization, Graphics processing units, Blood, Algorithm design and analysis, Wounds, Nanoscale devices, Standards",
				"page": "58-65",
				"title": "Controlling swarms of medical nanorobots using CPPSO on a GPU",
				"type": "paper-conference"
			}
		},
		"https://link.springer.com/article/10.1007/s00146-018-0845-5": {
			"fetched": "2021-07-20T18:24:24.638Z",
			"bibtex": [
				"",
				"@article{turchin_classification_2020,",
				"   title = {Classification of global catastrophic risks connected with artificial intelligence},",
				"   volume = {35},",
				"   issn = {1435-5655},",
				"   url = {https://doi.org/10.1007/s00146-018-0845-5},",
				"   doi = {10.1007/s00146-018-0845-5},",
				"   abstract = {A classification of the global catastrophic risks of AI is presented, along with a comprehensive list of previously identified risks. This classification allows the identification of several new risks. We show that at each level of AI’s intelligence power, separate types of possible catastrophes dominate. Our classification demonstrates that the field of AI risks is diverse, and includes many scenarios beyond the commonly discussed cases of a paperclip maximizer or robot-caused unemployment. Global catastrophic failure could happen at various levels of AI development, namely, (1) before it starts self-improvement, (2) during its takeoff, when it uses various instruments to escape its initial confinement, or (3) after it successfully takes over the world and starts to implement its goal system, which could be plainly unaligned, or feature-flawed friendliness. AI could also halt at later stages of its development either due to technical glitches or ontological problems. Overall, we identified around several dozen scenarios of AI-driven global catastrophe. The extent of this list illustrates that there is no one simple solution to the problem of AI safety, and that AI safety theory is complex and must be customized for each AI development level.},",
				"   language = {en},",
				"   number = {1},",
				"   urldate = {2021-07-20},",
				"   journal = {AI \\& SOCIETY},",
				"   author = {Turchin, Alexey and Denkenberger, David},",
				"   month = mar,",
				"   year = {2020},",
				"   pages = {147--163},",
				"}",
				""
			],
			"csl": {
				"DOI": "10.1007/s00146-018-0845-5",
				"ISSN": "1435-5655",
				"URL": "https://doi.org/10.1007/s00146-018-0845-5",
				"abstract": "A classification of the global catastrophic risks of AI is presented, along with a comprehensive list of previously identified risks. This classification allows the identification of several new risks. We show that at each level of AI’s intelligence power, separate types of possible catastrophes dominate. Our classification demonstrates that the field of AI risks is diverse, and includes many scenarios beyond the commonly discussed cases of a paperclip maximizer or robot-caused unemployment. Global catastrophic failure could happen at various levels of AI development, namely, (1) before it starts self-improvement, (2) during its takeoff, when it uses various instruments to escape its initial confinement, or (3) after it successfully takes over the world and starts to implement its goal system, which could be plainly unaligned, or feature-flawed friendliness. AI could also halt at later stages of its development either due to technical glitches or ontological problems. Overall, we identified around several dozen scenarios of AI-driven global catastrophe. The extent of this list illustrates that there is no one simple solution to the problem of AI safety, and that AI safety theory is complex and must be customized for each AI development level.",
				"accessed": {
					"date-parts": [
						[
							2021,
							7,
							20
						]
					]
				},
				"author": [
					{
						"family": "Turchin",
						"given": "Alexey"
					},
					{
						"family": "Denkenberger",
						"given": "David"
					}
				],
				"container-title": "AI & SOCIETY",
				"id": "https://link.springer.com/article/10.1007/s00146-018-0845-5",
				"issue": "1",
				"issued": {
					"date-parts": [
						[
							2020,
							3
						]
					]
				},
				"page": "147-163",
				"title": "Classification of global catastrophic risks connected with artificial intelligence",
				"type": "article-journal",
				"volume": "35"
			}
		},
		"https://arxiv.org/abs/1808.00177": {
			"fetched": "2021-07-20T18:24:27.782Z",
			"bibtex": [
				"",
				"@article{openai_learning_2019,",
				"   title = {Learning {Dexterous} {In}-{Hand} {Manipulation}},",
				"   url = {http://arxiv.org/abs/1808.00177},",
				"   abstract = {We use reinforcement learning (RL) to learn dexterous in-hand manipulation policies which can perform vision-based object reorientation on a physical Shadow Dexterous Hand. The training is performed in a simulated environment in which we randomize many of the physical properties of the system like friction coefficients and an object's appearance. Our policies transfer to the physical robot despite being trained entirely in simulation. Our method does not rely on any human demonstrations, but many behaviors found in human manipulation emerge naturally, including finger gaiting, multi-finger coordination, and the controlled use of gravity. Our results were obtained using the same distributed RL system that was used to train OpenAI Five. We also include a video of our results: https://youtu.be/jwSbzNHGflM},",
				"   urldate = {2021-07-20},",
				"   journal = {arXiv:1808.00177 [cs, stat]},",
				"   author = {{OpenAI} and Andrychowicz, Marcin and Baker, Bowen and Chociej, Maciek and Jozefowicz, Rafal and McGrew, Bob and Pachocki, Jakub and Petron, Arthur and Plappert, Matthias and Powell, Glenn and Ray, Alex and Schneider, Jonas and Sidor, Szymon and Tobin, Josh and Welinder, Peter and Weng, Lilian and Zaremba, Wojciech},",
				"   month = jan,",
				"   year = {2019},",
				"   note = {arXiv: 1808.00177},",
				"   keywords = {Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Robotics, Statistics - Machine Learning},",
				"}",
				""
			],
			"csl": {
				"URL": "http://arxiv.org/abs/1808.00177",
				"abstract": "We use reinforcement learning (RL) to learn dexterous in-hand manipulation policies which can perform vision-based object reorientation on a physical Shadow Dexterous Hand. The training is performed in a simulated environment in which we randomize many of the physical properties of the system like friction coefficients and an object’s appearance. Our policies transfer to the physical robot despite being trained entirely in simulation. Our method does not rely on any human demonstrations, but many behaviors found in human manipulation emerge naturally, including finger gaiting, multi-finger coordination, and the controlled use of gravity. Our results were obtained using the same distributed RL system that was used to train OpenAI Five. We also include a video of our results: https://youtu.be/jwSbzNHGflM",
				"accessed": {
					"date-parts": [
						[
							2021,
							7,
							20
						]
					]
				},
				"author": [
					{
						"literal": "OpenAI"
					},
					{
						"family": "Andrychowicz",
						"given": "Marcin"
					},
					{
						"family": "Baker",
						"given": "Bowen"
					},
					{
						"family": "Chociej",
						"given": "Maciek"
					},
					{
						"family": "Jozefowicz",
						"given": "Rafal"
					},
					{
						"family": "McGrew",
						"given": "Bob"
					},
					{
						"family": "Pachocki",
						"given": "Jakub"
					},
					{
						"family": "Petron",
						"given": "Arthur"
					},
					{
						"family": "Plappert",
						"given": "Matthias"
					},
					{
						"family": "Powell",
						"given": "Glenn"
					},
					{
						"family": "Ray",
						"given": "Alex"
					},
					{
						"family": "Schneider",
						"given": "Jonas"
					},
					{
						"family": "Sidor",
						"given": "Szymon"
					},
					{
						"family": "Tobin",
						"given": "Josh"
					},
					{
						"family": "Welinder",
						"given": "Peter"
					},
					{
						"family": "Weng",
						"given": "Lilian"
					},
					{
						"family": "Zaremba",
						"given": "Wojciech"
					}
				],
				"container-title": "arXiv:1808.00177 [cs, stat]",
				"id": "https://arxiv.org/abs/1808.00177",
				"issued": {
					"date-parts": [
						[
							2019,
							1
						]
					]
				},
				"keyword": "Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Robotics, Statistics - Machine Learning",
				"note": "arXiv: 1808.00177",
				"title": "Learning Dexterous In-Hand Manipulation",
				"type": "article-journal"
			}
		},
		"http://proceedings.mlr.press/v37/schulman15.html": {
			"fetched": "2021-07-20T19:36:30.876Z",
			"bibtex": [
				"",
				"@inproceedings{schulman_trust_2015,",
				"   title = {Trust {Region} {Policy} {Optimization}},",
				"   url = {http://proceedings.mlr.press/v37/schulman15.html},",
				"   abstract = {In this article, we describe a method for optimizing control policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified scheme, we develop a pr...},",
				"   language = {en},",
				"   urldate = {2021-07-20},",
				"   booktitle = {International {Conference} on {Machine} {Learning}},",
				"   publisher = {PMLR},",
				"   author = {Schulman, John and Levine, Sergey and Abbeel, Pieter and Jordan, Michael and Moritz, Philipp},",
				"   month = jun,",
				"   year = {2015},",
				"   pages = {1889--1897},",
				"}",
				""
			],
			"csl": {
				"URL": "http://proceedings.mlr.press/v37/schulman15.html",
				"abstract": "In this article, we describe a method for optimizing control policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified scheme, we develop a pr...",
				"accessed": {
					"date-parts": [
						[
							2021,
							7,
							20
						]
					]
				},
				"author": [
					{
						"family": "Schulman",
						"given": "John"
					},
					{
						"family": "Levine",
						"given": "Sergey"
					},
					{
						"family": "Abbeel",
						"given": "Pieter"
					},
					{
						"family": "Jordan",
						"given": "Michael"
					},
					{
						"family": "Moritz",
						"given": "Philipp"
					}
				],
				"container-title": "International Conference on Machine Learning",
				"id": "http://proceedings.mlr.press/v37/schulman15.html",
				"issued": {
					"date-parts": [
						[
							2015,
							6
						]
					]
				},
				"page": "1889-1897",
				"publisher": "PMLR",
				"title": "Trust Region Policy Optimization",
				"type": "paper-conference"
			}
		},
		"https://arxiv.org/abs/1801.01290": {
			"fetched": "2021-07-20T19:36:33.974Z",
			"bibtex": [
				"",
				"@article{haarnoja_soft_2018,",
				"   title = {Soft {Actor}-{Critic}: {Off}-{Policy} {Maximum} {Entropy} {Deep} {Reinforcement} {Learning} with a {Stochastic} {Actor}},",
				"   shorttitle = {Soft {Actor}-{Critic}},",
				"   url = {http://arxiv.org/abs/1801.01290},",
				"   abstract = {Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergence properties, which necessitate meticulous hyperparameter tuning. Both of these challenges severely limit the applicability of such methods to complex, real-world domains. In this paper, we propose soft actor-critic, an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy. That is, to succeed at the task while acting as randomly as possible. Prior deep RL methods based on this framework have been formulated as Q-learning methods. By combining off-policy updates with a stable stochastic actor-critic formulation, our method achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off-policy methods. Furthermore, we demonstrate that, in contrast to other off-policy algorithms, our approach is very stable, achieving very similar performance across different random seeds.},",
				"   urldate = {2021-07-20},",
				"   journal = {arXiv:1801.01290 [cs, stat]},",
				"   author = {Haarnoja, Tuomas and Zhou, Aurick and Abbeel, Pieter and Levine, Sergey},",
				"   month = aug,",
				"   year = {2018},",
				"   note = {arXiv: 1801.01290},",
				"   keywords = {Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Statistics - Machine Learning},",
				"}",
				""
			],
			"csl": {
				"URL": "http://arxiv.org/abs/1801.01290",
				"abstract": "Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergence properties, which necessitate meticulous hyperparameter tuning. Both of these challenges severely limit the applicability of such methods to complex, real-world domains. In this paper, we propose soft actor-critic, an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy. That is, to succeed at the task while acting as randomly as possible. Prior deep RL methods based on this framework have been formulated as Q-learning methods. By combining off-policy updates with a stable stochastic actor-critic formulation, our method achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off-policy methods. Furthermore, we demonstrate that, in contrast to other off-policy algorithms, our approach is very stable, achieving very similar performance across different random seeds.",
				"accessed": {
					"date-parts": [
						[
							2021,
							7,
							20
						]
					]
				},
				"author": [
					{
						"family": "Haarnoja",
						"given": "Tuomas"
					},
					{
						"family": "Zhou",
						"given": "Aurick"
					},
					{
						"family": "Abbeel",
						"given": "Pieter"
					},
					{
						"family": "Levine",
						"given": "Sergey"
					}
				],
				"container-title": "arXiv:1801.01290 [cs, stat]",
				"id": "https://arxiv.org/abs/1801.01290",
				"issued": {
					"date-parts": [
						[
							2018,
							8
						]
					]
				},
				"keyword": "Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Statistics - Machine Learning",
				"note": "arXiv: 1801.01290",
				"title": "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor",
				"title-short": "Soft Actor-Critic",
				"type": "article-journal"
			}
		},
		"doi:10.1214/aoms/1177729694": {
			"fetched": "2021-07-20T19:36:35.214Z",
			"bibtex": [
				"",
				"@article{kullback_information_1951,",
				"   title = {On {Information} and {Sufficiency}},",
				"   volume = {22},",
				"   issn = {0003-4851},",
				"   url = {http://projecteuclid.org/euclid.aoms/1177729694},",
				"   doi = {10.1214/aoms/1177729694},",
				"   language = {en},",
				"   number = {1},",
				"   urldate = {2021-07-20},",
				"   journal = {The Annals of Mathematical Statistics},",
				"   author = {Kullback, S. and Leibler, R. A.},",
				"   month = mar,",
				"   year = {1951},",
				"   pages = {79--86},",
				"}",
				""
			],
			"csl": {
				"DOI": "10.1214/aoms/1177729694",
				"ISSN": "0003-4851",
				"URL": "http://projecteuclid.org/euclid.aoms/1177729694",
				"accessed": {
					"date-parts": [
						[
							2021,
							7,
							20
						]
					]
				},
				"author": [
					{
						"family": "Kullback",
						"given": "S."
					},
					{
						"family": "Leibler",
						"given": "R. A."
					}
				],
				"container-title": "The Annals of Mathematical Statistics",
				"id": "doi:10.1214/aoms/1177729694",
				"issue": "1",
				"issued": {
					"date-parts": [
						[
							1951,
							3
						]
					]
				},
				"page": "79-86",
				"title": "On Information and Sufficiency",
				"type": "article-journal",
				"volume": "22"
			}
		},
		"https://www.springer.com/gp/book/9783540710493": {
			"fetched": "2021-07-20T19:36:36.721Z",
			"bibtex": [
				"",
				"@book{villani_optimal_2009,",
				"   address = {Berlin Heidelberg},",
				"   series = {Grundlehren der mathematischen {Wissenschaften}},",
				"   title = {Optimal {Transport}: {Old} and {New}},",
				"   isbn = {9783540710493},",
				"   shorttitle = {Optimal {Transport}},",
				"   url = {https://www.springer.com/gp/book/9783540710493},",
				"   abstract = {At the close of the 1980s, the independent contributions of Yann Brenier, Mike Cullen and John Mather launched a revolution in the venerable field of optimal transport founded by G. Monge in the 18th century, which has made breathtaking forays into various other domains of mathematics ever since. The author presents a broad overview of this area, supplying complete and self-contained proofs of all the fundamental results of the theory of optimal transport at the appropriate level of generality. Thus, the book encompasses the broad spectrum ranging from basic theory to the most recent research results. PhD students or researchers can read the entire book without any prior knowledge of the field. A comprehensive bibliography with notes that extensively discuss the existing literature underlines the book’s value as a most welcome reference text on this subject.},",
				"   language = {en},",
				"   urldate = {2021-07-20},",
				"   publisher = {Springer-Verlag},",
				"   author = {Villani, Cédric},",
				"   year = {2009},",
				"   doi = {10.1007/978-3-540-71050-9},",
				"}",
				""
			],
			"csl": {
				"DOI": "10.1007/978-3-540-71050-9",
				"ISBN": "9783540710493",
				"URL": "https://www.springer.com/gp/book/9783540710493",
				"abstract": "At the close of the 1980s, the independent contributions of Yann Brenier, Mike Cullen and John Mather launched a revolution in the venerable field of optimal transport founded by G. Monge in the 18th century, which has made breathtaking forays into various other domains of mathematics ever since. The author presents a broad overview of this area, supplying complete and self-contained proofs of all the fundamental results of the theory of optimal transport at the appropriate level of generality. Thus, the book encompasses the broad spectrum ranging from basic theory to the most recent research results. PhD students or researchers can read the entire book without any prior knowledge of the field. A comprehensive bibliography with notes that extensively discuss the existing literature underlines the book’s value as a most welcome reference text on this subject.",
				"accessed": {
					"date-parts": [
						[
							2021,
							7,
							20
						]
					]
				},
				"author": [
					{
						"family": "Villani",
						"given": "Cédric"
					}
				],
				"collection-title": "Grundlehren der mathematischen Wissenschaften",
				"id": "https://www.springer.com/gp/book/9783540710493",
				"issued": {
					"date-parts": [
						[
							2009
						]
					]
				},
				"publisher": "Springer-Verlag",
				"publisher-place": "Berlin Heidelberg",
				"title": "Optimal Transport: Old and New",
				"title-short": "Optimal Transport",
				"type": "book"
			}
		},
		"https://arxiv.org/abs/1506.02438": {
			"fetched": "2021-07-21T12:07:39.040Z",
			"bibtex": [
				"",
				"@article{schulman_high-dimensional_2018,",
				"   title = {High-{Dimensional} {Continuous} {Control} {Using} {Generalized} {Advantage} {Estimation}},",
				"   url = {http://arxiv.org/abs/1506.02438},",
				"   abstract = {Policy gradient methods are an appealing approach in reinforcement learning because they directly optimize the cumulative reward and can straightforwardly be used with nonlinear function approximators such as neural networks. The two main challenges are the large number of samples typically required, and the difficulty of obtaining stable and steady improvement despite the nonstationarity of the incoming data. We address the first challenge by using value functions to substantially reduce the variance of policy gradient estimates at the cost of some bias, with an exponentially-weighted estimator of the advantage function that is analogous to TD(lambda). We address the second challenge by using trust region optimization procedure for both the policy and the value function, which are represented by neural networks. Our approach yields strong empirical results on highly challenging 3D locomotion tasks, learning running gaits for bipedal and quadrupedal simulated robots, and learning a policy for getting the biped to stand up from starting out lying on the ground. In contrast to a body of prior work that uses hand-crafted policy representations, our neural network policies map directly from raw kinematics to joint torques. Our algorithm is fully model-free, and the amount of simulated experience required for the learning tasks on 3D bipeds corresponds to 1-2 weeks of real time.},",
				"   urldate = {2021-07-21},",
				"   journal = {arXiv:1506.02438 [cs]},",
				"   author = {Schulman, John and Moritz, Philipp and Levine, Sergey and Jordan, Michael and Abbeel, Pieter},",
				"   month = oct,",
				"   year = {2018},",
				"   note = {arXiv: 1506.02438},",
				"   keywords = {Computer Science - Machine Learning, Computer Science - Robotics, Electrical Engineering and Systems Science - Systems and Control},",
				"}",
				""
			],
			"csl": {
				"URL": "http://arxiv.org/abs/1506.02438",
				"abstract": "Policy gradient methods are an appealing approach in reinforcement learning because they directly optimize the cumulative reward and can straightforwardly be used with nonlinear function approximators such as neural networks. The two main challenges are the large number of samples typically required, and the difficulty of obtaining stable and steady improvement despite the nonstationarity of the incoming data. We address the first challenge by using value functions to substantially reduce the variance of policy gradient estimates at the cost of some bias, with an exponentially-weighted estimator of the advantage function that is analogous to TD(lambda). We address the second challenge by using trust region optimization procedure for both the policy and the value function, which are represented by neural networks. Our approach yields strong empirical results on highly challenging 3D locomotion tasks, learning running gaits for bipedal and quadrupedal simulated robots, and learning a policy for getting the biped to stand up from starting out lying on the ground. In contrast to a body of prior work that uses hand-crafted policy representations, our neural network policies map directly from raw kinematics to joint torques. Our algorithm is fully model-free, and the amount of simulated experience required for the learning tasks on 3D bipeds corresponds to 1-2 weeks of real time.",
				"accessed": {
					"date-parts": [
						[
							2021,
							7,
							21
						]
					]
				},
				"author": [
					{
						"family": "Schulman",
						"given": "John"
					},
					{
						"family": "Moritz",
						"given": "Philipp"
					},
					{
						"family": "Levine",
						"given": "Sergey"
					},
					{
						"family": "Jordan",
						"given": "Michael"
					},
					{
						"family": "Abbeel",
						"given": "Pieter"
					}
				],
				"container-title": "arXiv:1506.02438 [cs]",
				"id": "https://arxiv.org/abs/1506.02438",
				"issued": {
					"date-parts": [
						[
							2018,
							10
						]
					]
				},
				"keyword": "Computer Science - Machine Learning, Computer Science - Robotics, Electrical Engineering and Systems Science - Systems and Control",
				"note": "arXiv: 1506.02438",
				"title": "High-Dimensional Continuous Control Using Generalized Advantage Estimation",
				"type": "article-journal"
			}
		},
		"https://www.springer.com/gp/book/9783319289274": {
			"fetched": "2021-07-21T13:59:18.312Z",
			"bibtex": [
				"",
				"@book{oliehoek_concise_2016,",
				"   series = {{SpringerBriefs} in {Intelligent} {Systems}},",
				"   title = {A {Concise} {Introduction} to {Decentralized} {POMDPs}},",
				"   isbn = {9783319289274},",
				"   url = {https://www.springer.com/gp/book/9783319289274},",
				"   abstract = {This book introduces multiagent planning under uncertainty as formalized by decentralized partially observable Markov decision processes (Dec-POMDPs). The intended audience is researchers and graduate students working in the fields of artificial intelligence related to sequential decision making: reinforcement learning, decision-theoretic planning for single agents, classical multiagent planning, decentralized control, and operations research.},",
				"   language = {en},",
				"   urldate = {2021-07-21},",
				"   publisher = {Springer International Publishing},",
				"   author = {Oliehoek, Frans A. and Amato, Christopher},",
				"   year = {2016},",
				"   doi = {10.1007/978-3-319-28929-8},",
				"}",
				""
			],
			"csl": {
				"DOI": "10.1007/978-3-319-28929-8",
				"ISBN": "9783319289274",
				"URL": "https://www.springer.com/gp/book/9783319289274",
				"abstract": "This book introduces multiagent planning under uncertainty as formalized by decentralized partially observable Markov decision processes (Dec-POMDPs). The intended audience is researchers and graduate students working in the fields of artificial intelligence related to sequential decision making: reinforcement learning, decision-theoretic planning for single agents, classical multiagent planning, decentralized control, and operations research.",
				"accessed": {
					"date-parts": [
						[
							2021,
							7,
							21
						]
					]
				},
				"author": [
					{
						"family": "Oliehoek",
						"given": "Frans A."
					},
					{
						"family": "Amato",
						"given": "Christopher"
					}
				],
				"collection-title": "SpringerBriefs in Intelligent Systems",
				"id": "https://www.springer.com/gp/book/9783319289274",
				"issued": {
					"date-parts": [
						[
							2016
						]
					]
				},
				"publisher": "Springer International Publishing",
				"title": "A Concise Introduction to Decentralized POMDPs",
				"type": "book"
			}
		},
		"https://dl.acm.org/doi/10.5555/3091125.3091320": {
			"fetched": "2021-07-21T13:59:23.517Z",
			"bibtex": [
				"",
				"@inproceedings{sosic_inverse_2017,",
				"   address = {São Paulo, Brazil},",
				"   series = {{AAMAS} '17},",
				"   title = {Inverse {Reinforcement} {Learning} in {Swarm} {Systems}},",
				"   url = {https://dl.acm.org/doi/10.5555/3091125.3091320},",
				"   abstract = {Inverse reinforcement learning (IRL) has become a useful tool for learning behavioral models from demonstration data. However, IRL remains mostly unexplored for multi-agent systems. In this paper, we show how the principle of IRL can be extended to homogeneous large-scale problems, inspired by the collective swarming behavior of natural systems. In particular, we make the following contributions to the field: 1) We introduce the swarMDP framework, a sub-class of decentralized partially observable Markov decision processes endowed with a swarm characterization. 2) Exploiting the inherent homogeneity of this framework, we reduce the resulting multi-agent IRL problem to a single-agent one by proving that the agent-specific value functions in this model coincide. 3) To solve the corresponding control problem, we propose a novel heterogeneous learning scheme that is particularly tailored to the swarm setting. Results on two example systems demonstrate that our framework is able to produce meaningful local reward models from which we can replicate the observed global system dynamics.},",
				"   urldate = {2021-07-21},",
				"   booktitle = {Proceedings of the 16th {Conference} on {Autonomous} {Agents} and {MultiAgent} {Systems}},",
				"   publisher = {International Foundation for Autonomous Agents and Multiagent Systems},",
				"   author = {Šošić, Adrian and KhudaBukhsh, Wasiur R. and Zoubir, Abdelhak M. and Koeppl, Heinz},",
				"   month = may,",
				"   year = {2017},",
				"   keywords = {multi-agent systems, swarms, inverse reinforcement learning},",
				"   pages = {1413--1421},",
				"}",
				""
			],
			"csl": {
				"URL": "https://dl.acm.org/doi/10.5555/3091125.3091320",
				"abstract": "Inverse reinforcement learning (IRL) has become a useful tool for learning behavioral models from demonstration data. However, IRL remains mostly unexplored for multi-agent systems. In this paper, we show how the principle of IRL can be extended to homogeneous large-scale problems, inspired by the collective swarming behavior of natural systems. In particular, we make the following contributions to the field: 1) We introduce the swarMDP framework, a sub-class of decentralized partially observable Markov decision processes endowed with a swarm characterization. 2) Exploiting the inherent homogeneity of this framework, we reduce the resulting multi-agent IRL problem to a single-agent one by proving that the agent-specific value functions in this model coincide. 3) To solve the corresponding control problem, we propose a novel heterogeneous learning scheme that is particularly tailored to the swarm setting. Results on two example systems demonstrate that our framework is able to produce meaningful local reward models from which we can replicate the observed global system dynamics.",
				"accessed": {
					"date-parts": [
						[
							2021,
							7,
							21
						]
					]
				},
				"author": [
					{
						"family": "Šošić",
						"given": "Adrian"
					},
					{
						"family": "KhudaBukhsh",
						"given": "Wasiur R."
					},
					{
						"family": "Zoubir",
						"given": "Abdelhak M."
					},
					{
						"family": "Koeppl",
						"given": "Heinz"
					}
				],
				"collection-title": "AAMAS ’17",
				"container-title": "Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems",
				"id": "https://dl.acm.org/doi/10.5555/3091125.3091320",
				"issued": {
					"date-parts": [
						[
							2017,
							5
						]
					]
				},
				"keyword": "multi-agent systems, swarms, inverse reinforcement learning",
				"page": "1413-1421",
				"publisher": "International Foundation for Autonomous Agents; Multiagent Systems",
				"publisher-place": "São Paulo, Brazil",
				"title": "Inverse Reinforcement Learning in Swarm Systems",
				"type": "paper-conference"
			}
		},
		"https://dl.acm.org/doi/10.5555/1642194.1642231": {
			"fetched": "2021-07-24T12:55:08.109Z",
			"bibtex": [
				"",
				"@inproceedings{bowling_rational_2001,",
				"   address = {Seattle, WA, USA},",
				"   series = {{IJCAI}'01},",
				"   title = {Rational and convergent learning in stochastic games},",
				"   isbn = {9781558608122},",
				"   url = {https://dl.acm.org/doi/10.5555/1642194.1642231},",
				"   abstract = {This paper investigates the problem of policy learning in multiagent environments using the stochastic game framework, which we briefly overview. We introduce two properties as desirable for a learning agent when in the presence of other learning agents, namely rationality and convergence. We examine existing reinforcement learning algorithms according to these two properties and notice that they fail to simultaneously meet both criteria. We then contribute a new learning algorithm,WoLF policy hillclimbing, that is based on a simple principle: \"learn quickly while losing, slowly while winning.\" The algorithm is proven to be rational and we present empirical results for a number of stochastic games showing the algorithm converges.},",
				"   urldate = {2021-07-24},",
				"   booktitle = {Proceedings of the 17th international joint conference on {Artificial} intelligence - {Volume} 2},",
				"   publisher = {Morgan Kaufmann Publishers Inc.},",
				"   author = {Bowling, Michael and Veloso, Manuela},",
				"   month = aug,",
				"   year = {2001},",
				"   pages = {1021--1026},",
				"}",
				""
			],
			"csl": {
				"ISBN": "9781558608122",
				"URL": "https://dl.acm.org/doi/10.5555/1642194.1642231",
				"abstract": "This paper investigates the problem of policy learning in multiagent environments using the stochastic game framework, which we briefly overview. We introduce two properties as desirable for a learning agent when in the presence of other learning agents, namely rationality and convergence. We examine existing reinforcement learning algorithms according to these two properties and notice that they fail to simultaneously meet both criteria. We then contribute a new learning algorithm,WoLF policy hillclimbing, that is based on a simple principle: \"learn quickly while losing, slowly while winning.\" The algorithm is proven to be rational and we present empirical results for a number of stochastic games showing the algorithm converges.",
				"accessed": {
					"date-parts": [
						[
							2021,
							7,
							24
						]
					]
				},
				"author": [
					{
						"family": "Bowling",
						"given": "Michael"
					},
					{
						"family": "Veloso",
						"given": "Manuela"
					}
				],
				"collection-title": "IJCAI’01",
				"container-title": "Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2",
				"id": "https://dl.acm.org/doi/10.5555/1642194.1642231",
				"issued": {
					"date-parts": [
						[
							2001,
							8
						]
					]
				},
				"page": "1021-1026",
				"publisher": "Morgan Kaufmann Publishers Inc.",
				"publisher-place": "Seattle, WA, USA",
				"title": "Rational and convergent learning in stochastic games",
				"type": "paper-conference"
			}
		},
		"https://arxiv.org/abs/2107.12808": {
			"fetched": "2021-07-28T13:36:58.128Z",
			"bibtex": [
				"",
				"@article{team_open-ended_2021,",
				"   title = {Open-{Ended} {Learning} {Leads} to {Generally} {Capable} {Agents}},",
				"   url = {http://arxiv.org/abs/2107.12808},",
				"   abstract = {In this work we create agents that can perform well beyond a single, individual task, that exhibit much wider generalisation of behaviour to a massive, rich space of challenges. We define a universe of tasks within an environment domain and demonstrate the ability to train agents that are generally capable across this vast space and beyond. The environment is natively multi-agent, spanning the continuum of competitive, cooperative, and independent games, which are situated within procedurally generated physical 3D worlds. The resulting space is exceptionally diverse in terms of the challenges posed to agents, and as such, even measuring the learning progress of an agent is an open research problem. We propose an iterative notion of improvement between successive generations of agents, rather than seeking to maximise a singular objective, allowing us to quantify progress despite tasks being incomparable in terms of achievable rewards. We show that through constructing an open-ended learning process, which dynamically changes the training task distributions and training objectives such that the agent never stops learning, we achieve consistent learning of new behaviours. The resulting agent is able to score reward in every one of our humanly solvable evaluation levels, with behaviour generalising to many held-out points in the universe of tasks. Examples of this zero-shot generalisation include good performance on Hide and Seek, Capture the Flag, and Tag. Through analysis and hand-authored probe tasks we characterise the behaviour of our agent, and find interesting emergent heuristic behaviours such as trial-and-error experimentation, simple tool use, option switching, and cooperation. Finally, we demonstrate that the general capabilities of this agent could unlock larger scale transfer of behaviour through cheap finetuning.},",
				"   urldate = {2021-07-28},",
				"   journal = {arXiv:2107.12808 [cs]},",
				"   author = {Team, Open-Ended Learning and Stooke, Adam and Mahajan, Anuj and Barros, Catarina and Deck, Charlie and Bauer, Jakob and Sygnowski, Jakub and Trebacz, Maja and Jaderberg, Max and Mathieu, Michael and McAleese, Nat and Bradley-Schmieg, Nathalie and Wong, Nathaniel and Porcel, Nicolas and Raileanu, Roberta and Hughes-Fitt, Steph and Dalibard, Valentin and Czarnecki, Wojciech Marian},",
				"   month = jul,",
				"   year = {2021},",
				"   note = {arXiv: 2107.12808},",
				"   keywords = {Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Multiagent Systems},",
				"}",
				""
			],
			"csl": {
				"URL": "http://arxiv.org/abs/2107.12808",
				"abstract": "In this work we create agents that can perform well beyond a single, individual task, that exhibit much wider generalisation of behaviour to a massive, rich space of challenges. We define a universe of tasks within an environment domain and demonstrate the ability to train agents that are generally capable across this vast space and beyond. The environment is natively multi-agent, spanning the continuum of competitive, cooperative, and independent games, which are situated within procedurally generated physical 3D worlds. The resulting space is exceptionally diverse in terms of the challenges posed to agents, and as such, even measuring the learning progress of an agent is an open research problem. We propose an iterative notion of improvement between successive generations of agents, rather than seeking to maximise a singular objective, allowing us to quantify progress despite tasks being incomparable in terms of achievable rewards. We show that through constructing an open-ended learning process, which dynamically changes the training task distributions and training objectives such that the agent never stops learning, we achieve consistent learning of new behaviours. The resulting agent is able to score reward in every one of our humanly solvable evaluation levels, with behaviour generalising to many held-out points in the universe of tasks. Examples of this zero-shot generalisation include good performance on Hide and Seek, Capture the Flag, and Tag. Through analysis and hand-authored probe tasks we characterise the behaviour of our agent, and find interesting emergent heuristic behaviours such as trial-and-error experimentation, simple tool use, option switching, and cooperation. Finally, we demonstrate that the general capabilities of this agent could unlock larger scale transfer of behaviour through cheap finetuning.",
				"accessed": {
					"date-parts": [
						[
							2021,
							7,
							28
						]
					]
				},
				"author": [
					{
						"family": "Team",
						"given": "Open-Ended Learning"
					},
					{
						"family": "Stooke",
						"given": "Adam"
					},
					{
						"family": "Mahajan",
						"given": "Anuj"
					},
					{
						"family": "Barros",
						"given": "Catarina"
					},
					{
						"family": "Deck",
						"given": "Charlie"
					},
					{
						"family": "Bauer",
						"given": "Jakob"
					},
					{
						"family": "Sygnowski",
						"given": "Jakub"
					},
					{
						"family": "Trebacz",
						"given": "Maja"
					},
					{
						"family": "Jaderberg",
						"given": "Max"
					},
					{
						"family": "Mathieu",
						"given": "Michael"
					},
					{
						"family": "McAleese",
						"given": "Nat"
					},
					{
						"family": "Bradley-Schmieg",
						"given": "Nathalie"
					},
					{
						"family": "Wong",
						"given": "Nathaniel"
					},
					{
						"family": "Porcel",
						"given": "Nicolas"
					},
					{
						"family": "Raileanu",
						"given": "Roberta"
					},
					{
						"family": "Hughes-Fitt",
						"given": "Steph"
					},
					{
						"family": "Dalibard",
						"given": "Valentin"
					},
					{
						"family": "Czarnecki",
						"given": "Wojciech Marian"
					}
				],
				"container-title": "arXiv:2107.12808 [cs]",
				"id": "https://arxiv.org/abs/2107.12808",
				"issued": {
					"date-parts": [
						[
							2021,
							7
						]
					]
				},
				"keyword": "Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Multiagent Systems",
				"note": "arXiv: 2107.12808",
				"title": "Open-Ended Learning Leads to Generally Capable Agents",
				"type": "article-journal"
			}
		},
		"https://openreview.net/forum?id=r1etN1rtPB": {
			"fetched": "2021-07-28T15:00:08.830Z",
			"bibtex": [
				"",
				"@inproceedings{engstrom_implementation_2019,",
				"   title = {Implementation {Matters} in {Deep} {RL}: {A} {Case} {Study} on {PPO} and {TRPO}},",
				"   shorttitle = {Implementation {Matters} in {Deep} {RL}},",
				"   url = {https://openreview.net/forum?id=r1etN1rtPB},",
				"   abstract = {We study the roots of algorithmic progress in deep policy gradient algorithms through a case study on two popular algorithms: Proximal Policy Optimization (PPO) and Trust Region Policy Optimization...},",
				"   language = {en},",
				"   urldate = {2021-07-28},",
				"   author = {Engstrom, Logan and Ilyas, Andrew and Santurkar, Shibani and Tsipras, Dimitris and Janoos, Firdaus and Rudolph, Larry and Madry, Aleksander},",
				"   month = sep,",
				"   year = {2019},",
				"}",
				""
			],
			"csl": {
				"URL": "https://openreview.net/forum?id=r1etN1rtPB",
				"abstract": "We study the roots of algorithmic progress in deep policy gradient algorithms through a case study on two popular algorithms: Proximal Policy Optimization (PPO) and Trust Region Policy Optimization...",
				"accessed": {
					"date-parts": [
						[
							2021,
							7,
							28
						]
					]
				},
				"author": [
					{
						"family": "Engstrom",
						"given": "Logan"
					},
					{
						"family": "Ilyas",
						"given": "Andrew"
					},
					{
						"family": "Santurkar",
						"given": "Shibani"
					},
					{
						"family": "Tsipras",
						"given": "Dimitris"
					},
					{
						"family": "Janoos",
						"given": "Firdaus"
					},
					{
						"family": "Rudolph",
						"given": "Larry"
					},
					{
						"family": "Madry",
						"given": "Aleksander"
					}
				],
				"id": "https://openreview.net/forum?id_x61_r1etN1rtPB",
				"issued": {
					"date-parts": [
						[
							2019,
							9
						]
					]
				},
				"title": "Implementation Matters in Deep RL: A Case Study on PPO and TRPO",
				"title-short": "Implementation Matters in Deep RL",
				"type": "paper-conference"
			}
		}
	}
}
GitHub Events

Total

Last Year

Committers

Last synced: about 1 year ago
All Time

Total Commits: 365
Total Committers: 20
Avg Commits per committer: 18.25
Development Distribution Score (DDS): 0.449
Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0
Top Committers

Name	Email	Commits
Daniel Himmelstein	d**n@g**m	201
phiresky	p**t@g**m	50
Anthony Gitter	a****r	37
Vincent Rubinetti	v**i@g**m	27
Casey Greene	c****e	22
Venkat Malladi	v**i@g**m	6
David Slochower	s**r@g**m	5
Robert Gieseke	r**g@w**e	3
Michael Hoffman	m****n	2
Olga Botvinnik	o**k@g**m	2
C. Titus Brown	t**s@i**g	1
Dan Siddoway	d**n@b**t	1
Evan Cofer	e**r@p**u	1
Ogun Adebali	a****i	1
Paul Agapow	p**l@a**t	1
Pete Bachant	p**t@g**m	1
Ryan A. Hagenson	R**n@g**m	1
Sebastian Karcher	k**r@u**u	1
Tiago Lubiana	t**s@u**r	1
nfry321	5****1	1
Committer Domains (Top 20 + Academic)

usp.br: 1 u.northwestern.edu: 1 agapow.net: 1 princeton.edu: 1 blacksuncollective.net: 1 idyll.org: 1
Issues and Pull Requests

Last synced: about 1 year ago
All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0
Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

masters-thesis

Science Score: 54.0%

Keywords from Contributors

Repository

Basic Info

Statistics

Metadata Files

README.md

Bayesian and Attentive Aggregation for Cooperative Multi-Agent Deep Reinforcement Learning

Owner

Citation (citation-cache.json)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels