blackjack_linear
Blackjack with linear action value function approximation
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.4%) to scientific vocabulary
Repository
Blackjack with linear action value function approximation
Basic Info
- Host: GitHub
- Owner: Dolores2333
- Language: Python
- Default Branch: main
- Size: 45.1 MB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Blackjack with Action Value Function Approximation
Author: ZHA Mengyue
Math Department of HKUST

Problem Statement
We study playing Blackjack by Action Value Function Approximation. Prediction methods used to update q-value function for option here are linear Monte Carlo, linear Q Learning and linear Temporal Difference. We also test the algorithm under different combination of (M, N). M is the number of decks and N denotes N-1 palyers with 1 dealer. For each configuration, we find the optimal policy after iterations. Outcomes of three pre diction methods are compared by visualization and tables.
This work is based on the previous work done for HW1. We will compare the outcomes with and without linear action value function approximation.
Since details in the game implementation are exactly the same with HW1. We omit them here and please refer to HW1 if you'd like to know more. Next we discuss the design of linear approximation.
x: is the feature of form (playerpoints, dealerpoints, player_action)
w: a linear weight of shape (3, )
For a call on value by (s, a) pair where s = playerpoints, dealerpoints, a = player_action
We have $$Q(s, a) = w^T x$$
The linear approximated Q value function $Q$ is updated by the superviesed learning.
Homework Statement
Assume that in the Blackjack game, there are $m$ decks of cards, and $n$ players (a dealer vs $n-1$ players). The rules of the game are explained above.
(1) Find the optimal policy for the Blackjack, when $m=\infty$ and $n=2$. You can use any of the methods learned so far in class (e.g. linear Monte Carlo, linear TD, or linear Q-Learning). If you use more than one method, do they reach the same optimal policy?
(2) Visualise the value functions and policy as done in Figures 5.1 and 5.2 in Sutton's book.
(3) Redo (1) for different combinations of (m,n), e.g. $m=6, 3, 1$, and $n=3,4,6$. What are differences?
Implementation
File Structure
Since the overal structure is the same with HW1. We will only talk about the new added functions.
We put most new functions in AVAF.py
- MC_linear(): update the Q value function by MC samples propagated by SGD
- QL_linear(): update the Q value function by QL samples propagated by SGD
- TD_liner(): update the Q value function by TD samples propagated by SGD
- Q_function(key, w): calculate the feature $x$ of the given key and return the Q value of for the given key by linear approximation $w^T x$
- calculateQsa(Qsa, w): use the current linear weight w to restore the Q value table Qsa for all keys in Q_sa.
Example
- Prepare the environment
shell
conda create -n Blackjack python=3.6
shell
conda activate Blackjack
Now your working environment is the Blackjack now. Let's install the necessary packages. We have listed all packages in requirement.txt
pip install -r requirement.txt
Now your environment should be fully ready.
- Experiments on a single Instance
The following code blocks plays the Blackjack with m=2 decks and n=3 people where 2 are players and one is the dealer.
python main.py --m=2, --n=3
Note that when $m=\infty$, we use --m=0 instead.
shell
python main.py --m=0, --n=2
- Experiments on instances of combinations of (m, n)
Also you can test the combinations of (m, n) pairs. For example, m= 6, 3, 1 and n= 3, 4, 6
shell
python main.py --m 6 3 1 --n 3 4 6
- Experiments on $m=\infty$
We use --m=0 infers to use infinite decks in the game instead.
- The optimal policy
We store the final Q-value function instead and the optimal poliy are derived from it by either best policy or epsilon greedy policy.
The value.csv are stored in thecorresponding instance folder as:
MCbestvalue.csv
MCepsilonvalue.csv
QLbestvalue.csv
etc.
Tabular Summary for the Experiments
Choices for policy update: policy=['best', 'epsilon']
Choices for policy evaluateion(value function update): update=['MC', 'QL', 'TD']
- best: best policy evaluation
- epsilon: epsilon greedy policy evaluation
- MC_linear: Monte Carlo with liner AVAF
- QL_linear: Q Learning with liner AVAF
- TD_linear: Temporal Difference with liner AVAF
We see that the models with linear approximation perform worse than non-linear approximation. This is because we use very simple feature that underfit the real action value function.
Single Instance of $m=\infty$, $n=2$
| m=$\infty$, n=2 | MClinear | QLlinear | TD_linear | | --------------------- | --------- | --------- | --------- | | best policy | 18.0500% | 24.2700% | 24.8100% | | epsilon greedy policy | 20.1400% | 28.6900% | 27.2700% |
Conclusions:
- epsilon greedy policy outperforms best policy
- With liner approximation, MC, QL, TD perform worse. This may because I did not do any feature engineering. The linear function is to simple and underfit the real Q value function.
Combination of m=[6, 3, 1], n=[3, 4, 6]
We summary the performance of (update_policy) combinations in the tables below.
| MC_best | n=3 | n=4 | n=6 | | ------- | -------- | -------- | -------- | | m=6 | 27.8000% | 28.5600% | 27.9080% | | m=3 | 28.0200% | 27.8667% | 29.2500% | | m=1 | 27.8500% | 28.7833% | 28.9440% |
| MC_epsilon | n=3 | n=4 | n=6 | | ---------- | -------- | -------- | -------- | | m=6 | 29.0600% | 28.2433% | 27.8420% | | m=3 | 28.8650% | 29.1967% | 28.5940% | | m=1 | 29.4400% | 28.4367% | 28.8680% |
| QL_best | n=3 | n=4 | n=6 | | ------- | -------- | -------- | -------- | | m=6 | 35.900% | 37.6067% | 36.8160% | | m=3 | 36.8250% | 36.4067% | 36.5960% | | m=1 | 36.9750% | 36.7433% | 36.9180% |
| QL_epsilon | n=3 | n=4 | n=6 | | ---------- | -------- | -------- | -------- | | m=6 | 36.6950% | 36.8100% | 36.6420% | | m=3 | 36.5900% | 35.4767% | 36.2400% | | m=1 | 36.5000% | 36.1300% | 36.3960% |
| TD_best | n=3 | n=4 | n=6 | | ------- | -------- | -------- | -------- | | m=6 | 36.1950% | 35.9667% | 36.4840% | | m=3 | 35.6800% | 36.1833% | 37.4880% | | m=1 | 36.0950% | 37.3200% | 36.9300% |
| TD_epsilon | n=3 | n=4 | n=6 | | ---------- | -------- | -------- | -------- | | m=6 | 37.1250% | 36.6633% | 37.3560% | | m=3 | 36.5000% | 37.5100% | 37.7860% | | m=1 | 36.8200% | 36.9600% | 36.9720% |
Conclusions
- epsilon greedy policy is slightly better than the best policy
- With linear approximation, MC becomes the worst one. This because when use liner appriximation actually the Q_true got by MC is not suitable for updating all keys been sampled the trajectory.
- For MC_best, the more players are in, the more chance they will win
- For MC_epsilon, fewer players is better.
- For QL_best, fewer decks is better.
- and QL_epsilon fewer players is slightly better.
- For TDbest and TDepsilon, more players is better
Hyperparameters
All settable hyperparameters except for $m$ and $n$ are assigned by the instance level config.json under the instance's folder.
Some hypperparameters has finite many choices and will be generated in the main.py when different instances are created. We will write these hyperparameters into the instance level config.json that inherited from the template config.json (under the INSTANCE folder).
- update: choices in ['MC', 'QL', 'TD']
- name: choices in the combination of form 'update-epsilon' or 'update-best' for policy being epsilon greedy policy and best policy respectively.
- policy: choices in ['epsilongreedypolicy', 'best_policy']
We also has some higher level hyperparameters that are assigned in the template config.json. Note that these hyperparameters are the same for all instances created by call main.py once. They are:
- epochs: number of iterations.
- n_zeros: a constant for determine the value of $\epsilon$ in epsilon greedy policy
- session: denotes how often we summay the performance of a given player in plot.py. For example, if session = 1000, we summary its wins losses and draws every 1000 actions.
- linear: bool type. whether use the linear action value function approxiamation.
- alpha: learning rate when linear is True.
- Remark: if bool(linear) is True then we will append '_linear' to the update.
Visualization
We illustrate the typical plots as examples and you want to see more, please visit the subfolder with path = STORAGE/INSTANCE/pic
Visualization on $m=\infty$, $n=2$
We only take the update=QL as example and you should refer to Blackjack/storage/m0n2/pic/ for outcomes for MC and TD
We see that the policy here is different from non-linear approximation. This is because we use very simple feature that underfit the real action value function.
Value Function Visualization
QLbestvalue visualization

QLepsilonvalue visualization

Remark
Since I forgot to add the labels for x-axis, y-axis and z-axis when doing the experiment, their position and labels are denoted by the following Pseudo Value Function Plot. All axes' arrangements in the figures of this repository follow the left-hand rule. You may refer to the following pic to identify the arrangement and meaning of the x, y, z axes.

Player Performance Visualization
Visualize MC
MCbestplayer_1 visualization

MCepsilonplayer1 vs. MCbestplayer1 visualization

We see clearly that under the update rule MC, the player with epsilon greedy policy performs consistently better than they player with the deterministic best policy. The outcome shows that expolration is important !!!
Compare MC, QL, TD and best, epsilon
We have the following conclusions by observing the player performance visualization on update=[MC, QL, TD] and policy=[best, epsilon]
epsilon greedy policy outperforms the best policy consistently no matter which update strategy we adopt.
For a fixed policy, the performances of update strategies are TD>QL>MC
The reason we guess is that since the TD style fit the requirements by linear action value function approximation better. Because in the setting of TD, supervised learning is easier and this improves the qualioty of linear approximated action value function.
Citation
If you use my Blackjack in any context, please cite this repository:
latex
@article{
ZHA2021:RL_Blackjack_linear,
title={Blackjack with Action Value Function Approximation},
author={ZHA Mengyue},
year={2021},
url={https://github.com/Dolores2333/Blackjack_linear}
}
This work is done by ZHA Mengyue for Homework2 in MATH6450I Reinforcement Learning lectured by Prof Bing-yi Jing in HKUST. Please cite the repository if you use the code and outcomes.
Owner
- Name: Dolores
- Login: Dolores2333
- Kind: user
- Location: Hong Kong
- Company: HKUST
- Repositories: 2
- Profile: https://github.com/Dolores2333
👉 👉 👉
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this repository, please cite it as below."
authors:
- family-names: Zha
given-names: Mengyue
orcid: https://orcid.org/0000-0003-0956-4550
title: "Reinforcement Learning for the Blackjack"
version: 1.0
date-released: 2021-10-8
GitHub Events
Total
Last Year
Dependencies
- matplotlib *
- numpy *
- pandas *
- tqdm *