【强化学习】First-visit MC prediction

在未知环境的条件下，通过反复模拟获得样本数据，近似估计给定策略下的价值函数 vπv_{\pi}vπ

import gym
import numpy as np
from matplotlib import pyplot
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from collections import defaultdict
from functools import partial
plt.style.use('ggplot')env = gym.make('Blackjack-v0')def sample_policy(observation):score, dealer_score, usable_ace = observationreturn 0 if score >= 20 else 1def generate_episode(policy, env):# we initialize the list for storing states, actions, and rewardsstates, actions, rewards = [], [], []# Initialize the gym environmentobservation = env.reset()while True:# append the states to the states liststates.append(observation)# now, we select an action using our sample_policy function and append the action to actions listaction = sample_policy(observation)actions.append(action)# We perform the action in the environment according to our sample_policy, move to the next state # and receive rewardobservation, reward, done, info = env.step(action)rewards.append(reward)# Break if the state is a terminal stateif done:breakreturn states, actions, rewardsdef first_visit_mc_prediction(policy, env, n_episodes):# First, we initialize the empty value table as a dictionary for storing the values of each statevalue_table = defaultdict(float)N = defaultdict(int)for _ in range(n_episodes):# Next, we generate the epsiode and store the states and rewardsstates, _, rewards = generate_episode(policy, env)returns = 0# Then for each step, we store the rewards to a variable R and states to S, and we calculate# returns as a sum of rewardsfor t in range(len(states) - 1, -1, -1):R = rewards[t]S = states[t]returns += R# Now to perform first visit MC, we check if the episode is visited for the first time, if yes,# we simply take the average of returns and assign the value of the state as an average of returnsif S not in states[:t]:N[S] += 1value_table[S] += (returns - value_table[S]) / N[S]return value_tablevalue = first_visit_mc_prediction(sample_policy, env, n_episodes=500000)
for i in range(10):print(value.popitem())

((4, 1, False), -0.5786802030456852)
((14, 1, True), -0.43960396039603966)
((4, 9, False), -0.42211055276381915)
((13, 3, True), -0.22764227642276424)
((7, 3, False), -0.5780911062906736)
((12, 1, True), -0.4090909090909092)
((15, 8, True), -0.2540983606557379)
((4, 3, False), -0.534246575342466)
((4, 2, False), -0.48458149779735665)
((4, 8, False), -0.4603174603174603)

def plot_blackjack(V, ax1, ax2):player_sum = np.arange(12, 21 + 1)dealer_show = np.arange(1, 10 + 1)usable_ace = np.array([False, True])state_values = np.zeros((len(player_sum), len(dealer_show), len(usable_ace)))for i, player in enumerate(player_sum):for j, dealer in enumerate(dealer_show):for k, ace in enumerate(usable_ace):state_values[i, j, k] = V[player, dealer, ace]X, Y = np.meshgrid(player_sum, dealer_show)ax1.plot_wireframe(X, Y, state_values[:, :, 0])ax2.plot_wireframe(X, Y, state_values[:, :, 1])for ax in ax1, ax2:ax.set_zlim(-1, 1)ax.set_ylabel('player sum')ax.set_xlabel('dealer showing')ax.set_zlabel('state-value')fig, axes = pyplot.subplots(nrows=2, figsize=(5, 8),
subplot_kw={'projection': '3d'})
axes[0].set_title('value function without usable ace')
axes[1].set_title('value function with usable ace')
plot_blackjack(value, axes[0], axes[1])

【强化学习】First-visit MC prediction相关推荐

Reinforcement Learning强化学习系列之二：MC prediction
引言这几个月一直在忙找工作和毕业论文的事情,博客搁置了一段时间,现在稍微有点空闲时间,又啃起了强化学习的东西,今天主要介绍的是强化学习的免模型学习free-model learning中的最基础的部 ...
强化学习(四) - 无模型学习(MC、TDL)
上一节讲的是在已知模型的情况下,通过动态规划来解决马尔科夫决策过程(MDP)问题.具体的做法有两个:一个是策略迭代,一个是值迭代. 从这一节开始,我们将要进入模型未知的情况下,如何去解决MDP问题. ...
强化学习笔记： generalized policy iteration with MC
强化学习笔记: MDP - Policy iteration_UQI-LIUWJ的博客-CSDN博客强化学习笔记:Q-learning_UQI-LIUWJ的博客-CSDN博客在policy ite ...
强化学习（四）用蒙特卡罗法（MC）求解
在强化学习(三)用动态规划(DP)求解中,我们讨论了用动态规划来求解强化学习预测问题和控制问题的方法.但是由于动态规划法需要在每一次回溯更新某一个状态的价值时,回溯到该状态的所有可能的后续状态.导致对 ...
《强化学习》中的时序差分学习 Temporal-Difference Learning （基于与动态规划 DP 、蒙特卡洛方法 MC 的对比）
前言: 学习了 Sutton 的<强化学习(第二版)>中时序差分学习的"预测"部分内容.前两章中,书介绍了动态规划与蒙特卡洛方法 ,我们从二者与时序差分学习的 ...
强化学习（四） - 蒙特卡洛方法（Monte Carlo Methods）及实例
强化学习(四) - 蒙特卡洛方法(Monte Carlo Methods)及实例 4. 蒙特卡洛方法 4.1 蒙特卡洛预测例4.1:Blackjack(21点) 4.2 动作价值的蒙特卡洛估计 4. ...
强化学习——蒙特卡洛方法
学习目标理解Prediction和Control的差别: 理解什么是first-visit和every-visit: 理解什么是on-policy和off-policy: 理解蒙特卡洛方法的Pred ...
强化学习丨蒙特卡洛方法及关于“二十一点”游戏的编程仿真
目录一.蒙特卡洛方法简介二.蒙特卡洛预测 2.1 算法介绍 2.2 二十一点(Blackjack) 2.3 算法应用三.蒙特卡洛控制 3.1 基于试探性出发的蒙特卡洛(蒙特卡洛ES) 3.1.1 ...
《强化学习周刊》第30期：Deep Mind开展人机交互的新试点研究、MIT提出神经进化优化框架...
No.30 智源社区强化学习组强化学习研究观点资源活动关于周刊强化学习作为人工智能领域研究热点之一,其研究进展与成果也引发了众多关注.为帮助研究与工程人员了解该领域的相关进展和 ...

【强化学习】First-visit MC prediction

【强化学习】First-visit MC prediction相关推荐

最新文章

热门文章