DQN-FlappyBird项目学习

博客：https://yanpanlau.github.io/2016/07/10/FlappyBird-Keras.html

代码：https://github.com/yanpanlau/Keras-FlappyBird

一、代码说明

flappy_bird_utils.py主要加载游戏需要的图像音频等文件，wrapped_flappy_bird.py主要提供GameState类能接收动作返回更新的游戏状态，qlearn.py主要接收图像输入、图像预处理、卷积神经网络选择动作、使用DQN训练网络（The network will be trained millions of times, via an algorithm called Q-learning, to maximize the future expected reward）。

二、核心思想

Q-learning中最重要的就是Q-function，DQN就是用神经网络代替Q表。

1. Q-function是什么

Q-function Q(s, a) representing the maximum discounted future reward when we perform action a in state s. Q(s, a) gives you an estimation of how good to choose an action a in state s.

2. Q-function有什么用

Suppose you are in state s and you need to decide whether you take action a or b. If you have this magical Q-function, the answers become really simple – pick the action with highest Q-value!也即策略π(s)=argmaxaQ(s,a)。

3. 如何得到Q-function

That’s where Q-learning is coming from.

3.1 Q函数的递推表达式

Rt=rt+γ∗Rt+1

Recall the definition of Q-function (maximum discounted future reward if we choose action a in state s):

Q(st,at)=maxRt+1

therefore, we can rewrite the Q-function as below:

Q(s,a)=r+γ∗maxa′Q(s′,a′)

We could now use an iterative method to solve for the Q-function. Given a transition (s,a,r,s′) , we are going to convert this episode into training set for the network. i.e. We want r+γmaxaQ(s,a) to be equal to Q(s,a). You can think of finding a Q-value is a regession task now, I have a estimator r+γmaxaQ(s,a) and a predictor Q(s,a), I can define the mean squared error (MSE), or the loss function, as below:

L=[r+γmaxa′Q(s′,a′)−Q(s,a)]2

If L getting smaller, we know the Q-function is getting converged into the optimal value, which is our “strategy book”.

3.2 用神经网络学习Q函数

The idea of the DQN is that I use the neural network to COMPRESS this Q-table, using some parameters θ (We called it weight in Neural Network). So instead of handling a large table, I just need to worry the weights of the neural network. By smartly tuning the weight parameters, I can find the optimal Q-function via the various Neural Network training algorithm.

Q(s,a)=fθ(s)

where f is our neural network with input s and weight parameters θ.

Here is the code below to demonstrate how it works:

if t > OBSERVE:#sample a minibatch to train onminibatch = random.sample(D, BATCH)inputs = np.zeros((BATCH, s_t.shape[1], s_t.shape[2], s_t.shape[3]))   #32, 80, 80, 4targets = np.zeros((inputs.shape[0], ACTIONS))                         #32, 2#Now we do the experience replayfor i in range(0, len(minibatch)):state_t = minibatch[i][0]action_t = minibatch[i][1]   #This is action indexreward_t = minibatch[i][2]state_t1 = minibatch[i][3]terminal = minibatch[i][4]# if terminated, only equals rewardinputs[i:i + 1] = state_t    #I saved down s_ttargets[i] = model.predict(state_t)  # Hitting each buttom probabilityQ_sa = model.predict(state_t1)if terminal:targets[i, action_t] = reward_telse:targets[i, action_t] = reward_t + GAMMA * np.max(Q_sa)loss += model.train_on_batch(inputs, targets)s_t = s_t1t = t + 1

4. Experience Replay

指的是从经验池中随机取小样而不是用最近的样本进行训练，用于改善神经网络不稳定的问题。It was found that approximation of Q-value using non-linear functions like neural network is not very stable. The most important trick to solve this problem is called experience replay. During the gameplay all the episode (s,a,r,s′) are stored in replay memory D. (I use Python function deque() to store it). When training the network, random mini-batches from the replay memory are used instead of most the recent transition, which will greatly improve the stability.

训练效果参考：https://blog.csdn.net/Mr_BigG/article/details/113782135?utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7EOPENSEARCH%7Edefault-5.control&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7EOPENSEARCH%7Edefault-5.control

从开始训练到几乎无敌，该博主花了近30个小时。训练5万次只会一直往上飞，20万次有了大致的方向会尝试越过第一个柱子，30万次基本可以正确找到第一个柱子的间隙并尝试越过，40万次已有很高的几率过第一个柱子且有一定几率过第二个柱子，100万次达到普通玩家的正常水平、能顺利通过5~8个柱子，200万次几乎无敌。

DQN-FlappyBird项目学习相关推荐

PHP项目学习——控件
主要是在项目学习中总结的一些东西动态效果 flashbar滚动条,增加动态效果,直接嵌入html中 <object classid="cls ...
如何自学python到做项目-总算明白如何通过项目学习python
在学习完Python的基础知识之后,有很多朋友为自己接下来要干什么感到迷茫.不知道应该通过什么样的项目来锻炼自己编程水平和思维能力.接下来我就给大家说几个适合Python的新手项目和练手项目,Pyth ...
延大计算机文化基础课程作业,基于项目学习的大学《计算机文化基础课》教学设计...
摘要: 从大学教育看,计算机文化已经愈来愈多地融入了各专业科研和专业课的教学过程之中.计算机教学已成为素质教育的必要组成部分,良好的信息素养是当代大学生可持续发展的重要基础平台.大学计算机文化基础课程 ...
TheBeerHouse 网站项目学习笔记(5)---架构设计
前述讨论: TheBeerHouse 网站项目学习笔记(1)----换肤技术 TheBeerHouse 网站项目学习笔记(2)----个性化管理 ...
引入dubbo依赖的版本是多少_Dubbo 项目学习（四）接口抽取以及依赖版本统一
引言前面的系列项目中,我们会发现有个接口是一样的,我们需要单独抽取出来,统一维护,这样可以更加高效的处理项目.同时,两个项目的maven依赖包也可以统一维护,这样有助于项目在多人协作的同时,保证项目 ...
01-Flutter移动电商实战-项目学习记录
01-Flutter移动电商实战-项目学习记录一直想系统性的学习一下 Flutter,正好看到该课程<Flutter移动电商实战>的百度云资源,共 69 课时,由于怕自己坚持不下去(经常 ...
项目学习 - 收藏集 - 掘金
一款开源的视频直播项目 --EvilsLive - Android - 掘金项目介绍 EvilsLive 是一个视频直播件开发工具包(SDK), 目前只支持 Android, 主要负责视频直播的采集 ...
python3实战练手项目_Python0基础练手项目有哪些值得推荐？附实战项目+学习图谱...
原标题:Python 0基础练手项目,有哪些值得推荐?附实战项目+学习图谱刚学Python的时候,因为豆瓣帖子老沉,就写了一个顶帖脚本.就是用这个脚本,给自己的帖子顶了两年,在小组里追到了现在的女朋 ...
VUE项目学习（三）：win10版nginx部署vue项目
VUE项目学习(三):win10版nginx部署vue项目 niginx的安装和启停操作参照博客:https://blog.csdn.net/qq_26666947/article/details/1 ...
MongoDB+模板引擎项目学习 ---学生档案管理
MongoDB+模板引擎项目学习 -学生档案管理 1 案例介绍目标:模板引擎应用,强化node.js项目制作流程知识点:http请求响应.数据库.模板引擎.静态资源访问项目效果展示 2 制 ...

DQN-FlappyBird项目学习

DQN-FlappyBird项目学习相关推荐

最新文章

热门文章