Lect6_Value_Function_Approximation
文章目录
- Value Funtion Approximation
- Introduction
- Why need?
- Types of Value Function Approximation
- Which Funtion Approximator?
- Incremental Methods
- Value Funtion Approx. by SGD
- Linear Funtion Approximation
- Incremental Prediction Algorithms
- Control with Value Function Approximation
- Action-Value Function Approximation
- Linear Action-Value Function Approximation
- Incremental Prediction Algorithms
- Covergence of Prediction Algorithms
- Covergence of Control Algorithms
- Batch Methods
Value Funtion Approximation
Introduction
Why need?
- we have represented value function by a lookup table
- Every state s has an entry V(s)V(s)V(s)
- Or every state-action pair s, a has an entry Q(s,a)Q(s,a)Q(s,a)
- Problem with large MDPs:
- There ate too many states and/or actions to store in memory
- It’s too slow to learn the value of each state individually
Solution for large MDPs:
- Estimate value function with function approximation
v^(s,w)≈vπ(s)orq^(s,a,w)≈qπ(s,a)\begin{aligned} \hat{v}(s,\mathbf{w}) &\approx v_\pi(s) \\ or\ \hat{q}(s,a,\mathbf{w}) &\approx q_\pi(s,a) \end{aligned} v^(s,w)or q^(s,a,w)≈vπ(s)≈qπ(s,a)
Types of Value Function Approximation
Which Funtion Approximator?
There are many funtion approximators, but we consider differentiable fucntion approximators, e.g.
- Linear combinations of features
- Neural network
- Decision tree
- Nearest neighbor
- Fourier / wavelet bases
- …\dots…
Incremental Methods
Value Funtion Approx. by SGD
Goal: find parameter vector w\mathbf{w}w Minimising mean-squared error between approximate value function v^(s,w)\hat{v}(s,\mathbf{w})v^(s,w) and true value function vπ(s)v_\pi(s)vπ(s) :
J(w)=Eπ[(vπ(S)−v^(S,w))2](1)\pmb{J}(\mathbf{w}) = \mathbb{E}_\pi \left[(v_\pi(S) - \hat{v}(S, \mathbf{w}))^2 \right] \tag{1} JJJ(w)=Eπ[(vπ(S)−v^(S,w))2](1)
Gradient descent finds a local minimum:
Δw=−12α∇wJ(w)=αEπ[(vπ(S)−v^(S,w))∇wv^(S,w)]\Delta \mathbf{w} = -\frac{1}{2} \alpha \nabla_{\mathbf{w}} \pmb{J}(\mathbf{w}) = \alpha {\color{red}{\mathbb{E}_\pi}} \left[(v_\pi(S) - \hat{v}(S, \mathbf{w})) \nabla_{\mathbf{w}}\hat{v}(S,\mathbf{w}) \right] Δw=−21α∇wJJJ(w)=αEπ[(vπ(S)−v^(S,w))∇wv^(S,w)]
Expected update is equal to full gradient update
Lect6_Value_Function_Approximation相关推荐
最新文章
- 人工智能(Artificial Intelligence)常用算法
- python读取大文件内容_python 读取大文件
- 【控制】频域分析及奈氏判据
- 攻防世界easyJava(re Moble)
- ERROR:transport error 202: gethostbyname: unknown host
- 215. 数组中的第K个最大元素 BFPRT最牛解法
- win与Linux的防火墙配置
- 【C++ primer】第七章 函数-C++的编程模块
- vector中针对自定义类型的排序
- Java 中 Comparable 和 Comparator 比较(转)
- Java千百问_06数据结构(003)_什么是基本类型包装器
- 论文查重不能超过多少?
- 计算机领域获奖感言,期中考试获奖感言
- 静态代理的实现-模拟中介代理房东出租房子给房客
- geany配置html5,Geany的教程
- 通信业的双11来了!充话费、办宽带、买手机每年这时候最划算
- telegraf 使用 inputs.exec插件收集监控数据
- UIDatePicker得到的时间中怎么去掉时分秒(字符串操作知识拓展)--iOS开发
- 100种活动促销方案
- [其他] 如何在音乐网站下载音频,无需任何插件
热门文章
- int(4)、int(8)、int(11) 分别占用几个字节 ?
- java 图片质量压缩_java图片高质量压缩
- 100多年前人们心中的2018年:部分预测已成现实
- 【云原生 | 从零开始学Kubernetes】十二、k8spod的生命周期与容器钩子
- word中表格剩最后一行,一旦超过两行自动跳到下一页
- vdbench运行报错:java.net.NoRouteToHostException: No route to host (Host unreachable)
- 数字营销专业术语介绍
- 多态之父类引用指向子类对象
- 利用 FFMPEG 批量提取指定起止时间视频片段
- Monitoring(监控)