Paper Review: Bayesian Shrinkage towards Sharp Minimaxity
Paper Review: Bayesian Shrinkage towards Sharp Minimaxity
Motivation and Conclusion
Sparse normal mean model (ϵ∼N(σ2In)\epsilon \sim N(\sigma^2I_n)ϵ∼N(σ2In) but set σ2=1\sigma^2=1σ2=1):
y=θ+ϵ,ϵ∼N(0,In)y = \theta+\epsilon,\epsilon \sim N(0,I_n)y=θ+ϵ,ϵ∼N(0,In)
A general form of shrinkage prior:
π(θ∣τ)=∏i=1n1τπ0(θiτ),τ∼π(τ)\pi(\theta|\tau) = \prod_{i=1}^n \frac{1}{\tau}\pi_0\left( \frac{\theta_i}{\tau} \right),\tau \sim \pi(\tau)π(θ∣τ)=i=1∏nτ1π0(τθi),τ∼π(τ)
If π0\pi_0π0 is mixture of Gaussian, the general form leads to local-global shrinkage:
θi∼N(λi2τ2),λi2∼π(λi2),τ∼π(τ)\theta_i \sim N(\lambda_i^2\tau^2),\lambda_i^2 \sim \pi(\lambda_i^2),\tau \sim \pi(\tau)θi∼N(λi2τ2),λi2∼π(λi2),τ∼π(τ)
Observation: about contraction rate
Let θ∗\theta^*θ∗ be true parameter, sss be the number of nonzero entries in θ\thetaθ, rnr_nrn denote the contraction rate.
Frequentest:
minθ^maxθ∗∥θ^−θ∗∥=(2+o(1))slogns\min_{\hat \theta} \max_{\theta^*} \left\| \hat \theta - \theta^* \right\|=\sqrt{(2+o(1))s\log \frac{n}{s}}θ^minθ∗max∥∥∥θ^−θ∗∥∥∥=(2+o(1))slogsn
Bayesian:
- Dirichlet-Laplace prior: rn≍slognsr_n\asymp\sqrt{s\log \frac{n}{s}}rn≍slogsn when ∥θ∗∥≤slog2ns\left\| \theta^*\right\| \le \sqrt{s} \log^2\frac{n}{s}∥θ∗∥≤slog2sn
- Horseshoe prior: rn=Mnslogns,asMn→∞r_n=M_n\sqrt{s\log \frac{n}{s}},\ as\ M_n \to \inftyrn=Mnslogsn, as Mn→∞
- In general, polynomial decaying π0\pi_0π0 leads to near optimal rate
Further questions: How order of polynomial decaying π0\pi_0π0 affects contraction rate? How to choose τ\tauτ to achieve (near-) optimal contraction rate given polynomial decaying π0\pi_0π0?
Contribution of this paper:
- If order of π0\pi_0π0, say α≈1\alpha \approx 1α≈1, rn/2slogns≈1r_n/\sqrt{2s\log\frac{n}{s}} \approx 1rn/2slogsn≈1 (Bayesian sharp minimaxity, Thm 2.1)
- Choosing τ\tauτ requires knowledge on s/ns/ns/n, so the author proposed a Beta modeling on τ\tauτ to avoid unknown information.
Questions not been covered
- How rnr_nrn changes w.r.t limnαn→1\lim_n \alpha_n \to 1limnαn→1?
- How about α=1\alpha = 1α=1 (Thm 2.1 breaks down)?
- Beyond contraction rate, how α\alphaα affects model selection?
- How α\alphaα affects contraction rate in linear regression setting?
Bayesian sharp minimaxity
Import conditions on model sparsity and π0\pi_0π0
For simplicity, τ\tauτ is a deterministic value and θi\theta_iθis are mutually independent.
Remark 1: Conditions for τ\tauτ,
- τα−1≥(s/n)clog(n/s)\tau^{\alpha-1}\ge (s/n)^c\sqrt{\log (n/s)}τα−1≥(s/n)clog(n/s), for some c∈(0,1+w/2)c \in (0,1+w/2)c∈(0,1+w/2). τ\tauτ cannot be too small, or θ\thetaθ will be over-shrunk.
- τα−1≺(s/n)α[log(n/s)]α\tau^{\alpha-1}\prec (s/n)^{\alpha}[\log (n/s)]^{\alpha}τα−1≺(s/n)α[log(n/s)]α. τ\tauτ cannot be too large, or θ\thetaθ will be insufficient shrunk.
- τα−1≺(s/n)α[log(n/s)](1+α)/2\tau^{\alpha-1}\prec (s/n)^{\alpha}[\log (n/s)]^{(1+\alpha)/2}τα−1≺(s/n)α[log(n/s)](1+α)/2. This is the condition for L1L_1L1 contraction rate.
These conditions indicate α∈(1,1+w/2)\alpha \in (1,1+w/2)α∈(1,1+w/2) and www should be as small as possible.
Remark 2: E∗E^*E∗ is the expectation under true parameter θ∗\theta^*θ∗. Theoretical results indicate L2L_2L2 contraction rate is not greater than O(slog(n/s))O(\sqrt{s\log (n/s)})O(slog(n/s)) and L1L_1L1 contraction rate is not greater than O(slog(n/s))O(s\sqrt{\log (n/s)})O(slog(n/s)).
Remark 3: Note that log(n/s)≺(n/s)c,∃c>0\log(n/s)\prec (n/s)^c,\exists c>0log(n/s)≺(n/s)c,∃c>0. This observation leads to Corollary 2.1 which unifies (2.1) and (2.2).
Remark 4: Corollary 2.1 indicates τ≍(s/n)c/(α−1)\tau \asymp (s/n)^{c/(\alpha-1)}τ≍(s/n)c/(α−1). Select c=α+δc=\alpha+\deltac=α+δ for very small δ>0\delta>0δ>0. So a good choice would be τ≍(s/n)(α+δ)/(α−1)\tau \asymp (s/n)^{(\alpha+\delta)/(\alpha-1)}τ≍(s/n)(α+δ)/(α−1). However, we don’t know sss. An alternative is τ≍(1/n)(α+δ)/(α−1)\tau \asymp (1/n)^{(\alpha+\delta)/(\alpha-1)}τ≍(1/n)(α+δ)/(α−1). Theorem 2.2 considers the properties of this alternative.
Remark 5: Conditions for τ\tauτ,
- τα−1≥(1/n)clog(n/s)\tau^{\alpha-1}\ge (1/n)^c\sqrt{\log (n/s)}τα−1≥(1/n)clog(n/s), replace sss with 1
- τα−1≺(s/n)α[log(n/s)](1+α)/2\tau^{\alpha-1}\prec (s/n)^{\alpha}[\log (n/s)]^{(1+\alpha)/2}τα−1≺(s/n)α[log(n/s)](1+α)/2
Theoretical results indicate L2L_2L2 contraction rate is not greater than O(slog(n))O(\sqrt{s\log (n)})O(slog(n)) (sub-optimal) and L1L_1L1 contraction rate is not greater than O(slog(n))O(s\sqrt{\log (n)})O(slog(n)). If log(s)≺log(n)\log(s) \prec \log(n)log(s)≺log(n), sub-optimal is asymptotically non-different from optimal. If s≍nc,c∈(0,1)s \asymp n^c,c \in (0,1)s≍nc,c∈(0,1), sub-optimal has the same order as optimal. If log(s)∼log(n)\log(s) \sim \log(n)log(s)∼log(n), sub-optimal is of greater order.
Remark 6: Theorems above are derived based on deterministic τ\tauτ. Now consider π(τ)\pi(\tau)π(τ). π(τ)\pi(\tau)π(τ) should shrink to zero but should not shrink to zero so fast because π(τ)\pi(\tau)π(τ) needs to assign a little density to (s/n)(α+δ)/(α−1)(s/n)^{(\alpha+\delta)/(\alpha-1)}(s/n)(α+δ)/(α−1). Theorem 3.1 provides sufficient conditions on τ\tauτ to guarantee (2.1) and (2.2).
Remark 7: The prior density of τ\tauτ is split into three parts: around zero, (s/n)(1+w/2)/(α−1)(s/n)^{(1+w/2)/(\alpha-1)}(s/n)(1+w/2)/(α−1) to (s/n)α/(α−1)(s/n)^{\alpha/(\alpha-1)}(s/n)α/(α−1), and greater than (s/n)α/(α−1)(s/n)^{\alpha/(\alpha-1)}(s/n)α/(α−1). The first part is very huge and the second part is minor. Assume the third part is decay to zero.
Remark 8: A possible choice of π(τ)\pi(\tau)π(τ) is Beta (which may be multi-modal), i.e. τ∼[Beta(1,n)]c,c∈(α/(α−1),(1+w/2)/(α−1))\tau \sim [Beta(1,n)]^c,c \in (\alpha/(\alpha-1),(1+w/2)/(\alpha-1))τ∼[Beta(1,n)]c,c∈(α/(α−1),(1+w/2)/(α−1)).
Remark 9: Note that the restriction on θ∗\theta^*θ∗ is a technique assumption. Without this assumption, it’s possible to achieve sub-optimal. See Theorem 3.2.
Paper Review: Bayesian Shrinkage towards Sharp Minimaxity相关推荐
- Paper Review: Bayesian Regularization and Prediction
Paper Review: Bayesian Regularization and Prediction One-group Answers to Two-group questions Two-gr ...
- paper review : On Learning Associations of Faces and Voices
文章目录 On Learning Associations of Faces and Voices Summary 摘要 (中文) Research Objective Background and ...
- 学术新秀采访-陆品燕~How To Get Your SIGGRAPH Paper Rejected
from http://cbir.spaces.live.com 1.学术新秀采访-陆品燕 2.计算机系2007学术新秀朱军专访 3.How To Get Your SIGGRAPH Paper Re ...
- Shrinkage: I was in the pool
有人建议谈论shrinkage的时候,不可以缺少这个名场面,也许可以拿来当你的Graphical Abstract. 嗯,好主意,这里就直接当了标题. Erik van Zwet和合作者最近一段时间( ...
- 如何写第一篇研究论文 How to Write Your First Research Paper
How to Write Your First Research Paper Elena D. Kallestinova Author information ► Copyright and Lice ...
- CVPR 2011 全部论文标题和摘要
CVPR 2011 Tian, Yuandong; Narasimhan, Srinivasa G.; , ■Rectification and 3D reconstruction of curved ...
- 部分算法与对应代码整理(R、Python)
目录 1. 图像.人脸.OCR.语音相关算法整理 2. 机器学习与深度学习相关的R与Python库 (1)R General-Purpose Machine Learning Data Manipul ...
- 2020年国际学术会议参考列表
IJAC年度重磅分享:2020重要国际学术会议列表,涵盖机器学习.人工智能.计算机视觉.模式识别.自动控制.机器人几大领域,部分未列入表格的会议,或未正式发布会讯,或为两年至三年举办一次.如会议网站无 ...
- 汇总 | 精选CVPR开源项目学习资源
点击上方"视学算法",选择"星标" 干货第一时间送达 作者:Albert Lee https://zhuanlan.zhihu.com/p/142452685 ...
最新文章
- android 处理通话焦点,java – AUDIOFOCUS_LOSS在Android中打电话后打电话
- 你是否真正理解了泛型、通配符、类型擦除
- 树形DP求树的最小支配集,最小点覆盖,最大独立集
- 【图像超分辨率】遥感数据的高斯金字塔尺度上推方法研究
- 【项目合作】低清老视频转高清,视频超分辨
- csdn设置图片居中和尺寸
- 32位/64位机上常用数据类型字节数(C语言)
- linux lefse分析,科学网-linux本地化进行lefse分析-林国鹏的博文
- 给想立志入行网络或已经初入行的朋友的建议
- 运营商 sni 服务器,加密或者丢失:加密SNI的工作机制
- 深入剖解路由器的“心脏”技术
- 无限火力跳跳机器人_2021LOL无限火力机器人最强出装和天赋介绍
- LED发光二极管限流电阻的计算
- go FTP 文件传输
- 以太坊手续费详细分析
- A4纸和一寸照在屏幕的尺寸计算
- 苹果11是高通基带吗_iPhone11信号成最大问题,不支持5G还是英特尔基带,令人失望...
- 二极管(二):肖特基二极管
- mysql第一章试题_MySQL基础-第一章
- 腾讯股票接口API(1)——根据股票代码获取详情
热门文章
- 软件设计原则——里氏代换原则
- Use Batch Apex
- 网络爬虫(Web crawler)|| 爬虫入门程序
- Oracle 数据库设置最大进程数参数方法,oracle最大进程数满了处理方法,sysdba管理员登录报“maximum number of processes (150) exceeded“问题解决
- Python 技术篇-利用pdfkit库实现html格式文件转换PDF文档实例演示
- Python 技术篇-用PIL库实现等比例压缩、缩小图片实例演示
- C# 学习笔记(17)操作SQL Server 上
- CTFshow php特性 web128
- poj 2115 C Looooops(扩展欧几里德算法)
- [YTU]_2432 (C++习题 对象数组输入与输出)