Paper Review: Bayesian Shrinkage towards Sharp Minimaxity

Motivation and Conclusion

Sparse normal mean model (ϵ∼N(σ2In)\epsilon \sim N(\sigma^2I_n)ϵ∼N(σ2In) but set σ2=1\sigma^2=1σ2=1):
y=θ+ϵ,ϵ∼N(0,In)y = \theta+\epsilon,\epsilon \sim N(0,I_n)y=θ+ϵ,ϵ∼N(0,In)

A general form of shrinkage prior:
π(θ∣τ)=∏i=1n1τπ0(θiτ),τ∼π(τ)\pi(\theta|\tau) = \prod_{i=1}^n \frac{1}{\tau}\pi_0\left( \frac{\theta_i}{\tau} \right),\tau \sim \pi(\tau)π(θ∣τ)=i=1∏nτ1π0(τθi),τ∼π(τ)

If π0\pi_0π0 is mixture of Gaussian, the general form leads to local-global shrinkage:
θi∼N(λi2τ2),λi2∼π(λi2),τ∼π(τ)\theta_i \sim N(\lambda_i^2\tau^2),\lambda_i^2 \sim \pi(\lambda_i^2),\tau \sim \pi(\tau)θi∼N(λi2τ2),λi2∼π(λi2),τ∼π(τ)

Observation: about contraction rate
Let θ∗\theta^*θ∗ be true parameter, sss be the number of nonzero entries in θ\thetaθ, rnr_nrn denote the contraction rate.

Frequentest:
min⁡θ^max⁡θ∗∥θ^−θ∗∥=(2+o(1))slog⁡ns\min_{\hat \theta} \max_{\theta^*} \left\| \hat \theta - \theta^* \right\|=\sqrt{(2+o(1))s\log \frac{n}{s}}θ^minθ∗max∥∥∥θ^−θ∗∥∥∥=(2+o(1))slogsn
Bayesian:

Dirichlet-Laplace prior: rn≍slog⁡nsr_n\asymp\sqrt{s\log \frac{n}{s}}rn≍slogsn when ∥θ∗∥≤slog⁡2ns\left\| \theta^*\right\| \le \sqrt{s} \log^2\frac{n}{s}∥θ∗∥≤slog2sn
Horseshoe prior: rn=Mnslog⁡ns,asMn→∞r_n=M_n\sqrt{s\log \frac{n}{s}},\ as\ M_n \to \inftyrn=Mnslogsn, as Mn→∞
In general, polynomial decaying π0\pi_0π0 leads to near optimal rate

Further questions: How order of polynomial decaying π0\pi_0π0 affects contraction rate? How to choose τ\tauτ to achieve (near-) optimal contraction rate given polynomial decaying π0\pi_0π0?

Contribution of this paper:

If order of π0\pi_0π0, say α≈1\alpha \approx 1α≈1, rn/2slog⁡ns≈1r_n/\sqrt{2s\log\frac{n}{s}} \approx 1rn/2slogsn≈1 (Bayesian sharp minimaxity, Thm 2.1)
Choosing τ\tauτ requires knowledge on s/ns/ns/n, so the author proposed a Beta modeling on τ\tauτ to avoid unknown information.

Questions not been covered

How rnr_nrn changes w.r.t lim⁡nαn→1\lim_n \alpha_n \to 1limnαn→1?
How about α=1\alpha = 1α=1 (Thm 2.1 breaks down)?
Beyond contraction rate, how α\alphaα affects model selection?
How α\alphaα affects contraction rate in linear regression setting?

Bayesian sharp minimaxity

Import conditions on model sparsity and π0\pi_0π0

For simplicity, τ\tauτ is a deterministic value and θi\theta_iθis are mutually independent.

Remark 1: Conditions for τ\tauτ,

τα−1≥(s/n)clog⁡(n/s)\tau^{\alpha-1}\ge (s/n)^c\sqrt{\log (n/s)}τα−1≥(s/n)clog(n/s), for some c∈(0,1+w/2)c \in (0,1+w/2)c∈(0,1+w/2). τ\tauτ cannot be too small, or θ\thetaθ will be over-shrunk.
τα−1≺(s/n)α[log⁡(n/s)]α\tau^{\alpha-1}\prec (s/n)^{\alpha}[\log (n/s)]^{\alpha}τα−1≺(s/n)α[log(n/s)]α. τ\tauτ cannot be too large, or θ\thetaθ will be insufficient shrunk.
τα−1≺(s/n)α[log⁡(n/s)](1+α)/2\tau^{\alpha-1}\prec (s/n)^{\alpha}[\log (n/s)]^{(1+\alpha)/2}τα−1≺(s/n)α[log(n/s)](1+α)/2. This is the condition for L1L_1L1 contraction rate.

These conditions indicate α∈(1,1+w/2)\alpha \in (1,1+w/2)α∈(1,1+w/2) and www should be as small as possible.

Remark 2: E∗E^*E∗ is the expectation under true parameter θ∗\theta^*θ∗. Theoretical results indicate L2L_2L2 contraction rate is not greater than O(slog⁡(n/s))O(\sqrt{s\log (n/s)})O(slog(n/s)) and L1L_1L1 contraction rate is not greater than O(slog⁡(n/s))O(s\sqrt{\log (n/s)})O(slog(n/s)).

Remark 3: Note that log⁡(n/s)≺(n/s)c,∃c>0\log(n/s)\prec (n/s)^c,\exists c>0log(n/s)≺(n/s)c,∃c>0. This observation leads to Corollary 2.1 which unifies (2.1) and (2.2).

Remark 4: Corollary 2.1 indicates τ≍(s/n)c/(α−1)\tau \asymp (s/n)^{c/(\alpha-1)}τ≍(s/n)c/(α−1). Select c=α+δc=\alpha+\deltac=α+δ for very small δ>0\delta>0δ>0. So a good choice would be τ≍(s/n)(α+δ)/(α−1)\tau \asymp (s/n)^{(\alpha+\delta)/(\alpha-1)}τ≍(s/n)(α+δ)/(α−1). However, we don’t know sss. An alternative is τ≍(1/n)(α+δ)/(α−1)\tau \asymp (1/n)^{(\alpha+\delta)/(\alpha-1)}τ≍(1/n)(α+δ)/(α−1). Theorem 2.2 considers the properties of this alternative.

Remark 5: Conditions for τ\tauτ,

τα−1≥(1/n)clog⁡(n/s)\tau^{\alpha-1}\ge (1/n)^c\sqrt{\log (n/s)}τα−1≥(1/n)clog(n/s), replace sss with 1
τα−1≺(s/n)α[log⁡(n/s)](1+α)/2\tau^{\alpha-1}\prec (s/n)^{\alpha}[\log (n/s)]^{(1+\alpha)/2}τα−1≺(s/n)α[log(n/s)](1+α)/2

Theoretical results indicate L2L_2L2 contraction rate is not greater than O(slog⁡(n))O(\sqrt{s\log (n)})O(slog(n)) (sub-optimal) and L1L_1L1 contraction rate is not greater than O(slog⁡(n))O(s\sqrt{\log (n)})O(slog(n)). If log⁡(s)≺log⁡(n)\log(s) \prec \log(n)log(s)≺log(n), sub-optimal is asymptotically non-different from optimal. If s≍nc,c∈(0,1)s \asymp n^c,c \in (0,1)s≍nc,c∈(0,1), sub-optimal has the same order as optimal. If log⁡(s)∼log⁡(n)\log(s) \sim \log(n)log(s)∼log(n), sub-optimal is of greater order.

Remark 6: Theorems above are derived based on deterministic τ\tauτ. Now consider π(τ)\pi(\tau)π(τ). π(τ)\pi(\tau)π(τ) should shrink to zero but should not shrink to zero so fast because π(τ)\pi(\tau)π(τ) needs to assign a little density to (s/n)(α+δ)/(α−1)(s/n)^{(\alpha+\delta)/(\alpha-1)}(s/n)(α+δ)/(α−1). Theorem 3.1 provides sufficient conditions on τ\tauτ to guarantee (2.1) and (2.2).

Remark 7: The prior density of τ\tauτ is split into three parts: around zero, (s/n)(1+w/2)/(α−1)(s/n)^{(1+w/2)/(\alpha-1)}(s/n)(1+w/2)/(α−1) to (s/n)α/(α−1)(s/n)^{\alpha/(\alpha-1)}(s/n)α/(α−1), and greater than (s/n)α/(α−1)(s/n)^{\alpha/(\alpha-1)}(s/n)α/(α−1). The first part is very huge and the second part is minor. Assume the third part is decay to zero.

Remark 8: A possible choice of π(τ)\pi(\tau)π(τ) is Beta (which may be multi-modal), i.e. τ∼[Beta(1,n)]c,c∈(α/(α−1),(1+w/2)/(α−1))\tau \sim [Beta(1,n)]^c,c \in (\alpha/(\alpha-1),(1+w/2)/(\alpha-1))τ∼[Beta(1,n)]c,c∈(α/(α−1),(1+w/2)/(α−1)).

Remark 9: Note that the restriction on θ∗\theta^*θ∗ is a technique assumption. Without this assumption, it’s possible to achieve sub-optimal. See Theorem 3.2.