算法直观与对模型的理解

一些说法：
a vector of parameters θ=(θ1,θ2,…,θk)θ=(θ_1,θ_2,…,θ_k)，and we want to estimate the joint posterior distribution p(θ|X)p(θ|X).

Radial Basis Function Network

linear aggregation of distance-based similarities using k-means clustering for prototype finding

一堆similarities，一堆similarities的线性组合（linear combination）。

这些similarities又根据距离（distance-based）而来，这些距离又需要有一些prototype（代表），需要用什么方法找出代表呢？k-means clustering。

Stacked denoising Autoencoder

The SdA is an MLP（Multi Layer Perceptron）, for which all weights of intermediate layers are shared with a different denoising autoencoders.

PCA and Autoencoder

Both PCA and autoencoder can do dimension reduction (find a different subspace).

PCA is restricted a linear map, while autoencoders can have nonlinear encoder/decoders.

PCA is method that assumes linear systems where as Autoencoders(AE) do not.

One thing to note is that the hidden layer in an AE can be of greater dimensionality than that of the input. In such cases AE’s may not be doing dimensionality reduction. In this case we perceive them as doing a transformation from one feature space to another wherein the data in the new feature space disentangles factors of variation.

Monte Carlo

use the Monte Carlo simulation to determine the correct answer.

the CNN in Theano

Convolution operation is the main workhorse for implementing a convolutional layer in Theano.

神经网络的本质

神经网络的本质是一种语言（语言的最终目的是为了表达expression），我们通过它来表达对应用问题的理解。例如我们用：卷积层来表达空间相关性，RNN表达时间连续性。

dropout

Heuristically, when we dropout different sets of neurons, it’s rather like we’re training different neural networks（隐层神经元个数不同）. And so the dropout procedure is like averaging the effects of a very large number of different networks. The different networks will overfit in different ways, and so, hopefully, the net effect of dropout will be to reduce overfitting.

Lecun与CNN

虽然Yahn Lecun等人在1993年已提出CNNs，但在当时在vision中的应用效果一直欠佳。时至2006年，Geoffrey Hinton等人提出Deep Belief Network进行Layer-wise的pretraining，应用效果取得突破性进展，其与之后的Ruslan Salakhutdinov提出的Deep Boltzmann Machine 重新点燃了视觉领域对于Neural Network和Boltzmann Machine的热情。

最大似然（MLE）与平方误差

p(x_i,y_i,e_i|\theta)\propto\exp(-\frac1{2e_i^2}(y_i-\hat{y}(x_i|\theta))^2)\\ \arg\max\log\mathcal{L(D}|\theta)=\textrm{const}-\sum_{i=1}^n\frac1{2e_i^2}(y_i-\hat{y}(x_i|\theta))^2\\ \arg\min\mathcal{L}=\sum_{i=1}^n\frac1{2e_i^2}(y_i-\hat{y}(x_i|\theta))^2
第二行的 L\mathcal{L}表示 log\log，第三行的 L\mathcal{L}表示 loss\textrm{loss}。

Squared Loss 对噪声极其敏感

来看一张线性拟合图：

loss functions：from squared loss function to huber loss function

The variety of possible loss functions is quite literally infinite, but one relatively well-motivated option is the Huber loss.

The Huber loss is equivalent to the squared loss for points which are well-fit by the model, but reduces the loss contribution of outliers.

ridge regression & lasso

Ridge regression and the lasso are regularized versions of least squares regression using ℓ2\ell_2 and ℓ1\ell_1 penalties respectively, on the coefficient vector.

address the outliers

Frequentist Correction for Outliers: Huber loss

A Bayesian Approach to Outliers: Nuisance Parameters.

single to mixture

The Bayesian approach to accounting for outliers generally involves modifying the model so that the outliers are accounted for. For this data, it is abundantly clear that a simple straight line is not a good fit to our data. So let’s propose a more complicated model that has the flexibility to account for outliers. One option is to choose a mixture between a signal and a background:

P({xi},{yi},{ei}|θ,{gi},σB)=gi2πe2i−−−−√exp[−(y^(xi|θ)−yi)22e2i]+1−gi2πσ2B−−−−−√exp(−(y^(xi|θ)−yi)22σ2B)

\begin{split} P(\{x_i\},\{y_i\},\{e_i\}|\theta,\{g_i\},\sigma_B)=&\frac{g_i}{\sqrt{2\pi e_i^2}}\exp\begin{bmatrix}-\frac{(\hat{y}(x_i|\theta)-y_i)^2}{2e_i^2}\end{bmatrix}\\&+\frac{1-g_i}{\sqrt{2\pi\sigma_B^2}}\exp(-\frac{(\hat{y}(x_i|\theta)-y_i)^2}{2\sigma_B^2}) \end{split}

Nuisance Parameters的处理

Our model is much more complicated now: it has 22 parameters rather than 2, but the majority of these can be considered nuisance parameters, which can be marginalized-out in the end, just as we marginalized (integrated) over p in the Billiard example.

we’ll use the emcee package to explore the parameter space.

卷积

卷积就是一种滤波；
滤波就是一层特诊提取；
体征提取就是模板匹配；

卷积⇒ 滤波 ⇒ 特征提取 ⇒ 模板匹配

parameter space

we’ll run the MCMC samples to explore the parameter space.

SVMs vs ANNs

The development of ANNs followed a heuristic path, with applications and extensive experimentation preceding theory. In contrast, the development of SVMs involved sound theory first, then implementation and experiments.

A significant advantage of SVMs is that whilst ANNs can suffer from multiple local minima, the solution to an SVM is global and unique. Two more advantages of SVMs are that that have a simple geometric interpretation and give a sparse solution. Unlike ANNs, the computational complexity of SVMs does not depend on the dimensionality of the input space.
ANNs use empirical risk minimization, whilst SVMs use structural risk minimization. The reason that SVMs often outperform ANNs in practice is that they deal with the biggest problem with ANNs, SVMs are less prone to overfitting.

uninformative: flat vs informative: peaked

uninformative: flat vs informative: peaked，最大熵？

The prior distribution may be relatively uninformative (i.e. more flat) or inforamtive (i.e. more peaked).

比如beta 分布（a,b=10,10a, b = 10, 10，f(x;a,b)=xa−1(1−x)b−1B(a,b)f(x;a,b)=\frac{x^{a-1}(1-x)^{b-1}}{B(a,b)}）

import scipy.special as ss# 提供beta function
a, b = 10, 10
x = np.arange(0, 1, .001)
plt.plot(x, x**(a-1)*(1-x)**(b-1)/ss.beta(a, b), c='r', lw=2)
plt.axvline((a-1)/(a+b-2), ls='--', c='r', alpha=.4)
plt.xlim([0, 1])
plt.show()

可视化：Trace plots

Trace plots are often used to informally assess for stochastic convergence（马尔科夫链蒙特卡洛采样收敛性）. Rigorous（严格） demonstration of convergence is an unsolved problem, but simple ideas such as running multiple chains and checking that they are converging to similar distributions are often employed in practice.

*Trace plots：

import numpy as np
import scipy.stats as stdef target(prior, lik, n, theta, k):return 0 if theta < 0 or theta > 1 else prior.pdf(theta)*lik(n, theta).pmf(k)
def mcmc_coin(niters, prior, lik, n, theta, k, sigma):samples = np.zeros(niters+1)samples[0] = thetafor i in range(niters):theta_p = theta + st.norm(0, sigma).rvs()rho = min(target(prior, lik, n, theta_p, k)/target(prior, lik, n, theta, k), 1)u = st.uniform(0, 1).rvs()if u < rho:theta = theta_psamples[i+1] = theta
n, k = 100, 61
a, b = 10
prior = st.beta(a, b)
lik = st.binom
niters = 100
sigma = .05
sampless = [mcmc_chain(niters, prior, lik, n, theta, k, sigma) for theta in np.arange(.1, 1, .2)] for samples in sampless:plt.plot(samples, '-o')

采样

burn-in 的理解
```
马氏链收敛后，最终得到的样本就是 $p(x,y)$ 的样本（二维），而<font color="red">收敛之前的阶段称为burn-in period</font>。
```
- Gibbs sampling
Gibbs sampling is a type of random walk through parameter space, and hence can be thought of as a Metroplish-Hastings algorithm with a special proposal distribution

the idea of Deep Learning

The idea of deep learning is to discover multiple levels of representation, with the hope that higher-level features represent more abstract semantics of the data. Such abstract representations learned from a deep network are expected to provide more invariance to intra-class variability.

调参

learning a network useful for classification critically depends on expertise of parameter tuning and some ad hoc tricks.