一些说法:
a vector of parameters θ=(θ1,θ2,…,θk)θ=(θ_1,θ_2,…,θ_k),and we want to estimate the joint posterior distribution p(θ|X)p(θ|X).

Radial Basis Function Network

linear aggregation of distance-based similarities using k-means clustering for prototype finding

一堆similarities,一堆similarities的线性组合(linear combination)。

这些similarities又根据距离(distance-based)而来,这些距离又需要有一些prototype(代表),需要用什么方法找出代表呢?k-means clustering

Stacked denoising Autoencoder

The SdA is an MLP(Multi Layer Perceptron), for which all weights of intermediate layers are shared with a different denoising autoencoders.

PCA and Autoencoder

Both PCA and autoencoder can do dimension reduction (find a different subspace).

PCA is restricted a linear map, while autoencoders can have nonlinear encoder/decoders.

PCA is method that assumes linear systems where as Autoencoders(AE) do not.

One thing to note is that the hidden layer in an AE can be of greater dimensionality than that of the input. In such cases AE’s may not be doing dimensionality reduction. In this case we perceive them as doing a transformation from one feature space to another wherein the data in the new feature space disentangles factors of variation.

Monte Carlo

use the Monte Carlo simulation to determine the correct answer.

the CNN in Theano

Convolution operation is the main workhorse for implementing a convolutional layer in Theano.

神经网络的本质

神经网络的本质是一种语言(语言的最终目的是为了表达expression),我们通过它来表达对应用问题的理解。例如我们用:卷积层来表达空间相关性,RNN表达时间连续性。

dropout

Heuristically, when we dropout different sets of neurons, it’s rather like we’re training different neural networks(隐层神经元个数不同). And so the dropout procedure is like averaging the effects of a very large number of different networks. The different networks will overfit in different ways, and so, hopefully, the net effect of dropout will be to reduce overfitting.

Lecun与CNN

虽然Yahn Lecun等人在1993年已提出CNNs,但在当时在vision中的应用效果一直欠佳。时至2006年,Geoffrey Hinton等人提出Deep Belief Network进行Layer-wise的pretraining,应用效果取得突破性进展,其与之后的Ruslan Salakhutdinov提出的Deep Boltzmann Machine 重新点燃了视觉领域对于Neural Network和Boltzmann Machine的热情。

最大似然(MLE)与平方误差

p(xi,yi,ei|θ)∝exp(−12e2i(yi−y^(xi|θ))2)argmaxlogL(D|θ)=const−∑i=1n12e2i(yi−y^(xi|θ))2argminL=∑i=1n12e2i(yi−y^(xi|θ))2

p(x_i,y_i,e_i|\theta)\propto\exp(-\frac1{2e_i^2}(y_i-\hat{y}(x_i|\theta))^2)\\ \arg\max\log\mathcal{L(D}|\theta)=\textrm{const}-\sum_{i=1}^n\frac1{2e_i^2}(y_i-\hat{y}(x_i|\theta))^2\\ \arg\min\mathcal{L}=\sum_{i=1}^n\frac1{2e_i^2}(y_i-\hat{y}(x_i|\theta))^2
第二行的 L\mathcal{L}表示 log\log,第三行的 L\mathcal{L}表示 loss\textrm{loss}。

Squared Loss 对噪声极其敏感

来看一张线性拟合图:

loss functions:from squared loss function to huber loss function

The variety of possible loss functions is quite literally infinite, but one relatively well-motivated option is the Huber loss.

The Huber loss is equivalent to the squared loss for points which are well-fit by the model, but reduces the loss contribution of outliers.

ridge regression & lasso

Ridge regression and the lasso are regularized versions of least squares regression using ℓ2\ell_2 and ℓ1\ell_1 penalties respectively, on the coefficient vector.

address the outliers

Frequentist Correction for Outliers: Huber loss

A Bayesian Approach to Outliers: Nuisance Parameters.

single to mixture

The Bayesian approach to accounting for outliers generally involves modifying the model so that the outliers are accounted for. For this data, it is abundantly clear that a simple straight line is not a good fit to our data. So let’s propose a more complicated model that has the flexibility to account for outliers. One option is to choose a mixture between a signal and a background:

P({xi},{yi},{ei}|θ,{gi},σB)=gi2πe2i−−−−√exp[−(y^(xi|θ)−yi)22e2i]+1−gi2πσ2B−−−−−√exp(−(y^(xi|θ)−yi)22σ2B)

\begin{split} P(\{x_i\},\{y_i\},\{e_i\}|\theta,\{g_i\},\sigma_B)=&\frac{g_i}{\sqrt{2\pi e_i^2}}\exp\begin{bmatrix}-\frac{(\hat{y}(x_i|\theta)-y_i)^2}{2e_i^2}\end{bmatrix}\\&+\frac{1-g_i}{\sqrt{2\pi\sigma_B^2}}\exp(-\frac{(\hat{y}(x_i|\theta)-y_i)^2}{2\sigma_B^2}) \end{split}

Nuisance Parameters的处理

Our model is much more complicated now: it has 22 parameters rather than 2, but the majority of these can be considered nuisance parameters, which can be marginalized-out in the end, just as we marginalized (integrated) over p in the Billiard example.

we’ll use the emcee package to explore the parameter space.

卷积

卷积就是一种滤波;
滤波就是一层特诊提取;
体征提取就是模板匹配;

卷积⇒ 滤波 ⇒ 特征提取 ⇒ 模板匹配

parameter space

we’ll run the MCMC samples to explore the parameter space.

SVMs vs ANNs

The development of ANNs followed a heuristic path, with applications and extensive experimentation preceding theory. In contrast, the development of SVMs involved sound theory first, then implementation and experiments.

A significant advantage of SVMs is that whilst ANNs can suffer from multiple local minima, the solution to an SVM is global and unique. Two more advantages of SVMs are that that have a simple geometric interpretation and give a sparse solution. Unlike ANNs, the computational complexity of SVMs does not depend on the dimensionality of the input space.
ANNs use empirical risk minimization, whilst SVMs use structural risk minimization. The reason that SVMs often outperform ANNs in practice is that they deal with the biggest problem with ANNs, SVMs are less prone to overfitting.

uninformative: flat vs informative: peaked

uninformative: flat vs informative: peaked,最大熵?

The prior distribution may be relatively uninformative (i.e. more flat) or inforamtive (i.e. more peaked).

比如beta 分布(a,b=10,10a, b = 10, 10,f(x;a,b)=xa−1(1−x)b−1B(a,b)f(x;a,b)=\frac{x^{a-1}(1-x)^{b-1}}{B(a,b)})

import scipy.special as ss# 提供beta function
a, b = 10, 10
x = np.arange(0, 1, .001)
plt.plot(x, x**(a-1)*(1-x)**(b-1)/ss.beta(a, b), c='r', lw=2)
plt.axvline((a-1)/(a+b-2), ls='--', c='r', alpha=.4)
plt.xlim([0, 1])
plt.show()

可视化:Trace plots

Trace plots are often used to informally assess for stochastic convergence(马尔科夫链蒙特卡洛采样收敛性). Rigorous(严格) demonstration of convergence is an unsolved problem, but simple ideas such as running multiple chains and checking that they are converging to similar distributions are often employed in practice.

*Trace plots

import numpy as np
import scipy.stats as stdef target(prior, lik, n, theta, k):return 0 if theta < 0 or theta > 1 else prior.pdf(theta)*lik(n, theta).pmf(k)
def mcmc_coin(niters, prior, lik, n, theta, k, sigma):samples = np.zeros(niters+1)samples[0] = thetafor i in range(niters):theta_p = theta + st.norm(0, sigma).rvs()rho = min(target(prior, lik, n, theta_p, k)/target(prior, lik, n, theta, k), 1)u = st.uniform(0, 1).rvs()if u < rho:theta = theta_psamples[i+1] = theta
n, k = 100, 61
a, b = 10
prior = st.beta(a, b)
lik = st.binom
niters = 100
sigma = .05
sampless = [mcmc_chain(niters, prior, lik, n, theta, k, sigma) for theta in np.arange(.1, 1, .2)] for samples in sampless:plt.plot(samples, '-o')

采样

  • burn-in 的理解

    马氏链收敛后,最终得到的样本就是 $p(x,y)$ 的样本(二维),而<font color="red">收敛之前的阶段称为burn-in period</font>。
    
    • Gibbs sampling

    Gibbs sampling is a type of random walk through parameter space, and hence can be thought of as a Metroplish-Hastings algorithm with a special proposal distribution

the idea of Deep Learning

The idea of deep learning is to discover multiple levels of representation, with the hope that higher-level features represent more abstract semantics of the data. Such abstract representations learned from a deep network are expected to provide more invariance to intra-class variability.

调参

learning a network useful for classification critically depends on expertise of parameter tuning and some ad hoc tricks.

算法直观与对模型的理解相关推荐

  1. 算法直观与对模型的理解(二)

    算法直观与对模型的理解 assumptions 很多机器学习算法以数据的线性可分为基本假设或叫基本前提. Many machine learning algorithms make assumptio ...

  2. 机器学习模型的理解(三)

    算法直观与对模型的理解 算法直观与对模型的理解(二) 0. PCA 与 朴素贝叶斯 PCA 作为特征工程的重要技术工具,其降维特性本质上是由特征空间维度的压缩(通过属性列的解耦合)实现的,因此可以认为 ...

  3. 随机森林算法参数解释及调优 转胡卫雄 RF模型可以理解成决策树模型嵌入到bagging框架,因此,我们首先对外层的bagging框架进行参数择优,然后再对内层的决策树模型进行参数择优

    RF参数择优思想:RF模型可以理解成决策树模型嵌入到bagging框架,因此,我们首先对外层的bagging框架进行参数择优,然后再对内层的决策树模型进行参数择优.在优化某一参数时,需要把其他参数设置 ...

  4. 机器学习入门(九):非监督学习:5种聚类算法+2种评估模型

    机器学习入门专栏其他章节: 机器学习入门(一)线性回归 机器学习入门(二)KNN 机器学习入门(三)朴素贝叶斯 机器学习入门(四)决策树 机器学习入门(五)集成学习 机器学习入门(六)支持向量机 机器 ...

  5. 深度学习多变量时间序列预测:Bi-LSTM算法构建时间序列多变量模型预测交通流量+代码实战

    深度学习多变量时间序列预测:Bi-LSTM算法构建时间序列多变量模型预测交通流量+代码实战 人类并不是每时每刻都从一片空白的大脑开始他们的思考.在你阅读这篇文章时候,你都是基于自己已经拥有的对先前所见 ...

  6. 【数据挖掘】数据挖掘算法 组件化思想 ( 模型或模式结构 | 数据挖掘任务 | 评分函数 | 搜索和优化算法 | 数据管理策略 )

    文章目录 一. 数据挖掘算法组件化 二. 组件一 : 模型或模式结构 三. 组件二 : 数据挖掘任务 四. 组件三 : 评分函数 五. 组件四 : 搜索和优化算法 六. 组件五 : 数据管理策略 七. ...

  7. 原创专栏:谈谈我对评分模型的理解

    在准备给FAL的读者朋友们讲讲评分模型之前,我其实是不太想写有关评分模型的相关分享. 一是因为我开始创业后,自身更多关注企业管理与创新发展,评分模型技术关注度不再像以前那么紧密:还有另一个原因是,标准 ...

  8. 中兴算法大赛深度学习模型优化加速解决方案总结

    前言:从小白入门,通过这次比赛学到很多东西.现在把文档放到这里,希望能够帮助到需要的人. 特别感谢初赛排名第一的YaHei大佬.感谢他开源的代码把我带进了比赛的大门.附上大佬的项目链接: https: ...

  9. Python实现ABC人工蜂群优化算法优化支持向量机回归模型(SVR算法)项目实战

    说明:这是一个机器学习实战项目(附带数据+代码+文档+视频讲解),如需数据+代码+文档+视频讲解可以直接到文章最后获取. 1.项目背景 人工蜂群算法(Artificial Bee Colony, AB ...

最新文章

  1. go 接收 ffmpeg avpacket
  2. HTML语言基础.上
  3. poj2752 字符串子KMP应用
  4. Java学习笔记一和前言
  5. android手机常用功能,Windows Phone 7/Android手机常用功能对比
  6. MYSQL类型与JAVA类型对应表
  7. linux tcp测试工具_Linux下3种常用的网络测速工具
  8. 使用JAVA命令查看JVM参数
  9. VM VirtualBox安装mac os dmg 转 iso
  10. js获取用户使用的设备类型及平台
  11. office2007中把文件转换成pdf格式的问题
  12. 《计算机技术领域当前的主流技术及其社会需求调查报告》
  13. Check Point R81.10 - 下一代防火墙 (NGFW)
  14. JavaScript(WebAPI) (前端)
  15. 任正非的“先僵化、后优化、再固化”方针
  16. 第13章 原始套接字
  17. element表格分页功能
  18. angular1的分页
  19. Marked.js - HTML 中直接解析显示 Markdown
  20. 朋友圈一杠中间一个点_开启朋友圈3天可见,你不知道的秘密

热门文章

  1. IDEA中Maven项目导入依赖包,出现红线波浪线
  2. 计算机拆装与维修技能综述,综述虚拟机在计算机硬件组装与维护教学中的应用...
  3. 机器学习代码实战——决策树(预测泰坦尼号船员生存情况)
  4. TensorFlow精进之路(十三):长短时记忆神经网络LSTM
  5. OpenCV精进之路(二):图像处理——线性滤波和非线性滤波
  6. 0xc000007b——应用程序无法正常启动解决办法
  7. MySQL之 事务日志: redo和undo
  8. GridView上同时定义了 DataSource 和 DataSourceId
  9. 中望龙腾(广州)c++实习
  10. MongoDB可视化工具RoboMongo----Windows安装