非线性回归模型(part1)--神经网络

学习笔记，仅供参考，有错必纠

PS : 本BLOG采用中英混合模式，有些英文下有中文翻译(并不是博主翻译的)

非线性回归模型

神经网络

Neural networks (Bishop 1995; Ripley 1996; Titterington 2010) are powerful nonlinear regression techniques inspired by theories about how the brain works.

The outcome is modeled by an intermediary set of unobserved variables (called hidden variables or hidden units here).

翻译

结果变量利用一系列中间层的非观测变量(在此称为隐藏变量或隐藏元)进行建模。

These hidden units are linear combinations of the original predictors, but, they are not estimated in a hierarchical fashion(层级的方式).

As previously stated, each hidden unit is a linear combination of some or all of the predictor variables. However, this linear combination is typically transformed by a nonlinear function g(⋅)g(\cdot)g(⋅),such as the logistic function:
hk(x)=g(β0k+∑j=1Pxjβjk)g(u)=11+e−uh_k(x)=g\left( \beta_{0k}+ \sum_{j=1}^P x_j \beta_{jk} \right) \\g(u) = \frac{1}{1+e^{-u}} hk(x)=g(β0k+j=1∑Pxjβjk)g(u)=1+e−u1
The β\betaβ coeﬃcients are similar to regression coeﬃcients; coeﬃcient βjk\beta_{jk}βjk is the
eﬀect of the jthj thjth predictor on the kthk thkth hidden unit. A neural network model usually involves multiple hidden units to model the outcome.

There are no constraints that help deﬁne these linear combinations. Because of this, there is little likelihood that the coeﬃcients in each unit represent some coherent piece of information.

翻译

在这里讨线性组合的形式没有任何约束。由于这一点，每个隐藏元上的系数可能不会反映出一致的信息。

Once the number of hidden units is deﬁned, each unit must be related to the outcome. Another linear combination connects the hidden units to the outcome:
f(x)=γ0+∑k=1Hγkhkf(x)=\gamma_0 + \sum_{k=1}^H \gamma_k h_k f(x)=γ0+k=1∑Hγkhk

For this type of network model and P predictors, there are a total of H(P+1)+H+1H (P +1) + H + 1H(P+1)+H+1 total parameters being estimated, which quickly becomes large as P increases.

Treating this model as a nonlinear regression model, the parameters are usually optimized to minimize the sum of the squared residuals.

翻译

如果把这一模型作为一个非线性回归来看待，那么参数将要最小化残差平方和。

This can be a challenging numerical optimization problem (recall that there are no constraints on the parameters of this complex nonlinear model).

The parameters are usually initialized to random values and then specialized algorithms for solving the equations are used. The back-propagation algorithm (逆向传播算法) is a highly eﬃcient methodology that works with derivatives to ﬁnd the optimal parameters. However, it is common that a solution to this equation is not a global solution, meaning that we cannot guarantee that the resulting set of parameters are uniformly better than any other set.

Also, neural networks have a tendency to over-ﬁt the relationship between the predictors and the response due to the large number of regression coeﬃcients.

翻译

此外，神经网络倾向于过度拟合预测变量与响应变量之间的关系，原因是待估参数过多。

To combat this issue, several diﬀerent approaches have been proposed.

First, the iterative algorithms for solving for the regression equations can be prematurely halted(求解回归方程的迭代算法可以提前被中断) . This approach is referred to as early stopping (提前停止)and would stop the optimization procedure when some estimate of the error rate starts to increase.

Another approach to moderating over-ﬁtting is to use weight decay(权重衰减), a penalization method to regularize the model(正则化模型) similar to ridge regression(岭回归).

The structure of the model described here is the simplest neural network architecture: a single-layer feed-forward network(单层前馈神经网络). There are many other kinds, such as models where there are more than one layer of hidden units (i.e., there is a layer of hidden units that models the other hidden units). Also, other model architectures have loops going both directions between layers.

Given the challenge of estimating a large number of parameters, the ﬁtted model ﬁnds parameter estimates that are locally optimal(局部最优); that is, the algorithm converges(算法收敛), but the resulting parameter estimates are unlikely to be the globally optimal estimates.

Very often, diﬀerent locally optimal solutions can produce models that are very diﬀerent but have nearly equivalent performance.

This model instability can sometimes hinder this model(这种模型的不稳定性往往会制约神经网络的使用).

As an alternative, several models can be created using diﬀerent starting values and averaging the results of these model to produce a more stable prediction

These models are often adversely aﬀected by high correlation among the predictor variables.

Two approaches for mitigating this issue is to pre-ﬁlter the predictors to remove the predictorsthat are associated with high correlations (移除高相关性变量). Alternatively a feature extraction technique(特征提取技术), such as principal component analysis(PCA), can be used prior to modeling to eliminate correlations(减缓相关性).