ML Notes: Week 2 - Multivariate Linear Regression

1. The basic theory of the multivariate linear regression

Hypothesis: hθ(x)=θ0x0+θ1x1+…+θnxn=θTXh_\theta(x)=\theta_0x_0+\theta_1x_1+\ldots+\theta_nx_n = \theta^TXhθ(x)=θ0x0+θ1x1+…+θnxn=θTX

Parameters: θ0,θ1,…,θn\theta_0, \theta_1, \ldots, \theta_nθ0,θ1,…,θn

Cost Function: J(θ0,θ1,…,θn)=12m∑i=1m(hθ(x(i))−y(i))2J(\theta_0, \theta_1, \ldots, \theta_n)=\frac{1}{2m}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2J(θ0,θ1,…,θn)=2m1i=1∑m(hθ(x(i))−y(i))2

We also can use the gradient descent methop to come up with the optimzed θ\thetaθ.

2. Feature scaling

Method1: ximax⁡−min⁡\frac{x_i}{\max-\min}max−minxi
Method2(Mean Normalization): xi−μmax⁡−min⁡\frac{x_i-\mu}{\max-\min}max−minxi−μ

The data could be scaled which ranges in −1≤xi≤1-1\le x_i\le1−1≤xi≤1, or in −0.5≤xi≤0.5-0.5\le x_i\le0.5−0.5≤xi≤0.5

3. Learning rate

Too small: slow convergence
Too Large: (a) × convergence; (b) × decreas on every iteration; © slow convergence

TRY！！！
α=0.0001,0.01,0.1,1\alpha = 0.0001, 0.01, 0.1, 1α=0.0001,0.01,0.1,1

4. Normal equation

We can utilize the equation to solve out the θ\thetaθ directly.
θ=(XTX)−1XTy\theta=(X^TX)^{-1}X^Ty θ=(XTX)−1XTy

Derivation of the formula:
Cost Function: J(θ)=12m∑i=1m(hθ(x(i))−y(i))2J(\theta)=\frac{1}{2m}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2J(θ)=2m1i=1∑m(hθ(x(i))−y(i))2
so, we can vectorization the Cost Function as follows:
J(θ)=12(Xθ−y)T⏟1∗m(Xθ−y)⏟m∗1=12(θTXTXθ−θTXTy−yTXθ−yTy）\begin{aligned} J(\theta) &=\frac{1}{2}\underbrace{(X\theta-y)^T}_{1*m} \underbrace{(X\theta-y)}_{m*1}\\ &=\frac{1}{2}(\theta^TX^TX\theta-\theta^TX^Ty-y^TX\theta-y^Ty） \end{aligned}J(θ)=211∗m(Xθ−y)Tm∗1(Xθ−y)=21(θTXTXθ−θTXTy−yTXθ−yTy）
*the mmm could be igonred.

The θ\thetaθ that fit to ∂J(θ)∂θ=0\frac{\partial J(\theta)}{\partial \theta} =0∂θ∂J(θ)=0 could be considered as the optimum, so
∂J(θ)∂θ=12(2XTXθ−XTy−(yTX)T−0)=12(2XTXθ−XTy−XTy−0)=XTXθ−XTy=0\begin{aligned} \frac{\partial J(\theta)}{\partial \theta} &=\frac{1}{2}(2X^TX\theta-X^Ty-(y^TX)^T-0)\\ &= \frac{1}{2}(2X^TX\theta-X^Ty-X^Ty-0)\\ &= X^TX\theta-X^Ty=0 \end{aligned}∂θ∂J(θ)=21(2XTXθ−XTy−(yTX)T−0)=21(2XTXθ−XTy−XTy−0)=XTXθ−XTy=0
XTXθ=XTyX^TX\theta=X^TyXTXθ=XTy
we can solve out that θn∗1=(XTn∗mXm∗n)−1XTn∗mym∗1\mathop \theta\limits_{n*1} =(\mathop {X^T} \limits_{n*m} \mathop X\limits_{m*n})^{-1} \mathop {X^T}\limits_{n*m} \mathop y\limits_{m*1}n∗1θ=(n∗mXTm∗nX)−1n∗mXTm∗1y

*(1)∂Aθ∂θ=AT\frac{\partial A\theta}{\partial\theta} = A^T∂θ∂Aθ=AT

*(2)∂θTAθ∂θ=2Aθ\frac{\partial \theta^T A\theta}{\partial\theta} = 2A\theta∂θ∂θTAθ=2Aθ

%% ============= normal equation ==========
theta_normal = zeros(2,1);
theta_normal = inv(X'*X) * X' * y;

More information: Derivation of the Normal Equation for linear regression

5. Vectorization in univariate gradient descent

Vectorization

% Vectorization to calculate the \theta
itera = 3000;
theta_matrix = [0 0];
theta_itera = zeros(itera,2); % record all the theta values during the process
for j = 1:iteratheta_itera(j,:) = theta_matrix;hypothesis = X * theta_matrix';theta_matrix = theta_matrix - (alpha/m) * ((hypothesis - y)'* X);
end

“for” Loop

% "for" loop to calculate the \theta
itera = 3000;
theta_itera = zeros(length(y),2);
for j = 1:iteratheta_itera(j,:) = theta';  % record all the theta values during the processhypothesis = X * theta;for i = 1:theta_lengththeta(i) = theta(i) - (alpha/m) * ((hypothesis - y)'* X(:,i));  endend

**** What if XTXX^TXXTX is non-invertible?

(1) Delete the linearly dependent features (e.g. x2=2x1x2=2x1x2=2x1);
(2) Delete some features to make m(# sample) ≤\le≤ n(# features);
(3) Use regularization.