1. The basic theory of the multivariate linear regression

Hypothesis: hθ(x)=θ0x0+θ1x1+…+θnxn=θTXh_\theta(x)=\theta_0x_0+\theta_1x_1+\ldots+\theta_nx_n = \theta^TXhθ​(x)=θ0​x0​+θ1​x1​+…+θn​xn​=θTX

Parameters: θ0,θ1,…,θn\theta_0, \theta_1, \ldots, \theta_nθ0​,θ1​,…,θn​

Cost Function: J(θ0,θ1,…,θn)=12m∑i=1m(hθ(x(i))−y(i))2J(\theta_0, \theta_1, \ldots, \theta_n)=\frac{1}{2m}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2J(θ0​,θ1​,…,θn​)=2m1​i=1∑m​(hθ​(x(i))−y(i))2

We also can use the gradient descent methop to come up with the optimzed θ\thetaθ.

2. Feature scaling

  • Method1: ximax⁡−min⁡\frac{x_i}{\max-\min}max−minxi​​
  • Method2(Mean Normalization): xi−μmax⁡−min⁡\frac{x_i-\mu}{\max-\min}max−minxi​−μ​

The data could be scaled which ranges in −1≤xi≤1-1\le x_i\le1−1≤xi​≤1, or in −0.5≤xi≤0.5-0.5\le x_i\le0.5−0.5≤xi​≤0.5

3. Learning rate

  • Too small: slow convergence
  • Too Large: (a) × convergence; (b) × decreas on every iteration; © slow convergence

α=0.0001,0.01,0.1,1\alpha = 0.0001, 0.01, 0.1, 1α=0.0001,0.01,0.1,1

4. Normal equation

We can utilize the equation to solve out the θ\thetaθ directly.
θ=(XTX)−1XTy\theta=(X^TX)^{-1}X^Ty θ=(XTX)−1XTy

Derivation of the formula:
Cost Function: J(θ)=12m∑i=1m(hθ(x(i))−y(i))2J(\theta)=\frac{1}{2m}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2J(θ)=2m1​i=1∑m​(hθ​(x(i))−y(i))2
so, we can vectorization the Cost Function as follows:
J(θ)=12(Xθ−y)T⏟1∗m(Xθ−y)⏟m∗1=12(θTXTXθ−θTXTy−yTXθ−yTy)\begin{aligned} J(\theta) &=\frac{1}{2}\underbrace{(X\theta-y)^T}_{1*m} \underbrace{(X\theta-y)}_{m*1}\\ &=\frac{1}{2}(\theta^TX^TX\theta-\theta^TX^Ty-y^TX\theta-y^Ty) \end{aligned}J(θ)​=21​1∗m(Xθ−y)T​​m∗1(Xθ−y)​​=21​(θTXTXθ−θTXTy−yTXθ−yTy)​
*the mmm could be igonred.

The θ\thetaθ that fit to ∂J(θ)∂θ=0\frac{\partial J(\theta)}{\partial \theta} =0∂θ∂J(θ)​=0 could be considered as the optimum, so
∂J(θ)∂θ=12(2XTXθ−XTy−(yTX)T−0)=12(2XTXθ−XTy−XTy−0)=XTXθ−XTy=0\begin{aligned} \frac{\partial J(\theta)}{\partial \theta} &=\frac{1}{2}(2X^TX\theta-X^Ty-(y^TX)^T-0)\\ &= \frac{1}{2}(2X^TX\theta-X^Ty-X^Ty-0)\\ &= X^TX\theta-X^Ty=0 \end{aligned}∂θ∂J(θ)​​=21​(2XTXθ−XTy−(yTX)T−0)=21​(2XTXθ−XTy−XTy−0)=XTXθ−XTy=0​
we can solve out that θn∗1=(XTn∗mXm∗n)−1XTn∗mym∗1\mathop \theta\limits_{n*1} =(\mathop {X^T} \limits_{n*m} \mathop X\limits_{m*n})^{-1} \mathop {X^T}\limits_{n*m} \mathop y\limits_{m*1}n∗1θ​=(n∗mXT​m∗nX​)−1n∗mXT​m∗1y​

*(1)∂Aθ∂θ=AT\frac{\partial A\theta}{\partial\theta} = A^T∂θ∂Aθ​=AT

*(2)∂θTAθ∂θ=2Aθ\frac{\partial \theta^T A\theta}{\partial\theta} = 2A\theta∂θ∂θTAθ​=2Aθ

%% ============= normal equation ==========
theta_normal = zeros(2,1);
theta_normal = inv(X'*X) * X' * y;

More information: Derivation of the Normal Equation for linear regression

5. Vectorization in univariate gradient descent

  • Vectorization
% Vectorization to calculate the \theta
itera = 3000;
theta_matrix = [0 0];
theta_itera = zeros(itera,2); % record all the theta values during the process
for j = 1:iteratheta_itera(j,:) = theta_matrix;hypothesis = X * theta_matrix';theta_matrix = theta_matrix - (alpha/m) * ((hypothesis - y)'* X);
  • “for” Loop
% "for" loop to calculate the \theta
itera = 3000;
theta_itera = zeros(length(y),2);
for j = 1:iteratheta_itera(j,:) = theta';  % record all the theta values during the processhypothesis = X * theta;for i = 1:theta_lengththeta(i) = theta(i) - (alpha/m) * ((hypothesis - y)'* X(:,i));  endend

**** What if XTXX^TXXTX is non-invertible?

(1) Delete the linearly dependent features (e.g. x2=2x1x2=2x1x2=2x1);
(2) Delete some features to make m(# sample) ≤\le≤ n(# features);
(3) Use regularization.

