Mind Map

GOALS

eigendecomposition and singular value decomposition
inverse and pseudoinverse of matrix
PCA

CODE WORKS

Work Here!

Scalars: A scalar is just a single number.
We write scalars in italics.
We usually give scalars lower-case variable names.
Vectors: A vector is an array of numbers.
Arranged in order.
We give vectors lower case names written in bold typeface(a particular design of type), such as xxx.
Matrices: A matrix is a 2-D array of numbers
We usually give matrices upper-case variable names with bold typeface, such as A\boldsymbol{A}A .
For example,Ai,:\boldsymbol{A}_{i,:}Ai,: denotes the horizontal cross section of A\boldsymbol{A}A with vertical coordinate iii.
This is known as the iii-th row of A\boldsymbol{A}A. Likewise, A:,i\boldsymbol{A}_{ :,i}A:,i is the iii-th column of A\boldsymbol{A}A.
Tensors: An array of numbers arranged on a regular grid with avariable number of axes is known as a tensor.
We denote a tensor named “A”with this typeface:AAA . We identify the element of A at coordinates (i, j, k) by writing Ai,j,kA_{i,j,k}Ai,j,k .

import numpy as npdef visualize(name, arr):print(name + ' : \n' + str(arr))# 标量
Scalar = 2
# 向量
Vector = np.array(range(4))
# 矩阵
Matrix = np.random.rand(2, 2)
# 张量
Tensor = np.random.rand(2, 3, 4)visualize('Scalar', Scalar)
visualize('Vector', Vector)
visualize('Matrix', Matrix)
visualize('Tensor', Tensor)Scalar : 2
Vector : [0 1 2 3]
Matrix :
[[0.90397091 0.54302739][0.04528763 0.68575114]]
Tensor :
[[[0.69245712 0.73931532 0.81628792 0.15778732][0.60964378 0.1632089  0.16241425 0.45382426][0.67702252 0.40530584 0.55579957 0.9105852 ][[0.90174383 0.7064744  0.68461729 0.44535537][0.96427252 0.63662681 0.98678003 0.16036914][0.53236554 0.68601688 0.48556214 0.42737712]]]

(A⊤)i,j=Aj,i\left(\boldsymbol{A}^{\top}\right)_{i, j}=A_{j, i}(A⊤)i,j=Aj,i

# 转置
Matrix_t = np.transpose(Matrix)
visualize('Matrix', Matrix)
visualize('Transpose', Matrix_t)Matrix :
[[0.28740502 0.9449629 ][0.20801981 0.69508998]]
Transpose :
[[0.28740502 0.20801981][0.9449629  0.69508998]]

D=a⋅B+c\boldsymbol{D}=a \cdot \boldsymbol{B}+cD=a⋅B+c where Di,j=a⋅Bi,j+cD_{i, j}=a \cdot B_{i, j}+cDi,j=a⋅Bi,j+c
We allow the addition of matrix and a vector, yielding another matrix:C=A+b\boldsymbol{C}= \boldsymbol{A}+bC=A+b where Ci,j=Ai,j+bjC_{i, j}=A_{i, j}+b_{j}Ci,j=Ai,j+bj. In other words, the vector bbb is added to each row of the matrix.This implicit copying of b to many locations is called broadcasting.

# 广播机制 Vector size must be equal to Matrix's second dimension
Matrix = np.random.rand(2, 4)
ans = Vector + Matrix
visualize('Vector', Vector)
visualize('Matrix', Matrix)
visualize('BroadCast', ans)Vector : [0 1 2 3]
Matrix :
[[0.28635705 0.72777232 0.9124285  0.57315184][0.45841184 0.79016666 0.7727094  0.23308759]]
BroadCast :
[[0.28635705 1.72777232 2.9124285  3.57315184][0.45841184 1.79016666 2.7727094  3.23308759]]

C=AB\boldsymbol{C}= \boldsymbol{AB}C=AB where Ci,j=∑kAi,kBk,jC_{i, j}=\sum_{k} A_{i, k} B_{k, j}Ci,j=∑kAi,kBk,j
element-wise product or Hadamard product is denoted as A⊙B\boldsymbol{A} \odot \boldsymbol{B}A⊙B

# 矩阵乘法
Matrix_1 = np.random.rand(2, 2)
Matrix_2 = np.random.rand(2, 2)
visualize('Matrix_1', Matrix_1)
visualize('Matrix_2', Matrix_2)
ans = np.dot(Matrix_1, Matrix_2)
visualize('Matrix production', ans)
# Hadamard乘法
ans = np.multiply(Matrix_1, Matrix_2)
visualize('Element-wise production', ans)Matrix_1 :
[[0.76124234 0.40745662][0.77113541 0.55938339]]
Matrix_2 :
[[0.96024896 0.59472598][0.60984615 0.04059553]]
Matrix production :
[[0.97946802 0.46927151][1.08161979 0.48132273]]
Element-wise production :
[[0.73098217 0.24232504][0.47027396 0.02270846]]

matrix multiplication is distributive（分配律）and associative（结合律）, but not commutative（交换律）
However, the dot product between two vectors is commutative:x⊤y=y⊤x\boldsymbol{x}^{\top} \boldsymbol{y}=\boldsymbol{y}^{\top} \boldsymbol{x}x⊤y=y⊤x
(AB)⊤=B⊤A⊤(\boldsymbol{A} \boldsymbol{B})^{\top}=\boldsymbol{B}^{\top} \boldsymbol{A}^{\top}(AB)⊤=B⊤A⊤
the value of such a product is a scalar： x⊤y=(x⊤y)⊤=y⊤x\boldsymbol{x}^{\top} \boldsymbol{y}=\left(\boldsymbol{x}^{\top} \boldsymbol{y}\right)^{\top}=\boldsymbol{y}^{\top} \boldsymbol{x}x⊤y=(x⊤y)⊤=y⊤x
Ax=b\boldsymbol{A} \boldsymbol{x}=\boldsymbol{b}Ax=b
A−1Ax=A−1b\boldsymbol{A}^{-1} \boldsymbol{A} \boldsymbol{x}=\boldsymbol{A}^{-1} \boldsymbol{b}A−1Ax=A−1b
Inx=A−1b\boldsymbol{I}_{n} \boldsymbol{x}=\boldsymbol{A}^{-1} \boldsymbol{b}Inx=A−1b
x=A−1b.\boldsymbol{x}=\boldsymbol{A}^{-1} \boldsymbol{b} .x=A−1b.
However,A−1\boldsymbol{A}^{-1}A−1 is primarily useful as a theoretical tool, and should not actually be used in practice for most software applications. BecauseA−1\boldsymbol{A}^{-1}A−1 can be represented with only limited precision on a digital computer, algorithms that make use of the value of b can usually obtain more accurate estimates of x\boldsymbol{x}x

# 单位矩阵
I = np.identity(4)
visualize('Identity matrix', I)Identity matrix :
[[1. 0. 0. 0.][0. 1. 0. 0.][0. 0. 1. 0.][0. 0. 0. 1.]]

# 单位矩阵
# 矩阵的逆
# np.linalg means linear+algebra
# Matrix must be square
Matrix = np.random.rand(2, 2)
visualize('Matrix', Matrix)
Matrix_inverse = np.linalg.inv(Matrix)
visualize('Matrix_inverse', Matrix_inverse)Matrix :
[[0.7561044  0.65758474][0.54542972 0.12162917]]
Matrix_inverse :
[[-0.45604912  2.46561693][ 2.04509117 -2.83501683]]

if both x\boldsymbol{x}x and y\boldsymbol{y}y are solutions then z=αx+(1−α)y\boldsymbol{z}=\alpha \boldsymbol{x}+(1-\alpha) \boldsymbol{y}z=αx+(1−α)y is also a solution for any real α\alphaα
Determining whether Ax=b\boldsymbol{A} \boldsymbol{x}=\boldsymbol{b}Ax=b has a solution thus amounts to testing whether b\boldsymbol{b}b is in the span of the columns of A. This particular span is known as the column space or the range of A\boldsymbol{A}A
if Ax=b\boldsymbol{A} \boldsymbol{x}=\boldsymbol{b}Ax=b only have one solution, then this means that the matrix A\boldsymbol{A}A must be square. we require that m=nm = nm=n and that all of the columns must be linearly independent. A square matrix with linearly dependent columns is known as singular（奇异）
If A\boldsymbol{A}A is not square or is square but singular, it can still be possible to solve the equation. However, we can not use the method of matrix inversion to find the solution
线性方程组什么时候无解？多个解？唯一解？
In machine learning, we usually measure the size of vectors using a function called a norm . Formally, the LpL^{p}Lp norm is given by ∥x∥p=(∑i∣xi∣p)1p\|\boldsymbol{x}\|_{p}=\left(\sum_{i}\left|x_{i}\right|^{p}\right)^{\frac{1}{p}}∥x∥p=(∑i∣xi∣p)p1
Norms are functions mapping vectors to non-negative values.
More rigorously, a norm is any function fff that satisfies the following properties:
f(x)=0⇒x=0f(\boldsymbol{x})=0 \Rightarrow \boldsymbol{x}=\mathbf{0}f(x)=0⇒x=0
f(x+y)≤f(x)+f(y)f(\boldsymbol{x}+\boldsymbol{y}) \leq f(\boldsymbol{x})+f(\boldsymbol{y})f(x+y)≤f(x)+f(y) (the triangle inequality)
∀α∈R,f(αx)=∣α∣f(x)\forall \alpha \in \mathbb{R}, f(\alpha \boldsymbol{x})=|\alpha| f(\boldsymbol{x})∀α∈R,f(αx)=∣α∣f(x)
L2L^{2}L2:denoted simply as ∥x∥\| \boldsymbol{x} \|∥x∥. It is also common to measure the size of a vector , which can be calculated simply as x⊤x\boldsymbol{x}^{\top} \boldsymbol{x}x⊤x but it increases very slowly near the origin
The L1L^{1}L1 norm is commonly used in machine learning when the difference between zero and nonzero elements is very important.
Frobenius norm: It measures the size of a matrix , which is analogous to the L2L^{2}L2 norm of a vector

# 范数
visualize('Vector', Vector)
norm = np.linalg.norm(Vector, ord=1)
visualize('L1 norm', norm)
norm = np.linalg.norm(Vector, ord=2)
visualize('L2 norm', norm)
norm = np.linalg.norm(Vector, ord=np.inf)
visualize('Infinity norm', norm)
# Frobenius范数
visualize('Matrix', Matrix)
norm = np.linalg.norm(Matrix, ord="fro")
visualize('Frobenius norm', norm)Vector :
[0 1 2 3]
L1 norm :
6.0
L2 norm :
3.7416573867739413
Infinity norm :
3.0
Matrix :
[[0.76237551 0.35876482][0.90076455 0.90969587]]
Frobenius norm :
1.5325964773148073

Formally, a matrix D is diagonal if and only if Di,j=0D_{i, j}=0Di,j=0 for all i≠ji \neq ji=j
We write diag(v\boldsymbol{v}v) to denote a square diagonal matrix whose diagonal entries are given by the entries of the vector v\boldsymbol{v}v
The inverse exists only if every diagonal entry is nonzero, and in that case, diag⁡(v)−1=diag⁡([1/v1,…,1/vn]⊤)\operatorname{diag}(\boldsymbol{v})^{-1}=\operatorname{diag}\left(\left[1 / v_{1}, \ldots, 1 / v_{n}\right]^{\top}\right)diag(v)−1=diag([1/v1,…,1/vn]⊤)
A vector x\boldsymbol{x}x and a vector y\boldsymbol{y}y are orthogonal（正交） to each other if x⊤y=0\boldsymbol{x}^{\top} \boldsymbol{y}=0x⊤y=0
If the vectors are not only orthogonal but also have unit norm（单位范数）, we call them orthonormal.
An orthogonal matrix is a square matrix whose rows are mutually orthonormal and whose columns are mutually orthonormal: A⊤A=AA⊤=I\boldsymbol{A}^{\top} \boldsymbol{A}=\boldsymbol{A} \boldsymbol{A}^{\top}=\boldsymbol{I}A⊤A=AA⊤=I, thats to say A−1=A⊤\boldsymbol{A}^{-1}=\boldsymbol{A}^{\top}A−1=A⊤
One of the most widely used kinds of matrix decomposition is called eigen-decomposition（特征分解或谱分解）, in which we decompose a matrix into a set of eigenvectors and eigenvalues : Av=λv\boldsymbol{A} \boldsymbol{v}=\lambda \boldsymbol{v}Av=λv
Suppose that a matrix A\boldsymbol{A}A has nnn linearly independent eigenvectors, {v(1),…,v(n)}\left\{\boldsymbol{v}^{(1)}, \ldots, \boldsymbol{v}^{(n)}\right\}{v(1),…,v(n)} with corresponding eigenvalues {λ1,…,λn}\left\{\lambda_{1}, \ldots, \lambda_{n}\right\}{λ1,…,λn}. We may concatenate all of the eigenvectors to form a matrix V\boldsymbol{V}V with one eigenvector per column:V=[v(1),…,v(n)]\boldsymbol{V}=\left[\boldsymbol{v}^{(1)}, \ldots, \boldsymbol{v}^{(n)}\right]V=[v(1),…,v(n)]. Likewise, we can concatenate the eigenvalues to form a vector λ=[λ1,…,λn]⊤\boldsymbol\lambda={\left[\lambda_{1}, \ldots, \lambda_{n}\right]}^{\top}λ=[λ1,…,λn]⊤. The eigendecomposition of A\boldsymbol{A}A is then given by:A=Vdiag⁡(λ)V−1\boldsymbol{A}=\boldsymbol{V} \operatorname{diag}(\boldsymbol{\lambda}) \boldsymbol{V}^{-1}A=Vdiag(λ)V−1

# 矩阵特征值分解
# not every matrix has Eigen_decomposition
# some results can be complex matrices
visualize('Matrix', Matrix)
eigen_values, eigen_vector = np.linalg.eig(Matrix)
visualize('eigen_values', eigen_values)
visualize('eigen_vector', eigen_vector)Matrix :
[[0.41152147 0.22119277][0.12687741 0.29332044]]
eigen_values :
[0.53006453 0.17477738]
eigen_vector :
[[ 0.88140212 -0.68270023][ 0.4723667   0.73069857]]

Not every matrix can be decomposed into eigenvalues and eigenvectors.The eigendecomposition of a matrix tells us many useful facts about the matrix.
The matrix is singular if and only if any of the eigenvalues are zero.
The eigendecomposition of a real symmetric matrix can also be used to optimize quadratic expressions of the form f(x)=x⊤Axf(\boldsymbol{x})=\boldsymbol{x}^{\top} \boldsymbol{A} \boldsymbol{x}f(x)=x⊤Ax subject to ∥x∥2=1\|\boldsymbol{x}\|_{2}=1∥x∥2=1.Whenever xxx is equal to an eigenvector of A, fff takes on the value of the corresponding eigenvalue.(直接按照定义代入就能得到这个结论，在最后证明PCA的编码矩阵就是特征向量时用上了)
A matrix whose eigenvalues are all positive is called positive definite. x⊤Ax=0⇒x=0\boldsymbol{x}^{\top} \boldsymbol{A} \boldsymbol{x}=0 \Rightarrow \boldsymbol{x}=\mathbf{0}x⊤Ax=0⇒x=0
A matrix whose eigenvalues are all positive or zero-valued is called positive semidefinite. ∀x,x⊤Ax≥0\forall \boldsymbol{x}, \boldsymbol{x}^{\top} \boldsymbol{A} \boldsymbol{x} \geq 0∀x,x⊤Ax≥0
if all eigenvalues are negative, the matrix is negative definite.
if all eigenvalues are negative or zero-valued, it is negative semidefinite.
The singular value decomposition (SVD) provides another way to factorize a matrix, into singular vectors and singular values.
Every real matrix has a singular value decomposition, but the same is not true of the eigenvalue decomposition.
A=UDV⊤\boldsymbol{A}=\boldsymbol{U} \boldsymbol{D} \boldsymbol{V}^{\top}A=UDV⊤
Suppose that A\boldsymbol{A}A is an m×nm \times nm×n matrix. Then U\boldsymbol{U}U is defined to be an m×mm \times mm×m matrix, D\boldsymbol{D}D to be an m×nm \times nm×n matrix, and V\boldsymbol{V}V to be an n×nn \times nn×n matrix.
The elements along the diagonal of D\boldsymbol{D}D are known as the singular values of the matrix A\boldsymbol{A}A. The columns of U\boldsymbol{U}U are known as the left-singular vectors. The columns of V\boldsymbol{V}V are known as as the right-singular vectors.

# 矩阵SVD分解
# every matrix has SVD decomposition
visualize('Matrix', Matrix)
U, D, V = np.linalg.svd(Matrix)
visualize('U', U)
visualize('D', D)
visualize('V', V)Matrix :
[[0.58792665 0.32325678][0.59719291 0.79724479]]
U :
[[-0.5446513 -0.8386626][-0.8386626  0.5446513]]
D :
[1.17797449 0.23402443]
V :
[[-0.69700862 -0.71706275][-0.71706275  0.69700862]]

The Moore-Penrose pseudoinverse（伪逆） allows us to make some headway（进展） in these cases. The pseudoinverse of is deﬁned as a matrix：
A+=lim⁡α↘0(A⊤A+αI)−1A\boldsymbol{A}^{+}=\lim _{\alpha \searrow 0}\left(\boldsymbol{A}^{\top} \boldsymbol{A}+\alpha \boldsymbol{I}\right)^{-1} \boldsymbol{A}A+=limα↘0(A⊤A+αI)−1A
Practical algorithms for computing the pseudoinverse are not based on this deﬁnition, but rather the formula：
A+=VD+U⊤\boldsymbol{A}^{+}=\boldsymbol{V} \boldsymbol{D}^{+} \boldsymbol{U}^{\top}A+=VD+U⊤
where U,D\boldsymbol{U}, \boldsymbol{D}U,D and V\boldsymbol{V}V are the singular value decomposition of A\boldsymbol{A}A, and the pseudoinverse D+\boldsymbol{D}^{+}D+ of a diagonal matrix D\boldsymbol{D}D is obtained by taking the reciprocal（倒数） of its non-zero elements then taking the transpose of the resulting matrix
When A\boldsymbol{A}A has more columns than rows, then solving a linear equation using the pseudoinverse provides one of the many possible solutions. Specifically, it provides the solution x=A+y\boldsymbol{x}=\boldsymbol{A}^{+} \boldsymbol{y}x=A+y with minimal Euclidean norm ∥x∥2\|\boldsymbol{x}\|_{2}∥x∥2 among all possible solutions
When A\boldsymbol{A}A has more rows than columns, it is possible for there to be no solution. In this case, using the pseudoinverse gives us the x\boldsymbol{x}x for which Ax\boldsymbol{A} \boldsymbol{x}Ax is as close as possible to y\boldsymbol{y}y in terms of Euclidean norm ∥Ax−y∥2\|\boldsymbol{A} \boldsymbol{x}-\boldsymbol{y}\|_{2}∥Ax−y∥2
The trace operator gives the sum of all of the diagonal entries of a matrix:
Tr⁡(A)=∑iAi,i\operatorname{Tr}(\boldsymbol{A})=\sum_{i} \boldsymbol{A}_{i, i}Tr(A)=i∑Ai,i
∥A∥F=Tr⁡(AA⊤)\|A\|_{F}=\sqrt{\operatorname{Tr}\left(\boldsymbol{A} \boldsymbol{A}^{\top}\right)}∥A∥F=Tr(AA⊤)
Tr⁡(A)=Tr⁡(A⊤)\operatorname{Tr}(\boldsymbol{A})=\operatorname{Tr}\left(\boldsymbol{A}^{\top}\right)Tr(A)=Tr(A⊤)
Tr⁡(ABC)=Tr⁡(CAB)=Tr⁡(BCA)\operatorname{Tr}(\boldsymbol{A} \boldsymbol{B} \boldsymbol{C})=\operatorname{Tr}(\boldsymbol{C A B})=\operatorname{Tr}(\boldsymbol{B} \boldsymbol{C A})Tr(ABC)=Tr(CAB)=Tr(BCA) even though AB∈Rm×m\boldsymbol{A} \boldsymbol{B} \in \mathbb{R}^{m \times m}AB∈Rm×m and BA∈Rn×n\boldsymbol{B} \boldsymbol{A} \in \mathbb{R}^{n \times n}BA∈Rn×n
Tr⁡(∏i=1nF(i))=Tr⁡(F(n)∏i=1n−1F(i))\operatorname{Tr}\left(\prod_{i=1}^{n} \boldsymbol{F}^{(i)}\right)=\operatorname{Tr}\left(\boldsymbol{F}^{(n)} \prod_{i=1}^{n-1} \boldsymbol{F}^{(i)}\right)Tr(∏i=1nF(i))=Tr(F(n)∏i=1n−1F(i))
Another useful fact to keep in mind is that a scalar is its own trace: a=Tr⁡(a)a=\operatorname{Tr}(a)a=Tr(a)
The determinant（行列式） of a square matrix, denoted det(A), is a function mapping matrices to real scalars.
The determinant is equal to the product of all the eigenvalues of the matrix.
The absolute value of the determinant can be thought of as a measure of how much multiplication by the matrix expands or contracts space. If the determinant is 0, then space is contracted completely along at least one dimension, causing it to lose all of its volume. If the determinant is 1, then the transformation preserves volume.

principal components analysis（PCA）

Relation between PCA、SVD and eigendecomposition
协方差与样本协方差的区别是什么（1/n 1/(n-1)）？
Suppose we have a collection of mmm points {x(1),…,x(m)}\left\{\boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(m)}\right\}{x(1),…,x(m)} in Rn\mathbb{R}^{n}Rn. Suppose we would like to apply lossy compression to these points. Lossy compression means storing the points in a way that requires less memory but may lose some precision. We would like to lose as little precision as possible.
For each point x(i)∈Rn\boldsymbol{x}^{(i)} \in \mathbb{R}^{n}x(i)∈Rn we will find a corresponding code vector c(i)∈Rl\boldsymbol{c}^{(i)} \in \mathbb{R}^{l}c(i)∈Rl. If lll is smaller than nnn, it will take less memory to store the code points than the original data. We will want to find some encoding function that produces the code for an input, f(x)=cf(\boldsymbol{x})=\boldsymbol{c}f(x)=c, and a decoding function that produces the reconstructed input given its code, x≈g(f(x))\boldsymbol{x} \approx g(f(\boldsymbol{x}))x≈g(f(x)).
PCA is defined by our choice of the decoding function. Specifically, to make the decoder very simple, we choose to use matrix multiplication to map the code back into Rn\mathbb{R}^{n}Rn. Let g(c)=Dcg(\boldsymbol{c})=\boldsymbol{D} \boldsymbol{c}g(c)=Dc, where D∈Rn×l\boldsymbol{D} \in \mathbb{R}^{n \times l}D∈Rn×l is the matrix defining the decoding.
Computing the optimal code for this decoder could be a difficult problem. To keep the encoding problem easy, PCA constrains the columns of D\boldsymbol{D}D to be orthogonal to each other. (Note that D\boldsymbol{D}D is still not technically “an orthogonal matrix” unless l=n)l=n)l=n)
The first thing we need to do is figure out how to generate the optimal code point c∗\boldsymbol{c}^{*}c∗ for each input point xxx. One way to do this is to minimize the distance between the input point x\boldsymbol{x}x and its reconstruction, g(c∗)g\left(\boldsymbol{c}^{*}\right)g(c∗). We can measure this distance using a norm. In the principal components algorithm, we use the L2L^{2}L2 norm:
c∗=arg⁡min⁡c∥x−g(c)∥2\boldsymbol{c}^{*}=\underset{\boldsymbol{c}}{\arg \min }\|\boldsymbol{x}-g(\boldsymbol{c})\|_{2} c∗=cargmin∥x−g(c)∥2
We can switch to the squared L2L^{2}L2 norm instead of the L2L^{2}L2 norm itself, because both are minimized by the same value of ccc Both are minimized by the same value of ccc because the L2L^{2}L2 norm is non-negative and the squaring operation is monotonically increasing for non-negative arguments
c∗=arg⁡min⁡c∥x−g(c)∥22. \boldsymbol{c}^{*}=\underset{\boldsymbol{c}}{\arg \min }\|\boldsymbol{x}-g(\boldsymbol{c})\|_{2}^{2} \text { . } c∗=cargmin∥x−g(c)∥22 .
The function being minimized simplifies to
(x−g(c))⊤(x−g(c))(\boldsymbol{x}-g(\boldsymbol{c}))^{\top}(\boldsymbol{x}-g(\boldsymbol{c})) (x−g(c))⊤(x−g(c))
(by the definition of the L2L^{2}L2 norm)
=x⊤x−x⊤g(c)−g(c)⊤x+g(c)⊤g(c)=\boldsymbol{x}^{\top} \boldsymbol{x}-\boldsymbol{x}^{\top} g(\boldsymbol{c})-g(\boldsymbol{c})^{\top} \boldsymbol{x}+g(\boldsymbol{c})^{\top} g(\boldsymbol{c}) =x⊤x−x⊤g(c)−g(c)⊤x+g(c)⊤g(c)
(by the distributive property)
=x⊤x−2x⊤g(c)+g(c)⊤g(c)=x^{\top} \boldsymbol{x}-2 \boldsymbol{x}^{\top} g(\boldsymbol{c})+g(\boldsymbol{c})^{\top} g(\boldsymbol{c}) =x⊤x−2x⊤g(c)+g(c)⊤g(c)
(because the scalar g(c)⊤xg(c)^{\top} xg(c)⊤x is equal to the transpose of itself).
We can now change the function being minimized again, to omit the first term, since this term does not depend on c\boldsymbol{c}c :
c∗=arg⁡min⁡c−2x⊤g(c)+g(c)⊤g(c).\boldsymbol{c}^{*}=\underset{\boldsymbol{c}}{\arg \min }-2 \boldsymbol{x}^{\top} g(\boldsymbol{c})+g(\boldsymbol{c})^{\top} g(\boldsymbol{c}) . c∗=cargmin−2x⊤g(c)+g(c)⊤g(c).
To make further progress, we must substitute in the definition of g(c)g(c)g(c) :
c∗=arg⁡min⁡c−2x⊤Dc+c⊤D⊤Dc=arg⁡min⁡c−2x⊤Dc+c⊤Ilc\begin{aligned} c^{*} &=\underset{c}{\arg \min }-2 \boldsymbol{x^{\top} D c+c^{\top} D^{\top} D c} \\ &=\underset{c}{\arg \min }-2 \boldsymbol{x}^{\top} \boldsymbol{D} \boldsymbol{c}+\boldsymbol{c}^{\top} \boldsymbol{I}_{l} \boldsymbol{c} \end{aligned} c∗=cargmin−2x⊤Dc+c⊤D⊤Dc=cargmin−2x⊤Dc+c⊤Ilc
(by the orthogonality and unit norm constraints on DDD )
=arg⁡min⁡c−2x⊤Dc+c⊤c=\underset{c}{\arg \min }-2 \boldsymbol{x^{\top} D c+c^{\top} c} =cargmin−2x⊤Dc+c⊤c
We can solve this optimization problem using vector calculus :
∇c(−2x⊤Dc+c⊤c)=0−2D⊤x+2c=0c=D⊤x.\begin{gathered} \nabla_{c}\left(-2 \boldsymbol{x}^{\top} \boldsymbol{D} \boldsymbol{c}+\boldsymbol{c}^{\top} \boldsymbol{c}\right)=\mathbf{0} \\ -2 \boldsymbol{D}^{\top} \boldsymbol{x}+2 \boldsymbol{c}=\mathbf{0} \\ \boldsymbol{c}=\boldsymbol{D}^{\top} \boldsymbol{x} . \end{gathered} ∇c(−2x⊤Dc+c⊤c)=0−2D⊤x+2c=0c=D⊤x.
This makes the algorithm efficient: we can optimally encode x\boldsymbol{x}x just using a matrix-vector operation. To encode a vector, we apply the encoder function
f(x)=D⊤xf(\boldsymbol{x})=\boldsymbol{D}^{\top} \boldsymbol{x} f(x)=D⊤x
Using a further matrix multiplication, we can also define the PCA reconstruction operation:
r(x)=g(f(x))=DD⊤xr(\boldsymbol{x})=g(f(\boldsymbol{x}))=\boldsymbol{D} \boldsymbol{D}^{\top} \boldsymbol{x} r(x)=g(f(x))=DD⊤x
Next, we need to choose the encoding matrix DDD. To do so, we revisit the idea of minimizing the L2L^{2}L2 distance between inputs and reconstructions. Since we will use the same matrix DDD to decode all of the points, we can no longer consider the points in isolation. Instead, we must minimize the Frobenius norm of the matrix of errors computed over all dimensions and all points:
D∗=arg⁡min⁡D∑i,j(xj(i)−r(x(i))j)2subject to D⊤D=Il\boldsymbol{D}^{*}=\underset{D}{\arg \min } \sqrt{\sum_{i, j}\left(x_{j}^{(i)}-r\left(\boldsymbol{x}^{(i)}\right)_{j}\right)^{2}} \text { subject to } \boldsymbol{D}^{\top} \boldsymbol{D}=\boldsymbol{I}_{l} D∗=Dargmini,j∑(xj(i)−r(x(i))j)2 subject to D⊤D=Il
To derive the algorithm for finding D∗D^{*}D∗, we will start by considering the case where l=1l=1l=1. In this case, DDD is just a single vector, ddd. the problem reduces to
d∗=arg⁡min⁡d∑i∥x(i)−dd⊤x(i)∥22subject to ∥d∥2=1d^{*}=\underset{d}{\arg \min } \sum_{i}\left\|\boldsymbol{x}^{(i)}-\boldsymbol{d} \boldsymbol{d}^{\top} \boldsymbol{x}^{(i)}\right\|_{2}^{2} \text { subject to }\|\boldsymbol{d}\|_{2}=1 d∗=dargmini∑∥∥∥x(i)−dd⊤x(i)∥∥∥22 subject to ∥d∥2=1
The above formulation is the most direct way of performing the substitution, but is not the most stylistically pleasing way to write the equation. It places the scalar value d⊤x(i)\boldsymbol{d}^{\top} \boldsymbol{x}^{(i)}d⊤x(i) on the right of the vector d\boldsymbol{d}d. It is more conventional to write scalar coefficients on the left of vector they operate on. We therefore usually write such a formula as
d∗=arg⁡min⁡d∑i∥x(i)−d⊤x(i)d∥22subject to ∥d∥2=1d^{*}=\underset{d}{\arg \min } \sum_{i}\left\|\boldsymbol{x}^{(i)}-\boldsymbol{d}^{\top} \boldsymbol{x}^{(i)} \boldsymbol{d}\right\|_{2}^{2} \text { subject to }\|\boldsymbol{d}\|_{2}=1 d∗=dargmini∑∥∥∥x(i)−d⊤x(i)d∥∥∥22 subject to ∥d∥2=1
or, exploiting the fact that a scalar is its own transpose, as
d∗=arg⁡min⁡d∑i∥x(i)−x(i)⊤dd∥22subject to ∥d∥2=1.\boldsymbol{d}^{*}=\underset{\boldsymbol{d}}{\arg \min } \sum_{i}\left\|\boldsymbol{x}^{(i)}-\boldsymbol{x}^{(i) \top} \boldsymbol{d} \boldsymbol{d}\right\|_{2}^{2} \text { subject to }\|\boldsymbol{d}\|_{2}=1 . d∗=dargmini∑∥∥∥x(i)−x(i)⊤dd∥∥∥22 subject to ∥d∥2=1.
The reader should aim to become familiar with such cosmetic rearrangements.
Let X∈Rm×nX \in \mathbb{R}^{m \times n}X∈Rm×n be the matrix defined by stacking all of the vectors describing the points, such that Xi,:=x(i)⊤\boldsymbol{X}_{i,:}=\boldsymbol{x}^{(i)^{\top}}Xi,:=x(i)⊤. We can now rewrite the problem as
d∗=arg⁡min⁡d∥X−Xdd⊤∥F2subject to d⊤d=1. d^{*}=\underset{d}{\arg \min }\left\|X-X d d^{\top}\right\|_{F}^{2} \text { subject to } d^{\top} d=1 \text { . } d∗=dargmin∥∥X−Xdd⊤∥∥F2 subject to d⊤d=1 .
Disregarding the constraint for the moment, we can simplify the Frobenius norm portion as follows:
arg⁡min⁡d∥X−Xdd⊤∥F2=arg⁡min⁡dTr⁡((X−Xdd⊤)⊤(X−Xdd⊤))\begin{gathered} \underset{d}{\arg \min }\left\|\boldsymbol{X}-\boldsymbol{X} d \boldsymbol{d}^{\top}\right\|_{F}^{2} \\ =\underset{d}{\arg \min } \operatorname{Tr}\left(\left(\boldsymbol{X}-\boldsymbol{X} d \boldsymbol{d}^{\top}\right)^{\top}\left(\boldsymbol{X}-\boldsymbol{X} d \boldsymbol{d}^{\top}\right)\right) \end{gathered} dargmin∥∥∥X−Xdd⊤∥∥∥F2=dargminTr((X−Xdd⊤)⊤(X−Xdd⊤))
=arg⁡min⁡dTr⁡(X⊤X−X⊤Xdd⊤−dd⊤X⊤X+dd⊤X⊤Xdd⊤)=\underset{d}{\arg \min } \operatorname{Tr}\left(\boldsymbol{X}^{\top} \boldsymbol{X}-\boldsymbol{X}^{\top} \boldsymbol{X} d \boldsymbol{d}^{\top}-\boldsymbol{d} \boldsymbol{d}^{\top} \boldsymbol{X}^{\top} \boldsymbol{X}+d d^{\top} \boldsymbol{X}^{\top} \boldsymbol{X} d \boldsymbol{d}^{\top}\right) =dargminTr(X⊤X−X⊤Xdd⊤−dd⊤X⊤X+dd⊤X⊤Xdd⊤)
=arg⁡min⁡dTr⁡(X⊤X)−Tr⁡(X⊤Xdd⊤)−Tr⁡(dd⊤X⊤X)+Tr⁡(dd⊤X⊤Xdd⊤)arg⁡min⁡d−Tr⁡(X⊤Xdd⊤)−Tr⁡(dd⊤X⊤X)+Tr⁡(dd⊤X⊤Xdd⊤)\begin{aligned} =& \underset{d}{\arg \min } \operatorname{Tr}\left(\boldsymbol{X}^{\top} \boldsymbol{X}\right)-\operatorname{Tr}\left(\boldsymbol{X}^{\top} \boldsymbol{X} \boldsymbol{d} \boldsymbol{d}^{\top}\right)-\operatorname{Tr}\left(\boldsymbol{d} \boldsymbol{d}^{\top} \boldsymbol{X}^{\top} \boldsymbol{X}\right)+\operatorname{Tr}\left(\boldsymbol{d} \boldsymbol{d}^{\top} \boldsymbol{X}^{\top} \boldsymbol{X} \boldsymbol{d} \boldsymbol{d}^{\top}\right) \\ & \underset{d}{\arg \min }-\operatorname{Tr}\left(\boldsymbol{X}^{\top} \boldsymbol{X} d \boldsymbol{d}^{\top}\right)-\operatorname{Tr}\left(\boldsymbol{d} \boldsymbol{d}^{\top} \boldsymbol{X}^{\top} \boldsymbol{X}\right)+\operatorname{Tr}\left(\boldsymbol{d} \boldsymbol{d}^{\top} \boldsymbol{X}^{\top} \boldsymbol{X} d \boldsymbol{d}^{\top}\right) \\ \end{aligned} =dargminTr(X⊤X)−Tr(X⊤Xdd⊤)−Tr(dd⊤X⊤X)+Tr(dd⊤X⊤Xdd⊤)dargmin−Tr(X⊤Xdd⊤)−Tr(dd⊤X⊤X)+Tr(dd⊤X⊤Xdd⊤)
(because terms not involving ddd do not affect the arg min)
=arg⁡min⁡d−2Tr⁡(X⊤Xdd⊤)+Tr⁡(dd⊤X⊤Xdd⊤)=\underset{d}{\arg \min }-2 \operatorname{Tr}\left(\boldsymbol{X}^{\top} \boldsymbol{X} \boldsymbol{d} \boldsymbol{d}^{\top}\right)+\operatorname{Tr}\left(\boldsymbol{d} \boldsymbol{d}^{\top} \boldsymbol{X}^{\top} \boldsymbol{X} d \boldsymbol{d}^{\top}\right) =dargmin−2Tr(X⊤Xdd⊤)+Tr(dd⊤X⊤Xdd⊤)
(because we can cycle the order of the matrices inside a trace)
=arg⁡min⁡d−2Tr⁡(X⊤Xdd⊤)+Tr⁡(X⊤Xdd⊤dd⊤)=\underset{d}{\arg \min }-2 \operatorname{Tr}\left(\boldsymbol{X}^{\top} \boldsymbol{X} d \boldsymbol{d}^{\top}\right)+\operatorname{Tr}\left(\boldsymbol{X}^{\top} \boldsymbol{X} d d^{\top} d d^{\top}\right) =dargmin−2Tr(X⊤Xdd⊤)+Tr(X⊤Xdd⊤dd⊤)
(using the same property again) At this point, we re-introduce the constraint:
arg⁡min⁡d−2Tr⁡(X⊤Xdd⊤)+Tr⁡(X⊤Xdd⊤dd⊤)\underset{d}{\arg \min }-2 \operatorname{Tr}\left(\boldsymbol{X}^{\top} \boldsymbol{X} d d^{\top}\right)+\operatorname{Tr}\left(\boldsymbol{X}^{\top} \boldsymbol{X} d \boldsymbol{d}^{\top} d d^{\top}\right)dargmin−2Tr(X⊤Xdd⊤)+Tr(X⊤Xdd⊤dd⊤) subject to d⊤d=1\boldsymbol{d}^{\top} \boldsymbol{d}=1d⊤d=1
=arg⁡min⁡d−2Tr⁡(X⊤Xdd⊤)+Tr⁡(X⊤Xdd⊤)=\underset{d}{\arg \min }-2 \operatorname{Tr}\left(\boldsymbol{X}^{\top} \boldsymbol{X} d \boldsymbol{d}^{\top}\right)+\operatorname{Tr}\left(\boldsymbol{X}^{\top} \boldsymbol{X} d \boldsymbol{d}^{\top}\right)=dargmin−2Tr(X⊤Xdd⊤)+Tr(X⊤Xdd⊤) subject to d⊤d=1\boldsymbol{d}^{\top} \boldsymbol{d}=1d⊤d=1
(due to the constraint)
=arg⁡min⁡d−Tr⁡(X⊤Xdd⊤)subject to d⊤d=1=\underset{\boldsymbol{d}}{\arg \min }-\operatorname{Tr}\left(\boldsymbol{X}^{\top} \boldsymbol{X} \boldsymbol{d} \boldsymbol{d}^{\top}\right) \text { subject to } \boldsymbol{d}^{\top} \boldsymbol{d}=1 =dargmin−Tr(X⊤Xdd⊤) subject to d⊤d=1
=arg⁡max⁡dTr⁡(X⊤Xdd⊤)subject to d⊤d=1=arg⁡max⁡dTr⁡(d⊤X⊤Xd)subject to d⊤d=1\begin{aligned} &=\underset{d}{\arg \max } \operatorname{Tr}\left(\boldsymbol{X}^{\top} \boldsymbol{X} \boldsymbol{d} \boldsymbol{d}^{\top}\right) \text { subject to } \boldsymbol{d}^{\top} \boldsymbol{d}=1 \\ &=\underset{\boldsymbol{d}}{\arg \max } \operatorname{Tr}\left(\boldsymbol{d}^{\top} \boldsymbol{X}^{\top} \boldsymbol{X} \boldsymbol{d}\right) \text { subject to } \boldsymbol{d}^{\top} \boldsymbol{d}=1 \end{aligned} =dargmaxTr(X⊤Xdd⊤) subject to d⊤d=1=dargmaxTr(d⊤X⊤Xd) subject to d⊤d=1
This optimization problem may be solved using eigendecomposition. （或者拉格朗日函数优化更好理解）
Specifically, the optimal ddd is given by the eigenvector（d最后是特征向量啊！） of X⊤X\boldsymbol{X}^{\top} \boldsymbol{X}X⊤X corresponding to the largest eigenvalue.
This derivation is specific to the case of l=1l=1l=1 and recovers only the first principal component.
More generally, when we wish to recover a basis of principal components, the matrix DDD is given by the lll eigenvectors corresponding to the largest eigenvalues. This may be shown using proof by induction（归纳法）.

# PCA
# we apply PCA to iris data
from sklearn.datasets import load_iris
import pandas as pd
import matplotlib.pyplot as plt
import numpy as npiris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df.columns = ['sepal length', 'sepal width', 'petal length', 'petal width']
df['label'] = iris.target
visualize('iris data', df)
visualize('last five rows', df.tail())
visualize('value_counts of label', df['label'].value_counts())
# 查看第二个数据
X = df.iloc[:, 0:4]
Y = df.iloc[:, 4]
visualize('The second row''s data ', X.iloc[1, 0:4])
visualize('The second row''s label', Y.iloc[1])iris data : sepal length  sepal width  petal length  petal width  label
0             5.1          3.5           1.4          0.2      0
1             4.9          3.0           1.4          0.2      0
2             4.7          3.2           1.3          0.2      0
3             4.6          3.1           1.5          0.2      0
4             5.0          3.6           1.4          0.2      0
..            ...          ...           ...          ...    ...
145           6.7          3.0           5.2          2.3      2
146           6.3          2.5           5.0          1.9      2
147           6.5          3.0           5.2          2.0      2
148           6.2          3.4           5.4          2.3      2
149           5.9          3.0           5.1          1.8      2[150 rows x 5 columns]last five rows : sepal length  sepal width  petal length  petal width  label
145           6.7          3.0           5.2          2.3      2
146           6.3          2.5           5.0          1.9      2
147           6.5          3.0           5.2          2.0      2
148           6.2          3.4           5.4          2.3      2
149           5.9          3.0           5.1          1.8      2value_counts of label :
2    50
1    50
0    50
Name: label, dtype: int64The second rows data  :
sepal length    4.9
sepal width     3.0
petal length    1.4
petal width     0.2
Name: 1, dtype: float64The second rows label :
0

class PCA():def __init__(self):passdef fit_transform(self, X, n_components):n_samples = np.shape(X)[0]covariance_matrix = (1 / (n_samples - 1)) * np.dot(np.transpose(X - X.mean(axis=0)), (X - X.mean(axis=0)))eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)idx = eigenvalues.argsort()[::-1]# [::-1] means reverse   [-1] means the last element# [i:j] means from element_i to element_(j-1) ps: if i = 0, i can be omittedeigenvectors = eigenvectors[idx][:, :n_components]x_transformed = np.dot(X, eigenvectors)'''U, D, V = np.linalg.svd(X)idx = D.argsort()[::-1]x_transformed = np.dot(X, V[idx][:, :n_components])'''return x_transformedmodel = PCA()
Y = model.fit_transform(X, 2)
principalDF = pd.DataFrame(np.array(Y), columns=['principal component 1', 'principal component 2'])
DF = pd.concat([principalDF, y], axis=1)  # y = df.iloc[:, 4] see above
fig = plt.figure(figsize=(5, 5))
ax = fig.add_subplot(1, 1, 1)  # 将画布分割成1行1列，图像画在从左到右从上到下的第1块,也可以写成111
ax.set_xlabel('Principal Component 1', fontsize=10)
ax.set_ylabel('Principal Component 2', fontsize=10)
ax.set_title('2 Components PCA', fontsize=15)targets = [0, 1, 2]
colors = ['r', 'g', 'b']
for target, color in zip(targets, colors):idx = DF['label'] == targetax.scatter(DF.loc[idx, 'principal component 1'], DF.loc[idx, 'principal component 2'], c=color, s=50)
ax.legend(targets)  # legend : 图例
ax.grid()
plt.show()

Eigen-Decomposition

SVD

WORKS

the derivation of PCA using eigendecomposition（with all details explained）
PCA Derivation（answers）
本科欠的债慢慢还，之前都是匆匆翻阅，终于是把这本书开始慢慢啃起来了，全网免费的还有比我更细节的吗（自夸一波hhh）

Linear Algebra相关推荐

numpy中线性代数库的使用Linear Algebra
numpy中线性代数库的使用Linear Algebra 文章目录: 一. 一. Reference: 1.https://blog.csdn.net/pipisorry/article/detail ...
Machine Learning week 1 quiz: Linear Algebra
Linear Algebra 1. Let two matrices be A=[4639],B=[−2−592] What is A - B? [411211] [611−1211] [611−67 ...
线性代数第九版pdf英文_《Linear Algebra Done Right》线性代数复习及部分习题解答（3.C）...
个人声明本系列文章记录本人自学线性代数教材<Linear Algebra Done Right>的概念梳理(复习)和部分习题解答(练习).如有任何错误或不严谨之处恳请读者在评论区留言提醒 ...
Linear Algebra with Sub-linear Zero-Knowledge Arguments学习笔记
1. 引言 Groth 2009年论文<Linear Algebra with Sub-linear Zero-Knowledge Arguments>. 已知2个matrices A,B ...
Chapter 1 (Linear Equations in Linear Algebra): System of linear equations (线性方程组)
本文为<Linear algebra and its applications>的读书笔记目录 System of linear equations (线性方程组) Matrix Not ...
GAMES101课程学习笔记—Lec 02：Linear Algebra 线性代数回顾
GAMES101课程学习笔记-Lec 02:Linear Algebra 线性代数回顾 0 图形学的依赖学科 1 向量 1.1 点乘 1.2 叉乘 2 矩阵本节课知识比较基础,大学课程里应该都学过, ...
Linear Algebra 线性代数
Linear Algebra 线性代数最近在看Deep Learning这本书,刚看了Linear Algebra章,总结一下. 名词函数 Scalars:标量,就是单个数,一般用小写倾斜字体表示. ...
线性代数基本定理（核空间与行空间）——The Fundamental Theorem of Linear Algebra
本文依据Nicholas Hoell的讲义The Fundamental Theorem of Linear Algebra翻译,水平有限,如有不当欢迎指正. 目录一.预备知识:正交补空间 1.1 ...
Chapter 1 (Linear Equations in Linear Algebra): Row reduction and echelon forms (行化简与阶梯式矩阵)
本文为<Linear algebra and its applications>的读书笔记目录 Definition Uniqueness of the Reduced Echelon ...
【机器学习的数学基础】（二）线性代数(Linear Algebra)（中）
文章目录 2 线性代数(Linear Algebra)(中) 2.4 向量空间 2.4.1 群 2.4.2 向量空间 2.4.3 向量子空间 2.5 线性独立 2.6 基和秩 2.6.1 生成集和基 ...

Linear Algebra

文章目录

Mind Map

GOALS

CODE WORKS

CONTENTS

principal components analysis（PCA）

WORKS

Linear Algebra相关推荐

最新文章

热门文章