主成分分析spss_主成分分析

主成分分析spss

There are lot of dimensionality reduction techniques available in Machine Learning. It is one of the most integral part in Data Science field.Therefore, In this article, I am going to describe one of the most important dimensionality reduction techniques that is being used nowadays,called Principal Component Analysis(PCA).

机器学习中有很多降维技术可用。因此，在本文中，我将描述当今使用的最重要的降维技术之一，即主成分分析(PCA)。

But before doing that, one thing we need to know what is Dimensionality Reduction and why it is so important.

但是在此之前，我们需要了解什么是降维以及为什么降维如此重要。

什么是降维： (What is Dimensionality Reduction:)

Dimensionality Reduction is a technique,used to reduce the dimensions of the feature space.For an example, let’s say, if there are 100 features or columns in a dataset and you want to get only 10 features,using this dimensionality reduction techniques you can achieve this feat. Overall, it transforms the dataset which is in n dimensional space to n’ dimensional space where n’<n.

降维是一种用于减少要素空间尺寸的技术。例如，假设一个数据集中有100个要素或列，而您只想获得10个要素，使用这种降维技术可以实现这个壮举。总的来说，它将n维空间中的数据集转换为n'<n的n'维空间。

为什么要降维？ (Why Dimensionality Reduction ?)

Dimensionality Reduction is important in machine learning in a lot of ways, but the most important reason above all is the ‘Curse of Dimensionality’.

降维在很多方面在机器学习中都很重要，但最重要的原因首先是“维数诅咒”。

In machine learning,we often augment as many features as possible at first to get the higher accurate results. However, at a certain point of time,the performance of the model decreases(mainly overfitting) with the increasing number of features. This is the concept of ‘Curse of Dimensionality’.So,this is why dimensionality reduction is very crucial in the field of Machine Learning.

在机器学习中，我们通常一开始会尽可能多地扩展功能以获得更准确的结果。但是，在某个时间点，模型的性能会随着特征数量的增加而降低(主要是过度拟合)。这就是“维数的诅咒”的概念。因此，这就是为什么降维在机器学习领域非常重要的原因。

F

F

Now let’s come to the PCA.

现在让我们进入PCA。

主成分分析(PCA)： (Principal Component Analysis(PCA):)

PCA is a dimensionality reduction technique that enables us to identify correlations and patterns in a dataset so that it can be transformed into a new dataset of significantly lower dimensionality without the loss of any important information.

PCA是一种降维技术，使我们能够识别数据集中的相关性和模式，以便可以将其转换为维数明显较低的新数据集，而不会丢失任何重要信息。

PCA背后的数学： (Mathematics Behind PCA:)

The whole process of Mathematics in PCA can be divided into 5 parts.

PCA中的数学全过程可以分为5部分。

Standardizing the Data标准化数据
Calculate the co-variance matrix计算协方差矩阵
Calculating the EigenVectors and EigenValues计算特征向量和特征值
Computing the Principal Components计算主要成分
Reducing the dimension of the datasets减少数据集的维数

Let’s talk about each of these above sections separately.

让我们分别讨论以上每个部分。

Standardizing the Data:

标准化数据：

Standardizing is the process of scaling the data in such a way that all the variables and their values lie within a similar range.

标准化是按所有变量及其值位于相似范围内的方式缩放数据的过程。

The formula for Standardization is shown below:

标准化的公式如下所示：

where x^i=Observation or sample, Mu(μ)= Mean,Sigma(σ): Standard deviation.

其中x ^ i =观测值或样本，Mu(μ)=平均值，Sigma(σ)：标准偏差。

2. Calculate the co-variance matrix:

2. 计算协方差矩阵：

A co-variance matrix expresses the correlation between the different variables in the data set. It is essential to identify highly dependent variables because they contain biased and redundant information which can hamper the overall performance of the model.

协方差矩阵表示数据集中不同变量之间的相关性。识别高度因变量至关重要，因为它们包含有偏见和多余的信息，这些信息可能会妨碍模型的整体性能。

The calculation for co-variance is done this way —

协方差的计算是通过以下方式完成的：

where x^i=values of the x variable, x̅=mean of x variable, y^i=values of the y variable, ȳ=mean of y variable.

其中x ^ i = x变量的值，x̅= x变量的平均值，y ^ i = y变量的值，ȳ= y变量的平均值。

If our dataset has more than 2 dimensions then it can have more than one covariance measurement. For example, if we have a dataset with 3 dimensions x, y and z, then the covariance matrix of this dataset will look like this —

如果我们的数据集具有2个以上的维度，则它可以具有多个协方差度量。例如，如果我们有一个3维x，y和z的数据集，则该数据集的协方差矩阵将如下所示：

3. Calculating the EigenVectors and EigenValues:

3.计算特征向量和特征值：

EigenVectors are those vectors when a linear transformation is performed on them, then their directions does not change.

特征向量是对它们执行线性变换时其方向不变的那些向量。

EigenValues simply denote the scalars of their respective eigenvectors.

特征值只是表示它们各自特征向量的标量。

Let A be a square matrix, ν a vector and λ a scalar that satisfies Aν = λν, then λ is called eigenvalue associated with eigenvector ν of A.

设A是一个正方形矩阵，ν的载体和λ一个标量，其满足甲 ν=λν，则λ被称为特征值与A的本征向量ν相关联。

Now, Lets do some math and find the eigenvector and eigenvalue of a sample vector.

现在，让我们做一些数学运算，找到样本矢量的特征向量和特征值。

As you can see in our above calculations, [1,1] is the Eigenvector and 2 is the Eigenvalue. Now, lets see how we can find the Eigen pairs of a sample vector A.

如您在上述计算中所见，[1,1]是特征向量，而2是特征值。现在，让我们看看如何找到样本矢量A的本征对。

Replacing the value of our vector A in the above formula we get:

在上面的公式中替换向量A的值，我们得到：

With the found Eigen values, lets try and find the corresponding Eigen vectors which satisfies AX= λX.

使用找到的特征值，让我们尝试找到满足AX =λX的相应特征向量。

For Eigenvector, λ= 2:

对于特征向量，λ= 2：

For Eigenvector, λ = 3:

对于特征向量，λ= 3：

The above shows how we can calculate

上面显示了我们如何计算

4. Computing the Principal Components:

4. 计算主要成分：

Once we have computed the EigenVectors and Eigenvalues as shown above, all we have to do is order them into descending order, where the eigenvector with the highest eigen value is the most significant and therefore forms the first principal component.

一旦如上所述计算了特征向量和特征值，我们要做的就是将它们按降序排列，其中特征值最高的特征向量是最重要的，因此形成第一个主成分。

5.减少数据集的维数： (5. Reducing the dimension of the datasets:)

In the last step,we have to re-arrange the original dataset with the final principal components which represent the maximum and most significant information of the dataset.

在最后一步中，我们必须重新排列原始数据集，并使用最终的主成分来代表数据集的最大和最重要的信息。

python中的PCA： (PCA in python:)

Now, Let’s assemble all of these above steps into python code.

现在，让我们将上述所有步骤组装成python代码。

import numpy as npimport pandas as pd#load mnist datad0 = pd.read_csv('./mnist_train.csv') # save the labels into a variable l.l = d0['label']# Drop the label feature and store the pixel data in d. d=d0.drop("label",axis=1)# Pick first 15K data-pointslabels = l.head(15000)data = d.head(15000)# Data-preprocessing: Standardizing the datafrom sklearn.preprocessing import StandardScalerstandardized_data = StandardScaler().fit_transform(data)#find the co-variance matrix which is : A^T * Asample_data = standardized_datacovar_matrix = np.matmul(sample_data.T , sample_data)#finding the top two eigen-values and corresponding eigen-vectorsfor projecting onto a 2-Dim spacefrom scipy.linalg import eighvalues, vectors = eigh(covar_matrix, eigvals=(782,783))vectors = vectors.T#Computing the Principal Components:new_coordinates = np.matmul(vectors, sample_data.T)# appending label to the 2d projected datanew_coordinates = np.vstack((new_coordinates, labels)).T#New Dataframe with Reduced dimensiondataframe = pd.DataFrame(data=new_coordinates, columns("1st_principal", "2nd_principal", "label"))print(dataframe.head())

After executing above code,the result will looks like this —

执行上述代码后，结果将如下所示：

PCA的局限性： (Limitations of PCA:)

Though PCA works well, but it has some drawbacks too —

虽然PCA运作良好，但也有一些缺点-

Let’s discuss some of it’s significant drawbacks.

让我们讨论其中的一些重大缺点。

Independent variables become less interpretable: After implementing PCA on the dataset, your original features will turn into Principal Components. Principal Components are the linear combination of your original features. Principal Components are not as readable and interpretable as original features.

自变量变得难以解释：在数据集上实施PCA之后，您的原始特征将变为主要组件。主成分是原始特征的线性组合。主要组件不像原始功能那样易读易懂。

Data standardization is must before PCA: You must standardize your data before implementing PCA, otherwise PCA will not be able to find the optimal Principal Components.

在PCA之前必须对数据进行标准化：您必须在实施PCA之前对数据进行标准化，否则PCA将无法找到最佳的主要组件。

Information Loss: Although Principal Components try to cover maximum variance among the features in a dataset, if we don’t select the number of Principal Components with care, it may miss some information as compared to the original list of features.

信息丢失：尽管主成分试图覆盖数据集中各要素之间的最大差异，但如果我们不谨慎选择主成分的数量，则与原始要素列表相比，它可能会丢失一些信息。

Not for good visualization: PCA works well only for dimensionality reduction, However, it will not perform as expected when it comes to Data Visualization.

不能实现良好的可视化 ：PCA仅在降低维数方面效果良好，但是，在数据可视化方面，它的性能无法达到预期。

Conclusion:

结论：

In this article, I have shown you what is Dimensionality Reduction and it is so effective in the field machine learning.Besides,I have also given you a succinctinformation about PCA. There are lot of Dimensionality Reduction techniques available, However, which technique needs to be used at which point of time , it depends on your model and also business requirements.

在本文中，我向您展示了什么是降维，它在现场机器学习中是如此有效。此外，我还为您提供了有关PCA的简要信息。有许多降维技术可用，但是，哪种技术需要在哪个时间点使用，取决于您的模型和业务需求。

Hope you have now got a brief overview of PCA!!!!

希望您现在对PCA有了一个简短的概述！！！！