使用mnist数据集

It is easy for us to visualize two or three dimensional data, but once it goes beyond three dimensions, it becomes much harder to see what high dimensional data looks like.

对我们来说，可视化二维或三维数据很容易，但是一旦它超出了三维，就很难看到高维数据的外观。

Today we are often in a situation that we need to analyze and find patterns on datasets with thousands or even millions of dimensions, which makes visualization a bit of a challenge. However, a tool that can definitely help us better understand the data is dimensionality reduction.

如今，我们经常处于需要分析和查找具有数千甚至上百万个维度的数据集的模式的情况，这使可视化成为一个挑战。但是，绝对可以帮助我们更好地理解数据的工具是降维。

In this post, I will discuss t-SNE, a popular non-linear dimensionality reduction technique and how to implement it in Python using sklearn. The dataset I have chosen here is the popular MNIST dataset.

在本文中，我将讨论t-SNE(一种流行的非线性降维技术)以及如何使用sklearn在Python中实现该技术。我在这里选择的数据集是流行的MNIST数据集。

好奇心表 (Table of Curiosities)

What is t-SNE and how does it work?

什么是t-SNE，它如何工作？
How is t-SNE different with PCA?

t-SNE与PCA有何不同？
How can we improve upon t-SNE?

我们如何改善t-SNE？
What are the limitations?

有什么限制？
What can we do next?

接下来我们该怎么办？

总览 (Overview)

T-Distributed Stochastic Neighbor Embedding, or t-SNE, is a machine learning algorithm and it is often used to embedding high dimensional data in a low dimensional space [1].

T分布随机邻居嵌入(t-SNE)是一种机器学习算法，通常用于在低维空间中嵌入高维数据[1]。

In simple terms, the approach of t-SNE can be broken down into two steps. The first step is to represent the high dimensional data by constructing a probability distribution P, where the probability of similar points being picked is high, whereas the probability of dissimilar points being picked is low. The second step is to create a low dimensional space with another probability distribution Q that preserves the property of P as close as possible.

简单来说，t-SNE的方法可以分为两个步骤。第一步是通过构造概率分布P来表示高维数据，其中相似点被拾取的概率较高，而相异点被拾取的概率较低。第二步是创建具有另一个概率分布Q的低维空间，该概率分布Q保持P的属性尽可能接近。

In step 1, we compute the similarity between two data points using a conditional probability p. For example, the conditional probability of j given i represents that x_j would be picked by x_i as its neighbor assuming neighbors are picked in proportion to their probability density under a Gaussian distribution centered at x_i [1]. In step 2, we let y_i and y_j to be the low dimensional counterparts of x_i and x_j, respectively. Then we consider q to be a similar conditional probability for y_j being picked by y_i and we employ a student t-distribution in the low dimension map. The locations of the low dimensional data points are determined by minimizing the Kullback–Leibler divergence of probability distribution P from Q.

在步骤1中，我们使用条件概率p计算两个数据点之间的相似度。例如，给定i的条件概率j表示x_j将被x_i作为其邻居，并假设在以x_i [1]为中心的高斯分布下，按其概率密度成比例地选择了邻居。在步骤2中，我们让y_i和y_j分别为x_i和x_j的低维对应物。然后我们认为q是y_i选择y_j的相似条件概率，并且在低维图中使用学生t分布 。通过最小化概率分布P与Q的Kullback-Leibler散度来确定低维数据点的位置。

For more technical details of t-SNE, check out this paper.

有关t-SNE的更多技术细节，请查阅本文。

I have chosen the MNIST dataset from Kaggle (link) as the example here because it is a simple computer vision dataset, with 28x28 pixel images of handwritten digits (0–9). We can think of each instance as a data point embedded in a 784-dimensional space.

我选择了Kaggle( link )中的MNIST数据集作为示例，因为它是一个简单的计算机视觉数据集，具有28x28像素数字(0–9)的像素图像。我们可以将每个实例视为嵌入784维空间的数据点。

To see the full Python code, check out my Kaggle kernel.

要查看完整的Python代码，请查看我的Kaggle内核。

Without further ado, let’s get to the details!

事不宜迟，让我们来谈谈细节！

勘探 (Exploration)

Note that in the original Kaggle competition, the goal is to build a ML model using the training images with true labels that can accurately predict the labels on the test set. For our purposes here we will only use the training set.

请注意，在原始的Kaggle竞赛中，目标是使用带有真实标签的训练图像构建ML模型，该标签可以准确预测测试集上的标签。为了我们的目的，我们将仅使用训练集。

As usual, we check its shape first:

与往常一样，我们首先检查其形状：

train.shape--------------------------------------------------------------------(42000, 785)

There are 42K training instances. The 785 columns are the 784 pixel values, as well as the ‘label’ column.

有42K个训练实例。 785列是784像素值，以及“标签”列。

We can check the label distribution as well:

我们也可以检查标签分布：

label = train["label"]label.value_counts()--------------------------------------------------------------------1    46847    44013    43519    41882    41776    41370    41324    40728    40635    3795Name: label, dtype: int64

Principal Component Analysis (PCA)

主成分分析(PCA)

Before we implement t-SNE, let’s try PCA, a popular linear method for dimensionality reduction.

在实施t-SNE之前，让我们尝试PCA，一种流行的线性降维方法。

After we standardize the data, we can transform our data using PCA (specify ‘n_components’ to be 2):

在对数据进行标准化之后，我们可以使用PCA转换数据(将'n_components'指定为2)：

from sklearn.preprocessing import StandardScalerfrom sklearn.decomposition import PCAtrain = StandardScaler().fit_transform(train)pca = PCA(n_components=2)pca_res = pca.fit_transform(train)

Let’s make a scatter plot to visualize the result:

让我们绘制一个散点图以可视化结果：

sns.scatterplot(x = pca_res[:,0], y = pca_res[:,1], hue = label, palette = sns.hls_palette(10), legend = 'full');

2D Scatter plot of MNIST data after applying PCA

As shown in the scatter plot, PCA with two components does not sufficiently provide meaningful insights and patterns about the different labels. We know one drawback of PCA is that the linear projection can’t capture non-linear dependencies. Let’s try t-SNE now.

如散点图所示，具有两个组件的PCA不足以提供有关不同标签的有意义的见解和模式。我们知道PCA的一个缺点是线性投影无法捕获非线性依赖性。让我们现在尝试t-SNE。

T-SNE with sklearn

带Sklearn的T-SNE

We will implement t-SNE using sklearn.manifold (documentation):

我们将使用sklearn.manifold ( 文档 )实现t-SNE：

from sklearn.manifold import TSNEtsne = TSNE(n_components = 2, random_state=0)tsne_res = tsne.fit_transform(train)sns.scatterplot(x = tsne_res[:,0], y = tsne_res[:,1], hue = label, palette = sns.hls_palette(10), legend = 'full');

2D Scatter plot of MNIST data after applying t-SNE

Now we can see that the different clusters are more separable compared with the result from PCA. Here are a few observations on this plot:

现在我们可以看到，与PCA的结果相比，不同的聚类更可分离。以下是该图的一些观察结果：

The “5” data points seem to be more spread out compared with the other clusters such as “2” and “4”.
与“ 2”和“ 4”之类的其他群集相比，“ 5”数据点似乎更分散。
There are a few “5” and “8” data points that are similar to “3”s.
有一些“ 5”和“ 8”数据点与“ 3”相似。
There are two clusters of “7” and “9” where they are next to each other.
有两个簇“ 7”和“ 9”彼此相邻。

An Approach that Combines Both

结合两者的方法

It is generally recommended to use PCA or TruncatedSVD to reduce the number of dimension to a reasonable amount (e.g. 50) before applying t-SNE [2].

通常建议在应用t-SNE之前使用PCA或TruncatedSVD将尺寸数量减少到合理的数量(例如50)[2]。

Doing so can reduce the level of noise as well as speed up the computations.

这样做可以降低噪声水平并加快计算速度。

Let’s try PCA (50 components) first and then apply t-SNE. Here is the scatter plot:

让我们先尝试PCA(50个组件)，然后再应用t-SNE。这是散点图：

2D Scatter plot of MNIST data after applying PCA(50 components) and then t-SNE

Compared with the previous scatter plot, wecan now separate out the 10 clusters better. here are a few observations:

与以前的散点图相比，我们现在可以更好地分离出10个群集。以下是一些观察结果：

Most of the “5” data points are not as spread out as before, despite a few that still look like “3”.
尽管很少有5个数据点看起来仍然像“ 3”个数据点，但大多数“ 5”个数据点的分布都没有像以前那样分散。
There is one cluster of “7” and one cluster of “9” now.
现在有一个集群“ 7”和一个集群“ 9”。

Besides, the runtime in this approach decreased by over 60%.

此外，这种方法的运行时间减少了60％以上。

For more interactive 3D scatter plots, check out this post.

有关更多交互式3D散点图，请查看此文章。

局限性 (Limitations)

Here are a few limitations of t-SNE:

这是t-SNE的一些限制：

Unlike PCA, the cost function of t-SNE is non-convex, meaning there is a possibility that we would be stuck in a local minima.
与PCA不同，t-SNE的成本函数是非凸的，这意味着我们有可能陷入局部最小值。
Similar to other dimensionality reduction techniques, the meaning of the compressed dimensions as well as the transformed features becomes less interpretable.
与其他降维技术类似，压缩尺寸以及变换后的特征的含义变得难以解释。

下一步 (Next Steps)

Here are a few things that we can try as next steps:

以下是一些我们可以尝试做的下一步操作：

Hyperparameter tuning — Try tune ‘perplexity’ and see its effect on the visualized output.
超参数调整—尝试调整“困惑”，并查看其对可视化输出的影响。
Try some of the other non-linear techniques such as Uniform Manifold Approximation and Projection (UMAP), which is the generalization of t-SNE and it is based on Riemannian geometry.

尝试其他一些非线性技术，例如统一流形逼近和投影 (UMAP)，它是t-SNE的推广，它基于黎曼几何。
Train ML models on the transformed data and compare its performance with those from models without dimensionality reduction.
在转换后的数据上训练ML模型，并将其性能与未降维的模型的性能进行比较。

摘要 (Summary)

Let’s quickly recap.

让我们快速回顾一下。

We implemented t-SNE using sklearn on the MNIST dataset. We compared the visualized output with that from using PCA, and lastly, we tried a mixed approach which applies PCA first and then t-SNE.

我们在MNIST数据集上使用sklearn实现了t-SNE。我们将可视化输出与使用PCA的可视化输出进行了比较，最后，我们尝试了一种混合方法，该方法首先应用PCA，然后再应用t-SNE。

I hope you enjoyed this blog post and please share any thoughts that you may have :)

我希望您喜欢这篇博客文章，并请分享您可能有的任何想法:)

Check out my other post on Chi-square test for independence:

看看我关于卡方检验的其他文章是否具有独立性：

翻译自: https://towardsdatascience.com/dimensionality-reduction-using-t-distributed-stochastic-neighbor-embedding-t-sne-on-the-mnist-9d36a3dd4521

使用mnist数据集

http://www.taodudu.cc/news/show-863797.html

python模型部署方法_终极开箱即用的自动化Python模型选择方法
总体方差的充分统计量_R方是否衡量预测能力或统计充分性？
多尺度视网膜图像增强_视网膜图像怪异的预测
多元线性回归中多重共线性_多重共线性如何在线性回归中成为问题。
opencv 创建图像_非艺术家的图像创建（OpenCV项目演练）
使用TensorFlow进行深度学习-第2部分
基于bert的语义匹配_构建基于BERT的语义搜索系统…针对“星际迷航”
一个数据包的旅程_如何学习数据科学并开始您的惊人旅程
jupyter 托管_如何在本地托管的Jupyter Notebook上进行协作
fitbit手表中文说明书_如何获取和分析Fitbit睡眠分数
熔池沉积_用于3D打印的AI（第2部分）：异常熔池检测的一课学习
机器学习可视化_机器学习-可视化
学习javascript_使用5行JavaScript进行机器学习
强化学习-动态规划_强化学习-第4部分
神经网络优化器的选择_神经网络：优化器选择的重要性
客户细分_客户细分：K-Means聚类和A / B测试
菜品三级分类_分类器的惊人替代品
开关变压器绕制教程_教程：如何将变压器权重和令牌化器从AllenNLP上传到HuggingFace
一般线性模型和混合线性模型_线性混合模型如何工作
为什么基于数字的技术公司进行机器人研究
人类视觉系统_对人类视觉系统的对抗攻击
在神经网络中使用辍学：不是一个神奇的子弹
线程监视器模型_为什么模型验证如此重要，它与模型监视有何不同
dash使用_使用Dash和SHAP构建和部署可解释的AI仪表盘
面向表开发面向服务开发_面向繁忙开发人员的计算机视觉
可视化 nltk_词嵌入：具有Genism，NLTK和t-SNE可视化的Word2Vec
fitbit手表中文说明书_使用机器学习预测Fitbit睡眠分数
redis生产环境持久化_在SageMaker上安装持久性Julia环境
alexnet vgg_从零开始：建立著名的分类网2（AlexNet / VGG）
垃圾邮件分类 python_在python中创建SMS垃圾邮件分类器

使用mnist数据集_使用MNIST数据集上的t分布随机邻居嵌入（t-SNE）进行降维相关推荐

pythonsklearn乳腺癌数据集_【sklearn数据集】SVM之乳腺癌数据集实战
一.Sklearn介绍 scikit-learn是Python语言开发的机器学习库,一般简称为sklearn,目前算是通用机器学习算法库中实现得比较完善的库了.其完善之处不仅在于实现的算法多,还包括大 ...
python制作图像数据集_详细图像数据集增强原理的python代码
导读在深度学习时代,数据的规模越大.质量越高,模型就能够拥有更好的泛化能力,数据直接决定了模型学习的上限.然而在实际工程中,采集的数据很难覆盖全部的场景,比如图像的光照条件,同一场景拍摄的图片可能由 ...
无人驾驶图像数据集_自动驾驶数据集
转自https://zhuanlan.zhihu.com/p/45331609 无人驾驶技术涵盖了感知.决策.控制等领域的方方面面.感知层面对目标识别跟踪.障碍物检测.精确定位等技术的需求,使得深度学 ...
数据集_自动驾驶数据集
前面介绍了自动驾驶的仿真平台,接下来又收集了目前主要的自动驾驶数据集.抱着这个想法,何不做一个awesome,在git上搜索,已经有人建了一个awesome了,也免得重复造轮子.但是作者很久没更新了, ...
生成对抗网络生成多维数据集_生成没有数据集的新颖内容
生成对抗网络生成多维数据集介绍(Introduction) GAN architecture has been the standard for generating content through ...
python制作图片数据集_制作图片数据集
在学习卷积神经网络的时候,遇到了cifar10图像数据集,用着挺好,但不想局限于固定的几种图像的识别,所以就有了自己制作数据集来识别的想法. 一.cifar10数据集. 据原网站介绍,数据集为二进制. ...
推荐算法python数据集_推荐系统常用数据集
ps:对原文有所删减在这篇博客中,作者介绍了九个数据集,其中一些是推荐系统中常用到的标准数据集,也有一些是非传统意义上的数据集(non-traditional datasets),作者相信,这些非传 ...
推荐算法python数据集_推荐算法数据集
Movies Recommendation: MovieLens 基本内容: MovieLens数据集由GroupLens研究组在 University of Minnesota - 明尼苏达大学(与 ...
建立自己的voc数据集_将自己数据集转化成voc数据集格式并用mmdetection训练
一.准备自己的数据拿nwpu数据集来举例,nwpu数据集文件夹中的内容是: images文件夹:存放数据图片 labelTxt文件夹:存放标注信息,images文件夹中每张图片都对应一个txt文件存 ...

使用mnist数据集_使用MNIST数据集上的t分布随机邻居嵌入（t-SNE）进行降维