MachineLearning 12. 机器学习之降维方法t-SNE及可视化 (Rtsne)

点击关注，桓峰基因

桓峰基因公众号推出机器学习应用于临床预测的方法，跟着教程轻松学习，每个文本教程配有视频教程大家都可以自由免费学习，目前已有的机器学习教程整理出来如下：

MachineLearning 1. 主成分分析（PCA）

MachineLearning 2. 因子分析（Factor Analysis）

MachineLearning 3. 聚类分析（Cluster Analysis）

MachineLearning 4. 癌症诊断方法之 K-邻近算法（KNN）

MachineLearning 5. 癌症诊断和分子分型方法之支持向量机（SVM)

MachineLearning 6. 癌症诊断机器学习之分类树（Classification Trees)

MachineLearning 7. 癌症诊断机器学习之回归树（Regression Trees)

MachineLearning 8. 癌症诊断机器学习之随机森林（Random Forest)

MachineLearning 9. 癌症诊断机器学习之梯度提升算法（Gradient Boosting)

MachineLearning 10. 癌症诊断_机器学习_之神经网络（Neural network)

MachineLearning 11. 机器学习之随机森林生存分析（randomForestSRC）

MachineLearning 12. 机器学习之降维方法t-SNE及可视化(Rtsne)

这期介绍一下NB的最佳降维方法之一 t-SNE，并实现在多个数据集上的应用，尤其是单细胞测序数据。

前言

SNE(t-distributed stochastic neighbor embedding)是用于降维的一种机器学习算法，是由 Laurens van der Maaten 和 Geoffrey Hinton在08年提出来。此外，t-SNE 是一种非线性降维算法，非常适用于高维数据降维到2维或者3维，进行可视化。t-SNE是由SNE(Stochastic Neighbor Embedding, SNE; Hinton and Roweis, 2002)发展而来。t-SNE本质是一种嵌入模型，能够将高维空间中的数据映射到低维空间中，并保留数据集的局部特性。t-SNE降维方法，降维的效果相对于PCA更好，还可以进行可视化分析等，是目前很好用的降维算法之一。

软件安装

设置清华镜像，加快包的安装如下：

options("repos" = c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))
if(!require(Rtsne))install.packages("Rtsne")

数据读取

这里我们准备了三个数据集，一个是例子中给出来的，一个是乳腺癌患者活检数据集，另一个是单细胞转录组的数据集。

1. 埃德加·安德森的虹膜数据

Description This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica. Usage iris Format iris is a data frame with 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.

library(Rtsne)
library(ggplot2)
data("iris")
iris_unique <- unique(iris)  # Remove duplicates
iris_matrix <- as.matrix(iris_unique[, 1:4])
head(iris_matrix)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2
## 4          4.6         3.1          1.5         0.2
## 5          5.0         3.6          1.4         0.2
## 6          5.4         3.9          1.7         0.4

2. 乳腺癌患者活检数据

我们已经多次使用过这个数据集了，class是二分类的结果变量恶性还是良性的。

Description This breast cancer database was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. He assessed biopsies of breast tumours for 699 patients up to 15 July 1992; each of nine attributes has been scored on a scale of 1 to 10, and the outcome is also known. There are 699 rows and 11 columns. Usage biopsy Format This data frame contains the following columns: ID sample code number (not unique). V1 clump thickness. V2 uniformity of cell size. V3 uniformity of cell shape. V4 marginal adhesion. V5 single epithelial cell size. V6 bare nuclei (16 values are missing). V7 bland chromatin. V8 normal nucleoli. V9 mitoses. class “benign” or “malignant”.

library(MASS)
data("biopsy")
head(biopsy)
##        ID V1 V2 V3 V4 V5 V6 V7 V8 V9     class
## 1 1000025  5  1  1  1  2  1  3  1  1    benign
## 2 1002945  5  4  4  5  7 10  3  2  1    benign
## 3 1015425  3  1  1  1  2  2  3  1  1    benign
## 4 1016277  6  8  8  1  3  4  3  7  1    benign
## 5 1017023  4  1  1  3  2  1  3  1  1    benign
## 6 1017122  8 10 10  8  7 10  9  7  1 malignant
data <- unique(na.omit(biopsy[, -1]))
head(data)
##   V1 V2 V3 V4 V5 V6 V7 V8 V9     class
## 1  5  1  1  1  2  1  3  1  1    benign
## 2  5  4  4  5  7 10  3  2  1    benign
## 3  3  1  1  1  2  2  3  1  1    benign
## 4  6  8  8  1  3  4  3  7  1    benign
## 5  4  1  1  3  2  1  3  1  1    benign
## 6  8 10 10  8  7 10  9  7  1 malignant

3. 单细胞转录组数据集

单细胞数据我们可以从scatter这个软件包获取，但是目前失效了，我到github上下载了一下，同样可以使用：

library(scater)
library(scRNAseq)
load("sc_example_counts.RData")
load("sc_example_cell_info.RData")
head(sc_example_cell_info)
##              Cell Mutation_Status Cell_Cycle Treatment
## Cell_001 Cell_001        positive          S    treat1
## Cell_002 Cell_002        positive         G0    treat1
## Cell_003 Cell_003        negative         G1    treat1
## Cell_004 Cell_004        negative          S    treat1
## Cell_005 Cell_005        negative         G1    treat2
## Cell_006 Cell_006        negative         G0    treat1
sc_example_counts = unique(sc_example_counts)
sc_example_counts[1:5, 1:5]
##           Cell_001 Cell_002 Cell_003 Cell_004 Cell_005
## Gene_0001        0      123        2        0        0
## Gene_0002      575       65        3     1561     2311
## Gene_0003        0        0        0        0     1213
## Gene_0004        0        1        0        0        0
## Gene_0005        0        0       11        0        0

参数说明

Barnes-Hut implementation of t-Distributed Stochastic Neighbor Embedding Description Wrapper for the C++ implementation of Barnes-Hut t-Distributed Stochastic Neighbor Embedding. t-SNE is a method for constructing a low dimensional embedding of high-dimensional data, distances or similarities. Exact t-SNE can be computed by setting theta=0.0. Usage Rtsne(X, …)

Rtsne输入对象必须为矩阵；
dims参数设置降维后的维度，默认为2；
theta参数取值越大，结果的准确率越低，默认为0.5，theta此处设置为0.0 ，可以计算出t-SNE的准确值，但是运行时间会比较长；
pca表示是否对输入的数据进行pca分析，此处设置TRUE。

例子实操

1. iris数据集的降维

读入矩阵，构造模型，给出结果：

# Set a seed if you want reproducible results
set.seed(123)
tsne_out <- Rtsne(iris_matrix, pca = FALSE, perplexity = 30, theta = 0)  # Run TSNE
res <- as.data.frame(tsne_out$Y)
res$Class = iris_unique$Species
length(unique(res$Class))
## [1] 3
# 绘制降维后图形
ggplot(res, aes(x = V1, y = V2, color = Class)) + geom_point(size = 1.25) + labs(title = "t-SNE",x = "TSNE1", y = "TSNE2") + theme(plot.title = element_text(hjust = 0.5)) + theme_bw()

2. biopsy数据集的降维

读入矩阵，构造模型，给出结果：

set.seed(123)
tsne <- Rtsne(as.matrix(data[, 1:9]), pca = TRUE, perplexity = 10, theta = 0)  # Run TSNE
res <- as.data.frame(tsne$Y)
res$Class = data$class
length(unique(res$Class))
## [1] 2
head(res)
##           V1        V2     Class
## 1  -7.126002  32.10630    benign
## 2 -20.002728 -20.17814    benign
## 3 -12.207140  41.77637    benign
## 4   8.327698 -24.82569    benign
## 5  11.560514  34.23023    benign
## 6  14.006864 -56.03128 malignant
# 绘制降维后图形
ggplot(res, aes(x = V1, y = V2, color = Class)) + geom_point(size = 1.25) + labs(title = "t-SNE",x = "TSNE1", y = "TSNE2") + theme(plot.title = element_text(hjust = 0.5)) + theme_bw()

4. 单细胞转录组数据集降维

set.seed(123)
tsne <- Rtsne(as.matrix(t(sc_example_counts)), pca = FALSE, perplexity = 10, theta = 0)  # Run TSNE
res <- as.data.frame(tsne$Y)
res$Class = sc_example_cell_info[sc_example_cell_info$Cell %in% colnames(sc_example_counts),]$Cell_Cycle
length(unique(res$Class))
## [1] 4
head(res)
##          V1         V2 Class
## 1 -52.47414  29.539762     S
## 2  24.23850 -12.023524    G0
## 3  28.90259  48.070417    G1
## 4  72.40250   9.639261     S
## 5 -41.68105  53.376565    G1
## 6  38.48649  12.414251    G0
ggplot(res, aes(x = V1, y = V2, color = Class)) + geom_point(size = 1.25) + labs(title = "t-SNE",x = "TSNE1", y = "TSNE2") + theme(plot.title = element_text(hjust = 0.5)) + theme_bw()

我这里都给大家分析三个数据集的，方便大家选择适合自己的数据方法，另外需要代码的将这期教程转发朋友圈，并配文“学生信，找桓峰基因，铸造成功的你！”即可获得！

桓峰基因，铸造成功的您！

有想进生信交流群的老师可以扫最后一个二维码加微信，备注“单位+姓名+目的”，有些想发广告的就免打扰吧，还得费力气把你踢出去！

References:

Maaten, L. Van Der, 2014. Accelerating t-SNE using Tree-Based Algorithms. Journal of Machine Learning Research, 15, p.3221-3245.
van der Maaten, L.J.P. & Hinton, G.E., 2008. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research, 9, pp.2579-2605.