这期介绍一下NB的最佳降维方法之一 t-SNE,并实现在多个数据集上的应用,尤其是单细胞测序数据。


SNE(t-distributed stochastic neighbor embedding)是用于降维的一种机器学习算法,是由 Laurens van der Maaten 和 Geoffrey Hinton在08年提出来。此外,t-SNE 是一种非线性降维算法,非常适用于高维数据降维到2维或者3维,进行可视化。t-SNE是由SNE(Stochastic Neighbor Embedding, SNE; Hinton and Roweis, 2002)发展而来。t-SNE本质是一种嵌入模型,能够将高维空间中的数据映射到低维空间中,并保留数据集的局部特性。t-SNE降维方法,降维的效果相对于PCA更好,还可以进行可视化分析等,是目前很好用的降维算法之一。



options("repos" = c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))



1. 埃德加·安德森的虹膜数据

Description This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica. Usage iris Format iris is a data frame with 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.

iris_unique <- unique(iris)  # Remove duplicates
iris_matrix <- as.matrix(iris_unique[, 1:4])
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2
## 4          4.6         3.1          1.5         0.2
## 5          5.0         3.6          1.4         0.2
## 6          5.4         3.9          1.7         0.4

2. 乳腺癌患者活检数据


Description This breast cancer database was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. He assessed biopsies of breast tumours for 699 patients up to 15 July 1992; each of nine attributes has been scored on a scale of 1 to 10, and the outcome is also known. There are 699 rows and 11 columns. Usage biopsy Format This data frame contains the following columns: ID sample code number (not unique). V1 clump thickness. V2 uniformity of cell size. V3 uniformity of cell shape. V4 marginal adhesion. V5 single epithelial cell size. V6 bare nuclei (16 values are missing). V7 bland chromatin. V8 normal nucleoli. V9 mitoses. class “benign” or “malignant”.

##        ID V1 V2 V3 V4 V5 V6 V7 V8 V9     class
## 1 1000025  5  1  1  1  2  1  3  1  1    benign
## 2 1002945  5  4  4  5  7 10  3  2  1    benign
## 3 1015425  3  1  1  1  2  2  3  1  1    benign
## 4 1016277  6  8  8  1  3  4  3  7  1    benign
## 5 1017023  4  1  1  3  2  1  3  1  1    benign
## 6 1017122  8 10 10  8  7 10  9  7  1 malignant
data <- unique(na.omit(biopsy[, -1]))
##   V1 V2 V3 V4 V5 V6 V7 V8 V9     class
## 1  5  1  1  1  2  1  3  1  1    benign
## 2  5  4  4  5  7 10  3  2  1    benign
## 3  3  1  1  1  2  2  3  1  1    benign
## 4  6  8  8  1  3  4  3  7  1    benign
## 5  4  1  1  3  2  1  3  1  1    benign
## 6  8 10 10  8  7 10  9  7  1 malignant

3. 单细胞转录组数据集


##              Cell Mutation_Status Cell_Cycle Treatment
## Cell_001 Cell_001        positive          S    treat1
## Cell_002 Cell_002        positive         G0    treat1
## Cell_003 Cell_003        negative         G1    treat1
## Cell_004 Cell_004        negative          S    treat1
## Cell_005 Cell_005        negative         G1    treat2
## Cell_006 Cell_006        negative         G0    treat1
sc_example_counts = unique(sc_example_counts)
sc_example_counts[1:5, 1:5]
##           Cell_001 Cell_002 Cell_003 Cell_004 Cell_005
## Gene_0001        0      123        2        0        0
## Gene_0002      575       65        3     1561     2311
## Gene_0003        0        0        0        0     1213
## Gene_0004        0        1        0        0        0
## Gene_0005        0        0       11        0        0


Barnes-Hut implementation of t-Distributed Stochastic Neighbor Embedding Description Wrapper for the C++ implementation of Barnes-Hut t-Distributed Stochastic Neighbor Embedding. t-SNE is a method for constructing a low dimensional embedding of high-dimensional data, distances or similarities. Exact t-SNE can be computed by setting theta=0.0. Usage Rtsne(X, …)

  1. Rtsne输入对象必须为矩阵;

  2. dims参数设置降维后的维度,默认为2;

  3. theta参数取值越大,结果的准确率越低,默认为0.5,theta此处设置为0.0 ,可以计算出t-SNE的准确值,但是运行时间会比较长;

  4. pca表示是否对输入的数据进行pca分析,此处设置TRUE。


1. iris数据集的降维


# Set a seed if you want reproducible results
tsne_out <- Rtsne(iris_matrix, pca = FALSE, perplexity = 30, theta = 0)  # Run TSNE
res <- as.data.frame(tsne_out$Y)
res$Class = iris_unique$Species
## [1] 3
# 绘制降维后图形
ggplot(res, aes(x = V1, y = V2, color = Class)) + geom_point(size = 1.25) + labs(title = "t-SNE",x = "TSNE1", y = "TSNE2") + theme(plot.title = element_text(hjust = 0.5)) + theme_bw()

2. biopsy数据集的降维


tsne <- Rtsne(as.matrix(data[, 1:9]), pca = TRUE, perplexity = 10, theta = 0)  # Run TSNE
res <- as.data.frame(tsne$Y)
res$Class = data$class
## [1] 2
##           V1        V2     Class
## 1  -7.126002  32.10630    benign
## 2 -20.002728 -20.17814    benign
## 3 -12.207140  41.77637    benign
## 4   8.327698 -24.82569    benign
## 5  11.560514  34.23023    benign
## 6  14.006864 -56.03128 malignant
# 绘制降维后图形
ggplot(res, aes(x = V1, y = V2, color = Class)) + geom_point(size = 1.25) + labs(title = "t-SNE",x = "TSNE1", y = "TSNE2") + theme(plot.title = element_text(hjust = 0.5)) + theme_bw()

4. 单细胞转录组数据集降维

tsne <- Rtsne(as.matrix(t(sc_example_counts)), pca = FALSE, perplexity = 10, theta = 0)  # Run TSNE
res <- as.data.frame(tsne$Y)
res$Class = sc_example_cell_info[sc_example_cell_info$Cell %in% colnames(sc_example_counts),]$Cell_Cycle
## [1] 4
##          V1         V2 Class
## 1 -52.47414  29.539762     S
## 2  24.23850 -12.023524    G0
## 3  28.90259  48.070417    G1
## 4  72.40250   9.639261     S
## 5 -41.68105  53.376565    G1
## 6  38.48649  12.414251    G0
ggplot(res, aes(x = V1, y = V2, color = Class)) + geom_point(size = 1.25) + labs(title = "t-SNE",x = "TSNE1", y = "TSNE2") + theme(plot.title = element_text(hjust = 0.5)) + theme_bw()





  1. Maaten, L. Van Der, 2014. Accelerating t-SNE using Tree-Based Algorithms. Journal of Machine Learning Research, 15, p.3221-3245.

  2. van der Maaten, L.J.P. & Hinton, G.E., 2008. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research, 9, pp.2579-2605.

