【论文速览】深度模型-降维与聚类

【文章一】InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

文章题目：InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets
作者：Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, Pieter Abbeel
关键词：
时间：2016
来源：Nips / Arxiv
paper：https://proceedings.neurips.cc/paper/2016/hash/7c9d0b1f96aebd7b5eca8c3edaa19ebb-Abstract.html ， https://arxiv.org/pdf/1606.03657.pdf
code：
引用：Chen X, Duan Y, Houthooft R, Schulman J, Sutskever I, Abbeel P. InfoGAN: Interpretable Representation Learning. Nips 2016:2172–80.

abstract

This paper describes InfoGAN, an information-theoretic extension to the Generative Adversarial Network that is able to learn disentangled representations in a completely unsupervised manner. InfoGAN is a generative adversarial network that also maximizes the mutual information between a small subset of the latent variables and the observation. We derive a lower bound to the mutual information objective that can be optimized efficiently. Specifically, InfoGAN successfully disentangles writing styles from digit shapes on the MNIST dataset, pose from lighting of 3D rendered images, and background digits from the central digit on the SVHN dataset. It also discovers visual concepts that include hair styles, presence/absence of eyeglasses, and emotions on the CelebA face dataset. Experiments show that InfoGAN learns interpretable representations that are competitive with representations learned by existing fully supervised methods.

摘要：

本文介绍了InfoGAN，一个对生成对抗网络的信息论扩展，能够在完全无监督的情况下学习辨识性(disentangled)表示。InfoGAN是一个生成对抗网络，同时最大化隐变量的小型子集和观测之间的互信息。我们给出了可以有效优化的互信息目标的下界。具体地，InfoGAN成功分辨了MNIST数据集中手写数字形状，3D光照渲染图片中的姿势，SVHN数据集中中心数字的背景数字。还在CelebA面部数据中发掘出视觉概念，如发型，有无眼镜，以及情绪。实验表明，InfoGan学习可解释的表示，这些表示与现有监督方法学到的表示相匹敌。

感性认识

研究的基本问题
使用GAN进行无监督的表示学习。
现有问题
表示问题与下游问题是分离的，因此在不接触下游问题相关信息（如监督任务中的标签信息）的情况下，要预测下游任务可能的目的和需求，提取出原始数据中显著的，有语义意义特征表示。
利用生成模型（GAN）学习一个可解释的，性能好的，有意义的表示。
主要方法
maximizing the mutual information between a fixed small subset of the GAN’s noise variables and the observations

将噪声向量分解,变成两部分：1）z，一个不可压缩的噪声来源 2）c，潜在编码，代表数据分布的隐藏语义结构信息。潜在编码划分成若干个分量（子集），用c表示分量的串联，简单假设为服从因式分布，即 $P(c1,c2,...,cL)=∏i=1LP(ci)P(c_1,c_2,...,c_L)=\prod^L_{i=1}{P(c_i)}$ .
引入互信息作为目标，避免G忽略c（通过寻找满足 $P_G(x|c)=P_G(x)$ ），要求 $I (c, G (z, c))$ 要足够大。给定 $\thicksim P_G(x)$ , $P_G(c|x)$ 有一个较小的熵。
$I (X; Y) = H (X) - H (X ∣ Y) = H (Y) - H (Y ∣ X)$ 目标函数为： $min⁡Gmax⁡DVI(D,G)=V(D,G)−λI(c;G(z,c))\min_G\max_DV_I(D,G)=V(D,G)-\lambda I(c;G(z,c))$
提出互信息的变分下界，避免不易计算的 $P_G(c|x)$ 。用蒙特卡罗模拟来近似，可以在不改变GAN的训练过程的情况下添加到GAN的目标中。此时，目标函数为：
$min⁡G,Qmax⁡DVInfoGAN(D,G,Q)=V(D,G)−λL1(G,Q)\min_{G,Q}\max_DV_{InfoGAN}(D,G,Q)=V(D,G)-\lambda L_1(G,Q)$

实现：Q为一个神经网络，和D共享卷积层，最后增加一个全连接层输出 $Q (c ∣ x)$ 分布的参数。离散的c使用softmax，λ设置为1，连续的c应该根据真实分布选择，直接默认为高斯即可，调试λ使得前后项在一个尺度上即可。GAN部分是基于DC-GAN的。

结果与结论
不足与展望
相关工作

A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” ArXiv preprint arXiv:1511.06434, 2015.
DC-GAN
T. D. Kulkarni, W. F. Whitney, P . Kohli, and J. Tenenbaum, “Deep convolutional inverse graphics network,” in NIPS, 2015, pp. 2530–2538.
DC-IGN，3D图像上的比较模型
G. Desjardins, A. Courville, and Y. Bengio, “Disentangling factors of variation via generative entangling,”ArXiv preprint arXiv:1210.5474, 2012.
hossRBM，另一种学习辨识性表示的无监督方法。

【文章二】Interpretable dimensionality reduction of single cell transcriptome data with deep generative models（scvis）

文章题目：Interpretable dimensionality reduction of single cell transcriptome data with deep generative models
作者：Jiarui Ding, Anne Condon & Sohrab P. Shah
关键词：
时间：2018
来源：nature communications
paper：https://www.nature.com/articles/s41467-018-04368-5
code：https://bitbucket.org/jerry00/scvis-dev （tensorflow）
引用：Ding, J., Condon, A. & Shah, S.P. Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nat Commun 9, 2002 (2018). https://doi.org/10.1038/s41467-018-04368-5

abstract

Single-cell RNA-sequencing has great potential to discover cell types, identify cell states, trace development lineages, and reconstruct the spatial organization of cells. However, dimension reduction to interpret structure in single-cell sequencing data remains a challenge. Existing algorithms are either not able to uncover the clustering structures in the data or lose global information such as groups of clusters that are close to each other. We present a robust statistical model, scvis, to capture and visualize the low-dimensional structures in single-cell gene expression data. Simulation results demonstrate that low-dimensional representations learned by scvis preserve both the local and global neighbor structures in the data. In addition, scvis is robust to the number of data points and learns a probabilistic parametric mapping function to add new data points to an existing embedding. We then use scvis to analyze four single-cell RNA-sequencing datasets, exemplifying interpretable twodimensional representations of the high-dimensional single-cell RNA-sequencing data.

摘要：

单细胞rna测序在发现细胞类型、识别细胞状态、追踪细胞发育谱系和重构细胞空间组织方面具有巨大潜力。然而，通过降维来解释单细胞测序数据的结构仍然是一个挑战。现有的算法要么不能揭示数据中的聚类结构，要么丢失全局信息，比如彼此接近的聚类组。我们提出了一个健壮的统计模型，scvis，以捕获和可视化单细胞基因表达数据的低维结构。仿真结果表明，scvis学习的低维表示既保留了数据中的局部邻居结构，又保留了全局邻居结构。此外，scvis对数据点的数量具有鲁棒性，并学习了一个概率参数映射函数来向现有的嵌入添加新的数据点。然后，我们使用scvis分析四个单细胞rna测序数据集，举例说明高维单细胞rna测序数据的可解释的二维表示。

感性认识

研究的基本问题
对单细胞RNA测序数据(scRNA-seq)进行降维
现有问题
1.降维是固化的，非参数化的，对于新数据点没有很好的鲁棒性。
2.线性降维，如PCA，不能保持数据的复杂结构
3.非线性降维，如t-SNE，计算复杂度高，不适宜细胞测序数据这样的高通量数据，且在保留全局信息（结构）方面较弱。
主要方法
学习一个参数化的映射，将高维数据映射到低维空间，使得对新数据是开放的，且保留样本在投影之中的距离特征。

数据表示 $D=\{ x_n\}_{n=1}^N$ ， $x_n$ 代表一个细胞的测序向量，N是细胞数

假设： $x_n$ 是由一个隐含的低维向量 $z_n$ 控制的。 $z_n$ 是2维或者3维。 $z_n$ 是受一个先验控制的。于是模型的联合分布是：
$p(zn∣θ)p(xn∣zn,θ)p\left( \mathbf{z}_n|\boldsymbol{\theta } \right) p\left( \mathbf{x}_n|\mathbf{z}_n,\boldsymbol{\theta } \right)$ 高维表示是（可以是一个非常复杂的高维分布）： $p(xn∣θ)=∫p(zn∣θ)p(xn∣zn,θ)dznp\left( \mathbf{x}_n|\boldsymbol{\theta } \right) =\int{p\left( \mathbf{z}_n|\boldsymbol{\theta } \right) p\left( \mathbf{x}_n|\mathbf{z}_n,\boldsymbol{\theta } \right) d\mathbf{z}_n}$ 希望拥有较大的似然（式子的右侧），给相似的细胞簇分配相似的分布 $p(xn∣zn,θ)p\left( \mathbf{x}_n|\mathbf{z}_n,\boldsymbol{\theta } \right)$ ,保证细胞之间的簇结构。保证相似的细胞在近端，不同的在远端（投影的距离性），给z的分布施加t-SNE的目标函数作为约束。
假设：
$p(zn∣θ)=∏i=1dN(zn,i∣0,I)p\left( \mathbf{z}_n|\boldsymbol{\theta } \right) =\prod_{i=1}^d{\mathcal{N}\left( z_{n,i}|0,\mathbf{I} \right)}$ $p(xn∣zn,θ)=T(xn∣μθ(zn),σθ(zn),v)p\left( \mathbf{x}_n|\mathbf{z}_n,\boldsymbol{\theta } \right) =\mathcal{T}\left( \mathbf{x}_n|\boldsymbol{\mu }_{\boldsymbol{\theta }}\left( \mathbf{z}_n \right) ,\boldsymbol{\sigma }_{\boldsymbol{\theta }}\left( \mathbf{z}_n \right),\boldsymbol{v} \right)$
$p(xn∣zn,θ)p\left( \mathbf{x}_n|\mathbf{z}_n,\boldsymbol{\theta } \right)$ 是位置-尺度族分布,是关于zn的学生t分布，位置参数 $μθ(zn)\boldsymbol{\mu }_{\boldsymbol{\theta }}\left( \mathbf{z}_n \right)$ ，尺度参数 $σθ(zn)\boldsymbol{\sigma }_{\boldsymbol{\theta }}\left( \mathbf{z}_n \right)$ 都是由参数θ的神经网络参数化的关于zn的函数。ν是自由度参数，从数据中学习得到。

https://zhuanlan.zhihu.com/p/108895154

推理时，计算 $p(zn∣xn,θ)p\left( \mathbf{z}_n|\mathbf{x}_n,\boldsymbol{\theta } \right)$ .使用变分 $q(zn∣xn,ϕ)q\left( \mathbf{z}_n|\mathbf{x}_n,\boldsymbol{\phi } \right)$ 去近似。使用变分分布为多元正态分布。均值 $μϕ(xn)\boldsymbol{\mu }_{\boldsymbol{\phi }}\left( \mathbf{x}_n \right)$ ，方差 $σϕ(xn)\boldsymbol{\sigma }_{\boldsymbol{\phi}}\left( \mathbf{x}_n \right)$ 都是由参数Φ的神经网络参数化的关于xn的函数。变分推断部分，为VAE（变分自编码器）。
最终，目标函数是变分推断的证据下界（ELBO）和非对称t-SNE的目标函数的加权。

结果与结论
不足与展望
相关工作

Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. In Proc. 2nd International Conference on Learning Representations (Banff, Alberta, 2014).
Rezende, D. J., Mohamed, S. & Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In Proc. 31st International Conference on Machine Learning (eds Xing, E. P. & Jebara, T.) 1278–1286 (PMLR, Beijing, 2014).
变分推断，VAE部分
t-SNE

【文章三】A deep adversarial variational autoencoder model for dimensionality reduction in single-cell RNA sequencing analysis（DR-A）

文章题目：A deep adversarial variational autoencoder model for dimensionality reduction in single-cell RNA sequencing analysis
作者：Eugene Lin, Sudipto Mukherjee & Sreeram Kannan
关键词：Adversarial autoencoder, Variational autoencoder, Dimensionality reduction, Generative adversarial networks, Single-cell RNA sequencing
时间：2020
来源：BMC Bioinformatics
paper：https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-3401-5
code：
引用：Lin, E., Mukherjee, S. & Kannan, S. A deep adversarial variational autoencoder model for dimensionality reduction in single-cell RNA sequencing analysis. BMC Bioinformatics 21, 64 (2020). https://doi.org/10.1186/s12859-020-3401-5

abstract

Background: Single-cell RNA sequencing (scRNA-seq) is an emerging technology that can assess the function of an individual cell and cell-to-cell variability at the single cell level in an unbiased manner. Dimensionality reduction is an essential first step in downstream analysis of the scRNA-seq data. However, the scRNA-seq data are challenging for traditional methods due to their high dimensional measurements as well as an abundance of dropout events (that is, zero expression measurements).
Results: To overcome these difficulties, we propose DR-A (Dimensionality Reduction with Adversarial variational autoencoder), a data-driven approach to fulfill the task of dimensionality reduction. DR-A leverages a novel adversarial variational autoencoder-based framework, a variant of generative adversarial networks. DR-A is wellsuited for unsupervised learning tasks for the scRNA-seq data, where labels for cell types are costly and often impossible to acquire. Compared with existing methods, DR-A is able to provide a more accurate low dimensional representation of the scRNA-seq data. We illustrate this by utilizing DR-A for clustering of scRNA-seq data.
Conclusions: Our results indicate that DR-A significantly enhances clustering performance over state-of-the-art methods.

摘要：

背景:单细胞RNA测序(scRNA-seq)是一项新兴的技术，可以在单细胞水平上以无偏的方式评估单个细胞的功能和细胞间的变化。降维是scRNA-seq数据下游分析必不可少的第一步。然而，由于其高维测量和大量的dropout事件(即零表达测量)，scRNA-seq数据对传统方法具有挑战性。
结果:为了克服这些困难，我们提出了DR-A(使用对抗式变分自编码器进行降维)，一种数据驱动的方法来完成降维任务。DR-A利用了一种新的基于对抗性变分自编码器的框架，这是生成式对抗网络的一种变体。DR-A非常适合于scRNA-seq数据的无监督学习任务，在这些任务中，细胞类型的标签成本很高，而且通常不可能获得。与现有方法相比，DR-A能够提供更准确的scRNA-seq数据的低维表示。我们利用DR-A对scRNA-seq数据进行聚类来说明这一点。
结论:我们的结果表明DR-A比最先进的方法显著提高了聚类性能。

感性认识

研究的基本问题
使用VAE+GAN（或者说，对抗式自编码器+变分自编码器）对scRNA-seq数据进行降维，并用于下游的聚类任务（k-means）。
现有问题

scRNA-seq数据巨大的维数。
scRNA-seq中大量的0测量值（dropout）。

主要方法
结构：

Adversarial AutoEncoder：由标准自编码器和对抗性网络两部分组成。编码器（encoder）也是对抗网络的生成器（generator）。对抗性自编码器的思想是同时训练对抗性网络和自编码器进行推理。训练编码器(即生成器)使识别器（discriminator）相信潜在向量是由真实的先验分布生成的，同时训练识别器区分采样向量和编码器的潜在向量。对抗性训练保证潜在空间与某个先验潜在分布匹配。
识别器1：识别latent code是基于encoder还是基于真实的先验分布。以此希望encoder的latent code逼近真实的先验。
Adversarial Variational AutoEncoder (AVAE)：

在Adversarial AutoEncoder基础上，融合了变分自编码器（VAE）。包括：编码器（生成器），解码器，识别器。解码器通过最小化重构误差，将latent code重构成x序列。正态高斯分布N(0, I)作为先验分布p(z)。
Adversarial Variational AutoEncoder with dual matching (AVAE-DM)：在AVAE的基础上，增加识别器2，区分重构样本和真实样本。
引入Bhattacharyya距离，定义目标函数：
使用p(x|z)的ZINB条件似然重构scRNA-seq数据的解码器输出

结果与结论
不足与展望
相关工作

Makhzani A, Shlens J, Jaitly N, Goodfellow I, Frey B: Adversarial autoencoders. arXiv preprint arXiv:151105644 2015.
Choi E, Lee C. Feature extraction based on the Bhattacharyya distance.Pattern Recogn. 2003;36(8):1703–9.
Lopez R, Regier J, Cole MB, Jordan MI, Yosef N. Deep generative modeling for single-cell transcriptomics. Nat Methods. 2018;15(12):1053.
Grün D, Kester L, Van Oudenaarden A. Validation of noise models for singlecell transcriptomics. Nat Methods. 2014;11(6):637.
zero-inflated negative binomial (ZINB) distribution