今天真的要开始正式进入科研状态了,之前一直都进入失败,咳!那就先来看一篇相关领域的论文吧

——————————————————————————————————————————————

题目:DEEPCON: protein contact prediction using dilated convolutional neural networks with dropout

doi: 10.1093/bioinformatics/btz593

原文链接:https://academic.oup.com/bioinformatics/article/36/2/470/5540673

下载链接:https://sci-hub.ren/10.1093/bioinformatics/btz593

Abstract
Motivation: Exciting new opportunities have arisen to solve the protein contact prediction problem from the progress in neural networks and the availability of a large number of homologous sequences through high-throughput sequencing. In this work, we study how deep convolutional neural networks (ConvNets) may be best designed and developed to solve this long-standing problem.

动机:由于神经网络的进步和通过高通量测序获得大量同源序列,出现了解决蛋白质接触预测问题的令人兴奋的新机会。在这项工作中,我们研究如何更好地设计和开发深度卷积神经网络来解决这个长期存在的问题。

Results: With publicly available datasets, we designed and trained various ConvNet architectures. We tested several recent deep learning techniques including wide residual networks, dropouts and dilated convolutions. We studied the improvements in the precision of medium-range and long-range contacts, and compared the performance of our best architectures with the ones used in existing state-of-the-art methods. The proposed ConvNet architectures predict contacts with significantly more precision than the architectures used in several state-of-the-art methods. When trained using the DeepCov dataset consisting of 3456 proteins and tested on PSICOV dataset of 150 proteins, our architectures achieve up to 15% higher precision when L/2 long-range contacts are evaluated. Similarly, when trained using the DNCON2 dataset consisting of 1426 proteins and tested on 84 protein domains in the CASP12 dataset, our single network achieves 4.8% higher pre-cision than the ensembled DNCON2 method when top L long-range contacts are evaluated.

Availability and implementation: DEEPCON is available at https://github.com/badriadhikari/DEEPCON/.

结果:利用公开可用的数据集,我们设计并训练了各种ConvNet框架。我们测试了一些最近的深度学习技术,包括广泛的residual networks, dropouts and dilated convolutions。我们研究了中程和远程接触精度的改进,并将我们的架构与现有最先进方法中使用的架构的性能进行了比较。与几种先进方法中使用的架构相比,所提出的ConvNet架构预测接触的精度要高得多。当使用由3456个蛋白质组成的DeepCov数据集进行训练并在由150个蛋白质组成的PSICOV数据集上进行测试时,当评估L/2远程接触时,我们的架构实现了高达15%的高精度。类似地,当使用由1426个蛋白质组成的DNCON2数据集进行训练并在CASP12数据集中的84个蛋白质域上进行测试时,当评估顶级L远程接触时,我们的单个网络比集成的DNCON2方法实现了4.8%的更高精度。

可用性和实施:https://github.com/badriadhikari/DEEPCON/

——————————————————————————————

1 Introduction

For a protein whose amino acid sequence is obtained using a protein-sequencing device, three-dimensional (3D) models may be predicted using template modeling or ab initio. Template-modeling methods search for other similar protein sequences in a database of sequences whose 3D structures have already been experimentally determined using wet-lab experiments and use them to predict 3D models of the input sequence. The total number of protein structures determined through experimental methods such as X-ray crystallography and Nuclear magnetic resonance spectroscopy are currently limited to 147817 as of January 2019 (Berman, 2000). Protein sequences for which such templates cannot be found need to be predicted ab initio, i.e. without the use of any structural templates. For structure prediction of protein sequences whose structural templates are not found, predicted protein contacts serve as the driver for folding (Michel et al., 2018; Wang et al., 2017).

使用蛋白质测序装置获得一个蛋白质的氨基酸序列,可以使用模板建模从头开始预测三维(3D)模型。模板建模方法在序列数据库中搜索其他相似的蛋白质序列,这些序列的3D结构已经使用湿实验室实验进行了实验确定,并使用它们来预测输入序列的3D模型。截至2019年1月,通过X射线结晶学和核磁共振光谱学等实验方法确定的蛋白质结构总数目前被限制在147817个(Berman, 2000)。不能找到这种模板的蛋白质序列需要从头预测,即不使用任何结构模板。对于未发现结构模板的蛋白质序列的结构预测,预测的蛋白质接触充当折叠的驱动力(Michel et al., 2018; Wang et al., 2017)。

Residue-residue contacts or inter-residue contacts (or just contacts) define which pairs of amino acids should be close to each other in the 3D structure, i.e. pairs that are in contact should remain close and those that are not should stay farther. As defined in the Critical Assessment of Protein Structure Prediction (CASP) experiments (Monastyrskyy et al., 2016; Moult et al., 2018), a pair of residues in a protein are defined to be in contact if their carbon beta atoms (carbon alpha for glycine) are closer than 8 A˚in the native (experimental) structure. In a true (or a predicted) contact matrix not all contacts are equally important. Local contacts, those with sequence separation less than six residues and short-range contacts (with sequence separation between 6 and 12 residues) are not very useful for building an accurate 3D model. They are required for reconstructing local secondary structures but are not helpful to build folded proteins. However, medium-range contacts, the contact pairs that have sequence separation between 12 and 23 residues, and long-range contacts, the ones separated by at least 24 residues in the protein sequence, are necessary for building accurate models.

残基-残基接触或残基间接触(或仅接触)定义了在3D结构中哪些氨基酸对应该彼此接近,即接触的氨基酸对应该保持接近,而不接触的氨基酸对应该保持更远。如蛋白质结构预测关键评估实验(CASP)中所定义(Monastyrskyy et al., 2016; Moult et al., 2018),如果蛋白质中的一对残基的β碳原子(甘氨酸的α碳原子)在天然(实验)结构中比8 A更接近,则它们被定义为接触。真实(或预测的)接触矩阵中并非所有的接触都同等重要。局部接触、序列间隔小于6个残基的接触和短程接触(序列间隔在6到12个残基之间)对于建立精确的3D模型不是很有用。它们是重建局部二级结构所必需的,但对构建折叠蛋白没有帮助。然而,中等范围的接触,即在12到23个残基之间具有序列分离的接触对,以及长距离的接触,即在蛋白质序列中被至少24个残基分开的接触,对于建立精确的模型是必要的。

All top groups participating in the most recent Critical Assessment of Protein Structure Prediction (CASP) 13 experiment, including DeepMind’s AlphaFold method use contacts (or distance bins) for ab initio protein structure prediction. These contacts can be predicted with relatively high precision for protein sequences that have hundreds
to thousands of matches in protein sequence databases such as UNICLUST30 (Mirdita et al.,2017) and Uniref100 (Suzek et al.,2007). The sequence hits obtained, in the form of multiple sequence alignments (MSA) serve as an input to algorithms and machine learning methods to predict contact maps. While the overall goal in protein
structure prediction is to predict three-dimensional models (3D information) from protein sequences (1D information), predicted protein contacts serve as the intermediate step (2D information). In the absence of machine learning, contacts are predicted from protein sequence alignments based on the principle that evolutionary pressures place constraints on the sequence evolution over generations (Marks et al.,2011). The predicted contacts from these coevolution-based methods are a key input to machine-learning based methods that generally predict more accurate contacts.

所有参与最新蛋白质结构预测关键评估(CASP) 13实验的顶级团队,包括DeepMind的AlphaFold方法,都使用接触(或距离bins)进行从头开始的蛋白质结构预测。对于在蛋白质序列数据库中具有数百到数千个匹配的蛋白质序列,例如UNICLUST30 (Mirdita et al.,2017)Uniref100 (Suzek et al.,2007),可以相对高精度地预测这些接触。以多重序列比对(MSA)的形式获得的序列作为算法和机器学习方法的输入,以预测接触图。虽然蛋白质结构预测的总体目标是根据蛋白质序列(1D信息)预测三维模型(3D信息),但预测的蛋白质接触充当中间步骤(2D信息)。在没有机器学习的情况下,根据进化原则对几代人的序列进化施加限制的原则,从蛋白质序列比对中预测接触 (Marks et al.,2011)。来自这些基于共同进化方法的接触预测是基于机器学习方法的关键输入,该方法通常预测更精确的接触。

Although Eickholt and Cheng (2012) were the first group to apply deep learning for contact prediction, currently, the most successful contact prediction methods use convolutional neural networks (CNNs) fed with a combination of features generated from multiple sequence alignments and other sequence-derived features. After Jinbo Xu’s group first applied CNNs to predict contacts (Wang et al., 2017), CNNs have been found to be particularly well suited and highly effective for the contact prediction problem, mainly because of their ability to learn cross-channel (cross-feature) information (for example, the relationship between predicted solvent accessibility and predicted secondary structure). In Adhikari et al. (2018), we demonstrate that a single CNN-based network delivers a remarkably better performance compared to a boosted deep belief network. Similarly, in Jones and Kandathil (2018) authors demonstrate that a basic CNN-based method can easily outperform another state-of-the-art meta-method based on basic neural networks. Although the recent progress in contact prediction was initially attributed mostly to coevolutionary features generated
using methods such as CCMpred (Seemayer et al., 2014) and FreeContact (Kaja ´n et al.,2014), recent findings (AlQuraishi, 2019;Jones and Kandathil, 2018; Mirabello and Wallner, 2018) suggest that end-to-end training is possible in the near future, where the deep learning algorithm may contribute entirely to the performance and these
hand engineered features may be found redundant. Most of the recently successful methods for contact predictions, as demonstrated by the recent CASP results, are available for public use. For instance, methods such as RaptorX (Wang et al.,2017), MetaPSICOV (Jones et al.,2015), DNCON2 (Adhikari et al., 2018), PconsC3 (Michel et al.,2017) and PconsC4 (Michel et al., 2018) are either available as a downloadable tool or a web server. Each of these methods use very different CNN architectures, different sets of input features, and self-curated datasets to train and benchmark their methods.

虽然Eickholt和Cheng (2012)是第一批将深度学习应用于接触预测的研究小组,但目前最成功的接触预测方法使用卷积神经网络(CNNs),将多序列比对生成的特征和其他序列衍生特征组合输入该网络。在Jinbo Xu的团队首次将CNNs应用于接触预测(Wang et al., 2017)之后,CNNs被发现特别适合且高效地用于接触预测问题,主要是因为它们能够学习跨通道(跨特征)信息(例如,预测的溶剂可及性和预测的二级结构之间的关系)。在 Adhikari et al. (2018)中,我们证明了与增强的深度置信网络相比,单个基于CNN的网络提供了更好的性能。类似地,在Jones和Kandathil (2018)中,作者证明了一种基于CNN的基本方法可以轻松地胜过另一种基于基本神经网络的最先进的元方法。虽然接触预测的最新进展最初主要归因于使用CCMpred (Seemayer et al., 2014)和  FreeContact (Kaja ´n et al., 2 0 1 4)等方法生成的共同进化特征,但最近的发现 (AlQuraishi, 2019;Jones and Kandathil, 2018; Mirabello and Wallner, 2018)认为端到端的训练在不久的将来是可能的,其中深度学习算法可能完全有助于性能,这些手工设计的特征可能被发现是多余的。正如最近的CASP结果所证明的那样,大多数最近成功的接触预测方法都可供公众使用。例如,诸如RaptorX (Wang et al.,2017), MetaPSICOV (Jones et al.,2015), DNCON2 (Adhikari et al., 2018), PconsC3 (Michel et al.,2017) and PconsC4 (Michel et al., 2018)等方法可作为可下载工具或网络服务器提供。这些方法中的每一种都使用不同的CNN架构、不同的输入特征集和自组织数据集来训练和测试它们的方法。

From the perspective of input and output data format, the protein contact prediction problem is similar to depth prediction from monocular images (Eigen, 2014) in computer vision, i.e. predicting 3D depth from 2D images. In the depth prediction problem, the input is an image of dimensions H X W X C, where H is height, W is width and C is number of input channels, and output is a two-dimensional matrix of size H X W whose values represent the depth intensities. Similarly, in the protein contact prediction problem, the output is a contact probability map (matrix) of size L X L and input is protein features of dimension L X L X N, where L is the length of the protein sequence and N is the number of input channels. Depth prediction usually involves three channels (red, green and blue or hue, saturation and value) while in the latter we have much higher number of features such as 56 or 441 (Jones and Kandathil, 2018).Because of large number of input channels, the overall input volume becomes large, limiting the depth and width of deep learning architectures for training and testing. This also greatly affects the training time and requires high-end GPUs for training. These challenges,
imposed by large number of input channels, are also observed in other problems such as plant genotype prediction from hyperspectral images. A protein contact matrix and its input features are symmetrical along the diagonal. Most current methods (including this work) consider both upper and lower triangles (above and below the diagonal line) for training, but for prediction and evaluation, average the confidence scores in the upper and lower triangle.

从输入和输出数据格式的角度来看,蛋白质接触预测问题类似于计算机视觉中的单目图像深度预测(Eigen,2014),即预测2D图像的三维深度。深度预测问题中,输入的是一幅H X W X C维的图像,其中H为高度,W为宽度,C为输入通道数,输出为尺寸H X W的二维矩阵,其值代表深度强度。同样,在蛋白质接触预测问题中,输出的是大小为L X L的接触概率图(矩阵)输入是L X L X N维的蛋白质特征,其中L是蛋白质序列的长度,N是输入通道的数量。深度预测通常涉及三个通道(红色、绿色和蓝色或色调、饱和度和值),而在后者中,我们有更多的特征,如56或441(Jones and Kandathil, 2018)。由于大量的输入通道,整体输入量变大,限制了用于训练和测试的深度学习架构的深度和宽度。这也大大影响了训练时间,需要高端GPU进行训练。这些挑战是由大量的输入通道造成的,在其他问题中也可以观察到,例如从高光谱图像预测植物基因型。蛋白质接触矩阵及其输入特征沿对角线对称。大多数目前的方法(包括这项工作)都考虑上下三角形(对角线以上和以下)进行训练,但对于预测和评估,平均上下三角形的置信度得分。

The protein structure prediction problem has some additional unique characteristics. First, the input features for proteins are not all two dimensional. Input features such as length of a protein sequence are scalar or 0-dimensional (0D). For such input features, we create a channel with the same value throughout the channel.Similarly, input features such as secondary structure predictions are one-dimensional (1D) and we create two channels for each 1D input feature of length L—first channel by copying the vector L times to create a L X L matrix, and second channel by transposing the input feature vector and then copying L times to create a L X L matrix. Another way to generate 2D matrix from 1D vector is to compute outer product of the vector with its own transpose. Other input features which are 2D, are copied into the input volume, as they are. Also, the length of a protein sequence can vary from a few residues to a few thousand residues, implying a variable input feature volume. The dimension of each 2D feature (transformed into channels) will depend on the length of the protein. Unlike real-world images, these input features, cannot be easily studied/understood visually towards understanding what the networks are learning. The training and test dataset that we can use is also limited. Unlike other publicly available datasets, the protein structure dataset size cannot be significantly increased because only up to around 11 thousand new structures are experimentally determined each year. What further
reduces this dataset is the similarity between many of the structures deposited. On these proteins, if we perform some basic redundancy reductions such as keeping only the proteins with more than 20% sequence similarity and those with high resolution structures, the number of proteins available for training reduces to only around a five thousand (Wang and Dunbrack, 2003). Despite the fact that the data appears to be limited compared to many other datasets, some argue that it is sufficient to capture the principles of protein folding. In addition, a typical protein contact map, which is a binary matrix, is around 95% zeros and 5% ones (Jones et al., 2015).

蛋白质结构预测问题有一些额外的独特特征。首先,蛋白质的输入特征并不都是二维的。输入特征如蛋白质序列的长度是标量或0维(0D)。对于这样的输入特征,我们在整个通道中创建一个具有相同值的通道。类似地,二级结构预测等输入特征是一维的(1D),我们为每个长度为L的1D输入特征创建两个通道——第一个通道通过复制向量L次来创建一个L X L矩阵,和第二个通道通过移调输入特征向量,然后复制L次来创建一个L X L矩阵。从1D向量生成2D矩阵的另一种方法是计算向量与其自身转置的外积。其他输入要素(即2D)被原样复制到输入体积中。此外,蛋白质序列的长度可以从几个残基到几千个残基不等,这意味着输入特征量是可变的。每个2D特征(转化为通道)的维度将取决于蛋白质的长度。与真实世界的图像不同,这些输入特征不容易通过视觉研究/理解来理解网络正在学习什么。我们可以使用的训练和测试数据集也是有限的。与其他公开可用的数据集不同,蛋白质结构数据集的大小不能显著增加,因为每年只有大约11000个新结构通过实验确定。进一步减少数据集的是许多沉积结构之间的相似性。在这些蛋白质上,如果我们进行一些基本的冗余减少,例如只保留具有20%以上序列相似性的蛋白质和那些具有高分辨率结构的蛋白质,可用于训练的蛋白质数量减少到只有大约5000个(Wang and Dunbrack, 2003)。尽管与许多其他数据集相比,这些数据似乎是有限的,但一些人认为它足以捕捉蛋白质折叠的原理。此外,典型的蛋白质接触图是一个二元矩阵,大约有95%的0和5%的1(Jones et al., 2015)。

The three-dimensional models of real-world objects and protein structures have a fundamental difference. Protein 3D models are not scalable. An object, such as a chair, in the real world may be tiny or large. Proteins can also be large or small but the size of structure patterns are always physically fixed. For instance, the size of an alpha helix (a common building block of many protein structures) is the same for proteins of any size. The distance between carbon-alpha atoms is a fixed physical distance whether the helices are in a small helical protein such as the mouse C-MYB protein or a large helical protein such as hemoglobin. Because of this unique characteristic of protein structures, it is yet to be fully understood how useful deep architectures such as U-Nets (Ronneberger et al., 2015) can be for protein contact prediction although some groups have developed contact prediction methods using such architectures (Michel et al.,2018). Similar to the problem of depth prediction, the ultimate goal in the field is to develop methods that can be scaled to predict physical distances map (in Angstroms). Since raw distance prediction is too difficult, some groups have developed methods that can predict contacts at various sequence separation thresholds (Xu, 2018) and demonstrate that such binning of distance ranges improves contact prediction at a single standard distance threshold, such as the standard of 8 A.

现实世界物体和蛋白质结构的三维模型有一个根本的区别。蛋白质3D模型不可伸缩。现实世界中的物体,如椅子,可能很小,也可能很大。蛋白质也可以是大的或小的,但是结构模式的大小总是物理固定的。例如,对于任何大小的蛋白质来说,α螺旋(许多蛋白质结构的共同组成部分)的大小是相同的。阿尔法碳原子之间的距离是一个固定的物理距离,无论螺旋是在小螺旋蛋白质中,如小鼠MYB蛋白,还是在大螺旋蛋白质中,如血红蛋白。由于蛋白质结构的这一独特特征,诸如 U-Nets (Ronneberger et al., 2015)之类的深度结构对于蛋白质接触预测有多有用尚待充分理解,尽管一些小组已经开发了使用这种结构的接触预测方法 (Michel et al.,2018)。与深度预测问题类似,该领域的最终目标是开发可缩放的方法来预测物理距离图(以埃为单位)。由于原始距离预测太困难,一些研究小组已经开发了能够在各种序列分离阈值下预测接触的方法 (Xu, 2018),并且证明了这样的距离范围的在单个标准距离阈值下(例如标准的8埃)改善了接触预测。

In this work, we demonstrate that dilated residual networks with dropout layers are best suited for addressing the protein contact prediction problem. It can be argued that the results in a small dataset may not hold true in large datasets. In addition, results generated using one kind of features may not hold true for another kinds of input features. For instance, methods that work well for sequence-based features (such as secondary structures) may not work well for features generated directly from multiple-sequence alignments (such as covariance matrix and precision matrix). When trained on the DeepCov dataset consisting of 3456 proteins, using the same dataset
and input features for training, our method achieves up to 6 and 15% higher precision on the PSICOV150 protein dataset (independent test set) when top L/5 and L/2 long-range contacts are evaluated, respectively (L is protein length). Similarly, when trained on the DNCON2 dataset consisting of 1426 proteins, using the same dataset for training and testing, a single network in our method achieves 4.8% higher long-range precision (top L). While DNCON2 uses a two-level approach combined with ensembling of 20 models, we use a single network. Although we trim the input length of a protein to 256 for all proteins in our training experiments, after the training,
our models can make predictions for a protein of any length. It is important to note that although our overall architecture is novel, we are not the first group to apply dilated residual networks for protein contact prediction. In the CASP13 meeting, DeepMind’s group were the first group to present that dilated convolutional layers are fit for this problem.

在这项工作中,我们证明了带有dropout 层的dilated residual networks最适合解决蛋白质接触预测问题。可以说,小数据集中的结果在大数据集中可能不成立。此外,使用一种特征生成的结果可能不适用于另一种输入特征。例如,适用于基于序列的特征(如二级结构)的方法可能不适用于直接由多序列比对生成的特征(如协方差矩阵和精度矩阵)。当在由3456个蛋白质组成的DeepCov数据集上使用相同的数据集和输入特征进行训练时,当分别评估top L/5和L/2远程接触(L是蛋白质长度)时,我们的方法在PSICOV150蛋白质数据集(独立测试集)上实现了高达6%和15%的高精度。类似地,当在由1426个蛋白质组成的DNCON2数据集上训练时,使用相同的数据集进行训练和测试,我们方法中的单个网络实现了4.8%的高远程精度(top L)。虽然DNCON2使用两级方法结合20个模型的集合,但我们使用单个网络。虽然在我们的训练实验中,我们将所有蛋白质的蛋白质输入长度削减到256,但是在训练之后,我们的模型可以对任何长度的蛋白质进行预测。重要的是要注意,尽管我们的整体架构是新颖的,但我们不是第一个将dilated residual networks应用于蛋白质接触预测的团队。在CASP13会议上,DeepMind的小组是第一个提出 dilated convolutional layers适合这个问题的小组。

——————————————————————————————

2 Materials and methods

2.1 Datasets

For training our models in our experiments, we use two datasets: (i) the DNCON2 dataset (Adhikari et al., 2018) available at http://sysbio.rnet.missouri.edu/dncon2/ and (ii) DeepCov dataset (Jones and Kandathil, 2018) available at https://github.com/psipred/DeepCov. Following the deep learning practice, for each experiment, we consider three subsets—training subset, validation subset and an independent test set. The DNCON2 dataset consists of 1426 proteins between 30 and 300 residues in length. The dataset was curated before the CASP10 experiment in 2012 and the protein structures in the dataset are of 0 to 2 A˚in resolution. These proteins are filtered by 30% sequence identity to remove redundancy. Of the full dataset, we use 1230 proteins for training and the remaining 196 as the validation set. We found that the protein with PDB ID2PNE in the validation set was skipped during evaluation in the DNCON2 method and hence we too skip it. We train and validate our method using the 1426 proteins and test on 62 protein targets in the CASP12 experiment. The CASP12 protein targets range from 75 to 670 residues in length. For evaluating our method on the
CASP12 dataset, we predict contacts for the whole protein target and evaluate the contacts on the full target (or domains). The DeepCov dataset consists of 3456 proteins used to train and validate models, i.e. tune hyperparameters. The models trained using the DeepCov dataset are tested on the independent PSICOV dataset consisting of 150 proteins.

为了在实验中训练我们的模型,我们使用了两个数据集:(1)DNCON2数据集 (Adhikari et al., 2018),可在http://sysbio.rnet.missouri.edu/dncon2/获得;(2)Deepcov数据集 (Jones and Kandathil, 2018),可在https://github.com/psipred/DeepCov获得。在深度学习实践之后,对于每个实验,我们考虑三个子集——训练集、验证集和独立测试集。DNCON2数据集由1426个长度在30到300个残基之间的蛋白质组成。该数据集是在2012年CASP10实验之前整理的,数据集中的蛋白质结构分辨率为0至2A。这些蛋白质通过30%的序列同一性过滤以去除冗余。在完整的数据集中,我们使用1230个蛋白质进行训练,剩下的196个作为验证集。我们发现,在DNCON2方法的评估过程中,验证集中具有PDB ID2PNE的蛋白质被跳过,因此我们也跳过了它。我们使用1426个蛋白质训练和验证我们的方法,并在CASP12实验中测试62个蛋白质靶标。半胱氨酸天冬氨酸蛋白酶12蛋白的目标长度为75-670个残基。为了评估我们在CASP12数据集上的方法,我们预测了全蛋白靶的接触,并评估了全靶(或结构域)上的接触。DeepCov数据集由3456个蛋白质组成,用于训练和验证模型,即调整超参数。使用DeepCov数据集训练的模型在由150种蛋白质组成的独立PSICOV数据集上进行测试。

2.2 Contact evaluation

Following the standard in the field (Monastyrskyy et al., 2016), as evaluation metrics of contact prediction accuracy, we use the precision of top L/5 predicted long-range contacts (PL/5-LR), where L is the length of the protein sequence. In addition, we also evaluate our methods using the precision of top L/2 long-range contacts (PL/2-LR),
precision of top L long-range contacts (PL-LR) and the precision of all medium and long-range contacts (PNC-MLR). To calculate PNC-MLRfor a protein with NC number of medium- and long-range contacts, we first round top NC medium- and long-range predicted probabilities (after ranking the probabilities) to 1 s and the rest to 0 s. Precision is then calculated as the ratio of ‘the number of matches between predicted and true matrix’ and NC.

遵循该领域的标准 (Monastyrskyy et al., 2016),作为接触预测准确性的评估指标,我们使用预测的top L/5远程接触的精度(PL/5-LR),其中L是蛋白质序列的长度。此外,我们还使用top L/2远程接触的精度(PL/2-LR)、topL远程触点的精度(PL-LR)和所有中远程触点的精度(PNC-MLR)来评估我们的方法。为了计算具有中等和长期接触NC数量的蛋白质的PNC-MLRf,我们首先将顶级NC中等和长期预测概率(对概率进行排序后)舍入到1,其余的舍入到0。然后计算精度为“预测和真实矩阵之间的匹配数量”与NC的比率

2.3 Input features

We use two kinds of features in our experiments—covariance features (as in the DeepCov method), and sequence features generated using various prediction methods (as in the DNCON2 method). The sequence features consist of eight groups of features as input for training and validating our models. These include scalar values such as sequence length, one-dimensional features such as solvent accessibility and secondary structure predictions, and two-dimensional features such as features computed from position specific scoring matrix, sequence separation information, pre-computed statistical potentials, features computed from input multiple sequence alignment, and coevolution predictions from the alignment. In total, we use 29 unique features as input which we roll out to 56 input channels in total. For each 0D (scalar) feature we copy the value into the entire channel. Similarly, we copy each 2D input feature as is into the final input volume as a single channel. For each 1D input feature for a protein of length L, we create two channels—first channel by copying the vector L times to create a L X L matrix, and second channel by transposing the input feature vector and then copying L times to create an L X L matrix. Although the maximum length of a protein in this dataset is 300, we trim all input features to 256 length so that the longest protein has an input volume of 256 X 256 X 56. For training, all proteins with length less than 256 are 0-padded so that all input volumes have the same dimension of 256 X 256 X 56. The covariance feature consist of a 441 channel covariance matrix calculated using the publicly available script (cov21stat) and dataset in the DeepCov package. For this dataset the input volume for each protein is L X L X 441. For generating multiple sequence alignments for the CASP dataset and the DNCON2 dataset, we follow the approach of using HHsearch (So ding, 2005) with the UniProt20 database (2016/02 release) followed by JackHmmer (Finn et al., 2011) with the UniRef90 database (2016 release) as discussed in the DNCON2 method (see our DNCON2 paper for details). These databases are publicly available at http://sysbio.rnet.missouri.edu/bdm_download/dncon2-tool/databases/. The alignments for the proteins in the DeepCov dataset and the PSICOV150 dataset were copied from the corresponding DeepCov repository.

我们在实验中使用两种特征——协方差特征(如在DeepCov方法中)和使用各种预测方法生成的序列特征(如在DNCON2方法中)序列特征由八组特征组成,作为训练和验证模型的输入这些标量值包括,如序列长度,一维特征,如溶剂可及性和二级结构预测,以及二维特征,如从位置特异性评分矩阵计算的特征,序列分离信息,预先计算的统计势,从输入多序列比对计算的特征,以及来自比对的共同进化预测。总的来说,我们使用29个独特的特征作为输入,总共扩展到56个输入通道。对于每个0D(标量)特征,我们将该值复制到整个通道中。同样,我们将每个2D输入特征原样复制到最终输入体积中,作为一个单通道。对于长度为L的蛋白质的每个1D输入特征,我们创建两个通道——第一个通道通过复制向量L次来创建一个L X L矩阵,和第二个通道通过移调输入特征向量,然后复制L次来创建一个LX L矩阵。虽然该数据集中蛋白质的最大长度为300,但我们将所有输入特征修剪为256个长度,这样最长的蛋白质的输入体积为256X256X56。对于训练,长度小于256的所有蛋白质都被0填充,以便所有输入体积都具有相同的256 X 256 X 56。协方差特征由一个441通道协方差矩阵组成,该矩阵使用公开可用的脚本(cov21stat)和DeepCov包中的数据集进行计算。对于这个数据集,每个蛋白质的输入体积是L X L X 441。为了生成CASP数据集和DNCON2数据集的多序列比对,我们遵循了在UniProt20数据库(2016/02版本)中使用HHsearch(So ding, 2005)的方法,然后是在DNCON2方法中讨论的在UniRef90数据库(2016版本)中使用JackHmmer (Finn et al., 2011)。这些数据库可在http://sysbio.rnet.missouri.edu/bdm_download/dncon2-tool/databases/ 公开查阅。DeepCov数据集和PSICOV150数据集的蛋白质比对是从相应的DeepCov库中复制的。

2.4 Training convolutional neural networks

Our ConvNets involve an input layer, a number of 2D convolutional layers with batch normalization or dropouts, residual connections and Rectified Linear Units (ReLU) activations. In all architectures, the final layer is a convolutional layer with one filter of size 3 X 3 followed by a ‘sigmoid’ activation to predict contact probabilities. Besides dropout no other regularizes (such as L2) were used and ADAM (Kingma and Ba, 2014) was used as the optimizer. Since we only have convolutional layers, the variables are the numbers and size of filters at each layer, and dilation rate when dilated convolutional layers are used. Figure 1 summarizes our approach. All the CNN filters in the first layer convolve through the input volume of 256 X 256 X N producing batch normalized and ‘relu’ activated outputs passed as input to the subsequent layers. The number of channels, N, is 56 when DNCON2 dataset is used and 441 when DeepCov dataset is used. Error is computed using binary cross entropy calculated as -(y log(p)+(1-y) log(1-p)), where p is the output of the sigmoid activation of the last layer for each residue pair, and y is 1 if the residue pair are in contact in the experimental
structure or else is 0.

我们的卷积神经网络包括一个输入层、多个具有批量归一化或dropouts的2D卷积层、residual connections和Rectified Linear Units (ReLU) activations。在所有架构中,最后一层是卷积层,其中一个滤波器的大小为3 X 3之后是“sigmoid”激活,以预测接触概率。除了dropout没有其他正则化(如L2)的使用和ADAM (Kingma and Ba, 2014)被用作优化。由于我们只有卷积层,变量是每层滤波器的数量和大小,以及使用dilated convolutional layers时的扩张率。图1总结了我们的方法。第一层的 所有CNN filters全部卷积通过256 X 256 X N 产生批量标准化和 ‘relu’ 的输出作为输入传递到后续层。使用DNCON2数据集时,通道数N为56,使用DeepCov数据集时,通道数N为441。误差使用二元交叉熵计算,计算公式为-(y log(p)+(1-y) log(1-p)),其中p是每个残基对的最后一层的sigmoid激活的输出,如果残基对在实验结构中接触,y为1,否则为0。

Although we crop the input length of our training proteins to 256, after the training the model can make predictions for a protein of any length. Since we do not use max pooling or any dense connections, a model of any arbitrary dimension can be built for predicting proteins of lengths longer than 256 residues. For instance, to predict contacts for a protein of length 500 (i.e. input volume of 500 X 500 X 56), we first build a model of the same input dimensions (i.e. 500 X 500) and then load all the trained weights into this new model and make predictions for the input volume. Since contact matrix is symmetrical, we average the prediction of either triangle to generate final predictions. No model ensembling techniques were used. In all our training experiments, we reduce the learning rate by 0.5 when the loss on the validation dataset does not improve for 10 epochs. We stop the training if the loss on the validation dataset does not improve for 20 epochs in a row, and the output at the last epoch is selected.

尽管我们将训练蛋白质的输入长度裁剪为256,但是在训练之后,模型可以对任何长度的蛋白质进行预测。由于我们不使用最大池化或任何密集的连接,可以建立任意维度的模型来预测长度超过256个残基的蛋白质。例如,预测长度为500的蛋白质的接触(即输入体积为500 X 500 X 56),我们首先建立一个输入维度相同的模型(即500 X 500),然后将所有训练好的权重加载到这个新模型中,并对输入体积进行预测。由于接触矩阵是对称的,我们对任一三角形的预测进行平均,以生成最终预测。没有使用模型集成技术。在我们所有的训练实验中,当验证数据集的损失在10个epochs内没有改善时,我们将学习率降低了0.5。如果连续20个epochs验证数据集的损失没有改善,我们停止训练,并选择最后一个epochs的输出。

We used the Keras library (https://keras.io/) with Tensorflow (Abadi et al., 2016) backend for our method development. On a machine with 2 CPUs and 24 GB memory with a NVIDIA Tesla K80 GPU the training (around 35–45 epochs) takes about 16 h for either of the DNCON2 or DeepCov dataset. For testing larger architectures we used the Tesla P100, V100 and Quadro P6000 GPUs.

我们在方法开发中使用了https://keras.io/的Keras库和Tensorflow (Abadi et al., 2016)为后端。在一台具有2个CPU和24 GB内存以及NVIDIA Tesla K80 GPU的机器上,DNCON2或DeepCov数据集的训练(大约35-45个epochs)需要大约16小时。为了测试更大的架构,我们使用了Tesla P100, V100 and Quadro P6000的图形处理器。

2.5 Network architectures

We start our training experiments with standard convolutional neural networks (or fully convolutional networks), each preceded by a batch normalization layer and ReLU activation. We find that performance of such networks drop after 32–64 convolutional layers. Next, we design a residual block consisting of two convolutional layers, each preceded by a batch normalization layer and ReLU activation. Fixing the total number of convolutional filters in each layer to 64, we design four residual networks: (i) a regular residual network with depths (number of blocks) as the key parameter, (ii) an architecture with the second batch normalization layer in each residual block replaced with a dropout layer, (iii) we replace the last convolutional layer with a dilated convolutional layer with dilation rate alternating among 1, 2 and 4 and (iv) an architecture
that is a combination of the second and third architecture, i.e. we replace the second batch normalization layer with a dropout layer, and replace the second convolutional layer with a dilated convolution layer at alternating dilation rates of 1, 2 and 4 (see Fig. 2). We also tested many other architecture variations such as filter sizes of 16, 24 and 32, and different ways to alternate between batch normalization layers and dropout layers. On average, these architectures were not significantly better than the four architectures we discuss above.

我们用标准卷积神经网络(或全卷积网络)开始训练实验,每个实验之前都有一个批处理标准化层和ReLU激活。我们发现,在32–64个卷积层之后,这种网络的性能会下降。接下来,我们设计了一个由两个卷积层组成的残差块,每个卷积层之前都有一个批处理规范化层和ReLU激活。将每层中卷积滤波器的总数固定为64,我们设计了四个残差网络:(1)以深度(块数)为关键参数的规则残差网络,(2)用dropout层代替每个残差块中的第二批归一化层的体系结构,(3)我们用扩张速率在1、2和4之间交替的扩张卷积层来替换最后一个卷积层以及(4)第二和第三体系结构的组合体系结构,即我们用dropout层代替第二批归一化层,并用以1、2和4的交替膨胀速率膨胀的卷积层代替第二卷积层(见图2)。我们还测试了许多其他架构变化,例如16、24和32的过滤器大小,以及在批处理标准化层和dropout层之间交替的不同方式。平均而言,这些架构并不比我们上面讨论的四种架构好多少。

Using the DNCON2 dataset as our training and validation set and CASP12 targets as test dataset, we compared the precision of a fully convolutional network (FCN), and the four residual architectures—standard residual network, residual network with dilation, residual network with dropout and residual network with dilation and dropout. On the fully convolutional networks and residual networks, we experimented adding dropout layers in many ways and found that alternating between batch normalization layers and dropout layers yield the best performance (Zagoruyko and Komodakis,2016). On our standard residual network with 64 layers having 64 X 3 X 3 filters in each layer, we tested dropout values of 0.2, 0.3, 0.4 and 0.5. When we evaluated these models using the precision of top L/5 long-range precision (PL/5-LR) and all medium-, and long-range precision (PNC-MLR) we found that specific value of dropout does not matter as long as we have the dropout layers in place. For most of our experiments we fixed 0.3 as the parameter for our dropout layers (i.e. keep 70% weights).

使用DNCON2数据集作为我们的训练和验证集,CASP12目标作为测试数据集,我们比较了完全卷积网络(FCN)的精度,以及四种残差结构——standard residual network, residual network with dilation, residual network with dropout and residual network with dilation and dropout。在完全卷积网络和残差网络上,我们以多种方式尝试添加dropout层,并发现在批处理标准化层和dropout层之间交替产生最佳性能 (Zagoruyko and Komodakis,2016)。在我们64层的标准残差网络上每层有64X3X3个滤波器,我们测试了0.2、0.3、0.4和0.5的dropout值。当我们使用最高1/5远程精度(PL/5-LR)和所有中远程精度(PNC-MLR)的精度来评估这些模型时,我们发现只要有dropout层,dropout的具体值就无关紧要。在我们的大多数实验中,我们固定0.3作为我们的dropout层的参数(即保持70%的权重)。

Since our findings resonate with the findings in Zagoruyko and Komodakis (2016) we also hypothesize that the issue of ‘diminishing feature reuse’ is pronounced in the problem of contact prediction and that the dropout layers partially overcome the issue. In fact, we observed up to 6% gain in PL/5-LRwhen we randomly replace some of
the initial batch normalization layers in a fully convolutional layers with just 16 layers (each having 64 filters). These findings suggest that dropout layers, when used appropriately, can be highly effective in network architectures for protein contact prediction. When evaluated on the independent CASP12 dataset consisting of 62 targets, we find that residual networks with dilation and dropout outperform all the four other architectures, across all the precision metrics, PL/5-LR, PL/2-LR, PL-LRand PNC-MLR(see Fig. 3). We call this residual network with dilationand dropout as the DEEPCON method.

由于我们的发现与Zagoruyko和Komodakis (2016)的发现有共鸣,我们还假设“减少特征重用”的问题在接触预测的问题中很明显,dropout层部分克服了这个问题。事实上,当我们用16层(每层有64个滤波器)随机替换完全卷积层中的一些初始批量归一化层时,我们观察到PL/5-LR的增益高达6%。这些发现表明,如果使用得当,dropout层在蛋白质接触预测的网络结构中非常有效。当在由62个目标组成的独立CASP12数据集上进行评估时,我们发现具有dilation and dropout的残差网络在所有精度指标上优于所有其他四种架构,即P1/5-LR、P1/2-LR、PLLRand PNC-MLR(见图3)。我们称这种带有dilation and dropout的残差网络为DEEPCON方法

——————————————————————————————

3 Results

3.1 Comparison of network architectures with other network architectures

To benchmark our DEEPCON architecture we compared it with the architectures used in the current state-of-the-art methods—Raptor-X  (Wang et al., 2017), DNCON2 (Adhikari et al., 2018), DeepCov (Jones and Kandathil, 2018) and PconsC4 (Michel et al., 2018). The Raptor-X method uses residual networks with around 60 convolution-
al layers. The DNCON2 method uses ensembled two-level convolutional neural network, each with 7 layers. Each network consists of six layers with 16 filters each and an additional convolutional layer with one filter for generating the final contact probabilities. Each convolutional network here has around 50 thousand parameters. In the DeepCov method, 441 input channels are reduced to 64 using a Maxout layer, and then fully convolutional layers of various depths are tested. DeepCov performs best at the receptive field size of 15, i.e. 7 convolutional layers. The PconsC4 method directly uses the U-Net architecture that accepts 256 ? 256 ? 64 input volume, i.e. input channels will be first projected to 64 channels. Such an architecture has 31 million parameters.

为了对我们的DEEPCON架构进行基准测试,我们将其与当前最先进的方法——Raptor-X  (Wang et al., 2017), DNCON2 (Adhikari et al., 2018), DeepCov (Jones and Kandathil, 2018) and PconsC4 (Michel et al., 2018)中使用的架构进行了比较。Raptor-X方法使用大约60个卷积层的残差网络。DNCON2方法使用集成的两级卷积神经网络,每个网络有7层。每个网络由六层组成,每层有16个滤波器,另外一个卷积层有一个滤波器,用于生成最终的接触概率。这里的每个卷积网络都有大约5万个参数。在DeepCov方法中,使用Maxout层将441个输入通道减少到64个,然后测试不同深度的完全卷积层。DeepCov在感受野大小为15,即7个卷积层时表现最佳。PconsC4方法直接使用接受输入体积为256 X 256 X 64的U-Net架构,即输入通道将首先投影到64通道。这样的架构有3100万个参数。

Using covariance matrix as the input feature, when we use DeepCov dataset for training and validation and PSICOV 150 proteins as the test dataset, we find that DEEPCON performs similar to the standard residual architecture (see Fig. 4) on PL/5-LR evaluation metic. On the PNC-LR and PNC-MLR metrics, however, DEEPCON’s
performance is significantly higher than the standard residual-type network architecture, a fully connected CNN architecture, and the U-Net architecture. Similarly, using sequence features as input, when we use DNCON2 dataset for training and CASP12 protein domains for testing, we observe similar results (see Fig. 4). DEEPCON architecture’s high performance on these two very different datasets suggests that deep residual networks with dropout are reliable and work well for a variety of input features towards predicting protein contacts.

使用协方差矩阵作为输入特征,当我们使用DeepCov数据集进行训练和验证,使用PSICOV 150蛋白质作为测试数据集时,我们发现DEEPCON的性能类似于PL/5-LR 评价标准的标准残差架构(见图4)。然而,在PNC-LR 和 PNC-MLR 指标上,DEEPCON的性能明显高于标准剩余型网络架构、完全连接的CNN架构和U-Net架构。类似地,使用序列特征作为输入,当我们使用DNCON2数据集进行训练,使用CASP12蛋白域进行测试时,我们观察到类似的结果(见图4)。DEEPCON架构在这两个非常不同的数据集上的高性能表明,具有 dropout的深度残差网络是可靠的,并且可以很好地用于以各种输入特征来预测蛋白质接触。

3.2 Performance of DEEPCON

We compared the performance of our method DEEPCON with two state-of-the-art methods that have their training dataset publicly available. First, we compared the performance of DEEPCON with the DeepCov method when trained and tested using the same dataset consisting of 3456 proteins. Using the covariance features as input
and with the first 130 proteins as the validation set (as done in DeepCov) we train our models and test them on the independent PSICOV dataset consisting of 150 proteins. Evaluating our predictions on the PSICOV150 dataset, we find that our implementation performs better than the DeepCov method when we develop an architecture that is similar to the one used in the DeepCov method (see Table 1). Next, we compare DEEPCON’s performance with DeepCov method using PL/5-LRand PL-LR. Our comparison (see Table 1) shows that DEEPCON achieves 5.9% improvement in PL/5-LR and 15% improvement in PL-LR. For a fair comparison we only train our models once on the DeepCov dataset. In fact, to reduce our training time (and because of resource limitations), we trim all proteins to 256 residues length, possibly using lesser input information than the information used by the DeepCov method.

我们将DEEPCON方法的性能与两种最先进的方法进行了比较,这两种方法的训练数据集是公开的。首先,当使用由3456个蛋白质组成的相同数据集进行训练和测试时,我们比较了DEEPCON和DeepCov方法的性能。使用协方差特征作为输入,以前130个蛋白质作为验证集(如在DeepCov中所做的),训练我们的模型,并在由150个蛋白质组成的独立PSICOV数据集上测试它们。评估我们对PSICOV150数据集的预测,我们发现当我们开发一个类似于DeepCov方法中使用的体系结构时,我们的实现比DeepCov方法性能更好(见表1)。接下来,我们使用PL/5-LR和PL-LR比较DEEPCON和DeepCov方法的性能。我们的比较(见表1)显示DEEPCON在PL/5-LR中实现了5.9%的改进,在PL-LR中实现了15%的改进。为了公平比较,我们只在DeepCov数据集上训练模型一次。事实上,为了减少我们的训练时间(由于资源的限制),我们将所有的蛋白质修剪到256个残基的长度,可能使用比DeepCov方法更少的输入信息。

our predictions, we use the exact same input features (generated for targets instead of domains) as used by DNCON2. On these 84 CASP12 domains, we find that DEEPCON achieves 3.2% higher PL/2-LRand 4.8% higher PL-LR(see Table 1). For completeness, in Table 1, we also report the performance of our DEEPCON architecture when it is trained and validated on the DeepCov dataset with sequence features as input and on the DNCON2 dataset with covariance features as input.

在我们的预测中,我们使用了与DNCON2完全相同的输入特征(为目标而不是域生成)。在这84个CASP12结构域中,我们发现DEEPCON实现了3.2%的高PL/2-LR和4.8%的高PL-LR(见表1)。为了完整起见,在表1中,我们还报告了在以序列特征为输入的DeepCov数据集和以协方差特征为输入的DNCON2数据集上训练和验证时,我们的DEEPCON架构的性能。

Furthermore, to verify that the improved contact predictions are indeed useful, for the 150 proteins in the PSICOV dataset, we predicted full three-dimensional (3D) models using CONFOLD2 (Adhikari and Cheng, 2018). CONFOLD2 is ideal here because it purely relies on the input predicted contacts information to build models, unlike methods that use template fragments. It iteratively selects top 0.1 L, 0.2 L, 0.3 L, etc., up to 4.0 L input contacts, gener-ates up to 200 decoys and automatically selects topfive models. For the PSICOV dataset, using the covariance features as input, we predicted contacts using the DeepCov method and our DEEPCON method, and filtered the contacts to keep only the medium-range and long-range contacts. These contacts were supplied as input to CONFOLD2 to obtain full-atom 3D models. Next, we evaluated the top-five models predicted by CONFOLD2 and compared the best-of-top-five. Comparison of the predicted contacts using PL-LR and the best models using TM-score (Zhang, 2005) shows that, on average, DEEPCON contact predictions are better than DeepCov contacts (see Fig. 5). From the same figure, we also infer that the improvement in DEEPCON is not limited to proteins for which large number of homologous sequences are found. We calculated effective homologous sequences (Neff) using the ‘colstats’ program in the MetaPSICOV package. Finally, following the practice to evaluate the calibration of predicted contacts (Liu et al., 2018; Jones and Kandathil, 2018; Monastyrskyy et al., 2016) (i.e. to check if our probabilities are underestimated or overestimated), we evaluated the precision of contacts predicted at various confidence intervals and found that DEEPCON’s probabilities are highly calibrated (see Fig. 6).

此外,为了验证改进的接触预测确实有用,对于PSICOV数据集中的150种蛋白质,我们使用CONFOLD2预测了完整的三维(3D)模型 (Adhikari and Cheng, 2018)。CONFOLD2在这里是理想的,因为它完全依赖于输入的预测联系人信息来构建模型,不像使用模板片段的方法。它迭代选择前0.1 L、0.2 L、0.3 L等,最多4.0 L输入接触,最多生成200个decoys,自动选择前五个。对于PSICOV数据集,使用协方差特征作为输入,我们使用DeepCov方法和DEEPCON方法预测接触,并过滤接触以仅保留中程和远程接触。这些接触作为输入提供给CONFOLD2,以获得全原子3D模型。接下来,我们评估了CONFOLD2预测的前五个模型,并比较了前五个模型中的最佳模型。对比预测接触使用PL-LR和最佳模型使用TM-score (Zhang, 2005)表明,平均而言,DEEPCON接触预测优于DeepCov接触(见图5)。从同一图中,我们还推断DEEPCON的改进并不限于发现大量同源序列的蛋白质。我们使用MetaPSICOV包中的‘colstats’程序计算了有效同源序列(Neff)。最后,按照实践来评价预测触点的校准 (Liu et al., 2018; Jones and Kandathil, 2018; Monastyrskyy et al., 2016)(即检查我们的概率是否被低估或高估),我们评估了在各种置信区间预测的接触的精度,发现DEEPCON的概率被高度校准(见图6)

We believe that there are three reasons for DEEPCON’s improved performance. In Table 1 we show that when DEEPCON’s architecture is scaled back to an architecture similar to the DeepCov method’s architecture, the improvement is still significant. This suggests that the overall training and testing parameters (such as optimizer) and the overall training framework built using Tensorflow and Keras is contributing to the improvement. DEEPCON’s deeper architecture consisting of 32 residual blocks (effectively 64 convolutional layers) and the use of Dropout and BatchNormalization in the residual block, as shown in Figure 3, are also contributing to the improvement.

我们认为DEEPCON的性能提高有三个原因。在表1中,我们展示了当DEEPCON的架构被缩减到类似于DeepCov方法的架构时,改进仍然是显著的。这表明,使用Tensorflow和Keras构建的整体训练和测试参数(如优化器)以及整体训练框架对改进起到了促进作用。由32个残差块(有效的64个卷积层)组成的DEEPCON更深层次的架构,以及残差块中Dropout和BatchNormalization的使用,如图3所示,也有助于改进。

3.3 Evaluation on CASP 13 dataset

Finally, we compare our DEEPCON method trained using the co-variance features from the DeepCov dataset with a few state-of-the-art methods that accept multiple sequence alignment as input. For this comparison, we consider the 20 proteins targets (consisting of 32 domains) in the CASP 13 dataset for which the native structures
are publicly available. Although it would be ideal to evaluate on the entire CASP 13 dataset, as done by the CASP assessors, we are only able to evaluate the targets whose native structures are available to us. These 20 protein targets include both free-modeling (FM) and template-based modeling (TBM) domains. First we predicted multiple sequence alignments for these 20 protein targets using the publicly available DNCON2 scripts to generate multiple sequence alignments. As done in the DNCON2 method, we use the sequence databases curated in 2012. Next, we predicted contacts using the DEEPCON method we developed, and evaluated the contacts on the 32 domains. We also downloaded the PconsC4 tool available at https://github.com/ElofssonLab/PconsC4/ and predicted contacts. Similarly, we also downloaded the DeepCov method available at https://github.com/psipred/DeepCov and predicted contacts with the same alignment files as input. To obtain a baseline we also predicted contacts using CCMpred and FreeContact. To obtain reference results, we evaluated the contact predictions of the top group in CASP13 over these 20 targets, and also submitted the alignments to the Raptor-X webserver at http://raptorx.uchicago.edu/ContactMap/ and evaluated the contacts predicted by the webserver. We use the same alignment file as input to all these methods.

最后,我们将使用来自DeepCov数据集的协方差特征训练的DEEPCON方法与接受多序列比对作为输入的几种现有方法进行比较。为了进行比较,我们考虑了CASP 13数据集中的20个蛋白质靶标(由32个结构域组成),其天然结构是公开可得的。虽然对整个CASP 13数据集进行评估是理想的,但正如CASP评估员所做的那样,我们只能评估其天然结构对我们可用的目标。这20个蛋白质靶标包括自由建模(FM)和基于模板的建模(TBM)域。首先,我们使用公开的DNCON2脚本预测了这20个蛋白质靶标的多序列比对,以生成多序列比对。正如在DNCON2方法中所做的,我们使用2012年收集的序列数据库。接下来,我们使用我们开发的DEEPCON方法预测接触,并评估32个领域的接触。我们还下载了https://github.com/ElofssonLab/PconsC4/的PconsC4工具和预测接触。同样,我们还下载了https://github.com/psipred/DeepCov可用的DeepCov方法,并用相同的对齐文件作为输入来预测联系人。为了获得基线,我们还使用CCMpred和FreeContact预测了接触。为了获得参考结果,我们评估了在这20个目标中CASP13最高组的接触预测,并将比对结果提交给位于 http://raptorx.uchicago.edu/ContactMap/ 的Raptor-X网络服务器并评估了网络服务器预测的接触。我们使用相同的对齐文件作为所有这些方法的输入。

Table 2 shows that our method DEEPCON performs better than the baseline methods CCMpred and FreeContact, and better then PconsC4 and DeepCov. Our comparison of DEEPCON, DeepCov and PconsC4 with the RaptorX web-server is only for a reference because it uses many additional features other than the covariance calculated from raw alignment files. It is worth noting that the top group in CASP13 experiment was the RaptorX method. The large difference between the top group’s performance in CASP13 and the performance of the RaptorX webserver with our alignments as input (in Table 2), and the fact that the RaptorX method used in the CASP13 experiment is not significantly updated from the published method (as mentioned by Dr. Xu in his presentation) highlights the performance gain that could be achieved with better/larger multiple sequence alignments. To further validate our assumption, we also reached out to the top performing groups in the CASP13 experiment asking for their alignment files but we did not receive such data for these 20 targets from the groups. The differences of precision between the baseline methods (CCMpred and FreeContact) and the
deep learning methods (DEEPCON, DeepCov and PconsC4) represent the gain from the use of deep neural networks.

表2显示,我们的方法DEEPCON比基线方法CCMpred和FreeContact性能更好,比PconsC4和DeepCov更好。我们对DEEPCON、DeepCov和PconsC4与RaptorX网络服务器的比较仅供参考,因为它使用了许多其他功能,而不是从原始对齐文件中计算的协方差。值得注意的是,CASP13实验中的前一组是RaptorX方法。top group在CASP13中的表现与RaptorX网络服务器(以我们的比对作为输入)的表现之间的巨大差异(见表2),以及CASP13实验中使用的RaptorX方法与已公布的方法相比没有显著更新的事实(如徐博士在他的报告中所提到的),突出了使用更好/更大的多序列比对可以实现的性能增益。为了进一步验证我们的假设,我们还联系了CASP13实验中表现最好的组,询问他们的比对文件,但我们没有从这些组中收到这20个目标的数据。基线方法(CCMpred和FreeContact)和深度学习方法(DEEPCON、DeepCov和PconsC4)之间的精度差异代表了使用深度神经网络的收益。

——————————————————————————————

4 Conclusions

We found that regularization using dropout is highly effective in all architectures—fully convolutional, residual, or dilated residual networks. We also found that dilated convolutional methods yield negligibly better performance than the regular residual networks, but these gains are amplified when dilated convolutions are combined
with dropouts. We believe that our findings when combined with techniques such as using multiple distance thresholds (Adhikari et al., 2018; Jones et al., 2015; Michel et al., 2018), ensembling (Adhikari et al., 2018; Wang et al., 2017), predicting actual distances using binning (Xu, 2018), etc. used in other methods can significantly improve the overall contact prediction precision. We also believe that our findings and the architectures will be utilized by other researchers to continue the development and obtain better performance with more complex and powerful architectures, larger training and validation datasets, development of additional features and output labels, and through model ensembling.

我们发现,在所有的结构中,使用dropout的正则化是非常有效的——完全卷积的、残差或 dilated residual networks。我们还发现,扩张卷积方法比常规残差网络产生可忽略的更好的性能,但是当扩张卷积与dropout结合时,这些增益被放大。我们相信,当我们的发现与使用多距离阈值等技术相结合时(Adhikari et al., 2018; Jones et al., 2015; Michel et al., 2018),ensembling (Adhikari et al., 2018; Wang et al., 2017),利用预测实际距离(Xu, 2018)等。用于其他方法可以显著提高整体接触预测精度。我们还相信,我们的发现和架构将被其他研究人员用来继续开发,并通过更复杂和更强大的架构、更大的训练和验证数据集、附加功能和输出标签的开发以及模型集成来获得更好的性能。

DEEPCON: protein contact prediction using dilated convolutional neural networks with dropout相关推荐

  1. CRSNet: Dilated Convolutional Neural Networks for Underatanding the Highly Congested Scenes

    CRSNet: Dilated Convolutional Neural Networks for Underatanding the Highly Congested Scenes 针对复杂场景拥挤 ...

  2. 人群密度估计--CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes

    CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes CVPR2018 ...

  3. 论文学习笔记:CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes

    CSRNet是2018年提出来的人群计数模型,其论文发表于CVPR会议. 论文链接:CSRNet Abstract 摘要 我们提出了一个拥挤场景识别网络CSRNet,它提供了一种数据驱动的深度学习方法 ...

  4. 2018_Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes

    Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes 说明 概括 一. ...

  5. CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes2018—论文笔记

    本论文来自CVPR2018, 读于20190409. Abstract 我们提出的Congested Scene Recognition(CSRNet)包含了两个部分,一个是获得二维特征的前端,一个是 ...

  6. 论文解读 CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes

    abstract 1.提出CSRNet是为了处理非常密集的场景,提供准确的计数和密度图 2.提出的CSRNet主要两部分组成:提取二维特征的CNN做前端,膨胀的CNN做后端,膨胀的卷积核是为了获得更大 ...

  7. CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes

    本文首先针对MCNN,提出了其两个缺点:大量的训练时间和无效的分支架构. MCNN由于使用了多列网络,参数比较多,需要训练时间长容易理解.可是作者为什么说MCNN的多列是"无效的分支&quo ...

  8. Sequential regulatory activity prediction across chromosomes with convolutional neural networks

    Sequential regulatory activity prediction across chromosomes with convolutional neural networks 基于卷积 ...

  9. [论文翻译]测试时数据增强(TTA):Automatic Brain Tumor Segmentation using Convolutional Neural Networks with TTA

    论文下载: 地址 Automatic Brain Tumor Segmentation using Convolutional Neural Networks with Test-Time Augme ...

最新文章

  1. 【Java 注解】自定义注解 ( 注解属性定义与赋值 )
  2. jQuery EasyUI API 中文文档 - 进度条
  3. SpringBoot入门到精通_第7篇 _必知必会总结
  4. Spring 自定义注解,配置简单日志注解
  5. 算法每日一题--分治算法(二)-李富贵要上岸985
  6. STM32时钟学习之STM3210X_RCC.H解读
  7. Evaluation or Assessment
  8. 二 vue环境搭建
  9. win32调用系统颜色对话框
  10. 织梦php 文章采集规则,dede自带采集器的高阶技巧
  11. CAJ论文格式转PDF(附带书签)
  12. 之江汇空间如何加音乐背景_添加QQ空间背景音乐添加图文教程
  13. 微信小程序相关操作示例
  14. 团队协作工具调研笔记
  15. 实体关系图E-R图(Entity Relationship Diagram)
  16. Oracle VM VirtualBox Ubuntu设置共享文件夹
  17. 峰哥教你如何在B站学大数据(建议收藏)
  18. 电脑小技巧:win10我的电脑图标怎么调出来
  19. 数商云汽车零部件电商_精准电商解决方案
  20. 项目时间(项目进度计划控制)笔记

热门文章

  1. 微信公众平台卡券API接口开发指南
  2. c语言编程照抄能学好吗,C语言I作业12—学期总结
  3. 用文件流下载文件( Blob)时各种类型文件的 type 整理
  4. 万亿产业进化论,装备企业的机会在哪里?
  5. excel 日期选择器_Excel日期选择器工具
  6. 计算机图像处理要学什么软件有哪些,电脑中常用的图像处理软件有哪些
  7. 为Linux草根队加油
  8. xshell最多支持4个_3分钟苹果发布会:iPhone12支持5G、3个尺寸4种型号
  9. Application Repository一键启用微信告警通知
  10. Dynamics 365 新建组织时一直提示识别不到Reporting Extensions