Accurate prediction of molecular targets using a self-supervised image representation learning framework

（使用自监督图像表示学习框架精确预测分子靶点）

https://assets.researchsquare.com/files/rs-1477870/v1_covered.pdf?c=1649357561

Code:GitHub - HongxinXiang/ImageMol: ImageMol is a molecular image-based pre-training deep learning framework for computational drug discovery.

ImageMol用于发现抗新型冠状病毒抑制剂，ImageMol是第一个基于分子图像（而不是分子图，一定要注意区分分子图（molecular graph）和分子图像（molecular image）之间的关系）的预训练框架：分子图像与分子图不同，分子图像是由一堆像素而不是顶点和边组成的。（问题：分子图像的表示方式是图像，那么分子图的表示方式是什么呢？）

“自监督学习 + 多视角学习” 相关知识再次补充学习：

自监督学习（一）：基于 Pretext Task_马鹏森的博客-CSDN博客

自监督学习系列（二）：基于 Contrastive Learning_马鹏森的博客-CSDN博客

自监督学习系列（三）：基于 Masked Image Modeling_马鹏森的博客-CSDN博客

重学自监督学习（精）_马鹏森的博客-CSDN博客

Multi-View Learning（多视图学习/多视角学习）是什么？ Co-training（协同训练）和它的关系_马鹏森的博客-CSDN博客_多视角学习

为了更好的理解自监督学习，以及本文中的 (MRD) .（分子理性鉴别器）+ (MCL) .（基于mask的对比学习），我学习了2021CVPR文章“CutPaste:Self-Supervised Learning for Anomaly Detection and Localization”：CutPaste:Self-Supervised Learning for Anomaly Detection and Localization 论文解读（缺陷检测）_马鹏森的博客-CSDN博客

基础知识补充：

1、Drug design（药物设计）：
药物设计（英语：Drug design），根据对于靶点（Biological target）的现有知识，去寻找与发明出新型药物的过程。药物设计根据有机小分子物质（如蛋白质）的化学结构、电价与形状等，来设计出可能达到效果的新型化学药物。

2、computer-aided drug design（CADD）（计算机辅助药物设计）：
使用电脑分子建构技术，来进行药物设计，称为电脑辅助药物设计（computer-aided drug design）。

根据对于生物目标的化学结构来进行设计，称为结构药物设计（structure-based drug design）。

3、分子指纹（Molecular Fingerprint）

在比较两个化合物之间的相似性时遇到的最重要问题之一是任务的复杂性，这取决于分子表征的复杂性。
为了使分子的比较在计算上更容易，需要一定程度的简化或抽象。分子指纹就是一种分子的抽象表征，它将分子转化（编码）为一系列比特串（即比特向量，bit vector .），然后可以很容易地在分子之间进行比较。典型的流程是将提取分子的结构特征、然后哈希(Hashing)生成比特向量。

原来分子指纹就是label啊！！！

比较分子是很难的，比较比特串却很容易，分子之间的比较必须以可量化的方式进行。分子指纹上的每个比特位对应于一种分子片段（Figure 1），假设相似的分子之间必然有许多公共的片段，那么具有相似指纹的分子具有很大的概率在2D结构上也是相似的。

B.1 Molecular image and fingerprint generation（本文分子图像和分子指纹的生成）

如何从原始数据中得到分子图像？

==> 我们首先过滤掉原始数据集中没有SMILES信息的分子；
==> 其次，我们使用RDKit将SMILES序列转换为分子图像，并将图像大小设置为224 × 224；
==> 最后，这些大小相同的分子图像将被用作我们方法的初始数据集。

In this study, we use image as molecular representation. To obtain molecular images, the RDKit (https://github.com/rdkit/rdkit) is used to generate standard and unique image [3] for each molecule. Unlike molecular graph, molecular image is composed of a pile of pixels rather than vertices and edges（与分子图不同，分子图像是由一堆像素而不是顶点和边组成的：注意区分molecular graph 和 molecular image）.
In detail, we first filter out molecules without SMILES information in the original dataset. Second, we transform the SMILES sequences to molecular images using RDKit and set the image size to 224 × 224. Finally, these molecular images with the same size will be used as the initial dataset of our method.
Considering that molecular fingerprints are easy to obtain and can express some priori knowledge of molecules, we chose MACCS keys to assist our pre-training process to make our model learn molecule-related priori knowledge. The MACCS keys are one of the commonly used structural molecular fingerprint [4], which contain 166 keys related to molecular structure. In our work, we used RDKit to generate a distinct 166-D molecular fingerprint for each molecule.（我们使用MACCS keys来辅助我们的预训练过程，使我们的模型学习分子相关的先验知识）

4、分子靶点（Molecular Target）

target 是我们体内的一些蛋白质或酶或核酸，它的活性可以被一些药物改变，从而产生特定的效果（target 是指位于生物体内，能够被其他物质（配体、药物等）识别或结合的结构。）。

5、虚拟筛选（Virtual screening (VS)）

虚拟筛选(virtual screening，VS)也称计算机筛选，即在进行生物活性筛选之前，利用计算机上的分子对接软件模拟目标靶点与候选药物之间的相互作用，计算两者之间的亲和力大小，以降低实际筛选化合物数目，同时提高先导化合物发现效率。通过虚拟筛选可以有效减少试验筛选化合物数量，降低研发成本，缩短研发周期。

6、先导化合物（lead compound）

先导化合物（lead compound）是一种具有药理学或生物学活性的化合物，可被用于开发新药，其化学结构可被进一步优化，以提高药力、选择性，改善药物动力学性质。新发现的先导化合物也许存在某些缺陷，如活性不够高、化学结构不稳定、毒性较大、选择性不好、药物动力学性质不合理等，需要进行化学优化，将药性提高至足以进行生物试验或临床试验的程度，再进一步优化使之发展为理想的药物，这一过程称为先导化合物的优化。

7、SMILES：简化分子线性输入规范（Simplified molecular input line entry specification）

是一种用ASCII字符串描述分子结构的规范。SMILES字符串可以被大多数分子编辑软件导入并转换成二维图形或分子的三维模型。（也就是说SIILES与它对应的分子结构（图片形式表现出来）可以互相转换）

例子：

全文逻辑：

1. Pretraining，我们首先需要 preparing dataset：得到“data.csv”文件的每一行SMILE对应的分子图像
2. start to pretrain（训练一个预训练模型，从而用于下游任务），其中主要是为了得到较好的Molecular Encoder（通过5种策略，不断提升它的能力），其中Molecular Encoder就是我们要预训练的模型
3.Finetuning（将预训练的模型在下游任务中微调，这里的下游任务是分类任务，（我们的模型将会通过计算分类的可能性 $\tilde{\mathcal{Y}}_{n}^{g t}$ and 真实的label $\mathcal{Y}_{n}^{g t}$ 的交叉熵损失来 fine-tuned ）），具体的细节看，代码理解文章中的首页的 2. fine-tune部分

Discussion

讨论

药物的临床疗效和安全性取决于其在人类蛋白质组中的分子靶点（molecular targets）。然而，对人类甚至动物模型中的所有化合物进行蛋白质组范围的评估是具有挑战性的。在这项研究中，我们提出了一个无监督的预训练深度学习框架，称为ImageMol，从 850 万个未标记的类药物分子中预测候选化合物的分子靶点。ImageMol 框架旨在根据像素中分子的局部和全局结构特征，从未标记的分子图像中预训练化学表示。

提出了一个基于自监督图像处理的预训练深度学习框架，该框架结合了分子图像和无监督学习来学习分子表示；
在这项研究中，我们提出了一个无监督的分子图像预训练框架（称为 ImageMol），用于从大规模分子图像中学习分子结构。
预训练深度学习框架将为各种新兴疾病的快速药物发现和开发提供强大的工具，包括新冠肺炎疫情和未来的大流感；

几项改进：
1、ImageMol在药物开发的各种任务中实现了高性能，包括类药性质评估和各种目标的分子目标预测；
2、ImageMol优于最先进的方法；
3、ImageMol具有更好的可解释性，并且在识别分子性质和靶结合的生物学相关化学结构或亚结构方面更直观

局限性：
(1)在分子图像中整合更大规模的生物医学数据和更大容量的模型(如ViT)必然是工作的重点；
(2) multi-view learning of joint images 和 other representations(e.g. SMILES and graph)的多视图学习是一个重要的研究方向；
(3)引入更多的化学知识(包括原子性质、3D信息等)也是一个值得研究的问题。【将额外的化学知识(如原子属性和3D结构信息)集成到每个图像或像素区域可以进一步提高ImageMol的性能】

Abstract

我们提出了一个无监督的预训练深度学习框架，称为ImageMol，来自850万个未标记的药物样分子，以预测候选化合物的分子靶标。
ImageMol框架旨在根据来自像素的分子的局部和全局结构特征，从未标记的分子图像中预先训练化学表示。

The clinical efficacy and safety of a drug is determined by its molecular targets in the human proteome. However, proteome-wide evaluation of all compounds in human, or even animal models, is challenging. In this study, we present an unsupervised pre-training deep learning framework, termed ImageMol, from 8.5 million unlabeled drug-like molecules to predict molecular targets of candidate compounds.The ImageMol framework is designed to pretrain chemical representations from unlabeled molecular images based on local- and global-structural characteristics of molecules from pixels.

We demonstrate high performance of ImageMol in evaluation of molecular properties (i.e., drug’s metabolism, brain penetration and toxicity) and molecular target profiles (i.e., human immunodeficiency virus) across 10 benchmark datasets. ImageMol shows high accuracy in identifying antiSARS-CoV-2 molecules across 13 high-throughput experimental datasets from the National Center for Advancing Translational Sciences (NCATS) and we re-prioritized candidate clinical 3CL inhibitors for potential treatment of COVID-19. In summary, ImageMol is an active self-supervised image processing-based strategy that offers a powerful toolbox for computational drug discovery（计算机辅助药物研发） in a variety of human diseases, including COVID-19.

我们在10个基准数据集上展示了ImageMol在评估分子特性(即药物的代谢、脑渗透和毒性)和分子靶谱(即人类免疫缺陷病毒)方面的高性能。ImageMol在识别来自美国国家转化科学促进中心(NCATS)的13个高通量实验数据集中的antiSARS-CoV-2分子时表现出很高的准确性，我们重新确定了候选临床3CL抑制剂的优先级，以用于新冠肺炎的潜在治疗。总之，ImageMol是一种基于主动自我监督图像处理的策略，为包括新冠肺炎在内的各种人类疾病的计算药物发现提供了一个强大的工具箱。

Introduction

Despite recent advances of biomedical research and technologies, drug discovery and development remains a challenging multidimensional task requiring optimization of vital properties of candidate compounds, including pharmacokinetics, efficacy and safety [1, 2]. It was estimated that pharmaceutical companies spent $2.6 billion in 2015, up from $802 million in 2003, on drug approval by the U.S. Food and Drug Administration (FDA) [3]. The increasing cost of drug development resulted from lack of efficacy of the randomized controlled trials, and the unknown pharmacokinetics and safety profiles of candidate compounds [4-6]. Traditional experimental approaches are unfeasible on proteome-wide scale evaluation of molecular targets for all candidate compounds in human, or even animal models. Computational approaches and technologies have been considered a promising solution [7, 8], which can significantly reduce costs and time during the entire pipeline of the drug discovery and development.【但药物开发仍然是一项具有挑战性的任务，传统的实验方法在人类甚至动物模型中所有候选化合物的分子靶标的蛋白质组范围内评估是不可行的】

The rise of advanced Artificial Intelligence (AI) technologies [9, 10], motivated their application to drug design [11-13] and target identification [14- 16]. One of the fundamental challenges is how to learn molecular representation from chemical structures [17]. Previous molecular representations were based on hand-crafted features, such as fingerprint-based features [16, 18], physiochemical descriptors and pharmacophore-based features [19, 20]. However, these traditional molecular representation methods rely on a large amount of domain knowledge, such as sequence-based [21, 22] and graph-based [23, 24] approaches. Their accuracy in extracting informative vectors for description of molecular identities and biological characteristics of the molecules is limited. Recent advances of unsupervised learning in computer vision [25, 26] suggest that it is possible to apply unsupervised image-based pre-training models for computational drug discovery.【如何从化学结构中学习到分子表示？传统的分子表示方法依赖于大量的领域知识，可以将无监督的基于图像的预训练模型应用于计算药物开发。计算机视觉中无监督学习的最新进展表明，可以将无监督的基于图像的预训练模型应用于计算机辅助药物研发】

In this study, we presented an unsupervised molecular image pretraining framework (termed ImageMol) with chemical awareness for learning the molecular structures from large-scale molecular images. ImageMol combines an image processing framework with comprehensive molecular chemistry knowledge for extracting fine pixel-level molecular features in a visual computing way. Compared with state-of-the-art methods, ImageMol has two significant improvements:
(1) It utilizes molecular images as the feature representation of compounds with high accuracy and low computing cost;
(2) It exploits an unsupervised pre-trained learning framework to capture the structural information of molecular images from 8.5 million drug-like compounds with diverse biological activities at the human proteome (Figure 1). We demonstrated the high accuracy of ImageMol in a variety of drug discovery tasks. Via ImageMol, we identified anti-SARS-CoV-2 molecules across 13 high-throughput experimental datasets from the National Center for Advancing Translational Sciences (NCATS). In summary, ImageMol provides a powerful pre-training deep learning framework for computational drug discovery.【我们提出了一个无监督的分子图像预训练框架（称为 ImageMol），用于从大规模分子图像中学习分子结构。（1）它利用分子图像作为化合物的特征表示，精度高，计算成本低； (2) 它利用无监督的预训练学习框架从 850 万种在人类蛋白质组中具有不同生物活性的类药物化合物中捕获分子图像的结构信息（图 1）。】

Fig1. a： molecular encoder ：提取分子图像的潜在特征
Fig1. b：五种策略：用于预训练molecular encoder，提升它的能力

MG3C（多粒度化学簇分类器）中的Structure classifier【深蓝色】：用于预测分子图像中的化学结构信息
MRD（分子理性鉴别器）中的Rationality classifier【绿色】：用于区分理性和非理性分子
JPP（拼图预测）中的Jigsaw classifier【灰色】：用于预测理性排列
MCL（基于对比学习的mask）中的Contrastive classifier【橘色】：用于最大化原始图像和mask图像之间的相似性
MIR（分子图像重建）中的Generator【黄色】：中的生成器用于恢复分子图像的潜在特征，鉴别器Discriminator【紫色】用于区分真实和伪造的分子图像

Fig1. c：a pretrained molecular encoder ：在下游任务中进一步微调从而进一步提升模型的表现

ImageMol用于发现抗新型冠状病毒抑制剂，前20种药物的65%成功率已经被实验和临床证据证实为新冠肺炎的潜在抑制剂：
==> 全连接(FC)层被附加到预训练的molecular encoder，用于在新冠肺炎数据集上进行fine-tune，
==> 随后，微调后的模型用于虚拟筛选药物库中已批准的药物（所以fine-tune时使用FC进行分类）。

**Fig1 A diagram illustrating the ImageMol framework.**

(a) A molecular encoder (light blue) is used to extract the latent features of the molecular images.【提取分子图像的潜在特征】
(b) The five strategies are used to pretrain the molecular encoder.【五种策略用于预训练molecular encoder】

The structural classifier (dark blue) in multi-granularity chemical clusters classification (MG3C) is used to predict chemical structural information in molecular images.【MG3C用于预测分子图像中的化学结构信息】

The rationality classifier (green) in molecular rationality discrimination (MRD) is used to distinguish rational and irrational molecules.【(MRD)用于区分理性和非理性分子】

The jigsaw classifier (grey) in jigsaw puzzle prediction (JPP) is used to predict rational permutations.【(JPP)用于预测理性排列】

The contrastive classifier (orange) in MASK-based contrastive learning (MCL) is used to maximize the similarity between the original image and the masked image.【(MCL)用于最大化原始图像和mask图像之间的相似性】

The generator (yellow) in molecular image reconstruction (MIR) is used to restore latent features back to the molecular image and the discriminator (purple) is used to discriminate between real and fake molecular images.【(MIR)中的生成器用于恢复分子图像的潜在特征，鉴别器用于区分真实和伪造的分子图像】

(c) ImageMol for discovery of anti-SARS-CoV-2 inhibitors. A fully connected (FC) layer is appended to the pretrained molecular encoder for fine-tuning on the COVID-19 dataset.

Subsequently, the fine-tuned model is used for virtual screening from approved drugs in the DrugBank. The 65% success rate of the top 20 drugs have been validated by experimental and clinical evidence as potential inhibitors of COVID-19.

Results

Description of ImageMol

Here, we developed a pre-training deep learning framework, ImageMol, for accurate prediction of molecular targets. ImageMol pre-trained 8,506,205 molecular images from two large drug-like databases (ChEMBL [27] and ZINC [28]).

We assembled five pretext tasks to extract biologically relevant structural information:
1) A molecular encoder was designed to extract latent features from 8.5 million molecular images (Fig. 1a);
2) Five pretraining strategies (Supplementary Figures 1-5) are utilized to optimize the latent representation of the molecular encoderby considering the chemical knowledge and structural information from molecular images (Fig. 1b);
3) a pretrained molecular encoder is further fine-tuned on downstream tasks to further improve model performance (Fig. 1c).
4)+5) In addition, two pre-tasks (multi-grnularity chemical clusters classification task and molecular rationality discrimination task (cf. Methods) are further designed to ensure that ImageMol properly capture meaningful chemical information from images (Supplementary Figures 1-2).

We next evaluated the performance of ImageMol in a variety of drug discovery tasks, including evaluation of the drug’s metabolism, brain penetration, toxic profiles, and molecular target profiles across the human immunodeficiency virus (HIV), SARS-CoV-2, and Alzheimer’s disease

1、(MG3C) .（多粒度化学族分类器）【MG3C用于预测分子图像中的化学结构信息】

==> 分子指纹从SMILES 提取出来；
==> 输入进具有不同K值的无监督KMEANS 生成不同结构粒度的聚类簇；
==> 这些簇被视为分子图像的伪标签；
==> 联合使用molecular encoder和structural classifier来预测分子图像的标签，并在预训练中优化伪标签和预测标签之间的损失。

其中 Structure classifier【深蓝色】 是一个多任务学习器，它接收512维特征作为输入，然后前向传播到具有不同数量神经元(100、1000和10000)的3个全连接层，用于分类不同的聚类粒度。

**The architectural details of the Multi-Granularity Chemical Clusters Classification (MG3C) task.（多粒度化学族分类器）**

Figure S1: Firstly, the molecular fingerprints are extracted from SMILES
and input into unsupervised multi-granularity clustering to produce clusters with different granularity.
Then, these clusters are uniquely numbered as pseudo-labels of molecular images.
Finally, the molecular encoder and structural classifier are jointly used to predict the labels of molecular images and optimizing the loss between pseudo-labels and predicted labels in pre-training.

In multi-granularity chemical clusters classification (MG3C) task (Figure S1), chemical fingerprints are first extracted from SMILES
and input into unsupervised KMEANS with different K values to produce clusters with different structure granularity.
Then, these clusters are treated as pseudo-labels of molecular images.
Finally, the molecular encoder and structural classifier are jointly used to predict the labels of molecular images and optimizing the loss between pseudo-labels and predicted labels in pre-training.

The structural classifier is a multi-task learner that receives 512-dimensional features as input and then forward-propagates to 3 fully connected layers with different numbers of neurons ( 100 , 1000 and 10000 ) for classifying different clustering granularity.

2、(MRD) .（分子理性鉴别器）【是为了训练Molecular Encoder + Rationality Classifier（判断分子结构是否理性）】.

类似于自监督学习的“拼图”

==> 我们首先对分子结构进行打乱从而构建无理性的分子图像（这是用一个3×3的网格将分子图像分解成9个小块，随机洗牌，形成无理数的图像），
==> 然后，将这些理性（原始图像）和非理性（打乱后的分子图像）分子图像输入Molecular Encoder，提取512-D 的 visual features。
==> 最后，将这些特征向前传播到 Rationality classifier 进行合理性判断。

Rationality classifier是一个简单的MLP结构，它采用512维特征作为输入，并直接输出2维结果(合理或不合理)。

**The architectural details of the Molecular Rationality Discrimination (MRD) task.（分子理性鉴别器）**

Figure S2: In order to construct an irrational molecular image, we first disrupt the molecular structure, which uses a 3x3 grid to decompose the molecular image into 9 patches and randomly shuffle them to form an irrational image. Then, these rational and irrational molecules will be input to the molecular encoder to extract visual features. Finally, these features are forward propagated to rationality classifier for rationality judgment.

In molecular rationality discrimination (MRD) task (Figure S2), we first disrupt the molecular structure to construct an irrational molecular image, which uses a 3x3 grid to decompose the molecular image into 9 patches and randomly shuffle them to form an irrational image. The original images are viewed as rational molecular images. Then, these rational and irrational molecules will be input to the molecular encoder to extract 512-D features. Finally, these features are forward propagated to rationality classifier for rationality judgment.

The rationality classifier is a simple MLP structure that takes 512-dimensional feature as input and directly outputs 2 -dimensional results (rational or irrational).

具体代码实现：

# MRD: Rationality classifier (to discriminate rationality)
class Matcher(nn.Module):def __init__(self):super(Matcher, self).__init__()self.fc = nn.Linear(512, 2)self.logic = nn.LogSoftmax(dim=1)self.apply(weights_init)def forward(self, x):o = self.logic(self.fc(x))return o

3、(JPP) task.（拼图预测）【是为了训练Molecular Encoder + Jigsaw Classifier（为了预测分子的排序是什么）】

类似于自监督学习的“拼图”

拼图预测(JPP)类似于MRD

==> 首先使用一个3x3的网格将分子图像分解成9个patches，并将原始排列标记为(1，2，3，4，5，6，7，8，9)，
==> 然后，我们使用随机不同的排列来重组图像，如(7，1，6，2，0，5，4，3，8)或(7，8，5，6，3，2，0，1，4)【具体来说，我们从100个预先定义的排列顺序中随机选取，这些排列可以从permutations_100.npy中获得】
==> 最后，重组的图像被输入Molecular encode 来提取特征，随后输入进Jigsaw classifier 来预测排列顺序属于哪一类？

Jigsaw classifier 是一个简单的MLP，由一个512维的输入层和一个100维的输出层组成。

**The architectural details of the Jigsaw Puzzle Prediction (JPP) task.（拼图预测）**

Figure S3: We first use a 3x3 grid to decompose the molecular image into 9 patches and assign numbers from 1 to 9. Then, we use different permutations to reorganize the image. Finally, the reorganized images are fed into the molecular encoder and jigsaw classifier to predict the corresponding permutations.

In jigsaw puzzle prediction (JPP) task (Figure S3), similar to MRD, we first decompose the molecular image into 9 patches and label the original permutation as (1, 2, 3, 4, 5, 6, 7, 8, 9) .
Then, we randomly shuffle the permutation and re-stitch into new images like (7, 1, 6, 2, 0, 5, 4, 3, 8) or (7, 8, 5, 6, 3, 2, 0, 1, 4) . In particular, we randomly select from 100 defined permutations, which can be obtained from permutations_100.npy (https://github.com/fmcarlucci/JigenDG/blob/master/permutations_100.npy).
Finally, the Molecular encoder is used to extract features of rearranged images and subsequently input into the jigsaw classifier for predicting the permutation (100 classification).

The Jigsaw classifier is a simple MLP, which consists of a 512-dimensional input layer and a 100-dimensional output layer.

4、(MCL) .（基于mask的对比学习）【优化Molecular Encoder】

如果是自监督学习中的对比学习，那么正样本是什么，负样本是什么呢？这个不应该是pretext task的mask方式吗？

==> 我们首先随机遮蔽一个16 × 16的区域以获得一个被mask的图像，（mask使用图像的平均值填充）
==> 然后原始图像和mask图像被同时输入Molecular Encoder以提取latent features，
==> 最后，我们通过最大化latent features pairs之间的相似性来优化Molecular Encoder。

这里， Euclidean distance（欧几里德距离）用于约束两个特征之间的相似性，我们应该最小化 Euclidean distance 以确保更大的相似性。

**The architectural details of the MASK-based Contrastive Learning (MCL) task.（**基于mask的对比学习）

Figure S4: We first randomly mask a 16 × 16 area to obtain a masked image. Then a pair of images (original image and the masked image) are simultaneously fed into the molecular encoder to extract latent features. Finally, we optimize the molecular encoder by maximizing the similarity among the latent feature pairs.

In MASK-based contrastive learning (MCL) task (Figure S4), we randomly mask a 16 × 16 region in the molecular image, which is filled using the mean of the image (some masked examples in Figure S16). Subsequently, image pairs (original image, masked image) are fed into the molecular encoder to extract features and maximize the similarity.

Here, the Euclidean distance is used to constrain the similarity between two features, and we should minimize the Euclidean distance to ensure greater similarity.

5、(MIR) . （分子图像重建）【通过latent features生成64 × 64 分子图像，以及鉴别分子图像真假】

生成器为了将 latent features $z^{aug}$ 重建回 64 × 64 分子图像 $X^{rec}$
鉴别器接收生成器生成的分子图像 $X^{rec}$ ,和真正的分子图像 $X^{aug}$ ，然后鉴别他们的真假

**The architectural details of the Molecular Image Reconstruction(MIR) task. （分子图像重建）**

Figure S5: The generator is used to reconstruct latent features $z^{aug}$ back into 64 × 64 molecular images $X^{rec}$ , The discriminator accepts the generated image $X^{rec}$ , and the real image $X^{aug}$ and discriminates their real and fake.

In molecular image reconstruction (MIR) task (Figure S5), we build our GAN model based on context encoders [6]. The detail of GAN model is described in Figure S5. In generator, firstly, the latent features $z^{aug}$ are forward to a single hidden layer MLP model, which accepts 512-d input and obtains a 128-d output. Subsequently, four ConvTranspose2D layer with BatchNorm2D and ReLU are used. In ConvTranspose2D, the numbers represent input channels, output channels, kernel size and stride respectively. Finally, a ConvTranspose2D layer with Tanh activation function is used to generate 64 × 64 images.

In discriminator, $X^{aug}$ is first preprocessed to resize to 64 × 64. Then resized $X^{aug}$ and $X^{rec}$ , are input to a Conv2d with LeakyReLU and three Conv2d with BatchNorm2D and LeakyReLU (negative slope is 0.2). In Conv2d, the numbers have the same meaning as ConvTranspose2D. Finally, a Conv2d is used to discriminate the real or fake of input images.

Benchmark evaluation of ImageMol

ImageMol的基准评估

We first evaluated the performance of ImageMol using four types of benchmark datasets (Supplementary Tables 1 and 2): (1) molecular targets (human immunodeficiency virus [HIV] and beta-secretase [BACE, a key target in Alzheimer’s disease]); (2) blood-brain barrier penetration (BBBP); (3) drug’s metabolism, and (4) molecular Toxicity using the 21st century (Tox21) and clinical trial toxicity (ClinTox) databases (cf. Methods). Using the area under the curve (AUC) under the receiver operating characteristic (ROC) curve, ImageMol achieves high AUC values (Fig. 2a) across HIV (AUC=0.821), BACE (AUC=0.902), BBBP (AUC=0.931), Tox21 (AUC=0.809). In stratified split, the proportion of each class in the training set, validation set, and test set is the same as in the original dataset. In scaffold split, the datasets are divided according to the molecular substructure: the substructures in the training set, validation set, and test set are disjoint, making them ideal to test robustness and generalizability of in silico models. For a fair comparison in Fig. 2b, we used the same experimental setup as Chemception [29], a state-of-the-art convolutional neural network (CNN) framework. ImageMol achieves higher AUC values on HIV (AUC=0.821) and Tox21 (AUC=0.824), suggesting that ImageMol can capture more biologically relevant information from molecular images than CNN. We further evaluated performance of ImageMol in prediction of drug metabolism across five major metabolism enzymes: CYP1A2, CYP2C9, CYP2C19, CYP2D6 and CYP3A4 (cf. Methods). Figure 2c showed that ImageMol achieves higher AUC values (ranging from 0.802 to 0.892) in the prediction of inhibitors vs. non-inhibitors across five major drug metabolism enzymes as well compared with two state-of-the-art molecular image-based representation models: ADMET-CNN [30] and QSAR-CNN [31]. Additional results of the detailed comparison are provided in the Supplementary Figures 6-7.

We further compared the performance of ImageMol with three state-of-the-art molecular representation models: 1) fingerprint-based, 2) sequence-based, and 3) graph-based models. As shown in Fig. 2d, ImageMol outperforms two sequence-based models (SMILES Transformer [22] and Recurrent Neural Network-based Sequence-to-Sequence (RNNS2S) [32]) across all four benchmark biomedical datasets with stratified split. ImageMol has better performance (Fig. 2e) compared with three sequence-based models (ChemBERTa [21], SMILES Transformer and Mol2Vec [33]) and two graph-based models (Jure’s GNN [23] and N-GRAM [34]) based on a scaffold split. In addition, we found that ImageMol achieved higher AUC values (Fig. 2f) compared to traditional MACCS-based methods and FP4-based methods [35] across multiple machine learning algorithms, including support vector machine, Decision Tree, k-Nearest Neighbors, Naive Bayes (NB), and theirensemble models [35] (Supplementary Table 3). The detailed comparisons of ImageMol with each model/method are provided in the Supplemental Methods and Results. Altogether, ImageMol achieves high performance in multiple drug discovery tasks, outperforming state-of-the-art methods (Fig. 2d-2f and Supplementary Tables 4-5).

Prediction of anti-viral activities across 13 SARS-CoV-2 targets

13个新型冠状病毒靶标的抗病毒活性预测

The ongoing global COVID-19 pandemic caused by a toxic agent SARS-CoV- 2 virus, has led to more than 1.1 billion confirmed cases and over 6 million deaths worldwide as of March 15, 2022. There is a critical, time-sensitive need to develop effective anti-viral treatment strategies for the COVID-19 pandemic [6, 36]. We therefore test our ImageMol to identify potential anti-SARS-CoV-2 treatments across a variety of SARS-CoV-2 biological assays, including viral replication, viral entry, counterscreen, in vitro infectivity, and live virus infectivity [20]. In total, we evaluated ImageMol across 13 SARS-CoV-2 datasets, including 3C-like (3CL) protease enzymatic activity, angiotensin converting enzyme 2 (ACE2) enzymatic activity, human embryonic kidney 293 (KEK293) cell line toxicity, human fibroblast toxicity (Hcytox), middle east respiratory syndrome pseudotyped particle entry (MERS-PPE) and its Huh7 tox counterscreen (MERS-PPE_cs), SARS-CoV PPE (CoV-PPE) and its VeroE6 tox counterscreen (CoV-PPE_cs), SARS-CoV-2 cytopathic effect (CPE) and its host tox counterscreen (Cytotox), Spike-ACE2 protein-protein interaction (AlphaLISA) and its TruHit counterscreen, and transmembrane protease serine 2 (TMPRSS2) enzymatic activity (Supplementary Table 6). Across 13 SARS-CoV-2 targets, ImageMol achieves high AUC values ranging from 72.0% to 82.6% (Fig. 3a). To test whether ImageMol capture biologically relevant features, we used the global average pooling (GAP) layer of ImageMol to extract latent features of each dataset and used t-SNE to visualize latent features. Fig. 3b revealed that the latent features identified by ImageMol are well clustered according to whether they are active or inactive anti-SARS-CoV-2 agents across 8 targets or endpoints. These observations showed that ImageMol can accurately extract discriminative, antiviral features from molecular images for downstream tasks. We further compared ImageMol with both deep learning and machine learning frameworks: 1) a graph neural network (GNN) with a series of pre-training strategies (termed Jure's GNN [23]), and (2) REDIAL-2020 [20], a suite of machine learning models for estimating small molecule activities in a range of SARS-CoV-2-related assays. We found that ImageMol significantly outperform Jure’s GNN models across all 13 SARS-CoV-2 targets (Fig. 3a and Supplementary Table 7). For instance, there are over 12% elevated AUC values of ImageMol (AUC = 0.824) compared to the Jure’s GNN model (AUC = 0.704) in prediction of 3CL protease inhibitors. We further evaluated the area under the precision and recall (AUPR), a metric that is highly sensitive to the imbalance issues of positive versus negative labeled data. Compared to Jure’s GNN models, the elevated AUPR of ImageMol ranges from 3.0% to 29.1% with an average performance advantage of 8.5% across 13 SARS-CoV-2 targets, in particular for 3CL protease inhibitors (29.1% AUPR improvement) and ACE2 enzymatic activities (26.9% APUR improvement). To compared with REDIAL-2020 [20], we used the same experimental settings and the performance evaluation metrics, including accuracy, sensitivity, precision, F1 (the harmonic mean between sensitivity and precision) and AUC. We found that ImageMol outperformed REDIAL- 2020 as well (Supplementary Table 8). In summary, these comprehensive evaluations reveal high accuracy of ImageMol in identifying anti-SARS-CoV-2 molecules across diverse viral targets and phenotypic assays. Furthermore, ImageMol is more capable on datasets with extreme imbalance of positive and negative samples compared to traditional deep learning pre-trained models [23] or machine learning approaches [20]

Identifying anti-SARS-CoV-2 inhibitors via ImageMol

通过ImageMol识别抗新型冠状病毒抑制剂

We next turned to identify potential anti-SARS-CoV-2 inhibitors using 3CL protease as a prototypical example as it has been shown a promising target for therapeutic development in treating of COVID-19 [37, 38]. We focused on 2,501 U.S. FDA-approved drugs from DrugBank [39] to identify ImageMol-predicted 3CL protease inhibitors as repurposable drugs for COVID-19 using a drug repurposing strategy [36]. Via molecular image representation of 3CL protease inhibitor vs. non-inhibitor dataset under the ImageMol framework, we found that 3CL inhibitors and non-inhibitors are well separated in a t-distributed Stochastic Neighbor Embedding (t-SNE) plot (Fig. 3c). Molecules with activity concentration 50% (AC50) less than 10 uM were defined as inhibitors, otherwise they were non-inhibitors. We showed the probability of each drug in DrugBank being inferred as a 3CL protease inhibitor (Supplementary Table 9) and visualized their overall probability distribution (Supplementary Figure 8). We found that 12 of the top 20 drugs (60%) have been validated (including cell assay, clinical trial, etc.) as potential SARS-CoV-2 inhibitors (Supplementary Table 9), among which 3 drugs are further verified as potential 3CL protease inhibitors by biological experiments (Fig. 3d). To test the generalization ability of ImageMol, we used 10 experimentally reported 3CL protease inhibitors as an external validation set (Supplementary Table 10). ImageMol identified 6 out of 10 known 3CL protease inhibitors (60% success rate, Fig. 3e), suggesting a high generalization ability in anti-SARS-CoV-2 drug discovery. We further used the HEY293 assay to predict anti-SARS-CoV-2 repurposable drugs. We collected experimental evidence for top 20 drugs as potential SARS-CoV-2 inhibitors (Supplementary Table 11). We found that 13 out of 20 drugs (65%) have been validated by different experimental assays as potential inhibitors for the treatment of SARS-CoV-2 (such as in vitro cellular assays and clinical trials) in Supplementary Table 11. Meanwhile, 122 drugs have been identified to block SARS-CoV-2 infection [40]. From these drugs, we selected a total of 70 small molecules overlapped in DrugBank to evaluate performance of the KEY293 model. We found that ImageMol successfully predicted 47 out of 70 (67.1% success rate, Supplementary Table 12), suggesting a high generalizability of ImageMol for inferring potential candidate drugs in the HEY293 assay as well.

Biological Interpretation of ImageMol

ImageMol的生物学阐释

We next turned to use t-SNE to visualize molecular representations from different models to test the biological interpretation of ImageMol. We used the clusters identified by the multi-granularity chemical clusters classification (MG3C) task (cf. Methods) to split the molecular structures. We randomly selected 10% clusters obtained from MG3C and sampled 1,000 molecules for each cluster. We performed three comparisons for each molecule: a) MACCS fingerprints with 166-dimensional (166D) features, b) ImageMol without pre-trained models with 512D features, and c) ImageMol pre-trained 512D features. We found that ImageMol distinguish molecular structures very well (Fig. 4e and Supplementary Figure 9c), outperforming that of MACCS fingerprints (Supplementary Figure 9a) and non-pre-trained models (Supplementary Figure 9b). ImageMol can capture priori knowledge of chemical information from the molecular image representations, including =O bond, -OH bond, -NH3 bond and benzene ring (Fig. 4a). We further used the Davies Bouldin (DB) index [34] to quantitatively evaluate the clustering results and the smaller DB index represents the better performance. We found that ImageMol (DB index=1.98) was better than MACCS fingerprint (DB index=2.13); furthermore, pre-trained models can significantly improve the molecular representation as well (DB index=18.48). Gradient-weighted Class Activation Mapping (Grad-CAM) [41] is a commonly used convolutional neural network (CNN) visualization method [42, 43]. Figures 4b and 4c illustrate 12 example molecules of the Grad-CAM visualization of ImageMol (cf. Supplementary Figures 10 and 11). ImageMol accurately captures attention to the global (Fig. 4b) and the local (Fig. 4c) structural information simultaneously. In addition, we counted the proportion of blank areas in the images to the entire molecular image across all 13 SARS-CoV-2 datasets (Supplementary Table 13). We found an average sparsity (sparsity refers to the proportion of blank areas in an image) of 94.9% across the entire dataset, suggesting that ImageMol models are easily inclined to use blank areas of the image for meaningless inferences [31]. Figure 4d shows that ImageMol primarily pays attention to the middle area of the image during predictions. Thus, ImageMol indeed predicts based on the molecular structures rather than uses meaningless blank areas. We further calculated the coarse-grained and fine-grained hit rates (Supplementary Figure 12). The coarse-grained hit rate illustrates that ImageMol can utilize molecular structures of all images for inference, with a ratio of 100%, compared to the QSAR-CNN models [31] with 90.7%. The fine-grained hit rate shows that ImageMol can leverage almost all structural information in molecular images to inference, with a ratio of over 99%, reflecting its ability to capture global information of molecules. In summary, ImageMol captures the biologically relevant chemical information of molecular images with both local- and global- levels of structural information, outperforming existing state-of-the-art deep learning approaches (Fig. 4).

2.6 Ablation analysis of ImageMol

ImageMol的消融分析

The robustness of the model to hyperparameter tuning is important because the initialization of different parameters can affect the performance of the model [44]. Here, we explore the impact of pre-training strategies on the hyperparameter tuning of ImageMol. As shown in Supplementary Tables 4- 5, ImageMol is more robust than ImageMol_NonPretrained, with an average performance variance of 1.2% versus 2.4%. Therefore, pre-training strategies improve the robustness of ImageMol to initialization parameters. To explore the impact of pre-training with different data scales, we first use 0 million (no pre-training), 0.2 million, 0.6 million, 1 million, and 8.5 million drug-like compounds to pretrain ImageMol respectively and then evaluate their performance. We found that the average ROC-AUC performance of 0 million (75.7%), 0.2 million (76.9%), 0.6 million (81.6%), 1 million (83.8%) and 8.5 million (85.9%) increased from 1.2% to 10.2% as the pre-trained data size increases. Thus, ImageMol can be further improved as the more drug-like molecules cancer be pre-trained. We further investigated the impact of different pretext tasks using multi-granularity chemical clusters classification (MG3C), jigsaw puzzle prediction (JPP), and MASK-based contrastive learning (MCL) (cf. Methods), respectively. We found that each pretext task improves the mean AUC value of ImageMol from 0.7% to 4.9%: without pre-text task (75.7%), JPP (78.8%), MG3C (80.6%]) and MCL (76.4%) (Supplementary Figure 14). The best performance was achieved by assembling all 3 pretext tasks for pre-training (AUC = 85.9%, Supplementary Figure 14). In summary, each task integrated implemented the ImageMol framework synergistically improve performance and models can be improved further by hyperparameter tuning and pre-training from a bigger drug-like chemical datasets in the future.

Discussion

讨论

提出了一个基于自监督图像处理的预训练深度学习框架，该框架结合了分子图像和无监督学习来学习分子表示；
预训练深度学习框架将为各种新兴疾病的快速药物发现和开发提供强大的工具，包括新冠肺炎疫情和未来的大流感；

We presented a self-supervised image processing-based pre-training deep learning framework that combines molecular images and unsupervised learning to learn molecular representations.

We demonstrated the high accuracy of ImageMol across multiple benchmark biomedical datasets with a variety of drug discovery tasks (Figs. 2 and 3). In particular, we identified candidate anti-SARS-CoV-2 agents, which were validated by ongoing clinical and experimental data across 13 biological anti-SARS-CoV-2 assays. If broadly applied, our pre-training deep learning framework will offer a powerful tool for rapid drug discovery and development for various emerging diseases, including COVID-19 pandemic and future pandemics as well.

We highlighted several improvements of ImageMol compared to other state-of-the-art methods.

First, ImageMol achieved high performance across diverse tasks of drug discovery, including drug-like property assessment (brain permeability, drug’s metabolism and toxicity) and molecular target prediction across diverse targets, such as Alzheimer’s disease (i.e., BACE) and emerging infectious diseases caused by HIV and SARS-CoV-2 virus.

Furthermore, ImageMol outperforms state-of-the-art methods, including traditional deep learning and machine learning models (Fig. 2a-2c).

Second, we showed that our self-supervised image-based representation approach outperformed traditional fingerprint-based and graph-based representation methods as well (Fig. 2d-2f).

Finally, ImageMol has better interpretability and is more intuitive in identifying biologically relevant chemical structures or substructures for molecular properties and target binding (Figs. 4a-4c).

Via ablation analysis, we showed that pre-training process using 8.5 million drug-like molecules significantly improved the model performance compared to models without pre-training. Thus, integrating additional chemical knowledge (such as atomic properties and 3D structural information) to each image or pixel area may further improve the performance of ImageMol. We found that five pre-training tasks are well compatible and jointly improve model performance. We acknowledged several limitations in current study. Although we mitigated the effects of different representations of molecular images through data augmentation, perturbed views (i.e., rotation and scaling) of the input images may still affect the prediction results of ImageMol. We did not optimize for the sparsity of molecular images, which may affect the latent features extracted by the model. It is challenging to explicitly define the chemical properties of atoms and bonds compared to graph-based methods [23, 34], which will inevitably lead to insufficient chemical information. Several potential directions may improve our ImageMol further:

(1) integration of larger-scale biomedical data and larger-capacity models (such as ViT [45]) in molecular images will inevitably be the focus of future work;
(2) multi-view learning of joint images and other representations (e.g. SMILES and graph) is an important research direction;
(3) introducing more chemical knowledge (including atomic properties, 3D information, etc.) to each image or pixel area is also a point worth studying as well. In summary, ImageMol is an active self-supervised image processing-based strategy that offers a powerful toolbox for computational drug discovery in a variety of human diseases, including COVID-19.

Online Methods

Strategies for pre-training ImageMol

我们预训练策略的核心是通过考虑三个原则对分子进行视觉表示（一致性、相关性、合理性）：
1、一致性是指同一化学结构在不同图像中的语义信息是一致的，如-OH、=O、苯
2、相关性是指同一图像的不同增强（如mask、shuffle）在特征空间中是相关的, 例如，mask后图像的分布应该接近原始图像
3、合理性是指分子结构必须符合化学常识。

ImageMol是第一个基于分子图像的预训练框架，它通过定义五个有效的pretext任务来综合考虑多个原则。

Pre-training aims to make the model learn how to extract expressive representations by training on large-scale unlabeled datasets and then apply the well pre-trained model to related downstream tasks and fine-tune to improve their performance.

Defining several effective and task related pretext tasks is required for pre-training the model.
In this paper, the core of our pretraining strategy is the visual representation of molecules by considering three principles: consistency, relevance, and rationality.

These principles lead ImageMol to capture meaningful chemical knowledge and structural information from molecular images.

1、Especially, the consistency means that the semantic information of the same chemical structure in different images is consistent, such as -OH, =O, benzene.
2、The relevance means that different augmentations of the same image (such as mask, shuffle) are related in the feature space. For example, the distribution of the image after the mask should be close to the original image.
3、The rationality means that the molecular structure must conform to chemical common sense. The model needs to recognize the rationality of the molecule in order to promote the understanding of the molecular structure.

Unlike graph-based and smiles-based pre-training methods (they either only consider consistency or only correlation), ImageMol is the first molecular image-based pre-training framework and considers multiple principles comprehensively by defined five effective pretext tasks.

B.2 Pre-task details in pre-training（supplement material）

This section will describe the pre-training details of ImageMol with five pre-tasks. The overall data flow of the ImageMol framework during training is shown in Figure S17.

输入原始数据 ==> 数据增强 ==> 增强后的数据用于Jigsaw 和mask来获得对应的 Jigsaw 和mask图像 ==> 然后这三种数据被输入进ResNet18提取潜在特征 ==> 这些潜在特征又被用于5中任务

In general, the original input images $X$ is processed into three different datasets.

Augmented images $X^{aug}$ is obtained by using data augmentation on $X$ , including RandomHorizontalFlip(), RandomGrayscale(p=0.2) and RandomRotation(degrees=360) in torchvision.

Shuffled images $X^{Jig}$ is obtained by performing a jigsaw puzzle on $X^{aug}$ . The puzzle rule uses "permutations 100" in [5].

Masked images $X^{mask}$ ( is obtained by adding the mask matrix in $X^{aug}$ , and the values in the matrix are filled with the mean value. The examples about masked images are shown is Figure S16.

Then, randomly select a batch of data from these three datasets and input them into ResNet18 without classification layer to extract 512-D latent features $z^{aug},z^{jig},z^{mask}$ .

Finally, these latent features are input into the sub-network for each task for further processing.（augmented images are used for task 1, task 3 and task 4. Shuffled images are used in task 2 and task 3. Masked images are used in task 5.）

**The data flow of the forward propagation of ImageMol framework in pre-training.**

Figure S17: Data augmentation techniques are first used to extract different augmentations of the original input images and further permutation and masking to obtain shuffled images and masked images, respectively. These images are then fed into ResNet18 to extract latent features. Finally, augmented images are used for task 1, task 3 and task 4. Shuffled images are used in task 2 and task 3. Masked images are used in task 5.

1、Consistency for pre-training

一致性是指同一化学结构在不同图像中的语义信息是一致的，如-OH、=O、苯

考虑到不同图像中相同化学结构的语义信息是一致的，提出了多粒度化学簇分类(MG3C)任务(补充图1)，通过预测分子的化学结构来使语义一致性。

==> multi-granularity clustering 首先用于将多个不同粒度的聚类分配给每个化学结构指纹。
==> 然后，将每个簇作为对应分子的伪标签分配，每个分子有多个不同粒度的伪标签；
==> 最后，使用molecular encoder 提取分子图像的 latent features，并使用 structural classifier 对伪标签进行分类。

Considering that the semantic information of the same chemical structure in different images is consistent, the Multi-Granularity Chemical Clusters Classification (MG3C) task is proposed (Supplementary Figure 1), which discovers semantic consistency by predicting the chemical structure of the molecule.

Briefly, multi-granularity clustering is first used to assign multiple clusters of different granularities to each chemical structural fingerprint.

Then, each cluster is assigned as a pseudo-label to the corresponding molecule and each molecule has multiple pseudo-labels with different granularities;

Finally, molecular encoder is employed to extract the latent features of the molecular images and a structural classifier is used to classify the pseudo-labels.

我们使用MACCS键作为分子指纹的描述符，它是一个由0和1组成的166个长度的序列，这些分子指纹序列作为聚类的基础，分子指纹之间的距离越近，它们更有可能被聚集成一个集群
最后，我们使用不同K=100，1000，10000的K-means 对分子进行聚类，以获得从粗粒度到细粒度的不同粒度的簇。
根据聚类结果，我们为每个分子图像分配三个伪标签，然后使用ResNet18 作为molecular encoder来提取潜在特征，并使用structural classifier来预测潜在特征的伪标签。

Especially, we employed the MACCS keys as the descriptor of molecular fingerprints, which is a 166-length sequence composed of 0 and 1. These molecular fingerprint sequences can be used as a basis for clustering, and the closer the distance between molecular fingerprints, the more likely it is to be clustered into a cluster.

Finally, we use the K-means [46] with different K=100,1000,10000 (See Supplementary Section A.2 and Supplementary Figure 15 about selection of K) to cluster molecules to obtain clusters with different granularity from coarse-grained to fine-grained.

According to the clustering results, we assigned three pseudo-labels to each molecular image and then applied ResNet18 [47] as molecular encoder to extract latent feature and structural classifier to predict the pseudo-labels of latent feature.

The structural classifier是多任务的，由对应于三种不同聚类粒度的三个平行全连接层组成。每个全连接层的神经元分别为100、1000和10000个。

The structural classifier is multi-task, consisting of three parallel fully connected layers corresponding to three different clustering granularities. The neurons of each fully connected layer are 100, 1000 and 10000, respectively. Formally, the molecular image and the corresponding three pseudo-labels are represented by $x_{n} \in \mathbb{R}^{224 \times 224 \times 3}, \mathcal{Y}_{n}^{100} \in\{0,1, \ldots, 99\}^{100}, \mathcal{Y}_{n}^{1000} \in\{0,1, \ldots, 999\}^{1000}$ and $\mathcal{Y}_{n}^{1000} \in\{0,1, \ldots, 9999\}^{10000}$ respectively and the cost function $\mathcal{L}_{M G 3 C}$ of multi-granularity chemical clusters classification task is as follows:

Where $f_{\theta}$ and $\theta$ refer to the mapping function and corresponding parameters of molecular encoder, respectively. $w_{100},w_{1000}$ and $w_{10000}$ represent the parameters of three fully connected classification layers in structural classifier with 100, 1000 and 10000 neurons, respectively. represents all parameters of $w_{100},w_{1000}$ and $w_{10000}$ 。 ℓ is the multinomial logistic loss or the negative log-softmax function.

Supplementary Section A.2 Selection of K in K-means

In the clustering pseudo-label classification task, we determined the K values to be 100, 1000, and 10000, respectively. In order to determine the value of K in K-Means method, we first use different K values, ranging from 1 to 14000, to cluster the dataset and to calculate the sum of squared distances. Then, we use the K value as the x-axis and sum of squared distances as the y-axis to draw a curve. Finally, a knee point detection algorithm [2] is used to find the knee point of this curve. As shown in Figure S15, the dotted line indicates the K value corresponding to the "elbow" point. Obviously, the larger the K value, the more difficult it is for ImageMol to perform the clustering pseudo-label classification task. Therefore, we select two K values (? =100 and 1000) on the left side of the "elbow" point and one K value (k=10,000) on the right side of the "elbow" point.

Supplementary Figure S15

**The “elbow” point of two datasets with respect to the number of clusters.**

Figure S15: The x-axis represents the number of clusters, and the y-axis represents the sum of Euclidean distances between samples in the clusters. We find the "elbow", which is a value corresponding to the point of maximum curvature in an elbow curve, by using knee point detection algorithm [2]

2、Relevance for pre-training

相关性是指同一图像的不同增强（如mask、shuffle）在特征空间中是相关的，例如，mask后图像的分布应该接近原始图像。

我们使用像素级任务从潜在特征重建分子图像，并使用图像级任务最大化该空间中原始样本和mask样本之间的相关性。

Based on the assumption that different augmentations (such as mask, shuffle) of the same image are related in the feature space, we use a pixel-level task to reconstruct molecular images from latent features and use an image-level task to maximize the correlation between the original sample and the mask sample in that space.
Molecular image reconstruction (MIR). MIR reconstruct the latent features back to the molecular images（分子图像重建（MIR）将latent features重建回分子图像）. We input the original molecular image $x_{n}$ into molecular encoder to obtain the latent feature $f_{\theta}\left(x_{n}\right)$ . To make the model learn the correlation between the molecular structures in the image, we shuffle and rearrange the input image $x_{n}$ (as in Section Rationality for pre-training) in the hope that the correct image can be reconstructed.

生成器（MIR分子图像重建）和鉴别器的详细构造：

由于隐特征生成器难以重构到224 × 224分子图像，我们简化了将隐特征重构到64 × 64分子图像的任务

After that, we define a generator G and a discriminator D to reconstruct the latent features.

1、G is composed of four layers of 2D deconvolution layers with a batch normalization 2D layer and ReLU activation function, and one layer of deconvolution layer with a Tanh activation function.
2、The discriminator is also composed of four layers of 2D convolutional layers with a batch normalization 2D layer, a LeakyReLU activation function, and one layer of 2D convolutional layer with Sigmoid activation function. For further details of the GAN model see Supplementary Figure 5. Since it is difficult for the generator to reconstruct the latent features to 224 × 224 molecular images, we simplify the task to reconstruct the latent features to 64 × 64 molecular images. The discriminator accepts 64 × 64 molecular images and distinguishes real or fake images.

In detail, first the generator is used to reconstruct the latent feature $f_{\theta}\left(x_{n}\right)$ to a 64 × 64 molecular image $\tilde{x}_{n}^{64 \times 64}=G\left(f_{\theta}\left(x_{n}\right)\right)$ .

Then, we resize the original molecular image $x_{n}$ of 224 × 224 to the molecular image $x_{n}^{64 \times 64}$ of 64 × 64 and input it into the discriminator together with the molecular image generated by G at the same time to obtain $D\left(x_{n}^{64 \times 64}\right) \text { and } D\left(G\left(f_{\theta}\left(x_{n}\right)\right)\right.$ .

Finally, we update the parameters of the generator and the discriminator through their cost functions $L_{G}$ and $L_{D}$ respectively,

which are defined as:

For $L_{G}$ , the first term represents Wasserstein loss, and the second term represents the Euclidean distance between the generated image $G\left(f_{\theta}\left(x_{n}\right)\right)$ and the corresponding real image $x_{n}^{64 \times 64}$ . For $L_{D}$ , we use this loss to approximate the Wasserstein distance of the distribution of real features $x_{n}^{64 \times 64}$ and fake features $x_{n}$ . Finally, the molecular encoder model is updated by using the cost function $L_{MIR}$ , which was formalized as

MCL（MASK-based contrastive learning） .

Recently, the performance gap between the unsupervised pre-training and supervised leaning in computer vision has narrowed, notably owing to the achievements of contrastive learning methods [25, 26]. However, these methods typically rely on a large number of explicit pairwise feature comparisons, which is computationally challenging [48].

Furthermore, in order to maximize the feature extraction ability of the pre-training model, contrastive learning must select good feature pairs, which obviously increases the huge cost in computing resources. Therefore, to save computing resources and mine the fine-grained information in the molecule images, we introduce a simple contrastive learning method in molecular images, namely MASK-based contrastive learning (Supplementary Figure 4). We first use a 16 × 16 square area to randomly mask the molecular images (Supplementary Figure 16), denoted by $\tilde{x}_{n}$ . Then, the masked molecular $\tilde{x}_{n}$ and the unmasked molecular images ${x}_{n}$ are simultaneously input into molecular encoder to extract latent features $f_{\theta}\left(\tilde{x}_{n}\right)$ , $f_{\theta}\left({x}_{n}\right)$ . Finally, the cost function $L_{MCL}$ was introduced to ensure the consistency between the latent feature extracted by the molecular image before and after the mask, which was formalized as:

Where $\left\|f_{\theta}\left(\tilde{x}_{n}\right), f_{\theta}\left(x_{n}\right)\right\|_{2}$ means to calculate the Euclidean distance between $f_{\theta}\left(\tilde{x}_{n}\right)$ and $f_{\theta}\left({x}_{n}\right)$ .

代码实现：

欧氏距离（L2距离）里的距离计算：

                ### 计算MCL的损失hidden_feat_non_mask, _, _, _, _ = model(data_non_mask)hidden_feat_mask, _, _, _, _ = model(cl_data_mask)constractive_loss = torch.autograd.Variable(torch.Tensor([0.0])).cuda()if args.constractive_lambda != 0:constractive_loss = (hidden_feat_non_mask - hidden_feat_mask).pow(2).sum(axis=1).sqrt().mean()AvgConstractiveLoss += constractive_loss.item() / len(train_dataloader)

3、Rationality for pre-training

合理性是指分子结构必须符合化学常识

Supplementary Figure 2 and Supplementary Figure3是为了预测分子图，有效的提升了模型理解分子图像的能力

Inspired by human understanding of the world, we proposed the rationality principle, which means that the structural information described by molecular images must conform to chemical common sense. We rearranged the original images to construct irrational molecular images and designed two pre-training tasks to predict them (Supplementary Figure 2 and Supplementary Figure3), which can effectively improve the model's understanding of molecular images.

3.1、Molecular rationality discrimination (MRD).【分子理性鉴别器的详细介绍】

人们可以很容易的分辨出什么是合理的，什么是不合理的。但是人工智能很难进行分辨，所以构建了分子理性鉴别器

The reason why people can easily judge whether things in the image are reasonable based on the knowledge they have learned is because people are very good at summarizing the spatial structure information in the image scene. For example, image of a blue sky under the grass and an image of a blue sky above the grass, we can easily distinguish the former is unreasonable and the latter is reasonable. However, it is difficult for an artificial intelligence model to pay attention to this global-level spatial structure information spontaneously during the learning process.

Motivated by these phenomena, we construct a rational and an irrational molecular image pair for each molecular image to guide the model to learn the structural information. Especially, as shown in Supplementary Figure 2,

we use an 3 × 3 grid to decompose each molecular image $x_{n}$ into 9 patches and number each patch 1 to 9.

Then, these patch numbers are randomly shuffled and re-spliced according to the shuffled patch to form an image with the same dimensions as the original image.

Finally, these disordered images are viewed as irrational samples $\widehat{\chi}_{n}$ .

Subsequently, the original ordered image $x_{n}$ and the shuffled image $\widehat{\chi}_{n}$ are forward propagated to molecular encoder to extract latent features $f_{\theta}\left(x_{n}\right)$ and $f_{\theta}\left(\hat{x}_{n}\right)$ ,

and these features are further input into a rationality classifier to obtain the probability value $w_{M R D} f_{\theta}\left(x_{n}\right)$ whether the sample is reasonable.

Here, we define the cost function of molecular rationality discrimination task $L_{MRD}$ to update ResNet18, which is formalized as:

其实应该是这样的（在合理的分子图的时候应该是正确的标签1，在不合理的分子图的时候应该是错误的标签0），所以正好对应代码实现中的 criterion_matcher(out_cls_false, y_out_cls_false) + criterion_matcher(out_cls_true, y_out_cls_true)

Where the first term and the second term represent the binary classification loss of the rational image and the irrational image respectively.（二分类损失）
$W_{MRD}$ represents the parameters of the rationality classifier. $\mathcal{Y}_{n}^{M R D}$ represents the real label, which consists of 0 (irrational) and 1 (rational).

代码实现：

                #### 计算MRD的损失reasonability_loss = torch.autograd.Variable(torch.Tensor([0.0])).cuda()if args.matcher_lambda != 0:out_cls_false = matcher(hidden_feat)out_cls_true = matcher(hidden_feat_non_mask)y_out_cls_false = torch.from_numpy(np.where(Jigsaw_label.numpy().copy() > 0, 0, 1)).cuda().long()y_out_cls_true = torch.from_numpy(np.ones(out_cls_true.shape[0])).cuda().long()reasonability_loss = criterion_matcher(out_cls_false, y_out_cls_false) + criterion_matcher(out_cls_true, y_out_cls_true)AvgReasonabilityLoss += reasonability_loss.item() / len(train_dataloader)

3.2、Jigsaw puzzle prediction (JPP).【拼图预测】

在计算机视觉中常用的JPP提供了更细粒度的预测来发现分子图像的不变性和规律性，
在相同的分子图像上解决拼图游戏可以帮助模型关注更多的全局结构信息，并学习空间合理性的概念，以提高预训练模型的泛化能力。
我们为每个patch编号排列分配一个索引(范围从0到99)，该索引将用作分子图像的分类标签

Compared with MRD, JPP provides a more fine-grained prediction to discover the invariance and regularity of molecular images (Supplementary Figure 3), which is widely used in computer vision [49]. Solving a jigsaw puzzle on the same molecular images can help the model pay attention to the more global structural information and learn the concepts of spatial rationality to improve the generalization of the pre-training model.

In this task, by using the maximal Hamming distance algorithm in [50], we assign an index (ranging from 0 to 99) to each permutation of patch numbers, which will be used as the classification label $\mathcal{Y}_{n}^{J i g}$ of the molecular image. Similar to MRD task, the original ordered image $x_{n}$ and the shuffled image ${\hat{x}}_{n}$ are forward propagated to molecular encoder to extract latent features $f_{\theta}\left({x}_{n}\right)$ and $f_{\theta}\left(\hat{x}_{n}\right)$ . Then, an additional jigsaw classifier is introduced to classify the permutation to which the image belongs. The molecular encoder is updated by using cost function $L_{JPP}$ , which is formalized as:

Where the first term and the second term represent the classification loss of the original ordered image and the shuffled image respectively. $w_{Jig}$ represents the parameters of the jigsaw classifier.

4、Pre-training process

我们使用了两个大规模数据集（ZINC and ChEMBL）来进行无监督的预训练，预训练ImageMol分为两步：分别是 data augmentations and training process

In pre-training, we used two large-scale datasets (ZINC and ChEMBL) for unsupervised pre-training. ZINC is a dataset containing 8 million unlabeled molecules sampled from the ZINC15 database, and ChEMBL is a smaller dataset containing ~0.43 million unlabeled molecules. The two datasets have been preprocessed and are publicly available online [23]. Overall, the pre-training of ImageMol consists of two steps, which are data augmentations and training process, respectively. A detailed pre-training data flow can be found in Supplementary Figure 17 and Supplementary Section B.2.

4.1、Data augmentations.

由于分子图比较sparser（稀疏），所以无法使用“random cropping”，这里分别使用50% 水平翻转， 20%被转换为灰度, and 100% 概率被旋转°.

def load_norm_transform():normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])# RandomHorizontalFlip的默认概率为0.5，所以这里没有写p=0.5img_tra = [transforms.CenterCrop(args.imageSize),transforms.RandomHorizontalFlip(),transforms.RandomGrayscale(p=0.2),transforms.RandomRotation(degrees=360)]tile_tra = [transforms.RandomHorizontalFlip(),transforms.RandomGrayscale(p=0.2),transforms.RandomRotation(degrees=360),transforms.ToTensor()]return normalize, img_tra, tile_tra

Data augmentation is a simple way to effectively augment a limited number of samples and significantly improve the generalization ability and robustness of the model, which has been widely used in supervised and unsupervised representation learning.
However, different from ordinary images, the molecular images are relatively sparser（稀疏） as they are filled mostly (>90%) by zeros, resulting in “usable” data being limited to a very small fraction of the image [29]. In view of the above limitation, “random cropping” is not applied in our model. Finally, three augmentations are selected in pre-training stage, including RandomHorizontalFlip, RandomGrayscale and RandomRotation. Hence, before the original images are input into our pre-training model, each image has a 50% probability of being horizontal flipped, 20% probability of being converted to grayscale, and 100% probability of being rotated between 0°–360°. The augmentations are provided by PyTorch (https://pytorch.org/)

4.2、Training process.

==> 数据增强后的 molecular images $x_{n}$
==> 输入进molecular encoder（ResNet18 ）
==> 提取到 latent features $f_{\theta}\left({x}_{n}\right)$
==> 这些 $f_{\theta}\left({x}_{n}\right)$ 被五个代理任务用来计算总代价函数 $L_{ALL}$
==> $L_{ALL}$ 反向传播更新 ResNet18

Here, we used the ResNet18 as our molecular encoder. After using data augmentations to obtain molecular images $x_{n}$ , we forward these molecular images $x_{n}$ to the ResNet18 model to extract latent features $f_{\theta}\left({x}_{n}\right)$ . Then, these latent features are used by five pretext tasks to calculate the total cost function $L_{ALL}$ , which is defined as:

Finally, the total loss function $L_{ALL}$ is used for backpropagation to update ResNet18. Specially, the cost function $L_{ALL}$ is minimized using mini-batch stochastic gradient descent (SGD). See Supplementary Section A.3 and Supplementary Table 14 for more detailed hyperparameter settings and Supplementary Section C.1 and Supplementary Figure 18 for the loss record during pre-training.

C.1 Results on the pre-training

Figure S18: The x-axis and y-axis represent epoch number and loss value respectively. For simplicity, clustering pseudo-label classification task, jigsaw puzzle prediction task, molecular rationality discrimination task and MASK-based contrastive learning are simplified to pretext task1, pretext task2, pretext task3 and pretext task4 in this group of figures.

As shown in Figure S18, it shows the details of the loss change of ImageMol during pre-training. We did not show the training details of the Image reconstruction task because the loss is adversarial. In general, the loss of ImageMol in the remaining four pre-tasks is a decreasing trend and gradually converges, which shows that our ImageMol can learn different information about molecular images in these pre-tasks.

5、Fine-tuning

预训练完成后，我们在下游任务fine-tune ResNet18 ，本文中fine-tune并不是重点，所以我们只使用一种简单而常见的微调方法来使模型适应不同的下游任务：我们在预训练的ResNet18后添加一个全连接层 $g^{ft}$ ， $g^{ft}$ 输出的维度与下游任务输出的分类数相同，

==> 我们首先将下游任务的输入“ molecular image ”输入进ResNet18，得到 latent feature representation $f_{\theta}\left(x_{n}^{g t}\right)$
==> 然后我们将 latent feature representation $f_{\theta}\left(x_{n}^{g t}\right)$ 输入全连接层 $g^{ft}$ 得到关于分类的逻辑值 $g^{f t}\left(f_{\theta}\left(x_{n}^{g t}\right)\right)$ ，
==> 然后使用 softmax 激活函数 normalize 这些逻辑值来得到预测分类的可能性 $\tilde{\mathcal{Y}}_{n}^{g t}=\operatorname{softmax}\left(g^{f t}\left(f_{\theta}\left(x_{n}^{g t}\right)\right)\right)$
==> 最后, 我们的模型将会通过计算分类的可能性 $\tilde{\mathcal{Y}}_{n}^{g t}$ and 真实的标签 $\mathcal{Y}_{n}^{g t}$ 的交叉熵损失来 fine-tuned

After completing the pre-training, we fine-tune the pre-trained ResNet18 in the downstream task. Clearly, the performance of the model can be further improved by establishing a complex fine-tuning task for the pre-trained model. However, fine-tuning is not the research focus of this paper, so we only use a simple and common fine-tuning method to adapt the model to different downstream tasks.

In detail, we only add an additional full connection layer $g^{ft}$ after the ResNet18, and the output dimension of the full connection layer is equal to the number of classifications of downstream tasks. In fine-tuning, we first input the molecular image from the downstream task into ResNet18 to obtain the latent feature representation $f_{\theta}\left(x_{n}^{g t}\right)$ . Then, we forward the latent feature representation to the full connection layer $g^{ft}$ to obtain the logical value $g^{f t}\left(f_{\theta}\left(x_{n}^{g t}\right)\right)$ related to the category and use the softmax activation function to normalize these logical values to get the predicted category probability $\tilde{\mathcal{Y}}_{n}^{g t}=\operatorname{softmax}\left(g^{f t}\left(f_{\theta}\left(x_{n}^{g t}\right)\right)\right)$ . Finally, our model will be fine-tuned by calculating the cross-entropy loss between the category probability $\tilde{\mathcal{Y}}_{n}^{g t}$ and the true label $\mathcal{Y}_{n}^{g t}$ . Especially, since the data in the downstream task has the problem of category imbalance, we also added the category weight in the cross-entropy loss, which is formalized as:（由于在下游任务的数据有分类不平衡问题，我们在交叉熵损失中添加分类权重）

Where, N and K respectively represent the number of samples and the number of categories in downstream tasks. $\lambda _{k}$ represents category weight, which is calculated by $1-\frac{N_{k}}{N}$ ( $N_{k}$ = is the number of samples of category ). $\mathcal{Y}_{i, k}^{g t}$ and $\tilde{\mathcal{Y}}_{i, k}^{g t}$ represent the true label and predicted probability on the k-th category of the k-th sample. Finally, the loss function $L_{CE}$ is used for backpropagation to update the parameters of the model. The more detailed hyperparameter settings can be found in Supplementary Section A.3 and Supplementary Table 14.

Downstream task details

为了评估预训练模型，我们设计了三种类型的与分子表征学习相关的下游任务进行测试，分别是分子性质预测、药物代谢预测和抗病毒活性预测。

To evaluate our proposed pre-training model, we designed three types of downstream tasks related to molecular representation learning for testing, which are molecular property prediction, drug metabolism prediction and anti-viral activities prediction, respectively.

1、Molecular property prediction（分子性质预测）

Dataset.

MoleculeNet [51] is a popular benchmark for molecular property prediction. Here, we used five binary classification datasets (Tox21, ClinTox, BBBP, HIV, and BACE) from MoleculeNet to evaluate our ImageMol. See Supplementary Table 1 for details. In these five datasets, Tox21 is complex multiple binary classification tasks, with 12 tasks and 7831 samples. ClinTox has two binary classification tasks and 1478 samples. The three remaining classification datasets (BBBP, HIV, and BACE) are single binary classification tasks with 2039, 41127, 1513 samples respectively.

Comparison method.

For a comprehensive comparison, we selected several different types of popular methods, which are the SMILES sequence-based pre-training methods (ChemBERTa [21], SMILES Transformer [22], RNNS2S [32] and Mol2Vec [33]), the graph-based pre-training methods (Jure’s GNN [23] and N-GRAM [34]) and molecular image-based method (Chemception [29]). These recently proposed methods show competitive results and superior performance on molecular property prediction task. Therefore, we selected these representative methods for comparison. In the sequence-based pre-training methods, ChemBERTa is based on RoBERTa [52] with 12 attention heads, 6 layers, and pre-trained by 77M unique SMILES sequences from PubChem [53]; the SMILES Transformer builds an encoder-decoder network with 4 transformer [54] blocks, which is pretrained with 861,000 unlabeled SMILES sequences randomly sampled from ChEMBL24 [27]; the RNNS2S is designed based on sequence-to-sequence learning with GRU [55] cell and attention mechanism, which is pretrained by using 334,092 valid molecular SMILES sequences from LogP and PM2-full datasets. Mol2Vec learns vector representations of molecular substructures that point in similar directions for chemically related substructures by pre-training on 19.9 million compounds. In the graph-based pre-training method, Jure’s GNN are node-level cutting edge self-supervised pre-training methods, which first transform the 2M SMILES sequences sampled from the ZINC15 database [28] into a graph structure, and use different pre-training strategies to train the Graph Isomorphism Networks (GINs) [56]. The N-GRAM method introduces N-gram graph and learns a compact representation for each graph in pre-training. Within molecular image-based methods, Chemception [29] has a well-designed CNN architecture focused on molecular property prediction. To quantitatively compare the advantages and disadvantages of ImageMol and these methods, ROC–AUC score is calculated as the evaluation metric.

Experimental setting.

Due to the differences in data split between different methods, for fair comparison, we used multiple different data split ways to comprehensively evaluate our ImageMol. In order to compare fairly with RNNS2S [32] and SMILES Transformer [22], we split the original dataset into training set (80%), validation set (10%) and test set (10%) with stratified split. At the same time, to evaluate the stability of our results, we use different random seeds to perform 20 times and take the mean and variance of these results as the final result. Compared to a stratified split, the scaffold split is a more challenging and realistic evaluation setting because molecular substructures do not overlap between training and test sets. Therefore, we follow the experimental setup of Jure’s GNN [23] to use scaffold split to divide experimental datasets into training set (80%), validation set (10%) and test set (10%). The final performance will be reported by calculating the mean and variance of the experimental results from 5 different random seeds. In addition, in order to compare with Chemception [29], we use exactly the same experimental configuration as Chemception, which uses stratified split to divide 4/6 training set, 1/6 validation set and 1/6 test set.

2、Drug metabolism prediction（药物代谢预测）

Dataset.

In drug discovery, Cytochrome P450 inhibitors and noninhibitors classification is important for predicting the tendency of molecules to cause significant drug interactions by inhibiting CYP and to determine which subtypes are affected. In this task, we use PubChem Data Set I (Training Set) and PubChem Data Set II (Validation Set) from [35] to evaluate the performance of the proposed ImageMol on human cytochrome P450 (CYP) inhibition. PubChem Data Sets I and II are two-category datasets. Both include 1A2, 2C9, 2C19, 2D6 and 3A4 isoforms.

Comparison method.

We compare the proposed ImageMol with two latest molecular image-based methods (ADMET-CNN [30] and QSAR-CNN [31]) with ROC-AUC metric to confirm the superiority of our method on molecular images and other molecular fingerprinting-based methods (MACCS-based and FP4-based methods [35]) with accuracy and ROC-AUC metrics to validate that our method can learn more information from molecular images than molecular fingerprints. For molecular image-based methods, ADMET-CNN successfully established a molecular 2-D image-based CNN model and achieved good prediction performances on predicting the ADMET properties (including CYP1A2 inhibitory potency, P-gp inhibitory activity, etc.); QSAR-CNN applied transfer learning and data augmentation to train molecular image-based DenseNet121 [57] model for developing quantitative structure-activity relationships (QSARs) to predict compound rate constants toward OH radicals. For molecular fingerprinting-based methods, two types of methods are used in the comparison, which includes traditional machine learning methods (SVM, C4.5 DT, -NN and NB) and ensemble learning methods (CC-I, CC-II, etc.) respectively. In this task, the accuracy and ROC-AUC are calculated for comparison.

Experimental setting.

For fairness, we keep the experimental settings consistent with these methods. We use 5-fold cross-validation on PubChem Data Set I to evaluate the performance of our ImageMol, ADMET-CNN and QSAR-CNN. In addition, we also use the model trained in PubChem Data Set I to evaluate the performance of all models mentioned in this task on the external validation set PubChem Data Set II.

3、Anti-SARS-CoV-2 activities prediction（抗病毒活性预测）

Dataset.

Anti-viral activities prediction is vital for the development of new drugs to treat COVID-19. We then use anti-SARS-CoV-2 activities prediction as our task to prioritize compounds when screening in vitro. The experimental datasets are obtained from the COVID-19 portal [20] in the National Center for Advancing Translational Sciences (NCATS), which include 13 assays such as Spike-ACE2 protein-protein interaction (AlphaLISA), Spike-ACE2 protein-protein interaction (TruHit Counterscreen), ACE2 enzymatic activity, etc. These 13 assays represent five distinct categories: viral entry, viral replication, live virus infectivity, counterscreen and in vitro infectivity. Due to the extreme imbalance in these original datasets, the proportion of positive samples in the total samples ranges from 0.7% to 7.3%, so we filter out those samples without AC50 to generate our datasets and set AC50 greater than 10 and less than 10 as non-inhibitors and inhibitors, respectively. The overview of the processed datasets is summarized in Supplementary Table 6.

Comparison method.

We chose two representative methods for experimental comparison, Jure’GNN [23] and REDIAL-2020 [20]. Jure’GNN is a pre-training method based on graph and graph neural network (GNN), which used molecular graph as the input data of the GNN and introduced a series of pre-training strategies to train the GNN to obtain better molecular embedding. REDIAL-2020 is a suite of computational models based on manual features, which extracts a total of 22 features of three different types (19 fingerprints-based, 1 pharmacophore-based and 2 physicochemical descriptors-based) to train the machine learning model from scikit-learn package. In this task, we used a total of 6 evaluation metrics, namely accuracy, sensitivity, precision, ROC-AUC, AUPR and F1.

Experimental setting.

In order to compare our ImageMol with Jure’s GNN, we reproduced Jure's GNN by using the public source code they provided to extract molecular features and added a fully connected layer for fine-tuning on downstream tasks. We uniformly split these datasets into 80% training set and 20% test set, and report the AUC and AUPR results on test set. We also compared our method with REDIAL-2020. To compare fairly with REDIAL- 2020, we use the same experimental configuration as REDIAL-2020. See [20] for detailed experimental setting. Note that REDIAL-2020 provides a new data preprocessing method and divides the training set, validation set and test set, so we directly use these divided datasets to perform our evaluation process (Supplementary Table 13). For the experimental results, we use the model that achieves the best performance on the validation set to evaluate the results of the test set. Finally, accuracy, F1, sensitivity, precision and ROC-AUC metrics are reported in the experiment.

Data availability

The datasets used in this project can be found at the following links:
blood-brain barrier penetration (BBBP): http://deepchem.io.s3-website-us-west-1.amazonaws.com/datasets/BBBP.csv ,
beta-secretase (BACE): http://deepchem.io.s3-website-us-west-1.amazonaws.com/datasets/bace.csv,
human immunodeficiency virus (HIV): http://deepchem.io.s3-website-us-west-1.amazonaws.com/datasets/hiv.csv,（404Not Found）
molecular Toxicity using the 21st century (Tox21): http://deepchem.io.s3-website-us-west-1.amazonaws.com/datasets/tox21.csv.gz,
clinical trial toxicity (ClinTox): http://deepchem.io.s3-website-us-west-1.amazonaws.com/datasets/clintox.csv.gz,
13 SARS-CoV-2 targets: https://opendata.ncats.nih.gov/covid19/assays(The corresponding dataset can be found in Supplementary Table 6),
5 drug’s metabolism enzymes: https://pubs.acs.org/doi/abs/10.1021/ci200028n,
approved drug in DrugBank: https://go.drugbank.com/releases/5-1-9/downloads/approved-drug-links,
122 drugs that block SARS-CoV-2: https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-022-04482-x/MediaObjects/41586_2022_4482_MOESM1_ESM.pdf.

Accurate prediction of molecular targets using a self-supervised image rep...（论文解读）相关推荐

使用自监督图像表示学习框架精确预测分子性质和药物靶点（Accurate prediction of molecular properties and drug targets using a sel）
Accurate prediction of molecular properties and drug targets using a self-supervised image represent ...
论文翻译：A Highly Accurate Prediction Algorithm for Unknown Web Service QoS Values
A Highly Accurate Prediction Algorithm for Unknown Web Service QoS Values 一种用于未知Web服务QoS值的高精度预测算法 Ab ...
论文解读 Receptive Field Block Net for Accurate and Fast Object Detection
其它机器学习.深度学习算法的全面系统讲解可以阅读<机器学习-原理.算法与应用>,清华大学出版社,雷明著,由SIGAI公众号作者倾力打造. 书的购买链接书的勘误,优化,源代码资源 PDF全 ...
文献阅读：DeepLigand: accurate prediction of MHC class I ligands using peptide embedding
发表年份:2019 下载地址:点击下载目录 1.摘要: 2.数据集: 3.数据集处理: 4.模型框架: 5.提出两种可选的学习策略: 6.对比算法: 1.摘要: 提出半监督模型DeepLigand, ...
【论文解读 WSDM 2018 | SHINE】Signed HIN Embedding for Sentiment Link Prediction
论文链接:https://arxiv.org/abs/1712.00732 代码链接:https://github.com/boom85423/hello_SHINE 会议:WSDM 2018 这位大 ...
DE-PPN：Document-level Event Extraction via Parallel Prediction Networks论文解读
Document-level Event Extraction via Parallel Prediction Networks paper:Document-level Event Extracti ...
论文解读 | Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
论文地址:Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation GitHub地址: http ...
【论文解读：bCNN-Methylpred: Feature-Based Prediction of RNA Sequence Modification Using Branch Convoluti】
bCNN-Methylpred:基于特征的基于分支卷积神经网络的RNA序列修饰预测摘要 1.介绍 2.相关研究 (1)CNN的一个新分支:该网络结合了不同编码方案的特征,准确地预测了不同RNA序列中 ...
The Generalized Detection Method for the Dim Small Targets by Faster R-CNN Integrated with GAN 论文翻译
摘要 - 由于缺乏判别特征,昏暗和小目标检测作为ATR(自动目标识别)中的一个开放主题长期保留.随着真实场景的不断变化,由于依赖于特定图像数据,本方法在再现性和概括性方面表现出更差的性能.为了解决这个 ...

Accurate prediction of molecular targets using a self-supervised image rep...（论文解读）

基础知识补充：

全文逻辑：

Discussion

Abstract

Introduction

Results

Description of ImageMol

1、(MG3C) .（多粒度化学族分类器）【MG3C用于预测分子图像中的化学结构信息】

2、(MRD) .（分子理性鉴别器）【是为了训练Molecular Encoder + Rationality Classifier（判断分子结构是否理性）】.

3、(JPP) task.（拼图预测）【是为了训练Molecular Encoder + Jigsaw Classifier（为了预测分子的排序是什么）】

4、(MCL) .（基于mask的对比学习）【优化Molecular Encoder】

5、(MIR) . （分子图像重建）【通过latent features生成64 × 64 分子图像，以及鉴别分子图像真假】

Benchmark evaluation of ImageMol

Prediction of anti-viral activities across 13 SARS-CoV-2 targets

Identifying anti-SARS-CoV-2 inhibitors via ImageMol

Biological Interpretation of ImageMol

2.6 Ablation analysis of ImageMol

Discussion

Online Methods

Strategies for pre-training ImageMol

B.2 Pre-task details in pre-training（supplement material）

1、Consistency for pre-training

Supplementary Section A.2 Selection of K in K-means

Supplementary Figure S15

2、Relevance for pre-training

生成器（MIR分子图像重建）和鉴别器的详细构造：

MCL（MASK-based contrastive learning） .

3、Rationality for pre-training

3.1、Molecular rationality discrimination (MRD).【分子理性鉴别器的详细介绍】

3.2、Jigsaw puzzle prediction (JPP).【拼图预测】

4、Pre-training process

4.1、Data augmentations.

4.2、Training process.

5、Fine-tuning

Downstream task details

1、Molecular property prediction（分子性质预测）

2、Drug metabolism prediction（药物代谢预测）

3、Anti-SARS-CoV-2 activities prediction（抗病毒活性预测）

Accurate prediction of molecular targets using a self-supervised image rep...（论文解读）相关推荐

最新文章

热门文章