前言

本文通过参考论文《Malware Detection by Eating a Whole EXE》,更具使用论文提出的MalConv模型,使用Tensorflow实现了恶意软件的分类。

原模型是侦测是否为恶意软件,本文通过修改模型(其实几乎没有修改)进而实现恶意软件的分类。

数据集

本小马使用的数据集为kaggle Microsoft Malware Classification Challenge。这个数据集共包含了九类恶意如软件。

MalConv模型介绍

论文中的模型

论文第四节 “Model Architecture” 对MalConv模型进行说明,下面是论文中关于部分模型的摘要(如果我有闲心可能会翻译一下):

To best capture such high level location invariance, we choose to use a convolution network architecture. Combining the convolutional activations with a global max pooling before going to fully connected layers allows our model to produce its activation regardless of the location of the detected features. Rather than perform convolutions on the raw byte values (i.e., using a scaled version of a byte’s value from 0 to 255), we use an embedding layer to map each byte to a fixed length (but learned) feature vector. We avoid the raw byte value as it implies an interpretation that certain byte values are intrinsically “closer” to each-other than other byte values, which we know a priori to be false, as byte value meaning is dependent on context. Training the embedding jointly with the convolution allows even our shallow network to activate for a wider breadth of input patterns. This also gives it a degree of robustness in the face of minor alterations in byte values. Prior work using byte n-grams lack this quality, as they are dependent on exact byte matches (Kolter and Maloof 2006; Raff et al. 2016).
We note a number of difficult design choices that had to be made in developing a neural network architecture for such long input sequences. One of the primary limitations in practice was GPU memory consumption in the first convolution layer. Regardless of convolution size, storing the activations after the first convolution for forward propagation can easily lead to out-of-memory errors during back-propagation. We chose to use large convolutional filters and strides to control the memory used by activations in these early layers.
Attempts to build deep architectures on such long sequences requires aggressive pooling between layers for our data, which results in lopsided memory use. This makes model parallelism in frameworks like Tensorflow difficult to achieve. Instead we chose to create a shallow architecture with a large filter width of 500 bytes combined with an aggressive stride of 500. This allowed us to better balance computational workload in a data-parallel manner using PyTorch (Paszke, Gross, and Chintala 2016). Our convolutional architecture uses the gated convolution approach following Dauphin et al. (2016), with 128 filters.

下图是论文中MalConv模型的完整架构图:

本不知天高地厚的小马的理解

  1. 由于需要 “吃“ 下恶意软件的所有byte,所以这个模型必然会非常消耗显存,自然也不可能处理太大的文件,所以我们选取小于2M的文件进行训练
  2. 然后读取文件的byte流,将其转换成整数数组,如果长度不到2M,这在尾部添加0
  3. 把数组进行8维embedding,得到E
  4. 将E分为两个四维的A和B
  5. 将A,B分别进行以为的卷积,其中kernel size = 500,stride = 500, filters = 128
  6. 将B带入Sigmoid函数,其结果再与A相乘,得到G
  7. G进行最大池化得到P
  8. 进入全连接层
  9. 最后进入softmax分类器输出结果

本小马的垃圾代码

本小马特菜,代码质量不高,接受所有的批评。

环境

tensorflow-gpu 1.14.0

数据处理

因为本小马希望把数据放在服务器上跑,所以使用 TFRecord

import osimport tensorflow as tfimport pandas as pdimport hashlibsource = 'D:/TEMP/data/train/'label_Path = 'D:/TEMP/data/trainLabels.csv'file_size = 2000000def string_to_hexsarray(str):return [s for s in str.split() if len(s) == 2 and s != "??"]def read_file(entry):with open(entry, 'r') as f:return f.read()def hexarray_to_bytes(hexarray):while len(hexarray) < file_size:hexarray.append('100')str = " ".join(hexarray)return strdef string_to_bytes(str):return hexarray_to_bytes(string_to_hexsarray(str))# 获取符合要求的文件及其标签def open_files_and_get_label(dir_path, label_path, length=2000, size=2000000, end="bytes"):files = []labels = []df = pd.read_csv(label_path)with os.scandir(dir_path) as dir:for entry in dir:if entry.name.endswith(end):if os.path.getsize(entry) <= size:files.append(entry)file_name = entry.name.split('.')[0]print(file_name)index = df[df['Id'] == file_name].index.tolist()[0]labels.append(df.iat[index, 1])if len(files) == length:breakreturn files, labels# 产生数据"""# source_path 数据集目录# label_path 存储标签的csv文件路径# length 数据集的数量,默认2000条# size 文件最大的大小# end 文件结尾"""def produce_data(source_path, label_path, length=2000, size=2000000, end="bytes"):data, label = open_files_and_get_label(source_path, label_path, length, size, end)data = list(map(read_file, data))data = list(map(string_to_bytes, data))data = dict(zip(data, label))return data# 产生TFRecord文件"""data 数据filename 产生TFRecord文件的路径"""def produce_TFRecord(data, filename):print("write tfrecord")writer = tf.python_io.TFRecordWriter(filename)for train, label in data.items():train = bytes(train, encoding="utf8")example = tf.train.Example(features=tf.train.Features(feature={"label": tf.train.Feature(int64_list=tf.train.Int64List(value=[label])),'train': tf.train.Feature(bytes_list=tf.train.BytesList(value=[train]))}))# 将序列转为字符串writer.write(example.SerializeToString())writer.close()def _parse_function(example):features = {'label': tf.FixedLenFeature([], tf.int64),'train': tf.FixedLenFeature([], tf.string)}parsed_features = tf.parse_single_example(example, features)# Perform additional preprocessing on the parsed data.label = tf.cast(parsed_features["label"], tf.int32)train = tf.cast(parsed_features["train"], tf.string)return train, labeldef string_to_int(s):return [int(i, 16) for i in str(s, encoding='utf-8').split()]# 例子if __name__ == "__main__":data = produce_data(source, label_Path, length=1000)produce_TFRecord(data, 'D:/TEMP/1000_test.tfrecord')

.bytes文件有大量的”??“,这是程序运行时才分配的内存空间,静态状态下是不存在的,所以我把所有"??"都舍弃了。

前面说过 ”如果长度不到2M,这在尾部添加0“,但代码中并没有这样做,而是添加了一个其他字符’”100“,”100“是文件李不存在的(文件byte取值为[x00 -xff]),这样训练出来的模型效果更好。

模型

import tensorflow as tf
import numpy as npfile_size = 2000000
padding = 'VALID'# 初始化权值
def weight_variable(shape):initial = tf.truncated_normal(shape=shape, stddev=0.1)  # 生成一个正态分布return tf.Variable(initial)# 初始化偏置
def bias_variable(shape):initial = tf.constant(0.1, shape=shape)return tf.Variable(initial)def conv1d(x, kernel):# [batch, in_width, in_channels]return tf.nn.conv1d(x, kernel, stride=500, padding=padding)def max_pool(x):return tf.layers.max_pooling1d(x, pool_size=4000, strides=1, padding=padding, data_format='channels_last')lr = tf.Variable(0.001, dtype=tf.float32)#  input.shape = [batch,2000000]
input_data = tf.placeholder(tf.int32, shape=[None, file_size], name="input")
y = tf.placeholder(tf.float32, shape=[None, 9], name="y")
# training = tf.placeholder("bool", name="training")input_data1 = tf.reshape(input_data, [-1, file_size])# 8d-dembedding
embedding = tf.Variable(tf.random_normal([256, 8]), name="embedding")
x = tf.nn.embedding_lookup(embedding, input_data1)# slice.shape=[branch,2000000,4]
sliceA = tf.slice(x, [0, 0, 0], [-1, -1, 4])
sliceB = tf.slice(x, [0, 0, 4], [-1, -1, 4])# 初始化第一个卷积层的权值和偏置
W_convl = weight_variable([500, 4, 128])
b_conv1 = bias_variable([128])V_convl = weight_variable([500, 4, 128])
c_conv1 = bias_variable([128])# 卷积A
A = conv1d(sliceA, W_convl) + b_conv1
B = conv1d(sliceA, V_convl) + c_conv1# G0.shape=[branch*4000*128]
G0 = tf.nn.relu(A * tf.nn.sigmoid(B))P = tf.layers.max_pooling1d(G0, pool_size=4000, strides=1, padding=padding, data_format='channels_last')W_conv2 = weight_variable([100, 128, 256])
b_conv2 = bias_variable([256])# 降维,P.shape=[branch*128]
P = tf.reshape(P, [-1, 128])# FC1
W_fc1 = weight_variable([128, 128])
b_fc1 = bias_variable([128])h_fc1 = tf.nn.relu(tf.matmul(P, W_fc1))keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)# FC2
W_fc2 = weight_variable([128, 9])
b_fc2 = bias_variable([9])prediction = tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)cross_entropy = -tf.reduce_sum(y * tf.log(prediction))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)init = tf.global_variables_initializer()correct_prediction = tf.equal(tf.argmax(prediction, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))def _parse_function(example):features = {'label': tf.FixedLenFeature([], tf.int64),'train': tf.FixedLenFeature([], tf.string)}parsed_features = tf.parse_single_example(example, features)# Perform additional preprocessing on the parsed data.label = tf.cast(parsed_features["label"], tf.int32)train = tf.cast(parsed_features["train"], tf.string)return train, label# 测试集部分
dataset = tf.data.TFRecordDataset('/input/3000_train.tfrecord')
dataset = dataset.map(_parse_function)
dataset = dataset.repeat()
dataset = dataset.batch(75)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()# 训练集部分
test = tf.data.TFRecordDataset('/input/1000_test.tfrecord')
test = test.map(_parse_function)
test = test.repeat()
test = test.batch(75)
test_iterator = test.make_initializable_iterator()
test_next_element = iterator.get_next()def string_to_int(s):return [int(i, 16) for i in str(s, encoding='utf-8').split()]with tf.Session() as sess:sess.run(init)sess.run(iterator.initializer)for epoch in range(21):sess.run(tf.assign(lr, 0.001 * (0.95 ** epoch)))for batch in range(40):value = sess.run(next_element)batch_xs, batch_ys = sess.run([next_element[0], next_element[1]])l = []for i, j in enumerate(batch_xs):l.append(string_to_int(j))out = np.asanyarray(l)# batch_ys值域为[1-9],需要转化成[0-8]batch_ys = batch_ys - 1# one hot 化batch_ys = np.eye(9)[batch_ys.reshape(-1)]sess.run(train_step, feed_dict={input_data: out, y: batch_ys, keep_prob: 0.7})value = sess.run(test_next_element)batch_xs, batch_ys = sess.run([test_next_element[0], test_next_element[1]])l = []for i, j in enumerate(batch_xs):l.append(string_to_int(j))out = np.asanyarray(l)batch_ys = batch_ys - 1batch_ys = np.eye(9)[batch_ys.reshape(-1)]test_acc = sess.run(accuracy, feed_dict={input_data: out, y: batch_ys, keep_prob: 1.0})print("Iter" + str(epoch) + "Testing Accuracy " + str(test_acc))

我一共生成了3000个测试集和1000个训练集,训练6论左右就可以达到0.98的准确率

最后可以从 github上下载代码

参考

Malware Detection by Eating a Whole EXE

MalConv: Lessons learned from Deep Learning on executables

恶意软件分类模型——MalConv的实现

碎碎念

这是我《信息安全实训2》这门课的课业,由于本马没有学习过机器学习,不会tensorflow,完全是赶鸭子上架,一点点copy出来的,许多地方做的都不好,如果浪费您时间的话,深感抱歉。如果对您有帮助,我会非常高兴的。

最后,如果您感兴趣的话,可以到我的博客看看,虽然都是写垃圾文。

授权

本作品采用知识共享署名-相同方式共享 4.0 国际许可协议进行许可。

dnn模型 list index out of range_通过MalConv模型实现恶意软件的分类相关推荐

  1. dnn模型 list index out of range_基于svm的财务预警模型

    前言 本文将我国A股上市公司作为研究对象,选取了A股 2015-2019 年度被 ST 或被 *ST上市公司,剔除了部分非财务原因导致ST或*ST的上市公司.财务指标选择了T-3期的资产负债率.流动比 ...

  2. python线性回归模型预处理_线性回归-2 数据预处理与模型验证评估

    主要内容数据向量化处理 特征放缩 上采样和下采样 重采样和交叉验证 模型验证 python 代码实现 1. 数据向量化处理 对于给定的m个样本,假设最终的拟合函数是 为拟合的权重系数,则有 损失函数改 ...

  3. 从DSSM语义匹配到Google的双塔深度模型召回和广告场景中的双塔模型思考

    ▼ 相关推荐 ▼ 1.基于DNN的推荐算法介绍 2.传统机器学习和前沿深度学习推荐模型演化关系 3.论文|AGREE-基于注意力机制的群组推荐(附代码) 4.论文|被"玩烂"了的协 ...

  4. python训练模型函数参数_keras读取训练好的模型参数并把参数赋值给其它模型详解...

    介绍 本博文中的代码,实现的是加载训练好的模型model_halcon_resenet.h5,并把该模型的参数赋值给两个不同的新的model. 函数式模型 官网上给出的调用一个训练好模型,并输出任意层 ...

  5. cnn输入层_cnn模型怎么画?手把手教你绘制模型图

    cnn模型全称为卷积神经网络,是深度神经网络中最成功的DNN特例之一.cnn模型是由输入层.卷积层.池化层和全连接层构成的.cnn模型主要能实现特征的提取.一个cnn结构:输入→卷积→ReLU→卷积→ ...

  6. 5-3 Coursera吴恩达《序列模型》 第三周课程笔记-序列模型和注意力机制

    上一周的课程5-2 Coursera吴恩达<序列模型> 第二周课程笔记-自然语言处理和词嵌入介绍了自然语言处理相关内容,例如词汇表征.词嵌入.嵌入矩阵和负采样等概念,以及Word2Vec和 ...

  7. 在线部分:werobot服务、主要逻辑服务、句子相关模型服务、BERT中文预训练模型+微调模型(目的:比较两句话text1和text2之间是否有关联)、模型在Flask部署

    日萌社 人工智能AI:Keras PyTorch MXNet TensorFlow PaddlePaddle 深度学习实战(不定时更新) 智能对话系统:Unit对话API 在线聊天的总体架构与工具介绍 ...

  8. SPSS Modeler回归模型结果的详细解读分析(判断模型效果、是否需要调整)

    包括所有可以出现的表.意义和模型好坏的判断标准. 回归模型的参数设置 1,在回归模型前设置类型,角色分配:1个目标,多个输入,无关的数据角色选择"无": 2,在[字段]中选择使用预 ...

  9. 信用卡评分模型(数据获取+数据预处理+探索分析+变量选择+模型开发+模型评估+信用评分+建立评分系统)

    最近两次遇到关于信用卡评分的题目,遂了解一波. Reference: 基于python的信用卡评分模型(超详细!!!) https://www.jianshu.com/p/f931a4df202c h ...

最新文章

  1. 4- flutter - Widget
  2. 2014年最值得关注的六大趋势
  3. 机房冷热通道系统整体解决方案
  4. value proposition canvas
  5. [Python图像处理] 十一.灰度直方图概念及OpenCV绘制直方图
  6. operator、explicit与implicit
  7. 44行代码AC_卡片换位(DFS变形题 视频讲解 )
  8. Codeforces 1323 div2题解ABC
  9. phpmyadmin登录远程mysql数据库
  10. 初探奥尔良(Orleans)
  11. phpfind mysql怎么用_MySQL 的 find_in_set 函数使用方法
  12. 《天天数学》连载09:一月九日
  13. Netty 整合 MessagePack 序列化框架 + LengthFieldBasedFrameDecoder 自定义解码器
  14. oracle redo log file文件详解
  15. c语言烟花代码,C语言烟花程序
  16. 安装corelDraw x8过程遇到的坑
  17. .deb文件如何安装,Ubuntu下deb安装方法图文详解
  18. python闰月计算_Python实例讲解 -- 获取本地时间日期(日期计算)
  19. 32位 4G内存限制 linux,[操作系统]关于32位LINUX只支持4G内存的问题
  20. 360安全卫士指控QQ侵犯用户隐私

热门文章

  1. 【MySQL】MySQL Shell 简介与使用
  2. 【ES】CURL 操作 ES命令集合
  3. 95-080-058-源码-启动-启动taskexecutor
  4. 95-235-048-源码-task-数据交换策略
  5. 【Antlr】Antlr重写输入流
  6. 【Mac】mac安装go
  7. CGLI 报错 :VerifyError: class net.sf.cglib.core.DebuggingClassWriter overrides final method visit
  8. Kudu : kudu 主键相关
  9. Springboot java -jar 提示没没有主清单属性
  10. JDK的bin目录下所有程序的使用介绍