目标:根据各个字段数据的分布(例如srcIP和dstIP的top 10)以及其他特征来进行样本标注,最终将几类样本分别标注在black/white/ddos/mddos/cdn/unknown几类。

效果示意:

-------------choose one--------------
sub domain: DNSQueryName(N)
ip: srcip(S) or dstip(D)
length: DNSRequestLength(R1) or DNSReplyLength(R2)
length too: DNSRequestErrLength(R3) or DNSReplyErrLength(R4)
port: sourcePort(P1) or destPort(P2) or DNSReplyTTL(T)
code: DNSReplyCode(C2) or DNSRequestRRType(C1)
other: DNSRRClass(RR) or DNSReplyIPv4(V)
-------------label or quit------------
black(B) or white(W) or cdn(CDN) or ddos(DDOS) or mddos(M) or unknown(U) or white-like(L)
next(Q) or exit(E)?
***************************************
domain: workgroup. flow count: 206
***************************************
------------srcip-----------------
count                 206
unique                  9
top       162.105.129.122
freq                  150
Name: sourceIP, dtype: object
--------------destip---------------
count             206
unique             12
top       199.7.83.42
freq               82
Name: destIP, dtype: object

代码:

import sys
import json
import os
import pandas as pd
import tldextract
# import numpy as np
medata_field = '''
3 = sourceIP
4 = destIP
5 = sourcePort
6 = destPort
7 = protocol
12 = flowStartSeconds
13 = flowEndSecond
54 = DNSReplyCode
55 = DNSQueryName
56 = DNSRequestRRType
57 = DNSRRClass
58 = DNSDelay
59 = DNSReplyTTL
60 = DNSReplyIPv4
61 = DNSReplyIPv6
62 = DNSReplyRRType
77 = DNSReplyName
81 = payload
88 = DNSRequestLength
89 = DNSRequestErrLength
90 = DNSReplyLength
91 = DNSReplyErrLength
'''medata_field_num = []
medata_field_info = []
for l in medata_field.split("\n"):if len(l) == 0: continuenum, info = l.split(" = ")medata_field_num.append(int(num)-1)medata_field_info.append(info)
print medata_field_num
print medata_field_infodef extract_domain(domain):try:ext = tldextract.extract(domain)subdomain = ext.subdomainif ext.domain == "":mdomain = ext.suffixelse:mdomain = ".".join(ext[1:])return mdomainexcept Exception,e:print "extract_domain error:", ereturn "unknown"def parse_metadata(path):df = pd.read_csv(path, sep="^", header=None)dns_df = df.iloc[:, medata_field_num].copy()dns_df.columns = medata_field_info# print dns_df.tail()
dns_df["mdomain"] = dns_df["DNSQueryName"].apply(extract_domain)# print dns_df.groupby('mdomain').describe()# print dns_df.groupby('mdomain').groupsreturn dns_df.groupby('mdomain')def get_data_dist(df, col="sourceIP"):# group count by ip distgrouped = df.groupby(col)# print grouped.head(10)[col]print type(grouped.size())size = grouped.size()print sizeprint "-----------top 10-------------"print size.nlargest(10)def get_ipv4_dist(df, col="DNSReplyLength"):# group count by ip distdf2 = df[df[col] > 0]print "filter before length:", len(df), "filter after length:", len(df2)grouped = df2.groupby(by="DNSReplyIPv4")# print grouped.head(10)[col]size = grouped.size()print sizeprint "-----------top 10-------------"print size.nlargest(10)def move_to(srcpath, domain, dst_path):with open(dst_path, "w") as w:with open(srcpath) as r:for line in r:if extract_domain(line.split("^")[55-1]) == domain:w.write(line)def main():history_op = {}if os.path.exists("history_op.json"):with open("history_op.json") as h:history_op = json.load(h)print history_opfor day in range(24, 27):for hour in range(0, 24):path = "/home/bonelee/latest_metadata_sample/sampled/unknown_sample/debugdogcom-medata_wanted-2017-09-%d-%d.txt" % (day, hour)if not os.path.exists(path) or os.path.getsize(path) == 0:print path, "passed, file not exists or empty file."continueprint path, "running..."try:domains_info = parse_metadata(path)except IOError, e:print econtinuefor domain, group in domains_info:print "***************************************"print "domain:", domain, "flow count:", len(group)print "***************************************"# print type(group) #<class 'pandas.core.frame.DataFrame'>print "------------srcip-----------------"print group["sourceIP"].describe()print "--------------destip---------------"print group["destIP"].describe()print "----------------------------------------"print "ipv4 address return dist:"get_ipv4_dist(group)print "----------------------------------------"has_judged = Falseneed_break = Falsewhile True:print "-------------choose one--------------"print "sub domain: DNSQueryName(N)"print "ip: srcip(S) or dstip(D)"print "length: DNSRequestLength(R1) or DNSReplyLength(R2)"print "length too: DNSRequestErrLength(R3) or DNSReplyErrLength(R4)"print "port: sourcePort(P1) or destPort(P2) or DNSReplyTTL(T)"print "code: DNSReplyCode(C2) or DNSRequestRRType(C1)"print "other: DNSRRClass(RR) or DNSReplyIPv4(V)"dist_dict = {"R1": "DNSRequestLength","R2": "DNSReplyLength","R3": "DNSRequestErrLength","R4": "DNSReplyErrLength","P1": "sourcePort","P2": "destPort","T": "DNSReplyTTL","C2": "DNSReplyCode","C1": "DNSRequestRRType","RR": "DNSRRClass","V": "DNSReplyIPv4","S": "sourceIP","D": "destIP","N": "DNSQueryName"}print "-------------label or quit------------"print "black(B) or white(W) or cdn(CDN) or ddos(DDOS) or mddos(M) or unknown(U) or white-like(L)"print "next(Q) or exit(E)?"domain = domain.lower()if "win" == domain[-len("win"):] or "site" == domain[-len("site"):] or "vip" == domain[-len("vip"):]:check = "U"need_break = Trueelif "lan" in domain or "local" in domain or "dhcp" in domain or "workgroup" in domain or "home" in domain:check = "DDOS"need_break = Trueelif "cdn" in domain:check = "CDN"need_break = Trueelse:if domain in history_op and not has_judged:print "found history op:", history_op[domain]if not raw_input("OK(Enter for Y)?"):check = history_op[domain]need_break = Trueelse:check = raw_input("Input:")else:check = raw_input("Input:")has_judged = Trueif check == "Q":print path, "next OK!"breakelif check == "E":print path, "Exit!"with open("history_op.json", "w") as f:json.dump(history_op, f)print "saved history_op.json"sys.exit()elif check == "B":move_to(path, domain, "/home/bonelee/latest_metadata_sample/labeled_black/2017-8-%d-%d-%s.txt" % (day, hour, domain))history_op[domain] = "B"print "Saved OK!"if need_break: breakelif check == "W":move_to(path, domain, "/home/bonelee/latest_metadata_sample/labeled_white/2017-8-%d-%d-%s.txt" % (day, hour, domain))history_op[domain] = "W"print "Saved OK!"if need_break: breakelif check == "L":move_to(path, domain, "/home/bonelee/latest_metadata_sample/labeled_white_like/2017-8-%d-%d-%s.txt" % (day, hour, domain))history_op[domain] = "L"print "Saved OK!"if need_break: breakelif check == "CDN":move_to(path, domain, "/home/bonelee/latest_metadata_sample/labeled_cdn/2017-8-%d-%d-%s.txt" % (day, hour, domain))history_op[domain] = "CDN"print "Saved OK!"if need_break: breakelif check == "DDOS":move_to(path, domain, "/home/bonelee/latest_metadata_sample/labeled_ddos/2017-8-%d-%d-%s.txt" % (day, hour, domain))history_op[domain] = "DDOS"print "Saved OK!"if need_break: breakelif check == "M":move_to(path, domain, "/home/bonelee/latest_metadata_sample/labeled_mddos/2017-8-%d-%d-%s.txt" % (day, hour, domain))history_op[domain] = "M"print "Saved OK!"if need_break: breakelif check == "U":move_to(path, domain, "/home/bonelee/latest_metadata_sample/labeled_unknown/2017-8-%d-%d-%s.txt" % (day, hour, domain))history_op[domain] = "U"print "Saved OK!"if need_break: breakelse:if check in dist_dict:get_data_dist(group, dist_dict[check])else:print "unknown input!Choose the following one:"print "*******************************"print path, "check over..."print "*******************************"if __name__ == "__main__":main()

转载于:https://www.cnblogs.com/bonelee/p/7608165.html

机器学习样本标记 示意代码相关推荐

  1. 拥抱人工智能--机器学习(附python代码)

    机器学习简述 加入了代码,如果不想看代码请直接跳过,这不会产生任何影响 文章来自于云栖社区,修改及补充了一些内容,增添了代码 文章目录 机器学习简述 机器学习算法:是使计算机具有智能的关键 下面我们将 ...

  2. python常用代码入门-入门十大Python机器学习算法(附代码)

    入门十大Python机器学习算法(附代码) 今天,给大家推荐最常用的10种机器学习算法,它们几乎可以用在所有的数据问题上: 1.线性回归 线性回归通常用于根据连续变量估计实际数值(房价.呼叫次数.总销 ...

  3. 机器学习实战 支持向量机SVM 代码解析

    机器学习实战 支持向量机SVM 代码解析 <机器学习实战>用代码实现了算法,理解源代码更有助于我们掌握算法,但是比较适合有一定基础的小伙伴.svm这章代码看起来风轻云淡,实则对于新手来说有 ...

  4. 《机器学习实战》配套代码下载

    <机器学习实战>配套代码资源下载网址:http://www.ituring.com.cn/book/1021(图灵社区),网址里有随书下载,可以下载配套资源.

  5. apriori算法代码_资源 | 《机器学习实战》及代码(基于Python3)

    〇.<机器学习实战> 今天推荐给大家的是<机器学习实战>这本书. 机器学习作为人工智能研究领域中一个极其重要的研究方向(一文章看懂人工智能.机器学习和深度学习),在当下极其热门 ...

  6. 机器学习:公式推导与代码实现全书代码!

    今年新书<机器学习:公式推导与代码实现>目前在印刷中,本月底即将出版,现开源本书全部章节代码. 全书总共6大部分26个章节,包括入门.监督学习单模型.监督学习集成模型.无监督学习模型.概率 ...

  7. 机器学习实战——决策树(代码)

    最近在学习Peter Harrington的<机器学习实战>,代码与书中的略有不同,但可以顺利运行. from math import log import operator# 计算熵 d ...

  8. 学习—吴恩达《机器学习》—手敲代码_准备工作之基于Ubuntu系统的 Anaconda(python环境)搭建

    题记--初听不识曲中意,再听已是曲中人. 序曲 一直以来想找个机会与时间去了解一下机器学习.与此同时,吴恩达博士的名字一直在耳边回响,却不知为何如此响彻.后来,在couresa上看到了吴恩达博士的&l ...

  9. 原创 | 机器学习数学推导与代码实现30讲.pdf

    机器学习 Author:louwill Machine Learning Lab 机器学习数学推导与代码实现30讲已完成,主要包括监督学习模型.无监督学习模型.集成学习模型和概率模型四个大类29个模型 ...

最新文章

  1. Rocksdb 利用recycle_log_file_num 重用wal-log文件
  2. 这些动物,你认识几个呢
  3. 异步复位,同步释放的理解
  4. 一位读者刚刚收割阿里、腾讯等大厂Offer,他说这些话一定要和你们说一下
  5. php和mysql函数的区别吗,(PHP,MySQL)函数仅在2种情况中的1种有效,找不到区别
  6. 9个不为人知的Python技巧
  7. android通知栏如何添加按钮,如何在通知栏上放置媒体控制器按钮?
  8. 深度学习(二十五)基于Mutil-Scale CNN的图片语义分割、法向量估计
  9. 商业智能改变汽车行业
  10. 单库单服解决方案terraform部署实践
  11. Bash shell编程的语法知识点(1)
  12. 前端页面劫持和反劫持
  13. 个人征信系统机构接入工作流程
  14. 使用Markdown编写手册
  15. 网页版html怎么设置合适iPad,html – iPad缩放以适应在内容最少的网页上不起作用...
  16. Mac Mini - 一个深坑
  17. 淘宝运营的逻辑与本质是什么?
  18. 图解Transformer
  19. Docker超详细的入门学习通俗易懂(第三讲)
  20. STM32FSMC扩展SRAM

热门文章

  1. python二进制转字符串
  2. linux python cpu温度,linux-raspbian系统下编写python脚本显示树莓派的当前cpu温度、使用率、内存和硬盘信息...
  3. simplexmlelement类设置编码_超3.6万条!全国通用的医用耗材编码标准来了
  4. centos7光盘修复 grub_CentOs7怎样修复MBR和GRUB?
  5. python 对输入的数据进行排序_使用插入排序对输入数据排序
  6. 实现费用管理 mysql_电信资费管理系统的设计与实现(NetBeans,MySQL)
  7. php怎么刷新缓存,ZZ PHP立即刷新缓存(输出)的方法
  8. aide怎么打开html文件,求助!aide获取网页html源码
  9. python【数据结构与算法】循环赛日程表(分治)
  10. python【Numpy科学计算库】连女朋友都会用的Numpy(真の能看懂~!)