pandas dataframe 做机器学习训练数据=》直接使用iloc或者as_matrix即可
样本示意,为kdd99数据源:
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.01,0.00,0.00,0.00,0.00,0.00,snmpgetattack. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,255,1.00,0.00,0.01,0.00,0.00,0.00,0.00,0.00,snmpgetattack. 0,udp,domain_u,SF,29,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,0.00,0.00,0.00,0.00,0.50,1.00,0.00,10,3,0.30,0.30,0.30,0.00,0.00,0.00,0.00,0.00,normal. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,253,0.99,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack. 0,tcp,http,SF,223,185,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,4,4,0.00,0.00,0.00,0.00,1.00,0.00,0.00,71,255,1.00,0.00,0.01,0.01,0.00,0.00,0.00,0.00,normal. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack. 0,tcp,http,SF,230,260,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,19,0.00,0.00,0.00,0.00,1.00,0.00,0.11,3,255,1.00,0.00,0.33,0.07,0.33,0.00,0.00,0.00,normal. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.01,0.00,0.00,0.00,0.00,0.00,normal. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,252,0.99,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack. 1,tcp,smtp,SF,3170,329,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,2,0.00,0.00,0.00,0.00,1.00,0.00,1.00,54,39,0.72,0.11,0.02,0.00,0.02,0.00,0.09,0.13,normal. 0,tcp,http,SF,297,13787,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,177,255,1.00,0.00,0.01,0.01,0.00,0.00,0.00,0.00,normal. 0,tcp,http,SF,291,3542,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,12,12,0.00,0.00,0.00,0.00,1.00,0.00,0.00,187,255,1.00,0.00,0.01,0.01,0.00,0.00,0.00,0.00,normal. 0,tcp,http,SF,295,753,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,21,22,0.00,0.00,0.00,0.00,1.00,0.00,0.09,196,255,1.00,0.00,0.01,0.01,0.00,0.00,0.00,0.00,normal. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.01,0.00,0.00,0.00,0.00,0.00,snmpgetattack. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack. 0,tcp,http,SF,268,9235,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,5,5,0.00,0.00,0.00,0.00,1.00,0.00,0.00,58,255,1.00,0.00,0.02,0.05,0.00,0.00,0.00,0.00,normal. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,253,0.99,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack. 0,tcp,http,SF,223,185,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,3,3,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,255,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal. 0,tcp,http,SF,227,8841,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,13,13,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,255,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal. 0,tcp,http,SF,222,19564,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,22,23,0.00,0.00,0.00,0.00,1.00,0.00,0.09,255,255,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal. 0,tcp,ftp_data,SF,740,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,77,33,0.34,0.08,0.34,0.06,0.00,0.00,0.00,0.00,normal. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal. 0,tcp,ftp_data,SF,35195,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,10,10,0.00,0.00,0.00,0.00,1.00,0.00,0.00,92,44,0.43,0.07,0.43,0.05,0.00,0.00,0.00,0.00,normal. 0,tcp,ftp_data,SF,8325,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,20,20,0.00,0.00,0.00,0.00,1.00,0.00,0.00,103,54,0.49,0.06,0.49,0.04,0.00,0.00,0.00,0.00,normal.
代码:
# -*- coding:utf-8 -*-import re import matplotlib.pyplot as plt import os from sklearn.feature_extraction.text import CountVectorizer from sklearn import preprocessing from sklearn import cross_validation import os from sklearn.datasets import load_iris from sklearn import tree import pydotplus from sklearn.preprocessing import LabelEncoder import numpy as np import pandas as pd from sklearn_pandas import DataFrameMapperdef label(x):if x == "normal.":return 0else:return 1if __name__ == '__main__':data = pd.read_csv('../data/kddcup99/corrected', sep=",", header=None)print data.columnsprint data.iloc[0,0], data.iloc[0,1]print len(data)col_cnt = len(data.columns)normal = data.loc[data.loc[:, col_cnt-1] == "normal.", :]print "normal len:", len(normal)guess = data.loc[data.loc[:, col_cnt-1] == "guess_passwd.", :]print "normal len:", len(guess)data = pd.concat([normal, guess])print len(data)le = preprocessing.LabelEncoder()for i in range(col_cnt-1): if isinstance(data.iloc[0,i], str):print "tranform string column only:", idata.loc[:,i] = le.fit_transform(data.loc[:,i])data.loc[:,col_cnt-1] = data.loc[:,col_cnt-1].apply(label)print data.iloc[0,0], data.iloc[0,1]x = data.iloc[:, range(col_cnt-1)]#x = data.iloc[:, [0,4,5,6,7,8,22,23,24,25,26,27,28,29,30]]y = data.iloc[:, col_cnt-1] ''' also OK data = data.as_matrix() x = data[:, range(col_cnt-1)] y = data[:, col_cnt-1] '''print "x=>"print x.iloc[0:3, :]print "y=>"print y[-3:]#v=load_kdd99("../data/kddcup99/corrected")#x,y=get_guess_passwdandNormal(v)clf = tree.DecisionTreeClassifier()clf = clf.fit(x, y)print clfprint cross_validation.cross_val_score(clf, x, y, n_jobs=-1, cv=10)clf = clf.fit(x, y)dot_data = tree.export_graphviz(clf, out_file=None)graph = pydotplus.graph_from_dot_data(dot_data)graph.write_pdf("../photo/6/iris-dt.pdf")
结果:
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,34, 35, 36, 37, 38, 39, 40, 41],dtype='int64') 0 udp 311029 normal len: 60593 normal len: 4367 64960 tranform string column only: 1 tranform string column only: 2 tranform string column only: 3 0 2 x=>0 1 2 3 4 5 6 7 8 9 ... 31 32 33 34 35 \ 0 0 2 15 7 105 146 0 0 0 0 ... 255 254 1.0 0.01 0.0 1 0 2 15 7 105 146 0 0 0 0 ... 255 254 1.0 0.01 0.0 2 0 2 15 7 105 146 0 0 0 0 ... 255 254 1.0 0.01 0.0 36 37 38 39 40 0 0.0 0.0 0.0 0.0 0.0 1 0.0 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 0.0 0.0 [3 rows x 41 columns] y=> 142098 1 142099 1 142101 1 Name: 41, dtype: int64 DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,max_features=None, max_leaf_nodes=None,min_impurity_decrease=0.0, min_impurity_split=None,min_samples_leaf=1, min_samples_split=2,min_weight_fraction_leaf=0.0, presort=False, random_state=None,splitter='best') fg[ 0.9561336 0.99892258 0.99938433 0.99984606 0.99984606 0.999692121. 0.99984604 0.99969207 1. ]
转载于:https://www.cnblogs.com/bonelee/p/7808478.html
pandas dataframe 做机器学习训练数据=》直接使用iloc或者as_matrix即可相关推荐
- 哪些电脑最适合做机器学习、数据科学和深度学习呢?这里有份调研报告
选自Medium 机器之心编译 作者:Towards AI Team 编辑:陈萍.杜伟 一份来自 Towards AI 的关于机器学习.数据科学和深度学习的最佳笔记本电脑.在预算范围内,入手最适合的笔 ...
- pandas DataFrame 缺失值处理(数据预处理)
pandas DataFrame 缺失值处理 (数据预处理) 创建DataFrame数据样例 import pandas as pd import numpy as np df = pd.DataFr ...
- TensorFlow2.4可以在MacBook Pro/Mac Pro上利用GPU做机器学习训练了
田海立@CSDN 2020-11-21 之前MacBook上TensorFlow只能利用CPU做训练,TF2.4开始可以利用GPU做训练了,并且不管是M1的MacBook Pro,还是Intel架构的 ...
- python pandas.DataFrame选取、修改数据
文章转载自: https://blog.csdn.net/AlanGuoo/article/details/52331901 相信很多人像我一样在学习python,pandas过程中对数据的选取和修改 ...
- 训练数据也外包?这家公司“承包”了不少注释训练数据,原来是这样做的……...
作者 | Lionbridge AI 译者 | 天道酬勤 责编 | 徐威龙 封图| CSDN│下载于视觉中国 出品 | AI科技大本营(ID:rgznai100) 在机器学习领域,训练数据准备是最 ...
- pandas转mysql特定列_在pandas.DataFrame.to_sql时指定数据库表的列类型
问题 在数据分析并存储到数据库时,Python的Pandas包提供了to_sql 方法使存储的过程更为便捷,但如果在使用to_sql方法前不在数据库建好相对应的表,to_sql则会默认为你创建一个新表 ...
- python pandas dataframe 转json_python-将嵌套的json转换为pandas dataframe
时间: 2019-10-27 07:33:05 标签: pandas python 我正在尝试将嵌套的json数组转换为 pandas dataframe . 列表格式的数据如下所示: [{u'ana ...
- 深度学习,怎么知道你的训练数据真的够了?
最近有很多关于数据是否是新模型驱动 [1] [2] 的讨论,无论结论如何,都无法改变我们在实际工作中获取数据成本很高这一事实(人工费用.许可证费用.设备运行时间等方面). 因此,在机器学习项目中,一个 ...
- 在图数据上做机器学习,应该从哪个点切入?
作者 | David Mack 编译 | ronghuaiyang 来源 | AI公园(ID:AI_Paradise) [导读]很多公司和机构都在使用图数据,想在图上做机器学习但不知从哪里开始做,希望 ...
最新文章
- Chrome Extension 检查视图(无效)处理方法
- 混色,半透明的设定,以及我们视角即屏幕处在-1层,-1层的物体是看不见的
- Metasploit search命令使用技巧
- apache应用进阶
- Python—实训day4—爬虫案例3:贴吧图片下载
- 51 nod 1127最短的包含字符串(尺取法)
- 9.6 LSMW程序删除操作手册-录屏
- Rolling cURL: PHP并发最佳实践
- (02)System Verilog 程序块结束仿真
- 手机访问 电脑的html文件,手机能访问电脑的共享文件吗 如何用手机看电脑文件...
- java nio connect_服务器或客户端上的Java NIO套接字在什么时...
- 关于SimpleDateFormat的一些使用及性能数据
- 极客大学架构师训练营 性能测试 性能优化 第七次作业
- 四大开源3d游戏引擎探究
- Quartz 定时任务管理
- matlab逆变换法产生随机数_[原创]Matlab 生成随机数
- 宏转录组方法_Cell:基因表达的改变和群落的更替塑造了全球海洋宏转录组
- ubuntu 9配置
- Unity面板显示中文属性
- 2022-11-01 网工进阶(三十四) IP组播协议(PIM)-模式概述、组播分发树的分类、PIM路由表项、PIM-DM工作原理(组播分发树的形成、配置举例)
热门文章
- matlab绘制bland-altman,制作Bland-Altman图的步骤和程序(以SPSS作图为例讲解)
- java group类_浅析Java中线程组(ThreadGroup类)
- vsphere服务器虚拟化流程,VMware vSphere服务器虚拟化实验
- qt中socket通信流程图_使用QT实现简单的tcp/ip通信
- PHP的CI框架学习
- java文件不能生成class,一文说清!
- Android性能优化面试题集锦,终局之战
- CUDA编程快速入门教程
- 【OpenCV环境配置】Xcode+OpenCV+pkg-config
- 【深度学习入门到精通系列】目标检测评估之P-R曲线深入理解