Python数据处理方法总结

Python是一门非常简洁的语言，能够使用简单的代码实现非常丰富的功能。在数据处理方面，常用的python包有pandas和numpy。本次总结了基于参加一个机器学习比赛进行，主要是预测用户银行卡一周之后的余额。整个过程分为数据预处理阶段和预测阶段，重点介绍数据预处理阶段，这里涉及很多pandas的功能使用，总结一下便于以后查阅。

1.读取和保存txt文件

新建dataframe的方法：

data = {'city': ['Beijing', 'Shanghai', 'Guangzhou', 'Shenzhen', 'Hangzhou', 'Chongqing'],

'year': [2016,2016,2015,2017,2016, 2016], 'population': [2100, 2300, 1000, 700, 500, 500]}

frame = pd.DataFrame(data, columns = ['year', 'city', 'population', 'debt'])

读取：

columns_trx = ['id','trx','cardId','date','twog','name','category','deal']#和文件中的列数统一

trainTrx = pd.DataFrame(pd.read_csv('trainTrx .txt', header = None, names = columns_trx, sep=','))

保存：

trainTrx .to_csv(r'predict.txt', header=None, index=None, sep=',')

2.同一列名称相同的数据取出第一行和最后一行

# 筛选出相同id第一行的数据

y1 = bal_value.drop_duplicates('id', inplace = False)

# 筛选出相同id最后一行的数据

y2 = bal_value.drop_duplicates('id', 'last', inplace = False)

3.筛选出某列存在空值的行数据

y1 = x[x.deal.isnull().values==True]#x为dataframe，deal为dataframe中的一列

4.按照编码表替换数据

dataCode = {'100':'中国','2':'世界'}#编码表

dataFrame = pd.DataFrame({'A':['中国',1,'中国'],'B':['世界',5,6]})#待替换的数据集
for key in dataCode:
dataFrame['A'] = dataFrame['A'].replace(dataCode[key],key)

print dataFrame

5.List列表数据类型的替换

absErrorList = map(abs, errorList)#将errorList列表的值全部替换为绝对值，abs可以是str等其他类型

5.List数值获取唯一值

使用sorted(set(countyId))获取唯一值

6.获取DataFrame某列的唯一值

onlyValue = trainTrx['id'].unique()

7.获取两个DataFrame某列的互异值

differValue = list(set(map(str,trainTrx['id'].unique())) ^ set(map(str,trainTrx1['id'].unique())))

8.获取DataFrame前几行数据或后几行数据

前几行：trainTrx.head()#默认五行

trainTrx.head(10)

后几行：trainTrx.tail()#默认五行

trainTrx.tail(10)

9.DataFrame重新构建索引

trainTrx = trainTrx.set_index('id')

10.DataFrame按照指定列排序按照指定列求和

trainTrx = trainTrx .groupby(by=['id','category'])['deal'].sum()

11.查看pandas版本

pd.show_versions()#考虑到pandas的版本更新之后，相应的函数功能会发生变化。

12.删除dataframe符合某个条件的行

删除指定条件的行：train = train.drop(train[train[target] == 0].index)

删除指定条件的列(直接del)：del train['name']

13.获取DataFrame满足一定行或列数值条件下的值

（1）推荐参考博文：https://blog.csdn.net/GeekLeee/article/details/75268762，写的很详细

>>> import pandas as pd

>>> import numpy as np

>>> from pandas import Series, DataFrame

>>> df = DataFrame({'name':['a','a','b','b'],'classes':[1,2,3,4],'price':[11,22,33,44]})

（2）DataFrame取出满足多个条件的数值

参考网址：https://jingyan.baidu.com/article/0eb457e508b6d303f0a90572.html

https://blog.csdn.net/GeekLeee/article/details/75268762

根据index和columns取值

>>> s = df.loc[0,'price']

>>> s

根据同行的columns的值取同行的另一个columns的值

>>> sex = df.loc[(df.classes==1)&(df.name=='a'),'price']

>>> sex

0 11

>>> sex = df.loc[(df.classes==1)&(df.name=='a'),'price'].values[0]

>>> sex

13.统计出一个list中每个相同元素的下标

from collections import defaultdict
list_3 = []
list_1 = ['2', '5', '6', '11', '2', '535', '2', '2' ]
d = defaultdict(list)

for k,va in [(v,i) for i,v in enumerate(list_1)]:
d[k].append(va)

print d

14.对list列表对所有元素进行计算

求幂运算

predcitArray = np.array(predictValue)
realArray = np.array(realValue)
errorArray = predcitArray - realArray

errorSqureArray = np.power(errorArray, 2) # power函数

求对数运算（使用lambda表达式求出对数）

g = lambda x: math.log(x) #求对数
logY = [g(x) for x in y]

15.DataFrame和Series用法的对比总结

（1）新建dataframe

参考网址：https://blog.csdn.net/flyfrommath/article/details/68069584

（2）将list的值转为dataframe

logYDict = {'popNumDensity':yTrain}
logYDataFrame = pd.DataFrame(logYDict)
y = logYDataFrame

16.Python画图

（1）单个figure绘制单个或多个图

fig1 = plt.figure()

ax1 = fig1.add_subplot(211)

ax1.plot(x, x ** 3)

ax2.plot(x, np.log(x)

ax2 = fig1.add_subplot(212)

fig2 = plt.figure()

ax3 = fig2.add_subplot(121)

ax3.plot(x, np.sqrt(x))

ax4 = fig2.add_subplot(122)

ax4.plot(x, x ** 2)

plt.show()

（2）多个figure绘制单个或多个图

（3）绘制折线图并为折线添加注记

总结的非常好的博文链接：https://www.jianshu.com/p/5ae17ace7984

17.Python计算：

（1）加、减、乘、除、取整、取余

参考网址：https://blog.csdn.net/u014647208/article/details/53368244

（2）字符串截取

参考网址：https://www.cnblogs.com/xunbu7/p/8074417.html

程序：

str = ‘0123456789’
print str[0:3] #截取第一位到第三位的字符
print str[:] #截取字符串的全部字符
print str[6:] #截取第七个字符到结尾
print str[:-3] #截取从头开始到倒数第三个字符之前
print str[2] #截取第三个字符
print str[-1] #截取倒数第一个字符
print str[::-1] #创造一个与原字符串顺序相反的字符串
print str[-3:-1] #截取倒数第三位与倒数第一位之前的字符
print str[-3:] #截取倒数第三位到结尾
print str[:-5:-3] #逆序截取结果：

18.python实现文件名称批量修改

参考：http://www.runoob.com/python/os-rename.html

19.python实现删除指定字符串

参考：https://www.cnblogs.com/2bjiujiu/p/7257744.html

20.sublime下配置python调试环境

参考：

在sublime中安装相关插件配置python的调试环境

https://blog.csdn.net/qq_33505303/article/details/78550015

pdb命令的使用方式参考:

https://blog.csdn.net/LaputaFallen/article/details/78567099

https://www.jianshu.com/p/21a30d39c5f0

21 python打印中文乱码问题的解决

常规问题参考这篇博客：https://blog.csdn.net/chinwuforwork/article/details/52535643

上面的博客解决不了的问题很有可能是自己选择的编码方式是存在问题的，可以参考这篇博客来进行修改：

https://www.cnblogs.com/apple2016/p/5849412.html，主要针对是编码的形式。

22. pycharm安装教程

pycharm是一款节约生命的python开发idea，在官网下载最新的免费版本安装即可。

安装后使用的python版本的具体的配置可以参考以下的博文：https://blog.csdn.net/guifei010/article/details/79210034

23. python数组的使用

（1）一般情况下，python新建数组只需要array = [] 就可以构建一个数组（array为数组名称），但是这个array是需要直接命名的，当我们需要批量的为数组命名的时候，例如array1 = [], array2 = [], array3 = [].....等，如何直接通过程序为数组命名之后直接初始化，避免大量的代码。目前这个问题还没能很好的去解决。

（2）python构建hashmap结构

（3）在使用python的时候会有32位和64位的区别，64位系统可以使用更多的内存。但是部分软件只支持32位的python使用，如何在系统上同时使用32位和64位的python，参考博文https://www.cnblogs.com/lyy-totoro/p/6295636.html进行配置实现

24. pandas.DataFrame中删除包涵特定字符串所在的行

参考：https://blog.csdn.net/htbeker/article/details/79645651

25. python shift()

shift函数对数据进行移动操作，例如df.shift(1)表示将数据整体向上移动一位。

参考：https://blog.csdn.net/qq_18433441/article/details/56665931

机器学习方法的使用

一、DataFrame对行列相关的操作

二、DataFrame对文件相关的操作

三、DataFrame对字符串相关的操作

四、DataFrame与Numpy

五、Python与机器学习

六、Python画图