python工具包--Pandas
文章目录
- 数据读取
- DataFrame结构
- series结构
- 数据分析与value_counts函数
- pivot操作
- groupby操作
- Merge操作
- 示例1
- 示例2
- 排序操作
- 缺失值处理
- apply自定义函数
- 时间操作
- 绘图操作
- 大数据处理
官方的API参考文档: https://pandas.pydata.org/docs/reference/index.html
可以简单的了解一个pandas,以后需要使用再查文档也不迟
import pandas as pd
数据读取
API文档中显示pandas可以读取三种格式的文档,如下所示:
read_table(filepath_or_buffer[, sep, …])
Read general delimited file into DataFrame.
read_csv(filepath_or_buffer[, sep, …])
Read a comma-separated values (csv) file into DataFrame.
read_fwf(filepath_or_buffer[, colspecs, …])
Read a table of fixed-width formatted lines into DataFrame.
# 要注意,read_csv只能读取csv格式的数据,而且路径需要用的是/,不能直接在文件夹中复制路径然后黏贴上去
df = pd.read_csv('./data/titanic.csv')
# 读取前5条数据
df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
# 读取最后的5条数据
df.tail()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.00 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.00 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.45 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.00 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.75 | NaN | Q |
DataFrame结构
指定读取数据返回结果的名字叫做df,其实就是pandas工具包中最常用的基础结构DataFrame结构
# 打印当前读取数据的部分信息,包括数据样本规模,每列特征类型与个数,整体的内存占用等
# 其中的行表述数据样本,列表示每一个特征指标,读回来的数据基本上返回的都是df结构
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float646 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float6410 Cabin 204 non-null object 11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
DataFrame的其他属性
# 返回索引
df.index
RangeIndex(start=0, stop=891, step=1)
# 拿到每一个特征的名字
df.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp','Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],dtype='object')
# 每一列的类型,其中object表示python中的字符串,而numpy中的object也是表示字符串
df.dtypes
PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object
# 直接取得数值矩阵
df.values
array([[1, 0, 3, ..., 7.25, nan, 'S'],[2, 1, 1, ..., 71.2833, 'C85', 'C'],[3, 1, 3, ..., 7.925, nan, 'S'],...,[889, 0, 3, ..., 23.45, nan, 'S'],[890, 1, 1, ..., 30.0, 'C148', 'C'],[891, 0, 3, ..., 7.75, nan, 'Q']], dtype=object)
# 数据索引
age = df['Age']
# 获得单个数据
print("single:\n",age[0])
# 获得前x个数据
print("more:\n",age[:5])
# 单独把结果拿出来
age.values[:5]
single:22.0
more:0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
Name: Age, dtype: float64array([22., 38., 26., 35., 35.])
一般情况下是用数字来作为索引的,但还可以将其他内容设置为索引,比如所名字设置为索引
# 默认数字为索引
df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
# 将名字设置为索引,reset_index()是还原索引
df = df.set_index('Name')
df.head()
PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
Name | |||||||||||
Braund, Mr. Owen Harris | 1 | 0 | 3 | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
Cumings, Mrs. John Bradley (Florence Briggs Thayer) | 2 | 1 | 1 | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
Heikkinen, Miss. Laina | 3 | 1 | 3 | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
Futrelle, Mrs. Jacques Heath (Lily May Peel) | 4 | 1 | 1 | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
Allen, Mr. William Henry | 5 | 0 | 3 | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
# 此时便可以通过岁数来查找年龄
age = df['Age']
print("Braund, Mr. Owen Harris age:",age['Braund, Mr. Owen Harris'])
print("Futrelle, Mrs. Jacques Heath (Lily May Peel) age:",age['Futrelle, Mrs. Jacques Heath (Lily May Peel)'])
Braund, Mr. Owen Harris age: 22.0
Futrelle, Mrs. Jacques Heath (Lily May Peel) age: 35.0
# 还原索引
df = df.reset_index()
# 设置多个数据
Clichong = df[['Name','Age','Fare']]
Clichong[:5]
Name | Age | Fare | |
---|---|---|---|
0 | Braund, Mr. Owen Harris | 22.0 | 7.2500 |
1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 38.0 | 71.2833 |
2 | Heikkinen, Miss. Laina | 26.0 | 7.9250 |
3 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 35.0 | 53.1000 |
4 | Allen, Mr. William Henry | 35.0 | 8.0500 |
索引中可以使用iloc函数或者是loc函数来取索引,以iloc为例
# iloc例子
# 可以拿到一个数据
print("df.iloc[2]",df.iloc[2])
# 可以拿到一片数据
print("\n df.iloc[1:4]",df.iloc[1:4])
# 不仅可以指定样本,还可以指定特征
print("\n df.iloc[1:4,1:3]",df.iloc[1:4,1:3])
df.iloc[2] index 2
Name Heikkinen, Miss. Laina
PassengerId 3
Survived 1
Pclass 3
Sex female
Age 26
SibSp 0
Parch 0
Ticket STON/O2. 3101282
Fare 7.925
Cabin NaN
Embarked S
Name: 2, dtype: objectdf.iloc[1:4] index Name PassengerId \
1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 2
2 2 Heikkinen, Miss. Laina 3
3 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 Survived Pclass Sex Age SibSp Parch Ticket Fare \
1 1 1 female 38.0 1 0 PC 17599 71.2833
2 1 3 female 26.0 0 0 STON/O2. 3101282 7.9250
3 1 1 female 35.0 1 0 113803 53.1000 Cabin Embarked
1 C85 C
2 NaN S
3 C123 S df.iloc[1:4,1:3] Name PassengerId
1 Cumings, Mrs. John Bradley (Florence Briggs Th... 2
2 Heikkinen, Miss. Laina 3
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 4
# loc例子
# 取出Heikkinen, Miss. Laina的全部信息
df.loc[2]
Name Heikkinen, Miss. Laina
index 2
PassengerId 3
Survived 1
Pclass 3
Sex female
Age 26
SibSp 0
Parch 0
Ticket STON/O2. 3101282
Fare 7.925
Cabin NaN
Embarked S
Name: 2, dtype: object
# 取出Heikkinen, Miss. Laina的性别信息
df.loc[2,'Sex']
'female'
# 截取部分人的全部信息
df.loc[0:2]
Name | index | PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Braund, Mr. Owen Harris | 0 | 1 | 0 | 3 | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 1 | 2 | 1 | 1 | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | Heikkinen, Miss. Laina | 2 | 3 | 1 | 3 | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
# 截取部分人的部分信息
df.loc[0:2,'Name']
0 Braund, Mr. Owen Harris
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 Heikkinen, Miss. Laina
Name: Name, dtype: object
# 设置索引,找出全部男性
df = df.set_index('Name') #重复设置会报错
df['Sex'] == 'male'
Name
Braund, Mr. Owen Harris True
Cumings, Mrs. John Bradley (Florence Briggs Thayer) False
Heikkinen, Miss. Laina False
Futrelle, Mrs. Jacques Heath (Lily May Peel) False
Allen, Mr. William Henry True...
Montvila, Rev. Juozas True
Graham, Miss. Margaret Edith False
Johnston, Miss. Catherine Helen "Carrie" False
Behr, Mr. Karl Howell True
Dooley, Mr. Patrick True
Name: Sex, Length: 891, dtype: bool
# 找出前五个的男性
df[df['Sex'] == 'male'][0:5]
PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
Name | |||||||||||
Braund, Mr. Owen Harris | 1 | 0 | 3 | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
Allen, Mr. William Henry | 5 | 0 | 3 | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
Moran, Mr. James | 6 | 0 | 3 | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
McCarthy, Mr. Timothy J | 7 | 0 | 1 | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
Palsson, Master. Gosta Leonard | 8 | 0 | 3 | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
# 计算所有男性乘客的平均年龄
df.loc[df['Sex'] == 'male','Age'].mean()
0 22.0
4 35.0
5 NaN
6 54.0
7 2.0...
883 28.0
884 25.0
886 27.0
889 26.0
890 32.0
Name: Age, Length: 577, dtype: float64
df[‘Sex’] == ‘male’ :为晒选出男性
df.loc[df[‘Sex’] == ‘male’,‘Age’] :列举出全部男性的年龄
df.loc[df[‘Sex’] == ‘male’,‘Age’].mean : 求出平均年龄
# 大于70岁的乘客的信息
df[df['Age'] >= 70]
Name | PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
96 | Goldschmidt, Mr. George B | 97 | 0 | 1 | male | 71.0 | 0 | 0 | PC 17754 | 34.6542 | A5 | C |
116 | Connors, Mr. Patrick | 117 | 0 | 3 | male | 70.5 | 0 | 0 | 370369 | 7.7500 | NaN | Q |
493 | Artagaveytia, Mr. Ramon | 494 | 0 | 1 | male | 71.0 | 0 | 0 | PC 17609 | 49.5042 | NaN | C |
630 | Barkworth, Mr. Algernon Henry Wilson | 631 | 1 | 1 | male | 80.0 | 0 | 0 | 27042 | 30.0000 | A23 | S |
672 | Mitchell, Mr. Henry Michael | 673 | 0 | 2 | male | 70.0 | 0 | 0 | C.A. 24580 | 10.5000 | NaN | S |
745 | Crosby, Capt. Edward Gifford | 746 | 0 | 1 | male | 70.0 | 1 | 1 | WE/P 5735 | 71.0000 | B22 | S |
851 | Svensson, Mr. Johan | 852 | 0 | 3 | male | 74.0 | 0 | 0 | 347060 | 7.7750 | NaN | S |
# 计算大于70岁的乘客的人数
(df['Age'] >= 70).sum()
7
# 计算大于70岁的乘客的总岁数
df.loc[df['Age'] >= 70,'Age'].sum()
506.5
创建DataFrame
data = {'country':['China','America','India'],'population':[14,5,25]}
df = pd.DataFrame(data)
df
country | population | |
---|---|---|
0 | China | 14 |
1 | America | 5 |
2 | India | 25 |
df[df['population'] >= 14]
country | population | |
---|---|---|
0 | China | 14 |
2 | India | 25 |
参数设置 set_option
# 设置最大显示行数与列数
# pd.set_option('display.max_columns',20)
pd.set_option('display.max_rows',6)
df
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
# 查看最大显示的列数与行数
print("pd.get_option('display.max_rows'):",pd.get_option('display.max_rows'))
print("pd.get_option('display.max_columns'):",pd.get_option('display.max_columns'))
pd.get_option('display.max_rows'): 6
pd.get_option('display.max_columns'): 20
series结构
读取数据都是二维的,也就是DataFrame结构,而如果在数据中单独取某一列数据,那就是Series格式了。
相当于DataFrame结构是有Series组合起来得到的,因而会更高级一点
# 创建Series结构类型
data = [1,2,3]
index = ['a','b','c']
Clichong = pd.Series(data = data,index = index)
Clichong
a 1
b 2
c 3
dtype: int64
Clichong.loc['b']
2
Clichong.iloc[1]
2
Clichong_cp = Clichong.copy()
Clichong_cp['b']
2
# inplace如果为False则表示不将结果赋值给变量,只相当于打印操作;如果为True,就直接表示在数据中执行实际变换
Clichong.replace(to_replace=1,value=100,inplace=True)
Clichong
a 100
b 2
c 3
dtype: int64
Clichong.index
Index(['a', 'b', 'c'], dtype='object')
# 改变索引
Clichong.rename(index={'a':'A'},inplace=True)
Clichong.index
Index(['A', 'b', 'c'], dtype='object')
以下是两种增加数据的方式
# 增加数据的方式
# 方法1:直接赋值
Clichong['d'] = 4
Clichong
A 100
b 2
c 3
d 4
dtype: int64
# 方法2:追加另外一个series,其中的ignore_index可以重新设置索引
data = [5,6]
index = ['e','f']
Clichong2 = pd.Series(data=data,index=index)
Clichong = Clichong.append(Clichong2,ignore_index=True)
Clichong
0 100
1 2
2 3...
5 6
6 5
7 6
Length: 8, dtype: int64
删除数据的方式del与drop
# 方法1:del删除
del Clichong[1]
Clichong
2 3
3 4
4 5
5 6
6 5
7 6
dtype: int64
# 方法2:drop删除
Clichong.drop([7],inplace=True)
Clichong
4 5
5 6
6 5
dtype: int64
数据分析与value_counts函数
# 基本的一些统计方法都是类似的,max,min,median,sum等待
df['Age'].mean()
29.69911764705882
# 观察索引样本的函数
pd.set_option('display.max_rows',8) # 设置最大显示行数为8
df.describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
# 协方差矩阵
df.cov()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
PassengerId | 66231.000000 | -0.626966 | -7.561798 | 138.696504 | -16.325843 | -0.342697 | 161.883369 |
Survived | -0.626966 | 0.236772 | -0.137703 | -0.551296 | -0.018954 | 0.032017 | 6.221787 |
Pclass | -7.561798 | -0.137703 | 0.699015 | -4.496004 | 0.076599 | 0.012429 | -22.830196 |
Age | 138.696504 | -0.551296 | -4.496004 | 211.019125 | -4.163334 | -2.344191 | 73.849030 |
SibSp | -16.325843 | -0.018954 | 0.076599 | -4.163334 | 1.216043 | 0.368739 | 8.748734 |
Parch | -0.342697 | 0.032017 | 0.012429 | -2.344191 | 0.368739 | 0.649728 | 8.661052 |
Fare | 161.883369 | 6.221787 | -22.830196 | 73.849030 | 8.748734 | 8.661052 | 2469.436846 |
# 相关系数
df.corr()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
PassengerId | 1.000000 | -0.005007 | -0.035144 | 0.036847 | -0.057527 | -0.001652 | 0.012658 |
Survived | -0.005007 | 1.000000 | -0.338481 | -0.077221 | -0.035322 | 0.081629 | 0.257307 |
Pclass | -0.035144 | -0.338481 | 1.000000 | -0.369226 | 0.083081 | 0.018443 | -0.549500 |
Age | 0.036847 | -0.077221 | -0.369226 | 1.000000 | -0.308247 | -0.189119 | 0.096067 |
SibSp | -0.057527 | -0.035322 | 0.083081 | -0.308247 | 1.000000 | 0.414838 | 0.159651 |
Parch | -0.001652 | 0.081629 | 0.018443 | -0.189119 | 0.414838 | 1.000000 | 0.216225 |
Fare | 0.012658 | 0.257307 | -0.549500 | 0.096067 | 0.159651 | 0.216225 | 1.000000 |
# 统计某一I列属性的比例情况
df['Sex'].value_counts() # 默认多的排前面
# 还可以让少的排前面
df['Sex'].value_counts(ascending = True)
female 314
male 577
Name: Sex, dtype: int64
如果是只有少数的类型还好,但是如果是对于非离散指标就不太好弄了,这时候可以指定一些区间,对连续指进行离散化
# 对年龄段划分为几个组,bins是划分的组数
df['Age'].value_counts(ascending = True,bins = 5)
(64.084, 80.0] 11
(48.168, 64.084] 69
(0.339, 16.336] 100
(32.252, 48.168] 188
(16.336, 32.252] 346
Name: Age, dtype: int64
利用value_counts函数中的bins参数可以实现分组,而对于分箱操作,cut函数也可以实现,其中的[10,30,50,80]可以自定义分类参考,label可以自定义分组
data = [1,2,3,4,5,6,7,8,9,0]
bins = [0,3,7,10]
Clichong = pd.cut(data,bins)
Clichong
[(0.0, 3.0], (0.0, 3.0], (0.0, 3.0], (3.0, 7.0], (3.0, 7.0], (3.0, 7.0], (3.0, 7.0], (7.0, 10.0], (7.0, 10.0], NaN]
Categories (3, interval[int64]): [(0, 3] < (3, 7] < (7, 10]]
pd.value_counts(Clichong)
(3, 7] 4
(0, 3] 3
(7, 10] 2
dtype: int64
# 对某一种类型进行自定义的分类,结合起来用,要弄清楚每一小块的含义
pd.value_counts(pd.cut(df['Age'],[10,30,50,80]))
(10, 30] 345
(30, 50] 241
(50, 80] 64
Name: Age, dtype: int64
# 分类好的数据可以用value_counts来进行统计
pd.cut(df['Age'],[10,30,50,80],labels = group_name)
0 Young
1 Mille
2 Young
3 Mille...
887 Young
888 NaN
889 Young
890 Mille
Name: Age, Length: 891, dtype: category
Categories (3, object): [Young < Mille < Old]
# 还可以自定义标签,利用cut中的labels参数
group_name = ['Young','Mille','Old']
pd.value_counts(pd.cut(df['Age'],[10,30,50,80],labels = group_name))
Young 345
Mille 241
Old 64
Name: Age, dtype: int64
pivot操作
example = pd.DataFrame({'Month': ["January", "January", "January", "January", "February", "February", "February", "February", "March", "March", "March", "March"],'Category': ["Transportation", "Grocery", "Household", "Entertainment","Transportation", "Grocery", "Household", "Entertainment","Transportation", "Grocery", "Household", "Entertainment"],'Amount': [74., 235., 175., 100., 115., 240., 225., 125., 90., 260., 200., 120.]})
example
Month | Category | Amount | |
---|---|---|---|
0 | January | Transportation | 74.0 |
1 | January | Grocery | 235.0 |
2 | January | Household | 175.0 |
3 | January | Entertainment | 100.0 |
... | ... | ... | ... |
8 | March | Transportation | 90.0 |
9 | March | Grocery | 260.0 |
10 | March | Household | 200.0 |
11 | March | Entertainment | 120.0 |
12 rows × 3 columns
# pivot是数据透析表,可以按照自己的方式来分析数据
example_pivot = example.pivot(index = 'Category',columns= 'Month',values = 'Amount')
example_pivot
Month | February | January | March |
---|---|---|---|
Category | |||
Entertainment | 125.0 | 100.0 | 120.0 |
Grocery | 240.0 | 235.0 | 260.0 |
Household | 225.0 | 175.0 | 200.0 |
Transportation | 115.0 | 74.0 | 90.0 |
其中的Month表示统计的月份,而Amount表示实际的花费,以上统计的就是每个月花费在各项用途上的金额分别是多少
# 统计行项综合
example_pivot.sum(axis=1)
Category
Entertainment 345.0
Grocery 735.0
Household 600.0
Transportation 279.0
dtype: float64
# 统计列项总和
example_pivot.sum(axis=0)
Month
February 705.0
January 584.0
March 670.0
dtype: float64
df.pivot_table(index='Sex',columns='Pclass',values='Fare')
Pclass | 1 | 2 | 3 |
---|---|---|---|
Sex | |||
female | 106.125798 | 21.970121 | 16.118810 |
male | 67.226127 | 19.741782 | 12.661633 |
其中pclass表示船舱等级,Fare表示船票价格。
所以index指定了按照上面属性来统计,而columns指定了统计哪些指标,values指定了统计的实际指标值是什么,相当于平均值
如果想指定最大值或者是最小值,可以设置aggfunc参数,aggfunc的值代表了数据的取值为什么
# 指定最大值而不是平均值
df.pivot_table(index='Sex',columns='Pclass',values='Fare',aggfunc='max') # or min
Pclass | 1 | 2 | 3 |
---|---|---|---|
Sex | |||
female | 512.3292 | 65.0 | 69.55 |
male | 512.3292 | 73.5 | 69.55 |
# 想统计个船舱等级的人数
df.pivot_table(index='Sex',columns='Pclass',values='Fare',aggfunc='count')
Pclass | 1 | 2 | 3 |
---|---|---|---|
Sex | |||
female | 94 | 76 | 144 |
male | 122 | 108 | 347 |
# 进一步统计人数,不分男女
(df.pivot_table(index='Sex',columns='Pclass',values='Fare',aggfunc='count')).sum(axis=0)
Pclass
1 216
2 184
3 491
dtype: int64
# 将乘客分成两组:成年人与为成年人,再对着两组乘客分别统计不同性别人的平均获救可能性,mean为平均值,虽然默认也是平均值
df['Underaged'] = df['Age'] <= 18
df.pivot_table(index = 'Underaged',columns='Sex',values='Survived',aggfunc='mean')
Sex | female | male |
---|---|---|
Underaged | ||
False | 0.760163 | 0.167984 |
True | 0.676471 | 0.338028 |
groupby操作
df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],'data':[0,5,10,5,10,15,10,15,20]})
df
key | data | |
---|---|---|
0 | A | 0 |
1 | B | 5 |
2 | C | 10 |
3 | A | 5 |
... | ... | ... |
5 | C | 15 |
6 | A | 10 |
7 | B | 15 |
8 | C | 20 |
9 rows × 2 columns
# 统计key的取值的常见代码
for key in ['A','B','C']:print(key,df[df['key'] == key].sum())
A key AAA
data 15
dtype: object
B key BBB
data 30
dtype: object
C key CCC
data 45
dtype: object
# groupby可以代替以上代码,默认为累加值
df.groupby('key').sum()
data | |
---|---|
key | |
A | 15 |
B | 30 |
C | 45 |
# 可以更改为均值等指标
df.groupby('key').aggregate(np.mean)
data | |
---|---|
key | |
A | 5 |
B | 10 |
C | 15 |
# 按照不同行呗统计年龄的平均值
df.groupby('Sex')['Age'].mean()
Sex
female 27.915709
male 30.726645
Name: Age, dtype: float64
# 按照性别统计存活率
df.groupby('Sex')['Survived'].mean()
Sex
female 0.742038
male 0.188908
Name: Survived, dtype: float64
groupby的一些其他操作
# 数据集
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],'C' : np.random.randn(8),'D' : np.random.randn(8)})
df
A | B | C | D | |
---|---|---|---|---|
0 | foo | one | 0.611947 | 1.221803 |
1 | bar | one | 0.688298 | 0.077745 |
2 | foo | two | -0.392495 | -0.856792 |
3 | bar | three | 0.094306 | 1.281799 |
4 | foo | two | -0.524519 | -1.288789 |
5 | bar | two | -1.507810 | 1.012624 |
6 | foo | one | -0.061693 | -1.741863 |
7 | foo | three | -0.745409 | 0.332852 |
# 调用count属性
print("mean:\n",df.groupby('A').mean())
print("\ncount1:\n",df.groupby('A').count())
print("\ncount2:\n",df.groupby(['A','B']).count())
mean:C D
A
bar -0.241736 0.790723
foo -0.222434 -0.466558count1:B C D
A
bar 3 3 3
foo 5 5 5count2:C D
A B
bar one 1 1three 1 1two 1 1
foo one 2 2three 1 1two 2 2
# 与numpy结合,设置一下统计的方法
# df.groupby(['A','B']).aggregate(np.sum)
grouped = df.groupby(['A','B'])
grouped.aggregate(np.sum)
C | D | ||
---|---|---|---|
A | B | ||
bar | one | 0.688298 | 0.077745 |
three | 0.094306 | 1.281799 | |
two | -1.507810 | 1.012624 | |
foo | one | 0.550255 | -0.520061 |
three | -0.745409 | 0.332852 | |
two | -0.917015 | -2.145581 |
# 通过as_index参数来增加索引
# df.groupby(['A','B'],as_index = False).aggregate(np.sum)
grouped = df.groupby(['A','B'],as_index = False)
grouped.aggregate(np.sum)
A | B | C | D | |
---|---|---|---|---|
0 | bar | one | 0.688298 | 0.077745 |
1 | bar | three | 0.094306 | 1.281799 |
2 | bar | two | -1.507810 | 1.012624 |
3 | foo | one | 0.550255 | -0.520061 |
4 | foo | three | -0.745409 | 0.332852 |
5 | foo | two | -0.917015 | -2.145581 |
# 通过describe方法来展示所有的统计信息
grouped.describe().head()
C | D | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
0 | 1.0 | 0.688298 | NaN | 0.688298 | 0.688298 | 0.688298 | 0.688298 | 0.688298 | 1.0 | 0.077745 | NaN | 0.077745 | 0.077745 | 0.077745 | 0.077745 | 0.077745 |
1 | 1.0 | 0.094306 | NaN | 0.094306 | 0.094306 | 0.094306 | 0.094306 | 0.094306 | 1.0 | 1.281799 | NaN | 1.281799 | 1.281799 | 1.281799 | 1.281799 | 1.281799 |
2 | 1.0 | -1.507810 | NaN | -1.507810 | -1.507810 | -1.507810 | -1.507810 | -1.507810 | 1.0 | 1.012624 | NaN | 1.012624 | 1.012624 | 1.012624 | 1.012624 | 1.012624 |
3 | 2.0 | 0.275127 | 0.476335 | -0.061693 | 0.106717 | 0.275127 | 0.443537 | 0.611947 | 2.0 | -0.260030 | 2.095628 | -1.741863 | -1.000947 | -0.260030 | 0.480886 | 1.221803 |
4 | 1.0 | -0.745409 | NaN | -0.745409 | -0.745409 | -0.745409 | -0.745409 | -0.745409 | 1.0 | 0.332852 | NaN | 0.332852 | 0.332852 | 0.332852 | 0.332852 | 0.332852 |
# 设置自己需要的统计指标
grouped = df.groupby('A')
grouped['C'].agg([np.sum,np.mean,np.std])
sum | mean | std | |
---|---|---|---|
A | |||
bar | -0.725207 | -0.241736 | 1.135965 |
foo | -1.112169 | -0.222434 | 0.528136 |
Merge操作
示例1
left = pd.DataFrame({'key':['K0','K1','K2','K3'],'A':['A0','A1','A2','A3'],'B':['B0','B1','B2','B3']})right = pd.DataFrame({'key':['K0','K1','K2','K3'],'C':['C0','C1','C2','C3'],'D':['D0','D1','D2','D3']})
left
key | A | B | |
---|---|---|---|
0 | K0 | A0 | B0 |
1 | K1 | A1 | B1 |
2 | K2 | A2 | B2 |
3 | K3 | A3 | B3 |
right
key | C | D | |
---|---|---|---|
0 | K0 | C0 | D0 |
1 | K1 | C1 | D1 |
2 | K2 | C2 | D2 |
3 | K3 | C3 | D3 |
pd.merge(left,right)
key | A | B | C | D | |
---|---|---|---|---|---|
0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | A1 | B1 | C1 | D1 |
2 | K2 | A2 | B2 | C2 | D2 |
3 | K3 | A3 | B3 | C3 | D3 |
示例2
left = pd.DataFrame({'key1':['K0','K1','K2','K3'],'key2':['K0','K1','K2','K3'],'A':['A0','A1','A2','A3'],'B':['B0','B1','B2','B3']})right = pd.DataFrame({'key1':['K0','K1','K2','K3'],'key2':['K0','K1','K2','K4'],'C':['C0','C1','C2','C3'],'D':['D0','D1','D2','D3']})
left
key1 | key2 | A | B | |
---|---|---|---|---|
0 | K0 | K0 | A0 | B0 |
1 | K1 | K1 | A1 | B1 |
2 | K2 | K2 | A2 | B2 |
3 | K3 | K3 | A3 | B3 |
right
key1 | key2 | C | D | |
---|---|---|---|---|
0 | K0 | K0 | C0 | D0 |
1 | K1 | K1 | C1 | D1 |
2 | K2 | K2 | C2 | D2 |
3 | K3 | K4 | C3 | D3 |
# 发现有部分数据被丢弃
pd.merge(left,right,on=['key1','key2'])
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | K1 | A1 | B1 | C1 | D1 |
2 | K2 | K2 | A2 | B2 | C2 | D2 |
# 考虑所有的结果how
pd.merge(left,right,on=['key1','key2'],how='outer')
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | K1 | A1 | B1 | C1 | D1 |
2 | K2 | K2 | A2 | B2 | C2 | D2 |
3 | K3 | K3 | A3 | B3 | NaN | NaN |
4 | K3 | K4 | NaN | NaN | C3 | D3 |
# 再加入详细的说明indicator
pd.merge(left,right,on=['key1','key2'],how='outer',indicator = True)
key1 | key2 | A | B | C | D | _merge | |
---|---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 | both |
1 | K1 | K1 | A1 | B1 | C1 | D1 | both |
2 | K2 | K2 | A2 | B2 | C2 | D2 | both |
3 | K3 | K3 | A3 | B3 | NaN | NaN | left_only |
4 | K3 | K4 | NaN | NaN | C3 | D3 | right_only |
# 或者只考虑一边,比如只考虑左边
pd.merge(left,right,on=['key1','key2'],how='left')
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | K1 | A1 | B1 | C1 | D1 |
2 | K2 | K2 | A2 | B2 | C2 | D2 |
3 | K3 | K3 | A3 | B3 | NaN | NaN |
# 或者只考虑一边,比如只考虑右边
pd.merge(left,right,on=['key1','key2'],how='right')
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | K1 | A1 | B1 | C1 | D1 |
2 | K2 | K2 | A2 | B2 | C2 | D2 |
3 | K3 | K4 | NaN | NaN | C3 | D3 |
排序操作
data = pd.DataFrame({'group':['a','a','a','b','b','b','c','c','c'],'data':[4,3,2,1,12,3,4,5,7]})
data
group | data | |
---|---|---|
0 | a | 4 |
1 | a | 3 |
2 | a | 2 |
3 | b | 1 |
... | ... | ... |
5 | b | 3 |
6 | c | 4 |
7 | c | 5 |
8 | c | 7 |
9 rows × 2 columns
首先对group列按照降序进行排列,在此基础上对data列进行升序排列,by参数用于设置要排序的列,而ascending参数用于设置生降序。False表示降序,True表示升序
data.sort_values(by=['group','data'],ascending = [False,True],inplace=True)
data
group | data | |
---|---|---|
6 | c | 4 |
7 | c | 5 |
8 | c | 7 |
3 | b | 1 |
... | ... | ... |
4 | b | 12 |
2 | a | 2 |
1 | a | 3 |
0 | a | 4 |
9 rows × 2 columns
缺失值处理
data = pd.DataFrame({'k1':['one']*3+['two']*4,'k2':[3,2,1,3,3,4,4]})
data
k1 | k2 | |
---|---|---|
0 | one | 3 |
1 | one | 2 |
2 | one | 1 |
3 | two | 3 |
4 | two | 3 |
5 | two | 4 |
6 | two | 4 |
# 排序操作:降序
data.sort_values(by='k2',ascending = False)
k1 | k2 | |
---|---|---|
5 | two | 4 |
6 | two | 4 |
0 | one | 3 |
3 | two | 3 |
4 | two | 3 |
1 | one | 2 |
2 | one | 1 |
# 去除相同的数据
data.drop_duplicates()
k1 | k2 | |
---|---|---|
0 | one | 3 |
1 | one | 2 |
2 | one | 1 |
3 | two | 3 |
5 | two | 4 |
# 只考虑某一列的重复情况,其他全部舍弃
data.drop_duplicates(subset='k1')
k1 | k2 | |
---|---|---|
0 | one | 3 |
3 | two | 3 |
# 往数据里添加新列,使用assign函数
df = pd.DataFrame({'data1':np.random.randn(5),'data2':np.random.randn(5)})
df2 = df.assign(ration = df['data1']/df['data2'])
df2
data1 | data2 | ration | |
---|---|---|---|
0 | 1.046754 | -0.305594 | -3.425314 |
1 | 1.060983 | -0.751471 | -1.411874 |
2 | -0.633877 | -1.355632 | 0.467588 |
3 | -1.036392 | 1.615449 | -0.641551 |
4 | 0.481765 | -0.981919 | -0.490636 |
# 通过isnull判断缺失值
df = pd.DataFrame([range(3),[0, np.nan,0],[0,0,np.nan],range(3)])
df.isnull()
0 | 1 | 2 | |
---|---|---|---|
0 | False | False | False |
1 | False | True | False |
2 | False | False | True |
3 | False | False | False |
# 如果数据太多,可以按列或者是行来判断是否有缺失值
# 按列来判断
df.isnull().any()
0 False
1 True
2 True
dtype: bool
# 按行来判断
df.isnull().any(axis = 1)
0 False
1 True
2 True
3 False
dtype: bool
# 填充确实值
df.fillna(100)
0 | 1 | 2 | |
---|---|---|---|
0 | 0 | 1.0 | 2.0 |
1 | 0 | 100.0 | 0.0 |
2 | 0 | 0.0 | 100.0 |
3 | 0 | 1.0 | 2.0 |
apply自定义函数
首先先写好执行操作的函数,接下来直接调用即可,相当于对于数据中所有的样本都执行这样的操作
data = pd.DataFrame({'food':['A1','A2','B1','B2','B3','C1','C2'],'data':[1,2,3,4,5,6,7]})
data
food | data | |
---|---|---|
0 | A1 | 1 |
1 | A2 | 2 |
2 | B1 | 3 |
3 | B2 | 4 |
4 | B3 | 5 |
5 | C1 | 6 |
6 | C2 | 7 |
# 使用apply函数进行映射
def food_map(series):if series['food'] == 'A1':return 'A'elif series['food'] == 'A2':return 'A'elif series['food'] == 'B1':return 'B'elif series['food'] == 'B2':return 'B'elif series['food'] == 'B3':return 'B'elif series['food'] == 'C1':return 'C'elif series['food'] == 'C2':return 'C'
data['food_map'] = data.apply(food_map,axis = 'columns')
data
food | data | food_map | |
---|---|---|---|
0 | A1 | 1 | A |
1 | A2 | 2 | A |
2 | B1 | 3 | B |
3 | B2 | 4 | B |
4 | B3 | 5 | B |
5 | C1 | 6 | C |
6 | C2 | 7 | C |
# 使用map函数进行映射
food2Upper = {'A1':'a1','A2':'a2','B1':'b1','B2':'b2','B3':'b3','C1':'c1','C2':'c2'
}
data['upper'] = data['food'].map(food2Upper)
data
food | data | food_map | upper | |
---|---|---|---|---|
0 | A1 | 1 | A | a1 |
1 | A2 | 2 | A | a2 |
2 | B1 | 3 | B | b1 |
3 | B2 | 4 | B | b2 |
4 | B3 | 5 | B | b3 |
5 | C1 | 6 | C | c1 |
6 | C2 | 7 | C | c2 |
时间操作
# 创建一个时间戳
ts = pd.Timestamp('2021-1-21')
ts
Timestamp('2021-01-21 00:00:00')
# 打印月份
print("ts.month:",ts.month)
print("ts.month_name",ts.month_name())
ts.month: 1
ts.month_name January
# 打印天数
print("ts.day:",ts.day)
print("ts.day_name",ts.day_name())
ts.day: 21
ts.day_name Thursday
s = pd.Series(['2017-11-24 00:00:00','2017-11-25 00:00:00','2017-11-26 00:00:00'])
s
0 2017-11-24 00:00:00
1 2017-11-25 00:00:00
2 2017-11-26 00:00:00
dtype: object
# 转换成标准格式
ts = pd.to_datetime(s)
ts
0 2017-11-24
1 2017-11-25
2 2017-11-26
dtype: datetime64[ns]
ts.dt.hour
0 0
1 0
2 0
dtype: int64
ts.dt.weekday
0 4
1 5
2 6
dtype: int64
# 创建自己的时间特征,每条数据按固定的时间保存下来
pd.Series(pd.date_range(start='2017-11-24',periods = 10,freq = '12H'))
0 2017-11-24 00:00:00
1 2017-11-24 12:00:00
2 2017-11-25 00:00:00
3 2017-11-25 12:00:00...
6 2017-11-27 00:00:00
7 2017-11-27 12:00:00
8 2017-11-28 00:00:00
9 2017-11-28 12:00:00
Length: 10, dtype: datetime64[ns]
# 如果以时间特征为索引,可以将parse_dates参数设置为True
df = pd.read_csv('./data/flowdata.csv',index_col=0,parse_dates=True)
df
L06_347 | LS06_347 | LS06_348 | |
---|---|---|---|
Time | |||
2009-01-01 00:00:00 | 0.137417 | 0.097500 | 0.016833 |
2009-01-01 03:00:00 | 0.131250 | 0.088833 | 0.016417 |
2009-01-01 06:00:00 | 0.113500 | 0.091250 | 0.016750 |
2009-01-01 09:00:00 | 0.135750 | 0.091500 | 0.016250 |
... | ... | ... | ... |
2013-01-01 15:00:00 | 1.420000 | 1.420000 | 0.096333 |
2013-01-01 18:00:00 | 1.178583 | 1.178583 | 0.083083 |
2013-01-01 21:00:00 | 0.898250 | 0.898250 | 0.077167 |
2013-01-02 00:00:00 | 0.860000 | 0.860000 | 0.075000 |
11697 rows × 3 columns
# 有了索引就可以去数据
df[pd.Timestamp('2012-01-01 09:00'):pd.Timestamp('2012-01-11 19:00')]
L06_347 | LS06_347 | LS06_348 | |
---|---|---|---|
Time | |||
2012-01-01 09:00:00 | 0.330750 | 0.293583 | 0.029750 |
2012-01-01 12:00:00 | 0.295000 | 0.285167 | 0.031750 |
2012-01-01 15:00:00 | 0.301417 | 0.287750 | 0.031417 |
2012-01-01 18:00:00 | 0.322083 | 0.304167 | 0.038083 |
... | ... | ... | ... |
2012-01-11 09:00:00 | 0.190833 | 0.208833 | 0.022000 |
2012-01-11 12:00:00 | 0.195417 | 0.206167 | 0.022750 |
2012-01-11 15:00:00 | 0.182083 | 0.204083 | 0.022417 |
2012-01-11 18:00:00 | 0.170583 | 0.202750 | 0.022083 |
84 rows × 3 columns
# 取2012年的数据
df['2012']
L06_347 | LS06_347 | LS06_348 | |
---|---|---|---|
Time | |||
2012-01-01 00:00:00 | 0.307167 | 0.273917 | 0.028000 |
2012-01-01 03:00:00 | 0.302917 | 0.270833 | 0.030583 |
2012-01-01 06:00:00 | 0.331500 | 0.284750 | 0.030917 |
2012-01-01 09:00:00 | 0.330750 | 0.293583 | 0.029750 |
... | ... | ... | ... |
2012-12-31 12:00:00 | 0.651250 | 0.651250 | 0.063833 |
2012-12-31 15:00:00 | 0.629000 | 0.629000 | 0.061833 |
2012-12-31 18:00:00 | 0.617333 | 0.617333 | 0.060583 |
2012-12-31 21:00:00 | 0.846500 | 0.846500 | 0.170167 |
2928 rows × 3 columns
# 指定具体月份
df['2013-01':'2013-01']
L06_347 | LS06_347 | LS06_348 | |
---|---|---|---|
Time | |||
2013-01-01 00:00:00 | 1.688333 | 1.688333 | 0.207333 |
2013-01-01 03:00:00 | 2.693333 | 2.693333 | 0.201500 |
2013-01-01 06:00:00 | 2.220833 | 2.220833 | 0.166917 |
2013-01-01 09:00:00 | 2.055000 | 2.055000 | 0.175667 |
... | ... | ... | ... |
2013-01-01 15:00:00 | 1.420000 | 1.420000 | 0.096333 |
2013-01-01 18:00:00 | 1.178583 | 1.178583 | 0.083083 |
2013-01-01 21:00:00 | 0.898250 | 0.898250 | 0.077167 |
2013-01-02 00:00:00 | 0.860000 | 0.860000 | 0.075000 |
9 rows × 3 columns
# 进行细致的判断
df[(df.index.hour > 8) & (df.index.hour < 12) & (df.index.day == 1) & (df.index.month == 1)]
L06_347 | LS06_347 | LS06_348 | |
---|---|---|---|
Time | |||
2009-01-01 09:00:00 | 0.135750 | 0.091500 | 0.016250 |
2010-01-01 09:00:00 | 0.448167 | 0.524583 | 0.052000 |
2011-01-01 09:00:00 | 0.493167 | 0.502667 | NaN |
2012-01-01 09:00:00 | 0.330750 | 0.293583 | 0.029750 |
2013-01-01 09:00:00 | 2.055000 | 2.055000 | 0.175667 |
resample重采样函数
# mean表示去平均值,head表示取前五条数据,resample重采样统计了每天的平均指标(默认是1条为1个周期)
df.resample('D').mean().head()
L06_347 | LS06_347 | LS06_348 | |
---|---|---|---|
Time | |||
2009-01-01 | 0.125010 | 0.092281 | 0.016635 |
2009-01-02 | 0.124146 | 0.095781 | 0.016406 |
2009-01-03 | 0.113562 | 0.085542 | 0.016094 |
2009-01-04 | 0.140198 | 0.102708 | 0.017323 |
2009-01-05 | 0.128812 | 0.104490 | 0.018167 |
# 设置为10天一个周期
df.resample('10D').mean().head()
L06_347 | LS06_347 | LS06_348 | |
---|---|---|---|
Time | |||
2009-01-01 | 0.113854 | 0.085479 | 0.015616 |
2009-01-11 | 0.411610 | 0.453808 | 0.042764 |
2009-01-21 | 1.058358 | 1.107352 | 0.080539 |
2009-01-31 | 0.320679 | 0.295744 | 0.038614 |
2009-02-10 | 0.906675 | 0.988781 | 0.071100 |
# 按月来进行统计
df.resample('M').mean().head()
L06_347 | LS06_347 | LS06_348 | |
---|---|---|---|
Time | |||
2009-01-31 | 0.517864 | 0.536660 | 0.045597 |
2009-02-28 | 0.516847 | 0.529987 | 0.047238 |
2009-03-31 | 0.373157 | 0.383172 | 0.037508 |
2009-04-30 | 0.163182 | 0.129354 | 0.021356 |
2009-05-31 | 0.178588 | 0.160616 | 0.020744 |
绘图操作
以下都是一些简单的绘图,一般专业的绘图都是用matplolib工具包的
%matplotlib inline
df = pd.DataFrame(np.random.randn(10, 4).cumsum(0), index = np.arange(0, 100, 10), columns = ['A', 'B', 'C', 'D'])
df.plot()
df
A | B | C | D | |
---|---|---|---|---|
0 | -2.089443 | 1.089140 | 1.248430 | 0.082829 |
10 | -3.649157 | -0.520943 | 1.173340 | -1.307401 |
20 | -3.387452 | -1.713291 | 0.212164 | -1.434823 |
30 | -1.511931 | -3.020731 | -0.334770 | -2.991320 |
... | ... | ... | ... | ... |
60 | -0.027079 | -4.617245 | -0.540804 | -4.774767 |
70 | -2.743985 | -4.534934 | -1.393246 | -5.132394 |
80 | -3.336041 | -5.283589 | -2.202856 | -6.250555 |
90 | -1.680583 | -4.269342 | -1.783487 | -5.914622 |
10 rows × 4 columns
# 可以指定绘图的种类,比如散点图,条形图等等
df = pd.DataFrame(np.random.rand(6, 4), index = ['one', 'two', 'three', 'four', 'five', 'six'], columns = pd.Index(['A', 'B', 'C', 'D'], name = 'Genus'))
df.plot(kind='bar')
df
Genus | A | B | C | D |
---|---|---|---|---|
one | 0.706566 | 0.326933 | 0.169319 | 0.950354 |
two | 0.507864 | 0.763742 | 0.271822 | 0.750882 |
three | 0.274819 | 0.916360 | 0.040567 | 0.688893 |
four | 0.498586 | 0.390705 | 0.545891 | 0.636310 |
five | 0.145746 | 0.681933 | 0.411809 | 0.314657 |
six | 0.721254 | 0.289420 | 0.362825 | 0.218710 |
tips = pd.read_csv('./tips.csv')
tips.total_bill.plot(kind='hist',bins=50)
tips
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
... | ... | ... | ... | ... | ... | ... | ... |
240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 |
241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 |
242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 |
243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
244 rows × 7 columns
df = pd.read_csv('./macrodata.csv')
data = df[['quarter','realgdp','realcons']]
data.plot.scatter('quarter','realgdp')
data
quarter | realgdp | realcons | |
---|---|---|---|
0 | 1.0 | 2710.349 | 1707.4 |
1 | 2.0 | 2778.801 | 1733.7 |
2 | 3.0 | 2775.488 | 1751.8 |
3 | 4.0 | 2785.204 | 1753.7 |
... | ... | ... | ... |
199 | 4.0 | 13141.920 | 9195.3 |
200 | 1.0 | 12925.410 | 9209.2 |
201 | 2.0 | 12901.504 | 9189.0 |
202 | 3.0 | 12990.341 | 9256.0 |
203 rows × 3 columns
大数据处理
# 由于样本的较大,所以加载进来都卡顿了一会
gl = pd.read_csv('./game_logs.csv')
gl.head()
E:\anacanda\lib\site-packages\IPython\core\interactiveshell.py:3063: DtypeWarning: Columns (12,13,14,15,19,20,81,83,85,87,93,94,95,96,97,98,99,100,105,106,108,109,111,112,114,115,117,118,120,121,123,124,126,127,129,130,132,133,135,136,138,139,141,142,144,145,147,148,150,151,153,154,156,157,160) have mixed types.Specify dtype option on import or set low_memory=False.interactivity=interactivity, compiler=compiler, result=result)
date | number_of_game | day_of_week | v_name | v_league | v_game_number | h_name | h_league | h_game_number | v_score | ... | h_player_7_name | h_player_7_def_pos | h_player_8_id | h_player_8_name | h_player_8_def_pos | h_player_9_id | h_player_9_name | h_player_9_def_pos | additional_info | acquisition_info | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 18710504 | 0 | Thu | CL1 | na | 1 | FW1 | na | 1 | 0 | ... | Ed Mincher | 7.0 | mcdej101 | James McDermott | 8.0 | kellb105 | Bill Kelly | 9.0 | NaN | Y |
1 | 18710505 | 0 | Fri | BS1 | na | 1 | WS3 | na | 1 | 20 | ... | Asa Brainard | 1.0 | burrh101 | Henry Burroughs | 9.0 | berth101 | Henry Berthrong | 8.0 | HTBF | Y |
2 | 18710506 | 0 | Sat | CL1 | na | 2 | RC1 | na | 1 | 12 | ... | Pony Sager | 6.0 | birdg101 | George Bird | 7.0 | stirg101 | Gat Stires | 9.0 | NaN | Y |
3 | 18710508 | 0 | Mon | CL1 | na | 3 | CH1 | na | 1 | 12 | ... | Ed Duffy | 6.0 | pinke101 | Ed Pinkham | 5.0 | zettg101 | George Zettlein | 1.0 | NaN | Y |
4 | 18710509 | 0 | Tue | BS1 | na | 2 | TRO | na | 1 | 9 | ... | Steve Bellan | 5.0 | pikel101 | Lip Pike | 3.0 | cravb101 | Bill Craver | 6.0 | HTBF | Y |
5 rows × 161 columns
# 样本数据
gl.shape
(171907, 161)
# 指定成deep表示要详细展示当前数据占用的内存(执行的时候卡顿了好久)
gl.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171907 entries, 0 to 171906
Columns: 161 entries, date to acquisition_info
dtypes: float64(77), int64(6), object(78)
memory usage: 860.5 MB
可以看见这份数据读取进来之后占用860.5MB内容
for dtype in ['float64','object','int64']:selected_dtype = gl.select_dtypes(include=[dtype])mean_usage_b = selected_dtype.memory_usage(deep=True).mean()mean_usage_mb = mean_usage_b / 1024 ** 2print("Average memory usage for {} columns: {:03.2f} MB".format(dtype,mean_usage_mb))
Average memory usage for float64 columns: 1.29 MB
Average memory usage for object columns: 9.51 MB
Average memory usage for int64 columns: 1.12 MB
循环中遍历3种类型,通过select_dtypes选中当前类型的特征,可以发现object类型占用的内存最多,其余int64和float64的两种差不多,我们需要对着三种数据类型进行处理,所得数据加载得考研快一点
首先,我们先了解一些数据的类型,int64,int32等类型
int_types = ["uint8", "int8", "int16","int32","int64"]
for it in int_types:print(np.iinfo(it))
Machine parameters for uint8
---------------------------------------------------------------
min = 0
max = 255
---------------------------------------------------------------Machine parameters for int8
---------------------------------------------------------------
min = -128
max = 127
---------------------------------------------------------------Machine parameters for int16
---------------------------------------------------------------
min = -32768
max = 32767
---------------------------------------------------------------Machine parameters for int32
---------------------------------------------------------------
min = -2147483648
max = 2147483647
---------------------------------------------------------------Machine parameters for int64
---------------------------------------------------------------
min = -9223372036854775808
max = 9223372036854775807
---------------------------------------------------------------
def mem_usage(pandas_obj):if isinstance(pandas_obj,pd.DataFrame):usage_b = pandas_obj.memory_usage(deep=True).sum()else: # we assume if not a df it's a seriesusage_b = pandas_obj.memory_usage(deep=True)usage_mb = usage_b / 1024 ** 2 # convert bytes to megabytesreturn "{:03.2f} MB".format(usage_mb)gl_int = gl.select_dtypes(include=['int64'])
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_numeric.html
converted_int = gl_int.apply(pd.to_numeric,downcast='unsigned')print(mem_usage(gl_int))
print(mem_usage(converted_int))
7.87 MB
1.48 MB
7.87 MB是全部为int64类型时的内存占用量
1.48 MB是向下转换后,int类型数据的内存占用量
mem_usage()函数的主要功能就是计算传入数据的内存占用量
gl_float = gl.select_dtypes(include=['float64'])
converted_float = gl_float.apply(pd.to_numeric,downcast='float')print(mem_usage(gl_float))
print(mem_usage(converted_float))
100.99 MB
50.49 MB
可以看见float类型向下转换基本上可以节省一半的内存
gl_obj = gl.select_dtypes(include=['object']).copy()
converted_obj = pd.DataFrame()for col in gl_obj.columns:num_unique_values = len(gl_obj[col].unique())num_total_values = len(gl_obj[col])if num_unique_values / num_total_values < 0.5:converted_obj.loc[:,col] = gl_obj[col].astype('category')else:converted_obj.loc[:,col] = gl_obj[col]print(mem_usage(gl_obj))
print(mem_usage(converted_obj))
751.64 MB
51.67 MB
对object类型数据中唯一值个数进行判断,如果数量不足整体一半则进行转换,如果过多就忽略
python工具包--Pandas相关推荐
- Py之Pandas:Python的pandas库简介、安装、使用方法详细攻略
Py之Pandas:Python的pandas库简介.安装.使用方法详细攻略 目录 pandas库简介 pandas库安装 pandas库使用方法 1.函数使用方法 2.使用经验总结 3.绘图相关操作 ...
- 机器学习模型可解释性的6种Python工具包,总有一款适合你!
开发一个机器学习模型是任何数据科学家都期待的事情.我遇到过许多数据科学研究,只关注建模方面和评估,而没有解释. 然而,许多人还没有意识到机器学习可解释性在业务过程中的重要性.以我的经验,商业人士希望知 ...
- 如何最简单、通俗地理解Python的pandas库?
pandas是一个Python软件包,提供快速.灵活和富有表现力的数据结构,旨在使处理 "关系型 "或 "标签型 "数据变得简单和直观 这个解释是比较官方的. ...
- 鸽子学Python 之 Pandas数据分析库
本文来自鸽子学Python专栏系列文章,欢迎各位交流. 文章目录 Pandas介绍 第一部分 Pandas基础 1 Pandas数据结构 1.1 Series 1.2 DataFrame 2 数据查看 ...
- python神器排行_9款强大的Python工具包,第5款神器期待已久!
专注Python.AI.大数据 @七步编程 Python是一门简洁.优美且强大的编程语言,它的强大,很大一部分原因来自于丰富的第三方工具包. 通过这些第三方工具包,它可以轻松应对机器学习.数据分析. ...
- Python numpy+pandas+matplotlib学习笔记
Python numpy+pandas+matplotlib 本文是根据b站路飞学城Python数据分析全套教程的学习视频整理归纳的学习文档,主要目的是方便自己进行查阅,详细的还得去b站看原视频.另外 ...
- 时间序列预测的7种Python工具包,总有一款适合你!
欢迎关注我,专注Python.数据分析.数据挖掘.实用工具! 时间序列问题是数据科学中最难解决的问题之一.传统的处理方法如 ARIMA.SARIMA 等,虽然是很好,但在处理具有非线性特性或非平稳时间 ...
- python使用pandas计算dataframe中每个分组的分位数极差、分组数据的分位数极差(range)、使用groupby函数和agg函数计算分组的两个分位数
python使用pandas计算dataframe中每个分组的分位数极差.分组数据的分位数极差(range).使用groupby函数和agg函数计算分组的两个分位数 目录
- python使用pandas计算dataframe中每个分组的极差、分组数据的极差(range)、使用groupby函数和agg函数计算分组的最大值和最小值
python使用pandas计算dataframe中每个分组的极差.分组数据的极差(range).使用groupby函数和agg函数计算分组的最大值和最小值 目录
- Python中将pandas的dataframe拷贝到剪切板并保持格式实战:to_clipboard()函数、复制到Excel文件、复制到文本文件(默认是tsv格式)、复制到文本文件(设置逗号分隔符)
Python中将pandas的dataframe拷贝到剪切板并保持格式实战:to_clipboard()函数.复制到Excel文件.复制到文本文件(默认是tsv格式).复制到文本文件(设置逗号分隔符) ...
最新文章
- 《响应式Web图形设计》一13.2 缩放图像带来的问题
- 《深入理解C++11:C++ 11新特性解析与应用》——2.4 宏__cplusplus
- Linux中设置服务自启动的三种方式
- 几何画板度量三角形的步骤
- python敏感词过滤replace_Serverless 实战:3 分钟实现文本敏感词过滤
- 高德地图画带箭头的线_模具装配图画成这样,那才真的叫标准!
- C++学习书籍推荐《The C++ Standard Library 2nd》下载
- rpc服务器不可用处于启用状态,电脑提示RPC服务器不可用怎么办?
- VMware安装macOS High Sierra V10.13.6完整版
- PHP加密解密方法 阿星小栈
- h5/5+APP消息推送神器:Goeasy.js
- 【每日蓝桥】44、一七年省赛Java组真题“纸牌三角形”
- google api设计指南-简介
- 朱嘉明:数字经济和非同质时代——NFT
- android输入法框架分析,Android输入法架构.ppt
- android监控虚拟键盘,android虚拟键盘的监控,显示和隐藏
- 李奎元:说说那些征信模型(Z计分模型、巴萨利模型和A值模型)
- 网上邻居无法访问本机及打印机之解…
- 关于工信部要求品牌电脑强制预装“-花季护航”软件
- 国货驶入快车道,这些礼业新趋势你抓住了吗?
热门文章
- 数据治理-数据生命周期管理一
- The authentication type 10 is not supported
- 新闻——覃雄派、王会举、杜小勇、王珊论文两次入选“领跑者5000—中国精品科技期刊顶尖学术论文”
- 解决webView无法播放视频的问题
- 肿么用photoshop将位图转化成矢量图
- 腾讯android 热更新,Android 腾讯 Bugly 热更新
- 高仿富途牛牛-组件化(四)-优秀的时钟
- zt中俄两军炮兵的差距
- android 软键盘弹出内容整体上移,软键盘弹出后布局上移
- 历届奥斯卡最佳影片及下载地址