pandas(四)pandas的拼接操作
Logout 4-pandas_combine Last Checkpoint: 06/14/2018 (unsaved changes)
Python 3
- File
- Edit
- View
- Insert
- Cell
- Kernel
- Widgets
- Help
pandas的拼接操作
pandas的拼接分为两种:
- 级联:pd.concat, pd.append
- 合并:pd.merge, pd.join
0. 回顾numpy的级联
import numpy as np
nd1 = np.array([1,2,3])
nd2 = np.array([-1,-2,-3,-4])
np.concatenate([nd1,nd2],axis=0)
array([ 1, 2, 3, -1, -2, -3, -4])
nd3 = np.array([[1,2,3],[4,5,6]])
nd3
array([[1, 2, 3],[4, 5, 6]])
np.concatenate([nd1,nd3],axis=1) # 维度不同无法级联
--------------------------------------------------------------------------- AxisError Traceback (most recent call last) <ipython-input-5-8f0014705afb> in <module>() ----> 1 np.concatenate([nd1,nd3],axis=1) # 维度不同无法级联AxisError: axis 1 is out of bounds for array of dimension 1
nd4 = np.random.randint(0,10,size=(3,3))
nd4
array([[4, 6, 1],[9, 3, 7],[9, 6, 3]])
np.concatenate([nd3,nd4],axis=0)
array([[1, 2, 3],[4, 5, 6],[4, 6, 1],[9, 3, 7],[9, 6, 3]])
nd3 + nd4
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-8-abc835f3e1d9> in <module>() ----> 1 nd3 + nd4ValueError: operands could not be broadcast together with shapes (2,3) (3,3)
nd1 + nd3 # 维度不一样可以广播机制
array([[2, 4, 6],[5, 7, 9]])
============================================
练习12:
- 生成2个3*3的矩阵,对其分别进行两个维度上的级联
============================================
为方便讲解,我们首先定义一个生成DataFrame的函数:
import pandas as pd
from pandas import DataFrame,Series
# 定义一个函数,根据行he列名对元素设置值
def make_df(cols,inds):
data = {c:[c+str(i) for i in inds] for c in cols}
return DataFrame(data,index=inds)
df1 = make_df(list("abc"),[1,2,4])
df1
a | b | c | |
---|---|---|---|
1 | a1 | b1 | c1 |
2 | a2 | b2 | c2 |
4 | a4 | b4 | c4 |
df2 = make_df(list("abc"),[4,5,6])
df2
a | b | c | |
---|---|---|---|
4 | a4 | b4 | c4 |
5 | a5 | b5 | c5 |
6 | a6 | b6 | c6 |
1. 使用pd.concat()级联
pandas使用pd.concat函数,与np.concatenate函数类似,只是多了一些参数:
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,keys=None, levels=None, names=None, verify_integrity=False,copy=True)
1) 简单级联
和np.concatenate一样,优先增加行数(默认axis=0)
pd.concat([df1,df2])
a | b | c | |
---|---|---|---|
1 | a1 | b1 | c1 |
2 | a2 | b2 | c2 |
4 | a4 | b4 | c4 |
4 | a4 | b4 | c4 |
5 | a5 | b5 | c5 |
6 | a6 | b6 | c6 |
pd.concat([df1,df2],axis=1)
a | b | c | a | b | c | |
---|---|---|---|---|---|---|
1 | a1 | b1 | c1 | NaN | NaN | NaN |
2 | a2 | b2 | c2 | NaN | NaN | NaN |
4 | a4 | b4 | c4 | a4 | b4 | c4 |
5 | NaN | NaN | NaN | a5 | b5 | c5 |
6 | NaN | NaN | NaN | a6 | b6 | c6 |
可以通过设置axis来改变级联方向
注意index在级联时可以重复
也可以选择忽略ignore_index,重新索引
pd.concat([df1,df2],axis=0,ignore_index=True)
a | b | c | |
---|---|---|---|
0 | a1 | b1 | c1 |
1 | a2 | b2 | c2 |
2 | a4 | b4 | c4 |
3 | a4 | b4 | c4 |
4 | a5 | b5 | c5 |
5 | a6 | b6 | c6 |
或者使用多层索引 keys
concat([x,y],keys=['x','y'])
pd.concat([df1,df2],keys=["教学","品保"])
a | b | c | ||
---|---|---|---|---|
教学 | 1 | a1 | b1 | c1 |
2 | a2 | b2 | c2 | |
4 | a4 | b4 | c4 | |
品保 | 4 | a4 | b4 | c4 |
5 | a5 | b5 | c5 | |
6 | a6 | b6 | c6 |
============================================
练习13:
想一想级联的应用场景?
使用昨天的知识,建立一个期中考试张三、李四的成绩表ddd
假设新增考试学科"计算机",如何实现?
新增王老五同学的成绩,如何实现?
============================================
2) 不匹配级联
不匹配指的是级联的维度的索引不一致。例如纵向级联时列索引不一致,横向级联时行索引不一致
df1
a | b | c | |
---|---|---|---|
1 | a1 | b1 | c1 |
2 | a2 | b2 | c2 |
4 | a4 | b4 | c4 |
df3 = make_df(list("abcd"),[1,2,3,4])
df3
a | b | c | d | |
---|---|---|---|---|
1 | a1 | b1 | c1 | d1 |
2 | a2 | b2 | c2 | d2 |
3 | a3 | b3 | c3 | d3 |
4 | a4 | b4 | c4 | d4 |
pd.concat([df1,df3],axis=0)
a | b | c | d | |
---|---|---|---|---|
1 | a1 | b1 | c1 | NaN |
2 | a2 | b2 | c2 | NaN |
4 | a4 | b4 | c4 | NaN |
1 | a1 | b1 | c1 | d1 |
2 | a2 | b2 | c2 | d2 |
3 | a3 | b3 | c3 | d3 |
4 | a4 | b4 | c4 | d4 |
pd.concat([df1,df3],axis=1)
a | b | c | a | b | c | d | |
---|---|---|---|---|---|---|---|
1 | a1 | b1 | c1 | a1 | b1 | c1 | d1 |
2 | a2 | b2 | c2 | a2 | b2 | c2 | d2 |
3 | NaN | NaN | NaN | a3 | b3 | c3 | d3 |
4 | a4 | b4 | c4 | a4 | b4 | c4 | d4 |
有3种连接方式:
- 外连接:补NaN(默认模式)
pd.concat([df1,df3],axis=0,join="outer")
# 1、不匹配级联在外连接的模式下要求如果axis=0行级联列索引必须保持一致,
#axis=1列级联行索引必须保持一致
# 2、如果不一致缺哪个索引就补全哪个索引
a | b | c | d | |
---|---|---|---|---|
1 | a1 | b1 | c1 | NaN |
2 | a2 | b2 | c2 | NaN |
4 | a4 | b4 | c4 | NaN |
1 | a1 | b1 | c1 | d1 |
2 | a2 | b2 | c2 | d2 |
3 | a3 | b3 | c3 | d3 |
4 | a4 | b4 | c4 | d4 |
- 内连接:只连接匹配的项
pd.concat([df1,df3],axis=0,join="inner")
# 内连接在级联的时候不一致的地方全部丢弃
a | b | c | |
---|---|---|---|
1 | a1 | b1 | c1 |
2 | a2 | b2 | c2 |
4 | a4 | b4 | c4 |
1 | a1 | b1 | c1 |
2 | a2 | b2 | c2 |
3 | a3 | b3 | c3 |
4 | a4 | b4 | c4 |
df4 = make_df(list("aecd"),[1,2,3,4])
df4
a | c | d | e | |
---|---|---|---|---|
1 | a1 | c1 | d1 | e1 |
2 | a2 | c2 | d2 | e2 |
3 | a3 | c3 | d3 | e3 |
4 | a4 | c4 | d4 | e4 |
pd.concat([df3,df4],axis=0,join="inner")
a | c | d | |
---|---|---|---|
1 | a1 | c1 | d1 |
2 | a2 | c2 | d2 |
3 | a3 | c3 | d3 |
4 | a4 | c4 | d4 |
1 | a1 | c1 | d1 |
2 | a2 | c2 | d2 |
3 | a3 | c3 | d3 |
4 | a4 | c4 | d4 |
- 连接指定轴 join_axes
df5 = make_df(list("abcf"),[2,3,4,8])
df5
a | b | c | f | |
---|---|---|---|---|
2 | a2 | b2 | c2 | f2 |
3 | a3 | b3 | c3 | f3 |
4 | a4 | b4 | c4 | f4 |
8 | a8 | b8 | c8 | f8 |
pd.concat([df1,df4,df5],axis=1,join_axes=[df1.index])
a | b | c | a | c | d | e | a | b | c | f | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | a1 | b1 | c1 | a1 | c1 | d1 | e1 | NaN | NaN | NaN | NaN |
2 | a2 | b2 | c2 | a2 | c2 | d2 | e2 | a2 | b2 | c2 | f2 |
4 | a4 | b4 | c4 | a4 | c4 | d4 | e4 | a4 | b4 | c4 | f4 |
df1
a | b | c | |
---|---|---|---|
1 | a1 | b1 | c1 |
2 | a2 | b2 | c2 |
4 | a4 | b4 | c4 |
df1.append(df4)
a | b | c | d | e | |
---|---|---|---|---|---|
1 | a1 | b1 | c1 | NaN | NaN |
2 | a2 | b2 | c2 | NaN | NaN |
4 | a4 | b4 | c4 | NaN | NaN |
1 | a1 | NaN | c1 | d1 | e1 |
2 | a2 | NaN | c2 | d2 | e2 |
3 | a3 | NaN | c3 | d3 | e3 |
4 | a4 | NaN | c4 | d4 | e4 |
df1.append(df4,axis=1) # append函数没有axis属性,只能对行进行级联
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-30-344f316a2fe3> in <module>() ----> 1 df1.append(df4,axis=1) # append函数没有axis属性,只能对行进行级联TypeError: append() got an unexpected keyword argument 'axis'
============================================
练习14:
假设【期末】考试ddd2的成绩没有张三的,只有李四、王老五、赵小六的,使用多种方法级联
============================================
3) 使用append()函数添加
2、合并
由于在后面级联的使用非常普遍,因此有一个函数append专门用于在后面添加
merge与concat的区别在于,merge需要依据某一共同的行或列来进行合并
使用pd.merge()合并时,会自动根据两者相同column名称的那一列,作为key来进行合并。
注意每一列元素的顺序不要求一致
1) 一对一合并
如果在合并的时候,另个表的key那一列对应的值完全一样,这个合并就是一对一合并
df1 = DataFrame({
"name":["Jack Ma","Gates Bill","MrsWang","Xiaoming"],
"age":[50,60,48,18],
"sex":["男","男","女","男"],
"weight":[60,68,80,50]
})
df1
age | name | sex | weight | |
---|---|---|---|---|
0 | 50 | Jack Ma | 男 | 60 |
1 | 60 | Gates Bill | 男 | 68 |
2 | 48 | MrsWang | 女 | 80 |
3 | 18 | Xiaoming | 男 | 50 |
df2 = DataFrame({
"name":["Jack Ma","Gates Bill","MrsWang","Xiaoming"],
"home":["杭州","美国","隔壁","河南"],
"work":["teacher","seller","boss","student"],
})
df2
home | name | work | |
---|---|---|---|
0 | 杭州 | Jack Ma | teacher |
1 | 美国 | Gates Bill | seller |
2 | 隔壁 | MrsWang | boss |
3 | 河南 | Xiaoming | student |
pd.concat([df1,df2],axis=1)
age | name | sex | weight | home | name | work | |
---|---|---|---|---|---|---|---|
0 | 50 | Jack Ma | 男 | 60 | 杭州 | Jack Ma | teacher |
1 | 60 | Gates Bill | 男 | 68 | 美国 | Gates Bill | seller |
2 | 48 | MrsWang | 女 | 80 | 隔壁 | MrsWang | boss |
3 | 18 | Xiaoming | 男 | 50 | 河南 | Xiaoming | student |
df1.merge(df2)
age | name | sex | weight | home | work | |
---|---|---|---|---|---|---|
0 | 50 | Jack Ma | 男 | 60 | 杭州 | teacher |
1 | 60 | Gates Bill | 男 | 68 | 美国 | seller |
2 | 48 | MrsWang | 女 | 80 | 隔壁 | boss |
3 | 18 | Xiaoming | 男 | 50 | 河南 | student |
pd.merge(df1,df2)
age | name | sex | weight | home | work | |
---|---|---|---|---|---|---|
0 | 50 | Jack Ma | 男 | 60 | 杭州 | teacher |
1 | 60 | Gates Bill | 男 | 68 | 美国 | seller |
2 | 48 | MrsWang | 女 | 80 | 隔壁 | boss |
3 | 18 | Xiaoming | 男 | 50 | 河南 | student |
2) 多对一合并
表1中的某些属性的值,在表2中有多个和他对应,此时就是一对多合并
在合并的时候,首先要把“一”的哪一方不足的属性值根据已有的属性进行复制,然后在合并
df3 = DataFrame({
"name":["Jack Ma","Gates Bill","MrsWang","Xiaoming","Xiaoming","Jack Ma"],
"home":["杭州","美国","隔壁","河南","山东","安徽"],
"work":["teacher","seller","boss","student","studet","boss"],
})
df3
home | name | work | |
---|---|---|---|
0 | 杭州 | Jack Ma | teacher |
1 | 美国 | Gates Bill | seller |
2 | 隔壁 | MrsWang | boss |
3 | 河南 | Xiaoming | student |
4 | 山东 | Xiaoming | studet |
5 | 安徽 | Jack Ma | boss |
df1.merge(df3)
age | name | sex | weight | home | work | |
---|---|---|---|---|---|---|
0 | 50 | Jack Ma | 男 | 60 | 杭州 | teacher |
1 | 50 | Jack Ma | 男 | 60 | 安徽 | boss |
2 | 60 | Gates Bill | 男 | 68 | 美国 | seller |
3 | 48 | MrsWang | 女 | 80 | 隔壁 | boss |
4 | 18 | Xiaoming | 男 | 50 | 河南 | student |
5 | 18 | Xiaoming | 男 | 50 | 山东 | studet |
3) 多对多合并
表1中某些属性有多个值,在表2中对应的值也有多个,这时就是多对多合并
依次拿出表1中的多个值去和表2的多个值匹配
df4 = DataFrame({
"name":["Jack Ma","Gates Bill","MrsWang","Xiaoming","Jack Ma","Jack Ma","Xiaoming"],
"age":[50,60,48,18,20,30,40],
"sex":["男","男","女","男","M","M","F"],
"weight":[60,68,80,50,40,50,30]
})
df4
age | name | sex | weight | |
---|---|---|---|---|
0 | 50 | Jack Ma | 男 | 60 |
1 | 60 | Gates Bill | 男 | 68 |
2 | 48 | MrsWang | 女 | 80 |
3 | 18 | Xiaoming | 男 | 50 |
4 | 20 | Jack Ma | M | 40 |
5 | 30 | Jack Ma | M | 50 |
6 | 40 | Xiaoming | F | 30 |
df3
home | name | work | |
---|---|---|---|
0 | 杭州 | Jack Ma | teacher |
1 | 美国 | Gates Bill | seller |
2 | 隔壁 | MrsWang | boss |
3 | 河南 | Xiaoming | student |
4 | 山东 | Xiaoming | studet |
5 | 安徽 | Jack Ma | boss |
df4.merge(df3)
age | name | sex | weight | home | work | |
---|---|---|---|---|---|---|
0 | 50 | Jack Ma | 男 | 60 | 杭州 | teacher |
1 | 50 | Jack Ma | 男 | 60 | 安徽 | boss |
2 | 20 | Jack Ma | M | 40 | 杭州 | teacher |
3 | 20 | Jack Ma | M | 40 | 安徽 | boss |
4 | 30 | Jack Ma | M | 50 | 杭州 | teacher |
5 | 30 | Jack Ma | M | 50 | 安徽 | boss |
6 | 60 | Gates Bill | 男 | 68 | 美国 | seller |
7 | 48 | MrsWang | 女 | 80 | 隔壁 | boss |
8 | 18 | Xiaoming | 男 | 50 | 河南 | student |
9 | 18 | Xiaoming | 男 | 50 | 山东 | studet |
10 | 40 | Xiaoming | F | 30 | 河南 | student |
11 | 40 | Xiaoming | F | 30 | 山东 | studet |
4) key的规范化
- 使用on=显式指定哪一列为key,当有多个key相同时使用
df1
age | name | sex | weight | |
---|---|---|---|---|
0 | 50 | Jack Ma | 男 | 60 |
1 | 60 | Gates Bill | 男 | 68 |
2 | 48 | MrsWang | 女 | 80 |
3 | 18 | Xiaoming | 男 | 50 |
df5 = DataFrame({
"name":["Jack Ma","Gates Bill","MrsWang","Xiaoming","Xiaoming","Jack Ma"],
"home":["杭州","美国","隔壁","河南","山东","安徽"],
"work":["teacher","seller","boss","student","studet","boss"],
"age":[15,67,89,12,34,56]
})
df5
age | home | name | work | |
---|---|---|---|---|
0 | 15 | 杭州 | Jack Ma | teacher |
1 | 67 | 美国 | Gates Bill | seller |
2 | 89 | 隔壁 | MrsWang | boss |
3 | 12 | 河南 | Xiaoming | student |
4 | 34 | 山东 | Xiaoming | studet |
5 | 56 | 安徽 | Jack Ma | boss |
# 如果两个表有多个相同的属性,需要指定以哪一个属性为基准来合并
df1.merge(df5,on="name",suffixes=["_实际","_假的"])
age_实际 | name | sex | weight | age_假的 | home | work | |
---|---|---|---|---|---|---|---|
0 | 50 | Jack Ma | 男 | 60 | 15 | 杭州 | teacher |
1 | 50 | Jack Ma | 男 | 60 | 56 | 安徽 | boss |
2 | 60 | Gates Bill | 男 | 68 | 67 | 美国 | seller |
3 | 48 | MrsWang | 女 | 80 | 89 | 隔壁 | boss |
4 | 18 | Xiaoming | 男 | 50 | 12 | 河南 | student |
5 | 18 | Xiaoming | 男 | 50 | 34 | 山东 | studet |
- 使用left_on和right_on指定左右两边的列作为key,当左右两边的key都不想等时使用
df6 = DataFrame({
"姓名":["Jack Ma","Gates Bill","MrsWang","Xiaoming","Xiaoming","Jack Ma"],
"home":["杭州","美国","隔壁","河南","山东","安徽"],
"work":["teacher","seller","boss","student","studet","boss"],
})
df6
home | work | 姓名 | |
---|---|---|---|
0 | 杭州 | teacher | Jack Ma |
1 | 美国 | seller | Gates Bill |
2 | 隔壁 | boss | MrsWang |
3 | 河南 | student | Xiaoming |
4 | 山东 | studet | Xiaoming |
5 | 安徽 | boss | Jack Ma |
df1
age | name | sex | weight | |
---|---|---|---|---|
0 | 50 | Jack Ma | 男 | 60 |
1 | 60 | Gates Bill | 男 | 68 |
2 | 48 | MrsWang | 女 | 80 |
3 | 18 | Xiaoming | 男 | 50 |
df1.merge(df6,left_on="name",right_on="姓名")
# 如果两个表没有相同的属性,可以左边的表出一个属性,右边的表出一个属性,然后进行合并
# 这种合并两个属性不能并在一起
age | name | sex | weight | home | work | 姓名 | |
---|---|---|---|---|---|---|---|
0 | 50 | Jack Ma | 男 | 60 | 杭州 | teacher | Jack Ma |
1 | 50 | Jack Ma | 男 | 60 | 安徽 | boss | Jack Ma |
2 | 60 | Gates Bill | 男 | 68 | 美国 | seller | Gates Bill |
3 | 48 | MrsWang | 女 | 80 | 隔壁 | boss | MrsWang |
4 | 18 | Xiaoming | 男 | 50 | 河南 | student | Xiaoming |
5 | 18 | Xiaoming | 男 | 50 | 山东 | studet | Xiaoming |
df1.merge(df6,left_on="age",right_on="home") # 没有相同的值,和出来是空
df1.merge(df6,right_index=True,left_index=True) # 用索引来合并
age | name | sex | weight | home | work | 姓名 | |
---|---|---|---|---|---|---|---|
0 | 50 | Jack Ma | 男 | 60 | 杭州 | teacher | Jack Ma |
1 | 60 | Gates Bill | 男 | 68 | 美国 | seller | Gates Bill |
2 | 48 | MrsWang | 女 | 80 | 隔壁 | boss | MrsWang |
3 | 18 | Xiaoming | 男 | 50 | 河南 | student | Xiaoming |
============================================
练习16:
假设有两份成绩单,除了ddd是张三李四王老五之外,还有ddd4是张三和赵小六的成绩单,如何合并?
如果ddd4中张三的名字被打错了,成为了张十三,怎么办?
自行练习多对一,多对多的情况
自学left_index,right_index
============================================
5) 内合并与外合并
- 内合并:只保留两者都有的key(默认模式)
df7 = DataFrame({
"name":["Jack Ma","Gates Bill","MrsWang","Xiaoming","Xiaowang","MaYun"],
"home":["杭州","美国","隔壁","河南","山东","安徽"],
"work":["teacher","seller","boss","student","studet","boss"],
})
df7
home | name | work | |
---|---|---|---|
0 | 杭州 | Jack Ma | teacher |
1 | 美国 | Gates Bill | seller |
2 | 隔壁 | MrsWang | boss |
3 | 河南 | Xiaoming | student |
4 | 山东 | Xiaowang | studet |
5 | 安徽 | MaYun | boss |
df1.merge(df7,how="inner")
age | name | sex | weight | home | work | |
---|---|---|---|---|---|---|
0 | 50 | Jack Ma | 男 | 60 | 杭州 | teacher |
1 | 60 | Gates Bill | 男 | 68 | 美国 | seller |
2 | 48 | MrsWang | 女 | 80 | 隔壁 | boss |
3 | 18 | Xiaoming | 男 | 50 | 河南 | student |
- 外合并 how='outer':补NaN
df1.merge(df7,how="outer")
age | name | sex | weight | home | work | |
---|---|---|---|---|---|---|
0 | 50.0 | Jack Ma | 男 | 60.0 | 杭州 | teacher |
1 | 60.0 | Gates Bill | 男 | 68.0 | 美国 | seller |
2 | 48.0 | MrsWang | 女 | 80.0 | 隔壁 | boss |
3 | 18.0 | Xiaoming | 男 | 50.0 | 河南 | student |
4 | NaN | Xiaowang | NaN | NaN | 山东 | studet |
5 | NaN | MaYun | NaN | NaN | 安徽 | boss |
- 左合并、右合并:how='left',how='right',
df1["name"][0] = "Peater"
df1
d:\Anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrameSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy"""Entry point for launching an IPython kernel.
age | name | sex | weight | |
---|---|---|---|---|
0 | 50 | Peater | 男 | 60 |
1 | 60 | Gates Bill | 男 | 68 |
2 | 48 | MrsWang | 女 | 80 |
3 | 18 | Xiaoming | 男 | 50 |
df1.merge(df7,how="right")
# 左和并:以左侧为基准,左侧有右侧没有补nan,左侧没有右侧有去掉
# 有合并:和上面相反
age | name | sex | weight | home | work | |
---|---|---|---|---|---|---|
0 | 60.0 | Gates Bill | 男 | 68.0 | 美国 | seller |
1 | 48.0 | MrsWang | 女 | 80.0 | 隔壁 | boss |
2 | 18.0 | Xiaoming | 男 | 50.0 | 河南 | student |
3 | NaN | Jack Ma | NaN | NaN | 杭州 | teacher |
4 | NaN | Xiaowang | NaN | NaN | 山东 | studet |
5 | NaN | MaYun | NaN | NaN | 安徽 | boss |
============================================
练习17:
如果只有张三赵小六语数英三个科目的成绩,如何合并?
考虑应用情景,使用多种方式合并ddd与ddd4
============================================
6) 列冲突的解决
当列冲突时,即有多个列名称相同时,需要使用on=来指定哪一个列作为key,配合suffixes指定冲突列名
可以使用suffixes=自己指定后缀
============================================
练习18:
假设有两个同学都叫李四,ddd5、ddd6都是张三和李四的成绩表,如何合并?
============================================
作业
3. 案例分析:美国各州人口数据分析
首先导入文件,并查看数据样本
abbr = pd.read_csv("./usapop/state-abbrevs.csv")
abbr.head()
state | abbreviation | |
---|---|---|
0 | Alabama | AL |
1 | Alaska | AK |
2 | Arizona | AZ |
3 | Arkansas | AR |
4 | California | CA |
areas = pd.read_csv("./usapop/state-areas.csv")
areas.head()
state | area (sq. mi) | |
---|---|---|
0 | Alabama | 52423 |
1 | Alaska | 656425 |
2 | Arizona | 114006 |
3 | Arkansas | 53182 |
4 | California | 163707 |
pop = pd.read_csv("./usapop/state-population.csv")
pop.head()
state/region | ages | year | population | |
---|---|---|---|---|
0 | AL | under18 | 2012 | 1117489.0 |
1 | AL | total | 2012 | 4817528.0 |
2 | AL | under18 | 2010 | 1130966.0 |
3 | AL | total | 2010 | 4785570.0 |
4 | AL | under18 | 2011 | 1125763.0 |
合并pop与abbrevs两个DataFrame,分别依据state/region列和abbreviation列来合并。
为了保留所有信息,使用外合并。
pop2 = pop.merge(abbr,left_on="state/region",right_on="abbreviation",how="outer")
# 用外连接(或者左链接)
pop2
state/region | ages | year | population | state | abbreviation | |
---|---|---|---|---|---|---|
0 | AL | under18 | 2012 | 1117489.0 | Alabama | AL |
1 | AL | total | 2012 | 4817528.0 | Alabama | AL |
2 | AL | under18 | 2010 | 1130966.0 | Alabama | AL |
3 | AL | total | 2010 | 4785570.0 | Alabama | AL |
4 | AL | under18 | 2011 | 1125763.0 | Alabama | AL |
5 | AL | total | 2011 | 4801627.0 | Alabama | AL |
6 | AL | total | 2009 | 4757938.0 | Alabama | AL |
7 | AL | under18 | 2009 | 1134192.0 | Alabama | AL |
8 | AL | under18 | 2013 | 1111481.0 | Alabama | AL |
9 | AL | total | 2013 | 4833722.0 | Alabama | AL |
10 | AL | total | 2007 | 4672840.0 | Alabama | AL |
11 | AL | under18 | 2007 | 1132296.0 | Alabama | AL |
12 | AL | total | 2008 | 4718206.0 | Alabama | AL |
13 | AL | under18 | 2008 | 1134927.0 | Alabama | AL |
14 | AL | total | 2005 | 4569805.0 | Alabama | AL |
15 | AL | under18 | 2005 | 1117229.0 | Alabama | AL |
16 | AL | total | 2006 | 4628981.0 | Alabama | AL |
17 | AL | under18 | 2006 | 1126798.0 | Alabama | AL |
18 | AL | total | 2004 | 4530729.0 | Alabama | AL |
19 | AL | under18 | 2004 | 1113662.0 | Alabama | AL |
20 | AL | total | 2003 | 4503491.0 | Alabama | AL |
21 | AL | under18 | 2003 | 1113083.0 | Alabama | AL |
22 | AL | total | 2001 | 4467634.0 | Alabama | AL |
23 | AL | under18 | 2001 | 1120409.0 | Alabama | AL |
24 | AL | total | 2002 | 4480089.0 | Alabama | AL |
25 | AL | under18 | 2002 | 1116590.0 | Alabama | AL |
26 | AL | under18 | 1999 | 1121287.0 | Alabama | AL |
27 | AL | total | 1999 | 4430141.0 | Alabama | AL |
28 | AL | total | 2000 | 4452173.0 | Alabama | AL |
29 | AL | under18 | 2000 | 1122273.0 | Alabama | AL |
... | ... | ... | ... | ... | ... | ... |
2514 | USA | under18 | 1999 | 71946051.0 | NaN | NaN |
2515 | USA | total | 2000 | 282162411.0 | NaN | NaN |
2516 | USA | under18 | 2000 | 72376189.0 | NaN | NaN |
2517 | USA | total | 1999 | 279040181.0 | NaN | NaN |
2518 | USA | total | 2001 | 284968955.0 | NaN | NaN |
2519 | USA | under18 | 2001 | 72671175.0 | NaN | NaN |
2520 | USA | total | 2002 | 287625193.0 | NaN | NaN |
2521 | USA | under18 | 2002 | 72936457.0 | NaN | NaN |
2522 | USA | total | 2003 | 290107933.0 | NaN | NaN |
2523 | USA | under18 | 2003 | 73100758.0 | NaN | NaN |
2524 | USA | total | 2004 | 292805298.0 | NaN | NaN |
2525 | USA | under18 | 2004 | 73297735.0 | NaN | NaN |
2526 | USA | total | 2005 | 295516599.0 | NaN | NaN |
2527 | USA | under18 | 2005 | 73523669.0 | NaN | NaN |
2528 | USA | total | 2006 | 298379912.0 | NaN | NaN |
2529 | USA | under18 | 2006 | 73757714.0 | NaN | NaN |
2530 | USA | total | 2007 | 301231207.0 | NaN | NaN |
2531 | USA | under18 | 2007 | 74019405.0 | NaN | NaN |
2532 | USA | total | 2008 | 304093966.0 | NaN | NaN |
2533 | USA | under18 | 2008 | 74104602.0 | NaN | NaN |
2534 | USA | under18 | 2013 | 73585872.0 | NaN | NaN |
2535 | USA | total | 2013 | 316128839.0 | NaN | NaN |
2536 | USA | total | 2009 | 306771529.0 | NaN | NaN |
2537 | USA | under18 | 2009 | 74134167.0 | NaN | NaN |
2538 | USA | under18 | 2010 | 74119556.0 | NaN | NaN |
2539 | USA | total | 2010 | 309326295.0 | NaN | NaN |
2540 | USA | under18 | 2011 | 73902222.0 | NaN | NaN |
2541 | USA | total | 2011 | 311582564.0 | NaN | NaN |
2542 | USA | under18 | 2012 | 73708179.0 | NaN | NaN |
2543 | USA | total | 2012 | 313873685.0 | NaN | NaN |
2544 rows × 6 columns
去除abbreviation的那一列(axis=1)
pop2.drop("abbreviation",axis=1,inplace=True)
# 默认axis=0删除行
# inplace是否在原表上删除,默认是False不在原表上删除
pop2.head()
state/region | ages | year | population | state | |
---|---|---|---|---|---|
0 | AL | under18 | 2012 | 1117489.0 | Alabama |
1 | AL | total | 2012 | 4817528.0 | Alabama |
2 | AL | under18 | 2010 | 1130966.0 | Alabama |
3 | AL | total | 2010 | 4785570.0 | Alabama |
4 | AL | under18 | 2011 | 1125763.0 | Alabama |
查看存在缺失数据的列。
使用.isnull().any(),只有某一列存在一个缺失数据,就会显示True。
cond = pop2.isnull().any(axis=1)
cond
0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 False 12 False 13 False 14 False 15 False 16 False 17 False 18 False 19 False 20 False 21 False 22 False 23 False 24 False 25 False 26 False 27 False 28 False 29 False... 2514 True 2515 True 2516 True 2517 True 2518 True 2519 True 2520 True 2521 True 2522 True 2523 True 2524 True 2525 True 2526 True 2527 True 2528 True 2529 True 2530 True 2531 True 2532 True 2533 True 2534 True 2535 True 2536 True 2537 True 2538 True 2539 True 2540 True 2541 True 2542 True 2543 True Length: 2544, dtype: bool
pop2[cond]
state/region | ages | year | population | state | |
---|---|---|---|---|---|
2448 | PR | under18 | 1990 | NaN | NaN |
2449 | PR | total | 1990 | NaN | NaN |
2450 | PR | total | 1991 | NaN | NaN |
2451 | PR | under18 | 1991 | NaN | NaN |
2452 | PR | total | 1993 | NaN | NaN |
2453 | PR | under18 | 1993 | NaN | NaN |
2454 | PR | under18 | 1992 | NaN | NaN |
2455 | PR | total | 1992 | NaN | NaN |
2456 | PR | under18 | 1994 | NaN | NaN |
2457 | PR | total | 1994 | NaN | NaN |
2458 | PR | total | 1995 | NaN | NaN |
2459 | PR | under18 | 1995 | NaN | NaN |
2460 | PR | under18 | 1996 | NaN | NaN |
2461 | PR | total | 1996 | NaN | NaN |
2462 | PR | under18 | 1998 | NaN | NaN |
2463 | PR | total | 1998 | NaN | NaN |
2464 | PR | total | 1997 | NaN | NaN |
2465 | PR | under18 | 1997 | NaN | NaN |
2466 | PR | total | 1999 | NaN | NaN |
2467 | PR | under18 | 1999 | NaN | NaN |
2468 | PR | total | 2000 | 3810605.0 | NaN |
2469 | PR | under18 | 2000 | 1089063.0 | NaN |
2470 | PR | total | 2001 | 3818774.0 | NaN |
2471 | PR | under18 | 2001 | 1077566.0 | NaN |
2472 | PR | total | 2002 | 3823701.0 | NaN |
2473 | PR | under18 | 2002 | 1065051.0 | NaN |
2474 | PR | total | 2004 | 3826878.0 | NaN |
2475 | PR | under18 | 2004 | 1035919.0 | NaN |
2476 | PR | total | 2003 | 3826095.0 | NaN |
2477 | PR | under18 | 2003 | 1050615.0 | NaN |
... | ... | ... | ... | ... | ... |
2514 | USA | under18 | 1999 | 71946051.0 | NaN |
2515 | USA | total | 2000 | 282162411.0 | NaN |
2516 | USA | under18 | 2000 | 72376189.0 | NaN |
2517 | USA | total | 1999 | 279040181.0 | NaN |
2518 | USA | total | 2001 | 284968955.0 | NaN |
2519 | USA | under18 | 2001 | 72671175.0 | NaN |
2520 | USA | total | 2002 | 287625193.0 | NaN |
2521 | USA | under18 | 2002 | 72936457.0 | NaN |
2522 | USA | total | 2003 | 290107933.0 | NaN |
2523 | USA | under18 | 2003 | 73100758.0 | NaN |
2524 | USA | total | 2004 | 292805298.0 | NaN |
2525 | USA | under18 | 2004 | 73297735.0 | NaN |
2526 | USA | total | 2005 | 295516599.0 | NaN |
2527 | USA | under18 | 2005 | 73523669.0 | NaN |
2528 | USA | total | 2006 | 298379912.0 | NaN |
2529 | USA | under18 | 2006 | 73757714.0 | NaN |
2530 | USA | total | 2007 | 301231207.0 | NaN |
2531 | USA | under18 | 2007 | 74019405.0 | NaN |
2532 | USA | total | 2008 | 304093966.0 | NaN |
2533 | USA | under18 | 2008 | 74104602.0 | NaN |
2534 | USA | under18 | 2013 | 73585872.0 | NaN |
2535 | USA | total | 2013 | 316128839.0 | NaN |
2536 | USA | total | 2009 | 306771529.0 | NaN |
2537 | USA | under18 | 2009 | 74134167.0 | NaN |
2538 | USA | under18 | 2010 | 74119556.0 | NaN |
2539 | USA | total | 2010 | 309326295.0 | NaN |
2540 | USA | under18 | 2011 | 73902222.0 | NaN |
2541 | USA | total | 2011 | 311582564.0 | NaN |
2542 | USA | under18 | 2012 | 73708179.0 | NaN |
2543 | USA | total | 2012 | 313873685.0 | NaN |
96 rows × 5 columns
查看缺失数据
根据数据是否缺失情况显示数据,如果缺失为True,那么显示
cond_state = pop2["state"].isnull()
找到有哪些state/region使得state的值为NaN,使用unique()查看非重复值
pop2[cond_state]["state/region"].unique()
array(['PR', 'USA'], dtype=object)
为找到的这些state/region的state项补上正确的值,从而去除掉state这一列的所有NaN!
记住这样清除缺失数据NaN的方法!
cond_pr = pop2["state/region"] == "PR"
cond_usa = pop2["state/region"] == "USA"
pop2["state"][cond_pr] = "Puerto Rico"
pop2["state"][cond_usa] = "United State"
d:\Anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrameSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy"""Entry point for launching an IPython kernel. d:\Anaconda\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrameSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
pop2.head()
state/region | ages | year | population | state | |
---|---|---|---|---|---|
0 | AL | under18 | 2012 | 1117489.0 | Alabama |
1 | AL | total | 2012 | 4817528.0 | Alabama |
2 | AL | under18 | 2010 | 1130966.0 | Alabama |
3 | AL | total | 2010 | 4785570.0 | Alabama |
4 | AL | under18 | 2011 | 1125763.0 | Alabama |
pop2.isnull().any()
state/region False ages False year False population True state False dtype: bool
合并各州面积数据areas,使用左合并。
思考一下为什么使用外合并?
areas.head()
state | area (sq. mi) | |
---|---|---|
0 | Alabama | 52423 |
1 | Alaska | 656425 |
2 | Arizona | 114006 |
3 | Arkansas | 53182 |
4 | California | 163707 |
pop3 = pop2.merge(areas,how="outer")
pop3.head()
state/region | ages | year | population | state | area (sq. mi) | |
---|---|---|---|---|---|---|
0 | AL | under18 | 2012 | 1117489.0 | Alabama | 52423.0 |
1 | AL | total | 2012 | 4817528.0 | Alabama | 52423.0 |
2 | AL | under18 | 2010 | 1130966.0 | Alabama | 52423.0 |
3 | AL | total | 2010 | 4785570.0 | Alabama | 52423.0 |
4 | AL | under18 | 2011 | 1125763.0 | Alabama | 52423.0 |
继续寻找存在缺失数据的列
pop3.isnull().any()
state/region False ages False year False population True state False area (sq. mi) True dtype: bool
cond_area = pop3["area (sq. mi)"].isnull()
pop3[cond_area]
state/region | ages | year | population | state | area (sq. mi) | |
---|---|---|---|---|---|---|
2496 | USA | under18 | 1990 | 64218512.0 | United State | NaN |
2497 | USA | total | 1990 | 249622814.0 | United State | NaN |
2498 | USA | total | 1991 | 252980942.0 | United State | NaN |
2499 | USA | under18 | 1991 | 65313018.0 | United State | NaN |
2500 | USA | under18 | 1992 | 66509177.0 | United State | NaN |
2501 | USA | total | 1992 | 256514231.0 | United State | NaN |
2502 | USA | total | 1993 | 259918595.0 | United State | NaN |
2503 | USA | under18 | 1993 | 67594938.0 | United State | NaN |
2504 | USA | under18 | 1994 | 68640936.0 | United State | NaN |
2505 | USA | total | 1994 | 263125826.0 | United State | NaN |
2506 | USA | under18 | 1995 | 69473140.0 | United State | NaN |
2507 | USA | under18 | 1996 | 70233512.0 | United State | NaN |
2508 | USA | total | 1995 | 266278403.0 | United State | NaN |
2509 | USA | total | 1996 | 269394291.0 | United State | NaN |
2510 | USA | total | 1997 | 272646932.0 | United State | NaN |
2511 | USA | under18 | 1997 | 70920738.0 | United State | NaN |
2512 | USA | under18 | 1998 | 71431406.0 | United State | NaN |
2513 | USA | total | 1998 | 275854116.0 | United State | NaN |
2514 | USA | under18 | 1999 | 71946051.0 | United State | NaN |
2515 | USA | total | 2000 | 282162411.0 | United State | NaN |
2516 | USA | under18 | 2000 | 72376189.0 | United State | NaN |
2517 | USA | total | 1999 | 279040181.0 | United State | NaN |
2518 | USA | total | 2001 | 284968955.0 | United State | NaN |
2519 | USA | under18 | 2001 | 72671175.0 | United State | NaN |
2520 | USA | total | 2002 | 287625193.0 | United State | NaN |
2521 | USA | under18 | 2002 | 72936457.0 | United State | NaN |
2522 | USA | total | 2003 | 290107933.0 | United State | NaN |
2523 | USA | under18 | 2003 | 73100758.0 | United State | NaN |
2524 | USA | total | 2004 | 292805298.0 | United State | NaN |
2525 | USA | under18 | 2004 | 73297735.0 | United State | NaN |
2526 | USA | total | 2005 | 295516599.0 | United State | NaN |
2527 | USA | under18 | 2005 | 73523669.0 | United State | NaN |
2528 | USA | total | 2006 | 298379912.0 | United State | NaN |
2529 | USA | under18 | 2006 | 73757714.0 | United State | NaN |
2530 | USA | total | 2007 | 301231207.0 | United State | NaN |
2531 | USA | under18 | 2007 | 74019405.0 | United State | NaN |
2532 | USA | total | 2008 | 304093966.0 | United State | NaN |
2533 | USA | under18 | 2008 | 74104602.0 | United State | NaN |
2534 | USA | under18 | 2013 | 73585872.0 | United State | NaN |
2535 | USA | total | 2013 | 316128839.0 | United State | NaN |
2536 | USA | total | 2009 | 306771529.0 | United State | NaN |
2537 | USA | under18 | 2009 | 74134167.0 | United State | NaN |
2538 | USA | under18 | 2010 | 74119556.0 | United State | NaN |
2539 | USA | total | 2010 | 309326295.0 | United State | NaN |
2540 | USA | under18 | 2011 | 73902222.0 | United State | NaN |
2541 | USA | total | 2011 | 311582564.0 | United State | NaN |
2542 | USA | under18 | 2012 | 73708179.0 | United State | NaN |
2543 | USA | total | 2012 | 313873685.0 | United State | NaN |
我们会发现area(sq.mi)这一列有缺失数据,为了找出是哪一行,我们需要找出是哪个state没有数据
# usa面积缺失,就可以吧所有的州面积求和
usa_area = areas["area (sq. mi)"].sum()
# 把美国的面积赋值到pop3中
pop3["area (sq. mi)"][cond_area] = usa_area
d:\Anaconda\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrameSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
pop3
state/region | ages | year | population | state | area (sq. mi) | |
---|---|---|---|---|---|---|
0 | AL | under18 | 2012 | 1117489.0 | Alabama | 52423.0 |
1 | AL | total | 2012 | 4817528.0 | Alabama | 52423.0 |
2 | AL | under18 | 2010 | 1130966.0 | Alabama | 52423.0 |
3 | AL | total | 2010 | 4785570.0 | Alabama | 52423.0 |
4 | AL | under18 | 2011 | 1125763.0 | Alabama | 52423.0 |
5 | AL | total | 2011 | 4801627.0 | Alabama | 52423.0 |
6 | AL | total | 2009 | 4757938.0 | Alabama | 52423.0 |
7 | AL | under18 | 2009 | 1134192.0 | Alabama | 52423.0 |
8 | AL | under18 | 2013 | 1111481.0 | Alabama | 52423.0 |
9 | AL | total | 2013 | 4833722.0 | Alabama | 52423.0 |
10 | AL | total | 2007 | 4672840.0 | Alabama | 52423.0 |
11 | AL | under18 | 2007 | 1132296.0 | Alabama | 52423.0 |
12 | AL | total | 2008 | 4718206.0 | Alabama | 52423.0 |
13 | AL | under18 | 2008 | 1134927.0 | Alabama | 52423.0 |
14 | AL | total | 2005 | 4569805.0 | Alabama | 52423.0 |
15 | AL | under18 | 2005 | 1117229.0 | Alabama | 52423.0 |
16 | AL | total | 2006 | 4628981.0 | Alabama | 52423.0 |
17 | AL | under18 | 2006 | 1126798.0 | Alabama | 52423.0 |
18 | AL | total | 2004 | 4530729.0 | Alabama | 52423.0 |
19 | AL | under18 | 2004 | 1113662.0 | Alabama | 52423.0 |
20 | AL | total | 2003 | 4503491.0 | Alabama | 52423.0 |
21 | AL | under18 | 2003 | 1113083.0 | Alabama | 52423.0 |
22 | AL | total | 2001 | 4467634.0 | Alabama | 52423.0 |
23 | AL | under18 | 2001 | 1120409.0 | Alabama | 52423.0 |
24 | AL | total | 2002 | 4480089.0 | Alabama | 52423.0 |
25 | AL | under18 | 2002 | 1116590.0 | Alabama | 52423.0 |
26 | AL | under18 | 1999 | 1121287.0 | Alabama | 52423.0 |
27 | AL | total | 1999 | 4430141.0 | Alabama | 52423.0 |
28 | AL | total | 2000 | 4452173.0 | Alabama | 52423.0 |
29 | AL | under18 | 2000 | 1122273.0 | Alabama | 52423.0 |
... | ... | ... | ... | ... | ... | ... |
2514 | USA | under18 | 1999 | 71946051.0 | United State | 3790399.0 |
2515 | USA | total | 2000 | 282162411.0 | United State | 3790399.0 |
2516 | USA | under18 | 2000 | 72376189.0 | United State | 3790399.0 |
2517 | USA | total | 1999 | 279040181.0 | United State | 3790399.0 |
2518 | USA | total | 2001 | 284968955.0 | United State | 3790399.0 |
2519 | USA | under18 | 2001 | 72671175.0 | United State | 3790399.0 |
2520 | USA | total | 2002 | 287625193.0 | United State | 3790399.0 |
2521 | USA | under18 | 2002 | 72936457.0 | United State | 3790399.0 |
2522 | USA | total | 2003 | 290107933.0 | United State | 3790399.0 |
2523 | USA | under18 | 2003 | 73100758.0 | United State | 3790399.0 |
2524 | USA | total | 2004 | 292805298.0 | United State | 3790399.0 |
2525 | USA | under18 | 2004 | 73297735.0 | United State | 3790399.0 |
2526 | USA | total | 2005 | 295516599.0 | United State | 3790399.0 |
2527 | USA | under18 | 2005 | 73523669.0 | United State | 3790399.0 |
2528 | USA | total | 2006 | 298379912.0 | United State | 3790399.0 |
2529 | USA | under18 | 2006 | 73757714.0 | United State | 3790399.0 |
2530 | USA | total | 2007 | 301231207.0 | United State | 3790399.0 |
2531 | USA | under18 | 2007 | 74019405.0 | United State | 3790399.0 |
2532 | USA | total | 2008 | 304093966.0 | United State | 3790399.0 |
2533 | USA | under18 | 2008 | 74104602.0 | United State | 3790399.0 |
2534 | USA | under18 | 2013 | 73585872.0 | United State | 3790399.0 |
2535 | USA | total | 2013 | 316128839.0 | United State | 3790399.0 |
2536 | USA | total | 2009 | 306771529.0 | United State | 3790399.0 |
2537 | USA | under18 | 2009 | 74134167.0 | United State | 3790399.0 |
2538 | USA | under18 | 2010 | 74119556.0 | United State | 3790399.0 |
2539 | USA | total | 2010 | 309326295.0 | United State | 3790399.0 |
2540 | USA | under18 | 2011 | 73902222.0 | United State | 3790399.0 |
2541 | USA | total | 2011 | 311582564.0 | United State | 3790399.0 |
2542 | USA | under18 | 2012 | 73708179.0 | United State | 3790399.0 |
2543 | USA | total | 2012 | 313873685.0 | United State | 3790399.0 |
2544 rows × 6 columns
去除含有缺失数据的行
pop3.isnull().any()
state/region False ages False year False population True state False area (sq. mi) False dtype: bool
查看数据是否缺失
pop3.dropna(inplace=True)
pop3.isnull().any()
state/region False ages False year False population False state False area (sq. mi) False dtype: bool
找出2010年的全民人口数据,df.query(查询语句)
pop3.head()
state/region | ages | year | population | state | area (sq. mi) | |
---|---|---|---|---|---|---|
0 | AL | under18 | 2012 | 1117489.0 | Alabama | 52423.0 |
1 | AL | total | 2012 | 4817528.0 | Alabama | 52423.0 |
2 | AL | under18 | 2010 | 1130966.0 | Alabama | 52423.0 |
3 | AL | total | 2010 | 4785570.0 | Alabama | 52423.0 |
4 | AL | under18 | 2011 | 1125763.0 | Alabama | 52423.0 |
pop3.query("year==2010 & ages=='total' & state=='United State'")
state/region | ages | year | population | state | area (sq. mi) | |
---|---|---|---|---|---|---|
2539 | USA | total | 2010 | 309326295.0 | United State | 3790399.0 |
pop_2010 = pop3.query("year==2010 & ages=='total'")
pop_2010
state/region | ages | year | population | state | area (sq. mi) | |
---|---|---|---|---|---|---|
3 | AL | total | 2010 | 4785570.0 | Alabama | 52423.0 |
91 | AK | total | 2010 | 713868.0 | Alaska | 656425.0 |
101 | AZ | total | 2010 | 6408790.0 | Arizona | 114006.0 |
189 | AR | total | 2010 | 2922280.0 | Arkansas | 53182.0 |
197 | CA | total | 2010 | 37333601.0 | California | 163707.0 |
283 | CO | total | 2010 | 5048196.0 | Colorado | 104100.0 |
293 | CT | total | 2010 | 3579210.0 | Connecticut | 5544.0 |
379 | DE | total | 2010 | 899711.0 | Delaware | 1954.0 |
389 | DC | total | 2010 | 605125.0 | District of Columbia | 68.0 |
475 | FL | total | 2010 | 18846054.0 | Florida | 65758.0 |
485 | GA | total | 2010 | 9713248.0 | Georgia | 59441.0 |
570 | HI | total | 2010 | 1363731.0 | Hawaii | 10932.0 |
581 | ID | total | 2010 | 1570718.0 | Idaho | 83574.0 |
666 | IL | total | 2010 | 12839695.0 | Illinois | 57918.0 |
677 | IN | total | 2010 | 6489965.0 | Indiana | 36420.0 |
762 | IA | total | 2010 | 3050314.0 | Iowa | 56276.0 |
773 | KS | total | 2010 | 2858910.0 | Kansas | 82282.0 |
858 | KY | total | 2010 | 4347698.0 | Kentucky | 40411.0 |
869 | LA | total | 2010 | 4545392.0 | Louisiana | 51843.0 |
954 | ME | total | 2010 | 1327366.0 | Maine | 35387.0 |
965 | MD | total | 2010 | 5787193.0 | Maryland | 12407.0 |
1050 | MA | total | 2010 | 6563263.0 | Massachusetts | 10555.0 |
1061 | MI | total | 2010 | 9876149.0 | Michigan | 96810.0 |
1146 | MN | total | 2010 | 5310337.0 | Minnesota | 86943.0 |
1157 | MS | total | 2010 | 2970047.0 | Mississippi | 48434.0 |
1242 | MO | total | 2010 | 5996063.0 | Missouri | 69709.0 |
1253 | MT | total | 2010 | 990527.0 | Montana | 147046.0 |
1338 | NE | total | 2010 | 1829838.0 | Nebraska | 77358.0 |
1349 | NV | total | 2010 | 2703230.0 | Nevada | 110567.0 |
1434 | NH | total | 2010 | 1316614.0 | New Hampshire | 9351.0 |
1445 | NJ | total | 2010 | 8802707.0 | New Jersey | 8722.0 |
1530 | NM | total | 2010 | 2064982.0 | New Mexico | 121593.0 |
1541 | NY | total | 2010 | 19398228.0 | New York | 54475.0 |
1626 | NC | total | 2010 | 9559533.0 | North Carolina | 53821.0 |
1637 | ND | total | 2010 | 674344.0 | North Dakota | 70704.0 |
1722 | OH | total | 2010 | 11545435.0 | Ohio | 44828.0 |
1733 | OK | total | 2010 | 3759263.0 | Oklahoma | 69903.0 |
1818 | OR | total | 2010 | 3837208.0 | Oregon | 98386.0 |
1829 | PA | total | 2010 | 12710472.0 | Pennsylvania | 46058.0 |
1914 | RI | total | 2010 | 1052669.0 | Rhode Island | 1545.0 |
1925 | SC | total | 2010 | 4636361.0 | South Carolina | 32007.0 |
2010 | SD | total | 2010 | 816211.0 | South Dakota | 77121.0 |
2021 | TN | total | 2010 | 6356683.0 | Tennessee | 42146.0 |
2106 | TX | total | 2010 | 25245178.0 | Texas | 268601.0 |
2117 | UT | total | 2010 | 2774424.0 | Utah | 84904.0 |
2202 | VT | total | 2010 | 625793.0 | Vermont | 9615.0 |
2213 | VA | total | 2010 | 8024417.0 | Virginia | 42769.0 |
2298 | WA | total | 2010 | 6742256.0 | Washington | 71303.0 |
2309 | WV | total | 2010 | 1854146.0 | West Virginia | 24231.0 |
2394 | WI | total | 2010 | 5689060.0 | Wisconsin | 65503.0 |
2405 | WY | total | 2010 | 564222.0 | Wyoming | 97818.0 |
2490 | PR | total | 2010 | 3721208.0 | Puerto Rico | 3515.0 |
2539 | USA | total | 2010 | 309326295.0 | United State | 3790399.0 |
对查询结果进行处理,以state列作为新的行索引:set_index
pop_2010.set_index("state",inplace=True)
pop_2010.shape
(53, 5)
计算人口密度。注意是Series/Series,其结果还是一个Series。
pop_dense = pop_2010["population"] / pop_2010["area (sq. mi)"]
pop_dense
state Alabama 91.287603 Alaska 1.087509 Arizona 56.214497 Arkansas 54.948667 California 228.051342 Colorado 48.493718 Connecticut 645.600649 Delaware 460.445752 District of Columbia 8898.897059 Florida 286.597129 Georgia 163.409902 Hawaii 124.746707 Idaho 18.794338 Illinois 221.687472 Indiana 178.197831 Iowa 54.202751 Kansas 34.745266 Kentucky 107.586994 Louisiana 87.676099 Maine 37.509990 Maryland 466.445797 Massachusetts 621.815538 Michigan 102.015794 Minnesota 61.078373 Mississippi 61.321530 Missouri 86.015622 Montana 6.736171 Nebraska 23.654153 Nevada 24.448796 New Hampshire 140.799273 New Jersey 1009.253268 New Mexico 16.982737 New York 356.094135 North Carolina 177.617157 North Dakota 9.537565 Ohio 257.549634 Oklahoma 53.778278 Oregon 39.001565 Pennsylvania 275.966651 Rhode Island 681.339159 South Carolina 144.854594 South Dakota 10.583512 Tennessee 150.825298 Texas 93.987655 Utah 32.677188 Vermont 65.085075 Virginia 187.622273 Washington 94.557817 West Virginia 76.519582 Wisconsin 86.851900 Wyoming 5.768079 Puerto Rico 1058.665149 United State 81.607845 dtype: float64
排序,并找出人口密度最高的五个州sort_values()
pop_dense.sort_values(inplace=True)
pop_dense
state Alaska 1.087509 Wyoming 5.768079 Montana 6.736171 North Dakota 9.537565 South Dakota 10.583512 New Mexico 16.982737 Idaho 18.794338 Nebraska 23.654153 Nevada 24.448796 Utah 32.677188 Kansas 34.745266 Maine 37.509990 Oregon 39.001565 Colorado 48.493718 Oklahoma 53.778278 Iowa 54.202751 Arkansas 54.948667 Arizona 56.214497 Minnesota 61.078373 Mississippi 61.321530 Vermont 65.085075 West Virginia 76.519582 United State 81.607845 Missouri 86.015622 Wisconsin 86.851900 Louisiana 87.676099 Alabama 91.287603 Texas 93.987655 Washington 94.557817 Michigan 102.015794 Kentucky 107.586994 Hawaii 124.746707 New Hampshire 140.799273 South Carolina 144.854594 Tennessee 150.825298 Georgia 163.409902 North Carolina 177.617157 Indiana 178.197831 Virginia 187.622273 Illinois 221.687472 California 228.051342 Ohio 257.549634 Pennsylvania 275.966651 Florida 286.597129 New York 356.094135 Delaware 460.445752 Maryland 466.445797 Massachusetts 621.815538 Connecticut 645.600649 Rhode Island 681.339159 New Jersey 1009.253268 Puerto Rico 1058.665149 District of Columbia 8898.897059 dtype: float64
pop_dense.tail()
state Connecticut 645.600649 Rhode Island 681.339159 New Jersey 1009.253268 Puerto Rico 1058.665149 District of Columbia 8898.897059 dtype: float64
找出人口密度最低的五个州
pop_dense.head()
state Alaska 1.087509 Wyoming 5.768079 Montana 6.736171 North Dakota 9.537565 South Dakota 10.583512 dtype: float64
要点总结:
- 统一用loc()索引
- 善于使用.isnull().any()找到存在NaN的列
- 善于使用.unique()确定该列中哪些key是我们需要的
- 一般使用外合并、左合并,目的只有一个:宁愿该列是NaN也不要丢弃其他列的信息
回顾:Series/DataFrame运算与ndarray运算的区别
- Series与DataFrame没有广播,如果对应index没有值,则记为NaN;或者使用add的fill_value来补缺失值
- ndarray有广播,通过重复已有值来计算
pandas(四)pandas的拼接操作相关推荐
- 数据分析与AI(五)pandas的数据拼接操作/美国各州人口分析/苹果历年股票曲线图
pandas的拼接操作 pandas的拼接分为两种: - 级联: pd.concat, pd.append - 合并: pd.merge, pd.join 0. 回顾numpy的级联 import n ...
- 爬虫 数据分析 处理丢失数据 pandas的拼接操作
处理丢失的数据 处理丢失的数据 两种丢失的数据Nonenp.nan(NaN)None是Python自带的,其类型为python object.因此,None不能参与到任何计算中.np.nan是浮点类型 ...
- Pandas库常用函数和操作
目录 1. DataFrame 处理缺失值 dropna() 2. 根据某维度计算重复的行 duplicated().value_counts() 3. 去重 drop_duplicates( ...
- Pandas知识点-索引和切片操作
Pandas知识点-索引和切片操作 索引和切片操作是最基本最常用的数据处理操作,Pandas中的索引和切片操作基于Python的语言特性,支持类似于numpy中的操作,也可以使用行标签.列标签以及行标 ...
- Pandas——掌握DataFrame的常用操作
Pandas--掌握DataFrame的常用操作 一.查看DataFrame的常用属性 1.1.订单详情表的4个基本属性 1.2.size.ndim和 shape属性的使用 1.3.使用T属性进行转置 ...
- 机器学习之Pandas:Pandas介绍、基本数据操作、DataFrame运算、Pandas画图、文件读取与处、缺失值处理、数据离散化、合并、交叉表和透视表、分组与聚合、案例(超长篇,建议收藏慢慢看)
文章目录 Pandas 学习目标 1Pandas介绍 学习目标 1 Pandas介绍 2 为什么使用Pandas 3 案例: 问题:如何让数据更有意义的显示?处理刚才的股票数据 给股票涨跌幅数据增加行 ...
- python学习之数据分析(四):Pandas基础
文章目录 一.Pandas介绍: 1. Pandas介绍: 2.为什么要使用Pandas: 3. DataFrame: 4.DataFrame 4.1 DataFrame结构 4.2 DatatFra ...
- 【Pandas库】(5) 索引操作--增、删
各位同学好,今天我向大家介绍一下pandas库中的索引操作--增.删. 1. 增加 1.1 对series操作 方法一:在原有数据上增加,改变原有数据. Series名[新标签名] = 新标签名对应的 ...
- pandas 按字符串肚脐眼 读取数据_十分钟学习pandas! pandas常用操作总结!
学习Python, 当然少不了pandas,pandas是python数据科学中的必备工具,熟练使用pandas是从sql boy/girl 跨越到一名优秀的数据分析师傅的必备技能. 这篇pandas ...
最新文章
- Oracle 存储过程中查询序列值并用变量接收
- nodejs(三) --- nodejs进程与子进程
- spring boot 缓存_SpringBoot 应用 Redis 声明式缓存
- 【Excel】日常记录
- 数据结构与算法(C++)– 贪婪算法(Greedy algorithm)
- Error querying database. Cause: java.sql.SQLSyntaxErrorException: ORA-00911: 无效字符
- DHCP 服务原理:Snooping和Relay
- @Autowired自动注入
- MySQL的IFNULL() 函数使用
- Mvvm 前端数据流框架精讲
- 学历影响程序员的工资吗?
- Java记录 -59- SortedSet
- dubbo3.0源码编译问题
- Java实现Redis批量读取List
- Spring Cloud OAuth2 认证流程
- 商业银行数字化转型的难点与路径
- Python自动化生成 word 文档
- 用计算时间差计算出天数
- (宏) Word图片题注“图一-1”转化为“图1-1”
- 2018.11.03 NOIP模拟 地球发动机(线性dp)