操作环境: window10,Python3.7,Jupyter notebook
数据资料: https://www.lanzous.com/i98lfra

文章目录

  • 1、数据载入与总览
    • 1.1、数据加载
    • 1.2、数据合并
    • 1.3、数据预览和基本统计分析
      • 1.3.1、查看数据形状
      • 1.3.2、查看是否有空值
      • 1.3.3、查看描述性的信息
  • 2、数据清洗
    • 2.1、缺失值处理
      • 2.1.1、查看带有缺失值的列
      • 2.1.2、空值指定填充
    • 2.2、数据转换
      • 2.2.1、候选人去重
      • 2.2.2、添加党派
      • 2.2.3、统计两个党派支持的次数
      • 2.2.4、查询每个政党的捐献额
      • 2.2.5、按照职业汇总对赞助总金额进行排序
      • 2.2.6、职业类型去重
    • 2.3、数据筛选
      • 2.3.1、赞助金额筛选
      • 2.3.2、查看各候选人获得的赞助总金额
      • 2.3.3、数据可视化
      • 2.3.4、选取选举人为Obama、Romney的子集数据
    • 2.4、面元化数据
  • 3、数据聚合与分组运算
    • 3.1、透视表(pivot_table)分析党派和职业
      • 3.1.1、求每个职业为两个党派献金之和
      • 3.1.2、过滤掉总和小于200W的数据
      • 3.1.3、数据可视化
    • 3.2、分组及运算和转换
      • 3.2.1、不同职业对他两人的支持度
      • 3.2.2、不同公司对他两人的支持度
    • 3.3、统计各区间的赞助金额
      • 3.3.1、查看区间金额
      • 3.3.2、绘制Obama和Romney各区间赞助的总金额
      • 3.3.3、过滤掉大金额
      • 3.3.4、百分比堆积图
  • 4、时间处理
    • 4.1、查看数据类型
    • 4.2、str转datetime
  • 5、各州支持率
    • 5.1、数据分组
    • 5.2、候选人各州金额占比
    • 5.3、删除不存在的州
    • 5.4、绘制地图
      • 5.4.1、导入相关库
      • 5.4.2、同一绘制不同颜色方法
      • 5.4.3、查看Obama各州均值
      • 5.4.3、绘制美国地图

1、数据载入与总览

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
%matplotlib inline

1.1、数据加载

contb1 = pd.read_csv('./contb_01.csv')
contb2 = pd.read_csv('./contb_02.csv')
contb3 = pd.read_csv('./contb_03.csv')

1.2、数据合并

contb = pd.concat([contb1, contb2, contb3], axis=0)
contb.head() #查看前五行
cand_nm contbr_nm contbr_st contbr_employer contbr_occupation contb_receipt_amt contb_receipt_dt
0 Bachmann, Michelle HARVEY, WILLIAM AL RETIRED RETIRED 250.0 20-JUN-11
1 Bachmann, Michelle HARVEY, WILLIAM AL RETIRED RETIRED 50.0 23-JUN-11
2 Bachmann, Michelle SMITH, LANIER AL INFORMATION REQUESTED INFORMATION REQUESTED 250.0 05-JUL-11
3 Bachmann, Michelle BLEVINS, DARONDA AR NONE RETIRED 250.0 01-AUG-11
4 Bachmann, Michelle WARDENBURG, HAROLD AR NONE RETIRED 300.0 20-JUN-11

字段解释:

cmte_id :候选人ID
cand_nm :候选人姓名
contbr_nm : 捐赠人姓名
contbr_st :捐赠人所在州
contbr_employer : 捐赠人所在公司
contbr_occupation : 捐赠人职业
contb_receipt_amt :捐赠数额(美元)
contb_receipt_dt : 捐款的日期

1.3、数据预览和基本统计分析

1.3.1、查看数据形状

contb.shape
(1001733, 7)
  • 一共有1001733行,7列数据

1.3.2、查看是否有空值

contb.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1001733 entries, 0 to 1730
Data columns (total 7 columns):
cand_nm              1001733 non-null object
contbr_nm            1001733 non-null object
contbr_st            1001729 non-null object
contbr_employer      988004 non-null object
contbr_occupation    993303 non-null object
contb_receipt_amt    1001733 non-null float64
contb_receipt_dt     1001733 non-null object
dtypes: float64(1), object(6)
memory usage: 61.1+ MB
  • 从上面的结果可以看出:

    • 一共有1001733行数据,其中contbr_st,contbr_employe,contbr_occupation 有空值。

1.3.3、查看描述性的信息

contb.describe()
contb_receipt_amt
count 1.001733e+06
mean 2.982358e+02
std 3.749663e+03
min -3.080000e+04
25% 3.500000e+01
50% 1.000000e+02
75% 2.500000e+02
max 2.014491e+06

2、数据清洗

2.1、缺失值处理

2.1.1、查看带有缺失值的列

cond = contb['contbr_employer'].isnull() #判断是否有缺失值,返回True和False
contb[cond] #筛选满足的数据
cand_nm contbr_nm contbr_st contbr_employer contbr_occupation contb_receipt_amt contb_receipt_dt
41 Bachmann, Michelle MINNIS, RITA CA NaN NaN -1500.0 20-JUN-11
264 Bachmann, Michelle BISHOP, GERARD NY NaN NaN -1700.0 28-JUN-11
752 Romney, Mitt KNIGHT, RENA AL NaN NaN -60.0 07-MAR-12
897 Romney, Mitt THE STEWART FIRM L.L.C. AL NaN NaN 250.0 23-MAR-12
1033 Romney, Mitt SELLERS, LEE AL NaN NaN -120.0 20-JAN-12
... ... ... ... ... ... ... ...
1715 Perry, Rick TEXAS ENERGY L.L.C. WY NaN NaN 250.0 30-SEP-11
1717 Perry, Rick HARDER, ROBERT WY NaN RETIRED 100.0 12-NOV-11
1720 Perry, Rick HARDER, ROBERT WY NaN RETIRED 100.0 01-OCT-11
1722 Perry, Rick HARDER, ROBERT WY NaN RETIRED 100.0 29-OCT-11
1723 Perry, Rick HARDER, ROBERT WY NaN RETIRED 100.0 30-NOV-11

2.1.2、空值指定填充

  • 用某个值去填充上面的空值
  • 上面contbr_st,contbr_employe,contbr_occupation 的一些用户没有提供信息,这是使用NOT PROVIDE 进行填充。
# 填充空值,NOT PROVIDE没有提供
contb['contbr_employer'].fillna('NOT PROVIDE', inplace=True) #用NOT PROVIDE填充空值
#填充contbr_occupation为空的值为NOT PROVIDE
contb['contbr_occupation'].fillna('NOT PROVIDE', inplace=True)
contb['contbr_st'].fillna('NOT PROVIDE', inplace=True)
contb.info() # 查看填充后的结果
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1001733 entries, 0 to 1730
Data columns (total 7 columns):
cand_nm              1001733 non-null object
contbr_nm            1001733 non-null object
contbr_st            1001733 non-null object
contbr_employer      1001733 non-null object
contbr_occupation    1001733 non-null object
contb_receipt_amt    1001733 non-null float64
contb_receipt_dt     1001733 non-null object
dtypes: float64(1), object(6)
memory usage: 61.1+ MB
  • 现在所有带有空值的数据被 NOT PROVIDE 填充后,不再有空值。

2.2、数据转换

利用字典映射进行转换:党派分析

2.2.1、候选人去重

contb['cand_nm'].unique()
array(['Bachmann, Michelle', 'Romney, Mitt', 'Obama, Barack',"Roemer, Charles E. 'Buddy' III", 'Pawlenty, Timothy','Johnson, Gary Earl', 'Paul, Ron', 'Santorum, Rick','Cain, Herman', 'Gingrich, Newt', 'McCotter, Thaddeus G','Huntsman, Jon', 'Perry, Rick'], dtype=object)
  • 去重后发现一共有13名候选人参与竞选

2.2.2、添加党派

  • 通过搜索引擎等途径,获取每个总统候选人的所属党派,建立字典parties,候选人名字作为键,所属党派作为对应的值

建立字典:

parties = {'Bachmann, Michelle': 'Republican','Romney, Mitt': 'Republican','Obama, Barack': 'Democrat',"Roemer, Charles E. 'Buddy' III": 'Republican','Pawlenty, Timothy': 'Republican','Johnson, Gary Earl': 'Republican','Paul, Ron': 'Republican','Santorum, Rick': 'Republican','Cain, Herman': 'Republican','Gingrich, Newt': 'Republican','McCotter, Thaddeus G': 'Republican','Huntsman, Jon': 'Republican','Perry, Rick': 'Republican'           }

数据映射:

%%time
contb['party'] = contb['cand_nm'].map(parties)
Wall time: 129 ms
  • 新加党派,使用map字典映射,100万数据,增加一列耗时129ms

查看前5行:

contb.head()
cand_nm contbr_nm contbr_st contbr_employer contbr_occupation contb_receipt_amt contb_receipt_dt party
0 Bachmann, Michelle HARVEY, WILLIAM AL RETIRED RETIRED 250.0 20-JUN-11 Republican
1 Bachmann, Michelle HARVEY, WILLIAM AL RETIRED RETIRED 50.0 23-JUN-11 Republican
2 Bachmann, Michelle SMITH, LANIER AL INFORMATION REQUESTED INFORMATION REQUESTED 250.0 05-JUL-11 Republican
3 Bachmann, Michelle BLEVINS, DARONDA AR NONE RETIRED 250.0 01-AUG-11 Republican
4 Bachmann, Michelle WARDENBURG, HAROLD AR NONE RETIRED 300.0 20-JUN-11 Republican

2.2.3、统计两个党派支持的次数

contb['party'].value_counts()
Democrat      593747
Republican    407986
Name: party, dtype: int64

2.2.4、查询每个政党的捐献额

contb.groupby('party')['contb_receipt_amt'].sum()
party
Democrat      1.335028e+08
Republican    1.652498e+08
Name: contb_receipt_amt, dtype: float64

2.2.5、按照职业汇总对赞助总金额进行排序

grouped_occupation = contb.groupby(['contbr_occupation'])['contb_receipt_amt'].sum()
grouped_occupation.sort_values(ascending=False) #ascending=False降序
contbr_occupation
RETIRED                                   48176647.00
ATTORNEY                                  18470473.30
HOMEMAKER                                 17484807.65
INFORMATION REQUESTED PER BEST EFFORTS    15859514.55
INFORMATION REQUESTED                      8742357.59...
PRES OF GAS & ELECTRIC                       -2500.00
AVIATION ATTORNEY                            -2500.00
DREDGING                                     -2500.00
METAL SMITH                                  -4225.00
VENTURE PHILANTHROPIST                       -5000.00
Name: contb_receipt_amt, Length: 45074, dtype: float64

2.2.6、职业类型去重

  • 整理同种类型职业
  • 按照职位进行汇总,计算赞助总金额,发现不少职业都是相同的,只不过是表达形式不同而已,如C.E.0与CEO,都是一个职业
  • 利用函数进行转换:职业与雇主信息分析
  • 建立一个职业对应的字典,把相同职业的不同表达映射为对应的职业,比如C.E.O映射为CEO
#整理一部分相同的职业,如果全部整理需要花费很长时间
occupation = {'INFORMATION REQUESTED PER BEST EFFORTS':'NOT PROVIDE','INFORMATION REQUESTED':'NOT PROVIDE','C.E.O.':'CEO','LAWYER':'ATTORNEY','SELF':'SELF-EMPLOYED','SELF EMPLOYED ':'SELF-EMPLOYED'}
f = lambda x : occupation.get(x, x)
contb['contbr_occupation'] = contb['contbr_occupation'].map(f)
# 统计同种职业捐献的总额,查看前5个
contb.groupby(['contbr_employer'])['contb_receipt_amt'].sum().sort_values(ascending = False)[:10
contbr_employer
RETIRED           41374333.67
SELF-EMPLOYED     37483895.22
NOT PROVIDE       31281997.76
HOMEMAKER         14738524.86
NOT EMPLOYED       8636809.43
NONE               3809582.99
STUDENT             957971.85
REQUESTED           894009.54
MORGAN STANLEY      386129.40
UNEMPLOYED          377088.31
Name: contb_receipt_amt, dtype: float64

2.3、数据筛选

2.3.1、赞助金额筛选

  • 去掉金额小于 “0” 的异常数据
# 捐赠金额大于0
contb_over = contb[contb['contb_receipt_amt'] > 0]
contb_over.head()
cand_nm contbr_nm contbr_st contbr_employer contbr_occupation contb_receipt_amt contb_receipt_dt party
0 Bachmann, Michelle HARVEY, WILLIAM AL RETIRED RETIRED 250.0 20-JUN-11 Republican
1 Bachmann, Michelle HARVEY, WILLIAM AL RETIRED RETIRED 50.0 23-JUN-11 Republican
2 Bachmann, Michelle SMITH, LANIER AL NOT PROVIDE NOT PROVIDE 250.0 05-JUL-11 Republican
3 Bachmann, Michelle BLEVINS, DARONDA AR NONE RETIRED 250.0 01-AUG-11 Republican
4 Bachmann, Michelle WARDENBURG, HAROLD AR NONE RETIRED 300.0 20-JUN-11 Republican
#去重前
contb.shape  #(1001733, 8)#去重后
contb_over.shape  #(991477, 8)

2.3.2、查看各候选人获得的赞助总金额

cand_nm_amt = contb_over.groupby(['cand_nm'])['contb_receipt_amt'].sum().sort_values(ascending = False)
cand_nm_amt
cand_nm
Obama, Barack                     1.358776e+08
Romney, Mitt                      8.833591e+07
Paul, Ron                         2.100962e+07
Perry, Rick                       2.030675e+07
Gingrich, Newt                    1.283277e+07
Santorum, Rick                    1.104316e+07
Cain, Herman                      7.101082e+06
Pawlenty, Timothy                 6.004819e+06
Huntsman, Jon                     3.330373e+06
Bachmann, Michelle                2.711439e+06
Johnson, Gary Earl                5.669616e+05
Roemer, Charles E. 'Buddy' III    3.730099e+05
McCotter, Thaddeus G              3.903000e+04
Name: contb_receipt_amt, dtype: float64

2.3.3、数据可视化

plt.figure(figsize=(8, 8))
cand_nm_amt.plot(kind='pie') #

  • 从上面的数据可以看出,支持Obama, Barack 和 Romney, Mitt 的人是最多的。

2.3.4、选取选举人为Obama、Romney的子集数据

方法一:

cond1 = contb_over['cand_nm'] == 'Obama, Barack'
cond2 = contb_over['cand_nm'] == 'Romney, Mitt'
# 这是一个与运算
cond = cond1|cond2
cond.sum() #694283次
contb_vs = contb_over[cond]
contb_vs

方法二:

contb_over.query("cand_nm == 'Obama, Barack' or cand_nm == 'Romney, Mitt'")

方法三:

cond = contb_over['cand_nm'].isin(['Obama, Barack', 'Romney, Mitt'])
contb_over[cond]

结果:

cand_nm contbr_nm contbr_st contbr_employer contbr_occupation contb_receipt_amt contb_receipt_dt party
411 Romney, Mitt ELDERBAUM, WILLIAM AA US GOVERNMENT FOREIGN SERVICE OFFICER 25.0 01-FEB-12 Republican
412 Romney, Mitt ELDERBAUM, WILLIAM AA US GOVERNMENT FOREIGN SERVICE OFFICER 110.0 01-FEB-12 Republican
413 Romney, Mitt CARLSEN, RICHARD AE DEFENSE INTELLIGENCE AGENCY INTELLIGENCE ANALYST 250.0 13-APR-12 Republican
414 Romney, Mitt DELUCA, PIERRE AE CISCO ENGINEER 30.0 21-AUG-11 Republican
415 Romney, Mitt SARGENT, MICHAEL AE RAYTHEON TECHNICAL SERVICES CORP COMPUTER SYSTEMS ENGINEER 100.0 07-MAR-12 Republican
... ... ... ... ... ... ... ... ...
201381 Obama, Barack MOUNTS, ROBERT ZZ HQ USFK (FKDC-SA) GS-15 INTERNATIONAL RELATIONS OFFICER 25.0 26-FEB-12 Democrat
201382 Obama, Barack TAITANO, TYRONE ZZ NOT EMPLOYED RETIRED 250.0 20-JAN-12 Democrat
201383 Obama, Barack TUCKER, PAMELA ZZ DODEA EDUCATOR 3.0 20-JAN-12 Democrat
201384 Obama, Barack MOUNTS, ROBERT ZZ HQ USFK (FKDC-SA) GS-15 INTERNATIONAL RELATIONS OFFICER 25.0 26-APR-12 Democrat
201385 Obama, Barack NEAL, AMBER ZZ THE DEPARTMENT OF DEFENSE EDUCATION AC TEACHER 135.0 04-SEP-11 Democrat

2.4、面元化数据

接下来我们对该数据做另一种非常实用的分析,利用cut函数根据出资额大小将数据离散化到多个面元中

contb_over['contb_receipt_amt'].sort_values()
323688          0.01
327537          0.01
327468          0.01
326458          0.01
317091          0.01...
319478     526246.17
344419    1511192.17
344539    1679114.65
326651    1944042.43
325136    2014490.51
Name: contb_receipt_amt, Length: 991477, dtype: float64
  • 他们的金额出现在 (0.01,2014490.51)之间,现在将它们进行范围分组。
bins = [0, 1, 10, 100, 1000, 10000, 100000, 1000000, 10000000]
labels = pd.cut(contb_over['contb_receipt_amt'], bins)
labels
0         (100, 1000]
1           (10, 100]
2         (100, 1000]
3         (100, 1000]
4         (100, 1000]...
1726    (1000, 10000]
1727    (1000, 10000]
1728      (100, 1000]
1729      (100, 1000]
1730    (1000, 10000]
Name: contb_receipt_amt, Length: 991477, dtype: category
Categories (8, interval[int64]): [(0, 1] < (1, 10] < (10, 100] < (100, 1000] < (1000, 10000] < (10000, 100000] < (100000, 1000000] < (1000000, 10000000]]

3、数据聚合与分组运算

contb_over.columns
Index(['cand_nm', 'contbr_nm', 'contbr_st', 'contbr_employer','contbr_occupation', 'contb_receipt_amt', 'contb_receipt_dt', 'party'],dtype='object')

3.1、透视表(pivot_table)分析党派和职业

  • 按照党派,职业对赞助金额进行汇总,类似Excel表中的透视表功能,聚合函数为sum
ret = contb_over.pivot_table('contb_receipt_amt', index='contbr_occupation', columns='party', aggfunc='sum', fill_value=0)
ret
party Democrat Republican
contbr_occupation
MIXED-MEDIA ARTIST / STORYTELLER 100.0 0.0
AREA VICE PRESIDENT 250.0 0.0
RESEARCH ASSOCIATE 100.0 0.0
TEACHER 500.0 0.0
THERAPIST 3900.0 0.0
... ... ...
ZOOKEEPER 35.0 0.0
ZOOLOGIST 400.0 0.0
ZOOLOGY EDUCATION 25.0 0.0
\NONE\ 0.0 250.0
~ 0.0 75.0

3.1.1、求每个职业为两个党派献金之和

ret['total'] = ret['Democrat'] + ret['Republican']
ret.head()
party Democrat Republican total
contbr_occupation
MIXED-MEDIA ARTIST / STORYTELLER 100.0 0.0 100.0
AREA VICE PRESIDENT 250.0 0.0 250.0
RESEARCH ASSOCIATE 100.0 0.0 100.0
TEACHER 500.0 0.0 500.0
THERAPIST 3900.0 0.0 3900.0

3.1.2、过滤掉总和小于200W的数据

cond = ret['total'] < 2000000 #条件
index = ret[cond].index # 索引
ret_big = ret.drop(labels=index) # 移除
ret_big
party Democrat Republican total
contbr_occupation
ATTORNEY 14302461.84 7.868419e+06 2.217088e+07
CEO 2074974.79 4.211041e+06 6.286015e+06
CONSULTANT 2459912.71 2.544725e+06 5.004638e+06
ENGINEER 951525.55 1.818374e+06 2.769899e+06
EXECUTIVE 1355161.05 4.138850e+06 5.494011e+06
HOMEMAKER 4248875.80 1.363428e+07 1.788315e+07
INVESTOR 884133.00 2.431769e+06 3.315902e+06
MANAGER 762883.22 1.444532e+06 2.207416e+06
NOT PROVIDE 13725187.32 2.097161e+07 3.469680e+07
OWNER 1001567.36 2.408287e+06 3.409854e+06
PHYSICIAN 3735124.94 3.594320e+06 7.329445e+06
PRESIDENT 1878509.95 4.720924e+06 6.599434e+06
PROFESSOR 2165071.08 2.967027e+05 2.461774e+06
REAL ESTATE 528902.09 1.625902e+06 2.154804e+06
RETIRED 25305316.38 2.356124e+07 4.886656e+07
SELF-EMPLOYED 721108.40 1.961786e+06 2.682894e+06

3.1.3、数据可视化

# 绘制图形
ret_big.plot(kind='bar', figsize=(12, 6))

3.2、分组及运算和转换

  • 根据职业和雇主信息分组运算
grouped = contb_over.groupby('cand_nm')
grouped
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001D1D57CBE08>
  • 可以用 for 方法遍历出分组的结果

3.2.1、不同职业对他两人的支持度

  • 由于职业和雇主的处理非常相似,我们定义函数get_top_amounts()对两个字段进行分析处理
  • 首先统计各区间的赞助笔数,这里用到unstack(),stack()函数是堆叠,unstack()函数是不要堆叠,即把多层索引变为表格数据
def get_top_amounts(grouped, key, n):#先分组,grouped,然后继续再分return grouped.groupby(key)['contb_receipt_amt'].sum().sort_values(ascending=False)[:n]grouped = contb_over.groupby('cand_nm')
grouped.apply(get_top_amounts, 'contbr_occupation', 7).unstack(level=0)
cand_nm Obama, Barack Romney, Mitt
contbr_occupation
ATTORNEY 14302461.84 5372424.02
CEO NaN 2324297.03
CONSULTANT 2459912.71 NaN
EXECUTIVE NaN 2300947.03
HOMEMAKER 4248875.80 8147446.22
NOT PROVIDE 13725187.32 11638509.84
PHYSICIAN 3735124.94 NaN
PRESIDENT NaN 2491244.89
PROFESSOR 2165071.08 NaN
RETIRED 25305316.38 11508473.59

结论:从数据可以看出,Obama更受精英群体(律师、医生、咨询顾问)的欢迎,Romney则得到更多企业家或企业高管的支持

3.2.2、不同公司对他两人的支持度

  • 同样,使用get_top_amounts()对雇主进行分析处理
def get_top_amounts(grouped, key, n):#先分组,grouped,然后继续再分return grouped.groupby(key)['contb_receipt_amt'].sum().sort_values(ascending=False)[:n]grouped = contb_over.groupby('cand_nm')
grouped.apply(get_top_amounts, 'contbr_employer', 7)
cand_nm        contbr_employer
Obama, Barack  RETIRED            22694558.85SELF-EMPLOYED      18626807.16NOT PROVIDE        13883494.03NOT EMPLOYED        8586308.70HOMEMAKER           2605408.54STUDENT              318831.45VOLUNTEER            257104.00
Romney, Mitt   NOT PROVIDE        12321731.24RETIRED            11506225.71HOMEMAKER           8147196.22SELF-EMPLOYED       7414115.22STUDENT              496490.94CREDIT SUISSE        281150.00MORGAN STANLEY       267266.00
Name: contb_receipt_amt, dtype: float64

结论: Obama:微软,盛德国际律师事务所;Romney:瑞士瑞信银行,摩根斯坦利,高盛公司,巴克莱资本,H.I.G资本

3.3、统计各区间的赞助金额

3.3.1、查看区间金额

labels = pd.cut(contb_vs['contb_receipt_amt'], bins)
contb_vs.groupby(['cand_nm', labels]).size().unstack(level=0, fill_value=0)
cand_nm Obama, Barack Romney, Mitt
contb_receipt_amt
(0, 1] 493 77
(1, 10] 40070 3681
(10, 100] 372280 31853
(100, 1000] 153992 43357
(1000, 10000] 22284 26186
(10000, 100000] 2 1
(100000, 1000000] 3 0
(1000000, 10000000] 4 0

3.3.2、绘制Obama和Romney各区间赞助的总金额

amt_vs = contb_vs.groupby(['cand_nm', labels]).sum().unstack(level=0, fill_value=0)
amt_vs.fillna(0, inplace=True)
amt_vs.plot(kind='bar', figsize=(12,6))

3.3.3、过滤掉大金额

# 过滤掉大金额
amt_vs[:-2].plot(kind='bar', figsize=(12,6))

3.3.4、百分比堆积图

算出每个区间两个候选人收到赞助总金额比:

amt_vs.div(amt_vs.sum(axis = 1), axis=0)[:-2]
contb_receipt_amt
cand_nm Obama, Barack Romney, Mitt
contb_receipt_amt
(0, 1] 0.805182 0.194818
(1, 10] 0.918767 0.081233
(10, 100] 0.910769 0.089231
(100, 1000] 0.710177 0.289823
(1000, 10000] 0.447326 0.552674
(10000, 100000] 0.823120 0.176880

绘制百分比堆积图:

  • 参数 stacked=True
amt_vs.div(amt_vs.sum(axis = 1), axis=0)[:-2].plot(kind='bar', stacked=True, figsize=(12, 6))

4、时间处理

contb_vs.head()
cand_nm contbr_nm contbr_st contbr_employer contbr_occupation contb_receipt_amt contb_receipt_dt party
411 Romney, Mitt ELDERBAUM, WILLIAM AA US GOVERNMENT FOREIGN SERVICE OFFICER 25.0 01-FEB-12 Republican
412 Romney, Mitt ELDERBAUM, WILLIAM AA US GOVERNMENT FOREIGN SERVICE OFFICER 110.0 01-FEB-12 Republican
413 Romney, Mitt CARLSEN, RICHARD AE DEFENSE INTELLIGENCE AGENCY INTELLIGENCE ANALYST 250.0 13-APR-12 Republican
414 Romney, Mitt DELUCA, PIERRE AE CISCO ENGINEER 30.0 21-AUG-11 Republican
415 Romney, Mitt SARGENT, MICHAEL AE RAYTHEON TECHNICAL SERVICES CORP COMPUTER SYSTEMS ENGINEER 100.0 07-MAR-12 Republican
  • 在上面的结果中,我们可以看出时间格式为 01-FEB-12 ,拥有英文符号,说明格式是字符串,不能直接用于运算。

4.1、查看数据类型

contb_vs.dtypes
cand_nm               object
contbr_nm             object
contbr_st             object
contbr_employer       object
contbr_occupation     object
contb_receipt_amt    float64
contb_receipt_dt      object
party                 object
dtype: object

4.2、str转datetime

contb_vs['contb_receipt_dt'] = pd.to_datetime(contb_vs['contb_receipt_dt'])
contb_vs.head()
cand_nm contbr_nm contbr_st contbr_employer contbr_occupation contb_receipt_amt contb_receipt_dt party
411 Romney, Mitt ELDERBAUM, WILLIAM AA US GOVERNMENT FOREIGN SERVICE OFFICER 25.0 2012-02-01 Republican
412 Romney, Mitt ELDERBAUM, WILLIAM AA US GOVERNMENT FOREIGN SERVICE OFFICER 110.0 2012-02-01 Republican
413 Romney, Mitt CARLSEN, RICHARD AE DEFENSE INTELLIGENCE AGENCY INTELLIGENCE ANALYST 250.0 2012-04-13 Republican
414 Romney, Mitt DELUCA, PIERRE AE CISCO ENGINEER 30.0 2011-08-21 Republican
415 Romney, Mitt SARGENT, MICHAEL AE RAYTHEON TECHNICAL SERVICES CORP COMPUTER SYSTEMS ENGINEER 100.0 2012-03-07 Republican
  • 现在时间是 2012-02-01 的形式,格式为 datetime64

5、各州支持率

5.1、数据分组

根据州和候选人进行分组:

state_vs = contb_vs.groupby(['cand_nm','contbr_st'])['contb_receipt_amt'].sum().unstack(level=0)
state_vs
cand_nm Obama, Barack Romney, Mitt
contbr_st
AA 56405.00 135.00
AB 2048.00 NaN
AE 42973.75 5680.00
AK 281840.15 86204.24
AL 543123.48 527303.51
... ... ...
WI 1130155.46 270316.32
WV 169154.47 126725.12
WY 194046.74 252595.84
XX NaN 400250.00
ZZ 5963.00 NaN

数据可视化:

state_vs.fillna(0, inplace=True)
state_vs.plot(kind='bar', figsize=(16, 6))

5.2、候选人各州金额占比

# 所占的比例
state_vs_rate = state_vs.div(state_vs.sum(axis=1), axis=0)
state_vs_rate
cand_nm Obama, Barack Romney, Mitt
contbr_st
AA 0.997612 0.002388
AB 1.000000 0.000000
AE 0.883257 0.116743
AK 0.765778 0.234222
AL 0.507390 0.492610
... ... ...
WI 0.806982 0.193018
WV 0.571700 0.428300
WY 0.434456 0.565544
XX 0.000000 1.000000
ZZ 1.000000 0.000000

5.3、删除不存在的州

state_vs_rate.drop(labels=['AA', 'AB', 'AE', 'NOT PROVIDE'], inplace=True)

5.4、绘制地图

  • basemap 工具包, pip install basemap
  • basemap 绘制地图, pip install 无法安装成功
  • python本地库下载: https://www.lfd.uci.edu/~gohlke/pythonlibs/#
  • basemap官网: https://matplotlib.org/basemap/api/basemap_api.html
  • mapShapeFile下载: https://gadm.org/download_country_v3.html
  • 美国地图ShapeFile也可以在该博客的顶部数据链接下载

5.4.1、导入相关库

from mpl_toolkits.basemap import Basemap
from matplotlib.patches import Polygon #导入多边形包
from matplotlib.colors import rgb2hex #rgb2hex表示16进制的颜色

5.4.2、同一绘制不同颜色方法

# 着色
cmap = plt.cm.Reds #blues,0cean....
for i in range(10):print ((i+1) / 10)plt.plot(np.arange(10) + i, c=cmap((i+1) / 10))

输出结果:

0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0

5.4.3、查看Obama各州均值

obama = state_vs_rate['Obama, Barack']
obama
contbr_st
AK    0.765778
AL    0.507390
AP    0.957329
AR    0.772902
AS    1.000000...
WI    0.806982
WV    0.571700
WY    0.434456
XX    0.000000
ZZ    1.000000
Name: Obama, Barack, Length: 64, dtype: float64

5.4.3、绘制美国地图

'''
关键词 描述
llcrnrlon   所需地图域的左下角经度(度)。
llcrnrlat   所需地图域的左下角纬度(度)。
urcrnrlon   所需地图域右上角的经度(度)。
urcrnrlat   所需地图域右上角的纬度(度)。
'''
plt.figure(figsize=(12, 9))
m = Basemap(llcrnrlon = -122,llcrnrlat = 23.41,urcrnrlon = -64,urcrnrlat = 45,projection = 'lcc',lat_1 = 30,lon_0 = -100)m.drawcoastlines(linewidth=1.5) #海岸线
m.drawcountries(linewidth=1.5) # 国家
# m.drawstates() #直接画出州# 读取美国地图的现状,m中就有了各州的形状,数据
m.readshapefile('./USA/gadm36_USA_1', name='states')colors = []
states = []cmap = plt.cm.Reds
#州全称对应缩写缩写
abbr = {'Commonwealth of Kentucky':'KY','Commonwealth of Massachusetts':'MA','Commonwealth of Pennsylvania':'PA','State of Rhode Island and Providence Plantations':'RI'}for shapeinfo in m.states_info:a = shapeinfo['VARNAME_1'] #结果结构AL|Ala,AK|Alaska
#     州的缩写s = a.split('|')[0] # 结果AL,AK表示州的缩写try:rate = obama[s] # 取出obama的州对应的值colors.append(cmap(rate)) #转化颜色states.append(s) #州的简称except:colors.append(cmap(obama[abbr[s]]))#有些州没有检查,需要进行替换states.append(s)
# 州填充颜色
# seg州中的一部分区域,多边形
ax = plt.gca()
for n,seg in enumerate(m.states):c = rgb2hex(colors[n])poly = Polygon(seg,color = c )ax.add_patch(poly)
plt.show()

数据分析入门之2012美国大选政治献金项目相关推荐

  1. 利用python进行数据分析-14.5 2012美国大选数据分析(基础复习)

    美国大选数据分析 导入数据及相关库 数据清洗 填充缺失值 数据转换 面元化数据(分桶) 数据聚合与分组运算 时间处理 str转datetime(datetime为时间序列数据类型) 重新采样和频率转换 ...

  2. 2012美国大选献金项目数据分析

    文章目录 1.数据载入与预览 1.1 数据加载 1.2 数据合并 1.3 数据预览 1.3.1 查看是否有空值 1.3.2用统计学指标快速描述数值型属性的概要 2.数据的预处理 2.1 数据清洗 2. ...

  3. 美国2012政治献金数据分析(附有源数据和题目)

    读取文件usa_election.txt 查看文件样式及基本信息 指定数据截取,将如下字段的数据进行提取,其他数据舍去: cand_nm :候选人姓名 contbr_nm :捐赠人姓名 contbr_ ...

  4. 数据分析---2012美国大选献金项目数据分析

    需求: 1. 加载数据 df = pd.read_csv('./data/usa_election.txt') 2. 对新数据进行总览,查看是否存在缺失数据: 方法一:isnull.notnull.a ...

  5. 美国大选献金项目学习笔记

    目录 要求 代码: 要求 代码: import numpy as np import pandas as pddf = pd.read_csv('./data/usa_election.txt')# ...

  6. 【数据分析】使用pandas和numpy分析美国大选献金项目

    1. 数据载入与总览 1.1 数据加载 #绘图工具 import matplotlib.pyplot as plt %matplotlib inline #数据处理工具 import numpy as ...

  7. 2012美国大选献金项目(最详细解释)

    1. 数据载入和总览 import numpy as npimport pandas as pdfrom pandas import Series,DataFrame 1.1 数据载入 df1 = p ...

  8. pandas - 案例(美国2012年总统候选人政治献金数据分析)

    # 提供数据months = {'JAN' : 1, 'FEB' : 2, 'MAR' : 3, 'APR' : 4, 'MAY' : 5, 'JUN' : 6,'JUL' : 7, 'AUG' : ...

  9. 3 Python数据分析 美国各州人口分析案例 Pandas高级操作 美国大选献金案例 matplotlib

    Python数据分析 1 案例 美国各州人口分析 1.1 数据介绍 数据来源:https://github.com/jakevdp/data-USstates/ 1.1.1 州人口数量表 state- ...

  10. 项目3:美国大选献金目数据分析

    """ 需求: 1.读取数据 2.查看是否有缺失数据 3.用统计学指标快速描述数值型属性的概要 4.空值处理.可能因为忘记填写保密等原因,相关字段出现了空值,将其填充为N ...

最新文章

  1. oracle SQL以结尾函数,Oracle学习笔记--SQL查询和SQL函数(转) -- 迷失de天空 -- 编程爱好者......
  2. mysql万能密钥,mysql错误:指定的密钥太长;最大密钥长度为767字节
  3. JZOJ 4932. 【NOIP2017提高组模拟12.24】B
  4. VTK:Utilities之SortDataArray
  5. php去掉最后一个,号,织梦当前位置去除最后一个''符号
  6. 前端学习(2752):global全局设置
  7. 微服务架构设计总结实践
  8. 树莓派智能小车python_人工智能-树莓派小车(1)——DuerOS语音唤醒
  9. linux_network
  10. python经典题库及答案文库_Python经典题库及答案
  11. Linux文件打包成不同后缀的压缩文件
  12. powerdesigner 显示窗口小工具栏
  13. 使用谷歌浏览器下载的图片webp怎么转成jpg格式图片
  14. 如何更改Eclipse中Properties文件编码格式
  15. 通用发票在线OCR识别,报销场景适用,支持近20种票据
  16. 【运筹帷幄】网站打开慢故障排查思路
  17. 开灯问题~有n盏灯,编号为1~n。1号将灯全部打开,2将按下所有为2的倍数的开关,(这些灯将被关掉)第3个人按下所有编号为3的倍数的开关(该灯如为打开的, 则将它关闭;如关闭的,则将它打开)。
  18. html双引号打不正确,双引号要占一个空格吗 为什么打双引号要空格一下才能出来...
  19. 任意用户注册任意用户密码修改
  20. MATLAB约当标准型,用MATLAB求其约当标准型状态空间表达式

热门文章

  1. java报错establishing_[Microsoft][SQLServer 2000 Driver for JDBC]Error establishing socket错误解决方法总结...
  2. python的shape和reshape_一文浅显易懂:Python中shape()和reshape()的用法和区别
  3. 优惠券管理--优惠券类型
  4. codeforces1467E. Distinctive Roots in a Tree
  5. 微信小程序中 三元表达式
  6. 中创向心力:如何把思想政治教育贯穿职业教育全过程?
  7. Go 每日一库之 go-flags
  8. 将轻松的留给生活,沉重的留给写作——读《灯下尘》(七堇年)
  9. Transformer中Relative Position Bias以及DropPath细节梳理
  10. EOJ 2527 Fj haozi【dfs】