可视化案例研究——以智利总统选举为例

这是一篇有关智利总统选举的数据可视化的案例研究。

扫码关注《Python学研大本营》

2021年12月21日发生了智利历史上争议最大的总统竞选之一。两轮投票制度使所有传统政党在相当长一段时间内没有出现过的政治对立的背景下被甩在了后面，并导致传统政党被取消资格。

但本文的目的不是讨论政治。相反，它是关于调查这次选举的一个有趣的事实：根据国家选举服务局（SERVEL）的数据，在两轮选举中，明示的选票（即不是空白的也不是无效的）数量强烈增加，从7,028,345张明示选票增加到第二轮的8,271,893张。这是一个17.7%的增长，几乎是125万张选票！在智利，弃权率通常很高。（SERVEKL网址：https://www.servel.cl/resultados-en-excel-por-mesa-a-partir-del-ano-2012/）

在智利，弃权票通常很高。选举登记册上有1500万选民，其中仅有47.5%参加了第一轮选举，55.9%参加了第二轮选举，但却被认为是该国最好的成绩之一。然而，作者将只关注表达票的增加，因为弃权背后的原因是一个完全不同的问题，不会在这里处理。

正如作者所说，这些选举是高度分化的。第二轮的明示选票的增加使获胜的候选人加布里埃尔-鲍里奇受益。为了仔细研究这一现象，作者分析了来自SERVEL的投票数据，该数据为我们提供了全国和国外所有46888个投票点的详细选票信息：每个候选人获得的票数、空白票和无效票、弃权票、地点、性别等。

作为提醒，在所有事情之前，让我们列出两轮选举中的所有候选人及其各自的得分。第二轮中，加布里埃尔-鲍里克-方特以55.87%的选票战胜了何塞-安东尼奥-卡斯特，后者的得分为44.13%。

在第一轮比赛中，何塞-安东尼奥-卡斯特以27.91%的得票率领先，而加布里埃尔-鲍里克以25.83%的得票率位居第二。其余的候选人是弗朗西斯科-帕里西（12.80%）、塞巴斯蒂安-西谢尔（12.78%）、亚斯纳-普罗沃斯特（11.60%）、马可-恩里克斯-奥米纳米（7.60%）和爱德华多-阿特斯（1.47%）。

# libraries used
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pyplot import cm
from matplotlib.ticker import PercentFormatter
import seaborn as sn# preparation of the data includes:# importing first-round and second-round datasets# joining data from polling places within the country and abroad # cleaning the data# counting total number of expressed ballots per polling station(abstention and blank/invalid votes are not taken into account)# computing percentage of each candidate in each polling station# returning a NumPy array of all 46,888 scores (one per pollingstation), for both the 7 first-round candidates and the 2second-round candidates # calculating the difference of expressed votes between the tworounds to measure the electoral mobilization in each polling station. Outliers are set to a limit to narrow thespread of the array when plotted as a heatmap# all operations are regularly asserted throughout the script to be consistent with the totals provided by SERVEL in a separate sheet. # the detailed script is available on GitHub.

first-round dataset：https://www.servel.cl/resultados-definitivos-eleccion-presidencial-parlamentarias-y-de-consejeros-as-regionales-2021/

seconde-round dataset：https://www.servel.cl/resultados-definitivos-elecciones-segunda-votacion-presidencia-2021/

GitHub：https://github.com/louisleclerc/SERVEL_Data_Analysis/blob/main/SERVEL_EDA.py

比较全国各投票站的第一轮和第二轮的结果。

对于七位候选人中的每一位，作者根据候选人在上述投票站获得的分数，在散点图的X轴上显示投票站。每个投票站根据第二轮得分沿Y轴移动，以查看该投票站的选民对第二轮对决的反应。

# candidate is an array of scores of the 1st-round candidate
# candidate2 is an array of scores of the 2nd-round candidate
# diff_votes_perc is an array of the differences of expressed votes between the two rounds# plot
sns.scatterplot(
x=candidate,
y=candidate2,
hue=diff_votes_perc,
palette='coolwarm',
alpha=0.5,
ax=ax
)# legend
ax.legend(
title=f'Gain/loss of votes between\nthe two rounds (in %)',
title_fontsize='small',
fontsize='small'
)

候选人的选民由最靠右的投票站组成，这些投票站是第一轮成绩最好的地方。这个选区在Y轴上的位置揭示了它在第二轮的表现。换句话说，它显示了第一轮候选人的选民对两个决胜轮候选人之一的支持。

这些图表还显示了每个投票站的选举动员情况。一个点的颜色表示第一轮和第二轮之间表达的票数是增加了还是减少了，遵循热图的逻辑：暖红色表示强烈的动员，而冷蓝色意味着参与度下降。

# check distribution of the difference of expressed ballots array
plt.hist(diff_votes_perc)
plt.show()# narrow range of the array to avoid outliers
diff_votes_perc = np.where(
diff_votes_perc < -30, -30, diff_votes_perc
)
diff_votes_perc = np.where(
diff_votes_perc > 55, 55, diff_votes_perc
)

这样一来，这些图表不仅告诉我们第一轮候选人的选民是否支持第二轮候选人，而且还告诉我们这种支持是否是一种热情。

让我们看一个例子来澄清这一点，例如，传统右派候选人塞巴斯蒂安-西谢尔在第一轮获得的分数和极右派候选人何塞-安东尼奥-卡斯特在第二轮获得的分数比较图。

在图的右上方有一堆蓝色的点，从中我们可以得出两个结论。

首先，关于在第二轮选举中是否存在对何塞-安东尼奥-卡斯特的支持。西谢尔取得最佳成绩的投票站（X轴右侧）在第二轮投票中支持卡斯特（Y轴上方）。换句话说，西谢尔的选民在第二轮投票中支持卡斯特。
第二，关于这种支持的积极性。蓝色的盛行显示了选举的复员，因为在两轮之间有表达的选票的损失。换句话说，西谢尔的选民向卡斯特提供了一个 "复员的支持"。

将决选候选人与他们自己进行比较

当我们比较同一候选人在第一轮和第二轮之间的分数变化时，会变得更加有趣。因此，我们只能对那些确实进入第二轮的候选人进行这样的比较。

如果一个候选人的选民确实在两轮比赛中都支持他，那么看这个问题又有什么意义呢？

这似乎有悖常理，因为我们可以合理地预期那些在第一轮投票中大量支持某位候选人的投票站在第二轮投票中也会投给同一位候选人。

为了证明这一点，我们先来看看那些在第二轮投票中大量投给落选的候选人卡斯特的投票站。

它看起来像一个密集的蜂群，顶部有一个肿胀。

散点图的线性形状表明，他的选民是稳定的：一个投票站在第一轮中对他的投票越多，就越有可能在第二轮中大规模地投票给他。

但也有例外。有些地方在第二轮投票中强烈支持卡斯特，得分高达100%，尽管他的第一轮得分很低。但我们说的是总共46,888个投票站中的极少数投票站，而且一般是以蓝颜色的点所显示的被动员的选民为代价。

从这张图中得出的最重要的结论并不那么引人注目。膨胀的规模并不大，尤其是在左边。但左上/中上部分可能恰恰是胜利的所在。在选举中，收集那些在第一轮中没有投票给你的选民是至关重要的。这些选民应该正好出现在用红色画出的图中。

选举中爆发的胜利

为了更好地说明选举爆发的概念，我们现在来看看获胜候选人的分数。

肿胀更明显，充满了点。它由两部分组成：左边的蓝色部分，在第二轮投票中强烈支持鲍里奇，但没有动员起来；另一个较大的红色部分，强烈动员起来支持鲍里奇。

让我们用Boric在每轮比赛中获得的全国平均分数来显示同一个数字。

# get max values of the data to get limit coordinates
X_max = float(max(candidate))
Y_max = float(max(candidate2))# plot, same as before
sns.scatterplot(
x=candidate,
y=candidate2,
hue=diff_votes_perc,
palette='coolwarm',
alpha=0.5,
ax=ax
)# compute national averages of candidates
cand2_mean = float(np.mean(candidate2))
cand_mean = float(np.mean(candidate))# compute number of polling places
nb_pp = int(len(SERVEL_data) / 7)# plot national average of 2nd-round candidate
X_plot = np.linspace(0, X_max, nb_pp)
Y_plot = np.linspace(cand2_mean, cand2_mean, nb_pp)
ax.plot(
X_plot,
Y_plot,
color='black',
linestyle='-.',
label=f'{candidate2name}\n2nd round: {round(cand2_mean,1)}%'
)# plot national average of 1st-round candidate
X_plot2 = np.linspace(cand_mean, cand_mean, nb_pp)
Y_plot2 = np.linspace(0, Y_max, nb_pp)
ax.plot(
X_plot2,
Y_plot2,
color='black',
linestyle=':',
label=f'{candidate1name}\n1st round: {round(cand_mean, 1)}%'
)

右上角的四分之一收集了两轮投票中对鲍里克的投票都高于全国平均水平的所有投票站。与卡斯特的数字相比，我们可以看到它不是线性的。相反，扩大到顶部的红色爆点突出了支持鲍里克的动员。

左上角的四分之一也是相当有见地的。它收集了所有在第一轮投票中对鲍里奇的投票不多，低于全国平均水平的地方。然而，这些投票站在第二轮投票中对他明显有利。他们动员得更多，正如红色所示。

图中上部有很多红色的事实强调了鲍里奇的当选要归功于关键的选举动员，这种动员远远超出了他原来的选民。这一结论与他在第一轮选举中获得第二名却以少数优势赢得选举这一事实是一致的。

相反，请记住，卡斯特的膨胀几乎是空的，充满了被遣散的投票站，这意味着他没能吸引选民到他的选区边界之外。

把所有的东西放在一起

以下是所有七个第一轮候选人的完整数据。对于每个人来说，同一数据有三种观点。

每个投票站的明示票数，以及第一轮和第二轮候选人的全国平均得分。没有基于颜色的信息，以关注蜂群的形状和平均数的定位方式。
两轮之间选举动员的热图。这就是我们目前看到的可视化。
每个地区的票数。另一种信息显示，我面临着最奇怪的挑战：根据两种不同类型的编号对地区进行排名（有些是按地理位置，有些是按创建日期）。

下面是生成这些图的脚本。首先，我们定义候选人并设置图的一般特征。由于只显示三个子图，我们可以在右上角放一个自定义的图例，而不是第四个。

for i, candidate in enumerate(
[Boric, Kast, Provoste, Sichel, Artés, Ominami, Parisi]
):fig, axs = plt.subplots(2, 2, figsize=[15,10])# extract name of the 1st round candidatecandidate1name = names[i].title()# define candidate 2nd round to compare toif i == 1 or i == 3:candidate2name= 'José Antonio Kast Rist'candidate2 = Kast2else:candidate2name= 'Gabriel Boric Font'candidate2 = Boric2# format x and y axis in percentagesfor a, b in [(0,0), (0,1), (1,0), (1,1)]:axs[a][b].xaxis.set_major_formatter(PercentFormatter())axs[a][b].yaxis.set_major_formatter(PercentFormatter())# put the title in the second plotaxs[0][1].annotate(
text=f"2nd round behavior of\n{candidate1name}'s electorate",
xy=[0.5,0.8],
horizontalalignment='center',
fontsize=20,
fontweight='bold'
)    # add general descriptionaxs[0][1].annotate(
text='Comparison of the results obtained at each round of the\n2021 Chilean presidential elections (by polling station)',
xy=[0.5,0.6],
horizontalalignment='center',
fontsize=12,
fontstyle='italic'
)    # annote customized legendaxs[0][1].annotate('Legend:\n''1 - Expressed votes per polling station (in %)\n''2 - Electoral mobilization between the two rounds\n''3 - Vote per region',
xy=[0.05,0.05],
horizontalalignment='left',
fontsize=12,
fontweight='light',
backgroundcolor='white',
bbox=dict(edgecolor='black', facecolor='white',boxstyle='round')
)    # fetch limit coordinates of each plot X_max = float(max(candidate))Y_max = float(max(candidate2))    # put numbered references of the legend in the upper-right corner of each subplotaxs[0][0].annotate(
text='1',
xy=[X_max,90],
color='darkred',
fontsize=20,
fontweight='black'
)axs[1][0].annotate(
text='2',
xy=[X_max,90],
color='darkred',
fontsize=20,
fontweight='black'
)axs[1][1].annotate(
text='3',
xy=[X_max,90],
color='darkred',
fontsize=20,
fontweight='black'
)    # hide axisaxs[0][1].axis('off')# set labels of the general figurefig.supylabel(
f'{candidate2name} - 2nd round results',
fontsize=16,
ha='center',
va='center'
)fig.supxlabel(
f'{candidate1name} - 1st round results',
fontsize=16,
ha='center',
va='center'
)

我们现在生成全国平均水平的散点图。提醒大家，我们仍然在同一个 "for循环 "中。

# plot comparison of expressed votes in the first subplotsns.scatterplot(
x=candidate,
y=candidate2,
color=colors[i],
alpha=0.3,
ax=axs[0][0]
)# define variables to plot national averages of candidatescand2_mean = float(np.nanmean(candidate2))cand_mean = float(np.nanmean(candidate))nb_pp = int(len(SERVEL_data) / 7)    # plot national average of 2nd-round candidateX_plot = np.linspace(0, X_max, nb_pp)Y_plot = np.linspace(cand2_mean, cand2_mean, nb_pp)    axs[0][0].plot(
X_plot,
Y_plot,
color='black',
linestyle='-.',
label=f'{candidate2name}\n2nd round: {round(cand2_mean,1)}%'
)    # plot national average of first-round candidateX_plot2 = np.linspace(cand_mean, cand_mean, nb_pp)Y_plot2 = np.linspace(0, Y_max, nb_pp)    axs[0][0].plot(
X_plot2,
Y_plot2,
color='black',
linestyle=':',
label=f' {candidate1name}\n1st round: {round(cand_mean, 1)}%'
)axs[0][0].legend(
fontsize='small',
title='National average',
title_fontsize='small'
)

然后是我们已经看到的选举动员热图。

# plot electoral mobilization in the third subplotsns.scatterplot(
x=candidate,
y=candidate2,
hue=diff_votes_perc,
palette='coolwarm',
alpha=0.5,
ax=axs[1][0]
)    # legend with total number of votes in both rounds, as well as increase of participation in %axs[1][0].legend(
title=f'Gain/loss of votes between\nthe two rounds (in %)', title_fontsize='small',
fontsize='small'
)

智利是为数不多的地区可以是数字变量，而不是分类变量的国家之一。

最后，一个额外的区域图。对于那些不熟悉智利地理的人来说，它是世界上从北到南最长的国家，有4270公里长。

它从北部的世界上最干燥的沙漠到南部的南极洲，聚集了各种气候。但是它很狭窄，被卡在大海和强大的安第斯山脉之间。因此，平均而言，它只有177公里大。

各个地区堆积在一起，地图上没有出现精确的东西向分层。我们可以利用这种奇特的地理环境，根据这些点在南北轴上的位置，给它们赋予不同的颜色。毕竟，智利的形状在某种程度上就像一条轴线!

（图片来源：Wikimedia Commons）

换句话说，我们可以按数字排列智利的地区。能做到这一点的国家并不多! 在大多数国家，区域布局的彩色图将是分类的，不太可能提供重要的视觉洞察力。

因此，我们可以将智利的地理数据显示为从北到南的阴影。这种散点图会给我们一个提示，即投票站的大致位置是什么，一目了然。它为一些候选人带来了一些有趣的见解，比如帕里西和普罗沃斯特，他们的选区位于智利北部。

所以，回到脚本中来。我们希望区域从北到南排序。好消息是，它们被编号了，显然是由北向南。但是如果你看一下上面的智利地区地图，你会发现并非所有的编号都有意义。

智利曾多次创建新的地区。这是个棘手的问题! 最初，对第一批地区进行编号的标准是其地理位置。但与此同时，又有几个新的地区被创建，它们的等级源于它们的创建顺序，而不是它们的地理位置。

我们可以想象出很多方法来将所有地区从北到南正确排序，但这还不如通过索引、压缩和NumPy手动完成。

# plot votes according to region in the last subplot# a reordering the position to the north is necessary to create a readable heatmap    # instantiate a list with the result of regions = np.unique(location_array)    # zip the region list with a list of their respective position starting from the northnorth_to_south = [3, 1, 4, 15, 5, 12, 14, 13, 16, 2, 6, 10, 11, 8, 9, 17, 7]region_position = zip(regions, north_to_south)    # create an array of the regional position of each polling placeposition_array = np.empty(len(location_array))for region, position in region_position:position_array[location_array == region] = position    # stack all arrays of interest into a single oneordered_array = np.column_stack(
[candidate, candidate2, position_array]
)    # sort array according to the regional positionsorted_array = ordered_array[
np.argsort(ordered_array[:,2])
]    # create plotsns.scatterplot(
x=sorted_array[:,0],
y=sorted_array[:,1],
hue=sorted_array[:,2].astype('<U44'),
palette='Spectral',
alpha=0.4,
ax=axs[1][1]
)# readjust labels from north to southlocation_labels = [
'ARICA',
'TARAPACA',
'ANTOFAGASTA',
'ATACAMA',
'COQUIMBO',
'VALPARAISO',
'METROPOLITANA',
"O'HIGGINS",
'MAULE',
'ÑUBLE',
'BIOBIO',
'ARAUCANIA',
'LOS RIOS',
'LOS LAGOS',
'AYSEN',
'MAGALLANES',
'EXTRANJERO'
]axs[1][1].legend(
labels = location_labels,
ncol=4,
fontsize='xx-small'
)

更进一步

这个分析有很多地方没有涉及，比如弃权背后的原因。但是这个数据可以用来提供另一个关于基于性别的投票的研究，因为除了国外的投票地点，性别数据是可以得到的。

将性别数据与动员图进行比较可能会很有趣，因为据说鲍里奇在第二轮选举中获胜是由于对年轻女性选民的有力动员。（https://www.latercera.com/la-tercera-pm/noticia/mujeres-menores-de-50-anos-el-motor-del-triunfo-de-boric-como-fue-la-participacion-y-preferencias-por-edad-y-sexo/7EIMTWYA2VB7NFVOBVRQCG52XU/）

尽管选举是在去年12月举行的，而且案例研究来得有点晚，但在智利另一次关键选举即将到来之际，对这些数据进行分析仍然很有趣：9月4日的宪法公投。

法国总统选举后，《世界报》的一个可视化数据激发了本案例研究。（https://www.lemonde.fr/les-decodeurs/article/2022/05/04/election-presidentielle-2022-quels-reports-de-voix-entre-les-deux-tours_6124672_4355770.html）

参考文章： https://towardsdatascience.com/visualizing-electoral-data-polarization-and-mobilization-during-the-2021-chilean-presidential-528230a98232

《Python数据可视化》

购买链接： https://item.jd.com/12670073.html

《Python数据可视化》详细阐述了与Python数据可视化相关的基本解决方案，主要包括数据可视化和数据探索的重要性、绘图知识、Matplotlib、利用Seaborn简化可视化操作、绘制地理空间数据、基于Bokeh的交互式操作等内容。此外，该书还提供了相应的示例、代码，以帮助读者进一步理解相关方案的实现过程。

《Python数据可视化》适合作为高等院校计算机及相关专业的教材和教学参考书，也可作为相关开发人员的自学教材和参考手册。