使用R语言分析世界幸福指数

本文中的数据为2019年的世界幸福指数报告，数据来源于kaggle。

数据详情

数据包含9个字段：

Rank： 排名
Country or Region： 国家或地区
Score： 得分
GDP per capita： 人均GDP
Social support： 社会支持
Healthy life expectancy： 预期寿命
Freedom to make life choices： 自由
Generosity： 慷慨
Perceptions of corruption： 清廉指数

简单介绍

幸福感排行榜是基于世界民意调查，内容由盖洛普咨询公司（Gallup）设计制定。问卷中有一个被称为“坎特里尔阶梯”（Cantril Ladder）的问题，要求受访者想象有一座阶梯梯，对他们来说，最好的生活是10分，最坏的生活是0分，并根据这个标准对自己目前的生活进行评价。分数来自全球的代表性样本，并使用盖洛普的权重使估计结果具有代表性。幸福感得分来自于如下6个因素，经济、社会支持、预期寿命、自由、廉洁度和慷慨程度（捐赠）。

世界幸福指数报告是一项具有里程碑意义的全球幸福状况调查。第一份报告于2012年发布。2017年3月20日，在联合国庆祝国际幸福日的活动上，发布了《2017年世界幸福报告》，对155个国家的幸福程度进行了排名。随着各国政府、组织和民间社会越来越多地使用幸福指标为其决策提供信息，该报告继续获得全球认可。经济学、心理学、调查分析、国家统计、健康、公共政策等各领域的著名专家描述了如何有效地利用幸福感数据来评估国家的进步。这些报告回顾了当今世界的幸福状况，并展示了新的幸福科学如何解释个人和国家的幸福差异。

数据分析

本文主要利用R语言对世界幸福指数进行了统计性描述（五数分析、平均值、回归分析等）和基于卡方分布的正态性检验，主要目的是学习用R语言进行数据分析。

统计性描述

五数及平均差和标准差

五数包括中位数、下四分位数、上四分位数、最小值和最大值。

#数据加载
data = read.table("mydata.csv", sep = ",", header = TRUE, col.names = c("rank", "country", "score", "GDP", "support", "health", "freedom", "generosity", "corruption"))
datalibrary(reshape2)
#数据格式转换
data1 = melt(data, id = c("rank", "country", "score"))
data1#绘制箱形图
boxplot(value ~ variable,data1)library(ggplot2)
ggplot(data1, aes(x=variable, y=value, fill=variable)) + geom_boxplot()

影响因子五数分布箱型图:

# 五数及平均值
re = summary(data[3:9])
library(knitr)
kable(re,format="markdown")

五数及平均值分布：

#方差和标准差
data2 = data[3:9]
options(digits = 3)
apply(data2,2,var)
#1代表行，2代表列
apply(data2,2,sd)

方差分布结果：

标准差分布结果：

基于直方图的排名分析

下面主要展示全球幸福指数排名前15的国家，可以看出前7名的芬兰，丹麦，挪威，冰岛等都位于北欧地区，其中2/3国家集中在欧洲。亚洲仅以色列上榜，其余则主要位于美洲地区。

#直方图
data3 = head(data, 15)
data3
ggplot(data3, aes(x=country, y=score))+geom_bar(stat="identity", aes(reorder(country,-score), fill=country, width = 0.7))+geom_text(aes(label=score), vjust=-0.2)+theme(axis.text.x = element_text(angle = 315,hjust = 0.2, vjust = 0.2))

前15个国家的总分分布：

另外选取人均GDP和自由度排名前15的国家进行了统计，人均GDP最高的为卡塔尔，中国香港排名第9，其余国家主要位于亚洲西部和欧洲地区，上榜的亚洲西部国家主要得益于丰富的矿产资源。

ggplot(data3, aes(x=country, y=GDP))+geom_bar(stat="identity", aes(reorder(country,-GDP), fill=country, width = 0.7))+geom_text(aes(label=GDP), vjust=-0.2)+theme(axis.text.x = element_text(angle = 315,hjust = 0.2, vjust = 0.2))

排名前15的国家人均GDP分布：

自由度最高的国家为乌兹别克斯坦，柬埔寨位紧随其后，亚洲有3个国家上榜，其余主要集中在欧洲地区，非洲仅有一个国家上榜，位于非洲东部的索马里。

ggplot(data3, aes(x=country, y=freedom))+geom_bar(stat="identity", aes(reorder(country,-freedom), fill=country, width = 0.7))+geom_text(aes(label=freedom), vjust=-0.2)+theme(axis.text.x = element_text(angle = 315,hjust = 0.2, vjust = 0.2))

排名前15的国家自由度分布：

基于雷达图的对比分析

对排名第一的芬兰和排名第93的中国进行对比。芬兰虽然各方面排名都不是第一，但是除慷慨度以外，各部分排名都比较靠前，所以总得分较高。中国各方面得分都低于芬兰，其中自由度和预期寿命较高，社会支持和人均GDP中等偏上，但是廉洁度和慷慨度较低。

#雷达图
install.packages("fmsb")
library(fmsb)
#数据处理
data4 = data[, 4:9]
data4
maxm = apply(data[,4:9], 2, max)
maxm
minm = apply(data[,4:9], 2, min)
minm
maxmin <- data.frame(GDP=c(1.684, 0),support=c(1.624, 0),health=c(1.141, 0),freedom=c(0.631, 0),generosity=c(0.566, 0),corruption = c(0.453, 0))
data5 = rbind(maxmin, data4[1,])
data5 = rbind(data5, data4[93,])
data5
#绘制
colors_border = c('#FFFF00', '#00FF4D')
colors_in = c('#FFFF0055', '#00FF4D55')
radarchart( data5, axistype=6, pcol= colors_border, pfcol= colors_in, plwd=2, plty=1, cglcol="grey", cglty=1, axislabcol="grey",cglwd=0.8, vlcex=0.8
)
#添加图例
legend(x=1.5, y=1, legend = c("Finland", "China"), bty = "n", pch=20, col=colors_in , text.col = "black", cex=0.9, pt.cex=3)

芬兰及中国各因子分布：

回归分析

对分数和6个影响因子分别进行了回归分析，并建立了简单的一元线性回归模型，可以看出分数和GDP、support、health、freedom具有明显的正相关关系，和corruption呈现弱的正相关关系，而和generosity相关性不明显。

#散点图
library(dplyr)
library(ggpmisc) #score和GDP关系
ggplot(data, aes(x = GDP, y = score)) + #散点图函数geom_point(color = "blue", size = 2, alpha = 5/10) +#回归直线geom_smooth(method = lm,formula=y~I(x),se=FALSE) +#回归方程, sep为方程和方差的间隔，label.x和label.y代表位置，parse是转换成表达式stat_poly_eq(aes(label = paste(..eq.label.., ..adj.rr.label.., sep = '~~~~~')), formula = y ~ x, parse = TRUE, size = 4,label.x = 0.01, label.y = 1) + #添加方差分析表stat_fit_tb(tb.type = 'fit.anova', label.y = 0.92, label.x = 0.01, size = 4, parse = TRUE) + theme_classic()

分数和GDP的分布关系：

#score和support关系
ggplot(data, aes(x = support, y = score)) + geom_point(color = "blue", size = 2, alpha = 5/10) +geom_smooth(method = lm,formula=y~I(x),se=FALSE) +stat_poly_eq(aes(label = paste(..eq.label.., ..adj.rr.label.., sep = '~~~~~')), formula = y ~ x, parse = TRUE, size = 4,label.x = 0.01, label.y = 1) + stat_fit_tb(tb.type = 'fit.anova', label.y = 0.92, label.x = 0.01, size = 4, parse = TRUE) + theme_classic()

分数和support的分布关系：

#score和health关系
ggplot(data, aes(x = health, y = score)) + geom_point(color = "blue", size = 2, alpha = 5/10) +geom_smooth(method = lm,formula=y~I(x),se=FALSE) +stat_poly_eq(aes(label = paste(..eq.label.., ..adj.rr.label.., sep = '~~~~~')), formula = y ~ x, parse = TRUE, size = 4,label.x = 0.01, label.y = 1) + stat_fit_tb(tb.type = 'fit.anova', label.y = 0.92, label.x = 0.01, size = 4, parse = TRUE) + theme_classic()

分数和health的关系：

#score和freedom关系
ggplot(data, aes(x = freedom, y = score)) + geom_point(color = "blue", size = 2, alpha = 5/10) +geom_smooth(method = lm,formula=y~I(x),se=FALSE) +stat_poly_eq(aes(label = paste(..eq.label.., ..adj.rr.label.., sep = '~~~~~')), formula = y ~ x, parse = TRUE, size = 4,label.x = 0.01, label.y = 1) + stat_fit_tb(tb.type = 'fit.anova', label.y = 0.92, label.x = 0.01, size = 4, parse = TRUE) + theme_classic()

分数和freedom的分布关系：

#score和generosity关系
ggplot(data, aes(x = generosity, y = score)) + geom_point(color = "blue", size = 2, alpha = 5/10) +geom_smooth(method = lm,formula=y~I(x),se=FALSE) +stat_poly_eq(aes(label = paste(..eq.label.., ..adj.rr.label.., sep = '~~~~~')), formula = y ~ x, parse = TRUE, size = 4,label.x = 0.01, label.y = 1) + stat_fit_tb(tb.type = 'fit.anova', label.y = 0.92, label.x = 0.01, size = 4, parse = TRUE) + theme_classic()

分数和generosity的分布关系：

#score和corruption关系
ggplot(data, aes(x = corruption, y = score)) + geom_point(color = "blue", size = 2, alpha = 5/10) +geom_smooth(method = lm,formula=y~I(x),se=FALSE) +stat_poly_eq(aes(label = paste(..eq.label.., ..adj.rr.label.., sep = '~~~~~')), formula = y ~ x, parse = TRUE, size = 4,label.x = 0.01, label.y = 1) + stat_fit_tb(tb.type = 'fit.anova', label.y = 0.92, label.x = 0.01, size = 4, parse = TRUE) + theme_classic()

分数和corruption的分布关系：

对分数和6个影响因子进行多元回归分析，可以看出GDP、support、freedom的t检验非常显著（∗∗∗***∗∗∗），health较显著（∗∗**∗∗），corruption为弱显著，generosity不显著，和前面的一元线性回归结果相近。F检验显著（P-value < 2.2e-16），调整后的R2R^2R2为0.7703，相关性较好。

#多元线性回归
tdata = data[, 3:9]
tlm<-lm(formula= score~GDP+support+health+freedom+generosity+corruption,data=tdata)
summary(tlm)

分数和6个影响因子的多元线性回归分析：

对幸福得分和各因子可建立如下多元线性回归方程：score = 1.7952 + 0.7754∗\ast∗GDP + 1.1242∗\ast∗support + 1.0781∗\ast∗health + 1.4548∗\ast∗freedom + 0.4898∗\ast∗generosity + 0.9723∗\ast∗corruption。
由于generosity未通过显著性检验，所以将其去除，检测剩下的自变量和分数的关系，当去除generosity时，可看出corruption的t检验显著性有所提升，F检验显著，调整后的R²没有改变。

tlm<-lm(formula= score~GDP+support+health+freedom+corruption,data=tdata)
summary(tlm)

分数和除generosity以外因子的多元线性回归分析：

对幸福得分和除generosity以外各因子可建立如下多元线性回归方程：score = 1.8689 + 0.7755∗\ast∗GDP + 1.1180∗\ast∗support + 1.00840∗\ast∗health + 1.5340∗\ast∗freedom + 1.1176 ∗\ast∗corruption。
在影响幸福得分的因素中，GDP、社会支持、健康预期寿命呈现高度相关，自由权呈现中度相关，国家的廉政水平呈现低度相关，慷慨程度则呈现极低的相关性。

基于卡方分布的正态性检验

对分数和回归分析中显著性较强的自由以及没有显著性的慷慨度进行了卡方检验。

chisq.test(cbind(data[, 3], data[, 7]), correct = TRUE)

分数和自由的卡方检验结果：

chisq.test(cbind(data[, 3], data[, 8]), correct = TRUE)

分数和慷慨度的卡方检验结果：