随机森林随机回归预测_使用随机森林预测幸福

随机森林随机回归预测

Why do we try to predict happiness? Being able to predict happiness means that we are able to manipulate or try to improve certain components in order to increase our own happiness, and possibly national happiness for governments. I found Random Forest (RF) to be the simplest and most efficient package, so let’s get started!

为什么我们要尝试预测幸福？能够预测幸福意味着我们能够操纵或尝试改善某些组成部分，以增加我们自己的幸福，甚至可能增加政府的国民幸福。我发现随机森林(RF)是最简单，最有效的软件包，所以让我们开始吧！

内容： (Contents:)

The Data数据
Random Forest Model随机森林模型
Data Cleaning数据清理
Training and Testing培训与测试
Feature Importances功能重要性
Modifying number of variables修改变量数
Evaluating the Model评估模型

数据：(The Data:)

The data obtained from the #WorldValuesSurvey contains >290 questions & consist of ~69k responses after removing missing data for happiness levels. It is a cross-national survey across the years, and the questionnaire can be found on the website. In particular, we will be looking at the 2017–2020 data set. The size of the data set makes it optimal for machine learning.

从#WorldValuesSurvey获得的数据包含超过290个问题，并在删除幸福水平缺失的数据后包含约69k响应。这是多年来的跨国调查，其问卷可以在网站上找到。特别是，我们将研究2017-2020年的数据集。数据集的大小使其最适合机器学习。

随机森林模型： (Random Forest Model:)

To start with, we will be using the RF classifier* since we would like the machine to predict the level of happiness in groups (Very happy, Quite happy, Not very happy, Not at all happy).*side-note, a RF regressor is used when looking for a number that can take a range of values e.g any value between 0 and 1.

首先，我们将使用RF分类器*，因为我们希望机器预测小组中的幸福程度(非常高兴，非常高兴，不太高兴，根本不高兴)。 *旁注，当寻找一个可以取值范围(例如0到1之间的任何值)的数字时，将使用RF回归器。

import pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import train_test_splitfrom sklearn import metrics

数据清理：选择数据 (Data cleaning: Selecting the data)

Let’s start by getting the columns of only the questions and removing negative values* in the responses to Q46 that asks about happiness levels.

让我们从仅获得问题的列开始，并在对Q46的询问中询问幸福水平的答案中删除负值*。

var="Q46"df=df[df.columns[32:349]]df=df[df[var]>0]

*Negative values are either respondents saying they don’t know, have no answer, were not asked or the response was missing. These values would make it harder for the machine to classify them, since it increases the number of categories and are not what we are looking for.

*负值是指受访者表示他们不知道，没有答案，没有被要求或答案丢失。 这些值会使机器更难以对其进行分类，因为这会增加类别的数量，而这并不是我们要寻找的。

The data set remaining is shown below:

剩余的数据集如下所示：

进一步的数据清理： (Further data cleaning:)

The next concern is that we would have to deal with missing values in other columns. There are 3 options to consider:

下一个需要考虑的问题是，我们将不得不处理其他列中缺失的值。有3个选项可供考虑：

Replace the missing values with 0将缺失的值替换为0
Replace the missing values with the mean用均值替换缺失值
Drop the rows with missing values (data set becomes empty).删除缺少值的行(数据集为空)。

Since the third option is not viable, we will have to check which option, 1 or 2, would give the highest accuracy. In this case, I found that replacing with 0 makes it more accurate.

由于第三个选项不可行，我们将必须检查哪个选项1或2将提供最高的准确性。在这种情况下，我发现用0代替会使它更准确。

df.fillna(0, inplace=True)

准备火车标签： (Prepare train labels:)

Now we set the ‘label’ for the machine to recognize the feature that I want it to predict and split the data into train and test sets.

现在，我们为机器设置“标签”，以识别我希望它预测和将数据分为训练和测试集的功能。

train_labels = pd.DataFrame(df[var])train_labels = np.array(df[var])train_features= df.drop(var, axis = 1)feature_list = list(train_features.columns)train_features = np.array(train_features)train_features, test_features, train_labels, test_labels = train_test_split(train_features, train_labels, test_size = 0.25, random_state = 42)

训练和测试模型： (Train and Test the Model:)

The process of training and testing is simple. To improve the predictive power and/or model speed, we can simply modify the parameters within the RF classifier.

培训和测试的过程很简单。为了提高预测能力和/或模型速度，我们可以简单地在RF分类器中修改参数。

精度提高： (Increasing accuracy:)

n_estimators — number of trees the algorithm builds before majority voting

n_estimators-算法在多数表决之前构建的树数

max_features — maximum number of features random forest considers to split a node

max_features —随机森林考虑拆分节点的最大特征数

min_sample_leaf — the minimum number of leafs required to split an internal node.

min_sample_leaf —拆分内部节点所需的最小叶子数。

提高速度： (Increasing speed:)

n_jobs — number of processors it is allowed to use. If = 1, only use one processor. If =-1, no limit

n_jobs-允许使用的处理器数量。如果= 1，则仅使用一个处理器。如果= -1，则没有限制

random_state — makes the model’s output replicable i.e always produce the same results given the same hyperparameters and training data

random_state —使模型的输出可复制，即在给定相同的超参数和训练数据的情况下始终产生相同的结果

oob_score: random forest cross-validation method

oob_score：随机森林交叉验证方法

rf=RandomForestClassifier(n_estimators = 1000, oob_score = True, n_jobs = -1,random_state =42,max_features = “auto”, min_samples_leaf = 12)rf.fit(train_features, train_labels)predictions = rf.predict(test_features)print(metrics.accuracy_score(test_labels, predictions))

The model takes 1.3 minutes to train ~52k training rows and >290 columns, and 1 second to test. The accuracy was 63.70%. If we had chosen to fill the missing values with the mean, the accuracy would be 63.55%. But what’s important is finding out what influences the machine’s prediction, since those would be the variables that we want to look at. We certainly cannot expect everyone to answer 290+ questions, or try to work on all 290 aspects to improve happiness (that’s going to cost a lot). So we’ll be looking at the feature importances.

该模型需要1.3分钟来训练约52k训练行和> 290列，并且需要1秒进行测试。准确度是63.70％ 。如果我们选择用平均值填充缺失值，则准确度将为63.55％ 。但是重要的是找出影响机器预测的因素，因为这就是我们要查看的变量。当然，我们当然不能期望每个人都能回答290多个问题，或者尝试在290个方面进行工作以提高幸福感(这将花费很多)。因此，我们将研究功能的重要性。

功能重要性： (Feature Importances:)

If you recall, feature_list contains the columns of all other variables except Q46. The goal is to understand which are the variables that influence the prediction.

回想一下，feature_list包含除Q46之外的所有其他变量的列。目的是了解哪些因素会影响预测。

importances = list(rf.feature_importances_)feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances]x_values = list(range(len(importances)))# Make a bar chartplt.bar(x_values, importances, orientation = 'vertical', color = 'r', edgecolor = 'k', linewidth = 1.2)# Tick labels for x axisplt.xticks(x_values, feature_list, rotation='vertical')# Axis labels and titleplt.ylabel('Importance'); plt.xlabel('Variable'); plt.title('Variable Importances');

Feature importances sum to 1 and what we notice is that certain variables have a greater influence over the prediction compared to the others, and almost every variable has some form of influence, albeit extremely small because there are just too many variables. The next thing is to continue improving our model to allow us to better understand happiness.

特征重要性的总和为1，我们注意到，与其他变量相比，某些变量对预测的影响更大，并且几乎每个变量都有某种形式的影响，尽管由于变量太多而影响很小。接下来的事情是继续改进我们的模型，以使我们更好地了解幸福。

修改变量数： (Modifying the number of variables:)

Let’s take the top 20 features, and set up a new model using just these 20 variables (+ var itself). We’ll repeat the data cleaning and same RF model. I got an accuracy of 64.47%. If we had chosen to replace missing values with the mean, the accuracy would be 64.41%. What is surprising here is that with smaller number of variables, the model becomes more accurate (from 63.70% to 64.47%). This is likely because the other variables were generating noise in the model and causing it to be less accurate.

让我们采用前20个功能，并仅使用这20个变量(+ var本身)来建立新模型。我们将重复数据清理和相同的RF模型。我的准确度是64.47％。 如果我们选择用均值代替缺失值，则准确度将为64.41％ 。令人惊讶的是，变量数量越少，模型变得越准确(从63.70％到64.47％ )。这可能是因为其他变量在模型中产生了噪音，并导致其准确性降低。

让我们再次看一下功能重要性： (Let’s look at the Feature Importances again:)

This time, it is clearer to tell which variables were more important. You may refer to the questionnaire found on WVS for more detailed information. I will give a summary of the topics that the questions covered.

这次，更清楚地指出哪些变量更重要。您可以参考WVS上的调查表以获取更多详细信息。我将总结这些问题涉及的主题。

评估模型： (Evaluating the model:)

Let’s look at the graph of actual vs predicted values for the first 200 test values. For greater visibility of the whole test set, let’s also do a simple count for the difference in values of predicted and actual (predicted minus actual).

让我们看一下前200个测试值的实际值与预测值的关系图。为了更好地了解整个测试集，我们还对预期值和实际值(预期值减去实际值)之间的差进行简单计数。

The model appears to be slightly more negative than positive in predicting the happiness levels, but would still be considered otherwise balanced!

在预测幸福水平时，该模型似乎比肯定模型更具负面性，但在其他方面仍然可以认为是平衡的！

见解： (Insights:)

What I have done is to examine the key questions out of >290 in the WVS that is more relevant to happiness levels. This would mean that we can try to focus specifically on these aspects when examining happiness.

我要做的是研究WVS中超过290个与幸福感水平更相关的关键问题。这意味着我们在检查幸福时可以尝试着重于这些方面。

Looking at the questionnaire, we would also notice that Q261 and Q262 are the same thing (age and year born), so we could remove 1 of them to include another feature. For Q266,267,268 (country of birth of the respondent and parents) they appear to be repeats, but are not exactly the same thing since immigration/cross-cultural marriage may occur. Nonetheless, we could consider removing 2 of them since the occurrence is minimal.

通过问卷调查，我们还会注意到Q261和Q262是相同的东西(年龄和出生年份)，因此我们可以删除其中的一个以包括另一个功能。对于Q266,267,268(受访者和父母的出生国家)，它们似乎是重复的，但由于可能发生移民/跨文化婚姻，因此并非完全相同。尽管如此，由于发生的可能性很小，我们可以考虑删除其中的2个。

常规主题是： (The general topics are:)

Individual level:Life satisfaction, health, finances, freedom, age, safety, religion, marriage, and family.National level:country, perception of corruption, democracy/political influence, national pride

个人层面：生活满意度，健康，财务，自由，年龄，安全，宗教，婚姻和家庭。 国家层面：国家，对腐败的看法，民主/政治影响，民族自豪感

In particular, health, finances and age were the top features that were deemed as important by the machine. In this sense, the individual level factors has a greater influence on one’s happiness level compared to the national level factors.

特别是，健康，财务和年龄是机器认为重要的重要功能。从这个意义上说，个人水平因素比国家水平因素对一个人的幸福水平影响更大。

However, I noticed that the WVS did not have data on sleep hours, which was a key element that was observed in my earlier post. Nonetheless, it is still very much useful as we can consider those aspects for further analysis! I’ll be back with more insights into the correlation between those aspects and happiness, to determine how we can improve our happiness levels. Until then, remember to stay happy!

但是，我注意到WVS没有睡眠时间的数据，这是我早先文章中观察到的关键因素。但是，它仍然非常有用，因为我们可以考虑对这些方面进行进一步分析！我将在这些方面与幸福之间的相关性方面提供更多见解，以确定我们如何提高幸福水平。在此之前，请记住保持快乐！

翻译自: https://towardsdatascience.com/predicting-happiness-using-random-forest-1e6477affc24

随机森林随机回归预测

查看全文

http://www.taodudu.cc/news/show-5887382.html

幸福的十个关键词
satisfactory 幸福工厂 118201
《仙剑奇侠传四》精美COSPLAY图片
maven setting 多仓库配置
setting作用
一文搞懂Elasticsearch索引的mapping与setting
全局配置文件之settings [运行时行为设置]
我的海贼王
转载。1AGI 14个关键问题
万亿OTA市场进入新爆发期，2025或迎中国汽车软件付费元年
TensorFlow应用实战-1- 课程介绍及项目展示
对话 Geoffrey Hinton Demis Hassabis ：人工智能离我们有多远？
推荐一款开源的桌面效能神器，助你实现工具自由！
C语言版KMP算法
[Leetcode] KMP
KMP实现
字符串算法——KMP算法C++详解
【算法小结】KMP及扩展KMP
c语言kmp算法代码,C语言KMP算法的实现
KMP 算法Next数组
kmp java_KMP算法的JAVA实现
python多线程爬取段子_python爬虫（爬取段子）
java基于springboot+vue的在线文档管理系统 nodejs 前后端分离
中学地理教学参考杂志社中学地理教学参考编辑部2022年第10期目录
生活感悟-2016.06.14
生活中的感悟
生活感悟2013
生活感悟总结
生活感悟108句话(经典推荐)
避坑，医疗保障