随机森林随机回归预测

Why do we try to predict happiness? Being able to predict happiness means that we are able to manipulate or try to improve certain components in order to increase our own happiness, and possibly national happiness for governments. I found Random Forest (RF) to be the simplest and most efficient package, so let’s get started!

为什么我们要尝试预测幸福? 能够预测幸福意味着我们能够操纵或尝试改善某些组成部分,以增加我们自己的幸福,甚至可能增加政府的国民幸福。 我发现随机森林(RF)是最简单,最有效的软件包,所以让我们开始吧!

内容: (Contents:)

  1. The Data数据
  2. Random Forest Model随机森林模型
  3. Data Cleaning数据清理
  4. Training and Testing培训与测试
  5. Feature Importances功能重要性
  6. Modifying number of variables修改变量数
  7. Evaluating the Model评估模型

数据:(The Data:)

The data obtained from the #WorldValuesSurvey contains >290 questions & consist of ~69k responses after removing missing data for happiness levels. It is a cross-national survey across the years, and the questionnaire can be found on the website. In particular, we will be looking at the 2017–2020 data set. The size of the data set makes it optimal for machine learning.

从#WorldValuesSurvey获得的数据包含超过290个问题,并在删除幸福水平缺失的数据后包含约69k响应。 这是多年来的跨国调查,其问卷可以在网站上找到。 特别是,我们将研究2017-2020年的数据集。 数据集的大小使其最适合机器学习。

随机森林模型: (Random Forest Model:)

To start with, we will be using the RF classifier* since we would like the machine to predict the level of happiness in groups (Very happy, Quite happy, Not very happy, Not at all happy).*side-note, a RF regressor is used when looking for a number that can take a range of values e.g any value between 0 and 1.

首先,我们将使用RF分类器*,因为我们希望机器预测小组中的幸福程度(非常高兴,非常高兴,不太高兴,根本不高兴)。 *旁注,当寻找一个可以取值范围(例如0到1之间的任何值)的数字时,将使用RF回归器。

import pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import train_test_splitfrom sklearn import metrics

数据清理:选择数据 (Data cleaning: Selecting the data)

Let’s start by getting the columns of only the questions and removing negative values* in the responses to Q46 that asks about happiness levels.

让我们从仅获得问题的列开始,并在对Q46的询问中询问幸福水平的答案中删除负值*。

var="Q46"df=df[df.columns[32:349]]df=df[df[var]>0]

*Negative values are either respondents saying they don’t know, have no answer, were not asked or the response was missing. These values would make it harder for the machine to classify them, since it increases the number of categories and are not what we are looking for.

*负值是指受访者表示他们不知道,没有答案,没有被要求或答案丢失。 这些值会使机器更难以对其进行分类,因为这会增加类别的数量,而这并不是我们要寻找的。

The data set remaining is shown below:

剩余的数据集如下所示:

进一步的数据清理: (Further data cleaning:)

The next concern is that we would have to deal with missing values in other columns. There are 3 options to consider:

下一个需要考虑的问题是,我们将不得不处理其他列中缺失的值。 有3个选项可供考虑:

  1. Replace the missing values with 0将缺失的值替换为0
  2. Replace the missing values with the mean用均值替换缺失值
  3. Drop the rows with missing values (data set becomes empty).删除缺少值的行(数据集为空)。

Since the third option is not viable, we will have to check which option, 1 or 2, would give the highest accuracy. In this case, I found that replacing with 0 makes it more accurate.

由于第三个选项不可行,我们将必须检查哪个选项1或2将提供最高的准确性。 在这种情况下,我发现用0代替会使它更准确。

df.fillna(0, inplace=True)

准备火车标签: (Prepare train labels:)

Now we set the ‘label’ for the machine to recognize the feature that I want it to predict and split the data into train and test sets.

现在,我们为机器设置“标签”,以识别我希望它预测和将数据分为训练和测试集的功能。

train_labels = pd.DataFrame(df[var])train_labels = np.array(df[var])train_features= df.drop(var, axis = 1)feature_list = list(train_features.columns)train_features = np.array(train_features)train_features, test_features, train_labels, test_labels = train_test_split(train_features, train_labels, test_size = 0.25, random_state = 42)

训练和测试模型: (Train and Test the Model:)

The process of training and testing is simple. To improve the predictive power and/or model speed, we can simply modify the parameters within the RF classifier.

培训和测试的过程很简单。 为了提高预测能力和/或模型速度,我们可以简单地在RF分类器中修改参数。

精度提高: (Increasing accuracy:)

n_estimators — number of trees the algorithm builds before majority voting

n_estimators-算法在多数表决之前构建的树数

max_features — maximum number of features random forest considers to split a node

max_features —随机森林考虑拆分节点的最大特征数

min_sample_leaf — the minimum number of leafs required to split an internal node.

min_sample_leaf —拆分内部节点所需的最小叶子数。

提高速度: (Increasing speed:)

n_jobs — number of processors it is allowed to use. If = 1, only use one processor. If =-1, no limit

n_jobs-允许使用的处理器数量。 如果= 1,则仅使用一个处理器。 如果= -1,则没有限制

random_state — makes the model’s output replicable i.e always produce the same results given the same hyperparameters and training data

random_state —使模型的输出可复制,即在给定相同的超参数和训练数据的情况下始终产生相同的结果

oob_score: random forest cross-validation method

oob_score:随机森林交叉验证方法

rf=RandomForestClassifier(n_estimators = 1000, oob_score = True, n_jobs = -1,random_state =42,max_features = “auto”, min_samples_leaf = 12)rf.fit(train_features, train_labels)predictions = rf.predict(test_features)print(metrics.accuracy_score(test_labels, predictions))

The model takes 1.3 minutes to train ~52k training rows and >290 columns, and 1 second to test. The accuracy was 63.70%. If we had chosen to fill the missing values with the mean, the accuracy would be 63.55%. But what’s important is finding out what influences the machine’s prediction, since those would be the variables that we want to look at. We certainly cannot expect everyone to answer 290+ questions, or try to work on all 290 aspects to improve happiness (that’s going to cost a lot). So we’ll be looking at the feature importances.

该模型需要1.3分钟来训练约52k训练行和> 290列,并且需要1秒进行测试。 准确度是63.70% 。 如果我们选择用平均值填充缺失值,则准确度将为63.55% 。 但是重要的是找出影响机器预测的因素,因为这就是我们要查看的变量。 当然,我们当然不能期望每个人都能回答290多个问题,或者尝试在290个方面进行工作以提高幸福感(这将花费很多)。 因此,我们将研究功能的重要性。

功能重要性: (Feature Importances:)

If you recall, feature_list contains the columns of all other variables except Q46. The goal is to understand which are the variables that influence the prediction.

回想一下,feature_list包含除Q46之外的所有其他变量的列。 目的是了解哪些因素会影响预测。

importances = list(rf.feature_importances_)feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances]x_values = list(range(len(importances)))# Make a bar chartplt.bar(x_values, importances, orientation = 'vertical', color = 'r', edgecolor = 'k', linewidth = 1.2)# Tick labels for x axisplt.xticks(x_values, feature_list, rotation='vertical')# Axis labels and titleplt.ylabel('Importance'); plt.xlabel('Variable'); plt.title('Variable Importances');

Feature importances sum to 1 and what we notice is that certain variables have a greater influence over the prediction compared to the others, and almost every variable has some form of influence, albeit extremely small because there are just too many variables. The next thing is to continue improving our model to allow us to better understand happiness.

特征重要性的总和为1,我们注意到,与其他变量相比,某些变量对预测的影响更大,并且几乎每个变量都有某种形式的影响,尽管由于变量太多而影响很小。 接下来的事情是继续改进我们的模型,以使我们更好地了解幸福。

修改变量数: (Modifying the number of variables:)

Let’s take the top 20 features, and set up a new model using just these 20 variables (+ var itself). We’ll repeat the data cleaning and same RF model. I got an accuracy of 64.47%. If we had chosen to replace missing values with the mean, the accuracy would be 64.41%. What is surprising here is that with smaller number of variables, the model becomes more accurate (from 63.70% to 64.47%). This is likely because the other variables were generating noise in the model and causing it to be less accurate.

让我们采用前20个功能,并仅使用这20个变量(+ var本身)来建立新模型。 我们将重复数据清理和相同的RF模型。 我的准确度是64.47%。 如果我们选择用均值代替缺失值,则准确度将为64.41% 。 令人惊讶的是,变量数量越少,模型变得越准确(从63.70%64.47% )。 这可能是因为其他变量在模型中产生了噪音,并导致其准确性降低。

让我们再次看一下功能重要性: (Let’s look at the Feature Importances again:)

This time, it is clearer to tell which variables were more important. You may refer to the questionnaire found on WVS for more detailed information. I will give a summary of the topics that the questions covered.

这次,更清楚地指出哪些变量更重要。 您可以参考WVS上的调查表以获取更多详细信息。 我将总结这些问题涉及的主题。

评估模型: (Evaluating the model:)

Let’s look at the graph of actual vs predicted values for the first 200 test values. For greater visibility of the whole test set, let’s also do a simple count for the difference in values of predicted and actual (predicted minus actual).

让我们看一下前200个测试值的实际值与预测值的关系图。 为了更好地了解整个测试集,我们还对预期值和实际值(预期值减去实际值)之间的差进行简单计数。

The model appears to be slightly more negative than positive in predicting the happiness levels, but would still be considered otherwise balanced!

在预测幸福水平时,该模型似乎比肯定模型更具负面性,但在其他方面仍然可以认为是平衡的!

见解: (Insights:)

What I have done is to examine the key questions out of >290 in the WVS that is more relevant to happiness levels. This would mean that we can try to focus specifically on these aspects when examining happiness.

我要做的是研究WVS中超过290个与幸福感水平更相关的关键问题。 这意味着我们在检查幸福时可以尝试着重于这些方面。

Looking at the questionnaire, we would also notice that Q261 and Q262 are the same thing (age and year born), so we could remove 1 of them to include another feature. For Q266,267,268 (country of birth of the respondent and parents) they appear to be repeats, but are not exactly the same thing since immigration/cross-cultural marriage may occur. Nonetheless, we could consider removing 2 of them since the occurrence is minimal.

通过问卷调查,我们还会注意到Q261和Q262是相同的东西(年龄和出生年份),因此我们可以删除其中的一个以包括另一个功能。 对于Q266,267,268(受访者和父母的出生国家),它们似乎是重复的,但由于可能发生移民/跨文化婚姻,因此并非完全相同。 尽管如此,由于发生的可能性很小,我们可以考虑删除其中的2个。

常规主题是: (The general topics are:)

Individual level:Life satisfaction, health, finances, freedom, age, safety, religion, marriage, and family.National level:country, perception of corruption, democracy/political influence, national pride

个人层面:生活满意度,健康,财务,自由,年龄,安全,宗教,婚姻和家庭。 国家层面:国家,对腐败的看法,民主/政治影响,民族自豪感

In particular, health, finances and age were the top features that were deemed as important by the machine. In this sense, the individual level factors has a greater influence on one’s happiness level compared to the national level factors.

特别是,健康,财务和年龄是机器认为重要的重要功能。 从这个意义上说,个人水平因素比国家水平因素对一个人的幸福水平影响更大。

However, I noticed that the WVS did not have data on sleep hours, which was a key element that was observed in my earlier post. Nonetheless, it is still very much useful as we can consider those aspects for further analysis! I’ll be back with more insights into the correlation between those aspects and happiness, to determine how we can improve our happiness levels. Until then, remember to stay happy!

但是,我注意到WVS没有睡眠时间的数据,这是我早先文章中观察到的关键因素。 但是,它仍然非常有用,因为我们可以考虑对这些方面进行进一步分析! 我将在这些方面与幸福之间的相关性方面提供更多见解,以确定我们如何提高幸福水平。 在此之前,请记住保持快乐!

翻译自: https://towardsdatascience.com/predicting-happiness-using-random-forest-1e6477affc24

随机森林随机回归预测


http://www.taodudu.cc/news/show-5887382.html

相关文章:

  • 幸福的十个关键词
  • satisfactory 幸福工厂 118201
  • 《仙剑奇侠传四》精美COSPLAY图片
  • maven setting 多仓库配置
  • setting作用
  • 一文搞懂Elasticsearch索引的mapping与setting
  • 全局配置文件之settings [运行时行为设置]
  • 我的海贼王
  • 转载。1AGI 14个关键问题
  • 万亿OTA市场进入新爆发期,2025或迎中国汽车软件付费元年
  • TensorFlow应用实战-1- 课程介绍及项目展示
  • 对话 Geoffrey Hinton Demis Hassabis :人工智能离我们有多远?
  • 推荐一款开源的桌面效能神器,助你实现工具自由!
  • C语言版KMP算法
  • [Leetcode] KMP
  • KMP实现
  • 字符串算法——KMP算法C++详解
  • 【算法小结】KMP及扩展KMP
  • c语言kmp算法代码,C语言KMP算法的实现
  • KMP 算法Next数组
  • kmp java_KMP算法的JAVA实现
  • python多线程爬取段子_python爬虫(爬取段子)
  • java基于springboot+vue的在线文档管理系统 nodejs 前后端分离
  • 中学地理教学参考杂志社中学地理教学参考编辑部2022年第10期目录
  • 生活感悟-2016.06.14
  • 生活中的感悟
  • 生活感悟2013
  • 生活感悟总结
  • 生活感悟108句话(经典推荐)
  • 避坑,医疗保障

随机森林随机回归预测_使用随机森林预测幸福相关推荐

  1. python随机森林筛选变量_用随机森林分类器和GBDT进行特征筛选

    一.决策树(类型.节点特征选择的算法原理.优缺点.随机森林算法产生的背景) 1.分类树和回归树 由目标变量是离散的还是连续的来决定的:目标变量是离散的,选择分类树:反之(目标变量是连续的,但自变量可以 ...

  2. 随机森林原始论文_初识随机森林

    在机器学习中,随机森林是一个包含多个决策树的分类器, 并且其输出的类别是由个别树输出的类别的众数而定. 发展出推论出随机森林的算法. 而 "Random Forests" 是他们的 ...

  3. python随机森林变量重要性_利用随机森林对特征重要性进行评估

    前言 随机森林是以决策树为基学习器的集成学习算法.随机森林非常简单,易于实现,计算开销也很小,更令人惊奇的是它在分类和回归上表现出了十分惊人的性能,因此,随机森林也被誉为"代表集成学习技术水 ...

  4. python随机森林特征重要性_基于随机森林识别特征重要性(翻译)

    博主Slav Ivanov 的文章<Identifying churn drivers with Random Forests >部分内容翻译.博主有一款自己的产品RetainKit,用A ...

  5. 蚂蚁森林用户须知_关于蚂蚁森林的一些思考。

    蚂蚁森林,上线于2016年8月27日,是蚂蚁金服旗下支付宝平台的一个应用.只要用户在支付宝开通这应用,那么就会获得一棵虚拟树苗,通过每天的低碳行动获得能量培养树苗,等这棵树长大后,蚂蚁金服就会在现实世 ...

  6. r语言随机森林回归预测_从零实现回归随机森林

    一.前言 回归随机森林作为一种机器学习和数据分析领域常用且有效的算法,对其原理和代码实现过程的掌握是非常有必要的.为此,本文将着重介绍从零开始实现回归随机森林的过程,对于随机森林和决策树的相关理论原理 ...

  7. 基于python的随机森林回归实现_随机森林理论与python代码实现

    1,初品随机森林 随机森林,森林就是很多决策树放在一起一起叫森林,而随机体现在数据集的随机采样中和特征的随机选取中,具体下面再讲.通俗的说随机森林就是建立多颗决策树(CART),来做分类(回归),以多 ...

  8. 随机森林原理_机器学习(29):随机森林调参实战(信用卡欺诈预测)

    点击"机器学习研习社","置顶"公众号 重磅干货,第一时间送达 回复[大礼包]送你机器学习资料与笔记 回顾 推荐收藏>机器学习文章集合:1-20 机器学习 ...

  9. 随机森林python实例_用Python实现随机森林算法的示例

    这篇文章主要介绍了用Python实现随机森林算法,小编觉得挺不错的,现在分享给大家,也给大家做个参考. 拥有高方差使得决策树(secision tress)在处理特定训练数据集时其结果显得相对脆弱.b ...

最新文章

  1. hadoop2.2.0 集群安装配置
  2. 十二、Redis五大数据类型之四Hash
  3. 案例三:执行 JavaScript 语句
  4. 5.Django|模型层--多表关系
  5. java四方支付系统
  6. GRE_××× 配置(建议选择Cisco2811路由器)
  7. 微博签到打卡点数据集—北上广深杭
  8. 如何选择适合的大数据分析软件
  9. win10从旧的固态硬盘迁移系统到新的固态硬盘,开机黑屏LOGO处转圈,并启动不了,已解决,特写此贴,供急需解决该问题的有缘人参考!
  10. Real-Time Loop Closure in 2D LIDAR SLAM 翻译和总结(一)
  11. 用git连接gitee
  12. 智能语音语义时代,产品经理怎么让AI更聪明?(效果向)
  13. 服务器被植入挖矿木马的心酸过程
  14. 互联网读书-视界互联网+时代的创新与创业
  15. 考研英语作文笔记(刘晓燕强化班)
  16. linux查看当前文化大小,Linux锐速当前连接数等状态查询
  17. 正态分布的概率密度函数python_python 计算概率密度、累计分布、逆函数的例子...
  18. Ubuntu如何安装无线网卡驱动
  19. 自制剧、独播剧、各品牌栏目争先登场 丑女无敌引领十一荧屏大战dambolo
  20. 肇庆的海运集装箱港口分布

热门文章

  1. android 小米推送角标,MIUI67桌面角标开源代码简介
  2. 国内9大免费CDN汇总
  3. Java基础知识——集合类
  4. Luogu 2114 [NOI 2014] 起床困难综合症
  5. R语言绘制分类变量柱状图
  6. Hex字符串转byte数组 汉字转byte数组
  7. 南京工业大学计算机考研难吗,南京工业大学考研难吗?一般要什么水平才可以进入?...
  8. (转)用mysql自带工具mysqlslap对数据库进行压力测试
  9. Vue 组件间通信方式汇总,总有一款适合你( 附项目实战案例 )
  10. win11更新失败怎么办 Windows11更新失败错误代码0x80070003的解决方法