人群密度算法

介绍 (Introduction)

One of the most fascinating historical examples of the power of crowds can be found within the pages of James Surowiecki’s “The Wisdom of Crowds,” in which a team of engineers, oceanographers, salvage crew members, and mathematicians were asked to make their best estimate on where a particular sunken submarine, the Scorpion, could be found (the Navy did not have the manpower to search the entire area and wanted a more specific guess). Individually, these guessers were wildly inaccurate, but the group’s combined averaged guess was only 220m from the actual position of the sunken submarine!

詹姆斯·苏洛维耶克(James Surowiecki)在《人群的智慧》一书中找到了最令人着迷的人群力量历史例子，其中要求一组工程师，海洋学家，救助船员和数学家做出最好的估计在哪里可以找到一艘特殊的沉没式潜水艇“蝎子”(海军没有人力搜寻整个地区，并希望作出更具体的猜测)。个别地，这些猜测者非常不准确，但是该小组的综合平均猜测距离沉没式潜水艇的实际位置只有220m！

It has been shown repeatedly that the best way to arrive at an estimate is to ask lots of people from diverse backgrounds — the more diverse the better. How can we apply this sociological concept to machine learning?

反复表明，得出估算的最佳方法是问很多来自不同背景的人-越多样化越好。我们如何将这种社会学概念应用于机器学习？

合奏学习 (Ensemble Learning)

Ensemble models are just a conglomerate of models that are averaged to provide a “crowd’s guess.” Just as humans have bias, so too do models carry with them inherent assumptions and bias. Averaging these out across a few models is almost guaranteed to decrease error.

集成模型只是模型的集合，这些模型的平均被用来提供“人群的猜测”。就像人类有偏见一样，模型也带有内在的假设和偏见。 几乎可以保证在几个模型中对它们进行平均，以减少误差。

数据 (The Data)

For this example, I used some credit card fraud data from Kaggle. The first item of business was to pick an evaluation metric. I noticed that there was a huge imbalance of classes (99.8% of the data was marked as normal transaction volume, the other 0.2% were the fraudulent transactions), so the accuracy metric was out of the picture. For this type of problem, it’s better to choose precision, recall, or F1 scores. For simplicity, I chose precision.

在此示例中，我使用了Kaggle的一些信用卡欺诈数据。第一项业务是选择评估指标。我注意到类别之间存在巨大的不平衡 (99.8％的数据被标记为正常交易量，其他0.2％是欺诈性交易)，因此准确性度量超出了预期。对于此类问题，最好选择精度，召回率或F1分数。为了简单起见，我选择了精度。

模型 (The Models)

After looking at the data a bit, I trained five models: a K-Nearest Neighbors, a Logistic Regression, a Random Forest, an XGBoost, and a Naive Bayes. The precisions I received after each one are shown in the chart below.

在看了一点数据之后，我训练了五个模型：K最近邻，逻辑回归，随机森林，XGBoost和朴素贝叶斯。下图显示了我每次获得的精度。

Precisions for five different algorithms.

Clearly, the Naive Bayes and KNN classifiers were not ideal, but the other three did quite well. Our maximum precision using an individual model would be 0.96. Keep that number in mind.

显然，朴素贝叶斯分类器和KNN分类器并不理想，但其他三个分类器表现良好。 使用单个模型的最大精度为0.96。 请记住该数字。

民主(几乎) (A Democracy (almost))

One of the ways to turn these outputs into a “crowd-like” format is to have each model vote based on what it received as an answer. Essentially, if the Random Forest algorithm predicted a 0, it will vote “0.” If all the other algorithms predicted a 1, they will all vote “1.” That means the ultimate tally is four 1’s and one 0. In this case, the final output will be a 1.

将这些输出转换为“人群状”格式的方法之一是根据收到的答案对每个模型进行投票。本质上，如果随机森林算法预测为0，它将投票为“ 0”。如果所有其他算法都预测为1，则它们都将投票为“ 1”。这意味着最终的计数是四个1和一个0。在这种情况下，最终输出将是1。

I used a “weighted voting model” in which the models that performed the best have the most votes. The chart below details how many votes I gave each algorithm.

我使用了“加权投票模型”，其中表现最好的模型获得最多的选票 。下表详细列出了每种算法给我多少票。

These numbers were determined from the relative precision of each model. Now let’s have them vote!

这些数字是根据每个模型的相对精度确定的。现在让他们投票！

结果 (The Results)

After running the voting method, I received the following results:

运行投票方法后，我收到以下结果：

Prediction: 0, Actual: 0 → 56,868 (true negative)

预测：0，实际：0→56,868(真为负)

Prediction: 1, Actual: 0 → 2 (false positive)

预测：1，实际：0→2(误报)

Prediction: 0, Actual: 1 → 21 (false negative)

预测：0，实际：1→21(假阴性)

Prediction: 1, Actual: 1 → 71 (true negative)

预测：1，实际：1→71(真为负)

This translates to a precision of 0.973! Notice that for this data, the false negative is the costliest outcome, since this means the transaction was fraudulent but our models predicted a normal transaction.

这意味着精度为0.973！ 请注意，对于此数据，假阴性是最昂贵的结果，因为这意味着交易是欺诈性的，但我们的模型预测交易正常。

Just to prove that the crowd is better than the individual, here’s a chart of how many times each algorithm was correctly “outvoted” by the crowd (i.e. one algorithm voted incorrectly but the other algorithms voted correctly and hence “outvoted” the first).

只是为了证明人群比个人更好，这是一张图表，说明人群正确“投票”了每种算法多少次(即，一种算法投票不正确，而另一种算法投票正确，因此第一种算法“投票”)。

Occurrences of individual misses but crowd successes.

As expected, our Naive Bayes model was overriden the most number of times, but even the Random Forest was correctly outvoted 23 times by the other algorithms! The Naive Bayes model contributed to that outvoting as well, so it still is useful even if is not the best performer.

不出所料，我们的朴素贝叶斯(Naive Bayes)模型被覆盖了最多的次数，但即使是随机森林，其他算法也被正确地淘汰了23次！ 朴素贝叶斯(Naive Bayes)模型也对这一结果有所贡献，因此即使不是最佳执行者，它仍然很有用。

结论 (Conclusion)

The benefits of using an ensemble method include turning even bad models into important votes. This approach has very few drawbacks and should be an item in every data scientist’s toolbox. Better results could be achieved with hyperparameter optimization, but I just used the stock models for this example.

使用集成方法的好处包括将糟糕的模型变成重要的选票。这种方法几乎没有缺点，应该成为每个数据科学家工具箱中的一个项目。通过超参数优化可以实现更好的结果，但是我仅在本示例中使用了库存模型。

If you enjoyed this article, you can follow me for more content like this. Thanks for reading!

如果您喜欢本文，可以关注我以获取更多类似内容。谢谢阅读！

翻译自: https://towardsdatascience.com/the-power-of-algorithmic-crowds-be930baf2139

人群密度算法

查看全文

http://www.taodudu.cc/news/show-2744563.html

Git - 强制覆盖本地代码[与远程仓库保持一致]
collection接口
Java核心(集合类1-概述、Collection 接口、List 集合）
单目标跟踪——个人笔记
Java（十三）集合类（2）
关于Muster 5.5.7的奇怪问题
Deadline vs Qube vs Muster
MUSTer：Multi-Store Tracker:A Cognitive Psychology Inspired Approach to Object Tracking
总结：MUSTer中的keypoint matching以及利用RANSAC去除outliers
Virtual Vertex Muster9—3D渲染农场管理软件
A Benchmark and Simulator for UAV Tracking（论文翻译）
追踪算法MUSTer体验
十九、内存存储介质
ENIAC到EDVAC到Intel的思考
EDVAC
第五代人工智能计算机英语,计算机专业英语——关于第五代计算机
edvac是商用计算机吗,计算机系统发展历程.ppt
edvac是商用计算机吗,计算机基础知识78228
一台微型计算机的好坏主要取决于,计算机一级MSOffice应用选择题
2.1 电子计算机的兴起
计算机的绝密历史——窃取的创意、专利战争和丑闻如何改变了世界……
中专高一第一学期计算机应用期中基础考题,职业中专第一学期计算机应用基础WIN7和Word2010版期末考试题...
华文慕课计算机组成结构第二章课后习题解析
读《Jonathan von Neumann and EDVAC》
计算机是通过电网进行传播,2016年9月计算机一级MS Office考前必做试题及答案(4)...
【历史上的今天】6 月 26 日：EDSAC 计算机之父诞生；B 站成立；Skype 创始人出生
计算机二级Office计算机基础知识选择题
冯·诺依曼结构：现代计算机的诞生
[note]First draft of a report on the EDVAC (1~2)
【历史上的今天】6 月 30 日：冯·诺依曼发表第一份草案；九十年代末的半导体大战；CBS 收购 CNET

人群密度算法_算法人群的力量。相关推荐

人群密度检测matlab算法,基于视频的人员密度检测.doc
山东建筑大学课程设计说明书题目:课程:数字处理课程设计院 (部):信息与电气工程学院专业:电子信息工程班级:学生姓名:学号:指导教师:完成日期: 摘要3 一.设计目的4 二. ...
python实现五大基本算法_算法基础：五大排序算法Python实战教程
排序是每个算法工程师和开发者都需要一些知识的技能. 不仅要通过编码实现,还要对编程本身有一般性的了解. 不同的排序算法是算法设计如何在程序复杂性,速度和效率方面具有如此强大影响的完美展示. 让我们来看 ...
算法导论算法_算法导论
算法导论算法 Algorithms are an integral part of the development world. Before starting coding of any soft ...
逻辑回归算法_算法逻辑回归
logistic回归又称logistic回归分析,是一种广义的线性回归分析模型,常用于数据挖掘,疾病自动诊断,经济预测等领域.例如,探讨引发疾病的危险因素,并根据危险因素预测疾病发生的概率等.以胃癌病 ...
python dfs算法_算法工程师技术路线图
前言这是一份写给公司算法组同事们的技术路线图,其目的主要是为大家在技术路线的成长方面提供一些方向指引,配套一些自我考核项,可以带着实践进行学习,加深理解和掌握. 内容上有一定的通用性,所以也分享到知 ...
算法组合优化算法_算法交易简化了风险价值和投资组合优化
算法组合优化算法 Photo by Markus Spiske (left) and Jamie Street (right) on Unsplash Markus Spiske (左)和Jamie ...
JAVA工程师常用算法_算法工程师必须要知道的8种常用算法思想
算法思想有很多,业界公认的常用算法思想有8种,分别是枚举.递推.递归.分治.贪心.试探法.动态迭代和模拟.当然8种只是一个大概的划分,是一个"仁者见仁.智者见智"的问题. 1.1 ...
蝴蝶优化算法_算法｜FFT基础及各种常数优化，5万字笔记：公式推导+代码模板...
作者:中二攻子链接:https://ac.nowcoder.com/discuss/175409 来源:牛客网本文含NTT.MTT.拆系数FFT.共轭优化FFT.多项式求逆与ln 约定: 1. 表 ...
有向图最长路径算法_算法数据结构 | 三个步骤完成强连通分量分解的Kosaraju算法...
强连通分量分解的Kosaraju算法今天是算法数据结构专题的第35篇文章,我们来聊聊图论当中的强连通分量分解的Tarjan算法. Kosaraju算法一看这个名字很奇怪就可以猜到它也是一个根据人名起 ...

人群密度算法_算法人群的力量。