批判性思维

As Alexander Pope said, to err is human. By that metric, who is more human than us data scientists? We devise wrong hypotheses constantly and then spend time working on them just to find out how wrong we were.

正如亚历山大·波普(Alexander Pope)所说,犯错是人类。 按照这个指标,谁比我们的数据科学家更人性化? 我们不断设计错误的假设,然后花时间研究它们,以找出我们的错误所在。

When looking at mistakes from an experiment, a data scientist needs to be critical, always on the lookout for something that others may have missed. But sometimes, in our day-to-day routine, we can easily get lost in little details. When this happens, we often fail to look at the overall picture, ultimately failing to deliver what the business wants.

在查看实验中的错误时,数据科学家必须至关重要,始终在寻找其他人可能错过的东西。 但是有时候,在我们的日常工作中,我们很容易在细节上迷失方向。 发生这种情况时,我们常常无法看清整体情况,最终无法交付业务所需的东西。

Our business partners have hired us to generate value. We won’t be able to generate that value unless we develop business-oriented critical thinking, including having a more holistic perspective of the business at hand. So here is some practical advice for your day-to-day work as a data scientist. These recommendations will help you to be more diligent and more impactful at the same time.

我们的商业伙伴已聘请我们创造价值。 除非我们发展面向业务的批判性思维,包括对手头的业务有更全面的了解,否则我们将无法产生该价值。 因此,这是您作为数据科学家的日常工作的一些实用建议。 这些建议将帮助您同时更加勤奋和富有影响力。

1.当心清洁数据综合症 (1. Beware of the Clean Data Syndrome)

Tell me how many times this has happened to you: You get a data set and start working on it straight away. You create neat visualizations and start building models. Maybe you even present automatically generated descriptive analytics to your business counterparts!

告诉我这件事发生了多少次:您获得了一个数据集,并立即开始处理它。 您可以创建简洁的可视化效果并开始构建模型。 甚至您甚至可以向业务对手展示自动生成的描述性分析!

But do you ever ask, “Does this data actually make sense?” Incorrectly assuming that the data is clean could lead you toward very wrong hypotheses. Not only that, but you’re also missing an important analytical opportunity with this assumption.

但是您是否曾经问过:“这些数据真的有意义吗?” 错误地假设数据是干净的可能会导致您得出非常错误的假设。 不仅如此,这种假设还会使您失去重要的分析机会。

You can actually discern a lot of important patterns by looking at discrepancies in the data. For example, if you notice that a particular column has more than 50 percent of values missing, you might think about dropping the column. But what if the missing column is because the data collection instrument has some error? By calling attention to this, you could have helped the business to improve its processes.

实际上,您可以通过查看数据中的差异来识别许多重要的模式。 例如,如果您发现某个特定的列缺少超过50%的值,则可以考虑删除该列。 但是,如果缺少列是因为数据收集工具有一些错误怎么办? 通过引起对此的注意,您可以帮助企业改进其流程。

Or what if you’re given a distribution of customers that shows a ratio of 90 percent men versus 10 percent women, but the business is a cosmetics company that predominantly markets its products to women? You could assume you have clean data and show the results as is, or you can use common sense and ask the business partner if the labels are switched.

或者,如果给您分配的客户分布显示出90%的男性与10%的女性比率,但该企业是一家化妆品公司,主要将产品销售给女性? 您可以假设您有干净的数据并按原样显示结果,或者可以使用常识并询问业务伙伴是否更换了标签。

Such errors are widespread. Catching them not only helps the future data collection processes but also prevents the company from making wrong decisions by preventing various other teams from using bad data.

这种错误很普遍。 捕获它们不仅有助于将来的数据收集流程,而且还可以防止其他团队使用不良数据来防止公司做出错误的决定。

2.注意业务 (2. Be Aware of the business)

Source: Fab.com Beginnings资料来源 :Fab.com起点

You probably know fab.com. If you don’t, it’s a website that sells selected health and fitness items. But the site’s origins weren’t in e-commerce. Fab.com started as Fabulis.com, a social networking site for gay men. One of the site’s most popular features was called the “Gay Deal of the Day.”

您可能知道fab.com。 如果您不这样做,那是一个出售选定健康和健身物品的网站。 但是该网站的起源不是电子商务。 Fab.com 最初是Fabulis.com(男同性恋者的社交网站)。 该网站最受欢迎的功能之一被称为“每日同性恋交易”。

One day, the deal was for hamburgers. Half of the deal’s buyers were women, despite the fact that they weren’t the site’s target users. This fact caused the data team to realize that they had an untapped market for selling goods to women. So Fabulis.com changed its business model to serve this newfound market.

有一天,这笔交易是给汉堡包的。 尽管这不是该网站的目标用户,但交易的买家中有一半是女性。 这一事实使数据团队意识到,他们有一个尚未开发的向女性出售商品的市场。 因此Fabulis.com更改了其业务模式以服务于这个新发现的市场。

Be on the lookout for something out of the ordinary. Be ready to ask questions. If you see something in the data, you may have hit gold. Data can help a business to optimize revenue, but sometimes it has the power to change the direction of the company as well.

寻求与众不同的东西。 准备问问题。 如果您看到数据中的某些内容,则可能是黄金。 数据可以帮助企业优化收入,但有时它也可以改变公司的发展方向。

Source: Flickr Origins as “Game Neverending”资料来源 :Flickr起源为“游戏永无止境”

Another famous example of this is Flickr, which started out as a multiplayer game. Only when the founders noticed that people were using it as a photo upload service did the company pivot to the photo-sharing app we know it as today.

另一个著名的例子是Flickr,它最初是一种多人游戏 。 只有当创始人注意到人们将其用作照片上传服务时,公司才转向我们今天所知的照片共享应用程序。

Try to see patterns that others would miss. Do you see a discrepancy in some buying patterns or maybe something you can’t seem to explain? That might be an opportunity in disguise when you look through a wider lens.

尝试查看其他人会错过的模式。 您是否发现某些购买模式存在差异,或者您似乎无法解释? 当您从更大的角度看时,这可能是变相的机会。

3.关注正确的指标 (3. Focus on the right metrics)

What do we want to optimize for? Most businesses fail to answer this simple question.

我们要优化什么? 大多数企业无法回答这个简单的问题。

Every business problem is a little different and should, therefore, be optimized differently. For example, a website owner might ask you to optimize for daily active users. Daily active users is a metric defined as the number of people who open a product on a given day. But is that the right metric? Probably not! In reality, it’s just a vanity metric, meaning one that makes you look good but doesn’t serve any purpose when it comes to actionability. This metric will always increase if you are spending marketing dollars across various channels to bring more and more customers to your site.

每个业务问题都稍有不同,因此应该以不同的方式进行优化。 例如,网站所有者可能会要求您针对每日活跃用户进行优化。 每日活跃用户是一个指标,定义为在特定日期打开产品的人数。 但这是正确的指标吗? 可能不是! 实际上,这只是一种虚荣感指标,这意味着它可以使您看起来不错,但对于可操作性没有任何作用。 如果您在各种渠道上花费营销费用来吸引越来越多的客户访问您的网站,则该指标将始终保持增长。

Instead, I would recommend optimizing the percentage of users that are active to get a better idea of how my product is performing. A big marketing campaign might bring a lot of users to my site, but if only a few of them convert to active, the marketing campaign was a failure and my site stickiness factor is very low. You can measure the stickiness by the second metric and not the first one. If the percentage of active users is increasing, that must mean that they like my website.

相反,我建议优化活跃用户的百分比,以更好地了解我的产品的性能。 大型的营销活动可能会吸引很多用户访问我的网站,但是如果只有少数用户转换为活动用户,则营销活动将失败并且我的网站黏性系数非常低。 您可以通过第二个指标而不是第一个指标来衡量粘性。 如果活跃用户的百分比在增加,那必须表示他们喜欢我的网站。

Another example of looking at the wrong metric happens when we create classification models. We often try to increase accuracy for such models. But do we really want accuracy as a metric of our model performance?

创建分类模型时,会出现另一个错误指标的例子。 我们经常尝试提高此类模型的准确性。 但是,我们是否真的希望准确性作为衡量模型性能的指标?

PixabayPixabay

Imagine that we’re predicting the number of asteroids that will hit the Earth. If we want to optimize for accuracy, we can just say zero all the time, and we will be 99.99 percent accurate. That 0.01 percent error could be hugely impactful, though. What if that 0.01 percent is a planet-killing-sized asteroid? A model can be reasonably accurate but not at all valuable. A better metric would be the F score, which would be zero in this case, because the recall of such a model is zero as it never predicts an asteroid hitting the Earth.

想象一下,我们正在预测将撞击地球的小行星的数量。 如果我们要优化准确性,我们可以一直说零,那么我们将达到99.99%的准确性。 不过,该0.01%的错误可能会产生巨大影响。 如果那0.01%是杀行星大小的小行星怎么办? 模型可以相当准确,但根本没有价值。 更好的度量标准是F分数,在这种情况下为零,因为这种模型的召回率是零,因为它从未预测过小行星撞击地球。

When it comes to data science, designing a project and the metrics we want to use for evaluation is much more important than modeling itself. The metrics themselves need to specify the business goal and aiming for a wrong goal effectively destroys the whole purpose of modeling. For example, F1 or PRAUC is a better metric in terms of asteroid prediction as they take into consideration both the precision and recall of the model. If we optimize for accuracy, our whole modeling effort could just be in vain.

在数据科学方面,设计项目和我们要用于评估的指标比建模本身更为重要。 度量标准本身需要指定业务目标,而针对错误的目标有效地破坏了建模的整个目的。 例如,就小行星预测而言,F1或PRAUC是更好的指标,因为它们同时考虑了模型的精度和召回率。 如果我们针对准确性进行优化,那么整个建模工作将徒劳无功。

4.统计有时会说谎 (4. Statistics Lie sometimes)

Be skeptical of any statistics that get quoted to you. Statistics have been used to lie in advertisements, in workplaces, and in a lot of other arenas in the past. People will do anything to get sales or promotions.

怀疑引用给您的任何统计信息。 过去,统计信息已被用于广告,工作场所以及许多其他领域。 人们会做任何事情来获得销售或促销。

Source资源

For example, do you remember Colgate’s claim that 80 percent of dentists recommended their brand? This statistic seems pretty good at first. If so many dentists use Colgate, I should too, right? It turns out that during the survey, the dentists could choose multiple brands rather than just one. So other brands could be just as popular as Colgate.

例如, 您还记得高露洁声称80%的牙医推荐其品牌的说法吗? 起初,这个统计数据看起来不错。 如果有那么多牙医使用高露洁,我也应该吧? 事实证明,在调查期间,牙医可以选择多个品牌,而不仅仅是一个。 因此,其他品牌可能与高露洁一样受欢迎。

Source资源

Marketing departments are just myth creation machines. We often see such examples in our daily lives. Take, for example, this 1992 ad from Chevrolet. Just looking at just the graph and not at the axis labels, it looks like Nissan/Datsun must be dreadful truck manufacturers. In fact, the graph indicates that more than 95 percent of the Nissan and Datsun trucks sold in the previous 10 years were still running. And the small difference might just be due to sample sizes and the types of trucks sold by each of the companies. As a general rule, never trust a chart that doesn’t label the Y-axis.

营销部门只是神话创造的机器。 我们在日常生活中经常看到这样的例子。 以1992年雪佛兰(Chevrolet)的广告为例。 只看图表而不看轴标签,看起来日产/ Datsun一定是可怕的卡车制造商。 实际上,该图表明在过去10年中售出的日产和Datsun卡车中超过95%仍在运行。 差异很小可能只是由于样本量和每个公司出售的卡车的类型。 作为一般规则,否E版本的信任,不标注Y轴的图表。

As a part of the ongoing pandemic, we’re seeing even more such examples with a lot of studies promoting cures for COVID-19. This past June in India, a man claimed to have made medicine for coronavirus that cured 100 percent of patients in seven days. This news predictably caused a big stir, but only after he was asked about the sample size did we understand what was actually happening here. With a sample size of 100, the claim was utterly ridiculous on its face. Worse, the way the sample was selected was hugely flawed. His organization selected asymptomatic and mildly symptomatic users with a mean age between 35 and 45 with no pre-existing conditions, I was dumbfounded — this was not even a random sample. So not only was the study useless, it was actually unethical.

作为持续进行的大流行的一部分,我们通过许多促进COVID-19治愈的研究看到了更多这样的例子。 今年六月在印度,一名男子声称自己制作了冠状病毒药物,在7天内治愈了100%的患者。 可以预见的是,这一消息引起了极大的轰动,但只有在询问了他有关样本量的信息后,我们才了解这里实际发生的情况。 样本数量为100,该声明的内容完全荒谬。 更糟糕的是,样本的选择方式存在巨大缺陷。 他的组织选择了无症状和轻度症状的使用者,他们的平均年龄在35至45岁之间,并且没有既往疾病,我对此感到震惊-这甚至不是随机样本。 因此,这项研究不仅无用,而且实际上是不道德的。

When you see charts and statistics, remember to evaluate them carefully. Make sure the statistics were sampled correctly and are being used in an ethical, honest way.

当您看到图表和统计数据时,请记住要仔细评估它们。 确保统计信息已正确采样并以道德,诚实的方式使用。

5.不要屈服于谬论 (5. Don’t Give in to Fallacies)

Photo by Jonathan Petersson on Unsplash
乔纳森·彼得森 ( Jonathan Petersson)在Unsplash上拍摄的照片

During the summer of 1913 in a casino in Monaco, gamblers watched in amazement as the roulette wheel landed on black an astonishing 26 times in a row. And since the probability of red versus black is precisely half, they were confident that red was “due.” It was a field day for the casino and a perfect example of gambler’s fallacy, a.k.a. the Monte Carlo fallacy.

在1913年夏天,在摩纳哥的一家赌场中,赌徒惊奇地看着轮盘赌轮连续地连续26次落在黑色上。 而且由于红色与黑色的概率恰好是一半,所以他们确信红色是“应有的”。 这是赌场的野外活动日,也是赌徒谬论 (又称蒙特卡洛谬论)的完美例证。

This happens in everyday life outside of casinos too. People tend to avoid long strings of the same answer. Sometimes they do so while sacrificing accuracy of judgment for the sake of getting a pattern of decisions that look fairer or more probable. For example, an admissions office may reject the next application they see if they have approved three applications in a row, even if the application should have been accepted on merit.

这也发生在赌场以外的日常生活中。 人们倾向于避免使用长串相同的答案 。 有时他们这样做是在牺牲判断准确性的同时,为了获得看起来更公平或更可能的决策模式。 例如, 招生办公室可以连续拒绝三个申请,即使他们本应被接受,也可以拒绝下一个申请。

The world works on probabilities. We are seven billion people, each doing an event every second of our lives. Because of that sheer volume, rare events are bound to happen. But we shouldn’t put our money on them.

世界靠概率工作。 我们有70亿人口,每个人每秒钟都在做一件事情。 由于数量庞大,必将发生罕见的事件。 但是我们不应该把钱花在他们身上。

Think also of the spurious correlations we end up seeing regularly. This particular graph shows that organic food sales cause autism. Or is it the opposite? Just because two variables move together in tandem doesn’t necessarily mean that one causes the other. Correlation does not imply causation and as data scientists, it is our job to be on a lookout for such fallacies, biases, and spurious correlations. We can’t allow oversimplified conclusions to cloud our work.

还请考虑一下我们最终经常看到的虚假关联。 此特殊图表显示,有机食品的销售会导致自闭症。 还是相反? 仅仅因为两个变量串联在一起并不一定意味着一个导致另一个。 关联并不意味着因果关系 ,作为数据科学家,寻找此类谬论,偏差和虚假关联是我们的工作。 我们不能允许过于简单的结论使我们的工作蒙上阴影。

Data scientists have a big role to play in any organization. A good data scientist must be both technical as well as business-driven to perform the job’s requirements well. Thus, we need to make a conscious effort to understand the business’ needs while also polishing our technical skills.

数据科学家在任何组织中都可以发挥重要作用。 优秀的数据科学家必须具备技术和业务驱动才能很好地满足工作要求。 因此,我们需要有意识地努力去了解业务需求,同时还要完善我们的技术技能。

继续学习 (Continue Learning)

If you want to learn more about how to apply Data Science in a business context, I would like to call out the AI for Everyone course by Andrew Ng which focusses on spotting opportunities to apply AI to problems in your own organization, working with an AI team and build an AI strategy in your company.

如果您想了解有关如何在业务环境中应用数据科学的更多信息,我想讲一下Andrew Ng的“ 每个人AI”课程 ,该课程着重于发现与AI合作将AI应用于您自己组织中的问题的机会。团队并在您的公司中制定AI战略。

Thanks for the read. I am going to be writing more beginner-friendly posts in the future too. Follow me up at Medium or Subscribe to my blog to be informed about them. As always, I welcome feedback and constructive criticism and can be reached on Twitter @mlwhiz.

感谢您的阅读。 我将来也会写更多对初学者友好的文章。 在Medium上关注我,或订阅我的博客以了解有关它们的信息。 与往常一样,我欢迎您提供反馈和建设性的批评,可以在Twitter @mlwhiz上与我们联系 。

This post was first published here.

这篇文章首先 在这里 发表

翻译自: https://towardsdatascience.com/5-essential-business-oriented-critical-thinking-skills-for-data-science-ac25fa69aafc

批判性思维


http://www.taodudu.cc/news/show-863861.html

相关文章:

  • 大数据技术 学习之旅_数据-数据科学之旅的起点
  • 编写分段函数子函数_编写自己的函数
  • 打破学习的玻璃墙_打破Google背后的创新深度学习
  • 向量 矩阵 张量_张量,矩阵和向量有什么区别?
  • monk js_使用Monk AI进行手语分类
  • 辍学的名人_辍学效果如此出色的5个观点
  • 强化学习-动态规划_强化学习-第5部分
  • 查看-增强会话_会话式人工智能-关键技术和挑战-第2部分
  • 我从未看过荒原写作背景_您从未听说过的最佳数据科学认证
  • nlp算法文本向量化_NLP中的标记化算法概述
  • 数据科学与大数据排名思考题_排名前5位的数据科学课程
  • 《成为一名机器学习工程师》_如何在2020年成为机器学习工程师
  • 打开应用蜂窝移动数据就关闭_基于移动应用行为数据的客户流失预测
  • 端到端机器学习_端到端机器学习项目:评论分类
  • python 数据科学书籍_您必须在2020年阅读的数据科学书籍
  • ai人工智能收入_人工智能促进收入增长:使用ML推动更有价值的定价
  • 泰坦尼克数据集预测分析_探索性数据分析—以泰坦尼克号数据集为例(第1部分)
  • ml回归_ML中的分类和回归是什么?
  • 逻辑回归是分类还是回归_分类和回归:它们是否相同?
  • mongdb 群集_通过对比群集分配进行视觉特征的无监督学习
  • ansys电力变压器模型_变压器模型……一切是如何开始的?
  • 浓缩摘要_浓缩咖啡的收益递减
  • 机器学习中的无监督学习_无监督机器学习中聚类背后的直觉
  • python初学者编程指南_动态编程初学者指南
  • raspberry pi_在Raspberry Pi上使用TensorFlow进行对象检测
  • 我如何在20小时内为AWS ML专业课程做好准备并进行破解
  • 使用composer_在Google Cloud Composer(Airflow)上使用Selenium搜寻网页
  • nlp自然语言处理_自然语言处理(NLP):不要重新发明轮子
  • 机器学习导论�_机器学习导论
  • 直线回归数据 离群值_处理离群值:OLS与稳健回归

数据科学的5种基本的面向业务的批判性思维技能相关推荐

  1. ​数据科学中 17 种相似性和相异性度量

    本文解释了计算距离的各种方法,并展示了它们在我们日常生活中的实例.限于篇幅,便于阅读,将本文分为上下两篇,希望对你有所帮助. "There is no Royal Road to Geome ...

  2. 【数据科学】7种数据类型:思考机器学习数据类型的更好方法

    目录 目前的状态 史蒂文斯的测量类型学 7种机器学习的主要数据类型 1.无用的 2.形同虚设 3.序数 4.二进制 5.计数 6.时间 7.间隔 这些是正确的七个类别吗? 我如何记住这7种数据类型? ...

  3. 数据科学的四种参数估计方法

    摘 要 文章从概率.统计这两大基本概念入手,通过构造一个基本问题,利用四种参数的估计方法及其思路分别对问题进行分析与解答,从而厘清四种方法各自的特征以及之间的差异之处. 关键词 极大似然估计 最大后验 ...

  4. ​【机器学习】数据科学中 17 种相似性和相异性度量(上)

    本文解释了计算距离的各种方法,并展示了它们在我们日常生活中的实例.限于篇幅,便于阅读,将本文分为上下两篇,希望对你有所帮助. "There is no Royal Road to Geome ...

  5. 大疆 机器学习 实习生_我们的数据科学机器人实习生

    大疆 机器学习 实习生 Machine learning practitioners know how overwhelming the number of possibilities that we ...

  6. netflix 数据科学家_数据科学和机器学习在Netflix中的应用

    netflix 数据科学家 数据科学 , 机器学习 , 技术 (Data Science, Machine Learning, Technology) Using data science, Netf ...

  7. 新手数据科学家常犯的13种错误及其解决方法

    介绍 所以当你已经决定在数据科学这条道路走下去的时候.世界上越来越多的企业正在成为或者转型成为数据驱动的企业,世界变得越来越紧密,并且看起来每个企业都需要数据科学的人才.因此,对数据科学家的需求是巨大 ...

  8. 你知道什么是数据科学?如何把数据变成产品么?

    本文转自:O'Reilly(www.oreilly.com.cn):作者:麦克.罗克德斯(Mike Loukides): 未来属于那些知道如何把数据变成产品的企业和个人. --麦克.罗克德斯(Mike ...

  9. 数据科学 python_适用于数据科学的Python vs(和)R

    数据科学 python Choosing the right programming language when taking on a new project is perhaps one of t ...

最新文章

  1. 丢弃掉那些BeanUtils工具类吧,MapStruct真香!!!
  2. Github 的 Pull Request 教程
  3. oop的三大特性和传统dom如何渲染
  4. 推荐一个不到300k的Gif处理神器 - Gifsicle(免费下载)
  5. 重磅!双腿机器人Digit v2视频流出,自主搬卸货物噪音极小
  6. 关于jmf不能播放mp3的问题解决
  7. 【转载】三极管,场效应管 工作原理小结
  8. 石油、黄金与美元的游戏
  9. 2015.12.08-2015.12.11 硕士毕业大论文 前端技术学习
  10. 存储过程和函数的区别?
  11. 自动驾驶帆船,有史以来第一次成功横渡大西洋
  12. 一、Nginx源码安装与yum安装
  13. r语言degseq2_第二次RNA-seq实战总结(3)-用DESeq2进行基因表达差异分析
  14. Android UI设计之十三自定义ScrollView,实现QQ空间阻尼下拉刷新和渐变菜单栏效果
  15. 西电微机系统课程设计——步进电机开环控制系统设计
  16. python循环嵌套打印小星星_python基础:嵌套循环及例子(打印小星星,九九乘法表)...
  17. Ubuntu20.4环境下,Android11(R)源码,下载,编译,Pixel4刷机
  18. 分布式计算原理之分布式协调与同步(1)——分布式选举
  19. JPush(极光推送)实战总结
  20. stm32移植lvgl

热门文章

  1. Web应用开发中的几个问题
  2. 活动目录数据库授权恢复
  3. python 图表_用 Python 让你的数据图表动起来
  4. struts2上传文件类型限制
  5. 用python做生物信息数据分析_基于Python的自动获取生物信息数据的软件设计
  6. java填空题_Java语言基础知识填空题
  7. WSDM 2022 | 合约广告自适应统一分配框架
  8. Maven基础了解及配置信息
  9. csrf攻击 java_java使用jsp servlet来防止csrf 攻击的实现方法
  10. matlab png转02,matlab把图片pgm格式转换成png格式