测试面试挫败

I originally wrote this article to be published on Dataiku’s blog here.

我最初写这篇文章要在Dataiku的博客发表 在这里

数据科学,从热心的承诺到令人沮丧的结果 (Data Science, From Enthusiastic Commitment to Frustrating Results)

A growing interest in data science and artificial intelligence (AI) has been observed over the past decade, both in professional and private spheres. A variety of uses have appeared, such as selecting sitcom scenarios to gain viewers, writing articles for the press automatically, developing stronger-than-human AI for particularly difficult games, or creating smart chatbots to assist customers. Furthermore, numerous influential actors in the technology space demonstrate strong enthusiasm towards AI:

过去十年来,在专业领域和私有领域中,人们对数据科学和人工智能(AI)的兴趣不断增长。 出现了多种用途,例如选择情景喜剧场景来吸引观众,自动为媒体撰写文章 ,为特别困难的游戏开发比人类更强的AI或创建智能聊天机器人来协助客户。 此外,技术领域中众多有影响力的参与者对AI表现出强烈的热情:

“AI is going to be the next big shift in technology.” Satya Nadella, Microsoft CEO

“人工智能将成为技术的下一个重大转变。” 微软首席执行官萨蒂亚·纳德拉(Satya Nadella)

“It is a renaissance, it is a golden age.” Jeff Bezos, Amazon CEO

“这是复兴,是黄金时代。” 亚马逊首席执行官杰夫·贝佐斯(Jeff Bezos)

AI […] is more profound than […] fire or electricity.” Sundar Pichai, Google CEO

AI […]比火或电更深刻。” Google首席执行官Sundar Pichai

Beyond these promises, there are significant investments being made in data science and AI. Both retrospective and prospective figures are massive: the market was worth $20.7 billion in 2018 and the compound growth is expected to reach 33% by 2026, according to this study. The report “Data and Innovation: Leveraging Big Data and AI to Accelerate Business Transformation” by NewVantage Partners in 2019 showed that the majority of the 65 companies investigated react proactively and positively to data science and AI (see Figure 1 below, left panel).

除了这些承诺,在数据科学和人工智能领域也进行了大量投资。 回顾性和前瞻性数据都是巨大的:根据这项研究 ,2018年市场规模为207亿美元,预计到2026年复合增长率将达到33%。 NewVantage Partners在2019年发布的报告“数据与创新:利用大数据和AI加速业务转型”显示,接受调查的65家公司中的大多数对数据科学和AI做出了积极积极的React(请参见下面的图1,左图)。

Furthermore, numerous countries are investing significant resources to develop AI initiatives: the Digital Europe funding program dedicated 2.5 billion euros to AI research, the DARPA funding campaign raised two billion dollars to fund the next wave of AI technologies, and China declared its goal to lead the world in AI by 2030. Altogether, these observations argue that most stakeholders are aware of the importance of data science and AI, and commit to its development in their activities.

此外,许多国家正在投入大量资源来开发AI计划:“数字欧洲”计划将25亿欧元投入到AI研究中,DARPA筹款活动筹集了20亿美元以资助下一波AI技术,中国宣布了其引领人工智能的目标。到2030年,世界将成为AI领域。总而言之,这些观察结果表明, 大多数利益相关者都意识到数据科学和AI的重要性,并致力于其发展

Figure 1. Major investments toward Data Science but with an unsatisfactory performance due to cultural limitations. From « Data and Innovation: Leveraging Big Data and AI to Accelerate Business Transformation » realized by NewVantagePartners in 2019
图1. 对数据科学的重大投资,但由于文化局限而表现不理想。 摘自NewVantagePartners在2019年实现的《数据与创新:利用大数据和AI加速业务转型》

Despite these dynamic trends of investment and constructive attitudes from companies, there are also some alarming — sometimes even shocking — observations regarding the deployment of data science and AI. Indeed, the rate of data science projects which do not succeed may be as high as 85% or 87%. This low performance does not only affect young companies — it can also be observed within data-mature businesses.

尽管有投资的动态趋势和公司的建设性态度,但有关数据科学和AI的部署也有一些令人震惊的甚至是令人震惊的发现。 实际上,未成功的数据科学项目的比例可能高达85%或87% 。 这种低绩效不仅影响年轻的公司,而且在数据成熟的企业中也可以看到。

The study “Artificial Intelligence Global Adoption Trends & Strategies” conducted by the IDC in 2019 among 2,473 organizations exploiting AI in their operations reported that one-fourth of them have a failure rate of AI projects higher than 50%, while another one-fourth of them consider having developed an AI strategy at scale.

IDC于2019年对2,473个在其运营中利用AI的组织进行的研究《人工智能全球采用趋势和策略》的研究报告称,其中四分之一的AI项目失败率高于50%,而另外四分之一的AI项目失败率他们考虑大规模开发AI策略。

The report of NewVantage Partners mentioned above has also studied the nature of the obstacles met by companies while deploying data science and AI programs: the main limitations appear to be cultural rather than technological (see Figure 1 above, right panel). In this sense, this study reports that 77.1% of consulted companies have identified AI adoption by business experts as a major challenge.

上面提到的NewVantage Partners的报告还研究了公司在部署数据科学和AI程序时遇到的障碍的性质: 主要限制似乎是文化上的而不是技术上的 (请参见上面的图1,右图)。 从这个意义上讲,本研究报告称,有77.1%的咨询公司将业务专家采用AI视为主要挑战。

Figure 2. A) Hype Cycles as described by Gartner.B) The history of AI is made of an alternation of hype periods and AI winters. Adapted from YasnitskyL.N. (2020) Whether Be New “Winter” of Artificial Intelligence?. In: AntipovaT. (eds) Integrated Science in Digital Age. ICIS 2019. Lecture Notes in Networks and Systems, vol 78. Springer, Cham
图2. A) Gartner描述的炒作周期。 B) AI的历史是由炒作期和AI冬季的交替组成的。 改编自YasnitskyL.N。 (2020)是否成为人工智能的新“冬天”? 在: AntipovaT。 (eds)数字时代的综合科学。 ICIS2019。网络和系统讲座,第78卷,Springer,Cham

To sum up, many have high expectations for data science and AI, given the media coverage and the level of investment. However, in practice, its performance is low and restrained by cultural obstacles. This contradictory gap between expectations and the “reality” is not rare in the technology sector, and can lead to a “trough of disillusionment” as described by Gartner in “hype cycles” (see Figure 2 above, panel A). In fact, AI has already fluctuated between such highs and lows three times. In the history of AI, hype periods of intense investment and development were followed by an “AI winter”, with no activity in the field (see Figure 2 above, panel B).

综上所述,鉴于媒体的报道和投资水平,许多人对数据科学和人工智能寄予厚望。 但是,实际上,它的性能很差并且受到文化障碍的限制。 期望与“现实”之间的这种矛盾差距在技术领域并不罕见,并且可能导致“幻灭的低谷”,正如Gartner在“炒作周期”中所描述的(参见上面图2,图A)。 实际上,人工智能已经在这种高低之间波动了三倍。 在AI的历史上,大量投资和发展的大肆宣传时期之后是“ AI冬季”,但该领域没有活动(请参见上面的图2,B组)。

Then, the next question we should ask ourselves is: are we about to live a fourth “AI winter” with a “data science bubble burst”, or will the field continue to flourish? The answer will depend on us, and we firmly believe that adoption will be a key ingredient in the success or the failure of AI in the coming decade. This article describes the different cultural obstacles to successful data science and AI, and, in light of this analysis, presents a systemic and organizational lever to address these obstacles and foster adoption at scale: a new position named the “Data Adoption Officer.”

然后,我们应该问自己的下一个问题是:我们将要度过“数据科学泡沫破灭”的第四个“人工智能冬天”,还是这个领域继续繁荣? 答案将取决于我们,我们坚信采用将是未来十年AI成功或失败的关键因素 。 本文介绍了成功进行数据科学和AI的各种文化障碍,并根据此分析,提出了解决这些障碍并促进大规模采用的系统性和组织性杠杆:一个名为“数据采用官”的新职位

数据科学的文化痛 (The Cultural Pains of Data Science)

In the following sections, we will look at a variety of cultural obstacles met during the deployment at scale of data science and AI. To simplify the complex landscape, the limitations listed below are grouped along four main dimensions: the data scientists, the resources, the hermeticism of data science and AI, and MLOps.

在以下各节中,我们将研究在数据科学和AI规模部署期间遇到的各种文化障碍。 为了简化复杂的情况,下面列出的局限性分为四个主要方面:数据科学家,资源,数据科学和AI的封闭性以及MLOps。

痛苦1 —数据科学家,21世纪最棘手的工作 (Pain 1 — Data Scientist, The S̶e̶x̶i̶e̶s̶t̶ Trickiest Job of the 21st Century)

While the data scientist has been advertised as the sexiest job of the 21st century by an article in the Harvard Business Review, it may also be one of the trickiest positions for many reasons described below, including skill shortages, heterogenous profiles, an ill-defined job description, and significant turnover.

尽管《 哈佛商业评论 》( Harvard Business Review)上的一篇文章将数据科学家标榜为21世纪最性感的工作,但由于以下原因,它也可能是最棘手的职位之一,包括技能短缺,异质档案,定义不清等职位描述和重大营业额。

Skill Shortages

技能短缺

In their study Artificial Intelligence Global Adoption Trends & Strategies, the IDC reports that skill shortages are identified by companies as a major cause of AI project failure. Although numerous graduate programs in data science and AI have appeared in the last few years, the need for skillful data scientists is not fulfilled. In August 2018, LinkedIn stated that there was a shortage of 151,000 data scientists in the U.S.

IDC在他们的《 人工智能全球采用趋势和战略》研究中指出,公司技能短缺确定为AI项目失败的主要原因 。 尽管最近几年出现了众多数据科学和AI研究生课程,但仍无法满足对熟练数据科学家的需求。 2018年8月, LinkedIn宣布美国缺少151,000名数据科学家

Aside from this quantitative shortage, there is also a qualitative one. Since the data science discipline is fairly young, most senior data scientists come from other fields. Thus, while there are more and more juniors arriving on the labor market, companies struggle to find experienced profiles to drive a dynamic vision and strategy and enable skill transfer.

除了数量上的短缺之外,还存在定性的短缺。 由于数据科学学科还很年轻,因此大多数高级数据科学家都来自其他领域。 因此,尽管有越来越多的初级人才进入劳动力市场,但公司仍在努力寻找经验丰富的人才来驱动动态的愿景和策略并实现技能转移。

Heterogeneous Profiles

异质轮廓

One of the consequences of having many reconverted data scientists on the labor market is the strong heterogeneity of profiles, both in terms of educational background (see Figure 3 below, panel A) and previous experience (see Figure 3 below, panel B). Every field has its own culture, made of distinct methodologies and knowledge bases. Similarly, the other job titles have their own roadmap, scope and goals.

在劳动力市场上拥有许多重新转换的数据科学家的后果之一是,就教育背景(参见下面的图3,图A)和以前的经验(参见下面的图3,图B)而言,个人资料的异质性很强。 每个领域都有自己的文化,由不同的方法和知识基础组成。 同样,其他职位也有自己的路线图,范围和目标。

Furthermore, a Kaggle poll showed that 66% of the participants learned data science on their own (likely overestimated due to selection bias), which contributes to the diverse nature of this discipline. Without a doubt, this diversity may generate interesting synergies and values. However, it also slows down the composition and acquisition of a common frame of reference. The data science community has a vital need of social cohesion to consolidate its unity.

此外,Kaggle的民意测验显示,有66%的参与者自己学习了数据科学(由于选择偏见而可能被高估了),这有助于该学科的多样性。 毫无疑问,这种多样性可能会产生有趣的协同作用和价值。 但是,这也减慢了公共参照系的构成和获取。 数据科学界迫切需要社会凝聚力来巩固其统一性。

Figure 3. Taken and adapted from an Indeed study made in 2018 concerning Data Scientist profiles. A) Distribution of Data Scientists’ background.B) Repartition of prior job title for different types of data position. The measure of the Gini Index on prior job title distribution reflects heterogeneity and is the highest for Data Scientist (85%) compared to ML Engineer (73%), Data Analyst (78%), Data Engineer (79%), and Software Engineer (53%)
图3. 取自并改编自 2018年一项有关数据科学家概况 Indeed研究 A) 数据科学家背景的分布。 B) 为不同类型的数据职位重新分配先前的职务。 基尼系数对先前职称分布的度量反映了异质性,与ML工程师(73%),Data Analyst(78%),Data Engineer(79%)和Software Engineer相比,Data Scientist(85%)最高(53%)

Data Scientist, an Ill-Defined Position

数据科学家,位置不佳

Data scientists are expected to have numerous talents, from hard sciences to soft skills through business expertise. Altogether, the ideal data scientist endowed with all these abilities rather looks like a “five-legged sheep” (see Figure 4 below). Such a mix of objectives, roles, and functions makes it difficult for data scientists to handle all these aspects at once. And if data scientists are five-legged sheep, then a lead data scientist and a Chief Data Officer (CDO) may perhaps be a chimera. Indeed, NewVantage Partners reported in their 2019 study “Data and Innovation: Leveraging Big Data and AI to Accelerate Business Transformation” that the CDO’s role is changing gradually and is not well described, leaving the CDO rather helpless.

数据科学家将拥有从硬科学到软技能再到业务专业知识的众多才能。 总之,理想的数据科学家具备所有这些能力,而看上去像“五足羊”(请参见下面的图4)。 目标,角色和功能的这种混合使得数据科学家很难立即处理所有这些方面。 而且,如果数据科学家是五脚羊,那么首席数据科学家和首席数据官(CDO)可能就是一种幻想。 实际上,NewVantage Partners在其2019年的研究中报告了“数据与创新:利用大数据和AI来加速业务转型”,该CDO的角色正在逐渐变化,并且没有得到很好的描述,这使CDO变得无可奈何。

Figure 4. A radar plot of all the expected abilities of the Ideal Data Scientist, who earns as a matter of facts the “five-legged sheep” distinction.图4.理想数据科学家的所有预期能力的雷达图,事实上,该科学家获得了“五足羊”的称号。

Significant Turnover

重大营业额

According to a study published in the Training Industry Quarterly, an employee needs one to two years to be fully productive when starting a new job. Given the numerous challenges that data scientists face in their duties, as presented earlier, this may rather be two years. In 2015, a global poll realized by KDNuggets among data scientists revealed that half of the participants stayed at their previous role for less than 18 months. Such a turnover in data teams is a productivity killer. However, the participants stated that they were willing to stay longer in their current position.

根据《 培训行业季刊》上发布的一项研究,一名员工在开始新工作时需要一到两年的时间才能充分发挥生产力。 如前所述,鉴于数据科学家在其职责中面临的众多挑战,这可能是两年。 2015年,KDNuggets对数据科学家进行的一项全球民意测验显示,一半的参与者在不到18个月的时间内都保持了原来的角色。 数据团队的这种人员流动是生产力的杀手。 但是,与会人员表示,他们愿意在现任职位上停留更长的时间。

Since KDNuggets’ survey, however, it seems that the length of a data scientist position has increased: the average tenure at a previous data science job is 2.6 years for data scientists who took part in a survey conducted by Burtch Works and who changed jobs in 2018 (representing 17.6% of the participants). Even if there is a clear improvement since 2015, the average tenure for a data scientist seems still neatly below those of other jobs in the US (4.6 years according to the Bureau of Labor Statistics). Below is a non-exhaustive list of some reasons that data scientists change jobs:

但是,自KDNuggets进行调查以来,数据科学家的职位似乎已经增加:参加Burtch Works进行的调查并更换工作的数据科学家在以前的数据科学工作中的平均任期为2.6年。 2018年(占参与者的17.6%)。 即使自2015年以来有了明显的改善,数据科学家的平均任期似乎仍然比美国的其他工作要低(根据美国劳工统计局的数据为4.6年)。 以下是数据科学家更换工作的一些原因的详尽列表:

  • Mismatch between expectation and reality期望与现实之间的不匹配
  • Company politics公司政治
  • Overloaded by data duties, which most of the time are not machine learning-related数据任务超载,大多数情况下,这些任务与机器学习无关
  • Working in isolation孤立地工作
  • Lack of macroscopic vision concerning the roadmap缺乏有关路线图的宏观视野
  • Infrastructure and/or data not sufficiently mature (cold start in ML)基础结构和/或数据不够成熟(ML中的冷启动)

Losing an asset on which a company has invested negatively impacts a data team’s activity. Thus, retaining data scientists should be a major concern for companies. To help retain talent, companies may want to consider Happiness Management, which is closely related to adoption and the points listed above, since:

丢失公司投资的资产会对数据团队的活动产生负面影响。 因此, 保留数据科学家应该是公司的主要关注点。 为了帮助留住人才,公司可能需要考虑幸福管理,它与采用和上面列出的要点密切相关,因为:

“One is not happy because one succeeds, one succeeds because one is happy.”

“一个人因为一个成功而不快乐,一个人因为一个快乐而成功。”

痛苦2-资源使用不足 (Pain 2 — Suboptimal Usage of Resources)

Resources have different origins, and can be split into three types: tools, data and humans. All of these resources suffer from suboptimal usages as described below.

资源起源不同,可以分为三种类型:工具,数据和人。 所有这些资源都有以下所述的次优用法。

A Multiplicity of Tools Penalizing Quality

多种工具惩罚质量

The data lifecycle is not a long, quiet river and is, rather, composed of numerous stages. Each of them can be categorized into different perimeters, either functionally, technologically, or thematically. And for each perimeter of each stage, numerous solutions exist. Therefore, the data and AI landscape in 2019 in terms of tools was fairly crowded (see Figure 5 below).

数据生命周期不是漫长而安静的河流,而是由许多阶段组成。 在功能,技术或主题方面,它们中的每一个都可以分为不同的范围。 对于每个阶段的每个边界,都有许多解决方案。 因此,就工具而言, 2019年的数据和AI前景相当拥挤(请参见下面的图5)。

Figure 5. A well known representation of the Data and AI landscape in 2019, displaying all the related tools ordered by functions, technologies and/or themes.图5. 2019年数据和AI领域的知名代表,显示了按功能,技术和/或主题排序的所有相关工具。

This multiplicity of tools obliges data teams to include in their pipeline a diversity of technologies, each of them having their own complex functioning. It requires a great discipline and expertise to consolidate the best practices across space and time for each technology to reach a utopian “generalized best practices.” The reality is often far from this, inducing decreased performance and even hazards.

这种工具的多样性要求数据团队在其管道中包含多种技术,每个技术都有其自己的复杂功能。 它需要丰富的纪律和专业知识,才能在空间和时间上整合最佳实践,以使每种技术都达到乌托邦式的“通用最佳实践”。 现实往往远非如此,导致性能下降甚至危险。

The Data Management Bottleneck

数据管理瓶颈

Greg Hanson from Informatica has a great (overstated) sentence which summarizes the following paragraph:

Informatica的Greg Hanson用了一个很棒的(夸张的)句子,总结了以下段落:

“The hardest part of AI and analytics is not AI, it is data management.”

“人工智能和分析最困难的部分不是人工智能,而是数据管理。”

As a matter of fact, data science and AI do not escape the GIGO constraint: Garbage In, Garbage Out. In other words, the quality of the information coming out cannot be better than the quality of information that went in. The inputs that AI models receive come from a long sequence of actions (see Figure 6 below, panel A), with possible alterations of quality at each step, justifying Greg Hanson’s slogan.

事实上,数据科学和AI并没有摆脱GIGO的限制:垃圾进,垃圾出。 换句话说,发出的信息的质量不能好于输入的信息的质量。AI模型收到的输入来自很长的动作序列(请参见下面的图6,图A),并且可能会改变。每一步的质量,都证明了格雷格·汉森(Greg Hanson)的口号。

Data quality is precisely a great matter of concerns among data scientists. In the Kaggle survey from 2017 named “The State of Data Science & Machine Learning,” three of the top five obstacles encountered by data scientists are related to data management (see Figure 6 below, panel B). Finally, data management can sometimes be overflowed by short-term compelling duties, blocking any attempt to develop a long-term vision/program to widen the data horizon. Since most of the work of a data scientist depends greatly on efficient data management, its inertia can be an import limitation.

数据质量恰恰是数据科学家关注的重要问题。 在Kaggle于2017年进行的名为“数据科学与机器学习的状况”的调查中,数据科学家遇到的前五名障碍中的三个与数据管理有关(请参见下面的图6,B板)。 最后,数据管理有时可能会因短期强制性任务而泛滥成灾,从而阻止了任何旨在制定长期愿景/计划以扩大数据视野的尝试。 由于数据科学家的大部分工作在很大程度上取决于有效的数据管理,因此其惯性可能是一项导入限制。

Figure 6. A) The hierarchy of needs in Data Science, from the most elementary (bottom of the triangle) to the most refined (top of the triangle).图6. A)数据科学中的需求层次结构,从最基本的(三角形的底部)到最精细的(三角形的顶部)。 B) The repartition of the main obstacles met by Data Scientist among the participants of the Kaggle survey from 2017, B)从2017年开始的Kaggle调查参与者The State of Data Science & Machine Learning. Three items from the top five depends on Data Management (framed in red).(数据科学与机器学习状况)中 ,数据科学家遇到的主要障碍的重新分配。 前五名中的三项取决于数据管理(红色框)。

Blindfold Human Resources

蒙眼的人力资源

Knowing quantitatively what oneself knows is not easy and requires rare metacognitive skills. Therefore, knowing what a team knows, through its individuals and as a group is even more difficult. This is particularly true for data science, which involves a vast knowledge grounded in many fields (statistics, computer science, mathematics, machine learning, etc.) and completed by the need of a business expertise. Skills management may greatly benefit the data science activity of an entity.

定量地知道自己所知道的东西并不容易,并且需要罕见的元认知技能。 因此,要了解一个团队通过其个人和整个团队所知道的事情就更加困难。 对于数据科学而言尤其如此,因为它涉及许多领域(统计,计算机科学,数学,机器学习等)的广泛知识,并且需要业务专业知识才能完成。 技能管理可能会大大有益于实体的数据科学活动。

Besides, it is echoing the observations of section A with regards to happiness management, since it may facilitate and optimize the upskilling track of each data scientist. It may also stimulate the R&D activities by identifying blind spots to investigate, improve recruitment by matching candidates with existing assets, and identify the priority needs in terms of external services providers (consulting, expertise, etc.).

此外,它与A节有关幸福管理的观点相呼应,因为它可以促进和优化每个数据科学家的技能提升轨迹。 它还可以通过确定盲点进行调查来刺激研发活动,通过使候选人与现有资产匹配来改善招聘,并确定外部服务提供商(咨询,专业知识等)的优先需求。

痛苦3 —阻止全球采用的封闭领域 (Pain 3 — A Hermetic Field Preventing Global Adoption)

Since data science and AI are complex subjects, there is a risk of rejection by the population and business experts. This hermetic field may either generate fears or a disinterested posture. Sometimes, it is the structure of companies made of silos which slow down collaboration possibilities. Because of this, it is important that the data community tries to include everyone in its activity, to prevent any inequalities enhancement. These ideas are discussed below.

由于数据科学和AI是复杂的学科,因此存在被人群和业务专家拒绝的风险。 这个封闭的领域可能会产生恐惧或无私的姿势。 有时,是由筒仓组成的公司结构减慢了协作的可能性。 因此,很重要的一点是,数据社区必须设法将每个人都纳入其活动中,以防止任何不平等现象的加剧。 这些想法将在下面讨论。

Fear of the Unknown

对未知的恐惧

Change can be frightening, a fortiori when it is not intelligible. Because AI is complex, and sometimes suffers from a black-box effect, the related new technologies generate fear as observed in a survey ordered by Saegus, in which 60% of the participants reported such worries. Beyond the “unknown” factor, many other characteristics may create anxiety in the population. For instance, people dread robots which may replace them at work, similarly to the mechanization of agriculture (50% of the working force was working in agriculture in 1870, when it is 2% nowadays). This is not just a delusion: 47% of jobs may potentially be replaced (or augmented) by AI in the U.S. according to this report.

变革可能令人恐惧,这是不可理解的。 由于人工智能很复杂,并且有时会受到黑匣子的影响,因此相关的新技术会产生恐惧,正如Saegus所下令进行的一项调查所表明的那样,其中60%的参与者表示了此类担忧。 除了“未知”因素外,许多其他特征也可能导致人群焦虑。 例如,人们担心机器人会在工作中取代机器人,这与农业机械化类似(1870年劳动力的50%在农业中工作,如今为2%)。 这不仅是一种幻想,根据这份报告 ,美国有47%的工作可能会被AI替代(或增加)。

Another source of fear is related to possible dysfunctional AI, as advertised by numerous bad buzz: an autonomous vehicle that kills a pedestrian in Arizona in 2018, the inability of almost all the models to predict the victory of France at the soccer World Cup in 2018, some smart speakers which get activated loudly in the middle of the night, etc. In addition, when AI models are not dysfunctional, they might just escape our control: biased algorithms favoring male candidates in Amazon’s recruitment process, two chatbots collaborating to trade goods which build their own non-human language, or Microsoft’s chatbot Tay which published violent content on Twitter.

另一个令人担忧的原因是人工智能可能功能失调,正如无数糟糕的嗡嗡声所宣传的那样:2018年在亚利桑那州杀死一名行人的自动驾驶汽车,几乎所有模型都无法预测法国在2018年足球世界杯上的胜利,一些智能扬声器会在深夜被大声激活等。此外,当AI模型没有失灵时,它们可能会逃脱我们的控制:偏向于在亚马逊招聘过程中偏爱男性候选人的算法,两个聊天机器人协作进行商品交易会使用自己的非人类语言,或者使用Microsoft的聊天机器人Tay在Twitter上发布暴力内容。

If the models survive also these downsides, another risk is the misuse of these technologies: autonomous weapons such as the Google’s Maven project, social manipulation during elections, mass surveillance and privacy violation, etc. Some of these concerns are legitimate, and many scientists have shared them, such as when Stephen Hawking said:

如果模型还可以克服这些缺点,则另一个风险就是滥用这些技术:诸如Google的Maven项目之类的自主武器,选举期间的社会操纵,大规模监视和侵犯隐私权等。其中一些担忧是合理的,许多科学家已经分享了这些信息,例如史蒂芬·霍金(Stephen Hawking)说的话:

“Success in creating effective AI could be the biggest event in the history of our civilization. Or the worst.”

成功创建有效的AI可能是我们文明历史上最大的事件。 或最坏的情况。”

There is still a lot of progress to be made to protect users and people from these risks since there is a legal vacuum on most of the related topics (see Figure 7 below). It is a duty of the data community to take part in the legal framing of AI. The corresponding protection may facilitate the global adoption of AI.

由于大多数相关主题都存在法律真空,因此在保护用户和人员免受这些风险方面仍有许多进步(请参见下面的图7)。 数据社区有责任参加AI的法律框架。 相应的保护可以促进AI的全球采用。

Figure 7. Counts of countries by levels of legal maturity on the different segments of AI.图7.在不同人工智能领域中按法律成熟度级别划分的国家数。

The Clash of Cultures

文化冲突

It is essential that data teams and business experts collaborate fluidly, which is currently not always the case. A survey by Spiceworks showed that seven in ten IT professionals considered collaboration as a major priority. In the worst-case scenario, data scientists miss the ability to take advice from business experts while the former does not take into consideration data-driven elements provided by the latter to take a decision (see Figure 8 below, Panel A).

数据团队和业务专家之间的流畅协作至关重要 ,而目前并非总是如此。 Spiceworks进行的一项调查显示,十分之七的IT专业人员将协作视为首要任务。 在最坏的情况下,数据科学家会失去向业务专家征求意见的能力,而前者却没有考虑后者提供的数据驱动因素来做出决定(请参见下面的图8,图A)。

Most of the time, there is some good will from both sides, but the difficulties to collaborate can still emerge from existing silos in the company which is often the case. Statistics and consequences are presented in the panel B of Figure 8 below: silos are often harmful for developing at scale AI solution. Breaking these silos is the main solution to this constraint.

在大多数情况下,双方都有一定的良好意愿,但通常在这种情况下,合作的困难仍然会从公司现有的孤岛中浮现出来。 统计信息和后果显示在下面的图8的B面板中:筒仓通常对于大规模AI解决方案的开发有害。 打破这些孤岛是解决此约束的主要方法。

Figure 8. A) Worst-case scenario where the level of dialogue between data scientist and business experts is close to zero.B)Silos remains deeply anchored in the structural organization of companies, what is harmful given the statistics (taken from this survey and this other one) and the consequences.
图8. A)最坏的情况,数据科学家和业务专家之间的对话水平接近零。 B)筒仓仍然深深地植根于公司的结构组织中,鉴于统计数据(从本次调查和其他 调查中得出)及其后果,这是有害的。

Data Science and Inequalities

数据科学与不平等

Digital inequalities tend to replicate social inequality in terms of socio-economic status (education, gender, age, place of residence, professional situation, etc.). For instance, in France, the Insee reports in a study that:

数字不平等往往会复制社会不平等的社会经济地位(教育程度,性别,年龄,居住地点,职业状况等)。 例如,在法国, Insee在一项研究中报告说:

  • 15% of the active population has had no connection to internet in a year.一年中有15%的活跃人口没有互联网连接。
  • 38% of users lack at least one digital skill (search, communication, software manipulation, problem solving).38%的用户缺乏至少一种数字技能(搜索,通信,软件操纵,解决问题)。
  • 25% of the population does not know how to get information online.25%的人口不知道如何在线获取信息。

With such skills gaps on digital basics, the risk that AI will strengthen inequalities is real. Having two “parallel worlds”, with data-aware people on one side, and the others on the other one, can only be detrimental to data science. AI inequalities would be particularly strong on the following dimensions:

由于在数字基础方面存在此类技能差距,因此AI加剧不平等的风险是真实存在的。 拥有两个“平行世界”,一方面有数据感知人员,另一方面则有另一个人,这只会不利于数据科学。 AI不平等在以下方面尤其严重:

  • Level of education教育程度
  • Access to work上班
  • Access to information信息获取
  • Access to technologies获得技术
  • Biased AI favoring a ruling minority有偏见的AI偏爱少数族裔

In business, such inequality may appear with regards to transgenerational issues, as young people become more fluent in these technologies. Generally speaking, it is critical to advocate for an inclusive data science, serving everyone, with no one left behind.

在商业中,随着年轻人对这些技术的流利程度提高,这种关于代际问题的不平等现象可能会出现。 一般而言, 倡导包容性数据科学 ,为每个人服务,没有任何人落后是至关重要的

痛苦4 —在项目的整个生命周期中充满风险以创造价值的漫长道路 (Pain 4 — The Long Path Full of Risks Towards Value Generation Along Project’s Lifecycle)

The integration of the DevOps culture into data science and AI projects definitely deserves to be briefly mentioned here, but the gigantic technical aspect of MLOps will not be explicitly covered. Below, only general and cultural factors constraining data science from blooming and associated with MLOps are presented: the peculiarities of machine learning which render DevOps adaptation to AI not that simple, the youthfulness of the MLOps tools, and the difficult cohabitation of agile methodologies and data science.

将DevOps文化整合到数据科学和AI项目中绝对值得在此简单提及,但是MLOps的巨大技术方面将不会明确涵盖。 下面仅介绍了制约数据科学发展并与MLOps相关的一般和文化因素:机器学习的特殊性使DevOps适应AI变得不那么简单,MLOps工具的年轻性以及敏捷方法和数据的困难共存科学。

Machine Learning and DevOps

机器学习和DevOps

As mentioned previously, an AI project is a succession of numerous stages, from ideation to industrialization through POC. It is beneficial to manage systematically and automatically projects throughout its entire lifecycle, and in this line of idea, DevOps may be inspiring. However, the specificities of machine learning complicate the application of DevOps principles in an AI project. The peculiarities of ML with regards to DevOps are presented in the table below (see Figure 9 below).

如前所述,一个AI项目是从构思到通过POC进行工业化的多个阶段的连续过程。 在整个生命周期中系统地自动管理项目是有益的,按照这种思路,DevOps可能会鼓舞人心。 但是,机器学习的特殊性使DevOps原理在AI项目中的应用复杂化。 下表列出了ML与DevOps有关的特殊性(请参见下面的图9)。

Figure 9. Table presenting the specificities of AI and ML making it difficult to apply DevOps principles 图9.表呈递AI和ML使得难以per se.本身应用的DevOps原理的具体情况。

MLOps Tools on Track of Development Paths

跟踪开发路径的MLOps工具

Because of the AI and ML specificities presented in the previous paragraph, dedicated methodologies and tools are needed, called MLOps. This discipline is even younger than data science. As a consequence, the existing MLOps tools — even if really promising — are still on track of development paths. For instance, corresponding tools may be notable: ModelDB, Kubeflow, MLflow, Pachyderm, DVC.

由于上段介绍了AI和ML的特殊性,因此需要专用的方法和工具,称为MLOps。 这个学科比数据科学还年轻。 结果, 即使确实很有希望现有的MLOps工具仍在发展道路上。 例如,相应的工具可能值得注意:ModelDB,Kubeflow,MLflow,Pachyderm,DVC。

Because of their youthfulness, they would benefit from longer-term testing. From a general point of view, most of these tools were the results of local initiatives (sometimes redundant). Up until now, no methodological standards have been agreed upon in the community. One effect is that there is often a lack of interoperability between these technologies.

由于他们年轻,他们将从长期测试中受益。 从一般的角度来看,这些工具大多数是本地计划的结果(有时是多余的)。 到目前为止,社区尚未达成任何方法学标准。 一种效果是这些技术之间通常缺乏互操作性。

Agile Methodology and “Clumsy” Data Science

敏捷方法论和“笨拙”的数据科学

Agile methodologies are essential in many cases to facilitate collaboration, to converge to a solution satisfying everybody, while improving the productivity of the team. They are particularly used in the IT sector. However, some data scientists consider their work as incompatible with the methods. Indeed, in addition to the specificities of AI and ML presented previously, it is often really difficult to estimate the duration of some tasks — such as data wrangling — and consequently may inactivate the concept of sprint.

在许多情况下,敏捷方法对于促进协作,融合为每个人都满意的解决方案,同时提高团队的生产率至关重要。 它们特别用于IT部门。 但是,一些数据科学家认为他们的工作与这些方法不兼容。 确实,除了前面介绍的AI和ML的特殊性之外,通常真的很难估计某些任务的持续时间(例如数据争用),因此可能会使sprint的概念失效。

Furthermore, some intermediate results that cannot be anticipated may be central to a Go/No-Go decision which may abruptly interrupt a project. Whatever the reasons leading to a non-agile methodology (if any) within a data team, beyond a possible suboptimal way of working, it may impact negatively any attempts to collaborate with other departments.

此外,一些无法预期的中间结果可能会成为“通过/不通过”决定的中心,这可能会突然中断项目。 不管导致数据团队内部采用非敏捷方法(如果有)的原因是什么,除了可能的次优工作方式之外,这可能会对与其他部门进行协作的任何尝试产生负面影响。

结论 (Conclusion)

To sum up, while significant investments are made to develop data science and AI in most business sectors, the rate of successful AI projects is unacceptable. The difficulties met by companies to deploy AI at scale are mainly cultural and can be described along four dimensions.

综上所述,尽管在大多数业务领域进行了大量投资来发展数据科学和AI,但AI项目的成功率是无法接受的。 公司在大规模部署AI方面遇到的困难主要是文化上的,可以从四个方面来描述。

The first dimension is related to the data scientist population, which is not easy to handle: skill shortages on the labor market, heterogeneous profiles, overly ambitious job descriptions, and high turnover. The second dimension is associated with the suboptimal usage of resources, including tools, data and talent.

第一个方面与数据科学家的数量有关,这不容易处理:劳动力市场上的技能短缺,配置文件不均,过于雄心勃勃的职位描述和高离职率。 第二个维度与资源的次优使用有关,包括工具,数据和人才。

The third dimension corresponds to the siloed nature of data science and AI, which generates fears, cultural clashes and inequality, stifling collective data adoption. Finally, the fourth dimension is represented by MLOps issues: the specificities of AI that complicate the application of DevOps principles, young dedicated tools that are still in development, and, sometimes, a lack of agility in AI projects.

第三维对应于数据科学和AI的孤立性质,它产生恐惧,文化冲突和不平等,扼杀了集体数据的采用。 最后,第四维度由MLOps问题代表:使DevOps原理的应用复杂化的AI的特殊性,仍在开发中的年轻专用工具,有时在AI项目中缺乏敏捷性。

This diagnostic is absolutely necessary to anticipate any actions to overcome cultural obstacles. One may employ a specific and stand-alone solution for each obstacle, but this may not be the most efficient approach — low accountability, unbalanced agenda, lack of coordination, etc. This is why at Saegus, we have developed a holistic solution: an ingrained and organizational lever addressing every obstacle and fostering adoption at scale. This is a new job named the “Data Adoption Officer”. Our motto concerning data adoption is:

这种诊断对于预测克服文化障碍的任何行动是绝对必要的。 对于每个障碍,可能都采用一种特定的独立解决方案,但这可能不是最有效的方法,即责任心低,议程不平衡,缺乏协调等。这就是为什么在Saegus,我们开发了一种整体解决方案:根深蒂固的组织杠杆解决所有障碍并促进大规模采用。 这是一个名为“数据采用官”的新工作 关于数据采用的座右铭是:

“Technological evolution must be complemented by cultural progress. Adoption is a major issue, so for individual and collective fulfillment and deployment at scale teams need an efficient, fruitful, inclusive, and humanistic data science practice.”

“技术发展必须辅以文化进步。 采用是一个主要问题,因此对于个人和集体实现和大规模部署,团队需要有效,富有成果,包容性和人文主义的数据科学实践。”

翻译自: https://medium.com/data-by-saegus/data-science-and-ai-from-promises-to-frustration-cd0cffc60933

测试面试挫败


http://www.taodudu.cc/news/show-3826566.html

相关文章:

  • winform笔记本蓝牙与外部蓝牙设备通信
  • 鼠标跟随案例
  • canvas实现鼠标跟随特效
  • js简单的鼠标跟随
  • 鼠标跟随事件
  • javaScript鼠标跟随案例
  • 鼠标跟随案例JS
  • 【JS】jQuery实现鼠标跟随
  • js鼠标跟随 鼠标事件
  • 纯 CSS 实现鼠标跟随效果
  • 鼠标跟随效果html,javascript 鼠标跟随特效代码及理解
  • html鼠标跟随特效代码简短,JS实现的简单鼠标跟随DiV层效果完整实例
  • html++鼠标跟随动画,css3动画过渡实现鼠标跟随导航效果
  • php鼠标跟随特效,原生js实现鼠标跟随效果
  • 鼠标跟随效果css,CSS3 鼠标跟随+滑动覆盖动效
  • 鼠标跟随flash代码_Animate/Flash如何实现画笔跟随鼠标(AS3)
  • html++鼠标跟随动画,5分钟实现Canvas鼠标跟随动画背景
  • 鼠标跟随的实现
  • 鼠标跟随 html,鼠标跟随效果.html
  • html实现鼠标跟随,js实现鼠标跟随运动效果
  • 鼠标跟随案例:用js实现盒子跟随鼠标移动
  • ffmpeg 将h264格式文件编码为MP4文件
  • 1.2 H264文件分析
  • H264文件解析出nalu数据,送给ffmpeg解码,opencv显示
  • linux下使用ffmpeg采集摄像头数据并编码成h264文件
  • 清华计算机系未来学者奖学金,清华大学2020级“未来学者奖学金”获奖名单公示,共179人获奖!...
  • 来自未来的路由器,网件R9000夜鹰X10路由器深度评测及拆解!
  • 未来人类T5 睡眠后无法唤醒屏幕 但没死机 假睡死 问题 不完美的解决
  • 未来的概念交通工具
  • 深圳弘辽科技电商|马云:我希望未来中国的500个好公司中,有200个CEO来自于阿里巴巴的人

测试面试挫败_数据科学与人工智能:从承诺到挫败相关推荐

  1. 数据库面试复习_数据科学面试复习

    数据库面试复习 大面试前先刷新 (REFRESH BEFORE THE BIG INTERVIEW) 介绍 (Introduction) I crafted this study guide from ...

  2. netflix 数据科学家_数据科学和机器学习在Netflix中的应用

    netflix 数据科学家 数据科学 , 机器学习 , 技术 (Data Science, Machine Learning, Technology) Using data science, Netf ...

  3. 贝叶斯网络之父Judea Pearl:新因果科学与数据科学、人工智能的思考

    来源:AI科技评论 本文约6000字,建议阅读10分钟 本文介绍贝叶斯网络之父 Judea Pearl <新因果科学与数据科学.人工智能的思考>的报告. 标签:人工智能 6月21日,图灵奖 ...

  4. 根据我的经验如何进行数据科学,人工智能或大数据工作

    by Richard Freeman, PhD 理查德·弗里曼(Richard Freeman)博士 根据我的经验如何进行数据科学,人工智能或大数据工作 (How to work in Data Sc ...

  5. 机器学习、数据科学、人工智能、深度学习和统计学之间的区别!

    ↑↑↑关注后"星标"Datawhale 每日干货 & 每月组队学习,不错过 Datawhale干货 作者:Vincent Granville,来源:机器之心 在这篇文章中, ...

  6. 一文读懂机器学习、数据科学、人工智能、深度学习和统计学之间的区别!

    点击上方,选择星标或置顶,不定期资源大放送! 阅读大概需要15分钟 Follow小博主,每天更新前沿干货 作者:Vincent Granville 来源:机器之心公众号 链接:http://www.d ...

  7. 自学是一门艺术:踏上数据科学、人工智能和机器学习的自学之路

    全文共3217字,预计学习时长9分钟 图源:unsplash 学习是最好的投资,在B站最大的作用都变成学习之后,人们在互联网上学习什么都不稀奇了.没错,数据科学.人工智能和机器学习也是可以自学的. 时 ...

  8. 【科普】一文把数据科学、人工智能与机器学习讲清楚

    目录 什么是数据科学? 什么是人工智能? 什么是机器学习? 数据科学.人工智能.机器学习的关系 数据科学.人工智能.机器学习的区别 数据科学.人工智能.机器学习工作 尽管"数据科学" ...

  9. 一文读懂机器学习、数据科学、人工智能、深度学习和统计学之间的区别

    在这篇文章中,数据科学家与分析师 Vincent Granville 明晰了数据科学家所具有的不同角色,以及数据科学与机器学习.深度学习.人工智能.统计学.物联网.运筹学和应用数学等相关领域的比较和重 ...

最新文章

  1. Python黑帽编程2.4 流程控制
  2. 看得“深”、看得“清” —— 深度学习在图像超清化的应用
  3. Thread的start()和join()方法
  4. Windows 7 上安装 Mapnik
  5. redis基础整理(转载+与python结合)
  6. 牛客练习赛84F-牛客推荐系统开发之下班【莫比乌斯反演,杜教筛】
  7. string empty java,在C#中,我应该使用string.Empty还是String.Empty或“”来初始化字符串?...
  8. GSAP JS基础教程--动画的控制及事件
  9. 温湿度服务器系统软件,无线温湿度监测系统
  10. C语言程序设计实践 4.4车牌号
  11. houdini flowmap
  12. Python playsound 播放MP3
  13. 手机safari导入html书签,苹果手机safari书签及其历史记录怎么恢复
  14. QQ聊天记录备份助手 v1.0——搜索、备份、恢复QQ聊天记录文件,重装系统必备...
  15. 阿里万亿交易量级下的秒级监控
  16. 数据透视:Excel数据透视和Python数据透视
  17. 11.selenium登录126邮箱出现定位问题解决
  18. CentOS系统利用Gitolite搭建私有Git服务器
  19. js之Reflect
  20. 李宏毅自然语言处理——NLP任务概述

热门文章

  1. html5拖动鼠标直线,html5的鼠标拖拽
  2. XPath实战之爬取豆瓣电影
  3. (附源码)计算机毕业设计SSM二手交易平台
  4. Css3旋转动画效果
  5. 微信小程序跳转传参的方法
  6. 26、具有挂起状态的进程状态转换
  7. 如何让程序在后台保持挂起状态
  8. 04 布隆过滤器BloomFilter
  9. unity简易跑酷避障小游戏(零基础可做)
  10. 转载:娱乐性故事片的剧本是怎样的