敏捷数据科学pdf

TL;DR;

TL; DR;

  • I have encountered a lot of resistance in the data science community against agile methodology and specifically scrum framework;在数据科学界,我遇到了许多反对敏捷方法论(特别是Scrum框架)的抵制。
  • I don’t see it this way and claim that most disciplines would improve by adopting agile mindset;我不这样认为,并认为通过采用敏捷的思维方式,大多数学科都将得到改善。
  • We will go through a typical scrum sprint to highlight the compatibility of the data science process and the agile development process.我们将经历一个典型的Scrum冲刺,以突出数据科学过程与敏捷开发过程的兼容性。
  • Finally, we discuss when a scrum is not an appropriate process to follow. If you are a consultant working on many projects at a time or your work requires deep concentration on a single and narrow issue (narrow, so that you alone can solve it).最后,我们讨论了Scrum何时不适合遵循的过程。 如果您是同时从事多个项目的顾问,或者您的工作需要专注于一个狭窄的问题(狭窄,那么您一个人就能解决)。

I have found a medium post recently, which claims that Scrum is awful for data science. I’m afraid I have to disagree and would like to make a case for Agile Data Science.

我最近发现了一篇中篇文章,其中声称Scrum 对于数据科学非常糟糕 。 恐怕我不得不不同意,并希望为敏捷数据科学辩护。

Ideas for this post are significantly influenced by the Agile Data Science 2.0 book (which I highly recommend) and personal experience. I am eager to know other experiences, so please share them in the comments.

这篇文章的想法在很大程度上受到敏捷数据科学2.0本书(我强烈推荐)和个人经验的影响。 我很想知道其他经历,所以请在评论中分享。

First, we need to agree on what data science is and how it solves business problems so we can investigate the process of data science and how agile (and specifically Scrum) can improve it.

首先,我们需要就什么是数据科学及其如何解决业务问题达成共识,以便我们可以调查数据科学的过程以及敏捷性(特别是Scrum)如何改进它。

什么是数据科学? (What is Data Science?)

There are countless definitions online. For example, Wikipedia gives such a description:

在线上有无数的定义。 例如, 维基百科给出了这样的描述:

Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data.

数据科学是一个跨学科领域,它使用科学的方法,过程,算法和系统从许多结构化和非结构化数据中提取知识和见解。

In my opinion, it is quite an accurate definition of what data science tries to accomplish. But I would simplify this definition further.

我认为,这是对数据科学要完成的工作的准确定义。 但是,我将进一步简化该定义。

Data Science solves business problems by combining business understanding, data and algorithms.

数据科学通过结合业务理解,数据和算法来解决业务问题。

Compared to the definition in Wikipedia, I would like to stress that data scientists should aim to solve business problems rather than “extract knowledge and insights.”

与Wikipedia中的定义相比,我想强调的是,数据科学家应该致力于解决业务问题,而不是“ 提取知识和见解”。

数据科学如何解决业务问题? (How Data Science Solves business problems?)

So data science is here to solve business problems. We need to accomplish a few things along the way:

因此,数据科学在这里可以解决业务问题。 我们需要在此过程中完成几件事:

  1. Understand the business problem;了解业务问题;
  2. Identify and acquire available data;识别并获取可用数据;
  3. Clean / transform / prepare data;清理/转换/准备数据;
  4. Select and fit an appropriate “model” for a given data;为给定的数据选择合适的“模型”;
  5. Deploy model to “production” — this is our attempt to solving a given problem;将模型部署到“生产”中–这是我们解决给定问题的尝试;
  6. Monitoring performance;监测绩效;

As with everything, there are countless ways to go about implementing those steps, but I will try to persuade you that the agile (incremental and iterative) approach brings the most value to the company and the most joy to data scientists.

与所有内容一样,执行这些步骤的方法有无数种,但是我将尝试说服您敏捷(增量和迭代)方法为公司带来最大的价值,并为数据科学家带来最大的乐趣。

敏捷数据科学宣言 (Agile Data Science Manifesto)

I took this from page 6 in the Agile Data Science 2.0 book, so you are encouraged to read the original, but here it is:

我是从敏捷数据科学2.0本书的第6页上摘下来的,因此鼓励您阅读原始文档,但此处是:

  • Iterate, iterate, iterate — tables, charts, reports, predictions.迭代,迭代,迭代-表格,图表,报告,预测。
  • Ship intermediate output. Even failed experiments have output.运送中间输出。 即使失败的实验也可以输出。
  • Prototype experiments over implementing tasks.在执行任务方面进行原型实验。
  • Integrate the tyrannical opinion of data in product management.将数据的专横观点整合到产品管理中。
  • Climb up and down the data-value pyramid as you work.在工作时上下爬数据值金字塔。
  • Discover and pursue the critical path to a killer product.发现并寻求关键产品的关键途径。
  • Get meta. Describe the process, not just the end state.获取元数据。 描述过程,而不仅仅是结束状态。

Not all the steps are self-explanatory, and I encourage you to go and read what Russel Jurney had to say, but I hope that the main idea is clear — we share and intermediate output, and we iterate to achieve value.

并非所有步骤都是不言自明的,我鼓励您去阅读Russel Jurney所说的内容,但是我希望主要思想是明确的-我们共享和中间产出,并不断迭代以实现价值。

Given the above preliminaries, let us go over a standard week for a scrum team. And we will assume a one week sprint.

鉴于以上初步介绍,让我们为一个Scrum团队度过一个标准的星期。 我们将假设一个星期的冲刺。

Scrum团队冲刺 (Scrum Team Sprint)

第一天 (Day 1)

There are many sprint structure variations, but I will assume that planning is done on Monday morning. The team will decide which user stories from the product backlog will be transferred to the Sprint backlog. The most pressing issue for our business, as evident from the backlog ranking, is customer fraud — fraudulent transactions are causing our valuable customers out of our platform. During the previous backlog refinement session, the team already discussed this task, and the product owner got additional information from the Fraud Investigation team. So during the meeting, the team decides to start with a simple experiment (and already is thinking of interesting iterations further down the road) — an initial model based on simple features of the transaction and participating users. Work is split so that the data scientist can go and have a look at the data team identified for this problem. The data engineer will set up the pipeline for model output integration to DWH systems, and the full-stack engineer starts to set up a page for transaction review and alert system for the Fraud Investigation team.

sprint结构有很多变化,但我将假定计划在星期一早上完成。 团队将决定将产品积压中的哪些用户故事转移到Sprint积压中。 从积压的排名中可以明显看出,我们业务最紧迫的问题是客户欺诈-欺诈性交易正使我们宝贵的客户退出平台。 在上一个待办事项优化会话中,团队已经讨论了此任务,产品所有者从欺诈调查团队获得了更多信息。 因此,在会议期间,团队决定从一个简单的实验开始(并且已经在考虑下一步的有趣迭代),这是一个基于交易和参与用户的简单特征的初始模型。 工作是分开的,以便数据科学家可以去看看针对此问题确定的数据团队。 数据工程师将建立将模型输出集成到DWH系统的管道,而全栈工程师将开始为欺诈调查团队设置一个页面,用于事务审查和警报系统。

第二天 (Day 2)

At the start of Tuesday, all team gathers and shares progress. Data scientist shows a few graphs which indicate that even with limited features, we will have a decent model. At the same time, the data engineer is already halfway through setting up the system to score incoming transactions with the new model. The full-stack engineer is also progressing nicely, and just after a few minutes, everyone is back at their desk working on the agreed tasks.

在星期二初,所有团队聚集并分享进步。 数据科学家显示了一些图表,这些图表表明即使功能有限,我们也将拥有一个不错的模型。 同时,数据工程师已经完成设置系统的一半,以使用新模型对传入的交易进行评分。 全职工程师的进度也不错,几分钟后,每个人都回到了办公桌前,完成约定的任务。

第三天 (Day 3)

As with Tuesday, the team starts Wednesday with a standup meeting to share their progress. There is already a simple model build and some accuracy and error rate numbers. The data engineer shows the infrastructure for the transaction scoring, and the team discusses how the features arrive at the system and what needs to be done for them to be ready for the algorithm. The full-stack engineer shows the admin panel with metadata on transactions is displayed and the triggering mechanism. Another discussion follows on the threshold value for the model output to trigger a message for a fraud analyst. The team agrees that we need to be able to adjust this value since different models might have different distributions, and also, depending on other variables, we might want to increase and decrease the number of approved transactions.

与星期二一样,团队从星期三开始进行站立会议,以分享他们的进度。 已经有一个简单的模型构建以及一些准确性和错误率数字。 数据工程师展示了交易评分的基础架构,团队讨论了功能如何到达系统以及需要做什么才能使其准备好算法。 全栈工程师将显示管理面板,其中显示有关事务的元数据以及触发机制。 接下来是关于模型输出的阈值以触发欺诈分析者消息的讨论。 团队同意我们必须能够调整此值,因为不同的模型可能具有不同的分布,并且根据其他变量,我们可能希望增加和减少批准的交易数量。

第四天 (Day 4)

On Thursday, the team already has all the pieces, and during the standup, discuss how to integrate those pieces. Team also outlines how to best monitor models in production, so that model performance could be evaluated and also degradation could be detected before it causes any real damage. They agree that a simple dashboard for monitoring accuracy and error rates will suffice for now.

星期四,团队已经掌握了所有内容,在站立比赛中,讨论了如何整合这些内容。 团队还概述了如何在生产中最好地监视模型,以便可以评估模型性能并在导致任何实际损害之前检测出退化。 他们一致认为,目前仅需要一个用于监视准确性和错误率的简单仪表板即可。

第五天 (Day 5)

Friday is a demo day. During standup, the team discusses the last issues remaining with the first iteration of the transaction fraud detection. Team members prepare for the meeting with the fraud analysts that will be using this solution.

星期五是演示日。 在站立期间,团队讨论事务欺诈检测的第一次迭代中剩下的最后一个问题。 团队成员准备与将使用此解决方案的欺诈分析师进行会议。

During the demo, the team shows what they have built for the fraud analysts. The team presents performance metrics and their implications for the fraud analysts. All feedback is converted to tasks for future sprints.

在演示期间,团队将展示他们为欺诈分析人员构建的内容。 该团队介绍了绩效指标及其对欺诈分析师的影响。 所有反馈都转换为任务,以供将来冲刺。

Another vital part of the Sprint is a retrospective — meeting where the team discusses three things:1. What went well in the Sprint;

Sprint的另一个重要组成部分是回顾会议-团队讨论三件事的会议:1。 在Sprint中进展顺利;

2. What could be improved;

2.有待改进的地方;

3. What will we commit to improving in the next Sprint;

3.在下一个Sprint中我们将致力于改进什么;

再往前走 (Further down the road)

During the next Sprint, the team is working on another most important item from the product backlog. It might be feedback from the fraud analysts, or it might be something else that the product owner thinks will improve the overall business the most. However, the team closely monitors the performance of the initial version of the solution. It will continue to do so because ML solutions are sensitive to changes in underlying assumptions that the model made about data distribution.

在下一个Sprint期间,团队正在处理产品积压中的另一个最重要的项目。 这可能是欺诈分析师的反馈,也可能是产品所有者认为可以最大程度改善整体业务的其他方面。 但是,团队将密切监视解决方案初始版本的性能。 它将继续这样做,因为ML解决方案对模型对数据分布所做的基本假设的更改敏感。

讨论区 (Discussion)

Above is a relatively “clean” exposition of the scrum process for data science solutions. Real-world rarely is that way, but I wanted to convey a few points:

上面是数据科学解决方案的Scrum过程的相对“干净”的阐述。 现实世界很少采用这种方式,但我想表达几点:

  1. Data Science cannot stand on its own. If we’re going to impact the real world we have to collaborate in a cross-functional team, it should be a part of a wider team;数据科学不能自立。 如果要影响现实世界,我们必须在跨职能团队中进行协作,这应该成为更广泛团队的一部分。
  2. Iteration is critical in data science, and we should expose artifacts of those iterations to our stakeholders to receive feedback as fast as possible;迭代在数据科学中至关重要,我们应该将这些迭代的工件暴露给我们的涉众,以便尽快获得反馈。
  3. Scrum is a framework that is designed for iterative progress. Therefore it is a perfect fit for data science work;Scrum是一个专为迭代进度而设计的框架。 因此,它非常适合数据科学工作;

However, it is not a framework for any endeavor. If your job requires you to think deeply for days, then Scrum and agile would probably be very disruptive and counterproductive. Also, if your work requires you to handle a lot of different and small data science-related tasks, following Scrum would be inappropriate, and maybe Kanban should be considered. However, typical product data science work is not like that. Iteration is king, and getting feedback fast is key to providing the right solutions to business problems.

但是,这不是任何努力的框架。 如果您的工作需要您深入思考数日,那么Scrum和敏捷可能会非常破坏性且适得其反。 另外,如果您的工作要求您处理许多与小数据科学相关的不同任务,那么遵循Scrum是不合适的,也许应该考虑看板。 但是,典型的产品数据科学工作并非如此。 迭代为王,快速​​获得反馈对于提供正确的业务问题解决方案至关重要。

综上所述 (In summary)

Data Science is a perfect fit for the Scrum with a single modification — we do not expect to ship finished models. Instead, we ship artifacts of our work and solicit feedback from our stakeholders so we can make progress faster. Project managers might not like data science for the unpredictability of the progress, but iteration is not at fault, it is the only way forward.

只需修改一下,Data Science就非常适合Scrum —我们不希望交付完成的模型。 取而代之的是,我们运送工作的工件并征求利益相关者的反馈,以便我们更快地取得进展。 项目经理可能不喜欢数据科学,因为它具有不可预测的进度,但是迭代并不是错误,这是前进的唯一途径。

I would like to know what you think about agile data science? What has worked for you and your team? What didn’t work? I hope you will leave a comment!

我想知道您如何看待敏捷数据科学? 什么对您和您的团队有用? 什么没用? 希望您发表评论!

翻译自: https://towardsdatascience.com/agile-data-science-data-science-can-and-should-be-agile-c719a511b868

敏捷数据科学pdf


http://www.taodudu.cc/news/show-995234.html

相关文章:

  • api地理编码_通过地理编码API使您的数据更有意义
  • 分布分析和分组分析_如何通过群组分析对用户进行分组并获得可行的见解
  • 数据科学家 数据工程师_数据科学家应该对数据进行版本控制的4个理由
  • 数据可视化 信息可视化_可视化数据以帮助清理数据
  • 使用python pandas dataframe学习数据分析
  • 前端绘制绘制图表_绘制我的文学风景
  • 回归分析检验_回归分析
  • 数据科学与大数据技术的案例_主数据科学案例研究,招聘经理的观点
  • cad2016珊瑚_预测有马的硬珊瑚覆盖率
  • 用python进行营销分析_用python进行covid 19分析
  • 请不要更多的基本情节
  • 机器学习解决什么问题_机器学习帮助解决水危机
  • 网络浏览器如何工作
  • 案例与案例之间的非常规排版
  • 隐私策略_隐私图标
  • figma 安装插件_彩色滤光片Figma插件,用于色盲
  • 设计师的10种范式转变
  • 实验心得_大肠杆菌原核表达实验心得(上篇)
  • googleearthpro打开没有地球_嫦娥五号成功着陆地球!为何嫦娥五号返回时会燃烧,升空却不会?...
  • python实训英文_GitHub - MiracleYoung/You-are-Pythonista: 汇聚【Python应用】【Python实训】【Python技术分享】等等...
  • 工作失职的处理决定_工作失职的处理决定
  • vue图片压缩不失真_图片压缩会失真?快试试这几个无损压缩神器。
  • 更换mysql_Docker搭建MySQL主从复制
  • advanced installer更换程序id_好程序员web前端培训分享kbone高级-事件系统
  • 3d制作中需要注意的问题_浅谈线路板制作时需要注意的问题
  • cnn图像二分类 python_人工智能Keras图像分类器(CNN卷积神经网络的图片识别篇)...
  • crc16的c语言函数 计算ccitt_C语言为何如此重要
  • python 商城api编写_Python实现简单的API接口
  • excel表格行列显示十字定位_WPS表格:Excel表格打印时,如何每页都显示标题行?...
  • oem是代工还是贴牌_代加工和贴牌加工的区别是什么

敏捷数据科学pdf_敏捷数据科学数据科学可以并且应该是敏捷的相关推荐

  1. 数据科学和数学建模_数据科学与国际象棋心理建模重叠

    数据科学和数学建模 Chess and data science have a lot in common. Some seemingly surface-level parallels includ ...

  2. python刷题一亩三分地_手把手教你用python抓网页数据【一亩三分地论坛数据科学版】...

    前言:. visit 1point3acres.com for more. 数据科学越来越火了,网页是数据很大的一个来源.最近很多人问怎么抓网页数据,据我所知,常见的编程语言(C++,java,pyt ...

  3. 耕耘数据,融合发展——2018年度数据科学研究院RONG教授座谈会成功举办

    2018年05月07日,以"耕耘数据.融合发展"为主题的2018年度数据科学研究院(以下简称"数据院")RONG教授座谈会于双清大厦拉开帷幕.数据院院长俞士纶. ...

  4. IBM 数据科学平台三大特性解决数据科学家协作问题

    虽然数据科学是一个比较火爆的话题,也受到越来越多重视,但是企业内部数据科学现状却是:不同数据分析人员使用着包括Python.R.Spark在内的多种开源产品,并且版本不一:不同开源技术的使用导致数据资 ...

  5. python科学计数法转换_柳小白Python学习笔记35 Excel之科学计数法类型转换及数据选取1...

    昨天学习了使用pandas模块如何查看Excel工作表"wz"的基本信息.今天学习,转换数字科学计数法格式及提取需要处理的数据,接下来就进入今天的学习吧. 一.转换科学计数法格式 ...

  6. 以人为本的机器学习:谷歌人工智能产品设计概述 By 机器之心2017年7月17日 12:13 取代了手动编程,机器学习(ML)是一种帮助计算机发现数据中的模式和关系的科学。对于创建个人的和动态的经历

    以人为本的机器学习:谷歌人工智能产品设计概述 By 机器之心2017年7月17日 12:13 取代了手动编程,机器学习(ML)是一种帮助计算机发现数据中的模式和关系的科学.对于创建个人的和动态的经历来 ...

  7. 大数据数据量估算_如何估算数据科学项目的数据收集成本

    大数据数据量估算 (Notes: All opinions are my own) (注:所有观点均为我自己) 介绍 (Introduction) Data collection is the ini ...

  8. 大数据技术 学习之旅_数据-数据科学之旅的起点

    大数据技术 学习之旅 什么是数据科学? (What is Data Science?) The interesting thing about Data Science is that it is a ...

  9. 科学价值 社交关系 大数据_服务的价值:数据科学和用户体验研究美好生活

    科学价值 社交关系 大数据 A crucial part of building a product is understanding exactly how it provides your cus ...

最新文章

  1. JAVA实现从尾到头打印链表(《剑指offer》)
  2. mvc后台字符串转换html,c# – 从MVC Controller返回一个字符串到jQuery
  3. Spring【依赖注入】就是这么简单
  4. 【VBA】查看窗口当前状态
  5. 70+漂亮且极具亲和力的导航菜单设计推荐
  6. SharePoint 2010 技术参数(整理)
  7. Effective C++ 条款03:尽可能使用const
  8. PostgreSQL 12系统表(11)pg_user
  9. postgresql 用户安全配置
  10. oracle账户用root权限执行sh,安装Oracle执行orainstRoot.sh与root.sh作用
  11. Go语言构建高并发分布式系统实践
  12. 廖雪峰git教程学习
  13. 交换机的Vlan技术 以及Vlan隔离和 端口隔离区别
  14. 怎么用HTML表格中加上线条,如何在html的表格中加入边框线
  15. tiny-emitter.js:一个小型的事件订阅发布库
  16. 东方通 -- 如何安装、启动、停止、卸载东方通中间件
  17. for the love,for the dream
  18. H3C防火墙升级系统版本报错:No sufficient storage space on the device
  19. 基于C语言实现离散时域积分算法
  20. “联想笔记本电脑的电池显示0%,充不进电” 解决方案

热门文章

  1. Linux编程手册读书笔记第三章(20140407)
  2. 我使用过的Linux命令之hwclock - 查询和设置硬件时钟
  3. 比较ArrayList和数组的区别
  4. String | 344. Reverse String
  5. linux 调用默认程序打开文件,Excel VBA如何使用默认应用程序打开文件
  6. pyhive 连接 Hive 时错误
  7. css3-2 CSS3选择器和文本字体样式
  8. HUST软工1506班第2周作业成绩公布
  9. 32位与64位注册表
  10. ABAP中创建动态内表的三种方法(转载)