数据模型最佳实践_数据科学家应了解软件工程最佳实践

数据模型最佳实践

意见 (Opinion)

介绍 (Introduction)

I have been eagerly researching, speaking to friends and testing some new ideas that will contribute to making me a more indispensable Data Scientist — Of course, there is no way I am going to attempt to progress in my career without sharing with the people who have helped me to progress so far as things stand (in-case you don’t know, that’s you guys and gals!)

我一直在热切地研究，与朋友交谈并测试一些新想法，这些想法将使我成为更加不可或缺的数据科学家。当然，在没有与那些有经验的人分享的情况下，我将无法尝试自己的职业发展就事情发展而言，已经帮助我取得了进展(以防万一，这就是你们，加尔斯！)

Following a recent poll I carried out on my LinkedIn profile, I was surprised to see the number of people that thought Data Scientist must know Programming standards and follow engineering best practices.

在最近对我的LinkedIn个人资料进行的一项民意调查之后，我惊讶地看到看到认为数据科学家必须了解编程标准并遵循工程最佳实践的人数。

Statisticians are often disappointed by the lack of fundamental statistics knowledge that many Data Scientist (including myself) possess. Mathematicians believe that before application there must be a solid understanding of the principles applied of which in various scenarios, I admittedly do not. Software Engineers expect Data Scientist to carry out their experiments whilst following basic programming principles.

对于许多数据科学家(包括我自己)所拥有的基本统计知识的缺乏，统计学家通常会感到失望。数学家认为，应用之前必须对所应用的原理有深刻的了解，但我承认在各种情况下都没有。软件工程师希望数据科学家在遵循基本编程原则的同时进行实验。

What stung me the most is that every “yes” voter is currently working as a Data Scientist and many of them in leading roles (at the time of the poll) — comprising of the likes of 4x Kaggle Grandmaster Abhishek Thakur. Ok, I admit, the role you want determines how deep an understanding of Statistics and other Math concepts such as Probability, Linear Algebra and Calculus is required — although the basics are absolutely essential — but Software engineering practices?

最让我吃惊的是，每个“是”的选民目前都在担任数据科学家，其中许多人担任领导角色(在民意调查时)，其中包括4x Kaggle宗师Abhishek Thakur。好的，我承认，您想要的角色确定需要对统计和其他数学概念(如概率，线性代数和微积分)有多深入的理解(尽管基础知识绝对必要)，但需要软件工程实践吗？

I was once among the Data Scientists who believe we are solely Data Scientists, not Software engineers, hence our responsibility is to extract valuable insights from data and that is still a fact, however this poll disrupted my mental model and threw me into a deep trail of thought…

我曾经是数据科学家中的一员，他们认为我们只是数据科学家，而不是软件工程师，因此我们的责任是从数据中提取有价值的见解，这仍然是事实，但是这项民意调查打乱了我的思维模式，使我陷入了深深的困境。的思想

Why must I know the Fundamentals of Software Engineering when Job title is Data Scientist?

当职位为数据科学家时，为什么我必须了解软件工程基础知识？

I remembered the goal — To become an Indispensable Data Scientist. Am I saying that if I don’t know/learn the fundamentals of Software Engineering I am not indispensable? Mmm, Yeah. basically — Note this statement makes an assumptions however, such as you are a Data Scientist writing code that will most likely make it to production.

我记得这个目标-成为一名不可或缺的数据科学家。 我是说如果我不了解/学习软件工程的基础知识，那我不是必不可少的吗？ 嗯是的 基本上 -请注意，该声明是一个假设，例如，您是一位数据科学家，正在编写代码，很有可能将其投入生产。

On that note, I’ve curated a list of things that are fundamental principles of software engineering that should apply to Data Scientist. Not having a Software Engineering background, I consulted many friends that are Software Engineers to assist me make the list as well as teach me how to write better production code.

关于这一点，我整理了一系列适用于数据科学家的软件工程基本原理。由于没有软件工程背景，我咨询了许多软件工程师朋友来帮助我列出清单，并教我如何编写更好的生产代码。

Here are some of the best practices Data Scientist should know:

以下是数据科学家应了解的一些最佳做法：

清洁代码 (Clean Code)

Note: I want to start of by apologizing to R users as I have not done much research into coding in R hence many of the clean code tips will be mainly Python users.

注意：我想向R用户道歉，因为我还没有对R编码进行过多研究，因此许多干净的代码提示主要是Python用户。

The first programming language I learnt was Python because I am a fluent English speaker and to me Python significantly resembled the English language. Technically, this refers to the high readability of the Python programming language, which was a deliberate implementation by the designers of Python, following the realization that code is read more often than it is written.

我学习的第一门编程语言是Python，因为我会说流利的英语，而对我而言，Python非常类似于英语。从技术上讲，这是指Python编程语言的高度可读性，这是Python的设计人员在意识到代码读取的频率比编写的频率更高的情况下有意实现的。

When a veteran Python developer (a Pythonista) calls portions of code not “Pythonic”, they usually mean that these lines of code do not follow the common guidelines and fail to express its intent in what is considered the best (hear: most readable) way. — The Hitchhikers Guide to Python

当经验丰富的Python开发人员(Pythonista)调用部分代码而不是“ Pythonic”时，它们通常意味着这些代码行不遵循通用准则，并且无法表达其被认为是最佳的意图(听觉：最易读)方式。 — 《 Python旅行者指南》

I am going to list a few factors that constitute clean code, but I do not plan to go into too much detail since I believe there are many great resources out there that cover these topics better than I ever could such as PEP8 and Clean Code In Python:

我将列出构成干净代码的一些因素，但是我不打算赘述太多，因为我相信有很多很棒的资源可以比我以前更好地涵盖这些主题，例如PEP8和Clean Code In。 Python ：

Meaningful and Pronounceable naming conventions有意义且可发音的命名约定
Clarity beats consistency清晰度胜过一致性
Searchable Names可搜寻名称
Make your Code Easy to Read!使您的代码易于阅读！

Remember, not only will others read your code, but you will too and if you can’t remember what something means then imagine what hope someone else will have.

请记住，不仅别人会阅读您的代码，而且您也会阅读，如果您不记得某事意味着什么，那么请想象一下别人会有什么希望。

模块化 (Modularity)

Photo by Volodymyr Hryshchenko on Unsplash

This one can be partially blamed on the way we learn Data Science. I would be surprised if a Data Scientist could not spin up a Jupyter Notebook and begin doing some explorations. But that is all Jupyter notebooks is for, EXPERIMENTS! Unfortunately, many of the courses out there on learning Data Science do not do a good job of transporting us from a Jupyter Notebook to scripts — which are much more effective for Production environments.

这可以部分归因于我们学习数据科学的方式。如果数据科学家无法启动Jupyter笔记本并开始进行一些探索，我会感到惊讶。但这就是Jupyter笔记本的全部用途， 实验！ 不幸的是，许多有关学习数据科学的课程不能很好地将我们从Jupyter Notebook迁移到脚本，这对于生产环境更有效。

When we talk of Modular code we mean code that is separated into independent modules. Executed effectively, modularization allows makes packaging, testing and maintainable code that may be reused.

当我们谈论模块化代码时，是指被分成独立模块的代码。有效地执行模块化可以使打包，测试和可维护的代码可以重复使用。

In the video linked below, Abhishek Thakur builds a Machine Learning package for a Kaggle competition and was my first exposure to modularity. In the past, I’ve also heard Abhishek mention that the way he learn more about modularity and software engineering best practices as a whole was by reading through the Scikit Learn code on Github.

在下面链接的视频中，Abhishek Thakur为Kaggle竞赛构建了一个机器学习包，这是我第一次接触模块化。过去，我还听过Abhishek提到过，他通过阅读Github上的Scikit Learn代码，了解了整个模块化和软件工程最佳实践的更多信息。

演示地址

Some other things that contribute to writing good modularized code are:

有助于编写良好的模块化代码的其他一些事情是：

Don’t Repeat Yourself (DRY) — Don’t repeat yourself (DRY, or sometimes do not repeat yourself) is a principle of software development aimed at reducing repetition of software patterns, replacing it with abstractions or using data normalization to avoid redundancy. (Source: Wikipedia)

d on't [R EPEATŸ我们自己(DRY) -不重复自己(DRY，或有时不重复自己)是一个旨在降低软件模式的重复，用抽象取代它，或者使用数据标准化软件开发的原则避免冗余。 (来源：维基百科 )
Single Responsibility Principle (SRP) — The single-responsibility principle (SRP) is a computer-programming principle that states that every module, class or function in a computer program should have responsibility over a single part of that program’s functionality, which it should encapsulate. (Source: Wikipedia)

小号英格尔- [R esponsibility P rinciple(SRP) -单责任原则(SRP)是一个计算机编程原理，指出在计算机程序中的每个模块，类或函数应该在该程序的功能，单个部分责任，它应该封装。 (来源：维基百科 )
Open-Closed Principle — In object-oriented programming, the open–closed principle states “software entities (classes, modules, functions, etc.) should be open for extension, but closed for modification”; that is, such an entity can allow its behaviour to be extended without modifying its source code. (Source: Wikipedia)

开放-封闭原则—在面向对象的编程中，开放-封闭原则指出“软件实体(类，模块，功能等)应开放以进行扩展，而封闭以进行修改”；也就是说，这样的实体可以允许其行为得以扩展而无需修改其源代码。 (来源：维基百科 )

重构 (Refactoring)

Code refactoring may be defined as the process of restructuring existing code without changing the external behaviour of the code at runtime.

代码重构可以定义为在不更改代码在运行时的外部行为的情况下重构现有代码的过程。

Refactoring is intended to improve the design, structure, and/or implementation of the software (its non-functional attributes), while preserving its functionality. — Wikipedia

重构旨在改善软件(其非功能属性)的设计，结构和/或实现，同时保留其功能。 — 维基百科

There are many advantages to refactoring our code, for example, improved readability of our code and reduced complexity, which in-turn leads to our source code being much easier to maintain and we are equipped with an internal architecture that improves the extensibility of the code we write.

重构代码有很多优点，例如，提高代码的可读性和降低复杂性，这反过来又使我们的源代码更易于维护，并且我们配备了内部体系结构，可提高代码的可扩展性我们写。

Furthermore, we can not talk about Code Refactoring without talking about improving performance. The goal is to write a program that performs faster and uses less memory, especially is we have an end-user that will be executing some task.

此外，我们不能不谈提高性能而谈论代码重构。目标是编写一个执行速度更快且使用更少内存的程序，尤其是我们有一个最终用户将执行某些任务。

For more on refactoring in Python, see the link below!

有关在Python中进行重构的更多信息，请参见下面的链接！

测试中 (Testing)

Note: I learnt briefly about testing (and the majority of other ideas covered in this post) in the Deployment of Machine Learning Models udemy course.

注意：我在“ 部署机器学习模型”授课课程中简要了解了测试(以及本文中涉及的其他大多数想法)。

Data Science is a funny field in a sense that our code may still run even though there are errors in our code, whereas in software related projects the code will throw an error. Consequently, we will end up with misleading insights (and possibly no job). Hence, test are imperative and if you know how to do them, your price goes up.

数据科学是一个有趣的领域，从某种意义上说，即使我们的代码中存在错误，我们的代码仍然可以运行，而在与软件相关的项目中，代码将引发错误。因此，我们最终会产生误导性的见解(可能没有工作)。因此，测试势在必行，如果您知道该怎么做，价格就会上涨。

Here are some reasons why we run test:

我们进行测试的一些原因如下：

Ensure we get the correct outputs确保我们获得正确的输出
Easier updates to code轻松更新代码
Prevents pushing broken code to production防止将损坏的代码推送到生产中

I’m sure there are more reasons, but for now I will stop here. Check out the link below for more on testing.

我敢肯定还有更多原因，但是现在我将在这里停止。查看下面的链接以获取更多有关测试的信息。

代码审查 (Code Review)

Code reviews are done to improve code quality by promoting the best programming practices that will allow for code to ready for production. Additionally, it’s beneficial for everyone since it tends to have positive effects on team and company culture.

通过推广最佳编程实践来进行代码审查，以提高代码质量，从而使代码可以投入生产。此外，它对每个人都有益，因为它往往会对团队和公司文化产生积极影响。

The main reason for a code review is to catch errors though the reviews are extremely useful for improving readability as well as ensuring the coding standards are met.

尽管代码审查对于提高可读性和确保符合编码标准极为有用，但代码审查的主要原因是捕获错误。

A really great article that goes more into depth is linked below…

下面链接了一篇更深入的非常好的文章。

结语 (Wrap Up)

It’s fair to say that this is definitely a whole load of things to learn, but for the exact same reason it increases the over value of a Data Science practitioner. Being able to whip up a Jupyter Notebook is no longer enough to make you stand out as a Data Scientist because everyone can do it. If you want to be above average, you’d have to do above average things and in this instance, it may involve learning the software engineering best practices.

可以公平地说，这绝对是学习的全部内容，但是由于完全相同的原因，它增加了数据科学从业者的过高价值。能够启动Jupyter笔记本电脑已经不足以让您脱颖而出成为数据科学家，因为每个人都可以做到。如果要高于平均水平，则必须做高于平均水平的事情，在这种情况下，这可能涉及学习软件工程最佳实践。

Let’s continue the conversation on LinkedIn…

让我们继续在LinkedIn上进行对话…

翻译自: https://towardsdatascience.com/data-scientist-should-know-software-engineering-best-practices-f964ec44cada

数据模型最佳实践

查看全文

http://www.taodudu.cc/news/show-6930959.html

ruby oracle数据,从 Ruby on Rails 连接到 Oracle
Oracle项目OBIEE11G ---结构和加载
文献阅读15-OntoILPER:A logic-based relational learning approach关系抽取，NER+RE
构建您的第一个Web爬网程序，第2部分
2022-2028全球与中国问答平台市场现状及未来发展趋势
调试经验——OBIEE报表开发实例小结（数据库直连DDR模式、日期型Prompt的设置...）
OBIEE利用全局临时表开发复杂报表
clang --version clang: error while loading shared libraries: libtinfo.so.5: cannot open shared objec
bio和bieos哪个标注模式好_2021秋招-NLP基础任务模型-NER
vue 解决：Syntax Error: ValidationError: Invalid options object. Sass
收藏这几个免版权高清图片搜索网站,找高清图片素材更方便
Silverlight，WPF动画终极攻略之会飞的小鸟篇
IT小鸟
我不是一直IT小鸟
ubuntu18.04卡在“【ok】Starting Gnome Display Manager“问题处理
Ubuntu18安装VTK8.2
npm install安装报错 gyp info it worked if it ends with ok
Linux——全是OK无法打开图形化界面问题
阿里缺失产品经理文化？错！但产品文化确实与众不同
新浪微博向美国 SEC 提交 IPO 申请，计划募资5亿美元
Windows Phone 10(Lumia 920)升级记录
9宫格解锁 android_android开发图案解锁学习记录一（九宫格的绘制）
维基经济学指导创业的实践初探
经济学是科学吗？
未来最佳企业形态和商业模式的雏形——学习《维基经济学》的一点体会
威客理论看维基经济学的三个漏洞
维基经济学
Elasticsearch 管道聚合
usb 包
在Linux中实现自动交互（管道操作符，输入重定向，here document，expect）