数据科学还是计算机科学

什么是数据科学？ (What is data science?)

Well, if you have just woken up from a 10-year coma and have no idea what is data science, don’t worry, there’s still time. Many years ago, statisticians had some pretty good ideas for analysing data and getting insights from it, but they lacked the computational power to do it, so their hands were tied. Until one day, when computers managed to catch up with those guys, and made all their dreams come true. All of a sudden, we not only had more data available than ever in history, but we also had powerful machines to perform heavy calculations on this data, allowing statisticians to try out all these new algorithms. Data science is the hip daughter born from this marriage between statistics and computer science. In other words, it is the science of extracting useful patterns from data sets by use of computer power.

好吧，如果您刚从十年昏迷中醒来，不知道什么是数据科学，请不要担心，还有时间。许多年前，统计学家在分析数据和从中获取见解方面有一些相当不错的主意，但他们缺乏计算能力，因此束手无策。直到一天，计算机都赶上了这些家伙，并使所有梦想成真。突然之间，我们不仅拥有比以往任何时候都多的可用数据，而且还拥有功能强大的机器来对这些数据进行大量计算，从而使统计学家可以尝试所有这些新算法。数据科学是统计学和计算机科学之间的结合而生的时髦女儿。换句话说，这是通过使用计算机功能从数据集中提取有用模式的科学。

它是干什么用的？ (What is it used for?)

One of the reasons data science is so popular nowadays is the number of possible applications that are emerging.

当今数据科学如此流行的原因之一是正在出现的可能的应用程序数量。

市场营销和销售 (Marketing and sales)

A typical use case for data science in marketing is product recommendation. When you check out a product on Amazon and they tell you there’s another product you might like, there is an algorithm behind that recommendation that thinks you will like those products based on what other customers who also saw that product actually bought.

市场营销中数据科学的典型用例是产品推荐。当您在Amazon上查看某商品时，他们告诉您可能还会有另一种商品时，该建议背后有一个算法，该算法会根据其他顾客实际购买的商品来认为您会喜欢这些商品。

金融 (Finance)

The most common way that banks use data science methods is for credit risk analysis: back in the day, when someone asked for a loan, usually the banker took a good look at their financial record to decide whether to do it or not. Nowadays, there are sophisticated statistical models that are constantly updated and give a good estimated probability of default, making the whole process a lot faster and more reliable.

银行使用数据科学方法的最常见方法是进行信用风险分析：过去，当有人要求贷款时，银行家通常会仔细查看其财务记录，以决定是否这样做。如今，有复杂的统计模型可以不断更新，并且可以很好地估计违约概率，从而使整个过程变得更快，更可靠。

卫生保健 (Healthcare)

Healthcare is one of the most promising industries when it comes to data science. There is a lot of data being generated by connected wearables such as smartwatches, including calories spent, miles walked and heartbeats. One of the possible applications is tracking variables that can help explain some diseases, and even remind you to go see a doctor if you present a behavior that might indicate a health issue.

就数据科学而言，医疗保健是最有前途的行业之一。连接的可穿戴设备(例如智能手表)会生成大量数据，包括所消耗的卡路里，行走的距离和心跳。一种可能的应用是跟踪变量，这些变量可以帮助解释某些疾病，甚至提醒您如果出现可能表明健康问题的行为，请去看医生。

它回答什么问题？ (What questions does it answer?)

We can split data science tasks into two main groups: supervised vs. unsupervised learning

我们可以将数据科学任务分为两大类：有监督与无监督学习

监督学习 (Supervised learning)

Supervised learning comprises all tasks for which we have a target variable, that is, some feature in our data that we already know we want to predict. For example, if we want to explain house prices based on their characteristics (such as number of rooms and floors), or if we want to predict the likelihood that a customer will stop using our services.

监督学习包括我们具有目标变量的所有任务，即我们已经知道要预测的数据中的某些功能。例如，如果我们要根据房价的特征 (例如房间和楼层数)来解释房价，或者我们要预测客户停止使用我们的服务的可能性。

无监督学习 (Unsupervised learning)

These are the tasks for when we are not sure of the question we are asking. A typical case is clustering tasks, when we just want to find patterns in the data, not necessarily related to one specific variable (customer segmentation, for instance).

当我们不确定所要提出的问题时，这些就是这些任务。一种典型的情况是群集任务，当我们只想在数据中查找模式时，不一定与一个特定变量(例如客户细分)相关。

是谁啊 (Who does it?)

Besides the knowledge required in statistics and computer science, data science also calls for business awareness: no matter how good your algorithms are, they will be useless if they are not applicable in that domain. People who work with data usually fall into three categories, depending on which one of those three areas of expertise they are more focused on:

除了统计和计算机科学所需的知识外，数据科学还要求提高商业意识：无论您的算法有多出色，如果它们不适用于该领域，它们将毫无用处。处理数据的人员通常分为三类，具体取决于他们更专注于这三个专业领域中的哪一个：

数据分析师 (Data analyst)

Sometimes also called business analyst, this guy knows how to talk to people who don’t work directly with data. He’s usually in charge of translating business needs into data requirements (and data insights into business recommendations). He has an overall understanding of the main data science algorithms, and usually has really good skills in data visualization.

有时也称为业务分析师，这个人知道如何与不直接使用数据的人交谈。他通常负责将业务需求转换为数据需求(以及将数据洞察转换为业务建议)。他对主要的数据科学算法有全面的了解，并且通常在数据可视化方面具有非常好的技能。

数据工程师 (Data engineer)

This is the person who makes sure the data is collected from all its sources, integrated almost seamlessly into the company’s tech environment and that all the algorithms developed turn well and fast. They almost always come from a tech background, and sometimes have to create dedicated tools to display the data processes, especially if they are to be shared with other stakeholders in the company.

该人员负责确保从所有来源收集数据，几乎无缝地将其集成到公司的技术环境中，并且确保所开发的所有算法都能快速好转。它们几乎总是来自技术背景，有时必须创建专用工具来显示数据过程，尤其是要与公司中的其他利益相关者共享它们时。

数据科学家 (Data scientist)

As you can guess from the name, this guy has a deeper understanding of the way most algorithms operate, and which are the best ones for each situation. They probably know more about statistics than the data analyst and the data engineer, but less about the ins and outs of the business or of the process industrialisation. Some companies prefer to hire PhD’s for this position, but it is not always the case.

您可能会从名字中猜到，这个家伙对大多数算法的运行方式有更深入的了解，并且每种情况下最好的算法。他们可能比数据分析师和数据工程师对统计信息了解更多，但对业务或流程工业化的来龙去脉了解较少。一些公司更愿意聘请博士学位来担任这一职位，但并非总是如此。

去哪儿了 (Where is it going?)

In the next few years, we will see much progress in many different domains. By using data, cities will be able to better manage their traffic, their energy consumption and even their police units allocation. By the use of wearables, we’ll be able to exercise, eat and sleep better. And there might be many other possibilities of which we haven’t even thought of.

在接下来的几年中，我们将在许多不同的领域看到巨大的进步。通过使用数据，城市将能够更好地管理其交通，能源消耗甚至警力分配。通过使用可穿戴设备，我们将能够更好地运动，饮食和睡眠。而且可能还有许多其他我们甚至没有想到的可能性。

However, we will also find out that not everything can be improved with data, and we will soon find out where this limit lies. There will always be an important random component in every human activity or natural phenomenon that will never be tracked by any machine learning algorithm, no matter how sophisticated it is.

但是，我们还将发现并非所有数据都可以改善，而且我们很快就会发现此限制在哪里。在任何人类活动或自然现象中，总会有一个重要的随机成分，无论它多么复杂，都不会被任何机器学习算法跟踪。

This data-driven culture might also cause some important behavioural changes. People are starting to realize how much of their personal lives is being tracked by big companies and the government, and most do not seem to enjoy it. This might lead people to voluntarily downgrade their tech devices, use tools to prevent data collection, and even reduce their overall technology usage. Governments are already aware of these concerns, and regulation is getting stricter all over the world when it comes to people’s privacy. Let’s see in the years to come how this will shape society (the Black Mirror series offer interesting insights into these possibilities).

这种由数据驱动的文化也可能导致一些重要的行为变化。人们开始意识到大公司和政府正在追踪他们多少个人生活，而且大多数人似乎并不喜欢它。这可能会导致人们自愿降级其技术设备，使用工具来防止数据收集，甚至降低其整体技术使用率。各国政府已经意识到了这些担忧，并且在涉及人们隐私的世界范围内，监管越来越严格。让我们来看看未来几年这将如何塑造社会(《黑镜》系列为这些可能性提供了有趣的见解)。

怎么做？ (How to do it?)

If you want to learn more about it, I recommend the MIT Press Essential Knowledge series book “Data Science”, by John D. Kelleher and Brendan Tierney. It is a very good introduction to the subject, without getting too technical, to help you see if data science is really for you.

如果您想了解更多信息，我建议由John D. Kelleher和Brendan Tierney撰写的麻省理工学院出版社基础知识丛书“数据科学”。这是对该主题的很好的介绍，并且没有太多的技术知识，可以帮助您了解数据科学是否真的适合您。

Next in line is “Data Science for Business” by Foster Provost and Tom Fawcett. This one is more focused on business applications and it goes deeper into the details of the algorithms. It will give you a really good grasp of all the possibilities enabled by data-driven decision making.

接下来的是Foster Provost和Tom Fawcett撰写的“商业数据科学”。这是更专注于业务应用程序，它更深入地介绍了算法的细节。它将使您真正掌握数据驱动的决策制定所带来的所有可能性。

Then, once you got the basics covered, it’s time to study for real: you will almost certainly need to learn to code (if you don’t know it already). The main languages you should focus on are SQL and R or Python. The first one is used to querying databases to extract the data you need, in the right shape. The other two are used for applying the algorithms and creating plots. R was created with a focus on statistics, whereas Python is a more general programming language. To start with, just choose one of the two to concentrate your efforts and, if needed, learn the other one later on.

然后，一旦您掌握了基础知识，就可以学习真实的东西了：您几乎肯定需要学习编码(如果您还不知道的话)。您应该关注的主要语言是SQL和R或Python。第一个用于查询数据库，以正确的形式提取所需的数据。其他两个用于应用算法和创建图。 R的创建侧重于统计数据，而Python是一种更通用的编程语言。首先，只需选择两者之一以集中精力，如果需要，稍后再学习另一种。

A good way to start practicing your skills is Kaggle.com, where you can play with toy datasets and take part into real competitions. It will help you put your knowledge to test and also build a portfolio of your own. However, keep in mind that eventually, you will need to work with real-life cases, it’s a different beast.

Kaggle.com是开始练习技能的一个好方法，您可以在其中玩玩具数据集并参加真实的比赛。这将帮助您测试知识，并建立自己的投资组合。但是，请记住，最终，您将需要处理实际案例，这是另一种野兽。

结论 (Conclusion)

Now that you know some of the data science lingo, you are able to go out there and do your own research. The amount of available resources is pretty much endless, and there’s new information coming out every day, so make sure you are always up to date on the new methods and possibilities.

既然您已经了解了一些数据科学术语，那么您就可以在那里进行自己的研究。可用资源的数量几乎是无穷无尽的，每天都有新的信息出现，因此请确保您始终了解新的方法和可能性。

翻译自: https://towardsdatascience.com/data-science-101-99e34bea86c