大数据技术 学习之旅

什么是数据科学? (What is Data Science?)

The interesting thing about Data Science is that it is a young area and the definitions can differ from textbooks to newspapers to whitepapers. The general definition is that Data Science is a mixture of multiple tools, algorithms, and machine learning principles, in order to discover hidden patterns in the data. How is this different from statistics that have existed and have been used for years? The answer lies in the difference between explanation and prediction.

数据科学的有趣之处在于,它是一个年轻的领域,其定义从教科书到报纸再到白皮书都可能有所不同。 一般的定义是,数据科学是多种工具,算法和机器学习原理的结合,以便发现数据中的隐藏模式。 这与已有且使用多年的统计数据有何不同? 答案在于解释和预测之间的差异。

数据科学过程 (The data science process)

Data science is composed of seven main steps. Each one of them is important for the accuracy of the model. Let’s see what is contained in each step.

数据科学由七个主要步骤组成。 其中每个对于模型的准确性都很重要。 让我们看看每个步骤中包含的内容。

业务了解 (Business understanding)

If we want to create a data science project, we need to understand the problem that we are trying to solve. So, in this step we have to get answers to the following questions:

如果要创建数据科学项目,则需要了解我们要解决的问题。 因此,在这一步中,我们必须获得以下问题的答案:

- How many?

- 多少?

- Which category?

-哪个类别?

- Which group?

-哪一组?

- Is this strange?

-奇怪吗?

- Which option should be considered?

-应该考虑哪个选项?

Based on the answers to these questions, we can conclude which variable / variables should be predicted.

根据这些问题的答案,我们可以得出结论:应该预测哪个变量。

数据挖掘 (Data Mining)

The next step is finding the right data. Data mining is a process of finding and collecting data from different sources. We need to answer the following questions:

下一步是找到正确的数据。 数据挖掘是从不同来源查找和收集数据的过程。 我们需要回答以下问题:

- Which data is needed for the project?

-该项目需要哪些数据?

- Where can I find that data?

-在哪里可以找到这些数据?

- How to obtain the data?

-如何获取数据?

- Which is the most effective way of storing and accessing the data?

-哪种存储和访问数据最有效的方法?

If the data is in one place — this process will be easy for us. Usually, this is not the case.

如果数据在一个地方,对我们来说这个过程很容易。 通常情况并非如此。

数据清理 (Data Cleaning)

This is the most complicated step and it takes 50 to 80 percent of the time. After the data is collected, we must clean it. The data might contain missing values, or it might be inconsistent in one column. That is why we need to clean and organize our data.

这是最复杂的步骤,需要50%到80%的时间。 收集数据后,我们必须对其进行清理。 数据可能包含缺少的值,或者在一列中可能不一致。 这就是为什么我们需要清理和整理数据。

数据探索 (Data Exploration)

After the data is cleaned, we will try to find a hidden pattern in it. This step includes extracting a subset, analyzing, and visualizing the subset. After this, we get a complete image behind every data point.

清除数据后,我们将尝试在其中查找隐藏的模式。 此步骤包括提取一个子集,分析和可视化该子集。 此后,我们将在每个数据点后面获得完整的图像。

特征工程 (Feature Engineering)

In machine learning, a feature is explained as an attribute of a phenomenon that is observed. For example, if we are observing the results of a student — a possible attribute might be the amount of sleep that the student gets. This step is divided into two sub-steps. The first one is the Feature selection. In this step, we can remove some features in order to reduce the dimensionality that might cause complexity of the model. Also, the feature that we want to remove usually brings more noise than useful information. The second sub-step is Feature construction — this means that we can build a new feature based on the ones that we have.

在机器学习中,将特征解释为观察到的现象的属性。 例如,如果我们观察一个学生的成绩,则可能的属性可能是该学生获得的睡眠量。 此步骤分为两个子步骤。 第一个是功能选择。 在此步骤中,我们可以删除一些功能以降低可能导致模型复杂性的维数。 此外,我们要删除的功能通常会带来比有用信息更多的噪音。 第二个子步骤是功能构建-这意味着我们可以根据已有功能构建新功能。

预测建模 (Predictive modeling)

This is the step when we finally build out the model. Here we decide which model we will use — based on the answers that we obtained in the first step. This is not an easy decision and there is not always one answer. The model and its accuracy depend on the data — how big the data is, the type of the data and also the quality of the data. After the model is trained, we must evaluate the accuracy and determine if the model is successful.

这是我们最终构建模型的步骤。 在这里,我们根据第一步获得的答案来决定将使用哪种模型。 这不是一个容易决定的决定,而且也不总是一个答案。 模型及其准确性取决于数据-数据的大小,数据的类型以及数据的质量。 训练模型后,我们必须评估准确性并确定模型是否成功。

数据可视化 (Data Visualization)

After we have obtained the information from the model, we need to visualize them in different ways in order to be understood by everyone included in the project.

从模型中获得信息后,我们需要以不同的方式对其进行可视化,以便项目中的每个人都能理解。

业务了解 (Business understanding)

Once everything is done, we return to the first step and check if the model meets the initial requirements. If we came across new insights during the first iteration of the life cycle (and I am sure that we will), we can now enter that knowledge into the next iteration to generate even more powerful insights and unleash the power of data to extract phenomenal results for the project.

一切完成后,我们返回第一步,检查模型是否符合初始要求。 如果我们在生命周期的第一次迭代中遇到了新的见解(并且我肯定会),那么我们现在就可以将知识输入到下一个迭代中,以生成更强大的见解,并释放数据的力量以提取惊人的结果该项目。

什么是数据? (What is data?)

We can see that almost every step needs data. We can see that four out of five steps in the previous part are data related. So, we can assume that the data plays a crucial role in a data science project. What is data? How the data is defined? This might seem like an unimportant definition to look at, but it is. Whenever we use the word “data,” we refer to a collection of information in either an organized or unorganized format.

我们可以看到几乎每个步骤都需要数据。 我们可以看到,上一部分中的五个步骤中有四个与数据相关。 因此,我们可以假设数据在数据科学项目中起着至关重要的作用。 什么是数据? 数据如何定义? 这看起来似乎是一个不重要的定义,但是确实如此。 每当我们使用“数据”一词时,我们指的是有组织或无组织格式的信息集合。

基本数据类型 (Basic types of data)

There are two types of formats based on the definition in the previous part:

根据上一部分的定义,有两种格式:

o Structured (organized) data: Data that is sorted into a row/column structure, where every row represents a single observation and the columns represent the characteristics of that observation

o 结构化(组织)数据:排序为行/列结构的数据,其中每一行代表一个观察值,列代表该观察值的特征

o Unstructured (unorganized) data: Data that is in a free form, usually text or raw audio/signals that must be parsed further to become organized.

o 非结构化(非组织)数据:自由格式的数据,通常是文本或原始音频/信号,必须进一步解析才能变得有组织。

When we talk about data, the first thing that we need to answer is whether the data is quantitative or qualitative. When we talk about quantitative data, we usually think about a structured dataset. These two data types can be defined as follows:

在谈论数据时,我们需要回答的第一件事是数据是定量的还是定性的。 当谈论定量数据时,我们通常会考虑结构化数据集。 这两种数据类型可以定义如下:

o Quantitative data: When the data can be described using numbers, and basic mathematical operations, including addition, are possible on the set.

o 定量数据:当可以使用数字描述数据时,可以在集合上进行包括加法在内的基本数学运算。

o Qualitative data: When the data cannot be described using numbers and basic mathematics. This data is generally being described using natural categories and language.

o 定性数据:当无法使用数字和基本数学描述数据时。 通常使用自然类别和语言来描述此数据。

定量数据 (Quantitative data)

Quantitative data can be:

定量数据可以是:

o Discrete data: This describes data that is counted. It can only take on certain values. Examples of discrete quantitative data include a dice roll, because it can only take on six values, and the number of customers in a coffee shop because you can’t have a real range of people.

o 离散数据:这描述了计数的数据。 它只能采用某些值。 离散定量数据的示例包括骰子掷骰(因为它只能取六个值),以及咖啡店的顾客数量(因为您没有真正的人脉)。

o Continuous data: This describes data that is measured. It exists on an infinite range of values.

o 连续数据:这描述了要测量的数据。 它存在于无限范围的值中。

数据的四个层次 (The four levels of data)

It is generally understood that a specific characteristic (feature/column) of structured data can be broken into four levels of data. These levels are the following:

通常可以理解,结构化数据的特定特征(特征/列)可以分为四个数据级别。 这些级别如下:

o The nominal level

o标称水平

o The ordinal level

o顺序级别

o The interval level

o间隔等级

o The ratio level

o比率水平

Let’s go deeper into each level and explain each one of them.

让我们更深入地介绍每个级别,并解释每个级别。

名义水平 (The nominal level)

This level contains data that is described by name or category. For example, gender, name, species, and so on. The data cannot be described using numbers, so it is qualitative data and because of this we cannot perform mathematical operations such as addition or division on this data. The operations that we can perform on this level are equality and set membership function. Also, we cannot use the measure of center — a measure of center is explained as a number that shows us what the data tends to, and sometimes it is called a balance point of the data. Why we cannot use the measure of center? The explanation is simple — usually, when we use this measure we use the mode, median or the mean value. But, at the nominal level we cannot use mathematical operations, so these measures do not make sense. In conclusion, this level is composed of categorical data and we must be careful with this data — since it might contain very useful insights for us.

此级别包含按名称或类别描述的数据。 例如,性别,名称,物种等。 不能使用数字来描述数据,因此它是定性数据,因此,我们无法对此数据执行数学运算,例如加法或除法。 我们可以在此级别上执行的操作是相等性和设置成员资格函数。 另外,我们不能使用中心度量-中心度量被解释为一个数字,向我们显示数据趋向于什么,有时也称为数据的平衡点。 为什么我们不能使用中心度量? 解释很简单-通常,当我们使用此度量时,我们使用众数,中位数或平均值。 但是,在名义上,我们不能使用数学运算,因此这些度量没有意义。 总之,此级别由分类数据组成,我们必须谨慎使用此数据-因为它可能包含对我们非常有用的见解。

顺序级别 (The ordinal level)

The nominal level is not very flexible when we talk about mathematical operations. The data in the ordinal level provides a rank order, but we still cannot use more complex mathematical operations — like subtraction or addition in order to get a real meaning. For example, the grades from 1–10 are ordinal data — if we want to use addition, we won’t get any useful information from this. Another example is a survey result. At this level, we have more freedom with mathematical operations than in the nominal. The mathematical operations from the nominal level (equality and set membership) are inherited, and the additional operations that are allowed are ordering and comparison. At the ordinal level, the median is usually an appropriate way of defining the center of the data, but we can use the mode as well. The mean, however, would be impossible because the division is not allowed at this level.

当我们谈论数学运算时,名义水平不是很灵活。 顺序级别的数据提供了排名顺序,但是我们仍然不能使用更复杂的数学运算(例如减法或加法)以获得真实含义。 例如,从1到10的等级是序数数据—如果我们要使用加法,则将无法从中获得任何有用的信息。 另一个例子是调查结果。 在此级别上,我们在数学运算方面的自由度比名义上更大。 继承了名义级别(相等和集合成员)的数学运算,并且允许的其他运算是排序和比较。 在顺序级别上,中位数通常是定义数据中心的一种合适方法,但是我们也可以使用该模式。 但是,均值将是不可能的,因为在此级别不允许进行除法运算。

间隔等级 (The interval level)

Now, we are getting at a level where the data can be expressed through mean and we can use more complicated mathematical formulas. Data at the interval level support subtraction between data points. For example, data that contains temperature belongs to the interval level. The operations from the lower levels (ordering, comparisons, and so on), are inherited and the additional operations that are allowed are addition and subtraction. When we talk about the measure of center, we can use the median, the mode or the mean value — and usually, the most accurate description of the center would be the arithmetic mean. Let’s look at an example. We are trying to find the measure of center using data that contains temperatures of a fridge in which vaccines are stored. The optimal temperature must be under 29 degrees. After finding the mean and the median, we assumed both of them are near to 31 — so this is not acceptable for our dataset. This is the point when we need another measure — the measure of variance or standard deviation. We can use this measure if we want to see how our data is spread out. If we want to find the measure of variance, we need to calculate the mean, subtract each point from the mean, find the average of each square difference and take the square root. Here is the formula:

现在,我们可以达到可以通过均值表示数据的水平,并且可以使用更复杂的数学公式。 间隔级别的数据支持数据点之间的减法。 例如,包含温度的数据属于间隔级别。 较低级别的操作(排序,比较等)将被继承,而允许的其他操作为加法和减法。 当我们谈论中心的度量时,我们可以使用中位数,众数或平均值-通常,最准确的中心描述将是算术平均值。 让我们来看一个例子。 我们正在尝试使用包含存储疫苗的冰箱温度的数据来找到中心度量。 最佳温度必须低于29度。 找到均值和中位数后,我们假设它们均接近31,因此这对于我们的数据集是不可接受的。 在这一点上,我们需要另一种度量-方差或标准偏差的度量。 如果要查看数据如何分布,可以使用此度量。 如果要找到方差的度量,则需要计算平均值,从平均值中减去每个点,找到每个平方差的平均值,然后取平方根。 这是公式:

If we use this formula on the example with the temperatures, we can calculate the standard deviation on our dataset, and based on this measure we can see that the temperature might go down (mean minus standard deviation).

如果在带有温度的示例中使用此公式,则可以在数据集上计算标准偏差,并且基于此度量,我们可以看到温度可能会下降(平均值减去标准偏差)。

比例等级 (The ratio level)

The last level is called the ratio level. There are not a lot of differences between the ratio and the interval level — sometimes we might be confused about which one is the right one. At the interval level, we don’t have a natural starting point or a natural zero, but in the ratio level — we have. The mathematical operations from the lower level are inherited and the additional ones are multiplication and division. For example, money in a bank account are classified in this level — one bank account can have a natural zero. As a measure of center, we can use the geometric mean — it is the square root of the product of all the values. The data at this level should be non-negative so that is why this level is not preferred.

最后一个级别称为比率级别。 比率和间隔水平之间没有太多差异-有时我们可能会混淆哪个是正确的。 在时间间隔级别,我们没有自然的起点或自然的零,但是在比率级别上,我们有。 较低级别的数学运算是继承的,附加的运算是乘法和除法。 例如,银行帐户中的钱在此级别中分类-一个银行帐户可以具有自然零值。 作为中心的度量,我们可以使用几何平均值-它是所有值的乘积的平方根。 此级别的数据应为非负数,这就是为什么不首选此级别的原因。

结论 (Conclusion)

Data science can add values to any business — the important thing is to use the data well. Also, Data science can help us make better decisions based on measurable evidence. Data should always be available to us when making decisions. Using data science methodologies, we can research historical data, make comparisons with the competition, analyze the market, and most importantly, make recommendations on how the product or service would perform best. These analyzes, which are part of data science, provide deep knowledge and understanding of the market as well as their feedback on the product or service. It is estimated that about 2.5 billion gigabytes of data are generated daily. With this increase in the amount of data, getting what is important for the target group can be difficult. Every piece of data that a company collects from customers — whether it is social media likes, website visits or email surveys — contains data that can be analyzed to understand customers more effectively. This means that the services and products of certain groups can be customized. For example, finding correlations between age and income can help a company create new promotions or offers for groups that may not have been available before.

数据科学可以为任何业务增加价值,重要的是要充分利用数据。 同样,数据科学可以帮助我们基于可衡量的证据做出更好的决策。 决策时,数据应始终可供我们使用。 使用数据科学方法,我们可以研究历史数据,与竞争对手进行比较,分析市场,最重要的是就产品或服务的最佳性能提出建议。 这些分析是数据科学的一部分,可提供对市场的深入了解和了解,以及对产品或服务的反馈。 据估计每天大约产生25亿千兆字节的数据。 随着数据量的增加,获取对于目标群体重要的数据可能会很困难。 公司从客户那里收集的每条数据(无论是喜欢的社交媒体,网站访问还是电子邮件调查)都包含可以进行分析以更有效地了解客户的数据。 这意味着可以定制某些组的服务和产品。 例如,发现年龄和收入之间的相关性可以帮助公司为以前可能没有的团体创建新的促销或优惠。

If you are interested in this topic, do not hesitate to contact me.

如果您对此主题感兴趣,请随时与我联系。

LinkedIn profile: https://www.linkedin.com/in/ceftimoska/

领英简介: https : //www.linkedin.com/in/ceftimoska/

翻译自: https://towardsdatascience.com/data-the-starting-point-of-a-data-science-journey-f7880f9f0eb7

大数据技术 学习之旅


http://www.taodudu.cc/news/show-863860.html

相关文章:

  • 编写分段函数子函数_编写自己的函数
  • 打破学习的玻璃墙_打破Google背后的创新深度学习
  • 向量 矩阵 张量_张量,矩阵和向量有什么区别?
  • monk js_使用Monk AI进行手语分类
  • 辍学的名人_辍学效果如此出色的5个观点
  • 强化学习-动态规划_强化学习-第5部分
  • 查看-增强会话_会话式人工智能-关键技术和挑战-第2部分
  • 我从未看过荒原写作背景_您从未听说过的最佳数据科学认证
  • nlp算法文本向量化_NLP中的标记化算法概述
  • 数据科学与大数据排名思考题_排名前5位的数据科学课程
  • 《成为一名机器学习工程师》_如何在2020年成为机器学习工程师
  • 打开应用蜂窝移动数据就关闭_基于移动应用行为数据的客户流失预测
  • 端到端机器学习_端到端机器学习项目:评论分类
  • python 数据科学书籍_您必须在2020年阅读的数据科学书籍
  • ai人工智能收入_人工智能促进收入增长:使用ML推动更有价值的定价
  • 泰坦尼克数据集预测分析_探索性数据分析—以泰坦尼克号数据集为例(第1部分)
  • ml回归_ML中的分类和回归是什么?
  • 逻辑回归是分类还是回归_分类和回归:它们是否相同?
  • mongdb 群集_通过对比群集分配进行视觉特征的无监督学习
  • ansys电力变压器模型_变压器模型……一切是如何开始的?
  • 浓缩摘要_浓缩咖啡的收益递减
  • 机器学习中的无监督学习_无监督机器学习中聚类背后的直觉
  • python初学者编程指南_动态编程初学者指南
  • raspberry pi_在Raspberry Pi上使用TensorFlow进行对象检测
  • 我如何在20小时内为AWS ML专业课程做好准备并进行破解
  • 使用composer_在Google Cloud Composer(Airflow)上使用Selenium搜寻网页
  • nlp自然语言处理_自然语言处理(NLP):不要重新发明轮子
  • 机器学习导论�_机器学习导论
  • 直线回归数据 离群值_处理离群值:OLS与稳健回归
  • Python中机器学习的特征选择技术

大数据技术 学习之旅_数据-数据科学之旅的起点相关推荐

  1. 大数据技术 学习之旅_为什么聚焦是您数据科学之旅的关键

    大数据技术 学习之旅 David Robinson, a data scientist, has said the following quotes: 数据科学家David Robinson曾说过以下 ...

  2. 大数据技术 学习之旅_如何开始您的数据科学之旅?

    大数据技术 学习之旅 Machine Learning seems to be fascinating to a lot of beginners but they often get lost in ...

  3. [XW大数据技术学习探讨] 公众号学习笔记

    [XW大数据技术学习探讨] 公众号学习笔记 一.前言: 博主是某学校大数据专业大二的学生,我们专业的老师XW为了更好的帮助我们学习大数据技术,建立了微信公众号"XW大数据技术学习探讨&quo ...

  4. 大数据技术学习路线指南

    大数据技术作为决策神器,日益在社会治理和企业管理中起到不容忽视的作用,美国,欧盟都已经将大数据研究和使用列入国家发展的战略,类似谷歌,微软,百度,亚马逊等巨型企业也同样把大数据技术视为生命线以及未来发 ...

  5. ssm大数据技术学习网0y331【独家源码】 应对计算机毕业设计困难的解决方案

    本项目包含程序+源码+数据库+LW+调试部署环境,文末可获取一份本项目的java源码和数据库参考. 系统的选题背景和意义 选题背景: 随着信息技术的快速发展和互联网的普及,大数据技术在各个领域中扮演着 ...

  6. 大数据数据科学家常用面试题_进行数据科学工作面试

    大数据数据科学家常用面试题 During my time as a Data Scientist, I had the chance to interview my fair share of can ...

  7. python气象数据可视化学习记录1——基于ERA5数据画风场和海平面气压填色叠加图

    python气象数据可视化学习记录1--基于ERA5数据画风场和海平面气压填色叠加图 1. 写在前面 2. 图片效果 3. 逐步代码解析 3.1导入库 3.2 读取NC格式数据 3.3 对数据进行加工 ...

  8. 数据透视表 字段交叉_删除数据透视表的计算字段的宏

    数据透视表 字段交叉 Have you ever recorded a macro to remove pivot table calculated fields? Just turn on the ...

  9. 大数据技术学习路线,有信心能坚持学习的朋友,从现在开始吧

    如果你看完有信心能坚持学习的话,那就当下开始行动吧! 推荐下我自己建的大数据学习交流群:199427210,群里都是学大数据开发的,如果你正在学习大数据 ,小编欢迎你加入,大家都是软件开发党,不定期分 ...

最新文章

  1. 一致性协议算法-2PC、3PC、Paxos、Raft、ZAB、NWR超详细解析
  2. vn.py 2.0.2 发布,全功能交易程序开发框架
  3. 轻松搞定面试中的红黑树问题
  4. v8声卡怎么录制唱歌_【绝对干货】关于声卡你需要知道的几点知识(上)
  5. My Appointment - Belonging to me, Search by team, Search by group
  6. 在 .NET Core 中使用 DiagnosticSource 记录跟踪信息
  7. 70. Climbing Stairs【leetcode】递归,动态规划,java,算法
  8. eclipse 查看jar包源代码两种方式
  9. 线性代数与空间解析几何重要知识点笔记
  10. 手机蓝牙绑定pc,离开电脑自动锁屏
  11. 概率密度函数曲线及绘制
  12. 谷歌Debugger调试
  13. U盘中Word文档打不开怎么办?
  14. 推荐 :数据可视化的方法、工具和应用
  15. 一个小技巧告诉你,邮箱域名地址格式怎么选择?
  16. Python视频制作 MoviePy框架afx音频效果示例
  17. Python学习周期一般多长?需要多久?
  18. 每日刷题记录 (一)
  19. python打印hello_Python第一行代码——打印hello world
  20. 客户成功,一定要看懂 8 个指标

热门文章

  1. 【splunk】仪表盘导入导出
  2. 实现OC与JS的交互
  3. 国产海量存储系统的新突破
  4. 通过RADIUS 服务器管理无线AP的VLAN
  5. python中hashset_python中的集合
  6. linux7 无法连接网络,CentOS7无法连接网络怎么办
  7. 如何在用例之间传递值_接口测试:A12_HttpRunner_cookie整理_01_提取指定cookie值
  8. 4 angular 重构 项目_c# – 将Angular 4添加到ASP.NETCore项目中
  9. android+note2+分辨率,5.5英寸720p屏全新RGB像素排列_三星 GALAXY Note II_手机Android频道-中关村在线...
  10. 随机森林做特征重要性排序和特征选择