如何准备机器学习数据集

什么是数据？ (What is Data?)

Data refers to examples of cases from the domain that characterize the problem you want to solve and the choice of data is dependent on the objective you want to satisfy. Here are some commonly used websites where open-source data is available for each topic so that you can build your own Machine Learning application and contribute to global success.

数据是指领域中代表您要解决的问题的案例，数据的选择取决于您要满足的目标。这是一些常用的网站，每个主题都可以使用开源数据，以便您可以构建自己的机器学习应用程序并为全球成功做出贡献。

Kaggle — An organised platform, where each learner will learn to spend time. You will love how futuristic these datasets are, and with the help of kernels, you can process all in the platform without even downloading the data.

Kaggle —一个有组织的平台，每个学习者都将在这里学习花费时间。您会喜欢这些数据集的未来性，借助内核，您甚至无需下载数据就可以在平台上进行所有处理。

UCI Machine Learning Repository — It maintains a huge amount of diversified datasets as a service to the machine learning community

UCI机器学习存储库 -它维护着大量的多样化数据集，作为对机器学习社区的服务

Data.gov — You can download data from multiple Indian government ministries. Data can range from government budgets to school performance scores.

Data.gov —您可以从多个印度政府部门下载数据。数据范围从政府预算到学校成绩。

CMU Libraries — High-quality dataset from various domains an initiative by Carnegie Mellon University

CMU库 —来自卡内基·梅隆大学的一项举措，来自各个领域的高质量数据集

Google Dataset Search — This dataset search lets you find datasets wherever they’re hosted, whether it’s a publisher’s site, a digital library, or an author’s web page.

Google数据集搜索 -通过此数据集搜索，无论托管位置是发布者的网站，数字图书馆还是作者的网页，您都可以查找它们。

After collecting the data, the first task is to transform the data to meet the requirements of individual machine learning algorithms. The most challenging part of each machine learning project is how to prepare the one thing that is unique to the project i.e. The data used for modelling.

收集数据之后，首要任务是转换数据以满足各个机器学习算法的要求。每个机器学习项目中最具挑战性的部分是如何准备该项目特有的一件事，即用于建模的数据 。

Data preparation is the transformation of raw data into the form that is more suitable for modelling because “the quality of data is more important than using complicated algorithms”. And to transform the raw data into more informative and self-explanatory you need to perform the same type of data preparation task for any modelling problem.

数据准备是将原始数据转换为更适合建模的形式，因为“数据质量比使用复杂算法更重要”。要将原始数据转换为内容更丰富，更易于解释的数据，您需要针对任何建模问题执行相同类型的数据准备任务。

In this article, we will walk you through how to apply Data Preparation techniques using the Car Price Prediction Dataset as an example

在本文中，我们将指导您如何使用“汽车价格预测数据集”作为示例来应用数据准备技术

Data Preparation tasks are :

数据准备任务是：

Data Cleaning数据清理
Feature Engineering特征工程
Data Transformation数据转换
Feature Extraction特征提取

数据清理： (Data Cleaning :)

This is one of the hardest steps, as most of the real-world data may have incorrect values in the form of misleading observation, wrong entry of data or rows may store incorrect values and many more but to clean data in order to create reliable dataset you need to have domain expertise which helps you to identify and observe abnormalities within attributes.

这是最难的步骤之一，因为大多数现实世界中的数据可能以误导性观察的形式出现不正确的值，错误输入数据或行可能会存储不正确的值，还有更多内容需要清理，以创建可靠的数据集您需要具有领域专业知识，以帮助您识别和观察属性中的异常。

Needless to say, data cleaning is a time-consuming process and you will spend an enormous amount of time in enhancing the quality of the data. However, there are various methods to perform data cleaning operations.

不用说，数据清理是一个耗时的过程，您将花费大量时间来提高数据质量。但是，有多种方法可以执行数据清理操作。

Detecting the number of null values or missing rows检测空值或缺少行的数量
Identifying duplicate rows within data and remove them识别数据中的重复行并将其删除
Using domain knowledge to detect outliers based on some statistical technique基于某种统计技术，使用领域知识来检测离群值
Identifying the distribution of columns and eliminating columns having no variance识别列的分布并消除无差异的列

Here is the code snippet performed to clean “car price prediction” dataset.

这是为清除“汽车价格预测”数据集而执行的代码段。

功能工程： (Feature Engineering:)

Feature engineering is a process of creating new input variables from the available data. In general, you can think of data cleaning as a process of subtraction and feature engineering as a process of addition.

特征工程是根据可用数据创建新输入变量的过程。通常，您可以将数据清理视为减法过程，将特征工程视为添加过程。

Engineering new features are highly specific to your data and data types and is one of the most valuable tasks a data scientist can do to improve model performance because it helps to isolate and highlight key information, which helps the algorithms to “focus” on what’s important.

设计新功能是针对您的数据和数据类型的高度特定的功能，并且是数据科学家可以提高模型性能的最有价值的任务之一，因为它有助于隔离和突出显示关键信息 ，从而帮助算法“ 专注于”重要信息。

Some common feature engineering techniques are:

一些常见的特征工程技术是：

Binning装箱
Log Transformation日志转换
Feature split功能拆分
Combining sparse classes组合稀疏类

Feature engineering is much more similar to data transformation but due to the requirement of subject matter (i.e. understanding domain), expert feature engineering defines a separate preparation technique.

特征工程与数据转换非常相似，但是由于主题(即理解领域)的要求，专家特征工程定义了单独的准备技术。

Feature Engineering steps performed on the dataset are :

对数据集执行的特征工程步骤为：

数据转换： (Data Transformation :)

It is rare that the collected data can be solely used to make predictions. The data you have may not be in the right format or may require transformations to make it more useful. Data transforms are used to change the type of distribution of data variables.

收集到的数据仅可用于进行预测的情况很少。您拥有的数据可能格式不正确，或者可能需要进行转换以使其更有用。数据转换用于更改数据变量的分布类型。

Attributes in the data store information in the form of numerical or categorical values, but the limitation of machine learning algorithms is, it can process only numerical values and not any kind of string or character. So you have to encode the categorical variable as an integer by creating dummy variables or performing one-hot encoding.

数据中的属性以数字或分类值的形式存储信息，但是机器学习算法的局限性在于，它只能处理数字值，不能处理任何类型的字符串或字符。因此，您必须通过创建虚拟变量或执行一键编码将分类变量编码为整数。

After representing the categorical attributes into numerical once, your task is to verify the scale of all features and align them into one scale because for a computer there is dramatically more resolution in the range 0–1 than in the broader range of the data type. Scaling the data can be achieved by applying normalization (rescaling the feature in the range of 0 to 1) or by applying standardization (scaling the feature to their respective gaussian)

在将分类属性一次用数字表示之后，您的任务是验证所有功能的比例并将它们调整为一个比例，因为对于计算机而言，0-1范围内的分辨率比数据类型范围内的分辨率大得多。可以通过应用归一化(将特征缩放到0到1范围内)或通过应用标准化(将特征缩放到各自的高斯)来实现数据缩放。

Data Transformation steps performed on the data are :

对数据执行的数据转换步骤为：

Please note that we have split the data into two sets; one for training the algorithm, and another for evaluation purposes, and for further analysis, we would be concentrating on training data

请注意，我们已将数据分为两组；一个用于训练算法，另一个用于评估目的，为了进一步分析，我们将专注于训练数据

特征提取： (Feature Extraction:)

Feature Extraction is the process of selecting the subset from the existing feature list or reducing the dimensionality of the dataset by applying various dimensionality reduction algorithms. The number of input features for a dataset is considered as the dimensionality of the data.

特征提取是从现有特征列表中选择子集或通过应用各种降维算法来降低数据集的维数的过程。数据集的输入要素的数量被视为数据的维数。

The problem is that more the number of dimensions (i.e. more input variables), more likely the dataset represents a very sparse and unrepresentative sampling of that space. This is referred to as the curse of dimensionality.

问题在于，维数越多(即，输入变量越多)，数据集就越有可能代表该空间的非常稀疏且不具代表性的采样。这被称为维数的诅咒。

This can be done in two major ways :

这可以通过两种主要方式完成：

Feature Selection: this aims to rank the importance of the existing features in the dataset and discard less important ones特征选择：这旨在对数据集中现有特征的重要性进行排名，并丢弃次要的特征
Feature Extraction: it creates a projection of the data into a lower-dimensional space that still preserves the most important properties of the original data.特征提取：它将数据投影到较低维的空间中，该空间仍保留原始数据的最重要属性。

The benefits of feature extraction/selection techniques are

特征提取/选择技术的好处是

Reduce the overfitting/ underfitting减少过度拟合/不足
Reduce the training time减少培训时间
Improves the accuracy提高准确性
Improves data visualization改善数据可视化

Feature Extraction technique performed on the data

对数据执行特征提取技术

After performing feature selection on the training dataset you can now train the ML model, as one is used here (Linear Regression) without performing any advance analysis on the algorithm. The model obtained is evaluated by using a set of matrices, regression problem performance is computed on the basis of “Mean Square Error” and “Mean Absolute Error”.

在训练数据集上执行特征选择后，您现在可以训练ML模型，因为此处使用的是线性回归(线性回归)，而无需对算法进行任何高级分析。使用一组矩阵评估获得的模型，并基于“ 均方误差”和“均值绝对误差”计算回归问题的性能。

结论： (Conclusion:)

These are some of the basic steps to be considered while solving an ML problem. Because, while building an ML model, the most important and the hardest part is cleaning and pre-processing the data. Ironically, applying the algorithm and predicting the output is just a few lines of code, which is the easiest job while building.

这些是解决ML问题时要考虑的一些基本步骤。因为在建立ML模型时，最重要和最困难的部分是清理和预处理数据。具有讽刺意味的是，应用算法并预测输出只是几行代码，这是构建时最简单的工作。

Let us know if you like the blog, please do comment for any queries or suggestions and follow us on LinkedIn and Instagram. Your love and support inspires us to post our learnings in much better way..!!

让我们知道您是否喜欢该博客，如有任何疑问或建议，请发表评论，并在LinkedIn和Instagram上关注我们。您的爱与支持会激励我们以更好的方式发布我们的知识。

翻译自: https://medium.com/swlh/data-preparation-techniques-and-its-importance-in-machine-learning-4a9df5d258c0

如何准备机器学习数据集

查看全文

http://www.taodudu.cc/news/show-6381402.html

伪元素进度条_使用HTML5进度元素
vs2015 web_2015年新的Web布局想法
Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns--论文笔记
SpringBoot通过@Scheduled实现定时任务
java中写定时任务
Go语言中定时任务库Cron使用详解
java 客户端定时任务_定时任务最简单的3种实现方法（超实用）
c#自己做的定时关机小程序附加源代码
取消关闭计算机怎么弄,怎么设置和取消电脑自动关机
Spring Boot之定时任务
【Linux】揭露定时任务真相
springboot多线程定时任务
定时锁定计算机代码bat,用bat实现的自动关机的代码
定时锁定计算机代码bat,批处理实现定时关机、注销、重启、锁定等功能
SpringBoot定时任务简单应用
系统定时任务linux,linux系统下定时任务(示例代码)
虚函数的定义和使用
虚数的理解
虚函数的定义
IIS设置目录浏览增加IP访问限制（图文）
配置windows iis
IIS如何设置默认访问https
IIS配置Web服务
IIS的安装和配置
如何设置IIS中的HTTPS服务
IIS 站点配置文件
windows“IIS”配置
IIS Web 服务器的权限设置
Win7 中IIS配置
前端经典面试500题【上】

如何准备机器学习数据集_数据准备技术及其在机器学习中的重要性相关推荐

如何准备机器学习数据集_机器学习演练第一部分：准备数据
如何准备机器学习数据集 Cleaning and preparing data is a critical first step in any machine learning project. In ...
机器学习框架_一个框架解决几乎所有机器学习问题
一个叫 Abhishek Thakur 的数据科学家,在他的 Linkedin 发表了一篇文章 Approaching (Almost) Any Machine Learning Problem,介绍 ...
机器学习算法_明确解释：4种机器学习算法
您是涉足机器学习的数据科学家吗? 如果是,那么您应该阅读此内容. 定义,目的,流行算法和用例-全部说明 > Photo by Andy Kelly on Unsplash 机器学习已经从科幻小说 ...
谷歌adwords教程_区块链技术作为Google AdWords中的安全剧院
谷歌adwords教程 Google operates the largest ad exchange in the world and recently decided to start inves ...
opencv机器学习线性回归_全面讲解手推实战机器学习之线性回归
点击上方"蓝字",发现更多精彩. 这个主题是讲解机器学习,会全面的讲解理论,知识干货.学了理论不会实践怎么办?调了包不懂实现?每个算法都会配备实践,手推和简单实现,让你知其然,还要 ...
stk在计算机仿真中的应用_浅析仿真技术在激光系统设计中的应用
摘要:激光系统仿真技术是指仿真技术与激光工程技术相结合,为激光系统及激光器的设计分析.技术风险预估.复杂环境模拟及性能评估等提供理论验证手段和模拟平台.激光系统仿真技术包括系统仿真与建模技术,多物理场 ...
三维激光扫描后处理软件_三维激光扫描技术在幕墙安装中的应用
随着社会不断发展,越来越多地异形建筑出现在人们的视野中.以异形钢结构为依托的铝板加幕墙的设计方案也越来越受到设计师与业主的喜爱. 然而异形结构也往往在设计.施工.放样.下料.安装等阶段考验着相关行业的 ...
宏基因组应用_宏基因组学技术在生物冶金中的应用
宏基因组学技术与微生物浸出技术宏基因组学作为新兴的微生物研究方法,其不依赖于有机体的培养技术手段,以微生物多样性.种群结构进化关系.功能活性相互协作关系以及与环境之间的关系为研究目的. 高通量测序技 ...
技术在大数据分析中的重要性
随着人们越来越依赖大数据,技术支持将比以往任何时候都更加重要.人们需要了解大数据在IT支持的发展中所起的作用,但重要的是要认识到大数据也增加了对这些服务的需求. 为什么技术支持很重要? (1)便利性 ...

如何准备机器学习数据集_数据准备技术及其在机器学习中的重要性