了解机器学习 (Understanding ML)

This article is based on my entry into DengAI competition on the DrivenData platform. I’ve managed to score within 0.2% (14/9069 as on 02 Jun 2020).

本文基于我 对DrivenData平台上的DengAI竞赛的参与 。 我的得分一直在0.2％以内(截至2020年6月2日，得分为14/9069)。

In this article, I assume that you’re already familiar with DengAI — EDA. You don’t have to read it to understand everything here, but it would be a lot easier if you do.

在本文中，我假设您已经熟悉DengAI — EDA 。您无需阅读即可理解此处的所有内容，但是如果您这样做，它会容易得多。

为什么我们必须预处理数据？ (Why do we have to preprocess data?)

When designing ML models we have to remember that some of them are based on the gradient method. The problem with the gradient is that it performs better on normalized/scaled data. Let me show an example:

在设计ML模型时，我们必须记住，其中一些是基于渐变方法的。梯度的问题在于它在归一化/定标的数据上表现更好。让我举一个例子：

On the left side, we have a dataset that consists of two features and one of them has a larger scale than the other. In both cases, the gradient method works, but it takes a lot fewer steps to reach optimum when features lie on similar scales (right image).

在左侧，我们有一个包含两个特征的数据集，其中一个具有比另一个更大的比例。在这两种情况下，梯度方法都可以使用，但是当要素位于相似的比例(正确的图像)时，只需很少的步骤即可达到最佳效果。

什么是规范化，什么是缩放？ (What is a Normalization and what is Scaling?)

正常化 (Normalization)

In the standard sense, normalization refers to the process of adjusting the value distribution range to fit into <-1, 1> (id doesn’t have to be exact -1 to 1 but within the same order of magnitude so the range ). Standard normalization is done by subtracting mean value from each value in the set and dividing the result by the standard deviation.

在标准意义上，归一化是指调整值分布范围以适合<-1，1>的过程 (id不必精确地为-1到1，而是在相同的数量级内，因此范围为)。通过从集合中的每个值中减去平均值并将结果除以标准偏差来完成标准归一化。

缩放比例 (Scaling)

You can see it called “min-max normalization” but scaling is another value adjustment to fit in range, but this time range is <0, 1>.

您可以看到它称为“最小-最大归一化”，但是缩放是另一个值调整以适合范围，但是此时间范围是<0，1> 。

归一化还是缩放？ (Normalization or Scaling?)

There are two types of operations you can perform on the feature. You can either normalize or scale its values. Which one you choose depends on the feature itself. If you consider features that have some positive and negative values and that values are important, you should perform normalization. On the feature where negative values make no sense, you should apply scaling.

您可以对功能执行两种操作。您可以规范化或缩放其值。选择哪一个取决于功能本身。如果您考虑具有一些正负值且这些值很重要的要素，则应执行规范化。在负值没有意义的功能上，应应用缩放。

It’s not always black and white. Let’s consider a feature like a temperature. Depends on which scale you choose (Kelvin or Celsius/Fahrenheit) there might be different interpretations of what that temperature could be. Kelvin scale is an absolute thermodynamic temperature scale (starts with absolute zero and cannot go below that). On the other hand, we have scales used IRL where negative numbers are meaningful for us. When the temperature drops below 0 Celsius, water freezes. The same goes for the Fahrenheit scale, its 0 degrees describe the freezing point of the brine (concentrated solution of salt in water). The straight forward choice would be to scale Kelvins and normalize Celsius and Fahrenheit. That does not always work. We can show it on DengAI’s dataset:

它并不总是黑白的。让我们考虑一个像温度这样的特征。根据您选择的温度等级(开尔文或摄氏/华氏温度)，温度可能会有不同的解释。开尔文刻度是绝对的热力学温度刻度 (从绝对零开始，并且不能低于该数值)。另一方面，我们有使用IRL的量表，其中负数对我们有意义。当温度降到0摄氏度以下时，水冻结。华氏刻度也是如此，其0度表示盐水的凝固点(盐在水中的浓缩溶液)。直接的选择是缩放开尔文(Kelvins)并标准化摄氏(Celsius)和华氏(Fahrenheit)。这并不总是有效。我们可以在DengAI的数据集上显示它：

Some of the temperatures are on the Kelvin scale, and some on the Celsius scale. That’s not what is important here. If you look closely you should be able to group those temperatures by type:

一些温度在开尔文标度上，而一些温度在摄氏标度上。这不是很重要。如果仔细观察，您应该能够按类型对这些温度进行分组：

temperature with absolute minimum value绝对最小值的温度
temperature without absolute minimum value (can be negative)没有绝对最小值的温度(可以为负)

An example of the first one is station_diur_temp_rng_c. This is something called Diurnal temperature variation and defines a variation between minimum and maximum temperature withing some period. That value cannot have negative values (because the difference between minimum and maximum cannot be lower than 0). That’s where we should use scaling instead of normalization.

第一个示例是station_diur_temp_rng_c 。这就是所谓的昼夜温度变化，它定义了一段时间内最低温度和最高温度之间的变化。该值不能为负值(因为最小值和最大值之差不能小于0)。那是我们应该使用缩放而不是标准化的地方。

Another example is reanalysis_air_temp_k. It is the air temperature and important feature. We cannot define a minimum value that temperature could get. If we really want there is an arbitrary minimum temperature for each city that we should never get below but that’s not what we want to do. Things like the temperature in problems like ours might have another meaning when training models. There could be some positive and negative impacts of the temperature value. In this case, it might be that temperatures below 298K positively affecting a number of cases (fewer mosquitos). That’s why we should use normalization for this one.

另一个示例是reanalysis_air_temp_k 。它是气温和重要特征。我们无法定义温度可以达到的最小值。如果我们真的希望每个城市都有一个最低温度，那么我们永远都不能低于这个最低温度，但这不是我们想要做的。训练模型时，像我们这样的问题中的温度之类的事物可能具有另一种含义。温度值可能会有正面和负面影响。在这种情况下，低于298K的温度可能会积极影响许多病例(蚊子少)。这就是为什么我们应该对此使用归一化。

After checking an entire dataset we can come up with the list of features to normalize, scale and copy from our list of features:

在检查了整个数据集之后，我们可以提出功能列表以从我们的功能列表中进行规范化，缩放和复制：

归一化特征 (Normalized features)

'reanalysis_air_temp_k''reanalysis_avg_temp_k''reanalysis_dew_point_temp_k''reanalysis_max_air_temp_k''reanalysis_min_air_temp_k''station_avg_temp_c''station_max_temp_c''station_min_temp_c'

缩放功能 (Scaled features)

'station_diur_temp_rng_c''reanalysis_tdtr_k''precipitation_amt_mm''reanalysis_precip_amt_kg_per_m2''reanalysis_relative_humidity_percent''reanalysis_sat_precip_amt_mm''reanalysis_specific_humidity_g_per_kg''station_precip_mm''year''weekofyear'

复制的功能 (Copied features)

'ndvi_ne''ndvi_nw''ndvi_se''ndvi_sw'

为什么要复制？ (Why Copy?)

If we look at the definition of the NDVI index, we can decide there is no reason for scaling or normalizing those values. NDVI values are already in <-1, 1> range. Sometimes we might want to copy values directly like that. Especially when original values are within the same order of magnitude as our normalized features. It might be <0,2> or <1,4>, but it shouldn’t cause a problem for the model.

如果我们看一下NDVI指数的定义，我们可以决定没有理由缩放或标准化这些值。 NDVI值已经在<-1，1>范围内。有时我们可能想要直接复制值。尤其是当原始值与我们的归一化特征处于同一数量级时。它可能是<0,2>或<1,4>，但不会对模型造成问题。

代码 (The code)

Now we have to write some code to preprocess our data. We’re going to use StandardScaler and MinMaxScaler from sklearn library.

现在，我们必须编写一些代码来预处理数据。我们将使用sklearn库中的StandardScaler和MinMaxScaler 。

As an input to our function, we expect to send 3 or 4 variables. When dealing with the training set we’re sending 3 variables:

作为函数输入，我们希望发送3或4个变量。处理训练集时，我们将发送3个变量：

training dataset (as pandas Dataframe)训练数据集(作为pandas数据框)
list of columns to normalize要规范化的列列表
list of columns to scale要缩放的列列表

When we’re processing training data we have to define the dataset for the scaling/normalization process. This dataset is used to get values like mean or standard deviation. Because at the point of processing the training dataset we don’t have any external datasets, we’re using the training dataset. At line 19 we’re normalizing selected columns using StandardScaler():

在处理训练数据时，我们必须定义缩放/标准化过程的数据集。该数据集用于获取平均值或标准偏差之类的值。因为在处理训练数据集时，我们没有任何外部数据集，所以我们正在使用训练数据集。在第19行，我们使用StandardScaler()标准化选定的列：

StandardScaler doesn’t require any parameters when initializing, but it requires scale dataset to fit to. We could just past the new_data twice and it would work but then we need to create another preprocessing for the test dataset.

StandardScaler初始化时不需要任何参数，但是需要比例数据集适合。我们可以将两次new_data两次，它可以工作，但是接下来我们需要为测试数据集创建另一个预处理。

Next, we’re doing the same thing but with MinMaxScaler().

接下来，我们使用MinMaxScaler()进行相同的操作。

This time we’re passing one parameter called feature_range to be sure that our scale is in range <0,1>. As in the previous example, we're passing the scaling dataset to fit to and transform selected columns.

这次，我们传递了一个称为feature_range参数，以确保比例尺在<0,1>范围内。与前面的示例一样，我们传递缩放数据集以适合并转换选定的列。

In the end, we’re returning transformed new_data and additionally train_scale for further preprocessing. But wait the second! What further preprocessing? Remember that we're dealing not only with the training dataset but also with the test dataset. We have to apply the same data processing for both of them to have the same input for the model. If we would simply use preproc_data() in the same way for the test dataset, we would apply completely different normalization and scaling. The reason why is because normalization and scaling are done by the .fit() method and this method uses some given dataset to calculate mean and other required values. If you use a test dataset that might have a different range of values (there was a hot summer because of global warming etc.) your value of 28C in the test dataset will be normalized with different parameters. Let me show you an example:

最后，我们将返回转换后的new_data和train_scale以进行进一步的预处理。但是等一下！哪些进一步的预处理？请记住，我们不仅要处理训练数据集，还要处理测试数据集。我们必须对两个应用相同的数据处理，以使模型具有相同的输入。如果我们preproc_data()相同的方式对测试数据集使用preproc_data() ，则将应用完全不同的规范化和缩放。之所以这样，是因为归一化和缩放是通过.fit()方法完成的，并且该方法使用一些给定的数据集来计算均值和其他所需值。如果您使用的测试数据集可能具有不同的值范围(由于全球变暖等原因导致夏季炎热)，则将使用不同的参数对测试数据集中的28C值进行标准化。让我给你看一个例子：

Training Dataset:

训练数据集：

Testing Dataset:

测试数据集：

Normalizing Testing Dataset using mean and SD from Training Dataset gives us:

使用均值和训练数据集的SD对测试数据集进行归一化可以为我们提供：

[0.11,0.11,0.91,1.71,0.91,0.11,1.71]

But if you use mean and SD from the test dataset you’ll end up with:

但是，如果您使用测试数据集中的均值和标准差，则会得到：

[−1.04,−1.04,0.17,1.37,0.17,−1.04,1.37]

[−1.04，−1.04,0.17,1.37,0.17，−1.04,1.37]

You might think that the second one is better describing the dataset but that’s only true when dealing with only the testing dataset.

你可能会认为，第二个是更好的描述数据集，但仅与测试数据集打交道时，只有真实的。

That’s why when building our model we have to execute it like that:

这就是为什么在构建模型时我们必须像这样执行它：

结论 (Conclusion)

We’ve just gone through a quite standard normalization process for our dataset. It is important to understand the difference between normalization and scaling. Another thing which might be even more important is feature selection for normalization (example with different temperature features), you should always try to understand your features, not only apply some hardcoded rules from the internet.

我们刚刚为数据集完成了一个非常标准的标准化过程。了解规范化和缩放之间的区别很重要。可能更重要的另一件事是标准化的特征选择(例如，具有不同温度特征的特征)，您应始终尝试了解自己的特征，不仅要应用互联网上的一些硬编码规则。

The last thing that I have to mention (and you’ve probably already thought about it) is the difference between data range in training and testing dataset. You know that normalization of the testing data should be done with the variables from training data, but shouldn’t we adjust the process to fit into a different range? Let’s say the training dataset has a temperature range between 15C and 23C and the testing dataset has a range between 18C and 28C. Isn’t that a problem with our model? Actually it isn’t :) Models don’t really care about small changes like that because they are approximating continuous functions (or distributions) and unless your range differs a lot (it’s from different distribution) you shouldn’t have any issues with it.

我不得不提的最后一件事(您可能已经考虑过)是训练和测试数据集中的数据范围之间的差异。您知道应该使用训练数据中的变量对测试数据进行归一化，但是我们是否应该调整流程以适应不同的范围？假设训练数据集的温度范围在15C和23C之间，而测试数据集的温度范围在18C和28C之间。这不是我们模型的问题吗？实际上不是:)模型实际上并不关心这样的小变化，因为它们近似于连续函数(或分布)，并且除非您的范围相差很大(来自不同的分布)，否则您应该不会有任何问题。

Originally published at https://erdem.pl.

最初发布在 https://erdem.pl 。

翻译自: https://towardsdatascience.com/dengai-data-preprocessing-28fc541c9470

查看全文

http://www.taodudu.cc/news/show-863643.html

k 最近邻_k最近邻与维数的诅咒
使用Pytorch进行密集视频字幕
5g与edge ai_使用OpenVINO部署AI Edge应用
法庭上认可零和博弈的理论吗_从零开始的本征理论
极限学习机和支持向量机_极限学习机I
如何在不亏本的情况下构建道德数据科学系统？
ann人工神经网络_深度学习-人工神经网络（ANN）
唐宇迪机器学习课程数据集_最受欢迎的数据科学和机器学习课程-2020年8月
r中如何求变量的对数转换_对数转换以求阳性。
美团脱颖而出的经验_使数据科学项目脱颖而出的6种方法
aws rds同步_将数据从Python同步到AWS RDS
扫描二维码读取文档_使用深度学习读取和分类扫描的文档
电路分析导论_生存分析导论
强化学习-第3部分
范数在机器学习中的作用_设计在机器学习中的作用
贝叶斯深度神经网络_深度学习为何胜过贝叶斯神经网络
模型监控psi_PSI和CSI：前2个模型监控指标
flask渲染图像_用于图像推荐的Flask应用
pytorch贝叶斯网络_贝叶斯神经网络：2个在TensorFlow和Pytorch中完全连接
稀疏组套索_Python中的稀疏组套索
deepin中zz_如何解决R中的FizzBuzz问题
图像生成对抗生成网络gan_GAN生成汽车图像
生成模型和判别模型_生成模型和判别模型简介
机器学习算法拟合曲线_制定学习曲线以检测机器学习算法中的错误
重拾强化学习的核心概念_强化学习的核心概念
gpt 语言模型_您可以使用语言模型构建的事物的列表-不仅仅是GPT-3
廉价raid_如何查找80行代码中的廉价航班
深度学习数据集制作工作_创建我的第一个深度学习+数据科学工作站
pytorch线性回归_PyTorch中的线性回归
spotify音乐下载_使用Python和R对音乐进行聚类以在Spotify上创建播放列表。

DengAI —数据预处理相关推荐

机器学习PAL数据预处理
机器学习PAL数据预处理本文介绍如何对原始数据进行数据预处理,得到模型训练集和模型预测集. 前提条件完成数据准备,详情请参见准备数据. 操作步骤登录PAI控制台. 在左侧导航栏,选择模型开发和训 ...
深度学习——数据预处理篇
深度学习--数据预处理篇文章目录深度学习--数据预处理篇一.前言二.常用的数据预处理方法零均值化(中心化) 数据归一化(normalization) 主成分分析(PCA.Principal ...
目标检测之Faster-RCNN的pytorch代码详解(数据预处理篇)
首先贴上代码原作者的github:https://github.com/chenyuntc/simple-faster-rcnn-pytorch(非代码作者,博文只解释代码) 今天看完了simple- ...
第七篇：数据预处理(四) - 数据归约(PCA/EFA为例)
前言这部分也许是数据预处理最为关键的一个阶段. 如何对数据降维是一个很有挑战,很有深度的话题,很多理论书本均有详细深入的讲解分析. 本文仅介绍主成分分析法(PCA)和探索性因子分析法(EFA),并给 ...
数据预处理--噪声_为什么数据对您的业务很重要-以及如何处理数据
数据预处理--噪声 YES! Data is extremely important for your business. 是! 数据对您的业务极为重要. A human body has five ...
数据预处理（完整步骤）
原文:http://dataunion.org/5009.html 一:为什么要预处理数据? (1)现实世界的数据是肮脏的(不完整,含噪声,不一致) (2)没有高质量的数据,就没有高质量的挖掘结果(高 ...
3D目标检测深度学习方法数据预处理综述
作者 | 蒋天元来源 | 3D视觉工坊(ID: QYong_2014) 这一篇的内容主要要讲一点在深度学习的3D目标检测网络中,我们都采用了哪些数据预处理的方法,主要讲两个方面的知识,第一个是rep ...
整理一份详细的数据预处理方法
作者:lswbjtu https://zhuanlan.zhihu.com/p/51131210 编辑:机器学习算法与Python实战为什么数据处理很重要? 熟悉数据挖掘和机器学习的小伙伴们都知道, ...
pandas数据预处理(标准化归一化、离散化/分箱/分桶、分类数据处理、时间类型数据处理、样本类别分布不均衡数据处理、数据抽样)
1. 数值型数据的处理 1.1 标准化&归一化数据标准化是一个常用的数据预处理操作,目的是处理不同规模和量纲的数据,使其缩放到相同的数据区间和范围,以减少规模.特征.分布差异等对模型的影响. ...

DengAI —数据预处理