使用机器学习预测天气

You can reach all Python scripts relative to this on my GitHub page. If you are interested, you can also find the scripts used for data cleaning and data visualization for this study in the same repository. And the project is also deployed using Django on Heroku. View Deployment

您可以在我的 GitHub页面上找到 所有与此相关的Python脚本 。 如果您有兴趣，还可以在同一存储库中找到用于此研究的数据清理和数据可视化的脚本。 而且该项目还使用Django在Heroku上进行了部署。 查看部署

内容 (Content)

Data Cleaning (Identifying null values, filling missing values and removing outliers)数据清理(识别空值，填充缺失值和消除异常值)
Data Preprocessing (Standardization or Normalization)数据预处理(标准化或标准化)
ML Models: Linear Regression, Ridge Regression, Lasso, KNN, Random Forest Regressor, Bagging Regressor, Adaboost Regressor, and XGBoostML模型：线性回归，山脊回归，套索，KNN，随机森林回归，装袋回归，Adaboost回归和XGBoost
Comparison of the performance of the models模型性能比较
Some insights from data来自数据的一些见解

为什么通过对数转换来缩放价格特征？ (Why is price feature scaled by log transformation?)

In the regression model, for any fixed value of X, Y is distributed in this problem data-target value (Price ) not normally distributed, it is right skewed.

在回归模型中，对于X的任何固定值，Y均以非正态分布的这个问题数据目标值(价格)分布，因此右偏。

To solve this problem, the log transformation on the target variable is applied when it has skewed distribution and we need to apply an inverse function on the predicted values to get the actual predicted target value.

为了解决这个问题，当目标变量具有偏斜分布时，对它进行对数转换，我们需要对预测值应用反函数以获得实际的预测目标值。

Due to this, for evaluating the model, the RMSLE is calculated to check the error and the R2 Score is also calculated to evaluate the accuracy of the model.

因此，为了评估模型，将计算RMSLE以检查误差，并且还计算R2分数以评估模型的准确性。

一些关键概念： (Some Key Concepts:)

Learning Rate: Learning rate is a hyper-parameter that controls how much we are adjusting the weights of our network concerning the loss gradient. The lower the value, the slower we travel along the downward slope. While this might be a good idea (using a low learning rate) in terms of making sure that we do not miss any local minima, it could also mean that we’ll be taking a long time to converge — especially if we get stuck on a plateau region.

学习率：学习率是一个超参数，它控制我们在网络上调整与损耗梯度有关的权重的程度。值越低，我们沿着下坡行驶的速度就越慢。尽管就确保我们不错过任何局部最小值而言，这可能是一个好主意(使用较低的学习率)，但这也意味着我们将花费很长的时间进行收敛，尤其是如果我们陷入困境高原地区。
n_estimators: This is the number of trees you want to build before taking the maximum voting or averages of predictions. A higher number of trees give you better performance but make your code slower.

n_estimators ：这是在进行最大投票或平均预测之前要构建的树数。数量更多的树可为您提供更好的性能，但会使您的代码变慢。
R² Score: It is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. 0% indicates that the model explains none of the variability of the response data around its mean.

R²得分：它是统计数据与拟合回归线的接近程度的一种统计量度。也称为确定系数，或用于多元回归的多重确定系数。 0％表示该模型无法解释响应数据均值附近的变化。

1.数据： (1. The Data:)

The dataset used in this project was downloaded from Kaggle.

该项目中使用的数据集是从Kaggle下载的。

2.数据清理： (2. Data Cleaning:)

The first step is to remove irrelevant/useless features like ‘URL’, ’region_url’, ’vin’, ’image_url’, ’description’, ’county’, ’state’ from the dataset.

第一步是从数据集中删除不相关/无用的功能，例如“ URL”，“ region_url”，“ vin”，“ image_url”，“ description”，“ county”，“ state”。

As a next step, check missing values for each feature.

下一步，检查每个功能的缺失值。

Showing missing values (Image By Panwar Abhash Anil)

Next, now missing values were filled with appropriate values by an appropriate method.

接下来，现在通过适当的方法用适当的值填充缺少的值。

To fill the missing values, IterativeImputer method is used and different estimators are implemented then calculated MSE of each estimator using cross_val_score

为了填充缺失值，使用了IterativeImputer方法，并实现了不同的估计量，然后使用cross_val_score计算每个估计量的MSE

Mean and Median中位数和中位数
BayesianRidge Estimator贝叶斯里奇估计
DecisionTreeRegressor EstimatorDecisionTreeRegressor估算器
ExtraTreesRegressor EstimatorExtraTreesRegressor估算器
KNeighborsRegressor EstimatorKNeighbors回归估计器

MSE with Different Imputation Methods (Image By Panwar Abhash Anil)

From the above figure, we can conclude that the ExtraTreesRegressor estimator will be better for the imputation method to fill the missing value.

从上图可以得出结论， ExtraTreesRegressor估计器将更适合插补方法来填充缺失值。

Panwar Abhash Anil)Panwar Abhash Anil摄 )

At last, after dealing with missing values there zero null values.

最后，在处理了缺失值之后，零值为零。

Outliers: InterQuartile Range (IQR) method is used to remove the outliers from the data.

离群值：四分位数间距(IQR)方法用于从数据中删除离群值。

From figure 1, the prices whose log is below 6.55 and above 11.55 are the outliers从图1中，对数低于6.55和高于11.55的价格是异常值
From figure 2, it is impossible to conclude something so IQR is calculated to find outliers i.e. odometer values below 6.55 and above 11.55 are the outliers.从图2中无法得出结论，因此要计算IQR以找到异常值，即里程表值低于6.55而高于11.55就是异常值。
From figure 3, the year below 1995 and above 2020 are the outliers.根据图3，1995年以下和2020年以上的年份是异常值。

At last, Shape of dataset before process= (435849, 25) and after process= (374136, 18). Total 61713 rows and 7 cols removed.

最后，处理之前的数据集的形状=(435849，25)，处理之后的数据集的形状=(374136，18)。总共61713行和7列删除。

3.数据预处理： (3. Data preprocessing:)

Label Encoder: In our dataset, 12 features are categorical variables and 4 numerical variables (price column excluded). To apply the ML models, we need to transform these categorical variables into numerical variables. And sklearn library LabelEncoder is used to solve this problem.

标签编码器：在我们的数据集中，有12个要素是分类变量和4个数字变量(不包括价格栏)。要应用ML模型，我们需要将这些分类变量转换为数值变量。 sklearn库LabelEncoder用于解决此问题。

Normalization: The dataset is not normally distributed. All the features have different ranges. Without normalization, the ML model will try to disregard coefficients of features that have low values because their impact will be so small compared to the big value. Hence to normalized, sklearn library i.e. MinMaxScaler is used.

标准化 ：数据集不是正态分布的。所有功能都有不同的范围。如果不进行归一化，则ML模型将尝试忽略具有低值的要素的系数，因为与大值相比，其影响将很小。因此，为了进行标准化，使用了sklearn库，即MinMaxScaler 。

Train the data. In this process, 90% of the data was split for the train data and 10% of the data was taken as test data.

训练数据。 在此过程中，将90％的数据拆分为火车数据，并将10％的数据作为测试数据。

4.机器学习模型： (4. ML Models:)

In this section, different machine learning algorithms are used to predict price/target-variable.

在本节中，将使用不同的机器学习算法来预测价格/目标变量。

The dataset is supervised, so the models are applied in a given order:

数据集受到监督，因此以给定顺序应用模型：

Linear Regression

线性回归
Ridge Regression

岭回归
Lasso Regression

套索回归
K-Neighbors Regressor

K邻域回归器
Random Forest Regressor

随机森林回归
Bagging Regressor

装袋机
Adaboost Regressor

Adaboost回归器
XGBoost

XGBoost

1)线性回归： (1) Linear Regression:)

In statistics, linear regression is a linear approach to modelling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). In linear regression, the relationships are modelled using linear predictor functions whose unknown model parameters are estimated from the data. Such models are called linear models. More Details

在统计中，线性回归是对标量响应(或因变量)与一个或多个解释变量(或自变量)之间的关系进行建模的线性方法。在线性回归中，使用线性预测函数对关系进行建模，这些函数的未知模型参数可从数据中估算出来。这种模型称为线性模型。 更多细节

Coefficients: The sign of each coefficient indicates the direction of the relationship between a predictor variable and the response variable.

系数：每个系数的符号表示预测变量和响应变量之间关系的方向。

A positive sign indicates that as the predictor variable increases, the response variable also increases.正号表示随着预测变量的增加，响应变量也增加。
A negative sign indicates that as the predictor variable increases, the response variable decreases.负号表示随着预测变量增加，响应变量减少。

Considering this figure, linear regression suggests that year, cylinder, transmission, fuel, and odometer these five variables are the most important.

考虑到这个数字，线性回归表明年份，汽缸，变速箱，燃油和里程表这五个变量是最重要的。

2)岭回归： (2) Ridge Regression:)

Ridge Regression is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates are unbiased, but their variances are large so they may be far from the true value.

Ridge回归是一种用于分析遭受多重共线性的多个回归数据的技术。当发生多重共线性时，最小二乘估计是无偏的，但是它们的方差很大，因此可能与真实值相去甚远。

To find the best alpha value in ridge regression, yellowbrick library AlphaSelection was applied.

为了在岭回归中找到最佳的alpha值，应用了yellowbrick库AlphaSelection 。

From the figure, the best value of alpha to fit the dataset is 20.336.

从图中可以看出，最适合该数据集的alpha值为20.336。

Note: The value of alpha is not constant it varies every time.

注意：alpha值不是恒定的，每次都会变化。

Using this value of alpha, Ridgeregressor is implemented.

使用此alpha值，可实现Ridgeregressor。

Considering this figure, Lasso regression suggests that year, cylinder, transmission, fuel, and odometer these five variables are the most important.

考虑到该数字，Lasso回归表明年份，汽缸，变速箱，燃油和里程表这五个变量是最重要的。

The performance of ridge regression is almost the same as Linear Regression.

岭回归的性能几乎与线性回归相同。

3)套索回归： (3)Lasso Regression:)

Lasso regression is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point as mean. The lasso procedure encourages simple, sparse models (i.e. models with fewer parameters).

套索回归是一种使用收缩的线性回归。收缩是指数据值平均向中心点收缩。套索程序鼓励使用简单，稀疏的模型(即参数较少的模型)。

Why Lasso regression is used?

为什么使用套索回归？

The goal of lasso regression is to obtain the subset of predictors that minimizes prediction error for a quantitative response variable. The lasso does this by imposing a constraint on the model parameters that cause regression coefficients for some variables to shrink toward zero.

套索回归的目标是获得使定量响应变量的预测误差最小化的预测子集。套索通过对模型参数施加约束来实现此目的，该约束会使某些变量的回归系数缩小为零。

But for this dataset, there is no need for lasso regression as there no much difference in error.

但是对于此数据集，不需要套索回归，因为误差没有太大差异。

4)KNeighbors回归器：基于k最近邻的回归。 (4)KNeighbors Regressor: Regression-based on k-nearest neighbors.)

The target is predicted by local interpolation of the targets associated with the nearest neighbours the training set.

通过与训练集的最近邻居相关联的目标的局部插值来预测目标。

k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until function evaluation. Read More

k -NN是一种基于实例的学习或懒惰学习，其中功能仅在本地近似，所有计算都推迟到功能评估为止。

From the above figure, for k=5 KNN give the least error. So dataset is trained using n_neighbors=5 and metric=’euclidean’.

从上图可以看出，对于k = 5 KNN，误差最小。因此，使用n_neighbors = 5和metric ='euclidean'训练数据集。

The performance KNN is better and error is decreasing with increased accuracy.

性能KNN更好，并且误差随着精度的提高而降低。

5)随机森林： (5) Random Forest:)

The random forest is a classification algorithm consisting of many decision trees. It uses bagging and feature randomness when building each tree to try to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree. Read More

随机森林是一种由许多决策树组成的分类算法。在构建每棵树时，它使用套袋和特征随机性来尝试创建不相关的树林，其委员会的预测比任何单个树的预测更为准确。

In our model, 180 decisions are created with max_features 0.5

在我们的模型中，使用max_features 0.5创建了180个决策

Performance of Random Forest (True value vs predicted value)

This is the simple bar plot which illustrates that year is the most important feature of a car and then odometer variable and then others.

这是简单的条形图，它说明年份是汽车的最重要特征，然后是里程表变量，然后是其他变量。

Panwar Abhash Anil)Panwar Abhash Anil提供 )

The performance of the Random forest is better and accuracy is increased by approx. 10% which is good. Since the random forest is using bagging when building each tree so next Bagging Regressor will be performed.

随机森林的性能更好，并且准确性提高了约5％。 10％很好。由于随机森林在构建每棵树时正在使用装袋，因此将执行下一个装袋回归器。

6)套袋回归器： (6) Bagging Regressor:)

A Bagging regressor is an ensemble meta-estimator that fits base regressors each on random subsets of the original dataset and then aggregates their predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it. Read More

Bagging回归器是一个集合元估计器，它使每个基本回归器都适合原始数据集的随机子集，然后将其预测(通过投票或平均)进行汇总以形成最终预测。通过将随机化引入其构造过程中，然后使其整体，这种元估计器通常可以用作减少黑盒估计器(例如决策树)方差的方法。

In our model, DecisionTreeRegressor is used as the estimator with max_depth=20 which creates 50 decision trees and the results show below.

在我们的模型中，DecisionTreeRegressor用作max_depth = 20的估计量，它创建了50个决策树，结果如下所示。

The performance of Random Forest is much better than Bagging regressor.

Random Forest的性能比Bagging回归器要好得多。

The key difference between Random forest and Bagging: The fundamental difference is that in Random forests, only a subset of features are selected at random out of the total and the best split feature from the subset is used to split each node in a tree, unlike in bagging where all features are considered for splitting a node.

随机森林和套袋的关键区别：最根本的区别是，在随机森林中，只有功能的子集在总的随机开出，并从子集的最佳分割特征选择用于每个节点树分割，不像在装袋中考虑将所有要素拆分节点。

7)Adaboost回归器： (7) Adaboost regressor:)

AdaBoost can be used to boost the performance of any machine learning algorithm. Adaboost helps you combine multiple “weak classifiers” into a single “strong classifier”. Library used: AdaBoostRegressor & Read More

AdaBoost可用于提高任何机器学习算法的性能。 Adaboost可帮助您将多个“弱分类器”组合为一个“强分类器”。 使用的库： AdaBoostRegressor ＆

This is the simple bar plot which illustrates that year is the most important feature of a car and then odometer variable and then model, etc.

这是简单的条形图，它说明年份是汽车的最重要特征，然后是里程表变量，然后是模型，等等。

In our model, DecisionTreeRegressor is used as an estimator with 24 max_depth and creates 200 trees & learning the model with 0.6 learning_rate and result shown below.

在我们的模型中，DecisionTreeRegressor用作具有24个max_depth的估计量，并创建200棵树并以0.6 learning_rate和以下所示的结果学习模型。

8)XGBoost：XGBoost代表eXtreme Gradient Boosting (8) XGBoost: XGBoost stands for eXtreme Gradient Boosting)

XGBoost is an ensemble learning method.XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. The beauty of this powerful algorithm lies in its scalability, which drives fast learning through parallel and distributed computing and offers efficient memory usage. Read More

XGBoost是一种整体学习方法 .XGBoost是为速度和性能而设计的梯度增强决策树的实现。这种强大算法的优点在于可扩展性，可扩展性通过并行和分布式计算驱动快速学习，并提供有效的内存使用率。

This is the simple bar plot in descending of importance which illustrates that which feature/variable is an important feature of a car is more important.

这是重要性递减的简单条形图，它说明哪个特征/变量是汽车的重要特征更为重要。

According to XGBoost, Odometer is an important feature whereas from the previous models year is an important feature.

根据XGBoost的介绍， 里程表是一项重要功能，而从以前的型号开始，年份是一项重要功能。

In this model,200 decision trees are created of 24 max depth and the model is learning the parameter with a 0.4 learning rate.

在该模型中，创建了200个最大深度为24的决策树，并且该模型正在以0.4的学习率学习参数。

4)模型性能比较： (4)Comparison of the performance of the models:)

From the above figures, we can conclude that XGBoost regressor with 89.662% accuracy is performing better than other models.

从以上数据可以得出结论，精度为89.662％的XGBoost回归器的性能优于其他模型。

5)来自数据集的一些见解： (5) Some insights from the dataset:)

1From the pair plot, we can’t conclude anything. There is no correlation between the variables.

1从对图中，我们无法得出任何结论。变量之间没有相关性。

2From the distplot, we can conclude that initially, the price is increasing rapidly but after a particular point, the price starts decreasing.

2从distplot中，我们可以得出结论，最初，价格正在Swift上涨，但是在特定点之后，价格开始下降。

3From figure 1, we analyze that the car price of the diesel variant is high then the price of the electric variant comes. Hybrid variant cars have the lowest price.

3从图1中，我们分析出柴油车型的汽车价格高，然后电动车型的价格就来了。混合动力汽车的价格最低。

Bar Plot showing the price of each fuel type

4 From figure 2, we analyze that the car price of the respective fuel also depends upon the condition of the car.

4从图2中，我们分析了相应燃料的汽车价格还取决于汽车的状况。

Bar Plot between fuel and price with hue condition

5From figure 3, we analyze that car prices are increasing per year after 1995, and from figure 4, the number of cars also increasing per year, and at some point i.e in 2012yr, the number of cars is nearly the same.

5从图3中，我们分析了1995年以后汽车价格每年都在上涨，从图4中，汽车数量也在逐年增加，在某个年份，即2012年，汽车数量几乎是相同的。

Graph showing how the price varies per year

6From figure 5, we can analyze that the price of the cars also depends upon the condition of the car, and from figure 6, price varies with the condition of the cars with there size also.

6从图5中，我们可以分析出汽车的价格也取决于汽车的状况，而从图6中，价格也随汽车的大小而变化。

Bar Plot showing the price respective of the condition of the car

7From figure 7–8, we analyze that price of the cars also various each transmission of a car. People are ready to buy the car having “other transmission” and the price of the cars having “manual transmission” is low.

7从图7–8中，我们分析了汽车的价格也随汽车的每个变速箱而变化。人们准备购买具有“其他变速箱”的汽车，并且具有“手动变速箱”的汽车的价格很低。

8 Below there are similar graphs with the same insight but different features.

8下面是具有相同见解但功能不同的相似图表。

结论： (Conclusion:)

By performing different ML models, we aim to get a better result or less error with max accuracy. Our purpose was to predict the price of the used cars having 25 predictors and 509577 data entries.

通过执行不同的ML模型，我们旨在以最大的精度获得更好的结果或更少的误差。我们的目的是通过25个预测器和509577个数据输入来预测二手车的价格。

Initially, data cleaning is performed to remove the null values and outliers from the dataset then ML models are implemented to predict the price of cars.

最初，执行数据清理以从数据集中删除空值和离群值，然后实施ML模型以预测汽车价格。

Next, with the help of data visualization features were explored deeply. The relation between the features is examined.

接下来，借助数据可视化功能进行了深入探索。检查特征之间的关系。

From the below table, it can be concluded that XGBoost is the best model for the prediction for used car prices. XGBoost as a regression model gave the best MSLE and RMSLE values.

从下表中可以得出结论，XGBoost是预测二手车价格的最佳模型。 XGBoost作为回归模型可提供最佳的MSLE和RMSLE值。

翻译自: https://towardsdatascience.com/used-car-price-prediction-using-machine-learning-e3be02d977b2