Scikit-learn is one of the most famous machine learning library in Python. It offers several classification, regression and clustering algorithms and its key strength, in my opinion, is seamless integration with Numpy, Pandas and Scipy.

Scikit-learn是Python中最著名的机器学习库之一。 它提供了几种分类,回归和聚类算法,在我看来,它的主要优势是与Numpy,Pandas和Scipy无缝集成。

Scikit-learn is so well written by the developers, that with a couple of lines of code we can check the model predictions by many different algorithms. I feel sometimes, this strength of Scikit-learn inadvertently works to its disadvantage. Machine learning developers esp. with relatively lesser experience implements an inappropriate algorithm for prediction without grasping particular algorithms salient feature and limitations.

Scikit-learn由开发人员编写得非常好,以至于只需几行代码,我们就可以通过许多不同的算法检查模型预测。 有时我感到,Scikit学习的这种力量在无意中起到了不利的作用。 机器学习开发人员,尤其是。 经验相对较少的人在不掌握特定算法的显着特征和局限性的情况下,实施了不合适的预测算法。

In this article, I will discuss the reason we should not use the decision tree regression algorithm in making a prediction involving extrapolating the data.




We have the iron, calcium and protein content of peas since the time it is picked from the farm until 1142 days. Let us assume that it is easier and economical to measure the iron and calcium content compare to protein content.

自豌豆从农场被采摘到1142天,我们已经拥有其铁,钙和蛋白质的含量。 让我们假设与蛋白质含量相比,测量铁和钙含量更容易,更经济。

We will use this data to train the DecisionTreeRegressor algorithm and then predict the protein content based on new data points related to iron content, calcium, and days passed.


Sample Data File


I think that the data file is pretty much self-explanatory. The rows show the iron, calcium and protein content of peas with days since harvesting.

我认为数据文件几乎是不言自明的。 这些行显示了自收获以来的豌豆中铁,钙和蛋白质的含量。

Step 1- We will import the packages pandas, matplotlib, and DecisionTreeRegressor and NumPy which we are going to use for our analysis.

第1步 -我们将导入将用于分析的软件包pandas,matplotlib,DecisionTreeRegressor和NumPy。

from sklearn.tree import DecisionTreeRegressorimport pandas as pdimport matplotlib.pyplot as pltimport numpy as np

Step 2- Read the full data sample data excel file into the PandasDataframe called “data”.

步骤2-将完整的数据样本数据excel文件读入称为“ data”的PandasDataframe中。

data= pd.read_excel("Peas Nutrient.xlsx")

I will not focus on preliminary data quality checks like blank values, outliers, etc. and respective correction approach in this article, and assuming that there are no data series related to the discrepancy.


Step 3- We will split the full data set into two parts viz. training and testing set. As the name suggests, we will be using the training dataset to train the decision tree regressor algorithm and compare the protein predictions with actual content based on data in the testing set.

步骤3-我们将整个数据集分为两部分。 培训和测试集。 顾名思义,我们将使用训练数据集来训练决策树回归算法,并根据测试集中的数据将蛋白质预测值与实际含量进行比较。

In the below code, data records from day 1 to day 900 are sliced as training data and data records from day 901 to 1142 as testing data.


Training_data= data[:900]Test_data=data.loc[901:1142]

Step 4- “Days passed”, “iron content” and “calcium content” are independent variables used for prediction.”Protein content” predicted is the dependent variable. Generally, the independent variable is denoted with “X “and the dependent variable with “y”.

步骤4- “经过的天数”,“铁含量”和“钙含量”是用于预测的自变量。预测的“蛋白质含量”是因变量。 通常,自变量用“ X”表示,因变量用“ y”表示。

In the code below, “Protein content” data column is dropped from the DataFrame and remaining, data i.e independent variables datapoints is declared as X_train. Similarly, all the data columns except “Protein content” is dropped and declared as y_train.

在下面的代码中,“蛋白质内容”数据列从DataFrame中删除,剩余的数据(即自变量数据点)声明为X_train。 同样,所有数据列(“蛋白质内容”除外)都将被删除并声明为y_train。

X_train=Training_data.drop(["Protein Content "], axis=1)y_train=Training_data.drop(["Days Passed", "Iron Content " ,"Calcium Content "], axis=1)

The same process is repeated in the below code for the testing data set i.e. values from day 901 to day 1142,


X_test=Test_data.drop(["Protein Content "], axis=1)y_test=Test_data.drop(["Days Passed", "Iron Content " ,"Calcium Content "], axis=1)

Step 5- DecisionTreeRegressor model is trained with the training dataset. Further, the score is checked to understand how well the algorithm is trained on this data.

步骤5-使用训练数据集对DecisionTreeRegressor模型进行训练。 此外,检查分数以了解算法在该数据上的训练程度。

tree_reg = DecisionTreeRegressor().fit(X_train, y_train)print("The model training score is" , tree_reg.score(X_train, y_train))

A perfect score of 1.0 itself indicates the overfitting of the model.


Step 5- To address the overfitting due to unconstrained depth of tree during training the model, we will put a constraint of the max depth of 4.


tree_reg = DecisionTreeRegressor(max_depth=6).fit(X_train, y_train)print("The model training score is" , tree_reg.score(X_train, y_train))

This solves the overfitting of the model on training data, and the model is ready to predict the protein content based on test data points.


Step 6- In the below code, “protein content” of test data set i.e. from days 901 to 1142 is predicted based on respective “days passed”, “iron content” and “calcium content” data.


y_pred_tree = tree_reg.predict(X_test)

Step 7- We will plot the predicted protein content by the decision tree regression model and compare with actual protein content in the test dataset from day 901 to 1142.


plt.plot(X_test["Days Passed"],y_test, label="Actual Data")plt.plot(X_test["Days Passed"],np.rint(y_pred_tree), label="Predicted Data")plt.ylabel("Days Passed")plt.xlabel('Protin Content (in Grams)')plt.legend(loc='best')plt.show()

We can see that the decision tree regressor model, which is trained quite well in training dataset with 0.93 score fails miserably to predict the protein content on test data. The model predicts the same protein content of ~ 51.34 for all days.

我们可以看到,在0.93分数的训练数据集中训练得很好的决策树回归模型在预测测试数据中的蛋白质含量方面失败了。 该模型预测所有天的蛋白质含量相同,约为51.34。

We should not use the Decision Tree Regression model for prediction involving extrapolating the data. This is just an example, and the main takeaway for us machine learning practitioners are to consider the data, prediction objective, algorithms strengths and limitations before starting modelling.

我们不应该将决策树回归模型用于涉及外推数据的预测。 这只是一个例子,对于我们的机器学习从业人员来说,主要的收获是在开始建模之前要考虑数据,预测目标,算法的优势和局限性。

We can make similar mistakes while selecting the independent variables for Machine Learning Supervised Algorithms. In the article, “How to identify the right independent variables for Machine Learning Supervised Algorithms? ” I have discussed a structured approach to identify the appropriate independent variables to make accurate predictions.

在为机器学习监督算法选择自变量时,我们可能会犯类似的错误。 在文章“如何为机器学习监督算法中确定正确的自变量? ”我已经讨论了结构化的方法来确定适当的独立变量做出准确的预测。

