探索性数据分析入门

When I started on my journey to learn data science, I read through multiple articles that stressed the importance of understanding your data. It didn’t make sense to me. I was naive enough to think that we are handed over data which we push through an algorithm and hand over the results.

当我开始学习数据科学的旅程时,我通读了多篇文章,其中强调了理解您的数据的重要性。 对我来说这没有意义。 我很天真,以为我们已经交出了我们通过算法推送并交出结果的数据。

Yes, I wasn’t exactly the brightest. But I’ve learned my lesson and today I want to impart what I picked from my sleepless nights trying to figure out my data. I am going to use the R language to demonstrate EDA.

是的,我并不是最聪明的人。 但是我已经吸取了教训,今天我想讲讲我从不眠之夜中挑选出的东西,以弄清楚我的数据。 我将使用R语言来演示EDA。

WHY R?

为什么R?

Because it was built from the get-go keeping data science in mind. It’s easy to pick up and get your hands dirty and doesn’t have a steep learning curve, *cough* Assembly *cough*.

因为它是从一开始就牢记数据科学而构建的。 它很容易拿起并弄脏您的手,没有陡峭的学习曲线,*咳嗽* 组装 *咳嗽*。

Before I start, This article is a guide for people classified under the tag of ‘Data Science infants.’ I believe both Python and R are great languages, and what matters most is the Story you tell from your data.

在开始之前,本文是针对归类为“数据科学婴儿”标签的人们的指南。 我相信Python和R都是很棒的语言,最重要的是您从数据中讲述的故事。

为什么使用此数据集? (Why this dataset?)

Well, it’s where I think most of the aspiring data scientists would start. This data set is a good starting place to heat your engines to start thinking like a data scientist at the same time being a novice-friendly helps you breeze through the exercise.

好吧,这是我认为大多数有抱负的数据科学家都会从那里开始的地方。 该数据集是加热引擎以像数据科学家一样开始思考的良好起点,同时对新手友好,可以帮助您轻而易举地完成练习。

我们如何处理这些数据? (How do we approach this data?)

  • Will this variable help use predict house prices?这个变量是否有助于预测房价?
  • Is there a correlation between these variables?这些变量之间有相关性吗?
  • Univariate Analysis单变量分析
  • Multivariate Analysis多元分析
  • A bit of Data Cleaning一点数据清理
  • Conclude with proving the relevance of our selected variables.最后证明我们选择的变量的相关性。

Best of luck on your journey to master Data Science!

在掌握数据科学的过程中祝您好运!

Now, we start with importing packages, I’ll explain why these packages are present along the way…

现在,我们从导入程序包开始,我将解释为什么这些程序包一直存在...

easypackages::libraries("dplyr", "ggplot2", "tidyr", "corrplot", "corrr", "magrittr",   "e1071","ggplot2","RColorBrewer", "viridis")options(scipen = 5)      #To force R to not use scientfic notationdataset <- read.csv("train.csv")str(dataset)

Here, in the above snippet, we use scipen to avoid scientific notation. We import our data and use the str() function to get the gist of the selection of variables that the dataset offers and the respective data type.

在此,在上面的代码段中,我们使用scipen来避免科学计数法。 我们导入数据并使用str()函数来获取数据集提供的变量以及相应数据类型的选择依据。

The variable SalePrice is the dependent variable which we are going to base all our assumptions and hypothesis around. So it’s good to first understand more about this variable. For this, we’ll use a Histogram and fetch a frequency distribution to get a visual understanding of the variable. You’d notice there’s another function i.e. summary() which is essentially used to for the same purpose but without any form of visualization. With experience, you’ll be able to understand and interpret this form of information better.

变量SalePrice是因变量,我们将基于其所有假设和假设。 因此,最好先了解更多有关此变量的信息。 为此,我们将使用直方图并获取频率分布以直观了解变量。 您会注意到还有另一个函数,即summary(),该函数本质上用于相同的目的,但没有任何形式的可视化。 凭借经验,您将能够更好地理解和解释这种形式的信息。

ggplot(dataset, aes(x=SalePrice)) +   theme_bw()+  geom_histogram(aes(y=..density..),color = 'black', fill = 'white', binwidth = 50000)+  geom_density(alpha=.2, fill='blue') +  labs(title = "Sales Price Density", x="Price", y="Density")summary(dataset$SalePrice)

So it is pretty evident that you’ll find many properties in the sub $200,000 range. There are properties over $600,000 and we can try to understand why is it so and what makes these homes so ridiculously expensive. That can be another fun exercise…

因此,很明显,您会找到许多价格在20万美元以下的物业。 有超过60万美元的物业,我们可以试着理解为什么会这样,以及是什么使这些房屋如此昂贵。 那可能是另一个有趣的练习……

在确定要购买的房屋的价格时,您认为哪些变量影响最大? (Which variables do you think are most influential when deciding a price for a house you are looking to buy?)

Now that we have a basic idea about SalePrice we will try to visualize this variable in terms of some other variable. Please note that it is very important to understand what type of variable you are working with. I would like you to refer to this amazing article which covers this topic in more detail here.

现在,我们对SalePrice有了基本的了解,我们将尝试根据其他变量来形象化此变量。 请注意,了解要使用的变量类型非常重要。 我想你指的这个惊人的物品,其更为详细地介绍这个主题在这里 。

Moving on, We will be dealing with two kinds of variables.

继续,我们将处理两种变量。

  • Categorical Variable分类变量
  • Numeric Variable数值变量

Looking back at our dataset we can discern between these variables. For starters, we run a coarse comb across the dataset and guess pick some variables which have the highest chance of being relevant. Note that these are just assumptions and we are exploring this dataset to understand this. The variables I selected are:

回顾我们的数据集,我们可以区分这些变量。 首先,我们对数据集进行粗梳,并猜测选择一些具有最大相关性的变量。 请注意,这些只是假设,我们正在探索此数据集以理解这一点。 我选择的变量是:

  • GrLivAreaGrLivArea
  • TotalBsmtSFTotalBsmtSF
  • YearBuilt建立年份
  • OverallQual综合素质

So which ones are Quantitive and which ones are Qualitative out of the lot? If you look closely the OveralQual and YearBuilt variable then you will notice that these variables can never be Quantitative. Year and Quality both are categorical by nature of this data however, R doesn’t know that. For that, we use factor() function to convert a numerical variable to categorical so R can interpret the data better.

那么,哪些是定量的,哪些是定性的呢? 如果仔细查看OveralQualYearBuilt变量,您会注意到这些变量永远不会是定量的。 年和质量都是根据此数据的性质分类的,但是R不知道。 为此,我们使用factor()函数将数值变量转换为分类变量,以便R可以更好地解释数据。

dataset$YearBuilt <- factor(dataset$YearBuilt)dataset$OverallQual <- factor(dataset$OverallQual)

Now when we run str() on our dataset we will see both YearBuilt and OverallQual as factor variables.

现在,当我们在数据集上运行str()时 ,我们会将YearBuilt和TotalQual都视为因子变量。

We can now start plotting our variables.

现在,我们可以开始绘制变量。

关系非常复杂 (Relationships are (NOT) so complicated)

Taking YearBuilt as our first candidate we start plotting.

YearBuilt作为我们的第一个候选人,我们开始绘图。

ggplot(dataset, aes(y=SalePrice, x=YearBuilt, group=YearBuilt, fill=YearBuilt)) +  theme_bw()+  geom_boxplot(outlier.colour="red", outlier.shape=8, outlier.size=1)+  theme(legend.position="none")+  scale_fill_viridis(discrete = TRUE) +  theme(axis.text.x = element_text(angle = 90))+  labs(title = "Year Built vs. Sale Price", x="Year", y="Price")

Old houses sell for less as compared to a recently built house. And as for OverallQual,

与最近建造的房屋相比,旧房屋的售价更低。 至于TotalQuality

ggplot(dataset, aes(y=SalePrice, x=OverallQual, group=OverallQual,fill=OverallQual)) +  geom_boxplot(alpha=0.3)+  theme(legend.position="none")+  scale_fill_viridis(discrete = TRUE, option="B") +  labs(title = "Overall Quality vs. Sale Price", x="Quality", y="Price")

This was expected since you’d naturally pay more for the house which is of better quality. You won’t want your foot to break through the floorboard, will you? Now that the qualitative variables are out of the way we can focus on the numeric variables. The very first candidate we have here is GrLivArea.

这是预料之中的,因为您自然会为质量更好的房子付出更多。 您不希望您的脚摔破地板,对吗? 现在,定性变量已不复存在,我们可以将重点放在数字变量上。 我们在这里拥有的第一个候选人是GrLivArea

ggplot(dataset, aes(x=SalePrice, y=GrLivArea)) +  theme_bw()+  geom_point(colour="Blue", alpha=0.3)+  theme(legend.position='none')+  labs(title = "General Living Area vs. Sale Price", x="Price", y="Area")

I would be lying if I said I didn’t expect this. The very first instinct of a customer is to check the area of rooms. And I think the result will be the same for TotalBsmtASF. Let’s see…

如果我说我没想到这一点,我会撒谎。 顾客的本能是检查房间的面积。 而且我认为结果与TotalBsmtASF相同。 让我们来看看…

ggplot(dataset, aes(x=SalePrice, y=TotalBsmtSF)) +  theme_bw()+  geom_point(colour="Blue", alpha=0.3)+  theme(legend.position='none')+  labs(title = "Total Basement Area vs. Sale Price", x="Price", y="Area")

那么我们能说些什么呢? (So what can we say about our cherry-picked variables?)

GrLivArea and TotalBsmtSF both were found to be in a linear relation with SalePrice. As for the categorical variables, we can say with confidence that the two variable which we picked were related to SalePrice with confidence.

发现GrLivAreaTotalBsmtSF都与SalePrice呈线性关系。 至于分类变量,我们可以放心地说,我们选择的两个变量与SalePrice有信心。

But these are not the only variables and there’s more to than what meets the eye. So to tread over these many variables we’ll take help from a correlation matrix to see how each variable correlate to get a better insight.

但是,这些并不是唯一的变量,还有比眼球更重要的事情。 因此,要遍历这些变量,我们将从关联矩阵中获取帮助,以了解每个变量之间的关联方式,从而获得更好的见解。

相关图的时间 (Time for Correlation Plots)

So what is Correlation?

那么什么是相关性?

Correlation is a measure of how well two variables are related to each other. There are positive as well as negative correlation.

相关性是两个变量之间相关程度的度量。 正相关和负相关。

If you want to read more on Correlation then take a look at this article. So let’s create a basic Correlation Matrix.

如果您想阅读有关Correlation的更多信息,请阅读本文 。 因此,让我们创建一个基本的“相关矩阵”。

M <- cor(dataset)M <- dataset %>% mutate_if(is.character, as.factor)M <- M %>% mutate_if(is.factor, as.numeric)M <- cor(M)mat1 <- data.matrix(M)print(M)#plotting the correlation matrixcorrplot(M, method = "color", tl.col = 'black', is.corr=FALSE)

请不要关闭此标签。 我保证会好起来的。 (Please don’t close this tab. I promise it gets better.)

But worry not because now we’re going to get our hands dirty and make this plot interpretable and tidy.

但是不用担心,因为现在我们要动手做,使这段情节变得可解释和整洁。

M[lower.tri(M,diag=TRUE)] <- NA                   #remove coeff - 1 and duplicatesM[M == 1] <- NAM <- as.data.frame(as.table(M))                   #turn into a 3-column tableM <- na.omit(M)                                   #remove the NA values from aboveM <- subset(M, abs(Freq) > 0.5)              #select significant values, in this case, 0.5M <- M[order(-abs(M$Freq)),]                                  #sort by highest correlationmtx_corr <- reshape2::acast(M, Var1~Var2, value.var="Freq")    #turn M back into matrix corrplot(mtx_corr, is.corr=TRUE, tl.col="black", na.label=" ") #plot correlations visually

现在,这看起来更好而且可读。 (Now, this looks much better and readable.)

Looking at our plot we can see numerous other variables that are highly correlated with SalePrice. We pick these variables and then create a new dataframe by only including these select variables.

查看我们的图,我们可以看到许多与SalePrice高度相关的其他变量。 我们选择这些变量,然后仅通过包含这些选择变量来创建新的数据框。

Now that we have our suspect variables we can use a PairPlot to visualize all these variables in conjunction with each other.

现在我们有了可疑变量,我们可以使用PairPlot将所有这些变量相互可视化。

newData <- data.frame(dataset$SalePrice, dataset$TotalBsmtSF,                       dataset$GrLivArea, dataset$OverallQual,                       dataset$YearBuilt, dataset$FullBath,                       dataset$GarageCars )pairs(newData[1:7],       col="blue",      main = "Pairplot of our new set of variables"         )

在处理数据时,请清理数据 (While you’re at it, clean your data)

We should remove some useless variables which we are sure of not being of any use. Don’t apply changes to the original dataset though. Always create a new copy in case you remove something you shouldn’t have.

我们应该删除一些肯定不会有任何用处的无用变量。 但是不要将更改应用于原始数据集。 始终创建一个新副本,以防万一您删除了不该拥有的内容。

clean_data <- dataset[,!grepl("^Bsmt",names(dataset))]      #remove BSMTx variablesdrops <- c("clean_data$PoolQC", "clean_data$PoolArea",                "clean_data$FullBath", "clean_data$HalfBath")

clean_data <- clean_data[ , !(names(clean_data) %in% drops)]#The variables in 'drops'are removed.

单变量分析 (Univariate Analysis)

Taking a look back at our old friend, SalePrice, we see some extremely expensive houses. We haven’t delved into why is that so. Although we do know that these extremely pricey houses don’t follow the pattern which other house prices are following. The reason for such high prices could be justified but for the sake of our analysis, we have to drop them. Such records are called Outliers.

回顾一下我们的老朋友SalePrice ,我们看到了一些极其昂贵的房子。 我们还没有深入研究为什么会这样。 尽管我们确实知道这些极其昂贵的房子没有遵循其他房价正在遵循的模式。 如此高的价格的原因是有道理的,但出于我们的分析考虑,我们必须将其降低。 这样的记录称为离群值。

Simple way to understand Outliers is to think of them as that one guy (or more) in your group who likes to eat noodles with a spoon instead of a fork.

理解离群值的简单方法是将其视为小组中的一个(或多个)喜欢用勺子而不是叉子吃面条的人。

So first, we catch these outliers and then remove them from our dataset if need be. Let’s start with the catching part.

因此,首先,我们捕获这些离群值,然后根据需要将其从数据集中删除。 让我们从捕捉部分开始。

#Univariate Analysisclean_data$price_norm <- scale(clean_data$SalePrice)    #normalizing the price variablesummary(clean_data$price_norm)plot1 <- ggplot(clean_data, aes(x=factor(1), y=price_norm)) +  theme_bw()+  geom_boxplot(width = 0.4, fill = "blue", alpha = 0.2)+  geom_jitter(               width = 0.1, size = 1, aes(colour ="red"))+  geom_hline(yintercept=6.5, linetype="dashed", color = "red")+  theme(legend.position='none')+  labs(title = "Hunt for Outliers", x=NULL, y="Normalized Price")plot2 <- ggplot(clean_data, aes(x=price_norm)) +   theme_bw()+  geom_histogram(color = 'black', fill = 'blue', alpha = 0.2)+  geom_vline(xintercept=6.5, linetype="dashed", color = "red")+  geom_density(aes(y=0.4*..count..), colour="red", adjust=4) +  labs(title = "", x="Price", y="Count")grid.arrange(plot1, plot2, ncol=2)

The very first thing I did here was normalize SalePrice so that it’s more interpretable and it’s easier to bottom down on these outliers. The normalized SalePrice has Mean= 0 and SD= 1. Running a quick ‘summary()’ on this new variable price_norm give us this…

我在这里所做的第一件事就是对SalePrice进行规范化,以使其更易于解释,并且更容易查明这些异常值。 归一化的SalePrice的均值= 0SD = 1 。 在这个新变量price_norm上运行一个快速的“ summary()” ,可以给我们这个…

So now we know for sure that there ARE outliers present here. But do we really need to get rid of them? From the previous scatterplots we can say that these outliers are still following along with the trend and don’t need purging yet. Deciding what to do with outliers can be quite complex at times. You can read more on outliers here.

因此,现在我们可以肯定地知道这里有异常值。 但是我们真的需要摆脱它们吗? 从以前的散点图可以看出,这些离群值仍在跟随趋势,并且不需要清除。 决定如何处理异常值有时可能非常复杂。 您可以在这里有关离群值的信息 。

双变量分析 (Bi-Variate Analysis)

Bivariate analysis is the simultaneous analysis of two variables (attributes). It explores the concept of a relationship between two variables, whether there exists an association and the strength of this association, or whether there are differences between two variables and the significance of these differences. There are three types of bivariate analysis.

双变量分析是对两个变量(属性)的同时分析。 它探讨了两个变量之间关系的概念,是否存在关联和这种关联的强度,或者两个变量之间是否存在差异以及这些差异的重要性。 双变量分析有三种类型。

  • Numerical & Numerical数值与数值
  • Categorical & Categorical分类和分类
  • Numerical & Categorical数值和分类

The very first set of variables we will analyze here are SalePrice and GrLivArea. Both variables are Numerical so using a Scatter Plot is a good idea!

我们将在此处分析的第一组变量是SalePriceGrLivArea 。 这两个变量都是数值变量,因此使用散点图是个好主意!

ggplot(clean_data, aes(y=SalePrice, x=GrLivArea)) +  theme_bw()+  geom_point(aes(color = SalePrice), alpha=1)+  scale_color_gradientn(colors = c("#00AFBB", "#E7B800", "#FC4E07")) +  labs(title = "General Living Area vs. Sale Price", y="Price", x="Area")

Immediately, we notice that 2 houses don’t follow the linear trend and affect both our results and assumptions. These are our outliers. Since our results in future are prone to be affected negatively by these outliers, we will remove them.

立刻,我们注意到有2所房屋没有遵循线性趋势,并且影响了我们的结果和假设。 这些是我们的异常值。 由于我们未来的结果很容易受到这些异常值的负面影响,因此我们将其删除。

clean_data <- clean_data[!(clean_data$GrLivArea > 4000),]   #remove outliersggplot(clean_data, aes(y=SalePrice, x=GrLivArea)) +  theme_bw()+  geom_point(aes(color = SalePrice), alpha=1)+  scale_color_gradientn(colors = c("#00AFBB", "#E7B800", "#FC4E07")) +  labs(title = "General Living Area vs. Sale Price [Outlier Removed]", y="Price", x="Area")

The outlier is removed and the x-scale is adjusted. Next set of variables which we will analyze are SalePrice and TotalBsmtSF.

移除异常值并调整x比例。 我们将分析的下一组变量是SalePriceTotalBsmtSF

ggplot(clean_data, aes(y=SalePrice, x=TotalBsmtSF)) +  theme_bw()+  geom_point(aes(color = SalePrice), alpha=1)+  scale_color_gradientn(colors = c("#00AFBB", "#E7B800", "#FC4E07")) +  labs(title = "Total Basement Area vs. Sale Price", y="Price", x="Basement Area")

The observations here adhere to our assumptions and don’t need purging. If it ain’t broke, don’t fix it. I did mention that it is important to tread very carefully when working with outliers. You don’t get to remove them every time.

此处的观察结果符合我们的假设,无需清除。 如果没有损坏,请不要修复。 我确实提到过,在处理异常值时,请务必谨慎行事。 您不会每次都删除它们。

是时候深入一点了 (Time to dig a bit deeper)

We based a ton of visualization around ‘SalePrice’ and other important variables, but what If I said that’s not enough? It’s not Because there’s more to dig out of this pit. There are 4 horsemen of Data Analysis which I believe people should remember.

我们围绕“ SalePrice”和其他重要变量进行了大量可视化处理,但是如果我说那还不够怎么办? 不是因为有更多东西需要挖掘。 我相信人们应该记住4位数据分析骑士。

  • Normality: When we talk about normality what we mean is that the data should look like a normal distribution. This is important because a lot of statistic tests depend upon this (for example — t-statistics). First, we would check normality with just a single variable ‘SalePrice’(It’s usually better to start with a single variable). Though one shouldn’t assume that univariate normality would prove the existence of multivariate normality(which is comparatively more sought after), but it helps. Another thing to note is that in larger samples i.e. more than 200 samples, normality is not such an issue. However, A lot of problems can be avoided if we solve normality. That’s one of the reasons we are working with normality.

    正态性 :当谈论正态性时,我们的意思是数据看起来应该像正态分布。 这很重要,因为很多统计检验都依赖于此(例如t统计)。 首先,我们将仅使用单个变量“ SalePrice”(通常最好从单个变量开始)检查正态性。 尽管不应该假设单变量正态性会证明多元正态性的存在(相对较受追捧),但这很有帮助。 要注意的另一件事是,在较大的样本(即200多个样本)中,正态性不是问题。 但是,如果我们解决正态性,可以避免很多问题。 这就是我们进行正常工作的原因之一。

  • Homoscedasticity: Homoscedasticity refers to the ‘assumption that one or more dependent variables exhibit equal levels of variance across the range of predictor variables’. If we want the error term to be the same across all values of the independent variable, then Homoscedasticity is to be checked.

    均方根性 :均方根性是指“ 一个或多个因变量在预测变量范围内表现出相等水平的方差假设 ”。 如果我们希望误差项在自变量的所有值上都相同,则将检查同方差。

  • Linearity: If you want to assess the linearity of your data then I believe scatter plots should be the first choice. Scatter plots can quickly show the linear relationship(if it exists). In the case where patterns are not linear, it would be worthwhile to explore data transformations. However, we need not check for this again since our previous plots have already proved the existence of a linear relationship.

    线性度 :如果您想评估数据的线性度,那么我相信散点图应该是首选。 散点图可以快速显示线性关系(如果存在)。 在模式不是线性的情况下,值得探索数据转换。 但是,由于我们以前的曲线已经证明存在线性关系,因此我们无需再次检查。

  • Absence of correlated errors: When working with errors, if you notice a pattern where one error is correlated to another then there’s a relationship between these variables. For example, In a certain case, one positive error makes a negative error across the board then that would imply a relationship between errors. This phenomenon is more evident with time-sensitive data. If you do find yourself working with such data then try and add a variable that can explain your observations.

    缺少相关错误 :处理错误时,如果您注意到一种模式,其中一个错误与另一个错误相关,则这些变量之间存在关联。 例如,在某些情况下,一个正错误会在整个范围内产生一个负错误,然后暗示错误之间的关系。 对于时间敏感的数据,这种现象更加明显。 如果您发现自己正在使用此类数据,请尝试添加一个可以解释您的观察结果的变量。

我认为我们应该开始做,而不是说 (I think we should start doing rather than saying)

Starting with SalePrice. Do keep an eye on the overall distribution of our variable.

SalePrice开始。 请注意变量的总体分布。

plot3 <- ggplot(clean_data, aes(x=SalePrice)) +   theme_bw()+  geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)+  geom_density(color="black", alpha=1, adjust = 5, lwd=1.2)+  labs(title = "Sale Price Density", x="Price", y="Density")plot4 <- ggplot(clean_data, aes(sample=SalePrice))+  theme_bw()+  stat_qq(color="#69b3a2")+  stat_qq_line(color="black",lwd=1, lty=2)+  labs(title = "Probability Plot for SalePrice")grid.arrange(plot3, plot4, ncol=2)

SalePrice is not normal! But we have another trick up our sleeves viz. log transformation. Now, one great thing about log transformation is that it can deal with skewed data and make it normal. So now it’s time to apply the log transformation over our variable.

促销价不正常! 但是,我们还有另外一个窍门。 日志转换。 现在,关于日志转换的一大优点是它可以处理偏斜的数据并使之正常。 因此,现在是时候将对数转换应用于我们的变量了。

clean_data$log_price <- log(clean_data$SalePrice)plot5 <- ggplot(clean_data, aes(x=log_price)) +   theme_bw()+  geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)+  geom_density(color="black", alpha=1, adjust = 5, lwd=1)+  labs(title = "Sale Price Density [Log]", x="Price", y="Density")plot6 <- ggplot(clean_data, aes(sample=log_price))+  theme_bw()+  stat_qq(color="#69b3a2")+  stat_qq_line(color="black",lwd=1, lty=2)+  labs(title = "Probability Plot for SalePrice [Log]")grid.arrange(plot5, plot6, ncol=2)

现在,使用其余的变量重复该过程。 (Now repeat the process with the rest of our variables.)

我们先和GrLivArea一起去 (We go with GrLivArea first)

日志转换后 (After Log Transformation)

现在用于TotalBsmtSF (Now for TotalBsmtSF)

坚持,稍等! 我们这里有一些有趣的东西。 (Hold On! We’ve got something interesting here.)

Looks like TotalBsmtSF has some zeroes. This doesn’t bode well with log transformation. We’ll have to do something about it. To apply a log transformation here, we’ll create a variable that can get the effect of having or not having a basement (binary variable). Then, we’ll do a log transformation to all the non-zero observations, ignoring those with value zero. This way we can transform data, without losing the effect of having or not the basement.

看起来TotalBsmtSF有一些零。 这对日志转换不是一个好兆头。 我们必须对此做些事情。 要在此处应用对数转换,我们将创建一个变量,该变量可以具有或不具有地下室的效果(二进制变量)。 然后,我们将对所有非零观测值进行对数转换,而忽略值为零的观测值。 这样,我们可以转换数据,而不会失去拥有或不拥有地下室的影响。

#The step where I create a new variable to dictate which row to transform and which to ignoreclean_data <- transform(clean_data, cat_bsmt = ifelse(TotalBsmtSF>0, 1, 0))#Now we can do log transformationclean_data$totalbsmt_log <- log(clean_data$TotalBsmtSF)clean_data<-transform(clean_data,totalbsmt_log = ifelse(cat_bsmt == 1, log(TotalBsmtSF), 0 ))plot13 <- ggplot(clean_data, aes(x=totalbsmt_log)) +   theme_bw()+  geom_density(fill="#ed557e", color="#e9ecef", alpha=0.5)+  geom_density(color="black", alpha=1, adjust = 5, lwd=1)+  labs(title = "Total Basement Area Density [transformed]", x="Area", y="Density")plot14 <- ggplot(clean_data, aes(sample=totalbsmt_log))+  theme_bw()+  stat_qq(color="#ed557e")+  stat_qq_line(color="black",lwd=1, lty=2)+  labs(title = "Probability Plot for TotalBsmtSF [transformed]")grid.arrange(plot13, plot14, ncol=2)

We can still see the ignored data points on the chart but hey, I can trust you with this, right?

我们仍然可以在图表上看到被忽略的数据点,但是,我可以相信您,对吗?

均方根性-等待我的拼写正确吗? (Homoscedasticity — Wait is my spelling correct?)

The best way to look for homoscedasticity is to work try and visualize the variables using charts. A scatter plot should do the job. Notice the shape which the data forms when plotted. It could look like an equal dispersion which looks like a cone or it could very well look like a diamond where a large number of data points are spread around the centre.

寻找同质性的最佳方法是尝试使用图表直观显示变量。 散点图可以完成这项工作。 注意绘制时数据形成的形状。 它可能看起来像一个均匀的色散,看起来像一个圆锥形,或者看起来非常像一个菱形,其中大量数据点围绕中心分布。

Starting with ‘SalePrice’ and ‘GrLivArea’…

从“ SalePrice”和“ GrLivArea”开始...

ggplot(clean_data, aes(x=grlive_log, y=log_price)) +  theme_bw()+  geom_point(colour="#e34262", alpha=0.3)+  theme(legend.position='none')+  labs(title = "Homoscedasticity : Living Area vs. Sale Price ", x="Area [Log]", y="Price [Log]")

We plotted ‘SalePrice’ and ‘GrLivArea’ before but then why is the plot different? That’s right, because of the log transformation.

我们之前绘制了“ SalePrice”和“ GrLivArea”,但是为什么绘制不同? 是的,因为有日志转换。

If we go back to the previously plotted graphs showing the same variable, it is evident that the data has a conical shape when plotted. But after log transformation, the conic shape is no more. Here we solved the homoscedasticity problem with just one transformation. Pretty powerful eh?

如果我们回到显示相同变量的先前绘制的图,很明显,绘制时数据具有圆锥形状。 但是对数转换后,圆锥形状不再存在。 在这里,我们只用一种变换就解决了同方差问题。 很厉害吗?

Now let’s check ‘SalePrice’ with ‘TotalBsmtSF’.

现在,让我们用“ TotalBsmtSF”检查“ SalePrice”。

ggplot(clean_data, aes(x=totalbsmt_log, y=log_price)) +  theme_bw()+  geom_point(colour="#e34262", alpha=0.3)+  theme(legend.position='none')+  labs(title = " Homoscedasticity : Total Basement Area vs. Sale Price", x="Area [Log]", y="Price [Log]")
Please take care of 0 for me :)
请为我照顾0 :)

就是这样,我们已经完成分析的结尾。 现在剩下的就是获取虚拟变量了……其余的你都知道了。 :) (That’s it, we’ve reached the end of our Analysis. Now all that’s left is to get the dummy variables and… you know the rest. :))

This work was possible thanks to Pedro Marcelino. I found his Analysis on this dataset in Python and wanted to re-write it in R. Give him some love!

感谢Pedro Marcelino使得这项工作成为可能。 我在Python中找到了他对此数据集的分析,并想用R重新编写它。给他一些爱!

翻译自: https://medium.com/@unkletam/beginners-guide-exploratory-data-analysis-in-r-47dac64d95fe

探索性数据分析入门


http://www.taodudu.cc/news/show-994991.html

相关文章:

  • python web应用_为您的应用选择最佳的Python Web爬网库
  • 在FAANG面试中破解堆算法
  • itchat 道歉_人类的“道歉”
  • 数据科学 python_为什么需要以数据科学家的身份学习Python的7大理由
  • 动量策略 python_在Python中使用动量通道进行交易
  • 高斯模糊为什么叫高斯滤波_为什么高斯是所有发行之王?
  • 从Jupyter Notebook到脚本
  • 加勒比海兔_加勒比海海洋物种趋势
  • srpg 胜利条件设定_英雄联盟获胜条件
  • 机器学习 综合评价_PyCaret:机器学习综合
  • 盛严谨,严谨,再严谨。_评估员工调查的统计严谨性
  • arima 预测模型_预测未来:学习使用Arima模型进行预测
  • bigquery_在BigQuery中链接多个SQL查询
  • mysql 迁移到tidb_通过从MySQL迁移到TiDB来水平扩展Hive Metastore数据库
  • 递归函数基例和链条_链条和叉子
  • 足球预测_预测足球热
  • python3中朴素贝叶斯_贝叶斯统计:Python中从零开始的都会都市
  • 数据治理 主数据 元数据_我们对数据治理的误解
  • 提高机器学习质量的想法_如何提高机器学习的数据质量?
  • 逻辑回归 python_深入研究Python的逻辑回归
  • Matplotlib中的“ plt”和“ ax”到底是什么?
  • cayenne:用于随机模拟的Python包
  • spotify 数据分析_没有数据? 没问题! 如何从Wikipedia和Spotify收集重金属数据
  • kaggle数据集_Kaggle上有170万份ArXiv文章的数据集
  • 深度学习数据集中数据差异大_使用差异隐私来利用大数据并保留隐私
  • 小型数据库_如果您从事“小型科学”工作,那么您是否正在利用数据存储库?
  • 参考文献_参考
  • 数据统计 测试方法_统计测试:了解如何为数据选择最佳测试!
  • 每个Power BI开发人员的Power Query提示
  • a/b测试_如何进行A / B测试?

探索性数据分析入门_入门指南:R中的探索性数据分析相关推荐

  1. r语言min-max归一化_如何在R中使用min()和max()

    r语言min-max归一化 Finding min and max values is pretty much simple with the functions min() and max() in ...

  2. vue在日历表上面创建事件_如何在R中创建颜色编码的日历

    vue在日历表上面创建事件 用颜色编码的日历可以快速简便地查看您是否实现了日常目标. 您是否符合销售或社交媒体帖子等日常业务指标? 或者,您如何实现个人目标,例如每天锻炼? 乍一看,您可以了解自己的工 ...

  3. python 运行r语言_如何在R中运行Python

    python 运行r语言 尽管我很喜欢R,但很显然Python还是一种很棒的语言-既适用于数据科学又适用于通用计算. R用户想要在Python中做一些事情可能有充分的理由. 也许这是一个很棒的库,还没 ...

  4. csv文件示例_如何在R中使用数据框和CSV文件-带有示例的详细介绍

    csv文件示例 Welcome! If you want to start diving into data science and statistics, then data frames, CSV ...

  5. rstudio r语言_如何在R中接受用户输入?

    rstudio r语言 Taking a user input is very simple in R using readline() function. In this tutorial, we ...

  6. mysql中转换成字符串_如何在R中转换字符串的大小写?

    mysql中转换成字符串 Hello, folks. In this tutorial we are going to convert the case of the string in R. The ...

  7. 不用sqrt实现平方根_如何在R中使用sqrt()查找平方根?

    不用sqrt实现平方根 Getting a square root of the values in R is easy with the function sqrt() in R. Let's fi ...

  8. csh sum算总和_如何在R中使用sum()–在R中查找元素的总和

    csh sum算总和 Let's learn how to find the sum of the values with the help of the sum() in R. In this tu ...

  9. r求矩阵某一列的标准偏差_如何在R中找到标准偏差?

    r求矩阵某一列的标准偏差 Being a statistical language, R offers standard function sd(' ') to find the standard d ...

最新文章

  1. 一维码Codabar简介及其解码实现(zxing-cpp)
  2. 一个理想主义者关于爱情和美女、事业与金钱的疯人痴语
  3. 一日千里 云计算普及势不可挡
  4. 5月19-20日WebRTCon 2018 梳理全球WebRTC技术实践与案例
  5. 各大公司java面试整理对应问题博客整理
  6. hibernate映射数据库表如何在不插入值的情况下使表中字段默认值生效
  7. “打击式教育”盛行?数据分析剖析“中式父母”的“打压式教育”
  8. 基于JAVA+SpringBoot+Mybatis+MYSQL的请假与审批系统
  9. 剑指offer之统计数组中出现次数超过一半的数字
  10. linux下普通用户账号管理
  11. VirtualBox中虚拟XP共享文件夹设置
  12. linux轮训创建文件夹,Linux文件和目录管理相关命令(三)
  13. 支持-vsdoc.js的jQuery智能提示的VS2008 SP1补丁发布
  14. MyEclipse添加反编译工具
  15. 传诺基亚将和微软合作生产WP7手机
  16. 三极管与场效应管之导通与截止说明
  17. 中国石油大学(北京)本科毕业论文答辩PPT模板
  18. 最早的即时通讯软件哪一个,你知道吗?
  19. (狼人杀)游戏研究-Android
  20. 2021大厂Java面试真题(一)

热门文章

  1. 【git】解决gitlab ip更改问题
  2. Linux系统编程(二)孤儿进程和僵尸进程
  3. 【Java】异常处理的目的
  4. Java进阶:java字符串定位语句
  5. Redis面试复习大纲在手面试不慌,讲的明明白白!
  6. 【大牛系列教学】靠着这份面试题跟答案
  7. 保驾护航金三银四,万字解析!
  8. -wl是不是c语言的标识符,C语言基础知识考试
  9. 基于TCP的在线聊天程序
  10. Android NDK编程,引入第三方.so库