数据eda_关于分类和有序数据的EDA
数据eda
数据科学和机器学习统计 (STATISTICS FOR DATA SCIENCE AND MACHINE LEARNING)
Categorical variables are the ones where the possible values are provided as a set of options, it can be pre-defined or open. An example can be the gender of a person. In the case of Ordinal variables, the options can be ordered by some rule, like the Likert Scale:
分类变量是将可能的值作为一组选项提供的变量,可以预定义或打开。 一个例子可以是一个人的性别。 对于序数变量,可以按照某些规则对选项进行排序,例如Likert Scale:
- Like喜欢
- Like Somewhat有点像
- Neutral中性
- Dislike Somewhat有点不喜欢
- Dislike不喜欢
To simplify further examples, we will use a simple example, based on a group of students that have passed or not 2 distinct exams, the results are represented in the next RxC table:
为了简化更多示例,我们将使用一个简单示例,该示例基于一组已通过或未通过2次不同考试的学生,结果显示在下一个RxC表中:
Statisticians have developed specific techniques to analyze this data, the most important are:
统计人员已经开发出分析此数据的特定技术,其中最重要的是:
协议措施 (Measures of Agreement)
百分比协议 (Percent Agreement)
Calculated as the divisions between the number of cases where the rates are in a certain class by the total number of rates.
计算为费率在特定类别中的案例数除以费率总数。
- The percent agreement for Passing the exam 2 is 25/(25+60) = 0.29, so 29.4%通过考试2的百分比协议是25 /(25 + 60)= 0.29,所以29.4%
- The percent agreement for Passing the exam 1 is 30/85 = 0.35, so 35.3%通过考试1的百分比协议是30/85 = 0.35,所以35.3%
- The percent agreement of passing the exam 1 and not passing the exam 2 is 10/85 = 0.117, so 11.7%.通过考试1和未通过考试2的百分比协议是10/85 = 0.117,所以11.7%。
The problem with the percent agreement is that the data can be obtained only by chance.
百分比一致性的问题在于只能偶然获得数据。
科恩的卡帕 (Cohen’s Kappa)
To overcome the problems of percent agreement, we calculate Kappa as:
为了克服百分比协议的问题,我们将Kappa计算为:
where P0 is the observed agreement and Pe the expected agreement, calculated as:
其中P0是观察到的协议, Pe是期望的协议,计算公式为:
In our example:
在我们的示例中:
P0 = 70/85 = 0.82
P0 = 70/85 = 0.82
Pe = 30 x 25 / 85² + 55 x 60 / 85² = 0.56
Pe = 30 x 25 /85²+ 55 x 60 /85²= 0.56
K = 0.26 / 0.44 = 0.59
K = 0.26 / 0.44 = 0.59
The Kappa results are in possible range is (-1,1), where 0 means that observed agreement and chance agreement is the same, 1 if all cases were in agreement and -1 if all cases were in disagreement.
Kappa结果的可能范围是(-1,1),其中0表示观察到的一致和机会一致是相同的,如果所有情况都一致,则为1;如果所有情况都不一致,则为-1。
卡方分布 (The Chi-Squared Distribution)
To do hypothesis testing with categorical variables, we need to use custom distributions, the most common is the Chi-Square, being a continuous theoretical probability distribution.
要使用分类变量进行假设检验,我们需要使用自定义分布,最常见的是卡方,即连续的理论概率分布。
This distribution has only one parameter, k which means degrees of freedom. As k approaches infinity, the chi-Squared distribution becomes similar to the normal distribution.
这种分布只有一个参数, k表示自由度。 当k接近无穷大时,卡方分布变得类似于正态分布。
卡方检验 (Chi-Squared Test)
This test is used to check if two categorical variables are independent, we will use the same example to explain how to calculate it:
该测试用于检查两个类别变量是否独立,我们将使用相同的示例来说明如何计算它:
First, we define the hypothesis that we want to test, in our case, we want to check if passing exam 1 and exam 2 are independent, so:
首先,我们定义要测试的假设,在本例中,我们要检查通过考试1和考试2是否独立,因此:
- H0 = Pass exam 1 and pass exam 2 are independent.H0 =通过考试1和通过考试2是独立的。
- Ha = Pass exam 1 and pass exam 2 are dependent.Ha =通过考试1和通过考试2是相关的。
This test relies on the difference between expected and observed values, to calculate the expected values(what you expect to find if both variables were independent), we use:
该测试依赖于期望值与观察值之间的差异,以计算期望值(如果两个变量都是独立的,您会发现什么),我们使用:
To simplify the calculations first we calculate the marginals, these values are the sums per row and column that we already calculated in the second table if this post. The expected values are calculated as:
为了简化计算,首先我们计算边际,这些值是我们在第二张表中已经计算出的每行和每列的总和。 期望值的计算公式为:
Now we have all we need to calculate the chi-squared formula:
现在我们有了计算卡方公式所需的全部:
With the sum symbol, we mean that we have to calculate the formula for all combinations of our variables, in our case 4, and sum the results:
对于总和符号,我们的意思是我们必须为变量4的所有组合计算公式,并对结果求和:
The final values are the sum of all 4, being 26.96, now we have to compare this result with the statistical tables, for this we need to know the degrees of freedom, they are calculated as (num rows-1)*(num columns-1), in our case we have a degree of freedom = 1.
最终值是所有4的总和,即26.96 ,现在我们必须将此结果与统计表进行比较,为此,我们需要知道自由度,它们的计算方式为(num rows-1)*(num columns -1) ,在我们的情况下,我们的自由度= 1。
According to the tables found easy searching Chi-Squared table at Google(statistical packages for any language should have them in a function), the critical value for
第四章第九节数据资产盘点-数据资产目录分类 在形成数据资产清单以后,如何将清单进行分类?关于数据资产目录的分类,有几种方法,一是参考行业数据分类框架.二是参考监管数据分类.三是根据数据管理实践,结合企 ... 写在前面: 这是一个系列文章,沉淀了我在数据治理领域的一些实践和思考.共分为5篇.分别是: 一.大数据治理:那些年,我们一起踩过的坑 主要讲讲数据治理工作中常见的一些误区. 二.要打仗,你手里先得有张 ... 原文刊载于<中国科学院院刊>2022年第10期专题"数据要素市场化配置问题探究",原文标题<关于构建全国统一的数据资产登记体系的思考>.本文为精简改编版. ... 第一章 数据结构与算法 一.算法的基本概念 计算机解题的过程实际上是在实施某种算法,这种算法称为计算机算法. 1.算法的基本特征:可行性,确定性,有穷性,拥有足够的情报. 2.算法的基本要素:算法中对 ... Task09 基于模拟数据集的KNN回归.基于马绞痛数据集的KNN数据预处理+KNN分类pipeline 一.学习内容概括 学习资料: 1.阿里云天池:https://tianchi.aliyun.c ... Excel数据透视不会,分类汇总来帮忙,强的不是一点,学会它,错不了 大家办公经常要用到Excel处理数据,想要对成千上万条数据做分类汇总,则需要做数据透视表和分类汇总功能:在数据量适中即小于万条时, ... SAP MM 物料主数据分类视图的数据会带入批次分类视图里? 1,我们在物料主数据的分类视图里的023类型的分类里,维护了一个特性的值,比如'Potency in IU/MG' 这个特性的值为500. ... https://www.toutiao.com/a6647102437532369421/ 2019-01-17 08:01:00 大家好,今天跟大家学习一下通过sklearn的朴素贝叶斯模型实战.前 ... UA MATH574M 统计学习II 高维数据的二元分类 LDA的直观解释 NSC 上一讲提到了高维数据相比低维数据的特殊性,并介绍了处理高维数据二元分类最简单的一个模型independent rul ...数据eda_关于分类和有序数据的EDA相关推荐
最新文章
热门文章