消解原理推理

I Feel:

我觉得:

The more you analyze the data the more enlightened, data engineer you will become.

您对数据的分析越多,您将变得越发开明。

In data engineering, you will always find an instance where you need to establish whether the data sample which you have got from population data, is reliable enough to build a model around it. There can be an instance where you may have got the data from the old archive, which may not represent the true behavior of process modeled around it in a production environment, with time behavior changes and so the process on which model was built.

在数据工程中,您将始终找到一个实例,需要在该实例中确定从总体数据中获取的数据样本是否足够可靠以围绕该数据模型建立模型。 在某些情况下,您可能已经从旧的存档中获取了数据,这些数据可能无法表示生产环境中围绕它建模的流程的真实行为,并且行为会随时间变化,因此建立模型的流程也会随之变化。

So if we go ahead and build our new model around such old sample data, we may end up with a faulty process and the model will not be effective or useful. So what we do is to perform a certain inferential statistical test to ensure data is reliable.

因此,如果我们继续围绕这样的旧样本数据构建新模型,则可能会导致过程出错,并且该模型将无效或无用。 因此,我们要做的是执行某种推断统计检验,以确保数据可靠。

One such test is the Normal Deviate Z Test, where we test our sample data to infer if it has come from the population data which is a true representation of process behavior in a production environment before we go-ahead to build a model around it.

一种这样的测试是Normal Deviate Z Test 我们在这里测试示例数据以推断它是否来自于总体数据,这是生产环境中过程行为的真实表示,然后我们继续围绕它建立模型。

Earlier in part 1 of Inferential statistics, we learned about the Chi-Square test

在推论统计的第1部分之前,我们了解了卡方检验

I would invite you all to read the same. As promised, today we will cover more statistical testing techniques being used in inferential statistic hypothesis testing to establish sample data reliability. So let’s get started with understanding one such test called normal deviate Z Test which we will be covering in detail moving forward in our journey.

我请大家阅读相同的内容。 如所承诺的那样,今天我们将介绍用于推断统计假设检验的更多统计检验技术,以建立样本数据的可靠性。 因此,让我们开始理解一种称为“普通偏差Z测试”的测试,我们将详细介绍其前进的过程。

什么是标准偏差Z测试及其工作原理? (What Is Normal Deviate Z Test & How It Works?)

When we try to establish data reliability of a large sample data set (sample size > 30 is the norm)using Normal deviate Z test we try to compare two distribution means of data like the given sample data in our data science project and the production data.

当我们尝试使用正态偏差Z检验来建立大型样本数据集(样本量大于30的范数)的数据可靠性时,我们尝试比较两种分布方式的数据,例如数据科学项目中的给定样本数据和生产数据。

The Z-test compares sample and population means to determine if there is a significant difference.The Z test statistic is assumed to have a normal distribution, and nuisance parameters such as standard deviation should be known in order to perform an accurate z-test

Z检验比较样本和总体均值以确定是否存在显着差异。假定Z检验统计量具有正态分布,并且应该知道有害参数(例如标准差)以执行准确的Z检验

正态偏差Z测试如何工作? (How Normal Deviate Z test Work?)

We will understand how Z test functions in the following steps

我们将在以下步骤中了解Z测试的功能

第一步:建立假设: (Step1: Establishing Hypothesis:)

It is the first thing data engineers need to state before we go to perform any statistical test in inferential statistics.

在进行推理统计中的任何统计测试之前,这是数据工程师需要陈述的第一件事。

H0 — The difference in means between sample variable and population mean is a statistical fluctuation.

H0-样本变量和总体平均值之间的均值差异是统计波动。

# H1 — The difference in means between sample BP column and population mean is significant. The difference is too high to be the result of statistical fluctuation

#H1-样本BP列与总体平均值之间的均值差异显着。 差异太高,无法由统计波动得出

步骤2:计算Z检验统计量 (Step 2: Calculating Z test statistic)

Before we calculate, here are the required

在我们计算之前,这是必需的

Pre-Requisites: In-order to perform Z test on a normal distribution of data, there are some prerequisites:

先决条件:所有以上的数据的正态分布进行Z检验,也有一些先决条件:

  • Number of samples >= 30,样本数量> = 30,
  • The mean and standard deviation of the population should be known应该知道总体的平均值和标准偏差

计算的Z检验统计公式: (Z-test statistic Formula For Calculation:)

The Z measure is calculated as:

Z度量的计算公式为:

Z = (M — μ)/ SE

Z =(M —μ)/ SE

Where, M is the mean sample to be standardized, μ(mu) is the population mean and SE is the standard error of the mean.

其中,M是要标准化的平均样本,μ(μ)是总体平均值,SE是平均值的标准误差。

SE is calculated using the below-given formula:

SE使用以下给出的公式计算:

SE = s/ SQRT(n)

SE = s / SQRT(n)

Where, s is the population standard deviation and n is the sample size.

其中, s是总体标准差,n是样本量。

Standard_error is the standard deviation of the sample distribution of means (Central Limit Distribution)

Standard_error是均值样本分布(中央限制分布)的标准偏差

The above-given formula may look very similar to Z score calculation as both Z score calculation and Z Norm_dev is an instance of a test of statistical significance

上面给出的公式可能看起来与Z分数计算非常相似,因为Z分数计算和Z Norm_dev都是统计显着性检验的一个实例

步骤3:分析Z值以解释P值 (Step 3: Analyze the Z value to interpret P-Value)

Once we have the Z value we go ahead to calculate the p-value, based on which we will be able to accept or reject the null hypothesis.

一旦获得Z值,我们便可以计算p值,以此为基础我们可以接受或拒绝原假设。

使用Python和Jupyter Notebook的示例: (Example Using Python & Jupyter Notebook:)

So let’s try to understand the above-given steps using a practical example.

因此,让我们尝试通过一个实际示例来了解上述步骤。

安装Anaconda发行版: (Install Anaconda distribution:)

By following the given link anaconda download the latest version for a python based on your OS. This will come up with a pre-installed Jupyter notebook and required python packages likes pandas, SciPy, etc.

通过点击给定的链接, anaconda会根据您的操作系统为python下载最新版本。 这将附带一个预安装的Jupyter笔记本和必需的python包(如pandasSciPy等)。

Once you are done with the installation, launch your Jupyter notebook and write the following code(copy the below code ) to get started.

安装完成后,启动Jupyter笔记本并编写以下代码(复制以下代码)以开始使用。

Import The Required Package:

导入所需的软件包:

Let’s important some relevant python packages as shown below and create a data frame by reading “pima-indians-diabetes.csv” sourced from a Kaggle

让我们重要一些相关的python软件包,如下所示,并通过阅读来自Kaggle的pima-indians-diabetes.csv ”来创建数据框

import pandas as pdimport numpy as npimport scipyimport matplotlib.pyplot as plt import scipy.stats as stimport seaborn as sns#Reading CSV file into df as pandas dataframedf= pd.read_csv(“pima-indians-diabetes.csv”)

Let’s view the data frame by calling the method df.head(20) to view the data series in the given sample data set.

让我们通过调用方法df.head(20)来查看数据帧,以查看给定样本数据集中的数据系列。

df.head(20)

步骤1:让我们制定零假设和替代假设: (Step 1: Let’s Formulate Our Null and Alternate Hypothesis:)

零假设: (Null Hypothesis:)

# H0: The difference in the mean between sample BP(Press column visible above in the data frame table ) column and population mean for BP is a statistical fluctuation.

#H0:样本BP(在数据框表中上方可见的Press列)列与BP总体平均值之间的平均值差是统计波动。

替代假设: (Alternate Hypothesis:)

# H1 — The difference in Mean between sample BP column and population mean is significant, and is not a case of mere statistical fluctuation

#H1-样本BP列与总体均值之间的均值差异非常大,而不仅仅是统计上的波动

步骤2:让我们计算Z Stat(Z测试): (Step 2: Let’s Calculate Z Stat (Z Test):)

As we have already discussed Z stats formula,

正如我们已经讨论过的Z统计公式一样,

Z =(M —μ)/ SE (Z = (M — μ)/ SE)

Where, M is the mean sample to be standardized, μ(mu) is the population mean and SE is the standard error of the mean.

其中,M是要标准化的平均样本,μ(μ)是总体平均值 ,SE是平均值的标准误差。

So let’s do this calculation in Jupyter Notebook :

因此,让我们在Jupyter Notebook中进行以下计算:

这是计算Z测试的代码片段: (Here is the code snippet to calculate Z test:)

# Pre - Requisites -  Number of samples >= 30, the mean and standard deviation of population should be known# Here we have  Avg and Standard Deviation for  diastolic blood pressure = 71.3 with standard deviation of 7.2 

## Let's Apply of Normal Deviate Z test   on blood pressure(Press) column of given dataframe#mu = μ mu = 71.3   # source - http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_BiostatisticsBasics/BS704_BiostatisticsBasics3.htmlstd = 7.2#Let's find the M, mean of BP column(Press) in a given data frameMeanOfBpSample = np.average( df['Pres'])print("Mean Of BP Column", MeanOfBpSample)SE= std/np.sqrt(df.size)  #sf.size id the total size of# Z_norm_deviate =  sample_mean - population_mean /std_error_bpZ_norm_deviate = (MeanOfBpSample - mu) / SEprint("Normal Deviate Z Value: ", Z_norm_deviate)

If you type the above code in your notebook you will be able to see the below-given output

如果您在笔记本中键入上面的代码,您将能够看到以下给出的输出

Mean Of BP Column 69.10546875Standard Error:  0.08660254037844387Normal Deviate Z value : -25.340264158650886

Now that we know the Z Test Value, let’s find our p-value

现在我们知道了Z检验值,让我们找到p值

计算P值,代码段: (Calculating P-Value, Code Snippet:)

# We will be using scipy stats normal survival function sf#Here we mulitply the sf fucntion with 2 for two sided p value #calcultion , a two tail testp_value = scipy.stats.norm.sf(abs(Z_norm_deviate))*2 print('p values' , p_value)if p_value > 0.05:print('Samples are likely drawn from the same distributions (fail to reject H0)')else: print('Samples are likely drawn from different distributions (reject H0)')

If you run the above code snippet in Jupyter you will get the following outcome:

如果在Jupyter中运行上述代码片段,您将得到以下结果:

步骤3:分析Z值以解释P值 (Step 3: Analyze the Z value to interpret P-Value)

As you can see above, the p-value comes out to be: 1.150581011903455e-141. As the p-value is less than the accepted industry standard of 0.05, we can conclude that the given sample has not come from the same population distribution, on which the process was built. There is a significant difference in Means between sample BP column and population mean, so we have to reject the Null hypothesis H0, and accept the alternate hypothesis: H1.

如上所示, p值显示为: 1.150581011903455e-141。 由于p值小于公认的行业标准0.05,因此可以得出结论,给定的样本并非来自建立该过程的相同总体分布。 样本BP列与总体均值之间的均值存在显着差异,因此我们必须拒绝零假设H0,并接受替代假设H1。

As we reject the null hypothesis here using the normal Z deviate test, it will be recommended to avoid building an ML model on this sample data.

由于我们在这里使用常规Z偏差检验拒绝零假设,因此建议避免在此样本数据上建立ML模型。

Aspiring/working data engineers need to have a clear understanding of p-value. This will be the basis of performing most of the statistical data reliability tests. So let me quickly cover a few basic stuff about the same here and we will look into it more deeply in the special article which I will frame only around P-value for you all.

有抱负/工作数据的工程师需要对p值有清晰的了解。 这将是执行大多数统计数据可靠性测试的基础。 因此,让我在这里快速介绍一些基本知识,我们将在特别文章中对其进行更深入的研究,我将只为大家介绍P值

什么是P值? (What Is P-Value?)

The p-value, or probability value, tells us the probability of getting a value as small or as large as the one observed in the sample, given that our null hypothesis is true.

假设我们的零假设是真实的,则p值或概率值告诉我们获得与样本中观察到的值一样小的值的概率。

一般如何计算p值? (How to calculate p-value in general?)

  1. Frame your hypothesis

    阐明你的假设

  2. Assume the null hypothesis to be true

    假设原假设为真

  3. Calculate the z or t value for getting the value in the alternative hypothesis

    计算z或t值以获取替代假设中的值

  4. From the z/t-table, find the probability associated with the z or t value obtained above. You can also find p-value with Scipy inbuilt methods you just need to pass z, t statistics calculated in step 3.

    从z / t表中,找到与上面获得的z或t值关联的概率。 您还可以使用Scipy内置方法找到p值,只需传递步骤3中计算的z,t统计信息即可。

  5. This is the p-value you need to find

    这是您需要找到的p值

We will cover P-value calculations, how to interpret it and its use cases separately later on. Also, you will also experience it while we cover all the hypothesis test types in our journey of understanding inferential statistics.

稍后我们将分别介绍P值计算,如何解释它以及其用例。 同样,当我们在理解推理统计的过程中涵盖所有假设检验类型时,您还将体验到它。

下一步是什么? (What’s Next?)

In our next article: “Inferential Statistics: Hypothesis Testing using T-Test”. We will cover the T-test in detail.

在我们的下一篇文章: “推理统计:使用T检验的假设检验”。 我们将详细介绍T检验

Would like to leave you all by covering some basics of the T-test.

希望通过介绍T检验的一些基础知识来使您满意。

什么是T检验? (What Is T-Test?)

A t-test is a kind of inferential statistic used to find if there is a significant difference between the means of two given groups, which may be related to certain features.

t检验是一种推论统计量,用于发现两个给定组的均值之间是否存在显着差异,这可能与某些特征有关。

A t-test looks at the t-statistic, the t-distribution values, and the degrees of freedom to determine the probability of difference between two sets of data

t检验检查t统计量,t分布值和自由度,以确定两组数据之间的差异概率

T检验的类型: (Types Of T-Test:)

There are three types of t-test:

t检验分为三种类型:

一样本t检验: (One-sample t-test:)

Used to compare a sample mean with a known population mean or some other meaningful, fixed value

用于将样本平均值与已知总体平均值或其他有意义的固定值进行比较

独立样本t检验: (Independent samples t-test:)

Used to compare two means from independent groups

用于比较独立组的两种均值

配对样本t检验: (Paired samples t-test:)

  1. Used to compare two means that are repeated measures for the same participants — scores might be repeated across different measures or across time.用于比较两个均值是针对同一参与者的重复测量方法-得分可能会在不同测量值或时间之间重复。
  2. Used also to compare paired samples, as in a two treatment randomized block design.也用于比较成对的样本,如两次治疗的随机区组设计。

Will cover how we perform the above-given T-test using examples and hands-on lab exercises.

将通过示例和动手实验练习介绍我们如何执行上述T检验。

Do refer below given graphics that cover the decision making tree to help you chose the right kind of hypothesis testing based on the given problem statement.

请参考下面给出的覆盖决策树的图形,以帮助您根据给定的问题陈述选择正确的假设检验。

如何决定何时使用什么测试? (How To Decide What Test To Use & When?)

摘要: (Summary:)

Never ever rely on plain observation or assumption while you try to build a model on the given sample. Make sure you are measuring it’s distribution type, testing the data sample using statistical hypothesis testing to ensure your sample data is reliable. Descriptive statistics & inferential statistical techniques are designed to help you make better decisions, in data sampling before modeling it in machine learning.

尝试在给定样本上建立模型时,切勿依赖单纯的观察或假设。 确保您正在测量其分布类型,使用统计假设检验来测试数据样本,以确保样本数据可靠。 描述性统计和推论统计技术旨在帮助您在数据采样之前在机器学习中建模之前做出更好的决策。

As data cleansing, EDA will fill larger part of your work life as a data scientist, it’s imperative that you take responsibility of handling data with utmost clarity & care to test it out for its reliability. You are going to influence the market dynamics in a larger way, as your model is going to take some really critical business decisions.

在清理数据时,EDA将占据数据科学家一生的大部分时间,当务之急是您要以最清晰,最谨慎的态度处理数据,以测试其可靠性。 您将以更大的方式影响市场动态,因为您的模型将做出一些非常关键的业务决策。

我觉得 : (I Feel :)

Going wrong with data interpretations while building ML models may cost heavily. So don’t just build models for the sake of building, make sure it has been fed with the right kind of food in terms of data. Your right data feeding habit will do wonders when your machine will make intelligent & precise, ML based predictions and recommendations for your business . Everybody in the ecosystem will be the beneficiary of the right model building process if it’s done right.

在构建ML模型时,数据解释出错可能会耗费大量资金。 因此,不要仅仅为了构建模型而建立模型,还要确保它已经在数据方面获得了正确的选择。 您正确的数据馈送习惯将使您的机器何时能够为您的业务做出智能,精确的,基于ML的预测和建议,这会产生奇迹。 如果做得正确,生态系统中的每个人都将是正确的模型构建过程的受益者。

翻译自: https://medium.com/swlh/what-is-z-test-in-inferential-statistics-how-it-works-3dde6eae64e5

消解原理推理


http://www.taodudu.cc/news/show-997444.html

相关文章:

  • 大学生信息安全_给大学生的信息
  • 特斯拉最安全的车_特斯拉现在是最受欢迎的租车选择
  • ml dl el学习_DeepChem —在生命科学和化学信息学中使用ML和DL的框架
  • 用户参与度与活跃度的区别_用户参与度突然下降
  • 数据草拟:使您的团队热爱数据的研讨会
  • c++ 时间序列工具包_我的时间序列工具包
  • adobe 书签怎么设置_让我们设置一些规则…没有Adobe Analytics处理规则
  • 分类预测回归预测_我们应该如何汇总分类预测?
  • 神经网络推理_分析神经网络推理性能的新工具
  • 27个机器学习图表翻译_使用机器学习的信息图表信息组织
  • 面向Tableau开发人员的Python简要介绍(第4部分)
  • 探索感染了COVID-19的动物的数据
  • 已知两点坐标拾取怎么操作_已知的操作员学习-第4部分
  • lime 模型_使用LIME的糖尿病预测模型解释— OneZeroBlog
  • 永无止境_永无止境地死:
  • 吴恩达神经网络1-2-2_图神经网络进行药物发现-第1部分
  • python 数据框缺失值_Python:处理数据框中的缺失值
  • 外星人图像和外星人太空船_卫星图像:来自太空的见解
  • 棒棒糖 宏_棒棒糖图表
  • nlp自然语言处理_不要被NLP Research淹没
  • 时间序列预测 预测时间段_应用时间序列预测:美国住宅
  • 经验主义 保守主义_为什么我们需要行动主义-始终如此。
  • python机器学习预测_使用Python和机器学习预测未来的股市趋势
  • knn 机器学习_机器学习:通过预测意大利葡萄酒的品种来观察KNN的工作方式
  • python 实现分步累加_Python网页爬取分步指南
  • 用于MLOps的MLflow简介第1部分:Anaconda环境
  • pymc3 贝叶斯线性回归_使用PyMC3估计的贝叶斯推理能力
  • 朴素贝叶斯实现分类_关于朴素贝叶斯分类及其实现的简短教程
  • vray阴天室内_阴天有话:第1部分
  • 机器人的动力学和动力学联系_通过机器学习了解幸福动力学(第2部分)

消解原理推理_什么是推理统计中的Z检验及其工作原理?相关推荐

  1. 原理c++_浅谈C/S和B/S架构的工作原理及优缺点

    C/S架构 一.C/S架构及其背景 C/S架构是一种比较早的软件架构,主要应用于局域网内.在这之前经历了集中计算模式,随着计算机网络的进步与发展,尤其是可视化工具的应用,出现过两层C/S和三层C/S架 ...

  2. 统计中的p-value检验

    最近和一个老同学讨论统计中的p-value检验问题,其中涉及到为什么需要用p-value来做假设检验的判断依据,上网查到了一个很好的例子: 教室里四位同学将装了若干数量的白球和黑球的箱子放在了课桌上, ...

  3. 神经网络推理_分析神经网络推理性能的新工具

    神经网络推理 Measuring the inference time of a trained deep neural model on different hardware devices is ...

  4. 摇杆控制方向原理_医用无油空压机的送料作用及工作原理

    医用纯无油空压机可以在不同的场合运用,在化工生产中主要用来进行工艺控制,或把密闭管道中的物料由一个地方输送到另一个地方,下面我们一起了解一下吧. 目前用得较多的医用纯无油空压机主要有:往复医用纯无油空 ...

  5. 三层架构学习的困难_浅谈C/S和B/S架构的工作原理及优缺点

    C/S架构 一.C/S架构及其背景 C/S架构是一种比较早的软件架构,主要应用于局域网内.在这之前经历了集中计算模式,随着计算机网络的进步与发展,尤其是可视化工具的应用,出现过两层C/S和三层C/S架 ...

  6. 三极管工作原理_「硬见小百科」半导体三极管的工作原理

    PNP型半导体三极管和NPN型半导体三极管的基本工作原理完全一样,下面以NPN型半导体三极管为例来说明其内部的电流传输过程,进而介绍它的工作原理.半导体三极管常用的连接电路如图15-3(a)所示.半导 ...

  7. 晶振噪声及杂散_什么是温补晶振?温补晶振的工作原理是什么?

    晶振是一个神秘的存在,很多朋友对于晶振并非十分了解.晶振可分为有源晶振和无源晶振,再细分,便包含温补晶振.本文中,小编将对温补晶振的结构.温补晶振的原理予以介绍.如果你对温补晶振存在兴趣,不妨继续往下 ...

  8. 二线制和四线制传感器的区别_畅谈两线制、三线制、四线制其工作原理和结构上的区别...

    畅谈两线制.三线制.四线制,是指各种输出为模拟直流电流信号的变送器,其工作原理和结构上的区别,而并非只指变送器的接线形式. 因此最先出现的 是 四线制的变送器:即两根线负责电源的供应,另外两根线负责输 ...

  9. 双稳态继电器工作原理图_三分钟看懂双稳态电磁阀的工作原理

    稳态电磁阀采用先进的脉冲和永磁技术,只需通过控制器切换脉冲的电极触 点来改变阀的开.关状态,当控制器发出电脉冲时,驱动磁芯带动阀瓣克服永磁力产生上 下位移.阀瓣到位后永磁作用下处于自保持状态. 双稳态 ...

最新文章

  1. Spring Boot 配置元数据指南
  2. 【运筹学】匈牙利法 ( 匈牙利法示例 2 | 第一步 : 变换系数矩阵 | 第二步 : 试指派 | 行列打√ | 直线覆盖 | 第二轮试指派 )
  3. JSP 活动元素 <jsp:directive.pageimport=“zero.space.ch03.Bookbean“/> 解读
  4. Docker初学者指南-如何创建您的第一个Docker应用程序
  5. MATLAB中的S-Function的用法(C语言)
  6. AnalyticDB for MySQL技术架构解析
  7. matplotlib 可视化 —— cmap(colormap)
  8. better-scroll在vue中的使用
  9. oracle不能插入,oracle – 在过程中截断和插入不能一起工作
  10. 关于嵌入式工程师需要知道的网站
  11. 杭电2079-选课时间(题目已修改,注意读题)
  12. Zabbix如何配置告警短信?(预警短信通知设置流程)
  13. 北理工计算机组成原理在线作业,[北京师范大学]20秋《计算机组成原理》离线作业...
  14. python pyqt5 股票分时_pythonpyqt5股票分时:股票风险与提示_XAC配资之家
  15. Java SE 小白学习笔记 周周测 从小白到大牛
  16. 使用HttpClient下载网页
  17. Visual Studio运行c#程序出现权限问题
  18. 计算网络地址和广播地址
  19. 估值超400亿美元的京东物流,已成顺丰最有力的竞争对手
  20. CAD钣金展开AutoLisp开发

热门文章

  1. 大理石在哪儿 (Where is the Marble?,UVa 10474)
  2. 关于web前端的学习路线
  3. C语言-结构体内存对齐
  4. [Javascript_库编写]创建自己的“JavaScript库”
  5. ImageView的属性android:scaleType,即ImageView.setScaleType(ImageView.ScaleType)
  6. CentOS7 linux下yum安装redis以及使用
  7. 读取Mc1000的 唯一 ID 机器号
  8. day5 模拟用户登录
  9. jQ效果:简单的手风琴效果
  10. 通过webbrowser实现js与winform的相互调用