DATA2002 lecture 01 02 03

前言
Week 1
- 1.1 Data visualisation 数据可视化
- 1.2 Data collection 数据收集
- 1.3 Controlled experiments 对照实验
- 1.4 Chi-squared tests 卡方检验
Week 2
- 2.1 goodness of fit tests 拟合优度检验
- 2.2 Measures of performance 绩效衡量标准
- 2.3 Measures of risk 风险措施
Week 3
- 3.1 Testing for homogeneity
- 3.2 Testing for independence
- 3.3 Testing in small samples
总结

前言

系列博客是主要讲lecture里的重要知识点，这里面包括了data visualisation、data collection、 Chi-square test、 goodness of fit tests、 measure of performance 、 measure of risk 、 testing for homogeneity 、 testing for independent 和 testing in small sample。都是比较基础的知识点，掌握好，了解它。有的知识点讲的不够详细，后期会补上，现在把重点放在final内容。

Week 1

1.1 Data visualisation 数据可视化

我们要了解Palmer penguins数据集，并且要可视化。
这里不做过多讲解，基础代码部分自己对照lecture在Rstudio里运行

# install.packages("palmerpenguins")
library(palmerpenguins)

了解Palmer penguins数据集的更多信息。

help(penguins, package = "palmerpenguins")
# or more simply
?penguins

快速查看数据集的基本信息。

library(dplyr)
dplyr::glimpse(penguins) # glimpse the structure of the penguins data frame

使用ggplot2包将数据集可视化。

ggplot(data = penguins) + aes(x = species, fill = sex) + geom_bar(position = "fill") + labs(x = "", y = "Proportion of penguins", fill = "Sex") + scale_y_continuous(labels = scales::percent_format()) + facet_grid(cols = vars(island), scales = "free_x", space = "free_x") +theme_linedraw(base_size = 22)

这个part更多知识点参考其他博客里讲述ggplot绘图部分。

1.2 Data collection 数据收集

Sample and Population 样本和人口

A sample is part of a population（sample是population的一部分）
A statistic can be computed from a sample, and used to estimate a parameter.（可以从样本计算统计量，并用于估计参数）
A statistic summarises what the researcher knows. A parameter is what the researcher wants to know.（统计数据总结了研究人员所知道的。参数是研究人员想知道的）

为什么要用sample的方法，而不收集完整的Population来观察数据。

Hard to observe the population （很难观测到整体人群）
Not enough time （没有足够的时间）
Not enough money （没有足够的资金）
Not enough resource （没有足够的资源）

为什么要sample？

Reduce the number of measurements （减少测量次数）
Save time, money and resources （节省时间，资源和金钱）
Might be essential in destructive testing （在destructive testing 必不可少）

sample的定义

Sampling is the process of selecting a subset of observations from an entire population of interest so that characteristics from the subset (sample) can be used to draw conclusion or making inference about the entire population.（抽样是从整个感兴趣的总体中选择观察子集的过程，以便可以使用子集（样本）中的特征对整个总体得出结论或进行推断。）

Bias 偏见

Bias is any factor that favours certain outcomes or responses, or influences an individual’s responses. Bias may be unintentional (accidental), or intentional (to achieve certain results).
偏见是有利于某些结果或反应或影响个人反应的任何因素，偏见可能是无意的，也可能是故意的。

Selection bias / sampling bias: the sample does not accurately represent the population. Example: Attendees at a Star Trek convention may report that their favorite genre is science fiction.
Non-response bias: Certain groups are under-represented because they elect not to participate. Example: a restaurant may give each table a “customer satisfaction” survey with their bill.
Measurement or designed bias: Bias factors in the sampling method influence the data obtained. Example: a respondent may answer questions in the way she thinks the questioner wants her to answer.

1.3 Controlled experiments 对照实验

Randomised controlled double-blind trials 随机对照双盲试验

为什么选择Randomised controlled double-blind trials？

Investigators obtain a representative sample of subjects. （获取具有代表性的样本）
Investigators randomly allocate the subjects into a treatment group and a control group.（随机将被测试者分为治疗组和对照组，提到randomly要想到independent）
The control group is given a placebo, but neither the subjects nor the investigators know the identity of the 2 groups (double-blind).（在对照组添加药剂，但是被测试者们并不知道）
Investigators compare the responses of the 2 groups.（比较两组反应）
The design is good because we expect the 2 groups to be similar, hence any difference in the responses is likely to be caused by the treatment.（为我们所期望的，任何差异都可能是治疗组引起的）

Observational studies 观察性研究

观察性研究的必要性

By necessity, many research questions require an observational study, rather than a controlled experiment.（许多研究问题需要观察性研究，而不是受控实验）
Similarly, most educational research is based on observational studies.（大多数教育研究是基于观察性研究）
The conclusions of observational studies require great care.（结论要非常小心）

为什么要Observational studies？和Controlled experiments有什么不同？

A good randomised controlled experiment can establish causation, an observational study can only establish association.（前者randomised controlled experiment可以建立起因果关系，而后者observational study只能建立关联）
An observational study may suggest causation, but it can’t prove causation.（观察性研究可能会提示因果关系，而不会证明因果关系）

Misleading hidden confounders 误导性隐藏的混杂因素

Confounding occurs when the treatment group and control group differ by some third variable (other than the treatment) which influences the response that is studied.(当治疗组和对照组因影响所研究的反应的某些第三变量（治疗除外）不同时，就会发生混杂)
Confounders can be hard to find, and can mislead about a cause and effect relationship.(混杂因素很难找到，并且会误导因果关系)

Simpsons paradox 辛普森悖论

Sometimes there is a clear trend in individual groups of data that disappears when the groups are pooled together.(当将这些组汇集在一起时，单个数据组中的明显趋势会消失。)
It occurs when relationships between percentages in subgroups are reversed when the subgroups are combined, because of a confounding or lurking variable.(当子组合并时子组中百分比之间的关系由于混杂或潜在变量而发生逆转时，就会发生这种情况)

百度百科：辛普森悖论
知乎：辛普森悖论

1.4 Chi-squared tests 卡方检验

任何的 Hypothesis Testing都要经过这三个步骤：

清楚地了解实验是第一步，也是最重要的一步，然后设置research 问题
– set hypotheses: H0 VS H1
计算evidence
– set test statistic T
– set assumptions
– Select a critical value(α)：Common values are 5% and 1%
得出conclusion
– Calculate p-value
–reject the null hypothesis or not reject it

link：Hypothesis Testing in 3 steps

Chi-squared tests

通常使用2种方法来进行，第一种是Goodness of Fit 第二种是Independence。
下个小结进行详细讲解

Hypothesis 假设

分为null hypothesis和alternative hypothesis.
null hypothesis: The statement against which you search for evidence is called the null hypothesis, and is denoted by H0. It is generally a “no difference” statement.(您搜索证据所依据的陈述称为原假设,用 H0 表示。它通常是“无差异”陈述。)

alternative hypothesis: The statement you claim is called the alternative hypothesis, and is denoted by H1 (or sometimes you’ll see HA)(您声称的陈述称为备择假设，用 H1 表示（或者有时您会看到 HA）)

Assumptions 假设

Each observation are generally assumed to have been chosen at random from a population.(观测值从总体中随机选择）
We say that such random variables are iid (independently and identically distributed).（这样的随机变量是iid（独立同分布））
Each test we consider will have its own set of assumptions.（每个测试都有自己的一组假设。）

Test statistic

公式：

The observed test statistic, t0, is where we plug our observed data into the formula for the test statistic.（观察到的检验统计量 t0 是我们将观察到的数据插入检验统计量公式的地方。）
Large (positive or negative depending on H1) observed test statistic values is taken as evidence of poor agreement with H0.（大（正或负取决于 H1）观察到的测试统计值被视为与 H0 不一致的证据。）

Decision

An observed large positive or negative value of t0 and hence small p-value is taken as evidence of poor agreement with H0.
–
– If the p-value is small, then either H0 is true and the poor agreement is due to an unlikely event, or H0 is false. The smaller the p-value, the stronger the evidence against the null hypothesis.
–
–A large p-value does not mean that there is evidence that the null hypothesis is true.

Week 2

2.1 goodness of fit tests 拟合优度检验

两种distributions
在goodness of fit tests中，有两种distributions分布。分别是discrete distribution 和 continuous distribution。

discrete distribution： 我看见1辆汽车，2辆汽车，3辆汽车。不能是我看见1.23辆汽车，3.4辆汽车。其中 Binomial distribution和 Normal distribution 这个分布出现。
continuous distribution： 我的体重是134.56斤，你的体重是100.34斤，他的体重是180.2斤。能出现小数点。其中Normal distribution出现在这个分布

Poisson distribution 泊松分布

A Poisson random variable represents the probability of a given number of events occurring in a fixed interval (e.g. number of events in a fixed period of time) if these event occur independently with some known average rate λ per unit time（.泊松随机变量表示给定数量的事件在固定间隔内发生的概率（例如，在固定时间段内的事件数量），如果这些事件以每单位时间某个已知的平均速率 λ 独立发生。）

Chi-squared tests for discrete distributions 离散分布的卡方检验

2.2 Measures of performance 绩效衡量标准

Types of errors 重点知识：

记住每个的位置，基本大概率不会变。

True positive = correctly identified （阳性被检测出来）
False positive = incorrectly identified （结果为阳性，但实际上是阴性）
True negative = correctly rejected （阴性被检测出来）
False negative = incorrectly rejected （结果为阴性，但实际上是阳性）

2.3 Measures of risk 风险措施

Prospective and retrospective studies前瞻性和回顾性研究

A prospective study is based on subjects who are initially identified as disease-free and classified by presence or absence of a risk factor.(通过实现的设计去完成问题，有很强的目的性和因果性）
A random sample from each group is followed in time (prospectively) until eventually classified by disease outcome.（对来自每组的随机样本进行及时（前瞻性）跟踪，直到最终按疾病结果分类。）

Estimating population proportions 估计人口比例

Relative risk 相对风险

The relative risk is the ratio of the probability of having the disease in the group with the risk factor to the probability of having the disease in the group without the risk factor.

Odds ratio 优势比

A common alternative to the relative risk is the odds ratio, denoted OR.(相对风险的常见替代方法是优势比，表示为 OR。)
Odds are a ratio of probabilities. The odds are used as an alternative way of measuring the likelihood of an event occurring.(赔率是概率的比率。赔率用作衡量事件发生可能性的另一种方法。)

Standard errors and confidence intervals for odds ratios 优势比的标准误和置信区间

Week 3

3.1 Testing for homogeneity

Chi-squared test of homogeneity

With our observed counts and expected counts in each cell, we can construct a chi-squared test for homogeneity,

The expected cell counts are

Testing for homogeneity in general tables

3.2 Testing for independence

Testing for independence in 2×2 tables

Independence

Test statistic

3.3 Testing in small samples

Fisher’s exact test

The χ2 approximation for the test statistic is only reasonable when n is sufficiently large. I.e. we need the expected cell frequencies to all be 5 or more. However, if this is not the case, then we need to take care and maybe consider exact tests, i.e. calculating the exact p-value for the test statistic.
In R the function fisher.test() is available to carry out these calculations both for 2×2 tables and general contingency tables.

Yates’ chi-squared test

总结

有很多地方知识点总计的不够完善，final期末考试的重点是后面的内容，会把很多精力放在后面，这些内容后期会慢慢的补上。如果觉得哪里需要改进可以联系我。谢谢，请谅解。

USYD悉尼大学DATA 2002 【Module 1】: Categorical data 学习笔记（week1-week3）相关推荐

Clustered Data ONTAP Fundamentals课程第一单元学习笔记(续3）
在Data Ontap系统中,aggregate是系统定义的一个逻辑容器,包含了存储系统物理方面的组成部分,例如磁盘和RAID组.aggregate为创建的volume提供存储空间.而volume的创 ...
PyTorch中常用Module和Layer的学习笔记~
1 前言今天在学习PyTorch对于VGG网络的官方实现,朱老师在上课的时候也讲了, 不过感觉自己记得还是不是很牢,所以想写个笔记记录一下~ 2 常用Module和Layer nn.Conv2d 这 ...
GAIN: Missing Data Imputation using Generative Adversarial Nets学习笔记
GAIN: Missing Data Imputation using Generative Adversarial Nets(基于对抗生成网络实现缺失数据插补) 缺失数据插补算法问题背景缺失数据 ...
10_Rapidly Exploring Random Trees_宾夕法尼亚大学机器人运动规划专项课程【学习笔记】
在随机路图算法中,基本思路是建立一个由随机样本点和连接它们的边组成的路标图,一旦建立完成,就可以将想要的起点和终点通过路标图建立连接,得到一条可行路径.注意第一阶段建立的是一般路标图,并未考虑将任何一 ...
13_Course Summary_宾夕法尼亚大学机器人运动规划专项课程【学习笔记】
课程回顾最开始讨论了在网格上移动的机器人的路径规划问题,引入了图的概念,图中节点对应机器人可以到达的离散位置,边缘对应位点间的路径. 对于这些基于网格的问题,我们讨论了广度优先搜索[Breadth ...
1_Grassfire Algorithm_宾夕法尼亚大学机器人运动规划专项课程【学习笔记】
用0作为距离值标记目标节点,给所有距离目标节点1步的点标上1,距离2的标上2-以此类推,直至抵达起点. 对于网格中的节点,标上的距离值代表的是从这个点走到目标点的最少步数.会发现这些数字从目标点向外辐 ...
11_Artificial Potential Field_宾夕法尼亚大学机器人运动规划专项课程【学习笔记】
1. Constructing Artificial Potential Field 本节将讲解如何在人工势场中引导机器人在充满障碍物的环境中运动,基本思路是在位形空间内构造一个平滑函数[smooth ...
中国大学MOOC-翁恺-C语言程序设计习题集(学习笔记)
02-0. 整数四则运算(10) 本题要求编写程序,计算2个正整数的和.差.积.商并输出.题目保证输入和输出全部在整型范围内. 输入格式: 输入在一行中给出2个正整数A和B. 输出格式: 在4行中按照 ...
哔哩大学计算机学院：初识常量变量学习笔记
CSDN话题挑战赛第2期参赛话题:学习笔记目录视频教程上整理知识点遇到的难题解决方法知识点实践视频教程上整理知识点 1. 初识变量常量生活中的数据有些数据不能变:血型,性别有些数据 ...
悉尼大学教授陶大程加入京东，出任京东探索研究院院长
来源丨机器之心编辑丨蛋酱 3 月 9 日消息,人工智能和信息科学领域国际知名学者.悉尼大学教授.澳大利亚科学院院士陶大程已正式出任京东探索研究院 (JD Explore) 院长. 陶大程 2002 ...

USYD悉尼大学DATA 2002 【Module 1】: Categorical data 学习笔记（week1-week3）