Duke大学的Data Analysis and Statistical Inference课程笔记

anecdotal evidence: 用极端的个例去判断整体的信息。例如“我叔叔每天吸三根烟身体很棒”来验证“吸烟对人体没有危害”。

type of data: 对数据进行进一步处理前,先想一想数据是什么类型,qualitative(有顺序还是无顺序)还是quantitative(连续还是离散)。

Correlation does not imply causation

observation 能让我们得到correlation(高级方法也可以得到causation)

obs: 分别从是否workOut中选取一组人,比较各自的energyLevel,能得到相关关系。但是energyLevel不一定是由workOut与否引起的,可能有其他不可控的因素(被称为confounding var)。
exp:从population中做random assignmen,然后分别让两个随机组做workOut与否的测试,然后测量energyLevel。这点来说,与“控制变量法”相似。

sample bias
- convenience sample: 只选取容易获得的sample
- non-response:只选取了随机样本的一部分
- voluntary respoonse:结果的如何取决于投票者的志愿

sample methods
- simple random sample(SRS): each case is equally likely to be selected.
- stratified sample: divide the population into homogenous strata, then rondomly sample
- clusters: divide the population clusters, randomly sample a few clusters, then sample all obs within these clusters
- multistage: like clusters, while randomly sample within these clusters(例如调查一个城市的情况,分成各个区,避免了每个区都去的情况)

principles of experimental design
1. control: compare treatment of interset to a control group
2. randomize: randomly assign subjects to treatments
3. replicate: collect a suufficiently large sample, or replicate the entire study
4. block: block for variables known or suspected to affect the outcome

more on blocking
design an experiment investigating whether energy gels help you run faster
treatment: energy gel
control: no energy gel
block: energy gel might affect pro and amateur athletes differently
block for pro status:
1. divide the sample to pro and amateur
2. randomly assign pro and amateur athletes to treatment and control groups
3. pro and amateur athletes are equally represented in both groups

experimental terminology
1. placebo: fake treatment, often used as the control goup for medical studies
2. placebo effect: showing change despite being on the placebo(they believe that treatment, the mental reason)
3. blinding: experimenal units don’t know which group ther’re in
4. double-blink: both the experimental units and the researchers don’t know the group assignment

random sampling and random assignment
1. random sampling: In observation, random sample in the population.
2. random assignment: In experiment, random assign treatment and control group.
3. random sampling happens first , then random assignment.
4. only a study using random sampling and random assignment can be causal and generalizable.

1. unimodal
2. bimodal
3. uniform
4. multimodal

robust statistics
center: median ; not mean
spread: IQR; not SD,range
skew statistics is good at describing skewed data with extreme obes.

1. (natural) log transformation: often applied when much of the data cluster near zero(relative to the larger values in the data set) and all observations are positive. For example, the right skewed data transforms to the log data. Then the data is less skewed and has less extreme.
2. square root
3. inverse

goals of transformations
1. see the data structure differently
2. reduce skew assist in modeling
3. straighten a nonlinear relationship in a scatterplot



