概率论_1.1_1.2

  • 1.1 Populations,Samples,and Processes
    • Branches of Statistics
    • the Scope of Modern Statistics
    • Enumerative Versus Analytic Studies(枚举与分析研究)
    • collecting data
  • 1.2 Pictorial and Tabular Methods in Descriptive Statistics(描述统计学中的图形和表格方法)
    • Notation
    • Stem-and-Leaf displays(茎叶图)
    • Dotplots(点图)
    • Histograms(直方图)
    • Histogram Shapes(直方图的形状)
    • Qualitative Data(定性数据)
    • Multivariate data(多元数据)

1.1 Populations,Samples,and Processes

Engineers and scientists are constantly exposed to collections of facts, or data(数据), both in their professional capacities and in everyday activities.

An investigation will typically focus on a well-defined collection of objects constituting a population(总体) of interest

When desired information is available for all objects in the population, we have what is called a census(人口普查).(impractical or infeasible,不切实际的或不可行的)

Instead, a subset of the population—a sample(样本)—is selected in some prescribed manner

A variable(变量) is any characteristic whose value may change from one object to another in the population.We shall initially denote variables by lowercase letters from the end of our alphabet(用字母表末尾的小写字母). e.g.,x = brand of calculator owned by a student

Data(数据) results from making observations either on a single variable or simultaneously on two or more variables(数据是对单个变量或同时对两个或多个变量进行观察的结果)

A univariate data set(单变量数据集) consists of observations on a single variable.

We have bivariate data(双边量数据) when observations are made on each of two variables.

numerical variable 数值变量 categorical variable 类别变量

Multivariate data(多变量数据) arises when observations are made on more than one variable (so bivariate is a special case of multivariate).

Branches of Statistics

An investigator who has collected data may wish simply to summarize and describe important features of the data. This entails using methods from descriptive statistics(描述性统计学)

Some of these methods are graphical(图形化的) in nature; the construction of histograms(柱状图), boxplots(箱型图), and scatter plots(散点图) are primary examples. Other descriptive methods involve calculation of numerical summary measures(计算数值的汇总度量), such as means(平均值), standard deviations(标准差), and correlation coefficients(相关系数)

Having obtained a sample from a population, an investigator would frequently like to use sample information to draw some type of conclusion (make an inference of some sort) about the population. That is, the sample is a means to an end rather than an end in itself(样本是达到目的的一种手段,而不是目的本身). Techniques for generalizing from a sample to a population(从样本推广到总体) are gathered within the branch of our discipline called inferential statistics(推理统计学).

Mastery of probability(掌握概率) leads to a better understanding of how inferential procedures are developed and used(如何开发和使用推理程序), how statistical conclusions can be translated into everyday language and interpreted(如何将统计结论翻译成日常语言并加以解释), and when and where pitfalls can occur in applying the methods(在应用这些方法时何时何地会出现陷阱). Probability and statistics both deal with questions involving populations and samples, but do so in an “inverse manner(处理方式相反)” to one another.(概率论是统计推断的基础,在给定数据生成过程下观测、研究数据的性质;而统计推断则根据观测的数据,反向思考其数据生成过程。)

Before we can understand what a particular sample can tell us about the population, we should first understand the uncertainty associated with taking a sample from a given population. This is why we study probability before statistics.

There are a number of problem situations in which we fit questions into the framework of inferential statistics by conceptualizing a population(概念化一个总体).

the Scope of Modern Statistics

These days statistical methodology is employed by investigators in virtually all disciplines, including such areas as
● molecular biology(分子生物学) (analysis of microarray data,微阵列数据分析)
● ecology(生态学) (describing quantitatively how individuals in various animal and plant and populations are spatially distributed,定量地描述个体在各种动物和植物中的表现 与 种群是空间分布的)
● materials engineering(材料工程) (studying properties of various treatments to retard corrosion,研究各种缓蚀剂的性能)
● marketing (市场营销)(developing market surveys and strategies for marketing new products,制定新产品的市场调查和营销策略)
● public health (公共卫生)(identifying sources of diseases and ways to treat them,查明疾病来源和治疗方法)
● civil engineering (土木工程)(assessing the effects of stress on structural elements and the impacts of traffic flows on communities,评估应力对结构构件的影响和交通流量对社区的影响)

Enumerative Versus Analytic Studies(枚举与分析研究)

enumerative studies(计数研究): Interest is focused on a finite, identifiable, unchanging collection of individuals or objects that make up a population(组成一个总体的有限的、可识别的、不变的个体或对象集合). A sampling frame(抽样框架) that is, a listing of the individuals or objects(被抽样的个体或对象的清单) to be sampled is either available to an investigator or else can be constructed.

analytic studies(分析研究):An analytic study is broadly(广义地) defined as one that is not enumerative in nature(性质上不是列举). Such studies are often carried out with the objective(目标) of improving a future product by taking action on a process of some sort (e.g., recalibrating equipment or adjusting the level of some input such as the amount of a catalyst 重新校准设备或调整一些输入的水平,如催化剂的数量). Data can often be obtained only on an existing process, one that may differ in important respects from the future process. There is thus no sampling frame listing the individuals or objects of interest.

collecting data

When data collection entails selecting individuals or objects from a frame, the simplest method for ensuring a representative selection is to take a simple random sample. This is one for which any particular subset of the specified size(特定大小的任何特定子集) (e.g., a sample of size 100) has the same chance of being selected.

stratified sampling(分层抽样):entails separating the population units into nonoverlapping groups and taking a sample from each one.

1.2 Pictorial and Tabular Methods in Descriptive Statistics(描述统计学中的图形和表格方法)

visual displays 可视化

Notation

The number of observations in a single sample, that is, the sample size(样本大小), will often be denoted by n. e.g. , for the sample of universities {Stanford, Iowa State, Wyoming, Rochester}, n=4.

If two samples are simultaneously under consideration, either m and n or n1 and n2 can be used to denote the numbers of observations.

Given a data set(数据集) consisting of n observations on some variable x, the individual observations will be denoted by x1, x2, x3,…, xn

Stem-and-Leaf displays(茎叶图)

constructing a Stem-and-Leaf display

  1. Select one or more leading digits(前导数字) for the stem values(茎值). The trailing digits become the leaves.
  2. List possible stem values in a vertical column.
  3. Record the leaf for each observation beside the corresponding stem value.
  4. Indicate the units for stems and leaves(标明茎和叶的单位) some place in the display.

In general, a display based on between 5 and 20 stems is recommended.

手工创建的叶值没必要按照从小到大的顺序排列。

A stem-and-leaf display conveys information about the following aspects of the data:

● identification of a typical or representative value(对典型或代表性值的识别)
● extent of spread about the typical value(典型值的传播范围)
● presence of any gaps in the data(数据中存在的任何漏洞)

● extent of symmetry in the distribution of values(数值分布的对称程度)
● number and locations of peaks(峰值的数量和位置)
● presence of any outliers—values far from the rest of the data(任何离群值的存在——距离其他数据很远)

Dotplots(点图)

A dotplot is an attractive summary of numerical data when the data set is reasonably small or there are relatively few distinct data values(当数据集相当小或有相对较少的不同数据值时). Each observation is represented by a dot above the corresponding location on a horizontal measurement scale.(每个观测值都用水平测量标度上相应位置上方的一个点表示) When a value occurs more than once, there is a dot for each occurrence, and these dots are stacked vertically

Histograms(直方图)

A numerical variable is discrete if its set of possible values either is finite or else can be listed in an infinite sequence (one in which there is a first number, a second number, and so on)(一个数值变量的可能值集是有限的,或者可以在一个无限序列(其中有第一个数、第二个数,等等)中列出). A numerical variable is continuous if its possible values consist of an entire interval on the number line(数值变量的可能值由数轴上的整个区间组成)

A discrete variable(离散变量) x almost always results from counting(计数)

Continuous variables(连续变量) arise from making measurements(测量).

The frequency(频率) of any particular x value is the number of times that value occurs in the data set. The relative frequency(相对频率) of a value is the fraction or proportion of times the value occurs:

Multiplying a relative frequency by 100 gives a percentage(将相对频率乘以100得到一个百分比)

The relative fre quencies, or percentages, are usually of more interest than the frequencies themselves.

In theory, the relative frequencies should sum to 1(理论上,相对频率的总和应该是1), but in practice the sum may differ slightly from 1 because of rounding(由于四舍五入的关系,总和可能与1略有不同). A frequency distribution(频率分布) is a tabulation(表) of the frequencies and/or relative frequencies.

Constructing a Histogram for discrete data:
First, determine the frequency and relative frequency of each x value(确定每个x值的频率和相对频率).

Then mark possible x values on a horizontal scale(在水平刻度上标记可能的x值).

Above each value, draw a rectangle whose height is the relative frequency (or alternatively, the frequency) of that value(在每个值之上,画一个矩形,其高度是该值的相对频率(或频率))

The rectangles should have equal widths(这些矩形的宽度应该相等).

This construction ensures that the area of each rectangle is proportional(成正比) to the relative frequency of the value.

Constructing a histogram for continuous data(连续数据) (measurements) entails subdividing the measurement axis(细分测量轴) into a suitable number of class intervals(类间隔) or classes(类), such that each observation is contained in exactly one class(每个观察值都包含在一个类中).

One potential difficulty is that occasionally an observation lies on a class boundary so therefore does not fall in exactly one interval(有时观察值位于类边界上,因此不会恰好落在一个区间内), for example, 29.0

处理这个问题的一种方法是使用像27.55,28.05,…,31.55这样的边界。向类边界添加百分位数字可以防止观察结果落在结果边界上。另一种方法是一个边界上的观测值被放置在边界右边的区间内。

Constructing a Histogram for continuous data:Equal class Widths(相等的类宽度)

Determine the frequency and relative frequency for each class.

Mark the class boundaries on a horizontal measurement axis.

Above each class interval,draw a rectangle whose height is the corresponding relative frequency (or frequency).

Equal-width classes may not be a sensible choice if there are some regions of the measurement scale that have a high concentration of data values and other parts where data is quite sparse(度量尺度的某些区域数据值高度集中,而其他部分数据相当稀疏).

If a large number of equal-width classes are used, many classes will have zero frequency(如果使用了大量的等宽类,那么许多类的频率将为零). A sound choice is to use a few wider intervals near extreme observations and narrower intervals in the region of high concentration(在极端观测附近使用较宽的间隔,而在高度集中的区域使用较窄的间隔).

Constructing a Histogram for continuous data: unequal class Widths (不等类宽度)

After determining frequencies and relative frequencies, calculate the height of each rectangle using the formula

The resulting rectangle heights(矩形高度) are usually called densities(密度), and the vertical scale(垂直比例) is the density scale(密度比例). This prescription(规定) will also work when class widths are equal.

Histogram Shapes(直方图的形状)

A unimodal histogram(单峰直方图) is one that rises to a single peak and then declines

A bimodal histogram(双峰直方图) has two different peaks.

Bimodality(双模态) can occur when the data set consists of observations on two quite different kinds of individuals or objects(当数据集包含对两种完全不同的个体或对象的观察时).

Only if the two separate histograms are far apart relative to their spreads will bimodality occur in the histogram of combined data.

The number of peaks may well depend on the choice of class intervals, particularly with a small number of observations. The larger the number of classes, the more likely it is that bimodality or multimodality(多模态) will manifest(出现) itself.

A histogram is symmetric(对称的) if the left half is a mirror image of the right half. A unimodal(单峰的) histogram is positively skewed(正向倾斜) if the right or upper tail(右尾或上尾) is stretched out compared with the left or lower tail and negatively skewed(负向倾斜) if the stretching is to the left.

Qualitative Data(定性数据)

Both a frequency distribution and a histogram can be constructed when the data set is qualitative (categorical,分类的) in nature. In some cases, there will be a natural ordering(顺序) of classes—for example, freshmen, sophomores, juniors, seniors, graduate students—whereas in other cases the order will be arbitrary—for example, Catholic, Jewish, Protestant, and the like. With such categorical data, the intervals above which rectangles are constructed should have equal width.

Multivariate data(多元数据)

Multivariate data is generally rather difficult to describe visually(很难直观地描述). Several methods for doing so appear later in the book, notably scatterplots for bivariate numerical data(二元数值数据的散点)

概率论与数理统计 1 Overview and Descriptive Statistics(概述和描述性统计) (上篇)相关推荐

  1. 概率论与数理统计(Probability Statistics I)

    Table of Contents 概率论的基本概念(The Basic Concept of Probability Theory) 随机变量及其分布(Random Variable and Its ...

  2. Probability and Statistics for Engineering and The Sciences 概率论与数理统计 读书笔记(一)

    看了 data mining - practical machine learning techniques and tools 一书,觉得很多概念与算法都与统计有关,所以索性把概率与统计也读一读. ...

  3. 【概率论与数理统计 Probability and Statistics 1】—— 必需夯实的几个概念以及几个重要的概率模型,蒙特卡洛方法介绍

    这是<概率论与数理统计>网上授课的第一次笔记,记录一下这门课基本的几个概念.概率论是相当重要的一门课,在许多领域,例如深度学习,机器学习,数据挖掘等都广泛地运用了概率论的知识 文章目录 一 ...

  4. 概率论与数理统计常用英文词汇对照

    概率论与数理统计常用英文词汇对照 Probability Theory 概率论 Trial 试验 intersection交 union 并 frequency 频率 difference 差 add ...

  5. 概率与统计在计算机应用,计算机技术在概率论和数理统计中的应用

    计算机技术在概率论和数理统计中的应用 (5页) 本资源提供全文预览,点击全文预览即可全文预览,如果喜欢文档就下载吧,查找使用更方便哦! 19.90 积分 概率论与数理统计 期中论文计算机技术在概率论和 ...

  6. 【概率论与数理统计】小结8 - 三大抽样分布

    注:抽样分布就是统计量的分布,其特点是不包含未知参数且尽可能多的概括了样本信息.除了常见的正态分布之外,还有卡方分布.t分布和F分布为最常见的描述抽样分布的分布函数.这几个分布函数在数理统计中也非常有 ...

  7. python实现概率论与数理统计_《统计思维:程序员数学之概率统计》读书笔记

    更多 1.书籍信息 书名:Think Stats: Probability and Statistics for Programmers 译名:<统计思维:程序员数学之概率统计> 作者:A ...

  8. 概率论在实际生活的例子_概率论与数理统计在实际生活中的应用-论文.doc

    您所在位置:网站首页 > 海量文档 &nbsp>&nbsp学术论文&nbsp>&nbsp大学论文 概率论与数理统计在实际生活中的应用-论文.doc19 ...

  9. 概率论由相关性求数学期望和方差的公式_《概率论与数理统计》(公共课—计算机科学与技术本科专业)教学大纲(2017.2编)资料...

    <概率论与数理统计>课程教学大纲 (2016版) 一.课程基本信息 课程名称:概率论与数理统计 英文名称:Probability Theory and Mathematical Stati ...

最新文章

  1. 主成分分析(PCA)简介
  2. java 迭代器的原理_java里Iterator的原理
  3. PureMVC在Unity游戏开发中的应用
  4. 面试中的 10 大排序算法总结
  5. dbeaver驱动问题解决方案
  6. [人脸识别]什么叫One-shot learning
  7. 百度文库f12免费复制文章
  8. Linux系统下安装ssh服务
  9. mysql 导入dmp_navicat怎么导入dmp文件
  10. html模拟点击某个键盘按钮,如何使用JavaScript模拟按键或单击?
  11. 黑鹰ASP.NET教程
  12. java基础结构图_java基础之【堆、栈、方法区】结构图
  13. 微信小程序学习笔记(7.16)
  14. 5G前传网络之损伤仿真测试(5G Fronthaule, eCPRI, RoE, 25GbE)
  15. php 票务系统开发_电子票务系统的意义和实现
  16. js中,转义单双引号
  17. 【项目记录】-上门洗车-汽车服务_0
  18. DAX CountX+RelatedTable实战:帮助HR妹子创建一个工作日历表
  19. 为什么大型网站前端用 PHP 后台逻辑用 JAVA?
  20. 怎么知道自己是怎样的人?

热门文章

  1. 超声波测距1602显示程序
  2. 前端开发基础知识汇总
  3. L2-016 愿天下有情人都是失散多年的兄妹 (25 分)(C语言)(并查集)(dfs)(测试点坑)
  4. [python]的functools.partial(偏函数)
  5. Ubuntu 20.4 美化桌面、美化引导界面、Mac 既视感
  6. 4 win10环境下+vs2017+pcl1.9环境配置
  7. 深入理解Linux进程调度(0.4)
  8. 什么是DBMS以及DBMS的分类
  9. PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation
  10. 前端JS校验银行卡卡号和身份证号码(附ES6版方法)