一、问题分析

1.1 问题描述

In this assignment you must choose one of the options presented below and submit a visual as well as your source code for peer grading. The details of how you solve the assignment are up to you, although your assignment must use matplotlib so that your peers can evaluate your work. The options differ in challenge level, but there are no grades associated with the challenge level you chose. However, your peers will be asked to ensure you at least met a minimum quality for a given technique in order to pass. Implement the technique fully (or exceed it!) and you should be able to earn full grades for the assignment.

Ferreira, N., Fisher, D., & Konig, A. C. (2014, April). Sample-oriented task-driven visualizations: allowing users to make better, more confident decisions.
In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 571  In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 571-580). ACM. ([video](https://www.youtube.com/watch?v=BI7GAs-va-Q))

In this paper the authors describe the challenges users face when trying to make judgements about probabilistic data generated through samples. As an example, they look at a bar chart of four years of data (replicated below in Figure 1). Each year has a y-axis value, which is derived from a sample of a larger dataset. For instance, the first value might be the number votes in a given district or riding for 1992, with the average being around 33,000. On top of this is plotted the 95% confidence interval for the mean (see the boxplot lectures for more information, and the yerr parameter of barcharts).

A challenge that users face is that, for a given y-axis value (e.g. 42,000), it is difficult to know which x-axis values are most likely to be representative, because the confidence levels overlap and their distributions are different (the lengths of the confidence interval bars are unequal). One of the solutions the authors propose for this problem (Figure 2c) is to allow users to indicate the y-axis value of interest (e.g. 42,000) and then draw a horizontal line and color bars based on this value. So bars might be colored red if they are definitely above this value (given the confidence interval), blue if they are definitely below this value, or white if they contain this value.

Easiest option: Implement the bar coloring as described above - a color scale with only three colors, (e.g. blue, white, and red). Assume the user provides the y axis value of interest as a parameter or variable.

Harder option: Implement the bar coloring as described in the paper, where the color of the bar is actually based on the amount of data covered (e.g. a gradient ranging from dark blue for the distribution being certainly below this y-axis, to white if the value is certainly contained, to dark red if the value is certainly not contained as the distribution is above the axis).

Even Harder option: Add interactivity to the above, which allows the user to click on the y axis to set the value of interest. The bar colors should change with respect to what value the user has selected.

Hardest option: Allow the user to interactively set a range of y values they are interested in, and recolor based on this (e.g. a y-axis band, see the paper for more details).

Note: The data given for this assignment is not the same as the data used in the article and as a result the visualizations may look a little different.

Use the following data for this assignment:

import pandas as pd
import numpy as np
from scipy import stats
%matplotlib notebooknp.random.seed(12345)df = pd.DataFrame([np.random.normal(32000,200000,3650), np.random.normal(43000,100000,3650), np.random.normal(43500,140000,3650), np.random.normal(48000,70000,3650)], index=[1992,1993,1994,1995])

1.2 问题分析

分析Even Harder option选项：

本题给出了1992年-1995年4年间的数据集，在给定某一个特定的值的时候，判断其属于那个年份的概率最大。根据中心极限定理，我们假设每一年的数据分布应该是属于正态分布的（结合核密度曲线观察、利用Shapiro-Wilk检验或者Kolmogorov-Smirnov检验法，本题中可省略），首先我们设定α=0.05\alpha=0.05α=0.05，给定置信水平为95%，以此计算出每个数据的置信区间。根据置信区间的中值（同时也是数据集的中值）绘制柱状图，在根据置信区间的范围绘制errorbar，这样即可给出在0.95置信水平内可能属于该数据集的观测值的范围。

考虑到越靠近置信区间的均值，则隶属于该数据集的概率越大，利用2∗(Xˉ−Y)/(Xmax−Xmin)2*(\bar{X}-Y)/(X_{max}-X_{min})2∗(Xˉ−Y)/(Xmax−Xmin)来设计函数用以计算观测值属于各个数据集的概率。

选择色阶图的时候应该选择分散形（Diverging）的色阶图colormap，表现为中间白两头渐变，中间值为1，两头值为0，首先需要对-1到1的数据设计规范化，用plt.Normalize(-1,1)，根据已经得到的数据概率值，对观测值属于各个数据集的概率值进行上色，最后完善图的细节（例如提高有效墨水比例ink-ratio和减少绘图垃圾信息），便完成了一帧的制作，对于色阶图的运用可以查看
【DS with Python】Matplotlib入门(三)：cm模块、colormap配色、animation动画与canvas交互设计。

最后根据鼠标点击创建事件，在点击时获取当前的y坐标event.ydata，将其当作观测值，带入上面设计好的绘图函数完成每一帧的制作中即可。

二、具体代码及注释

2.1 代码及注释

# Use the following data for this assignment:import pandas as pd
import numpy as np
from scipy import stats
%matplotlib notebooknp.random.seed(12345)#四个数据集
df = pd.DataFrame([np.random.normal(32000,200000,3650), np.random.normal(43000,100000,3650), np.random.normal(43500,140000,3650), np.random.normal(48000,70000,3650)], index=[1992,1993,1994,1995])
import matplotlib.pyplot as plt
from matplotlib import cm
from scipy import stats#计算95%置信区间
intervals=[]
for idx in df.index:interval=stats.norm.interval(0.95,np.mean(df.loc[idx]),stats.sem(df.loc[idx]))intervals.append(interval)#计算yerr值(本质上就是置信区间减去期望值)用于在柱状图上绘制errorbar
err_1992=np.array(stats.norm.interval(0.95,np.mean(df.loc[1992]),stats.sem(df.loc[1992])))-np.mean(df.loc[1992])
err_1993=np.array(stats.norm.interval(0.95,np.mean(df.loc[1993]),stats.sem(df.loc[1993])))-np.mean(df.loc[1993])
err_1994=np.array(stats.norm.interval(0.95,np.mean(df.loc[1994]),stats.sem(df.loc[1994])))-np.mean(df.loc[1994])
err_1995=np.array(stats.norm.interval(0.95,np.mean(df.loc[1995]),stats.sem(df.loc[1995])))-np.mean(df.loc[1995])
err=np.array([err_1992,err_1993,err_1994,err_1995]).T## 提供另一种思路：直接在上面的95%置信区间内减掉对应的数据
# idx_2=1992
# intervals_2=[]
# for interval in intervals:
#     interval_2=np.array(interval)-np.mean(df.loc[idx_2])
#     intervals_2.append(interval_2)
#     idx_2+=1
# err=np.array([intervals_2[0],intervals_2[1],intervals_2[2],intervals_2[3]]).T#提取df的index属性和均值
index=df.T.describe().loc['mean',:].index.values
values=df.T.describe().loc['mean',:].values#设置虚线y的默认值为4条柱状图均值的均值
y=np.mean(values)#创建新图像
plt.figure()#从colormap中选定色彩，这里选择了'collwarm'，也可以选择其他的发散式colormap，或自定义
cmap=cm.get_cmap('coolwarm')#计算概率，完全超过95%置信区间为0，即蓝色，完全低于95%置信区间为1，即红色
def calculate_probability(y,interval):if y<interval[0]:return 1elif y>interval[1]:return -1return 2*((interval[1]+interval[0])/2-y)/(interval[1]-interval[0])#LC表达式对各个置信区间求解
probs=[calculate_probability(y,interval) for interval in intervals]#设置各个概率对应的颜色
colors=cmap(probs)#设置ScalarMappable
sm = cm.ScalarMappable(cmap=cmap,norm=plt.Normalize(-1,1))
sm.set_array([])#画柱状图
bars=plt.bar(range(len(values)),values,color=sm.to_rgba(probs))#画误差线
plt.gca().errorbar(range(len(values)),values,yerr=abs(err),c='k',fmt=' ',capsize=15)#画面设置
plt.xticks(range(len(values)),index)
plt.ylabel('Values')
plt.xlabel('Year')
plt.ylim([0,60000])
plt.gca().set_title('Assignment3')#设置水平色阶图
plt.colorbar(sm,orientation='horizontal')#去掉两两条边框，减少绘图垃圾
[plt.gca().spines[loc].set_visible(False) for loc in ['top','right']]
[plt.gca().spines[loc].set_alpha(0.3) for loc in ['left','bottom']]#更新虚线y的y轴坐标
yticks = plt.gca().get_yticks()
new_yticks=np.append(yticks,y)
plt.gca().set_yticks(new_yticks)#画观测值的虚线
h_line=plt.axhline(y,color='gray',linestyle='--',linewidth=1)#给每个柱添加注释
text=plt.text(1.5,58000,'y={:5.0f}'.format(y),bbox={'fc':'w','ec':'k'},ha='center')
text1=plt.text(bars[0].get_x()+bars[0].get_width()/2,bars[0].get_height()+10000,'prob={:.2f}'.format(1-abs(probs[0])),bbox={'fc':'w','ec':'k'},ha='center')
text2=plt.text(bars[1].get_x()+bars[1].get_width()/2,bars[1].get_height()+10000,'prob={:.2f}'.format(1-abs(probs[1])),bbox={'fc':'w','ec':'k'},ha='center')
text3=plt.text(bars[2].get_x()+bars[2].get_width()/2,bars[2].get_height()+10000,'prob={:.2f}'.format(1-abs(probs[2])),bbox={'fc':'w','ec':'k'},ha='center')
text4=plt.text(bars[3].get_x()+bars[3].get_width()/2,bars[3].get_height()+10000,'prob={:.2f}'.format(1-abs(probs[3])),bbox={'fc':'w','ec':'k'},ha='center')#设置交互函数
def onclick(event):#计算概率probs=[calculate_probability(event.ydata,interval) for interval in intervals]#用cmap给数值上色colors=cmap(probs)#print(probs)plt.bar(range(len(values)),values,color=sm.to_rgba(probs))plt.gca().errorbar(range(len(values)),values,yerr=abs(err),c='k',fmt=' ',capsize=15)#更改观测值h_line.set_ydata(event.ydata)#得到新的y刻度new_yticks=np.append(yticks,event.ydata)#更新新的y刻度plt.gca().set_yticks(new_yticks)#给每个柱添加注释text.set_text('y={:5.0f}'.format(event.ydata))text1.set_text('prob={:.2f}'.format(1-abs(probs[0])))text2.set_text('prob={:.2f}'.format(1-abs(probs[1])))text3.set_text('prob={:.2f}'.format(1-abs(probs[2])))text4.set_text('prob={:.2f}'.format(1-abs(probs[3])))#text=plt.gca().text(1.5,55000,'y={:5.0f}'.format(event.ydata),bbox={'fc':'w','ec':'k'},ha='center')plt.gcf().canvas.mpl_connect('button_press_event', onclick)

2.2 绘图结果

请运行脚本体验交互过程，jupyter notebook请运行%matplotlib notebook进行交互模式

初始页面
当观测值过高或者过低都为0，y=40603时，属于1994年一类的概率最大。

交互页面
观测值过低时，概率皆为0

观测值始终时属于各个数据集的概率：

观测值过高时，概率皆为0

【DS实践 | Coursera】Assignment 3 | Applied Plotting, Charting Data Representation in Python相关推荐

【DS实践 | Coursera】Assignment 2 | Applied Plotting, Charting Data Representation in Python
文章目录一.问题分析 1.1 问题描述 1.2 问题分析二.具体代码及注释 2.1 代码 2.2 绘图结果一.问题分析 1.1 问题描述 Before working on this assig ...
Coursera | Applied Plotting, Charting Data Representation in Python（UMich）| W3 Practice Assignment
所有assignment相关链接: Coursera | Applied Plotting, Charting & Data Representation in Python(Uni ...
Coursera | Applied Plotting, Charting Data Representation in Python（UMich）| Assignment4
所有assignment相关链接: Coursera | Applied Plotting, Charting & Data Representation in Python(Uni ...
Coursera | Applied Plotting, Charting Data Representation in Python（UMich）| Assignment2
所有assignment相关链接: Coursera | Applied Plotting, Charting & Data Representation in Python(Uni ...
[Applied Plotting, Charting Data Representation in Python] Assignment 2-Plotting Weather Patterns
作为一个近乎小白的新玩家,旁听的这个课又无法提交,想来想去还是发出来留个纪念嘻嘻. Assignment 2 Before working on this assignment please read ...
【DS实践 | Coursera】Assignment3 | Introduction to Data Science in Python
文章目录前言一.Q1 二.Q2 三.Q3 四.Q4 五.Q5 六.Q6 七.Q7 八.Q8 九.Q9 十.Q10 十一.Q11 十二.Q12 十三.Q13 前言本章是Introduction t ...
Coursera | Applied Data Science with Python 专项课程 | Applied Machine Learning in Python
本文为学习笔记,记录了由University of Michigan推出的Coursera专项课程--Applied Data Science with Python中Course Three: Ap ...
Coursera | Introduction to Data Science in Python（University of Michigan）| Assignment2
u1s1,这门课的assignment还是有点难度的,特别是assigment4(哀怨),放给大家参考啦~ 有时间(需求)就把所有代码放到github上(好担心被河蟹啊) 先放下该课 ...
Coursera | Introduction to Data Science in Python（University of Michigan）| Assignment4
u1s1,这门课的assignment还是有点难度的,特别是assigment4(哀怨),放给大家参考啦~ 有时间(需求)就把所有代码放到github上(好担心被河蟹啊) 先放下该课 ...

【DS实践 | Coursera】Assignment 3 | Applied Plotting, Charting Data Representation in Python

文章目录