文章目录

  • 一、问题分析
    • 1.1 问题描述
    • 1.2 问题分析
  • 二、具体代码及注释
    • 2.1 代码及注释
    • 2.2 绘图结果

一、问题分析

1.1 问题描述

In this assignment you must choose one of the options presented below and submit a visual as well as your source code for peer grading. The details of how you solve the assignment are up to you, although your assignment must use matplotlib so that your peers can evaluate your work. The options differ in challenge level, but there are no grades associated with the challenge level you chose. However, your peers will be asked to ensure you at least met a minimum quality for a given technique in order to pass. Implement the technique fully (or exceed it!) and you should be able to earn full grades for the assignment.

Ferreira, N., Fisher, D., & Konig, A. C. (2014, April). Sample-oriented task-driven visualizations: allowing users to make better, more confident decisions.
      In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 571      In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 571-580). ACM. ([video](https://www.youtube.com/watch?v=BI7GAs-va-Q))

In this paper the authors describe the challenges users face when trying to make judgements about probabilistic data generated through samples. As an example, they look at a bar chart of four years of data (replicated below in Figure 1). Each year has a y-axis value, which is derived from a sample of a larger dataset. For instance, the first value might be the number votes in a given district or riding for 1992, with the average being around 33,000. On top of this is plotted the 95% confidence interval for the mean (see the boxplot lectures for more information, and the yerr parameter of barcharts).

A challenge that users face is that, for a given y-axis value (e.g. 42,000), it is difficult to know which x-axis values are most likely to be representative, because the confidence levels overlap and their distributions are different (the lengths of the confidence interval bars are unequal). One of the solutions the authors propose for this problem (Figure 2c) is to allow users to indicate the y-axis value of interest (e.g. 42,000) and then draw a horizontal line and color bars based on this value. So bars might be colored red if they are definitely above this value (given the confidence interval), blue if they are definitely below this value, or white if they contain this value.

Easiest option: Implement the bar coloring as described above - a color scale with only three colors, (e.g. blue, white, and red). Assume the user provides the y axis value of interest as a parameter or variable.

Harder option: Implement the bar coloring as described in the paper, where the color of the bar is actually based on the amount of data covered (e.g. a gradient ranging from dark blue for the distribution being certainly below this y-axis, to white if the value is certainly contained, to dark red if the value is certainly not contained as the distribution is above the axis).

Even Harder option: Add interactivity to the above, which allows the user to click on the y axis to set the value of interest. The bar colors should change with respect to what value the user has selected.

Hardest option: Allow the user to interactively set a range of y values they are interested in, and recolor based on this (e.g. a y-axis band, see the paper for more details).


Note: The data given for this assignment is not the same as the data used in the article and as a result the visualizations may look a little different.

Use the following data for this assignment:

import pandas as pd
import numpy as np
from scipy import stats
%matplotlib notebooknp.random.seed(12345)df = pd.DataFrame([np.random.normal(32000,200000,3650), np.random.normal(43000,100000,3650), np.random.normal(43500,140000,3650), np.random.normal(48000,70000,3650)], index=[1992,1993,1994,1995])

1.2 问题分析

  • 分析Even Harder option选项:

  本题给出了1992年-1995年4年间的数据集,在给定某一个特定的值的时候,判断其属于那个年份的概率最大。根据中心极限定理,我们假设每一年的数据分布应该是属于正态分布的(结合核密度曲线观察、利用Shapiro-Wilk检验或者Kolmogorov-Smirnov检验法,本题中可省略),首先我们设定α=0.05\alpha=0.05α=0.05,给定置信水平为95%,以此计算出每个数据的置信区间。根据置信区间的中值(同时也是数据集的中值)绘制柱状图,在根据置信区间的范围绘制errorbar,这样即可给出在0.95置信水平内可能属于该数据集的观测值的范围

  考虑到越靠近置信区间的均值,则隶属于该数据集的概率越大,利用2∗(Xˉ−Y)/(Xmax−Xmin)2*(\bar{X}-Y)/(X_{max}-X_{min})2∗(Xˉ−Y)/(Xmax​−Xmin​)来设计函数用以计算观测值属于各个数据集的概率。

  选择色阶图的时候应该选择分散形Diverging)的色阶图colormap,表现为中间白两头渐变,中间值为1,两头值为0,首先需要对-1到1的数据设计规范化,用plt.Normalize(-1,1),根据已经得到的数据概率值,对观测值属于各个数据集的概率值进行上色,最后完善图的细节(例如提高有效墨水比例ink-ratio和减少绘图垃圾信息),便完成了一帧的制作,对于色阶图的运用可以查看
【DS with Python】Matplotlib入门(三):cm模块、colormap配色、animation动画与canvas交互设计。

  最后根据鼠标点击创建事件,在点击时获取当前的y坐标event.ydata,将其当作观测值,带入上面设计好的绘图函数完成每一帧的制作中即可。

二、具体代码及注释

2.1 代码及注释

# Use the following data for this assignment:import pandas as pd
import numpy as np
from scipy import stats
%matplotlib notebooknp.random.seed(12345)#四个数据集
df = pd.DataFrame([np.random.normal(32000,200000,3650), np.random.normal(43000,100000,3650), np.random.normal(43500,140000,3650), np.random.normal(48000,70000,3650)], index=[1992,1993,1994,1995])
import matplotlib.pyplot as plt
from matplotlib import cm
from scipy import stats#计算95%置信区间
intervals=[]
for idx in df.index:interval=stats.norm.interval(0.95,np.mean(df.loc[idx]),stats.sem(df.loc[idx]))intervals.append(interval)#计算yerr值(本质上就是置信区间减去期望值)用于在柱状图上绘制errorbar
err_1992=np.array(stats.norm.interval(0.95,np.mean(df.loc[1992]),stats.sem(df.loc[1992])))-np.mean(df.loc[1992])
err_1993=np.array(stats.norm.interval(0.95,np.mean(df.loc[1993]),stats.sem(df.loc[1993])))-np.mean(df.loc[1993])
err_1994=np.array(stats.norm.interval(0.95,np.mean(df.loc[1994]),stats.sem(df.loc[1994])))-np.mean(df.loc[1994])
err_1995=np.array(stats.norm.interval(0.95,np.mean(df.loc[1995]),stats.sem(df.loc[1995])))-np.mean(df.loc[1995])
err=np.array([err_1992,err_1993,err_1994,err_1995]).T## 提供另一种思路:直接在上面的95%置信区间内减掉对应的数据
# idx_2=1992
# intervals_2=[]
# for interval in intervals:
#     interval_2=np.array(interval)-np.mean(df.loc[idx_2])
#     intervals_2.append(interval_2)
#     idx_2+=1
# err=np.array([intervals_2[0],intervals_2[1],intervals_2[2],intervals_2[3]]).T#提取df的index属性和均值
index=df.T.describe().loc['mean',:].index.values
values=df.T.describe().loc['mean',:].values#设置虚线y的默认值为4条柱状图均值的均值
y=np.mean(values)#创建新图像
plt.figure()#从colormap中选定色彩,这里选择了'collwarm',也可以选择其他的发散式colormap,或自定义
cmap=cm.get_cmap('coolwarm')#计算概率,完全超过95%置信区间为0,即蓝色,完全低于95%置信区间为1,即红色
def calculate_probability(y,interval):if y<interval[0]:return 1elif y>interval[1]:return -1return 2*((interval[1]+interval[0])/2-y)/(interval[1]-interval[0])#LC表达式对各个置信区间求解
probs=[calculate_probability(y,interval) for interval in intervals]#设置各个概率对应的颜色
colors=cmap(probs)#设置ScalarMappable
sm = cm.ScalarMappable(cmap=cmap,norm=plt.Normalize(-1,1))
sm.set_array([])#画柱状图
bars=plt.bar(range(len(values)),values,color=sm.to_rgba(probs))#画误差线
plt.gca().errorbar(range(len(values)),values,yerr=abs(err),c='k',fmt=' ',capsize=15)#画面设置
plt.xticks(range(len(values)),index)
plt.ylabel('Values')
plt.xlabel('Year')
plt.ylim([0,60000])
plt.gca().set_title('Assignment3')#设置水平色阶图
plt.colorbar(sm,orientation='horizontal')#去掉两两条边框,减少绘图垃圾
[plt.gca().spines[loc].set_visible(False) for loc in ['top','right']]
[plt.gca().spines[loc].set_alpha(0.3) for loc in ['left','bottom']]#更新虚线y的y轴坐标
yticks = plt.gca().get_yticks()
new_yticks=np.append(yticks,y)
plt.gca().set_yticks(new_yticks)#画观测值的虚线
h_line=plt.axhline(y,color='gray',linestyle='--',linewidth=1)#给每个柱添加注释
text=plt.text(1.5,58000,'y={:5.0f}'.format(y),bbox={'fc':'w','ec':'k'},ha='center')
text1=plt.text(bars[0].get_x()+bars[0].get_width()/2,bars[0].get_height()+10000,'prob={:.2f}'.format(1-abs(probs[0])),bbox={'fc':'w','ec':'k'},ha='center')
text2=plt.text(bars[1].get_x()+bars[1].get_width()/2,bars[1].get_height()+10000,'prob={:.2f}'.format(1-abs(probs[1])),bbox={'fc':'w','ec':'k'},ha='center')
text3=plt.text(bars[2].get_x()+bars[2].get_width()/2,bars[2].get_height()+10000,'prob={:.2f}'.format(1-abs(probs[2])),bbox={'fc':'w','ec':'k'},ha='center')
text4=plt.text(bars[3].get_x()+bars[3].get_width()/2,bars[3].get_height()+10000,'prob={:.2f}'.format(1-abs(probs[3])),bbox={'fc':'w','ec':'k'},ha='center')#设置交互函数
def onclick(event):#计算概率probs=[calculate_probability(event.ydata,interval) for interval in intervals]#用cmap给数值上色colors=cmap(probs)#print(probs)plt.bar(range(len(values)),values,color=sm.to_rgba(probs))plt.gca().errorbar(range(len(values)),values,yerr=abs(err),c='k',fmt=' ',capsize=15)#更改观测值h_line.set_ydata(event.ydata)#得到新的y刻度new_yticks=np.append(yticks,event.ydata)#更新新的y刻度plt.gca().set_yticks(new_yticks)#给每个柱添加注释text.set_text('y={:5.0f}'.format(event.ydata))text1.set_text('prob={:.2f}'.format(1-abs(probs[0])))text2.set_text('prob={:.2f}'.format(1-abs(probs[1])))text3.set_text('prob={:.2f}'.format(1-abs(probs[2])))text4.set_text('prob={:.2f}'.format(1-abs(probs[3])))#text=plt.gca().text(1.5,55000,'y={:5.0f}'.format(event.ydata),bbox={'fc':'w','ec':'k'},ha='center')plt.gcf().canvas.mpl_connect('button_press_event', onclick)

2.2 绘图结果

请运行脚本体验交互过程,jupyter notebook请运行%matplotlib notebook进行交互模式

  • 初始页面
    当观测值过高或者过低都为0,y=40603时,属于1994年一类的概率最大。

  • 交互页面
    观测值过低时,概率皆为0

    观测值始终时属于各个数据集的概率:

    观测值过高时,概率皆为0

【DS实践 | Coursera】Assignment 3 | Applied Plotting, Charting Data Representation in Python相关推荐

  1. 【DS实践 | Coursera】Assignment 2 | Applied Plotting, Charting Data Representation in Python

    文章目录 一.问题分析 1.1 问题描述 1.2 问题分析 二.具体代码及注释 2.1 代码 2.2 绘图结果 一.问题分析 1.1 问题描述 Before working on this assig ...

  2. Coursera | Applied Plotting, Charting Data Representation in Python(UMich)| W3 Practice Assignment

       所有assignment相关链接:   Coursera | Applied Plotting, Charting & Data Representation in Python(Uni ...

  3. Coursera | Applied Plotting, Charting Data Representation in Python(UMich)| Assignment4

       所有assignment相关链接:   Coursera | Applied Plotting, Charting & Data Representation in Python(Uni ...

  4. Coursera | Applied Plotting, Charting Data Representation in Python(UMich)| Assignment2

       所有assignment相关链接:   Coursera | Applied Plotting, Charting & Data Representation in Python(Uni ...

  5. [Applied Plotting, Charting Data Representation in Python] Assignment 2-Plotting Weather Patterns

    作为一个近乎小白的新玩家,旁听的这个课又无法提交,想来想去还是发出来留个纪念嘻嘻. Assignment 2 Before working on this assignment please read ...

  6. 【DS实践 | Coursera】Assignment3 | Introduction to Data Science in Python

    文章目录 前言 一.Q1 二.Q2 三.Q3 四.Q4 五.Q5 六.Q6 七.Q7 八.Q8 九.Q9 十.Q10 十一.Q11 十二.Q12 十三.Q13 前言 本章是Introduction t ...

  7. Coursera | Applied Data Science with Python 专项课程 | Applied Machine Learning in Python

    本文为学习笔记,记录了由University of Michigan推出的Coursera专项课程--Applied Data Science with Python中Course Three: Ap ...

  8. Coursera | Introduction to Data Science in Python(University of Michigan)| Assignment2

       u1s1,这门课的assignment还是有点难度的,特别是assigment4(哀怨),放给大家参考啦~    有时间(需求)就把所有代码放到github上(好担心被河蟹啊)    先放下该课 ...

  9. Coursera | Introduction to Data Science in Python(University of Michigan)| Assignment4

       u1s1,这门课的assignment还是有点难度的,特别是assigment4(哀怨),放给大家参考啦~    有时间(需求)就把所有代码放到github上(好担心被河蟹啊)    先放下该课 ...

最新文章

  1. 首批辉瑞疫苗紧急出仓,传特朗普将「以身试苗」
  2. Spring Boot 13 之freemarker
  3. VScode 格式化代码快捷键、修改快捷键
  4. Oracle查看表空间,创建表空间
  5. datagridview 绑定list 不能刷新界面_人人都可写代码-H5零基础编程-发布活动界面实操07...
  6. 先收藏!关于Java类、接口、枚举的知识点大汇总
  7. QCostomPlot 示例注解 3
  8. 无监督/自监督/半监督的景物分割方法
  9. metadata文件_用Kubernetes部署Springboot或Nginx,也就一个文件的事
  10. php结合美图秀秀,美图秀秀头像编辑器的使用?thinkphp+七牛方案
  11. 《Maya 2009高手之路》-115网盘下载
  12. 投资:螺纹钢研究框架
  13. 【基础】SAP 新增计量单位
  14. python 实现 加减乘除,对数指数,三角反三角计算器
  15. 批量删除文件名前的数字编号
  16. OCP认证体系大揭秘
  17. 使用QT5书写的护眼程序
  18. 自建WIFI热点传输应用评测: 还在用蓝牙传文件?你OUT了
  19. 风雨人生,自己撑伞=
  20. Google挥刀自宫给站长看

热门文章

  1. 电子工程师必备:运算放大器11种经典电路
  2. 什么是海鸥脚网络变压器?普思海鸥脚H1102NL百兆网络变压器
  3. 2012年中国土地市场网数据(含经纬度)
  4. 李航老师《统计学习方法》第二版第三章课后题答案
  5. cv2.VideoCapture(0)
  6. 【大数据/分布式】MapReduce学习-结合6.824课程
  7. c语言disp函数_disp 在matlab中是什么意思
  8. mui 写出Tab标签可滑动可点击的效果(下划线效果)
  9. 【转】HTML5斯诺克桌球俱乐部【译】
  10. 计算机操作系统学习之FCFS、SJF和HRRN调度算法