20210319_23期_心跳检测_Task02_数据探索性分析
二、数据探索性分析(EDA)
这里写目录标题
- 二、数据探索性分析(EDA)
- 来源
- 1.1 EDA及其目标
- 1.2 载入数据及其概览
- 1.2.1 导入数据查看数据集首尾
- 1.2.2 数据集行列总览/统计量/数据类型
- 1.2.3 异常值/缺失值分布及可视化
- 1.2.4 预测值分布情况
- 1.2.4.1 整体分布情况
- 1.2.4.2 偏度(Skewness)和峰度(Kurtosis)
- 1.2.4.3 查看预测值的具体频数
- 参考资料
来源
Datewhle23期__数据挖掘心跳检测 :
https://github.com/datawhalechina/team-learning-data-mining/tree/master/HeartbeatClassification
作者:鱼佬、杜晓东、张晋、王皓月、牧小熊、姚昱君、杨梦迪
论坛地址:http://datawhale.club/t/topic/1574
import missingno as msno
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
1.1 EDA及其目标
EDA主要包括:分布分析,统计量分析和相关分析。
目标:
- 充分了解数据
- 数据的外部信息: 数据的现实意义,先验知识(百度,专业人士处获得)
- 数据的内部信息: 数据的自身情况(统计学的相关知识,图表)
- 充分了解数据
数据内部信息包括:
- 单特征分析
- 连续数据分析: 统计计算(分布趋势/正态性/数据转化/游程检验) 、可视化(直线图/箱线图/散点图)、类型转换(连续型数据转为离散型数据)
- 离散数据分析: 统计计算(value_counts()看数据结构)、 可视化(饼图/柱状图)
- 两特征分析
- 连续 vs. 连续: 统计计算(协方差/相关系数/) 、可视化(散点/线图(与时间特征联系))
- 离散 vs. 离散 : 统计计算(Kendall相关系数/卡方检验)、可视化(点线图(频数和频率))
- 多特征分析
- 多连续特征分析: 散点图/方差分析
- 多连续离散特征分析: 气泡图
- 多离散特征分析: 细分各个离散的频数,频率,再绘图
- 热力图: 多个特征两两的相似度。特征的度量可从三个相关系数中选择
- 单特征分析
1.2 载入数据及其概览
1.2.1 导入数据查看数据集首尾
data.head()
data.tail()
观察数据集首尾5行
path = 'C:/Users/GJX/Desktop/Datawhale-学习/23期/心电图/data/'
Train_data = pd.read_csv(path+'train.csv')
Test_data= pd.read_csv(path+'testA.csv')
Train_data.head().append(Train_data.tail())
id heartbeat_signals label
0 0 0.9912297987616655,0.9435330436439665,0.7646772997256593,0.6185708990212999,0.3796321642826237,0.19082233510621885,0.040237131594430715,0.02599520771717858,0.03170886048677242,0.06552357497104398,0.12553088566683082,0.14674736762087337,0.16765635354203254,0.19337353075154495,0.22613482558418235,0.2211427948707646,0.23606736350657742,0.2211427948707646,0.2211427948707646,0.21110661221417562,0.20858662883955462,0.19337353075154495,0.19592021355822875,0.1984624088145674,0.18570638844539308,0.19592020417425474,0.18314160533045887,0.19337353075154495,0.19082233510621885,0.20858662883955462,0.2211427948707646,0.2508391000672623,0.2606035735248363,0.27753397418529446,0.2942679945470305,0.3037438606924122,0.3364276621747203,0.3479233126336631,0.38410562561692113,0.3863371817756788,0.4084648300418338,0.4106590521686313,0.42592580507675887,0.42592580507675887,0.4291763526701312,0.4324195928902589,0.42809365036122277,0.42809365036122277,0.4128499421288813,0.41503751002775946,0.397443445265033... 0.0
1 1 0.9714822034884503,0.9289687459588268,0.5729328050711678,0.1784566262750076,0.1229615224365985,0.13236021729815928,0.09439236984499814,0.08957535516351411,0.030480606866741047,0.04049936195430977,0.02039178866617264,0.027965007816101565,0.03549867995241705,0.015320799491921583,0.045482770405584606,0.03549867995241705,0.03549867995241705,0.025445016021896714,0.030480606866741047,0.030480606866741047,0.030480606866741047,0.030480606866741047,0.025445016681959063,0.02039178866617264,0.015320799491921583,0.04049936195430977,0.025445016021896714,0.03549867995241705,0.06524602377453673,0.08957535516351411,0.099193354503267,0.14169808852070362,0.14169808852070362,0.14402314068245697,0.1463444518106101,0.1784566262750076,0.1920031630840051,0.1920031630840051,0.20542368372531994,0.2231256795470371,0.2275174563894053,0.21872051112824814,0.23189588627264138,0.20986959655861195,0.20319558825083509,0.1964905619662614,0.16478169390091,0.15559252550973682,0.15097589898583877,0.1463444518106101,0.... 0.0
2 2 1.0,0.9591487564065292,0.7013782792997189,0.23177753487886463,0.0,0.08069805776387916,0.12837603937503544,0.18744837555079963,0.28082571505275855,0.3282610568488903,0.32046267784302424,0.3224162366449385,0.3243671537022964,0.3224162366449385,0.33214447844788714,0.3243671537022964,0.31654764084501213,0.32046267784302424,0.31654764084501213,0.3224162366449385,0.31850648737156023,0.3106550752297612,0.3145861223979474,0.31850648737156023,0.3282610568488903,0.33214447844788714,0.3340822933251979,0.34565477103910563,0.3723016231352367,0.3928985088227811,0.4003159884456723,0.41320550604017453,0.42598088340935963,0.4287038055313511,0.43142159812868963,0.41686715839221355,0.41320550604017453,0.40585416986519723,0.40585416986519723,0.3910381893598958,0.39475646524206204,0.38731027267857376,0.38357273107588824,0.3817003064705676,0.37888712165261346,0.3760684405425128,0.38544272867487156,0.38544272867487156,0.38544272867487156,0.38917543513362857,0.38917543513362857,0.38544272867487156,0.37982... 2.0
3 3 0.9757952826275774,0.9340884687738161,0.6596366611990001,0.2499208267606008,0.23711575621286213,0.28144491730834825,0.2499208267606008,0.2499208267606008,0.24139674778512604,0.2306703464848836,0.2241960118573745,0.228515464684134,0.23282202359265147,0.23497049640533801,0.22635735942834095,0.21551813249934026,0.21551813249934026,0.21769249154853543,0.20897528621177913,0.19579989661088618,0.19579989661088618,0.1913812125046129,0.18472772577951438,0.17804341197656084,0.16908255648094667,0.16458116075056775,0.16232518985680355,0.137273812861256,0.14414925943232418,0.137273812861256,0.11411609355032969,0.09058058082100903,0.06906525870002136,0.06424018598748543,0.0593989217214407,0.034945879760208715,0.005044393297922133,0.0,0.017578747467928164,0.04232541063433138,0.08582704715228397,0.12805528546314976,0.16908255648094667,0.2067877206238659,0.21660572165978373,0.22635735942834095,0.24566509183217283,0.2668193465709445,0.27310560988449545,0.26891781855808394,0.27101324264570303,0.28973... 0.0
4 4 0.0,0.055816398940721094,0.26129357194994196,0.35984696254197834,0.43314263962884686,0.45369772898632504,0.49900406742109477,0.5427959768500487,0.6169044962835193,0.6766958323316207,0.7378818829055369,0.7554731067740066,0.7728500804863674,0.7741753908028428,0.7861734357934297,0.7909617169519071,0.8102666498706994,0.8213265156641842,0.8419185526427072,0.8628786774259445,0.8739753377637767,0.8892311346121539,0.8993416699311979,0.9093765560123989,0.9133558892299791,0.9196760305196715,0.9196723283106785,0.9124058665196302,0.8974838054797518,0.892449880952715,0.8661915608066262,0.8401030990072569,0.8286366753686091,0.815342528433036,0.801922073839424,0.7731368441418528,0.773943748662296,0.7445842840013923,0.7361862004120101,0.7278475921473763,0.7322235659799561,0.7049354480540089,0.7095029330860465,0.7077633854547168,0.7108565486012726,0.7139415073241846,0.7089537499714106,0.7101403569941387,0.7032846702149288,0.7050294020852683,0.7000108231737793,0.6980378401196279,0.695195710827144,0.... 2.0
99995 99995 1.0,0.677705342021188,0.22239242747868546,0.2571578307224994,0.20469042415279454,0.05466497618736314,0.026152286890497062,0.11818142707296006,0.24483757081121627,0.3289485158861968,0.3612662393002452,0.3692338152135465,0.3771576301414996,0.39064084657121523,0.38166605147504046,0.3861604278730283,0.40178139358056136,0.4128366039514504,0.41503751002775946,0.40842466093154955,0.4039992167307407,0.4216201541053901,0.4249002472095003,0.42817289964943017,0.4238077112517939,0.4346960170320757,0.44334791072667395,0.44334791072667395,0.44765446213849447,0.45836493196059325,0.4626269547531633,0.46899647358123736,0.46899647358123736,0.4742830236507332,0.4795502726166403,0.49002739679124524,0.4879380500650448,0.4921137220489549,0.5025003304125567,0.5230518999566728,0.5311912370235234,0.5352437500089933,0.543314814156555,0.5533405227389467,0.5593226750559254,0.5652801247011404,0.569238134528172,0.5771217550018541,0.5888669514585284,0.5849625007211562,0.5632970403005845,0.5573313791309694,0.5473... 0.0
99996 99996 0.9268571578157265,0.9063471198026871,0.6369932212888393,0.41503751002775946,0.37474480119929776,0.3825812845814957,0.35894293360916163,0.34135861850914284,0.3365254578264915,0.3170292884548231,0.29228262540259375,0.2864468160019497,0.2805873044059686,0.2687963811344823,0.24492145814626645,0.22588141240133217,0.208351475079585,0.18881932118654174,0.15444626964619396,0.12671905345249726,0.10035052363303557,0.0792889099483413,0.06961285122234513,0.05987145737563155,0.034231634226255754,0.018223757989263702,0.006100199547756366,0.0,0.008127880898497107,0.042169453668894775,0.0676698253788931,0.09463687430869291,0.12671905345249726,0.15444626481736634,0.18165061886731076,0.2310991107890251,0.27386144345060753,0.307181505689795,0.3349108108012508,0.3605309216239204,0.37002238421616246,0.3778844911406192,0.38881997353332987,0.39812772906374444,0.39812772906374444,0.39812772906374444,0.39812772906374444,0.3950318004560962,0.40583856004882424,0.41809081689340327,0.4196150504935423,0.430239... 2.0
99997 99997 0.9258351628306013,0.5873839035878395,0.6332261741951388,0.6323533645350808,0.6392827243034813,0.6142923239940205,0.5991551019747257,0.5176324324889339,0.4038033525475481,0.2531748788594435,0.013519441127279834,0.025638558505711485,0.03705234633905498,0.11669339776335014,0.11759388804025704,0.1552665234331609,0.2399151427891965,0.24175478159679248,0.21102295363616005,0.1891698778351716,0.20565849754093615,0.21451222943478682,0.2133064043203807,0.212085396542634,0.22091079609811587,0.21847999205830745,0.21744462158190533,0.2167611384978616,0.2098287268315808,0.20565849754093615,0.19919893110020848,0.2073944541586703,0.212085396542634,0.21380991021662776,0.2155287604547542,0.21380990689430834,0.23161798828669153,0.23069809802709923,0.25121190105757335,0.27337248192957914,0.2843963989758637,0.31710999788513283,0.3504182388524873,0.36157188616365815,0.37473102778466755,0.3877688916918828,0.4037961057981058,0.4160469383364878,0.4251865797948851,0.41453205902085904,0.40941055354379485,0.... 3.0
99998 99998 1.0,0.9947621698382489,0.8297017704865509,0.45819277171637834,0.26416169623741237,0.24022845026183584,0.21376575735540573,0.18929103849637752,0.20381573166587716,0.21086610220048516,0.190781903999013,0.18315266786292242,0.17548283434154238,0.1719202391788071,0.14480739621989083,0.12712256152820978,0.12367351932025483,0.10805993008957318,0.08342031774735097,0.06573228274169252,0.05506601146864584,0.043824017158610444,0.030206105177531178,0.016458299017634,0.0007163387656620452,0.005716803856722555,0.004021298881707146,0.0017004301572025823,0.024590351432569123,0.05309127643975563,0.09306172700835041,0.12291571612873288,0.1645792079804964,0.19003666117847653,0.21505245699261524,0.2449353266164658,0.25104409561890884,0.26638881817117777,0.28509026834456125,0.29432487352018755,0.2857629833660126,0.28796241582668164,0.2989825195095478,0.29848023837086235,0.2918423132813349,0.2851729871398236,0.28326423170204107,0.2915479296274273,0.28855142223257646,0.28318472801293526,0.279873906737311... 2.0
99999 99999 0.9259994004527861,0.916476635326053,0.4042900774399834,0.0,0.2630344094167657,0.3854310437765884,0.3610665021846972,0.33270794046870034,0.33985000288462475,0.3504972538285509,0.34872815423107756,0.3469568826178959,0.3451834336490557,0.3416300161159092,0.3540289245985311,0.3540289245985311,0.3504972538285509,0.33985000288462475,0.3522641867306462,0.35579154003393343,0.34872815423107756,0.3469568826178959,0.3522641698902398,0.3575520046180837,0.3575520046180837,0.35579154003393343,0.35931032359378573,0.3645724256169181,0.3680698831948491,0.3663222142458158,0.3645724256169181,0.37851163648316444,0.3854310437765884,0.3880172853451348,0.3905988989929319,0.39574834125134917,0.40769264218340895,0.41616415828881326,0.41785251488589786,0.4279413389014036,0.4412842655939598,0.4479097361339566,0.4479097361339566,0.4446008200300382,0.44956138405850743,0.45450495010798597,0.44625624251214385,0.42458621984429695,0.39746071299791125,0.37851163648316444,0.3575520046180837,0.33270794046870034,0.31... 0.0
1.2.2 数据集行列总览/统计量/数据类型
- 行列数
print("训练集大小:",Train_data.shape)
print("测试集大小:",Test_data.shape)
训练集大小: (100000, 3)
测试集大小: (20000, 2)
- 个数、均值、方差std、最大/小值、分位数等统计量, 训练集为例:
Train_data.describe()
id label
count 100000.000000 100000.000000
mean 49999.500000 0.856960
std 28867.657797 1.217084
min 0.000000 0.000000
25% 24999.750000 0.000000
50% 49999.500000 0.000000
75% 74999.250000 2.000000
max 99999.000000 3.000000
- 数据集类型,同样训练集为例
Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 3 columns):
id 100000 non-null int64
heartbeat_signals 100000 non-null object
label 100000 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 2.3+ MB
1.2.3 异常值/缺失值分布及可视化
- 缺失数据可以使用
.isna
或.isnull
查看每个单元格是否缺失 .isnull().sum()
来统计每一列缺失- 缺失值可用
df.dropna()
删除axis
(默认为0,即删除行)、删除方式how
、删除的非缺失值个数阈值thresh
( 非缺失值 没有达到这个数量的相应维度会被删除)、备选的删除子集subset
,其中how
主要有any
和all
两种参数可以选择。
Train_data.isnull().sum()
id 0
heartbeat_signals 0
label 0
dtype: int64
- 缺失值可视化
# 可视化看下缺省值
msno.matrix(Train_data.sample(100000))
msno.bar(Train_data.sample(1000)) #柱状图看是否有缺失
1.2.4 预测值分布情况
1.2.4.1 整体分布情况
seaborn.distplot()
— 绘制直方图,质量估计图,核密度估计图- 函数原型
seaborn.distplot(a, bins=None, hist=True, kde=True, rug=False, fit=None, hist_kws=None, kde_kws=None, rug_kws=None,fit_kws=None, color=None, vertical=False,norm_hist=False, axlabel=None,
- 关键参数
a
:表示要观察的数据,可以是Series、一维数组或列表。
bins
:用于控制条形的数量。
hist
:bool,是否绘制(标注)直方图。
kde
:bool,是否绘制高斯核密度估计曲线。—高斯核函数K(x,xc)=exp[−∣∣x−xc∣∣2/(2∗σ)2]K(x,xc)=exp[-||x-xc||^2/(2*σ)^2]K(x,xc)=exp[−∣∣x−xc∣∣2/(2∗σ)2],其中xc为核函数中心,σ为函数的宽度参数
rug
:bool,是否在支持的轴方向上绘制rugplot。
fit
:随机变量对象,可选参数。 一个带有fit方法的对象,返回一个元组,该元组可以传递给概率密度函数方法一个位置参数,该位置参数遵循一个值的网格用于评估概率密度函数。
stats.probplot
(QQ图): 计算概率图的分位数,并可选地显示该图plt.subplot()
函数用于直接指定划分方式和位置进行绘图
f,ax = plt.subplots(1,2,figsize = (13.2,5.6),dpi = 100)
sns.distplot(Train_data['label'],bins = 20,kde_kws = {'color':'r','lw':1.5},ax = ax[0])
stats.probplot(Train_data['label'],fit = True,plot = ax[1])
ax[0].set(xlabel = '心跳类别',ylabel = '概率密度',title = '心跳类别概率分布图')
ax[1].set(xlabel = '理论分位数',ylabel = '心跳类别',title = 'QQ图')
plt.grid(linestyle = '--') # 添加网格线
plt.show()
- 类别分布为—约翰逊分布—非正态
import scipy.stats as st
plt.figure(1); plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm) #正态
plt.figure(2); plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm) # 对数正态
1.2.4.2 偏度(Skewness)和峰度(Kurtosis)
- 偏度:是统计量中用来描述数据分布偏斜程度,其数值越大代表偏斜程度越大。
- 峰度是描述数据分布陡缓程度的统计量,判断数据相对于正态分布而言是更陡峭还是更平缓。
- 查看峰度和偏度
Train_data.skew(), Train_data.kurt()
(id 0.000000label 0.871005dtype: float64, id -1.200000label -1.009573dtype: float64
- 峰度偏度可视化
f, ax = plt.subplots(1, 2, figsize = (13.5, 5.2), dpi = 100)
f1 = sns.distplot(Train_data.kurt(), color = 'g', axlabel = 'Kurtness',kde_kws={"color": "r", "lw": 1.5,},ax = ax[0])
f2 = sns.distplot(Train_data.skew(), color = 'blue', axlabel = 'Skewness',kde_kws={"color": "r", "lw": 1.5,},ax = ax[1])
ax[0].set(xlabel = 'Kurtness', ylabel = '概率密度', title = '峰度分布')
ax[1].set(xlabel = 'Skewness', ylabel = '概率密度', title = '偏度分布')
f1.grid(linestyle = '--')
f2.grid(linestyle = '--')
plt.show()
1.2.4.3 查看预测值的具体频数
plt.hist(Train_data['label'], orientation = 'vertical',histtype = 'bar', color ='red')
plt.show()
参考资料
1.https://blog.csdn.net/AvenueCyy/article/details/104405747 数据探索性分析
2.http://joyfulpandas.datawhale.club/Content/ch7.html Pandas教程-缺失数据
3.https://cloud.tencent.com/developer/article/1512635 Seaborn系列 | 直方图distplot()
4.https://www.cntofu.com/book/172/docs/24.md seaborn中文文档
20210319_23期_心跳检测_Task02_数据探索性分析相关推荐
- 20210316_23期_心跳检测_Task01
一.赛题理解及baseline 这里写目录标题 一.赛题理解及baseline 来源 1.1 赛题理解 1.2 baseline 预处理: 训练数据/测试数据准备: 评价指标: 模型建立 1.3 提交 ...
- 机器学习的第一个难点,是数据探索性分析
作者 | 陆春晖 责编 | 寇雪芹 头图 | 下载于视觉中国 当我们在进行机器学习领域的学习和研究时,遇到的第一个难点就是数据探索性分析(Exploratory Data Analysis).虽然从各 ...
- 数据探索性分析_探索性数据分析
数据探索性分析 When we hear about Data science or Analytics , the first thing that comes to our mind is Mod ...
- #数据挖掘--第1章:EDA数据探索性分析
#数据挖掘--第1章:EDA数据探索性分析 一.序言 二.EDA的意义 三.EDA的流程 一.序言 本系列博客面向初学者,只讲浅显易懂易操作的知识.包含:数据分析.特征工程.模型训练等通用流程.将 ...
- 竞赛入门-数据探索性分析(EDA)
竞赛入门-数据探索性分析 总览 数据科学库 Numpy Scipy Pandas 可视化库 matplotlib seaborn missingno库 载入数据 数据总揽 数据检测 缺失值检测 异常值 ...
- python实现二手汽车价格预测(一)初始数据探索性分析
python实现二手汽车价格预测(一)初始数据探索性分析 零基础入门数据挖掘的 EDA-数据探索性分析 部分,带你来了解数据,熟悉数据,和数据做朋友. 一.EDA目标 EDA的价值主要在于熟悉数据集, ...
- 关于二手车交易预测的数据探索性分析
关于二手车交易预测的数据探索性分析 我们为什么要进行数据分析呢?这是我摘自一个博客的一个答案,希望能给您帮助:探索性数据分析(Exploratory Data Analysis,简称EDA),摘抄网上 ...
- Kaggle泰坦尼克号数据机器学习实战:从缺失值处理、数据探索性分析、组合特征生成到多模型构建
Kaggle泰坦尼克号数据机器学习实战:从缺失值处理.数据探索性分析.组合特征生成到多模型构建 泰坦尼克号的沉没是历史上最为人熟知的海难事件之一. 1912 年 4 月 15 日,在她的处女航中,泰坦 ...
- 数据探索性分析(EDA)常用方法大合集
EDA(Exploratory Data Analysis),全名为数据探索性分析,是通过了解数据集,了解变量间的相互关系以及变量与预测值之间的关系,从而帮助我们后期更好地进行特征工程和建立模型,是数 ...
- mysql心跳检测_心跳检测 · GatewayWorker手册 · 看云
## 为什么需要心跳检测? 正常的情况客户端断开连接会向服务端发送一个fin包,服务端收到fin包后得知客户端连接断开,则立刻触发onClose事件回调. 但是有些极端情况如客户端掉电.网络关闭.拔网 ...
最新文章
- iOS-消除CocoaPods内容警告
- Auto Layout 和 Constraints
- 《数学与泛型编程:高效编程的奥秘》一1.4 各章概述
- vue导出Excel(一)
- 京东数科科创板IPO获受理,刘强东为实际控制人
- 编程语言python入门-2020年10月编程语言:Java、Python 龙争虎斗
- Linux 基本操作命令
- 关于Python学习的一点说明
- bitvise SSH 打开代码中文显示乱码的问题
- win7如何更改计算机管理员用户名和密码,Win7如何修改管理用户名
- 程序员和码农有什么不同?从这三个单词就看得出来
- 机器学习(一)——K近邻算法(python实现)
- 谷歌浏览器chrome官方下载网址
- 更改java和javac的默认输出语言为英文
- nrf52840 IO操作 定时器 LOG
- android代码下拉刷新页面,Android下拉刷新的实现
- Python:数据结构——构建叉树
- ThinkPHP5.1使用redis缓存
- 简单认识c语言的概念
- 关于Linux你不知道的那些往事