二、数据探索性分析(EDA)

这里写目录标题

  • 二、数据探索性分析(EDA)
      • 来源
    • 1.1 EDA及其目标
    • 1.2 载入数据及其概览
      • 1.2.1 导入数据查看数据集首尾
      • 1.2.2 数据集行列总览/统计量/数据类型
      • 1.2.3 异常值/缺失值分布及可视化
      • 1.2.4 预测值分布情况
        • 1.2.4.1 整体分布情况
        • 1.2.4.2 偏度(Skewness)和峰度(Kurtosis)
        • 1.2.4.3 查看预测值的具体频数
    • 参考资料

来源

Datewhle23期__数据挖掘心跳检测 :
https://github.com/datawhalechina/team-learning-data-mining/tree/master/HeartbeatClassification
作者:鱼佬、杜晓东、张晋、王皓月、牧小熊、姚昱君、杨梦迪

论坛地址:http://datawhale.club/t/topic/1574

import missingno as msno
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

1.1 EDA及其目标

  • EDA主要包括:分布分析,统计量分析和相关分析。

  • 目标:

    • 充分了解数据

      • 数据的外部信息: 数据的现实意义,先验知识(百度,专业人士处获得)
      • 数据的内部信息: 数据的自身情况(统计学的相关知识,图表)
  • 数据内部信息包括:

    • 单特征分析

      • 连续数据分析: 统计计算(分布趋势/正态性/数据转化/游程检验) 、可视化(直线图/箱线图/散点图)、类型转换(连续型数据转为离散型数据)
      • 离散数据分析: 统计计算(value_counts()看数据结构)、 可视化(饼图/柱状图)
    • 两特征分析
      • 连续 vs. 连续: 统计计算(协方差/相关系数/) 、可视化(散点/线图(与时间特征联系))
      • 离散 vs. 离散 : 统计计算(Kendall相关系数/卡方检验)、可视化(点线图(频数和频率))
    • 多特征分析
      • 多连续特征分析: 散点图/方差分析
      • 多连续离散特征分析: 气泡图
      • 多离散特征分析: 细分各个离散的频数,频率,再绘图
      • 热力图: 多个特征两两的相似度。特征的度量可从三个相关系数中选择

1.2 载入数据及其概览

1.2.1 导入数据查看数据集首尾

  • data.head() data.tail()观察数据集首尾5行
path = 'C:/Users/GJX/Desktop/Datawhale-学习/23期/心电图/data/'
Train_data = pd.read_csv(path+'train.csv')
Test_data= pd.read_csv(path+'testA.csv')
Train_data.head().append(Train_data.tail())
id   heartbeat_signals   label
0   0   0.9912297987616655,0.9435330436439665,0.7646772997256593,0.6185708990212999,0.3796321642826237,0.19082233510621885,0.040237131594430715,0.02599520771717858,0.03170886048677242,0.06552357497104398,0.12553088566683082,0.14674736762087337,0.16765635354203254,0.19337353075154495,0.22613482558418235,0.2211427948707646,0.23606736350657742,0.2211427948707646,0.2211427948707646,0.21110661221417562,0.20858662883955462,0.19337353075154495,0.19592021355822875,0.1984624088145674,0.18570638844539308,0.19592020417425474,0.18314160533045887,0.19337353075154495,0.19082233510621885,0.20858662883955462,0.2211427948707646,0.2508391000672623,0.2606035735248363,0.27753397418529446,0.2942679945470305,0.3037438606924122,0.3364276621747203,0.3479233126336631,0.38410562561692113,0.3863371817756788,0.4084648300418338,0.4106590521686313,0.42592580507675887,0.42592580507675887,0.4291763526701312,0.4324195928902589,0.42809365036122277,0.42809365036122277,0.4128499421288813,0.41503751002775946,0.397443445265033...     0.0
1   1   0.9714822034884503,0.9289687459588268,0.5729328050711678,0.1784566262750076,0.1229615224365985,0.13236021729815928,0.09439236984499814,0.08957535516351411,0.030480606866741047,0.04049936195430977,0.02039178866617264,0.027965007816101565,0.03549867995241705,0.015320799491921583,0.045482770405584606,0.03549867995241705,0.03549867995241705,0.025445016021896714,0.030480606866741047,0.030480606866741047,0.030480606866741047,0.030480606866741047,0.025445016681959063,0.02039178866617264,0.015320799491921583,0.04049936195430977,0.025445016021896714,0.03549867995241705,0.06524602377453673,0.08957535516351411,0.099193354503267,0.14169808852070362,0.14169808852070362,0.14402314068245697,0.1463444518106101,0.1784566262750076,0.1920031630840051,0.1920031630840051,0.20542368372531994,0.2231256795470371,0.2275174563894053,0.21872051112824814,0.23189588627264138,0.20986959655861195,0.20319558825083509,0.1964905619662614,0.16478169390091,0.15559252550973682,0.15097589898583877,0.1463444518106101,0....     0.0
2   2   1.0,0.9591487564065292,0.7013782792997189,0.23177753487886463,0.0,0.08069805776387916,0.12837603937503544,0.18744837555079963,0.28082571505275855,0.3282610568488903,0.32046267784302424,0.3224162366449385,0.3243671537022964,0.3224162366449385,0.33214447844788714,0.3243671537022964,0.31654764084501213,0.32046267784302424,0.31654764084501213,0.3224162366449385,0.31850648737156023,0.3106550752297612,0.3145861223979474,0.31850648737156023,0.3282610568488903,0.33214447844788714,0.3340822933251979,0.34565477103910563,0.3723016231352367,0.3928985088227811,0.4003159884456723,0.41320550604017453,0.42598088340935963,0.4287038055313511,0.43142159812868963,0.41686715839221355,0.41320550604017453,0.40585416986519723,0.40585416986519723,0.3910381893598958,0.39475646524206204,0.38731027267857376,0.38357273107588824,0.3817003064705676,0.37888712165261346,0.3760684405425128,0.38544272867487156,0.38544272867487156,0.38544272867487156,0.38917543513362857,0.38917543513362857,0.38544272867487156,0.37982...     2.0
3   3   0.9757952826275774,0.9340884687738161,0.6596366611990001,0.2499208267606008,0.23711575621286213,0.28144491730834825,0.2499208267606008,0.2499208267606008,0.24139674778512604,0.2306703464848836,0.2241960118573745,0.228515464684134,0.23282202359265147,0.23497049640533801,0.22635735942834095,0.21551813249934026,0.21551813249934026,0.21769249154853543,0.20897528621177913,0.19579989661088618,0.19579989661088618,0.1913812125046129,0.18472772577951438,0.17804341197656084,0.16908255648094667,0.16458116075056775,0.16232518985680355,0.137273812861256,0.14414925943232418,0.137273812861256,0.11411609355032969,0.09058058082100903,0.06906525870002136,0.06424018598748543,0.0593989217214407,0.034945879760208715,0.005044393297922133,0.0,0.017578747467928164,0.04232541063433138,0.08582704715228397,0.12805528546314976,0.16908255648094667,0.2067877206238659,0.21660572165978373,0.22635735942834095,0.24566509183217283,0.2668193465709445,0.27310560988449545,0.26891781855808394,0.27101324264570303,0.28973...     0.0
4   4   0.0,0.055816398940721094,0.26129357194994196,0.35984696254197834,0.43314263962884686,0.45369772898632504,0.49900406742109477,0.5427959768500487,0.6169044962835193,0.6766958323316207,0.7378818829055369,0.7554731067740066,0.7728500804863674,0.7741753908028428,0.7861734357934297,0.7909617169519071,0.8102666498706994,0.8213265156641842,0.8419185526427072,0.8628786774259445,0.8739753377637767,0.8892311346121539,0.8993416699311979,0.9093765560123989,0.9133558892299791,0.9196760305196715,0.9196723283106785,0.9124058665196302,0.8974838054797518,0.892449880952715,0.8661915608066262,0.8401030990072569,0.8286366753686091,0.815342528433036,0.801922073839424,0.7731368441418528,0.773943748662296,0.7445842840013923,0.7361862004120101,0.7278475921473763,0.7322235659799561,0.7049354480540089,0.7095029330860465,0.7077633854547168,0.7108565486012726,0.7139415073241846,0.7089537499714106,0.7101403569941387,0.7032846702149288,0.7050294020852683,0.7000108231737793,0.6980378401196279,0.695195710827144,0....     2.0
99995   99995   1.0,0.677705342021188,0.22239242747868546,0.2571578307224994,0.20469042415279454,0.05466497618736314,0.026152286890497062,0.11818142707296006,0.24483757081121627,0.3289485158861968,0.3612662393002452,0.3692338152135465,0.3771576301414996,0.39064084657121523,0.38166605147504046,0.3861604278730283,0.40178139358056136,0.4128366039514504,0.41503751002775946,0.40842466093154955,0.4039992167307407,0.4216201541053901,0.4249002472095003,0.42817289964943017,0.4238077112517939,0.4346960170320757,0.44334791072667395,0.44334791072667395,0.44765446213849447,0.45836493196059325,0.4626269547531633,0.46899647358123736,0.46899647358123736,0.4742830236507332,0.4795502726166403,0.49002739679124524,0.4879380500650448,0.4921137220489549,0.5025003304125567,0.5230518999566728,0.5311912370235234,0.5352437500089933,0.543314814156555,0.5533405227389467,0.5593226750559254,0.5652801247011404,0.569238134528172,0.5771217550018541,0.5888669514585284,0.5849625007211562,0.5632970403005845,0.5573313791309694,0.5473...     0.0
99996   99996   0.9268571578157265,0.9063471198026871,0.6369932212888393,0.41503751002775946,0.37474480119929776,0.3825812845814957,0.35894293360916163,0.34135861850914284,0.3365254578264915,0.3170292884548231,0.29228262540259375,0.2864468160019497,0.2805873044059686,0.2687963811344823,0.24492145814626645,0.22588141240133217,0.208351475079585,0.18881932118654174,0.15444626964619396,0.12671905345249726,0.10035052363303557,0.0792889099483413,0.06961285122234513,0.05987145737563155,0.034231634226255754,0.018223757989263702,0.006100199547756366,0.0,0.008127880898497107,0.042169453668894775,0.0676698253788931,0.09463687430869291,0.12671905345249726,0.15444626481736634,0.18165061886731076,0.2310991107890251,0.27386144345060753,0.307181505689795,0.3349108108012508,0.3605309216239204,0.37002238421616246,0.3778844911406192,0.38881997353332987,0.39812772906374444,0.39812772906374444,0.39812772906374444,0.39812772906374444,0.3950318004560962,0.40583856004882424,0.41809081689340327,0.4196150504935423,0.430239...     2.0
99997   99997   0.9258351628306013,0.5873839035878395,0.6332261741951388,0.6323533645350808,0.6392827243034813,0.6142923239940205,0.5991551019747257,0.5176324324889339,0.4038033525475481,0.2531748788594435,0.013519441127279834,0.025638558505711485,0.03705234633905498,0.11669339776335014,0.11759388804025704,0.1552665234331609,0.2399151427891965,0.24175478159679248,0.21102295363616005,0.1891698778351716,0.20565849754093615,0.21451222943478682,0.2133064043203807,0.212085396542634,0.22091079609811587,0.21847999205830745,0.21744462158190533,0.2167611384978616,0.2098287268315808,0.20565849754093615,0.19919893110020848,0.2073944541586703,0.212085396542634,0.21380991021662776,0.2155287604547542,0.21380990689430834,0.23161798828669153,0.23069809802709923,0.25121190105757335,0.27337248192957914,0.2843963989758637,0.31710999788513283,0.3504182388524873,0.36157188616365815,0.37473102778466755,0.3877688916918828,0.4037961057981058,0.4160469383364878,0.4251865797948851,0.41453205902085904,0.40941055354379485,0....     3.0
99998   99998   1.0,0.9947621698382489,0.8297017704865509,0.45819277171637834,0.26416169623741237,0.24022845026183584,0.21376575735540573,0.18929103849637752,0.20381573166587716,0.21086610220048516,0.190781903999013,0.18315266786292242,0.17548283434154238,0.1719202391788071,0.14480739621989083,0.12712256152820978,0.12367351932025483,0.10805993008957318,0.08342031774735097,0.06573228274169252,0.05506601146864584,0.043824017158610444,0.030206105177531178,0.016458299017634,0.0007163387656620452,0.005716803856722555,0.004021298881707146,0.0017004301572025823,0.024590351432569123,0.05309127643975563,0.09306172700835041,0.12291571612873288,0.1645792079804964,0.19003666117847653,0.21505245699261524,0.2449353266164658,0.25104409561890884,0.26638881817117777,0.28509026834456125,0.29432487352018755,0.2857629833660126,0.28796241582668164,0.2989825195095478,0.29848023837086235,0.2918423132813349,0.2851729871398236,0.28326423170204107,0.2915479296274273,0.28855142223257646,0.28318472801293526,0.279873906737311...     2.0
99999   99999   0.9259994004527861,0.916476635326053,0.4042900774399834,0.0,0.2630344094167657,0.3854310437765884,0.3610665021846972,0.33270794046870034,0.33985000288462475,0.3504972538285509,0.34872815423107756,0.3469568826178959,0.3451834336490557,0.3416300161159092,0.3540289245985311,0.3540289245985311,0.3504972538285509,0.33985000288462475,0.3522641867306462,0.35579154003393343,0.34872815423107756,0.3469568826178959,0.3522641698902398,0.3575520046180837,0.3575520046180837,0.35579154003393343,0.35931032359378573,0.3645724256169181,0.3680698831948491,0.3663222142458158,0.3645724256169181,0.37851163648316444,0.3854310437765884,0.3880172853451348,0.3905988989929319,0.39574834125134917,0.40769264218340895,0.41616415828881326,0.41785251488589786,0.4279413389014036,0.4412842655939598,0.4479097361339566,0.4479097361339566,0.4446008200300382,0.44956138405850743,0.45450495010798597,0.44625624251214385,0.42458621984429695,0.39746071299791125,0.37851163648316444,0.3575520046180837,0.33270794046870034,0.31...     0.0

1.2.2 数据集行列总览/统计量/数据类型

  • 行列数
print("训练集大小:",Train_data.shape)
print("测试集大小:",Test_data.shape)
训练集大小: (100000, 3)
测试集大小: (20000, 2)

  • 个数、均值、方差std、最大/小值、分位数等统计量, 训练集为例:
Train_data.describe()
id   label
count   100000.000000   100000.000000
mean    49999.500000    0.856960
std     28867.657797    1.217084
min     0.000000        0.000000
25%     24999.750000    0.000000
50%     49999.500000    0.000000
75%     74999.250000    2.000000
max     99999.000000    3.000000

  • 数据集类型,同样训练集为例
Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 3 columns):
id                   100000 non-null int64
heartbeat_signals    100000 non-null object
label                100000 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 2.3+ MB

1.2.3 异常值/缺失值分布及可视化

  • 缺失数据可以使用 .isna.isnull查看每个单元格是否缺失
  • .isnull().sum()来统计每一列缺失
  • 缺失值可用df.dropna()删除
    • axis (默认为0,即删除行)、删除方式 how 、删除的非缺失值个数阈值 thresh ( 非缺失值 没有达到这个数量的相应维度会被删除)、备选的删除子集 subset ,其中 how 主要有 anyall两种参数可以选择。

Train_data.isnull().sum()
id                   0
heartbeat_signals    0
label                0
dtype: int64

  • 缺失值可视化
# 可视化看下缺省值
msno.matrix(Train_data.sample(100000))


msno.bar(Train_data.sample(1000)) #柱状图看是否有缺失


1.2.4 预测值分布情况

1.2.4.1 整体分布情况

  • seaborn.distplot() — 绘制直方图,质量估计图,核密度估计图

    • 函数原型
    seaborn.distplot(a, bins=None, hist=True, kde=True, rug=False, fit=None, hist_kws=None, kde_kws=None, rug_kws=None,fit_kws=None, color=None, vertical=False,norm_hist=False, axlabel=None,
    
    • 关键参数
    • a:表示要观察的数据,可以是Series、一维数组或列表。
      bins:用于控制条形的数量。
      hist:bool,是否绘制(标注)直方图。
      kde:bool,是否绘制高斯核密度估计曲线。—高斯核函数K(x,xc)=exp[−∣∣x−xc∣∣2/(2∗σ)2]K(x,xc)=exp[-||x-xc||^2/(2*σ)^2]K(x,xc)=exp[−∣∣x−xc∣∣2/(2∗σ)2],其中xc为核函数中心,σ为函数的宽度参数
      rug:bool,是否在支持的轴方向上绘制rugplot。
      fit:随机变量对象,可选参数。 一个带有fit方法的对象,返回一个元组,该元组可以传递给概率密度函数方法一个位置参数,该位置参数遵循一个值的网格用于评估概率密度函数。
  • stats.probplot (QQ图): 计算概率图的分位数,并可选地显示该图

  • plt.subplot()函数用于直接指定划分方式和位置进行绘图

f,ax = plt.subplots(1,2,figsize = (13.2,5.6),dpi = 100)
sns.distplot(Train_data['label'],bins = 20,kde_kws = {'color':'r','lw':1.5},ax = ax[0])
stats.probplot(Train_data['label'],fit = True,plot = ax[1])
ax[0].set(xlabel = '心跳类别',ylabel = '概率密度',title = '心跳类别概率分布图')
ax[1].set(xlabel = '理论分位数',ylabel = '心跳类别',title = 'QQ图')
plt.grid(linestyle = '--')     # 添加网格线
plt.show()

  • 类别分布为—约翰逊分布—非正态

import scipy.stats as st
plt.figure(1); plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)  #正态
plt.figure(2); plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm) # 对数正态


1.2.4.2 偏度(Skewness)和峰度(Kurtosis)

  • 偏度:是统计量中用来描述数据分布偏斜程度,其数值越大代表偏斜程度越大。
  • 峰度是描述数据分布陡缓程度的统计量,判断数据相对于正态分布而言是更陡峭还是更平缓。

  • 查看峰度和偏度
Train_data.skew(), Train_data.kurt()
(id       0.000000label    0.871005dtype: float64, id      -1.200000label   -1.009573dtype: float64

  • 峰度偏度可视化
f, ax = plt.subplots(1, 2, figsize = (13.5, 5.2), dpi = 100)
f1 = sns.distplot(Train_data.kurt(), color = 'g', axlabel = 'Kurtness',kde_kws={"color": "r", "lw": 1.5,},ax = ax[0])
f2 = sns.distplot(Train_data.skew(), color = 'blue', axlabel = 'Skewness',kde_kws={"color": "r", "lw": 1.5,},ax = ax[1])
ax[0].set(xlabel = 'Kurtness', ylabel = '概率密度', title = '峰度分布')
ax[1].set(xlabel = 'Skewness', ylabel = '概率密度', title = '偏度分布')
f1.grid(linestyle = '--')
f2.grid(linestyle = '--')
plt.show()


1.2.4.3 查看预测值的具体频数

plt.hist(Train_data['label'], orientation = 'vertical',histtype = 'bar', color ='red')
plt.show()

参考资料

1.https://blog.csdn.net/AvenueCyy/article/details/104405747 数据探索性分析
2.http://joyfulpandas.datawhale.club/Content/ch7.html Pandas教程-缺失数据
3.https://cloud.tencent.com/developer/article/1512635 Seaborn系列 | 直方图distplot()
4.https://www.cntofu.com/book/172/docs/24.md seaborn中文文档

20210319_23期_心跳检测_Task02_数据探索性分析相关推荐

  1. 20210316_23期_心跳检测_Task01

    一.赛题理解及baseline 这里写目录标题 一.赛题理解及baseline 来源 1.1 赛题理解 1.2 baseline 预处理: 训练数据/测试数据准备: 评价指标: 模型建立 1.3 提交 ...

  2. 机器学习的第一个难点,是数据探索性分析

    作者 | 陆春晖 责编 | 寇雪芹 头图 | 下载于视觉中国 当我们在进行机器学习领域的学习和研究时,遇到的第一个难点就是数据探索性分析(Exploratory Data Analysis).虽然从各 ...

  3. 数据探索性分析_探索性数据分析

    数据探索性分析 When we hear about Data science or Analytics , the first thing that comes to our mind is Mod ...

  4. #数据挖掘--第1章:EDA数据探索性分析

    #数据挖掘--第1章:EDA数据探索性分析 一.序言 二.EDA的意义 三.EDA的流程 一.序言   本系列博客面向初学者,只讲浅显易懂易操作的知识.包含:数据分析.特征工程.模型训练等通用流程.将 ...

  5. 竞赛入门-数据探索性分析(EDA)

    竞赛入门-数据探索性分析 总览 数据科学库 Numpy Scipy Pandas 可视化库 matplotlib seaborn missingno库 载入数据 数据总揽 数据检测 缺失值检测 异常值 ...

  6. python实现二手汽车价格预测(一)初始数据探索性分析

    python实现二手汽车价格预测(一)初始数据探索性分析 零基础入门数据挖掘的 EDA-数据探索性分析 部分,带你来了解数据,熟悉数据,和数据做朋友. 一.EDA目标 EDA的价值主要在于熟悉数据集, ...

  7. 关于二手车交易预测的数据探索性分析

    关于二手车交易预测的数据探索性分析 我们为什么要进行数据分析呢?这是我摘自一个博客的一个答案,希望能给您帮助:探索性数据分析(Exploratory Data Analysis,简称EDA),摘抄网上 ...

  8. Kaggle泰坦尼克号数据机器学习实战:从缺失值处理、数据探索性分析、组合特征生成到多模型构建

    Kaggle泰坦尼克号数据机器学习实战:从缺失值处理.数据探索性分析.组合特征生成到多模型构建 泰坦尼克号的沉没是历史上最为人熟知的海难事件之一. 1912 年 4 月 15 日,在她的处女航中,泰坦 ...

  9. 数据探索性分析(EDA)常用方法大合集

    EDA(Exploratory Data Analysis),全名为数据探索性分析,是通过了解数据集,了解变量间的相互关系以及变量与预测值之间的关系,从而帮助我们后期更好地进行特征工程和建立模型,是数 ...

  10. mysql心跳检测_心跳检测 · GatewayWorker手册 · 看云

    ## 为什么需要心跳检测? 正常的情况客户端断开连接会向服务端发送一个fin包,服务端收到fin包后得知客户端连接断开,则立刻触发onClose事件回调. 但是有些极端情况如客户端掉电.网络关闭.拔网 ...

最新文章

  1. iOS-消除CocoaPods内容警告
  2. Auto Layout 和 Constraints
  3. 《数学与泛型编程:高效编程的奥秘》一1.4 各章概述
  4. vue导出Excel(一)
  5. 京东数科科创板IPO获受理,刘强东为实际控制人
  6. 编程语言python入门-2020年10月编程语言:Java、Python 龙争虎斗
  7. Linux 基本操作命令
  8. 关于Python学习的一点说明
  9. bitvise SSH 打开代码中文显示乱码的问题
  10. win7如何更改计算机管理员用户名和密码,Win7如何修改管理用户名
  11. 程序员和码农有什么不同?从这三个单词就看得出来
  12. 机器学习(一)——K近邻算法(python实现)
  13. 谷歌浏览器chrome官方下载网址
  14. 更改java和javac的默认输出语言为英文
  15. nrf52840 IO操作 定时器 LOG
  16. android代码下拉刷新页面,Android下拉刷新的实现
  17. Python:数据结构——构建叉树
  18. ThinkPHP5.1使用redis缓存
  19. 简单认识c语言的概念
  20. 关于Linux你不知道的那些往事

热门文章

  1. EXCEL,如何进行查找,单条件和多条件查询
  2. ACM 6174问题Java解决
  3. 如何增强台式计算机无线网络,台式机无线网信号差怎么解决
  4. 上传照片(身份证照片正反面)
  5. 微信小程序–二维码生成器
  6. java 双引号 转义_java字符转义 字符串中的双引号
  7. c# python3_从C#到Python —— 3 函数及函数编程
  8. 正态分布t个标准差范围内的概率
  9. 这才是程序员的元宵节打开方式:亲手做一盏花灯,轻松学三维绘图
  10. 想转行学IT,到底要不要去培训机构?