基于Kaggle心脏病数据集的数据分析和分类预测-StatisticalLearning统计学习实验报告

一、实验准备

本数据来源于kaggle,包含14个维度，303个样本，具体的变量说明如下表所示。

变量名	详细说明	取值范围
target	是否患有心脏病（分类变量）	0=否，1=是
age	年龄（连续变量）	[29，77]
sex	性别（分类变量）	1=男，0=女
cp	胸痛经历（分类变量）	1=典型心绞痛，2=非典型性心绞痛，3=非心绞痛，4=无症状
trestbps	静息血压（连续变量Hg）	[94，200]
chols	人体胆固醇（连续变量mg/dl）	[126，564]
fbs	空腹血糖（分类变量>120mg/dl）	1=真，0=假
restecg	静息心电图测量（分类变量）	0=正常，1=有ST-T波异常，2=按Estes标准显示可能或明确的左心室肥厚
thalach	最大心率（连续变量）	[71，202]
exang	运动诱发心绞痛（分类变量）	1=是，0=否
oldpeak	运动相对于休息引起的ST段压低（连续变量）	[0，6.2]
slope	峰值运动ST段的斜率（分类变量）	1=上升，2=平坦，3=下降
ca	主要血管数量（连续变量）	[0，3]
thal	地中海贫血的血液疾病（分类变量）	1=正常，2=固定缺陷，3=可逆缺陷

'''-*- coding: utf-8 -*-@Author     : DouGang@E-mail     : dorza@qq.com@Software   : PyCharm, Python3.6@Time       : 2021-07-24
'''

导入相关库

# 数据集特征分析相关库
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# 数据集预处理相关库
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# K近邻算法相关库
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision_score,recall_score,f1_score
from sklearn.metrics import precision_recall_curve,roc_curve,average_precision_score,auc
# 决策树相关库
from sklearn.tree import DecisionTreeClassifier
# 随机森林相关库
from sklearn.ensemble import RandomForestClassifier
# 逻辑回归相关库
from sklearn.linear_model import LogisticRegression
# SGD分类相关库
from sklearn.linear_model import SGDClassifier

二、数据展示

plt.rcParams['font.sans-serif'] = ['SimHei']    # 设置图表的显示样式

heart_df = pd.read_csv("./dataSet/heart.csv")
print(heart_df.shape)   # 查看数据的维度
print(heart_df.head())  # 查看数据的前5行
print(heart_df.info())  # 展示数据的详细信息
print(heart_df.describe())      # 描述统计相关信息
print(heart_df.isnull().sum())  # 缺少值检查
sns.heatmap(heart_df.isnull())
plt.show()
sns.pairplot(heart_df,hue='target')
plt.show()

(303, 14)age  sex  cp  trestbps  chol  fbs  ...  exang  oldpeak  slope  ca  thal  target
0   63    1   3       145   233    1  ...      0      2.3      0   0     1       1
1   37    1   2       130   250    0  ...      0      3.5      0   0     2       1
2   41    0   1       130   204    0  ...      0      1.4      2   0     2       1
3   56    1   1       120   236    0  ...      0      0.8      2   0     2       1
4   57    0   0       120   354    0  ...      1      0.6      2   0     2       1
[5 rows x 14 columns]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):#   Column    Non-Null Count  Dtype
---  ------    --------------  -----  0   age       303 non-null    int64  1   sex       303 non-null    int64  2   cp        303 non-null    int64  3   trestbps  303 non-null    int64  4   chol      303 non-null    int64  5   fbs       303 non-null    int64  6   restecg   303 non-null    int64  7   thalach   303 non-null    int64  8   exang     303 non-null    int64  9   oldpeak   303 non-null    float6410  slope     303 non-null    int64  11  ca        303 non-null    int64  12  thal      303 non-null    int64  13  target    303 non-null    int64
dtypes: float64(1), int64(13)
memory usage: 33.3 KB

              age         sex          cp  ...          ca        thal      target
count  303.000000  303.000000  303.000000  ...  303.000000  303.000000  303.000000
mean    54.366337    0.683168    0.966997  ...    0.729373    2.313531    0.544554
std      9.082101    0.466011    1.032052  ...    1.022606    0.612277    0.498835
min     29.000000    0.000000    0.000000  ...    0.000000    0.000000    0.000000
25%     47.500000    0.000000    0.000000  ...    0.000000    2.000000    0.000000
50%     55.000000    1.000000    1.000000  ...    0.000000    2.000000    1.000000
75%     61.000000    1.000000    2.000000  ...    1.000000    3.000000    1.000000
max     77.000000    1.000000    3.000000  ...    4.000000    3.000000    1.000000

[8 rows x 14 columns]
age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

sns.heatmap(heart_df.isnull())
plt.show()
sns.pairplot(heart_df,hue='target')
plt.show()

三、数据的描述性信息

# 绘制变量的相关系数
plt.figure(figsize=(10,10))
sns.heatmap(heart_df.corr(),annot=True,fmt='.1f')
plt.show()

# 查看样本的年龄分布
heart_df['age'].value_counts()
sns.barplot(x=heart_df.age.value_counts().index,y=heart_df.age.value_counts().values)
plt.xlabel('Age')
plt.ylabel('Age Counter')
plt.title('Age Analysis System')
plt.show()

# 查看年龄列的最大值、最小值以及平均值
minage = min(heart_df.age)
maxage = max(heart_df.age)
meanage = round(heart_df.age.mean(),2)
print('最小年龄:',minage)
print('最大年龄:',maxage)
print('平均年龄:',meanage)
# 将连续变量年龄转换成分类变量年龄的状态
heart_df['age_states']=0
heart_df['age_states'][(heart_df['age']>=29)&(heart_df['age']<40)]='young ages'
heart_df['age_states'][(heart_df['age']>=40)&(heart_df['age']<55)]='middle ages'
heart_df['age_states'][(heart_df['age']>=55)&(heart_df['age']<=77)]='old ages'
# 查看各年龄段的样本数量
print(heart_df['age_states'].value_counts())
'''x: x轴上的条形图，直接为series数据 y: y轴上的条形图，直接为series数据order代表x轴上各类别的先后顺序hue代表类别 hue_order代表带类别的先后顺序
'''
sns.countplot(x='age_states',data=heart_df,order=['young ages','middle ages','old ages'])
plt.xlabel('Age Range')
plt.ylabel('Age Counts')
plt.title('Age State in Dataset')
plt.show()

最小年龄: 29
最大年龄: 77
平均年龄: 54.37
old ages       159
middle ages    128
young ages      16
Name: age_states, dtype: int64

'''通过如下图发现在样本中随着年龄的变化：样本的数据量逐渐增多，青年人16，中年人128，老年人159。
'''
# 性别样本数据数据占比 0代表女性 1代表男性
print(heart_df['sex'].value_counts())
sns.countplot(y='sex',data=heart_df)
plt.title('Sex Count in Dataset')
plt.show()

1    207
0     96
Name: sex, dtype: int64

# 列名代表是否换心脏病 行名代表性别
pd.crosstab(heart_df['sex'],heart_df['target'])
# 性别与是否患有心脏病的关系 0代表女性；1代表男性
pd.crosstab(heart_df['sex'],heart_df['target']).plot(kind="bar",figsize=(12,8),color=['#1CA53B','#AA1111'])
plt.title('Heart Disease Frequency for Sex')
plt.xlabel('sex(0=female, 1=male)')
plt.xticks(rotation=0)
plt.legend(["'Haven't Disease","Have Disease"])
plt.ylabel('Frequency')
plt.show()

# 心脏病预测-性别与患病分析
# 患病的分布情况
fig,axes = plt.subplots(1,2,figsize=(10,5))
ax = heart_df.target.value_counts().plot(kind="bar",ax=axes[0])
ax.set_title("患病分布")
ax.set_xlabel("1：患病，0：未患病")heart_df.target.value_counts().plot(kind="pie",autopct="%.2f%%",labels=['患病','未患病'],ax=axes[1])
plt.show()

# 性别和患病的分布
ax1 = plt.subplot(121)
ax = sns.countplot(x="sex",hue='target',data=heart_df,ax=ax1)
ax.set_xlabel("0：女性，1：男性")ax2 = plt.subplot(222)
heart_df[heart_df['target'] == 0].sex.value_counts().plot(kind="pie",autopct="%.2f%%",labels=['男性','女性'],ax=ax2)
ax2.set_title("未患病性别比例")ax2 = plt.subplot(224)
heart_df[heart_df['target'] == 1].sex.value_counts().plot(kind="pie",autopct="%.2f%%",labels=['男性','女性'],ax=ax2)
ax2.set_title("患病性别比例")
plt.show()

fig,axes = plt.subplots(2,1,figsize=(20,10))
sns.countplot(x="age",hue="target",data=heart_df,ax=axes[0])# 0-45：青年人，45-59：中年人，60-100：老年人
age_type = pd.cut(heart_df.age,bins=[0,45,60,100],include_lowest=True,right=False,labels=['青年人','中年人','老年人'])
age_target_df = pd.concat([age_type,heart_df.target],axis=1)
sns.countplot(x="age",hue='target',data=age_target_df)
plt.show()

# 统一看下所有特征的分布情况
fig,axes = plt.subplots(7,2,figsize=(10,20))
for x in range(0,14):plt.subplot(7,2,x+1)sns.distplot(heart_df.iloc[:,x],kde=True)
plt.tight_layout()
plt.show()

plt.figure(figsize=(8,5))
sns.heatmap(heart_df.corr(),cmap="Blues",annot=True)
plt.show()

四、特征预处理

# 数据预处理
features = heart_df.drop(columns=['target'])
targets = heart_df['target']
# 将离散型数据，从普通的0,1,2这些，转换成真正的字符串表示# sex
features.loc[features['sex']==0,'sex'] = 'female'
features.loc[features['sex']==1,'sex'] = 'male'# cp
features.loc[features['cp'] == 1,'cp'] = 'typical'
features.loc[features['cp'] == 2,'cp'] = 'atypical'
features.loc[features['cp'] == 3,'cp'] = 'non-anginal'
features.loc[features['cp'] == 4,'cp'] = 'asymptomatic'# fbs
features.loc[features['fbs'] == 1,'fbs'] = 'true'
features.loc[features['fbs'] == 0,'fbs'] = 'false'# exang
features.loc[features['exang'] == 1,'exang'] = 'true'
features.loc[features['exang'] == 0,'exang'] = 'false'# slope
features.loc[features['slope'] == 1,'slope'] = 'true'
features.loc[features['slope'] == 2,'slope'] = 'true'
features.loc[features['slope'] == 3,'slope'] = 'true'# thal
features.loc[features['thal'] == 3,'thal'] = 'normal'
features.loc[features['thal'] == 3,'thal'] = 'fixed'
features.loc[features['thal'] == 3,'thal'] = 'reversable'# restecg
# 0：普通，1：ST-T波异常，2：可能左心室肥大
features.loc[features['restecg'] == 0,'restecg'] = 'normal'
features.loc[features['restecg'] == 1,'restecg'] = 'ST-T abnormal'
features.loc[features['restecg'] == 2,'restecg'] = 'Left ventricular hypertrophy'# ca
features['ca'].astype("object")# thal
features.thal.astype("object")features.head()features = pd.get_dummies(features)
features_temp = StandardScaler().fit_transform(features)
# features_temp = StandardScaler().fit_transform(pd.get_dummies(features))X_train,X_test,y_train,y_test = train_test_split(features_temp,targets,test_size=0.25)

五、各种分类方法实现分类预测和算法评估

5.1 K近邻预测

def plotting(estimator,y_test):fig,axes = plt.subplots(1,2,figsize=(10,5))y_predict_proba = estimator.predict_proba(X_test)precisions,recalls,thretholds = precision_recall_curve(y_test,y_predict_proba[:,1])axes[0].plot(precisions,recalls)axes[0].set_title("平均精准率：%.2f"%average_precision_score(y_test,y_predict_proba[:,1]))axes[0].set_xlabel("召回率")axes[0].set_ylabel("精准率")fpr,tpr,thretholds = roc_curve(y_test,y_predict_proba[:,1])axes[1].plot(fpr,tpr)axes[1].set_title("AUC值：%.2f"%auc(fpr,tpr))axes[1].set_xlabel("FPR")axes[1].set_ylabel("TPR")

# K近邻
knn = KNeighborsClassifier(n_neighbors=5)
scores = cross_val_score(knn,features_temp,targets,cv=5)
print("准确率：",scores.mean())knn.fit(X_train,y_train)y_predict = knn.predict(X_test)
# 精准率
print("精准率：",precision_score(y_test,y_predict))
# 召回率
print("召回率：",recall_score(y_test,y_predict))
# F1-Score
print("F1得分：",f1_score(y_test,y_predict))plotting(knn,y_test)
plt.show()

准确率： 0.7985245901639344
精准率： 0.8
召回率： 0.8421052631578947
F1得分： 0.8205128205128205

5.2 决策树算法评估

tree = DecisionTreeClassifier(max_depth=10)
tree.fit(X_train,y_train)
plotting(tree,y_test)
plt.show()

5.3 随机森林算法评估

rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train,y_train)
plotting(rf,y_test)
plt.show()

5.4 逻辑回归算法评估

logic = LogisticRegression(tol=1e-10)
logic.fit(X_train,y_train)
plotting(logic,y_test)
plt.show()

5.5 SGD分类算法评估

sgd = SGDClassifier(loss="log")
sgd.fit(X_train,y_train)
plotting(sgd,y_test)
plt.show()

5.6 特征重要性分析

# 4.6 心脏病预测-特征重要性分析
importances = pd.Series(data=rf.feature_importances_,index=features.columns).sort_values(ascending=False)
sns.barplot(y=importances.index,x=importances.values,orient='h')
plt.show()

基于Kaggle心脏病数据集的数据分析和分类预测-StatisticalLearning统计学习实验报告相关推荐

A.机器学习入门算法[七]：基于英雄联盟数据集的LightGBM的分类预测
[机器学习入门与实践]入门必看系列,含数据挖掘项目实战:数据融合.特征优化.特征降维.探索性分析等,实战带你掌握机器学习数据挖掘专栏详细介绍:[机器学习入门与实践]合集入门必看系列,含数据挖掘项目实 ...
基于Kaggle数据的词袋模型文本分类教程
基于Kaggle数据的词袋模型文本分类教程发表于23小时前| 454次阅读| 来源FastML| 0 条评论| 作者Zygmunt Z 词袋模型文本分类word2vecn-gram机器学习 w ...
ML之PDP：基于titanic泰坦尼克是否获救二分类预测数据集利用PDP部分依赖图对RF随机森林实现模型可解释性案例
ML之PDP:基于titanic泰坦尼克是否获救二分类预测数据集利用PDP部分依赖图对RF随机森林实现模型可解释性案例目录基于titanic泰坦尼克是否获救二分类预测数据集利用PDP部分依赖图对R ...
ML之interpret：基于titanic泰坦尼克是否获救二分类预测数据集利用interpret实现EBC模型可解释性之全局解释/局部解释案例
ML之interpret:基于titanic泰坦尼克是否获救二分类预测数据集利用interpret实现EBC模型可解释性之全局解释/局部解释案例目录基于titanic泰坦尼克是否获救二分类预测数据 ...
决策树详解python基于Kaggle的Titanic数据实现决策树分类
决策树详解&&python基于Kaggle的Titanic数据实现决策树分类一决策树算法详解 1.前期准备实验目的准备 2.决策树概述 2.1 决策树 2.2 ID3算法原理 2 ...
ML之yellowbrick：基于titanic泰坦尼克是否获救二分类预测数据集利用yellowbrick对LoR逻辑回归模型实现可解释性(阈值图)案例
ML之yellowbrick:基于titanic泰坦尼克是否获救二分类预测数据集利用yellowbrick对LoR逻辑回归模型实现可解释性(阈值图)案例目录基于titanic泰坦尼克是否获救二分类 ...
Dataset：titanic泰坦尼克号数据集/泰坦尼克数据集(是否获救二分类预测)的简介、下载、案例应用之详细攻略
Dataset:titanic泰坦尼克号数据集/泰坦尼克数据集(是否获救二分类预测)的简介.下载.案例应用之详细攻略目录 titanic(泰坦尼克号)数据集的简介 1.titanic数据集各字段描述 ...
机器学习：基于朴素贝叶斯(Naive Bayes)的分类预测
目录一.简介和环境准备简介: 环境: 二.实战演练 2.1使用葡萄(Wine)数据集,进行贝叶斯分类 1.数据导入 2.模型训练 3.模型预测 2.2模拟离散数据集–贝叶斯分类 1.数据导入.分析 ...
机器学习：基于多项式贝叶斯对蘑菇毒性分类预测分析
基于多项式贝叶斯对蘑菇毒性分类预测分析作者:i阿极作者简介:Python领域新星作者.多项比赛获奖者:博主个人首页

基于Kaggle心脏病数据集的数据分析和分类预测-StatisticalLearning统计学习实验报告

基于Kaggle心脏病数据集的数据分析和分类预测-StatisticalLearning统计学习实验报告

一、实验准备

二、数据展示

三、数据的描述性信息

四、特征预处理

五、各种分类方法实现分类预测和算法评估

5.1 K近邻预测

5.2 决策树算法评估

5.3 随机森林算法评估

5.4 逻辑回归算法评估

5.5 SGD分类算法评估

5.6 特征重要性分析

基于Kaggle心脏病数据集的数据分析和分类预测-StatisticalLearning统计学习实验报告相关推荐

最新文章

热门文章