利用python机器学习库进行Kaggle皮马印第安人糖尿病预测分析

项目摘要

本项目主要使用python对各医学参数与糖尿病之间的关系进行可视化分析、描述性分析。使用scikit-learn机器学习工具进行推断性分析，对数据标准化、使用逻辑回归算法对测试集数据进行预测，最后使用混淆模型和准确率对模型进行评估。

主要结论：

数据集768人中，有268人患病，500人不患病，患病率为34.90%；
糖尿病患者的平均葡萄糖浓度、平均舒张压、平均皮褶厚度、平均血清胰岛素、平均体重指数、平均糖尿病谱系功能都比正常人高。患病者一般在27-47岁之间，怀孕次数在1-8次之间；
与糖尿病相关性较强的参数为glucose、insulin、BMI、skin_thick；
使用逻辑回归预测模型，在被预测的154名皮马印第安女性中，共124人被准确预测，预测准确率为80.5%。

一、数据集介绍

该数据集源至美国国家糖尿病、消化及肾脏疾病研究所。数据集的目的是根据已有诊断信息来预测患者是否患有糖尿病。但该数据库存在一定局限性，特别是数据集中的患者都是年龄大于等于21岁的皮马印第安女性。

数据来源：https://www.kaggle.com/uciml/pima-indians-diabetes-database

数据集由若干医学预测变量和一个目标变量Outcome组成，共九个字段。预测变量包括患者的怀孕次数，BMI，胰岛素水平，年龄等。

二、项目分析

分析步骤

1.提出问题

是否可以利用现有数据准确预测人是否患有糖尿病？

2.理解数据

简单查看表格内容。首先需要更改列名为更好的理解和使用数据。

# 导入数据处理包
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# 导入数据文件
df = pd.read_csv('diabete.csv')

# 查看数据基本情况
df.head()

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

新表列名信息如下：

2.1查看数据概况、数据类型

glucose、blood_pressure、skin_thick、insulin、BMI 最小数据不应为0，可理解这些数据列没有数据录入，存在数据缺失。处理思路：1.将这些列的0值转换成NaN值；2.根据outcome的结果计算出各列的平均值；3.使用平均值填充缺失值。

数据类型都为数值类型。

# 更改列名为更好的理解和使用
df.rename(columns={'Pregnancies': 'preg_times', 'Glucose': 'glucose', 'BloodPressure': 'blood_pressure', 'SkinThickness': 'skin_thick', 'Insulin': 'insulin','DiabetesPedigreeFunction': 'DPF','Age': 'age','Outcome': 'outcome'}, inplace=True)
df.head()

	preg_times	glucose	blood_pressure	skin_thick	insulin	BMI	DPF	age	outcome
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

print('数据集列数：%.f， 数据集行数：%.f' % (df.shape[1], df.shape[0]))

数据集列数：9， 数据集行数：768

df.describe()

	preg_times	glucose	blood_pressure	skin_thick	insulin	BMI	DPF	age	outcome
count	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000
mean	3.845052	120.894531	69.105469	20.536458	79.799479	31.992578	0.471876	33.240885	0.348958
std	3.369578	31.972618	19.355807	15.952218	115.244002	7.884160	0.331329	11.760232	0.476951
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.078000	21.000000	0.000000
25%	1.000000	99.000000	62.000000	0.000000	0.000000	27.300000	0.243750	24.000000	0.000000
50%	3.000000	117.000000	72.000000	23.000000	30.500000	32.000000	0.372500	29.000000	0.000000
75%	6.000000	140.250000	80.000000	32.000000	127.250000	36.600000	0.626250	41.000000	1.000000
max	17.000000	199.000000	122.000000	99.000000	846.000000	67.100000	2.420000	81.000000	1.000000

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
preg_times        768 non-null int64
glucose           768 non-null int64
blood_pressure    768 non-null int64
skin_thick        768 non-null int64
insulin           768 non-null int64
BMI               768 non-null float64
DPF               768 non-null float64
age               768 non-null int64
outcome           768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

3 数据清洗

3.1数据预处理

将glucose、blood_pressure、skin_thick、insulin、BMI 0值替换为NaN值。

# 数据清理 glucose、blood_pressure、skin_thick、insulin、BMI
# 存在无效的0值，将所有0值替换为NaN
df_copy = df.copy(deep = True)
df_copy[['glucose', 'blood_pressure', 'skin_thick', 'insulin', 'BMI']] = \
df_copy[['glucose', 'blood_pressure', 'skin_thick', 'insulin', 'BMI']].replace(0, np.NaN)df_copy.isnull().sum()

preg_times          0
glucose             5
blood_pressure     35
skin_thick        227
insulin           374
BMI                11
DPF                 0
age                 0
outcome             0
dtype: int64

自定义计算各列平均值的函数。

# 填充缺失值
# 计算不同结果下各标签的平均值
def mea_byoutcome(index):temp = df_copy[df_copy[index].notnull()]temp = temp[[index, 'outcome']].groupby(['outcome'])[[index]].mean().reset_index()return temp

glucose 缺失值填充。

# glucose 缺失值填充
mea_byoutcome('glucose')

	outcome	glucose
0	0	110.643863
1	1	142.319549

df_copy.loc[(df_copy['outcome'] == 0) & (df_copy['glucose'].isnull()), 'glucose'] = mea_byoutcome('glucose')['glucose'][0]
df_copy.loc[(df_copy['outcome'] == 1) & (df_copy['glucose'].isnull()), 'glucose'] = mea_byoutcome('glucose')['glucose'][1]

blood_pressure 缺失值填充。

# blood_pressure 缺失值填充
mea_byoutcome('blood_pressure')

	outcome	blood_pressure
0	0	70.877339
1	1	75.321429

df_copy.loc[(df_copy['outcome'] == 0) & (df_copy['blood_pressure'].isnull()), 'blood_pressure'] = mea_byoutcome('blood_pressure')['blood_pressure'][0]
df_copy.loc[(df_copy['outcome'] == 1) & (df_copy['blood_pressure'].isnull()), 'blood_pressure'] = mea_byoutcome('blood_pressure')['blood_pressure'][1]

skin_thick 缺失值填充。

# skin_thick 缺失值填充
mea_byoutcome('skin_thick')

	outcome	skin_thick
0	0	27.235457
1	1	33.000000

df_copy.loc[(df_copy['outcome'] == 0) & (df_copy['skin_thick'].isnull()), 'skin_thick'] = mea_byoutcome('skin_thick')['skin_thick'][0]
df_copy.loc[(df_copy['outcome'] == 1) & (df_copy['skin_thick'].isnull()), 'skin_thick'] = mea_byoutcome('skin_thick')['skin_thick'][1]

insulin 缺失值填充。

# insulin 缺失值填充
mea_byoutcome('insulin')

	outcome	insulin
0	0	130.287879
1	1	206.846154

df_copy.loc[(df_copy['outcome'] == 0) & (df_copy['insulin'].isnull()), 'insulin'] = mea_byoutcome('insulin')['insulin'][0]
df_copy.loc[(df_copy['outcome'] == 1) & (df_copy['insulin'].isnull()), 'insulin'] = mea_byoutcome('insulin')['insulin'][1]

BMI 缺失值填充。

# BMI 缺失值填充
mea_byoutcome('BMI')

	outcome	BMI
0	0	30.859674
1	1	35.406767

df_copy.loc[(df_copy['outcome'] == 0) & (df_copy['BMI'].isnull()), 'BMI'] = mea_byoutcome('BMI')['BMI'][0]
df_copy.loc[(df_copy['outcome'] == 1) & (df_copy['BMI'].isnull()), 'BMI'] = mea_byoutcome('BMI')['BMI'][1]

填充完毕后，查看各列缺失值，数据集无缺失值。

df_copy.isnull().any()

preg_times        False
glucose           False
blood_pressure    False
skin_thick        False
insulin           False
BMI               False
DPF               False
age               False
outcome           False
dtype: bool

以outcome为标准作热力图，可以查看各参数的数据分布及它们之间的相关数据分布，本项目不对参数的组合影响做更多的拆解分析。

sns.pairplot(df_copy, hue = 'outcome'); #以outcome为标准作热力图

m:\Users\mok\Anaconda3\lib\site-packages\statsmodels\nonparametric\kde.py:487: RuntimeWarning: invalid value encountered in true_dividebinned = fast_linbin(X, a, b, gridsize) / (delta * nobs)
m:\Users\mok\Anaconda3\lib\site-packages\statsmodels\nonparametric\kdetools.py:34: RuntimeWarning: invalid value encountered in double_scalarsFAC1 = 2*(np.pi*bw/RANGE)**2

特征选取

在特征选取中采用相关系数法，可见各参数与outcome的相关系都较高，且都为正相关关系，因此所有参数都应在预测时被考虑。相关性较强的前三参数是glucose、insulin、BMI，相关系数分别为0.50、0.41、0.32。

另外值得注意的是glucose与insulin、skin_thick与BMI有较强的正相关性。

corrdf = df_copy.corr()
corrdf['outcome'].sort_values(ascending = False)

outcome           1.000000
glucose           0.495954
insulin           0.410918
BMI               0.315271
skin_thick        0.308094
age               0.238356
preg_times        0.221898
blood_pressure    0.175087
DPF               0.173844
Name: outcome, dtype: float64

plt.figure(figsize=(15, 5), facecolor='w')
sns.heatmap(corrdf, vmax=1, square=False, annot=True, linewidth=1, cmap=plt.cm.Blues);

在数据标准化前查看一次整体数据描述统计信息。

df_copy.describe()

	preg_times	glucose	blood_pressure	skin_thick	insulin	BMI	DPF	age	outcome
count	768.000000	768.000000	768.000000	768.000000	768.000000	768.00000	768.000000	768.000000	768.000000
mean	3.845052	121.697358	72.428141	29.247042	157.003527	32.44642	0.471876	33.240885	0.348958
std	3.369578	30.462008	12.106044	8.923908	88.860914	6.87897	0.331329	11.760232	0.476951
min	0.000000	44.000000	24.000000	7.000000	14.000000	18.20000	0.078000	21.000000	0.000000
25%	1.000000	99.750000	64.000000	25.000000	121.500000	27.50000	0.243750	24.000000	0.000000
50%	3.000000	117.000000	72.000000	28.000000	130.287879	32.05000	0.372500	29.000000	0.000000
75%	6.000000	141.000000	80.000000	33.000000	206.846154	36.60000	0.626250	41.000000	1.000000
max	17.000000	199.000000	122.000000	99.000000	846.000000	67.10000	2.420000	81.000000	1.000000

查看没有患病组的描述统计信息。

df_copy[df_copy['outcome'] == 0].describe() # 没有患病组

	preg_times	glucose	blood_pressure	skin_thick	insulin	BMI	DPF	age	outcome
count	500.000000	500.000000	500.000000	500.000000	500.000000	500.000000	500.000000	500.000000	500.0
mean	3.298000	110.643863	70.877339	27.235457	130.287879	30.859674	0.429734	31.190000	0.0
std	3.017185	24.702314	11.927450	8.516280	74.400559	6.501303	0.299085	11.667655	0.0
min	0.000000	44.000000	24.000000	7.000000	15.000000	18.200000	0.078000	21.000000	0.0
25%	1.000000	93.000000	63.500000	22.000000	95.000000	25.750000	0.229750	23.000000	0.0
50%	2.000000	107.500000	70.877339	27.235457	130.287879	30.400000	0.336000	27.000000	0.0
75%	5.000000	125.000000	78.000000	31.000000	130.287879	35.300000	0.561750	37.000000	0.0
max	13.000000	197.000000	122.000000	60.000000	744.000000	57.300000	2.329000	81.000000	0.0

查看患糖尿病病组的描述统计信息。

df_copy[df_copy['outcome'] == 1].describe() # 患糖尿病组

	preg_times	glucose	blood_pressure	skin_thick	insulin	BMI	DPF	age	outcome
count	268.000000	268.000000	268.000000	268.000000	268.000000	268.000000	268.000000	268.000000	268.0
mean	4.865672	142.319549	75.321429	33.000000	206.846154	35.406767	0.550500	37.067164	1.0
std	3.741239	29.488132	11.925638	8.456099	92.237987	6.590161	0.372354	10.968254	0.0
min	0.000000	78.000000	30.000000	7.000000	14.000000	22.900000	0.088000	21.000000	1.0
25%	1.750000	119.000000	68.000000	30.000000	175.000000	30.900000	0.262500	28.000000	1.0
50%	4.000000	140.500000	75.321429	33.000000	206.846154	34.300000	0.449000	36.000000	1.0
75%	8.000000	167.000000	82.000000	36.000000	206.846154	38.775000	0.728000	44.000000	1.0
max	17.000000	199.000000	114.000000	99.000000	846.000000	67.100000	2.420000	70.000000	1.0

3.3 数据标准化

为了让数据有可比性，需进行数据标准化，以下使用StandardScaler模型进行标准化。标准化后的数据如下表。

# 标准化数据
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
source_x = pd.DataFrame(sc_X.fit_transform(df_copy.drop(['outcome'], axis = 1),), columns = ['preg_times', 'glucose', 'blood_pressure', 'skin_thick', 'insulin','BMI','DPF','age'])

source_x .head()

	preg_times	glucose	blood_pressure	skin_thick	insulin	BMI	DPF	age
0	0.639947	0.864020	-0.035389	0.645088	0.561272	0.167806	0.468492	1.425995
1	-0.844885	-1.205478	-0.531332	-0.027701	-0.300842	-0.850452	-0.365061	-0.190672
2	1.233880	2.013741	-0.696647	0.420825	0.561272	-1.330487	0.604397	-0.105584
3	-0.844885	-1.074081	-0.531332	-0.700491	-0.709475	-0.632253	-0.920763	-1.041549
4	-1.141852	0.502679	-2.680419	0.645088	0.123830	1.549727	5.484909	-0.020496

4. 构建模型

首先查看数据集中糖尿病患者人数，768人中，有268人患病，500人不患病，患病率为34.90%。

df.outcome.value_counts() #计算数量集中糖尿病人的数量

0    500
1    268
Name: outcome, dtype: int64

plt.figure(figsize = (5,5), facecolor='w')
df.outcome.value_counts().plot.bar()
plt.xticks(rotation=0);

print('患病率为 %.2f %%' % (df.outcome.value_counts()[1] / df.shape[0] * 100))

患病率为 34.90 %

4.1 构建预测模型

将数据集拆分为80%的训练集和20%的测试集。训练集有614条数据，用于训练模型。测试集有154条数据，用于模型结果预测。

# 构建模型
from sklearn.model_selection import train_test_split
source_y = df_copy.outcome
train_X, test_X, train_y, test_y = train_test_split(source_x,source_y,train_size = .8,random_state = 0)print('原始数据集特征：', source_x.shape,'训练数据集特征：', train_X.shape,'测试数据集特征：', test_X.shape)print('原始数据集标签：', source_y.shape,'训练数据集标签：', train_y.shape,'测试数据集标签：', test_y.shape)

原始数据集特征： (768, 8) 训练数据集特征： (614, 8) 测试数据集特征： (154, 8)
原始数据集标签： (768,) 训练数据集标签： (614,) 测试数据集标签： (154,)

4.2 训练模型

选用逻辑回归算法训练模型。

# 选用逻辑回归算法训练模型，导入算法并且导入多个评估算法
from sklearn.linear_model import LogisticRegression

# 训练模型
log_mod = LogisticRegression()
log_mod.fit(train_X, train_y)
preds = log_mod.predict(test_X)

m:\Users\mok\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.FutureWarning)

5.评估模型

使用2种方式评估该模型。

from sklearn.metrics import confusion_matrix, accuracy_score

1.混淆模型

在混淆矩阵中可知，154名皮马印第安女性中，被正确识别为未患糖尿病的正常人数量为95名，被正确识别为患糖尿病的糖尿病患者数量为29名，被识别为未患糖尿病的糖尿病患者为12名，被识别为糖尿病的正常人为18名。

# 2种方式评估模型
confusion_matrix(test_y, preds) #混淆模型

array([[95, 12],[18, 29]], dtype=int64)

2.准确率

表示模型预测皮马印第安人是否患糖尿病的准确程度。最终计算结果为80.52%。

accuracy_score(test_y, preds)

0.8051948051948052

三、项目总结

描述性分析结论：

768人中，有268人患病，500人不患病，患病率为34.90%；
糖尿病患者的平均葡萄糖浓度、平均舒张压、平均皮褶厚度、平均血清胰岛素、平均体重指数、平均糖尿病谱系功能都比正常人高。患病者一般在27-47岁之间，怀孕次数在1-8次之间。

推断性分析总结:

与糖尿病相关性较强的参数为glucose、insulin、BMI、skin_thick。
在被预测的154名皮马印第安女性中，共124人被准确预测，预测准确率为80.5%

四、后续改进

本次预测准确率为80.52%，为提高预测准确率，可进行以下措施：

数据收集，补充21岁以下人群及男性的数据，进一步扩大各年龄段的人数；
进行特征工程对模型参数进行改进，如提取各参数的的组合效果对outcome的影响。