2022讯飞——糖尿病遗传风险检测挑战赛解决方案
目录:
- 0. 赛事背景
- 1. 读取数据
- 2. 数据探索及预处理
- 2.1 缺失值
- 2.2 分析字段类型
- 2.3 计算字段相关性
- 3. 特征工程
- 3.1 特征构造
- 4. 模型训练
- 4.1 LightGBM (0.96206)
- 4.2 随机森林(0.96324)
- 4.3 XGBoost (0.95981)
- 4.4 CatBoost(0.95854)
- 4.5 AdaBoost(0.96098)
- 4.6 集成模型(0.95971)
- 4.7 Stacking(0.96577)
- 4.8 归一化数据,pytorch神经网络
- 4.9 SVM
- 4.10 sklearn神经网络
- 5. 总结思考
0. 赛事背景
截至2022年,中国糖尿病患者近1.3亿。中国糖尿病患病原因受生活方式、老龄化、城市化、家族遗传等多种因素影响。同时,糖尿病患者趋向年轻化。
糖尿病可导致心血管、肾脏、脑血管并发症的发生。因此,准确诊断出患有糖尿病个体具有非常重要的临床意义。糖尿病早期遗传风险预测将有助于预防糖尿病的发生。
根据《中国2型糖尿病防治指南(2017年版)》,糖尿病的诊断标准是具有典型糖尿病症状(烦渴多饮、多尿、多食、不明原因的体重下降)且随机静脉血浆葡萄糖≥11.1mmol/L或空腹静脉血浆葡萄糖≥7.0mmol/L或口服葡萄糖耐量试验(OGTT)负荷后2h血浆葡萄糖≥11.1mmol/L。
在这次比赛中,您需要通过训练数据集构建糖尿病遗传风险预测模型,然后预测出测试数据集中个体是否患有糖尿病,和我们一起帮助糖尿病患者解决这“甜蜜的烦恼”。
训练集说明
训练集(比赛训练集.csv)一共有5070条数据,用于构建您的预测模型(您可能需要先进行数据分析)。数据的字段有编号、性别、出生年份、体重指数、糖尿病家族史、舒张压、口服耐糖量测试、胰岛素释放实验、肱三头肌皮褶厚度、患有糖尿病标识(最后一列),您也可以通过特征工程技术构建新的特征。测试集说明
测试集(比赛测试集.csv)一共有1000条数据,用于验证预测模型的性能。数据的字段有编号、性别、出生年份、体重指数、糖尿病家族史、舒张压、口服耐糖量测试、胰岛素释放实验、肱三头肌皮褶厚度。评估指标
对于提交的结果,系统会采用二分类任务中的F1-score指标进行评价,F1-score越大说明预测模型性能越好。
本质为二分类预测,不是很难,注意做好特征工程,提升数据质量。
比赛详情见官网。
1. 读取数据
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import lightgbmtrain_data = pd.read_csv('比赛训练集.csv',encoding='gbk')
test_data = pd.read_csv('比赛测试集.csv',encoding='gbk')
train_data.describe()
编号 | 性别 | 出生年份 | 体重指数 | 舒张压 | 口服耐糖量测试 | 胰岛素释放实验 | 肱三头肌皮褶厚度 | 患有糖尿病标识 | |
---|---|---|---|---|---|---|---|---|---|
count | 5070.000000 | 5070.000000 | 5070.000000 | 5070.000000 | 4823.000000 | 5070.000000 | 5070.000000 | 5070.000000 | 5070.000000 |
mean | 2535.500000 | 0.456805 | 1986.869231 | 37.986785 | 89.423595 | 5.612839 | 4.114321 | 6.994371 | 0.381854 |
std | 1463.727263 | 0.498180 | 8.919737 | 11.447095 | 9.266992 | 2.257649 | 8.726001 | 13.651442 | 0.485889 |
min | 1.000000 | 0.000000 | 1943.000000 | 0.000000 | 30.000000 | -1.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 1268.250000 | 0.000000 | 1980.000000 | 28.400000 | 85.000000 | 4.314000 | 0.000000 | 0.000000 | 0.000000 |
50% | 2535.500000 | 0.000000 | 1987.000000 | 36.550000 | 89.000000 | 5.760000 | 0.000000 | 0.000000 | 0.000000 |
75% | 3802.750000 | 1.000000 | 1995.000000 | 47.600000 | 96.000000 | 7.193000 | 7.100000 | 4.120000 | 1.000000 |
max | 5070.000000 | 1.000000 | 2009.000000 | 65.900000 | 126.000000 | 10.839000 | 108.960000 | 45.000000 | 1.000000 |
患有糖尿病标识的均值为0.38,说明未患病:患病 ≈6:4\approx 6 : 4≈6:4,数据还算较为平衡。
test_data.describe()
编号 | 性别 | 出生年份 | 体重指数 | 舒张压 | 口服耐糖量测试 | 胰岛素释放实验 | 肱三头肌皮褶厚度 | |
---|---|---|---|---|---|---|---|---|
count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 951.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
mean | 500.500000 | 0.481000 | 1986.386000 | 39.439000 | 89.638275 | 5.872314 | 4.102700 | 7.064240 |
std | 288.819436 | 0.499889 | 8.816163 | 11.284861 | 9.379124 | 1.930880 | 8.594005 | 13.900938 |
min | 1.000000 | 0.000000 | 1958.000000 | 0.000000 | 28.000000 | -1.000000 | 0.000000 | 0.000000 |
25% | 250.750000 | 0.000000 | 1979.000000 | 29.975000 | 85.000000 | 4.516000 | 0.000000 | 0.000000 |
50% | 500.500000 | 0.000000 | 1987.000000 | 38.900000 | 89.000000 | 5.851500 | 0.000000 | 0.000000 |
75% | 750.250000 | 1.000000 | 1994.000000 | 48.950000 | 96.000000 | 7.465000 | 7.202500 | 3.820000 |
max | 1000.000000 | 1.000000 | 2003.000000 | 60.000000 | 112.000000 | 10.613000 | 123.890000 | 44.900000 |
2. 数据探索及预处理
2.1 缺失值
统计每个字段的缺失比例,并进行填充。可以看到舒张压指标的缺失值较多,用字段均值将其填充。
print('训练集各字段缺失比例:')
print(train_data.isnull().mean(0))
print('\n测试集各字段缺失比例:')
print(test_data.isnull().mean(0))# 用均值填充缺失值
train_data['舒张压'] = train_data['舒张压'].fillna(train_data['舒张压'].mean())
test_data['舒张压'] = test_data['舒张压'].fillna(test_data['舒张压'].mean())
训练集各字段缺失比例:
编号 0.000000
性别 0.000000
出生年份 0.000000
体重指数 0.000000
糖尿病家族史 0.000000
舒张压 0.048718
口服耐糖量测试 0.000000
胰岛素释放实验 0.000000
肱三头肌皮褶厚度 0.000000
患有糖尿病标识 0.000000
dtype: float64测试集各字段缺失比例:
编号 0.000
性别 0.000
出生年份 0.000
体重指数 0.000
糖尿病家族史 0.000
舒张压 0.049
口服耐糖量测试 0.000
胰岛素释放实验 0.000
肱三头肌皮褶厚度 0.000
dtype: float64
2.2 分析字段类型
print(train_data.columns)train_data.describe()
Index(['编号', '性别', '出生年份', '体重指数', '糖尿病家族史', '舒张压', '口服耐糖量测试', '胰岛素释放实验','肱三头肌皮褶厚度', '患有糖尿病标识'],dtype='object')
编号 | 性别 | 出生年份 | 体重指数 | 舒张压 | 口服耐糖量测试 | 胰岛素释放实验 | 肱三头肌皮褶厚度 | 患有糖尿病标识 | |
---|---|---|---|---|---|---|---|---|---|
count | 5070.000000 | 5070.000000 | 5070.000000 | 5070.000000 | 5070.000000 | 5070.000000 | 5070.000000 | 5070.000000 | 5070.000000 |
mean | 2535.500000 | 0.456805 | 1986.869231 | 37.986785 | 89.423595 | 5.612839 | 4.114321 | 6.994371 | 0.381854 |
std | 1463.727263 | 0.498180 | 8.919737 | 11.447095 | 9.038394 | 2.257649 | 8.726001 | 13.651442 | 0.485889 |
min | 1.000000 | 0.000000 | 1943.000000 | 0.000000 | 30.000000 | -1.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 1268.250000 | 0.000000 | 1980.000000 | 28.400000 | 85.000000 | 4.314000 | 0.000000 | 0.000000 | 0.000000 |
50% | 2535.500000 | 0.000000 | 1987.000000 | 36.550000 | 89.000000 | 5.760000 | 0.000000 | 0.000000 | 0.000000 |
75% | 3802.750000 | 1.000000 | 1995.000000 | 47.600000 | 95.000000 | 7.193000 | 7.100000 | 4.120000 | 1.000000 |
max | 5070.000000 | 1.000000 | 2009.000000 | 65.900000 | 126.000000 | 10.839000 | 108.960000 | 45.000000 | 1.000000 |
编号与是否患病没关系,删除;
性别为类别变量,只有0,1,不再需要进行编码;
糖尿病家族病史为文本型变量,需要转化为数值变量;
其他均为数值型变量,可以暂时不变。
train_data = train_data.drop(['编号'], axis=1)
test_data = test_data.drop(['编号'], axis=1)
2.3 计算字段相关性
查看各字段之间的相关性,防止多重共线性。
Ref:
Python绘制相关性热力图
train_corr = train_data.drop('糖尿病家族史',axis=1).corr()import seaborn as snsplt.subplots(figsize=(9,9),dpi=80,facecolor='w') # 设置画布大小,分辨率,和底色
plt.rcParams['font.sans-serif'] = ['SimHei'] # 黑体
plt.rcParams['axes.unicode_minus'] = False # 解决无法显示符号的问题
sns.set(font='SimHei', font_scale=0.8) # 解决Seaborn中文显示问题#annot为热力图上显示数据;fmt='.2g'为数据保留两位有效数字,square呈现正方形,vmax最大值为1
fig=sns.heatmap(train_corr,annot=True, vmax=1, square=True, cmap="Blues", fmt='.2g')#保存图片
fig.get_figure().savefig('train_corr.png',bbox_inches='tight',transparent=True)
#bbox_inches让图片显示完整,transparent=True让图片背景透明
这里中文显示出了点问题,但是可以看到各特征变量与是否患病没有显著线性关系(但可能有非线性关系),各特征变量之间也不存在多重共线性,可以继续下一步操作。
3. 特征工程
这一步至关重要,主要是有两个目的:
- 特征构造:尝试构建有价值的新变量;
- 特征筛选:删除对因变量影响不大的冗余变量。
由于这里的特征也不是很多,就不做筛选了。
3.1 特征构造
可以用统计指标,已有知识、经验构造新的变量,具体到这个问题上可以有BMI指数、舒张压范围、年龄等。
特征构造方法:
- 特征的统计指标;
- 特征之间的四则运算;
- 交叉特征;
- 分解类别特征。如将三个颜色分解为“知道颜色”和“不知道颜色”。
- 特征分箱。将数值型特征变量按段划分,得到类别型特征。
- 重构特征。单位转换、整数部分与小数部分分离等。
- 根据已有经验构造新的特征变量,比如xx因子。
Ref:
[1] 深度了解特征工程
# 将出生年份换算成年龄
train_data['年龄']=2022-train_data['出生年份'] #换成年龄
test_data['年龄']=2022-test_data['出生年份']train_data = train_data.drop('出生年份', axis=1)
test_data = test_data.drop('出生年份', axis=1)
# 家族史转换, 方法一,label编码
from sklearn.preprocessing import OneHotEncoder, LabelEncoderdef FHOD(a):if a=='无记录':return 0elif a=='叔叔或者姑姑有一方患有糖尿病' or a=='叔叔或姑姑有一方患有糖尿病':return 1else:return 2train_data['糖尿病家族史'] = train_data['糖尿病家族史'].apply(FHOD)
test_data['糖尿病家族史'] = test_data['糖尿病家族史'].apply(FHOD)# history = train_data['糖尿病家族史']
# print(set(history))# history.loc[history=='叔叔或姑姑有一方患有糖尿病'] = '叔叔或者姑姑有一方患有糖尿病'# le = LabelEncoder()
# h = le.fit_transform(history)# 方法二,onehot 编码
# def onehot_transform(data):
# # 将家族史的文本型变量转换为onehot编码。# onehot = OneHotEncoder()
# data.loc[data['糖尿病家族史']=='叔叔或姑姑有一方患有糖尿病', '糖尿病家族史'] = '叔叔或者姑姑有一方患有糖尿病'# data_onehot = pd.DataFrame(onehot.fit_transform(data[['糖尿病家族史']]).toarray(),
# columns=onehot.get_feature_names(['糖尿病家族史']), dtype='int32')# return data_onehot# data_train_history = onehot_transform(train_data)
# data_test_history = onehot_transform(test_data)
def BMI(a):"""人体的成人体重指数正常值是在18.5-24之间低于18.5是体重指数过轻在24-27之间是体重超重27以上考虑是肥胖高于32了就是非常的肥胖。"""if a<18.5:return 0elif 18.5<=a<=24:return 1elif 24<a<=27:return 2elif 27<a<=32:return 3else:return 4train_data['BMI']=train_data['体重指数'].apply(BMI)
test_data['BMI']=test_data['体重指数'].apply(BMI)# 转换舒张压为类别型变量
def DBP(a):# 舒张压范围为60-90if a<60:return 0elif 60<=a<=90:return 1elif a>90:return 2else:return atrain_data['DBP'] = train_data['舒张压'].apply(DBP)
test_data['DBP'] = test_data['舒张压'].apply(DBP)X_train = train_data.drop('患有糖尿病标识', axis=1)
Y_train = train_data['患有糖尿病标识']
X_train['年龄'] = X_train['年龄'].astype(float)
X_test = test_datatrain_data.describe()
性别 | 体重指数 | 糖尿病家族史 | 舒张压 | 口服耐糖量测试 | 胰岛素释放实验 | 肱三头肌皮褶厚度 | 患有糖尿病标识 | 年龄 | BMI | DBP | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 5070.000000 | 5070.000000 | 5070.000000 | 5070.000000 | 5070.000000 | 5070.000000 | 5070.000000 | 5070.000000 | 5070.000000 | 5070.000000 | 5070.000000 |
mean | 0.456805 | 37.986785 | 0.601183 | 89.423595 | 5.612839 | 4.114321 | 6.994371 | 0.381854 | 35.130769 | 3.301972 | 1.394477 |
std | 0.498180 | 11.447095 | 0.764882 | 9.038394 | 2.257649 | 8.726001 | 13.651442 | 0.485889 | 8.919737 | 1.051700 | 0.510116 |
min | 0.000000 | 0.000000 | 0.000000 | 30.000000 | -1.000000 | 0.000000 | 0.000000 | 0.000000 | 13.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 28.400000 | 0.000000 | 85.000000 | 4.314000 | 0.000000 | 0.000000 | 0.000000 | 27.000000 | 3.000000 | 1.000000 |
50% | 0.000000 | 36.550000 | 0.000000 | 89.000000 | 5.760000 | 0.000000 | 0.000000 | 0.000000 | 35.000000 | 4.000000 | 1.000000 |
75% | 1.000000 | 47.600000 | 1.000000 | 95.000000 | 7.193000 | 7.100000 | 4.120000 | 1.000000 | 42.000000 | 4.000000 | 2.000000 |
max | 1.000000 | 65.900000 | 2.000000 | 126.000000 | 10.839000 | 108.960000 | 45.000000 | 1.000000 | 79.000000 | 4.000000 | 2.000000 |
4. 模型训练
可用于分类问题的模型非常丰富,常见的如下图:
4.1 LightGBM (0.96206)
首先尝试构建LightGBM模型。
Ref:
[1] Lightgbm原理、参数详解及python实例
[2] 深入理解LightGBM
以下为lightgbm采用5折交叉训练的代码:
#使用Lightgbm方法训练数据集,使用5折交叉验证的方法获得5个测试集预测结果
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold, GridSearchCVdef select_by_lgb(train_data,train_label,test_data,random_state=1234, n_splits=5,metric='auc',num_round=10000,early_stopping_rounds=100):
# kfold = KFold(n_splits=n_splits, shuffle=True, random_state=random_state)kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)fold=0result0 = []for train_idx, val_idx in kfold.split(train_data, train_label):random_state+=1train_x = train_data.loc[train_idx]train_y = train_label.loc[train_idx]test_x = train_data.loc[val_idx]test_y = train_label.loc[val_idx]clf = lightgbmtrain_matrix=clf.Dataset(train_x,label=train_y)test_matrix=clf.Dataset(test_x,label=test_y)params={'boosting_type': 'gbdt','objective': 'binary','learning_rate': 0.1,# 'max_depth': 7,# 'num_leaves': 10,'metric': metric,'seed': random_state,'silent': True,'nthread':-1 }model=clf.train(params,train_matrix,num_round,valid_sets=test_matrix,early_stopping_rounds=early_stopping_rounds)pre_y=model.predict(test_data)result0.append(pre_y)fold+=1pred_test = pd.DataFrame(result0).T# 将5次预测结果求平均值pred_test['average'] = pred_test.mean(axis=1)#因为竞赛需要你提交最后的预测判断,而模型给出的预测结果是概率,因此我们认为概率>0.5的即该患者有糖尿病,概率<=0.5的没有糖尿病pred_test['label'] = pred_test['average'].apply(lambda x:1 if x>0.5 else 0)## 导出结果result = pd.read_csv('提交示例.csv')result['label']=pred_test['label']return result
后面其他模型也需要进行k折交叉训练,这里定义一个k折交叉训练的函数,方便后续调用。
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.metrics import roc_auc_score, f1_scoredef SKFold(train_data,train_label,test_data, model, random_state=1234, n_splits=5,metric='auc',num_round=10000,early_stopping_rounds=100):# 采用分层K折交叉验证训练模型。kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)fold = 1pred_test = []for train_idx, val_idx in kfold.split(train_data, train_label):random_state+=1train_x = train_data.loc[train_idx]train_y = train_label.loc[train_idx]val_x = train_data.loc[val_idx]val_y = train_label.loc[val_idx]eval_set = (val_x, val_y)clf = modelmodel_trained = clf.fit(train_x, train_y)# model_trained = clf.fit(train_x,train_y,early_stopping_rounds=early_stopping_rounds, verbose=False)# model_trained = clf.fit(train_x, train_y, eval_set=eval_set, early_stopping_rounds=early_stopping_rounds)pre_y = model_trained.predict(test_data)pred_test.append(pre_y)auc_train = roc_auc_score(train_y, model_trained.predict(train_x))auc_val = roc_auc_score(val_y, model_trained.predict(val_x))f_score_train = f1_score(train_y, model_trained.predict(train_x))f_score_val = f1_score(val_y, model_trained.predict(val_x))print('Fold: %d, AUC_train: %.4f, AUC_val: %.4f, F1-score_train: %.4f, F1-score_val: %.4f'%(fold, auc_train, auc_val, f_score_train, f_score_val))fold += 1pred_test = pd.DataFrame(pred_test).T# 将5次预测结果求平均值pred_test['average'] = pred_test.mean(axis=1)#因为竞赛需要你提交最后的预测判断,而模型给出的预测结果是概率,因此我们认为概率>0.5的即该患者有糖尿病,概率<=0.5的没有糖尿病pred_test['label'] = pred_test['average'].apply(lambda x:1 if x>0.5 else 0)## 导出结果result=pd.read_csv('提交示例.csv')result['label']=pred_test['label']return result
由于比赛的测试集未公布,我们只能提交预测结果然后得到测试集上的分数,这里以表现较好的lightgbm模型作为baseline,若与lightgbm的预测结果相差较多则说明该模型表现不行。
def evaluate(result_LightGBM, result_others):# 以lightGBM的结果为基准,评估其他模型的表现。c = result_LightGBM['label'] - result_others['label']count = 0for i in c:if i != 0:count += 1print('与LightGBM预测不同的样本数: ', count)print(c[c!=0])return count
先用select_by_lgb
快速跑出一个baseline,然后用网格搜索得到最优参数,接着用最优参数组合在训练一遍模型,最后将结果提交。
random_state = 1234result_LightGBM = select_by_lgb(X_train, Y_train, X_test) #baseline
result_LightGBM.to_csv('result_lightGBM.csv',index=False)# 试试网格搜索最优参数
import lightgbm as lgb
params_test = {'max_depth': range(4, 10, 1),'num_leaves': range(10, 60, 10)
}skf = StratifiedKFold(n_splits=5) gsearch1 = GridSearchCV(estimator=lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.1, n_estimators=325, max_depth=8, bagging_fraction = 0.8,feature_fraction = 0.8), param_grid=params_test,scoring='roc_auc', cv=skf, n_jobs=-1)gsearch1.fit(X_train, Y_train)
print(gsearch1.best_params_)
print(gsearch1.best_score_)# 用最优参数再训练一遍
model_lgb = lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.1, n_estimators=200, num_leaves=10, silent=True,max_depth=7)result_SKFold_lgb = SKFold(X_train, Y_train, X_test, model_lgb, n_splits=5)
result_SKFold_lgb.to_csv('result_SKFold_lgb.csv',index=False)diff_lgb = evaluate(result_LightGBM, result_SKFold_lgb)
[LightGBM] [Warning] Unknown parameter: silent
[LightGBM] [Warning] Unknown parameter: silent
[LightGBM] [Info] Number of positive: 1549, number of negative: 2507
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000386 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1047
[LightGBM] [Info] Number of data points in the train set: 4056, number of used features: 10
[LightGBM] [Warning] Unknown parameter: silent
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.381903 -> initscore=-0.481477
[LightGBM] [Info] Start training from score -0.481477
[1] valid_0's auc: 0.979322
Training until validation scores don't improve for 100 rounds
[2] valid_0's auc: 0.980354
[3] valid_0's auc: 0.982351
[4] valid_0's auc: 0.981993Early stopping, best iteration is:
[40] valid_0's auc: 0.989851
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] bagging_fraction is set=0.8, subsample=1.0 will be ignored. Current value: bagging_fraction=0.8
{'max_depth': 7, 'num_leaves': 10}
0.9902553653233458
Fold: 1, AUC_train: 0.9917, AUC_val: 0.9461, F1-score_train: 0.9909, F1-score_val: 0.9358
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] bagging_fraction is set=0.8, subsample=1.0 will be ignored. Current value: bagging_fraction=0.8
Fold: 2, AUC_train: 0.9852, AUC_val: 0.9487, F1-score_train: 0.9837, F1-score_val: 0.9386
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] bagging_fraction is set=0.8, subsample=1.0 will be ignored. Current value: bagging_fraction=0.8
Fold: 3, AUC_train: 0.9863, AUC_val: 0.9457, F1-score_train: 0.9853, F1-score_val: 0.9328
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] bagging_fraction is set=0.8, subsample=1.0 will be ignored. Current value: bagging_fraction=0.8
Fold: 4, AUC_train: 0.9876, AUC_val: 0.9607, F1-score_train: 0.9860, F1-score_val: 0.9490
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] bagging_fraction is set=0.8, subsample=1.0 will be ignored. Current value: bagging_fraction=0.8
Fold: 5, AUC_train: 0.9858, AUC_val: 0.9505, F1-score_train: 0.9841, F1-score_val: 0.9403
与LightGBM预测不同的样本数: 7
24 1
35 -1
43 1
76 1
434 1
442 1
501 -1
Name: label, dtype: int64
4.2 随机森林(0.96324)
Ref:
[1] Permutation Importance vs Random Forest Feature Importance (MDI)
from sklearn.ensemble import RandomForestClassifierforest = RandomForestClassifier(max_depth=5, random_state=1234)
forest.fit(X_train, Y_train)
pred_forest = forest.predict(X_test)result=pd.read_csv('提交示例.csv')
result['label']=pred_forest
result.to_csv('result_RandomForest.csv',index=False)feature_importance_forest = pd.Series(forest.feature_importances_, index=X_train.columns).sort_values(ascending=True)plt.figure(figsize=(10, 7), dpi=80)ax = feature_importance_forest.plot.barh()
ax.set_title("Random Forest Feature Importances (MDI)")
# ax.figure.tight_layout()
## 网格搜索最优参数组合params_test = {'max_depth': range(3, 20, 2),'n_estimators': range(100, 600, 100),'min_samples_leaf': [2, 4, 6]
}skf = StratifiedKFold(n_splits=5) gsearch2 = GridSearchCV(estimator=RandomForestClassifier(n_estimators=200, max_depth=5, random_state=random_state), param_grid=params_test, scoring='roc_auc', cv=skf, n_jobs=-1)gsearch2.fit(X_train, Y_train)
print(gsearch2.best_params_)
print(gsearch2.best_score_)
{'max_depth': 13, 'min_samples_leaf': 2, 'n_estimators': 400}
0.9927105570462718
model_forest = RandomForestClassifier(n_estimators=400, max_depth=13, random_state=random_state)
result_SKFold_forest = SKFold(X_train, Y_train, X_test, model_forest)
result_SKFold_forest.to_csv('result_skfold_RandomForest.csv',index=False)diff_skold_forest = evaluate(result_LightGBM, result_SKFold_forest)
Fold: 1, AUC_train: 0.9935, AUC_val: 0.9516, F1-score_train: 0.9935, F1-score_val: 0.9424
Fold: 2, AUC_train: 0.9910, AUC_val: 0.9564, F1-score_train: 0.9909, F1-score_val: 0.9468
Fold: 3, AUC_train: 0.9919, AUC_val: 0.9490, F1-score_train: 0.9919, F1-score_val: 0.9396
Fold: 4, AUC_train: 0.9919, AUC_val: 0.9647, F1-score_train: 0.9919, F1-score_val: 0.9551
Fold: 5, AUC_train: 0.9913, AUC_val: 0.9591, F1-score_train: 0.9912, F1-score_val: 0.9497
与LightGBM预测不同的样本数: 11
8 1
35 -1
47 -1
52 -1
60 1
64 1
76 1
78 1
85 1
618 -1
796 -1
Name: label, dtype: int64
4.3 XGBoost (0.95981)
Ref:
[1] XGBoost:在Python中使用XGBoost
[2] Python机器学习笔记:XgBoost算法
[3] python包xgboost安装和简单使用
[4] 深入理解XGBoost,优缺点分析,原理推导及工程实现
[5] XGBoost的原理、公式推导、Python实现和应用
[6] XGBoost官方文档
import xgboost as xgb
from xgboost import plot_importance
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFoldimport warnings
warnings.filterwarnings('ignore')# 分层k折交叉检验
skf = StratifiedKFold(n_splits=5) result_xgb = []
fold = 1
for train_idx, val_idx in skf.split(X_train, Y_train):train_x = X_train.loc[train_idx]train_y = Y_train.loc[train_idx]val_x = X_train.loc[val_idx]val_y = Y_train.loc[val_idx]d_train = xgb.DMatrix(train_x, train_y)d_val = xgb.DMatrix(val_x, val_y)d_test = xgb.DMatrix(X_test)params = {'max_depth':5,'min_child_weight':1,'num_class':2,'eta': 0.1, #学习率'gamma': 0.1, #后剪枝参数,取值在[0, 1],越大越保守'seed': 1234,'alpha': 1, #L1正则项的惩罚系数'eval_metric': 'auc'}num_round = 500# # 方式一:采用sklearn接口,采用fit 和 predict# model_xgb = xgb.XGBClassifier()# model_xgb.fit(train_x, train_y, verbose=False) # pred_train = model_xgb.predict(train_x)# pred_val = model_xgb.predict(val_x)# pred_xgb = model_xgb.predict(X_test)# 方式二:采用xgboost原生接口,采用train和predict,方便调参model_xgb = xgb.train(params, d_train, num_round)pred_train = model_xgb.predict(d_train)pred_val = model_xgb.predict(d_val)pred_xgb = model_xgb.predict(d_test)auc_train = roc_auc_score(train_y, pred_train)auc_val = roc_auc_score(val_y, pred_val)f_score_train = f1_score(train_y, pred_train)f_score_val = f1_score(val_y, pred_val)print('Fold: %d, AUC_train: %.4f, AUC_val: %.4f, F1-score_train: %.4f, F1-score_val: %.4f'%(fold, auc_train, auc_val, f_score_train, f_score_val))result_xgb.append(pred_xgb)fold += 1result_xgb = pd.DataFrame(result_xgb).T
print('result_xgb.shape = ', result_xgb.shape)# 将5次预测结果求平均值
result_xgb['average'] = result_xgb.mean(axis=1)# 最终预测结果
result_xgb['label'] = result_xgb['average'].apply(lambda x:1 if x>0.5 else 0)# 特征重要性
plot_importance(model_xgb)
plt.show()# 导出结果
result = pd.read_csv('提交示例.csv')
result['label'] = result_xgb['label']
result.to_csv('result_XGBoost_StratifiedKFold.csv',index=False)diff_xgb = evaluate(result_LightGBM, result_xgb)
Fold: 1, AUC_train: 0.9935, AUC_val: 0.9463, F1-score_train: 0.9929, F1-score_val: 0.9349
Fold: 2, AUC_train: 0.9952, AUC_val: 0.9556, F1-score_train: 0.9948, F1-score_val: 0.9456
Fold: 3, AUC_train: 0.9952, AUC_val: 0.9518, F1-score_train: 0.9948, F1-score_val: 0.9415
Fold: 4, AUC_train: 0.9906, AUC_val: 0.9524, F1-score_train: 0.9893, F1-score_val: 0.9407
Fold: 5, AUC_train: 0.9924, AUC_val: 0.9573, F1-score_train: 0.9919, F1-score_val: 0.9482
result_xgb.shape = (1000, 5)与LightGBM预测不同的样本数: 6
21 -1
24 1
28 -1
35 -1
43 1
74 -1
Name: label, dtype: int64
4.4 CatBoost(0.95854)
CatbBoost 是GBDT算法框架的一种改进实现,其主要创新点有:
- 支持类别性变量。嵌入了自动将类别型特征处理为数值型特征的创新算法。
- 使用了组合类别特征,丰富特征维度。
- 采用排序提升的方法对抗训练集中的噪声点,从而避免梯度估计的偏差,进而解决预测偏移的问题。
- 采用了完全对称树作为基模型。
Ref:
[1] 深入理解CatBoost
[2] Catboost 一个超级简单实用的boost算法
import catboost as cb
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold# 分层k折交叉检验
skf = StratifiedKFold(n_splits=5) categorical_features_index = np.where(X_train.dtypes != float)[0]
print(X_train.columns[categorical_features_index])result_cat = []
fold = 1
for train_idx, val_idx in skf.split(X_train, Y_train):train_x = X_train.loc[train_idx]train_y = Y_train.loc[train_idx]val_x = X_train.loc[val_idx]val_y = Y_train.loc[val_idx]model_catboost = cb.CatBoostClassifier(eval_metric='AUC', cat_features=categorical_features_index, depth=6, n_estimators=400, learning_rate=0.5, verbose=False)model_catboost.fit(train_x, train_y, eval_set=(val_x, val_y), plot=False)pred_train = model_catboost.predict(train_x)pred_val = model_catboost.predict(val_x)auc_train = roc_auc_score(train_y, pred_train)auc_val = roc_auc_score(val_y, pred_val)f_score_train = f1_score(train_y, pred_train)f_score_val = f1_score(val_y, pred_val)print('Fold: %d, AUC_train: %.4f, AUC_val: %.4f, F1-score_train: %.4f, F1-score_val: %.4f'%(fold, auc_train, auc_val, f_score_train, f_score_val))pred_catboost = model_catboost.predict(X_test)result_cat.append(pred_catboost)fold += 1result_cat = pd.DataFrame(result_cat).T
print('result_cat.shape = ', result_cat.shape)# 将5次预测结果求平均值
result_cat['average'] = result_cat.mean(axis=1)# 最终预测结果
result_cat['label'] = result_cat['average'].apply(lambda x:1 if x>0.5 else 0)# 导出结果
result = pd.read_csv('提交示例.csv')
result['label'] = result_cat['label']
result.to_csv('result_CatBoost_StratifiedKFold.csv',index=False)diff_catboost = evaluate(result_LightGBM, result_cat)feature_importance_catboost = model_catboost.feature_importances_plt.figure(figsize=(10,8), dpi=80)
plt.barh(col_names, feature_importance_catboost)
plt.show()
Index(['性别', '糖尿病家族史', 'BMI', 'DBP'], dtype='object')
Fold: 1, AUC_train: 0.9902, AUC_val: 0.9419, F1-score_train: 0.9887, F1-score_val: 0.9305
Fold: 2, AUC_train: 0.9775, AUC_val: 0.9608, F1-score_train: 0.9722, F1-score_val: 0.9510
Fold: 3, AUC_train: 0.9988, AUC_val: 0.9543, F1-score_train: 0.9984, F1-score_val: 0.9442
Fold: 4, AUC_train: 0.9590, AUC_val: 0.9448, F1-score_train: 0.9498, F1-score_val: 0.9298
Fold: 5, AUC_train: 0.9651, AUC_val: 0.9484, F1-score_train: 0.9578, F1-score_val: 0.9377
result_cat.shape = (1000, 5)
与LightGBM预测不同的样本数: 16
8 1
33 -1
35 -1
47 -1
52 -1
60 1
64 1
74 -1
76 1
83 -1
85 1
89 -1
166 -1
501 -1
618 -1
796 -1
Name: label, dtype: int64
4.5 AdaBoost(0.96098)
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifierrandom_state = 1234
model_tree = DecisionTreeClassifier(max_depth=5, random_state=random_state)
model_adaboost = AdaBoostClassifier(base_estimator=model_tree, n_estimators=200, random_state=random_state)result_adaboost = SKFold(X_train, Y_train, X_test, model_adaboost)
result_adaboost.to_csv('result_AdaBoost.csv',index=False)# 评估
diff_skold_adaboost = evaluate(result_LightGBM, result_adaboost)# 特征重要性
feature_importance_adaboost = model_adaboost.feature_importances_plt.figure(figsize=(10,8), dpi=80)
plt.rc('font', size = 18)
plt.barh(col_names, feature_importance_adaboost)
plt.title('Feature importances computed by AdaBoost')
plt.show()
Fold: 1, AUC_train: 1.0000, AUC_val: 0.9458, F1-score_train: 1.0000, F1-score_val: 0.9347
Fold: 2, AUC_train: 1.0000, AUC_val: 0.9445, F1-score_train: 1.0000, F1-score_val: 0.9333
Fold: 3, AUC_train: 1.0000, AUC_val: 0.9380, F1-score_train: 1.0000, F1-score_val: 0.9263
Fold: 4, AUC_train: 1.0000, AUC_val: 0.9488, F1-score_train: 1.0000, F1-score_val: 0.9357
Fold: 5, AUC_train: 1.0000, AUC_val: 0.9536, F1-score_train: 1.0000, F1-score_val: 0.9432
与LightGBM预测不同的样本数: 12
35 -1
47 -1
50 -1
52 -1
60 1
64 1
76 1
85 1
92 -1
94 1
495 1
796 -1
Name: label, dtype: int64
4.6 集成模型(0.95971)
挑几个表现较好的模型进行集成。
%%time
skf = StratifiedKFold(n_splits=5) categorical_features_index = np.where(X_train.dtypes != float)[0]
print('类别型特征: ', X_train.columns[categorical_features_index])
cat_features = list(map(lambda x:int(x), categorical_features_index))
random_state = 1234fold = 1
for train_idx, val_idx in skf.split(X_train, Y_train):train_x = X_train.loc[train_idx]train_y = Y_train.loc[train_idx]val_x = X_train.loc[val_idx]val_y = Y_train.loc[val_idx]d_train = xgb.DMatrix(train_x, train_y)d_val = xgb.DMatrix(val_x, val_y)d_test = xgb.DMatrix(X_test)params_xgb = {'max_depth':5,'min_child_weight':1,'num_class':2,'eta': 0.1, #学习率'gamma': 0.1, #后剪枝参数,取值在[0, 1],越大越保守'seed': 1234,'alpha': 1, #L1正则项的惩罚系数'eval_metric': 'auc'}num_round = 500early_stopping_rounds = 100model_lightGBM = lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.1, n_estimators=200, num_leaves=10, silent=True,max_depth=7)model_lightGBM.fit(X_train, Y_train)model_forest = RandomForestClassifier(max_depth=13, n_estimators=400, random_state=1234)model_forest.fit(X_train, Y_train)model_tree = DecisionTreeClassifier(max_depth=5, random_state=random_state)model_adaboost = AdaBoostClassifier(base_estimator=model_tree, n_estimators=200, random_state=random_state)model_adaboost.fit(X_train, Y_train)model_xgb = xgb.train(params_xgb, d_train, num_round)model_catboost = cb.CatBoostClassifier(eval_metric='AUC', cat_features=categorical_features_index, depth=6, iterations=400, learning_rate=0.5, verbose=False)model_catboost.fit(train_x, train_y, eval_set=(val_x, val_y), plot=False)print('Fold: %d finished training. '%fold)fold += 1pred_lightGBM = model_lightGBM.predict(test_data)
# pred_lightGBM = list(map(lambda x: 1 if x>0.5 else 0, pred_lightGBM)) #调用lightGBM原生接口时使用
pred_forest = forest.predict(X_test)
pred_adaboost = model_adaboost.predict(X_test)
pred_xgb = model_xgb.predict(d_test)
pred_catboost = model_catboost.predict(X_test)
pred_all = pd.DataFrame({'lightGBM': pred_lightGBM,'RandomForest': pred_forest,'AdaBoost': pred_adaboost,'XGBoost': pred_xgb,'CatBoost': pred_catboost})pred_all['Average'] = pred_all.mean(axis=1)# 最终预测结果
pred_all['label'] = pred_all['Average'].apply(lambda x:1 if x>0.5 else 0)# 导出结果
result = pd.read_csv('提交示例.csv')
result['label'] = pred_all['label']
result.to_csv('result_Ensemble.csv',index=False)diff_ensemble = evaluate(result_LightGBM, result)
类别型特征: Index(['性别', '糖尿病家族史', 'BMI', 'DBP'], dtype='object')Custom logger is already specified. Specify more than one logger at same time is not thread safe.Fold: 1 finished training.
Fold: 2 finished training.
Fold: 3 finished training.
Fold: 4 finished training.
Fold: 5 finished training.
与LightGBM预测不同的样本数: 11
24 1
33 -1
35 -1
43 1
52 -1
60 1
64 1
76 1
78 1
85 1
796 -1
Name: label, dtype: int64
Wall time: 1min 31s
4.7 Stacking(0.96577)
Stacking的思想为在初始数据集上训练若干个基学习器,并将这几个基学习器的预测结果作为新的训练集,来训练一个新的学习器,并将其预测结果作为最终输出。
Stacking本质是一种层级结构,第一层有n个基学习器,每个基学习器进行k折交叉训练,把每一折的验证集(validation set)的预测结果输出并拼接在一起,把这n个模型的训练集预测结果作为新的训练集,将这n个模型的测试集预测结果拼接在一起作为新的测试集。
(图源见水印)
Ref:
[1] stacking模型融合
[2] Kaggle上分技巧——单模K折交叉验证训练+多模型融合
model_tree = DecisionTreeClassifier(max_depth=5, random_state=random_state)clfs = [lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.1, n_estimators=200, num_leaves=10, silent=True,max_depth=7),RandomForestClassifier(max_depth=13, n_estimators=400, random_state=1234),AdaBoostClassifier(base_estimator=model_tree, n_estimators=200, random_state=random_state),xgb.XGBClassifier(),cb.CatBoostClassifier(eval_metric='AUC', cat_features=categorical_features_index, depth=6, iterations=400, learning_rate=0.5, verbose=False)]data_train = np.zeros((X_train.shape[0], len(clfs)))
data_test = np.zeros((X_test.shape[0], len(clfs)))# 5折stacking
n_splits = 5
skf = StratifiedKFold(n_splits)# 第一层,训练各个个体学习器
for i, clf in enumerate(clfs):# 依次训练各个模型d_test = np.zeros((X_test.shape[0], n_splits)) #存放个体学习器在测试集上的预测输出for fold, (train_idx, val_idx) in enumerate(skf.split(X_train, Y_train)):#5折交叉训练,第j折拿来预测并作为第二层模型的训练集,剩余部分拿来训练模型。train_x = X_train.loc[train_idx]train_y = Y_train.loc[train_idx]val_x = X_train.loc[val_idx]val_y = Y_train.loc[val_idx]clf.fit(train_x, train_y)pred_train = clf.predict(train_x)pred_val = clf.predict(val_x)data_train[val_idx, i] = pred_vald_test[:, fold] = clf.predict(X_test)auc_train = roc_auc_score(train_y, pred_train)auc_val = roc_auc_score(val_y, pred_val)f_score_train = f1_score(train_y, pred_train)f_score_val = f1_score(val_y, pred_val)print('Classifier:%d, Fold: %d, AUC_train: %.4f, AUC_val: %.4f, F1-score_train: %.4f, F1-score_val: %.4f'%(i+1,fold+1, auc_train, auc_val, f_score_train, f_score_val))#对于测试集,直接用这交叉验证训练的每个模型的预测值均值作为新的特征data_test[:, i] = d_test.mean(axis=1)data_train = pd.DataFrame(data_train)
data_test = pd.DataFrame(data_test)# 第二层改用高级点的模型,并进行5折交叉训练
# model_forest = RandomForestClassifier(max_depth=5, random_state=1234)
model_2 = lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.3, n_estimators=200, num_leaves=10, silent=True,max_depth=7)
result_stack = SKFold(data_train, Y_train, data_test, model_2)
result_stack.to_csv('result_stack.csv', index=False)diff_stack = evaluate(result_LightGBM, result_stack)
Classifier:1, Fold: 1, AUC_train: 0.9894, AUC_val: 0.9465, F1-score_train: 0.9880, F1-score_val: 0.9340
Classifier:1, Fold: 2, AUC_train: 0.9883, AUC_val: 0.9554, F1-score_train: 0.9870, F1-score_val: 0.9465
Classifier:1, Fold: 3, AUC_train: 0.9886, AUC_val: 0.9559, F1-score_train: 0.9873, F1-score_val: 0.9467
Classifier:1, Fold: 4, AUC_train: 0.9897, AUC_val: 0.9417, F1-score_train: 0.9883, F1-score_val: 0.9268
Classifier:1, Fold: 5, AUC_train: 0.9887, AUC_val: 0.9542, F1-score_train: 0.9879, F1-score_val: 0.9453
Classifier:2, Fold: 1, AUC_train: 0.9926, AUC_val: 0.9489, F1-score_train: 0.9925, F1-score_val: 0.9377
Classifier:2, Fold: 2, AUC_train: 0.9929, AUC_val: 0.9577, F1-score_train: 0.9928, F1-score_val: 0.9482
Classifier:2, Fold: 3, AUC_train: 0.9923, AUC_val: 0.9562, F1-score_train: 0.9922, F1-score_val: 0.9478
Classifier:2, Fold: 4, AUC_train: 0.9929, AUC_val: 0.9569, F1-score_train: 0.9928, F1-score_val: 0.9470
Classifier:2, Fold: 5, AUC_train: 0.9903, AUC_val: 0.9529, F1-score_train: 0.9902, F1-score_val: 0.9439
Classifier:3, Fold: 1, AUC_train: 1.0000, AUC_val: 0.9377, F1-score_train: 1.0000, F1-score_val: 0.9253
Classifier:3, Fold: 2, AUC_train: 1.0000, AUC_val: 0.9478, F1-score_train: 1.0000, F1-score_val: 0.9354
Classifier:3, Fold: 3, AUC_train: 1.0000, AUC_val: 0.9554, F1-score_train: 1.0000, F1-score_val: 0.9465
Classifier:3, Fold: 4, AUC_train: 1.0000, AUC_val: 0.9395, F1-score_train: 1.0000, F1-score_val: 0.9269
Classifier:3, Fold: 5, AUC_train: 1.0000, AUC_val: 0.9495, F1-score_train: 1.0000, F1-score_val: 0.9399
[00:31:58] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.0/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Classifier:4, Fold: 1, AUC_train: 0.9997, AUC_val: 0.9452, F1-score_train: 0.9997, F1-score_val: 0.9326
[00:31:59] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.0/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Classifier:4, Fold: 2, AUC_train: 0.9997, AUC_val: 0.9530, F1-score_train: 0.9997, F1-score_val: 0.9429
[00:31:59] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.0/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Classifier:4, Fold: 3, AUC_train: 1.0000, AUC_val: 0.9538, F1-score_train: 1.0000, F1-score_val: 0.9441
[00:31:59] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.0/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Classifier:4, Fold: 4, AUC_train: 0.9994, AUC_val: 0.9480, F1-score_train: 0.9994, F1-score_val: 0.9345
[00:32:00] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.0/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Classifier:4, Fold: 5, AUC_train: 0.9994, AUC_val: 0.9537, F1-score_train: 0.9994, F1-score_val: 0.9452
Classifier:5, Fold: 1, AUC_train: 0.9980, AUC_val: 0.9411, F1-score_train: 0.9977, F1-score_val: 0.9293
Classifier:5, Fold: 2, AUC_train: 0.9992, AUC_val: 0.9587, F1-score_train: 0.9990, F1-score_val: 0.9485
Classifier:5, Fold: 3, AUC_train: 0.9990, AUC_val: 0.9556, F1-score_train: 0.9987, F1-score_val: 0.9456
Classifier:5, Fold: 4, AUC_train: 0.9997, AUC_val: 0.9464, F1-score_train: 0.9997, F1-score_val: 0.9321
Classifier:5, Fold: 5, AUC_train: 0.9987, AUC_val: 0.9447, F1-score_train: 0.9987, F1-score_val: 0.9326
Fold: 1, AUC_train: 0.9582, AUC_val: 0.9455, F1-score_train: 0.9491, F1-score_val: 0.9337
Fold: 2, AUC_train: 0.9538, AUC_val: 0.9495, F1-score_train: 0.9449, F1-score_val: 0.9398
Fold: 3, AUC_train: 0.9569, AUC_val: 0.9471, F1-score_train: 0.9477, F1-score_val: 0.9361
Fold: 4, AUC_train: 0.9546, AUC_val: 0.9556, F1-score_train: 0.9451, F1-score_val: 0.9456
Fold: 5, AUC_train: 0.9544, AUC_val: 0.9518, F1-score_train: 0.9463, F1-score_val: 0.9416
与LightGBM预测不同的样本数: 14
0 -1
8 1
23 -1
28 -1
33 -1
35 -1
47 -1
52 -1
60 1
64 1
74 -1
89 -1
796 -1
851 -1
Name: label, dtype: int64
4.8 归一化数据,pytorch神经网络
先归一化数据,统一量纲。从本节开始,使用基于距离的模型,不再使用树模型。
注意pytorch做二元分类有以下几种实现方式:
- Linear 输出维度为1 + sigmoid + BCELoss。
- Linear 输出维度为1 + BCEWithLogitsLoss。不需要加sigmoid或softmax函数,BCEWithLogitsLoss自带sigmoid作为激活函数。
- Linear 输出维度为2 + 交叉熵(CrossEntropyLoss)。输出tensor的维度0对应第一个label(即0),维度1对应第二个label(即1)。注意使用交叉熵时,真实标签不能是onehot格式,必须为1维tensor,预测标签必须大于或等于2维,预测标签的每一个维度对应一个标签。
pytorch 中使用神经网络进行多分类时,网络的输出 prediction 是 one hot 格式,但计算 交叉熵损失函数时,loss = criterion(prediction, target) 的输入 target 不能是 one hot 格式,直接用数字来表示就行(4 表示 one hot 中的 0 0 0 1)。
所以,自己构建数据集,返回的 target 不需要是 one hot 格式。
Ref:
[1] Pytorch学习笔记(5)——交叉熵报错RuntimeError: 1D target tensor expected, multi-target not supported
[2] PyTorch二分类时BCELoss,CrossEntropyLoss,Sigmoid等的选择和使用
[3] Pytorch实现二分类器
[4] RuntimeError: multi-target not supported at
from sklearn.preprocessing import MinMaxScaler# 归一化
scaler = MinMaxScaler()
X_train2 = scaler.fit_transform(X_train)
X_test2 = scaler.fit_transform(X_test)
Y_train2 = Y_train.to_numpy()
print('X_train.shape = ', X_train.shape)
print('X_train2.shape = ', X_train2.shape)
print('Y_train2.shape = ', Y_train2.shape)
X_train.shape = (5070, 10)
X_train2.shape = (5070, 10)
Y_train2.shape = (5070,)
def Convert(x):# Conver the numeric values into categorical values.y = np.zeros((x.shape[0],))for i in range(len(x)):if x[i, 0] > x[i, 1]:y[i] = 0else:y[i] = 1return y
import torch
import torch.nn as nn
import torch.nn.functional as Fclass NET(nn.Module):def __init__(self, input_dim:int, hidden:int, out_dim:int, activation='relu', dropout=0.2):super(NET, self).__init__()self.input_dim = input_dimself.hidden = hiddenself.out_dim = out_dimself.activation = activationself.Dropout = dropout# 激活函数选择if self.activation == 'relu':mid_act = torch.nn.ReLU()elif self.activation == 'tanh':mid_act = torch.nn.Tanh()elif self.activation == 'sigmoid':mid_act = torch.nn.Sigmoid()elif self.activation == 'LeakyReLU':mid_act = torch.nn.LeakyReLU()elif self.activation == 'ELU':mid_act = torch.nn.ELU()elif self.activation == 'GELU':mid_act = torch.nn.GELU()self.model = nn.Sequential(nn.Linear(self.input_dim, self.hidden),mid_act,nn.Dropout(self.Dropout),nn.Linear(self.hidden, self.hidden),mid_act,nn.Dropout(self.Dropout),nn.Linear(hidden, self.out_dim))def forward(self, x):out = self.model(x)return outdef predict(self, x):# x = torch.tensor(x.to_numpy()) #针对datafram# x = x.to(torch.float32)x = torch.tensor(x).to(torch.float32) #针对ndarrayx = F.softmax(self.model(x))ans = []for t in x:if t[0] > t[1]:ans.append(0)else:ans.append(1)return np.array(ans)
import time
from torch.utils.data import DataLoader, TensorDataset
class NN_classifier():def __init__(self, model, crit, l_rate, batch_size, max_epochs, n_splits=5, verbose=True):super(NN_classifier, self).__init__()self.model = model # Neural network model, should be a nn.Module()self.l_rate = l_rateself.batch_size = batch_sizeself.max_epochs = max_epochsself.verbose = verboseself.n_splits = n_splits # the value of k in k-fold validationself.crit = crit # loss functionself.device = 'cpu'def fit(self, X_train, Y_train, X_test):skf = StratifiedKFold(n_splits=self.n_splits)fold = 1pred_test = []for train_idx, val_idx in skf.split(X_train, Y_train):train_x = X_train[train_idx, :]train_y = Y_train[train_idx]val_x = X_train[val_idx, :]val_y = Y_train[val_idx]train_data = TensorDataset(train_x, train_y)train_dataloader = DataLoader(dataset=train_data, batch_size=self.batch_size, shuffle=True)valid_data = TensorDataset(val_x, val_y)validation_dataloader = DataLoader(dataset=valid_data, batch_size=self.batch_size, shuffle=False)model = self.modeloptimizer = torch.optim.Adam(model.parameters(), lr=self.l_rate)scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=50, gamma=0.9) #动态学习率调整for epoch in range(self.max_epochs):start_time = time.time()loss_all = []#------------- Training -----------------model.train()for data in train_dataloader:x, y = datax = x.to(self.device)y = y.to(self.device)optimizer.zero_grad()out = model(x)loss = self.crit(out, y.long())loss.requires_grad_(True) loss.backward()optimizer.step()loss_all.append(loss.item())scheduler.step()end_time = time.time()cost_time = end_time - start_timetrain_loss = np.mean(np.array(loss_all))#------------- Validation -----------------model.eval()loss_all = []with torch.no_grad():for data in validation_dataloader:x, y = datax = x.to(self.device)y = y.to(self.device)output = model(x)loss = self.crit(output, y.long())loss_all.append(loss.item())validation_loss = np.mean(np.array(loss_all))if self.verbose and (epoch+1) % 100 ==0:print('Fold:{:d}, Epoch:{:d}, train_loss: {:.4f}, validation_loss: {:.4f}, cost_time: {:.2f}s'.format(fold, epoch+1, train_loss, validation_loss, cost_time))#------------- Prediction -----------------pred = Convert(model(X_test).detach().numpy())pred_test.append(pred)pred_train = Convert(model(train_x).detach().numpy())pred_val = Convert(model(val_x).detach().numpy())auc_train = roc_auc_score(train_y, pred_train)auc_val = roc_auc_score(val_y, pred_val)f_score_train = f1_score(train_y, pred_train)f_score_val = f1_score(val_y, pred_val)print('Fold: %d, AUC_train: %.4f, AUC_val: %.4f, F1-score_train: %.4f, F1-score_val: %.4f'%(fold, auc_train, auc_val, f_score_train, f_score_val))fold += 1pred_test = pd.DataFrame(pred_test).Tprint('pred_test.shape = ', pred_test.shape)# 将5次预测结果求平均值pred_test['average'] = pred_test.mean(axis=1)#因为竞赛需要你提交最后的预测判断,而模型给出的预测结果是概率,因此我们认为概率>0.5的即该患者有糖尿病,概率<=0.5的没有糖尿病pred_test['label'] = pred_test['average'].apply(lambda x:1 if x>0.5 else 0)## 导出结果result=pd.read_csv('提交示例.csv')result['label']=pred_test['label']return result
# k折交叉验证不断训练同一个模型,集成不同fold(即不同时刻)的模型的预测结果
hidden = 64
activation = 'relu'
# activation = 'tanh'
# crit = nn.MSELoss()
crit = nn.CrossEntropyLoss()
batch_size = 512*2
max_epochs = 500
l_rate = 1e-3
dropout = 0.1
n_splits = 5# Convert to tensor
X_train_tensor = torch.from_numpy(X_train2).to(torch.float32)
X_test_tensor = torch.from_numpy(X_test2).to(torch.float32)
Y_train_tensor = torch.from_numpy(Y_train2).to(torch.float32)model_NN = NET(X_train2.shape[1], hidden, out_dim=2, activation=activation)
classifier_NN = NN_classifier(model_NN, crit=crit, batch_size=batch_size, l_rate=l_rate, max_epochs=max_epochs, n_splits=n_splits)result_SKFold_NN = NN_classifier.fit(classifier_NN,X_train_tensor, Y_train_tensor, X_test_tensor,)
c = result_LightGBM['label'] - result_SKFold_NN['label']count = 0
for i in c:if i != 0:count += 1print('与LightGBM预测不同的样本数: ', count)
print(c[c!=0])
Fold:1, Epoch:100, train_loss: 0.3503, validation_loss: 0.3196, cost_time: 0.04s
Fold:1, Epoch:200, train_loss: 0.2785, validation_loss: 0.2505, cost_time: 0.04s
Fold:1, Epoch:300, train_loss: 0.2332, validation_loss: 0.2257, cost_time: 0.04s
Fold:1, Epoch:400, train_loss: 0.2098, validation_loss: 0.2130, cost_time: 0.04s
Fold:1, Epoch:500, train_loss: 0.2006, validation_loss: 0.2066, cost_time: 0.04s
Fold: 1, AUC_train: 0.9363, AUC_val: 0.9133, F1-score_train: 0.9254, F1-score_val: 0.8956
Fold:2, Epoch:100, train_loss: 0.1813, validation_loss: 0.1555, cost_time: 0.04s
Fold:2, Epoch:200, train_loss: 0.1671, validation_loss: 0.1449, cost_time: 0.04s
Fold:2, Epoch:300, train_loss: 0.1540, validation_loss: 0.1444, cost_time: 0.04s
Fold:2, Epoch:400, train_loss: 0.1513, validation_loss: 0.1384, cost_time: 0.04s
Fold:2, Epoch:500, train_loss: 0.1426, validation_loss: 0.1379, cost_time: 0.04s
Fold: 2, AUC_train: 0.9496, AUC_val: 0.9395, F1-score_train: 0.9401, F1-score_val: 0.9269
Fold:3, Epoch:100, train_loss: 0.1356, validation_loss: 0.1419, cost_time: 0.04s
Fold:3, Epoch:200, train_loss: 0.1219, validation_loss: 0.1384, cost_time: 0.04s
Fold:3, Epoch:300, train_loss: 0.1206, validation_loss: 0.1357, cost_time: 0.04s
Fold:3, Epoch:400, train_loss: 0.1152, validation_loss: 0.1400, cost_time: 0.04s
Fold:3, Epoch:500, train_loss: 0.1113, validation_loss: 0.1395, cost_time: 0.04s
Fold: 3, AUC_train: 0.9625, AUC_val: 0.9550, F1-score_train: 0.9535, F1-score_val: 0.9434
Fold:4, Epoch:100, train_loss: 0.1075, validation_loss: 0.1185, cost_time: 0.04s
Fold:4, Epoch:200, train_loss: 0.1155, validation_loss: 0.1250, cost_time: 0.04s
Fold:4, Epoch:300, train_loss: 0.1081, validation_loss: 0.1238, cost_time: 0.04s
Fold:4, Epoch:400, train_loss: 0.1056, validation_loss: 0.1283, cost_time: 0.04s
Fold:4, Epoch:500, train_loss: 0.0957, validation_loss: 0.1289, cost_time: 0.04s
Fold: 4, AUC_train: 0.9702, AUC_val: 0.9518, F1-score_train: 0.9629, F1-score_val: 0.9386
Fold:5, Epoch:100, train_loss: 0.1064, validation_loss: 0.0951, cost_time: 0.04s
Fold:5, Epoch:200, train_loss: 0.0983, validation_loss: 0.0978, cost_time: 0.04s
Fold:5, Epoch:300, train_loss: 0.1028, validation_loss: 0.1055, cost_time: 0.05s
Fold:5, Epoch:400, train_loss: 0.0954, validation_loss: 0.1065, cost_time: 0.04s
Fold:5, Epoch:500, train_loss: 0.0935, validation_loss: 0.1073, cost_time: 0.04s
Fold: 5, AUC_train: 0.9709, AUC_val: 0.9547, F1-score_train: 0.9647, F1-score_val: 0.9455
pred_test.shape = (1000, 5)
与LightGBM预测不同的样本数: 423
0 -1
2 -1
4 -1
8 1
16 -1..
985 -1
987 -1
994 -1
995 -1
999 -1
Name: label, Length: 423, dtype: int64
k折交叉验证不断训练同一个模型,虽然模型最终的表现结果还可以(F1-score_val上去了),但集成各fold(各时期)的模型的结果表现依然糟糕,与baseline —— lightGBM相差甚远,都不用提交就知道分数会很低(0.63左右)了。
说明这网络模型不行啊!
可能原因:
- 集成时被早期表现较差的模型所拖累;
- 模型本身对表格数据拟合能力不够;
- 模型过拟合;
# 不用k折交叉验证,一个模型用到底
hidden = 64
activation = 'tanh'
# activation = 'tanh'
crit = nn.CrossEntropyLoss()
batch_size = 128
max_epochs = 2000
l_rate = 5e-3
dropout = 0.1
n_splits = 5model_NN2 = NET(X_train2.shape[1], hidden, out_dim=2, activation=activation)
classifier_NN2 = NN(model_NN2, crit=crit, batch_size=batch_size, l_rate=l_rate, max_epochs=max_epochs)# result_SKFold_NN = SKFold(pd.DataFrame(X_train2), pd.DataFrame(Y_train2),
# pd.DataFrame(X_test2), classifier_NN2, n_splits=5)classifier_NN2.fit(X_train2, Y_train2)
result_NN = classifier_NN2.predict(X_test2)# c = result_LightGBM['label'] - result_SKFold_NN['label']
c = result_LightGBM['label'] - result_NNcount = 0
for i in c:if i != 0:count += 1print('与LightGBM预测不同的样本数: ', count)
print(c[c!=0])
Epoch:100, train_loss: 0.2839, cost_time: 0.09s
Epoch:200, train_loss: 0.2275, cost_time: 0.09s
Epoch:300, train_loss: 0.2072, cost_time: 0.09s
Epoch:400, train_loss: 0.2097, cost_time: 0.09s
Epoch:500, train_loss: 0.1854, cost_time: 0.09s
Epoch:600, train_loss: 0.1906, cost_time: 0.09s
Epoch:700, train_loss: 0.1772, cost_time: 0.09s
Epoch:800, train_loss: 0.1749, cost_time: 0.09s
Epoch:900, train_loss: 0.1726, cost_time: 0.08s
Epoch:1000, train_loss: 0.1707, cost_time: 0.08s
Epoch:1100, train_loss: 0.1692, cost_time: 0.09s
Epoch:1200, train_loss: 0.1575, cost_time: 0.09s
Epoch:1300, train_loss: 0.1615, cost_time: 0.09s
Epoch:1400, train_loss: 0.1548, cost_time: 0.09s
Epoch:1500, train_loss: 0.1611, cost_time: 0.09s
Epoch:1600, train_loss: 0.1617, cost_time: 0.09s
Epoch:1700, train_loss: 0.1535, cost_time: 0.09s
Epoch:1800, train_loss: 0.1650, cost_time: 0.09s
Epoch:1900, train_loss: 0.1609, cost_time: 0.08s
Epoch:2000, train_loss: 0.1631, cost_time: 0.09s
与LightGBM预测不同的样本数: 424
2 -1.0
4 -1.0
6 -1.0
7 -1.0
8 1.0...
986 1.0
988 -1.0
995 -1.0
997 -1.0
999 -1.0
Name: label, Length: 424, dtype: float64
上面这个结果说明这个神经网络模型学习遇到了瓶颈,很难再提升了。要么改模型,要么改训练方法(k折交叉训练重复训练同一个模型有提升,但提升有限,表现依然扑街),要么改数据。
4.9 SVM
表现不行。
Ref:
[1] Python3《机器学习实战》学习笔记(八):支持向量机原理篇之手撕线性SVM
from sklearn.svm import SVCmodel_SVM = SVC(C=10) #C越大,对误分类的惩罚越大。result_SKFold_SVM= SKFold(pd.DataFrame(X_train2), pd.DataFrame(Y_train2),pd.DataFrame(X_test2), model_SVM, n_splits=5)diff_SVM = evaluate(result_LightGBM, result_SKFold_SVM)
Fold: 1, AUC_train: 0.9001, AUC_val: 0.8614, F1-score_train: 0.8819, F1-score_val: 0.8311
Fold: 2, AUC_train: 0.8987, AUC_val: 0.8737, F1-score_train: 0.8797, F1-score_val: 0.8489
Fold: 3, AUC_train: 0.8942, AUC_val: 0.8916, F1-score_train: 0.8744, F1-score_val: 0.8711
Fold: 4, AUC_train: 0.8901, AUC_val: 0.8840, F1-score_train: 0.8688, F1-score_val: 0.8614
Fold: 5, AUC_train: 0.8945, AUC_val: 0.8836, F1-score_train: 0.8747, F1-score_val: 0.8602
与LightGBM预测不同的样本数: 271
2 -1
8 1
16 -1
23 -1
33 -1..
983 -1
985 -1
993 1
995 -1
999 -1
Name: label, Length: 271, dtype: int64
4.10 sklearn神经网络
from sklearn.neural_network import MLPClassifiermodel_MLP = MLPClassifier(hidden_layer_sizes=128, activation='relu')result_SKFold_MLP = SKFold(pd.DataFrame(X_train2), pd.DataFrame(Y_train2),pd.DataFrame(X_test2), model_MLP, n_splits=5)c = result_LightGBM['label'] - result_SKFold_MLP['label']count = 0
for i in c:if i != 0:count += 1print('与LightGBM预测不同的样本数: ', count)
print(c[c!=0])
Fold: 1, AUC_train: 0.8780, AUC_val: 0.8394, F1-score_train: 0.8550, F1-score_val: 0.8045
Fold: 2, AUC_train: 0.8789, AUC_val: 0.8746, F1-score_train: 0.8558, F1-score_val: 0.8512
Fold: 3, AUC_train: 0.8838, AUC_val: 0.8828, F1-score_train: 0.8617, F1-score_val: 0.8599
Fold: 4, AUC_train: 0.8869, AUC_val: 0.8959, F1-score_train: 0.8637, F1-score_val: 0.8747
Fold: 5, AUC_train: 0.8867, AUC_val: 0.8820, F1-score_train: 0.8650, F1-score_val: 0.8579
与LightGBM预测不同的样本数: 238
2 -1
8 1
16 -1
23 -1
33 -1..
979 -1
983 -1
993 1
995 -1
999 -1
Name: label, Length: 238, dtype: int64
神经网络在这个任务上的表现扑街了。。
5. 总结思考
这次比赛最好的分数为stacking的0.96577,再往上提升一点变得非常困难,继续提升一点分数需要耗费巨量时间和精力,投入产出比划不来,就没有继续去改进了。但从这次比赛也学到了许多,掌握了许多树模型的使用方法,以及特征工程的一点技巧。
在表格数据上,不得不说还是树模型表现更好,计算快,对算力的需求没有神经网络那么大,结果也非常棒。相反,神经网络在这个数据上表现明显不如树模型,或许是因为我采用的模型过于简单了。
透过现象看本质,这个比赛本质就是个简单的二分类问题,那么有没有一种可能,推荐系统里的DeepFM、DCN等网络模型也能用于这个问题呢?
另外,这个比赛我有点过于注重模型部分了,特征工程没有怎么去做。数据决定了你能达到的上限,模型只是帮你接近这个上限。
哈哈哈,有空再试了。欢迎各位大佬在评论区留言赐教,一起变得更强!
参考资料:
[1] Datawhale_如何打一个数据挖掘比赛V2.1
[2] 讯飞官方参考解析
[3] Kaggle上分技巧——单模K折交叉验证训练+多模型融合
2022讯飞——糖尿病遗传风险检测挑战赛解决方案相关推荐
- 讯飞-糖尿病遗传风险检测挑战赛
讯飞-糖尿病遗传风险检测挑战赛 前言 相关库 一.比赛报名 1.1 赛事任务 1.2 读取数据 训练数据集 测试数据集 1.3 数据集基本信息 数据集维度 数据表基本信息 二.比赛数据分析 2.1 缺 ...
- 糖尿病遗传风险检测挑战赛(Coggle 30 Days of ML)
本次跟着Coggle 30 Days of ML学习算法竞赛,而不是基于现成baseline来操作,预计重新熟悉并学习 Pandas.Numpy 处理数据 Sklearn.LightGBM 模型使用 ...
- 糖尿病遗传风险检测挑战赛竞赛学习
目录 一:报名比赛 二:比赛数据分析 三:逻辑回归尝试 五:特征筛选 六:高阶树模型 七:多折训练与集成 !!!Staking!!! 任务一:报名比赛 糖尿病遗传风险检测挑战赛是讯飞开放平台上的一个算 ...
- 糖尿病遗传风险检测挑战赛
糖尿病遗传风险检测挑战赛 加载数据 import pandas as pd import lightgbm import matplotlib.pyplot as plt %matplotlib in ...
- 数据挖掘竞赛——糖尿病遗传风险检测挑战赛进阶
本次比赛是一个数据挖掘赛,需要选手通过训练集数据构建模型,然后对验证集数据进行预测,预测结果进行提交. 本题的任务是构建一种模型,该模型能够根据患者的测试数据来预测这个患者是否患有糖尿病.这种类型的任 ...
- 糖尿病遗传风险检测挑战赛-Coggle 30 Days of ML
前言 跟着coggle的7月竞赛学习活动,记录一些个人学习的过程,本博客围绕着科大讯飞糖尿病遗传风险检测挑战赛展开. 比赛地址:http://challenge.xfyun.cn/topic/info ...
- 【Datawhale】【机器学习】糖尿病遗传风险检测挑战赛
糖尿病遗传风险检测挑战赛 赛题介绍 截至2022年,中国糖尿病患者近1.3亿.中国糖尿病患病原因受生活方式.老龄化.城市化.家族遗传等多种因素影响.同时,糖尿病患者趋向年轻化. 糖尿病可导致心血管.肾 ...
- 【Coggle 】糖尿病遗传风险检测挑战赛
文章目录 活动背景 一.赛题介绍 二.打卡任务 任务1:报名比赛 步骤1:报名比赛 步骤2:下载比赛数据(点击比赛页面的赛题数据) 步骤3:解压比赛数据,并使用pandas进行读取: 步骤4:查看训练 ...
- 竞赛打卡:糖尿病遗传风险检测挑战赛
竞赛打卡:糖尿病遗传风险检测挑战赛 本博客是竞赛学习打卡记录博客.学习地址Coggle七月学习打卡.就是记录在竞赛学习过程中的打卡内容(有可能不会按照内容全部进行打卡,时间可能不是很够
最新文章
- APPCAN MAS接口之SOAP
- 用matlab定义2020年元旦是星期三,如何用matlab实现某年某月某日是星期几的问题,...
- 腾讯成功入选LoRa联盟新一届董事会
- react 最佳入门_miaov-React 最佳入门
- vijos 1057 盖房子 dp 最大子正方形
- ROS笔记(13) 记录与回放数据
- 刘强东解读京东AI战略布局,沈南鹏高文杨强助阵谈AI机遇
- AcWing 904. 虫洞(SPFA or Djakarta or bellman判负环)
- The summary of Interview
- Acwing:最长回文子串
- 超图 三维 GIS 技术白皮书
- 浅谈视频编解码器的工作原理和应用领域
- python实现搜狗新闻挖掘(一)
- 数据挖掘简介及学习路径介绍
- ifconfig内容解析
- java应用程序接口批量访问_spring中使用mybatis实现批量插入的示例代码
- platEMO里多目标进化算法对应的参考文献
- 计算机学院的加油口号,各学院校运会加油口号
- 小米平板4能装Linux系统吗,还在考虑小米平板4要不要入手?看完这十个问题你就有答案了...
- 最全的Matlab经典教程免费下载
热门文章
- Altium Designer系列:添加泪滴
- python刷步数程序设计_乐心健康间接修改微信步数-Docker持久运行python脚本
- 谷歌浏览器 抖动 chrome 上下抖动 问题解决
- 网络原理练习题(含答案)
- vmware ThinApp 4.01 and Workstation 6.5.1 sn
- 花窗图案c语言设计,4款中式仿古花窗图案隔断设计图 中式传统古建筑园林木门花窗艺术文化图片...
- 机床测头应用一:仿形加工功能,降低废品率
- sudo 切换用户 普通用户
- CoreJava(第二章)Java基础
- linux运行pppd,linux pppd 连接成功 还是 没有网