Coggle 30 Days of ML(Spaceship Titanic)
任务1:报名比赛
报名比赛
读取训练数据
import numpy as np
import pandas as pdtrain_data = pd.read_csv("./data/train.csv")
train_data.info()
任务2:比赛数据分析
步骤1:使用pandas完成如下数据分析
训练集和测试集的行数分别是多少?
训练集中每列的类型是什么?
train_data.info()
训练集中标签是如何分布,与哪一个特征最相关?
对数据集中的类型字段进行编码转换成数字类型,使用sns显示字段之间的关系
from sklearn.preprocessing import LabelEncoder
col = ['HomePlanet' ,'CryoSleep' ,'Cabin' ,'Destination' ,'VIP']
ded = data_train.copy()
for i in col:encoder= LabelEncoder().fit(ded[i])ded[i] = encoder.transform(ded[i])
plt.figure(figsize=(10,10))
sns.heatmap(ded.corr(),annot=True)
可知CryoSleep与Transportted的相关性最大
训练集中列缺失值如何分布的?
# 统计缺失值
fea_lst = []
num_lst = []
num_rate_lst = []
for i in list(data_test.columns):
# print(i)num = data_test[i].isnull().sum()num_rate = num / len(data_train)fea_lst.append(i)num_lst.append(num)num_rate_lst.append(num_rate)
null_train = pd.DataFrame({"训练集":fea_lst,"缺失个数":num_lst,"缺失率":num_rate_lst})
null_train
步骤2:使用seaborn或matplotlib完成如下可视化
HomePlanet 与 Transported 的分布关系
CryoSleep 与 Transported 的分布关系
Cabin 与 Transported 的分布关系
Destination 与 Transported 的分布关系
Age 与 Transported 的分布关系
VIP 与 Transported 的分布关系
Name 与 Transported 的分布关系
步骤3:根据上述分析结果,你找出什么规律,如什么类型的乘客更加容易被Transported?
CryoSleep与Transportted的相关性最大
任务3:验证集划分与树模型
步骤1:学习sklearn中的数据划分方法
from sklearn.model_selection import train_test_split
y = ded['Transported']
X = ded.drop(['Transported'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
步骤2:导入sklearn中的树模型
树模型有随机深林、决策树,高级树模型有XGBoos、LightGBM、CatBoost。
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostRegressor
步骤3:训练集和测试集进行缺失值填充(数值列填充列均值,类别列填充众数)
# 处理缺失值
cat_cols = ['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'Name', 'Cabin']
num_cols = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
for col in num_cols: train_data[col].fillna(train_data[col].mean(), inplace=True)train_data[col].fillna(train_data[col].mean(), inplace=True)
for col in cat_cols:train_data[col].fillna(train_data[col].mode()[0], inplace=True)train_data[col].fillna(train_data[col].mode()[0], inplace=True)
任务4:特征工程入门
步骤1:学习特征工程基础
特征工程学习稳住链接
步骤2:对类别字段分别进行onehot和labelencoder
在机器学习中,通常需要对类别变量单独做处理,这是因为模型的输入项基本都需要是数值型变量,而因为类别变量本身不带数值属性,所以需要进行一层转换。常用的方法一般有两种:label encoding和one hot encoding。
label encoding
# lableencoder
from sklearn.preprocessing import LabelEncoder
col_1 = ['HomePlanet', 'Destination', 'Name', 'Cabin']
train = train_data.copy()
for col in col_1:lab = LabelEncoder()train_data[col] = lab.fit_transform(train_data[col])
dict_mp = {True:1, False:0}
for col in col_2:train_data[col] = train_data[col].astype(int)
one hot encoding
from sklearn.preprocessing import OneHotEncoder
train_temp = train.copy();
train_temp.drop("PassengerId", axis=1, inplace=True)
col_2 = ['HomePlanet', 'Destination', 'Name', 'Cabin']
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(train_temp[col_2])
after_onehot_features = enc.get_feature_names_out(col_2)
data_onehot = pd.DataFrame(enc.transform(train_temp[col_2]).toarray(), columns=after_onehot_features)
train_temp.drop(col_2, axis=1, inplace=True)
train_temp = train_temp.join(data_onehot)col_2 = ['CryoSleep', 'VIP', 'Transported']
for col in col_2:train_temp[col] = train_temp[col].astype(int)
train_temp.head()
步骤3:使用分类树模型和Kfold验证onehot和labelencoder在验证集的精度。
使用随机森林模型和5折进行验证。
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFolddef train(X, y):kfold = KFold(n_splits=5, shuffle=True, random_state=1234)scorelist = []rfc = RandomForestClassifier(max_features='auto', oob_score=True, random_state=1, n_jobs=-1)for train_index, test_index in kfold.split(X, y):X_train = X.iloc[train_index]y_train = y[train_index]X_test = X.iloc[test_index]y_test = y[test_index]rfc = rfc.fit(X_train, y_train)score_r = rfc.score(X_test, y_test)scorelist.append(score_r)print(scorelist)print("mean {}".format(np.mean(scorelist)))
labelencoder结果:
[0.8033352501437608, 0.7883841288096607, 0.7952846463484762, 0.7658227848101266, 0.7813578826237054]
onehot结果:
[0.7866589994249569, 0.7849338700402531, 0.8039102932719954, 0.7796317606444189, 0.7871116225546605]
(ps:感觉类别处理的不是很好,差距有点大)
任务5:特征工程进阶
步骤1:对所有类别字段进行target encoding
Target
Encoding是用于类别特征的。这是一种将类别编码为数字的方法,就像One-hot或Label-encoding一样,但和这种两种方法不同的地方在于target
encoding还使用目标来创建编码,这就是我们所说的有监督特征工程方法。 Target
Encoding是任何一种可以从目标中派生出数字替换特征类别的编码方式。这种目标编码有时被称为平均编码。应用于二进制目标时,也被称为bin
counting。(可能会遇到的其他名称包括:likelihood encoding, impact encoding, and
leave-one-out encoding。)
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.2)
encoder = TargetEncoder(cols=cat_cols, handle_unknown='value', handle_missing='value').fit(X_train, y_train)
encoder_train = encoder.transform(X_train)
encoder_valid = encoder.transform(X_valid)
步骤2:使用树模型的feature importance筛选top10特征。
import matplotlib.pyplot as pltimportances = pd.DataFrame(data={ 'Attribute': encoder_train.columns, 'Importance': rfc.feature_importances_
})
importances = importances.sort_values(by='Importance', ascending=False)
# 可视化
plt.bar(x=importances['Attribute'], height=importances['Importance'], color='#087E8B')
plt.title('Feature importances', size=20)
plt.xticks(rotation='vertical')
plt.show()
步骤3:使用筛选后的特征从新进行训练和验证,对比模型精度。
根据步骤二选取重要的10个特征,重新进行训练和验证。
训练前:score 0.759057
训练后:score 0.764232
任务6:高阶树模型
步骤1:安装LightGBM,并学习基础的使用方法;
安装LightGBM
import lightgbm as lgb
LightGBM基本使用
params = {'num_leaves': 60, #结果对最终效果影响较大,越大值越好,太大会出现过拟合'min_data_in_leaf': 30,'objective': 'binary', #定义的目标函数'max_depth': -1,'learning_rate': 0.03,"min_sum_hessian_in_leaf": 6,"boosting": "gbdt","feature_fraction": 0.9, #提取的特征比率"bagging_freq": 1,"bagging_fraction": 0.8,"bagging_seed": 11,"lambda_l1": 0.1, #l1正则# 'lambda_l2': 0.001, #l2正则"verbosity": -1,"nthread": -1, #线程数量,-1表示全部线程,线程越多,运行的速度越快'metric': {'binary_logloss', 'auc'}, ##评价函数选择"random_state": 2019, #随机数种子,可以防止每次运行的结果不一致# 'device': 'gpu' ##如果安装的事gpu版本的lightgbm,可以加快运算}folds = KFold(n_splits=5, shuffle=True, random_state=2019)
prob_oof = np.zeros((train_x.shape[0], ))
test_pred_prob = np.zeros((test.shape[0], ))## train and predict
feature_importance_df = pd.DataFrame()
for fold_, (trn_idx, val_idx) in enumerate(folds.split(train)):print("fold {}".format(fold_ + 1))trn_data = lgb.Dataset(train_x.iloc[trn_idx], label=train_y[trn_idx])val_data = lgb.Dataset(train_x.iloc[val_idx], label=train_y[val_idx])clf = lgb.train(params,trn_data,num_round,valid_sets=[trn_data, val_data],verbose_eval=20,early_stopping_rounds=60)prob_oof[val_idx] = clf.predict(train_x.iloc[val_idx], num_iteration=clf.best_iteration)fold_importance_df = pd.DataFrame()fold_importance_df["Feature"] = featuresfold_importance_df["importance"] = clf.feature_importance()fold_importance_df["fold"] = fold_ + 1feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)test_pred_prob += clf.predict(test[features], num_iteration=clf.best_iteration) / folds.n_splitsthreshold = 0.5
for pred in test_pred_prob:result = 1 if pred > threshold else 0
步骤2:将训练集20%划分为验证集,使用LightGBM完成训练,精度是否有提高?
将训练集20%划分为验证集
y = train_data["Transported"].astype(int)
X = train_data.drop(labels=['Transported'], axis=1)
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.2, random_state=22)
使用LightGBM训练
lgb_train = lgb.Dataset(encoder_train, y_train, silent=True)
lgb_eval = lgb.Dataset(encoder_valid, y_valid, reference=lgb_train, silent=True)
gbm = lgb.train(params, lgb_train, num_boost_round=400, valid_sets=[lgb_train, lgb_eval], verbose_eval=100,early_stopping_rounds=200, categorical_feature=cat_cols)
#对测试集进行操作
test_pre = gbm.predict(encoder_test, num_iteration=gbm.best_iteration)
训练结果
步骤3:将步骤2预测的结果文件提交到比赛,截图分数
步骤4:尝试调节搜索LightGBM的参数;
根据网上教程进行LightGBM的参数搜索
params = {'boosting_type': 'gbdt','objective': 'binary','metric': 'auc','n_jobs':-1,'learning_rate':0.1}### 交叉验证(调参)
print('交叉验证')
max_auc = float('0')
best_params = {}# 准确率
print("调参1:提高准确率")
for num_leaves in range(5,100,5): #指定叶子的个数,默认值为31,此参数的数值应该小于 2^{max\_depth}for max_depth in range(3,8,1): #指定树的最大深度,默认值为-1,表示不做限制,合理的设置可以防止过拟合。params['num_leaves'] = num_leavesparams['max_depth'] = max_depthcv_results = lgb.cv(params,lgb_train,seed=1,nfold=2,metrics=['auc'],early_stopping_rounds=10,verbose_eval=True)mean_auc = pd.Series(cv_results['auc-mean']).max()boost_rounds = pd.Series(cv_results['auc-mean']).idxmax()if (mean_auc >= max_auc):max_auc = mean_aucbest_params['num_leaves'] = num_leavesbest_params['max_depth'] = max_depth
if ('num_leaves' and 'max_depth' in best_params.keys()): params['num_leaves'] = best_params['num_leaves']params['max_depth'] = best_params['max_depth']# 过拟合
print("调参2:降低过拟合")
for max_bin in range(5,256,10): #最大的桶的数量,用来装数值的;for min_data_in_leaf in range(1,102,10): #每个叶子上的最少数据;params['max_bin'] = max_binparams['min_data_in_leaf'] = min_data_in_leafcv_results = lgb.cv(params,lgb_train,seed=1,nfold=2,metrics=['auc'],early_stopping_rounds=10,verbose_eval=True)mean_auc = pd.Series(cv_results['auc-mean']).max()boost_rounds = pd.Series(cv_results['auc-mean']).idxmax()if (mean_auc >= max_auc):max_auc = mean_aucbest_params['max_bin']= max_binbest_params['min_data_in_leaf'] = min_data_in_leaf
if ('max_bin' and 'min_data_in_leaf' in best_params.keys()):params['min_data_in_leaf'] = best_params['min_data_in_leaf']params['max_bin'] = best_params['max_bin']print("调参3:降低过拟合")
for feature_fraction in [0.6,0.7,0.8,0.9,1.0]: #默认值为1;指定每次迭代所需要的特征部分;for bagging_fraction in [0.6,0.7,0.8,0.9,1.0]: #默认值为1;指定每次迭代所需要的数据部分,并且它通常是被用来提升训练速度和避免过拟合的。for bagging_freq in range(0,50,5):params['feature_fraction'] = feature_fractionparams['bagging_fraction'] = bagging_fractionparams['bagging_freq'] = bagging_freqcv_results = lgb.cv(params,lgb_train,seed=1,nfold=2,metrics=['auc'],early_stopping_rounds=10,verbose_eval=True)mean_auc = pd.Series(cv_results['auc-mean']).max()boost_rounds = pd.Series(cv_results['auc-mean']).idxmax()if (mean_auc >= max_auc):max_auc=mean_aucbest_params['feature_fraction'] = feature_fractionbest_params['bagging_fraction'] = bagging_fractionbest_params['bagging_freq'] = bagging_freqif( 'feature_fraction' and 'bagging_fraction' and 'bagging_freq' in best_params.keys()):params['feature_fraction'] = best_params['feature_fraction']params['bagging_fraction'] = best_params['bagging_fraction']params['bagging_freq'] = best_params['bagging_freq']print("调参4:降低过拟合")
for lambda_l1 in [1e-5,1e-3,1e-1,0.0,0.1,0.3,0.5,0.7,0.9,1.0]:for lambda_l2 in [1e-5,1e-3,1e-1,0.0,0.1,0.4,0.6,0.7,0.9,1.0]:params['lambda_l1'] = lambda_l1params['lambda_l2'] = lambda_l2cv_results = lgb.cv(params,lgb_train,seed=1,nfold=2,metrics=['auc'],early_stopping_rounds=10,verbose_eval=True)mean_auc = pd.Series(cv_results['auc-mean']).max()boost_rounds = pd.Series(cv_results['auc-mean']).idxmax()if( mean_auc >= max_auc):max_auc=mean_aucbest_params['lambda_l1'] = lambda_l1best_params['lambda_l2'] = lambda_l2
if ('lambda_l1' and 'lambda_l2' in best_params.keys()):params['lambda_l1'] = best_params['lambda_l1']params['lambda_l2'] = best_params['lambda_l2']print("调参5:降低过拟合2")
for min_split_gain in [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]: #指定叶节点进行分支所需的损失减少的最小值,默认值为0。设置的值越大,模型就越保守。params['min_split_gain'] = min_split_gaincv_results = lgb.cv(params,lgb_train,seed=1,nfold=2,metrics=['auc'],early_stopping_rounds=10,verbose_eval=True)mean_auc = pd.Series(cv_results['auc-mean']).max()boost_rounds = pd.Series(cv_results['auc-mean']).idxmax()if mean_auc >= max_auc:max_auc=mean_aucbest_params['min_split_gain'] = min_split_gain
if 'min_split_gain' in best_params.keys():params['min_split_gain'] = best_params['min_split_gain']print(best_params)
步骤5:将步骤4调参之后的模型从新训练,将最新预测的结果文件提交到比赛,截图分数;
根据上一步骤参数进行调整LightGBM,再次进新训练。
结果如下:
比赛截图分数:
任务7:多折训练与集成
步骤1:使用KFold完成数据划分;
from sklearn.model_selection import StratifiedKFold,KFold
from sklearn.metrics import f1_scoredef culatescore(predict, real):f1=f1_score(real, predict, average='macro')scores.append(f1)return scoresparams = {'bagging_freq': 5,'lambda_l1': 0.5, 'lambda_l2': 0.001,'min_split_gain': 0.0,'feature_fraction': 0.8,'objective': 'binary','metric': ['auc', 'binary_logloss'],'num_leaves': 2 ** 5,'max_bin': 225,'max_depth': 7,'num_boost_round': 5000,"learning_rate": 0.05,"colsample_bytree": 0.8, # 每次迭代中随机选择特征的比例"bagging_fraction": 0.8, # 每次迭代时用的数据比例'min_child_samples': 25,'n_jobs': -1,'silent': True, # 信息输出设置成1则没有信息输出'seed': 1000,} #设置出参数results=[] #这个是f1的值,每一次交叉验证的f1
bigtestresults=[] #这个是测试集各个交叉验证汇总后的结果
smalltestresults=[] #每一次运行这一大段代码,初始化各个list,这份是测试集预测的交叉验证的每一次存放结果的list
scores=[] #这是汇总后的交叉验证# cat = ["CryoSleep", "county_name"] #这个是类别特征,catgorical_feature=catkf = KFold(n_splits=5, shuffle=True, random_state=123)for i,(train_index,valid_index) in enumerate(kf.split(X, y)):print("第",i+1,"次")x_train,y_train = X.iloc[train_index], y.iloc[train_index]x_valid,y_valid = X.iloc[valid_index], y.iloc[valid_index] #取出数据lgb_train = lgb.Dataset(x_train, y_train, silent=True)lgb_eval = lgb.Dataset(x_valid, y_valid, reference=lgb_train, silent=True)gbm = lgb.train(params, lgb_train, num_boost_round=400, valid_sets=[lgb_train, lgb_eval], verbose_eval=100,early_stopping_rounds=200, categorical_feature=cat_cols)#varbose_eval 迭代多少次打印 early_stopping_rounds:有多少次分数没有提高就停止#categorical_feature:lightgbm可以处理标称型(类别)数据。通过指定'categorical_feature' 这一参数告诉它哪些feature是标称型的。#它不需要将数据展开成独热码(one-hot),其原理是对特征的所有取值,做一个one-vs-others,从而找出最佳分割的那一个特征取值#bagging_fraction:和bagging_freq同时使用可以更快的出结果vaild_preds = gbm.predict(x_valid, num_iteration=gbm.best_iteration)#对测试集进行操作test_pre = gbm.predict(encoder_test, num_iteration=gbm.best_iteration)threshold = 0.5 #设置阈值smalltestresults=[] #这个是测试集预测的交叉验证的每一次存放结果的list# 对每个交叉验证的测试集进行0 , 1 化,然后将每次结果放入bigtestresults中汇总for w in test_pre: temp = 1 if w > threshold else 0# print(w)smalltestresults.append(temp)bigtestresults.append(smalltestresults)# 对每次交叉验证的验证集进行 0 ,1 化,然后评估f1值results=[]for pred in vaild_preds: result = 1 if pred > threshold else 0results.append(result)c = culatescore(results, y_valid)print("score {}".format(c)) print('---N折交叉验证分数---')
print(np.average(c))
#将汇总的交叉验证的测试集的数据转变为dataframe,取出出现次数最多的那类,用作预测结果。
finalpres = pd.DataFrame(bigtestresults)
finaltask = []
lss=[] #这个是最终结果
for i in finalpres.columns:temp1=finalpres.iloc[:,i].value_counts().index[0]lss.append(temp1)sample = pd.read_csv('./data/sample_submission.csv')
mp_dict = {1:True, 0:False}
sample['Transported'] = lss
# print(sample['Transported'])
sample['Transported'] = sample['Transported'].map(mp_dict)
sample.to_csv('submission6.csv', index=False)
步骤2:使用StratifiedKFold完成数据划分;
from sklearn.model_selection import StratifiedKFold,KFold
from sklearn.metrics import f1_scoredef culatescore(predict, real):f1=f1_score(real, predict, average='macro')scores.append(f1)return scoresparams = {'bagging_freq': 5,'lambda_l1': 0.5, 'lambda_l2': 0.001,'min_split_gain': 0.0,'feature_fraction': 0.8,'objective': 'binary','metric': ['auc', 'binary_logloss'],'num_leaves': 2 ** 5,'max_bin': 225,'max_depth': 7,'num_boost_round': 5000,"learning_rate": 0.05,"colsample_bytree": 0.8, # 每次迭代中随机选择特征的比例"bagging_fraction": 0.8, # 每次迭代时用的数据比例'min_child_samples': 25,'n_jobs': -1,'silent': True, # 信息输出设置成1则没有信息输出'seed': 1000,} #设置出参数results=[] #这个是f1的值,每一次交叉验证的f1
bigtestresults=[] #这个是测试集各个交叉验证汇总后的结果
smalltestresults=[] #每一次运行这一大段代码,初始化各个list,这份是测试集预测的交叉验证的每一次存放结果的list
scores=[] #这是汇总后的交叉验证# cat = ["CryoSleep", "county_name"] #这个是类别特征,catgorical_feature=catkf = StratifiedKFold(n_splits=5, shuffle=True, random_state=123)for i,(train_index,valid_index) in enumerate(kf.split(X, y)):print("第",i+1,"次")x_train,y_train = X.iloc[train_index], y.iloc[train_index]x_valid,y_valid = X.iloc[valid_index], y.iloc[valid_index] #取出数据lgb_train = lgb.Dataset(x_train, y_train, silent=True)lgb_eval = lgb.Dataset(x_valid, y_valid, reference=lgb_train, silent=True)gbm = lgb.train(params, lgb_train, num_boost_round=400, valid_sets=[lgb_train, lgb_eval], verbose_eval=100,early_stopping_rounds=200, categorical_feature=cat_cols)#varbose_eval 迭代多少次打印 early_stopping_rounds:有多少次分数没有提高就停止#categorical_feature:lightgbm可以处理标称型(类别)数据。通过指定'categorical_feature' 这一参数告诉它哪些feature是标称型的。#它不需要将数据展开成独热码(one-hot),其原理是对特征的所有取值,做一个one-vs-others,从而找出最佳分割的那一个特征取值#bagging_fraction:和bagging_freq同时使用可以更快的出结果vaild_preds = gbm.predict(x_valid, num_iteration=gbm.best_iteration)#对测试集进行操作test_pre = gbm.predict(encoder_test, num_iteration=gbm.best_iteration)threshold = 0.5 #设置阈值smalltestresults=[] #这个是测试集预测的交叉验证的每一次存放结果的list# 对每个交叉验证的测试集进行0 , 1 化,然后将每次结果放入bigtestresults中汇总for w in test_pre: temp = 1 if w > threshold else 0# print(w)smalltestresults.append(temp)bigtestresults.append(smalltestresults)# 对每次交叉验证的验证集进行 0 ,1 化,然后评估f1值results=[]for pred in vaild_preds: result = 1 if pred > threshold else 0results.append(result)c = culatescore(results, y_valid)print("score {}".format(c)) print('---N折交叉验证分数---')
print(np.average(c))
#将汇总的交叉验证的测试集的数据转变为dataframe,取出出现次数最多的那类,用作预测结果。
finalpres = pd.DataFrame(bigtestresults)
finaltask = []
lss=[] #这个是最终结果
for i in finalpres.columns:temp1=finalpres.iloc[:,i].value_counts().index[0]lss.append(temp1)sample = pd.read_csv('./data/sample_submission.csv')
mp_dict = {1:True, 0:False}
sample['Transported'] = lss
# print(sample['Transported'])
sample['Transported'] = sample['Transported'].map(mp_dict)
sample.to_csv('submission6.csv', index=False)
步骤3:使用StratifiedKFold配合LightGBM完成模型的训练和预测
步骤4:在步骤3训练得到了多少个模型,对测试集多次预测,将最新预测的结果文件提交到比赛,截图分数;
使用5折训练,得到了5个模型,对测试集进行了5次训练。比赛得分如下:
步骤5:使用交叉验证训练5个机器学习模型(svm、lr等),使用stacking完成集成,将最新预测的结果文件提交到比赛,截图分数;
from sklearn.metrics import f1_score
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVCclf1 = LogisticRegression(random_state=2022,tol=1e-6, max_iter=10000) # 逻辑回归模型
clf2 = DecisionTreeClassifier(random_state=2022) #决策树模型
clf3 = SVC(probability=True,random_state=2022,tol=1e-6) # SVM模型
clf4 = RandomForestClassifier(n_estimators=100,random_state=2022) # 随机森林
clf5 = GradientBoostingClassifier(random_state=2022) #GBDTsclf = StackingClassifier(classifiers=[clf1, clf2, clf3, clf4, clf5],meta_classifier=clf5)
sclf = StackingClassifier(estimators=[('lr', clf1), ('dvtree', clf2), ('svm', clf3), ('random', clf4), ('gdbt', clf5)], final_estimator=LogisticRegression()
)
sclf.fit(encoder_train, y_train)
Coggle 30 Days of ML(Spaceship Titanic)相关推荐
- Coggle 30 Days of ML(23年1月)打卡
前言 任务链接 这个任务内容比较感兴趣而且和工作内容相关,学习一下打个卡. 编码完成任务1,2,3,5,6. 任务7之后再补充. 任务1:数据集读取 %pip install pandas -i ht ...
- 「打卡」Coggle 30 Days of ML(23年2月)
任务1:图属性与图构造 import pandas as pd import numpy as np import networkx as nx import matplotlib.pyplot as ...
- 糖尿病遗传风险检测挑战赛(Coggle 30 Days of ML)
本次跟着Coggle 30 Days of ML学习算法竞赛,而不是基于现成baseline来操作,预计重新熟悉并学习 Pandas.Numpy 处理数据 Sklearn.LightGBM 模型使用 ...
- 汽车领域多语种迁移学习挑战赛-Coggle 30 Days of ML
前言 依然是coggle的7月竞赛学习活动,本博客围绕着汽车领域多语种迁移学习挑战赛展开. 比赛地址:http://challenge.xfyun.cn/topic/info?type=car-mul ...
- Coggle 30 Days of ML【打卡】广告-信息流跨域ctr预估
Coggle 30 Days of ML[打卡]广告-信息流跨域ctr预估 任务介绍 赛题介绍 广告推荐主要基于用户对广告的历史曝光.点击等行为进行建模,如果只是使用广告域数据,用户行为数据稀疏,行为 ...
- Ubuntu虚拟机磁盘扩容+VM虚拟机开机多出1分30秒的解决方案(终极教程)
详细步骤,请参考我电脑上的Word:[在用] Word:Ubuntu虚拟机磁盘扩容+VM虚拟机开机多出1分30秒的解决方案(终极教程) 扩容简单总结: 参考1:https://www.cnblogs. ...
- 【OpenCV 例程200篇】30. 图像的缩放(cv2.resize)
[OpenCV 例程200篇]30. 图像的缩放(cv2.resize) 欢迎关注 『OpenCV 例程200篇』 系列,持续更新中 欢迎关注 『Python小白的OpenCV学习课』 系列,持续更新 ...
- 挑战30天精通Javaweb(IDEA版)!2021年最值得学习的JavaWeb教程
Javaweb教程,Javaweb入门教程,作为Java学习必须得一个类目,你能做到30天精通Javaweb吗?今天分享的这套Javaweb教程,就非常的全面,涵盖JavaWeb阶段所有核心知识点,最 ...
- 糖尿病遗传风险检测挑战赛-Coggle 30 Days of ML
前言 跟着coggle的7月竞赛学习活动,记录一些个人学习的过程,本博客围绕着科大讯飞糖尿病遗传风险检测挑战赛展开. 比赛地址:http://challenge.xfyun.cn/topic/info ...
最新文章
- 微信小程序下拉刷新和上拉加载的实现
- python可以在linux运行_在linux运行python
- XDP/eBPF — BPF
- Mysql/Mairadb主从复制
- angr学习笔记(7)(malloc地址单元符号化)
- binlog数据库不写入binlog_MySQL数据库及InnoDB存储引擎的日志文件
- 导论 计算机组成 ppt,《计算机导论》说课稿PPT课件.ppt
- 华为联运游戏或应用审核驳回:点击登录进入游戏,未显示欢迎栏
- c++中类的private的static变量实现类对象的数据共享
- 给ibus-rime输入法添加小鹤双拼方案
- STC12C5A60S2-定时器+数码管
- 一种简单的PCB加温电路设计
- 杀戮尖塔是用java_杀戮尖塔修改class文件图文教程
- 在家做什么小生意赚钱,这6种最适合在家操作!
- 【720开发】 spring boot 快速入门
- 如何把图片上的文字转换成可编辑的文档文字?
- 可变步长最小均方 (VSS-LMS) 算法附matlab代码
- EAO-SLAM: Monocular Semi-Dense Object SLAM Based on Ensemble Data Association
- DateDiff 数据库时间差函数
- iReport 相关资料汇总