为了得到一致假设而使假设变得过度严格称为过拟合。避免过拟合是分类器设计中的一个核心任务。通常采用增大数据量和测试样本集的方法对分类器性能进行评价。由于比赛中样本量是一致,目标测试集也是既定,所以我的思路是:先把过拟合特征做预训练,然后放入重新训练。参考代码如下:

# -*- coding: utf-8 -*-import pandas as pd
import time
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier  def main():#省份和地市映射data = {"province":['河北省', '山西省', '内蒙古自治区', '辽宁省', '吉林省', '黑龙江省', '江苏省', '浙江省', '安徽省', '福建省', '江西省', '山东省', '河南省', '湖北省', '湖南省', '广东省', '广西壮族自治区', '海南省', '四川省', '贵州省', '云南省', '西藏自治区', '陕西省', '甘肃省', '青海省', '宁夏回族自治区', '新疆维吾尔自治区', '北京市', '天津市', '上海市', '重庆市'],"pro_code":[13,14,15,21,22,23,32,33,34,35,36,37,41,42,43,44,45,46,51,52,53,54,61,62,63,64,65,11,12,31,50]}province = pd.DataFrame(data, columns = ["province", "pro_code"])citydata=pd.read_csv(r"D:\city.csv")#加载地市映射表#加载带标记数据label_ds=pd.read_csv(r"D:\label.csv")label_ds = pd.merge(label_ds, province, how = "left", on = "province")label_ds = pd.merge(label_ds, citydata, how = "left", on = "city")label_df = pd.DataFrame(label_ds[['denomination','min_amount','pro_code','age','sex','account_age','txn_count','use_nums',\'txn_min_amount','txn_amount_mean','avg_discount','voucher_num','avg_txn_amt',\'use_ratio','voucher_ratio','batch_no','voucher_no','city_id','label']])label_df["denomination"] = label_df["denomination"].astype("int")label_df["min_amount"] = label_df["min_amount"].astype("int")label_df["pro_code"] = label_df["pro_code"].astype("int")   label_df["age"] = label_df["age"].astype("int")label_df["sex"] = label_df["sex"].astype("int")label_df["account_age"] = label_df["account_age"].astype("int")label_df["txn_count"] = label_df["txn_count"].astype("int")label_df["use_nums"] = label_df["use_nums"].astype("int")label_df["txn_min_amount"] = label_df["txn_min_amount"].astype("int")label_df["txn_amount_mean"] = label_df["txn_amount_mean"].astype("int")label_df["avg_discount"] = label_df["avg_discount"].astype("int")label_df["voucher_num"] = label_df["voucher_num"].astype("int")label_df["avg_txn_amt"] = label_df["avg_txn_amt"].astype("int")label_df["use_ratio"] = label_df["use_ratio"].astype("float")label_df["voucher_ratio"] = label_df["voucher_ratio"].astype("float")label_df["batch_no"] = label_df["batch_no"].astype("int")label_df["voucher_no"] = label_df["voucher_no"].astype("str")label_df["city_id"] = label_df["city_id"].astype("int")label_df["label"] = label_df["label"].astype("int")#加载未标记数据unlabel_ds=pd.read_csv(r"D:\unlabel.csv")unlabel_ds = pd.merge(unlabel_ds, province, how = "left", on = "province")unlabel_ds = pd.merge(unlabel_ds, citydata, how = "left", on = "city")unlabel_df = pd.DataFrame(unlabel_ds[['denomination','min_amount','pro_code','age','sex','account_age','txn_count','use_nums',\'txn_min_amount','txn_amount_mean','avg_discount','voucher_num','avg_txn_amt',\'use_ratio','voucher_ratio','batch_no','city_id','phone','voucher_no']])  unlabel_df["denomination"] = unlabel_df["denomination"].astype("int")unlabel_df["min_amount"] = unlabel_df["min_amount"].astype("int")    unlabel_df["pro_code"] = unlabel_df["pro_code"].astype("int") unlabel_df["age"] = unlabel_df["age"].astype("int")unlabel_df["sex"] = unlabel_df["sex"].astype("int")unlabel_df["account_age"] = unlabel_df["account_age"].astype("int")unlabel_df["txn_count"] = unlabel_df["txn_count"].astype("int")unlabel_df["use_nums"] = unlabel_df["use_nums"].astype("int")unlabel_df["txn_min_amount"] = unlabel_df["txn_min_amount"].astype("int")unlabel_df["txn_amount_mean"] = unlabel_df["txn_amount_mean"].astype("int")unlabel_df["avg_discount"] = unlabel_df["avg_discount"].astype("int")unlabel_df["voucher_num"] = unlabel_df["voucher_num"].astype("int")unlabel_df["avg_txn_amt"] = unlabel_df["avg_txn_amt"].astype("int")unlabel_df["use_ratio"] = unlabel_df["use_ratio"].astype("float")unlabel_df["voucher_ratio"] = unlabel_df["voucher_ratio"].astype("float")unlabel_df["batch_no"] = unlabel_df["batch_no"].astype("int")unlabel_df["city_id"] = unlabel_df["city_id"].astype("int")unlabel_df["phone"] = unlabel_df["phone"].astype("str")unlabel_df["voucher_no"] = unlabel_df["voucher_no"].astype("str") #预训练开始#训练数据采样,80%训练,20%验证           print "总样本,有", label_df.shape[0], "行", label_df.shape[1], "列"train_label_df=label_df.sample(frac=0.8) print "训练集,有", train_label_df.shape[0], "行", train_label_df.shape[1], "列"test_label_df=label_df.sample(frac=0.2) print "验证集,有", test_label_df.shape[0], "行", test_label_df.shape[1], "列"#模型训练label_X = train_label_df[['voucher_num','use_nums','use_ratio','voucher_ratio','avg_discount','avg_txn_amt']]label_X = preprocessing.scale(label_X)#归一化label_y = train_label_df['label']model =LogisticRegression()#DecisionTreeClassifier()model.fit(label_X, label_y)      #模型验证#expected = test_label_df['label']#predicted_X=test_label_df[['voucher_num','use_nums','use_ratio','voucher_ratio','avg_discount','avg_txn_amt']]#predicted_X=preprocessing.scale(predicted_X)#归一化#predicted = model.predict(predicted_X)#f1_score = metrics.f1_score(expected, predicted) #模型评估#print f1_score   #利用模型为未标记样本打上标签unlabel_X=unlabel_df[['pro_code','city_id','age','sex','account_age',\'txn_count','txn_amount_mean','txn_min_amount']]unlabel_X_Scale=unlabel_df[['voucher_num','use_nums','use_ratio','voucher_ratio','avg_discount','avg_txn_amt']]unlabel_X_Scale=preprocessing.scale(unlabel_X_Scale)#归一化unlabel_y=model.predict(unlabel_X_Scale)out_y=pd.DataFrame(unlabel_y.reshape(-1,1),columns=['label'])unlabel_X_new=unlabel_X.join(out_y,how='left')label_df=label_df.append(unlabel_X_new)#构成新的训练集#预训练结束    #模型训练和预测f1_score_old=float(0)#f1-scoref1_score=float(0.3)#高于全部设置1的分数outset=[]flag=int(1)   label_df_cons=label_df#训练样本数不变while (f1_score-f1_score_old)>0.0001 :#迭代收敛到f1-score不再提升if flag==0 :#第一次训练排除样本数量带来的问题f1_score_old=f1_score#训练数据采样,80%训练,20%验证           print "总样本,有", label_df.shape[0], "行", label_df.shape[1], "列"train_label_df=label_df#全量训练,ample(frac=0.8) print "训练集,有", train_label_df.shape[0], "行", train_label_df.shape[1], "列"test_label_df=label_df_cons.sample(frac=0.3) #用训练集来测试f1-scoreprint "验证集,有", test_label_df.shape[0], "行", test_label_df.shape[1], "列"#模型训练label_X = train_label_df[['pro_code','city_id','age','sex','account_age',\'txn_count','txn_amount_mean','txn_min_amount']]label_X = preprocessing.scale(label_X)#归一化label_y = train_label_df['label']model = LogisticRegression()#DecisionTreeClassifier()#if flag==0 :#    model = DecisionTreeClassifier()#逻辑回归,第一次预训练#else :#    model = LogisticRegression()#决策树model.fit(label_X, label_y)if flag==0 :#模型验证,第一次训练不评分expected = test_label_df['label']predicted_X=test_label_df[['pro_code','city_id','age','sex','account_age',\'txn_count','txn_amount_mean','txn_min_amount']]predicted_X=preprocessing.scale(predicted_X)#归一化predicted = model.predict(predicted_X)f1_score = metrics.f1_score(expected, predicted) #模型评估print f1_scoreflag=int(0)if f1_score_old<f1_score :#为未标记样本打上标记,然后加入训练集unlabel_X=unlabel_df[['pro_code','city_id','age','sex','account_age',\'txn_count','txn_amount_mean','txn_min_amount']]unlabel_X_noScale=unlabel_Xunlabel_X=preprocessing.scale(unlabel_X)#归一化unlabel_y=model.predict(unlabel_X)out_y=pd.DataFrame(unlabel_y.reshape(-1,1),columns=['label'])unlabel_X_new=unlabel_X_noScale.join(out_y,how='left')label_df=pd.DataFrame()#原样本清空label_df=label_df_cons.append(unlabel_X_new)#构成新的训练集else : #迭代训练结束,输出结果unlabel_X=unlabel_df[['pro_code','city_id','age','sex','account_age',\'txn_count','txn_amount_mean','txn_min_amount']]unlabel_info = unlabel_df[['phone','voucher_no']]unlabel_X=preprocessing.scale(unlabel_X)#归一化unlabel_y=model.predict_proba(unlabel_X)[:,1]#预测返回概率值,通过概率值阈值选择正例样本out_y=pd.DataFrame(unlabel_y,columns=['prob']) #返回判定正例的比例outset=unlabel_info.join(out_y,how='left')#输出结果outset["label"] = outset.apply(lambda x:  0 if x["prob"] <0.55 else 1, axis = 1)outset= outset[outset['label']==1]  outset=outset[['phone','voucher_no','label']]outsetds=pd.DataFrame(outset)outsetds.to_csv('D:\gd_delta.csv',index=False,header=None)#输出预测数据#评价f1#unlabel_X=pd.DataFrame(unlabel_X,columns=['pro_code','city_id','age','sex','account_age',\#                  'txn_count','txn_amount_mean','txn_min_amount'])#print unlabel_X.head(5)#outset=unlabel_X.join(out_y,how='left')#输出结果#outset["label"] = outset.apply(lambda x:  0 if x["prob"] <0.57 else 1, axis = 1)#expected = outset['label']#predicted_X=outset[['pro_code','city_id','age','sex','account_age',\#                           'txn_count','txn_amount_mean','txn_min_amount']]#predicted_X=preprocessing.scale(predicted_X)#归一化#predicted = model.predict(predicted_X)#f1_score = metrics.f1_score(expected, predicted) #模型评估#print f1_score#0.855946148093#退出循环break#执行
if __name__ == '__main__':  start = time.clock()  main()end = time.clock()  print('finish all in %s' % str(end - start))

避免过拟合还可以尝试用正则化来避免,正则化方法是指在进行目标函数或代价函数优化时,在目标函数或代价函数后面加上一个正则项,一般有L1正则与L2正则等。这个要看scikit-learn模型参数或内部算法逻辑。

【Python学习系列十八】基于scikit-learn库逻辑回归训练模型(delta比赛代码3)相关推荐

  1. 【Python学习系列十六】基于scikit-learn库逻辑回归训练模型(delta比赛代码)

    delta比赛的场景:给定数据样本,设计模型训练预测二分类结果,并通过f1-score评估结果.比赛中对特征抽取.样本扰动.过拟合.强相关特征.归一化等概念有实际的理解和应用. 这里给出的代码是基于逻 ...

  2. 【Python学习系列十七】基于scikit-learn库逻辑回归训练模型(delta比赛代码2)

    机器学习任务流程:学习任务定义->数学建模->训练样本采样->特征分析和抽取->算法设计和代码->模型训练和优化(性能评估和度量)->泛化能力评估(重采样和重建模) ...

  3. Java学习系列(十八)Java面向对象之基于UDP协议的网络通信

    UDP协议:无需建立虚拟链路,协议是不可靠的. A节点以DatagramSocket发送数据包,数据报携带数据,数据报上还有目的目地地址,大部分情况下,数据报可以抵达:但有些情况下,数据报可能会丢失 ...

  4. python笔迹识别_python_基于Scikit learn库中KNN,SVM算法的笔迹识别

    之前我们用自己写KNN算法[网址]识别了MNIST手写识别数据 [数据下载地址] 这里介绍,如何运用Scikit learn库中的KNN,SVM算法进行笔迹识别. 数据说明: 数据共有785列,第一列 ...

  5. 【Python学习系列十九】基于scikit-learn库进行特征选择

    场景:特征选择在模型训练前是非常有意义的,实际上就是先期对特征相关性进行分析. 参考:http://blog.csdn.net/fjssharpsword/article/details/735503 ...

  6. 【Python学习系列十二】Python库pandas之CSV导入

    Python Data Analysis Library 或 pandas 是基于NumPy 的一种工具,该工具是为了解决数据分析任务而创建的.Pandas 纳入了大量库和一些标准的数据模型,提供了高 ...

  7. Python学习日记(十八) 序列化模块

    什么是序列? 就是每一个元素被有序的排成一列 什么是序列化? 就是将原本的列表.字典等内容转化成字符串的过程 什么时候会用到序列化? 数据存储(把数据放在文件.数据库),网络传输等 序列化的目的 1. ...

  8. 【Python学习系列十四】IPython命令行式交互界面工具Jupyter

    好的IDE对提高编码质量很有帮助,Jupyter Notebook的即编即视效果很有利于调试. IPython 是 Python 的原生交互式 shell 的增强版,可以完成许多不同寻常的任务,比如帮 ...

  9. 【Python学习系列十】Python机器学习库scikit-learn实现Decision Trees案例

    学习网址:http://scikit-learn.org/stable/modules/tree.html scikit-learn这个官网很好,里面有算法案例也有算法原理说明. 案例代码: # -* ...

最新文章

  1. 传递结构体变量解决方案,资料整理一
  2. C/C++人机猜拳游戏
  3. P3195 [HNOI2008]玩具装箱TOY(斜率优化)
  4. 这份操作系统词典请查收!
  5. 酒精测试仪检定设备设计与验证
  6. 结构体的空间分配和位定义
  7. android事件拦截处理机制详解
  8. 神经网络与深度学习第1章:绪论 阅读提问
  9. 大型分布式网站术语分析
  10. Angular 权威教程
  11. ❤️字节跳动8年测试经验,彻夜无眠整理的40道自动化测试面试题(附精准答案),爆肝2W字❤️
  12. Playfair Crack
  13. 二进制转八进制和十六进制
  14. js 正则替换手机号中间四位为****
  15. win10 系统禁用笔记本自带键盘的有效方法
  16. 计算机16进制A3 B9,ASCII码16进制对照表
  17. Java---鼠标事件小实例
  18. 总结vue 需要掌握的知识点
  19. 多线程时,出现段错误
  20. 学员故事|从房产销售转行软件测试工程师,轻松月薪14K

热门文章

  1. 存货编码数字_用友T3软件存货编码与存货代码有什么不同?
  2. python使用redis队列_Python的Flask框架应用调用Redis队列数据的方法
  3. IIS和.net framework 4.0的安装顺序导致的问题
  4. QT-QT简介,QT环境与工具链(day1)
  5. golang struct 转map 及 map[string]*Struct 初始化和遍历
  6. Flask abort
  7. WebStorm2018配置nodejs
  8. 50 道 CSS 基础面试题及答案
  9. echarts自定义提示框数据
  10. 定时任务 Crontab命令 详解