机器学习任务流程:学习任务定义->数学建模->训练样本采样->特征分析和抽取->算法设计和代码->模型训练和优化(性能评估和度量)->泛化能力评估(重采样和重建模);

算法思路:应用半监督学习思路,先用训练集训练出一个模型,然后用模型给预测集打标签,之后将打上标签的预测集也加入到训练集中用模型再训练,用f1-scror作为性能评估的依据。这个代码和之前比,主要是增加model.predict_proba()函数返回正例概率,自己设置阈值来选择正例样本。代码如下:

# -*- coding: utf-8 -*-import pandas as pd
import time
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
#from sklearn.tree import DecisionTreeClassifier  def main():#省份和地市映射data = {"province":['河北省', '山西省', '内蒙古自治区', '辽宁省', '吉林省', '黑龙江省', '江苏省', '浙江省', '安徽省', '福建省', '江西省', '山东省', '河南省', '湖北省', '湖南省', '广东省', '广西壮族自治区', '海南省', '四川省', '贵州省', '云南省', '西藏自治区', '陕西省', '甘肃省', '青海省', '宁夏回族自治区', '新疆维吾尔自治区', '北京市', '天津市', '上海市', '重庆市'],"pro_code":[13,14,15,21,22,23,32,33,34,35,36,37,41,42,43,44,45,46,51,52,53,54,61,62,63,64,65,11,12,31,50]}province = pd.DataFrame(data, columns = ["province", "pro_code"])citydata=pd.read_csv(r"D:\city.csv")#加载地市映射表#加载带标记数据label_ds=pd.read_csv(r"D:\label.csv")label_ds = pd.merge(label_ds, province, how = "left", on = "province")label_ds = pd.merge(label_ds, citydata, how = "left", on = "city")label_df = pd.DataFrame(label_ds[['denomination','min_amount','pro_code','age','sex','account_age','txn_count','use_nums',\'txn_min_amount','txn_amount_mean','avg_discount','voucher_num','avg_txn_amt',\'use_ratio','voucher_ratio','batch_no','voucher_no','city_id','label']])label_df["denomination"] = label_df["denomination"].astype("int")label_df["min_amount"] = label_df["min_amount"].astype("int")label_df["pro_code"] = label_df["pro_code"].astype("int")   label_df["age"] = label_df["age"].astype("int")label_df["sex"] = label_df["sex"].astype("int")label_df["account_age"] = label_df["account_age"].astype("int")label_df["txn_count"] = label_df["txn_count"].astype("int")label_df["use_nums"] = label_df["use_nums"].astype("int")label_df["txn_min_amount"] = label_df["txn_min_amount"].astype("int")label_df["txn_amount_mean"] = label_df["txn_amount_mean"].astype("int")label_df["avg_discount"] = label_df["avg_discount"].astype("int")label_df["voucher_num"] = label_df["voucher_num"].astype("int")label_df["avg_txn_amt"] = label_df["avg_txn_amt"].astype("int")label_df["use_ratio"] = label_df["use_ratio"].astype("float")label_df["voucher_ratio"] = label_df["voucher_ratio"].astype("float")label_df["batch_no"] = label_df["batch_no"].astype("int")label_df["voucher_no"] = label_df["voucher_no"].astype("str")label_df["city_id"] = label_df["city_id"].astype("int")label_df["label"] = label_df["label"].astype("int")#加载未标记数据unlabel_ds=pd.read_csv(r"D:\unlabel.csv")unlabel_ds = pd.merge(unlabel_ds, province, how = "left", on = "province")unlabel_ds = pd.merge(unlabel_ds, citydata, how = "left", on = "city")unlabel_df = pd.DataFrame(unlabel_ds[['denomination','min_amount','pro_code','age','sex','account_age','txn_count','use_nums',\'txn_min_amount','txn_amount_mean','avg_discount','voucher_num','avg_txn_amt',\'use_ratio','voucher_ratio','batch_no','city_id','phone','voucher_no']])  unlabel_df["denomination"] = unlabel_df["denomination"].astype("int")unlabel_df["min_amount"] = unlabel_df["min_amount"].astype("int")    unlabel_df["pro_code"] = unlabel_df["pro_code"].astype("int") unlabel_df["age"] = unlabel_df["age"].astype("int")unlabel_df["sex"] = unlabel_df["sex"].astype("int")unlabel_df["account_age"] = unlabel_df["account_age"].astype("int")unlabel_df["txn_count"] = unlabel_df["txn_count"].astype("int")unlabel_df["use_nums"] = unlabel_df["use_nums"].astype("int")unlabel_df["txn_min_amount"] = unlabel_df["txn_min_amount"].astype("int")unlabel_df["txn_amount_mean"] = unlabel_df["txn_amount_mean"].astype("int")unlabel_df["avg_discount"] = unlabel_df["avg_discount"].astype("int")unlabel_df["voucher_num"] = unlabel_df["voucher_num"].astype("int")unlabel_df["avg_txn_amt"] = unlabel_df["avg_txn_amt"].astype("int")unlabel_df["use_ratio"] = unlabel_df["use_ratio"].astype("float")unlabel_df["voucher_ratio"] = unlabel_df["voucher_ratio"].astype("float")unlabel_df["batch_no"] = unlabel_df["batch_no"].astype("int")unlabel_df["city_id"] = unlabel_df["city_id"].astype("int")unlabel_df["phone"] = unlabel_df["phone"].astype("str")unlabel_df["voucher_no"] = unlabel_df["voucher_no"].astype("str") #模型训练和预测f1_score_old=float(0)#f1-scoref1_score=float(0.3)#高于全部设置1的分数outset=[]flag=int(1)   label_df_cons=label_df#训练样本数不变while (f1_score-f1_score_old)>0.0001 :#迭代收敛到f1-score不再提升if flag==0 :#第一次训练排除样本数量带来的问题f1_score_old=f1_score#训练数据采样,80%训练,20%验证           print "总样本,有", label_df.shape[0], "行", label_df.shape[1], "列"train_label_df=label_df#全量训练,ample(frac=0.8) print "训练集,有", train_label_df.shape[0], "行", train_label_df.shape[1], "列"test_label_df=label_df_cons.sample(frac=0.3) #用训练集来测试f1-scoreprint "验证集,有", test_label_df.shape[0], "行", test_label_df.shape[1], "列"#模型训练label_X = train_label_df[['pro_code','city_id','age','sex','account_age',\'txn_count','txn_amount_mean','txn_min_amount']]label_X = preprocessing.scale(label_X)#归一化label_y = train_label_df['label']model = LogisticRegression()#if flag==0 :#    model = LogisticRegression()#逻辑回归,第一次预训练#else :#    model = DecisionTreeClassifier()#决策树model.fit(label_X, label_y)if flag==0 :#模型验证,第一次训练不评分expected = test_label_df['label']predicted_X=test_label_df[['pro_code','city_id','age','sex','account_age',\'txn_count','txn_amount_mean','txn_min_amount']]predicted_X=preprocessing.scale(predicted_X)#归一化predicted = model.predict(predicted_X)f1_score = metrics.f1_score(expected, predicted) #模型评估print f1_scoreflag=int(0)if f1_score_old<f1_score :#为未标记样本打上标记,然后加入训练集unlabel_X=unlabel_df[['pro_code','city_id','age','sex','account_age',\'txn_count','txn_amount_mean','txn_min_amount']]unlabel_X_noScale=unlabel_Xunlabel_X=preprocessing.scale(unlabel_X)#归一化unlabel_y=model.predict(unlabel_X)out_y=pd.DataFrame(unlabel_y.reshape(-1,1),columns=['label'])unlabel_X_new=unlabel_X_noScale.join(out_y,how='left')label_df=pd.DataFrame()#原样本清空label_df=label_df_cons.append(unlabel_X_new)#构成新的训练集else : #迭代训练结束,输出结果unlabel_X=unlabel_df[['pro_code','city_id','age','sex','account_age',\'txn_count','txn_amount_mean','txn_min_amount']]unlabel_info = unlabel_df[['phone','voucher_no']]unlabel_X=preprocessing.scale(unlabel_X)#归一化unlabel_y=model.predict_proba(unlabel_X)[:,1]#预测返回概率值,通过概率值阈值选择正例样本out_y=pd.DataFrame(unlabel_y,columns=['prob']) #返回判定正例的比例outset=unlabel_info.join(out_y,how='left')#输出结果outset["label"] = outset.apply(lambda x:  0 if x["prob"] <0.57 else 1, axis = 1)outset= outset[outset['label']==1]  outset=outset[['phone','voucher_no','label']]outsetds=pd.DataFrame(outset)outsetds.to_csv('D:\gd_delta.csv',index=False,header=None)#输出预测数据#评价f1#unlabel_X=pd.DataFrame(unlabel_X,columns=['pro_code','city_id','age','sex','account_age',\#                  'txn_count','txn_amount_mean','txn_min_amount'])#print unlabel_X.head(5)#outset=unlabel_X.join(out_y,how='left')#输出结果#outset["label"] = outset.apply(lambda x:  0 if x["prob"] <0.57 else 1, axis = 1)#expected = outset['label']#predicted_X=outset[['pro_code','city_id','age','sex','account_age',\#                           'txn_count','txn_amount_mean','txn_min_amount']]#predicted_X=preprocessing.scale(predicted_X)#归一化#predicted = model.predict(predicted_X)#f1_score = metrics.f1_score(expected, predicted) #模型评估#print f1_score#0.855946148093#退出循环break#执行
if __name__ == '__main__':  start = time.clock()  main()end = time.clock()  print('finish all in %s' % str(end - start))

继续提升有三点:

1)可以尝试给预测集打标签用一个模型,迭代训练用另一个模型;

2)可以尝试抽取不同的特征来建模,其次对特征值做离散化处理;

3)可以尝试用部分特征来预训练,另一部分特征来做训练模型,可以降低过拟合问题;

【Python学习系列十七】基于scikit-learn库逻辑回归训练模型(delta比赛代码2)相关推荐

  1. 【Python学习系列十六】基于scikit-learn库逻辑回归训练模型(delta比赛代码)

    delta比赛的场景:给定数据样本,设计模型训练预测二分类结果,并通过f1-score评估结果.比赛中对特征抽取.样本扰动.过拟合.强相关特征.归一化等概念有实际的理解和应用. 这里给出的代码是基于逻 ...

  2. 【Python学习系列十八】基于scikit-learn库逻辑回归训练模型(delta比赛代码3)

    为了得到一致假设而使假设变得过度严格称为过拟合.避免过拟合是分类器设计中的一个核心任务.通常采用增大数据量和测试样本集的方法对分类器性能进行评价.由于比赛中样本量是一致,目标测试集也是既定,所以我的思 ...

  3. python笔迹识别_python_基于Scikit learn库中KNN,SVM算法的笔迹识别

    之前我们用自己写KNN算法[网址]识别了MNIST手写识别数据 [数据下载地址] 这里介绍,如何运用Scikit learn库中的KNN,SVM算法进行笔迹识别. 数据说明: 数据共有785列,第一列 ...

  4. 【Python学习系列二十三】Scikit_Learn库降维方法(矩阵分解)-PCAFA

    1主成分分析PCA 1.1 精确PCA和似然估计 PCA基于最大方差的正交变量分解多维数据集.在scikit-learn库中,PCA的实现是先通过fit方法计算n维的特征值和特征向量,然后通过tran ...

  5. 【Python学习系列二十】scikit-learn库模型持久化

    场景:需要将模型保存到内存,或磁盘. 代码: # -*- coding: utf-8 -*-import pandas as pd import pickle as pkl from sklearn. ...

  6. 【Python学习系列二十一】pandas库基本操作

    pandas很强大,操作参考官网:http://pandas.pydata.org/pandas-docs/stable/ 也有一份10分钟入门的材料:http://pandas.pydata.org ...

  7. 【Python学习系列十五】pandas库DataFrame行列操作使用方法

    参考:http://pandas.pydata.org/pandas-docs/stable/api.html#dataframe data['w'] #选择表格中的'w'列,使用类字典属性,返回的是 ...

  8. Python: 学习系列之七:模块、PIPY及Anaconda

    系列 Python: 学习系列之一:Python能做什么 Python: 学习系列之二:基础介绍(int/float/string/range/list/tuple/dict/set) Python: ...

  9. Python学习系列(六)(模块)

    Python学习系列(六)(模块) Python学习系列(五)(文件操作及其字典) 一,模块的基本介绍 1,import引入其他标准模块 标准库:Python标准安装包里的模块. 引入模块的几种方式: ...

最新文章

  1. android呼吸灯动画,Android高德地图自定义定位蓝点实现呼吸灯功能
  2. poj 1634 Who's the boss?
  3. 【sklearn学习】决策树、分类树、剪枝策略
  4. leetcode - 873. 最长的斐波那契子序列的长度(使用到哈希表)
  5. 小白初涉,先试试水。涉及Python,C语言基础,机器学习等
  6. 清北学堂模拟赛d1t1 位运算1(bit)
  7. 如何打造个人品牌,把自己“卖”出去?
  8. Spring Boot 集成 Spring Security 实现权限认证模块
  9. hdu acm 1010
  10. 乐优商城(04)--商品规格
  11. k2ttl救砖_斐讯K2T救砖或备份恢复开telnet+ssh备份教程
  12. 电子电路学习笔记(16)——晶振电路的电容
  13. 快速截图工具——百度输入法的扩展功能
  14. 网卡的HWADDR和MACADDR的区别?
  15. 网络爬虫设计中需要注意的几个问题
  16. 网站速度对谷歌SEO优化的影响
  17. 微信小程序学习之路——API用户信息
  18. 网站开发之ie下在线浏览pdf文件无需本地支持
  19. 努比亚android最高版本,努比亚Z11安卓7.1固件开发版下载地址:新增压力按键等功能...
  20. C语言(C++语言)中##(两个井号)和#(一个井号)用法[转]

热门文章

  1. Spark详解(十):SparkShuffle机制原理分析
  2. LetCode: 227. 简单计算器2
  3. 彩超探头频率高低的区别_超声波液位开关和液位开关的区别,它们的工作原理分别是什么?...
  4. html获取text值_Python小程序2获取href的值
  5. python数字加密解密_Python对整形数字进行加密和解密
  6. 排序命令: sort, wc, uniq
  7. 搭建mongodb分片
  8. [六省联考2017]分手是祝愿(期望+DP)
  9. Spring Boot有四大神器
  10. 初识Hibernate之关联映射(一)