文章目录

引入包名
变量类型转化
- 提取object变量
- 转换百分号变量
- 标准化
缺失值处理
- 查看
- numeric
特征工程
- 下采样
- 正则
- map 函数
- object
object编码化
- 热编码
- 热编码Not sparse
- label_encoder
- 辨析
数据分类
合并数据
建模
决策树可视化展示
特征重要性
均衡样本
模型评价
- train test split
- 模型评价
- ROC
- oob
混淆矩阵confusion matrix
- 混淆矩阵标准化
成本矩阵cost matrix

引入包名

import matplotlib.pyplot as plt
import numpy as np
import os.path
from sklearn.preprocessing import  Imputer
import csv
import pandas as pd
import warnings
import seaborn as sns
warnings.simplefilter("ignore")
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.tree import export_graphviz
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
import pydotplus
from IPython.display import Image
from sklearn.model_selection import train_test_splitimport plotly.graph_objects as go
import matplotlib.pyplot as plt
import plotly_express as px
from sklearn import preprocessing

变量类型转化

提取object变量

cat_cols = [col for col in X.columns.values if X[col].dtype == 'O']

转换百分号变量

for i in data.columns:try:if data[i].dtype == 'O' and data[i][0][-1]=='%':print(i)data[i] = data[i].apply(lambda x:float(x[:-1]))except:continue

标准化

from sklearn import preprocessing
scaler = MinMaxScaler()
for i in data.columns:if data[i].dtype != 'O':data[i] = preprocessing.minmax_scale(data[i])

或者

x = preprocessing.scale(x)

缺失值处理

查看

missing_values_table(X)

numeric

num = X.drop(cat_cols,axis=1).fillna(X.mean())num = X.drop(cat_cols,axis=1).fillna(X.median())num = X.drop(cat_cols,axis=1).fillna(X.mode())

特征工程

下采样

sub_sample

def lower_sample_data(df, class_):'''percent:多数类别下采样的数量相对于少数类别样本数量的比例'''data0 = df[df['retention'] == class_]  # 将少数类别的样本放在data0data1 = df[df['retention'] != class_] index = np.random.randint(len(data1), size= (len(df) - len(data1)))  # 随机给定下采样取出样本的序号lower_data1 = data1.iloc[list(index)]  # 下采样return(pd.concat([lower_data1, data0]))

data = lower_sample_data(data,'lost')
data['retention'].value_counts()

正则

# 只取数字
data['brand_version'] = data['brand'].apply(lambda x:re.findall(r'\d',x)[0] if re.findall(r'\d',x) else 'null')
data['brand_version'] = data['brand_version'].apply(lambda x:int(x) if x!='null' else 'null')

# 分类
data['brand_class'] = data['brand'].apply(lambda x:'小米' if x.find('小米') else('红米' if x.find('红米') else 'others') )

# 只取英文
uncn = re.compile(r'[\u0061-\u007a,\u0020]')
data['brand_series'] = data['brand'].apply(lambda x:"".join(uncn.findall(x.lower())))

# 只取英文和数字
data['brand_detail'] = data['brand'].apply(lambda x:re.sub('[^\u0061-\u007a^a-z^A-Z^0-9]+', '', x))

map 函数

def price_map(x):if x=='0-600':y=1elif x=='600-1000':y=2elif x=='1000-1500':y=3elif x=='1500-2000':y=4elif x=='2000-3000':y=5elif x=='3000-4000':y=6else:y=7return ydata['price_band'] = data['price'].apply(lambda x:price_map(x))

object

X = X.fillna('missing')

object编码化

热编码

热编码Not sparse

label_encoder

le = preprocessing.LabelEncoder()
for col in cat_cols:cat_labelcoder[col] = le.fit_transform(cat_labelcoder[col].astype('str'))

辨析

理论上，将object变量进行label_encoder或者one_hot encoder都是一样的，但是因为label encoder会将object赋予大小含义，切割特征时会按照numeric型变量进行切分；因此，如果每次赋值不同，那么每次决策树的左右子树的值就会不同，会导致结果不一致。
因此，一般而言，除了表示“不好，一般，好，很好”这种带有赋值含义的object型变量可以根据label_encoder进行数据处理，其他情况请都用one_hot。

数据分类

x = data.drop(['id','retention'],axis=1)y = pd.DataFrame(data['retention'].apply(lambda x:1 if x=='lost' else 0))

合并数据

x_labelcoder = pd.concat([num,cat_labelcoder],axis=1)

建模

clf = RandomForestClassifier(n_estimators=10, criterion='gini',max_depth=10,bootstrap=True,random_state=0)
#拟合模型
clf.fit(x_onehot, y)

决策树可视化展示

clf = tree.DecisionTreeClassifier(min_samples_split=0.1,max_depth=int(np.log2(x_onehot.shape[1])),random_state=0,class_weight='balanced')
#拟合模
clf.fit(x_onehot, y)
# extract single tree
dot_data = tree.export_graphviz(clf, out_file=None,feature_names=x_onehot.columns,### 重点！！！class_names=data['tag'].unique(),filled=True, rounded=True,special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data)
#使用ipython的终端jupyter notebook显示。
Image(graph.create_png())

特征重要性

clf = RandomForestClassifier(n_estimators=10, criterion='gini',max_depth=10,bootstrap=True,random_state=0)
#拟合模型
clf.fit(x_onehot, y)
y_importances = clf.feature_importances_
x_importances = x_onehot.columns
df = pd.DataFrame({'x':x_importances,'y':y_importances}).sort_values(by='y',ascending=False)px.bar_polar(df[:10], r="y", theta="x", color="x", template='plotly_white',color_discrete_sequence=px.colors.sequential.Plotly3[-2::-1])

均衡样本

class_weight=‘balanced’

clf = RandomForestClassifier(n_estimators=10, criterion='gini',max_depth=10,bootstrap=True,random_state=0,class_weight='balanced')
#拟合模型
clf.fit(x_onehot, y)

模型评价

train test split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x_onehot, y, test_size=0.33, random_state=42)

模型评价

from sklearn.metrics import classification_reporty_predict = clf.predict(X_test)print(classification_report(y_test, y_predict))

ROC

# y_test：实际的标签, dataset_pred：预测的概率值。
fpr, tpr, thresholds = roc_curve(y_test, y_predict)
roc_auc = auc(fpr, tpr)
#画图，只需要plt.plot(fpr,tpr),变量roc_auc只是记录auc的值，通过auc()函数能计算出来
plt.plot(fpr, tpr, lw=1, label='ROC(area = %0.2f)' % (roc_auc))
plt.xlabel("FPR (False Positive Rate)")
plt.ylabel("TPR (True Positive Rate)")
plt.title("Receiver Operating Characteristic, ROC(AUC = %0.2f)"% (roc_auc))
plt.show()

roc_auc_score(y_test, y_predict)

oob

clf = RandomForestClassifier(n_estimators=100, criterion='gini',max_depth=x_onehot.shape[1],bootstrap=True,random_state=0,class_weight='balanced',oob_score=True)
#拟合模型
clf.fit(x_onehot, y)clf.oob_score_

混淆矩阵confusion matrix

ax = sns.heatmap(confusion_matrix(y_test, y_predict),cmap='Blues',annot=True,fmt='g')
plt.title('confusion matrix')
plt.ylabel('True Lable')
plt.xlabel('Predicted Lable')

混淆矩阵标准化

_ = confusion_matrix(y_test, y_predict)/np.sum(confusion_matrix(y_test, y_predict))
_ = np.around(_,decimals=2)
ax = sns.heatmap(_,cmap='Blues',annot=True,fmt='g')
plt.title('confusion matrix')
plt.ylabel('True Lable')
plt.xlabel('Predicted Lable')

成本矩阵cost matrix

cm = confusion_matrix(y_test, y_predict)
# 0是流失，1是活跃
TP = cm[1][1]
TN = cm[0][0]
FP = cm[0][1]*5
FN = cm[1][0]*2
accuracy = round((TP+TN)/(TP+TN+FP+FN),2)
recall = round(TP/(TP+FN),2)
fscore = round(accuracy*recall/(accuracy+recall),2)
cm_biz = np.vstack(([TN,FP],[FN,TP]))
cm_biz = pd.DataFrame(cm_biz)
ax = sns.heatmap(cm_biz,cmap='Blues',annot=True,fmt='g')
plt.title('cost matrix'+'\n'+'accuracy= '+str(accuracy)+'\n'+'recall= '+str(recall)+'\n'+'f_score'+str(fscore)+'\n')
plt.ylabel('True Lable')
plt.xlabel('Predicted Lable')

python建模全步骤相关推荐

Eclipse配置python环境全步骤
安装方法参考这一博主的文章,一开始选用第二.三种,都安装失败 eclipse配置python开发环境_如何在Eclipse中配置python开发环境_weixin_39827036的博客-CSDN博客 ...
python深度学习库keras——网络建模全解
全栈工程师开发手册 (作者:栾鹏) python教程全解一.数据预处理 1.序列预处理 1.1.填充序列pad_sequences keras.preprocessing.sequence.pad_ ...
python数据分析的主要流程-Python数据分析全流程实操指南
内容全面:借助5大Python工具库,实现数据分析从获取到建模全流程覆盖: 贴合实际:不空讲Python语法,清晰简明地介绍如何用Python来处理.分析数据: 热点案例:覆盖6大热点应用领域,可直接 ...
【利用Python进行数据分析】13 - Python建模库介绍
第十三章 Python建模库介绍 1.pandas与模型代码的接口 2.用Patsy创建模型描述 2.1.Patsy创建模型设计矩阵 2.2.用Patsy公式进行数据转换 2.3.分类数据和Patsy ...
第十三篇 Python建模库介绍
前面已经介绍了Python数据分析的编程基础.数据分析师和科学家总是在数据规整和准备上花费⼤量时间,前面部分的重点在于掌握这些功能. 开发模型选⽤什么库取决于应⽤本身.许多统计问题可以⽤简单⽅法解决, ...
风控评分卡建模全流程
风控评分卡建模全流程前言本文将通过 python 代码演示传统评分卡建模的全流程,比较通用的一个版本.评分卡已经发展得非常成熟了,对于不同业务或者不同建模人员下的模型构建过程可以称得上是大同小异. ...
Python机器学习全流程项目实战精讲（2018版）
Python机器学习全流程项目实战精讲(2018版) 网盘地址:https://pan.baidu.com/s/16SSVq74YC07M0dW1iDekPg 提取码: vu7r 备用地址(腾讯微云) ...
《利用Python进行数据分析·第2版》第13章 Python建模库介绍
第1章准备工作第2章 Python语法基础,IPython和Jupyter 第3章 Python的数据结构.函数和文件第4章 NumPy基础:数组和矢量计算第5章 pandas入门第6章数 ...
python训练过程是什么_学了这么久，你知道Python机器学习全流程是怎样的么？
今天呢,小编就带大家了解一下Python机器学习全流程首先介绍一下机器学习的概念和地位,和其他的区别是? 机器学习的核心任务是? 机器学习的全流程是? 我们将上述流程拆解出来看: 1.需求来源是?需 ...
金融风控评分卡建模全流程！
↑↑↑关注后"星标"Datawhale 每日干货 & 每月组队学习,不错过 Datawhale干货作者:桔了个仔,南洋理工大学,数据科学家知乎丨https://zhua ...

python建模全步骤

文章目录

引入包名

变量类型转化

提取object变量

转换百分号变量

标准化

缺失值处理

查看

numeric

特征工程

下采样

正则

map 函数

object

object编码化

热编码

热编码Not sparse

label_encoder

辨析

数据分类

合并数据

建模

决策树可视化展示

特征重要性

均衡样本

模型评价

train test split

模型评价

ROC

oob

混淆矩阵confusion matrix

混淆矩阵标准化

成本矩阵cost matrix

python建模全步骤相关推荐

最新文章

热门文章