DataCastle员工离职预测数据竞赛个人总结
DataCastle员工离职预测数据竞赛个人总结
赛题链接
赛题任务:
给定影响员工离职的因素和员工是否离职的记录,建立模型预测有可能离职的员工。
数据字段:
(1)Age:员工年龄 ;
(2)Label:员工是否已经离职,1表示已经离职,2表示未离职,这是目标预测值;
(3)BusinessTravel:商务差旅频率,Non-Travel表示不出差,Travel_Rarely表示不经常出差,Travel_Frequently表示经常出差;
(4)Department:员工所在部门,Sales表示销售部,Research & Development表示研发部,Human Resources表示人力资源部;
(5)DistanceFromHome:公司跟家庭住址的距离,从1到29,1表示最近,29表示最远;
(6)Education:员工的教育程度,从1到5,5表示教育程度最高;
(7)EducationField:员工所学习的专业领域,Life Sciences表示生命科学,Medical表示医疗,Marketing表示市场营销,Technical Degree表示技术学位,Human Resources表示人力资源,Other表示其他;
(8)EmployeeNumber:员工号码;
(9)EnvironmentSatisfaction:员工对于工作环境的满意程度,从1到4,1的满意程度最低,4的满意程度最高;
(10)Gender:员工性别,Male表示男性,Female表示女性;
(11)JobInvolvement:员工工作投入度,从1到4,1为投入度最低,4为投入度最高;
(12)JobLevel:职业级别,从1到5,1为最低级别,5为最高级别;
(13)JobRole:工作角色:Sales Executive是销售主管,Research Scientist是科学研究员,Laboratory Technician实验室技术员,Manufacturing Director是制造总监,Healthcare Representative是医疗代表,Manager是经理,Sales Representative是销售代表,Research Director是研究总监,Human Resources是人力资源;
(14)JobSatisfaction:工作满意度,从1到4,1代表满意程度最低,4代表满意程度最高;
(15)MaritalStatus:员工婚姻状况,Single代表单身,Married代表已婚,Divorced代表离婚;
(16)MonthlyIncome:员工月收入,范围在1009到19999之间;
(17)NumCompaniesWorked:员工曾经工作过的公司数;
(18)Over18:年龄是否超过18岁;
(19)OverTime:是否加班,Yes表示加班,No表示不加班;
(20)PercentSalaryHike:工资提高的百分比;
(21)PerformanceRating:绩效评估;
(22)RelationshipSatisfaction:关系满意度,从1到4,1表示满意度最低,4表示满意度最高;
(23)StandardHours:标准工时;
(24)StockOptionLevel:股票期权水平;
(25)TotalWorkingYears:总工龄;
(26)TrainingTimesLastYear:上一年的培训时长,从0到6,0表示没有培训,6表示培训时间最长;
(27)WorkLifeBalance:工作与生活平衡程度,从1到4,1表示平衡程度最低,4表示平衡程度最高;
(28)YearsAtCompany:在目前公司工作年数;
(29)YearsInCurrentRole:在目前工作职责的工作年数 ;
(30)YearsSinceLastPromotion:距离上次升职时长 ;
(31)YearsWithCurrManager:跟目前的管理者共事年数;
评分标准:
评分算法为准确率,准确率越高,说明正确预测出离职员工与留职员工的效果越好。
载入常用库及数据
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
train = pd.read_csv('rain.csv')
test = pd.read_csv('test_noLabel.csv')
pd.set_option('display.max_columns', None)
简单EDA
train.head()
test.head()
train.info()
print('-------------------')
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 32 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 ID 1100 non-null int64 1 Age 1100 non-null int64 2 BusinessTravel 1100 non-null object3 Department 1100 non-null object4 DistanceFromHome 1100 non-null int64 5 Education 1100 non-null int64 6 EducationField 1100 non-null object7 EmployeeNumber 1100 non-null int64 8 EnvironmentSatisfaction 1100 non-null int64 9 Gender 1100 non-null object10 JobInvolvement 1100 non-null int64 11 JobLevel 1100 non-null int64 12 JobRole 1100 non-null object13 JobSatisfaction 1100 non-null int64 14 MaritalStatus 1100 non-null object15 MonthlyIncome 1100 non-null int64 16 NumCompaniesWorked 1100 non-null int64 17 Over18 1100 non-null object18 OverTime 1100 non-null object19 PercentSalaryHike 1100 non-null int64 20 PerformanceRating 1100 non-null int64 21 RelationshipSatisfaction 1100 non-null int64 22 StandardHours 1100 non-null int64 23 StockOptionLevel 1100 non-null int64 24 TotalWorkingYears 1100 non-null int64 25 TrainingTimesLastYear 1100 non-null int64 26 WorkLifeBalance 1100 non-null int64 27 YearsAtCompany 1100 non-null int64 28 YearsInCurrentRole 1100 non-null int64 29 YearsSinceLastPromotion 1100 non-null int64 30 YearsWithCurrManager 1100 non-null int64 31 Label 1100 non-null int64
dtypes: int64(24), object(8)
memory usage: 275.1+ KB
----------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 31 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 ID 350 non-null int64 1 Age 350 non-null int64 2 BusinessTravel 350 non-null object3 Department 350 non-null object4 DistanceFromHome 350 non-null int64 5 Education 350 non-null int64 6 EducationField 350 non-null object7 EmployeeNumber 350 non-null int64 8 EnvironmentSatisfaction 350 non-null int64 9 Gender 350 non-null object10 JobInvolvement 350 non-null int64 11 JobLevel 350 non-null int64 12 JobRole 350 non-null object13 JobSatisfaction 350 non-null int64 14 MaritalStatus 350 non-null object15 MonthlyIncome 350 non-null int64 16 NumCompaniesWorked 350 non-null int64 17 Over18 350 non-null object18 OverTime 350 non-null object19 PercentSalaryHike 350 non-null int64 20 PerformanceRating 350 non-null int64 21 RelationshipSatisfaction 350 non-null int64 22 StandardHours 350 non-null int64 23 StockOptionLevel 350 non-null int64 24 TotalWorkingYears 350 non-null int64 25 TrainingTimesLastYear 350 non-null int64 26 WorkLifeBalance 350 non-null int64 27 YearsAtCompany 350 non-null int64 28 YearsInCurrentRole 350 non-null int64 29 YearsSinceLastPromotion 350 non-null int64 30 YearsWithCurrManager 350 non-null int64
dtypes: int64(23), object(8)
memory usage: 84.9+ KB
可以看出,数据集十分简单,并且没有缺失值,因此不需要进行缺失值处理。
train.columns
Index(['ID', 'Age', 'BusinessTravel', 'Department', 'DistanceFromHome','Education', 'EducationField', 'EmployeeNumber','EnvironmentSatisfaction', 'Gender', 'JobInvolvement', 'JobLevel','JobRole', 'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome','NumCompaniesWorked', 'Over18', 'OverTime', 'PercentSalaryHike','PerformanceRating', 'RelationshipSatisfaction', 'StandardHours','StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear','WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole','YearsSinceLastPromotion', 'YearsWithCurrManager', 'Label'],dtype='object')
此组数据中分类变量很多,因此看一下分类变量的分布情况。
catfeatures = ['BusinessTravel', 'Department', 'Education', 'EducationField', 'EnvironmentSatisfaction', 'Gender', 'JobInvolvement', 'JobLevel','JobRole', 'JobSatisfaction', 'MaritalStatus', 'NumCompaniesWorked', 'Over18', 'OverTime','PerformanceRating', 'RelationshipSatisfaction', 'StandardHours','StockOptionLevel','TrainingTimesLastYear','WorkLifeBalance']
for feature in catfeatures:print(train[feature].value_counts())print('----------------')
Travel_Rarely 787
Travel_Frequently 205
Non-Travel 108
Name: BusinessTravel, dtype: int64
----------------
Research & Development 727
Sales 331
Human Resources 42
Name: Department, dtype: int64
----------------
3 431
4 301
2 206
1 126
5 36
Name: Education, dtype: int64
----------------
Life Sciences 462
Medical 337
Marketing 127
Technical Degree 92
Other 63
Human Resources 19
Name: EducationField, dtype: int64
----------------
4 338
3 337
1 215
2 210
Name: EnvironmentSatisfaction, dtype: int64
----------------
Male 653
Female 447
Name: Gender, dtype: int64
----------------
3 661
2 273
4 103
1 63
Name: JobInvolvement, dtype: int64
----------------
1 412
2 399
3 157
4 81
5 51
Name: JobLevel, dtype: int64
----------------
Sales Executive 247
Research Scientist 221
Laboratory Technician 205
Manufacturing Director 101
Healthcare Representative 100
Manager 80
Sales Representative 57
Research Director 56
Human Resources 33
Name: JobRole, dtype: int64
----------------
4 350
3 325
1 219
2 206
Name: JobSatisfaction, dtype: int64
----------------
Married 500
Single 362
Divorced 238
Name: MaritalStatus, dtype: int64
----------------
1 390
0 151
3 114
2 113
4 101
7 56
6 52
5 45
8 41
9 37
Name: NumCompaniesWorked, dtype: int64
----------------
Y 1100
Name: Over18, dtype: int64
----------------
No 794
Yes 306
Name: OverTime, dtype: int64
----------------
3 932
4 168
Name: PerformanceRating, dtype: int64
----------------
3 340
4 323
1 220
2 217
Name: RelationshipSatisfaction, dtype: int64
----------------
80 1100
Name: StandardHours, dtype: int64
----------------
0 473
1 446
2 122
3 59
Name: StockOptionLevel, dtype: int64
----------------
2 396
3 379
4 94
5 89
1 50
6 48
0 44
Name: TrainingTimesLastYear, dtype: int64
----------------
3 678
2 256
4 103
1 63
Name: WorkLifeBalance, dtype: int64
----------------
sns.set_style('whitegrid')
for feature in catfeatures:train[[feature,'Label']].groupby([feature]).mean().plot.bar()
通过观察可以发现,'StandardHours’和’Over18’两列数据仅有一个取值,因此将其删除。
train.drop(['StandardHours','Over18'],axis=1,inplace=True)
test.drop(['StandardHours','Over18'],axis=1,inplace=True)
查看相关性
columns = train.columns.drop('ID')
correlation = train[columns].corr()
plt.figure(figsize=(15, 15))
sns.heatmap(correlation,square = True, annot=True, fmt='0.2f',vmax=0.8)
通过相关性矩阵发现’MonthlyIncome’和’JobLevel’之间存在严重的共线性,因此将相关性较小的’MonthlyIncome’删除。
train.drop(['MonthlyIncome'],axis=1,inplace=True)
test.drop(['MonthlyIncome'],axis=1,inplace=True)
特征工程
可以根据几种满意度的加和构造新的特征——总满意度。
def fea_creat(df):df['Satisfaction'] = df['JobSatisfaction'] + df['EnvironmentSatisfaction'] + df['RelationshipSatisfaction']
fea_creat(train)
fea_creat(test)
对数据进行dummies处理,利于后续分析。
train = pd.get_dummies(train)
test = pd.get_dummies(test)
EmployeeNumber为标识值,将其删除。
train.drop(['EmployeeNumber'],axis=1,inplace=True)
test.drop(['EmployeeNumber'],axis=1,inplace=True)
对数据进行归一化处理:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()df_train = train.drop(['ID','Label'],axis=1)
df_test = test.drop('ID',axis=1)
train_scaled = scaler.fit_transform(df_train)
test_scaled = scaler.transform(df_test)
df_train.iloc[:,:] = train_scaled[:,:]
df_test.iloc[:,:] = test_scaled[:,:]
df_train = pd.concat([df_train,train['Label']],axis=1)
数据分析建模
X_data = df_train.drop('Label',axis=1)
Y_data = df_train['Label']
X_test = df_testprint('X train shape:',X_data.shape)
print('X test shape:',X_test.shape)
X train shape: (1100, 48)
X test shape: (350, 48)
# 多模型交叉验证
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
import xgboost as xgb
import lightgbm as lgb
from sklearn.model_selection import cross_val_scoremodels = {'LR': LogisticRegression(solver='liblinear', penalty='l2', C=1),'SVM': SVC(C=1, gamma='auto'),'DT': DecisionTreeClassifier(),'RF' : RandomForestClassifier(n_estimators=100),'AdaBoost': AdaBoostClassifier(n_estimators=100),'GBDT': GradientBoostingClassifier(n_estimators=100),'XGB': xgb.XGBClassifier(max_depth=10,subsample=0.7,colsample_bytree=0.75,n_estimators=100),'LGB': lgb.LGBMClassifier(num_leaves=120,n_estimators = 100)
}for k, clf in models.items():print("the model is {}".format(k))scores = cross_val_score(clf, X_data, Y_data, cv=10)print(scores)print("Mean accuracy is {}".format(np.mean(scores)))print("-" * 100)
the model is LR
[0.91891892 0.85585586 0.90909091 0.87272727 0.89090909 0.881818180.85454545 0.88181818 0.8440367 0.86238532]
Mean accuracy is 0.877210588403249
----------------------------------------------------------------------------------------------------
the model is SVM
[0.83783784 0.83783784 0.83636364 0.83636364 0.83636364 0.836363640.83636364 0.83636364 0.8440367 0.8440367 ]
Mean accuracy is 0.8381930888352906
----------------------------------------------------------------------------------------------------
the model is DT
[0.81081081 0.82882883 0.80909091 0.79090909 0.82727273 0.790909090.76363636 0.74545455 0.78899083 0.74311927]
Mean accuracy is 0.7899022458655487
----------------------------------------------------------------------------------------------------
the model is RF
[0.88288288 0.84684685 0.87272727 0.87272727 0.86363636 0.854545450.86363636 0.85454545 0.87155963 0.87155963]
Mean accuracy is 0.8654667177602958
----------------------------------------------------------------------------------------------------
the model is AdaBoost
[0.90990991 0.81981982 0.83636364 0.83636364 0.89090909 0.863636360.86363636 0.82727273 0.86238532 0.89908257]
Mean accuracy is 0.8609379437819804
----------------------------------------------------------------------------------------------------
the model is GBDT
[0.88288288 0.87387387 0.85454545 0.85454545 0.9 0.863636360.84545455 0.85454545 0.86238532 0.83486239]
Mean accuracy is 0.8626731735906048
----------------------------------------------------------------------------------------------------
the model is XGB
[0.89189189 0.85585586 0.86363636 0.86363636 0.9 0.845454550.84545455 0.86363636 0.83486239 0.86238532]
Mean accuracy is 0.8626813635987949
----------------------------------------------------------------------------------------------------
the model is LGB
[0.88288288 0.85585586 0.86363636 0.80909091 0.88181818 0.845454550.84545455 0.86363636 0.8440367 0.85321101]
Mean accuracy is 0.8545077354251667
----------------------------------------------------------------------------------------------------
可以发现,简单的Logistic回归模型的效果最好,这里直接选取该模型进行预测,输出预测结果。
clf = LogisticRegression(solver='liblinear', penalty='l2', C=1)
clf.fit(X_data, Y_data)
result = clf.predict(X_test)
file = pd.DataFrame()
file['ID'] = test.ID
file['Label'] = result
file.to_csv('sub.csv',index=False)
DataCastle员工离职预测数据竞赛个人总结相关推荐
- 天池竞赛员工离职预测训练赛
组员:欧阳略.陶奇辉.王曙光.吴轩毅 数据来源:天池大数据竞赛员工离职预测训练赛中的数据 大致数据截图如下 根据所给数据,我组利用Pycharm编程源代码截图如下 最终,我组预测准确率为0.89,基本 ...
- 数据挖掘竞赛-员工离职预测训练赛
员工离职预测 简介 DC的一道回归预测题.是比较基础的分类问题,主要对逻辑回归算法的使用.核心思路为属性构造+逻辑回归. 过程 数据获取 报名参与比赛即可获得数据集的百度网盘地址,这个比赛时间很久,随 ...
- 员工离职预测(logistic)(R语言)
员工离职预测(logistic) 出于工作需要及个人兴趣,学习数据分析及R语言是差不多2年前,第一篇更新的文章为m久前做的员工离职预测,当时做这个项目的主要是为了学习logistic算法,数据来源为D ...
- 机器学习-员工离职预测训练赛
[数据来源]DC竞赛的员工离职预测训练赛 一共两个csv表格,pfm_train.csv训练(1100行,31个字段),pfm_test.csv测试集(350行,30个字段) [字段说明] Age:员 ...
- kaggle员工离职预测——SVC
一.比赛说明 比赛地址:https://www.kaggle.com/c/bi-attrition-predict 问题描述 数据包括员工的各种统计信息,以及该员工是否已经离职,统计的信息包括工资.出 ...
- 吃鸡排名预测挑战赛 空气质量预测 英雄联盟大师预测 手机行为识别 员工离职预测 猫十二分类体验赛
1.吃鸡排名预测挑战赛 https://aistudio.baidu.com/aistudio/competition/detail/155/0/introduction 2.空气质量预测https: ...
- r语言员工离职_使用R机器学习进行员工离职预测系列(一)
最近一直觉得,其实机器学习的门槛并不高,以R语言的角度,甚至稍微学过一点,就可以针对相关数据进行各种算法模型的建立和测试. 而真正有难度的地方一是算法优化部分,二是和对于模型评价的部分,这两个部分往往 ...
- 基于python的kaggle练习(二)——员工离职预测
前沿 目前社会上呈现出一种公司招不到人,大批失业人员的矛盾现象,且大部分公司的离职率居高不下,很多入职没多久就辞职,所花费的培训招聘等资源都浪费了.为了弄清楚公司员工离职原因,通过kaggle上某一家 ...
- 【Kaggle】二:【数据预测】员工离职预测
版权声明:本文为博主原创文章,未经博主允许不得转载. 文章目录 一.赛题网址 二.赛题任务 三.数据集下载 四.数据集介绍 五.数据集处理 六.评分标准 七.代码实现 一.赛题网址 二.赛题任务 三. ...
- 员工离职预测 逻辑回归
import numpy as np import pandas as pd from sklearn.preprocessing import LabelEncoder#中文编码为数字 import ...
最新文章
- [文章集合] 在Windows Server 2012上关于Vmware的几点
- OVS ovs-vsctl(二十五)
- 《转》八大算法详细讲解
- sublime快捷键整理
- 清理Mac OSX中安装的nvidia CUDA driver
- [Unity 游戏设计的元素]
- 局域网物理机怎么访问虚拟机
- 完整的开发一个ContentProvider步骤
- SpringBoot使用@ServerEndpoint无法依赖注入问题解决 SpringBoot webSocket配置
- 二叉树中节点的最大的距离(编程之美3.8)
- websockets_将WebSockets与Node.js结合使用
- 微信小程序——事件绑定
- pytorch dataloader参数解析
- 魔客吧php登录界面模板,精仿魔客吧网站模板discuz模板_带VIP购买等多个插件
- 语音通信64K的由来
- 数据库系统的核心:数据模型
- 详解u盘装系统启动不了怎么办
- 项目3-2-----多肉
- 去哪儿实习面经(拿到offer)
- 使用网络调试助手连接阿里云平台