在本文中, 我们来建立一个机器学习模型,该模型可以预测哪些客户可能流失,并锁定并留住这些较高流失风险的特定客户。我们会使用神经网络模型。人工神经网络(ANN)模型是一种机器学习模型,受人脑功能的启发。 ANN模型最近在图像识别,语音识别和机器人技术方面的成功应用证明了其在各种行业中的预测能力和实用性。您可能已经听说过“深度学习”一词。这是一种ANN模型,其中输入和输出层之间的层数很大。


下面我们还是用Kaggle数据集 WA_Fn-UseC_-Telco-Customer-Churn.csv 。然后我们用keras来构建一个神经网络。

Load the packages

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import roc_curve, auc
%matplotlib inline

Load the data

df = pd.read_csv('../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No ... No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes ... Yes No No No One year No Mailed check 56.95 1889.5 No
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes ... No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes

3 rows ?? 21 columns

(7043, 21)

Data Analysis & Preparation

Encoding target var: Churn

df['Churn'] = df['Churn'].apply(lambda x: 1 if x == 'Yes' else 0)

Create TotalCharges

df['TotalCharges'] = df['TotalCharges'].replace(' ', np.nan).astype(float)
df = df.dropna()

Create Continuous Vars

df[['tenure', 'MonthlyCharges', 'TotalCharges']].describe()
tenure MonthlyCharges TotalCharges
count 7032.000000 7032.000000 7032.000000
mean 32.421786 64.798208 2283.300441
std 24.545260 30.085974 2266.771362
min 1.000000 18.250000 18.800000
25% 9.000000 35.587500 401.450000
50% 29.000000 70.350000 1397.475000
75% 55.000000 89.862500 3794.737500
max 72.000000 118.750000 8684.800000

Normalize the variable

df['MonthlyCharges'] = np.log(df['MonthlyCharges'])
df['MonthlyCharges'] = (df['MonthlyCharges'] - df['MonthlyCharges'].mean())/df['MonthlyCharges'].std()df['TotalCharges'] = np.log(df['TotalCharges'])
df['TotalCharges'] = (df['TotalCharges'] - df['TotalCharges'].mean())/df['TotalCharges'].std()df['tenure'] = (df['tenure'] - df['tenure'].mean())/df['tenure'].std()
df[['tenure', 'MonthlyCharges', 'TotalCharges']].describe()
tenure MonthlyCharges TotalCharges
count 7.032000e+03 7.032000e+03 7.032000e+03
mean -1.028756e-16 4.688495e-14 7.150708e-15
std 1.000000e+00 1.000000e+00 1.000000e+00
min -1.280157e+00 -1.882268e+00 -2.579056e+00
25% -9.542285e-01 -7.583727e-01 -6.080585e-01
50% -1.394072e-01 3.885103e-01 1.950521e-01
75% 9.198605e-01 8.004829e-01 8.382338e-01
max 1.612459e+00 1.269576e+00 1.371323e+00
continuous_vars = list(df.describe().columns)
['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges', 'Churn']

One-Hot Encoding

for col in list(df.columns):print(col, df[col].nunique())
customerID 7032
gender 2
SeniorCitizen 2
Partner 2
Dependents 2
tenure 72
PhoneService 2
MultipleLines 3
InternetService 3
OnlineSecurity 3
OnlineBackup 3
DeviceProtection 3
TechSupport 3
StreamingTV 3
StreamingMovies 3
Contract 3
PaperlessBilling 2
PaymentMethod 4
MonthlyCharges 1584
TotalCharges 6530
Churn 2
df.groupby('gender').count()['customerID'].plot(kind='bar', color='skyblue', grid=True, figsize=(8,6), title='Gender'
plt.show()df.groupby('InternetService').count()['customerID'].plot(kind='bar', color='skyblue', grid=True, figsize=(8,6), title='Internet Service'
plt.show()df.groupby('PaymentMethod').count()['customerID'].plot(kind='bar', color='skyblue', grid=True, figsize=(8,6), title='Payment Method'

dummy_cols = []sample_set = df[['tenure', 'MonthlyCharges', 'TotalCharges', 'Churn']].copy(deep=True)for col in list(df.columns):if col not in ['tenure', 'MonthlyCharges', 'TotalCharges', 'Churn'] and df[col].nunique() < 5:dummy_vars = pd.get_dummies(df[col])dummy_vars.columns = [col+str(x) for x in dummy_vars.columns]        sample_set = pd.concat([sample_set, dummy_vars], axis=1)
tenure MonthlyCharges TotalCharges Churn genderFemale genderMale SeniorCitizen0 SeniorCitizen1 PartnerNo PartnerYes ... StreamingMoviesYes ContractMonth-to-month ContractOne year ContractTwo year PaperlessBillingNo PaperlessBillingYes PaymentMethodBank transfer (automatic) PaymentMethodCredit card (automatic) PaymentMethodElectronic check PaymentMethodMailed check
0 -1.280157 -1.054244 -2.281382 0 1 0 1 0 0 1 ... 0 1 0 0 0 1 0 0 1 0
1 0.064298 0.032896 0.389269 0 0 1 1 0 1 0 ... 0 0 1 0 1 0 0 0 0 1
2 -1.239416 -0.061298 -1.452520 1 0 1 1 0 1 0 ... 0 1 0 0 0 1 0 0 0 1
3 0.512450 -0.467578 0.372439 0 0 1 1 0 1 0 ... 0 0 1 0 1 0 1 0 0 0
4 -1.239416 0.396862 -1.234860 1 1 0 1 0 1 0 ... 0 1 0 0 0 1 0 0 1 0

5 rows ?? 47 columns

['tenure','MonthlyCharges','TotalCharges','Churn','genderFemale','genderMale','SeniorCitizen0','SeniorCitizen1','PartnerNo','PartnerYes','DependentsNo','DependentsYes','PhoneServiceNo','PhoneServiceYes','MultipleLinesNo','MultipleLinesNo phone service','MultipleLinesYes','InternetServiceDSL','InternetServiceFiber optic','InternetServiceNo','OnlineSecurityNo','OnlineSecurityNo internet service','OnlineSecurityYes','OnlineBackupNo','OnlineBackupNo internet service','OnlineBackupYes','DeviceProtectionNo','DeviceProtectionNo internet service','DeviceProtectionYes','TechSupportNo','TechSupportNo internet service','TechSupportYes','StreamingTVNo','StreamingTVNo internet service','StreamingTVYes','StreamingMoviesNo','StreamingMoviesNo internet service','StreamingMoviesYes','ContractMonth-to-month','ContractOne year','ContractTwo year','PaperlessBillingNo','PaperlessBillingYes','PaymentMethodBank transfer (automatic)','PaymentMethodCredit card (automatic)','PaymentMethodElectronic check','PaymentMethodMailed check']

Train & Test Sets

target_var = 'Churn'
features = [x for x in list(sample_set.columns) if x != target_var]
model = Sequential()
model.add(Dense(16, input_dim=len(features), activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))


model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
X_train, X_test, y_train, y_test = train_test_split(sample_set[features], sample_set[target_var], test_size=0.3
model.fit(X_train, y_train, epochs=50, batch_size=100)
Epoch 1/50
4922/4922 [==============================] - 0s 73us/step - loss: 0.6871 - accuracy: 0.5638
Epoch 2/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.5409 - accuracy: 0.7314
Epoch 3/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.5034 - accuracy: 0.7322
Epoch 4/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.4717 - accuracy: 0.7452
Epoch 5/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.4404 - accuracy: 0.7926
Epoch 6/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.4225 - accuracy: 0.8037
Epoch 7/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.4150 - accuracy: 0.8066
Epoch 8/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.4113 - accuracy: 0.8070
Epoch 9/50
4922/4922 [==============================] - 0s 14us/step - loss: 0.4083 - accuracy: 0.8098
Epoch 10/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.4063 - accuracy: 0.8090
Epoch 11/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.4052 - accuracy: 0.8111
Epoch 12/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.4037 - accuracy: 0.8090
Epoch 13/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.4030 - accuracy: 0.8119
Epoch 14/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.4021 - accuracy: 0.8127
Epoch 15/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.4014 - accuracy: 0.8108
Epoch 16/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.4009 - accuracy: 0.8104
Epoch 17/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.4003 - accuracy: 0.8125
Epoch 18/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.4002 - accuracy: 0.8147
Epoch 19/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3987 - accuracy: 0.8133
Epoch 20/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3982 - accuracy: 0.8139
Epoch 21/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3979 - accuracy: 0.8155
Epoch 22/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3976 - accuracy: 0.8137
Epoch 23/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3974 - accuracy: 0.8139
Epoch 24/50
4922/4922 [==============================] - 0s 14us/step - loss: 0.3971 - accuracy: 0.8129
Epoch 25/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3969 - accuracy: 0.8143
Epoch 26/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3970 - accuracy: 0.8135
Epoch 27/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3963 - accuracy: 0.8123
Epoch 28/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3959 - accuracy: 0.8141
Epoch 29/50
4922/4922 [==============================] - 0s 14us/step - loss: 0.3952 - accuracy: 0.8149
Epoch 30/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3948 - accuracy: 0.8153
Epoch 31/50
4922/4922 [==============================] - 0s 14us/step - loss: 0.3954 - accuracy: 0.8153
Epoch 32/50
4922/4922 [==============================] - 0s 14us/step - loss: 0.3948 - accuracy: 0.8163
Epoch 33/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3944 - accuracy: 0.8159
Epoch 34/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3940 - accuracy: 0.8169
Epoch 35/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3941 - accuracy: 0.8178
Epoch 36/50
4922/4922 [==============================] - 0s 14us/step - loss: 0.3938 - accuracy: 0.8161
Epoch 37/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3936 - accuracy: 0.8151
Epoch 38/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3929 - accuracy: 0.8147
Epoch 39/50
4922/4922 [==============================] - 0s 14us/step - loss: 0.3927 - accuracy: 0.8169
Epoch 40/50
4922/4922 [==============================] - 0s 14us/step - loss: 0.3930 - accuracy: 0.8155
Epoch 41/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3922 - accuracy: 0.8169
Epoch 42/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3925 - accuracy: 0.8178
Epoch 43/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3921 - accuracy: 0.8155
Epoch 44/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3915 - accuracy: 0.8182
Epoch 45/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3911 - accuracy: 0.8163
Epoch 46/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3912 - accuracy: 0.8159
Epoch 47/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3909 - accuracy: 0.8178
Epoch 48/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3909 - accuracy: 0.8169
Epoch 49/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3910 - accuracy: 0.8174
Epoch 50/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3901 - accuracy: 0.8190<keras.callbacks.callbacks.History at 0x7f9437d2e990>

Accuracy, Precision, Recall

in_sample_preds = [round(x[0]) for x in model.predict(X_train)]
out_sample_preds = [round(x[0]) for x in model.predict(X_test)]
print('In-Sample Accuracy: %0.4f' % accuracy_score(y_train, in_sample_preds))
print('Out-of-Sample Accuracy: %0.4f' % accuracy_score(y_test, out_sample_preds))print('\n')print('In-Sample Precision: %0.4f' % precision_score(y_train, in_sample_preds))
print('Out-of-Sample Precision: %0.4f' % precision_score(y_test, out_sample_preds))print('\n')print('In-Sample Recall: %0.4f' % recall_score(y_train, in_sample_preds))
print('Out-of-Sample Recall: %0.4f' % recall_score(y_test, out_sample_preds))
In-Sample Accuracy: 0.8171
Out-of-Sample Accuracy: 0.7991In-Sample Precision: 0.6946
Out-of-Sample Precision: 0.6440In-Sample Recall: 0.5660
Out-of-Sample Recall: 0.5154


in_sample_preds = [x[0] for x in model.predict(X_train)]
out_sample_preds = [x[0] for x in model.predict(X_test)]
in_sample_fpr, in_sample_tpr, in_sample_thresholds = roc_curve(y_train, in_sample_preds)
out_sample_fpr, out_sample_tpr, out_sample_thresholds = roc_curve(y_test, out_sample_preds)
in_sample_roc_auc = auc(in_sample_fpr, in_sample_tpr)
out_sample_roc_auc = auc(out_sample_fpr, out_sample_tpr)print('In-Sample AUC: %0.4f' % in_sample_roc_auc)
print('Out-Sample AUC: %0.4f' % out_sample_roc_auc)
In-Sample AUC: 0.8691
Out-Sample AUC: 0.8314
plt.figure(figsize=(10,7))plt.plot(out_sample_fpr, out_sample_tpr, color='darkorange', label='Out-Sample ROC curve (area = %0.4f)' % in_sample_roc_auc
plt.plot(in_sample_fpr, in_sample_tpr, color='navy', label='In-Sample ROC curve (area = %0.4f)' % out_sample_roc_auc
plt.plot([0, 1], [0, 1], color='gray', lw=1, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")plt.show()



