In this homework, we use logistic regression to predict the probability of default using incomeand balance on the Default data set. We will also estimate the test error of this logistic regressionmodel using the validation set approach. Do not forget to set a random seed before beginningyour analysis.

在这次作业中,我们使用logistic回归来预测违约的概率,使用默认数据集上的income和balance。我们还将使用validation set方法来估计这个logistic回归模型的测试误差。在开始分析之前,不要忘记设置一个随机种子。

  1. (a) Fit a multiple logistic regression model that uses income and balance to predict the probability of default, using only the observations


import pandas as pd
from sklearn import metrics
import warnings
default student balance income
1 No No 729.526495 44361.625074
2 No Yes 817.180407 12106.134700
3 No No 1073.549164 31767.138947
4 No No 529.250605 35704.493935
5 No No 785.655883 38463.495879
def fun(x):if 'No' in x:return 0else:return 1
test['default']=test.apply(lambda x: fun(x['default']),axis=1)
from sklearn.linear_model import LogisticRegression
# 准确率
# 构建LogisticRegression模型(默认参数即可),并调用fit进行模型拟合
model = LogisticRegression()
# 计算LogisticRegression在测试集上的误差率
# 打印误差率
for i in range(len(a)):if a[i][1]>0.5:result.append(1)else:result.append(0)
print('误差: %.4f' % (1-metrics.recall_score(y,result,average='weighted')))
误差: 0.0336

(b) Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:


i. Split the sample set into a training set and a validation set.


from sklearn.model_selection import train_test_split
# 使用train_test_split方法,划分训练集和测试集,指定80%数据为训练集,20%为验证集
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2,random_state=2020)
X_validation, X_test, y_validation, y_test = train_test_split(X_test,y_test, test_size=0.1,random_state=2020)

ii. Fit a multiple logistic regression model using only the training observations.


from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,penalty='l2', random_state=None, solver='liblinear', tol=0.0001,verbose=0, warm_start=False)

Obtain a prediction of default status for each individual in the validation set by computing theposterior probability of default for that individual, and classifying the individual to the defaultcategory if the posterior probability equals 0.5.


for i in range(len(a)):if a[i][1]>0.5:result.append(1)else:result.append(0)
print('误差: %.4f' % (1-metrics.recall_score(y_validation,result,average='weighted')))
误差: 0.0361

© Repeat the process in (b) three times, using three different splits of the observations into a
training set and a validation set. Comment on the results obtained

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
for i in range(3):X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2,random_state=2020)X_validation, X_test, y_validation, y_test = train_test_split(X_test,y_test, test_size=0.1,random_state=2020)model = LogisticRegression()model.fit(X_train,y_train)a=model.predict_proba(X_validation)result=[]for i in range(len(a)):if a[i][1]>0.5:result.append(1)else:result.append(0)from sklearn import metricsprint('误差: %.4f' % (1-metrics.recall_score(y_validation,result,average='weighted')))
误差: 0.0361
误差: 0.0361
误差: 0.0361

(d) Now consider a logistic regression model that predicts the probability of default using
income, balance, and a dummy variable for student. Estimate the test error for this model using
the validation set approach. Comment on whether or not including a dummy variable for student
leads to a reduction in the test error rate.


def fun(x):if 'No' in x:return 0else:return 1
test['student']=test.apply(lambda x: fun(x['student']),axis=1)
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2,random_state=2020)
X_validation, X_test, y_validation, y_test = train_test_split(X_test,y_test, test_size=0.1,random_state=2020)
model = LogisticRegression()
for i in range(len(a)):if a[i][1]>0.5:result.append(1)else:result.append(0)
from sklearn import metrics
print('误差: %.4f' % (1-metrics.recall_score(y_validation,result,average='weighted')))
误差: 0.0361



