1.Multi-class classification

使用Logistic regression和neural networks来识别手写数字识别(从0到9)。在第一部分练习中使用Logistic regression进行one-vs-all分类。

1.1 Dataset

数据集ex3data1.mat包含了5000条手写数字的训练样本,每个训练样本是 20 * 20 的像素灰度的矩阵。每一个像素值用浮点数来表示对应位置的灰度值,并被展开成400维的向量。即矩阵X中每一行代表一个训练样本。


import scipy.io as scio

import pandas as pd

data = scio.loadmat('E:/2018/ML/work/machine-learning-ex3/ex3/ex3data1.mat')

data1 = data.get('X')

label = data.get('y')

1.2 数据可视化


import random

import matplotlib.pyplot as plt

def plot_an_image(x):

pick_one = random.randint(0, 5000)

image = x[pick_one, :]

fig, ax = plt.subplots(figsize=(1, 1))

ax.matshow(image.reshape((20, 20)), cmap='gray_r')




print('this should be {}'.format(label[pick_one]))



def plot_100_image(x):

sample_idx = np.random.choice(np.arange(x.shape[0]), 100)

sample_images = x[sample_idx, :]

fig, ax_array = plt.subplots(nrows=10, ncols=10, sharey=True, sharex=True, figsize=(8, 8))

for row in range(10):

for col in range(10):

ax_array[row, col].matshow(sample_images[10 * row + col].reshape((20, 20)), cmap='gray_r')





1.3 Vectorizing regularized Logistic Regression

使用one-vs-all logistic regression模型来构建一个multi-class分类器。由于有10个类别,需要构建10个分开的logistic分类器。为了使训练更高效,使用向量化语言,而不使用循环。

正则化的cost function被定义为如下:

注意其中bias term 即theta0是没有包含在正则化项中的。


# 定义sigmoid函数

def sigmoid(x):

return 1/(1+np.exp(-x))

def costFunction(theta, x, y, Lambda):

m = np.shape(x)[0]

thetaReg = theta[1:]

y = y.transpose()

hypothesis = sigmoid(np.dot(x, theta))

cost = y * np.log(hypothesis) + (1 - y) * np.log(1 - hypothesis)

reg = np.sum(thetaReg * thetaReg) * Lambda/(2 * m)

costAll = np.mean(-cost) + reg

return costAll

对正则化的logistic regression cost function求偏导

def Gradient(theta, x, y, Lambda):

m = np.shape(x)[0]

theteReg = theta[1:]

y = y.transpose()

hypothesis = sigmoid(np.dot(x, theta))

loss = hypothesis - y

cost_1 = np.dot(loss, x)/m

reg = np.concatenate([np.array([0]), (Lambda / m) * theteReg])

gradient = cost_1 + reg

return gradient

1.4 One-vs-all Classification

在手写数字数据集,类别为10,代码需要对任意一个类别识别。代码需要返回所有分类器的参数矩阵k*(N+1)维即 10 * 401,其中每一行代表每一个分类器的参数。

对于训练分类器k(1,..., K)时需要对数据进行转换,将类别为k的标记为正向类(y = 1),然后将其他类标记为负向类(y = 0),随后利用minimize()对参数对第k个分类器的参数。实现1~K 的循环得到最终所有的参数矩阵

def oneVsAll(x, y, Lambda, k):

all_theta = np.zeros((k, np.shape(x)[1]))

for i in range(k):

theta = np.zeros(np.shape(x)[1])

y_i = np.array([1 if label == i+1 else 0 for label in y])

ret = minimize(fun=costFunction, x0=theta, args=(x, y_i, Lambda), method='TNC', jac=Gradient, options={'disp': True})

all_theta[i, :] = ret.x

return all_theta


def predictOneVsAll(x, all_theta):

thetaT = np.transpose(all_theta)

probMat = sigmoid(np.dot(x, thetaT))

maxProb = np.argmax(probMat, axis=1)

label = maxProb+1

return label

对于读入的数据data1需要增加bias term项

dataFix = np.insert(data1, 0, 1, axis=1)

thetaAll = oneVsAll(dataFix, label, 1, 10)

pred = predictOneVsAll(dataFix, thetaAll)

accuracy = np.mean(pred == label.T)

print('accuracy = {0}%'.format(accuracy * 100))


2. Neural Networks

前面用multi-class logistic regression 来识别手写数字,但是logistic regression不能形成复杂的hypotheses 只能是线性分类。使用neural networks来构造更复杂的模型,非线性的neural network。运行feedforward propagation algorithm来运行,其中权重已知。

所构造的神经网络包含了一个输入层,一个隐藏层,一个输出层,其中输入层有400个input unit和一个bias,隐藏层包含了25个units和1个bias, 输出层包含10个output unit(因为存在10个类别数字)

parameters = scio.loadmat('E:/2018/ML/work/machine-learning-ex3/ex3/ex3weights.mat')

theta1 = parameters.get('Theta1')

theta2 = parameters.get('Theta2')

data_Neur = np.insert(data1, 0, 1, axis=1)

hidden_layer = sigmoid(np.dot(data_Neur, theta1.T))

z_2 = np.insert(hidden_layer, 0, 1, axis=1)

output_layer = sigmoid(np.dot(z_2, theta2.T))

max_prob = np.argmax(output_layer, axis=1)

out = max_prob + 1

accuracy = np.mean(out == label.T)

print('accuracy = {0}%'.format(accuracy * 100))


