MNIST数据集简介

MNIST数据集来自美国国家标准与技术研究所, National Institute of Standards and Technology (NIST)。训练集（training set）由来自250个不同人手写的数字构成，其中50%是高中学生，50%来自人口普查局（the Census Bureau）的工作人员。测试集（test set）也是同样比例的手写数字数据，但保证了测试集和训练集的作者集不相交。

MNIST数据集一共有7万张图片，其中6万张是训练集，1万张是测试集。每张图片是28×2828\times 2828×28的0−90-90−9的手写数字图片组成。每个图片是黑底白字的形式，黑底用0表示，白字用0-1之间的浮点数表示，越接近1，颜色越白。

将28×2828\times 2828×28维的图片矩阵拉直，转化为1×7841\times 7841×784维的向量不影响理解：

[0,0,0,0.345,0.728,0.310,0.402,0,0,0,⋯,0,0,0][0,0,0,0.345,0.728,0.310,0.402,0,0,0,\cdots,0,0,0][0,0,0,0.345,0.728,0.310,0.402,0,0,0,⋯,0,0,0]

图片的标签以一维数组的one-hot编码形式给出：

[0,0,0,0,0,1,0,0,0,0][0,0,0,0,0,1,0,0,0,0][0,0,0,0,0,1,0,0,0,0]每个元素表示图片对应的数字出现的概率，显然，该向量标签表示的是数字555。

MNIST数据集下载地址是http://yann.lecun.com/exdb/mnist/，它包含了444个部分：

训练数据集：train-images-idx3-ubyte.gz （9.45 MB，包含60,000个样本）。
训练数据集标签：train-labels-idx1-ubyte.gz（28.2 KB，包含60,000个标签）。
测试数据集：t10k-images-idx3-ubyte.gz（1.57 MB ，包含10,000个样本）。
测试数据集标签：t10k-labels-idx1-ubyte.gz（4.43 KB，包含10,000个样本的标签）。

使用`TensorFlow`导入数据集

在这里，使用Jupyter NoteBook来运行有关MNIST数据集的程序实现。

代码实现

使用TensorFlow读取数据集。
注意事项： 亲自实验的时候，使用上述代码，原本应该是需要下载，但下载不动。提前下载好，直接放到正确地址下也是可以的。

import tensorflow.examples.tutorials.mnist.input_data as input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=False)
'''
Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
'''

打印MNIST数据集中的一些信息。

print("MNIST数据集的类型是： %s'" % (type(mnist)))
print("训练集的数量是：%d" % mnist.train.num_examples)
print("验证集的数量是：%d" % mnist.validation.num_examples)
print("测试集的数量是：%d" % mnist.test.num_examples)
'''
MNIST数据集的类型是： <class 'tensorflow.contrib.learn.python.learn.datasets.base.Datasets'>'
训练集的数量是：55000
验证集的数量是：5000
测试集的数量是：10000
'''

将所有数据集，加载为数组形式，方便之后的使用。

train_img = mnist.train.images
train_label = mnist.train.labels
test_img = mnist.test.images
test_label = mnist.test.labelsprint("Type of training is %s" % (type(train_img )))
print("Type of trainlabel is %s" % (type(train_label )))
print("Type of testing is %s" % (type(test_img )))
print("Type of testing is %s" % (type(test_label )))
'''
Type of training is <class 'numpy.ndarray'>
Type of trainlabel is <class 'numpy.ndarray'>
Type of testing is <class 'numpy.ndarray'>
Type of testing is <class 'numpy.ndarray'>
'''

获取前10条MNSIT数据集的图片形式，如下图所示：

import numpy as np
import matplotlib.pyplot as pltfor i in range(10):img = np.reshape(train_img [i, :], (28, 28))label = np.argmax(train_img [i, :])plt.matshow(img, cmap = plt.get_cmap('gray'))plt.show()

可以看到，读取的数据是从MNIST数组开头开始的，但数字便签并不是从0开始，是随机的、无序的。

to be continued…