python中idx是什么意思_使用Python解析MNIST数据集（IDX文件格式）

前言

最近在学习Keras，要使用到LeCun大神的MNIST手写数字数据集，直接从官网上下载了4个压缩包：

MNIST数据集

解压后发现里面每个压缩包里有一个idx-ubyte文件，没有图片文件在里面。回去仔细看了一下官网后发现原来这是IDX文件格式，是一种用来存储向量与多维度矩阵的文件格式。

IDX文件格式

官网上的介绍如下:

THE IDX FILE FORMAT

the IDX file format is a simple format for vectors and multidimensional matrices of various numerical types.

The basic format is

magic number

size in dimension 0

size in dimension 1

size in dimension 2

.....

size in dimension N

data

The magic number is an integer (MSB first). The first 2 bytes are always 0.

The third byte codes the type of the data:

0x08: unsigned byte

0x09: signed byte

0x0B: short (2 bytes)

0x0C: int (4 bytes)

0x0D: float (4 bytes)

0x0E: double (8 bytes)

The 4-th byte codes the number of dimensions of the vector/matrix: 1 for vectors, 2 for matrices....

The sizes in each dimension are 4-byte integers (MSB first, high endian, like in most non-Intel processors).

The data is stored like in a C array, i.e. the index in the last dimension changes the fastest.

解析脚本

根据以上解析规则，我使用了Python里的struct模块对文件进行读写（如果不熟悉struct模块的可以看我的另一篇博客文章《Python中对字节流/二进制流的操作:struct模块简易使用教程》）。IDX文件的解析通用接口如下：

# 解析idx1格式

def decode_idx1_ubyte(idx1_ubyte_file):

"""

解析idx1文件的通用函数

:param idx1_ubyte_file: idx1文件路径

:return: np.array类型对象

"""

return data

def decode_idx3_ubyte(idx3_ubyte_file):

"""

解析idx3文件的通用函数

:param idx3_ubyte_file: idx3文件路径

:return: np.array类型对象

"""

return data

针对MNIST数据集的解析脚本如下

# encoding: utf-8

"""

@author: monitor1379

@contact: yy4f5da2@hotmail.com

@site: www.monitor1379.com

@version: 1.0

@license: Apache Licence

@file: mnist_decoder.py

@time: 2016/8/16 20:03

对MNIST手写数字数据文件转换为bmp图片文件格式。

数据集下载地址为http://yann.lecun.com/exdb/mnist。

相关格式转换见官网以及代码注释。

========================

关于IDX文件格式的解析规则：

========================

THE IDX FILE FORMAT

the IDX file format is a simple format for vectors and multidimensional matrices of various numerical types.

The basic format is

magic number

size in dimension 0

size in dimension 1

size in dimension 2

.....

size in dimension N

data

The magic number is an integer (MSB first). The first 2 bytes are always 0.

The third byte codes the type of the data:

0x08: unsigned byte

0x09: signed byte

0x0B: short (2 bytes)

0x0C: int (4 bytes)

0x0D: float (4 bytes)

0x0E: double (8 bytes)

The 4-th byte codes the number of dimensions of the vector/matrix: 1 for vectors, 2 for matrices....

The sizes in each dimension are 4-byte integers (MSB first, high endian, like in most non-Intel processors).

The data is stored like in a C array, i.e. the index in the last dimension changes the fastest.

"""

import numpy as np

import struct

import matplotlib.pyplot as plt

# 训练集文件

train_images_idx3_ubyte_file = '../../data/mnist/bin/train-images.idx3-ubyte'

# 训练集标签文件

train_labels_idx1_ubyte_file = '../../data/mnist/bin/train-labels.idx1-ubyte'

# 测试集文件

test_images_idx3_ubyte_file = '../../data/mnist/bin/t10k-images.idx3-ubyte'

# 测试集标签文件

test_labels_idx1_ubyte_file = '../../data/mnist/bin/t10k-labels.idx1-ubyte'

def decode_idx3_ubyte(idx3_ubyte_file):

"""

解析idx3文件的通用函数

:param idx3_ubyte_file: idx3文件路径

:return: 数据集

"""

# 读取二进制数据

bin_data = open(idx3_ubyte_file, 'rb').read()

# 解析文件头信息，依次为魔数、图片数量、每张图片高、每张图片宽

offset = 0

fmt_header = '>iiii'

magic_number, num_images, num_rows, num_cols = struct.unpack_from(fmt_header, bin_data, offset)

print '魔数:%d, 图片数量: %d张, 图片大小: %d*%d' % (magic_number, num_images, num_rows, num_cols)

# 解析数据集

image_size = num_rows * num_cols

offset += struct.calcsize(fmt_header)

fmt_image = '>' + str(image_size) + 'B'

images = np.empty((num_images, num_rows, num_cols))

for i in range(num_images):

if (i + 1) % 10000 == 0:

print '已解析 %d' % (i + 1) + '张'

images[i] = np.array(struct.unpack_from(fmt_image, bin_data, offset)).reshape((num_rows, num_cols))

offset += struct.calcsize(fmt_image)

return images

def decode_idx1_ubyte(idx1_ubyte_file):

"""

解析idx1文件的通用函数

:param idx1_ubyte_file: idx1文件路径

:return: 数据集

"""

# 读取二进制数据

bin_data = open(idx1_ubyte_file, 'rb').read()

# 解析文件头信息，依次为魔数和标签数

offset = 0

fmt_header = '>ii'

magic_number, num_images = struct.unpack_from(fmt_header, bin_data, offset)

print '魔数:%d, 图片数量: %d张' % (magic_number, num_images)

# 解析数据集

offset += struct.calcsize(fmt_header)

fmt_image = '>B'

labels = np.empty(num_images)

for i in range(num_images):

if (i + 1) % 10000 == 0:

print '已解析 %d' % (i + 1) + '张'

labels[i] = struct.unpack_from(fmt_image, bin_data, offset)[0]

offset += struct.calcsize(fmt_image)

return labels

def load_train_images(idx_ubyte_file=train_images_idx3_ubyte_file):

"""

TRAINING SET IMAGE FILE (train-images-idx3-ubyte):

[offset] [type] [value] [description]

0000 32 bit integer 0x00000803(2051) magic number

0004 32 bit integer 60000 number of images

0008 32 bit integer 28 number of rows

0012 32 bit integer 28 number of columns

0016 unsigned byte ?? pixel

0017 unsigned byte ?? pixel

........

xxxx unsigned byte ?? pixel

Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black).

:param idx_ubyte_file: idx文件路径

:return: n*row*col维np.array对象，n为图片数量

"""

return decode_idx3_ubyte(idx_ubyte_file)

def load_train_labels(idx_ubyte_file=train_labels_idx1_ubyte_file):

"""

TRAINING SET LABEL FILE (train-labels-idx1-ubyte):

[offset] [type] [value] [description]

0000 32 bit integer 0x00000801(2049) magic number (MSB first)

0004 32 bit integer 60000 number of items

0008 unsigned byte ?? label

0009 unsigned byte ?? label

........

xxxx unsigned byte ?? label

The labels values are 0 to 9.

:param idx_ubyte_file: idx文件路径

:return: n*1维np.array对象，n为图片数量

"""

return decode_idx1_ubyte(idx_ubyte_file)

def load_test_images(idx_ubyte_file=test_images_idx3_ubyte_file):

"""

TEST SET IMAGE FILE (t10k-images-idx3-ubyte):

[offset] [type] [value] [description]

0000 32 bit integer 0x00000803(2051) magic number

0004 32 bit integer 10000 number of images

0008 32 bit integer 28 number of rows

0012 32 bit integer 28 number of columns

0016 unsigned byte ?? pixel

0017 unsigned byte ?? pixel

........

xxxx unsigned byte ?? pixel

Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black).

:param idx_ubyte_file: idx文件路径

:return: n*row*col维np.array对象，n为图片数量

"""

return decode_idx3_ubyte(idx_ubyte_file)

def load_test_labels(idx_ubyte_file=test_labels_idx1_ubyte_file):

"""

TEST SET LABEL FILE (t10k-labels-idx1-ubyte):

[offset] [type] [value] [description]

0000 32 bit integer 0x00000801(2049) magic number (MSB first)

0004 32 bit integer 10000 number of items

0008 unsigned byte ?? label

0009 unsigned byte ?? label

........

xxxx unsigned byte ?? label

The labels values are 0 to 9.

:param idx_ubyte_file: idx文件路径

:return: n*1维np.array对象，n为图片数量

"""

return decode_idx1_ubyte(idx_ubyte_file)

def run():

train_images = load_train_images()

train_labels = load_train_labels()

# test_images = load_test_images()

# test_labels = load_test_labels()

# 查看前十个数据及其标签以读取是否正确

for i in range(10):

print train_labels[i]

plt.imshow(train_images[i], cmap='gray')

plt.show()

print 'done'

if __name__ == '__main__':

run()

python中idx是什么意思_使用Python解析MNIST数据集（IDX文件格式）相关推荐

python中numpy数组的合并_基于Python中numpy数组的合并实例讲解
基于Python中numpy数组的合并实例讲解 Python中numpy数组的合并有很多方法,如 - np.append() - np.concatenate() - np.stack() - np. ...
python中for语句的使用_对Python中for复合语句的使用示例讲解
当Python中用到双重for循环设计的时候我一般会使用循环的嵌套,但是在Python中其实还存在另一种技巧--for复合语句. 简单写一个小程序,用于延时循环嵌套功能如下: #!/usr/bin/p ...
python中mod是什么意思_【python中，mod_python到底做了些什么呢？】mod python 教程
python 编程小白 ,不会用doctest 请大神指教怎么用!! >>> >>> def is_between(v, lower, higher): ... ...
python中二进制和文本不同_关于Python字符编码与二进制不得不说的一些事
二进制核心思想: 冯诺依曼 + 图灵机电如何表示状态,才能稳定? 计算机开始设计的时候并不是考虑简单,而是考虑能自动完成任务与结果的可靠性, 简单始终是建立再稳定.可靠基础上经过尝试10进制,但 ...
python中sample是什么意思_基于Python中random.sample()的替代方案
python中random.sample()方法可以随机地从指定列表中提取出N个不同的元素,但在实践中发现,当N的值比较大的时候,该方法执行速度很慢,如: numpy random模块中的choice ...
python中的class怎么用_对python 中class与变量的使用方法详解
python中的变量定义是很灵活的,很容易搞混淆,特别是对于class的变量的定义,如何定义使用类里的变量是我们维护代码和保证代码稳定性的关键. #!/usr/bin/python #encoding ...
python中矩阵的表示方法_关于Python表示矩阵的方法详解
这篇文章主要介绍了Python表示矩阵的方法,结合具体实例形式分析了Python表示矩阵的方法与相关操作注意事项,需要的朋友可以参考下本文实例讲述了Python表示矩阵的方法.分享给大家供大家参考, ...
python中list作为函数参数_在python中list作函数形参,防止被实参修改的实现方法
0.摘要我们将一个list传入函数后,函数内部对实参修改后,形参也会随之改变.本文将主要介绍这种错误的现象.原因和解决方法. 1.代码示例 def fun(inner_lst): inner_lst ...
python中使用函数编程的意义_总结Python编程中函数的使用要点
为何使用函数最大化代码的重用和最小化代码冗余流程的分解编写函数 >>def语句在Python中创建一个函数是通过def关键字进行的,def语句将创建一个函数对象并将其赋值给一个变量 ...
python中matrix是什么意思_初识Python
初识Python 跟学习所有的编程语言一样,首先得了解这门语言的编程风格和最基础的语法.下面就让我们一起来了解一下Python的编程风格. 1.逻辑行与物理行在Python中有逻辑行和物理行这个概念 ...

python中idx是什么意思_使用Python解析MNIST数据集（IDX文件格式）

python中idx是什么意思_使用Python解析MNIST数据集（IDX文件格式）相关推荐

最新文章

热门文章