h5py官方文档

简介

HDF（Hierarchical Data Format）指一种为存储和处理大容量科学数据设计的文件格式及相应库文件。最早由美国国家超级计算应用中心 NCSA 研究开发，目前在非盈利组织HDF Group维护下继续发展。HDF支持多种商业及非商业的软件平台，包括MATLAB、Java、Python、R 和 Julia 等等，现在也提供了 Spark ，其版本包括了 HDF4 和 HDF5 。当前流行的版本是 HDF5。Python 中有一系列的工具可以操作和使用 HDF5 数据，其中最常用的是 h5py 和 PyTables。

HDF5文件是一种存储dataset 和 group 两类数据对象的容器，其操作类似 python 标准的文件操作；File 实例对象本身就是一个组，以 / 为名，是遍历文件的入口。

dataset：数据集，可类比为 Numpy 数组，每个数据集都有一个名字（name）、形状（shape）和类型（dtype），支持切片操作；
group：组，可以类比为字典，它是一种像文件夹一样的容器；group 中可以存放 dataset 或者其他的 group，键就是组成员的名称，值就是组成员对象本身(组或者数据集)。

h5py库基本操作

安装h5py

conda install h5py

创建 h5py 文件

import h5py
# 方式一
f = h5py.File("myh5py1.h5", "w")
# 方式二
with h5py.File('myh5py2.h5', 'w') as f:

参数说明：

第一个参数：文件名，可以是字节字符串或 unicode 字符串；
第二个参数：mode

mode	说明
r	只读，文件必须存在
r+	读 / 写，文件必须存在
w	创建文件，已经存在的文件会被覆盖掉
w- / x	创建文件，文件如果已经存在则出错
a	打开已经存在的文件进行读 / 写，如果不存在则创建一个新文件读 / 写（默认）

创建dataset

函数：h5py.File.create_dataset(self, name, shape=None, dtype=None, data=None, **kwds)

方式一：创建一个空数据集并赋值

创建空数据集时，只需指定数据集的 name 和 shape。dtype 默认为 np.float32，默认填充值为 0，亦可通过关键字参数 fillvalue 来改变。

f1 = h5py.File("myh5py1.h5", "w")
d = f1.create_dataset("dset1", (100,), 'i')
d[...] = np.arange(100)

方式二：直接创建一个数据集并赋值

创建非空数据集时，只需指定 name 和具体的数据 data。shape 和 dtype 都会从 data 中自动获取，当然也可以显示的指定存储类型来节省空间。（单精度浮点比双精度浮点要节省一半的空间）

f2 = h5py.File("myh5py2.h5", "w")
# 1.
f2["dset2"] = np.arange(100)
# 2.
arr=np.arange(100)
dset=f2.create_dataset('dset2',data=arr)for key in f2.keys():print(f2[key].name)print(f2[key].value)

深度学习之10分钟入门h5py

创建group

import h5py
import numpy as npf = h5py.File("myh5py.h5", "w")# 创建组bar1,组bar2，数据集dset
g1 = f.create_group("bar1")
g2 = f.create_group("bar2")
d = f.create_dataset("dset", data=np.arange(10))# 在bar1组里面创建一个组car1和一个数据集dset1。
c1 = g1.create_group("car1")
d1 = g1.create_dataset("dset1", data=np.arange(10))# 在bar2组里面创建一个组car2和一个数据集dset2
c2 = g2.create_group("car2")
d2 = g2.create_dataset("dset2", data=np.arange(10))# 根目录下的组和数据集
print(".............")
for key in f.keys():print(f[key].name)# bar1这个组下面的组和数据集
print(".............")
for key in g1.keys():print(g1[key].name)# bar2这个组下面的组和数据集
print(".............")
for key in g2.keys():print(g2[key].name)# 查看car1组和car2组下面都有什么
print(".............")
print(c1.keys())
print(c2.keys())

批量读、写HDF5文件

#-*- coding: utf-8 -*-
import h5py
import numpy as np
# 批量写入数据
def save_h5(h5f,data,target):shape_list=list(data.shape)if not h5f.__contains__(target):shape_list[0]=Nonedataset = h5f.create_dataset(target, data=data,maxshape=tuple(shape_list), chunks=True)returnelse:dataset = h5f[target]len_old=dataset.shape[0]len_new=len_old+data.shape[0]shape_list[0]=len_newdataset.resize(tuple(shape_list))dataset[len_old:len_new] = data
# 批量读取
def getDataFromH5py(fileName,target,start,length):with h5py.File(fileName,'r') as h5f:if not h5f.__contains__(target):res=[]elif(start+length>=h5f[target].shape[0]):res=h5f[target].value[start:h5f[target].shape[0]]else:res=h5f[target].value[start:start+length]return res
# 调用
file_name='./data.h5'
with h5py.File(file_name,'w') as h5f:features=np.arange(100)save_h5(h5f,data=np.array(features),target='mnist_features')save_h5(h5f,data=np.array(features),target='mnist_features')save_h5(h5f,data=np.array(features),target='mnist_features')save_h5(h5f,data=np.array(features),target='mnist_features')save_h5(h5f,data=np.array(features),target='mnist_features')
for i in range(10):d=getDataFromH5py('./data.h5','mnist_features',i*5,5)#每批读取5个数据print(d)

输出：

[0 1 2 3 4]
[5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]
[20 21 22 23 24]
[25 26 27 28 29]
[30 31 32 33 34]
[35 36 37 38 39]
[40 41 42 43 44]
[45 46 47 48 49]

特殊地，如果批量写入的数据为图像时：

# 按批次将图像数据及相应的标签数据写入一个h5py文件
IMAGE_DIMS = (80, 60, 3)# 图片维度
LABEL_DIMS=3 # 标签维度
img_total=40442# 图片总数目
img_batch=100# 每次存取的图片数目
img_num=img_total//img_batch+1# 需要读取的batch数
img_res=img_total-img_batch*(img_num-1)# 最后一个batch读取的图片数目for img_i in range(img_num): #img_i 第几个batchif img_i == 0：h5f = h5py.File("./dataset/train_data.h5", "w") #build File objectx = h5f.create_dataset("x_train", (img_batch,IMAGE_DIMS[0], IMAGE_DIMS[1],IMAGE_DIMS[2]), maxshape=(None,IMAGE_DIMS[0], IMAGE_DIMS[1],IMAGE_DIMS[2]), dtype =np.float32)# build x_train datasety = h5f.create_dataset("y_train", (img_batch, LABEL_DIMS),maxshape=(None, LABEL_DIMS), dtype =np.int32)# build y_train datasetelse:h5f = h5py.File("./dataset/train_data.h5", "a") # add modex = h5f["x_train"]y = h5f["y_train"]ytem = label[img_i*img_batch:(img_i+1)*img_batch]image=[]for i in range(img_i*img_batch,(img_i+1)*img_batch):if i>=img_total:breakimg = cv2.imread(path+str(i)+'.jpg') img=cv2.resize(img, (IMAGE_DIMS[1], IMAGE_DIMS[0]))img=img_to_array(img)image.append(img)image=np.array(image, dtype="float") / 255.0if img_i != img_num-1:x.resize([img_i*img_batch + img_batch,IMAGE_DIMS[0], IMAGE_DIMS[1],IMAGE_DIMS[2]])y.resize([img_i*img_batch + img_batch,LABEL_DIMS])x[img_i*img_batch:img_i*img_batch + img_batch] = imagey[img_i*img_batch:img_i*img_batch + img_batch] = ytem #写入数据集 print('{} images are dealed with'.format(img_i))else:x.resize([img_i*img_batch+img_res,IMAGE_DIMS[0], IMAGE_DIMS[1],IMAGE_DIMS[2]])y.resize([img_i*img_batch+img_res,LABEL_DIMS])x[img_i*img_batch:img_i*img_batch + img_res] = imagey[img_i*img_batch:img_i*img_batch + img_res] = ytemprint('{} images are dealed with'.format(img_i))h5f.close() #close file

深度学习入门笔记（十二）：深度学习数据读取
h5py批量写入文件、读取文件，支持任意维度的数据
利用h5py 构建深度学习数据集

Python处理HDF5文件：h5py库相关推荐

html显示hdf5文件,python读取hdf5文件
python怎样读取hdf5文件 python 中h5py读文件,提示错误File "h5py\_objects完整代码和完整错误信息的图片. Windows环境下给Python安装h5py ...
Python实例之利用h5py库保存数据集
目录 1.数据集介绍 2.保存为HDF5文件 3.从h5py中读取图像 4.查看图像 5.拓展--将压缩包内图像保存为HDF5文件 1.数据集介绍本文选用的数据集是CelebA数据集)数据集,该数据 ...
Python读写Excel文件第三方库汇总
常见库简介 xlrd xlrd是一个从Excel文件读取数据和格式化信息的库,支持.xls以及.xlsx文件. http://xlrd.readthedocs.io/en/latest/ 1. ...
python读取hdf5文件_科学网—python读hdf5文件 - 钱磊的博文
读hdf5文件和读fits差不多,需要知道类似文件头的东西,这里是一个变量的名称.这可以通过定义一个prt函数结合h5py报的工具来实现如下 def prt(name): print(name) fi ...
Python读写Excel文件第三方库汇总，你想要的都在这儿！
常见库简介 xlrd xlrd是一个从Excel文件读取数据和格式化信息的库,支持.xls以及.xlsx文件. http://xlrd.readthedocs.io/en/latest/ 1.xlrd ...
python读写ini文件的库支持unicode_Python读写unicode文件的方法
本文实例讲述了Python读写unicode文件的方法.分享给大家供大家参考.具体实现方法如下: #coding=utf-8 import os import codecs def writefile ...
python读取hdf5文件_Python处理HDF5文件
笔记地址:有道云笔记 h5py 的安装 pip install h5py 读取 H5 文件 import h5py import numpy as np # 打开文件 f = h5py.File('t ...
python处理office文件的第三方库_Python读写Excel文件第三方库汇总，你想要的都在这儿！...
Python Python开发 Python语言 Python读写Excel文件第三方库汇总,你想要的都在这儿! ---恢复内容开始--- 常见库简介 xlrd xlrd是一个从Excel文件读取数据 ...
python 读写HDF5
HDF5是一种用于存储和管理大规模科学数据的文件格式,支持高效的数据访问和并行 I/O.在Python中,可以使用h5py库来读写HDF5文件.下面是一个简单的例子,展示如何使用h5py来创建.读取和 ...

Python处理HDF5文件：h5py库

简介