MIT-BIH使用（四）使用WFDB批量读取MIT-BIH

前言

刚拿到这个数据集，甚至都不知道应该如何打开查看里边儿的内容，然后经过查找资料拿到了 MIT-BIH介绍（三） 中原生的读取方式，直接且粗暴。

而之后学习过程中发现了 wfdb 这个库的妙用！

原生Python波形数据库（WFDB）软件包。一个用于读取，写入和处理WFDB信号和注释的工具库。

1.一口气运行看看！

建议测试项目结构跟我这个保持一致，以免复制过去的代码出现路径问题。

只需要建 test.py 即可，将代码复制到 test.py，准备好数据集，然后运行！

### Import required packages
from IPython.display import displayimport numpy as np
import pandas as pd
import pickle as dill
import os
from tqdm import tqdm
from collections import OrderedDict, Counterimport wfdb# 读取数据、波峰、标签保存到mit-bih-database.pkl
def read_dataset(data_path):"""before running this data preparing code,please first download the raw data from https://www.physionet.org/content/mitdb/1.0.0/,and put it in data_path"""#  read data and labelall_data = []all_peak = []all_label = []filenames = pd.read_csv(os.path.join(data_path, 'RECORDS'), header=None)filenames = filenames.iloc[:, 0].values  # 拿第一列的值print(filenames)  # 打印文件名for filename in tqdm(filenames):# read datadat = wfdb.rdrecord(os.path.join(data_path, '{0}'.format(filename)), channels=[0])dat = np.array(dat.p_signal)x = []for i in dat:x.append(i[0])all_data.append(x)# read labelatr = wfdb.rdann(os.path.join(data_path, '{0}'.format(filename)), 'atr')all_peak.append(atr.sample)  # 无法转矩阵，因为每个人波峰个数不一样label = np.array(atr.symbol)all_label.append(np.array(atr.symbol))all_data = np.array(all_data)  # list转nparrayall_peak = np.array(all_peak)all_label = np.array(all_label)res = {'data': all_data, 'peak': all_peak, 'label': all_label}  # res 作为json格式保存数据和标签display(res)with open(os.path.join(data_path, 'mit-bih-database.pkl'), 'wb') as fout:dill.dump(res, fout)def preprocess_dataset(data_path):# read pklwith open(os.path.join(data_path, 'mit-bih-database.pkl'), 'rb') as fin:res = dill.load(fin)## scale dataall_data = res['data']all_peak = res['peak']all_label = res['label']print(all_data.shape)label_type = []for i in range(0, all_peak.shape[0]):label_type = np.hstack([label_type, all_label[i]])print(Counter(label_type))data_path = '../input/mit-bih-arrhythmia-database-1.0.0/'
read_dataset(data_path)
preprocess_dataset(data_path)

看看 console 输出，成就感满满~~~

# 患者编号
[100 101 102 103 104 105 106 107 108 109 111 112 113 114 115 116 117 118 119 121 122 123 124 200 201 202 203 205 207 208 209 210 212 213 214 215 217 219 220 221 222 223 228 230 231 232 233 234]# res内容
{'data': array([[-0.145, -0.145, -0.145, ..., -0.675, -0.765, -1.28 ],[-0.345, -0.345, -0.345, ..., -0.295, -0.29 ,  0.   ],[-0.2  , -0.2  , -0.2  , ..., -0.17 , -0.195,  0.   ],...,[-0.245, -0.245, -0.245, ..., -0.07 , -0.075,  0.   ],[-0.095, -0.095, -0.095, ..., -0.495, -0.49 ,  0.   ],[-0.08 , -0.08 , -0.08 , ..., -0.395, -0.38 ,  0.   ]]),'peak': array([array([    18,     77,    370, ..., 649484, 649734, 649991]),array([     7,     83,    396, ..., 649004, 649372, 649751]),array([    68,    136,    410, ..., 649244, 649553, 649852]),...array([    76,    491,    737, ..., 648845, 649097, 649366]),array([     3,     42,    320, ..., 649518, 649730, 649946]),array([    52,    135,    366, ..., 649292, 649536, 649772])],dtype=object), 'label': array([array(['+', 'N', 'N', ..., 'N', 'N', 'N'], dtype='<U1'),array(['+', 'N', 'N', ..., 'N', 'N', 'N'], dtype='<U1'),...array(['+', 'V', 'N', ..., 'N', 'N', 'N'], dtype='<U1'),array(['+', 'N', 'N', ..., 'N', 'N', 'N'], dtype='<U1')],dtype=object)}# 48条患者数据，每条包含650000个数据点
(48, 650000)# 各类型的心跳数量统计，共23种
Counter({'N': 75052, 'L': 8075, 'R': 7259, 'V': 7130, '/': 7028, 'A': 2546, '+': 1291, 'f': 982, 'F': 803, '~': 616, '!': 472, '"': 437, 'j': 229, 'x': 193, 'a': 150, '|': 132, 'E': 106, 'J': 83, 'Q': 33, 'e': 16, '[': 6, ']': 6, 'S': 2})

2.代码注释

首先读取 RECORDS 文件，拿到所有患者的编号，而后根据编号去遍历读取所有的患者数据。

然后读取 .dat 文件跟 .atr 文件，数据有MLII和V5 两个导联的数据，将通道定 0 。输出是一个字典类型，使用下面的方式能够打印出具体内容。

# Load the WFDB record
dat = wfdb.rdrecord('../input/mit-bih-arrhythmia-database-1.0.0/100', channels=[0])
print(dat.__dict__)# 输出
{'record_name': '100', 'n_sig': 1, 'fs': 360, 'counter_freq': None, 'base_counter': None, 'sig_len': 600, 'base_time': None, 'base_date': None, 'comments': ['69 M 1085 1629 x1', 'Aldomet, Inderal'], 'sig_name': ['MLII'], 'p_signal': array([[-0.145],...,[-0.295],[-0.265],[-0.255]]), 'd_signal': None, 'e_p_signal': None, 'e_d_signal': None, 'file_name': ['100.dat'], 'fmt': ['212'], 'samps_per_frame': [1], 'skew': [None], 'byte_offset': [None], 'adc_gain': [200.0], 'baseline': [1024], 'units': ['mV'], 'adc_res': [11], 'adc_zero': [1024], 'init_value': [995], 'checksum': [54316], 'block_size': [0]}# 读取WFDB注释文件record_name.extension并返回一个Annotation对象。
atr = wfdb.rdann('../input/mit-bih-arrhythmia-database-1.0.0/100', 'atr')
print(atr.__dict__)# 输出
{'record_name': '100', 'extension': 'atr', 'sample': array([  18,   77,  370,  662,  946, 1231, 1515]), 'symbol': ['+', 'N', 'N', 'N', 'N', 'N', 'N'], 'subtype': array([0, 0, 0, 0, 0, 0, 0]), 'chan': array([0, 0, 0, 0, 0, 0, 0]), 'num': array([0, 0, 0, 0, 0, 0, 0]), 'aux_note': ['(N\x00', '', '', '', '', '', ''], 'fs': 360, 'label_store': None, 'description': None, 'custom_labels': None, 'contained_labels': None, 'ann_len': 7}

取出我们需要的dat.p_signal, atr.sample, atr.symbol三类数据保存到mit-bih-database.pkl文件中，方便后边预处理的时候可以直接拿到这些数据。

预处理那部分的代码演示了如何打开pkl文件，如何获取里边的数据，还展示了如何统计各心跳的数量。

结束语

至此，MIT-BIH数据库的读取工作就完成啦！是不是顿感神清气爽？

wfdb 还有R峰矫正的方式，篇幅有限，这里不作展开，有需要读者可以去看看这个库文档，试试其他函数。

唉，太不容易了，一起加油吧~~