torch中Dataset的构造与解读

Dataset的构造

要自定义自己的数据集，首先需要继承Dataset(torch.utils.data.Dataset)类.

继承Dataset类之后，必须重写三个方法:__init__(), __getitem__(), __len__()

class ModelNet40(Dataset):def __init__(self, xxx):...def __getitem__(self, item):...def __len()__(self):...

解读

单看上面的构造结构与三个需要重写的方法可能会一头雾水。我们详细分析其作用：

__init__的作用
__init__的作用与所有构造函数都一样，初始化一个类的实例。定义类的实际属性，如点云数据集中的unseen, guassian_noise等，是True还是False，取出所有数据存储为成员变量等等。
__getitem__的作用
__getitem__的作用是，根据item的值取出数据。 item实际上就是索引值，会由Dataloader自动从0一直递增到__len__中取出的值。
__len__的作用
__len__的作用是，相当于返回整体数据data的shape[0]，即给item的递增指定一个范围。

例子

class ModelNet40(Dataset):def __init__(self, num_points, partition='train', gaussian_noise=False, unseen=False, factor=4):self.data, self.label = load_data(partition)self.num_points = num_pointsself.partition = partitionself.gaussian_noise = gaussian_noiseself.unseen = unseenself.label = self.label.squeeze()self.factor = factorif self.unseen:######## simulate testing on first 20 categories while training on last 20 categoriesif self.partition == 'test':self.data = self.data[self.label>=20]self.label = self.label[self.label>=20]elif self.partition == 'train':self.data = self.data[self.label<20]self.label = self.label[self.label<20]def __getitem__(self, item):pointcloud = self.data[item][:self.num_points]          # 核心代码，就是用item取出的数据if self.gaussian_noise:pointcloud = jitter_pointcloud(pointcloud)if self.partition != 'train':np.random.seed(item)anglex = np.random.uniform() * np.pi / self.factorangley = np.random.uniform() * np.pi / self.factoranglez = np.random.uniform() * np.pi / self.factorcosx = np.cos(anglex)cosy = np.cos(angley)cosz = np.cos(anglez)sinx = np.sin(anglex)siny = np.sin(angley)sinz = np.sin(anglez)Rx = np.array([[1, 0, 0],[0, cosx, -sinx],[0, sinx, cosx]])Ry = np.array([[cosy, 0, siny],[0, 1, 0],[-siny, 0, cosy]])Rz = np.array([[cosz, -sinz, 0],[sinz, cosz, 0],[0, 0, 1]])R_ab = Rx.dot(Ry).dot(Rz)R_ba = R_ab.Ttranslation_ab = np.array([np.random.uniform(-0.5, 0.5), np.random.uniform(-0.5, 0.5),np.random.uniform(-0.5, 0.5)])translation_ba = -R_ba.dot(translation_ab)pointcloud1 = pointcloud.Trotation_ab = Rotation.from_euler('zyx', [anglez, angley, anglex])pointcloud2 = rotation_ab.apply(pointcloud1.T).T + np.expand_dims(translation_ab, axis=1)euler_ab = np.asarray([anglez, angley, anglex])euler_ba = -euler_ab[::-1]pointcloud1 = np.random.permutation(pointcloud1.T).Tpointcloud2 = np.random.permutation(pointcloud2.T).Tprint(item)print(pointcloud1.shape)return pointcloud1.astype('float32'), pointcloud2.astype('float32'), R_ab.astype('float32'), \translation_ab.astype('float32'), R_ba.astype('float32'), translation_ba.astype('float32'), \euler_ab.astype('float32'), euler_ba.astype('float32')def __len__(self):return self.data.shape[0]       # 给item一个范围

进一步理解其执行逻辑

if __name__ == '__main__':dataset1 = ModelNet40(num_points=1024, partition='train', gaussian_noise=True)dataloader = DataLoader(dataset1, batch_size=64, shuffle=False)count = 0for src_pointcloud, tgt_pointcloud, Rotation, translation, _, _, _, _ in dataloader:print(src_pointcloud.shape)count += 1print(count)

首先需要说明的是，在ModelNet40中，getitem中会打印item的当前值。

如果执行这段代码，在shuffle=False的情况下，其结果为：

item从0一直增加到__len__返回的那个值-1，也就是data的第一维（姑且称为batch维）。

在getitem中取出的pointcloud的shape为(3, 1024)，只有两个axis.

而最后输出的count，也就是main函数中整个for循环执行的次数，会是__len__() / batch_size.

比如len是9480，即self.data的shape为(9480, 2048, 3)，那么item就会从0一直增加到9479. 在batch_size为64的情况下，for循环一共执行（即count为） 9480/64 = 148.125，那么最终会执行149次。也就是说，每次for循环实质上调用了getitem方法64次，最后在第一维上stack，使之shape变为(64, 3, 1024).