mmdet3d纯视觉baseline之数据准备：处理waymo dataset v1.3.1

在waymo上测纯视觉baseline（多相机模式），分很多步：

处理数据集为kitti格式
修改dataloader代码
修改模型config
修改模型target和loss
修改eval pipeline的代码

mmdet3d官网的waymo dataset教程过于简略，处理的结果只能给pointpillar用，而且是旧版的数据集。对初学者的我非常不友好。下面基于mmdet的教程（以下简称教程），简要归纳一下具体流程，并解释如何修改mmdet3d的代码，使得detr3d在处理waymo的道路上，迈出第一步。

事实上，直接手写一遍处理比研究并修改这套代码更快，但是作为初学者，为了熟悉框架，我还是看了一遍

环境配置

update: 环境配置直接使用环境配置中的install_mmdet3d_rc2.sh即可。

waymo dataset v1.3.1或者v1.3.2在github上有配套的waymo-open-dataset工程，里面有tutorial和data frame的protobuf定义，基于tensorflow实现了一些提取数据集的功能。

使用mmdetection(3d)框架跑waymo dataset的时候，需要用到上述工程提取waymo dataset数据，并转换成kitti格式，这样，mmdet3d里的pointpillar就能直接跑dataset了。（detr3d还要做更多）

环境配置挺坑爹的，waymo的pip包有bug。推荐安装版本1.4.7。

(conda virtual env) user@unbuntu: pip install waymo-open-dataset-tf-2-6-0==1.4.7

tf版本根据自己环境的cudatoolkit选，我是cuda11.4。
这个版本有个问题是waymo_open_dataset/camera/ops底下缺少了文件py_camera_model_ops.py，直接在github上找到这个文件复制进安装的目录里即可，一般在anaconda3/env/[your env name]/lib/python3.7/site-package里面。
mmdet按照官方配置来就行了。如果是新环境，推荐先装waymo再装mmdet系列库。

一些坑：
1、装1.4.8会遇到某个依赖库build失败的问题，然后不仅没安装成功，环境还起包冲突了。
2、pip使用–user选项会让包装在~/.local下

dataset quick glance

waymo_format

在waymo官方网站里可以下载数据，按教程整理之后，整个waymo dataset格式如下：
mmdetection3d
├── mmdet3d
├── tools
├── configs
├── data
│ ├── waymo
│ │ ├── waymo_format
│ │ │ ├── training
│ │ │ ├── validation
│ │ │ ├── testing
│ │ │ ├── gt.bin
│ │ ├── kitti_format
│ │ │ ├── ImageSets

每个文件夹如training，底下都有若干个.tfrecord文件，是protobuf格式储存的dataframe。一般来说，一个文件里会有100~200个frame，每个frame包含5个camera image，若干gt_box的label以及lidar的信息如range_image等。具体可见工程里的dataset.proto。gt.bin是gt_box的点云信息，mmdet会重新生成，可以不用管。

ImageSets里放的是train/test/val的dataframe index，教程说，要从他那里下载，但也可以自己处理，之后会说。

kitti_format

按教程使用tools/create_data.py可以从.tfrecord里提取信息，并以kitti格式储存到kitti_format/下，文件结构如下：
│ │ ├── kitti_format
│ │ │ ├── ImageSets
│ │ │ ├── training
│ │ │ │ ├── calib
│ │ │ │ ├── image_0
│ │ │ │ ├── image_1
│ │ │ │ ├── image_2
│ │ │ │ ├── image_3
│ │ │ │ ├── image_4
│ │ │ │ ├── label_0
│ │ │ │ ├── label_1
│ │ │ │ ├── label_2
│ │ │ │ ├── label_3
│ │ │ │ ├── label_4
│ │ │ │ ├── label_all
│ │ │ │ ├── pose
│ │ │ │ ├── velodyne
│ │ │ ├── testing
│ │ │ │ ├── (the same as training)
│ │ │ ├── waymo_gt_database
│ │ │ ├── waymo_infos_trainval.pkl
│ │ │ ├── waymo_infos_train.pkl
│ │ │ ├── waymo_infos_val.pkl
│ │ │ ├── waymo_infos_test.pkl
│ │ │ ├── waymo_dbinfos_train.pkl
整体上看，create_data做3件事：

提取tfrecord信息生成training与testing文件夹，不同域的信息放入不同子文件夹里。其中validation也放到training里，这样就对齐了kitti格式。
比如calib储存的是每个frame的相机内外参，以txt方式储存。其命名方式为ABBBCCC.txt，其中A=[0,1,2]即表征[train,val,test]，BBB表示第几个tfrecord文件，CCC表示该tfrecord的第几个dataframe。所以一个calib里储存了10多万个.txt文件
根据ImageSet，从第一步的结果中抽取每个帧ABBBCCC的信息，汇总入train/test/val/trainval.pkl中。pkl是二进制形式的dict
list，形如[dict(), dict(),…]，每个dict对应一个帧的所有信息，如label，image_path等。
根据lidar点云数据，生成waymo_gt_database。这文件夹里面有许多.bin文件，代表每个gt_box内的点云。dbinfos_train.pkl则储存了每个gt_boox的label，以及.bin点云文件的path。

一些细节：
tfrecord大概1T多，转换完了之后3T多。
如果image_i的某个frame ABBBCCC里没有gt box，那么label_i里就没有对应的.txt文件
velodyne储存点云信息。
label_0/ABBBCCC.txt的每一行储存了gt box的参数，mmdet原本的代码，格式为：

#type + 是否截断  是否遮挡  alpha？  2Dbbox[l,b,r,t] 3Dbbox[h,w,l,x,y,z,rot]
line = my_type + ' {} {} {} {} {} {} {} {} {} {} {} {} {} {}\n'.format(...)

ImageSet 里的.txt事实上就是所有frame按照ABBBCCC方式命名下标后，下标的集合。所以自己可以直接用os.listdir处理一下得到。
database用于做lidar baseline的数据增强，给场景增添一些本来没有的物体，纯视觉的baseline不用，所以运行时可以把那块代码注释掉。

函数作用与关系一览

tools/create_data.py

def waymo_data_prep:# from tfrecord extractingsplits = ['training', 'validation', 'testing']converter = waymo.Waymo2KITTI(...)converter.convert()#注意上一步结束后，需要手动生成或者下载ImageSets文件# Generate waymo infoskitti.create_waymo_info_file(out_dir, info_prefix, max_sweeps=max_sweeps, workers=workers)# gt databaseGTDatabaseCreater('WaymoDataset',out_dir,info_prefix,f'{out_dir}/{info_prefix}_infos_train.pkl',relative_path=False,with_mask=False,num_worker=workers).create()

tools\data_converter\waymo_converter.py

含有Waymo2KITTI converter，下面是没整理的函数笔记，重点是type名称的转换。注意到代码里的lidar_list似乎表示的是camera_list，感觉他写错了。这里面也有很多label信息没提取出来，如果之后要用的话还得修改这部分。

class waymo2kitti:def init:lidar有五个，仍然是没有back。但是我怎么感觉这是cam的name？？self.lidar_list = ['_FRONT', '_FRONT_RIGHT', '_FRONT_LEFT', '_SIDE_RIGHT','_SIDE_LEFT']label有5个：[ 'UNKNOWN', 'VEHICLE', 'PEDESTRIAN', 'SIGN', 'CYCLIST']转成kitti对应的是['DontCare','Car','Pedestrian','Sign', 'Cyclist']这个大小写和label名称应该挺重要的。config里要写好。self.tfrecord_pathnames = sorted(glob(join(self.load_dir, '*.tfrecord')))veledyne文件夹是放点云的...不是速度。是lidar的brandself.image_save_dir = f'{self.save_dir}/image_'self.prefix = 0/1/2 即train val testdef create_folder():test 的少一个label dirdef convert():这个len(self)是啥啊？它实现了__len__，返回的就是tfrecord个数mmcv.track_parallel_progress(self.convert_one, range(len(self)),self.workers)多线程的编程不是很懂，可以问一下人？def convert_one(file_idx):转换第idx个tfrecord里的信息。#用tf封装的类，你可以用这个来看tfrecord的东西dataset = tf.data.TFRecordDataset(pathname, compression_type='')for frame_idx, data in enumerate(dataset):self.save_image(frame, file_idx, frame_idx)save_calib -> cart_to_homosave_lidar -> convert_range_image_to_point_cloudsave_posesave_label if trainingdef save_image(self,frame, file_idx, frame_idx):for img in frame.images:img_path = f'{self.image_save_dir}{str(img.name - 1)}/' + \f'{self.prefix}{str(file_idx).zfill(3)}' + \f'{str(frame_idx).zfill(3)}.png'img = mmcv.imfrombytes(img.image)mmcv.imwrite(img, img_path)这里可以看出转换为kitti数据集后的命名规则。例如：1049185.png就是 1-049-185.png，1代表prefix（0 for train, 1 for val, 2 for test）049就是file_idx，第几个tfrecord。185是frame_idx,在convert_one里，所以一个tfrecord里不超过3位数的图片？前面的path就是image_0啥的，img.name只有12345.def save_lidar(self, frame, file_idx, frame_idx):def save_label(self, frame, file_idx, frame_idx):要做一个waymo->kitti的坐标变换，属于是lidar坐标到camera了box中心要变成底面中心。rotation也要变。一定要注意label的int-string对应如下：
#   enum Type {#     TYPE_UNKNOWN = 0;
#     TYPE_VEHICLE = 1;
#     TYPE_PEDESTRIAN = 2;
#     TYPE_SIGN = 3;
#     TYPE_CYCLIST = 4;
#   }这里会弄出label_01234和label_all，看看这俩有啥区别label_all：仍然按照之前的命名规则。首先提取2D box。枚举projected_lidar_labels，即投影到某个cam上的image上的box的东西提取出来。但这里只提取了cam id，object id和2D box的左上右下角。不知道camlabel是否还有多的域在里面？比如difficulty啥的。感觉label里很多东西没用到，而且还很贴心的帮我把sign给去掉了...很想print一个frame出来看一下，全部的数据，不过有几百M捏。#type + 截断  遮挡  alpha？  2Dbbox[l,b,r,t] 3Dbbox[h,w,l,x,y,z,rot]  line = my_type + ' {} {} {} {} {} {} {} {} {} {} {} {} {} {}\n'.format(def save_pose():pose = np.array(frame.pose.transform).reshape(4, 4)#The frame vehicle pose defines the coordinate system which#the 3D laser labels are defined in.    # 感觉是lidar坐标到全局坐标的变换？问一下炫耀？def save_calib():T_cam_to_vehicle = np.array(camera.extrinsic.transform)#cam to ego# waymo front camera to kitti reference cameraT_front_cam_to_ref = np.array([[0.0, -1.0, 0.0], [0.0, 0.0, -1.0],[1.0, 0.0, 0.0]])#感觉Tr_velo_to_cam就是kitti坐标系下的ego to cam transform，reverse过才能让detr3d投影Tr_velo_to_cam = self.cart_to_homo(T_front_cam_to_ref) @ T_vehicle_to_camcamera_calib = camera.intrinsic #内参有一个顺序的变换R0_rect不知道是啥玩意def save_lidar():pass点云信息放在velodyne/下，全是.bin文件# 先不看了，以后看别的lidar数据定义吧

自动生成ImageSet的函数

自己写的，放在create_data.py里用

def create_ImageSets_img_ids(root_dir):names_dict=dict()save_dir = osp.join(root_dir, 'ImageSets/')if not osp.exists(save_dir): os.mkdir(save_dir)load_01 =osp.join(root_dir, 'training/calib')load_2 = osp.join(root_dir, 'testing/calib')RawNames = os.listdir(load_01) + os.listdir(load_2) split = [[],[],[]]for name in RawNames:if name.endswith('.txt'):idx = name.replace('.txt', '\n')split[int(idx[0])].append(idx)for i in range(3): split[i].sort()open(save_dir+'train.txt','w').writelines(split[0])open(save_dir+'val.txt','w').writelines(split[1])open(save_dir+'trainval.txt','w').writelines(split[0]+split[1])open(save_dir+'test.txt','w').writelines(split[2])

tools\data_converter\kitti_converter.py

生成.pkl文件，get_waymo_image_info从文件系统中读取每帧对应信息，_calculate_num_points_in_gt计算每个gt_box有多少lidar_point，方便后续使用。

def create_waymo_info_file:imageset_folder = Path(data_path) / 'ImageSets'train_img_ids = _read_imageset_file(str(imageset_folder / 'train.txt'))#所以这个split文件也可以直接用LS获得，就是比较慢#你就直接自己处理吧waymo_infos_train = get_waymo_image_info(   #从离散的文件中整合信息到pkldata_path,                              #包括img、lidar的path，labelstraining=True,velodyne=True,calib=True,pose=True,image_ids=train_img_ids,relative_path=relative_path,max_sweeps=max_sweeps)_calculate_num_points_in_gt(data_path,waymo_infos_train,relative_path,num_features=6,remove_outside=False)# _calculate_num_points_in_gt 为 info[0~num_frame][annos]多加一个dim#annos['num_points_in_gt']，shape[M]，表示每个box里有多少lidar pointfilename = save_path / f'{pkl_prefix}_infos_train.pkl'print(f'Waymo info train file is saved to {filename}')mmcv.dump(waymo_infos_train, filename)#不知道是text还是2进制文件呢？问问看？#上面是train，接下来val和test差不多的。val里面还额外生成一个trainval，就是把train和val的信息加起来

tools\data_converter\kitti_data_utils.py

里面有函数get_waymo_image_info，主要关注文件中格式到pkl格式的转换，比较繁琐，这是二次转换，.pkl相比kitti_format/training又会丢失一些信息，比如他只存了front camera的path，其他cam的信息，如果要使用，要修改后续的dataloader，具体怎么改后面说。下面贴的代码供我自己备忘。

def get_waymo_image_info(path,training=True,label_info=True,velodyne=False,calib=False,pose=False,image_ids=7481,extend_matrix=True,num_worker=8,relative_path=True,with_imageshape=True,max_sweeps=5):注意到这里也只开了8个worker。不知道是否使用GPU？pkl可以用mmcv.load看一个pkl frame里面有points、image[path,idx,shape]，point_cloud[path,num_feat],calib,annos[num_gt个gt label]with futures.ThreadPoolExecutor(num_worker) as executor:image_infos = executor.map(map_func, image_ids)#用这种方法来开多线程，应该就不用GPU？map_func就是处理单个image_id信息的东西，image_infos应该是一个list的dict。
单个info的获取方法如下：pc_info=dict(velodyne: path/to/0-000-000.binnum_feat:6)#x,y,z,refl, ?, timestampimage_info = dict(image_path: image_0的path，只有单个camimage_shape: 从image_path里读img进来算shape，当然只是img_0的shape)label_info = dict(label_path: label_all的path)calib_info=dict(Pi: cam_intrinsic_i (i=0 to 4)   [4x4]R0_rect:R0_rect                  [4x4]Tr_velo_to_cam : Tr_velo_to_cam_0[3x4]注意这里只读入了cam0的外参，不需要其他相机吗？)确实没有其他相机的外参，但是transfusion又额外读了：https://github.com/XuyangBai/TransFusion/blob/master/mmdet3d/datasets/waymo_dataset.py#L144pose = np.loadtxt(pose_path)annotations = get_label_anno()annotations['cam_id'] = self.pop('score')#直接把score换成cam_id？感觉是kitti的settingadd_difficulty_to_annos(annotations)   #应该是用kitti的标准算难度，这完全就抛弃了waymo嘛...max_sweep = 5sweeps 用来把前5个velodyne的数据都存下来。prev_info=dict(velodyne_path, timestamp, pose)sweeps = list of prev_info# 不知道这个sweep有什么用呢？就是有时候你可以利用历史信息？points = 从.bin里把xxx.bin的点云读出来，直接np.fromfile,说明里面是N*num_feat个floatinfo=dict(timestamp: 取points里首个点的timestampimage: image_infospoint_cloud: pc_infocalib: calib_infopose: poseannos: annotationssweeps: sweeps)return infodef get_label_anno():content = [line.strip().split(' ') for line in lines]#[[],[],[]]num_objects = len([x[0] for x in content if x[0] != 'DontCare'])num_gt = len(annotations['name'])index = list(range(num_objects)) + [-1] * (num_gt - num_objects)annotations =list of dict(name, truncated, occluded, alpha, bbox, dim, loc, rot,scorename = [car,car,car,ped,ped,cyc,car...]即从label_all/xxx.txt里读的所有东西，每个都是list。另外有：index：index没看懂为什么要把最后的index弄成-1，而前面就是range()但其实在waymo converter里面dont care已经被筛掉了。所以怎么搞都一样group_ids: range(num_gt))

tools\data_converter\create_gt_database.py

不太重要


class GTDatabaseCreater():好像不会用到waymo官方的gt.bin？只是使用刚刚转换出来的东西def create_groundtruth_database():直接用了dataloader和pipeline类，一直没搞懂，明天感觉还得好好看看这个。不太清楚怎么从pkl的key转换成dataset的keyloadanno的地方有个bbox_3d和label_3d，可是anno_info里又不叫bbox，不知道怎么转换的。他只read points和annos，生成dbinfos_train.pkl和gt_database/他似乎只用了infos_train.pkl的信息。注意到是每个object放一个bin，命名方式为image_idx_TYPE_i.bin其中i为image_idx这张图片里的第i个物体。type就是类别。内容为gt_box内的points。box的其他信息比如name、bin的path，image_idx，bbox3d，'num_points_in_gt'都会放到db_infodb_info弄成list，存进dbinfos_train.pkl里

修改dataloader

用官方的create_data，.pkl会漏掉一些信息，比如只存了image_0的label，没有image_1的，这些需要通过修改dataloader来实现，我直接复制了transfusion的代码来用，其中一些关键差别如下：
transfusion waymo_dataset_line139
transfusion waymo_dataset_line146
如何使用这个dataloader可见mmdet3d官方的自定义dataset教程
坑：dataset要注册到mmdet3d里，我之前不小心注册到mmdet里了。

修改config

也是参考了transfusion改了一下。

修改target

不同dataset的target不一样，如果某个detector的实现没有为waymo配置好target，需要修改一下，比如detector不再需要预测速度，所以target dim减2

eval pipeline尚在debug，之后总结一下eval流程，以及如何提交submission。

waymo dataset处理结果

/waymo_format$ du -h
192G    ./validation
759G    ./training
27G     ./testing
977G    ./kitti_format$ du -h
3.2M    ./ImageSets
1.1G    ./training/label_all
781M    ./training/calib
767G    ./training/velodyne
781M    ./training/pose
781M    ./training/timestamp
855M    ./training/label_0
581M    ./training/label_1
498M    ./training/label_2
504M    ./training/label_3
408M    ./training/label_4
568G    ./training/image_0
579G    ./training/image_1
592G    ./training/image_2
392G    ./training/image_3
400G    ./training/image_4
3.3T    ./training
63M     ./testing/calib
584K    ./testing/velodyne
63M     ./testing/pose
63M     ./testing/timestamp
47G     ./testing/image_0
47G     ./testing/image_1
46G     ./testing/image_2
32G     ./testing/image_3
31G     ./testing/image_4
200G    ./testing
81G     ./waymo_gt_database
3.6T    .