slowfast 跑多卡的时候遇到问题

问题描述：

在57 服务器上跑：

python tools/run_net.py --cfg configs/Kinetics/X3D_XS.yaml NUM_GPUS 4 TRAIN.BATCH_SIZE 8 SOLVER.BASE_LR 0.0125 DATA.PATH_TO_DATA_DIR /mnt/data/geguojing/Heart_data/annotations20201112

目前这个问题没有解决，57 的服务器一直在被占用。调试不了

185 上的出现的错误是：

对比了自己的代码和测试代码：

import torch
import utils.distributed as dist
from torch.utils.data.distributed import DistributedSampler
from torch.utils.data.sampler import RandomSampler
from torch.utils.data._utils.collate import default_collate
import pdbdef manu_collate(batch):inputs, bbox = zip(*batch)inputs = default_collate(inputs)collate_bbox = [i[0] for i in bbox]return inputs, collate_bbox
class AVA_DATA(torch.utils.data.DataLoader):def __init__(self):print('heihei')passdef __getitem__(self, index):return torch.ones(3,224,224)*index, [index]# return torch.ones(3,224,224)*indxdef __len__(self):return 16class AVA_MODEL(torch.nn.Module):def __init__(self):super(AVA_MODEL, self).__init__()self.conv1 = torch.nn.Conv2d(3,64,3,1)self.tmp = None# for i in range(20):#     self.tmp[i]=0def forward(self,x,bbox,tmp):# print(x.device.index)self.tmp.append(1)if x.device.index==0:print(self.tmp)# tmp.append(1)# print(tmp)# # if 1:#     print('***start***')#     print(self.tmp)#     for j in bbox:#         self.tmp[j]=j#     print('***end***')#     print(bbox, self.tmp)# self.tmp.append(1)# print(self.tmp)# tmp.append(1)# print(tmp)# print(len(bbox))res = self.conv1(x)return resdef run(local_rank, num_proc, func, init_method, shard_id, num_shards, backend,
):# Initialize the process group.world_size = num_proc * num_shardsrank = shard_id * num_proc + local_ranktry:torch.distributed.init_process_group(backend=backend,init_method=init_method,world_size=world_size,rank=rank,)except Exception as e:raise etorch.cuda.set_device(local_rank)func()def launch_job(func, init_method='tcp://localhost:9999', daemon=False, dis=True):"""Run 'func' on one or more GPUs, specified in cfgArgs:cfg (CfgNode): configs. Details can be found inslowfast/config/defaults.pyinit_method (str): initialization method to launch the job with multipledevices.func (function): job to run on GPU(s)daemon (bool): The spawned processes’ daemon flag. If set to True,daemonic processes will be created"""if dis:torch.multiprocessing.spawn(run,nprocs=4,args=(4,func,init_method,0,1,'nccl',),daemon=daemon,)else:func()def train():model = AVA_MODEL()cur_device = torch.cuda.current_device()model = model.cuda(device=cur_device)# model = torch.nn.DataParallel(model).cuda()model = torch.nn.parallel.DistributedDataParallel(module=model, device_ids=[cur_device], output_device=cur_device)dataset = AVA_DATA()dataloader = torch.utils.data.DataLoader(dataset,batch_size=1,shuffle=False,sampler=DistributedSampler(dataset),# sampler=None,num_workers=8,pin_memory=True,drop_last=True,collate_fn=manu_collate,# worker_init_fn=None,)tmp = []model.train()for i in range(200000000):shuffle_dataset(dataloader,i)for iter, (inputs,bbox) in enumerate(dataloader):# print(i,iter)# inputs = inputs.cuda(non_blocking=True)# res = model(inputs,bbox,tmp)for j in bbox:tmp.append(j)dist.synchronize()tmp = dist.all_gather_unaligned(tmp)if dist.get_rank()==0:print(iter,tmp)breakmodel.module.tmp = tmpfor i in range(2):for iter, (inputs,bbox) in enumerate(dataloader):# print(i,iter)inputs = inputs.cuda(non_blocking=True)res = model(inputs,bbox,tmp)# tmp.append(1)
def shuffle_dataset(loader, cur_epoch):""""Shuffles the data.Args:loader (loader): data loader to perform shuffle.cur_epoch (int): number of the current epoch."""# sampler = (#     loader.batch_sampler.sampler#     if isinstance(loader.batch_sampler, ShortCycleBatchSampler)#     else loader.sampler# )# assert isinstance(#     sampler, (RandomSampler, DistributedSampler)# ), "Sampler type '{}' not supported".format(type(sampler))# # RandomSampler handles shuffling automaticallysampler = loader.samplerif isinstance(sampler, DistributedSampler):# DistributedSampler shuffles data based on epochsampler.set_epoch(cur_epoch)
if __name__ == "__main__":tmp = dict()pdb.set_trace()for i in range(20):tmp[i]=0pdb.set_trace()launch_job(train)    # train(tmp)

发现init_method 程序里面没有定义

给init_method 一个定义：tcp://localhost：9999

查看每一个函数定义与测试代码的异同点

cfg.SHRAD_ID 0

cfg.NUM_SHARDS 1

cfg.DIST_BACKEND 为nccl

57上的nccl 的问题不知道修改这个是不是就可以了

进到run 里面print 打印

总结下来就是：不知道改了哪里就好了

pdb去掉就好了

最好是加打印输出

slowfast 跑多卡的时候遇到问题相关推荐

老电脑跑win7卡慢的解决办法
我的09年的ideaPad Y450,cpu:T9800 双核2.8GHz 内存:4G 1366hz 跑了几年的win7,越跑越卡,终于忍受不了,重装了xp 64位,结果64位的xp各种软件不兼容, ...
连接工作站跑机器学习（Linux命令）
首先讲一讲可能用到的操作 1.查看当前文件夹下的文件 dir 2.返回上一/两个目录 cd - cd -/- 3.Tab 直接按Tab,可以查看当前文件下的所有东西 4.移动文件 mv 文件名移动目 ...
python修改悦跑圈数据_悦跑圈刷数据插件下载-悦跑圈刷步数插件下载5.9.2安卓版-西西软件下载...
悦跑圈刷步数插件是一款基于跑步开发的一款刷步数app,在悦跑圈刷步数插件中小伙伴将会体验到一系列玩法非常有趣的刷步数体验到,操作难度不大想要刷步数的小伙伴不要错过了哦! 悦跑圈刷步数插件介绍: 跑步就 ...
多卡聚合路由器5G+4G是什么意思
多张运营商4G卡的网络进行聚合,因5G网速快且发展迅速,云联芯聚合路由器已经增加5G模块,支持插入5G数据卡,那4G+5G简单来说就是支持4G卡与5G卡同时使用,即将各家运营商(联通.移动.电信)的各 ...
华数未能连接到服务器,我玩华数杭州的服务器时,进A区不卡,进B区 – 手机爱问...
2008-04-17 我进飙车,跑城市.先进A,B 区跑时卡的厉害,再跑上海一点也不卡,然后再返回到A,B区跑这时候就不卡了, 进飙车直接跑上海地图也不卡.不知道这是怎么一回事!! 有高手解 ...
多卡聚合路由器的工作原理
在介绍多卡聚合路由器之前,我们不妨先来看个例子: 假如现在有1000辆小车需要从A市开往B市,需要准点到达B市地点进行集合,假设现在的道路为单车道设计,不管你是走高速省道国道什么的都好,都会堵车,想要 ...
Hadoop运维记录系列(二十二)
今天下午写了一会代码,然后帮同事解决了一个hbase相关的故障分析,定位了问题根源,觉得比较有代表性,记录一下. 先说一下问题的发生与背景. 这个故障其实是分为两个故障的,第一个比较简单,第二个相对复 ...
《Linux就该这么学》培训笔记_ch01_部署虚拟环境安装Linux系统
<Linux就该这么学>培训笔记_ch01_部署虚拟环境安装Linux系统文章最后会post上书本的笔记照片. 文章主要内容: 在虚拟机中安装红帽RHEL7系统在Linux系统中找回r ...
嵌入式相关的硬件平台
本文转载自CSDN博客:https://blog.csdn.net/qq_35999634/article/details/82924356 从左至右,性能逐渐增强; (Arduino):图中无,适合 ...

slowfast 跑多卡的时候遇到问题

slowfast 跑多卡的时候遇到问题相关推荐

最新文章

热门文章