1.概述

PySlowFast 是Facebook近期开源的一个视频理解项目，其中包含了数个优秀论文的实现，包括SlowFast、X3D、I3D等。项目的地址在这里，本人最近正在做用该项目作视频行为识别的相关内容，因此本文讲述在应用PySlowFast的一些问题，主要是展示代码的运行过程。在查看相关论文后，发现X3D是项目中无论是准确率还是速度比较优秀的一个，因此采取该方案来做行为识别。

2. PySlowFast解读

下边结合代码来解读PySlowFast框架的运行。因为项目工作时间有限，只对训练部分作了解，测试以及demo展示部分皆没有详细阅读。
解读过程是利用VSCode中的debug模型进行的，此法能够方便了解代码的运行过程，同时能够随时执行各种命令，查看变量的各种属性。
项目安装完成好后可以输入以下命令来运行

python tools/run_net.py --cfg {config file}

由于我主要训练的是X3D网络，因此将–cfg后的参数换成 configs/Kinetics/X3D_M.yaml即可。
以下是X3D_M.yaml文件

TRAIN:ENABLE: True # default TrueDATASET: kineticsBATCH_SIZE: 8#BATCH_SIZE: 16EVAL_PERIOD: 1CHECKPOINT_PERIOD: 1AUTO_RESUME: FalseCHECKPOINT_FILE_PATH: checkpoints/AVA_RWF_pretrain/checkpoint_epoch_00004.pyth#CHECKPOINT_FILE_PATH: model/x3d_m.pythCHECKPOINT_EPOCH_RESET: True
X3D:WIDTH_FACTOR: 2.0 DEPTH_FACTOR: 2.2 BOTTLENECK_FACTOR: 2.25DIM_C5: 2048DIM_C1: 12
TEST:ENABLE: FalseDATASET: kineticsBATCH_SIZE: 64# CHECKPOINT_FILE_PATH: 'x3d_m.pyth' # 76.21% top1 30-view accuracy to download from the model zoo (optional).NUM_SPATIAL_CROPS: 1#NUM_SPATIAL_CROPS: 3CHECKPOINT_FILE_PATH: checkpoints/AVA_RWF_pretrain/checkpoint_epoch_00004.pyth
DATA:PATH_TO_DATA_DIR: NUM_FRAMES: 16SAMPLING_RATE: 5TRAIN_JITTER_SCALES: [256, 320]TRAIN_CROP_SIZE: 224 # TEST_CROP_SIZE: 224 # use if TEST.NUM_SPATIAL_CROPS: 1TEST_CROP_SIZE: 256 # use if TEST.NUM_SPATIAL_CROPS: 3INPUT_CHANNEL_NUM: [3] DECODING_BACKEND: pyav#DECODING_BACKEND: torchvisionRESNET:ZERO_INIT_FINAL_BN: TrueTRANS_FUNC: x3d_transformSTRIDE_1X1: False
BN:USE_PRECISE_STATS: TrueNUM_BATCHES_PRECISE: 200WEIGHT_DECAY: 0.0
SOLVER:BASE_LR: 0.00002 # 1 machineBASE_LR_SCALE_NUM_SHARDS: TrueLR_POLICY: cosineMAX_EPOCH: 10WEIGHT_DECAY: 5e-5#WARMUP_EPOCHS: 35.0#WARMUP_START_LR: 0.01#OPTIMIZING_METHOD: sgdOPTIMIZING_METHOD: adam
MODEL:NUM_CLASSES: 2ARCH: x3dMODEL_NAME: X3DLOSS_FUNC: cross_entropyDROPOUT_RATE: 0.5
DATA_LOADER:NUM_WORKERS: 8PIN_MEMORY: True
NUM_GPUS: 1
RNG_SEED: 0
OUTPUT_DIR: .
DEMO:ENABLE: TrueLABEL_FILE_PATH:# Path to json file providing class_name - id mapping.INPUT_VIDEO:OUTPUT_FILE:# Path to output video file to write results to.# Leave an empty string if you would like to display results to a window.THREAD_ENABLE: False # Run video reader/writer in the background with multi-threading.NUM_VIS_INSTANCES: 1 # Number of CPU(s)/processes use to run video visualizer.NUM_CLIPS_SKIP: 0 # Number of clips to skip prediction/visualization# (mostly to smoothen/improve display quality with wecam input).

配置文件中包含运行代码所需的许多配置信息，包括超参数、数据路径、权重路径等。
配置文件中最重要的就是要设置好目前运行目的是训练、测试还是demo展示，分别对应的是TRAIN，TEST，DEMO中的ENABEL选项。选为True则运行该项，False则不运行，如果有多项设置为True，则会依次执行。例如TEST和DEMO中的ENABLE选项都设置为True的话，则会先进行测试，然后进行demo制作。

运行命令后程序将会执行

launch_job(cfg=cfg, init_method=args.init_method, func=train)

之后程序将会跳转到toos/train_net.py中的train函数中，此函数将包含训练模型所需要的所有命令。

def train(cfg):du.init_distributed_training(cfg)# Set random seed from configs.np.random.seed(cfg.RNG_SEED)torch.manual_seed(cfg.RNG_SEED)# Setup logging format.logging.setup_logging(cfg.OUTPUT_DIR)# Init multigrid.multigrid = Noneif cfg.MULTIGRID.LONG_CYCLE or cfg.MULTIGRID.SHORT_CYCLE:multigrid = MultigridSchedule()cfg = multigrid.init_multigrid(cfg)if cfg.MULTIGRID.LONG_CYCLE:cfg, _ = multigrid.update_long_cycle(cfg, cur_epoch=0)# Print config.logger.info("Train with config:")logger.info(pprint.pformat(cfg))

此部分代码主要是训练前的一些初始化准备，包括分布式训练初始化、随机数种子指定、日志记录初始化等。其中multigrid中在配置文件中默认是不执行的，因此并没有详细了解其作用。

model = build_model(cfg)

此行代码很明显是根据配置文件来创建模型，因此可以仔细查看一下其创建过程是怎么样的。
build_model函数是用cfg作为参数传入的，因此其他模型也可以通过此一行代码可以构建

def build_model(cfg, gpu_id=None):"""someting"""name = cfg.MODEL.MODEL_NAMEmodel = MODEL_REGISTRY.get(name)(cfg)"""someting"""

此函数的核心代码就比较简单，就是在配置文件中读取函数名字，然后通过读取预先注册好的构建函数来构建模型。
下边进入这个函数，查看一下代码是如何构建模型的。
下一个进入的文件是slowfast/models/video_model_builder.py，查看发现其中包含很多模型的构建函数，包括slowfast，ResNet，X3D以及MVit。此项目其实还支持其他模型，包括I3D，在此处没有找到，可能用了其他方式进行声明。

2.1 X3D模型构建

查看video_model_builder.py中的X3D类可以发现，其主要部件都在__init__中进行初始化，forward函数只是简单地将每一个模块进行运算，因此我们只要看__init__函数就可以大致了解模型的构建了。

class X3D(nn.Module):def __init__(self, cfg):    super(X3D, self).__init__()self.norm_module = get_norm(cfg)self.enable_detection = cfg.DETECTION.ENABLEself.num_pathways = 1exp_stage = 2.0self.dim_c1 = cfg.X3D.DIM_C1self.dim_res2 = (round_width(self.dim_c1, exp_stage, divisor=8)if cfg.X3D.SCALE_RES2else self.dim_c1)self.dim_res3 = round_width(self.dim_res2, exp_stage, divisor=8)self.dim_res4 = round_width(self.dim_res3, exp_stage, divisor=8)self.dim_res5 = round_width(self.dim_res4, exp_stage, divisor=8)self.block_basis = [# blocks, c, stride[1, self.dim_res2, 2],[2, self.dim_res3, 2],[5, self.dim_res4, 2],[3, self.dim_res5, 2],]self._construct_network(cfg)init_helper.init_weights(self, cfg.MODEL.FC_INIT_STD, cfg.RESNET.ZERO_INIT_FINAL_BN)

初始化函数主要是定义了一些变量，这些变量是构建模型的时候的各个块的超参数，如步长、通道数等等。定义完毕后代码执行_construct_network来构建模型，以及init_helper.init_weights来进行参数初始化，因此我们主要看前一个函数即可。

def _construct_network(self, cfg):"""something"""self.s1 = stem_helper.VideoModelStem(dim_in=cfg.DATA.INPUT_CHANNEL_NUM,dim_out=[dim_res1],kernel=[temp_kernel[0][0] + [3, 3]],stride=[[1, 2, 2]],padding=[[temp_kernel[0][0][0] // 2, 1, 1]],norm_module=self.norm_module,stem_func_name="x3d_stem",)

函数开始执行时同样是声明一些构建模型所需的变量。之后开始构建第一个模块
self.s1 = stem_helper.VideoModelStem()。由于我在训练时用的是X3D-M模块，因此可以参考论文中的结构图来了解其中的构建过程。

根据顺序可以看到stem_helper.VideoModelStem()函数构建的就是上图中的conv1模块。但是有些参数是不一样的。

class VideoModelStem(nn.Module):def __init__(self,dim_in,dim_out,kernel,stride,padding,inplace_relu=True,eps=1e-5,bn_mmt=0.1,norm_module=nn.BatchNorm3d,stem_func_name="basic_stem",):"""something"""self._construct_stem(dim_in, dim_out, norm_module, stem_func_name)

此函数与上边类似，也是声明一些变量，然后用_construct_stem来构建模块。读完这些代码下来发现，这个项目的风格就是喜欢定义一些变量，然后用这些变量执行一个函数来生成模块。

def _construct_stem(self, dim_in, dim_out, norm_module, stem_func_name):trans_func = get_stem_func(stem_func_name)for pathway in range(len(dim_in)):stem = trans_func(dim_in[pathway],dim_out[pathway],self.kernel[pathway],self.stride[pathway],self.padding[pathway],self.inplace_relu,self.eps,self.bn_mmt,norm_module,)self.add_module("pathway{}_stem".format(pathway), stem)

此模块为了提高可复用性，利用get_stem_func()函数获取该调用的函数名字X3DStem，然后再执行。
此处执行时可以看到需要进入一个pathway循环。在X3D中，此处只执行一次。在SlowFast模块中，由于模型有两条前向模块，分别对应不同维度的输入，因此才需要这个循环。

class X3DStem(nn.Module):def __init__(self,dim_in,dim_out,kernel,stride,padding,inplace_relu=True,eps=1e-5,bn_mmt=0.1,norm_module=nn.BatchNorm3d,):super(X3DStem, self).__init__()self.kernel = kernelself.stride = strideself.padding = paddingself.inplace_relu = inplace_reluself.eps = epsself.bn_mmt = bn_mmt# Construct the stem layer.self._construct_stem(dim_in, dim_out, norm_module)def _construct_stem(self, dim_in, dim_out, norm_module):self.conv_xy = nn.Conv3d(dim_in,dim_out,kernel_size=(1, self.kernel[1], self.kernel[2]),stride=(1, self.stride[1], self.stride[2]),padding=(0, self.padding[1], self.padding[2]),bias=False,)self.conv = nn.Conv3d(dim_out,dim_out,kernel_size=(self.kernel[0], 1, 1),stride=(self.stride[0], 1, 1),padding=(self.padding[0], 0, 0),bias=False,groups=dim_out,)self.bn = norm_module(num_features=dim_out, eps=self.eps, momentum=self.bn_mmt)self.relu = nn.ReLU(self.inplace_relu)def forward(self, x):x = self.conv_xy(x)x = self.conv(x)x = self.bn(x)x = self.relu(x)return x

进入X3DStem模块，已经很清晰各个步骤了，模块首先执行conv_xy，然后是conv，最后是常规的batchnorm和relu。
其中conv_xy是空间维度的卷积，根据传入参数可以看到，kernel大小是1x3x3，stride是1x2x2，输入通道为3，输出通道为24。
conv模块是时间维度卷积，kernel大小是5x1x1，stride是1x1x1，输入和输出通道相等都是24。而文章中此处的kernel大小应为3x1x1，这是两者不同之处。
至此
X3DStem模块已经生成完毕，此模块对应的是论文结构图中的conv1模块。

返回到X3D模块中的_construct_network函数中去，现在第一个模块self.s1已经生成完毕。之后是进入一个for循环中，生成模型剩下的模块。

for stage, block in enumerate(self.block_basis):s = resnet_helper.ResStage(dim_in=[dim_in],dim_out=[dim_out],dim_inner=[dim_inner],temp_kernel_sizes=temp_kernel[1],stride=[block[2]],num_blocks=[n_rep],num_groups=[dim_inner]if cfg.X3D.CHANNELWISE_3x3x3else [num_groups],num_block_temp_kernel=[n_rep],nonlocal_inds=cfg.NONLOCAL.LOCATION[0],nonlocal_group=cfg.NONLOCAL.GROUP[0],nonlocal_pool=cfg.NONLOCAL.POOL[0],instantiation=cfg.NONLOCAL.INSTANTIATION,trans_func_name=cfg.RESNET.TRANS_FUNC,stride_1x1=cfg.RESNET.STRIDE_1X1,norm_module=self.norm_module,dilation=cfg.RESNET.SPATIAL_DILATIONS[stage],drop_connect_rate=cfg.MODEL.DROPCONNECT_RATE* (stage + 2)/ (len(self.block_basis) + 1),)

此处输入参数十分多，看不过来，直接进入模块内部看看是怎么生成的。方便起见我就不一一列出代码了。
此函数进入的是resnet_helper.py中的ResStage类，在初始化函数中调用self._construct来生成模块

for pathway in range(self.num_pathways):for i in range(self.num_blocks[pathway]):# Retrieve the transformation function.trans_func = get_trans_func(trans_func_name)# Construct the block.res_block = ResBlock(dim_in[pathway] if i == 0 else dim_out[pathway],dim_out[pathway],self.temp_kernel_sizes[pathway][i],stride[pathway] if i == 0 else 1,trans_func,dim_inner[pathway],num_groups[pathway],stride_1x1=stride_1x1,inplace_relu=inplace_relu,dilation=dilation[pathway],norm_module=norm_module,block_idx=i,drop_connect_rate=self._drop_connect_rate,)

以上是_construct函数的核心代码，可以看到函数进入了两个循环，第一个是对pathway进行循环，上文已经提及对于slowfast此处有两个pathway，而对于X3D这里只有1个。第二个循环是模块的重复次数，根据论文结构图，res2-5模块是分别循环3、5、11、17次，这里是第一个模块，因此循环次数为3。

class ResBlock(nn.Module):def _construct():self.branch1 = nn.Conv3d(dim_in,dim_out,kernel_size=1,stride=[1, stride, stride],padding=0,bias=False,dilation=1,)self.branch1_bn = norm_module(num_features=dim_out, eps=self._eps, momentum=self._bn_mmt)self.branch2 = trans_func(dim_in,dim_out,temp_kernel_size,stride,dim_inner,num_groups,stride_1x1=stride_1x1,inplace_relu=inplace_relu,dilation=dilation,norm_module=norm_module,block_idx=block_idx,)self.relu = nn.ReLU(self._inplace_relu)

进入到ResBlock中的_construct函数中，可以看到主要定义了branch1和branch2，其中branch1的con3d，kernel大小是1x1x1，输入输出通道都是24，stride为1x2x2，因此可以明确此模块是用于下采样，缩小输入的H和W大小。
branch2是用X3DTransform函数生成的。

class X3DTransform():def _construct():self.a = nn.Conv3d(dim_in,dim_inner,kernel_size=[1, 1, 1],stride=[1, str1x1, str1x1],padding=[0, 0, 0],bias=False,)self.a_bn = norm_module(num_features=dim_inner, eps=self._eps, momentum=self._bn_mmt)self.a_relu = nn.ReLU(inplace=self._inplace_relu)# Tx3x3, BN, ReLU.self.b = nn.Conv3d(dim_inner,dim_inner,[self.temp_kernel_size, 3, 3],stride=[1, str3x3, str3x3],padding=[int(self.temp_kernel_size // 2), dilation, dilation],groups=num_groups,bias=False,dilation=[1, dilation, dilation],)self.b_bn = norm_module(num_features=dim_inner, eps=self._eps, momentum=self._bn_mmt)# Apply SE attention or notuse_se = True if (self._block_idx + 1) % 2 else Falseif self._se_ratio > 0.0 and use_se:self.se = SE(dim_inner, self._se_ratio)if self._swish_inner:self.b_relu = Swish()else:self.b_relu = nn.ReLU(inplace=self._inplace_relu)# 1x1x1, BN.self.c = nn.Conv3d(dim_inner,dim_out,kernel_size=[1, 1, 1],stride=[1, 1, 1],padding=[0, 0, 0],bias=False,)self.c_bn = norm_module(num_features=dim_out, eps=self._eps, momentum=self._bn_mmt)self.c_bn.transform_final_bn = True

此模块较大，可以逐个看清楚。并且可以跟论文结构图对应起来
self.a，输入通道24，输出54，kernel大小1x1x1，stride1x1x1，单纯改变通道数用。
self.b，输入通道54，输出54，kernel大小3x3x3，stride1x2x2，缩小了空间大小，注意此模块将会循环生成3次，除了第一次之外，其余的stride均为1x1x1。
然后引入了SE模块，此为注意力模块，详情可以参考这篇文章。
之后还引入了Swish激活函数，此函数为relu和sigmoid的结合体，表现比relu的要好，详情可以看这篇文章。
self.c 输入通道54，输出24，kernel大小1x1x1，stride1x1x1，单纯改变通道数用。

至此X3DTransform模块的构建已经生成完毕，可以看到此类的重点输入参数有3个，dim_in对应输入通道，dim_out对应输出通道，dim_inner对应中间通道。ResBlock也对应的构建完毕。
返回到ResStage类中，可以看到现在代码在循环中，将重复生成3次ResBlock，组成论文结构图中的res2模块。

现在可以返回到X3D类了，可以看到代码现在在一个循环中

for stage, block in enumerate(self.block_basis):dim_out = round_width(block[1], w_mul)dim_inner = int(cfg.X3D.BOTTLENECK_FACTOR * dim_out)n_rep = self._round_repeats(block[0], d_mul)prefix = "s{}".format(stage + 2)  # start w res2 to follow conventions = resnet_helper.ResStage(dim_in=[dim_in],dim_out=[dim_out],dim_inner=[dim_inner],temp_kernel_sizes=temp_kernel[1],stride=[block[2]],num_blocks=[n_rep],num_groups=[dim_inner]if cfg.X3D.CHANNELWISE_3x3x3else [num_groups],num_block_temp_kernel=[n_rep],nonlocal_inds=cfg.NONLOCAL.LOCATION[0],nonlocal_group=cfg.NONLOCAL.GROUP[0],nonlocal_pool=cfg.NONLOCAL.POOL[0],instantiation=cfg.NONLOCAL.INSTANTIATION,trans_func_name=cfg.RESNET.TRANS_FUNC,stride_1x1=cfg.RESNET.STRIDE_1X1,norm_module=self.norm_module,dilation=cfg.RESNET.SPATIAL_DILATIONS[stage],drop_connect_rate=cfg.MODEL.DROPCONNECT_RATE* (stage + 2)/ (len(self.block_basis) + 1),)dim_in = dim_outself.add_module(prefix, s)

self.block_basis长度为4，分别对应生成论文结构图中的res2-5。在res2中，输入dim_in和输出通道dim_out都是24，中间通道dim_inner是54，重复num_blocks为3次。
在res3中，输入通道24，输出通道48，中间通道108，重复5次
在res4中，输入通道48，输出通道96，中间通道216，重复11次
在res5中，输入通道96，输出通道192，中间通道432，重复7次。
代码中的参数都能够与论文结构图对应起来。

在ResStage生成完毕后，代码最后构建了X3DHead，下边我们进入到这个函数中看看模块是如何生成的。

class X3DHead(nn.Module):
def _construct_head(self, dim_in, dim_inner, dim_out, norm_module):self.conv_5 = nn.Conv3d(dim_in,dim_inner,kernel_size=(1, 1, 1),stride=(1, 1, 1),padding=(0, 0, 0),bias=False,)self.conv_5_bn = norm_module(num_features=dim_inner, eps=self.eps, momentum=self.bn_mmt)self.conv_5_relu = nn.ReLU(self.inplace_relu)if self.pool_size is None:self.avg_pool = nn.AdaptiveAvgPool3d((1, 1, 1))else:self.avg_pool = nn.AvgPool3d(self.pool_size, stride=1)self.lin_5 = nn.Conv3d(dim_inner,dim_out,kernel_size=(1, 1, 1),stride=(1, 1, 1),padding=(0, 0, 0),bias=False,)if self.bn_lin5_on:self.lin_5_bn = norm_module(num_features=dim_out, eps=self.eps, momentum=self.bn_mmt)self.lin_5_relu = nn.ReLU(self.inplace_relu)if self.dropout_rate > 0.0:self.dropout = nn.Dropout(self.dropout_rate)# Perform FC in a fully convolutional manner. The FC layer will be# initialized with a different std comparing to convolutional layers.self.projection = nn.Linear(dim_out, self.num_classes, bias=True)# Softmax for evaluation and testing.if self.act_func == "softmax":self.act = nn.Softmax(dim=4)elif self.act_func == "sigmoid":self.act = nn.Sigmoid()else:raise NotImplementedError("{} is not supported as an activation""function.".format(self.act_func))

首先是生成了self.conv_5，此模块输入通道是192，输出是432，对应的是论文结构图中的conv5。
接着是生成一个kernel大小为16x7x7的平均池化层，在conv5的输出对应的大小为16x7x7，因此这里是一个全局池化层。
接着生成的是self.lin_5，虽然此处声明的是conv3d，但是由于经过平均池化后特征图的大小已为1，因此就是一个全连接层，输入大小为432，输出大小为2048。
最后self.projection是将2048的特征向量运算输出长度为类别数的向量。至此模型构建完毕。

2.2 数据输入

模型构建完毕后我们下一个关心的就是数据输入的格式。行为识别的输入当然就是视频流了，但是输入的视频长度，以及训练时作了何种预处理，了解这些步骤有助于我们更加深入了解模型。
所以我们要看的就是dataset的构建以及读取数据的形式。dataloader的声明在train_net.py中的train函数中

train_loader = loader.construct_loader(cfg, "train")

进入此函数之后可以看到首先是构建了dataset，然后再构建dataloader并返回

def construct_loader(cfg, split, is_precise_bn=False):"""something"""dataset = build_dataset(dataset_name, cfg, split)"""something"""

dataset生成完之后下边还有一堆代码用于创建dataloader时的选项，不过按照默认设置，collate_fn和worker_init_fn都是none，因此只是简单创建dataloader而已。因此读取数据的核心就在于dataset的构建与读取。

进入build_dataset后发现其构建的函数在kinetics.py的Kinetics类中。其中的__init__虽然较长，但是内容较简单，只是读取视频文件的路径及标签，保存在列表中。读取视频的核心代码在__getitem__函数中，因此我们先退出这个函数，进入到训练的模块中再来阅读这方面的代码。
返回到train_net中，容易发现train_epoch函数即为训练模块

train_epoch([train_loader,train_loader2],model,optimizer,scaler,train_meter,cur_epoch,cfg,writer,)

在train_epoch函数中，for循环读取trainloader中的数据即为读取模块，因此我们可以从这里进入到__getitem__函数中。

    for cur_iter, (inputs, labels, _, meta) in enumerate(train_loader[0]):

def __getitem__(self, index):short_cycle_idx = None# When short cycle is used, input index is a tupple.if isinstance(index, tuple):index, short_cycle_idx = indexif self.mode in ["train", "val"]:# -1 indicates random sampling.temporal_sample_index = -1spatial_sample_index = -1min_scale = self.cfg.DATA.TRAIN_JITTER_SCALES[0]max_scale = self.cfg.DATA.TRAIN_JITTER_SCALES[1]crop_size = self.cfg.DATA.TRAIN_CROP_SIZEif short_cycle_idx in [0, 1]:crop_size = int(round(self.cfg.MULTIGRID.SHORT_CYCLE_FACTORS[short_cycle_idx]* self.cfg.MULTIGRID.DEFAULT_S))if self.cfg.MULTIGRID.DEFAULT_S > 0:# Decreasing the scale is equivalent to using a larger "span"# in a sampling grid.min_scale = int(round(float(min_scale)* crop_size/ self.cfg.MULTIGRID.DEFAULT_S))elif self.mode in ["test"]:temporal_sample_index = (self._spatial_temporal_idx[index]// self.cfg.TEST.NUM_SPATIAL_CROPS)# spatial_sample_index is in [0, 1, 2]. Corresponding to left,# center, or right if width is larger than height, and top, middle,# or bottom if height is larger than width.spatial_sample_index = ((self._spatial_temporal_idx[index]% self.cfg.TEST.NUM_SPATIAL_CROPS)if self.cfg.TEST.NUM_SPATIAL_CROPS > 1else 1)min_scale, max_scale, crop_size = ([self.cfg.DATA.TEST_CROP_SIZE] * 3if self.cfg.TEST.NUM_SPATIAL_CROPS > 1else [self.cfg.DATA.TRAIN_JITTER_SCALES[0]] * 2+ [self.cfg.DATA.TEST_CROP_SIZE])# The testing is deterministic and no jitter should be performed.# min_scale, max_scale, and crop_size are expect to be the same.assert len({min_scale, max_scale}) == 1else:raise NotImplementedError("Does not support {} mode".format(self.mode))sampling_rate = utils.get_random_sampling_rate(self.cfg.MULTIGRID.LONG_CYCLE_SAMPLING_RATE,self.cfg.DATA.SAMPLING_RATE,)

函数的开头为通过配置文件声明一些变量，这些变量的意义等会随着代码的阅读将会逐渐明确。

 for i_try in range(self._num_retries):video_container = Nonetry:video_container = container.get_video_container(self._path_to_videos[index],self.cfg.DATA_LOADER.ENABLE_MULTI_THREAD_DECODE,self.cfg.DATA.DECODING_BACKEND,)except Exception as e:logger.info("Failed to load video from {} with error {}".format(self._path_to_videos[index], e))

之后代码进入了一个for循环中循环10次，这是为了防止视频读取失败而设置的，如果读取视频成功则直接返回数据，失败超过10次则报错。
get_video_container函数是利用pyav来读取视频，并返回读取video_container对象，之后的操作都是基于这个对象。
下一步来到了decode函数

 frames = decoder.decode(video_container,sampling_rate,self.cfg.DATA.NUM_FRAMES,temporal_sample_index,self.cfg.TEST.NUM_ENSEMBLE_VIEWS,video_meta=self._video_meta[index],target_fps=self.cfg.DATA.TARGET_FPS,backend=self.cfg.DATA.DECODING_BACKEND,max_spatial_scale=min_scale,use_offset=self.cfg.DATA.USE_OFFSET_SAMPLING,)

此函数比较关键，因此必须进入看看做了什么操作。
decode函数首先执行的是pyav_decode函数

 frames, fps, decode_all_video = pyav_decode(container,sampling_rate,num_frames,clip_idx,num_clips,target_fps,use_offset=use_offset,)

def pyav_decode(container,sampling_rate,num_frames,clip_idx,num_clips=10,target_fps=30,use_offset=False,
):fps = float(container.streams.video[0].average_rate)frames_length = container.streams.video[0].framesduration = container.streams.video[0].durationif duration is None:# If failed to fetch the decoding information, decode the entire video.decode_all_video = Truevideo_start_pts, video_end_pts = 0, math.infelse:# Perform selective decoding.decode_all_video = Falsestart_idx, end_idx = get_start_end_idx(frames_length,sampling_rate * num_frames / target_fps * fps,clip_idx,num_clips,use_offset=use_offset,)timebase = duration / frames_lengthvideo_start_pts = int(start_idx * timebase)video_end_pts = int(end_idx * timebase)frames = None# If video stream was found, fetch video frames from the video.if container.streams.video:video_frames, max_pts = pyav_decode_stream(container,video_start_pts,video_end_pts,container.streams.video[0],{"video": 0},)container.close()frames = [frame.to_rgb().to_ndarray() for frame in video_frames]frames = torch.as_tensor(np.stack(frames))return frames, fps, decode_all_video

查看函数发现，其重要步骤在于，第一：利用get_start_end_idx函数定义了视频的开头和结尾的下标，由于输入的视频有可能过长，所以必须确定视频的开头和结尾。根据默认条件，视频帧的长度为80帧，片段是随机从视频中截取。第二：通过pyav_decode_stream利用开头和结尾下标来截取视频，从而生成frames变量，frame变量一般是一个长度为80的图片帧。
返回到decode函数中，在返回结果之前还要执行一个temporal_sampling函数，这个函数比较简单，就是将frame变量每隔5帧提取1帧，从而生成长度为16的图片帧，并返回。
decode函数已经执行完毕，下边返回到Kinetics类的__getitem__中，现在长度合适的frame变量已经生成完毕，但是看到之后还需要执行一个utils.spatial_sampling函数。

def spatial_sampling(frames,spatial_idx=-1,min_scale=256,max_scale=320,crop_size=224,random_horizontal_flip=True,inverse_uniform_sampling=False,aspect_ratio=None,scale=None,motion_shift=False,
):assert spatial_idx in [-1, 0, 1, 2]if spatial_idx == -1:if aspect_ratio is None and scale is None:#frames = F.interpolate(frames,crop_size)frames, _ = transform.random_short_side_scale_jitter(images=frames,min_size=min_scale,max_size=max_scale,#min_size=224,#max_size=224,inverse_uniform_sampling=inverse_uniform_sampling,)frames, _ = transform.random_crop(frames, crop_size)else:transform_func = (transform.random_resized_crop_with_shiftif motion_shiftelse transform.random_resized_crop)frames = transform_func(images=frames,target_height=crop_size,target_width=crop_size,scale=scale,ratio=aspect_ratio,)if random_horizontal_flip:frames, _ = transform.horizontal_flip(0.5, frames)else:# The testing is deterministic and no jitter should be performed.# min_scale, max_scale, and crop_size are expect to be the same.assert len({min_scale, max_scale}) == 1frames, _ = transform.random_short_side_scale_jitter(frames, min_scale, max_scale)frames, _ = transform.uniform_crop(frames, crop_size, spatial_idx)return frames

此函数主要是为了数据增强，随机裁切原视频。下边看一下它是怎样做的。
首先调用了random_short_side_scale_jitter函数，将视频的短边按比例随机缩放至min_size至max_size之间，这里默认的min_size和max_size分别是256和312。然后随机裁切到224x224大小。
最终返回的frame大小是3x16x224x224。至此，数据处理完毕，返回的frame即可用于模型输入。

3. 小结

X3D论文可以在这里查阅，解读论文的文章也有很多，我就不重复写了。小结一下就是，论文主要是通过坐标下降法，在baseline的基础上搜索各种变量对模型准确率和速度的影响，最终生成准确率高，计算量低的模型。我利用这个模型去做打架行为识别，利用几个公开的数据集去做，准确率可以达到90%以上，速度也很快，应该是可以直接使用了。经过我测试，忽略掉预处理的时间，模型在P40GPU上推理一段视频的时间大概是0.02秒，测量得比较粗糙可能并不准确，不过应该是这个数量级。为了加深论文的理解，也对代码作了较为深入的阅读，主要了解了模型是如何构建，以及预处理的过程。阅读的过程是利用VScode的debug模式，方便了解代码的执行过程以及变量的情况，否则的话光阅读是很难去消化的。当然还有很多部分没有来得及去详细了解，包括测试部分，以及demo制作部分，以后如果还有时间的话再继续补充

PySlowFast 平台的使用及解析——以X3D为例相关推荐

移动支付平台间接口报文解析技术及平台交易处理项目
<基于移动支付平台间接口报文解析技术核心架构实现.及平台交易处理项目全程实录> 课程讲师:MoMo 课程分类:Java框架适合人群:中级课时数量:52课时用到技术:JavaBean ...
[白话解析] 用水浒传为例学习条件随机场
[白话解析] 用水浒传为例学习条件随机场 0x00 摘要本文将尽量使用易懂的方式,尽可能不涉及数学公式,而是从整体的思路上来看,运用感性直觉的思考来解释条件随机场.并且用水浒传为例学习.并且从名著中 ...
java 移动支付接口开发,移动支付平台间接口报文解析技术核心架构实现、及平台交易处理项目全程实录教程...
课程介绍:本课程抛开理论.以项目为驱动,适用于初次接触报文收发.组装解析以及交易分发的同学或开发人员.从报文规范的阅读.需求提炼.到架构实现,做到由浅入深的讲解.涉及到的内容或技术有:使用JAXB转换 ...
短视频平台盈利模式深度解析
抖音短视频的火爆将短视频运营推向了风口,越来越多的互联网企业争相涌入,新媒体运营人员闻风而动,开始转向短视频运营领域,作为短视频运营人员需要渠道化的多平台运营,有些渠道还会进行个性化运营.其中渠道运营 ...
物联网大数据平台TIZA STAR架构解析
声明:本文为<程序员>原创文章,未经允许不得转载,更多精彩文章请订阅2016年<程序员> 作者:孙杰,天泽信息研发总监,负责物联网大数据平台的研发和管理,同时担任江苏省互联网协 ...
阿里云物联网平台物模型数据解析脚本
在一些物联网业务场景中,由于资源受限或配置较低,设备端不适合直接构造物模型的JSON数据结构体与物联网平台进行直接通信.这种情况下,可以将设备上报的原数据直接透传到物联网平台.物联网平台调用您提交的数 ...
第三方短信平台——SUBMAIL 配置 DNS 解析
1.概述 SUBMAIL短信平台,配置域名的 DNS 解析,是校验发件人身份和反垃圾邮件的核心技术手段.通过配置适当的 DNS 解析,能让收件服务器确定发件人身份,极大地提高邮件送达率.在使用 SUB ...
从Google的PaaS平台说起，解析中美Docker生态圈
本文选自清华大数据产业联合会会员.数人云CEO王璞博士在5月18日第八届中国云计算大会上主题为"中美容器之融合与变革"的分享,以下是演讲实录: 容器VS虚拟化首先我科普一点什么是 ...
开源任务调度平台elastic-job-lite源码解析
前段时间写过一遍文章<一文揭秘定时任务调度框架quartz>,有读者建议我再讲讲elastic-job这个任务调度框架,年末没有那么忙,就来学习一下elastic-job. 首先一点,el ...

PySlowFast 平台的使用及解析——以X3D为例