




2.1 Gait Recognition 步态识别

2.2 Deep Learning on an Unordered Set 深度学习在非序列集合上的应用

3 GAITSET 提出的方法

3.1 Problem Formulation 问题公式化

3.2 Set Pooling 集合池化

3.3 Horizontal Pyramid Mapping 水平金字塔映射

3.4 Multilayer Global Pipeline 基于多层的处理

3.5 Loss Functions and Training Strategy 损失函数和训练策略

3.6 Training and Test 训练与测试

3.7 Post Feature Dimension Reduction 后期的特征尺度降维


4.1 Datasets 数据集

4.2 Parameter Setting 参数设置

4.3 Brief Introduction of Compared Methods 对比方法的简要介绍

4.4 Main Results 主要结果

4.4.1 CASIA-B CASIA-B数据集

4.4.2 OU-MVLP OU-MVLP数据集​

4.5 Ablation Experiments and Model Studies 各部分的性能和模型研究

4.5.1 Ablation experiments 各部分的性能测试

4.5.2 Training strategies 训练策略

4.6 Feature Dimension Reduction 特征降维

4.7 Practicality 实用性



To portray a gait, existing gait recognition methods utilize either a gait template which makes it difficult to preserve temporal information, or a gait sequence that maintains unnecessary sequential constraints and thus loses the flexibility of gait recognition.
we present a novel perspective that utilizes gait as a deep set, which means that a set of gait frames are integrated by a global-local fused deep network inspired by the way our left- and right-hemisphere processes information to learn information that can be used in identification.
提出的新方法将步态视作一个深度集(deep set),类似于左右大脑半球处理信息的方式,用全局-局部混合深度网络将步态帧整合。
our method is immune to frame permutations, and can naturally integrate frames from different videos that have been acquired under different scenarios, such as diverse viewing angles, different clothes, or different item-carrying conditions.


easily affected by exterior factors such as the subject’s walking speed, clothing, and item-carrying condition as well as the camera’s viewpoint and frame rate.
Although various existing gait templates encode information as abundantly as possible, the compression process omits significant features such as temporal information and fine-grained spatial information.
These methods preserve more temporal information but would suffer a significant degradation when an input contains discontinuous frames or has a frame rate different from the training dataset.
To solve these problems, we present a novel perspective that regards gait as a set of gait silhouettes.
From this perspective, we propose an end-to-end deep learning model called Gaitset that extracts features from a gait frame set to identify gaits.
First, a CNN is used to extract frame-level features from each silhouette in dependently (local information).
Second, an operation called Set Pooling is used to aggregate frame-level features into a single set-level feature (global information).
第二步,用一个叫“集合池化(Set Pooling)”的操作,将所有的帧级特征,汇聚成集合级的特征。
it preserves spatial and temporal information better than a gait template;
Third, a structure called Horizontal pyramid mapping (HPM) is applied to project the set-level feature into a more discriminative space to obtain a final deep set representation.
第三步,用一个叫“水平金字塔映射(Horizontal pyramid mapping (HPM) )”的结构,将集合级的特征投影到一个更具分辨性的特征空间,从而获取到最终的深度集合表征。


2.1 Gait Recognition 步态识别

Gait recognition can be broadly categorized into template-based and sequence-based approaches.
The goal of template generation is to compress gait information into a single image,
In the template matching procedure, they first extract the gait representation from a template image using machine learning approaches such as canonical correlationanalysis (CCA) [16], linear discriminant analysis (LDA) [1],[17] and deep learning [18].
Then, they measure the similarity between pairs of representations using Euclidean distance or other metric learning approaches.
Finally,they assign a label to the template based on the measured distance using a classifier, e.g., SVM or nearest neighbor classifier.
In the second category, the video-based approaches directly take a sequence of silhouettes as an input.

2.2 Deep Learning on an Unordered Set 深度学习在非序列集合上的应用

The initial goal for using unordered sets was to address point cloud tasks in the computer vision domain [28] based on PointNet.
Using an unordered set, PointNet can avoid the noise introduced by quantization and the extension of data, leading to a high prediction performance.

3 GAITSET 提出的方法

3.1 Problem Formulation 问题公式化

all silhouettes in one or more sequences of a given person can be regarded as a set of n silhouettes Xi= {xji|j = 1, 2, ..., n}
fi= H(G(F (x1i), F (x2i), ..., F (xni)))

3.2 Set Pooling 集合池化

The goal of Set Pooling (SP) is to condense a set of gait information,
Note that there are two constraints when performing an SP operation.
To meet the invariant constraint requirement in Eq. 2, one rational strategy of SP is to use statistical functions on the set dimension.
max(·), mean(·) and median(·).
G(·) = max(·) + mean(·) + median(·) (3)
G(·) = 1_1C(cat(max(·), mean(·), median(·))) (4)
We included two attention strategies in our work.

3.3 Horizontal Pyramid Mapping 水平金字塔映射

horizontal pyramid pooling(HPP) through cropping and resizing the images into a uniform size based on pedestrian size while varying the discriminative parts from image to image.
we improve HPP to adapt it to the gait recognition task; instead of applying a 1 × 1 convolutional layer after the pooling, we use independent fully connected layers (FC) for each pooled feature to map it into the discriminative space,
The structure of horizontal pyramid mapping

3.4 Multilayer Global Pipeline 基于多层的处理

Thus, pixels in the feature maps of a shallow layer pay more attention to local and fine-grained information while those in deeper layers focus more on global and coarse-grained information.
To collect different-level set information, we propose a multilayer global pipeline (MGP),
The main pipeline is similar to that of human cognition, which focuses intuitively on a person’s profile, whereas the MGH can preserve more details of a person’s walking movements.

3.5 Loss Functions and Training Strategy 损失函数和训练策略

In the field of identification,two loss functions are widely used, i.e., cross entropy loss and triplet loss
It measures the gap between a predictive distribution and the corresponding true distribution.
It aims to pull semantically-similar points close to each other while pushing semantically-different points away from each other
In this study, to improve the learning ability, we combined the cross entropy loss with the triplet loss.

3.6 Training and Test 训练与测试

calculate the rank-1 recognition accuracy, which means the percentage of the correct subjects ranked first, based on nearest Euclidean distance.

3.7 Post Feature Dimension Reduction 后期的特征尺度降维

we proposed a post feature dimension reduction module which is a post trained linear projection to reduce the dimension of the output feature while maintaining a competitive recognition accuracy.


we report the results of comprehensive experiments conducted to evaluate the performance of the proposed GaitSet.

4.1 Datasets 数据集

Based on the sizes of the training sets, we name these three kinds of division small-sample training (ST), medium-sample training (MT) and large-sample training (LT). 

4.2 Parameter Setting 参数设置


We adopted the Adam optimizer [45] for training our GaitSet network.

4.3 Brief Introduction of Compared Methods 对比方法的简要介绍

View-invariant Discriminative Projection (ViDP) [2] uses a unitary linear projection to project the templates into a latent space to learn a view-invariant represent.
Correlated Motion Co-Clustering(CMCC) [46] first uses motion co-clustering to partition the most related parts of gaits from different views into the same group, and then applies canonical correlation analysis(CCA) on each group to maximize the correlation between gait information across views.

4.4 Main Results 主要结果

4.4.1 CASIA-B CASIA-B数据集

Therefore, both parallel and vertical perspectives lose some portion of the gait information while views such as 36◦and 144◦achieve a better balance between these two extremes.

1) Because our model regards the input as a set, the number of samples (frames) available for training the convolutional network in the main pipeline is dozens of times higher than the number of samples used to train the template- or video-based models.
2) Because the sample sets used in the training phase are composed of frames selected randomly from the sequence in the training set, each of which can generate multiple different sample sets; thus, any units related to set feature learning (such as MGP and HPM) can also be trained well.
Our model achieves satisfactory performance on the BG subset. On the CL dataset, the recognition performances are somewhat less satisfactory, although our model still exceeds the best performance reported so far [7] by over15%.

4.4.2 OU-MVLP OU-MVLP数据集

The results show that our method generalizes well to a dataset with a large scale and wide view variation.

4.5 Ablation Experiments and Model Studies 各部分的性能和模型研究

In this section, we report ablation experiments and model studies on CASIA-B, to examine the effectiveness of regarding gait as a set with set pooling, MGP, HPM, and different training strategies with different loss combinations.

4.5.1 Ablation experiments 各部分的性能测试

There might be two main reasons for this improvement: 1) our SP extracts the set-level feature from a high-level feature map where the temporal information is well preserved and the spatial information has been sufficiently processed; and 2) as mentioned in Sec. 4.4, regarding gait as a set enlarges the volume of training data.
SP with pixel-wised attention achieves the highest accuracy on the NM and BG subsets and when max(·) is used, it obtains the highest accuracy on the CL subsets.
that set-level features extracted from different layers of the main pipeline
HPM obtains better performance with more scales.
It can be seen that using independent weights increases the accuracy by more than 7% on each subset.

4.5.2 Training strategies 训练策略

However, only the pretraining model that combines the two losses reaches the highest 96.1% rank-1 accuracy.
the dropout layers is essential for a robust training performance of cross-entropy loss in this case.
batch normalization improves all the training strategies.

4.6 Feature Dimension Reduction 特征降维

the testing feature after concatenating all the HPM ouputs has 256 × 31 × 2 = 15, 872 dimensions in a standard framework
HPM后特征的维度是 15, 872
However, there is still a negative impact on the performance if the HPM output dimensions are too low (down to 32) or too high (up to1024).
By decreasing the HMP output dimensions, we can compress the final feature dimension from 15, 872 to one quarter of that.
After the model has been well trained, a new fully connected layer is applied to the 15, 872 dimension feature to reduce it into a lower dimension.

4.7 Practicality 实用性

In real forensic identification scenarios, cases occur in which no continuous sequence of a subject’s gait is available, only some fitful and sporadic silhouettes.
It can also be observed that 1) the accuracy rises monotonically as the number of silhouettes increases, and 2) the accuracy is close to the best performance when the samples contain more than 25 silhouettes.
We simulate these scenarios by constructing each silhouette sample selected from two sequences that have the same walking condition but different views.
Including multiple views in the input set allows the model to gather both parallel and vertical information, resulting in performance improvements.
We simulate such different conditions by forming an input set using silhouettes from two sequences with the same view but different walking conditions and conduct experiments under the constraint of different numbers of silhouettes.
Containing large yet complementary noises and information, the combination of silhouettes from BG and CL helps the model improve the accuracy.


The proposed GaitSet approach extracts both spatial and temporal information more effectively and efficiently than do the existing methods, which regard gait as either a template or a sequence.
In addition, since the set assumption could fit various other biometric identification tasks including person re-identification and video-based face recognition, the structure of GaitSet can be applied to these tasks with few minor changes in the future.

GaitSet: Cross-view Gait Recognition through Utilizing Gait as a Deep Set 阅读笔记相关推荐

  1. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks论文阅读笔记

    文章目录 OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks论文阅读笔记 ...

  2. A NIR-to-VIS face recognition via part adaptive and relation attention module阅读笔记

    2021 Computer Vision and Pattern Recognition Rushuang Xu, MyeongAh Cho, Sangyoun Lee 一.简介 许多研究集中在提取领 ...

  3. Multi-Task GANs for View-Specific Feature Learning in Gait Recognition论文翻译以及理解

    Multi-Task GANs for View-Specific Feature Learning in Gait Recognition论文翻译以及理解 今天想尝试一下翻译一篇自己读的论文.写的不 ...

  4. 【步态识别】LagrangeGait基于拉格朗日《Lagrange Motion Analysis and View Embeddings for Improved Gait Recognition》

    目录 1. 论文&代码源 2. 论文亮点 3. 模型结构 3.1 建模思路 3.2 建立拉格朗日方程 3.3 网络结构 3.3.1 运动分支(Motion Branch) 3.3.2 视图嵌入 ...

  5. 【论文翻译】-- GaitSet: Regarding Gait as a Set for Cross-View Gait Recognition

    本文是复旦大学发表于 AAAI 2019 的工作.截至目前CASIA-B正确率最高的网络. 英文粘贴原文,google参与翻译但人工为主.有不对的地方欢迎评论. 粉色部分为本人理解添加,非原文内容. ...

  6. 【论文阅读】GaitSet: Regarding Gait as a Set for Cross-View Gait Recognition

    GaitSet: Regarding Gait as a Set for Cross-View Gait Recognition 摘要 Intro GaitSet 问题公式描述 Set Pooling ...

  7. 【论文笔记】GaitSet: Regarding Gait as a Set for Cross-View Gait Recognition

    Title GaitSet: Regarding Gait as a Set for Cross-View Gait Recognition 原文地址:https://arxiv.org/abs/18 ...

  8. View Invariant Gait Recognition Using Only One Uniform Model论文翻译以及理解

    View Invariant Gait Recognition Using Only One Uniform Model论文翻译以及理解 一行英文,一行翻译 论文中所述的优点:The unique a ...

  9. 【步态识别】SMPLGait 算法学习《Gait Recognition in the Wild with Dense 3D Representations and A Benchmark》

    目录 1. 论文&代码源 2. 论文亮点 2.1 SMPLGait模型 2.2 3D-STN 2.3 Gait3D数据集 3. 模型结构 3.1 SLN--Silhouette Learnin ...


  1. [转载]详细解说STL排序(sort)------这篇博文在一道题上救了我o_0
  2. OpenStack、Docker、KVM被评为最火的云开源项目
  3. [css] 使用纯css来创建一个滑块
  4. wireshark rto_RTO的完整形式是什么?
  5. 河北体检系统诚信企业推荐_应用多的隔膜计量泵价格诚信企业推荐
  6. java lambda函数类型_java8-lambda-函数式接口及四大类型函数接口
  7. 网页英文字体和中文字体应用
  8. cAdvisor的安装使用(Docker)
  9. 计算机病毒片头制作,怎么用格式工厂做gif_格式工厂怎么制作片头_格式工厂能做什么...
  10. 深度探索C++对象模型pdf
  11. 王峰 阜阳师范学院计算机,《阜阳师范学院学报》投稿_学报投稿网
  12. 共享单车登录显示服务器未响应,ofo共享单车出故障了吗?ofo共享单车无法登陆、连接不上、无法结算怎么回事?[图]...
  13. 基于51单片机PT100热电偶AD转换protues仿真设计
  14. 程序员的算法趣题Q25: 时髦的鞋带系法
  15. Bartender 4:图标显示切换大变样,还能在菜单栏自定义文字
  16. SpringBoot23-spingboot数据访问-Spring Data REST
  17. 计算机主机爆炸,「技术向」如何让你的电脑在十秒钟后爆炸!
  18. 非root用戶配置两机ssh互信
  19. 计算机科学与技术专业用什么笔记本,计算机科学与技术专业课程有哪些
  20. 轨迹评估工具使用总结(二) evo 绘图 ROS map


  1. 21省人均GDP超过1万美元,北京以19.01万元继续稳居榜首
  2. 如何帮助企业员工快速成长?不少企业是这样做的
  3. Ardupilot chibios编译,启动,main函数学习(2)
  4. nginx文件服务器5万并发量,Nginx服务器高性能优化-轻松实现10万并发访问量
  5. 国家开放大学:直播为远程教学插上腾飞的翅膀
  6. 1997年苹果公司《think different》广告台词中英文版本
  7. 关于金融区块链,这是我们和趣链科技的一场对谈
  8. 斯诺克台球比赛规则 (Snooker)
  9. 计算机管理员没设密码忘了怎么办,没有电脑路由器密码忘记了怎么办?
  10. Mock模拟数据测试一:使用fiddler mock response数据