paper reading：Part-based Graph Convolutional Network for Action Recognition

文章目录

paper reading：Part-based Graph Convolutional Network for Action Recognition
- graph 与 skeleton：
- 传统的 action recognition from S-videos：
- 本文模型使用的两种信息：
- 本文主要贡献：
- 单图（无划分）的卷积公式：
- - k-th neighborhood
  - 1-th neighborhood
- Part-based Graph
- - 图的划分的定义：
  - two parts (b)：
  - four parts (c ) （推荐）：
  - six part (d) ：
  - 子图的连接：
- Part-based Graph Convolutions
- - 邻域：
  - 卷积：
  - - 子图卷积：
    - 子图卷积结果聚合：
- Spatio-temporal Part-based Graph Convolutions
- - 卷积的步骤
  - 邻域的划分
  - 标签的给定
  - 卷积的全部公式！！！
  - - 子图的空间卷积
    - 子图空间卷积的聚合
    - 时域卷积
    - 时域卷积

graph 与 skeleton：

Human skeleton is intuitively represented as a sparse graph with joints as nodes and natural connections between them as edges.

nodes：joints
edges：natural connections between joints

传统的 action recognition from S-videos：

the whole skeleton is treated as a single graph
使用 3D coordinate

本文模型使用的两种信息：

Geometric features：such as relative joint coordinates
motion features：such as temporal displacements

本文主要贡献：

Formulation of a general part-based graph convolutional network (PB-GCN) .
Use of geometric and motion features in place of 3D joint locations at each vertex.

即，几何信息（relative joint coordinates）和运动信息（temporal displacements）的使用
Exceeding the state-of-the-art on challenging benchmark datasets NTURGB+D and HDM05.

单图（无划分）的卷积公式：

k-th neighborhood

Y(vi)=∑vj∈Nk(vi)W(L(vj))X(vj)Y(v_i) = \sum_{v_j\in \ N_k (v_i)} W(L(v_j))X(v_j) Y(vi)=vj∈ Nk(vi)∑W(L(vj))X(vj)

W(⋅)W(·)W(⋅)： a filter weight vector of size of LLL indexed by the label assigned to neighbor vjv_jvj in the kkk-neighborhood Nk(vi)N_k(v_i)Nk(vi)
X(vj)X(v_j)X(vj)：the input feature at vjv_jvj
Y(vj)Y(v_j)Y(vj) ：convolved output feature at root vertex viv_ivi

1-th neighborhood

将邻域Nk(vi)N_k(v_i)Nk(vi)换一种表示形式（用邻接矩阵AAA表示），且将邻域数从kkk降为1，则得到下面的式子
Y(vi)=∑jAnorm(i,j)W(L(vj))X(vj)Y(v_i) = \sum_j A^{norm}(i, j) W(L(v_j)) X(v_j) Y(vi)=j∑Anorm(i,j)W(L(vj))X(vj)

D(i,i)=∑j(i,j)D(i,i) = \sum_j(i,j)D(i,i)=∑j(i,j)；Anorm=D−1/2AD−1/2A^{norm}=D^{-1/2}AD^{-1/2}Anorm=D−1/2AD−1/2

Part-based Graph

In general, a part-based graph can be constructed as a combination of subgraphs where each subgraph has certain properties that define it.

图的划分的定义：

We consider scenarios in which the partitions can share vertices or have edges connecting them.

即，一个图被划分为不同的子图，不同的子图会共享顶点或共享边。
G=⋃p∈{1,...,n}Pp∣Pp=(Vp,εp)G = \bigcup_{p \in \{1,...,n\}} P_p |P_p=(V_p, \varepsilon _p) G=p∈{1,...,n}⋃Pp∣Pp=(Vp,εp)

PpP_pPp is the partition (or subgraph) ppp of the graph GGG

two parts (b)：

Axial skeleton
Appendicular skeleton

four parts (c ) （推荐）：

head
hands
torso
legs

We consider left and right parts of hands and legs together in order to be agnostic to laterality [31] (handedness / footedness) of the human when performing an action.

即，排除侧向性的干扰（左手招手和右手招手都是招手）。

six part (d) ：

we divide the upper and lower components of appendicular skeleton into left and right (shown in Figure 1(d)), resulting in six parts

子图的连接：

图的连接有两种方式：点连接 & 边连接。此处采用的是点连接。

To cover all natural connections between joints in skeleton graph, we include an overlap of at least one joint between two adjacent parts.

即，每个子图之间有至少有一个公用的node。

Part-based Graph Convolutions

不同于上述提到的单图的卷积公式（Eq.2），划分为子图后，graph有新的卷积公式。

同时，有几个概念需要重新定义。

邻域：

空间邻域（Spatial neighbor）：单个 frame 下（特定时间）一阶邻域（Figure 3(a)）。
时间邻域（Temporal neighbor）：单个 node 的不同的时间的位置（Figure 3(a)）。
时空邻域（Spatial-temporal neighbor）：时空邻域的并集（Figure 3(b)）。

卷积：

graph convolutions over a part identifies the properties of that subgraph and an aggregation across subgraphs learns the relations between them.

For a part-based graph, convolutions for each part are performed separately and the results are combined using an aggregation function FaggF_{agg}Fagg

即，先通过子图内卷积（一阶邻域），再通过聚合函数FaggF_{agg}Fagg计算各子图的联系。

公式表达如下：

子图卷积：

Yp(vi)=∑vj∈Nkp(vi)Wp(Lp(vj))Xp(vj),p∈1,...,nY_p(v_i) = \sum_{v_j\in N_{kp}(v_i)} W_p(L_p(v_j)) X_p(v_j), p \in {1,...,n} Yp(vi)=vj∈Nkp(vi)∑Wp(Lp(vj))Xp(vj),p∈1,...,n

WpW_pWp can be shared across parts or kept separate, while the neighbors of viv_ivi only in that part (Nkp(vi)N_{kp}(v_i)Nkp(vi)) are considered

子图卷积结果聚合：

边共享形式：
Y(vi)=Fagg(Yp1(vi),Yp2(vj))∣(vi,vj)∈ε(p1,p2),(p1,p2)∈{1,...,n}×{1,...,n}Y(v_i) = F_{agg}(Y_{p1}(v_i),Y_{p2}(v_j)) | (v_i, v_j) \in \varepsilon(p1,p2), (p1, p2) \in \{1,...,n\} × \{1,...,n\} Y(vi)=Fagg(Yp1(vi),Yp2(vj))∣(vi,vj)∈ε(p1,p2),(p1,p2)∈{1,...,n}×{1,...,n}
顶点共享形式：
Y(vi)=Fagg(Yp1(vi),Yp2(vi))∣(p1,p2)∈{1,...,n}×{1,...,n}Y(v_i) = F_{agg}(Y_{p1}(v_i),Y_{p2}(v_i)) | (p1, p2) \in \{1,...,n\} × \{1,...,n\} Y(vi)=Fagg(Yp1(vi),Yp2(vi))∣(p1,p2)∈{1,...,n}×{1,...,n}

Spatio-temporal Part-based Graph Convolutions

卷积的步骤

The S-videos are represented as spatio-temporal graphs.

即，S-video 的本质是 spatio-temporal graphs.

we spatially convolve each partition independently for each frame, aggregate them at each frame and perform temporal convolution on the temporal dimension of the aggregated graph.

即大致分为两步，细致可分为3步：

Spatial convolution（空间卷积）：
- 子图卷积：spatially convolve each partition independently for each frame
- 子图卷积结果聚合：aggregate result of partition convolution at each frame
Temporal convolution（时间卷积）：
- 对聚合结果进行时间卷积：temporal convolution on the temporal dimension of the aggregated graph。

邻域的划分

For each vertex, we use 1-neighborhood (kkk = 1) for spatial dimension (N1N_1N1) as the skeleton graph is not very large and a τττ-neighborhood (kkk = τττ) for the temporal dimension (NτN_τNτ ), NτN_τNτ is not part-specific.

空间邻域和时间邻域的划分，由下式表示：
N1p(vi)={vj∣d(vi,vj)≤1,vi,vj∈Vp}N_{1p}(v_i) = \{ v_j | d(v_i, v_j) ≤ 1, v_i, v_j \in V_p\} N1p(vi)={vj∣d(vi,vj)≤1,vi,vj∈Vp}

Nτ(vita)={vitb∣d(vita,vitb)≤∣τ2∣}N_τ (v_{it_a}) = \{v_{it_b} | d(v_{it_a}, v_{it_b}) ≤|\frac{τ}{2}|\} Nτ(vita)={vitb∣d(vita,vitb)≤∣2τ∣}

标签的给定

For ordering vertices in the receptive fields (or neighborhoods), we use a single label spatially (LS:V→{0})L_S : V → \{0\})LS:V→{0}) to weigh vertices in N1pN_{1p}N1p of each vertex equally and τττ labels temporally (LT:V→{0,...,τ−1}L_T : V → \{0,..., τ −1\}LT:V→{0,...,τ−1}) to weigh vertices across frames in NτN_τNτ differently.

即，对于 root 节点，空间邻域内 label 相同（为0），时间邻域内 label 不同。

公式表达如下：
LS(vjt)={0∣vjt∈N1p(vit)}L_S(v_{jt}) = \{0 | v_{jt} \in N_{1p}(v_{it})\} LS(vjt)={0∣vjt∈N1p(vit)}

LT(vitb)={((tb−ta)+∣τ2∣)∣vitb∈Nτ(vita)}L_T (v_{it_b}) = \{((t_b −t_a) +|\frac{τ}{2}|) | v_{it_b} ∈ N_τ (v_{it_a} )\} LT(vitb)={((tb−ta)+∣2τ∣)∣vitb∈Nτ(vita)}

卷积的全部公式！！！

子图的空间卷积

Zp(vjt)=Wp(LS(vjt))Xp(vjt)Z_p(v_{jt}) = W_p(L_S(v_{jt})) X_p(v_{jt}) Zp(vjt)=Wp(LS(vjt))Xp(vjt)

Wp∈RC′×C×1×1W_p \in \R^{C \ ' × C × 1 × 1}Wp∈RC ′×C×1×1：part-specific channel transform kernel (pointwise operation)
LSL_SLS for each part is same but N1pN_{1p}N1p is part-specific
ZpZ_pZp：output from applying WpW_pWp on input features XpX_pXp at each vertex

Yp(vit)=∑vjt∈N1p(vit)Ap(i,j)Zp(vjt)∣p∈{1,...,4}Y_p(v_{it}) = \sum_{v_{jt} \in N_{1p}(v_{it})} A_p(i, j)Z_p(v_{jt}) | p \in \{1,...,4\} Yp(vit)=vjt∈N1p(vit)∑Ap(i,j)Zp(vjt)∣p∈{1,...,4}

ApA_pAp：normalized adjacency matrix for part ppp
WT∈RC′×C′×τ×1W_T \in \R^{C \ ' ×C \ '×τ×1}WT∈RC ′×C ′×τ×1：temporal convolution kernel

子图空间卷积的聚合

YS(vit)=Fagg({Y1(vit),...,Yn(vit)})Y_S(v_{it}) = F_{agg}(\{Y_1(v_{it}),...,Y_n(v_{it})\}) YS(vit)=Fagg({Y1(vit),...,Yn(vit)})

YsY_sYs：output obtained after aggregating all partition graphs at one frame

时域卷积

YT(vita)=∑vjtb∈Nτ(vita)WT(LT(vitb))YS(vitb)Y_T (v_{it_a}) = \sum_{v_{jt_b} \in N_τ (v_{it_a})} W_T (L_T(v_{it_b})) Y_S(v_{it_b}) YT(vita)=vjtb∈Nτ(vita)∑WT(LT(vitb))YS(vitb)

g}({Y_1(v_{it}),…,Y_n(v_{it})})
$$

YsY_sYs：output obtained after aggregating all partition graphs at one frame

时域卷积

YT(vita)=∑vjtb∈Nτ(vita)WT(LT(vitb))YS(vitb)Y_T (v_{it_a}) = \sum_{v_{jt_b} \in N_τ (v_{it_a})} W_T (L_T(v_{it_b})) Y_S(v_{it_b}) YT(vita)=vjtb∈Nτ(vita)∑WT(LT(vitb))YS(vitb)

YTY_TYT：output after applying temporal convolution on YSY_SYS output of τ frames