连续消除算法-SEA（Successive Elimination Algorithm）

Abstract

该块匹配算法（block matching algorithm BMA）用于运动估计。核心思想就是通过连续不断的消除搜索窗口中的块位置，降低计算量，同时获得较为准确的运动向量，准确度极其接近全搜索算法。它不同于其他递增型的搜索算法，也就是从一点出发，搜索周边块，该性质会导致其陷入局部最优。

Principles of the Algorithm

块匹配算法（BMA）的目的就是在参考帧的搜索窗口中找到一个与当前块的最佳匹配。接下来会讲一个不等关系式，用于限制在搜索窗口中搜索过程，但仍然保留了运动向量的最佳选项。

假设一个像素块，其尺寸为N x N，搜索窗口的尺寸为 (2M + 1) x (2M + 1)。f(i, j, t) 代表像素帧 t 中坐标为 (i, j) 的像素值。平均绝对差 (Mean absolute difference MAD) 用于衡量两个块之间的匹配程度。假设参考帧为 t - 1。我们能得到一个基本的不等式：

(1) ∣ ∣ f ( i , j , t ) ∣ − ∣ f ( i − x , j − y , t − 1 ) ∣ ∣ ≤ ∣ f ( i , j , t ) − f ( i − x , j − y , t − 1 ) ∣ . ||f(i, j, t)| - |f(i - x, j - y, t - 1)|| \leq |f(i, j, t) - f(i - x, j - y, t - 1)|. \tag{1} ∣∣f(i,j,t)∣−∣f(i−x,j−y,t−1)∣∣≤∣f(i,j,t)−f(i−x,j−y,t−1)∣.(1)

该不等式与以下两个不等式等效：

(2) ∣ f ( i , j , t ) ∣ − ∣ f ( i − x , j − y , t − 1 ) ∣ ≤ ∣ f ( i , j , t ) − f ( i − x , j − y , t − 1 ) ∣ . |f(i, j, t)| - |f(i - x, j - y, t - 1)| \leq |f(i, j, t) - f(i - x, j - y, t - 1)|.\tag{2} ∣f(i,j,t)∣−∣f(i−x,j−y,t−1)∣≤∣f(i,j,t)−f(i−x,j−y,t−1)∣.(2)

(3) ∣ f ( i − x , j − y , t − 1 ) ∣ − ∣ f ( i , j , t ) ∣ ≤ ∣ f ( i , j , t ) − f ( i − x , j − y , t − 1 ) ∣ . |f(i - x, j - y, t - 1)| - |f(i, j, t)| \leq |f(i, j, t) - f(i - x, j - y, t - 1)|. \tag{3} ∣f(i−x,j−y,t−1)∣−∣f(i,j,t)∣≤∣f(i,j,t)−f(i−x,j−y,t−1)∣.(3)

在这里x、y表示的是运动向量并且其取值范围为 [-M, M]。然后将（2）和（3）式遍历块中所有像素，并求和得到：

(4) ∑ i = 1 N ∑ j = 1 N ∣ f ( i , j , t ) ∣ − ∑ i = 1 N ∑ j = 1 N ∣ f ( i − x , j − y , t − 1 ) ∣ ≤ ∑ i = 1 N ∑ j = 1 N ∣ f ( i , j , t ) − f ( i − x , j − y , t − 1 ) ∣ . \sum_{i = 1}^{N}\sum_{j = 1}^{N}|f(i, j, t)| - \sum_{i = 1}^{N}\sum_{j = 1}^{N}|f(i - x, j - y, t - 1)| \leq \sum_{i = 1}^{N}\sum_{j = 1}^{N}|f(i, j, t) - f(i - x, j - y, t - 1)|. \tag{4} i=1∑Nj=1∑N∣f(i,j,t)∣−i=1∑Nj=1∑N∣f(i−x,j−y,t−1)∣≤i=1∑Nj=1∑N∣f(i,j,t)−f(i−x,j−y,t−1)∣.(4)

(5) ∑ i = 1 N ∑ j = 1 N ∣ f ( i − x , j − y , t − 1 ) ∣ − ∑ i = 1 N ∑ j = 1 N ∣ f ( i , j , t ) ∣ ≤ ∑ i = 1 N ∑ j = 1 N ∣ f ( i , j , t ) − f ( i − x , j − y , t − 1 ) ∣ . \sum_{i = 1}^{N}\sum_{j = 1}^{N}|f(i - x, j - y, t - 1)| - \sum_{i = 1}^{N}\sum_{j = 1}^{N}|f(i, j, t)| \leq \sum_{i = 1}^{N}\sum_{j = 1}^{N}|f(i, j, t) - f(i - x, j - y, t - 1)|. \tag{5} i=1∑Nj=1∑N∣f(i−x,j−y,t−1)∣−i=1∑Nj=1∑N∣f(i,j,t)∣≤i=1∑Nj=1∑N∣f(i,j,t)−f(i−x,j−y,t−1)∣.(5)

将其用特殊符号表示，得到下式，分别于（4）、（5）两式对应。

(6) R − M ( x , y ) ≤ M A D ( x , y ) . R - M(x, y) \leq MAD(x, y). \tag{6} R−M(x,y)≤MAD(x,y).(6)

(7) M ( x , y ) − R ≤ M A D ( x , y ) . M(x, y) - R \leq MAD(x, y). \tag{7} M(x,y)−R≤MAD(x,y).(7)

假设在块匹配的最开始，我们得到一个起始运动向量 (m, n)，对应MAD(m, n)。显然，为获取更佳的匹配坐标，我们需要去关注运动向量 (x, y)，其满足下式：

(8) M A D ( x , y ) ≤ M A D ( m , n ) . MAD(x, y) \leq MAD(m, n). \tag{8} MAD(x,y)≤MAD(m,n).(8)

进而得到：

(9) ∣ R − M ( x , y ) ∣ ≤ M A D ( m , n ) . |R - M(x, y)| \leq MAD(m, n). \tag{9} ∣R−M(x,y)∣≤MAD(m,n).(9)

式（9）就是用于SEA算法的不等式。具体详细的计算只会对满足这个不等式的块执行，其数量显然小于搜索窗中所有块的数量。因此，在此基础上，就可以得到一个降低搜索匹配过程复杂度，当并没有将最佳选项排除在外的块匹配算法。但是，该算法的效率也就由不等式中的求和运算速度和起始运动向量决定。

应用实例

该算法在x265编码器中有具体的实现。以下是其代码。

case X265_SEA:{// Successive Elimination Algorithm// omv -> current search origin or starting point//mvmin 和 mvmax 表示运动向量的范围//merange 表示搜索范围--搜索窗口的大小const int16_t minX = X265_MAX(omv.x - (int16_t)merange, mvmin.x);  //这里计算出了运动向量的变动范围const int16_t minY = X265_MAX(omv.y - (int16_t)merange, mvmin.y);const int16_t maxX = X265_MIN(omv.x + (int16_t)merange, mvmax.x);const int16_t maxY = X265_MIN(omv.y + (int16_t)merange, mvmax.y);const uint16_t *p_cost_mvx = m_cost_mvx - qmvp.x;const uint16_t *p_cost_mvy = m_cost_mvy - qmvp.y;int16_t* meScratchBuffer = NULL;int scratchSize = merange * 2 + 4;if (scratchSize){meScratchBuffer = X265_MALLOC(int16_t, scratchSize);memset(meScratchBuffer, 0, sizeof(int16_t)* scratchSize);}/* SEA is fastest in multiples of 4 */int meRangeWidth = (maxX - minX + 3) & ~3;int w = 0, h = 0;                    // Width and height of the PUALIGN_VAR_32(pixel, zero[64 * FENC_STRIDE]) = { 0 }; // ALIGN_VAR_32(int, encDC[4]);uint16_t *fpelCostMvX = m_fpelMvCosts[-qmvp.x & 3] + (-qmvp.x >> 2);sizesFromPartition(partEnum, &w, &h);int deltaX = (w <= 8) ? (w) : (w >> 1); //若源矩阵可分离则为width/2，否则为widthint deltaY = (h <= 8) ? (h) : (h >> 1); //同上 height/* Check if very small rectangular blocks which cannot be sub-divided anymore */bool smallRectPartition = partEnum == LUMA_4x4 || partEnum == LUMA_16x12 ||partEnum == LUMA_12x16 || partEnum == LUMA_16x4 || partEnum == LUMA_4x16;/* Check if vertical partition */bool verticalRect = partEnum == LUMA_32x64 || partEnum == LUMA_16x32 || partEnum == LUMA_8x16 ||partEnum == LUMA_4x8;/* Check if horizontal partition */bool horizontalRect = partEnum == LUMA_64x32 || partEnum == LUMA_32x16 || partEnum == LUMA_16x8 ||partEnum == LUMA_8x4;/* Check if assymetric vertical partition */bool assymetricVertical = partEnum == LUMA_12x16 || partEnum == LUMA_4x16 || partEnum == LUMA_24x32 ||partEnum == LUMA_8x32 || partEnum == LUMA_48x64 || partEnum == LUMA_16x64;/* Check if assymetric horizontal partition */bool assymetricHorizontal = partEnum == LUMA_16x12 || partEnum == LUMA_16x4 || partEnum == LUMA_32x24 ||partEnum == LUMA_32x8 || partEnum == LUMA_64x48 || partEnum == LUMA_64x16;int tempPartEnum = 0;/* If a vertical rectangular partition, it is horizontally split into two, for ads_x2() */if (verticalRect)tempPartEnum = partitionFromSizes(w, h >> 1);/* If a horizontal rectangular partition, it is vertically split into two, for ads_x2() */else if (horizontalRect)tempPartEnum = partitionFromSizes(w >> 1, h);/* We have integral planes introduced to account for assymetric partitions.* Hence all assymetric partitions except those which cannot be split into legal sizes,* are split into four for ads_x4() */else if (assymetricVertical || assymetricHorizontal)tempPartEnum = smallRectPartition ? partEnum : partitionFromSizes(w >> 1, h >> 1);/* General case: Square partitions. All partitions with width > 8 are split into four* for ads_x4(), for 4x4 and 8x8 we do ads_x1() */elsetempPartEnum = (w <= 8) ? partEnum : partitionFromSizes(w >> 1, h >> 1);/* Successive elimination by comparing DC before a full SAD,* because sum(abs(diff)) >= abs(diff(sum)). */ primitives.pu[tempPartEnum].sad_x4(zero,fenc,fenc + deltaX,fenc + deltaY * FENC_STRIDE,fenc + deltaX + deltaY * FENC_STRIDE,FENC_STRIDE,encDC);  //fenc为64x64+8的数组，所有stride为64(FENC_STRIDE)//encDC表示当前块的绝对值求和/* Assigning appropriate integral plane */uint32_t *sumsBase = NULL;switch (deltaX){case 32: if (deltaY % 24 == 0)sumsBase = integral[1];else if (deltaY == 8)sumsBase = integral[2];elsesumsBase = integral[0];break;case 24: sumsBase = integral[3];break;case 16: if (deltaY % 12 == 0)sumsBase = integral[5];else if (deltaY == 4)sumsBase = integral[6];elsesumsBase = integral[4];break;case 12: sumsBase = integral[7];break;case 8: if (deltaY == 32)sumsBase = integral[8];elsesumsBase = integral[9];break;case 4: if (deltaY == 16)sumsBase = integral[10];elsesumsBase = integral[11];break;default: sumsBase = integral[11]; //搜索窗口所有尺寸块的绝对值求和break;}if (partEnum == LUMA_64x64 || partEnum == LUMA_32x32 || partEnum == LUMA_16x16 ||partEnum == LUMA_32x64 || partEnum == LUMA_16x32 || partEnum == LUMA_8x16 ||partEnum == LUMA_4x8 || partEnum == LUMA_12x16 || partEnum == LUMA_4x16 ||partEnum == LUMA_24x32 || partEnum == LUMA_8x32 || partEnum == LUMA_48x64 ||partEnum == LUMA_16x64)deltaY *= (int)stride;if (verticalRect)encDC[1] = encDC[2];if (horizontalRect)deltaY = deltaX;/* ADS and SAD */MV tmv;for (tmv.y = minY; tmv.y <= maxY; tmv.y++){int i, xn;int ycost = p_cost_mvy[tmv.y] << 2;if (bcost <= ycost)continue;bcost -= ycost;/* ADS_4 for 16x16, 32x32, 64x64, 24x32, 32x24, 48x64, 64x48, 32x8, 8x32, 64x16, 16x64 partitions* ADS_1 for 4x4, 8x8, 16x4, 4x16, 16x12, 12x16 partitions* ADS_2 for all other rectangular partitions */xn = ads(encDC,sumsBase + minX + tmv.y * stride,deltaY,fpelCostMvX + minX,meScratchBuffer,meRangeWidth,bcost); //bcost表示参考帧和当前帧中对应块的残差//fpelCostMvX表示向量中x元素所需的比特数for (i = 0; i < xn - 2; i += 3)COST_MV_X3_ABS(minX + meScratchBuffer[i], tmv.y,minX + meScratchBuffer[i + 1], tmv.y,minX + meScratchBuffer[i + 2], tmv.y); //fenc和fref的ABS + 向量x元素所需要的比特数 比较更新bcostbcost += ycost; for (; i < xn; i++)COST_MV(minX + meScratchBuffer[i], tmv.y); //fenc和fref的ABS + 向量x y元素所需要的比特数 比较更新bcost}if (meScratchBuffer)x265_free(meScratchBuffer);break;}

x265的实现还是比较复杂，这里就不具体展开了，但其核心代码还是能体现出SEA的思想的。该算法的其他内容可以参考下面这个论文。

[1] W. LI and E. Salari. Successive Elimination Algorithm for Motion Estimation. IEEE TRANSACRTIONS ON IMAGE PROCESSING.