并行加速实战 双边滤波器
之前分析了 二维中值滤波器的并行加速
由于二维中值滤波器是控制密集型的滤波器(排序操作),所以SSE加速不太明显
这次选用了计算密集型的双边滤波器
针对双边滤波器在5*5的滤波核下的运算速度做优化和分析
以下会有主区域、全图、主循环、完整(初始化+主循环)的概念
1. 由于双边滤波的滤波半径为2+1,所以不能忽略图像四周边界的区域了。
所以,以下会对主区域和全图滤波做一个预算时间的对比。
2. 在快速算法中还做了查找表优化,所以滤波函数是个有状态滤波器,算法需要初始化。
以下会对主循环和包含初始化的运算时间做个对比
总结先写:
发现同样的版本 使用浮点比整形还要快。浮点的SSE并行并没有提速4倍。
目前分析是因为编译器自动做了优化,将浮点运算的速度提升了2-3倍。以至于比整形的版本还略快一点。
运算时间 |
|
旧版整型 主区域 主循环 |
8.766ms |
新版整型 全图 主循环 |
6.324ms |
Opencv 浮点 全图 主循环 |
5.713ms |
新版 浮点 全图 主循环 |
5.301ms |
SSE 浮点 全图 主循环 |
4.778ms |
omp浮点 全图 主循环 |
1.527ms |
SSE+omp浮点 全图 主循环 |
1.355ms |
并行算法优化分析
1. 整型双边滤波
运算时间 |
|
旧版整型 主区域 主循环 |
8.766ms |
新版整型 全图 主循环 |
6.324ms |
2. 整型、浮点型双边滤波
运算时间 |
|
新版 整型 全图 主循环 |
6.324ms |
新版 浮点 全图 主循环 |
5.301ms |
opencv浮点 全图 主循环 |
5.713ms |
3. 浮点型SSE双边滤波主区域、全图耗时
主循环 |
完整 |
||
浮点型 |
主区域 |
5.002ms |
5.078ms |
全图 |
5.198ms |
5.301ms |
|
浮点型SSE |
主区域 |
4.658ms |
4.764ms |
全图 |
4.778ms |
5.076ms |
4. 浮点型SSE omp双边滤波
主循环 |
||
浮点型 |
主区域 |
5.002ms |
全图 |
5.198ms |
|
浮点型SSE |
主区域 |
4.658ms |
全图 |
4.778ms |
|
浮点型omp |
主区域 |
1.434ms |
全图 |
1.527ms |
|
浮点型SSE+omp |
主区域 |
1.285ms |
全图 |
1.355ms |
一.重构了关于整型优化的双边滤波器
原先版本是对矩形区域做滤波的,现在改成了圆形区域。减少了近一半的计算量。
旧版的整型双边滤波主区域主循环耗时为8.766ms
新版的整型双边滤波全图主循环耗时为6.324ms
二.设计了浮点型优化的双边滤波器
浮点型双边滤波主区域主循环耗时为5.002ms
浮点型双边滤波主区域完整耗时为5.078ms
浮点型双边滤波全图主循环耗时为5.198ms
浮点型双边滤波全图完整耗时为5.301ms
Opencv浮点型双边滤波全图完整耗时为5.713ms
这里可以看出全图运算大概比主区域多耗时0.2ms
算法初始化耗时0.1ms
三.增加了SSE加速优化的双边滤波器
浮点型SSE加速双边滤波主区域主循环耗时为4.658ms
浮点型SSE加速双边滤波主区域完整耗时为4.764ms
浮点型SSE加速双边滤波全图主循环耗时为4.778ms
浮点型SSE加速双边滤波全图完整耗时为5.076ms
这里可以看出全图运算大概比主区域多耗时0.1ms
算法初始化耗时0.2ms
四.增加了omp加速优化的双边滤波
浮点型omp加速双边滤波主区域主循环耗时为1.434ms
浮点型omp加速双边滤波全图主循环耗时为1.527ms
这里可以看出全图运算大概比主区域多耗时0.1ms
五.增加了SSE+omp加速优化的双边滤波
浮点型SSE+omp加速双边滤波主区域主循环耗时为1.285ms
浮点型SSE+omp加速双边滤波全图主循环耗时为1.355ms
这里可以看出全图运算大概比主区域多耗时0.1ms
以下是具体的运算优化耗时
l opencv full 5.713ms
l mainBody mainLoop 5.002ms
l mainBody 5.078ms
l Full mainLoop5.198ms
l Full 5.301ms
l mainBody sse mainLoop 4.658ms
l mainBody sse 4.764ms
l Full sse mainLoop 4.778ms
l Full sse 5.076ms
l mainBody omp mainLoop 1.434ms
l Full omp mainLoop 1.527ms
l mainBody sse_omp mainLoop 1.285ms
l Full sse_omp mainLoop 1.355ms
l Int Full mainLoop 6.324ms
l INT old version mainBody mainLoop 8.766ms
具体代码如下
宏定义
因为整型的双边会有求和溢出风险,所以这里限制了滤波直径为11/半径5
#define MALLOC malloc
#define FREE(p) if (p != NULL) { free(p); p = NULL;}#define ALIGN16 __declspec(align(16))
#define ALIGN_MALLOC16(n) _aligned_malloc(n, 16)
#define ALIGN_MALLOC32(n) _aligned_malloc(n, 32)
#define ALIGN_MALLOC64(n) _aligned_malloc(n, 64)
#define ALIGN_MALLOC128(n) _aligned_malloc(n, 128)
#define ALIGN_FREE(p) if (p != NULL) { _aligned_free(p); p = NULL;}
#define BF_INT_BITS (10)
#define BF_INT_SCALE (1 << BF_INT_BITS)
#define BF_INT_SHIFT ((S32)(((BF_INT_BITS + 1) << 1) - 31 + 16) + 7) //more 7 bits[>11*11]
#define BF_INT_BITS2 ((S32)((BF_INT_BITS << 1) - BF_INT_SHIFT))
#define BF_INT_SCALE2 (1 << BF_INT_BITS2)#define BF_BUF_LEN (1024)
#define EDGEPRES_R_MAX (5)
类的定义
typedef class edgePresFiltMain
{
public:edgePresFiltMain(S32 width, S32 height, S32 radius, F32 sigmaVal, F32 sigmaSp);~edgePresFiltMain();void edgePreserve_mainBody(U16 *src, U16 *dst);void edgePreserve_mainBody_omp(U16 *src, U16 *dst);void edgePreserve_mainBody_sse(U16 *src, U16 *dst);void edgePreserve_mainBody_sse_omp(U16 *src, U16 *dst);void *hdl;}edgePresFiltMain_;typedef class edgePresFilt
{
public:edgePresFilt(S32 width, S32 height, S32 radius, F32 sigmaVal, F32 sigmaSp);~edgePresFilt();void edgePreserve(U16 *src, U16 *dst);void edgePreserve_omp(U16 *src, U16 *dst);void edgePreserve_sse(U16 *src, U16 *dst);void edgePreserve_sse_omp(U16 *src, U16 *dst);void *hdl;}edgePresFilt_;typedef class edgePresFiltInt
{
public:edgePresFiltInt(S32 width, S32 height, S32 radius, F32 sigmaVal, F32 sigmaSp);~edgePresFiltInt();void edgePreserve(U16 *src, U16 *dst);void edgePreserve_omp(U16 *src, U16 *dst);void *hdl;}edgePresFiltInt_;
纯C函数声明
// no smooth on the border
void edgePreserve_mainBody(U16 *src, U16 *dst, S32 width, S32 height,S32 radius, F32 sigmaVal, F32 sigmaSp);// smooth on the border
void edgePreserve(U16 *src, U16 *dst, S32 width, S32 height,S32 radius, F32 sigmaVal, F32 sigmaSp);// no smooth on the border
void edgePreserve_mainBody_sse(U16 *src, U16 *dst, S32 width, S32 height,S32 radius, F32 sigmaVal, F32 sigmaSp);// smooth on the border
void edgePreserve_sse(U16 *src, U16 *dst, S32 width, S32 height,S32 radius, F32 sigmaVal, F32 sigmaSp);// smooth on the border
void edgePreserveInt(U16 *src, U16 *dst, S32 width, S32 height,S32 radius, F32 sigmaVal, F32 sigmaSp);
边界处理
static void borderReflect(U16 *src, S32 width, S32 height, U16 *dst, S32 radius)
{S32 i = 0;S32 j = 0;S32 itmp1 = radius - 1;S32 itmp2 = -1 - radius;S32 width2 = width + (radius << 1);U16 *psrc = src;U16 *pdst = dst + width2 * radius;for (i = 0; i < height; i++){for (j = 0; j < radius; j++){pdst[j] = psrc[itmp1 - j];}memcpy(pdst + radius, psrc, sizeof(U16) * width);psrc += width;pdst += width2;for (j = -1; j >= -radius; j--){pdst[j] = psrc[itmp2 - j];}}psrc = pdst - width2;for (i = 0; i < radius; i++){memcpy(pdst, psrc, sizeof(U16) * width2);psrc -= width2;pdst += width2;}psrc = dst + width2 * radius;pdst = dst + width2 * (radius - 1);for (i = 0; i < radius; i++){memcpy(pdst, psrc, sizeof(U16) * width2);psrc += width2;pdst -= width2;}
}static void borderCopy(U16 *src, U16 *dst, S32 width, S32 height, S32 radius)
{S32 i = 0;S32 j = 0;S32 xend = height - (radius << 1);U16 *psrc = src;U16 *pdst = dst;memcpy(pdst, psrc, sizeof(U16) * (width * radius - radius));psrc += radius * width;pdst += radius * width;for (i = 0; i < xend; i++){memcpy(pdst - radius, psrc - radius, sizeof(U16) * (radius << 1));psrc += width;pdst += width;}memcpy(pdst - radius, psrc - radius, sizeof(U16) * (width * radius + radius));
}
static void edgePreserve_LUT(S32 radius, S32 width, F32 sigmaVal, F32 sigmaSp,S32 *spOfs, F32 *spWt, F32 *valWt)
{S32 i = 0;S32 j = 0;S32 k = 0;F32 sigmaValCoeff = -0.5f / (sigmaVal * sigmaVal);F32 sigmaSpCoeff = -0.5f / (sigmaSp * sigmaSp);for (i = -radius; i <= radius; i++){for (j = -radius; j <= radius; j++){if (sqrtf((i * i + j * j) * 1.f) <= radius){if ((i == 0) && (j == 0)){continue;}spOfs[k] = i * width + j;spWt[k] = expf((i * i + j * j) * sigmaSpCoeff);k++;}}}for (i = 0; i < BF_BUF_LEN - 1; i++){valWt[i] = expf(i * i * sigmaValCoeff);}valWt[i] = 0.f;
}
// Smooth main body of src to the dst img
// src img is bigger than dst img
static void edgePreserve_mainBody_process(U16 *src, S32 srcWidth, S32 srcHeight, S32 srcStep,U16 *dst, S32 dstStep, S32 radius, S32 *spOfs, F32 *spWt, F32 *valWt, S32 maxk)
{const S32 xEnd = srcWidth - radius;const S32 dstHeight = srcHeight - (radius << 1);S32 i = 0;S32 j = 0;S32 k = 0;S32 n = 0;U16 val0 = 0;U16 val = 0;S32 tmp = 0;F32 w = 0.f;F32 sum = 0.f;F32 wsum = 0.f;U16 *psrc = src + radius * srcStep;U16 *pdst = dst;for (i = 0; i < dstHeight; i++){for (j = radius; j < xEnd; j++){val0 = psrc[j];sum = 0.f;wsum = 0.f;for (k = 0; k < maxk; k++){val = psrc[j + spOfs[k]];tmp = val - val0;tmp = ABS(tmp);tmp = MIN(tmp, BF_BUF_LEN - 1);w = spWt[k] * valWt[tmp];sum += val * w;wsum += w;}pdst[j - radius] = (U16)((sum + val0) / (wsum + 1.f));}psrc += srcStep;pdst += dstStep;}
}// Smooth main body of src to the dst img
// src img is bigger than dst img
static void edgePreserve_mainBody_omp_process(U16 *src, S32 srcWidth, S32 srcHeight, S32 srcStep,U16 *dst, S32 dstStep, S32 radius, S32 *spOfs, F32 *spWt, F32 *valWt, S32 maxk)
{const S32 xEnd = srcWidth - radius;const S32 dstHeight = srcHeight - (radius << 1);S32 i = 0;#pragma omp parallel forfor (i = 0; i < dstHeight; i++){S32 j = 0;S32 k = 0;U16 val0 = 0;U16 val = 0;S32 tmp = 0;F32 w = 0.f;F32 sum = 0.f;F32 wsum = 0.f;U16 *psrc = src + (radius + i) * srcStep;U16 *pdst = dst + i * dstStep;for (j = radius; j < xEnd; j++){val0 = psrc[j];sum = 0.f;wsum = 0.f;for (k = 0; k < maxk; k++){val = psrc[j + spOfs[k]];tmp = val - val0;tmp = ABS(tmp);tmp = MIN(tmp, BF_BUF_LEN - 1);w = spWt[k] * valWt[tmp];sum += val * w;wsum += w;}pdst[j - radius] = (U16)((sum + val0) / (wsum + 1.f));}}
}// Smooth main body of src to the dst img
// src img is bigger than dst img
static void edgePreserve_mainBody_sse_process(U16 *src, S32 srcWidth, S32 srcHeight, S32 srcStep,U16 *dst, S32 dstStep, S32 radius, S32 *spOfs, F32 *spWt, F32 *valWt, S32 maxk)
{const S32 xEnd = srcWidth - radius;const S32 dstHeight = srcHeight - (radius << 1);const U32 ALIGN16 bufSignMask[] = { 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff };const F32 ALIGN16 bufLutLen[] = { BF_BUF_LEN - 1, BF_BUF_LEN - 1, BF_BUF_LEN - 1, BF_BUF_LEN - 1 };S32 ALIGN16 buf[4];F32 ALIGN16 bufSum[4];S32 i = 0;S32 j = 0;S32 k = 0;F32 val0 = 0.f;F32 sum = 0.f;F32 wsum = 0.f;U16 *psrc = src + radius * srcStep;U16 *pdst = dst;__m128 _val;__m128 _val0;__m128 _idx;__m128 _psum;__m128 _sw;__m128 _cw;__m128 _w;const __m128 _signMask = _mm_load_ps((const float*)bufSignMask);const __m128 _lutLen = _mm_load_ps((const float*)bufLutLen);for (i = 0; i < dstHeight; i++){for (j = radius; j < xEnd; j++){val0 = psrc[j] * 1.0f;_psum = _mm_setzero_ps();_val0 = _mm_set1_ps(val0);for (k = 0; k <= maxk - 4; k += 4){_val = _mm_set_ps(psrc[j + spOfs[k + 3]], psrc[j + spOfs[k + 2]],psrc[j + spOfs[k + 1]], psrc[j + spOfs[k]]);// _sw = _mm_loadu_ps(spWt + k);_sw = _mm_load_ps(spWt + k);_idx = _mm_and_ps(_signMask, _mm_sub_ps(_val, _val0));_mm_store_si128((__m128i*)buf, _mm_cvtps_epi32(_mm_min_ps(_idx, _lutLen)));_cw = _mm_set_ps(valWt[buf[3]], valWt[buf[2]], valWt[buf[1]], valWt[buf[0]]);_w = _mm_mul_ps(_cw, _sw);_val = _mm_mul_ps(_w, _val);_sw = _mm_hadd_ps(_w, _val);_sw = _mm_hadd_ps(_sw, _sw);_psum = _mm_add_ps(_sw, _psum);}_mm_storel_pi((__m64*)bufSum, _psum);sum = bufSum[1] + val0;wsum = bufSum[0] + 1.f;pdst[j - radius] = (U16)(sum / wsum);}psrc += srcStep;pdst += dstStep;}
}// Smooth main body of src to the dst img
// src img is bigger than dst img
static void edgePreserve_mainBody_sse_omp_process(U16 *src, S32 srcWidth, S32 srcHeight, S32 srcStep,U16 *dst, S32 dstStep, S32 radius, S32 *spOfs, F32 *spWt, F32 *valWt, S32 maxk)
{const S32 xEnd = srcWidth - radius;const S32 dstHeight = srcHeight - (radius << 1);const U32 ALIGN16 bufSignMask[] = { 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff };const F32 ALIGN16 bufLutLen[] = { BF_BUF_LEN - 1, BF_BUF_LEN - 1, BF_BUF_LEN - 1, BF_BUF_LEN - 1 };const __m128 _signMask = _mm_load_ps((const float*)bufSignMask);const __m128 _lutLen = _mm_load_ps((const float*)bufLutLen);S32 i = 0;#pragma omp parallel forfor (i = 0; i < dstHeight; i++){S32 ALIGN16 buf[4];F32 ALIGN16 bufSum[4];S32 j = 0;S32 k = 0;F32 val0 = 0.f;F32 sum = 0.f;F32 wsum = 0.f;U16 *psrc = src + (radius + i) * srcStep;U16 *pdst = dst + i * dstStep;__m128 _val;__m128 _val0;__m128 _idx;__m128 _psum;__m128 _sw;__m128 _cw;__m128 _w;for (j = radius; j < xEnd; j++){val0 = psrc[j] * 1.0f;_psum = _mm_setzero_ps();_val0 = _mm_set1_ps(val0);for (k = 0; k <= maxk - 4; k += 4){_val = _mm_set_ps(psrc[j + spOfs[k + 3]], psrc[j + spOfs[k + 2]],psrc[j + spOfs[k + 1]], psrc[j + spOfs[k]]);// _sw = _mm_loadu_ps(spWt + k);_sw = _mm_load_ps(spWt + k);_idx = _mm_and_ps(_signMask, _mm_sub_ps(_val, _val0));_mm_store_si128((__m128i*)buf, _mm_cvtps_epi32(_mm_min_ps(_idx, _lutLen)));_cw = _mm_set_ps(valWt[buf[3]], valWt[buf[2]], valWt[buf[1]], valWt[buf[0]]);_w = _mm_mul_ps(_cw, _sw);_val = _mm_mul_ps(_w, _val);_sw = _mm_hadd_ps(_w, _val);_sw = _mm_hadd_ps(_sw, _sw);_psum = _mm_add_ps(_sw, _psum);}_mm_storel_pi((__m64*)bufSum, _psum);sum = bufSum[1] + val0;wsum = bufSum[0] + 1.f;pdst[j - radius] = (U16)(sum / wsum);}}
}// no smooth on the border
void edgePreserve_mainBody(U16 *src, U16 *dst, S32 width, S32 height,S32 radius, F32 sigmaVal, F32 sigmaSp)
{S32 i = 0;S32 j = 0;S32 maxk = -1; //exclude the centerS32 *spOfs = NULL;F32 *spWt = NULL;F32 *valWt = NULL;for (i = -radius; i <= radius; i++){for (j = -radius; j <= radius; j++){if (sqrtf((i * i + j * j) * 1.f) <= radius){maxk++;}}}spWt = (F32 *)ALIGN_MALLOC16(sizeof(F32) * maxk);spOfs = (S32 *)MALLOC(sizeof(S32) * maxk);valWt = (F32 *)MALLOC(sizeof(F32) * BF_BUF_LEN);edgePreserve_LUT(radius, width, sigmaVal, sigmaSp, spOfs, spWt, valWt);borderCopy(src, dst, width, height, radius);edgePreserve_mainBody_process(src, width, height, width,dst + width * radius + radius, width,radius, spOfs, spWt, valWt, maxk);ALIGN_FREE(spWt);FREE(spOfs);FREE(valWt);
}// smooth on the border
void edgePreserve(U16 *src, U16 *dst, S32 width, S32 height,S32 radius, F32 sigmaVal, F32 sigmaSp)
{S32 i = 0;S32 j = 0;S32 maxk = -1; //exclude the centerS32 width2 = width + (radius << 1);S32 height2 = height + (radius << 1);S32 *spOfs = NULL;F32 *spWt = NULL;F32 *valWt = NULL;U16 *buf = (U16 *)MALLOC(sizeof(U16) * width2 * height2);borderReflect(src, width, height, buf, radius);for (i = -radius; i <= radius; i++){for (j = -radius; j <= radius; j++){if (sqrtf((i * i + j * j) * 1.f) <= radius){maxk++;}}}spWt = (F32 *)ALIGN_MALLOC16(sizeof(F32) * maxk);spOfs = (S32 *)MALLOC(sizeof(S32) * maxk);valWt = (F32 *)MALLOC(sizeof(F32) * BF_BUF_LEN);edgePreserve_LUT(radius, width2, sigmaVal, sigmaSp, spOfs, spWt, valWt);edgePreserve_mainBody_process(buf, width2, height2, width2,dst, width, radius, spOfs, spWt, valWt, maxk);ALIGN_FREE(spWt);FREE(spOfs);FREE(valWt);FREE(buf);
}// no smooth on the border
void edgePreserve_mainBody_sse(U16 *src, U16 *dst, S32 width, S32 height,S32 radius, F32 sigmaVal, F32 sigmaSp)
{S32 i = 0;S32 j = 0;S32 maxk = -1; //exclude the centerS32 *spOfs = NULL;F32 *spWt = NULL;F32 *valWt = NULL;for (i = -radius; i <= radius; i++){for (j = -radius; j <= radius; j++){if (sqrtf((i * i + j * j) * 1.f) <= radius){maxk++;}}}spWt = (F32 *)ALIGN_MALLOC16(sizeof(F32) * maxk);spOfs = (S32 *)MALLOC(sizeof(S32) * maxk);valWt = (F32 *)MALLOC(sizeof(F32) * BF_BUF_LEN);edgePreserve_LUT(radius, width, sigmaVal, sigmaSp, spOfs, spWt, valWt);borderCopy(src, dst, width, height, radius);// edgePreserve_noBorder_sse_mainloop(src, dst, width, height, radius, spOfs, spWt, valWt, maxk);edgePreserve_mainBody_sse_process(src, width, height, width,dst + width * radius + radius, width,radius, spOfs, spWt, valWt, maxk);ALIGN_FREE(spWt);FREE(spOfs);FREE(valWt);
}// smooth on the border
void edgePreserve_sse(U16 *src, U16 *dst, S32 width, S32 height,S32 radius, F32 sigmaVal, F32 sigmaSp)
{S32 i = 0;S32 j = 0;S32 maxk = -1; //exclude the centerS32 width2 = width + (radius << 1);S32 height2 = height + (radius << 1);S32 *spOfs = NULL;F32 *spWt = NULL;F32 *valWt = NULL;U16 *buf = (U16 *)MALLOC(sizeof(U16) * width2 * height2);borderReflect(src, width, height, buf, radius);for (i = -radius; i <= radius; i++){for (j = -radius; j <= radius; j++){if (sqrtf((i * i + j * j) * 1.f) <= radius){maxk++;}}}spWt = (F32 *)ALIGN_MALLOC16(sizeof(F32) * maxk);spOfs = (S32 *)MALLOC(sizeof(S32) * maxk);valWt = (F32 *)MALLOC(sizeof(F32) * BF_BUF_LEN);edgePreserve_LUT(radius, width2, sigmaVal, sigmaSp, spOfs, spWt, valWt);// edgePreserve_sse_mainloop(buf, dst, width, height, radius, spOfs, spWt, valWt, maxk);edgePreserve_mainBody_sse_process(buf, width2, height2, width2,dst, width, radius, spOfs, spWt, valWt, maxk);ALIGN_FREE(spWt);FREE(spOfs);FREE(valWt);FREE(buf);
}typedef struct EDGE_PRES_FILT_HDL
{S32 maxk;S32 width;S32 height;S32 radius;S32 *spOfs;F32 *spWt;F32 *valWt;U16 *buf;
}EDGE_PRES_FILT_HDL_;edgePresFiltMain::edgePresFiltMain(S32 width, S32 height, S32 radius, F32 sigmaVal, F32 sigmaSp)
{S32 i = 0;S32 j = 0;S32 maxk = -1; //exclude the centerEDGE_PRES_FILT_HDL *pHdl = NULL;pHdl = new EDGE_PRES_FILT_HDL;hdl = pHdl;for (i = -radius; i <= radius; i++){for (j = -radius; j <= radius; j++){if (sqrtf((i * i + j * j) * 1.f) <= radius){maxk++;}}}pHdl->maxk = maxk;pHdl->width = width;pHdl->height = height;pHdl->radius = radius;pHdl->spWt = (F32 *)ALIGN_MALLOC16(sizeof(F32) * maxk);pHdl->spOfs = (S32 *)MALLOC(sizeof(S32) * maxk);pHdl->valWt = (F32 *)MALLOC(sizeof(F32) * BF_BUF_LEN);edgePreserve_LUT(radius, width, sigmaVal, sigmaSp, pHdl->spOfs, pHdl->spWt, pHdl->valWt);
}edgePresFiltMain::~edgePresFiltMain()
{EDGE_PRES_FILT_HDL *pHdl = (EDGE_PRES_FILT_HDL *)hdl;ALIGN_FREE(pHdl->spWt);FREE(pHdl->spOfs);FREE(pHdl->valWt);delete hdl;
}void edgePresFiltMain::edgePreserve_mainBody(U16 *src, U16 *dst)
{EDGE_PRES_FILT_HDL *pHdl = (EDGE_PRES_FILT_HDL *)hdl;S32 width = pHdl->width;S32 height = pHdl->height;S32 radius = pHdl->radius;borderCopy(src, dst, width, height, radius);edgePreserve_mainBody_process(src, width, height, width,dst + width * radius + radius, width,radius, pHdl->spOfs, pHdl->spWt, pHdl->valWt, pHdl->maxk);
}void edgePresFiltMain::edgePreserve_mainBody_omp(U16 *src, U16 *dst)
{EDGE_PRES_FILT_HDL *pHdl = (EDGE_PRES_FILT_HDL *)hdl;S32 width = pHdl->width;S32 height = pHdl->height;S32 radius = pHdl->radius;borderCopy(src, dst, width, height, radius);edgePreserve_mainBody_omp_process(src, width, height, width,dst + width * radius + radius, width,radius, pHdl->spOfs, pHdl->spWt, pHdl->valWt, pHdl->maxk);
}void edgePresFiltMain::edgePreserve_mainBody_sse(U16 *src, U16 *dst)
{EDGE_PRES_FILT_HDL *pHdl = (EDGE_PRES_FILT_HDL *)hdl;S32 width = pHdl->width;S32 height = pHdl->height;S32 radius = pHdl->radius;borderCopy(src, dst, width, height, radius);edgePreserve_mainBody_sse_process(src, width, height, width,dst + width * radius + radius, width,radius, pHdl->spOfs, pHdl->spWt, pHdl->valWt, pHdl->maxk);
}void edgePresFiltMain::edgePreserve_mainBody_sse_omp(U16 *src, U16 *dst)
{EDGE_PRES_FILT_HDL *pHdl = (EDGE_PRES_FILT_HDL *)hdl;S32 width = pHdl->width;S32 height = pHdl->height;S32 radius = pHdl->radius;borderCopy(src, dst, width, height, radius);edgePreserve_mainBody_sse_omp_process(src, width, height, width,dst + width * radius + radius, width,radius, pHdl->spOfs, pHdl->spWt, pHdl->valWt, pHdl->maxk);
}edgePresFilt::edgePresFilt(S32 width, S32 height, S32 radius, F32 sigmaVal, F32 sigmaSp)
{S32 i = 0;S32 j = 0;S32 maxk = -1; //exclude the centerS32 width2 = width + (radius << 1);S32 height2 = height + (radius << 1);EDGE_PRES_FILT_HDL *pHdl = NULL;pHdl = new EDGE_PRES_FILT_HDL;hdl = pHdl;for (i = -radius; i <= radius; i++){for (j = -radius; j <= radius; j++){if (sqrtf((i * i + j * j) * 1.f) <= radius){maxk++;}}}pHdl->maxk = maxk;pHdl->width = width;pHdl->height = height;pHdl->radius = radius;pHdl->spWt = (F32 *)ALIGN_MALLOC16(sizeof(F32) * maxk);pHdl->spOfs = (S32 *)MALLOC(sizeof(S32) * maxk);pHdl->valWt = (F32 *)MALLOC(sizeof(F32) * BF_BUF_LEN);pHdl->buf = (U16 *)MALLOC(sizeof(U16) * width2 * height2);edgePreserve_LUT(radius, width2, sigmaVal, sigmaSp, pHdl->spOfs, pHdl->spWt, pHdl->valWt);
}edgePresFilt::~edgePresFilt()
{EDGE_PRES_FILT_HDL *pHdl = (EDGE_PRES_FILT_HDL *)hdl;ALIGN_FREE(pHdl->spWt);FREE(pHdl->spOfs);FREE(pHdl->valWt);FREE(pHdl->buf);delete hdl;
}void edgePresFilt::edgePreserve(U16 *src, U16 *dst)
{EDGE_PRES_FILT_HDL *pHdl = (EDGE_PRES_FILT_HDL *)hdl;S32 width = pHdl->width;S32 height = pHdl->height;S32 radius = pHdl->radius;S32 width2 = width + (radius << 1);S32 height2 = height + (radius << 1);borderReflect(src, width, height, pHdl->buf, radius);edgePreserve_mainBody_process(pHdl->buf, width2, height2, width2,dst, width, radius, pHdl->spOfs, pHdl->spWt, pHdl->valWt, pHdl->maxk);
}void edgePresFilt::edgePreserve_omp(U16 *src, U16 *dst)
{EDGE_PRES_FILT_HDL *pHdl = (EDGE_PRES_FILT_HDL *)hdl;S32 width = pHdl->width;S32 height = pHdl->height;S32 radius = pHdl->radius;S32 width2 = width + (radius << 1);S32 height2 = height + (radius << 1);borderReflect(src, width, height, pHdl->buf, radius);edgePreserve_mainBody_omp_process(pHdl->buf, width2, height2, width2,dst, width, radius, pHdl->spOfs, pHdl->spWt, pHdl->valWt, pHdl->maxk);
}void edgePresFilt::edgePreserve_sse(U16 *src, U16 *dst)
{EDGE_PRES_FILT_HDL *pHdl = (EDGE_PRES_FILT_HDL *)hdl;S32 width = pHdl->width;S32 height = pHdl->height;S32 radius = pHdl->radius;S32 width2 = width + (radius << 1);S32 height2 = height + (radius << 1);borderReflect(src, width, height, pHdl->buf, radius);edgePreserve_mainBody_sse_process(pHdl->buf, width2, height2, width2,dst, width, radius, pHdl->spOfs, pHdl->spWt, pHdl->valWt, pHdl->maxk);
}void edgePresFilt::edgePreserve_sse_omp(U16 *src, U16 *dst)
{EDGE_PRES_FILT_HDL *pHdl = (EDGE_PRES_FILT_HDL *)hdl;S32 width = pHdl->width;S32 height = pHdl->height;S32 radius = pHdl->radius;S32 width2 = width + (radius << 1);S32 height2 = height + (radius << 1);borderReflect(src, width, height, pHdl->buf, radius);edgePreserve_mainBody_sse_omp_process(pHdl->buf, width2, height2, width2,dst, width, radius, pHdl->spOfs, pHdl->spWt, pHdl->valWt, pHdl->maxk);
}//static void edgePreserveInt_LUT(S32 radius, S32 width, F32 sigmaVal, F32 sigmaSp,S32 *spOfs, S32 *spWt, S32 *valWt)
{S32 i = 0;S32 j = 0;S32 k = 0;F32 sigmaValCoeff = -0.5f / (sigmaVal * sigmaVal);F32 sigmaSpCoeff = -0.5f / (sigmaSp * sigmaSp);for (i = -radius; i <= radius; i++){for (j = -radius; j <= radius; j++){if (sqrtf((i * i + j * j) * 1.f) <= radius){if ((i == 0) && (j == 0)){continue;}spOfs[k] = i * width + j;spWt[k] = (S32)(expf((i * i + j * j) * sigmaSpCoeff) * BF_INT_SCALE);k++;}}}for (i = 0; i < BF_BUF_LEN - 1; i++){valWt[i] = (S32)(expf(i * i * sigmaValCoeff) * BF_INT_SCALE);}valWt[i] = 0;
}// Smooth main body of src to the dst img
// src img is bigger than dst img
static void edgePreserveInt_mainBody_process(U16 *src, S32 srcWidth, S32 srcHeight, S32 srcStep,U16 *dst, S32 dstStep, S32 radius, S32 *spOfs, S32 *spWt, S32 *valWt, S32 maxk)
{const S32 xEnd = srcWidth - radius;const S32 dstHeight = srcHeight - (radius << 1);S32 i = 0;S32 j = 0;S32 k = 0;S32 n = 0;U16 val0 = 0;U16 val = 0;S32 tmp = 0;S32 w = 0;S32 sum = 0;S32 wsum = 0;U16 *psrc = src + radius * srcStep;U16 *pdst = dst;for (i = 0; i < dstHeight; i++){for (j = radius; j < xEnd; j++){val0 = psrc[j];sum = 0;wsum = 0;for (k = 0; k < maxk; k++){val = psrc[j + spOfs[k]];tmp = val - val0;tmp = ABS(tmp);tmp = MIN(tmp, BF_BUF_LEN - 1);w = spWt[k] * valWt[tmp];w >>= BF_INT_SHIFT;wsum += w;sum += (val * w);}pdst[j - radius] = (U16)((sum + (val0 << BF_INT_BITS2)) / (wsum + BF_INT_SCALE2));}psrc += srcStep;pdst += dstStep;}
}// smooth on the border
void edgePreserveInt(U16 *src, U16 *dst, S32 width, S32 height,S32 radius_, F32 sigmaVal, F32 sigmaSp)
{S32 i = 0;S32 j = 0;S32 maxk = -1; //exclude the centerS32 radius = GP_MIN(radius_, GP_EDGEPRES_R_MAX);S32 width2 = width + (radius << 1);S32 height2 = height + (radius << 1);S32 *spOfs = NULL;S32 *spWt = NULL;S32 *valWt = NULL;U16 *buf = (U16 *)MALLOC(sizeof(U16) * width2 * height2);borderReflect(src, width, height, buf, radius);for (i = -radius; i <= radius; i++){for (j = -radius; j <= radius; j++){if (sqrtf((i * i + j * j) * 1.f) <= radius){maxk++;}}}spWt = (S32 *)ALIGN_MALLOC16(sizeof(S32) * maxk);spOfs = (S32 *)MALLOC(sizeof(S32) * maxk);valWt = (S32 *)MALLOC(sizeof(S32) * BF_BUF_LEN);edgePreserveInt_LUT(radius, width2, sigmaVal, sigmaSp, spOfs, spWt, valWt);edgePreserveInt_mainBody_process(buf, width2, height2, width2,dst, width, radius, spOfs, spWt, valWt, maxk);ALIGN_FREE(spWt);FREE(spOfs);FREE(valWt);FREE(buf);
}typedef struct EDGE_PRES_FILT_INT_HDL
{S32 maxk;S32 width;S32 height;S32 radius;S32 *spOfs;S32 *spWt;S32 *valWt;U16 *buf;
}EDGE_PRES_FILT_INT_HDL_;edgePresFiltInt::edgePresFiltInt(S32 width, S32 height, S32 radius_, F32 sigmaVal, F32 sigmaSp)
{S32 i = 0;S32 j = 0;S32 maxk = -1; //exclude the centerS32 radius = GP_MIN(radius_, GP_EDGEPRES_R_MAX);S32 width2 = width + (radius << 1);S32 height2 = height + (radius << 1);EDGE_PRES_FILT_INT_HDL *pHdl = NULL;pHdl = new EDGE_PRES_FILT_INT_HDL;hdl = pHdl;for (i = -radius; i <= radius; i++){for (j = -radius; j <= radius; j++){if (sqrtf((i * i + j * j) * 1.f) <= radius){maxk++;}}}pHdl->maxk = maxk;pHdl->width = width;pHdl->height = height;pHdl->radius = radius;pHdl->spWt = (S32 *)ALIGN_MALLOC16(sizeof(S32) * maxk);pHdl->spOfs = (S32 *)MALLOC(sizeof(S32) * maxk);pHdl->valWt = (S32 *)MALLOC(sizeof(S32) * BF_BUF_LEN);pHdl->buf = (U16 *)MALLOC(sizeof(U16) * width2 * height2);edgePreserveInt_LUT(radius, width2, sigmaVal, sigmaSp, pHdl->spOfs, pHdl->spWt, pHdl->valWt);
}edgePresFiltInt::~edgePresFiltInt()
{EDGE_PRES_FILT_INT_HDL *pHdl = (EDGE_PRES_FILT_INT_HDL *)hdl;ALIGN_FREE(pHdl->spWt);FREE(pHdl->spOfs);FREE(pHdl->valWt);FREE(pHdl->buf);delete hdl;
}void edgePresFiltInt::edgePreserve(U16 *src, U16 *dst)
{EDGE_PRES_FILT_INT_HDL *pHdl = (EDGE_PRES_FILT_INT_HDL *)hdl;S32 width = pHdl->width;S32 height = pHdl->height;S32 radius = pHdl->radius;S32 width2 = width + (radius << 1);S32 height2 = height + (radius << 1);borderReflect(src, width, height, pHdl->buf, radius);edgePreserveInt_mainBody_process(pHdl->buf, width2, height2, width2,dst, width, radius, pHdl->spOfs, pHdl->spWt, pHdl->valWt, pHdl->maxk);
}
并行加速实战 双边滤波器相关推荐
- 基于FPGA的引导滤波并行加速实现
前面一篇文章中,已经详细的分析了引导滤波的理论,公式的推导,以及和双边滤波的对比分析,即在边缘的处理上双边滤波会引起人为的黑/白边.我们已经知道何博士引导滤波的优秀之处,那么本篇文章,我带你推演,如何 ...
- 用openMP进行并行加速
用openMP进行并行加速 参考:http://blog.csdn.net/lanbing510/article/details/17108451 最近在看多核编程.简单来说,由于现在电脑CPU一般都 ...
- 双边滤波器在灰度和彩色图像处理中的应用
原文链接:http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/MANDUCHI1/Bilateral_Filtering.html 版权归原 ...
- 【双边滤波】基于小波变换的多尺度自适应THZ增强双边滤波器的MATLAB仿真
1.软件版本 MATLAB2021a 2.本算法理论知识 提出了一种"基于小波变换的多尺度自适应双边滤波器"算法. 其对应的算法流程如下所示: 下面,我们从理论上限介绍一下这里所采 ...
- 双边滤波器的原理及实现
双边滤波器是什么? 双边滤波(Bilateral filter)是一种可以保边去噪的滤波器.之所以可以达到此去噪效果,是因为滤波器是由两个函数构成.一个函数是由几何空间距离决定滤波器系数.另一个由像素 ...
- bilateral filter双边滤波器的通俗理解
bilateral filter双边滤波器的通俗理解 图像去噪的方法很多,如中值滤波,高斯滤波,维纳滤波等等.但这些降噪方法容易模糊图片的边缘细节,对于高频细节的保护效果并不明显.相比较而言,bila ...
- 能使曲线变平滑的一维滤波器_双边滤波器的原理及实现
双边滤波(Bilateral filter)是一种非线性的滤波方法,是结合图像的空间邻近度和像素值相似度的一种折衷处理,同时考虑空域信息和灰度相似性,达到保边去噪的目的. 双边滤波器之所以能够做到在平 ...
- 双边滤波器—— Matlab实现
例:先用双边滤波器(BF)对原图像进行滤波得到低频部分,原图和低频作差后得到高频分量,高频分量和低频分量分别增强后再进行合成. 双边滤波的特点是保边去噪,相较于高斯滤波,在平滑图像的同时,增加了对图像 ...
- 为SSD加速 实战4KB对齐技巧1/3
本篇文章分块 ※为SSD加速 实战4KB对齐技巧1/3 ※为SSD加速 实战4KB对齐技巧2/3 ※为SSD加速 实战4KB对齐技巧3/3 载入游戏仅需几秒,这让固态硬盘SSD成为大家喜欢的存储利器. ...
最新文章
- 成熟的夜间模式解决方案
- nyoj13-Fibonacci数
- python3 tkinter
- 18-switch语句
- jmeter+Fiddler:通过Fiddler抓包生成jmeter脚本
- JavaScript 常用数组函数方法专题
- 数据挖掘10大算法(1)——PageRank
- Java面试题-泛型篇十四
- python 读取txt文件并在txt每行文件后面增加内容
- 适合Python新手的爬虫练习:网易LOFTER图片爬虫(一)
- php mpm,ubuntu切换为apache+php-fpm+mpm_event
- 【软考网络管理员】2023年软考网管初级常见知识考点(1)- 计算机硬件基础
- 【华为机试真题 Python】窗口滑动和最大值
- Cpp web(一) Ubuntu下安装使用Crow服务
- 中国平安真牛,把中国人寿给替了!!!!
- ybt1248_Dungeon Master
- 【优化求解】粒子群优化和重力搜索算法求解MLP问题matlab源码
- Win10系统无法使用管理员账户启动应用解决方法
- linux修复安装软件,Linux系统安装教程及卸载修复
- ios开发返回按钮消失_iOS10 的适配问题,你遇到了吗?导航栏标题和返回按钮神奇的消失了...