[Kaggle]图片去噪题解阅读笔记

原文在这里
* Image Processing + Machine Learning in R: Denoising Dirty Documents Tutorial Series
要站在巨人的肩膀前进啊。拜读一下，吸收经验。

photoshop有一个曲线命令，横轴是输入的值域，纵轴是输出的值域，这里的denoising简单的话其实可以看做是怎样生成这条曲线（可以发现手动调整很难调啊！(･ิϖ･ิ)）。当然复杂的情况，输入除了考虑原始像素还能加其他东西（比如相邻像素啥的）。

还是要记点笔记，不然文章粗略扫过去没吸收啥玩意。

Part 1: Least squares regression

居然用LR？！有点小震惊。 X是脏图像的值（每个像素点），Y是干净图像的值。这样算出来只有一个权重和intercept term。感觉没啥用，跟直接原原始值切分不是一样么？唯一的好处估计就只有最后predict的时候，阈值选取比较方便了?

  #这个拼接语句写错了#dat = rbind(dat, cbind(y, x, x2))#最后predict，注意训练的时候也是过滤掉异常值了。y[y < 0] = 0y[y > 1] = 1

Part 2: Image thresholding & gradient boosting machines

kmeans 聚成3个cluster，白、噪声、文字。然后取噪声和文字的中间值做边界。
然后把原始值和kmeans处理后的二值灌给gbm去学。
感觉确实需要一个”基准”值，不然单独原值X信息很不够啊。

Part 3: Adaptive thresholding

咖啡杯印也是深色的，很难分开，用了library(“EBImage”)的Image和thresh函数

Part 4: Canny edge detection & morphology

3、4用了图像检测的一些包，没啥意思。

Part 5: Median filter function & background removal

中值滤波器（图像处理估计很常见？），就是取一块图像的中间值，”效果”上能得到图像的背景。这个倒有点意思。怎么过滤？5*5的话，要平均25张图片的值，for x偏移1到5 * for y偏移1到5，这25张图，当然边缘会有一些问题。

Part 6: Nearby pixels & brute force machine learning

把去背景后的图片和中值滤波的中间结果（也就是一个像素周围25个像素值）一起丢给xgboost，用机器学习简单粗暴去学，啥图像处理的domain knowledge都不需要，ml大法好啊(╬▔ ω▔)。

Part 7: Stacking

模型太多跑不动，看来跟我的电脑差不多啊。分治一下。

子模型都差不多的话，可以求和平均一下。如果模型有某一个特别好，貌似直接用那个最好的就行了。我自己上次的经验。

Part 8: Feature engineering (gaps between lines of text)

很直观的一个就是文字中间有白的间隙。

Part 9: Exploiting leakage

利用信息”泄露”。指用了predict时候不知道的信息（这里具体指背景其实只有8种，分别训练一下就可以了，虽然没保证过预测集的背景也一样，不过这里简单的情况刚好一样）。通常会提高效果。有点ticky，不过确实有效。特别是比赛，能发现leakage也是一种数据嗅觉啊。

Part 10: Convolutional neural networks

图像的话，看来卷积还是大杀器啊。代码没贴，囧。

Part 11: Deep neural networks

在我看来，10和11不都是deep learning么？

Part 12: Final ensemble

讲了bagging的一个要点

if each model has statistically independent errors, and each model performs with similar accuracy, then the average prediction across the 4 models will have half the RMSE score of the individual models

kaggle上的blog不全，还是得跳到作者的主页去看看。
I therefore chose the following combination of models:

deep learning – thresholding based features
deep learning – edge based features
deep learning – median based features
images with backgrounds removed using information leakage
xgboost – wide selection of features
convolutional neural network – using raw images without background removal pre-processing
convolutional neural network – using images with backgrounds removed using information leakage
deep convolutional neural network – using raw images without background removal pre-processing
deep convolutional neural network – using images with backgrounds removed using information leakage

小结

图像背景去噪居然也可以用ml，脑洞开了点。
domain knowledge还是挺重要的，不过直接用ml brute force其实效果也还可以的，不要太灰心，如果对名次没强求的话。图像处理还是要用神经网络好
information leakage啊，对数据要敏感。
model ensemble。kaggle标配。