成功训练Tesseract识别瘦金体(其中的24个字:))

--插播--
非常好的训练文件,来自 the Early Modern OCR Project (eMOP)。应该是PRIMALabs的工作。
TesseractTraining
Testing with Tesseract

网上很多资源。列几个步骤明确表达清晰的:
利用jTessBoxEditor工具进行Tesseract3.02.02样本训练,提高验证码识别率

  • 提到了合并样本图片。Multi-page TIFF file。
  • 注意: DO NOT MIX FONTS IN AN IMAGE FILE (In a single .tr file to be precise.)
  • 有输出中间过程的console结果,可以对照自己训练时的输出,非常有用
  • 用到了shapeclustering (有一些教程没有用这个功能好像也没问题?)
  • 最后有步骤列表:
    1、合并图片
    2、生成box文件
    tesseract langyp.fontyp.exp0.tif langyp.fontyp.exp0 -l eng -psm 7 batch.nochop makebox
    3、修改box文件
    4、生成font_properties
    echo fontyp 0 0 0 0 0 >font_properties
    5、生成训练文件
    tesseract langyp.fontyp.exp0.tif langyp.fontyp.exp0 -l eng -psm 7 nobatch box.train
    6、生成字符集文件
    unicharset_extractor langyp.fontyp.exp0.box
    7、生成shape文件
    shapeclustering -F font_properties -U unicharset -O langyp.unicharset langyp.fontyp.exp0.tr
    8、生成聚集字符特征文件
    mftraining -F font_properties -U unicharset -O langyp.unicharset langyp.fontyp.exp0.tr
    9、生成字符正常化特征文件
    cntraining langyp.fontyp.exp0.tr
    10、更名
    rename normproto fontyp.normproto
    rename inttemp fontyp.inttemp
    rename pffmtable fontyp.pffmtable
    rename unicharset fontyp.unicharset
    rename shapetable fontyp.shapetable
    11、合并训练文件,生成fontyp.traineddata
    combine_tessdata fontyp.

我的训练结果

  • 上面第5步,真正重要的步骤。官网说用什么语言没关系。测试时发现用默认英语好像错误多一些?
    tesseract.exe chi.slimqjs.exp0.tif chi.slimqjs.exp0 nobatch box.train
    Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
    Page 1
    row xheight=86, but median xheight = 0.5
    row xheight=77, but median xheight = 0.5
    row xheight=68, but median xheight = 0.5
    row xheight=77, but median xheight = 0.5
    FAIL!
    APPLY_BOXES: boxfile line 1/里 ((73,490),(185,582)): FAILURE! Couldn’t find a matching blob
    FAIL!
    APPLY_BOXES: boxfile line 2/凤 ((226,480),(355,600)): FAILURE! Couldn’t find a matching blob
    FAIL!
    APPLY_BOXES: boxfile line 5/露 ((677,478),(803,603)): FAILURE! Couldn’t find a matching blob
    FAIL!
    APPLY_BOXES: boxfile line 6/魁 ((828,480),(951,592)): FAILURE! Couldn’t find a matching blob
    APPLY_BOXES:
    Boxes read from boxfile: 24
    Boxes failed resegmentation: 4
    APPLY_BOXES: Unlabelled word at :Bounding box=(54,610)->(974,611)
    Found 20 good blobs.
    Leaving 12 unlabelled blobs in 0 words.
    1 remaining unlabelled words deleted.
    Generated training data for 18 words
  • 最后选择用 chi_sim 来训练。
    tesseract.exe chi.slimqjs.exp0.tif chi.slimqjs.exp0 -l chi_sim nobatch box.train
    Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
    Page 1
    row xheight=29.5, but median xheight = 58.1667
    row xheight=460, but median xheight = 58.1667
    row xheight=117.5, but median xheight = 58.1667
    row xheight=117.5, but median xheight = 58.1667
    row xheight=121, but median xheight = 58.1667
    row xheight=121, but median xheight = 58.1667
    row xheight=28.6667, but median xheight = 58.1667
    row xheight=44.6667, but median xheight = 58.1667
    row xheight=81.3333, but median xheight = 58.1667
    row xheight=29, but median xheight = 58.1667
    APPLY_BOXES:
    Boxes read from boxfile: 24
    APPLY_BOXES: Unlabelled word at :Bounding box=(-611,961)->(0,1020)
    APPLY_BOXES: Unlabelled word at :Bounding box=(-610,54)->(-609,974)
    APPLY_BOXES: Unlabelled word at :Bounding box=(398,183)->(459,250)
    APPLY_BOXES: Unlabelled word at :Bounding box=(371,184)->(400,262)
    APPLY_BOXES: Unlabelled word at :Bounding box=(-611,0)->(0,58)
    Found 24 good blobs.
    Leaving 3 unlabelled blobs in 0 words.
    5 remaining unlabelled words deleted.

  • shapeclustering.exe 出现很多 Bad properties 错误

  • 后面的结果都类似
    shapeclustering.exe -F font_properties -U unicharset chi.slimqjs.exp0.tr
    Reading chi.slimqjs.exp0.tr …
    Bad properties for index 3, char 里: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 4, char 凤: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 5, char 起: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 6, char 云: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 7, char 露: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 8, char 魁: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 9, char 霎: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 10, char 髓: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 11, char 绿: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 12, char 堆: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 13, char 和: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 14, char 采: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 15, char 时: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 16, char 点: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 17, char 尘: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 18, char 轻: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 19, char 烟: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 20, char 取: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 21, char 滋: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 22, char 将: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 23, char 埃: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 24, char 动: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 25, char 捣: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 26, char 枝: 0,255 0,255 0,0 0,0 0,0
    Building master shape table
    Computing shape distances…
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances…
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances…
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances… 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances… 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances… 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances… 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances… 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances… 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances… 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances… 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances… 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances… 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances… 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances… 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances… 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances… 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances… 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances… 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances… 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances… 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances… 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances… 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances… 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances… 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances… 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances… 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances…
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances…
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances… 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
    Stopped with 0 merged, min dist 0.348285
    Master shape_table:Number of shapes = 24 max unichars = 1 number with multiple unichars = 0

  • mftraining.exe 也是同样的 Bad properties 错误,加两个 Warning
    mftraining.exe -F font_properties -U unicharset -O chi.unicharset chi.slimqjs.exp0.tr
    Read shape table shapetable of 24 shapes
    Reading chi.slimqjs.exp0.tr …
    Bad properties for index 3, char 里: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 4, char 凤: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 5, char 起: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 6, char 云: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 7, char 露: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 8, char 魁: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 9, char 霎: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 10, char 髓: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 11, char 绿: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 12, char 堆: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 13, char 和: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 14, char 采: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 15, char 时: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 16, char 点: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 17, char 尘: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 18, char 轻: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 19, char 烟: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 20, char 取: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 21, char 滋: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 22, char 将: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 23, char 埃: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 24, char 动: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 25, char 捣: 0,255 0,255 0,0 0,0 0,0
    Bad properties for index 26, char 枝: 0,255 0,255 0,0 0,0 0,0
    Warning: no protos/configs for Joined in CreateIntTemplates()
    Warning: no protos/configs for |Broken|0|1 in CreateIntTemplates()
    Done!

  • 最后成功!!
    combine_tessdata.exe chi.
    Combining tessdata files
    TessdataManager combined tesseract data files.
    Offset for type 0 (chi.config ) is -1
    Offset for type 1 (chi.unicharset ) is 140
    Offset for type 2 (chi.unicharambigs ) is -1
    Offset for type 3 (chi.inttemp ) is 1661
    Offset for type 4 (chi.pffmtable ) is 200388
    Offset for type 5 (chi.normproto ) is 200668
    Offset for type 6 (chi.punc-dawg ) is -1
    Offset for type 7 (chi.word-dawg ) is -1
    Offset for type 8 (chi.number-dawg ) is -1
    Offset for type 9 (chi.freq-dawg ) is -1
    Offset for type 10 (chi.fixed-length-dawgs ) is -1
    Offset for type 11 (chi.cube-unicharset ) is -1
    Offset for type 12 (chi.cube-word-dawg ) is -1
    Offset for type 13 (chi.shapetable ) is 203778
    Offset for type 14 (chi.bigram-dawg ) is -1
    Offset for type 15 (chi.unambig-dawg ) is -1
    Offset for type 16 (chi.params-model ) is -1
    Output chi.traineddata created successfully.

Tesseract训练中文字体识别

  • 如果有多个图像文件,多个box文件等,需要合并成1个Character集合,以及后面的训练文件

Tesseract-OCR识别中文与训练字库实例

  • 同样在第五步有识别错误

Tesseract-OCR的简单使用与训练

  • 没有特别的地方

Training Tesseract OCR for a New Font and Input Set on Mac

  • 跟Mac 没有半毛钱关系!!
  • 最用的是提到字体鉴别网站
  • identifont
  • 如果安装了字体文件,可以直接用 TIFF/Box Generator 生成图片和box文件(通过)

Testing with Tesseract

How to prepare training files for Tesseract OCR and improve characters recognition?

下面三篇都不错
Training Tesseract for labels, receipts and such

A Guide on OCR with tesseract 3.03

Adding New Fonts to Tesseract 3 OCR Engine

Trained Tesseract on 瘦金体 successfully!!相关推荐

  1. 适于打印观赏的宋徽宗瘦金体图片

    转载于:https://www.cnblogs.com/xiandedanteng/p/10631593.html

  2. 微信小程序——瘦金社区

    瘦金社区小程序: 前段时间跟着讲解视频做了一个微信小程序,在此做一些记录. 虽然最终因为功能定位的原因导致无法上线,最终在新浪云的云应用也删除了.只能做到开发版这一版本,但是在这一过程中学到了很多知识 ...

  3. 围棋经典棋谱_秀秀老师:茶艺师也要学好围棋

    "引清风,邀明月,去来兮.省多少闲是闲非.临山近水,近些松竹向些梅.书院茶香几多般,诗酒琴棋.无萦无烦恼,无别离. 于中国文人雅士而言,茶与棋,皆是清雅之物事.曹臣<舌花录>中, ...

  4. 生宣、熟宣、半生半熟宣纸各有什么特点?初学书法用哪种宣纸好?

    生宣,熟宣,半生半熟宣纸,这些名字,顾名思义,生,意思就是指"生产出来之后未经过处理的",那么对应的"熟"字,意思便是"经过了一些处理".生 ...

  5. 关于盗墓笔记的那些事

    最初作者应该是暖和狐狸在贴吧写的,全文分析的很好,把三叔没填的一些坑给填上了,主线还是非常明了. [解密]盗墓笔记到底讲了什么故事 一.西王母的爱情引发的千年血案! 在很古远的时候,人类寿命很长,有很 ...

  6. 田野调查手记·浮山篇(九)

    此系列文章均来自于文乡枞阳,作者王乐群,特此说明. 田野调查手记•浮山篇(九) 作者:王乐群 本篇为你介绍"吴宝三墓". 2008年10月7日下午1时30分左右,第三次文物普查田野 ...

  7. 教资笔记(综合素质篇)

    综合素质 第二章: 第一节:教育法律法规概述 教育法律法规概述: 1:教育法概述 2:教育法律关系 3:教育法律救济(申诉) 4:教育法律责任(行政,哪个是对内,哪个是对外) 第二节:有关教育的法律法 ...

  8. 宋氏极简美学的编码风格

    宋朝人的美学,叫极简 宋朝的美学语言,叫极简. 老子曾言:"万物之始,大道至简." 极简,是一种哲学态度,更是一种美学语言 宋人的极简,是简单 宋人喜欢自然朴素的美,不喜欢大肆雕琢 ...

  9. 前端学习: 用css设置文字样式

    转自:微点阅读 http://www.weidianyuedu.com/ 相信大家已经基本了解了前端最基本的一些规则了: html搭建结构,承载内容 css则可以利用选择器,来为相应的html标签设置 ...

  10. 中国书法名词解释大全

    书法: 中国的传统艺术之一.通常以毛笔为书写工具,讲究执笔,运笔,用墨,点画,结构,分布(行次,章法),风格,韵味等.笔法,笔势,笔意是书法艺术的三大要素.书法家运用书写的技法,以独特的形式表现内心的 ...

最新文章

  1. 解决jsp引用其他项目时出现的 cannot be resolved to a type错误
  2. 全新界面改版+实用功能上线 高德地图新版全体验
  3. 发布5个月全系下跌500 麒麟980+40W快充 多亏了华为P30!
  4. PC-CSS-多浏览器支持HTML5
  5. PyQT:第一个Demo,画出鼠标单击位置出图像的列像素折线图
  6. 前端开发~uni-app ·[项目-仿糗事百科] 学习笔记 ·004【App.vue引入全局公共样式】
  7. 【持久化框架】SpringMVC+Spring4+Mybatis3 集成,开发简单Web项目+源码下载
  8. 人脸数据集汇总(附百度云盘链接)
  9. 被奉为经典的「金字塔原理」,教给我们哪些PPT写作技巧?
  10. 图像艺术风格化 Neural-Style
  11. 昇腾AI室外移动机器人原理与应用(二 初识室外移动机器人)
  12. 微信小程序getUserProfile,获取头像和昵称实现登录
  13. Kaggle实战之 房价预测案例
  14. Python小黄人绘制
  15. 本地的项目上传到 Git 仓库的步骤-超详细
  16. 什么是指用计算机,cat是指计算机的什么
  17. 题解报告:hdu 1570 A C
  18. 新款云开发趣味测试小程序源码+功能超多
  19. 顺丰同城赴港IPO,即时配送烽火再起
  20. 在ros中使用 RPLIDAR_A1 激光雷达 8000点/秒 的配置方法

热门文章

  1. 移动边缘计算——计算卸载
  2. 什么是PaaS云平台?
  3. JSP内置对象实例实训报告
  4. 检定规程JJG687- 2008《液态物料定量灌装机》解析
  5. oracle startup open ora 03113,oracle宕机,startup报错ora03113
  6. 灰色理论 光滑度处理 matlab,基于灰色理论的电子设备寿命预测研究
  7. gotoxy c语言,C语言中的gotoxy()到c++中变成什么了?
  8. linux虚拟机简单部署以及安装可视化界面
  9. ECharts异步数据获取
  10. 红外图像盲元检测matlab,一种红外图像盲元检测方法与流程