Trained Tesseract on 瘦金体 successfully!!
成功训练Tesseract识别瘦金体(其中的24个字:))
--插播--
非常好的训练文件,来自 the Early Modern OCR Project (eMOP)。应该是PRIMALabs的工作。
TesseractTraining
Testing with Tesseract
网上很多资源。列几个步骤明确表达清晰的:
利用jTessBoxEditor工具进行Tesseract3.02.02样本训练,提高验证码识别率
- 提到了合并样本图片。Multi-page TIFF file。
- 注意: DO NOT MIX FONTS IN AN IMAGE FILE (In a single .tr file to be precise.)
- 有输出中间过程的console结果,可以对照自己训练时的输出,非常有用
- 用到了shapeclustering (有一些教程没有用这个功能好像也没问题?)
- 最后有步骤列表:
1、合并图片
2、生成box文件
tesseract langyp.fontyp.exp0.tif langyp.fontyp.exp0 -l eng -psm 7 batch.nochop makebox
3、修改box文件
4、生成font_properties
echo fontyp 0 0 0 0 0 >font_properties
5、生成训练文件
tesseract langyp.fontyp.exp0.tif langyp.fontyp.exp0 -l eng -psm 7 nobatch box.train
6、生成字符集文件
unicharset_extractor langyp.fontyp.exp0.box
7、生成shape文件
shapeclustering -F font_properties -U unicharset -O langyp.unicharset langyp.fontyp.exp0.tr
8、生成聚集字符特征文件
mftraining -F font_properties -U unicharset -O langyp.unicharset langyp.fontyp.exp0.tr
9、生成字符正常化特征文件
cntraining langyp.fontyp.exp0.tr
10、更名
rename normproto fontyp.normproto
rename inttemp fontyp.inttemp
rename pffmtable fontyp.pffmtable
rename unicharset fontyp.unicharset
rename shapetable fontyp.shapetable
11、合并训练文件,生成fontyp.traineddata
combine_tessdata fontyp.
我的训练结果
- 上面第5步,真正重要的步骤。官网说用什么语言没关系。测试时发现用默认英语好像错误多一些?
tesseract.exe chi.slimqjs.exp0.tif chi.slimqjs.exp0 nobatch box.train
Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Page 1
row xheight=86, but median xheight = 0.5
row xheight=77, but median xheight = 0.5
row xheight=68, but median xheight = 0.5
row xheight=77, but median xheight = 0.5
FAIL!
APPLY_BOXES: boxfile line 1/里 ((73,490),(185,582)): FAILURE! Couldn’t find a matching blob
FAIL!
APPLY_BOXES: boxfile line 2/凤 ((226,480),(355,600)): FAILURE! Couldn’t find a matching blob
FAIL!
APPLY_BOXES: boxfile line 5/露 ((677,478),(803,603)): FAILURE! Couldn’t find a matching blob
FAIL!
APPLY_BOXES: boxfile line 6/魁 ((828,480),(951,592)): FAILURE! Couldn’t find a matching blob
APPLY_BOXES:
Boxes read from boxfile: 24
Boxes failed resegmentation: 4
APPLY_BOXES: Unlabelled word at :Bounding box=(54,610)->(974,611)
Found 20 good blobs.
Leaving 12 unlabelled blobs in 0 words.
1 remaining unlabelled words deleted.
Generated training data for 18 words最后选择用 chi_sim 来训练。
tesseract.exe chi.slimqjs.exp0.tif chi.slimqjs.exp0 -l chi_sim nobatch box.train
Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Page 1
row xheight=29.5, but median xheight = 58.1667
row xheight=460, but median xheight = 58.1667
row xheight=117.5, but median xheight = 58.1667
row xheight=117.5, but median xheight = 58.1667
row xheight=121, but median xheight = 58.1667
row xheight=121, but median xheight = 58.1667
row xheight=28.6667, but median xheight = 58.1667
row xheight=44.6667, but median xheight = 58.1667
row xheight=81.3333, but median xheight = 58.1667
row xheight=29, but median xheight = 58.1667
APPLY_BOXES:
Boxes read from boxfile: 24
APPLY_BOXES: Unlabelled word at :Bounding box=(-611,961)->(0,1020)
APPLY_BOXES: Unlabelled word at :Bounding box=(-610,54)->(-609,974)
APPLY_BOXES: Unlabelled word at :Bounding box=(398,183)->(459,250)
APPLY_BOXES: Unlabelled word at :Bounding box=(371,184)->(400,262)
APPLY_BOXES: Unlabelled word at :Bounding box=(-611,0)->(0,58)
Found 24 good blobs.
Leaving 3 unlabelled blobs in 0 words.
5 remaining unlabelled words deleted.shapeclustering.exe 出现很多 Bad properties 错误
后面的结果都类似
shapeclustering.exe -F font_properties -U unicharset chi.slimqjs.exp0.tr
Reading chi.slimqjs.exp0.tr …
Bad properties for index 3, char 里: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 4, char 凤: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 5, char 起: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 6, char 云: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 7, char 露: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 8, char 魁: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 9, char 霎: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 10, char 髓: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 11, char 绿: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 12, char 堆: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 13, char 和: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 14, char 采: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 15, char 时: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 16, char 点: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 17, char 尘: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 18, char 轻: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 19, char 烟: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 20, char 取: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 21, char 滋: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 22, char 将: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 23, char 埃: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 24, char 动: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 25, char 捣: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 26, char 枝: 0,255 0,255 0,0 0,0 0,0
Building master shape table
Computing shape distances…
Stopped with 0 merged, min dist 999.000000
Computing shape distances…
Stopped with 0 merged, min dist 999.000000
Computing shape distances…
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances…
Stopped with 0 merged, min dist 999.000000
Computing shape distances…
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Stopped with 0 merged, min dist 0.348285
Master shape_table:Number of shapes = 24 max unichars = 1 number with multiple unichars = 0mftraining.exe 也是同样的 Bad properties 错误,加两个 Warning
mftraining.exe -F font_properties -U unicharset -O chi.unicharset chi.slimqjs.exp0.tr
Read shape table shapetable of 24 shapes
Reading chi.slimqjs.exp0.tr …
Bad properties for index 3, char 里: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 4, char 凤: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 5, char 起: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 6, char 云: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 7, char 露: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 8, char 魁: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 9, char 霎: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 10, char 髓: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 11, char 绿: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 12, char 堆: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 13, char 和: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 14, char 采: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 15, char 时: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 16, char 点: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 17, char 尘: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 18, char 轻: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 19, char 烟: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 20, char 取: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 21, char 滋: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 22, char 将: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 23, char 埃: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 24, char 动: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 25, char 捣: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 26, char 枝: 0,255 0,255 0,0 0,0 0,0
Warning: no protos/configs for Joined in CreateIntTemplates()
Warning: no protos/configs for |Broken|0|1 in CreateIntTemplates()
Done!最后成功!!
combine_tessdata.exe chi.
Combining tessdata files
TessdataManager combined tesseract data files.
Offset for type 0 (chi.config ) is -1
Offset for type 1 (chi.unicharset ) is 140
Offset for type 2 (chi.unicharambigs ) is -1
Offset for type 3 (chi.inttemp ) is 1661
Offset for type 4 (chi.pffmtable ) is 200388
Offset for type 5 (chi.normproto ) is 200668
Offset for type 6 (chi.punc-dawg ) is -1
Offset for type 7 (chi.word-dawg ) is -1
Offset for type 8 (chi.number-dawg ) is -1
Offset for type 9 (chi.freq-dawg ) is -1
Offset for type 10 (chi.fixed-length-dawgs ) is -1
Offset for type 11 (chi.cube-unicharset ) is -1
Offset for type 12 (chi.cube-word-dawg ) is -1
Offset for type 13 (chi.shapetable ) is 203778
Offset for type 14 (chi.bigram-dawg ) is -1
Offset for type 15 (chi.unambig-dawg ) is -1
Offset for type 16 (chi.params-model ) is -1
Output chi.traineddata created successfully.
Tesseract训练中文字体识别
- 如果有多个图像文件,多个box文件等,需要合并成1个Character集合,以及后面的训练文件
Tesseract-OCR识别中文与训练字库实例
- 同样在第五步有识别错误
Tesseract-OCR的简单使用与训练
- 没有特别的地方
Training Tesseract OCR for a New Font and Input Set on Mac
- 跟Mac 没有半毛钱关系!!
- 最用的是提到字体鉴别网站
- identifont
- 如果安装了字体文件,可以直接用 TIFF/Box Generator 生成图片和box文件(通过)
Testing with Tesseract
How to prepare training files for Tesseract OCR and improve characters recognition?
下面三篇都不错
Training Tesseract for labels, receipts and such
A Guide on OCR with tesseract 3.03
Adding New Fonts to Tesseract 3 OCR Engine
Trained Tesseract on 瘦金体 successfully!!相关推荐
- 适于打印观赏的宋徽宗瘦金体图片
转载于:https://www.cnblogs.com/xiandedanteng/p/10631593.html
- 微信小程序——瘦金社区
瘦金社区小程序: 前段时间跟着讲解视频做了一个微信小程序,在此做一些记录. 虽然最终因为功能定位的原因导致无法上线,最终在新浪云的云应用也删除了.只能做到开发版这一版本,但是在这一过程中学到了很多知识 ...
- 围棋经典棋谱_秀秀老师:茶艺师也要学好围棋
"引清风,邀明月,去来兮.省多少闲是闲非.临山近水,近些松竹向些梅.书院茶香几多般,诗酒琴棋.无萦无烦恼,无别离. 于中国文人雅士而言,茶与棋,皆是清雅之物事.曹臣<舌花录>中, ...
- 生宣、熟宣、半生半熟宣纸各有什么特点?初学书法用哪种宣纸好?
生宣,熟宣,半生半熟宣纸,这些名字,顾名思义,生,意思就是指"生产出来之后未经过处理的",那么对应的"熟"字,意思便是"经过了一些处理".生 ...
- 关于盗墓笔记的那些事
最初作者应该是暖和狐狸在贴吧写的,全文分析的很好,把三叔没填的一些坑给填上了,主线还是非常明了. [解密]盗墓笔记到底讲了什么故事 一.西王母的爱情引发的千年血案! 在很古远的时候,人类寿命很长,有很 ...
- 田野调查手记·浮山篇(九)
此系列文章均来自于文乡枞阳,作者王乐群,特此说明. 田野调查手记•浮山篇(九) 作者:王乐群 本篇为你介绍"吴宝三墓". 2008年10月7日下午1时30分左右,第三次文物普查田野 ...
- 教资笔记(综合素质篇)
综合素质 第二章: 第一节:教育法律法规概述 教育法律法规概述: 1:教育法概述 2:教育法律关系 3:教育法律救济(申诉) 4:教育法律责任(行政,哪个是对内,哪个是对外) 第二节:有关教育的法律法 ...
- 宋氏极简美学的编码风格
宋朝人的美学,叫极简 宋朝的美学语言,叫极简. 老子曾言:"万物之始,大道至简." 极简,是一种哲学态度,更是一种美学语言 宋人的极简,是简单 宋人喜欢自然朴素的美,不喜欢大肆雕琢 ...
- 前端学习: 用css设置文字样式
转自:微点阅读 http://www.weidianyuedu.com/ 相信大家已经基本了解了前端最基本的一些规则了: html搭建结构,承载内容 css则可以利用选择器,来为相应的html标签设置 ...
- 中国书法名词解释大全
书法: 中国的传统艺术之一.通常以毛笔为书写工具,讲究执笔,运笔,用墨,点画,结构,分布(行次,章法),风格,韵味等.笔法,笔势,笔意是书法艺术的三大要素.书法家运用书写的技法,以独特的形式表现内心的 ...
最新文章
- 解决jsp引用其他项目时出现的 cannot be resolved to a type错误
- 全新界面改版+实用功能上线 高德地图新版全体验
- 发布5个月全系下跌500 麒麟980+40W快充 多亏了华为P30!
- PC-CSS-多浏览器支持HTML5
- PyQT:第一个Demo,画出鼠标单击位置出图像的列像素折线图
- 前端开发~uni-app ·[项目-仿糗事百科] 学习笔记 ·004【App.vue引入全局公共样式】
- 【持久化框架】SpringMVC+Spring4+Mybatis3 集成,开发简单Web项目+源码下载
- 人脸数据集汇总(附百度云盘链接)
- 被奉为经典的「金字塔原理」,教给我们哪些PPT写作技巧?
- 图像艺术风格化 Neural-Style
- 昇腾AI室外移动机器人原理与应用(二 初识室外移动机器人)
- 微信小程序getUserProfile,获取头像和昵称实现登录
- Kaggle实战之 房价预测案例
- Python小黄人绘制
- 本地的项目上传到 Git 仓库的步骤-超详细
- 什么是指用计算机,cat是指计算机的什么
- 题解报告:hdu 1570 A C
- 新款云开发趣味测试小程序源码+功能超多
- 顺丰同城赴港IPO,即时配送烽火再起
- 在ros中使用 RPLIDAR_A1 激光雷达 8000点/秒 的配置方法