tesseract 训练入门--记一次50张简单验证码的训练过程

省略各种tesseract和各种包的安装，默认有python基础

需要有java环境以便操作训练工具jTessboxeditor，jdk和训练辅助工具的安装此处不讨论.

本人使用ubuntu18.04 环境，训练工具是在windows虚拟机上安装java后使用的

1. 得到验证码图片

发现安居客的验证码较为简单，这里借用安居客验证码接口下载验证码，代码仅供参考，验证接口也许以后会变，这里主要阐述方法

#num =你想获取的验证码数目
def gen_captcha_from_url(num):# https://login.anjuke.com/general/captcha?timestamp=15580192349965467 安居客if not os.path.exists('code_pic/'):os.mkdir('code_pic/')pic = []for i in range(0, num):fake_timestamp = get_random_string(8, 8, 0, 0)url = 'https://login.anjuke.com/general/captcha?timestamp=' + fake_timestampprint(url)response = requests.get(url)filename = 'code_pic/' + str(i) + '.jpg'pic.append(filename)with open(filename, 'wb')as f:f.write(response.content)return pic

2. 将图片二值化处理

由于得到的验证码图片为彩色内容，需要将其进行二值化处理，形成黑白影像，以便tesseract 识别


def two_value(pic):img = Image.open(pic)# 模式L”为灰色图像，它的每个像素用8个bit表示，0表示黑，255表示白，其他数字表示不同的灰度。Img = img.convert('L')Img.save(pic)# 自定义灰度界限，大于这个值为黑色，小于这个值为白色threshold = 200table = []for i in range(256):if i < threshold:table.append(0)else:table.append(1)# 图片二值化photo = Img.point(table, '1')photo.save(pic)

3.先自己手动识别一遍，填入人工处理的结果（非必须，只是为了方便计算识别率）

4. 将其转换成tiff格式的文件，并调用tesseract进行识别

#根据识别结果是否等于图片名字来判定
def check_right(folder):pic_list = os.listdir(folder)print(pic_list)right = 0all_pic = 0for pic in pic_list:# if not '.tiff' in pic:#    continueall_pic += 1real = pic[:4]filename = 'code_pic/' + picim = Image.open(filename)filename = filename.replace('.jpg', '.tiff')im.save(filename)  # or 'test.tif'result = pytesseract.image_to_string(filename, lang='eng', config='--psm 7 --oem 3 -c tessedit_char_whitelist=qwertyuiopasdfghjklzxcvbnm')print(filename + '=' + result, end='')if real == result:print('    yes', end='')right += 1print('\n')print('成功率：{right}/{all}={last}'.format(right=right,all=all_pic, last=float(right/all_pic)))

5.使用jTessboxeditor 将上述的N张tiff图片合并成一张 tif

打开jTessboxeditor ，点击 tool -》 merge tiff -》先选中上述所有的tiff图片，然后需要输入合成后的tif名称

注意：取名很讲究，否则无法识别。例如 myeng.normal.exp0.tif

myeng 为你训练的语言，为了不影响本有的语言eng，chi_sim等等，取成别的

normal 为你对这门语言某一字体，你可以填任何你记得住的比如 trumpsb

后面的exp0.tif为惯例

6.tesseract命令行对 tif文件操作，生成box文件

tesseract myeng.normal.exp0.tif myeng.normal.exp0 -l eng --psm 7 batch.nochop makebox

当前目录会生成一个 myeng.normal.exp0.box 文件

注意，一定要先生成box文件，再用jTessboxeditor工具打开那张合成的 tif 。

如果这一步或者其他一步出了意外，工具左边的表格栏里会存在为空的情况。即没有正确识别到此tif对应的box

二者出了后缀名，其余地方必须相同

7. 使用jTessboxeditor 工具进行人工训练/校对，将正确的值，输入表格内并保存

需要花大量精力在此，对每一个张图片的每一个字母的坐标，长宽，内容进行精准的校对，然后不要忘记点击save .

有少识别的可以点击insert，多了的点击delete

FBI warning ：要每一个字符都去校正，不要以为识别对了就放弃，要做到每一个字符的坐标，长宽都是十分精确的，才能有不错的识别率。本人亲测，同样50张照片，本人只对识别错误的照片进行校正，结束后的识别率只有0.18.但后来本人对每一个字符都进行校正，识别率达到0.56.

注意：全部校正完毕后，点击工具左上角，save as ，替代掉之前存在的box文件

8.生成字体属性文件 font_properties

font_properties 后面跟五个0 ，具体代表什么不记得了，保存时候不要文件名后缀

9.命令行生成训练文件

tesseract myeng.normal.exp0.tif myeng.normal.exp0 -l eng --psm 7 nobatch box.train

此时，目录下生成了一个 xxxxxx.tr 文件

10. 命令行生成你训练的字符集文件

unicharset_extractor myeng.normal.exp0.box
（此处由于本人采用英文进行识别，但是验证码全为小写字母，舍去了对大写的识别，所以控制台会提示大写不在字符集）

Extracting unicharset from box file myeng.normal.exp0.box
Other case A of a is not in unicharset
Other case Q of q is not in unicharset
Other case I of i is not in unicharset
Other case V of v is not in unicharset
Other case B of b is not in unicharset
Other case E of e is not in unicharset
Other case O of o is not in unicharset
Other case F of f is not in unicharset
Other case C of c is not in unicharset
Other case D of d is not in unicharset
Other case T of t is not in unicharset
Other case L of l is not in unicharset
Other case M of m is not in unicharset
Other case H of h is not in unicharset
Other case P of p is not in unicharset
Other case Y of y is not in unicharset
Other case J of j is not in unicharset
Other case N of n is not in unicharset
Other case R of r is not in unicharset
Wrote unicharset file unicharset

11.生成shape文件

shapeclustering -F font_properties -U unicharset -O myeng.unicharset myeng.normal.exp0.tr

控制台会提示

....

Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Distance = 0.007812: Stopped with 1 merged, min dist 0.048780
Master shape_table:Number of shapes = 26 max unichars = 2 number with multiple unichars = 1

12.生成聚集字符特征文件

mftraining -F font_properties -U unicharset -O myeng.unicharset myeng.normal.exp0.tr

控制台打印

Read shape table shapetable of 26 shapes
Reading myeng.normal.exp0.tr ...
Warning: no protos/configs for sh0023 in CreateIntTemplates()
Warning: no protos/configs for sh0024 in CreateIntTemplates()
Warning: no protos/configs for sh0025 in CreateIntTemplates()
Done!

13.生成字符正常化特征文件

执行 cntraining myeng.normal.exp0.tr
出现以下内容

Reading myeng.normal.exp0.tr ...
Clustering ...

Writing normproto ...

14. 将以上步骤生成的文件更名

mv inttemp normal.inttemp

mv pffmtable normal.pffmtable
mv normproto normal.normproto
mv unicharset normal.unicharset
mv shapetable normal.shapetable

15. 合并训练文件

combine_tessdata normal.

控制台打印如下内容：

Combining tessdata files
Output normal.traineddata created successfully.
Version string:4.1.0-rc2-34-gb2fc3
1:unicharset:size=1702, offset=192
3:inttemp:size=172723, offset=1894
4:pffmtable:size=236, offset=174617
5:normproto:size=3422, offset=174853
13:shapetable:size=484, offset=178275
23:version:size=19, offset=178759
执行结果中，1,3,4,5,13这几行必须有数值，才代表命令执行成功。

16.将训练文件拷贝到tesseract所在目录

sudo cp normal.traineddata /usr/share/tesseract-ocr/4.00/tessdata/

17.使用训练结果进行新一轮测试

将上面所给的代码中，lang=‘eng’ 改成你自己的字体 normal ，或者使用tesseract命令时，将语言设置为normal即可

result = pytesseract.image_to_string(

filename, lang='normal', config='--psm 7 --oem 3 -c tessedit_char_whitelist=qwertyuiopasdfghjklzxcvbnm')

由于训练效果的不同，以及样本的质量，会导致自己的训练效果远不及tesseract自身的水平，这当然是正常现象，加大训练量即可