pytesseract+tensorflow开发一个自己的验证码训练集

pytesseract模块结合tesseract-ocr软件能识别大部分的验证码，虽然用自己训练的数据跑tesseract识别验证码，具体参考博主：https://blog.csdn.net/Jayj1997/article/details/102882379
本人尝试了，很麻烦。

用pytesseract对以上这种验证码的识别率也只在75%左右，对于这个准确率实在事不满意。
例如验证码：
验证代码：

import pytesseract
from PIL import Image
img = Image.open('./3tst_15819267383940756.png', mode='r')
result = pytesseract.image_to_string(image=img, )
checkcode = result.replace(' ', '').lower()  ##把字符串的空格去掉，把大写转成小写
print(checkcode)

识别结果：

$Process finished with exit code 0

连这么简单的验证码都识别错误，真是不堪重用。

直到某天逛github发现了https://github.com/nickliqian/cnn_captcha大神的宝藏，针对字符型图片验证码，使用tensorflow实现卷积神经网络，进行验证码识别。
git clone https://github.com/nickliqian/cnn_captcha
到本机，搞了好久终于测试通过。

但是这个训练数据是通过captcha模块生成，其验证码图片大小与形状和工作用到的大不一样。

如果要训练这种验证码的测试集，怎么处理？

最重要的是验证码的标签如何识别？训练图片至少好几千，总不能一个一个人工识别吧。不是还有pytesseract吗？虽然识别率有点差强人意。
缕下思路：
先从网站下载图片，图片命名格式：4位验证码标签_时间戳.png
4位验证码标签就通过pytesseract来识别，如果验证码长度不等于4或有非数字或字母的其它字符，就把图片方到另外一个路径。

import gevent
from gevent import monkey
monkey.patch_all(select=False)
import requests
import shutil
import time
import os
from PIL import Image
import pytesseract
import sys
import reroot_dir='./imgs'
error_dir='./imgs_error'
image_suffix='png'
imgurl = 'http://1.1.1.1:7001/cas/captcha.htm'  ##需要修改获取验证码的网址
headers = {'context-type': 'text/xml; charset=UTF-8','Accept-Language': 'zh-CN,zh;q=0.9','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3'service=http%3A%2F%2F10.128.85.13%3A7001%2Fworkshop%2Fbase%2Fjt%2Fjsp%2Findex_JT.jsp',}if not os.path.exists(root_dir):os.makedirs(root_dir)
if not os.path.exists(root_dir):os.makedirs(error_dir)
CNT=0def get_img():global CNTcheckcodecontent = requests.get(url=imgurl, headers=headers)timec = str(time.time()).replace(".", "")text='aaaa'filename = os.path.join(root_dir, "{}_{}.{}".format(text, timec, image_suffix))with open(filename, 'wb') as f:f.write(checkcodecontent.content)f.close() img = Image.open(filename, mode='r')result = pytesseract.image_to_string(image=img, )checkcode = result.replace(' ', '').lower()  ##把字符串的空格去掉，把大写转成小写CNT = CNT + 1destname=os.path.join(error_dir, "{}_{}.{}".format(checkcode, timec, image_suffix))if len(checkcode) != 4:print('验证码{}长度不等于4,移动到错误目录'.format(checkcode))shutil.move(filename,destname)returnif re.search('\W+', checkcode):print('文件名{}前缀包含特殊字符[包含非数字与字母的字符]，移动到错误目录'.format(checkcode))shutil.move(filename,destname)return'''需要判断checkcode要a-z，1-9，A-Z的字符才可以'''filename_new=os.path.join(root_dir, "{}_{}.{}".format(checkcode, timec, image_suffix))os.rename(filename,filename_new)print('第{}张验证码图片下载完成'.format(CNT))def modefy_name():filenames=os.listdir(root_dir)for filename in filenames:prefix=re.split('_', filename)[0]if len(prefix) != 4:print('文件名{}前缀长度不等于4，直接删除'.format(filename))os.remove(os.path.join(root_dir, filename))elif re.search('\W+',prefix):print('文件名{}前缀包含特殊字符[包含非数字与字母的字符]，直接删除'.format(filename))os.remove(os.path.join(root_dir, filename))if __name__ == '__main__':g_list = list()start = time.time()for j in range(200):for i in range(30):g = gevent.spawn(get_img)g_list.append(g)gevent.joinall(g_list)print('总耗时：%.5f秒' % float(time.time() - start))modefy_name()time.sleep(50)

把验证码图片下载到目录imgs,至少要5000张，拷贝到cnn_captcha\sample\origin，
执行python verify_and_split_data.py
执行结果：

>>> 开始校验目录：[sample/origin/]
开始校验原始图片集
原始集共有图片: 7375张
====以下0张图片有异常====
未发现异常（共 7375 张图片）
========end
开始分离原始图片集为：测试集（5%）和训练集（95%）
共分配7375张图片到训练集和测试集，其中0张为异常留在原始目录
测试集数量为：368
训练集数量为：7007
>>> 开始校验目录：[sample/new_train/]
开始校验原始图片集
原始集共有图片: 0张
====以下0张图片有异常====
未发现异常（共 0 张图片）
========end
开始分离原始图片集为：测试集（5%）和训练集（95%）
共分配0张图片到训练集和测试集，其中0张为异常留在原始目录
测试集数量为：0
训练集数量为：0

开始训练：
python train_model.py
训练大概要几个小时：
训练结果：

[训练集] 字符准确率为 1.00000 图片准确率为 1.00000 >>> loss 0.0029035145
[验证集] 字符准确率为 0.94750 图片准确率为 0.83000 >>> loss 0.0029035145
第6970次训练 >>>
[训练集] 字符准确率为 0.99414 图片准确率为 0.99219 >>> loss 0.0040454771
[验证集] 字符准确率为 0.95500 图片准确率为 0.89000 >>> loss 0.0040454771
第6980次训练 >>>
[训练集] 字符准确率为 0.99805 图片准确率为 0.99219 >>> loss 0.0039159926
[验证集] 字符准确率为 0.95500 图片准确率为 0.87000 >>> loss 0.0039159926
第6990次训练 >>>
[训练集] 字符准确率为 1.00000 图片准确率为 1.00000 >>> loss 0.0033209689
[验证集] 字符准确率为 0.96500 图片准确率为 0.88000 >>> loss 0.0033209689
第7000次训练 >>>
[训练集] 字符准确率为 0.99219 图片准确率为 0.99219 >>> loss 0.0044688974
[验证集] 字符准确率为 0.96500 图片准确率为 0.89000 >>> loss 0.0044688974

开启api功能：
python webserver_recognize_api.py

测试验证码是否能准确识别：
测试图片：

python recognize_local.py

接口响应: {"speed_time(ms)": 76, "time": "1584877709842", "value": "3tst"
}【2020-02-17 18:31:27】 耗时：124ms 预测结果：3tst
============== end ==============

完美！perfect！

附，tensorflow注意事项：
1）先https://github.com/nickliqian/cnn_captcha 把代码clone到本机
在此，非常感谢大神Nick Li提供的代码
2）新建目录model，logs，sample/origin , sample/train, sample/test,
sample/api , sample/local, sample/new_train , sample/online
3) 把要训练的验证码按照，4位验证码标签_时间戳.png
的格式放进目录 sample/origin,至少要5000张以上，执行pythonverify_and_split_data.py
4）开始训练python train_model.py
如果报错断点是在代码：
saver.restore(sess, self.model_save_dir)
那么可以先把这段代码屏蔽，等训练集没有问题后可以再打开
5）训练完成后，需要先打开api才能验证

pytesseract+tensorflow开发一个自己的验证码训练集相关推荐

田渊栋的2021年终总结：多读历史！历史就是一个大规模强化学习训练集
视学算法报道作者:田渊栋编辑:好困 LRS [新智元导读]田渊栋博士最近又在知乎上发表了他的2021年度总结,成果包括10篇Paper和1部长篇小说及续集.文章中还提到一些研究心得和反思, ...
【入门篇】如何正确使用机器学习中的训练集、验证集和测试集？
[注] ·本文为转载文章,原文作者是王树义老师,原文链接为 https://zhuanlan.zhihu.com/p/71961236 训练集.验证集和测试集,林林总总的数据集合类型,到底该怎么选.怎 ...
华为推出全球最快AI训练集群Atlas 900，算力超群
9月18日,华为全联接2019(HUAWEI CONNECT)大会上,华为副董事长胡厚崑发布了Atlas 900 AI训练集群,以超强算力带给企业人工智能业务的极致体验.世界正从数字化向智能化转型,人 ...
训练集、验证集、测试集以及交叉验证
本文转自:https://blog.csdn.net/kieven2008/article/details/81582591 三者的区别训练集(train set) 用于模型拟合的数据样本. 验证集 ...
如何正确使用机器学习中的训练集、验证集和测试集？
王树义读完需要 19 分钟速读仅需7分钟训练集.验证集和测试集,林林总总的数据集合类型,到底该怎么选.怎么用?看过这篇教程后,你就能游刃有余地处理它们了. 1 问题审稿的时候,不止一次,我遇到 ...
Pytorch---- CIFAR10实战(训练集+测试集+验证集)完整版，逐行注释-----学习笔记
文章目录 CIFAR10数据集准备.加载搭建神经网络损失函数和优化器训练集测试集关于argmax: 使用tensorboard可视化训练过程. 完整代码(训练集+测试集): 程序结果: 验证 ...
使用TensorFlow 来实现一个简单的验证码识别过程
本文我们来用 TensorFlow 来实现一个深度学习模型,用来实现验证码识别的过程,这里识别的验证码是图形验证码,首先我们会用标注好的数据来训练一个模型,然后再用模型来实现这个验证码的识别. 1.验 ...
使用Tensorflow进行数字（字母）验证码训练和预测
首先感谢该网址的博主:https://www.cnblogs.com/ydf0509/p/6916435.html,找了很多源码.就这个运行调试成功了. 其次还是要安装好tensorflow-gpu的 ...
TensorFlow识别复杂验证码以及搭建生产环境（1）—— 收集训练集
0x00 前言最近我们的教务系统升级了,这是我们新教务系统的验证码,字体歪斜,有干扰线.(如下图) 如果能够识别这个验证码,就能够自动登录,避免手工输入验证码所带来的烦恼,为一些自动化操作奠定基础, ...
Tensorflow框架是如何支持分布式训练的？
参加 2019 Python开发者日,请扫码咨询 ↑↑↑ 作者 | 杨旭东转载自知乎<算法工程师的自我修养>专栏 Methods that scale with computation ...

pytesseract+tensorflow开发一个自己的验证码训练集

pytesseract+tensorflow开发一个自己的验证码训练集相关推荐

最新文章

热门文章