基于Python的离线OCR图片文字识别（四）—

虽然在前面在第二次升级时就已经通过json配置文件支持将ocr识别结果txt保存到指定的文件夹里了，但由于指定待识别文件夹时文件夹里面可能包含多个不同的子文件夹、不同的子文件夹里面可能包含同名的图像文件，而原来的方式是直接把所有的txt全部放在json文件指定的一个文件夹中，当不同文件夹中存在同名的图像文件时，会存在txt文件覆盖的情况，虽然几率很小但是开发小伙伴们反映确实出现了这种情况，那么就需要改进，最好是改进为下面的方式：在指定的文件夹下面按照原图像文件的目录结构新建相同的文件夹结构并存放txt文件，也即在json指定的txt保存路径下重新按照待识别图像文件夹的结构，完全重新新建相同的文件夹结果，这样就可以完全避免由于大文件夹中的文件重名而带来的识别结果txt文件覆盖从而丢失的情况发生了，升级后的代码如下所示：

#!/home/super/miniconda3/bin/python
#encoding=utf-8
#author: superchao1982, 50903556@qq.com#帮助信息
strhelp='''
img2txt is one program to get ocr texts from image or pdf files!default threshold is 0.1;
default langpath is '/home/langdata' for linux and 'C:\ocr\langdata' for win;
default remove char is '| _^~`&';
default path storing the ocr texts are the same directory with images;
default settings above can be changed in the file 'config.json' which stored in langpath;contents in config.json like:
{"threshold": 0.1,"batchsize": 2,"workernum": 4,"maximgsize": 1000,"allowlist": "","langpath": "/home/langdata","removechar": " _^~`&""txtpath": ""
}
------------------------------------
e.g.
./img2txt.py img1.jpg jmg2.jpg #follow by one or more image files
./img2txt.py ./img1 ./img home/usr/Document/img #follow by one or more directory contain image files
./img2txt.py --help #output the help info
./img2txt.py --config #generate the default config.json file in the langpath
------------------------------------
'''
import sys
import json
import os
import pdf2image
import numpy as np#------------------默认参数设置----------------------
threshold=0.1        #(default = 0.1)阈值
batchsize=2            # (default = 1) - batch_size>1 will make EasyOCR faster but use more memory
workernum=4            # (default = 0) - Number thread used in of dataloader
maximgsize=1000        #(default = 1000) - Max image width & height when using pdf
allowlist=''        # (string) - Force EasyOCR to recognize only subset of characters
removechar='| _^~`&'#待删除无效字符
txtpath=''            #ocr识别后同名txt文件存放的位置:空表示同一目录，点表示相对目录，其他表示绝对目录
#根据系统设置默认的语言包路径
if sys.platform.lower().startswith('linux'):langpath='/home/langdata'
elif sys.platform.lower().startswith('win'):langpath='C:\ocr\langdata'
else:print('Error: Unknow System!')sys.exit()
#配置参数字典
config={"threshold": threshold,"batchsize": batchsize,"workernum": workernum,"maximgsize": maximgsize,"allowlist": allowlist,"langpath": langpath,"removechar": removechar,"txtpath": txtpath
}#------------------命令行参数处理----------------------
#首先对输入的命令行参数进行处理，在加载ocr包之前排查的好处是避免临处理时出错白白浪费时间
for i in range(1,len(sys.argv)):#获取命令行参数：argv[0]表示可执行文件本身if sys.argv[i] in ['-h', '--help']:print(strhelp)sys.exit()elif sys.argv[i] in ['-c', '--config']:#保存字典到文件try:with open(os.path.join(langpath,'config.json'), 'w') as jsonfile:json.dump(config, jsonfile, ensure_ascii=False,indent=4)print('Genrerating config.json success! ---> ', os.path.join(langpath,'config.json'))except(Exception) as e:print('\tSaving config file config.json Error: ', e)#输出异常错误sys.exit()else:#check the image file or directory is valid-提前校验，免得浪费时间加载easyocr模型if not os.path.exists(sys.argv[i]):print(sys.argv[i], ' is invalid, please input the correct file or directory path!')sys.exit()#检查语言包路径是否正确check the langpath is valid
if not os.path.exists(langpath):print('Error: Invalid langpath! Checking the path again!')sys.exit()#判断是否存在配置文件config.json,存在就使用,格式如下：
configfile=os.path.join(langpath,'config.json')
if os.path.exists(configfile):try:with open(configfile, 'r') as jsonfile:configdict=json.load(jsonfile)threshold=configdict['threshold']batchsize=configdict['batchsize']workernum=configdict['workernum']maximgsize=configdict['maximgsize']langpath=configdict['langpath']allowlist=configdict['allowlist']removechar=configdict['removechar']txtpath=configdict['txtpath']print('using the config in ', configfile)except(Exception) as e:print('\tReading config file ', configfile ,' Error: ', e)#输出异常错误print('\tCheck the json file, or remove the config.json file to use defaulting configs!')sys.exit()
else:print('\tusing the default config in ', langpath)
print(configdict)#如果用户在config.json中指定的txt文件保存路径不存在就生成一个
if len(txtpath)>0 and not os.path.exists(txtpath):print('txtpath in config.json is not exists, generating ', txtpath, '!\n')try:os.system('mkdir '+txtpath)except(Exception) as e:print('\tMaking txt directory Error: ', e)#输出异常错误print('\tPlease input a legal txtpath in the config.json file and try again!\n')sys.exit()#------------------开始OCR识别----------------------
import easyocr
ocrreader=easyocr.Reader(['ch_sim', 'en'], model_storage_directory=langpath)#Linux: r'/home/langdata', Windows: r'C:\ocr\langdata'
for ind in range(1,len(sys.argv)):#获取命令行参数：argv[0]表示可执行文件本身argpath=sys.argv[ind]#如果是文件...if os.path.isfile(argpath):paper=''#获取文件后缀名filext=os.path.splitext(argpath)[-1]if filext.upper() not in ['.JPG','.JPEG','.PNG','.BMP','.PDF']:#转换为大写后再比对print('\t', argpath, ' 不是有效图片格式(jpg/jpeg/png/bmp/pdf)!')continueif filext.upper() in['.PDF']:#如果是pdf文档images=pdf2image.convert_from_path(argpath)#将pdf文档转换为图像序列for i in range(len(images)):#如果图片尺寸过大，缩小到特定尺寸，避免内存崩溃ratio=max(images[i].width, images[i].height)/maximgsizeif ratio>1:images[i]=images[i].resize((round(images[i].width/ratio),round(images[i].height/ratio)))result = ocrreader.readtext(np.asarray(images[i]),batch_size=batchsize,workers=workernum)for w in result:if w[2]>threshold:#设置一定的置信度阈值paper = paper+w[1]else:result = ocrreader.readtext(argpath,batch_size=batchsize,workers=workernum)for w in result:if w[2]>threshold:#设置一定的置信度阈值paper = paper+w[1]#print(paper)for item in removechar:paper=paper.replace(item, '')paper=paper.replace('\r', '')paper=paper.replace('\n', '')#记录当前文件的识别结果，保存为同名的txt文件if(len(txtpath)>0):#如果设置了txt文件目录basename=os.path.basename(argpath)+'.txt'#与原文件同名的txt文件（不含目录仅文件名）txtfilename=os.path.join(txtpath, basename)else:txtfilename=os.path.splitext(argpath)[0]+'.txt'#与原文件同名的txt文件（包括目录）print('saving file ---> ', txtfilename)#保存的文件名字try:with open(txtfilename, 'w') as txtfile:txtfile.write(paper)except(Exception) as e:print('\t', txtfilename, ' Saving txt File Error: ', e)#输出异常错误continue#如果是文件夹...if os.path.isdir(argpath):for root, _, filenames in os.walk(argpath):for imgfile in filenames:paper=''filext=os.path.splitext(imgfile)[-1]#文件后缀名if filext.upper() not in ['.JPG','.JPEG','.PNG','.BMP','.PDF']:print('\t', imgfile, '的后缀名不是有效的图像格式，跳过该文件！')continueimgfilepath=os.path.join(root, imgfile)#文件绝对路径if filext.upper() in['.PDF']:#如果是pdf文档images=pdf2image.convert_from_path(imgfilepath)#将pdf文档转换为图像序列for i in range(len(images)):#如果图片尺寸过大，缩小到特定尺寸，避免内存崩溃ratio=max(images[i].width, images[i].height)/maximgsizeif ratio>1:images[i]=images[i].resize((round(images[i].width/ratio),round(images[i].height/ratio)))result = ocrreader.readtext(np.asarray(images[i]),batch_size=batchsize,workers=workernum)for w in result:if w[2]>threshold:#设置一定的置信度阈值paper = paper+w[1]else:result = ocrreader.readtext(imgfilepath,batch_size=batchsize,workers=workernum)for w in result:if w[2]>threshold:#设置一定的置信度阈值paper = paper+w[1]#print(paper)for item in removechar:paper=paper.replace(item, '')paper=paper.replace('\r', '')paper=paper.replace('\n', '')#记录当前文件的识别结果，保存为同名的txt文件basename=os.path.splitext(imgfile)[0]+'.txt'#与原文件同名的txt文件（不包括目录）if(len(txtpath)>0):#如果设置了txt文件目录#原来的方式是直接把所有的txt全部放在指定的一个文件夹中，当不同文件夹中存在同名的图像文件时，会存在txt文件覆盖的情况#txtfilename=os.path.join(txtpath, basename)#拼接得到txt文件的绝对路径#下面的方式是在指定的文件夹下面按照原图像文件的目录结构新建相同的文件夹结构并存放txt文件relativeimgpath=imgfilepath.lstrip(argpath)#图片绝对路径左减去命令行指定的路径argpath得到图像文件的内部相对路径newtxtpath=os.path.join(txtpath,relativeimgpath)#指定txt文件路径+图像内部相对路径（还带有图像文件名和后缀名）basedir=os.path.dirname(newtxtpath)#去掉图像文件名和后缀名if not os.path.exists(basedir):#上面的新文件路径不一定存在try:os.system('mkdir '+basedir)#新建文件夹except(Exception) as e:print('\tMaking txt directory Error: ', e)#输出异常错误print('\tTxt file will be storded in the image file directory!')txtfilename=os.path.join(root, basename)#路径+txt文件名txtfilename=os.path.join(basedir, basename)#新路径+txt文件名else:txtfilename=os.path.join(root, basename)#路径+txt文件名print('saving file ---> ', txtfilename)#保存的文件名字try:with open(txtfilename, 'w') as txtfile:txtfile.write(paper)except(Exception) as e:print('\t', txtfilename, ' Saving txt File Error: ', e)#输出异常错误continue

基于Python的离线OCR图片文字识别（四）——支持txt文件指定路径保存相关推荐

基于Python的离线OCR图片文字识别（三）——支持PDF文件
前面第一个版本实现了基本的ocr功能,可以对某图像文件进行处理,将ocr结果以同名txt文件的方式保存在图像文件同路径下: 然后在第二个版本中又实现了对文件夹参数的支持,也即可以对某个包含大量图像文件 ...
基于Python的离线OCR图片文字识别（五）——终极版本
至此,终于迎来了离线ocr的终极大结局,命令行后面参数既支持图像文件.图像文件夹,还支持PDF图像类型的文件,既支持通过json文件进行参数配置,又支持帮助文档,easyOCR包既支持允许字符集(也即 ...
基于Python的离线OCR图片文字识别（一）——命令行方式对图像文件处理生成同名txt文件
应用背景:在正式开始文章之前,先阐述一下项目的应用背景--项目需要对已有的电子档案数据进行"大数据"处理和呈现,但是由于之前进行档案电子化时都是以扫描文件的图像格式存储在硬盘上(准 ...
阿里云 OCR 图片文字识别接口使用案例（java）
阿里云 OCR 图片文字识别接口使用案例(java) 阿里云官方接口文档前期需要完成购买阿里云服务购买服务可以购买测试服务.每个阿里云用户可以购买1次免费的500次接口请求进行测试购买完成之 ...
OCR图片文字识别，人工手动图片标注软件安装过程
OCR图片文字识别,人工手动图片标注软件安装过程,本章关注标注软件的安装,启动过程 1. 下载 anaconda anaconda 下载慢的问题: 使用国内镜像地址下载: https://mirro ...
一篇文章搞定百度OCR图片文字识别API
一篇文章搞定百度OCR图片文字识别API https://www.jianshu.com/p/7905d3b12104 转载于:https://www.cnblogs.com/chongdongxia ...
TOOLFK工具-在线OCR图片文字识别工具
本文要推荐的[TOOLFK]在线OCR图片文字识别工具 ,提供图像文字识别,提取图片文字,OCR图片文字识别,图片转文字,把图片拖拽到上传框中自动上传识别,图片文件最大3M 網站名稱:ToolFk 網 ...
python:pytesseract库实现图片文字识别
import pytesseract from PIL import Imagetext = pytesseract.image_to_string(Image.open(r"E:\repo ...
一款免费的ocr图片文字识别提取工具网站
市场上ocr很成熟,但是都有各种限制,比如强制登陆,转换限制,收费过高等!有时候我们只是简单的提取一下图片中的数据.或者暂时的图片文字识别和提取.这个我感觉还可以,大家要就拿去不谢!https://o ...

基于Python的离线OCR图片文字识别（四）——支持txt文件指定路径保存

基于Python的离线OCR图片文字识别（四）——支持txt文件指定路径保存相关推荐

最新文章

热门文章