ocr tesseract

The code is available on GitHub

该代码可在 GitHub上获得

Let’s say you want to extract some information from paper prescriptions (to perform some trend analysis afterwards, or to simply check if the prescription is valid against a database; the exact reasons may vary and are not too important here). You could type them into a computer database manually and call it a day. But it gets tedious if the number of those to type in becomes large, and the time is precious. This is when Optical Character Recognition (OCR) techniques may come in handy.

假设您要从纸质处方中提取一些信息(以便在以后进行一些趋势分析,或者只是根据数据库检查该处方是否有效;确切的原因可能有所不同,在这里并不太重要)。 您可以将它们手动输入到计算机数据库中,并称之为一天。 但是,如果要键入的内容数量增加,这将变得很乏味,并且时间很宝贵。 这就是光学字符识别(OCR)技术可能派上用场的时候。

OCR is an umbrella term encompassing a range of different technologies that detect, extract and recognise text from images. There exist a multitude of approaches for OCR from very simple ones, that can only recognise clear text in a specific font, to particularly advanced, that employ machine learning and can work with both printed and handwritten texts in multiple languages. We are going to focus our attention on two specific ones here.

OCR是一个笼统的术语,涵盖了从图像中检测,提取和识别文本的一系列不同技术。 OCR存在许多方法,从非常简单的方法到只能识别特定字体的清晰文本,再到特别高级的方法,它们采用了机器学习,并且可以同时处理多种语言的印刷文本和手写文本。 在这里,我们将注意力集中在两个特定的方面。

我们有什么选择? (What are our options?)

In this article we are going with two specific OCR tools: Tesseract OCR and Google Cloud’s Vision API

在本文中,我们将使用两种特定的OCR工具: Tesseract OCR和Google Cloud的Vision API

Tesseract OCR is an offline tool, which provides some options it can be run with. The one that makes the most difference in the example problems we have here is page segmentation mode. In many cases, one might resort to run it in auto-mode, but it’s always useful to think about what the potential layouts of the documents might be and hence provide some hints to Tesseract, so it can yield more accurate results. For simplicity, I will not cover all of the modes available, but experiment with three, that, at least intuitively, might fit the layouts of the documents we are working with here:

Tesseract OCR是一个离线工具,提供了一些可以运行的选项。 在这里的示例问题中,最大的区别是页面分割模式。 在许多情况下,可能会选择以自动模式运行它,但是思考文档的潜在布局可能总是很有用的,因此可以为Tesseract提供一些提示,以便可以产生更准确的结果。 为简单起见,我不会介绍所有可用的模式,而是尝试三种模式,至少在直观上,这些模式可能适合我们在此使用的文档的布局:

  • Single Block: the system assumes that there is a single block of text on the image单个块:系统假定图像上有单个文本块
  • Sparse Text: the system tries to extract as much text as possible, disregarding its location and order稀疏文本:系统尝试提取尽可能多的文本,而不考虑其位置和顺序
  • Single Column: the system assumes that there’s a single column of lines of texts单列:系统假设文本行只有一列

Google Vision, on the other hand, does not provide as much control over its configuration as Tesseract. However, its defaults are very effective in general. There are two distinct OCR models that are worth experimenting with:

另一方面,Google Vision对其配置的控制能力不及Tesseract。 但是,其默认设置通常非常有效。 有两种不同的OCR模型值得尝试:

  • Text Detection model: detects and recognises all text on a provided image文本检测模型:检测并识别提供的图像上的所有文本
  • Document Text Detection model: practically performs the same task, but is tuned to better suit dense texts and documents文档文本检测模型:实际上执行相同的任务,但已调整为更好地适合密集的文本和文档

进入第一个例子 (Into the first example)

N.B. the full recognised text results are not provided here, for the sake of keeping the article readable. However, the reader is encouraged to play around with the notebook provided and see full results for themselves.

注意:为了使文章易于阅读,此处未提供完整的识别文本结果。 但是,建议读者阅读提供 的笔记本电脑 ,自己看看完整的结果。

Starting with our first example: a scanned prescription for amoxicillin (kind of antibiotic). We will first look at how the methods described above handle this form as is.

从我们的第一个例子开始:阿莫西林(一种抗生素)的扫描处方。 我们将首先看一下上述方法如何按原样处理这种形式。

A couple of observations to make here:

这里有几个观察结果:

  • Both Single Block and Single Column Tesseract modes detect some spurious boxes as text (we may assume that the text is not dense enough for these modes)单块和单列Tesseract模式都将一些虚假框检测为文本(我们可能认为这些模式的文本密度不足)
  • Both Google models detect the text on the watermarks, that is not relevant to our task两种Google模型都检测到水印上的文本,这与我们的任务无关

I will not be talking about the accuracy of the recognition yet, since there’s a way to improve the detection first!

我不会谈论识别的准确性,因为有一种方法可以首先改善检测效果!

图像预处理有帮助吗? (Does image pre-processing help?)

There’s, of course, a plethora of ways that images can be pre-processed before analysis, including but not limited to different ways of filtering and de-noising, lightning adjustments, sharpening, you name it. But here we don’t have to do anything too convoluted to remove the background. We can perform thresholding, an operation that turns an image into a binary one by setting the pixels to black if their intensity values are above a certain threshold, and to white otherwise. Conveniently, the OpenCV library provides a method to threshold the image. Instead of picking and hard-coding some specific threshold Otsu method can be used, which chooses the best threshold, based on the distribution of pixel intensities. The results after thresholding already look more promising. Let’s investigate what we see here.

当然,在分析之前,有很多方法可以对图像进行预处理,包括但不限于过滤和降噪,闪电调整,锐化等各种方法,您可以自己命名。 但是在这里,我们不必做任何复杂的事情即可删除背景。 我们可以执行阈值处理,通过将像素的强度值高于某个阈值设置为黑色,将像素设置为黑色,否则将图像设置为白色,从而将图像转换为二进制图像。 方便地,OpenCV库提供了一种对图像进行阈值处理的方法。 代替选择和硬编码某些特定阈值,可以使用Otsu方法 ,该方法根据像素强度的分布选择最佳阈值。 阈值化后的结果看起来更有希望。 让我们调查一下我们在这里看到的内容。

  • Tesseract’s results haven’t changed much (since it was not affected by the background anyway).Tesseract的结果没有太大变化(因为它始终不受背景影响)。
  • However, Google’s results have definitely improved: no more unwanted elements are recognised.但是,Google的结果肯定得到了改善:不再识别出任何多余的元素。
  • Tesseract’s Sparse Text mode still stands superior to the other two, detecting the layout correctly, and recognising most of the text without mistakes. There are some occasional extra characters inserted: for example “i 50 Stanhope Street”, where ‘i’ is not a real character, but part of the box to the left of the text.Tesseract的“稀疏文本”模式仍然优于其他两个模式,可以正确检测布局并正确识别大多数文本。 偶尔会插入一些额外的字符:例如“ i Stanhope Street 50”,其中“ i”不是真实字符,而是文本左侧方框的一部分。
  • Google Document Text Detection model does not seem to perform well on this one, combining texts into a single paragraph when they clearly do not belong together. The possible reason here is that the text is not dense enough for that particular model.Google文档文本检测模型在此模型上似乎效果不佳,当文本明显不属于同一文本时,将它们合并为一个段落。 此处可能的原因是该文本对于该特定模型而言不够密集。
  • We can also see one of the weak sides of Google Vision: it sometimes fails to recognise short strings or lone characters standing on their own (Google Text Detection model failed to detect that single ‘5’ in the middle of the page). But overall, when the text was detected, it was indeed recognised correctly.我们还可以看到Google Vision的弱点之一:它有时无法识别短字符串或单独的字符(Google文字检测模型无法在页面中间检测到单个“ 5”)。 但是总的来说,当检测到文本时,确实可以正确识别它。

So, as you can see, there’s no one-size-fits-all model. In this particular case, after the experiment, I would go with either Tesseract Sparse Text mode or Google Text Detection model (leaning more towards the former, as the 5 and other single-digit numbers that may come in the following documents, that Google misses, might be quite important for the task), with the thresholding as the preprocessing step.

因此,如您所见,没有一种适合所有人的模型。 在这种特殊情况下,经过实验后,我将使用Tesseract稀疏文本模式或Google文本检测模型(更倾向于前者,因为以下文档可能会包含5个和其他个位数的数字,但Google会忽略掉,对于该任务可能非常重要),并以阈值作为预处理步骤。

那不同的布局呢 (What about a different layout)

To illustrate that the nature of a document might influence the choice of a model, let’s go through another example. Here we have a receipt from a hotel, that was submitted to claim reimbursement.

为了说明文档的性质可能会影响模型的选择,让我们来看另一个示例。 在这里,我们有一家酒店的收据,已据要求索偿。

Original source: https://www.imperial.ac.uk/staff-travel-and-expenses/whilst-you-are-travelling/obtaining-receipts-to-evidence-expenditure/receipt-examples/
原始来源: https : //www.imperial.ac.uk/staff-travel-and-expenses/whilst-you-are-travelling/obtaining-receipts-to-evidence-expenditure/receipt-examples/
  • All three modes of Tesseract did a good job detecting the text on the image. However, Single Block and Column modes kept the lines of the text intact, which is convenient if you wants to match label-value pairs in the analysis afterwards.Tesseract的所有三种模式都很好地检测了图像上的文字。 但是,“单块和列”模式可以使文本的行保持完整,如果以后要在分析中匹配“标签-值”对,这将非常方便。
  • Tesseract’s recognition, on the other hand, is far from perfect (e.g. “MTE: Jin, 26,16 TIE: 03:4” instead of “DATE: Jun, 26, 16 TIME: 09:47”). My guess would be that it is due to a narrow font and the image quality not being the sharpest. Some pre-processing steps to improve edges of the letters might yield better results, but this is just a speculation, since this was not tested.另一方面,Tesseract的认可远非完美(例如,“ MTE:Jin,26,16 TIE:03:4”,而不是“ DATE:Jun,26,16 TIME:09:47”)。 我的猜测是这是由于字体较窄且图像质量不是最清晰。 一些改善字母边缘的预处理步骤可能会产生更好的结果,但这只是一种推测,因为未经测试。
  • In comparison, Google Vision recognises all the text with almost no errors, including the handwritten “Housing” (which Tesseract recognised as “‘stflrg”)相比之下,Google Vision几乎可以识别所有文本,包括手写的“外壳”(Tesseract将该文本识别为“'st fl rg”)
  • The segmentation by the Text Detection model was again better in comparison to the Document Text Detection model, even though it breaks up some of the lines of text与文档文本检测模型相比,“文本检测”模型的分割还是更好的,尽管它分割了一些文本行

At this point, I’d like to add a note on matching label-value pairs. Tesseract’s Single Column mode conveniently provides lines of the receipt, so the label-value pairs can be parsed trivially. However, if this was not the case (and it isn’t for the results of Google Vision), the pairs can still be matched if needed by using the coordinates of the detected text, which are provided by both Tesseract and Google.

在这一点上,我想对匹配的标签值对添加一条注释。 Tesseract的“单列”模式可方便地提供收据行,因此可以轻松解析标签-值对。 但是,如果不是这种情况(不是Google Vision的结果),则仍可以根据需要使用Tesseract和Google所提供的检测到的文本的坐标来匹配这些对。

In conclusion, even though Tesseract Single Column provides a nice layout, I would choose Google Vision for this task, as it provides much better accuracy of recognition (including the handwritten part, which is important for this particular example).

总之,即使Tesseract Single Column提供了一个不错的布局,我还是会选择Google Vision来完成此任务,因为它提供了更好的识别精度(包括手写部分,这对于该特定示例很重要)。

带回家的消息 (Take-home message)

In this short blog post, we’ve compared how Tesseract OCR and Google Vision perform on two specific tasks, but this is by no means an exhaustive overview. When employing OCR, you need to consider the nature of the images you are working with, and experiment with different OCR models to find one that is the most suitable for their application. Additionally, the real-world data might not be as clean as the examples above, therefore some pre-processing steps might be required for recognition to be successful. And finally, even though this step is out of the scope of this post, it is worth mentioning, that after the recognition some post-processing might be required, such as removal of the extra characters, or spell-checking (it depends on what the further analysis is).

在这篇简短的博客文章中,我们比较了Tesseract OCR和Google Vision在两个特定任务上的执行情况,但这绝不是详尽的概述。 使用OCR时,您需要考虑所使用图像的性质,并尝试使用不同的OCR模型以找到最适合其应用的图像。 此外,现实世界中的数据可能不如上面的示例那么干净,因此可能需要一些预处理步骤才能使识别成功。 最后,即使此步骤不在本文讨论范围之内,但值得一提的是,在识别之后,可能需要进行一些后期处理,例如删除多余的字符或进行拼写检查(取决于哪些内容)。进一步的分析是)。

The notebook with the code producing the results for the examples in this post is available. The readers are highly encouraged to give it a try and experiment with their sample data.

提供了带有生成本文中示例结果的代码的笔记本 。 强烈建议读者尝试并尝试使用其样本数据。

翻译自: https://medium.com/fuzzylabsai/the-battle-of-the-ocr-engines-tesseract-vs-google-902c302e71a1

ocr tesseract


http://www.taodudu.cc/news/show-1873988.html

相关文章:

  • 游戏行业数据类丛书_理论丛书:高维数据101
  • tesseract box_使用Qt Box Editor在自定义数据集上训练Tesseract
  • 人脸检测用什么模型_人脸检测模型:使用哪个以及为什么使用?
  • 不洗袜子的高文博_那个孩子在夏天中旬用高袜子大笑?
  • word2vec字向量_Anything2Vec:将Reddit映射到向量空间
  • ai人工智能伪原创_AI伪科学与科学种族主义
  • ai人工智能操控什么意思_为什么要建立AI分散式自治组织(AI DAO)
  • 机器学习cnn如何改变权值_五个机器学习悖论将改变您对数据的思考方式
  • DeepStyle(第2部分):时尚GAN
  • 肉体之爱的解释圣经_可解释的AI的解释
  • 机器学习 神经网络 神经元_神经网络如何学习?
  • 人工智能ai应用高管指南_理解人工智能指南
  • 机器学习 决策树 监督_监督机器学习-决策树分类器简介
  • ai人工智能数据处理分析_建立数据平台以实现分析和AI驱动的创新
  • 极限学习机和支持向量机_极限学习机的发展
  • 人工智能时代的危机_AI信任危机:如何前进
  • 不平衡数据集_我们的不平衡数据集
  • 建筑业建筑业大数据行业现状_建筑—第4部分
  • 线性分类模型和向量矩阵求导_自然语言处理中向量空间模型的矩阵设计
  • 离散数学期末复习概念_复习第1部分中的基本概念
  • 熵 机器学习_理解熵:机器学习的金标准
  • heroku_如何通过5个步骤在Heroku上部署机器学习UI
  • detr 历史解析代码_视觉/ DETR变压器
  • 人工神经网络方法学习步长_人工神经网络-一种直观的方法第1部分
  • 机器学习 声音 分角色_机器学习对儿童电视节目角色的痴迷
  • 遗传算法是一种进化算法_一种算法的少量更改可以减少种族主义的借贷
  • 无监督模型 训练过程_监督使用训练模型
  • 端到端车道线检测_弱监督对象检测-端到端培训管道
  • feynman1999_AI Feynman 2.0:从数据中学习回归方程
  • canny edge_Canny Edge检测器简介

ocr tesseract_OCR引擎之战— Tesseract与Google Vision相关推荐

  1. 基于OCR识别引擎的识别表格文字并将结果以Excel电子表格的形式原样导出的Android客户端代码

    基于OCR识别引擎的识别表格文字并将结果以Excel电子表格的形式原样导出的Android客户端代码 界面截图 实现思路 对表格图片进行灰度化和二值化处理 对图像进行倾斜矫正 进行表格线提取 进行表格 ...

  2. Google Vision API

    Google Cloud Console (https://cloud.google.com/vision/docs/setup?hl=zh_CN) 1.创建项目 2.启用结算功能 3.启用 API ...

  3. 【OCR识别验证码】--基于tesseract

    目录 1.环境准备(windows) 2.实现目的: 3.代码实现 5.二值化处理 6.评价 1.环境准备(windows) 打开cmd(命令符窗口)输入以下命令:  pip install pyte ...

  4. 手游引擎之战再现新挑战者,OGEngine来了

    GameRes报道 / 在刚刚召开的移动游戏大会上,Unity公司和触控科技分别发布了Unity的2D引擎及Cocos的3D引擎,一时激起了圈内的对原本不太受关注的游戏引擎的强烈兴趣. 本来,在游戏产 ...

  5. Tesseract-OCR引擎 入门

    OCR(Optical Character Recognition):光学字符识别,是指对图片文件中的文字进行分析识别,获取的过程. Tesseract:开源的OCR识别引擎,初期Tesseract引 ...

  6. 使用Python,OpenCV进行Tesseract-OCR绑定及识别

    使用Python,OpenCV进行Tesseract-OCR绑定及识别 1. 效果图 2. 安装Tesseract+Python"绑定"及识别 3. 源码 参考 上一篇博客介绍了W ...

  7. C++开源代码项目汇总

    Google的C++开源代码项目 v8  -  V8 JavaScript Engine V8 是 Google 的开源 JavaScript 引擎. V8 采用 C++ 编写,可在谷歌浏览器(来自 ...

  8. 一些C++的开源项目和C++库以及修炼C++的方法

     Google的C++开源代码项目 v8  -  V8 JavaScript Engine V8 是 Google 的开源 JavaScript 引擎. V8 采用 C++ 编写,可在谷歌浏览器( ...

  9. imagemagick图片识别技术数据抓取(转自:http://michael-roshen.iteye.com/blog/1982817)

    安装: sudo apt-get install imagemagick ImageMagick是一套功能强大.稳定而且开源的工具集和开发包,可以用来读.写和处理超过89种基本格式的图片文件,包括流行 ...

  10. Tesseract-OCR 安装、中文识别与训练字库

    简介 OCR(Optical Character Recognition):光学字符识别,是指电子设备(例如扫描仪或数码相机)检查纸上打印的字符,通过检测暗.亮的模式确定其形状,然后用字符识别方法将形 ...

最新文章

  1. LaTex排版技巧:[15]公式太长如何换行
  2. KVM虚拟化环境搭建
  3. 如果使用SD-WAN为客户提供高价值,应该部署哪些安全功能?
  4. C#-Home / 详解Asp.Net Sql数据库连接字符串
  5. C段 192.168.1.15/28与192.168.1.16/28的区别
  6. [转]谨以此文献给才毕业2--5年的朋友
  7. Win7下硬盘安装Ubuntu10.10双系统
  8. mysql not exists很慢_查询速度优化用not EXISTS 代替 not in
  9. 滴滴医护专车新增南京 共上线6城覆盖1.8万医护
  10. Boot.ini无解
  11. java清除缓存池_Java 缓存池(使用Map实现)
  12. Javascript代码编写的逻辑冗余
  13. 外包开发app系统软件价格表:价格一般多少呢
  14. Synchronized 可重入性粒度测试
  15. linux/debian安装wps以及缺失字体,亲测可用
  16. 图像处理/计算机视觉/python环境下如何用滤波器、算法恢复图片,对图片去污【附代码】
  17. 计算机毕业设计-基于微信小程序高校学生课堂扫码考勤签到系统-校园考勤打卡签到小程序
  18. [chatGPT]六问ChatGPT:当AI“成精”
  19. 孝心无价-毕淑敏[转]
  20. ElasticSearch查询实现全字段搜索

热门文章

  1. android_handler(三)
  2. 2016年第3本:启示录----打造用户喜爱的产品
  3. 指向API的函数指针定义方法
  4. c++内存中字节对齐问题详解 【转载】
  5. BT 与 Magnet 的下载方式及原理
  6. ugmented reality(AR) equipment
  7. Effective C++ Notebook
  8. Atitit 中间件之道 attilax著 1. 第1章 中间件产生背景及分布式计算环境 2 2. 中间件分类 2 2.1. 通讯,消息,数据存储中间件 3 3. 第3章 COM相关技术 3 4.
  9. atitit 2017年学业计划 v5 r818.xlsx
  10. Atitit 手机图片备份解决方案attilax总结