量子指南 (QUANTRIUM GUIDES)

Today, the extraction of information from scanned documents such as letters, write-ups, invoices, etc. has become an integral part of your business processes. To accomplish this task, you need to setup an OCR software to extract the information from these scanned documents or pdfs.

如今,从扫描的文档中提取信息,例如信件,信件,发票等,已成为您业务流程中不可或缺的一部分。 要完成此任务,您需要安装OCR软件以从这些扫描的文档或pdf中提取信息。

Here we will take you through the process of building and installing Tesseract 4.x on your Ubuntu 18.04 machine. There are two ways to install Tesseract 4.x.:

在这里,我们将带您完成在Ubuntu 18.04计算机上构建和安装Tesseract 4.x的过程。 有两种安装Tesseract 4.x的方法:

One is installing the Tesseract 4.0.0 beta version, it is easy to install and can be done using couple of commands.

一种是安装Tesseract 4.0.0 beta版本,它易于安装,可以使用几个命令来完成。

Alternatively, you can install Tesseract 4.1.1 version, the latest stable release of Tesseract. In this post, we will guide you how to install each one of them on your Ubuntu 18.04 Machine.

或者,您可以安装Tesseract 4.1.1版本( Tesseract的最新稳定版本)。 在本文中,我们将指导您如何在Ubuntu 18.04机器上安装它们中的每一个。

If you are not familiar with build tools and building from GitHub repositories, then installing Tesseract 4.0.0 beta is better way for you. However, if you are experienced in building and installing applications from GitHub repositories you can skip the next section and jump directly to section Installing Tesseract 4.1.1.

如果您不熟悉构建工具以及如何从GitHub存储库构建,那么安装Tesseract 4.0.0 beta是您的更好方法。 但是,如果您有从GitHub存储库构建和安装应用程序的经验,则可以跳过下一部分,直接跳至安装Tesseract 4.1.1。

安装Tesseract 4.0.0 Beta (Installing Tesseract 4.0.0 beta)

Installing Tesseract 4.0.0 beta version is quite simple to install and can be done using the following apt commands:

安装Tesseract 4.0.0 beta版非常容易安装,可以使用以下apt命令完成:

$ sudo apt install tesseract-ocr$ sudo apt install libtesseract-dev

Once you have run these two commands, check, if you have successfully installed tesseract by running the following command:

运行这两个命令后,通过运行以下命令来检查是否已成功安装tesseract:

$ tesseract --version

After running this command, you should something like this:

运行此命令后,应执行以下操作:

tesseract 4.0.0-beta.1  leptonica-1.75.3

Or something along those lines if your installation was successful. If you it is not installed properly, you will get some errors. That means you have to check for your operating system and versions. These commands work only on Ubuntu 18.04 or higher.

如果安装成功,则遵循这些原则。 如果未正确安装,则会出现一些错误。 这意味着您必须检查操作系统和版本。 这些命令仅适用于Ubuntu 18.04或更高版本。

Once your tesseract installation is successful, you can run the following command to check which languages are supported by your installed version of tesseract:

成功安装tesseract之后,可以运行以下命令来检查已安装的tesseract版本支持哪些语言:

$ tesseract --list-langs

You can expect the following output:

您可以期待以下输出:

List of available languages (2):eng osd

The eng means, it can detect English language and osd refers that it can detect orientation and script.

eng表示可以检测英语,而osd则可以检测方向和脚本。

Well Congratulations! You have successfully installed Tesseract 4.0.0 beta on your system and its ready to use it.

好恭喜! 您已经在系统上成功安装了Tesseract 4.0.0 beta,并且可以使用它了。

在Ubuntu 18.04上安装tesseract 4.1.1: (Installing tesseract 4.1.1 on Ubuntu 18.04:)

In this section, we take you through the steps to build and install tesseract 4.1.1 from the following tesseract’s GitHub repository:

在本节中,我们将引导您从以下tesseract的GitHub存储库构建和安装tesseract 4.1.1的步骤:

Before you start building tesseract 4.1.1 from source, you need to install few dependencies. First, you have to install the leptonica library, its a pedagogically-oriented open source library containing software that is broadly useful for image processing and image analysis applications. To know more about leptonica, refer to Leptonica’s website:

从源代码开始构建tesseract 4.1.1之前,您需要安装一些依赖项。 首先,您必须安装leptonica库,它是面向教学法的开源库,其中包含对图像处理和图像分析应用程序广泛有用的软件。 要了解更多关于leptonica ,请参阅Leptonica的网站:

http://www.leptonica.org/

http://www.leptonica.org/

To install leptonica, use the following command:

要安装leptonica ,请使用以下命令:

$ sudo apt-get install -y libleptonica-dev

A further list of all the dependencies required by tesseract can be found here:

可在此处找到tesseract所需的所有依赖关系的进一步列表:

From this list, most likely you will not have the following dependencies:

从此列表中,很可能您将没有以下依赖项:

automake pkg-configpango-develcairo-develicu-devel

Your Ubuntu system comes along with gcc which does offer C++11 support hence, its already there. You can use the following commands to install the above dependencies:

您的Ubuntu系统随附了确实提供C ++ 11支持的gcc ,因此它已经存在。 您可以使用以下命令来安装以上依赖项:

$ sudo apt-get update -y$ sudo apt-get install automake$ sudo apt-get install -y pkg-config$ sudo apt-get install -y libsdl-pango-dev$ sudo apt-get install -y libicu-dev$ sudo apt-get install -y libcairo2-dev$ sudo apt-get install bc

The last library bc is an extra dependency that is required to get tesseract 4 running on your machine.

最后一个库bc是使tesseract 4在您的计算机上运行所需的额外依赖项。

Now you have to clone the tesseract repository. Hey! but stop right there! First, go to the following repository:

现在,您必须克隆tesseract存储库。 嘿! 但是就停在那! 首先,转到以下存储库:

And open the file named VERSION, you will see 5.0.0-alpha written, that means the tesseract version that will be installed by using the makefile in this repository will be 5.0.0-alpha. But this is not the stable release of tesseract, the stable release is 4.1.1 at the time of creation of this post.

并打开名为VERSION的文件,您将看到写入5.0.0-alpha ,这意味着将使用此存储库中的makefile安装的tesseract版本将为5.0.0-alpha 。 但这不是tesseract的稳定版本,在创建此文章时,稳定版本是4.1.1

Now to find the link to download latest stable release of tesseract, in the right side bar you will find a section titled “Releases” and within that you will see 4.1.1 Release.

现在,找到下载tesseract最新稳定版本的链接,在右侧栏中,您将找到标题为“ Releases”的部分,在该部分中,您将看到4.1.1 Release

Tesseract GitHub Repository
Tesseract GitHub存储库

Click on the link 4.1.1. Release and there you will find Assets section with Source code (zip) and Source code (tar.gz), copy the link and then download using the following command:

单击链接4.1.1。 释放,然后在其中找到带有源代码( zip )和源代码( tar.gz )的Assets部分,复制链接,然后使用以下命令下载:

$ wget https://github.com/tesseract-ocr/tesseract/archive/4.1.1.zip

You can download either zip or tar.gz file. Here I have downloaded the zip file. You can unzip the file to your current directory using unzip command:

您可以下载ziptar.gz文件。 在这里,我下载了zip文件。 您可以使用unzip命令将文件解压缩到当前目录:

$ unzip 4.1.1.zip

Upon the completion of unzip operation, a folder titled tesseract-4.1.1 has been created. Get into this directory using cd command.

解压缩操作完成后,已创建一个名为tesseract-4.1.1的文件夹。 使用cd命令进入该目录。

$ cd tesseract-4.1.1

In this folder if you list the files it should be something like this:

在此文件夹中,如果您列出文件,则应如下所示:

abseil     CONTRIBUTING.md     java     tessdataappveyor.yml cppan.yml       LICENSE  tesseract.pc.cmakeAUTHORS        doc         m4       tesseract.pc.inautogen.sh    docker-compose.yml  Makefile.am  testChangeLog    Dockerfile      README.md    unittestcmake        googletest      snap     VERSIONCMakeLists.txt    INSTALL         srcconfigure.ac   INSTALL.GIT.md      sw.cpp

Now you are ready to install tesseract. The different ways and methods to do so for various operating systems are given here below in this link: https://github.com/tesseract-ocr/tesseract/blob/master/INSTALL.GIT.md We are going to use the autotools (LINUX/UNIX , msys…) to do so.

现在,您可以安装tesseract 。 在下面的此链接中给出了针对各种操作系统执行此操作的不同方法和方法: https : //github.com/tesseract-ocr/tesseract/blob/master/INSTALL.GIT.md我们将使用自动工具(LINUX / UNIX,msys等)来执行此操作。

You need to run the following commands from the tesseract-4.1.1 directory to install the tesseract:

您需要从tesseract-4.1.1目录运行以下命令来安装tesseract:

$ ./autogen.sh$ ./configure$ make$ sudo make install$ sudo ldconfig$ make training$ sudo make training-install

To check that tesseract has been installed successfully, run the following command:

要检查是否已成功安装tesseract,请运行以下命令:

$ tesseract --version

You should see the output something like this:

您应该看到如下输出:

tesseract 4.1.1 leptonica-1.75.3  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX2 Found AVX Found FMA Found SSE

If the output is not same as the above or you get some error, then try to go back and check again to see where you went wrong or again follow the steps one by one.

如果输出结果与上面的结果不同或出现错误,请尝试返回并再次检查错误的地方,或者再次按照步骤进行操作。

文件夹tessdata (The Folder tessdata)

Now, the tessdata folder in the tesseract directory is where the tesseract checks for the language data that it needs to perform OCR on the input document.

现在,tesseract目录中的tessdata文件夹是tesseract检查在输入文档上执行OCR所需的语言数据的位置。

For tesseract to work, you need at least one language, for English language you need a data file, titled 'eng.traineddata'. Also you will need another file titled 'osd.traineddata', it is used for orientation detection, and is also required in tessdata folder.

为了使tesseract正常工作,您至少需要一种语言,对于英语,则需要一个名为'eng.traineddata'的数据文件。 另外,您还需要另一个名为'osd.traineddata'文件,该文件用于方向检测,在tessdata文件夹中也是必需的。

Unfortunately, these are not installed by default in this folder when we run make command. You need to download them separately into this folder. You can check the content of the tessdata folder by using ls command:

不幸的是,当我们运行make命令时,默认情况下这些文件未安装在此文件夹中。 您需要将它们分别下载到此文件夹中。 您可以使用ls命令检查tessdata文件夹的内容:

$ cd tessdata$ ls

You will see output somewhat similar to following:

您将看到类似于以下内容的输出:

configs        eng.user-words  Makefile.am  pdf.ttfeng.user-patterns  Makefile       Makefile.in  tessconfigs

As you can see, both the eng.traineddata, and the osd.traineddata are missing. Now download the eng.traineddata and osd.trainedddata from the following link:

如您所见, eng.traineddataosd.traineddata都丢失了。 现在,从以下链接下载eng.traineddataosd.trainedddata

You can download them to your local system and then upload them to the tessdata folder or you could download them directly using the wget command:

您可以将它们下载到本地系统,然后将它们上传到tessdata文件夹,也可以使用wget命令直接下载它们:

$ wget https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata $ wget https://github.com/tesseract-ocr/tessdata/blob/master/osd.traineddata

Once you have successfully downloaded these files, you need to set your TESSDATA_PREFIX environment variable to the location of your tessdata directory. Use the export command to set the variable:

成功下载这些文件后,需要将TESSDATA_PREFIX环境变量设置为tessdata目录的位置。 使用export命令设置变量:

$ export TESSDATA_PREFIX=/content/tesseract-4.1.1/tessdata

Now you can list the languages in your tesseract using the following command:

现在,您可以使用以下命令列出tesseract中的语言:

$ tesseract --list-langs

You can see the output as following:

您可以看到以下输出:

List of available languages (2):engosd

If you want to use other languages, you can download them to the tessdata folder and start using them.

如果要使用其他语言,可以将它们下载到tessdata文件夹中并开始使用它们。

从终端使用Tesseract (Using Tesseract from Terminal)

Tesseract has a various wrappers, for example, Python wrapper named pytesseract, these wrappers helps you to get access to tesseract using various programming languages. Here, we will be using tesseract through the command line.

Tesseract具有各种包装器,例如,名为pytesseract Python包装器,这些包装器可帮助您使用各种编程语言来访问tesseract 。 在这里,我们将通过命令行使用tesseract。

To perform OCR on an image you can run the following command on the terminal with the path of image file on which you want to perform OCR:

要在图像上执行OCR,可以在终端上运行以下命令,并在其上执行OCR的图像文件的路径为:

$ tesseract <path_of_image> stdout

In the above command, the path_of_image is the location of the image that you want to test tesseract with. Once you do so, you should get an output right in the command line that looks something like this:

在上面的命令中, path_of_image是要用于测试tesseract的图像的位置。 一旦这样做,您应该在命令行中获得如下所示的输出:

Here pardit was the text present in my image. So I was able to successfully use tesseract for extracting text out of my image file.

这里pardit存在于我的形象的文字。 因此,我能够成功地使用tesseract从图像文件中提取文本。

将Tesseract输出保存到文件 (Saving Tesseract Output to a File)

If you want to save the output of tesseract to a text file, you can use the following command:

如果要将tesseract的输出保存到文本文件,可以使用以下命令:

tesseract <path_of_image> output.txt

Here, the output will be stored in output.txt file in your present working directory.

在这里,输出将存储在当前工作目录中的output.txt文件中。

在多个文件上运行Tesseract (Running Tesseract on Multiple Files)

Sometimes we want to extract text out of multiple images or documents. To accomplish this, you can give text file as an input to the Tesseract which contains all the absolute path of the images that you want to perform OCR on, one file in each line.

有时我们想从多个图像或文档中提取文本。 为此,您可以将文本文件作为Tesseract的输入,其中包含要对其执行OCR的图像的所有绝对路径,每行一个文件。

For Example, let’s you have two photos called handwritten_photo_1.png and handwritten_photo_2.png, with some text in them, in /usr/share/ directory. Let’s create a file named input.txt with the following content:

例如,让我们在/usr/share/目录中有两张名为handwritten_photo_1.pnghandwritten_photo_2.png照片,其中包含一些文本。 让我们创建一个名为input.txt的文件 具有以下内容:

/usr/share/handwritten_photo_1.png/usr/share/handwritten_photo_2.png

And you want to store the contents of the these two handwritten photos in a text file, say output.txt. You have to run the following command:

您想将这两张手写照片的内容存储在一个文本文件中,例如output.txt 。 您必须运行以下命令:

$ tesseract input.txt output.txt

output.txt will have the OCR contents of both handwritten_photo_1.png and handwritten_photo_2.png, in that order. When you open and view the content of the output.txt, you will see that the extracted lines are preceded by some symbol like this:

output.txt的OCR内容将output.txt顺序同时为handwritten_photo_1.pnghandwritten_photo_2.png 。 当您打开并查看output.txt的内容时,您将看到提取的行前面带有一些符号,如下所示:

Tesseract output of an input text file with 5 lines of image locations
Tesseract输出具有5行图像位置的输入文本文件

So in this case, Viral Calic is the prediction for the first image, CY am the king of the world the prediction for the second image, Com and Serr the prediction for the third image and so on.

因此,在这种情况下, Viral Calic是第一个图像的预测, CY am the king of the worldCY am the king of the world第二个图像的预测, Com and Serr是第三个图像的预测,依此类推。

You can explore further on the usage of the tesseract on the following two links:

您可以在以下两个链接上进一步探索tesseract的用法:

I hope you were able to follow the guide and were able to install and use Tesseract on your Ubuntu 18.04 machine.

我希望您能够按照指南进行操作,并能够在Ubuntu 18.04计算机上安装和使用Tesseract。

翻译自: https://medium.com/quantrium-tech/installing-tesseract-4-on-ubuntu-18-04-b6fcd0cbd78f


http://www.taodudu.cc/news/show-863591.html

相关文章:

  • pytorch机器学习_机器学习— PyTorch
  • 检测和语义分割_分割和对象检测-第1部分
  • ai人工智能编程_从人工智能动态编程:Q学习
  • 架构垂直伸缩和水平伸缩区别_简单的可伸缩图神经网络
  • yolo opencv_如何使用Yolo,SORT和Opencv跟踪足球运动员。
  • 人工智能的搭便车指南
  • 机器学习 对回归的评估_在机器学习回归问题中应使用哪种评估指标?
  • 可持久化数据结构加扫描线_结构化光扫描
  • 信号处理深度学习机器学习_机器学习和信号处理如何融合?
  • python 数组合并排重_并排深度学习:Julia vs Python
  • 强化学习 求解迷宫问题_使用天真强化学习的迷宫求解器
  • 朴素贝叶斯 半朴素贝叶斯_使用朴素贝叶斯和N-Gram的Twitter情绪分析
  • 自动填充数据新增测试数据_用测试数据填充员工数据库
  • bart使用方法_使用简单变压器的BART释义
  • 卷积网络和卷积神经网络_卷积神经网络的眼病识别
  • 了解回归:迈向机器学习的第一步
  • yolo yolov2_PP-YOLO超越YOLOv4 —对象检测的进步
  • 机器学习初学者_绝对初学者的机器学习
  • monk js_对象检测-使用Monk AI进行文档布局分析
  • 线性回归 c语言实现_C ++中的线性回归实现
  • 忍者必须死3 玩什么忍者_降维:忍者新手
  • 交叉验证和超参数调整:如何优化您的机器学习模型
  • 安装好机器学习环境的虚拟机_虚拟环境之外的数据科学是弄乱机器的好方法
  • 遭遇棘手 交接_Librosa的城市声音分类-棘手的交叉验证
  • 模型越复杂越容易惰性_ML模型的惰性预测
  • vgg 名人人脸图像库_您看起来像哪个名人? 图像相似度搜索模型
  • 机器学习:贝叶斯和优化方法_Facebook使用贝叶斯优化在机器学习模型中进行更好的实验
  • power-bi_在Power BI中的VertiPaq内-压缩成功!
  • 模型 标签数据 神经网络_大型神经网络和小数据的模型选择
  • 学习excel数据分析_为什么Excel是学习数据分析的最佳方法

在Ubuntu 18.04上安装和使用Tesseract 4相关推荐

  1. skype linux 安装,如何在Ubuntu 18.04上安装Skype

    Skype是世界上最流行的通信应用程序之一,它使您可以拨打免费的在线音频和视频电话,以及可负担得起的拨打全球移动电话和固定电话的国际电话. Skype不是开源应用程序,也不包含在Ubuntu存储库中. ...

  2. ubuntu memcached php,如何在 Ubuntu 18.04 上安装 Memcached

    Memcached 是一个免费的开源高性能内存中键值数据存储. 它最常用于通过从数据库调用的结果缓存各种对象来加速应用程序. 在本教程中,我们将介绍在 Ubuntu 18.04 上安装和配置最新版 M ...

  3. ubuntu19 安装git_在Ubuntu 18.04上安装Git

    步骤1.首先,通过运行以下命令确保您的系统和apt包列表完全更新: apt-get update -y apt-get upgrade -y 第2步.在Ubuntu 18.04上安装Git. 现在让我 ...

  4. ubuntu18.04安装python3_如何在 Ubuntu 18.04上安装 Python3.7

    Python 是世界上最流行的编程语言之一,凭借其简单易学的语法,Python是初学者和经验丰富的开发人员的绝佳选择.随着近几年人工智能和大数据的火热,Python流行度更是进一步飙升,有冲顶编程语言 ...

  5. 如何在Ubuntu 18.04上安装Django

    Django是一个免费的开源高级Python Web框架,旨在帮助开发人员构建安全,可扩展和可维护的Web应用程序. 根据您的需要,有不同的方法来安装Django.它可以使用pip在系统范围内安装或在 ...

  6. webmin安装_如何在Ubuntu 18.04上安装Webmin

    webmin安装 Are you averse to running commands on a terminal and instead prefer managing your Linux sys ...

  7. 如何在Ubuntu 18.04上安装/卸载NodeJS

    NodeJS is a JavaScript framework that allows you to build fast network applications with ease. In th ...

  8. 在Ubuntu 18.04上安装Nginx

    NGINX pronounced as engine-x is an open source and popular HTTP server and can be configured to act ...

  9. 如何在Ubuntu 18.04上安装Elasticsearch Logstash Kibana(Elastic Stack)

    In this guide, you will learn to install Elastic stack on Ubuntu 18.04. Elastic stack, formerly know ...

最新文章

  1. javascript 初学对象
  2. FD.io/VPP — ACL
  3. 取得Linux系统的各种统计信息
  4. mysql逗号分隔正则查询_正则表达式在逗号分隔的列表中查找字符串和下一个字符-MySQL?...
  5. 如何在 ASP.Net Core 中使用 Autofac
  6. ShellCode初体验
  7. DEL: Chrome Browser Shortcuts
  8. MVC HtmlHelperTModel 类
  9. PyTorch:数据读取2 - Dataloader
  10. Maven-Eclipse使用maven创建HelloWorld Java项目,maven常用的命令解析
  11. java交通灯英文文献,194关于单片机智能交通灯控制系统设计有关的外文文献翻译成品:基于单片机的智能交通控制系统(中英文双语对照)7...
  12. Android背景斜线
  13. 外贸找客户软件:Email Extractor Pro
  14. python数据统计分析兼职_招聘兼职数据分析师
  15. 网站攻击有哪几种方式?如何处理
  16. 吴式太极大师战波简介
  17. word中无法输入中文
  18. 学习笔记-部署和管理DPM 2016-04文件和应用程序保护
  19. 高级计算机培训 英语教案设计,第四课时英语教案设计
  20. Python数据分析与大数据处理从入门到精通

热门文章

  1. windows安装mongodb(快速简易版)
  2. 1500Vdc的光伏系统距离大规模应用还有多远?
  3. 图解WildFly8之Servlet容器Undertow剖析
  4. MySQL存储过程中的3种循环
  5. 用我对HTML的点点理解来做个简单的百度首页
  6. oracle备份磁盘头,oracle asm 磁盘头数据以及备份与恢复基础篇(2)
  7. 用Hibernate tool从实体对象生成数据库表
  8. php如何获得文件数量,PHP:获取目录中文件数量的最有效方法
  9. 支付宝异步回调返回success_深入解决异步编程Promise对象的学习
  10. java中date代替_Java:为什么Date构造函数不推荐,我用什么来代替?