The process of extracting information from a digital copy of invoice can be a tricky task. There are various tools that are available in the market that can be used to perform this task. However there are many factors due to which most of the people want to solve this problem using Open Source Libraries.

从发票的数字副本中提取信息的过程可能是一项棘手的任务。 市场上有可以用来执行此任务的各种工具。 但是,由于许多因素,大多数人都希望使用开放源代码库解决此问题。

I came across a similar set of problem a few days back and wanted to share with you all the approach through which I solved this problem. The libraries that I used for developing this solution were pdf2image (for converting PDF to images), OpenCV (for Image pre-processing) and finally PyTesseract for OCR along with Python.

几天前,我遇到了一系列类似的问题,并想与大家分享解决该问题的所有方法。 我用于开发此解决方案的库是pdf2image (用于将PDF转换为图像), OpenCV (用于图像预处理),最后用于OCR的PyTesseractPython

将PDF转换为图像 (Converting PDF to Image)

pdf2image is a python library which converts PDF to a sequence of PIL Image objects using pdftoppm library. The following command can be used for installing the pdf2image library using pip installation method.

pdf2image是一个python库,可使用pdftoppm库将PDF转换为PIL Image对象序列。 以下命令可用于通过pip安装方法安装pdf2image库。

pip install pdf2image


Note: pdf2image uses Poppler which is a PDF rendering library based on the xpdf-3.0 code base and will not work without it. Please refer to the below resources for downloading and installation instructions for Poppler.

注意:pdf2image使用Poppler ,它是基于xpdf-3.0代码库的PDF渲染库,没有它就无法使用。 请参考以下资源以获取Poppler的下载和安装说明。 r

After installation, any pdf can be converted to images using the below code.


After converting the PDF to images, the next step is to highlight the regions of the images from which we have to extract the information.


Note: Before marking regions make sure that you have preprocessed the image for improving its quality (DPI ≥ 300, Skewness, Sharpness and Brightness should be adjusted, Thresholding etc.)


标记图像区域以进行信息提取 (Marking Regions of Image for Information Extraction)

Here in this step we will mark the regions of the image from where we have to extract the data. After marking those regions with the rectangle, we will crop those regions one by one from the original image before feeding it to the OCR engine.

在此步骤中,我们将标记必须从中提取数据的图像区域。 在用矩形标记了这些区域之后,我们将原始图像一一裁剪掉这些区域,然后再将其提供给OCR引擎。

Most of us would think to this point — why should we mark the regions in an image before doing OCR and not doing it directly?


The simple answer to this question is that YOU CAN


The only catch to this question is sometimes there are hidden line breaks/ page breaks that are embedded in the document and if this document is passed directly into the OCR engine, the continuity of data breaks automatically (as line breaks are recognized by OCR).


Through this approach, we can get maximum correct results for any given document. In our case we will be trying to extract information from an invoice using the exact same approach.

通过这种方法,我们可以获得任何给定文档的最大正确结果。 在我们的案例中,我们将尝试使用完全相同的方法从发票中提取信息。

The below code can be used for marking the regions of interest in the image and getting their respective co-ordinates.


# use this command to install open cv2
# pip install opencv-python# use this command to install PIL
# pip install Pillowimport cv2
from PIL import Imagedef mark_region(imagE_path):im = cv2.imread(image_path)gray = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)blur = cv2.GaussianBlur(gray, (9,9), 0)thresh = cv2.adaptiveThreshold(blur,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY_INV,11,30)# Dilate to combine adjacent text contourskernel = cv2.getStructuringElement(cv2.MORPH_RECT, (9,9))dilate = cv2.dilate(thresh, kernel, iterations=4)# Find contours, highlight text areas, and extract ROIscnts = cv2.findContours(dilate, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)cnts = cnts[0] if len(cnts) == 2 else cnts[1]line_items_coordinates = []for c in cnts:area = cv2.contourArea(c)x,y,w,h = cv2.boundingRect(c)if y >= 600 and x <= 1000:if area > 10000:image = cv2.rectangle(im, (x,y), (2200, y+h), color=(255,0,255), thickness=3)line_items_coordinates.append([(x,y), (2200, y+h)])if y >= 2400 and x<= 2000:image = cv2.rectangle(im, (x,y), (2200, y+h), color=(255,0,255), thickness=3)line_items_coordinates.append([(x,y), (2200, y+h)])return image, line_items_coordinates
Image for post
Original Image (Source: Abbyy OCR Tool Sample Invoice Image)
原始图像(来源:Abbyy OCR工具示例发票图像)
Regions of Interest marked in Image (Source: Abbyy OCR Tool Sample Invoice Image)
图片中标记的感兴趣区域(来源:Abbyy OCR工具示例发票图片)

将OCR应用于图像 (Applying OCR to the Image)

Once we have marked the regions of interest (along with the respective coordinates) we can simply crop the original image for the particular region and pass it through pytesseract to get the results.


For those who are new to Python and OCR, pytesseract can be an overwhelming word. According to its official website -

对于那些不熟悉Python和OCR的人,pytesseract可能是一个压倒性的词。 根据其官方网站-

Python-tesseract is a wrapper for Google’s Tesseract-OCR Engine. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others. Additionally, if used as a script, Python-tesseract will print the recognized text instead of writing it to a file.

Python-tesseract是Google Tesseract-OCR Engine的包装。 它也可以用作tesseract的独立调用脚本,因为它可以读取Pillow和Leptonica图像库支持的所有图像类型,包括jpeg,png,gif,bmp,tiff等。 此外,如果将Python-tesseract用作脚本,它将打印识别的文本,而不是将其写入文件。

Also, if you want to play around with the configuration parameters of pytesseract, I would recommend to go through the below links first.


The following code can be used to perform this task.


import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Users\Akash.Chauhan1\AppData\Local\Tesseract-OCR\tesseract.exe'# load the original image
image = cv2.imread('Original_Image.jpg')# get co-ordinates to crop the image
c = line_items_coordinates[1]# cropping image img = image[y0:y1, x0:x1]
img = image[c[0][1]:c[1][1], c[0][0]:c[1][0]]    plt.figure(figsize=(10,10))
plt.imshow(img)# convert the image to black and white for better OCR
ret,thresh1 = cv2.threshold(img,120,255,cv2.THRESH_BINARY)# pytesseract image to string to get results
text = str(pytesseract.image_to_string(thresh1, config='--psm 6'))
Cropped Image-1 from Original Image (Source: Abbyy OCR Tool Sample Invoice Image)
从原始图像中裁剪图像1(来源:Abbyy OCR工具示例发票图像)

Output from OCR:


Payment:Mr. John DoeGreen Street 15, Office 41234 VermutNew Caledonia
Cropped Image-2 from Original Image (Source: Abbyy OCR Tool Sample Invoice Image)
从原始图像中裁剪图像2(来源:Abbyy OCR工具示例发票图像)

Output from OCR



As you can see, the accuracy of our output is 100%.


So this was all about how you can develop a solution for extracting data from a complex document such as invoices.


There are many applications to what OCR can do in term of document intelligence. Using pytesseract, one can extract almost all the data irrespective of the format of the documents (whether its a scanned document or a pdf or a simple jpeg image).

OCR在文件智能方面可以做很多应用。 使用pytesseract,可以提取几乎所有数据,而不管文档的格式如何(无论是扫描的文档还是pdf或简单的jpeg图像)。

Also, since its open source, the overall solution would be flexible as well as not that expensive.



