
介绍 (Introduction)

This post finds his root in an interesting project of knowledge extraction. The first step was to extract the text of pdf documents. The company that I work for is based on the Google platform, so naturally, I would like to use the OCR of the API Vision but, can’t find an easy way to use the API to extract text. So here this post.

这篇文章源自一个有趣的知识提取项目。 第一步是提取pdf文档的文本。 我工作的公司基于Google平台,因此自然而然地,我想使用API​​ Vision的OCR,但是找不到使用API​​提取文本的简便方法。 所以这里这篇文章。

The notebook of this post is available on GitHub


Google API愿景 (Google API Vision)

Google released the API to help people, industry, and researchers to use their functionalities.


Google Cloud's Vision API has powerful machine learning models pre-trained through REST and RPC APIs. Tag images and quickly organize them into millions of predefined categories. You will be able to detect objects and faces, read printed or handwritten text, and integrate useful metadata into your image catalog. (source: API Vision)

Google Cloud的Vision API具有强大的机器学习模型,这些模型通过REST和RPC API进行了预训练。 标记图像并快速将它们组织成数百万个预定义类别。 您将能够检测物体和面部,阅读印刷或手写文本,并将有用的元数据集成到图像目录中。 (来源: API Vision )

The part of the API that interested us for this post is the OCR part.


光学字符识别 (OCR)

Optical Character Recognition or OCR is a technology where characters are recognized and detected inside an image. Most of the time Convolutional Neural Networks (CNN) are trained on a very large dataset of characters and numbers in different types and colors. You can imagine a small window slicing on each pixel or group of pixels to detect characters or partial characters, spaces, forms, lines etc.

光学字符识别或OCR是一种在图像内部识别和检测字符的技术。 大多数时候,卷积神经网络(CNN)都是在非常大的不同类型和颜色的字符和数字数据集上训练的。 您可以想象在每个像素或一组像素上切片一个小窗口,以检测字符或部分字符,空格,形式,线条等。

服务帐号 (Service Account)

A service account is a special type of Google account intended to represent a non-human user that needs to authenticate and be authorized to access data in Google APIs. (source: IAM google cloud)

服务帐户是一种特殊的Google帐户,旨在代表需要认证并有权访问Google API中数据的非人类用户。 (资料来源:IAM谷歌云 )

Basically you can imagine it as an RSA key (encrypted key to communicate with high security between machine via the internet) with which you can connect to Google services (API, GCS, IAM…). Its basic form is a json file.

基本上,您可以将其想象为RSA密钥(通过互联网在计算机之间以高安全性进行通信的加密密钥),您可以使用该RSA密钥连接到Google服务(API,GCS,IAM…)。 它的基本形式是一个json文件。

笔记本 (Notebook)

Here, I will show you the different functions to use the API and extract the text from the image automatically.


Libraries needed to be installed:


!pip install google-cloud!pip install google-cloud-storage!pip install google-cloud-pubsub!pip install google-cloud-vision!pip install pdf2image!pip install google-api-python-client!pip install google-auth

The libraries used:


from pdf2image import convert_from_bytesimport globfrom tqdm import tqdmimport base64import jsonimport osfrom io import BytesIOimport numpy as npimport iofrom PIL import Imagefrom import pubsub_v1from import visionfrom google.oauth2 import service_accountimport googleapiclient.discovery# to see a progress bartqdm().pandas()

The OCR can take pdf, tiff and jpeg formats to be used in the API.In this post we will convert the pdf into jpeg to concatenate many pages in one picture. Two manners of using jpeg:

OCR可以采用pdf,tiff和jpeg格式在API中使用。在本文中,我们将pdf转换为jpeg,以在一张图片中连接许多页面。 jpeg的两种使用方式:

First, you could convert your pdf in jpeg files and save them into another repository:


# Name files where the pdf are and where you want to save the resultsNAME_INPUT_FOLDER = "PDF FOLDER NAME"NAME_OUTPUT_FOLDER= "RESULT TEXTS FOLDER"list_pdf = glob.glob(NAME_INPUT_FOLDER+"/*.pdf") # stock the name of the pdf files # Loop over all the filesfor i in list_pdf:# convert the pdf into jpeg        pages = convert_from_path(i, 500)        for page in tqdm(enumerate(pages)):# save each page into jpeg             page[1].save(NAME_OUTPUT_FOLDER+"/"+i.split('/')[-1].split('.')[0]+'_'+str(page[0])+'.jpg', 'JPEG') # keep the name of the document and add increment 

Here, you can use your jpeg document with the API. But, you can do it better without saving the jpeg file and use it in memory to call the API directly.

在这里,您可以将jpeg文档与API结合使用。 但是,您可以做得更好,而无需保存jpeg文件并在内存中使用它直接调用API。

设置凭证 (Setup Credentials)

Before going deeper we need to configure the credentials of the Vision API. You’ll see, it’s very simple:

在深入研究之前,我们需要配置Vision API的凭据。 您会看到,这非常简单:

SCOPES = ['']SERVICE_ACCOUNT_FILE = "PUT the PATH of YOUR SERVICE ACCOUNT JSON FILE HERE"# Configure the google credentialscredentials = service_account.Credentials.from_service_account_file(        SERVICE_ACCOUNT_FILE, scopes=SCOPES)

图片处理 (Picture manipulations)

This needed more code because we also concatenate 10 pages of documents to create a “big picture” and feed it to the API. One call versus 10 is better for the price because you’ll pay each time you’ll request the API.

这需要更多代码,因为我们还连接了10页文档以创建“大图”并将其提供给API。 价格最好是一次调用而不是10次,因为每次请求API时都要付费。

Let’s go:


def pil_grid(images, max_horiz=np.iinfo(int).max):'''Function to stock the image into memory'''n_images = len(images)n_horiz = min(n_images, max_horiz)h_sizes, v_sizes = [0] * n_horiz, [0] * (n_images // n_horiz)for i, im in enumerate(images):h, v = i % n_horiz, i // n_horizh_sizes[h] = max(h_sizes[h], im.size[0])v_sizes[v] = max(v_sizes[v], im.size[1])h_sizes, v_sizes = np.cumsum([0] + h_sizes), np.cumsum([0] + v_sizes)im_grid ='RGB', (h_sizes[-1], v_sizes[-1]), color='white')for i, im in enumerate(images):im_grid.paste(im, (h_sizes[i % n_horiz], v_sizes[i // n_horiz]))return im_griddef concat_file_ocr(path, cred=credentials):'''Function to concat 10 pages of the document and feed them to the OCR@param path: (str) path of the pdf@param cred: google credentials (service account)  '''imgs = convert_from_bytes(open(path, 'rb').read(), fmt="jpeg")nb_pages = len(imgs)nb_remaining_pages = nb_pagesocr_step = 10current_ocr_page_nb = 0text = []while nb_remaining_pages > 0:if nb_remaining_pages > ocr_step:ocr_range = range(current_ocr_page_nb, ocr_step + current_ocr_page_nb)nb_remaining_pages -= ocr_stepcurrent_ocr_page_nb += ocr_stepelse:ocr_range = range(current_ocr_page_nb, current_ocr_page_nb + nb_remaining_pages)nb_remaining_pages = 0# call ocr with rangeim_grid = pil_grid(imgs[ocr_range.start:ocr_range.stop],1)temp = BytesIO(), format='jpeg')text.append(detect_text_document(temp.getvalue(), cred))np.savetxt(NAME_OUTPUT_FOLDER+"/"+path.split('/')[-1].split('.')[0]+'.txt', text, fmt="%s")

With these two functions, you’ll be able to load a pdf file, convert it into bytes, create a “big picture” and feed it to the function detect_text_document() (details below).

使用这两个函数,您将能够加载pdf文件,将其转换为字节,创建“大图片”,并将其提供给函数detect_text_document() (以下详细信息)。

The function detect_text_document takes in input the content of the pictures and the credentials (information of your service account).


def detect_text_document(content, credentials):"""Function to call the API vision and return the text detected inside the image@param content: (bytes) image in bytes @param credentials: credentials of the service account to call the API @return: the text detected inside the picture"""client = vision.ImageAnnotatorClient(credentials=credentials)#with, 'rb') as image_file:#    content = load the image in bytes image = vision.types.Image(content=content)# call the OCR and keep text annotationresponse = client.text_detection(image=image)# The actual response for the first page of the input file.breaks = vision.enums.TextAnnotation.DetectedBreak.BreakTypeparagraphs = []lines = []# extract text by block of detectionfor page in response.full_text_annotation.pages:for block in page.blocks:for paragraph in block.paragraphs:para = ""line = ""for word in paragraph.words:for symbol in word.symbols:line += symbol.textif == breaks.SPACE:line += ' 'if == breaks.EOL_SURE_SPACE:line += ' 'lines.append(line)para += lineline = ''if == breaks.LINE_BREAK:lines.append(line)para += lineline = ''paragraphs.append(para)return "\n".join(paragraphs)

The output is a text extracted from the images. The goal of this function is to concatenate words into paragraphs and documents.

输出是从图像中提取的文本。 此功能的目标是将单词连接到段落和文档中。

如何使用它? (How to use it?)

You can use this block of functions like this:


for doc_pdf in tqdm(list_pdf):        # call the function which convert into jpeg, stack 10 images        # and call the API, save the output into txt file         concat_file_ocr(doc_pdf)

The input is just the path obtained with the glob function. The credentials were generated in the Setup Credentials part. This loop will take each pdf of the input files, call the API with a jpeg file obtained by converting the pdf and save text files containing the detection.

输入只是使用glob函数获得的路径。 这些凭据是在“ 安装凭据”部分中生成的。 此循环将获取输入文件的每个pdf,使用通过转换pdf获得的jpeg文件调用API,并保存包含检测结果的文本文件。

结论 (Conclusion)

Here, you reach the end of the tutorial on how to use the Vision API and generate text files containing the detection automatically. You know how to configure credentials with your service account and convert a pdf into a jpeg file (one jpeg per page). Is it all? No, I have some bonuses for you (see below).

在这里,您将到达教程的结尾,了解如何使用Vision API并自动生成包含检测结果的文本文件。 您知道如何使用服务帐户配置凭据并将pdf转换为jpeg文件(每页一个jpeg)。 全部吗 不,我为您提供一些奖金(请参见下文)。

奖励1:每页使用API (Bonus 1: Use the API per page)

The previous functions allow you to use the API with the concatenation of pages. But, we can use the API per page of the pdf document. The function below will request the API for each page of the convert pdf into a jpeg format.

先前的功能允许您将API与页面串联使用。 但是,我们可以在pdf文档的每页上使用API​​。 以下功能将要求将pdf转换为jpeg格式的每一页的API。

def call_ocr_save_txt(path, cred=credentials):'''Function to feed the OCR with each page of the pdf convert into jpeg@param path: (str) path of the pdf@param cred: google credentials '''pages = convert_from_bytes(open(path, 'rb').read(), fmt="jpeg") text = []# run on each page of the pdf for page in pages:# cast the jpeg into bytes temp = BytesIO(), format='jpeg')# save the result of the OCR inside the variable text text.append(detect_text_document(temp.getvalue(), cred))# save the result into txt file np.savetxt(NAME_OUTPUT_FILE+"/"+i.split('/')[1].split('.')[0]+'.txt', text, fmt="%s")

It’s very easy to use it, just call this function with the path of the pdf folder and the credentials. Like this:

使用它非常容易,只需使用pdf文件夹的路径和凭据调用此函数。 像这样:

if per_page: # option True if you want to use per page# call the API vision per page of the pdf    for i in tqdm(list_pdf):# open the pdf and convert it into a PlImage format jpeg        call_ocr_save_txt(i, cred=credentials)

奖励2:使用多处理库 (Bonus 2: Use multiprocessing library)

Just for fun, you can use this API with multiprocessing (ok it’s not real multiprocessing in python with the Global Interpreter Lock (GIL)). But, here the code:

只是为了好玩,您可以将此API与多处理配合使用(好吧,这不是带有全局解释器锁(GIL)的python中真正的多处理)。 但是,这里的代码:

if multi_proc:    nb_threads = mp.cpu_count() # return the number of CPU    print(f"The number of available CPU is {nb_threads}")# if you want to use the API without stacking the pages    if per_page:# create threads corresponding to the number specified        pool = mp.Pool(processes=nb_threads)    # map the function with part of the list for each thread        result =, list_pdf)     if per_document:        pool = mp.Pool(processes=nb_threads)         result =, list_pdf)




  • 高通开发系列 - ALSA声卡驱动中音频通路kcontrol控件
  • ALSA声卡驱动中的DAPM详解之六:精髓所在,牵一发而动全身
  • 浅析alsa声卡驱动snd_pcm_start函数-将音频数据真实的发送到外部音频接口硬件
  • k210实现麦克风阵列声源定位
  • 标称电感值
  • 电阻标称值选型
  • 电阻精度标称值查找表
  • 数据预处理(处理缺失值、属性编码、数据标准化正则化、特征选择、主成分分析)
  • 模电学习之电阻
  • 阻容元件标称值
  • 电阻标称值
  • 电容标称值
  • 关于电阻标称值
  • 常用电阻标称值
  • Python+Vue计算机毕业设计一品萫茶馆管理系统的设计与实现69dcm(源码+程序+LW+部署)
  • Navicat Premium 12 卸载和注册表的删除
  • 软件安装信息、系统服务在注册表中的位置
  • 打开并导出注册表步骤
  • 什么是ICP,什么是ICP证
  • 企业办理icp许可证的重要性,办理流程是什么
  • 文网文是什么?与ICP和EDI许可证有什么区别?
  • matlab 相乘掩蔽图像,MATLAB逻辑掩蔽
  • 数字图像处理学习笔记4:图像增强之空间滤波2(一阶微分锐化滤波(梯度),二阶微分锐化(拉普拉斯),非锐化掩蔽)
  • MG-BERT | 利用 无监督 原子表示学习 预测分子性质 | 在分子图上应用BERT | GNN | 无监督学习(掩蔽原子预训练) | attention
  • 注意力评分函数(掩蔽softmax操作,加性注意力,缩放点积注意力)
  • 灰度图像--图像增强 非锐化掩蔽 (Unsharpening Mask)
  • avr flash_AVR | 掩蔽
  • 多模态数据也能进行MAE?伯克利谷歌提出M3AE,在图像和文本数据上进行MAE!最优掩蔽率可达75%,显著高于BERT的15%...
  • 非锐化掩蔽(Unsharp Masking)与高提升滤波
  • TFS 掩蔽或取消掩蔽工作区中的文件夹

谷歌的愿景_什么是Google API愿景以及如何使用它相关推荐

  1. ios谷歌地图显示不出来_在iOS Google Maps中显示行进方向

    ios谷歌地图显示不出来 Taxi or travel apps always need to show the direction of travel on the map. In this qui ...

  2. 如何在谷歌地图自定义范围_如何在Google表格中更改和创建自定义数字格式

    如何在谷歌地图自定义范围 Khamosh Pathak Khamosh Pathak By default, Google Sheets doesn't format numbers. If you' ...

  3. 怎么在谷歌地图上画图_如何在Google地图上规划具有多个目的地的公路旅行

    怎么在谷歌地图上画图 Whether you're planning a day out on the town, or want to orchestrate the perfect road tr ...

  4. 谷歌手机pixel4 夜景_如何在Google Pixel手机上使用实时字幕

    谷歌手机pixel4 夜景 Google 谷歌 Live Caption automatically adds captions for any audio playing on your phone ...

  5. 使用谷歌高级搜索需要_什么是Google高级保护?谁应该使用它?

    使用谷歌高级搜索需要 Maybe you've heard of Google's "Advanced Protection" program. Maybe you haven't ...

  6. 谷歌地图位置偏移_如何使用Google地图与他人临时分享您的位置

    谷歌地图位置偏移 You're meeting a friend downtown in a new city, and he asks you where you are. Be honest: y ...

  7. 谷歌面试 扔鸡蛋_如何解决Google招聘人员关于从建筑物中扔鸡蛋的难题

    谷歌面试 扔鸡蛋 by Marcin Moskala 通过Marcin Moskala 如何解决Google招聘人员关于从建筑物中扔鸡蛋的难题 (How to solve the Google rec ...

  8. 谷歌书签删除重复_如何删除Google表格中的重复项

    谷歌书签删除重复 Google Sheets lets you remove duplicates from your document with three methods ranging from ...

  9. 谷歌 广告营收_初看Google收件箱

    谷歌 广告营收 Last week, Google released a new invite-only product, Inbox. Inbox is an interface to Gmail ...


  1. SpringBoot集成全局异常处理
  2. 天融信TOS系统命令行下查看资源使用情况
  3. 如何在SAP Fiori应用里使用React component
  4. Jzoj4778 数列编辑器
  5. UOJ Test Round 3
  6. 驱动思想之机制和策略
  7. 语言五子棋无ai程序框图_微软多语言预训练模型T-ULRv2登顶XTREME排行榜
  8. SCVMM 2012 R2---安装前的准备工作
  9. 无锁队列以及ABA问题
  10. contenttype类型_HTTP请求中,几种常见的ContentType类型解析
  11. 数据库系统概论——事务
  12. nginx常用配置模板
  13. Git的下载安装教程
  14. 学习笔记——STM32摄像头OV7725(一)
  15. FreeCAD Addon Manager的Workbenches为空的解决方法
  16. java Doc文档生成方式
  17. Android闹钟最终版【android源码闹钟解析】
  18. 蜂窝移动的架构 以及省电的方法
  19. 零基础学编程觉得很枯燥,很难坚持学习下去怎么办?
  20. 实时获取浏览器的地址栏的网页地址


  1. python爬取斗图啦表情包并下载到本地
  2. 二分查找:思路很简单,细节是魔鬼
  3. anki用HTML代码批量,使用 JavaScript 操作 HTML 批量制作 Anki 卡片
  4. 12306全自动抢票
  5. 超市服务器安装哪个系统好,超市收银系统哪个好?如何挑选到适合自家超市用的收银系统?...
  6. 删除文件出现 数据错误 循环冗余的解决办法
  7. 汉澳jail容器技术支持直接同时运行数百个sinox副本
  8. Handle的原理(Looper、Handler、Message三者关系)
  9. springboot+mysql城市房屋租赁管理系统设计与实现-计算机毕业设计源码01855
  10. 范数 --python