Tesseract是一个开源的OCR(Optical Character Recognition,光学字符识别)引擎,可以识别多种格式的图像文件并将其转换成文本,目前已支持60多种语言(包括中文)。 Tesseract最初由HP公司开发,后来由Google接盘填坑。

git网址https://github.com/tesseract-ocr/tesseract

其源码调用栈:

main[api/tesseractmain.cpp] ->
TessBaseAPI::ProcessPages[api/baseapi.cpp] ->
TessBaseAPI::ProcessPage[api/baseapi.cpp] ->
TessBaseAPI::Recognize[api/baseapi.cpp] ->
TessBaseAPI::FindLines[api/baseapi.cpp] ->
TessBaseAPI::Threshold[api/baseapi.cpp] ->
ImageThresholder::ThresholdToPix[ccmain/thresholder.cpp] ->
ImageThresholder::OtsuThresholdRectToPix [ccmain/thresholder.cpp]
/*** Recognizes all the pages in the named file, as a multi-page tiff or* list of filenames, or single image, and gets the appropriate kind of text* according to parameters: tessedit_create_boxfile,* tessedit_make_boxes_from_boxes, tessedit_write_unlv, tessedit_create_hocr.* Calls ProcessPage on each page in the input file, which may be a* multi-page tiff, single-page other file format, or a plain text list of* images to read. If tessedit_page_number is non-negative, processing begins* at that page of a multi-page tiff file, or filelist.* The text is returned in text_out. Returns false on error.* If non-zero timeout_millisec terminates processing after the timeout on* a single page.* If non-NULL and non-empty, and some page fails for some reason,* the page is reprocessed with the retry_config config file. Useful* for interactively debugging a bad page.*/
bool TessBaseAPI::ProcessPages(const char* filename,const char* retry_config, int timeout_millisec,STRING* text_out) {int page = tesseract_->tessedit_page_number;if (page < 0)page = 0;FILE* fp = fopen(filename, "rb");if (fp == NULL) {tprintf(_("Image file %s cannot be opened!\n"), filename);return false;}// Find the number of pages if a tiff file, or zero otherwise.int npages = CountTiffPages(fp);fclose(fp);if (tesseract_->tessedit_create_hocr) {*text_out ="<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n""<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"\n""    \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n""<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" ""lang=\"en\">\n <head>\n  <title></title>\n""  <meta http-equiv=\"Content-Type\" content=\"text/html; ""charset=utf-8\" />\n""  <meta name='ocr-system' content='tesseract " VERSION "' />\n""  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par"" ocr_line ocrx_word'/>\n"" </head>\n <body>\n";} else {*text_out = "";}bool success = true;Pix *pix;if (npages > 0) {for (; page < npages && (pix = pixReadTiff(filename, page)) != NULL;++page) {if ((page >= 0) && (npages > 1))tprintf(_("Page %d of %d\n"), page + 1, npages);char page_str[kMaxIntSize];snprintf(page_str, kMaxIntSize - 1, "%d", page);SetVariable("applybox_page", page_str);success &= ProcessPage(pix, page, filename, retry_config,timeout_millisec, text_out);pixDestroy(&pix);if (tesseract_->tessedit_page_number >= 0 || npages == 1) {break;}}} else {// The file is not a tiff file, so use the general pixRead function.pix = pixRead(filename);if (pix != NULL) {success &= ProcessPage(pix, 0, filename, retry_config,timeout_millisec, text_out);pixDestroy(&pix);} else {// The file is not an image file, so try it as a list of filenames.FILE* fimg = fopen(filename, "rb");if (fimg == NULL) {tprintf(_("File %s cannot be opened!\n"), filename);return false;}tprintf(_("Reading %s as a list of filenames...\n"), filename);char pagename[MAX_PATH];// Skip to the requested page number.for (int i = 0; i < page &&fgets(pagename, sizeof(pagename), fimg) != NULL;++i);while (fgets(pagename, sizeof(pagename), fimg) != NULL) {chomp_string(pagename);pix = pixRead(pagename);if (pix == NULL) {tprintf(_("Image file %s cannot be read!\n"), pagename);fclose(fimg);return false;}tprintf(_("Page %d : %s\n"), page, pagename);success &= ProcessPage(pix, page, pagename, retry_config,timeout_millisec, text_out);pixDestroy(&pix);++page;}fclose(fimg);}}if (tesseract_->tessedit_create_hocr)*text_out += " </body>\n</html>\n";return success;
}
/*** Recognizes a single page for ProcessPages, appending the text to text_out.* The pix is the image processed - filename and page_index are metadata* used by side-effect processes, such as reading a box file or formatting* as hOCR.* If non-zero timeout_millisec terminates processing after the timeout.* If non-NULL and non-empty, and some page fails for some reason,* the page is reprocessed with the retry_config config file. Useful* for interactively debugging a bad page.* The text is returned in text_out. Returns false on error.*/
bool TessBaseAPI::ProcessPage(Pix* pix, int page_index, const char* filename,const char* retry_config, int timeout_millisec,STRING* text_out) {SetInputName(filename);SetImage(pix);bool failed = false;if (timeout_millisec > 0) {// Running with a timeout.ETEXT_DESC monitor;monitor.cancel = NULL;monitor.cancel_this = NULL;monitor.set_deadline_msecs(timeout_millisec);// Now run the main recognition.failed = Recognize(&monitor) < 0;} else if (tesseract_->tessedit_pageseg_mode == PSM_OSD_ONLY ||tesseract_->tessedit_pageseg_mode == PSM_AUTO_ONLY) {// Disabled character recognition.PageIterator* it = AnalyseLayout();if (it == NULL) {failed = true;} else {delete it;return true;}} else {// Normal layout and character recognition with no timeout.failed = Recognize(NULL) < 0;}if (tesseract_->tessedit_write_images) {Pix* page_pix = GetThresholdedImage();pixWrite("tessinput.tif", page_pix, IFF_TIFF_G4);}if (failed && retry_config != NULL && retry_config[0] != '\0') {// Save current config variables before switching modes.FILE* fp = fopen(kOldVarsFile, "wb");PrintVariables(fp);fclose(fp);// Switch to alternate mode for retry.ReadConfigFile(retry_config);SetImage(pix);Recognize(NULL);// Restore saved config variables.ReadConfigFile(kOldVarsFile);}// Get text only if successful.if (!failed) {char* text;if (tesseract_->tessedit_create_boxfile ||tesseract_->tessedit_make_boxes_from_boxes) {text = GetBoxText(page_index);} else if (tesseract_->tessedit_write_unlv) {text = GetUNLVText();} else if (tesseract_->tessedit_create_hocr) {text = GetHOCRText(page_index);} else {text = GetUTF8Text();}*text_out += text;delete [] text;return true;}return false;
}
/*** Recognize the tesseract global image and return the result as Tesseract* internal structures.*/
int TessBaseAPI::Recognize(ETEXT_DESC* monitor) {if (tesseract_ == NULL)return -1;if (FindLines() != 0)return -1;if (page_res_ != NULL)delete page_res_;if (block_list_->empty()) {page_res_ = new PAGE_RES(block_list_, &tesseract_->prev_word_best_choice_);return 0; // Empty page.}tesseract_->SetBlackAndWhitelist();recognition_done_ = true;if (tesseract_->tessedit_resegment_from_line_boxes)page_res_ = tesseract_->ApplyBoxes(*input_file_, true, block_list_);else if (tesseract_->tessedit_resegment_from_boxes)page_res_ = tesseract_->ApplyBoxes(*input_file_, false, block_list_);elsepage_res_ = new PAGE_RES(block_list_, &tesseract_->prev_word_best_choice_);if (tesseract_->tessedit_make_boxes_from_boxes) {tesseract_->CorrectClassifyWords(page_res_);return 0;}if (truth_cb_ != NULL) {tesseract_->wordrec_run_blamer.set_value(true);truth_cb_->Run(tesseract_->getDict().getUnicharset(),image_height_, page_res_);}int result = 0;if (tesseract_->interactive_display_mode) {#ifndef GRAPHICS_DISABLEDtesseract_->pgeditor_main(rect_width_, rect_height_, page_res_);#endif  // GRAPHICS_DISABLED// The page_res is invalid after an interactive session, so cleanup// in a way that lets us continue to the next page without crashing.delete page_res_;page_res_ = NULL;return -1;} else if (tesseract_->tessedit_train_from_boxes) {tesseract_->ApplyBoxTraining(*output_file_, page_res_);} else if (tesseract_->tessedit_ambigs_training) {FILE *training_output_file = tesseract_->init_recog_training(*input_file_);// OCR the page segmented into words by tesseract.tesseract_->recog_training_segmented(*input_file_, page_res_, monitor, training_output_file);fclose(training_output_file);} else {// Now run the main recognition.if (tesseract_->recog_all_words(page_res_, monitor, NULL, NULL, 0)) {DetectParagraphs(true);} else {result = -1;}}return result;
}
/** Find lines from the image making the BLOCK_LIST. */
int TessBaseAPI::FindLines() {if (thresholder_ == NULL || thresholder_->IsEmpty()) {tprintf("Please call SetImage before attempting recognition.");return -1;}if (recognition_done_)ClearResults();if (!block_list_->empty()) {return 0;}if (tesseract_ == NULL) {tesseract_ = new Tesseract;tesseract_->InitAdaptiveClassifier(false);}if (tesseract_->pix_binary() == NULL)Threshold(tesseract_->mutable_pix_binary());if (tesseract_->ImageWidth() > MAX_INT16 ||tesseract_->ImageHeight() > MAX_INT16) {tprintf("Image too large: (%d, %d)\n",tesseract_->ImageWidth(), tesseract_->ImageHeight());return -1;}tesseract_->PrepareForPageseg();if (tesseract_->textord_equation_detect) {if (equ_detect_ == NULL && datapath_ != NULL) {equ_detect_ = new EquationDetect(datapath_->string(), NULL);}tesseract_->SetEquationDetect(equ_detect_);}Tesseract* osd_tess = osd_tesseract_;OSResults osr;if (PSM_OSD_ENABLED(tesseract_->tessedit_pageseg_mode) && osd_tess == NULL) {if (strcmp(language_->string(), "osd") == 0) {osd_tess = tesseract_;} else {osd_tesseract_ = new Tesseract;if (osd_tesseract_->init_tesseract(datapath_->string(), NULL, "osd", OEM_TESSERACT_ONLY,NULL, 0, NULL, NULL, false) == 0) {osd_tess = osd_tesseract_;osd_tesseract_->set_source_resolution(thresholder_->GetSourceYResolution());} else {tprintf("Warning: Auto orientation and script detection requested,"" but osd language failed to load\n");delete osd_tesseract_;osd_tesseract_ = NULL;}}}if (tesseract_->SegmentPage(input_file_, block_list_, osd_tess, &osr) < 0)return -1;// If Devanagari is being recognized, we use different images for page seg// and for OCR.tesseract_->PrepareForTessOCR(block_list_, osd_tess, &osr);return 0;
}
/*** Run the thresholder to make the thresholded image, returned in pix,* which must not be NULL. *pix must be initialized to NULL, or point* to an existing pixDestroyable Pix.* The usual argument to Threshold is Tesseract::mutable_pix_binary().*/
void TessBaseAPI::Threshold(Pix** pix) {ASSERT_HOST(pix != NULL);if (!thresholder_->IsBinary()) {tesseract_->set_pix_grey(thresholder_->GetPixRectGrey());}if (*pix != NULL)pixDestroy(pix);// Zero resolution messes up the algorithms, so make sure it is credible.int y_res = thresholder_->GetScaledYResolution();if (y_res < kMinCredibleResolution || y_res > kMaxCredibleResolution) {// Use the minimum default resolution, as it is safer to under-estimate// than over-estimate resolution.thresholder_->SetSourceYResolution(kMinCredibleResolution);}thresholder_->ThresholdToPix(pix);thresholder_->GetImageSizes(&rect_left_, &rect_top_,&rect_width_, &rect_height_,&image_width_, &image_height_);// Set the internal resolution that is used for layout parameters from the// estimated resolution, rather than the image resolution, which may be// fabricated, but we will use the image resolution, if there is one, to// report output point sizes.int estimated_res = ClipToRange(thresholder_->GetScaledEstimatedResolution(),kMinCredibleResolution,kMaxCredibleResolution);if (estimated_res != thresholder_->GetScaledEstimatedResolution()) {tprintf("Estimated resolution %d out of range! Corrected to %d\n",thresholder_->GetScaledEstimatedResolution(), estimated_res);}tesseract_->set_source_resolution(estimated_res);
}
// Threshold the source image as efficiently as possible to the output Pix.
// Creates a Pix and sets pix to point to the resulting pointer.
// Caller must use pixDestroy to free the created Pix.
void ImageThresholder::ThresholdToPix(Pix** pix) {if (pix_ != NULL) {if (image_bytespp_ == 0) {// We have a binary image, so it just has to be cloned.*pix = GetPixRect();} else {if (image_bytespp_ == 4) {// Color data can just be passed direct.const uinT32* data = pixGetData(pix_);OtsuThresholdRectToPix(reinterpret_cast<const uinT8*>(data),image_bytespp_, image_bytespl_, pix);} else {// Convert 8-bit to IMAGE and then pass its// buffer to the raw interface to complete the conversion.IMAGE temp_image;temp_image.FromPix(pix_);OtsuThresholdRectToPix(temp_image.get_buffer(),image_bytespp_,COMPUTE_IMAGE_XDIM(temp_image.get_xsize(),temp_image.get_bpp()),pix);}}return;}if (image_bytespp_ > 0) {// Threshold grey or color.OtsuThresholdRectToPix(image_data_, image_bytespp_, image_bytespl_, pix);} else {RawRectToPix(pix);}
}
// Otsu threshold the rectangle, taking everything except the image buffer
// pointer from the class, to the output Pix.
void ImageThresholder::OtsuThresholdRectToPix(const unsigned char* imagedata,int bytes_per_pixel,int bytes_per_line,Pix** pix) const {int* thresholds;int* hi_values;OtsuThreshold(imagedata, bytes_per_pixel, bytes_per_line,rect_left_, rect_top_, rect_width_, rect_height_,&thresholds, &hi_values);// Threshold the image to the given IMAGE.ThresholdRectToPix(imagedata, bytes_per_pixel, bytes_per_line,thresholds, hi_values, pix);delete [] thresholds;delete [] hi_values;
}

同时在输入图像处理时是以坐标参数为基准了,所以支持处理指定区域。

即:api.SetRectangle(0, 0, x, y);

最后说一下安装版的cmd调用方法:

tesseract.exe input result  -l chi_sim

input:输入图片

result;输出文本

-l 字体库资源

chi_sim  中文大陆

字体库资源在tessdata文件夹下:

输入图片:

输出图片:

一般都是用jTessBoxEditor进行矫正错误并训练,ps:需要java环境,还是比较麻烦的。

最终生成的也是字体库资源,即在tessdata文件夹下的*.traineddata。如果普通业务需求的话建议从网上或者其他途径获取比较适合自己使用的库资源。

当然如果你的图像文字存在一些规律性或者对输出要把控的很严谨的话还是建议自己训练吧。

Tesseract-ocr实现图像文本识别相关推荐

  1. Python 图像文本识别 EasyOCR

    目录 1. 安装依赖 2. 下载模型文件 3. 编写代码 4. 运行脚本 命令行模式 参考文献 一个Python的图像文本识别工具:easyocr. 这个模块支持70多种语言的即用型OCR,包括中文, ...

  2. 基于CRNN+CTC的改进图像文本识别算法

    上一次介绍了基于改进EAST(An Efficient and Accurate Scene Text Detector)算法的文本定位算法这次我来介绍基于卷积循环神经网络CRNN (Convolut ...

  3. python制作文字识别_用Python轻松进行图像文本识别

    用Python轻松进行图像文本识别 作者:梅朵 微信公众号:实用办公编程技能 微信号:Excel-Python 最近,办公室的同事小李在整理一份报告,很多材料的电子版找不到了,都是纸质版的,纸质版上的 ...

  4. CRNN-基于序列的(端到端)图像文本识别

    文章目录 一.前言 二.网络架构 2.1 特征序列提取 2.2. 序列标注 2.3. 转录 2.3.1 标签序列的概率 2.3.2 无字典转录 2.3.3 基于词典的转录 2.4. 网络训练 4. 总 ...

  5. Android百度文字识别bitmap,Android集成Tesseract OCR实现图片文字识别

    最近项目需要做图片上的文字识别,在网上找了很久,这方面的知识挺多的,但是很杂.将最近学习到的东西整理一下,仅供参考. 1.Tesseract OCR 介绍我就不说了,自行百度,或者访问:https:/ ...

  6. Focusing Attention Network(FAN)自然图像文本识别 学习笔记

    Focusing Attention: Towards Accurate Text Recognition in Natural Images Author: Zhanzhan Cheng,Fan B ...

  7. 个人永久性免费-Excel催化剂功能第86波-人工智能之图像OCR文本识别全覆盖

    在上一年中,Excel催化剂已经送上一波人工智能系列功能,鉴于部分高端用户的需求,再次给予实现了复杂的图像OCR识别,包含几乎所有日常场景,让公司个人手头的图像非结构化数据瞬间变为可进行结构化处理分析 ...

  8. 文本识别新王者CharNet:卷积字符网络

    本文解读了ICCV 2019 论文 Convolutional Character Networks,该算法CharNet在多个文本检测识别数据集取得了SOTA,代码已开源,地址在本文留言区. htt ...

  9. Opencv多语言自然场景文本识别系统(源码&教程)

    1. 研究背景 人类在自然场景中可以快速定位并识别看到的文字信息,但是想要计算机做到和人类一样是比较困难的.开发人员一直想要让机器也能识别图像中的文字信息.当然,对于自然场景来说,图像中的信息复杂甚至 ...

最新文章

  1. 大学计算机专业全英文论文,5计算机专业 外文文献 英文文献 外文翻译 jsp应用框架 中英对照 大学学位论文.doc...
  2. 组会PPT20200910《大工HPT放电结果错误剖析》
  3. 励志:读书七年,为了挣钱,我竭尽全力
  4. 2018-3-28Linux系统管理(16)计算机网络基础
  5. 在Java中使用FileChannel和ByteBuffer对文件进行读写
  6. Linux-lamp
  7. 如何跨越线程调用窗体控件?(3)
  8. Android、Java超详细的身份证验证工具
  9. PFC离散元程序及其在工程领域中的应用(新手必看)
  10. 龙芯2F入手,点滴记录!
  11. 百度360争推1TB永久网盘
  12. 股票预测和股票分析就用FineBI!
  13. 法国语言学校c1,法国留学的语言要求是什么?
  14. 华为P30和IphoneX性能比较
  15. 苹果要给高通多少钱?瑞银分析师预计最多支付60亿美元
  16. 打印100以内的质数
  17. Android仿新浪微博弹出界面动画,Android仿新浪微博启动界面或登陆界面(1)
  18. 微信小程序轮播中的current_手把手教你美化微信小程序中的轮播效果
  19. python跨年表白神器--你值得拥有!
  20. 20210326TCPandUDP

热门文章

  1. 45 SD配置-销售凭证设置-分配订单类型/状态参数文件
  2. 34 FI配置-财务会计-应收账款和应付账款-定义应付汇票的备选统驭科目
  3. 4 SAP权限PFCG操作手册
  4. 单片机 串口 波特率 c语言 9600 11.0592mhz t1,C51语言编程:单片机与 PC 串口通信程序...
  5. define定义的函数如何引用_「C与指针心得」25.预处理器-宏函数
  6. RabbitMQ集群并处理失败
  7. opencv空间色彩转换
  8. MySQL快速生成大量测试数据1000万
  9. 暴露的全局方法_面试刷题36:线程池的原理和使用方法?
  10. layui + echarts股票K线图(含案例、代码、截图)(转载篇)