基于PP-OCR训练表格识别模型

参考：百度官网

一. 配置PP-OCR环境

pp-ocr环境配置教程

二. 训练数据准备

数据下载地址
或者用命令下载数据

curl -o ./PubTabNet.tar.gz https://dax-cdn.cdn.appdomain.cloud/dax-pubtabnet/2.0.0/pubtabnet.tar.gz

下载数据后解压，并用代码将其划分为训练集和验证集，代码如下：

import jsonlines
"""
把PubTabNet_2.0.0.jsonl分成PubTabNet_2.0.0_train.jsonl和PubTabNet_2.0.0_val.jsonl两个文件
test文件夹中的图片没有标注信息
"""if __name__  == "__main__":with jsonlines.open("PubTabNet_2.0.0.jsonl", "r") as f:with jsonlines.open("PubTabNet_2.0.0_train.jsonl", "w") as train_f:for data in f:if data['split'] == "train":train_f.write(data)with jsonlines.open("PubTabNet_2.0.0.jsonl", "r") as f:with jsonlines.open("PubTabNet_2.0.0_val.jsonl", "w") as val_f:for data in f:if data['split'] == "val":val_f.write(data)

三. 训练模型

# 单机单卡训练
python3 tools/train.py -c configs/table/table_mv3.yml
# 单机多卡训练，通过 --gpus 参数设置使用的GPU ID
python3 -m paddle.distributed.launch --gpus '0,1,2,3' tools/train.py -c configs/table/table_mv3.yml

如果要用到预训练模型，则用如下命令：

CUDA_VISIBLE_DEVICES=5 nohup python3 tools/train.py -c configs/table/table_mv3.yml -o Global.checkpoints="./output/table_mv3/best_accuracy"

我的table_mv3.yml内容如下：

Global:use_gpu: trueepoch_num: 400log_smooth_window: 20print_batch_step: 5save_model_dir: ./output/table_mv3/save_epoch_step: 3# evaluation is run every 400 iterations after the 0th iterationeval_batch_step: [0, 400]cal_metric_during_train: Truepretrained_model:checkpoints: save_inference_dir:use_visualdl: Falseinfer_img: doc/table/table.jpg# for data or label processcharacter_dict_path: ppocr/utils/dict/table_structure_dict.txtcharacter_type: enmax_text_length: 100max_elem_length: 800max_cell_num: 500infer_mode: Falseprocess_total_num: 0process_cut_num: 0Optimizer:name: Adambeta1: 0.9beta2: 0.999clip_norm: 5.0lr:learning_rate: 0.001regularizer:name: 'L2'factor: 0.00000Architecture:model_type: tablealgorithm: TableAttnBackbone:name: MobileNetV3scale: 1.0model_name: largeHead:name: TableAttentionHeadhidden_size: 256l2_decay: 0.00001loc_type: 2max_text_length: 100max_elem_length: 800max_cell_num: 500Loss:name: TableAttentionLossstructure_weight: 100.0loc_weight: 10000.0PostProcess:name: TableLabelDecodeMetric:name: TableMetricmain_indicator: accTrain:dataset:name: PubTabDataSetdata_dir: /home/work/data/guopei/pubtabnet/train/label_file_path: /home/work/data/guopei/pubtabnet/PubTabNet_2.0.0_train.jsonltransforms:- DecodeImage: # load imageimg_mode: BGRchannel_first: False- ResizeTableImage:max_len: 488- TableLabelEncode:- NormalizeImage:scale: 1./255.mean: [0.485, 0.456, 0.406]std: [0.229, 0.224, 0.225]order: 'hwc'- PaddingTableImage:- ToCHWImage:- KeepKeys:keep_keys: ['image', 'structure', 'bbox_list', 'sp_tokens', 'bbox_list_mask']loader:shuffle: Truebatch_size_per_card: 48drop_last: Truenum_workers: 16Eval:dataset:name: PubTabDataSetdata_dir: /home/work/data/guopei/pubtabnet/val/label_file_path: /home/work/data/guopei/pubtabnet/PubTabNet_2.0.0_val.jsonltransforms:- DecodeImage: # load imageimg_mode: BGRchannel_first: False- ResizeTableImage:max_len: 488- TableLabelEncode:- NormalizeImage:scale: 1./255.mean: [0.485, 0.456, 0.406]std: [0.229, 0.224, 0.225]order: 'hwc'- PaddingTableImage:- ToCHWImage:- KeepKeys:keep_keys: ['image', 'structure', 'bbox_list', 'sp_tokens', 'bbox_list_mask']loader:shuffle: Falsedrop_last: Falsebatch_size_per_card: 16num_workers: 8

通过上述步骤，你的表格识别模型就训练起来了，如下图所示：

四. 将训练的模型转化为推理模型

CUDA_VISIBLE_DEVICES=6 python tools/export_model.py -c configs/table/table_mv3.yml -o Global.pretrained_model=/home/work/guopei/workspace/OCR/table_recog/paddle/PaddleOCR/output/table_mv3/best_accuracy Global.load_static_weights=False Global.save_inference_dir=./table_infer

该命令会把output/table_mv3/best_accuracy.pdparams 转换成推理模型并存在./table_infer文件夹下。如下图所示：

五. 测试训练好的表格识别模型

CUDA_VISIBLE_DEVICES=5 python3 predict_system.py --det_model_dir=inference/ch_ppocr_mobile_v2.0_det_infer --rec_model_dir=inference/ch_ppocr_mobile_v2.0_rec_infer --table_model_dir=/home/work/guopei/workspace/OCR/table_recog/paddle/PaddleOCR/table_infer --image_dir=../doc/table/1.png --rec_char_dict_path=../ppocr/utils/ppocr_keys_v1.txt --table_char_dict_path=../ppocr/utils/dict/table_structure_dict.txt --rec_char_type=ch --output=../output/table --vis_font_path=../doc/fonts/simfang.ttf

测试结果如下：

六. 测试表格结构识别的测试指标

首先，根据官网，获得gt.json，我选了pubtabnet验证集中的500张表格测试。
生成gt.json的代码如下：

import jsonlines
import json
import osdef data_process(data):data_new = {}img_name = data["filename"]img_path = os.path.join("/home/work/data/guopei/pubtabnet/val", img_name)html = data['html']["structure"]['tokens']html = ["<html>", "<body>", "<table>"] + html + ["</table>", "</body>", "</html>"]tokens = []bboxes = []for cell in data['html']["cells"]:if len(cell['tokens']) == 0 or "bbox" not in cell.keys():continuetokens.append(cell['tokens'])bboxes.append(cell['bbox'])label = [html, bboxes, tokens]return img_path, labelif __name__ == "__main__":datas = {}idx = 0with jsonlines.open("PubTabNet_2.0.0_val.jsonl", "r") as f:for data in f:idx += 1if idx > 500:breakimg_path, label = data_process(data)datas[img_path] = labeljson.dump(datas, open("test.json", "w"), indent=2, ensure_ascii=False)

测试命令如下：

CUDA_VISIBLE_DEVICES=2 python3 table/eval_table.py --det_model_dir=inference/ch_ppocr_mobile_v2.0_det_infer --rec_model_dir=inference/ch_ppocr_mobile_v2.0_rec_infer --table_model_dir=inference/en_ppocr_mobile_v2.0_table_structure_infer --image_dir='' --rec_char_dict_path=../ppocr/utils/ppocr_keys_v1.txt --table_char_dict_path=../ppocr/utils/dict/table_structure_dict.txt --rec_char_type=ch --det_limit_side_len=736 --det_limit_type=min --gt_path=/home/work/data/guopei/pubtabnet/test.json

我稍微修改了一下代码，测试的结果是表格结构的teds，测试结构如下：

我们的表格识别技术解决方案
每天进步一点，欢迎技术交流！！！

基于PP-OCR训练表格识别模型相关推荐

基于TensorFlow训练花朵识别模型的源码和Demo
基于TensorFlow训练花朵识别模型的源码和Demo 转发来源: https://blog.csdn.net/Anymake_ren/article/details/80550684 下面就通过对 ...
截屏就可以转文字？飞桨带您体验OCR超轻量中英文识别模型
[飞桨开发者说]陈千鹤,华中科技大学计算机科学与技术学院大一在读任务背景目前很多实用小工具都趋向收费模式,即使免费,不是功能不完整,就是有很多约束条件,在应用时效果无法达到我们的预期.于是我萌生一 ...
（CVPR-2020）GaitPart：基于时间部分的步态识别模型（一）
文章目录 GaitPart:基于时间部分的步态识别模型 Abstract 1. Introduction 2. Related Work 3. Proposed Method 3.1. Pipelin ...
知识图谱基于CRF的命名实体识别模型
基于CRF的命名实体识别模型条件随机场 CRF 条件随机场 CRF 是在已知一组输入随机变量条件的情况下,输出另一组随机变量的条件概率分布模型:其前提是假设输出随机变量构成马尔可夫随机场:条件随 ...
TF学习——TF之TFOD：基于TFOD AP训练ssd_mobilenet预模型+faster_rcnn_inception_resnet_v2_模型训练过程(TensorBoard监控)全记录
TF学习--TF之TFOD:基于TFOD AP训练ssd_mobilenet预模型+faster_rcnn_inception_resnet_v2_模型训练过程(TensorBoard监控)全记录目 ...
paddle - crowdHuman数据集训练人体识别模型
paddle - crowdHuman数据集训练人体识别模型数据集annotation crowdhuman的odgt文件各项意义转换为paddle yolo的格式输入哪些数据? 输出模型数据 ...
PaddlePaddle飞桨OCR文本检测——识别模型训练（三）
上一篇检测模型训练https://blog.csdn.net/weixin_42845306/article/details/112689152 飞桨的OCR模型分为检测.识别和分类,今天讨论识别. ...
基于人脸的常见表情识别——模型搭建、训练与测试¶
整个训练流程包括数据接口准备.模型定义.结果保存与分析. 数据接口一般使用torchvision.Dataset定义数据的读取.torch.utils.data.Dataloader定义数据的加载. ...
利用PaddleOCR训练车牌识别模型
目录 1--前言 2--生成车牌数据集 3--构建车牌数据集标签 4--自定义字典 5--训练模型 6--模型转换和推理 7--模型转换为onnx模型 8--参考 1--前言 ①系统:Ubuntu18 ...
Python+OpenCV实现AI人脸识别身份认证系统(3)—训练人脸识别模型
目录案例引入本节项目最近有小伙伴们一直在催本项目的进度,好吧,今晚熬夜加班编写,在上一节中,实现了人脸数据的采集,在本节中将对采集的人脸数据进行训练,生成识别模型. 案例引入首先简要讲解数据集 ...