文档图片也就是我们经常看到的扫描件吧,他和我们常见的照片还是有很大的。照片的元素丰富,而文档一般只有文字、表格、图片,还有一些附件信息组成。文档图片还有的特点就是背景负责,同一份文档,通过不同设备的采集,所附加的噪音可能大不相同。还有采样设备的差异,导致采样率差距很大,采集的图片大小不一致。还有一种情况,同一份文档,有的文档盖了一个章,一份文档盖了3个章。
扫描的图片可能会增加一些黑色的斑点,而拍照甚至会改变背景颜色。如果用上述的方法,会直接导致图片检测误差太大。
总结一下,文档图片的特征:
既然常规的方法我们用不着,那么就设计一点其他方法吧,好在文件流转的过程中通过了OCR 识别方法,可以通过接口获取到每个文字的坐标信息,图片的大小信息,以及一些文档关键元素(章、图片、二维码)的类别和坐标。
利用这些信息,我们可以通过文字的坐标信息比对来判断文档图片的“骨架”的相似度,然后通过文字的编辑距离来判断内容的相似性。文件的附加信息可以通过信息的类别和具体的内容进行比对。
#mermaid-svg-kz2EkpUnbFW4f6mr .label{font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family);fill:#333;color:#333}#mermaid-svg-kz2EkpUnbFW4f6mr .label text{fill:#333}#mermaid-svg-kz2EkpUnbFW4f6mr .node rect,#mermaid-svg-kz2EkpUnbFW4f6mr .node circle,#mermaid-svg-kz2EkpUnbFW4f6mr .node ellipse,#mermaid-svg-kz2EkpUnbFW4f6mr .node polygon,#mermaid-svg-kz2EkpUnbFW4f6mr .node path{fill:#ECECFF;stroke:#9370db;stroke-width:1px}#mermaid-svg-kz2EkpUnbFW4f6mr .node .label{text-align:center;fill:#333}#mermaid-svg-kz2EkpUnbFW4f6mr .node.clickable{cursor:pointer}#mermaid-svg-kz2EkpUnbFW4f6mr .arrowheadPath{fill:#333}#mermaid-svg-kz2EkpUnbFW4f6mr .edgePath .path{stroke:#333;stroke-width:1.5px}#mermaid-svg-kz2EkpUnbFW4f6mr .flowchart-link{stroke:#333;fill:none}#mermaid-svg-kz2EkpUnbFW4f6mr .edgeLabel{background-color:#e8e8e8;text-align:center}#mermaid-svg-kz2EkpUnbFW4f6mr .edgeLabel rect{opacity:0.9}#mermaid-svg-kz2EkpUnbFW4f6mr .edgeLabel span{color:#333}#mermaid-svg-kz2EkpUnbFW4f6mr .cluster rect{fill:#ffffde;stroke:#aa3;stroke-width:1px}#mermaid-svg-kz2EkpUnbFW4f6mr .cluster text{fill:#333}#mermaid-svg-kz2EkpUnbFW4f6mr div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family);font-size:12px;background:#ffffde;border:1px solid #aa3;border-radius:2px;pointer-events:none;z-index:100}#mermaid-svg-kz2EkpUnbFW4f6mr .actor{stroke:#ccf;fill:#ECECFF}#mermaid-svg-kz2EkpUnbFW4f6mr text.actor>tspan{fill:#000;stroke:none}#mermaid-svg-kz2EkpUnbFW4f6mr .actor-line{stroke:grey}#mermaid-svg-kz2EkpUnbFW4f6mr .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333}#mermaid-svg-kz2EkpUnbFW4f6mr .messageLine1{stroke-width:1.5;stroke-dasharray:2, 2;stroke:#333}#mermaid-svg-kz2EkpUnbFW4f6mr #arrowhead path{fill:#333;stroke:#333}#mermaid-svg-kz2EkpUnbFW4f6mr .sequenceNumber{fill:#fff}#mermaid-svg-kz2EkpUnbFW4f6mr #sequencenumber{fill:#333}#mermaid-svg-kz2EkpUnbFW4f6mr #crosshead path{fill:#333;stroke:#333}#mermaid-svg-kz2EkpUnbFW4f6mr .messageText{fill:#333;stroke:#333}#mermaid-svg-kz2EkpUnbFW4f6mr .labelBox{stroke:#ccf;fill:#ECECFF}#mermaid-svg-kz2EkpUnbFW4f6mr .labelText,#mermaid-svg-kz2EkpUnbFW4f6mr .labelText>tspan{fill:#000;stroke:none}#mermaid-svg-kz2EkpUnbFW4f6mr .loopText,#mermaid-svg-kz2EkpUnbFW4f6mr .loopText>tspan{fill:#000;stroke:none}#mermaid-svg-kz2EkpUnbFW4f6mr .loopLine{stroke-width:2px;stroke-dasharray:2, 2;stroke:#ccf;fill:#ccf}#mermaid-svg-kz2EkpUnbFW4f6mr .note{stroke:#aa3;fill:#fff5ad}#mermaid-svg-kz2EkpUnbFW4f6mr .noteText,#mermaid-svg-kz2EkpUnbFW4f6mr .noteText>tspan{fill:#000;stroke:none}#mermaid-svg-kz2EkpUnbFW4f6mr .activation0{fill:#f4f4f4;stroke:#666}#mermaid-svg-kz2EkpUnbFW4f6mr .activation1{fill:#f4f4f4;stroke:#666}#mermaid-svg-kz2EkpUnbFW4f6mr .activation2{fill:#f4f4f4;stroke:#666}#mermaid-svg-kz2EkpUnbFW4f6mr .mermaid-main-font{font-family:"trebuchet ms", verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-kz2EkpUnbFW4f6mr .section{stroke:none;opacity:0.2}#mermaid-svg-kz2EkpUnbFW4f6mr .section0{fill:rgba(102,102,255,0.49)}#mermaid-svg-kz2EkpUnbFW4f6mr .section2{fill:#fff400}#mermaid-svg-kz2EkpUnbFW4f6mr .section1,#mermaid-svg-kz2EkpUnbFW4f6mr .section3{fill:#fff;opacity:0.2}#mermaid-svg-kz2EkpUnbFW4f6mr .sectionTitle0{fill:#333}#mermaid-svg-kz2EkpUnbFW4f6mr .sectionTitle1{fill:#333}#mermaid-svg-kz2EkpUnbFW4f6mr .sectionTitle2{fill:#333}#mermaid-svg-kz2EkpUnbFW4f6mr .sectionTitle3{fill:#333}#mermaid-svg-kz2EkpUnbFW4f6mr .sectionTitle{text-anchor:start;font-size:11px;text-height:14px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-kz2EkpUnbFW4f6mr .grid .tick{stroke:#d3d3d3;opacity:0.8;shape-rendering:crispEdges}#mermaid-svg-kz2EkpUnbFW4f6mr .grid .tick text{font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-kz2EkpUnbFW4f6mr .grid path{stroke-width:0}#mermaid-svg-kz2EkpUnbFW4f6mr .today{fill:none;stroke:red;stroke-width:2px}#mermaid-svg-kz2EkpUnbFW4f6mr .task{stroke-width:2}#mermaid-svg-kz2EkpUnbFW4f6mr .taskText{text-anchor:middle;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-kz2EkpUnbFW4f6mr .taskText:not([font-size]){font-size:11px}#mermaid-svg-kz2EkpUnbFW4f6mr .taskTextOutsideRight{fill:#000;text-anchor:start;font-size:11px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-kz2EkpUnbFW4f6mr .taskTextOutsideLeft{fill:#000;text-anchor:end;font-size:11px}#mermaid-svg-kz2EkpUnbFW4f6mr .task.clickable{cursor:pointer}#mermaid-svg-kz2EkpUnbFW4f6mr .taskText.clickable{cursor:pointer;fill:#003163 !important;font-weight:bold}#mermaid-svg-kz2EkpUnbFW4f6mr .taskTextOutsideLeft.clickable{cursor:pointer;fill:#003163 !important;font-weight:bold}#mermaid-svg-kz2EkpUnbFW4f6mr .taskTextOutsideRight.clickable{cursor:pointer;fill:#003163 !important;font-weight:bold}#mermaid-svg-kz2EkpUnbFW4f6mr .taskText0,#mermaid-svg-kz2EkpUnbFW4f6mr .taskText1,#mermaid-svg-kz2EkpUnbFW4f6mr .taskText2,#mermaid-svg-kz2EkpUnbFW4f6mr .taskText3{fill:#fff}#mermaid-svg-kz2EkpUnbFW4f6mr .task0,#mermaid-svg-kz2EkpUnbFW4f6mr .task1,#mermaid-svg-kz2EkpUnbFW4f6mr .task2,#mermaid-svg-kz2EkpUnbFW4f6mr .task3{fill:#8a90dd;stroke:#534fbc}#mermaid-svg-kz2EkpUnbFW4f6mr .taskTextOutside0,#mermaid-svg-kz2EkpUnbFW4f6mr .taskTextOutside2{fill:#000}#mermaid-svg-kz2EkpUnbFW4f6mr .taskTextOutside1,#mermaid-svg-kz2EkpUnbFW4f6mr .taskTextOutside3{fill:#000}#mermaid-svg-kz2EkpUnbFW4f6mr .active0,#mermaid-svg-kz2EkpUnbFW4f6mr .active1,#mermaid-svg-kz2EkpUnbFW4f6mr .active2,#mermaid-svg-kz2EkpUnbFW4f6mr .active3{fill:#bfc7ff;stroke:#534fbc}#mermaid-svg-kz2EkpUnbFW4f6mr .activeText0,#mermaid-svg-kz2EkpUnbFW4f6mr .activeText1,#mermaid-svg-kz2EkpUnbFW4f6mr .activeText2,#mermaid-svg-kz2EkpUnbFW4f6mr .activeText3{fill:#000 !important}#mermaid-svg-kz2EkpUnbFW4f6mr .done0,#mermaid-svg-kz2EkpUnbFW4f6mr .done1,#mermaid-svg-kz2EkpUnbFW4f6mr .done2,#mermaid-svg-kz2EkpUnbFW4f6mr .done3{stroke:grey;fill:#d3d3d3;stroke-width:2}#mermaid-svg-kz2EkpUnbFW4f6mr .doneText0,#mermaid-svg-kz2EkpUnbFW4f6mr .doneText1,#mermaid-svg-kz2EkpUnbFW4f6mr .doneText2,#mermaid-svg-kz2EkpUnbFW4f6mr .doneText3{fill:#000 !important}#mermaid-svg-kz2EkpUnbFW4f6mr .crit0,#mermaid-svg-kz2EkpUnbFW4f6mr .crit1,#mermaid-svg-kz2EkpUnbFW4f6mr .crit2,#mermaid-svg-kz2EkpUnbFW4f6mr .crit3{stroke:#f88;fill:red;stroke-width:2}#mermaid-svg-kz2EkpUnbFW4f6mr .activeCrit0,#mermaid-svg-kz2EkpUnbFW4f6mr .activeCrit1,#mermaid-svg-kz2EkpUnbFW4f6mr .activeCrit2,#mermaid-svg-kz2EkpUnbFW4f6mr .activeCrit3{stroke:#f88;fill:#bfc7ff;stroke-width:2}#mermaid-svg-kz2EkpUnbFW4f6mr .doneCrit0,#mermaid-svg-kz2EkpUnbFW4f6mr .doneCrit1,#mermaid-svg-kz2EkpUnbFW4f6mr .doneCrit2,#mermaid-svg-kz2EkpUnbFW4f6mr .doneCrit3{stroke:#f88;fill:#d3d3d3;stroke-width:2;cursor:pointer;shape-rendering:crispEdges}#mermaid-svg-kz2EkpUnbFW4f6mr .milestone{transform:rotate(45deg) scale(0.8, 0.8)}#mermaid-svg-kz2EkpUnbFW4f6mr .milestoneText{font-style:italic}#mermaid-svg-kz2EkpUnbFW4f6mr .doneCritText0,#mermaid-svg-kz2EkpUnbFW4f6mr .doneCritText1,#mermaid-svg-kz2EkpUnbFW4f6mr .doneCritText2,#mermaid-svg-kz2EkpUnbFW4f6mr .doneCritText3{fill:#000 !important}#mermaid-svg-kz2EkpUnbFW4f6mr .activeCritText0,#mermaid-svg-kz2EkpUnbFW4f6mr .activeCritText1,#mermaid-svg-kz2EkpUnbFW4f6mr .activeCritText2,#mermaid-svg-kz2EkpUnbFW4f6mr .activeCritText3{fill:#000 !important}#mermaid-svg-kz2EkpUnbFW4f6mr .titleText{text-anchor:middle;font-size:18px;fill:#000;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-kz2EkpUnbFW4f6mr g.classGroup text{fill:#9370db;stroke:none;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family);font-size:10px}#mermaid-svg-kz2EkpUnbFW4f6mr g.classGroup text .title{font-weight:bolder}#mermaid-svg-kz2EkpUnbFW4f6mr g.clickable{cursor:pointer}#mermaid-svg-kz2EkpUnbFW4f6mr g.classGroup rect{fill:#ECECFF;stroke:#9370db}#mermaid-svg-kz2EkpUnbFW4f6mr g.classGroup line{stroke:#9370db;stroke-width:1}#mermaid-svg-kz2EkpUnbFW4f6mr .classLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.5}#mermaid-svg-kz2EkpUnbFW4f6mr .classLabel .label{fill:#9370db;font-size:10px}#mermaid-svg-kz2EkpUnbFW4f6mr .relation{stroke:#9370db;stroke-width:1;fill:none}#mermaid-svg-kz2EkpUnbFW4f6mr .dashed-line{stroke-dasharray:3}#mermaid-svg-kz2EkpUnbFW4f6mr #compositionStart{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-kz2EkpUnbFW4f6mr #compositionEnd{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-kz2EkpUnbFW4f6mr #aggregationStart{fill:#ECECFF;stroke:#9370db;stroke-width:1}#mermaid-svg-kz2EkpUnbFW4f6mr #aggregationEnd{fill:#ECECFF;stroke:#9370db;stroke-width:1}#mermaid-svg-kz2EkpUnbFW4f6mr #dependencyStart{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-kz2EkpUnbFW4f6mr #dependencyEnd{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-kz2EkpUnbFW4f6mr #extensionStart{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-kz2EkpUnbFW4f6mr #extensionEnd{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-kz2EkpUnbFW4f6mr .commit-id,#mermaid-svg-kz2EkpUnbFW4f6mr .commit-msg,#mermaid-svg-kz2EkpUnbFW4f6mr .branch-label{fill:lightgrey;color:lightgrey;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-kz2EkpUnbFW4f6mr .pieTitleText{text-anchor:middle;font-size:25px;fill:#000;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-kz2EkpUnbFW4f6mr .slice{font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-kz2EkpUnbFW4f6mr g.stateGroup text{fill:#9370db;stroke:none;font-size:10px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-kz2EkpUnbFW4f6mr g.stateGroup text{fill:#9370db;fill:#333;stroke:none;font-size:10px}#mermaid-svg-kz2EkpUnbFW4f6mr g.statediagram-cluster .cluster-label text{fill:#333}#mermaid-svg-kz2EkpUnbFW4f6mr g.stateGroup .state-title{font-weight:bolder;fill:#000}#mermaid-svg-kz2EkpUnbFW4f6mr g.stateGroup rect{fill:#ECECFF;stroke:#9370db}#mermaid-svg-kz2EkpUnbFW4f6mr g.stateGroup line{stroke:#9370db;stroke-width:1}#mermaid-svg-kz2EkpUnbFW4f6mr .transition{stroke:#9370db;stroke-width:1;fill:none}#mermaid-svg-kz2EkpUnbFW4f6mr .stateGroup .composit{fill:white;border-bottom:1px}#mermaid-svg-kz2EkpUnbFW4f6mr .stateGroup .alt-composit{fill:#e0e0e0;border-bottom:1px}#mermaid-svg-kz2EkpUnbFW4f6mr .state-note{stroke:#aa3;fill:#fff5ad}#mermaid-svg-kz2EkpUnbFW4f6mr .state-note text{fill:black;stroke:none;font-size:10px}#mermaid-svg-kz2EkpUnbFW4f6mr .stateLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.7}#mermaid-svg-kz2EkpUnbFW4f6mr .edgeLabel text{fill:#333}#mermaid-svg-kz2EkpUnbFW4f6mr .stateLabel text{fill:#000;font-size:10px;font-weight:bold;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-kz2EkpUnbFW4f6mr .node circle.state-start{fill:black;stroke:black}#mermaid-svg-kz2EkpUnbFW4f6mr .node circle.state-end{fill:black;stroke:white;stroke-width:1.5}#mermaid-svg-kz2EkpUnbFW4f6mr #statediagram-barbEnd{fill:#9370db}#mermaid-svg-kz2EkpUnbFW4f6mr .statediagram-cluster rect{fill:#ECECFF;stroke:#9370db;stroke-width:1px}#mermaid-svg-kz2EkpUnbFW4f6mr .statediagram-cluster rect.outer{rx:5px;ry:5px}#mermaid-svg-kz2EkpUnbFW4f6mr .statediagram-state .divider{stroke:#9370db}#mermaid-svg-kz2EkpUnbFW4f6mr .statediagram-state .title-state{rx:5px;ry:5px}#mermaid-svg-kz2EkpUnbFW4f6mr .statediagram-cluster.statediagram-cluster .inner{fill:white}#mermaid-svg-kz2EkpUnbFW4f6mr .statediagram-cluster.statediagram-cluster-alt .inner{fill:#e0e0e0}#mermaid-svg-kz2EkpUnbFW4f6mr .statediagram-cluster .inner{rx:0;ry:0}#mermaid-svg-kz2EkpUnbFW4f6mr .statediagram-state rect.basic{rx:5px;ry:5px}#mermaid-svg-kz2EkpUnbFW4f6mr .statediagram-state rect.divider{stroke-dasharray:10,10;fill:#efefef}#mermaid-svg-kz2EkpUnbFW4f6mr .note-edge{stroke-dasharray:5}#mermaid-svg-kz2EkpUnbFW4f6mr .statediagram-note rect{fill:#fff5ad;stroke:#aa3;stroke-width:1px;rx:0;ry:0}:root{--mermaid-font-family: '"trebuchet ms", verdana, arial';--mermaid-font-family: "Comic Sans MS", "Comic Sans", cursive}#mermaid-svg-kz2EkpUnbFW4f6mr .error-icon{fill:#522}#mermaid-svg-kz2EkpUnbFW4f6mr .error-text{fill:#522;stroke:#522}#mermaid-svg-kz2EkpUnbFW4f6mr .edge-thickness-normal{stroke-width:2px}#mermaid-svg-kz2EkpUnbFW4f6mr .edge-thickness-thick{stroke-width:3.5px}#mermaid-svg-kz2EkpUnbFW4f6mr .edge-pattern-solid{stroke-dasharray:0}#mermaid-svg-kz2EkpUnbFW4f6mr .edge-pattern-dashed{stroke-dasharray:3}#mermaid-svg-kz2EkpUnbFW4f6mr .edge-pattern-dotted{stroke-dasharray:2}#mermaid-svg-kz2EkpUnbFW4f6mr .marker{fill:#333}#mermaid-svg-kz2EkpUnbFW4f6mr .marker.cross{stroke:#333}:root { --mermaid-font-family: "trebuchet ms", verdana, arial;}#mermaid-svg-kz2EkpUnbFW4f6mr {color: rgba(0, 0, 0, 0.75);font: ;}
OCR
OCR
检出
文档图片
文字
编辑距离
坐标
面积比
目标
IOU
通过比对现有方法的效果和这种综合方法的效果,可以发现,这种方法更适合文档图像,但是,这种方法有严重的依赖性,依赖于OCR组件与相应的检出工具,对性能有很大的影响。