摘要:本文介绍一个提取PDF中的表格内容的程序。首先,程序给出使用示例,最后给出代码开发思路及细节。


0.需求说明

  • PDF中存在大量表格,需要从表格中提取出指定类型的表格,这些表格主要通过表头和表中的关键字来确定。

1.PDF示例

  • 样例PDF下载地址:样本一、样本二、样本三

2.提取规则

提取规则通过Excel指定,如下示例:

3.提取结果示例

提取的结果保存在Excel中,结果如下:

4.使用方法

  • 首先准备好Demo.xlsx文件(下载),同时下载PDFparser.exe 程序(下载),将二者放在同一个目录下,然后将PDF文件准备好放在任意文件夹xxx中,将xxx文件夹和以上两个文件放在同一目录下,双击运行程序即可。

5.代码说明

  • 程序使用pdfplumber模块进行PDF解析以获取表格和文本
  • 程序使用xlwt模块和xlrd 模块进行Excel的读写
  • 程序使用多进程+多线程模式加快速度
  • 程序使用re模块来使用Python正则表达式

6.代码细节

  • PDF解析

    
    # 该类用来实现PDF表格和文字内容的提取
    class Extractor(object):def __init__(self, file_path, rules):''':param file_path:PDF file path:param rules: extract rules'''self.file_path = file_pathself.rules = rules
    
    <span class="token comment"># 加载PDF文件</span>
    <span class="token keyword">def</span> <span class="token function">parse_pages</span><span class="token punctuation">(</span>self<span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token keyword">try</span><span class="token punctuation">:</span>pages <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span>pdf <span class="token operator">=</span> pdfplumber<span class="token punctuation">.</span><span class="token builtin">open</span><span class="token punctuation">(</span>self<span class="token punctuation">.</span>file_path<span class="token punctuation">)</span><span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'parse file:{}   page num:{}'</span><span class="token punctuation">.</span><span class="token builtin">format</span><span class="token punctuation">(</span>os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>basename<span class="token punctuation">(</span>self<span class="token punctuation">.</span>file_path<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token builtin">len</span><span class="token punctuation">(</span>pdf<span class="token punctuation">.</span>pages<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token keyword">for</span> index<span class="token punctuation">,</span> page <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>pdf<span class="token punctuation">.</span>pages<span class="token punctuation">)</span><span class="token punctuation">:</span>tables <span class="token operator">=</span> page<span class="token punctuation">.</span>extract_tables<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token keyword">if</span> <span class="token builtin">len</span><span class="token punctuation">(</span>tables<span class="token punctuation">)</span> <span class="token operator">&lt;</span> <span class="token number">1</span><span class="token punctuation">:</span><span class="token keyword">continue</span>pages<span class="token punctuation">.</span>append<span class="token punctuation">(</span><span class="token punctuation">{<!-- --></span><span class="token string">'text'</span><span class="token punctuation">:</span> page<span class="token punctuation">.</span>extract_text<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string">'tables'</span><span class="token punctuation">:</span> tables<span class="token punctuation">,</span> <span class="token string">'page'</span><span class="token punctuation">:</span> index <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token keyword">return</span> pages<span class="token keyword">except</span> Exception <span class="token keyword">as</span> e<span class="token punctuation">:</span><span class="token keyword">print</span><span class="token punctuation">(</span>e<span class="token punctuation">)</span><span class="token keyword">return</span> <span class="token boolean">None</span><span class="token comment"># 提取特定类型表头的表格,规则有rules参数指定</span>
    <span class="token keyword">def</span> <span class="token function">extract_table_with_specific_header</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> pages<span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token keyword">if</span> pages <span class="token keyword">is</span> <span class="token boolean">None</span> <span class="token operator">or</span> <span class="token builtin">len</span><span class="token punctuation">(</span>pages<span class="token punctuation">)</span> <span class="token operator">&lt;</span> <span class="token number">1</span><span class="token punctuation">:</span><span class="token comment"># print('no-page...')</span><span class="token keyword">return</span>target_tables <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token comment"># 遍历所有页面</span><span class="token keyword">for</span> index<span class="token punctuation">,</span> page <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>pages<span class="token punctuation">)</span><span class="token punctuation">:</span>text <span class="token operator">=</span> page<span class="token punctuation">[</span><span class="token string">'text'</span><span class="token punctuation">]</span>tables <span class="token operator">=</span> page<span class="token punctuation">[</span><span class="token string">'tables'</span><span class="token punctuation">]</span>page_id <span class="token operator">=</span> page<span class="token punctuation">[</span><span class="token string">'page'</span><span class="token punctuation">]</span> <span class="token operator">-</span> <span class="token number">1</span>lines <span class="token operator">=</span> re<span class="token punctuation">.</span>split<span class="token punctuation">(</span>r<span class="token string">'\n+'</span><span class="token punctuation">,</span> text<span class="token punctuation">)</span><span class="token comment"># 遍历当前页面的所有行</span><span class="token keyword">for</span> ind<span class="token punctuation">,</span> line <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>lines<span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token comment"># 判定表头符合规则的表格</span><span class="token keyword">if</span> <span class="token builtin">sum</span><span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token number">1</span> <span class="token keyword">for</span> rule <span class="token keyword">in</span> self<span class="token punctuation">.</span>rules<span class="token punctuation">[</span><span class="token string">'in-header'</span><span class="token punctuation">]</span> <span class="token keyword">if</span> re<span class="token punctuation">.</span>search<span class="token punctuation">(</span>rule<span class="token punctuation">,</span> line<span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token operator">&gt;</span> <span class="token number">0</span> <span class="token operator">and</span> \<span class="token builtin">sum</span><span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token number">1</span> <span class="token keyword">for</span> rule <span class="token keyword">in</span> self<span class="token punctuation">.</span>rules<span class="token punctuation">[</span><span class="token string">'not-in-header'</span><span class="token punctuation">]</span> <span class="token keyword">if</span> re<span class="token punctuation">.</span>search<span class="token punctuation">(</span>rule<span class="token punctuation">,</span> line<span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">:</span><span class="token keyword">if</span> ind <span class="token operator">&gt;=</span> <span class="token builtin">len</span><span class="token punctuation">(</span>lines<span class="token punctuation">)</span> <span class="token operator">-</span> <span class="token number">1</span><span class="token punctuation">:</span><span class="token keyword">break</span>cnt <span class="token operator">=</span> ind <span class="token operator">+</span> <span class="token number">1</span><span class="token builtin">next</span> <span class="token operator">=</span> lines<span class="token punctuation">[</span>ind <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">]</span><span class="token keyword">if</span> ind <span class="token operator">&lt;</span> <span class="token builtin">len</span><span class="token punctuation">(</span>lines<span class="token punctuation">)</span> <span class="token operator">-</span> <span class="token number">2</span> <span class="token operator">and</span> re<span class="token punctuation">.</span>search<span class="token punctuation">(</span>r<span class="token string">'单位[::]'</span><span class="token punctuation">,</span> <span class="token builtin">next</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token builtin">next</span> <span class="token operator">=</span> lines<span class="token punctuation">[</span>ind <span class="token operator">+</span> <span class="token number">2</span><span class="token punctuation">]</span>cnt <span class="token operator">+=</span> <span class="token number">1</span><span class="token keyword">for</span> ti<span class="token punctuation">,</span> table <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>tables<span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token keyword">if</span> <span class="token operator">not</span> table<span class="token punctuation">:</span><span class="token keyword">continue</span>first <span class="token operator">=</span> <span class="token string">''</span><span class="token punctuation">.</span>join<span class="token punctuation">(</span><span class="token punctuation">[</span>word <span class="token keyword">for</span> word <span class="token keyword">in</span> table<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token keyword">if</span> word <span class="token keyword">is</span> <span class="token operator">not</span> <span class="token boolean">None</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token comment"># 表格是完整的情况</span><span class="token keyword">if</span> first <span class="token operator">==</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span>r<span class="token string">'\s+'</span><span class="token punctuation">,</span> <span class="token string">''</span><span class="token punctuation">,</span> <span class="token builtin">next</span><span class="token punctuation">)</span><span class="token punctuation">:</span>tables<span class="token punctuation">[</span>ti<span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token boolean">False</span><span class="token keyword">if</span> index <span class="token operator">+</span> <span class="token number">1</span> <span class="token operator">&lt;</span> <span class="token builtin">len</span><span class="token punctuation">(</span>pages<span class="token punctuation">)</span> <span class="token operator">and</span> <span class="token builtin">len</span><span class="token punctuation">(</span>pages<span class="token punctuation">[</span>index <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'tables'</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token operator">&gt;</span> <span class="token number">0</span><span class="token punctuation">:</span>table_next <span class="token operator">=</span> pages<span class="token punctuation">[</span>index <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'tables'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span>fi <span class="token operator">=</span> <span class="token string">''</span><span class="token punctuation">.</span>join<span class="token punctuation">(</span><span class="token punctuation">[</span>item <span class="token keyword">for</span> item <span class="token keyword">in</span> table_next<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token keyword">if</span> item <span class="token keyword">is</span> <span class="token operator">not</span> <span class="token boolean">None</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token keyword">if</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span>r<span class="token string">'\s+'</span><span class="token punctuation">,</span> <span class="token string">''</span><span class="token punctuation">,</span> fi<span class="token punctuation">)</span> <span class="token keyword">in</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span>r<span class="token string">'\s+'</span><span class="token punctuation">,</span> <span class="token string">''</span><span class="token punctuation">,</span> pages<span class="token punctuation">[</span>index <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'text'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">:</span>table <span class="token operator">+=</span> table_nexttarget_tables<span class="token punctuation">.</span>append<span class="token punctuation">(</span><span class="token punctuation">{<!-- --></span><span class="token string">'page'</span><span class="token punctuation">:</span> page_id <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">,</span> <span class="token string">'method'</span><span class="token punctuation">:</span> <span class="token string">'exact'</span><span class="token punctuation">,</span> <span class="token string">'table'</span><span class="token punctuation">:</span> table<span class="token punctuation">,</span><span class="token string">'table-id'</span><span class="token punctuation">:</span> <span class="token builtin">str</span><span class="token punctuation">(</span>page_id <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">)</span> <span class="token operator">+</span> <span class="token builtin">str</span><span class="token punctuation">(</span>ti <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token comment"># 表格可能不完整的情况</span><span class="token keyword">elif</span> first <span class="token keyword">in</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span>r<span class="token string">'\s+'</span><span class="token punctuation">,</span> <span class="token string">''</span><span class="token punctuation">,</span> <span class="token builtin">next</span><span class="token punctuation">)</span><span class="token punctuation">:</span>tables<span class="token punctuation">[</span>ti<span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token boolean">False</span><span class="token keyword">if</span> index <span class="token operator">+</span> <span class="token number">1</span> <span class="token operator">&lt;</span> <span class="token builtin">len</span><span class="token punctuation">(</span>pages<span class="token punctuation">)</span> <span class="token operator">and</span> <span class="token builtin">len</span><span class="token punctuation">(</span>pages<span class="token punctuation">[</span>index <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'tables'</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token operator">&gt;</span> <span class="token number">0</span><span class="token punctuation">:</span>table_next <span class="token operator">=</span> pages<span class="token punctuation">[</span>index <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'tables'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span>fi <span class="token operator">=</span> <span class="token string">''</span><span class="token punctuation">.</span>join<span class="token punctuation">(</span><span class="token punctuation">[</span>item <span class="token keyword">for</span> item <span class="token keyword">in</span> table_next<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token keyword">if</span> item <span class="token keyword">is</span> <span class="token operator">not</span> <span class="token boolean">None</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token keyword">if</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span>r<span class="token string">'\s+'</span><span class="token punctuation">,</span> <span class="token string">''</span><span class="token punctuation">,</span> fi<span class="token punctuation">)</span> <span class="token keyword">in</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span>r<span class="token string">'\s+'</span><span class="token punctuation">,</span> <span class="token string">''</span><span class="token punctuation">,</span> pages<span class="token punctuation">[</span>index <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'text'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">:</span>table <span class="token operator">+=</span> table_nexttarget_tables<span class="token punctuation">.</span>append<span class="token punctuation">(</span><span class="token punctuation">{<!-- --></span><span class="token string">'page'</span><span class="token punctuation">:</span> page_id <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">,</span> <span class="token string">'method'</span><span class="token punctuation">:</span> <span class="token string">'guess'</span><span class="token punctuation">,</span> <span class="token string">'table'</span><span class="token punctuation">:</span> table<span class="token punctuation">,</span><span class="token string">'table-id'</span><span class="token punctuation">:</span> <span class="token builtin">str</span><span class="token punctuation">(</span>page_id <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">)</span> <span class="token operator">+</span> <span class="token string">'-'</span> <span class="token operator">+</span> <span class="token builtin">str</span><span class="token punctuation">(</span>ti <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token keyword">return</span> target_tables<span class="token comment"># 提取表格中存在指定类型信息的表格,规则由参数rules指定</span>
    <span class="token keyword">def</span> <span class="token function">extract_table_with_specific_info</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> pages<span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token keyword">if</span> pages <span class="token keyword">is</span> <span class="token boolean">None</span> <span class="token operator">or</span> <span class="token builtin">len</span><span class="token punctuation">(</span>pages<span class="token punctuation">)</span> <span class="token operator">&lt;</span> <span class="token number">1</span><span class="token punctuation">:</span><span class="token comment"># print('no-page...')</span><span class="token keyword">return</span>target_tables <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token keyword">for</span> index<span class="token punctuation">,</span> page <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>pages<span class="token punctuation">)</span><span class="token punctuation">:</span>tables <span class="token operator">=</span> page<span class="token punctuation">[</span><span class="token string">'tables'</span><span class="token punctuation">]</span>page_id <span class="token operator">=</span> page<span class="token punctuation">[</span><span class="token string">'page'</span><span class="token punctuation">]</span> <span class="token operator">-</span> <span class="token number">1</span><span class="token keyword">for</span> ti<span class="token punctuation">,</span> table <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>tables<span class="token punctuation">)</span><span class="token punctuation">:</span>st <span class="token operator">=</span> <span class="token builtin">str</span><span class="token punctuation">(</span>table<span class="token punctuation">)</span><span class="token keyword">if</span> <span class="token builtin">len</span><span class="token punctuation">(</span>self<span class="token punctuation">.</span>rules<span class="token punctuation">[</span><span class="token string">'in-table'</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token builtin">sum</span><span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token number">1</span> <span class="token keyword">for</span> in_tab <span class="token keyword">in</span> self<span class="token punctuation">.</span>rules<span class="token punctuation">[</span><span class="token string">'in-table'</span><span class="token punctuation">]</span> <span class="token keyword">if</span> in_tab <span class="token keyword">in</span> st<span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token keyword">if</span> <span class="token builtin">len</span><span class="token punctuation">(</span>self<span class="token punctuation">.</span>rules<span class="token punctuation">[</span><span class="token string">'not-in-table'</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token builtin">sum</span><span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token number">1</span> <span class="token keyword">for</span> not_tab <span class="token keyword">in</span> self<span class="token punctuation">.</span>rules<span class="token punctuation">[</span><span class="token string">'not-in-table'</span><span class="token punctuation">]</span> <span class="token keyword">if</span> <span class="token operator">not</span> not_tab <span class="token keyword">in</span> st<span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">:</span>target_tables<span class="token punctuation">.</span>append<span class="token punctuation">(</span><span class="token punctuation">{<!-- --></span><span class="token string">'page'</span><span class="token punctuation">:</span> page_id <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">,</span> <span class="token string">'method'</span><span class="token punctuation">:</span> <span class="token string">'content-in-table'</span><span class="token punctuation">,</span> <span class="token string">'table'</span><span class="token punctuation">:</span> table<span class="token punctuation">,</span><span class="token string">'table-id'</span><span class="token punctuation">:</span> <span class="token builtin">str</span><span class="token punctuation">(</span>page_id <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">)</span> <span class="token operator">+</span> <span class="token builtin">str</span><span class="token punctuation">(</span>ti <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token keyword">return</span> target_tables<span class="token comment"># 提取存在指定关键词的页面,关键词有rules指定</span>
    <span class="token keyword">def</span> <span class="token function">extract_specific_page</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> pages<span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token keyword">if</span> pages <span class="token keyword">is</span> <span class="token boolean">None</span> <span class="token operator">or</span> <span class="token builtin">len</span><span class="token punctuation">(</span>pages<span class="token punctuation">)</span> <span class="token operator">&lt;</span> <span class="token number">1</span><span class="token punctuation">:</span><span class="token comment"># print('no-page...')</span><span class="token keyword">return</span>target_pages <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token keyword">for</span> index<span class="token punctuation">,</span> page <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>pages<span class="token punctuation">)</span><span class="token punctuation">:</span>text <span class="token operator">=</span> page<span class="token punctuation">[</span><span class="token string">'text'</span><span class="token punctuation">]</span>page_id <span class="token operator">=</span> page<span class="token punctuation">[</span><span class="token string">'page'</span><span class="token punctuation">]</span><span class="token keyword">if</span> <span class="token builtin">len</span><span class="token punctuation">(</span>self<span class="token punctuation">.</span>rules<span class="token punctuation">[</span><span class="token string">'in-page'</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token builtin">sum</span><span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token number">1</span> <span class="token keyword">for</span> rule <span class="token keyword">in</span> self<span class="token punctuation">.</span>rules<span class="token punctuation">[</span><span class="token string">'in-page'</span><span class="token punctuation">]</span> <span class="token keyword">if</span> re<span class="token punctuation">.</span>search<span class="token punctuation">(</span>rule<span class="token punctuation">,</span> text<span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">:</span>target_pages<span class="token punctuation">.</span>append<span class="token punctuation">(</span><span class="token punctuation">{<!-- --></span><span class="token string">'page'</span><span class="token punctuation">:</span> page_id<span class="token punctuation">,</span> <span class="token string">'text'</span><span class="token punctuation">:</span> text<span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token keyword">return</span> target_pages<span class="token comment"># 执行以上所有过程,返回提取结果</span>
    <span class="token keyword">def</span> <span class="token function">run</span><span class="token punctuation">(</span>self<span class="token punctuation">)</span><span class="token punctuation">:</span>pages <span class="token operator">=</span> self<span class="token punctuation">.</span>parse_pages<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token keyword">if</span> pages <span class="token keyword">is</span> <span class="token boolean">None</span> <span class="token operator">or</span> <span class="token builtin">len</span><span class="token punctuation">(</span>pages<span class="token punctuation">)</span> <span class="token operator">&lt;</span> <span class="token number">1</span><span class="token punctuation">:</span><span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'parse pdf error:'</span><span class="token punctuation">,</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>basename<span class="token punctuation">(</span>self<span class="token punctuation">.</span>file_path<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token keyword">return</span><span class="token keyword">try</span><span class="token punctuation">:</span>target_1 <span class="token operator">=</span> self<span class="token punctuation">.</span>extract_table_with_specific_header<span class="token punctuation">(</span>pages<span class="token punctuation">)</span>target_2 <span class="token operator">=</span> self<span class="token punctuation">.</span>extract_table_with_specific_info<span class="token punctuation">(</span>pages<span class="token punctuation">)</span>target_3 <span class="token operator">=</span> self<span class="token punctuation">.</span>extract_specific_page<span class="token punctuation">(</span>pages<span class="token punctuation">)</span>tables <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span>s <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token keyword">for</span> table <span class="token keyword">in</span> target_1<span class="token punctuation">:</span><span class="token keyword">if</span> table<span class="token punctuation">[</span><span class="token string">'table-id'</span><span class="token punctuation">]</span> <span class="token operator">not</span> <span class="token keyword">in</span> s<span class="token punctuation">:</span>s<span class="token punctuation">.</span>append<span class="token punctuation">(</span>table<span class="token punctuation">[</span><span class="token string">'table-id'</span><span class="token punctuation">]</span><span class="token punctuation">)</span>tables<span class="token punctuation">.</span>append<span class="token punctuation">(</span>table<span class="token punctuation">)</span><span class="token keyword">for</span> table <span class="token keyword">in</span> target_2<span class="token punctuation">:</span><span class="token keyword">if</span> table<span class="token punctuation">[</span><span class="token string">'table-id'</span><span class="token punctuation">]</span> <span class="token operator">not</span> <span class="token keyword">in</span> s<span class="token punctuation">:</span>s<span class="token punctuation">.</span>append<span class="token punctuation">(</span>table<span class="token punctuation">[</span><span class="token string">'table-id'</span><span class="token punctuation">]</span><span class="token punctuation">)</span>tables<span class="token punctuation">.</span>append<span class="token punctuation">(</span>table<span class="token punctuation">)</span><span class="token keyword">return</span> tables<span class="token punctuation">,</span> target_3<span class="token keyword">except</span> Exception <span class="token keyword">as</span> e<span class="token punctuation">:</span><span class="token keyword">print</span><span class="token punctuation">(</span>e<span class="token punctuation">)</span>
    
  • 加载Excel和缓存结果

    # 该类用来加载Excel,遍历地址获取PDF文件路径及缓存结果class Util():def init(self, folder, out, demo):self.folder = folderself.out = outself.demo = demo
  • <span class="token comment"># 加载Demo文件,获取rules</span>
    <span class="token keyword">def</span> <span class="token function">load_demo</span><span class="token punctuation">(</span>self<span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'load demo Excel:'</span><span class="token punctuation">,</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>basename<span class="token punctuation">(</span>self<span class="token punctuation">.</span>demo<span class="token punctuation">)</span><span class="token punctuation">)</span>book <span class="token operator">=</span> xlrd<span class="token punctuation">.</span>open_workbook<span class="token punctuation">(</span>self<span class="token punctuation">.</span>demo<span class="token punctuation">)</span>sheet <span class="token operator">=</span> book<span class="token punctuation">.</span>sheet_by_index<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span>in_header <span class="token operator">=</span> sheet<span class="token punctuation">.</span>col_values<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">2</span><span class="token punctuation">,</span> sheet<span class="token punctuation">.</span>nrows<span class="token punctuation">)</span>not_in_header <span class="token operator">=</span> sheet<span class="token punctuation">.</span>col_values<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">,</span> <span class="token number">2</span><span class="token punctuation">,</span> sheet<span class="token punctuation">.</span>nrows<span class="token punctuation">)</span>in_table <span class="token operator">=</span> sheet<span class="token punctuation">.</span>col_values<span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">,</span> <span class="token number">2</span><span class="token punctuation">,</span> sheet<span class="token punctuation">.</span>nrows<span class="token punctuation">)</span>not_in_table <span class="token operator">=</span> sheet<span class="token punctuation">.</span>col_values<span class="token punctuation">(</span><span class="token number">3</span><span class="token punctuation">,</span> <span class="token number">2</span><span class="token punctuation">,</span> sheet<span class="token punctuation">.</span>nrows<span class="token punctuation">)</span>in_page <span class="token operator">=</span> sheet<span class="token punctuation">.</span>col_values<span class="token punctuation">(</span><span class="token number">4</span><span class="token punctuation">,</span> <span class="token number">2</span><span class="token punctuation">,</span> sheet<span class="token punctuation">.</span>nrows<span class="token punctuation">)</span>rules <span class="token operator">=</span> <span class="token punctuation">{<!-- --></span><span class="token string">'in-header'</span><span class="token punctuation">:</span> in_header<span class="token punctuation">,</span> <span class="token string">'not-in-header'</span><span class="token punctuation">:</span> not_in_header<span class="token punctuation">,</span> <span class="token string">'in-table'</span><span class="token punctuation">:</span> in_table<span class="token punctuation">,</span><span class="token string">'not-in-table'</span><span class="token punctuation">:</span> not_in_table<span class="token punctuation">,</span> <span class="token string">'in-page'</span><span class="token punctuation">:</span> in_page<span class="token punctuation">}</span><span class="token keyword">for</span> k<span class="token punctuation">,</span> v <span class="token keyword">in</span> rules<span class="token punctuation">.</span>items<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span>rules<span class="token punctuation">[</span>k<span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token punctuation">[</span>i <span class="token keyword">for</span> i <span class="token keyword">in</span> v <span class="token keyword">if</span> <span class="token operator">not</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span>r<span class="token string">'\s+'</span><span class="token punctuation">,</span> <span class="token string">''</span><span class="token punctuation">,</span> i<span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token string">''</span><span class="token punctuation">]</span><span class="token keyword">return</span> rules<span class="token comment"># 加载PDF文件,采用迭代遍历</span>
    <span class="token keyword">def</span> <span class="token function">load_folder</span><span class="token punctuation">(</span>self<span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'load folder:'</span><span class="token punctuation">,</span> self<span class="token punctuation">.</span>folder<span class="token punctuation">)</span>paths <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token keyword">for</span> dirpath<span class="token punctuation">,</span> dirnames<span class="token punctuation">,</span> filenames <span class="token keyword">in</span> os<span class="token punctuation">.</span>walk<span class="token punctuation">(</span>self<span class="token punctuation">.</span>folder<span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token keyword">for</span> <span class="token builtin">file</span> <span class="token keyword">in</span> filenames<span class="token punctuation">:</span>path <span class="token operator">=</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>join<span class="token punctuation">(</span>dirpath<span class="token punctuation">,</span> <span class="token builtin">file</span><span class="token punctuation">)</span><span class="token keyword">if</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>isfile<span class="token punctuation">(</span>path<span class="token punctuation">)</span> <span class="token operator">and</span> <span class="token punctuation">(</span>os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>splitext<span class="token punctuation">(</span>path<span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span> <span class="token operator">==</span> <span class="token string">'.pdf'</span> <span class="token operator">or</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>splitext<span class="token punctuation">(</span>path<span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span> <span class="token operator">==</span> <span class="token string">'.PDF'</span><span class="token punctuation">)</span><span class="token punctuation">:</span>paths<span class="token punctuation">.</span>append<span class="token punctuation">(</span>path<span class="token punctuation">)</span><span class="token keyword">return</span> paths<span class="token comment"># 缓存结果</span>
    <span class="token keyword">def</span> <span class="token function">save_tmp</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> info<span class="token punctuation">,</span> name<span class="token punctuation">,</span> code<span class="token punctuation">,</span> year<span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'save tmp file:'</span><span class="token punctuation">,</span> name<span class="token punctuation">)</span><span class="token keyword">if</span> <span class="token operator">not</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>isdir<span class="token punctuation">(</span><span class="token string">'tmp'</span><span class="token punctuation">)</span><span class="token punctuation">:</span>os<span class="token punctuation">.</span>mkdir<span class="token punctuation">(</span><span class="token string">'tmp'</span><span class="token punctuation">)</span>tables <span class="token operator">=</span> info<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span>pages <span class="token operator">=</span> info<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token comment"># Excel样式</span>style <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>XFStyle<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token comment"># background color</span>pattern <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Pattern<span class="token punctuation">(</span><span class="token punctuation">)</span>pattern<span class="token punctuation">.</span>pattern <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Pattern<span class="token punctuation">.</span>SOLID_PATTERNpattern<span class="token punctuation">.</span>pattern_fore_colour <span class="token operator">=</span> <span class="token number">5</span>style<span class="token punctuation">.</span>pattern <span class="token operator">=</span> pattern<span class="token comment"># border</span>borders <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">(</span><span class="token punctuation">)</span>borders<span class="token punctuation">.</span>left <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKborders<span class="token punctuation">.</span>right <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKborders<span class="token punctuation">.</span>top <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKborders<span class="token punctuation">.</span>bottom <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKstyle<span class="token punctuation">.</span>borders <span class="token operator">=</span> borders<span class="token comment"># font</span>font <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Font<span class="token punctuation">(</span><span class="token punctuation">)</span>font<span class="token punctuation">.</span>name <span class="token operator">=</span> <span class="token string">'Times New Roman'</span>font<span class="token punctuation">.</span>bold <span class="token operator">=</span> <span class="token boolean">True</span>font<span class="token punctuation">.</span>underline <span class="token operator">=</span> <span class="token boolean">False</span>font<span class="token punctuation">.</span>italic <span class="token operator">=</span> <span class="token boolean">False</span>style<span class="token punctuation">.</span>font <span class="token operator">=</span> font<span class="token comment"># sheet style-2</span>style2 <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>XFStyle<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token comment"># background color</span>pattern <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Pattern<span class="token punctuation">(</span><span class="token punctuation">)</span>pattern<span class="token punctuation">.</span>pattern <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Pattern<span class="token punctuation">.</span>SOLID_PATTERNpattern<span class="token punctuation">.</span>pattern_fore_colour <span class="token operator">=</span> <span class="token number">22</span>style2<span class="token punctuation">.</span>pattern <span class="token operator">=</span> pattern<span class="token comment"># border</span>borders <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">(</span><span class="token punctuation">)</span>borders<span class="token punctuation">.</span>left <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKborders<span class="token punctuation">.</span>right <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKborders<span class="token punctuation">.</span>top <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKborders<span class="token punctuation">.</span>bottom <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKstyle2<span class="token punctuation">.</span>borders <span class="token operator">=</span> borders<span class="token comment"># font</span>font <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Font<span class="token punctuation">(</span><span class="token punctuation">)</span>font<span class="token punctuation">.</span>name <span class="token operator">=</span> <span class="token string">'Times New Roman'</span>font<span class="token punctuation">.</span>bold <span class="token operator">=</span> <span class="token boolean">True</span>font<span class="token punctuation">.</span>underline <span class="token operator">=</span> <span class="token boolean">False</span>font<span class="token punctuation">.</span>italic <span class="token operator">=</span> <span class="token boolean">False</span>style2<span class="token punctuation">.</span>font <span class="token operator">=</span> font<span class="token comment"># sheet style-3</span>style3 <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>XFStyle<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token comment"># background color</span>pattern2 <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Pattern<span class="token punctuation">(</span><span class="token punctuation">)</span>pattern2<span class="token punctuation">.</span>pattern <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Pattern<span class="token punctuation">.</span>SOLID_PATTERNpattern2<span class="token punctuation">.</span>pattern_fore_colour <span class="token operator">=</span> <span class="token number">3</span>style3<span class="token punctuation">.</span>pattern <span class="token operator">=</span> pattern2<span class="token comment"># border</span>borders <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">(</span><span class="token punctuation">)</span>borders<span class="token punctuation">.</span>left <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKborders<span class="token punctuation">.</span>right <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKborders<span class="token punctuation">.</span>top <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKborders<span class="token punctuation">.</span>bottom <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKstyle3<span class="token punctuation">.</span>borders <span class="token operator">=</span> borders<span class="token comment"># font</span>font <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Font<span class="token punctuation">(</span><span class="token punctuation">)</span>font<span class="token punctuation">.</span>name <span class="token operator">=</span> <span class="token string">'Times New Roman'</span>font<span class="token punctuation">.</span>bold <span class="token operator">=</span> <span class="token boolean">True</span>font<span class="token punctuation">.</span>underline <span class="token operator">=</span> <span class="token boolean">False</span>font<span class="token punctuation">.</span>italic <span class="token operator">=</span> <span class="token boolean">False</span>style3<span class="token punctuation">.</span>font <span class="token operator">=</span> font<span class="token comment"># 将数据写如Excel</span>book <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Workbook<span class="token punctuation">(</span><span class="token punctuation">)</span>sheet1 <span class="token operator">=</span> book<span class="token punctuation">.</span>add_sheet<span class="token punctuation">(</span><span class="token string">'tables'</span><span class="token punctuation">)</span>sheet2 <span class="token operator">=</span> book<span class="token punctuation">.</span>add_sheet<span class="token punctuation">(</span><span class="token string">'pages'</span><span class="token punctuation">)</span><span class="token keyword">for</span> ind<span class="token punctuation">,</span> page <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>pages<span class="token punctuation">)</span><span class="token punctuation">:</span>page_num <span class="token operator">=</span> page<span class="token punctuation">[</span><span class="token string">'page'</span><span class="token punctuation">]</span>text <span class="token operator">=</span> page<span class="token punctuation">[</span><span class="token string">'text'</span><span class="token punctuation">]</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span>ind<span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">,</span> name<span class="token punctuation">)</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span>ind<span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">,</span> code<span class="token punctuation">)</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span>ind<span class="token punctuation">,</span> <span class="token number">2</span><span class="token punctuation">,</span> year<span class="token punctuation">)</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span>ind<span class="token punctuation">,</span> <span class="token number">3</span><span class="token punctuation">,</span> page_num<span class="token punctuation">)</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span>ind<span class="token punctuation">,</span> <span class="token number">4</span><span class="token punctuation">,</span> <span class="token string">'search page'</span><span class="token punctuation">)</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span>ind<span class="token punctuation">,</span> <span class="token number">5</span><span class="token punctuation">,</span> text<span class="token punctuation">,</span> style<span class="token punctuation">)</span><span class="token comment"># save table</span>i <span class="token operator">=</span> <span class="token number">0</span><span class="token keyword">for</span> ti<span class="token punctuation">,</span> table <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>tables<span class="token punctuation">)</span><span class="token punctuation">:</span>page <span class="token operator">=</span> table<span class="token punctuation">[</span><span class="token string">'page'</span><span class="token punctuation">]</span>method <span class="token operator">=</span> table<span class="token punctuation">[</span><span class="token string">'method'</span><span class="token punctuation">]</span>table_content <span class="token operator">=</span> table<span class="token punctuation">[</span><span class="token string">'table'</span><span class="token punctuation">]</span><span class="token keyword">if</span> method <span class="token operator">==</span> <span class="token string">'exact'</span><span class="token punctuation">:</span>sty <span class="token operator">=</span> style<span class="token keyword">elif</span> method <span class="token operator">==</span> <span class="token string">'guess'</span><span class="token punctuation">:</span>sty <span class="token operator">=</span> style2<span class="token keyword">else</span><span class="token punctuation">:</span>sty <span class="token operator">=</span> style3<span class="token keyword">for</span> index<span class="token punctuation">,</span> row <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>table_content<span class="token punctuation">)</span><span class="token punctuation">:</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span>i<span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">,</span> name<span class="token punctuation">)</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span>i<span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">,</span> code<span class="token punctuation">)</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span>i<span class="token punctuation">,</span> <span class="token number">2</span><span class="token punctuation">,</span> year<span class="token punctuation">)</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span>i<span class="token punctuation">,</span> <span class="token number">3</span><span class="token punctuation">,</span> page<span class="token punctuation">)</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span>i<span class="token punctuation">,</span> <span class="token number">4</span><span class="token punctuation">,</span> method<span class="token punctuation">)</span><span class="token keyword">for</span> ind<span class="token punctuation">,</span> one <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>row<span class="token punctuation">)</span><span class="token punctuation">:</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span>i<span class="token punctuation">,</span> <span class="token number">5</span> <span class="token operator">+</span> ind<span class="token punctuation">,</span> one <span class="token keyword">if</span> one <span class="token keyword">is</span> <span class="token operator">not</span> <span class="token boolean">None</span> <span class="token keyword">else</span> <span class="token string">''</span><span class="token punctuation">,</span> sty<span class="token punctuation">)</span>i <span class="token operator">+=</span> <span class="token number">1</span>i <span class="token operator">+=</span> <span class="token number">1</span>book<span class="token punctuation">.</span>save<span class="token punctuation">(</span><span class="token string">'tmp\\'</span> <span class="token operator">+</span> name <span class="token operator">+</span> <span class="token string">'.tmp.xls'</span><span class="token punctuation">)</span>
    
  • 执行提取的函数

    # 单个文件运行的完整流程,从加载文件到缓存结果的全过程,如果只想使用单线程运行程序,则在主函数中调用该函数即可def run(rules, file, util):extractor = Extractor(file, rules)info = extractor.run()code = re.findall(r’\d{6}’, os.path.basename(file))[0]year = re.findall(r’\d{8}’, os.path.basename(file))[0]if info is None or len(info[0]) < 1:with open(‘noResult.txt’, ‘a’, encoding=‘utf-8’) as fp:fp.write(file + ‘\n’)else:util.save_tmp(info, os.path.basename(file), code, year)
  • 多进程+多线程

    # -------------------# 以下两个函数是为了加快执行速度而启用的多线程+多进程模式,计算密集型任务状态下进程越多越好(不多于机器CPU核心数)# -----------------# 多线程:每次会启动跟files数量相对应的线程来执行,但只能执行在一个CPU核心中# multiple threadsdef batch_processor(func, rules, files, util):thread_pool = []for index, file in enumerate(files):th = threading.Thread(target=func, args=(rules, file, util))# print(‘running thread:’, th.name)th.start()thread_pool.append(th)for th in thread_pool:# print(‘waiting for thread:’, th.name)th.join()
  • # 多进程:启动4个进程执行,每个进程中运行多线程,CPU有几个核心就使用几个进程,一般机器多为双核心四进程,此时4进程可占满CPU运行,效能最大
    # multiple processors
    def multi_processor_run(func, sub_func, files, rules, util):
    pool = multiprocessing.Pool(processes=4)
    cnt = 0
    batch_size = 5
    while cnt < len(files):
    rear = cnt + batch_size
    if rear > len(files):
    rear = len(files)
    batch = files[cnt + 0:rear]
    pool.apply_async(func, (sub_func, rules, batch, util))
    cnt += batch_size
    pool.close()
    pool.join()

  • 整理结果并保存

    # 该函数将缓存在本地目录tmp文件夹下的所有临时Excel文件结果整合到一个Excel中# re-format resultdef re_format(sheet_size):print(‘re-format file…’)files = os.listdir(‘tmp’)paths = []new_book = xlwt.Workbook()for file in files:if os.path.isfile(os.path.join(‘tmp’, file)) and ‘.tmp.xls’ in file:paths.append(os.path.join(‘tmp’, file))
  • style <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>XFStyle<span class="token punctuation">(</span><span class="token punctuation">)</span>
    <span class="token comment"># background color</span>
    pattern <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Pattern<span class="token punctuation">(</span><span class="token punctuation">)</span>
    pattern<span class="token punctuation">.</span>pattern <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Pattern<span class="token punctuation">.</span>SOLID_PATTERN
    pattern<span class="token punctuation">.</span>pattern_fore_colour <span class="token operator">=</span> <span class="token number">5</span>
    style<span class="token punctuation">.</span>pattern <span class="token operator">=</span> pattern
    <span class="token comment"># border</span>
    borders <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">(</span><span class="token punctuation">)</span>
    borders<span class="token punctuation">.</span>left <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICK
    borders<span class="token punctuation">.</span>right <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICK
    borders<span class="token punctuation">.</span>top <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICK
    borders<span class="token punctuation">.</span>bottom <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICK
    style<span class="token punctuation">.</span>borders <span class="token operator">=</span> borders
    <span class="token comment"># font</span>
    font <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Font<span class="token punctuation">(</span><span class="token punctuation">)</span>
    font<span class="token punctuation">.</span>name <span class="token operator">=</span> <span class="token string">'Times New Roman'</span>
    font<span class="token punctuation">.</span>bold <span class="token operator">=</span> <span class="token boolean">True</span>
    font<span class="token punctuation">.</span>underline <span class="token operator">=</span> <span class="token boolean">False</span>
    font<span class="token punctuation">.</span>italic <span class="token operator">=</span> <span class="token boolean">False</span>
    style<span class="token punctuation">.</span>font <span class="token operator">=</span> font<span class="token comment"># sheet style-2</span>
    style2 <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>XFStyle<span class="token punctuation">(</span><span class="token punctuation">)</span>
    <span class="token comment"># background color</span>
    pattern <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Pattern<span class="token punctuation">(</span><span class="token punctuation">)</span>
    pattern<span class="token punctuation">.</span>pattern <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Pattern<span class="token punctuation">.</span>SOLID_PATTERN
    pattern<span class="token punctuation">.</span>pattern_fore_colour <span class="token operator">=</span> <span class="token number">22</span>
    style2<span class="token punctuation">.</span>pattern <span class="token operator">=</span> pattern
    <span class="token comment"># border</span>
    borders <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">(</span><span class="token punctuation">)</span>
    borders<span class="token punctuation">.</span>left <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICK
    borders<span class="token punctuation">.</span>right <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICK
    borders<span class="token punctuation">.</span>top <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICK
    borders<span class="token punctuation">.</span>bottom <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICK
    style2<span class="token punctuation">.</span>borders <span class="token operator">=</span> borders
    <span class="token comment"># font</span>
    font <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Font<span class="token punctuation">(</span><span class="token punctuation">)</span>
    font<span class="token punctuation">.</span>name <span class="token operator">=</span> <span class="token string">'Times New Roman'</span>
    font<span class="token punctuation">.</span>bold <span class="token operator">=</span> <span class="token boolean">True</span>
    font<span class="token punctuation">.</span>underline <span class="token operator">=</span> <span class="token boolean">False</span>
    font<span class="token punctuation">.</span>italic <span class="token operator">=</span> <span class="token boolean">False</span>
    style2<span class="token punctuation">.</span>font <span class="token operator">=</span> font<span class="token comment"># sheet style-3</span>
    style3 <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>XFStyle<span class="token punctuation">(</span><span class="token punctuation">)</span>
    <span class="token comment"># background color</span>
    pattern2 <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Pattern<span class="token punctuation">(</span><span class="token punctuation">)</span>
    pattern2<span class="token punctuation">.</span>pattern <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Pattern<span class="token punctuation">.</span>SOLID_PATTERN
    pattern2<span class="token punctuation">.</span>pattern_fore_colour <span class="token operator">=</span> <span class="token number">3</span>
    style3<span class="token punctuation">.</span>pattern <span class="token operator">=</span> pattern2
    <span class="token comment"># border</span>
    borders <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">(</span><span class="token punctuation">)</span>
    borders<span class="token punctuation">.</span>left <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICK
    borders<span class="token punctuation">.</span>right <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICK
    borders<span class="token punctuation">.</span>top <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICK
    borders<span class="token punctuation">.</span>bottom <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICK
    style3<span class="token punctuation">.</span>borders <span class="token operator">=</span> borders
    <span class="token comment"># font</span>
    font <span class="token operator">=</span> xlwt<span class="token punctuation">.</span>Font<span class="token punctuation">(</span><span class="token punctuation">)</span>
    font<span class="token punctuation">.</span>name <span class="token operator">=</span> <span class="token string">'Times New Roman'</span>
    font<span class="token punctuation">.</span>bold <span class="token operator">=</span> <span class="token boolean">True</span>
    font<span class="token punctuation">.</span>underline <span class="token operator">=</span> <span class="token boolean">False</span>
    font<span class="token punctuation">.</span>italic <span class="token operator">=</span> <span class="token boolean">False</span>
    style3<span class="token punctuation">.</span>font <span class="token operator">=</span> fonttab_cnt <span class="token operator">=</span> <span class="token number">1</span>
    page_cnt <span class="token operator">=</span> <span class="token number">1</span>
    tab_rows <span class="token operator">=</span> <span class="token number">0</span>
    page_rows <span class="token operator">=</span> <span class="token number">0</span>
    sheet2 <span class="token operator">=</span> new_book<span class="token punctuation">.</span>add_sheet<span class="token punctuation">(</span><span class="token string">'pages-'</span> <span class="token operator">+</span> <span class="token builtin">str</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
    sheet1 <span class="token operator">=</span> new_book<span class="token punctuation">.</span>add_sheet<span class="token punctuation">(</span><span class="token string">'tables-'</span> <span class="token operator">+</span> <span class="token builtin">str</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
    sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">,</span> <span class="token string">'File'</span><span class="token punctuation">)</span>
    sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">,</span> <span class="token string">'Code'</span><span class="token punctuation">)</span>
    sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">2</span><span class="token punctuation">,</span> <span class="token string">'Date'</span><span class="token punctuation">)</span>
    sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">3</span><span class="token punctuation">,</span> <span class="token string">'Page'</span><span class="token punctuation">)</span>
    sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">4</span><span class="token punctuation">,</span> <span class="token string">'Method'</span><span class="token punctuation">)</span>
    sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">,</span> <span class="token string">'File'</span><span class="token punctuation">)</span>
    sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">,</span> <span class="token string">'Code'</span><span class="token punctuation">)</span>
    sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">2</span><span class="token punctuation">,</span> <span class="token string">'Date'</span><span class="token punctuation">)</span>
    sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">3</span><span class="token punctuation">,</span> <span class="token string">'Page'</span><span class="token punctuation">)</span>
    sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">4</span><span class="token punctuation">,</span> <span class="token string">'Method'</span><span class="token punctuation">)</span><span class="token keyword">for</span> index<span class="token punctuation">,</span> <span class="token builtin">file</span> <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>paths<span class="token punctuation">)</span><span class="token punctuation">:</span>book <span class="token operator">=</span> xlrd<span class="token punctuation">.</span>open_workbook<span class="token punctuation">(</span><span class="token builtin">file</span><span class="token punctuation">)</span>sheet <span class="token operator">=</span> book<span class="token punctuation">.</span>sheet_by_name<span class="token punctuation">(</span><span class="token string">'tables'</span><span class="token punctuation">)</span>sheet_pages <span class="token operator">=</span> book<span class="token punctuation">.</span>sheet_by_name<span class="token punctuation">(</span><span class="token string">'pages'</span><span class="token punctuation">)</span>tab_rows <span class="token operator">+=</span> sheet<span class="token punctuation">.</span>nrowspage_rows <span class="token operator">+=</span> sheet_pages<span class="token punctuation">.</span>nrows<span class="token keyword">for</span> row <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span>sheet<span class="token punctuation">.</span>nrows<span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token keyword">if</span> <span class="token builtin">len</span><span class="token punctuation">(</span>sheet<span class="token punctuation">.</span>row_values<span class="token punctuation">(</span>row<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">&lt;</span> <span class="token number">5</span><span class="token punctuation">:</span>sty1 <span class="token operator">=</span> <span class="token boolean">None</span><span class="token keyword">elif</span> sheet<span class="token punctuation">.</span>row_values<span class="token punctuation">(</span>row<span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">4</span><span class="token punctuation">]</span> <span class="token operator">==</span> <span class="token string">'exact'</span><span class="token punctuation">:</span>sty1 <span class="token operator">=</span> style<span class="token keyword">elif</span> sheet<span class="token punctuation">.</span>row_values<span class="token punctuation">(</span>row<span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">4</span><span class="token punctuation">]</span> <span class="token operator">==</span> <span class="token string">'guess'</span><span class="token punctuation">:</span>sty1 <span class="token operator">=</span> style2<span class="token keyword">elif</span> sheet<span class="token punctuation">.</span>row_values<span class="token punctuation">(</span>row<span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">4</span><span class="token punctuation">]</span> <span class="token operator">==</span> <span class="token string">'content-in-table'</span><span class="token punctuation">:</span>sty1 <span class="token operator">=</span> style3<span class="token keyword">else</span><span class="token punctuation">:</span>sty1 <span class="token operator">=</span> <span class="token boolean">None</span><span class="token keyword">for</span> col<span class="token punctuation">,</span> val <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>sheet<span class="token punctuation">.</span>row_values<span class="token punctuation">(</span>row<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token keyword">if</span> col <span class="token operator">&gt;</span> <span class="token number">4</span><span class="token punctuation">:</span><span class="token keyword">if</span> sty1 <span class="token keyword">is</span> <span class="token operator">not</span> <span class="token boolean">None</span><span class="token punctuation">:</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span>tab_cnt<span class="token punctuation">,</span> col<span class="token punctuation">,</span> val<span class="token punctuation">,</span> sty1<span class="token punctuation">)</span><span class="token keyword">else</span><span class="token punctuation">:</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span>tab_cnt<span class="token punctuation">,</span> col<span class="token punctuation">,</span> val<span class="token punctuation">)</span><span class="token keyword">else</span><span class="token punctuation">:</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span>tab_cnt<span class="token punctuation">,</span> col<span class="token punctuation">,</span> val<span class="token punctuation">)</span>tab_cnt <span class="token operator">+=</span> <span class="token number">1</span>tab_cnt <span class="token operator">+=</span> <span class="token number">1</span><span class="token keyword">for</span> row <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span>sheet_pages<span class="token punctuation">.</span>nrows<span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token keyword">if</span> <span class="token builtin">len</span><span class="token punctuation">(</span>sheet_pages<span class="token punctuation">.</span>row_values<span class="token punctuation">(</span>row<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">&lt;</span> <span class="token number">5</span><span class="token punctuation">:</span>sty2 <span class="token operator">=</span> <span class="token boolean">None</span><span class="token keyword">elif</span> sheet_pages<span class="token punctuation">.</span>row_values<span class="token punctuation">(</span>row<span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">4</span><span class="token punctuation">]</span> <span class="token operator">==</span> <span class="token string">'exact'</span><span class="token punctuation">:</span>sty2 <span class="token operator">=</span> style<span class="token keyword">elif</span> sheet_pages<span class="token punctuation">.</span>row_values<span class="token punctuation">(</span>row<span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">4</span><span class="token punctuation">]</span> <span class="token operator">==</span> <span class="token string">'guess'</span><span class="token punctuation">:</span>sty2 <span class="token operator">=</span> style2<span class="token keyword">elif</span> sheet_pages<span class="token punctuation">.</span>row_values<span class="token punctuation">(</span>row<span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">4</span><span class="token punctuation">]</span> <span class="token operator">==</span> <span class="token string">'content-in-table'</span><span class="token punctuation">:</span>sty2 <span class="token operator">=</span> style3<span class="token keyword">else</span><span class="token punctuation">:</span>sty2 <span class="token operator">=</span> <span class="token boolean">None</span><span class="token keyword">for</span> col<span class="token punctuation">,</span> val <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>sheet_pages<span class="token punctuation">.</span>row_values<span class="token punctuation">(</span>row<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token keyword">if</span> col <span class="token operator">&gt;</span> <span class="token number">4</span><span class="token punctuation">:</span><span class="token keyword">if</span> sty2 <span class="token keyword">is</span> <span class="token operator">not</span> <span class="token boolean">None</span><span class="token punctuation">:</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span>page_cnt<span class="token punctuation">,</span> col<span class="token punctuation">,</span> val<span class="token punctuation">,</span> sty2<span class="token punctuation">)</span><span class="token keyword">else</span><span class="token punctuation">:</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span>page_cnt<span class="token punctuation">,</span> col<span class="token punctuation">,</span> val<span class="token punctuation">)</span><span class="token keyword">else</span><span class="token punctuation">:</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span>page_cnt<span class="token punctuation">,</span> col<span class="token punctuation">,</span> val<span class="token punctuation">)</span>page_cnt <span class="token operator">+=</span> <span class="token number">1</span>page_cnt <span class="token operator">+=</span> <span class="token number">1</span><span class="token keyword">if</span> tab_rows <span class="token operator">&gt;=</span> sheet_size<span class="token punctuation">:</span>tab_rows <span class="token operator">=</span> <span class="token number">0</span>tab_cnt <span class="token operator">=</span> <span class="token number">1</span>sheet1 <span class="token operator">=</span> new_book<span class="token punctuation">.</span>add_sheet<span class="token punctuation">(</span><span class="token string">'tables-'</span> <span class="token operator">+</span> <span class="token builtin">str</span><span class="token punctuation">(</span>index<span class="token punctuation">)</span><span class="token punctuation">)</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">,</span> <span class="token string">'File'</span><span class="token punctuation">)</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">,</span> <span class="token string">'Code'</span><span class="token punctuation">)</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">2</span><span class="token punctuation">,</span> <span class="token string">'Date'</span><span class="token punctuation">)</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">3</span><span class="token punctuation">,</span> <span class="token string">'Page'</span><span class="token punctuation">)</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">4</span><span class="token punctuation">,</span> <span class="token string">'Method'</span><span class="token punctuation">)</span><span class="token keyword">if</span> page_rows <span class="token operator">&gt;=</span> sheet_size<span class="token punctuation">:</span>page_rows <span class="token operator">=</span> <span class="token number">0</span>page_cnt <span class="token operator">=</span> <span class="token number">1</span>sheet2 <span class="token operator">=</span> new_book<span class="token punctuation">.</span>add_sheet<span class="token punctuation">(</span><span class="token string">'pages-'</span> <span class="token operator">+</span> <span class="token builtin">str</span><span class="token punctuation">(</span>index<span class="token punctuation">)</span><span class="token punctuation">)</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">,</span> <span class="token string">'File'</span><span class="token punctuation">)</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">,</span> <span class="token string">'Code'</span><span class="token punctuation">)</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">2</span><span class="token punctuation">,</span> <span class="token string">'Date'</span><span class="token punctuation">)</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">3</span><span class="token punctuation">,</span> <span class="token string">'Page'</span><span class="token punctuation">)</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">4</span><span class="token punctuation">,</span> <span class="token string">'Method'</span><span class="token punctuation">)</span>new_book<span class="token punctuation">.</span>save<span class="token punctuation">(</span><span class="token string">'tables.xls'</span><span class="token punctuation">)</span>
    
  • 函数入口

    # 程序执行入口:主函数if name == ‘main’:# 此命令是为在Windows环境下打包exe时正确引入多进程模块而添加的,在Python解释器中运行代码这一行是不必要的,当然添加之后也无妨multiprocessing.freeze_support()# 程序运行需要的参数# parasbase_dir = r’./’  # 程序工作目录设定为本程序所在的目录out_path = base_dir + r’\result.xls’  # 输出结果文件名称demo = base_dir + r’\Demo.xlsx’  # Demo文件名称
  • <span class="token comment"># 新建noResult.txt文件,用来保存没有结果的PDF文件名称</span>
    <span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">'noResult.txt'</span><span class="token punctuation">,</span> <span class="token string">'w'</span><span class="token punctuation">,</span> encoding<span class="token operator">=</span><span class="token string">'utf-8'</span><span class="token punctuation">)</span><span class="token keyword">as</span> fp<span class="token punctuation">:</span>fp<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token builtin">str</span><span class="token punctuation">(</span>datetime<span class="token punctuation">.</span>now<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span> <span class="token string">'\n'</span><span class="token punctuation">)</span><span class="token comment"># 初始化Util类</span>
    util <span class="token operator">=</span> Util<span class="token punctuation">(</span>base_dir <span class="token operator">+</span> <span class="token string">'\\test'</span><span class="token punctuation">,</span> out_path<span class="token punctuation">,</span> demo<span class="token punctuation">)</span>
    rules <span class="token operator">=</span> util<span class="token punctuation">.</span>load_demo<span class="token punctuation">(</span><span class="token punctuation">)</span>
    folder <span class="token operator">=</span> util<span class="token punctuation">.</span>load_folder<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token comment"># 执行多进程,,但仅执行单线程模式时这里可替换为run函数</span>
    multi_processor_run<span class="token punctuation">(</span>batch_processor<span class="token punctuation">,</span> run<span class="token punctuation">,</span> folder<span class="token punctuation">,</span> rules<span class="token punctuation">,</span> util<span class="token punctuation">)</span><span class="token comment"># 保存结果:5000代表每个Excel的单个sheet最多5000行,超过则会新建sheet</span>
    <span class="token comment"># save...</span>
    re_format<span class="token punctuation">(</span><span class="token number">5000</span><span class="token punctuation">)</span>
    <span class="token comment"># 移除临时文件,这些临时文件在程序运行过程中会保存在当前目录的tmp文件夹内,其中每个Excel文件保存的是单个PDF文件的结果,最终这些结果将会通过re_format函数整合到一个Excel中,当想要保留这些结果时,可将下面一行代码注释掉</span>
    shutil<span class="token punctuation">.</span>rmtree<span class="token punctuation">(</span><span class="token string">'tmp'</span><span class="token punctuation">)</span>
    

    7.完整项目代码下载地址

    • https://github.com/yooongchun/PDFParser/blob/master/PDFTable/ExtractTables.py

提取指定的PDF表格保存到Excel相关推荐

  1. python提取pdf表格数据并保存到excel中

    pdfplumber操作pdf文件 python开源库pdfplumber,可以较为方便地获取pdf的各种信息,包含pdf的基本信息(作者.创建时间.修改时间-)及表格.文本.图片等信息,基本可以满足 ...

  2. Python实现分析pdf或者Word形式简历,并且保存到Excel中

    Python实现分析当前文件夹里面所有的pdf或者Word形式简历,并且保存到Excel中 # -*- coding:utf-8 -*-#作者:公众号:湾区人工智能 #功能:实现分析pdf或者Word ...

  3. python 发票信息提取_Python提取发票内容保存到Excel.md

    Python提取PDF发票内容保存到Excel --- 摘要:这篇文章介绍如何把发票内容提取出来保存到Excel中.文章分为两个部分,第一部分程序用法,第二部分介绍代码. --- 作者:yooongc ...

  4. python爬取京东畅销榜(计算机类)图书信息(书名,作者,价格),并保存到excel表格

    爬虫新手小白的第一次"半独立"爬虫,为什么是"半独立"呢?因为基本的代码块是从其他博客借鉴过来的,在此基础上加入了自己的思考和实现. (后面的价格获取感觉自己走 ...

  5. 对豆瓣进行爬虫来获取相关数据(分别保存到Excel表格和sqlite中)

    1.存入Excel表格的代码: from bs4 import BeautifulSoup #网页解析,获取数据 import re #正则表达式,进行文字匹配 import urllib.reque ...

  6. Crawler:基于BeautifulSoup库+requests库+伪装浏览器的方式实现爬取14年所有的福彩网页的福彩3D相关信息,并将其保存到Excel表格中

    Crawler:Python爬取14年所有的福彩信息,利用requests库和BeautifulSoup模块来抓取中彩网页福彩3D相关的信息,并将其保存到Excel表格中 目录 输出结果 核心代码 输 ...

  7. 怎么把matlab中的图导出,matlab的数据能保存到excel表格-如何将matlab 中输出的图形保存到Excel中去,详细点...

    怎样将MATLAB中的数据输出到excel中 数据保存到excel文件 xlswrite(xlsfile, data, sheet, range); % sheet 和 range可以不指定 如: x ...

  8. Python爬虫鲁迅先生《经典语录》保存到Excel表格(附源码)

    Python爬虫鲁迅先生<经典语录>保存到Excel表格(附源码) 前言 今天用Python 爬取鲁迅先生<经典语录>,直接开整~ 代码运行效果展示 开发工具 Python版本 ...

  9. 如何将网页内容保存到计算机中,如何将网站导出excel表格数据-如何把网页数据保存到EXCEL...

    网页上的表格数据怎么复制到excel 1.打开excel表格. 2.打开菜单"数据->"导入外部数据"->"新建 Web 查询",在&qu ...

最新文章

  1. 特征工程(feature engineering)是什么?特征工程(feature engineering)包含哪些方面?
  2. 笔记-中项案例题-2017年下-变更管理和配置管理
  3. Asp.Net Core 5 REST API 使用 JWT 身份验证 - Step by Step(二)
  4. android程序的入口点,常见android面试基础题
  5. java创建一副牌_用java怎么创建一副扑克牌
  6. Docker创建自己的镜像库
  7. ​iPhone 12全线跌破发行价;三星扩大众包定位网络;Fedora 33发布|极客头条
  8. 2016/1/15代码
  9. 配置nginx+wordpress的https
  10. 解决sese9 安装时多个屏幕
  11. 人员招聘与培训实务【1】
  12. 搞定分布式系列:缓存 热key 问题解决方案
  13. vue的过渡动画(有vue的动画库和ui库的介绍)
  14. 卷积神经网络的训练过程
  15. bmc网络配置_SA5212M5-BMC设置
  16. 百度地图 标记聚合器MarkerClusterer结合TextIconOverlay,根据标记点的属性更换聚合器的样式
  17. 图解脱机BT(transmission-daemon)教程
  18. OS 临界资源、独占资源、临界区
  19. IDEA左侧的project目录中,看不到项目的文件结构图,项目目录不见了
  20. UE4/在屏幕上画准星

热门文章

  1. Android进程保活方案
  2. 强大的vim配置文件,让编程更随意
  3. [java手把手教程][第二季]java后端博客系统文章系统——No10
  4. 《HFSS电磁仿真设计从入门到精通》一第2章 入门实例——T形波导的内场分析和优化设计...
  5. 从C++strStr到字符串匹配算法
  6. java上机练习01
  7. need study
  8. UART的16倍频过采样和3倍频过采样
  9. html frame 菜单切换,官方底部导航如何通过frame0.html的JS控制切换
  10. 如何区分直连串口线和交叉串口线?