一、项目简介

该项目用到的库都是做爬虫常用的库,包括Beautifulsoup4,requests,re等等在写入word文件这一步需要了解一下docx库的使用,可以实现按标题检索按正文检索,可以指定存放文件夹等等。项目较为简单,但是在对数据进行清洗的时候较为复杂,适合刚如爬虫的新手练习!

二、代码展示

import time
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.shared import Pt, Cm
from lxml import etree
from docx import Document
from docx.oxml.ns import qn
import requests
import random
import urllib.parse
from bs4 import BeautifulSoup
import lxml
import string
from concurrent.futures import ThreadPoolExecutor
import threading
import refrom docx.oxml import OxmlElement
from docx.oxml.ns import qndef set_cell_border(cell, **kwargs): # 设置单元格上下左右边框属性(线性/颜色/粗细)"""Set cell`s borderUsage:set_cell_border(cell,top={"sz": 12, "val": "single", "color": "#FF0000", "space": "0"},bottom={"sz": 12, "color": "#00FF00", "val": "single"},left={"sz": 24, "val": "dashed", "shadow": "true"},right={"sz": 12, "val": "dashed"},)"""tc = cell._tctcPr = tc.get_or_add_tcPr()# check for tag existnace, if none found, then create onetcBorders = tcPr.first_child_found_in("w:tcBorders")if tcBorders is None:tcBorders = OxmlElement('w:tcBorders')tcPr.append(tcBorders)# list over all available tagsfor edge in ('left', 'top', 'right', 'bottom', 'insideH', 'insideV'):edge_data = kwargs.get(edge)if edge_data:tag = 'w:{}'.format(edge)# check for tag existnace, if none found, then create oneelement = tcBorders.find(qn(tag))if element is None:element = OxmlElement(tag)tcBorders.append(element)# looks like order of attributes is importantfor key in ["sz", "val", "color", "space", "shadow"]:if key in edge_data:element.set(qn('w:{}'.format(key)), str(edge_data[key]))def get_headers():# 头部文件#得到随机headersuser_agents = ["Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)","Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)","Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)","Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)","Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)","Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)","Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)","Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6","Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1","Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0","Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5","Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20","Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER","Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)","Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)","Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1","Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5","Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre","Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11","Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10"]user_agent = random.choice(user_agents)headers = {'User-Agent': user_agent}return headersdef get_catalogue_data(url): # 请求总标题网页数据try:data = requests.get(url=url, headers=get_headers())soup = BeautifulSoup(data.content, 'lxml', from_encoding='utf-8')return soupexcept:time.sleep(5)return get_catalogue_data(url)def get_related_nums(url): # 获取相关数据个数soup = get_catalogue_data(url)info = soup.select('body > div.main.pageWidth > div.mainCon.clearself > div > div > div.searTit')re_nums = int(re.findall('(\d+)',str(info))[0])if re_nums == 0:return Falseelse:return re_numsdef get_catalogue_pages(url): # 获取总标题页码soup = get_catalogue_data(url)last_page = re.findall('(\d+)',str(soup.select('#pagecount')))if last_page :return last_pageelse:if get_related_nums(url):print('为你查到有%s个相关数据' % get_related_nums(url))return last_pageelse:print('无相关记录')return Falsedef get_title_urls(url_index_0,pages,q): # 得到每个标题的urlall_url = []if pages:for page in range(1,int(pages[0])+1):url_index = url_index_0.format(page,q)errors_url = []try:data = requests.get(url=url_index, headers=get_headers())#soup = BeautifulSoup(data.content,'lxml',from_encoding='utf-8')pattern = '<a href="(.*)" target="_blank">'urls = re.findall(pattern, data.text)for url in urls:url = 'http://search.chinalaw.gov.cn/' + urlall_url.append(url)except:time.sleep(5)return get_title_urls(url_index_0,pages,q)else:url_index = url_index_0.format('',q)try:data = requests.get(url=url_index, headers=get_headers())pattern = '<a href="(.*)" target="_blank">'urls = re.findall(pattern, data.text)#print(data.text)for url in urls:url = 'http://search.chinalaw.gov.cn/' + urlall_url.append(url)except:time.sleep(5)return get_title_urls(url_index_0, pages, q)return all_urldef get_title_pages(title_url):# 得到详情页的页数try:data = requests.get(title_url,headers=get_headers())soup = BeautifulSoup(data.content,'lxml',from_encoding='utf-8')pages = int(re.findall('(\d+)',str(soup.select('#pagecount')))[0])return pagesexcept:return 1def get_title_infos(title_url,pages):#得到详情页所有信息attr = {}title = {}paragraphs = []for page in range(1,pages+1):time.sleep(1)try:title_url0 = title_url + '&PageIndex=' + str(page)data = requests.get(title_url0, headers=get_headers())html = etree.HTML(data.content.decode('utf-8'))paragraphs_1 = html.xpath('string(/html/body/div[2]/div[2]/div/div/div[2]/div[2])')paragraphs0 = []for pa in paragraphs_1:if pa == '':passelse:paragraphs0.append(pa)paragraphs_2 = ''.join(paragraphs0)pa2 = paragraphs_2.split('\r\n')pa3 = [x.strip() for x in pa2 if x.strip() != '']if page == 1:title0 = html.xpath('/html/body/div[2]/div[2]/div/div/div[2]/div[1]')title.update({'title': title0[0].text})attr_gov = html.xpath('/html/body/div[2]/div[2]/div/div/div[2]/table[1]/tr[1]/td[1]')attr_gov_a = html.xpath('/html/body/div[2]/div[2]/div/div/div[2]/table[1]/tr[1]/td[2]')attr.update({attr_gov[0].text: attr_gov_a[0].text})attr_pdata = html.xpath('/html/body/div[2]/div[2]/div/div/div[2]/table[1]/tr[2]/td[1]')attr_pdata_a = html.xpath('/html/body/div[2]/div[2]/div/div/div[2]/table[1]/tr[2]/td[2]')attr.update({attr_pdata[0].text: attr_pdata_a[0].text})attr_ddata = html.xpath('/html/body/div[2]/div[2]/div/div/div[2]/table[1]/tr[2]/td[3]')attr_ddata_a = html.xpath('/html/body/div[2]/div[2]/div/div/div[2]/table[1]/tr[2]/td[4]')attr.update({attr_ddata[0].text: attr_ddata_a[0].text})attr_tok = html.xpath('/html/body/div[2]/div[2]/div/div/div[2]/table[1]/tr[3]/td[1]')attr_tok_a = html.xpath('/html/body/div[2]/div[2]/div/div/div[2]/table[1]/tr[3]/td[2]')attr.update({attr_tok[0].text: attr_tok_a[0].text})attr_cl = html.xpath('/html/body/div[2]/div[2]/div/div/div[2]/table[1]/tr[3]/td[3]')attr_cl_a = html.xpath('/html/body/div[2]/div[2]/div/div/div[2]/table[1]/tr[3]/td[4]')attr.update({attr_cl[0].text: attr_cl_a[0].text})paragraphs += pa3else:paragraphs += pa3except:time.sleep(5)return get_title_infos(title_url,pages)#print(paragraphs)return title,attr,paragraphsdef wt(title,attr,paragraphs,q,mkd):# 向word写入信息mkdir(mkd+'\关于__{}__的相关信息'.format(q))document = Document()document.styles['Normal'].font.name = u'微软雅黑'# 设置文档基础字体为宋体‘normal'表示默认样式document.styles['Normal'].element.rPr.rFonts.set(qn('w:eastAsia'), u'微软雅黑')# 写入标题title_p = document.add_paragraph()run1 = title_p.add_run(title['title'])run1.font.name = '微软雅黑'# 设置西文字体为微软雅黑run1.element.rPr.rFonts.set(qn('w:eastAsia'), u'微软雅黑')# 设置中文字体为微软雅黑title_p.alignment = WD_ALIGN_PARAGRAPH.CENTERrun1.font.size = Pt(13.5)# 写入属性table = document.add_table(rows=4, cols=4, style='Table Grid')table_run1 = table.cell(0,0).paragraphs[0].add_run('公布机关:')table_run1.font.size = Pt(9)table.cell(0,0).Width = Cm(2.12)table_run2 = table.cell(0, 1).paragraphs[0].add_run(attr['公布机关:'])table_run2.font.size = Pt(9)table.cell(0, 0).width = Cm(10.58)table_run3 = table.cell(1, 0).paragraphs[0].add_run('公布日期:')table_run3.font.size = Pt(9)table.cell(1, 0).width = Cm(2.12)table_run4 = table.cell(1, 1).paragraphs[0].add_run(attr['公布日期:'])table_run4.font.size = Pt(9)table.cell(1, 1).width = Cm(10.58)table_run5 = table.cell(1, 2).paragraphs[0].add_run('施行日期:')table_run5.font.size = Pt(9)table.cell(1, 2).width = Cm(2.12)table_run6 = table.cell(1, 3).paragraphs[0].add_run(attr['施行日期:'])table_run6.font.size = Pt(9)table_run7 = table.cell(3, 0).paragraphs[0].add_run('效力:')table_run7.font.size = Pt(9)table.cell(3, 0).width = Cm(2.12)table_run8 = table.cell(3, 1).paragraphs[0].add_run(attr['效力:'])table_run8.font.size = Pt(9)table.cell(3, 1).width = Cm(10.58)table_run9 = table.cell(3, 2).paragraphs[0].add_run('门类:')table_run9.font.size = Pt(9)table.cell(3, 2).width = Cm(2.12)table_run10 = table.cell(3, 3).paragraphs[0].add_run(attr['门类:'])table_run10.font.size = Pt(9)for r in range(0,4):for c in range(0,4):set_cell_border(table.cell(r,c),top={"color": "#FFFFFF"},bottom={"color": "#FFFFFF"},left={"color": "#FFFFFF"},right={"color": "#FFFFFF"},)# 写入段落p1 = document.add_paragraph()run_1 = p1.add_run(paragraphs[0])run_1.font.size = Pt(10.5)p1.alignment = WD_ALIGN_PARAGRAPH.CENTERp2 = document.add_paragraph()run_2 = p2.add_run(paragraphs[1])run_2.font.size = Pt(12.5)p2.alignment = WD_ALIGN_PARAGRAPH.CENTERfor p in paragraphs[2:]:p1 = document.add_paragraph()run_1 = p1.add_run(p)run_1.font.size = Pt(10.5)p1.paragraph_format.first_line_indent=Cm(0.74)document.save('D:\关于__{}__的相关信息\{}.docx'.format(q,title['title']))def mkdir(path): # 创建目录文件import os# function:新建文件夹# path:str-从程序文件夹要要创建的目录路径(包含新建文件夹名)# 去除首尾空格path = path.strip()  # strip方法只要含有该字符就会去除# 去除首尾\符号path = path.rstrip('\\')# 判断路径是否存在isExists = os.path.exists(path)# 根据需要是否显示当前程序运行文件夹# print("当前程序所在位置为:"+os.getcwd())if not isExists:os.makedirs(path)print(path + '创建成功')return Trueelse:return False
def main():mkd = str(input('请输入文件夹创建目录[如C:\\Users\\14983\Documents]:'))choose = int(input('按标题检索[1]   按正文检索[2]:'))url_0 = ''url_index_0 = ''if choose == 1:url_0 = 'http://search.chinalaw.gov.cn/SearchLawTitle?effectLevel=&SiteID=124&PageIndex=&Sort=PublishTime&Query='url_index_0 = 'http://search.chinalaw.gov.cn/SearchLawTitle?effectLevel=&SiteID=124&PageIndex={}&Sort=PublishTime&Query={}'elif choose == 2:url_0 = 'http://search.chinalaw.gov.cn/SearchLaw?effectLevel=&SiteID=124&PageIndex=&Sort=PublishTime&Query='url_index_0 = 'http://search.chinalaw.gov.cn/SearchLaw?effectLevel=&SiteID=124&PageIndex={}&Sort=PublishTime&Query={}'q = str(input('请输入查询关键字:'))url_1 = url_0 + qpages = get_catalogue_pages(url_1)#print(pages)nums = get_related_nums(url_1)if nums :print('全库有{}个相关数据'.format(str(nums)))title_urls = get_title_urls(url_index_0,pages,q)for title_url in title_urls:try:title_pages = get_title_pages(title_url)title,attr,paragraphs = get_title_infos(title_url,title_pages)wt(title,attr,paragraphs,q,mkd)print('{}-----------完成!'.format(title['title']))time.sleep(1)except:print('{}-----------失败!'.format(title['title']))else:print('全库没有相关数据')return main()print('该程序版权归崔sir(qq:1498381083 晚竹田)所有,未经许可不得用于商业用途。')
main()

三、程序运行截图





Python实现简单爬虫:爬取法律法规网数据库信息并分类写入word保存相关推荐

  1. Python 爬虫第三步 -- 多线程爬虫爬取当当网书籍信息

    XPath 的安装以及使用 1 . XPath 的介绍 刚学过正则表达式,用的正顺手,现在就把正则表达式替换掉,使用 XPath,有人表示这太坑爹了,早知道刚上来就学习 XPath 多省事 啊.其实我 ...

  2. scrapy框架的简单使用——爬取当当网图书信息

    ** Scrapy爬取当当网图书信息实例 --以警察局办案为类比 ** 使用Scrapy进行信息爬取的过程看起来十分的复杂,但是他的操作方式与警局办案十分的相似,那么接下来我们就以故事的形式开始Scr ...

  3. python 爬虫 爬取当当网图书信息

    初次系统的学习python,在学习完基本语法后,对爬虫进行学习,现在对当当网进行爬取,爬取了基本图书信息,包括图书名.作者等 import requests from time import slee ...

  4. python爬虫-爬取当当网书籍信息存到Excel中

    文章目录 一.任务 二.分析 (一).单页面的信息分析 源代码分析 目标信息定位与分析 代码设计 (二).所有目标页面链接分析 目标链接分析 代码设计 三.注意要点 四.完整代码 五.参考 一.任务 ...

  5. 1.简单爬虫————爬取古诗网

    该文章仅供学习,如有错误,欢迎指出 1.开始创建一个项目 mkdir s古诗网 2.进入到文件夹下创建python3的虚拟环境 pipenv install scrapy 3.进入pipenv 下使用 ...

  6. Python Scrapy简单爬虫-爬取澳洲药店,代购党的福音

    身在澳洲,近期和ld决定开始做代购,一拍即合之后开始准备工作.众所周知,澳洲值得买的也就那么点东西,奶粉.UGG.各种保健品,其中奶粉价格基本万年不变,但是UGG和保健品的价格变化可能会比较大.所以, ...

  7. 爬虫爬取赶集网租房信息

    一.爬虫–scrapy 1.搭建环境 代码如下(示例): import scrapy import numpy as np import pandas as pd import matplotlib. ...

  8. python学习笔记爬虫——爬取智联招聘信息

    目的:自己输入指定地方,指定职业,起始页数,程序会自动打印页面的所有信息. 实现过程:这次程序的代码跟以往的都不一样,这次也是我第一次使用面向对象式编程,并且并不是分析网页代码,分析json字符串得到 ...

  9. 在当当买了python怎么下载源代码-python爬虫爬取当当网

    [实例简介]python爬虫爬取当当网 [实例截图] [核心代码] ''' Function: 当当网图书爬虫 Author: Charles 微信公众号: Charles的皮卡丘 ''' impor ...

  10. python爬虫爬取慕课网中的图片

    我们简单地爬取慕课网中免费课程下的第一页的图片,如想爬取多页图片,可以添加for循环自行实现 python版本:3.6.5 爬取网址:http://www.imooc.com/course/list ...

最新文章

  1. python-opencv 定位识别读表
  2. github创建静态页面_如何在10分钟内使用GitHub Pages创建免费的静态站点
  3. 对Linux0.11 中 进程0 和 进程1分析
  4. mysql mongodb插件_MySQL和MongoDB设计实例对比分析
  5. ABAP动态生成经典应用之Dynamic SQL Excute 程序
  6. Web网站架构设计(转)
  7. 设置progressbar进度条颜色
  8. surging 微服务引擎 1.0 正式发布
  9. ASP.NET MVC传递Model到视图的多种方式之通用方式的使用
  10. Atitit 过去五年的技术趋势与没落技术聚合去重 Attilax认为重要的取出了移动端特有的等。。运维等,,只保留了开发部分的趋势 目录 1. 技术趋势 1 2. 3. 不建议的技术 4 4
  11. 如何修改hosts文件
  12. 1.3 三种交换方式:电路交换、分组交换、报文交换
  13. 求x的n次方编程_c语言求x的n次方的函数介绍
  14. 十分钟玩转3D绘图:WxGL完全手册(第二版)
  15. 应用宝认领应用签名_腾讯开放平台第三方应用签名参数sig的说明
  16. tipask二次开发总结_WeCenter和Tipask的智能问答系统的区别
  17. 阻滞增长模型求解_阻滞增长模型研究解读.ppt
  18. 小屏幕android电视,手机屏幕还能投屏到电视?教你4种方法,1秒钟小屏变大屏
  19. python抽奖简单小程序游戏_python实现抽奖小程序
  20. 随笔——不要活在别人的眼里

热门文章

  1. pycharm汉化(搜索不到插件的参考第二中方法)
  2. Monkey命令参数详解
  3. VMP学习笔记之壳基础(一)
  4. TokenInsight对话首席——暗流涌动,钱包如何引领数字资产新生态
  5. vb.net 同时给多个属性赋值_传奇技能,第十四祭:装备属性修改与增加新装备...
  6. 如何让百度云里的资源不被和谐掉?
  7. Nsight Compute 使用
  8. 【学习笔记】软考中级【数据库系统工程师】下午题技巧
  9. moodle安装图解
  10. 12864c语言程序,LCM12864 C语言驱动程序