Python利用selenium爬取行政区域存到MySQL里

from selenium import webdriver
import time
import pymysqlclass GovementSpider(object):def __init__(self):self.browser = webdriver.Chrome()self.one_url = 'http://www.mca.gov.cn/article/sj/xzqh/2019/'# 创建数据库相关变量self.db = pymysql.connect('localhost','root','123456','govdb',charset='utf8')self.cursor = self.db.cursor()# 定义3个列表,为了excutemany()self.province_list = []self.city_list = []self.county_list = []# 获取首页,并提取二级页面链接(虚假链接)def get_false_url(self):self.browser.get(self.one_url)# 提取二级页面链接 + 点击该节点td_list = self.browser.find_elements_by_xpath('//td[@class="arlisttd"]/a[contains(@title,"代码")]')if td_list:# 找节点对象,因为要click()two_url_element = td_list[0]# 增量爬取,取出链接和数据库version表中做比对two_url = two_url_element.get_attribute('href')sel = 'select * from version where link=%s'self.cursor.execute(sel,[two_url])result = self.cursor.fetchall()if result:print('数据已最新,无需爬取')else:# 点击two_url_element.click()time.sleep(5)# 切换browserall_handles = self.browser.window_handlesself.browser.switch_to_window(all_handles[1])# 数据抓取self.get_data()# 结束之后把two_url插入到version表中ins = 'insert into version values(%s)'self.cursor.execute(ins,[two_url])self.db.commit()# 二级页面中提取行政区划代码def get_data(self):# 基准xpathtr_list = self.browser.find_elements_by_xpath('//tr[@height="19"]')for tr in tr_list:code = tr.find_element_by_xpath('./td[2]').text.strip()name = tr.find_element_by_xpath('./td[3]').text.strip()print(name,code)# 判断层级关系,添加到对应的数据库表中(对应表中字段)# province: p_name p_code# city    : c_name c_code c_father_code# county  : x_name x_code x_father_codeif code[-4:] == '0000':self.province_list.append([name, code])# 单独判断4个直辖市放到city表中if name in ['北京市', '天津市', '上海市', '重庆市']:city = [name, code, code]self.city_list.append(city)elif code[-2:] == '00':city = [name, code, code[:2] + '0000']self.city_list.append(city)else:# 四个直辖市区县的上一级为: xx0000if code[:2] in ['11','12','31','50']:county = [name,code,code[:2]+'0000']# 普通省市区县上一级为: xxxx00else:county = [name, code, code[:4] + '00']self.county_list.append(county)# 和for循环同缩进,所有数据爬完后统一excutemany()self.insert_mysql()def insert_mysql(self):# 更新时一定要先删除表记录del_province = 'delete from province'del_city = 'delete from city'del_county = 'delete from county'self.cursor.execute(del_province)self.cursor.execute(del_city)self.cursor.execute(del_county)# 插入新的数据ins_province = 'insert into province values(%s,%s)'ins_city = 'insert into city values(%s,%s,%s)'ins_county = 'insert into county values(%s,%s,%s)'self.cursor.executemany(ins_province,self.province_list)self.cursor.executemany(ins_city,self.city_list)self.cursor.executemany(ins_county,self.county_list)self.db.commit()print('数据抓取完成,成功存入数据库')def main(self):self.get_false_url()# 所有数据处理完成后断开连接self.cursor.close()self.db.close()# 关闭浏览器self.browser.quit()if __name__ == '__main__':spider = GovementSpider()spider.main()

Python利用selenium爬取行政区域存到MySQL里相关推荐

python利用selenium爬取X蜂窝热门游记
最近因项目需要,学习了下爬虫.之前都是完成的静态网页的爬去,但大部分网页都是动态加载AJAX,所以学习了selenium.当然也可以通过在network中查找隐藏的网页内容,在利用requests去爬 ...
python利用selenium爬取京东数据
一直以来都是看别人博客学习,这次就自己发个,回馈回馈先放上成功图,表示可用(末尾有打包的百度云链接供下载测试) 需要的模块,selenium pyquery,pymysql,还需要谷歌浏览器及其ch ...
python利用selenium爬取网易云入驻歌手id、歌手主页id、歌手姓名、歌手粉丝数量
首先需要访问入驻歌手页,可以看到两个a结点中的链接,其中第一个链接为歌手主页,后面的数字是其主页id:第二个链接为歌手的信息主页,后面的数字为歌手id,通过第二个链接的访问可以查看歌手的粉丝数量成功 ...
利用Selenium爬取淘宝商品信息
文章来源:公众号-智能化IT系统. 一. Selenium和PhantomJS介绍 Selenium是一个用于Web应用程序测试的工具,Selenium直接运行在浏览器中,就像真正的用户在操作一样. ...
Python利用Scrapy爬取前程无忧
** Python利用Scrapy爬取前程无忧 ** 一.爬虫准备 Python:3.x Scrapy PyCharm 二.爬取目标爬取前程无忧的职位信息,此案例以Python为关键词爬取相应的职位 ...
python 使用 selenium 爬取中国福利彩票双色球历史中奖号码
python 使用 selenium 爬取中国福利彩票双色球历史中奖号码前期准备版本:python3 模块:selenium.time.pprint 一开始使用的是 tree 的方式获取数据,但发 ...
python利用bs4爬取外国高清图片网站
python利用bs4爬取外国高清图片网站爬取高清图片爬取高清图片 import re import requests from bs4 import BeautifulSoup import o ...
python使用selenium爬取联想官网驱动（一）：获取遍历各驱动的下载网址
python使用selenium爬取联想官网驱动(一):获取遍历各驱动的下载网址然后wget命令试验下载由于初期学习,所以先拿一个型号的产品驱动试验. (1)以下为在联想某型号产品获取相关驱动下载的 ...
[python爬虫] Selenium爬取内容并存储至MySQL数据库
前面我通过一篇文章讲述了如何爬取CSDN的博客摘要等信息.通常,在使用Selenium爬虫爬取数据后,需要存储在TXT文本中,但是这是很难进行数据处理和数据分析的.这篇文章主要讲述通过Selenium ...

Python利用selenium爬取行政区域存到MySQL里

Python利用selenium爬取行政区域存到MySQL里相关推荐

最新文章

热门文章