python爬虫获取学信网学校与对应专业信息

python 学习有一段时间，于是开始着手写了一个入门爬虫试试，话不多说直接上源码了。

import requests
import bs4
import pymysql
import json
from xpinyin import Pinyinhost = "localhost"
user = "root"
password = "123456"
database = "test"
# 打开数据库连接
db = pymysql.connect(host, user, password, database)
# 使用 cursor() 方法创建一个游标对象 cursor
cursor = db.cursor()
#判断是否存在学校表（school）和专业表（major）
cursor.execute("drop table if exists school")
cursor.execute("drop table if exists major")
# 使用 cursor() 方法创建一个游标对象 cursor
cursor = db.cursor()
#第一次打开网页
res = requests.get('https://gaokao.chsi.com.cn/sch/search.do?start=0') res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
#获取总页数
page_num = soup.select('.ch-page li')[7].string
th = soup.select('th')
th_herd = []
#申明汉子转换拼音对象
pin = Pinyin()
field = ""
for item in th:field += (","+pin.get_pinyin(item.get_text(), '_')+" varchar(100)")th_herd.append(pin.get_pinyin(item.get_text(), '_'))
#学校表创建sql语句
sql = """CREATE TABLE school (id int NOT NULL AUTO_INCREMENT """+field+""",PRIMARY KEY ( id ))"""
sql1 = """CREATE TABLE major (s_id int , name varchar(100))"""
#执行语句
cursor.execute(sql);
cursor.execute(sql1);
# 转化数组为逗号分隔的字符串
delimiter = ','
field = delimiter.join(th_herd)
j = 0
m = 1
while j<int(page_num):#输出数据量print(j*20) res = requests.get('https://gaokao.chsi.com.cn/sch/search.do?start=%s'%(j*20)) res.raise_for_status() soup = bs4.BeautifulSoup(res.text, 'html.parser') tr = soup.select("table tr")i = 0a = ""a1 = ""while i<len(tr):if i == 0:i += 1continueb = ''for x in tr[i].select('td')[5].select('span'):b = b + "    " + str(x.get_text())c = (1 if (tr[i].select('td')[6].string) is None else 0)d = (";" if (i+1) is len(tr) else ",")a += "('%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s')%s" % (tr[i].select('td')[0].get_text().strip(), tr[i].select('td')[1].get_text().strip(), tr[i].select('td')[2].get_text().strip(),tr[i].select('td')[3].get_text().strip(), tr[i].select('td')[4].get_text().strip(), b, c, tr[i].select('td')[7].get_text().strip(), d)#获取学校专业##查询学校res1 = requests.get('https://gaokao.chsi.com.cn/zyk/pub/myd/specAppraisalTop.action?yxmc=%s'%(tr[i].select('td')[0].get_text().strip())) res1.raise_for_status() soup1 = bs4.BeautifulSoup(res1.text, 'html.parser') url = soup1.select('.check_detail')##查看全部专业u = 0for x in url:if u > 1:breaku = u + 1res1 = requests.get('https://gaokao.chsi.com.cn/%s'%(x['href'])) res1.raise_for_status() soup1 = bs4.BeautifulSoup(res1.text, 'html.parser') td = soup1.select(".first_td")j1 = 0while j1<len(td):if j1 == 0:j1 = j1 + 1continued1 = ','a1 = a1 + "(%s, '%s')%s" % (m, td[j1].get_text().strip(), d1)j1 = j1 + 1i = i + 1m = m + 1#执行新增语句sql = """INSERT INTO school("""+field+""") VALUES """ + asql1 = """INSERT INTO major(s_id,name) VALUES """ + a1j = j + 1try:# 执行sql语句    cursor.execute(sql)cursor.execute(sql1[:-1])# 提交到数据库执行db.commit()except:# 如果发生错误则回滚print(sql1)exit()db.rollback()
db.close()

如有不同理解，欢迎评论沟通。获取以运行好的数据库文件请点击这里。

python爬虫获取学信网学校与对应专业信息相关推荐

编写python爬虫获取中华英才网全网工资数据
做数据分析数据挖掘,第一步是获取数据,在这里,我们要分析现今全国各地各个职业的工资情况. 我们选择较为权威的'中华英才网',编写python爬虫获取该网站上的各个招聘信息说给出的工资,再取其行业工资 ...
学信网查不到学位信息？学位绑定的流程详解
学位绑定的流程详解一.绑定学位主要操作二.绑定次数已用完怎么办? 一.绑定学位主要操作首先确定学校已上传学位信息,否则一直绑定不上会用完绑定次数: 学位查询 :https://www.chsi. ...
python爬虫获取中国天气网天气数据 requests BeautifulSoup re
python获取中国天气网天气数据:http://www.weather.com.cn/textFC/henan.shtml main.py # -*- coding: utf-8 -*- impor ...
php爬虫实时更新天气,Python爬虫获取中国天气网天气预报数据[2018-06-12更新]
实时天气显示建议用Domoticz内置的DarkSky. 天气预报只能自己获取. 此脚本获取中国天气网七日预报,设备需要自建虚拟硬件,添加虚拟设备,设备类型选择Text文本. 效果: 屏幕快照 201 ...
python爬虫爬猎聘网获取多条职责描述中有Linux需求的招聘信息
python爬虫爬猎聘网获取多条职责描述中有Linux需求的招聘信息下列是我爬虫的作业摘要随着现代化社会的飞速发展,网络上巨大信息量的获取给用户带来了许多的麻烦.由于工作和生活节奏的需求,人们 ...
Python实现简单的爬虫获取某刀网的更新数据
昨天晚上无聊时,想着练习一下Python所以写了一个小爬虫获取小刀娱乐网里的更新数据 #!/usr/bin/python # coding: utf-8import urllib.request im ...
使用学信网认证，免费获取JetBrains学习产品
使用学信网认证,白嫖JetBrains学习产品 1. 打开JetBrains教育申请官网链接点击官方文件认证,并且填入相关信息注意,此处的在线验证码是下一步申请认证报告的在线验证码,文件一定要上 ...
python 网络爬虫 1.3 获取中国天气网8-15天的天气信息,包含: 日期,天气,温度,风力. 将数据存入文档。
题目: 获取中国天气网8-15天的天气信息,包含: 日期,天气,温度,风力. 将数据存入文档. 代码: from requests_html import HTMLSessionurl = " ...
超级简单的Python爬虫教程,python爬虫菜鸟教程官网
毫无基础的人如何入门 Python ? Python是一种计算机程序设计语言.你可能已经听说过很多种流行的编程语言,比如非常难学的C语言,非常流行的Java语言,适合初学者的Basic语言,适合网页编 ...

python爬虫获取学信网学校与对应专业信息

python爬虫获取学信网学校与对应专业信息相关推荐

最新文章

热门文章

python爬虫 获取学信网 学校与对应专业信息

python爬虫 获取学信网 学校与对应专业信息相关推荐

最新文章

热门文章

python爬虫获取学信网学校与对应专业信息

python爬虫获取学信网学校与对应专业信息相关推荐