python爬虫（五）：实战【2. 爬创客实验室（requests + bs4）】

目标：爬取创科实验室网站中讲座的信息，

输出表：讲座标题、报告人、单位、报告时间、讲座内容、报告人简介

技术：requests + bs4

查看爬虫协议：

http://127.0.0.1/lab/robots.txt

（创科实验室是我自己写的网址，不反爬虫）

经过观察，在http://127.0.0.1/lab/lectur页面，讲座标题在a标签里，

通过点击讲座标题可以进入讲座内容，链接也在a标签里

代码：

import requests

import bs4

# 获取页面

url = 'http://127.0.0.1/lab/lecture'

r = requests.get(url)

r.encoding = r.apparent_encoding

html = r.text

#解析页面，获取讲座标题

soup = bs4.BeautifulSoup(html, 'html.parser')

titleList = soup.find_all('a')

lecture = []

for i in titleList:

lecture.append(i.string)

lecture

结果：

# 获取讲座内容链接，通过链接获取页面

content_url = 'http://127.0.0.1/lab/lectureContent/17'

req = requests.get(content_url)

req.encoding = req.apparent_encoding

content = req.text

soup_new = bs4.BeautifulSoup(content, 'html.parser')

soup_new.section.contents

结果：

总代码：

import requests

import bs4

# 获取页面

url = 'http://127.0.0.1/lab/lecture'

r = requests.get(url)

r.encoding = r.apparent_encoding

html = r.text

#解析页面，获取讲座标题

soup = bs4.BeautifulSoup(html, 'html.parser')

aList = soup.find_all('a')

lecture = []

for i in aList:

# 获取讲座标题 lecture.append(i.string)

# 获取讲座内容链接，通过链接获取页面

content_url = 'http://127.0.0.1' + i.attrs['href']

req = requests.get(content_url)

req.encoding = req.apparent_encoding

content = req.text

# 解析页面，获取讲座内容(报告人、单位。。)

soup_new = bs4.BeautifulSoup(content, 'html.parser')

# 便利section标签的子节点

j = soup_new.section.contents

lecture.append([i.string, j[1].string, j[3].string, j[5].string, j[7].string, j[9].string, j[13].string])

# 输出为表

import pandas as pd

# 先把list转为dataframe类型，然后使用.to_csv方法

table = pd.DataFrame(data=lecture,columns=["讲座", "报告人", "单位", "报告时间", "报告地点", "内容简介", "报告人简介"])

table.to_csv('D:/1.csv',index=False)

结果：

优化：

爬虫虽然实现了，但还有几个问题需要优化：

1. 没有爬到内容简介和报告人简介

推测是含有<br/>，.string方法不起作用

解决方法：用split分割

str(str(j[9]).split('>')[2]).split('<')[0]

先把这一大段<span>...</span>转为字符串，再分割两次得到内容

2. 报告人、单位、报告时间下面的内容冗余

解决方法：用split分割

j[1].string.split('：')[1]

3. 代码优化

使用函数分块，便于以后的项目调用（可不做）

最终代码：

import requests

import bs4

import re

# 获取页面

url = 'http://127.0.0.1/lab/lecture'

r = requests.get(url)

r.encoding = r.apparent_encoding

html = r.text

#解析页面，获取讲座标题

soup = bs4.BeautifulSoup(html, 'html.parser')

aList = soup.find_all('a')

lecture = []

for i in aList:

# 获取讲座标题 lecture.append(i.string)

# 获取讲座内容链接，通过链接获取页面

content_url = 'http://127.0.0.1' + i.attrs['href']

req = requests.get(content_url)

req.encoding = req.apparent_encoding

content = req.text

# 解析页面，获取讲座内容(报告人、单位。。)

soup_new = bs4.BeautifulSoup(content, 'html.parser')

# 便利section标签的子节点

j = soup_new.section.contents

# "内容简介", "报告人简介"不大好搞出来，先把标签内容转为字符串，分割，再转为字符串，再分割，搞定

m = str(str(j[9]).split('>')[2]).split('<')[0]

n = str(str(j[13]).split('>')[2]).split('<')[0]

lecture.append([i.string, j[1].string.split('：')[1], j[3].string.split('：')[1], j[5].string.split('：')[1], j[7].string.split('：')[1],m,n])

# 输出为表

import pandas as pd

table = pd.DataFrame(data=lecture,columns=["讲座", "报告人", "单位", "报告时间", "报告地点", "内容简介", "报告人简介"])

table.to_csv('D:/1.csv',index=False)

结果：

python爬虫（五）：实战【2. 爬创客实验室（requests + bs4）】相关推荐

python爬虫入门实战！爬取博客文章标题和链接！
最近有小伙伴和我留言想学python爬虫,那么就搞起来吧. 准备阶段爬虫有什么用呢?举个最简单的小例子,你需要<战狼2>的所有豆瓣影评.最先想的做法可能是打开浏览器,进入该网站,找到评论 ...
python爬虫应用实战-如何爬取好看的小姐姐照片？
线程锁 Threading模块为我们提供了一个类,Threading.Lock锁.我们创建该类的对象,在线程函数执行之前,"抢占"该锁,执行完成之后,"释放"该 ...
python爬虫应用实战-如何爬取表情进行斗图？丰富你的表情库
面向对象 python从设计开始就是一门面向对象的的语言,因此使用python创建一个类与对象是非常简单的一件事情. 如果你以前没有接触过面向对象的编程语言,那么你需要了解一些面向对象语言的一些基本特 ...
python爬虫入门实战---------一周天气预报爬取_Python爬虫入门实战--------一周天气预报爬取【转载】【没有分析...
Python爬虫入门实战--------一周天气预报爬取[转载][没有分析 Python爬虫入门实战--------一周天气预报爬取[转载][没有分析] 来源:https://blog.csdn.ne ...
python爬虫初学实战——免登录爬取easyicon里的vip图标（2）
python爬虫初学实战-免登录爬取easyicon里的vip图标(2) 实验日期:2020-08-09 tips:没看过前面(1)的可以康康,指路 -> 爬取easyicon里的png图标成 ...
Python爬虫学习实战
Python爬虫学习实战前期回顾概述技术要求实战网页分析与数据提取小说目录提取小说章节内容总结前期回顾 Python爬虫学习之requests Python爬虫学习之数据提取(XPa ...
python爬虫项目实战教学视频_('[Python爬虫]---Python爬虫进阶项目实战视频',)
爬虫]---Python 爬虫进阶项目实战 1- Python3+Pip环境配置 2- MongoDB环境配置 3- Redis环境配置 4- 4-MySQL的安装 5- 5-Python多版本共存配 ...
python爬虫项目-33个Python爬虫项目实战(推荐)
今天为大家整理了32个Python爬虫项目. 整理的原因是,爬虫入门简单快速,也非常适合新入门的小伙伴培养信心.所有链接指向GitHub,祝大家玩的愉快~O(∩_∩)O WechatSogou [1] ...
19. python爬虫——基于scrapy框架爬取网易新闻内容
python爬虫--基于scrapy框架爬取网易新闻内容 1.需求 [前期准备] 2.分析及代码实现 (1)获取五大板块详情页url (2)解析每个板块 (3)解析每个模块里的标题中详情页信息 1.需 ...
python爬虫知网实例-33个Python爬虫项目实战(推荐)
今天为大家整理了32个Python爬虫项目. 整理的原因是,爬虫入门简单快速,也非常适合新入门的小伙伴培养信心.所有链接指向GitHub,祝大家玩的愉快~O(∩_∩)O WechatSogou [1] ...

python爬虫（五）：实战【2. 爬创客实验室（requests + bs4）】

python爬虫（五）：实战【2. 爬创客实验室（requests + bs4）】相关推荐

最新文章

热门文章

python爬虫（五）：实战 【2. 爬创客实验室（requests + bs4）】

python爬虫（五）：实战 【2. 爬创客实验室（requests + bs4）】相关推荐

最新文章

热门文章

python爬虫（五）：实战【2. 爬创客实验室（requests + bs4）】

python爬虫（五）：实战【2. 爬创客实验室（requests + bs4）】相关推荐