Python真香之爬取古诗文网

最近在学习Python相关，学习了基本的语法后想搞点事情试试，所以来爬取下古诗文网中的相关作者信息
准备资料：
爬取目标：爬取古诗文网的唐代作者的信息
目标分析：

一级页面是所有唐代作者的列表，点击名字后会跳转到作者详情页，所以这一页我们要做的就是解析出每一项作者详情页url，并且自动翻页

上述是作者详情页，其中主要包含作者名字，作者简述和作者生平故事，这一页我要做的事情是解析这些数据并保存到本地文件；
详细代码

#!/usr/bin/env python2.7
import requests
import codecs
import json
from bs4 import BeautifulSoupDOWNLOAD_URL = 'https://so.gushiwen.org/authors/Default.aspx?p=1&c=%E5%94%90%E4%BB%A3'def download_page(url):headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36'}data = requests.get(url, headers=headers).contentreturn data
def parseHtml(html):soup = BeautifulSoup(html, features="html.parser")nextPage = 'https://so.gushiwen.org' + soup.find('a', attrs={'class': 'amore'}).get('href')leftDiv = soup.find('div', attrs={'class': 'main3'}).find('div', attrs={'class': 'left'})data = []for item in leftDiv.find_all('div', attrs={'class': 'sonspic'}):data.append('https://so.gushiwen.org' + item.find('a')['href'])return data, nextPage
def buildJson(keys, values):dictionary = dict(zip(keys, values))return json.dumps(dictionary)def parseAuthorHtml(html):soup = BeautifulSoup(html, features="html.parser")all = soup.find('div', attrs={'class': 'main3'}).find('div', attrs={'class': 'left'})author = all.find('div', attrs={'class': 'sonspic'}).find('div', attrs={'class': 'cont'}).find('b').getText()desc = all.find('div', attrs={'class': 'sonspic'}).find('div', attrs={'class': 'cont'}).find('p').getText()sons = all.find('div',attrs={'class': 'sons'})yishiUrl = Noneif sons:yishiUrl = sons.get('id')if yishiUrl:yishiUrl = 'https://so.gushiwen.org/authors/ajaxziliao.aspx?id=' + all.find('div',attrs={'class': 'sons'})['id'].replace('fanyi','')else:yishiUrl = Nonereturn author, desc, yishiUrldef parseAuthorMore(html):soup = BeautifulSoup(html, features='html.parser')yishi = []if soup.find('div', attrs={'clase', 'contyishang'}):for p in soup.find('div', attrs={'clase', 'contyishang'}).find_all('p'):yishi.append(p.getText())return ''.join(yishi)def main():url = DOWNLOAD_URLkeys = ['author', 'desc', 'story']with codecs.open('author.json', 'w', encoding='utf-8') as fp:fp.write('[')while url:html = download_page(url)data, url = parseHtml(html)for item in data:authorHtml = download_page(item)author, desc, yishiUrl = parseAuthorHtml(authorHtml)if yishiUrl:yishiHtml = download_page(yishiUrl)yishi = parseAuthorMore(yishiHtml)else:yishi = ""fp.write(buildJson(keys, [author, desc,yishi]))fp.write(',')fp.write(']')if __name__ == '__main__':main()

整体思路：请求–>解析数据–>存储
请求使用了requests库，佯装正常的浏览器请求，获取到数据之后，将数据交给BeautifulSoup来做数据解析，主要是找到html中的相关标签并获取其相关属性值和标签内容，然后组装成自己要的目标作者的详情页url，之后访问详情页，再获取数据并存储再本地

Python真香之爬取古诗文网相关推荐

Python实战---使用正则表达式爬取古诗文网
使用正则表达式爬取古诗文网爬取目标具体字段为: title 标题 dynasty 朝代 author 作者 content 内容 tag 标签实现代码 ''' @Description: 使用正 ...
python输出古诗词_python爬取古诗文网
解析:正则表达式代码 import requests import re def parse_page(url): headers = { 'User-Agent': 'Mozilla/5.0 (W ...
Python使用网络抓包的方式，利用超级鹰平台识别验证码登录爬取古诗文网、上篇--识别验证码
Python使用网络抓包的方式,利用超级鹰平台识别验证码登录,<爬取古诗文网>. 上篇–识别验证码序言: 哈喽,各位小可爱们,我又来了,这次我新学习到的内容是python爬虫识别验证码. ...
爬取古诗文网的推荐古诗
爬取古诗文网的推荐古诗思路分析完整代码结果展示思路分析本次的主要目的是练习使用正则表达式提取网页中的数据. 该网站的推荐古诗文一共有10页,页码可以在URL中进行控制,比如说,下面的URL指 ...
Python爬虫（一）——爬取古诗文网，初识什么是爬虫
首先来说下什么是爬虫,按照百度百科的说法是:是一种按照一定规则,自动抓取万维网信息的程序或者脚本:首先它是程序,需要我们定义好规则,然后程序就会按照定义好的规则抓取网络上的信息,数据抓取下来了之后,需 ...
Python-爬虫（爬虫练习爬取古诗文网五言绝句）
目标网站采用的数据解析方式:xpath.bs4.re正则获取网站中所有的五言绝句诗词链接 from bs4 import BeautifulSoup import re# 获取五言绝句代码链接,以 ...
爬虫学习笔记：爬取古诗文网
1.目标网站目标网站:https://so.gushiwen.org/shiwen/default.aspx? 2.爬虫目的爬取目标网站的文本,如古诗的内容,作者,朝代,并且保存到本地中. 3.爬 ...
scrapy框架爬取古诗文网的名句
使用scrapy框架爬取名句,在这里只爬取的了名句和出处两个字段.具体解析如下: items.py 用来存放爬虫爬取下来的数据模型,代码如下: import scrapyclass QsbkItem( ...
python爬虫入门_3种方法爬取古诗文网站
目的: 爬取古诗文网的古诗词,获取详细信息,目标网站:https://www.gushiwen.org/default.aspx?page=1 1.根据网页分析可知下面包含了当前页面的所有信息,所以 ...

Python真香之爬取古诗文网

Python真香之爬取古诗文网相关推荐

最新文章

热门文章