Python豆瓣书籍信息爬虫

练习下BeautifulSoup，requests库，用python3.3 写了一个简易的豆瓣小爬虫，将爬取的信息在控制台输出并且写入文件中。

上源码：

  1 # coding = utf-8
  2 '''my words
  3     基于python3 需要的库 requests BeautifulSoup
  4     这个爬虫很基本，没有采用任何的爬虫框架，用requests,BeautifulSoup,re等库。
  5     这个爬虫的基本功能是爬取豆瓣各个类型的书籍的信息：作者，出版社，豆瓣评分，评分人数，出版时间等信息。
  6     不能保证爬取到的信息都是正确的，可能有误。
  7     也可以把爬取到的书籍信息存放在数据库中，这里只是输出到控制台。
  8     爬取到的信息存储在文本txt中。
  9 '''
 10
 11 import requests
 12 from bs4 import BeautifulSoup
 13 import re
 14
 15 #爬取豆瓣所有的标签分类页面，并且提供每一个标签页面的URL
 16 def provide_url():
 17     # 以http的get方式请求豆瓣页面（豆瓣的分类标签页面）
 18     responds = requests.get("https://book.douban.com/tag/?icn=index-nav")
 19     # html为获得响应的页面内容
 20     html = responds.text
 21     # 解析页面
 22     soup = BeautifulSoup(html, "lxml")
 23     # 选取页面中的需要的a标签，从而提取出其中的所有链接
 24     book_table = soup.select("#content > div > .article > div > div > .tagCol > tbody > tr > td > a")
 25     # 新建一个列表来存放爬取到的所有链接
 26     book_url_list = []
 27     for book in book_table:
 28         book_url_list.append('https://book.douban.com/tag/' + str(book.string))
 29     return book_url_list
 30
 31 #获得评分人数的函数
 32 def get_person(person):
 33     person = person.get_text().split()[0]
 34     person = re.findall(r'[0-9]+',person)
 35     return person
 36
 37 #当detail分为四段时候的获得价格函数
 38 def get_rmb_price1(detail):
 39     price = detail.get_text().split('/',4)[-1].split()
 40     if re.match("USD", price[0]):
 41         price = float(price[1]) * 6
 42     elif re.match("CNY", price[0]):
 43         price = price[1]
 44     elif re.match("\A$", price[0]):
 45         price = float(price[1:len(price)]) * 6
 46     else:
 47         price = price[0]
 48     return price
 49
 50 #当detail分为三段时候的获得价格函数
 51 def get_rmb_price2(detail):
 52     price = detail.get_text().split('/',3)[-1].split()
 53     if re.match("USD", price[0]):
 54         price = float(price[1]) * 6
 55     elif re.match("CNY", price[0]):
 56         price = price[1]
 57     elif re.match("\A$", price[0]):
 58         price = float(price[1:len(price)]) * 6
 59     else:
 60         price = price[0]
 61     return price
 62
 63 #测试输出函数
 64 def test_print(name,author,intepretor,publish,time,price,score,person):
 65     print('name: ',name)
 66     print('author:', author)
 67     print('intepretor: ',intepretor)
 68     print('publish: ',publish)
 69     print('time: ',time)
 70     print('price: ',price)
 71     print('score: ',score)
 72     print('person: ',person)
 73
 74
 75
 76
 77 #解析每个页面获得其中需要信息的函数
 78 def get_url_content(url):
 79     res = requests.get(url)
 80     html = res.text
 81     soup = BeautifulSoup(html.encode('utf-8'),"lxml")
 82     tag = url.split("?")[0].split("/")[-1]  #页面标签，就是页面链接中'tag/'后面的字符串
 83     titles = soup.select(".subject-list > .subject-item > .info > h2 > a") #包含书名的a标签
 84     details = soup.select(".subject-list > .subject-item > .info > .pub") #包含书的作者，出版社等信息的div标签
 85     scores = soup.select(".subject-list > .subject-item > .info > div > .rating_nums") #包含评分的span标签
 86     persons = soup.select(".subject-list > .subject-item > .info > div > .pl")  #包含评价人数的span标签
 87
 88     print("*******************这是 %s 类的书籍**********************" %tag)
 89
 90     #打开文件，将信息写入文件
 91     file = open("C:/Users/lenovo/Desktop/book_info.txt",'a') #可以更改为你自己的文件地址
 92     file.write("*******************这是 %s 类的书籍**********************" % tag)
 93
 94     #用zip函数将相应的信息以元祖的形式组织在一起，以供后面遍历
 95     for title,detail,score,person in zip(titles,details,scores,persons):
 96         try:#detail可以分成四段
 97             name = title.get_text().split()[0] #书名
 98             author = detail.get_text().split('/',4)[0].split()[0] #作者
 99             intepretor = detail.get_text().split('/',4)[1] #译者
100             publish = detail.get_text().split('/',4)[2]  #出版社
101             time = detail.get_text().split('/',4)[3].split()[0].split('-')[0] #出版年份，只输出年
102             price = get_rmb_price1(detail)   #获取价格
103             score = score.get_text() if True else ""   #如果没有评分就置空
104             person = get_person(person)  #获得评分人数
105             #在控制台测试打印
106             test_print(name,author,intepretor,publish,time,price,score,person)
107             #将书籍信息写入txt文件
108             try:
109                 file.write('name: %s ' % name)
110                 file.write('author: %s ' % author)
111                 file.write('intepretor: %s ' % intepretor)
112                 file.write('publish: %s ' % publish)
113                 file.write('time: %s ' % time)
114                 file.write('price: %s ' % price)
115                 file.write('score: %s ' % score)
116                 file.write('person: %s ' % person)
117                 file.write('\n')
118             except (IndentationError,UnicodeEncodeError):
119                 continue
120
121         except IndexError:
122             try:#detail可以分成三段
123                 name = title.get_text().split()[0]  # 书名
124                 author = detail.get_text().split('/', 3)[0].split()[0]  # 作者
125                 intepretor = "" # 译者
126                 publish = detail.get_text().split('/', 3)[1]  # 出版社
127                 time = detail.get_text().split('/', 3)[2].split()[0].split('-')[0]  # 出版年份，只输出年
128                 price = get_rmb_price2(detail)  # 获取价格
129                 score = score.get_text() if True else ""  # 如果没有评分就置空
130                 person = get_person(person)  # 获得评分人数
131                 #在控制台测试打印
132                 test_print(name, author, intepretor, publish, time, price, score, person)
133                 #将书籍信息写入txt文件
134                 try:
135                     file.write('name: %s ' % name)
136                     file.write('author: %s ' % author)
137                     file.write('intepretor: %s ' % intepretor)
138                     file.write('publish: %s ' % publish)
139                     file.write('time: %s ' % time)
140                     file.write('price: %s ' % price)
141                     file.write('score: %s ' % score)
142                     file.write('person: %s ' % person)
143                     file.write('\n')
144                 except (IndentationError, UnicodeEncodeError):
145                     continue
146
147             except (IndexError,TypeError):
148                 continue
149
150         except TypeError:
151             continue
152     file
153
154     file.write('\n')
155     file.close()  #关闭文件
156
157
158 #程序执行入口
159 if __name__ == '__main__':
160     #url = "https://book.douban.com/tag/程序"
161     book_url_list = provide_url() #存放豆瓣所有分类标签页URL的列表
162     for url in book_url_list:
163         get_url_content(url)  #解析每一个URL的内容

下面是效果图：

转载于:https://www.cnblogs.com/jeavenwong/p/8442833.html

Python豆瓣书籍信息爬虫相关推荐

python爬取豆瓣书籍_python 爬取豆瓣书籍信息
继爬取猫眼电影TOP100榜单之后,再来爬一下豆瓣的书籍信息(主要是书的信息,评分及占比,评论并未爬取).原创,转载请联系我. 需求:爬取豆瓣某类型标签下的所有书籍的详细信息及评分语言:pyth ...
python 爬取豆瓣书籍信息
继爬取猫眼电影TOP100榜单之后,再来爬一下豆瓣的书籍信息(主要是书的信息,评分及占比,评论并未爬取).原创,转载请联系我. 需求:爬取豆瓣某类型标签下的所有书籍的详细信息及评分语言:pyth ...
Python实现可视化界面多线程豆瓣电影信息爬虫，并绘制统计图分析结果
完整代码见链接:https://github.com/kuronekonano/python_scrapy_movie 实现时使用图形界面.多线程.文件操作.数据库编程.网络编程.统计绘图六项技术. ...
python爬取豆瓣书籍_python爬虫学习，爬取豆瓣各分类书单
点击蓝字"python教程"关注我们哟! 代码展示:pachon2.5.py # -- coding: utf-8 -- import urllib import urllib2 ...
Python实现爬取豆瓣电影|python豆瓣全栈爬虫：电影系列全爬虫系统1.0：（信息，短评，影评，海报）|你想爬的都有
写在前面: 此博客仅用于记录个人学习进度,学识浅薄,若有错误观点欢迎评论区指出.欢迎各位前来交流.(部分材料来源网络,若有侵权,立即删除) 豆瓣电影全系列爬虫系统免责声明情况说明效果展示主菜单 ...
python爬取豆瓣书籍_Python爬虫-爬取豆瓣图书Top250
豆瓣网站很人性化,对于新手爬虫比较友好,没有如果调低爬取频率,不用担心会被封 IP.但也不要太频繁爬取. 涉及知识点:requests.html.xpath.csv 一.准备工作需要安装reques ...
python豆瓣电影top250爬虫课程设计_[教程]图文:爬虫爬取豆瓣电影top250
window环境下使用python脚本爬取豆瓣环境安装 python python开发环境 jupyter python web IDE requests python requests模块用于向 ...
爬虫实战-爬取豆瓣读书书籍信息
1. 豆瓣读书书籍种类列表在下面这个URL, 我们可以获得所有的种类链接 https://book.douban.com/tag/ 如下图: 可以通过bs4和re库进行筛选, 得到所有图书种类, 结 ...
【python爬虫专项（7）】爬虫实战项目一（豆瓣图书类别的书籍信息数据获取——爬虫逻辑1）
任意一图书类别的书籍信息数据参考网址:豆瓣读书网爬虫逻辑:[分页网页url采集]-[数据信息网页url采集]-[数据采集] 针对爬虫逻辑的三步走,采用函数式编程的方式进行数据爬取函数1: get ...

Python豆瓣书籍信息爬虫

Python豆瓣书籍信息爬虫相关推荐

最新文章

热门文章