稍微总结一下：

今天爬的稍微有点打击士气了，但是还是学到了不少东西，

告诉我们，要学会自己去百度，谷歌答案，自己去思考，不要依赖一些技术交流QQ群，很多都是水群的，真的帮助你的是很少的。

重点在这里：今天学了将爬取的数据存取到txt ,.xlsx文件，也就是txt文件跟excel 表格中，又一次加强了re模块的正则表达式，

先贴结果图：

这次爬取的是
贴代码：我是比较习惯先贴上代码，让大家先把项目贴进去再跑起来看一看的

# encoding=utf8
import requests
import re
from bs4 import BeautifulSoup
import csv
import time
import threading
from openpyxl import Workbooknum0 = 1  # 用来计数，计算爬取的书一共有多少本url0 = 'https://read.douban.com/kind/0?sort=hot&promotion_only=False&min_price=None&max_price=None&works_type=None'  # 原创写作(都是根据热门一栏进行选择)
url1 = 'https://read.douban.com/kind/1?sort=hot&promotion_only=False&min_price=None&max_price=None&works_type=None'  # 中文电子书
url2 = 'https://read.douban.com/kind/300?sort=hot&promotion_only=False&min_price=None&max_price=None&works_type=None'  # 英文电子书heders = {'Cookie': 'bid=enXFLYclZz4; ap=1; _gat=1; _ga=GA1.3.1076520291.1498377550; _gid=GA1.3.2095776114.1498377550; _pk_id.100001.a7dd=34cd84f06a063e70.1498377550.1.1498378103.1498377550.; _pk_ses.100001.a7dd=*','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}wb = Workbook()
ws = wb.active
ws.title = '豆瓣阅读中文电子书'
ws.cell(row=1, column=1).value = '书籍名称'
ws.cell(row=1, column=2).value = '作者'
ws.cell(row=1, column=3).value = '评分'
ws.cell(row=1, column=4).value = '评价人数'
ws.cell(row=1, column=5).value = '是否免费'
ws.cell(row=1, column=6).value = '简述'def getHotBookUrl(ID):# 判断是否是第一页# url = 'https://read.douban.com/kind/1?sort=hot&promotion_only=False&min_price=None&max_price=None&works_type=None'# url20 = 'https://read.douban.com/kind/1?start=20&sort=hot&promotion_only=False&min_price=0&max_price=0&works_type=None'# url40 ='https://read.douban.com/kind/1?start=40&sort=hot&promotion_only=False&min_price=0&max_price=0&works_type=None'if ID == 1:url = 'https://read.douban.com/kind/1?sort=hot&promotion_only=False&min_price=None&max_price=None&works_type=None'else:url = 'https://read.douban.com/kind/1?start=%s&sort=hot&promotion_only=False&min_price=0&max_price=0&works_type=None' % str((ID - 1) * 20)print('url ' + url + ' pageIndex ' + str(ID))response = requests.get(url, heders)return response.textdef parseHotBook(html):# 获取书名regPageName = r'<div class="title">.*?">(.*?)</a>'reg_pagename = re.compile(regPageName)titlelist = re.findall(reg_pagename, html)# print(titlelist)# 作者 （豆瓣用户，简书）regAuthor = r'<span class="meta-item">.*?"author-item".*?">(.*?)</a>'regAuthorMain = r'<a class="author-item".*?">(.*?)</a>'reg_author = re.compile(regAuthor)reg_main_author = re.compile(regAuthorMain)authorother = re.findall(reg_author, html)# print(authorother)authormain = re.findall(reg_main_author, html)# authormain.remove('%s'%authorother)  可能存在译者print(authormain)# print(authormain + authorother)# 评分regCommend = r'<span class="rating-average">(.*?)</span>'reg_commend = re.compile(regCommend)commend = re.findall(reg_commend, html)# print(commend)regPerson = r'<a.*?class="ratings-link".*?<span>(.*?)</span>'reg_person = re.compile(regPerson)person = re.findall(reg_person, html)# print(person)# 是否是免费regFree = r'<span class="price-tag ">(.*?)</span>'reg_free = re.compile(regFree)free = re.findall(reg_free, html)# print(free)# 简单描述# regDes = r'desc-brief">(.*?)\n<a.*?</div>'regDes = r'desc-brief">(.*?)\n'reg_des = re.compile(regDes)desc = re.findall(reg_des, html)print(desc)ver_info = list(zip(titlelist, authormain, commend, person, free, desc))return ver_infodef write():global num0print('开始爬取内容')ID = 1nums = 0while ID < 78:html = getHotBookUrl(ID)ver_infos = parseHotBook(html)# print(ver_infos)ID = ID + 1for ver_info in ver_infos:num0 = num0 + 1ws.cell(row=num0, column=1).value = ver_info[0]ws.cell(row=num0, column=2).value = ver_info[1]ws.cell(row=num0, column=3).value = ver_info[2]ws.cell(row=num0, column=4).value = ver_info[3]ws.cell(row=num0, column=5).value = ver_info[4]ver = ver_info[5]ws.cell(row=num0, column=6).value = r"%s"%verprint(ver_info[5])nums += 1print('爬取成功 ' + str(nums))wb.save('豆瓣阅读中文电子书' + '.xlsx')def start():print('start init ....   ')t1 = threading.Thread(target=write())t1.start()start()

代码就是这些，现在对re这个模块还是有点入门吧，不过有点糟糕的就是在作者那里，有译者，有原作者，每一进行判断，自己的python3 基础学的并不是很好，很多都是直接用来拿来用的。感觉这样学习的效率会比较高。比单纯的看着基础教程学习的效果会好些。

一直被卡着的是简单描述的模块，确实，不知道空格这个正则的匹配。\n 有点坑人了。有时候字符也不用全匹配上，最后还是一个高二的学生给帮忙的，确实有危机感了。

还有要学会用一点：

ver_info = list(zip(titlelist, authormain, commend, person, free, desc))

将多个对象存到一个对象里面，应该是这样子吧。

ws.cell(row=num0, column=6).value = r”%s”%ver
这个也是可以蛮好用的，%s %

总结：

要学会bs beautifulsoup xpath的使用方法，还有要学会多线程的爬取， 不然单线程爬取时间有点慢，mongDB 还暂时没有学会，现在会用个.csv挺爽的， --> 以后要往 数据清洗方向看看。

python3 [入门基础实战] 爬虫入门之爬取豆瓣阅读中文电子书[热门排序]相关推荐

python3 [入门基础实战] 爬虫入门之xpath爬取脚本之家python栏目
这次爬取的确实有些坎坷,经过了两个晚上吧,最后一个晚上还是爬取数据到最后一公里了,突然报错了.又得继续重新进行爬取先来个爬取结果图,这次爬取的是标题,url,还有日期,估计也就只有这么多内容,用的单 ...
Python数据爬虫学习笔记:爬取豆瓣阅读的出版社名称数据
环境准备: 1.python 3.0+ 2.豆瓣出版社网址 https://read.douban.com/provider/all 1.打开浏览器,输入网址,右击网页,查看网页源码 2.看上图我们发 ...
笨方法学 python3 豆瓣_python3 爬虫学习：爬取豆瓣读书Top250（一）
本节课,我们试着来写一个基础的爬虫,来爬取一下豆瓣读书top250的内容:主要涉及的知识就是我们之前讲过的requests库. 网页分析我们先选取一个待会准备爬取的网站,咱们选个较好爬的网页,豆瓣读 ...
scrapy爬虫之crawlspide爬取豆瓣近一周同城活动
简介本文主要介绍crawlspider爬取豆瓣近一周同城活动. 要点:item/itemloader利用input_processor/output_processor对爬取的数据进行过滤. 实现 ...
python爬虫——Cookie登录爬取豆瓣短评和影评及常见问题
python爬虫--Cookie登录爬取豆瓣短评和影评常见问题(本文已解决) 具体步骤一.获取网页源码短评.影评二.解析网页源码及爬取评论 1.短评网页解析 ①确定位置 2.短评爬取 ①名称爬 ...
Python爬虫新手入门教学（一）：爬取豆瓣电影排行信息
前言本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理. Python爬虫.数据分析.网站开发等案例教程视频免费在线观看 https://space. ...
python实例豆瓣音乐代码_Python爬虫实战（3）-爬取豆瓣音乐Top250数据（超详细
前言首先我们先来回忆一下上两篇爬虫实战文章: 第一篇:讲到了requests和bs4和一些网页基本操作. 第二篇:用到了正则表达式-re模块今天我们用lxml库和xpath语法来爬虫实战. 1.安 ...
python爬虫爬取豆瓣_爬虫，从爬取豆瓣开始
1 爬虫概述当初第一次接触python,听的最多的就是爬虫爬虫,搞得我一脸蒙蔽,因为我从来都没听过这么新颖的词,而且我还天真的以为是不是python长得像一条小虫子,所以才叫爬虫. 直到后来经过不断 ...
python爬虫实践之爬取豆瓣高评分电影
目录概述准备所需模块涉及知识点运行效果完成爬虫 1. 分析网页 2. 爬虫代码 3. 整理总结概述爬取豆瓣的高评分的电影. 准备所需模块 re模块 requests模块涉及知识点 ...

python3 [入门基础实战] 爬虫入门之爬取豆瓣阅读中文电子书[热门排序]

稍微总结一下：

今天爬的稍微有点打击士气了，但是还是学到了不少东西，

先贴结果图：

python3 [入门基础实战] 爬虫入门之爬取豆瓣阅读中文电子书[热门排序]相关推荐

最新文章

热门文章