爬虫起点小说网所有小说基本信息

第一篇博客，先试试水。爬虫你们懂的，三小时5万条数据：

多线程
失败再爬取机制
多次失败链接储存再爬取
自定义数据量

代码块

导入需要的包

# -*- coding: utf-8 -*-
import time
import datetime
import threadpool
from bs4 import BeautifulSoup
import csv
import requests
from urllib.parse import urlencode

将每本小说url储存到urls.txt：

def load(i,count=0):try:url="https://www.qidian.com/all?page="+str(i)print("正在采集页面:{}".format(url))page=requests.get(url)page.encoding="utf-8"soup = BeautifulSoup(page.text, 'lxml')elem=soup.select(".book-mid-info h4 a")#选取urlurls=[]for j in range(0,20):url = 'https:' + elem[j].get('href')urls.append(url)if len(urls)!=20:raise Exception(BaseException, i)with open('urls.txt', 'a', encoding='utf-8') as f:#写入文件for cont in urls:f.write(str(cont)+'\n')except BaseException as e:if count<5:load(i,count+1)else:print(str(e))with open('urllist.txt','a',encoding='utf-8') as fp:fp.write(url+' '+i+'\n')def loadurl(start,end,thrednum):links = []for i in range(start,end+1):#自定义页数links.append(i)#开始采集小说urlprint(len(links))try:pool = threadpool.ThreadPool(thrednum)  # 线程池requests = threadpool.makeRequests(load, links)[pool.putRequest(req) for req in requests]pool.wait()except KeyboardInterrupt:print('手动暂停')

初始化qidian.csv文件，仅能执行一次：

def init():row = ['book_name', 'author', 'words_count', 'click_count', 'books_count', 'score', 'j_user_count','crawl_time','id']#row = ['小说名', '作者', '字数', '点击量', '作品个数', '评分', '评价人数', '抓取时间', 'url']with open("qidian.csv", "w", newline="") as f:f = csv.writer(f, dialect="excel")f.writerow(row)

读取urls.txt文件，将小说转换成记录储存到qidian.csv。：

def work(url, count=0):page = requests.get(url)page.encoding = "utf-8"soup = BeautifulSoup(page.text, 'lxml')try:# 选择元素elem = soup.select(".book-info h1 em")book_name = elem[0].textauthor = soup.select(".writer")[0].textwords_count = soup.select(".book-info p em")[0].textclick_count = soup.select(".book-info p em")[1].textbooks_count = soup.select(".work-state li em")[0].textid = url.replace("https://book.qidian.com/info/", "")crawl_time=get_unix_time()print(url)# score = soup.select("#score1")[0].text + '.' + soup.select("#score2")[0].text# j_user_count = soup.select("#j_userCount span")[0].textbookid = iddata = {'_csrfToken': 'QpbsVhyc5zc0h21NiEweIrLMu2tFOM1RsgfZtWSS','bookId': bookid,'pageSize': 15}other_url = 'https://book.qidian.com/ajax/comment/index?' + urlencode(data)page = requests.get(other_url, stream=True)page.encoding = "utf-8"cont = eval(page.text)score = cont.get('data').get('rate')j_user_count = cont.get('data').get('userCount')# 写：追加row = [book_name, author, words_count, click_count, books_count, score, j_user_count, crawl_time, id]with open("qidian.csv", "a", encoding="utf-8",newline='') as f:f = csv.writer(f, dialect="excel")f.writerow(row)with open("doneurl.txt", "a", newline='',encoding='utf-8') as fe:fe.write(url + '\n')fe.close()except BaseException:if count < 5:print('errror 元素获取失败 重试次数：' + str(count))time.sleep(2)work(url, count+1)else:with open("error_url.txt", "a", encoding='utf-8') as fe:fe.write(url + '\n')print('errror 元素获取失败 写入文件')fe.close()

其他函数及爬虫启动函数

#时间戳
def get_unix_time():  # 获取unix时间戳dtime = datetime.datetime.now()ans_time = int(time.mktime(dtime.timetuple()))return ans_time
#爬虫启动
def spider(start=1,end=2500,thrednum=10):  #输入文件输出文件#采集每本小说url储存到文件loadurl(start,end,thrednum)#将url读取到listwith open('urls.txt', 'r+', encoding='utf-8') as f:links = []url = f.readline().strip('\n')while url:links.append(url)url = f.readline().strip('\n')#开始采集每条记录init()try:pool = threadpool.ThreadPool(thrednum)  # 线程池requests = threadpool.makeRequests(work, links)[pool.putRequest(req) for req in requests]pool.wait()except KeyboardInterrupt:print('手动暂停')

爬虫启动

spider(1,2500,20)
从第1页爬取到第2500页，20条线程.一共200条记录

作者想说的话

本人第一次发博客，必有很多解释不到之处，还请大家多多指教。
欢迎任何人发现本人代码中存在的问题或者可以改进的地方与本人交流，必将感激不尽
本人qq:289672494 常用
希望大家共同进步
需要获取其他的元素改部分即可

注：本人必将遵守法律法规，不发生任何盗取网站数据或影响网站运营的行为，此篇文章仅供广大博友或来访者参考。倡导绿色安全网络环境，人人有责。

爬虫起点小说网所有小说基本信息相关推荐

网络爬虫-爬取顶点小说网指定小说
需求是女朋友下发的(凌晨12:30): 帮我下载一部小说–医后倾仙(1979章-最新章节) 打开电脑–打开百度–输入医后倾仙–打开我见到的第一个小说网站(顶点小说网)–敲代码 import reque ...
爬取飞卢小说网的小说
爬取飞卢小说网的任意小说需要的库就三个 import requests import re import os 飞卢小说网的url关系很简单,主要的小说内容就是原来小说界面后面加了一个_1 爬虫函数 ...
爬取17k小说网的小说
最近在学习python爬虫,所以写了一个17K小说网爬取的脚本来做练习,分享一下 1.爬取的网页为http://all.17k.com/lib/book.html 小说分类页面的免费区的小说,付费vi ...
爬取起点小说网免费小说
python 3.7 设置了0.5秒存入一个章节所以有点慢运行的时候在py文件的同级目录下创建目标的小说文件夹在文件夹中写入小说章节 headers完全没有引用= =(主要是起点没有怎么反爬取) ...
Python爬虫实战 | 抓取小说网完结小说斗罗大陆
储备知识应有:Python语言程序设计 Python网络爬虫与信息提取两门课程都是中国大学MOOC的精彩课程,特别推荐初学者.环境Python3 本文整体思路是:1.获取小说目录页面,解析目录页面, ...
太空战机c语言程序,第18章 18- 太空战机-科幻小说之无尽展开-扶轮小说网手机小说...
前往宇宙军基地的路上并没有招到任何阻碍,仿佛之前启动斯库尔德号时的漫天炮火并没有实际存在过一样.一笔阁 m.yibige.com在经过简单的通讯后,便顺利的获得了对接与进入基地的许可. 与第一次来到宇 ...
诸天最强穿越位面系统鸿蒙决,第304章假册-笑踏疯巅-扶轮小说网手机小说
"规定?那是什么?"赵立寒问道. 白良还没回答守门人的攻击就到了,见此情形白良便道:"现在已经躲不开了,你能承受攻击而不会再迷失吗?只要可以听到我说话就行." ...
Scrapy爬虫框架，爬取小说网的所有小说
Scrapy入门教程请看目录 1.思路清理 2.创建爬虫项目 3. 爬虫架构构思 4.爬虫程序具体分析 5.效果展示 6.待优化的地方 1.思路清理我们的目的是把某个小说网的所有小说给拿下,这就涉 ...
python爬虫三大解析数据方法：bs4 及爬小说网案例
bs4 python独有可以将html文档转成bs对象,可以直接调用bs对象的属性进行解析安装 pip install bs4 本地html Beautiful("open('路径')&q ...

爬虫起点小说网所有小说基本信息

爬虫起点小说网所有小说基本信息

代码块

作者想说的话

爬虫起点小说网所有小说基本信息相关推荐

最新文章

热门文章