优书网是一个老白常用的第三方小说点评网站
首先爬取优书网–>书库
通过书库翻页来获得书籍相关信息

def get_url():url = "http://www.yousuu.com/bookstore/?channel&classId&tag&countWord&status&update&sort&page="headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3534.4 Safari/537.36'}html = requests.get(url+"1",headers=headers)html.encoding = "UTF-8"js_info = xpathnode(html)js_info = js_info.get('Bookstore')account_info = js_info.get('total')pages = math.ceil(float(account_info/20))  #get the upper integerurl = [url+str(i+1) for i in range(pages)]    #this is the array of waited crawl url ,just return to another blockreturn pages,urldef xpathnode(html):            #return the structure of json datatree = etree.HTML(html.text)node = tree.xpath('//script/text()')   #get the account of booksinfo = node[0][25:-122]js_info = json.loads(info)return js_infodef crawl():    #the corepages,url_combine = get_url()conn = conn_sql()create_tab(conn)cursor = conn.cursor()flag = 0for url in url_combine:       #page turningflag  = flag+1headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3534.4 Safari/537.36'}html = requests.get(url,headers=headers)html.encoding = "UTF-8"book_js_info = xpathnode(html)book_js_info = book_js_info.get('Bookstore')book_js_info = book_js_info.get('books')print('rate of progress:'+str(round(flag*100/pages,2))+'%')   #rate of progressfor i in range(20):       #scanning the pagetry:book = book_js_info[i]dt = {'bookId':book.get('bookId'),'title':book.get('title'),'score':book.get('score'),'scorerCount':book.get('scorerCount'),'author':book.get('author'),'countWord':str(round(book.get('countWord')/10000,2)),\'tags':str(book.get('tags')).translate(str.maketrans('', '', '\'')),\'updateAt':book.get('updateAt')[:10]}store_to_sql(dt,conn,cursor)except:print('erro')cursor.close()conn.close()

存入SQL server数据库:(使用时可更改自己数据库参数的设置)

def store_to_sql(dt,connect,cursor):    #insert or just change the informationtbname = '['+record+']'ls = [(k,v) for k,v in dt.items() if k is not None]sentence = 'IF NOT EXISTS ( SELECT  * FROM '+tbname+' WHERE bookId = '+str(ls[0][1])+') \INSERT INTO %s (' % tbname +','.join([i[0] for i in ls]) +') VALUES (' + ','.join(repr(i[1]) for i in ls) + ');'cursor.execute((sentence))connect.commit()return ""def create_tab(conn):    #create table(if not exists)cursor = conn.cursor()sentence = 'if not exists (select * from sysobjects where id = object_id('+record+')and OBJECTPROPERTY(id, \'IsUserTable\') = 1) \CREATE TABLE\"'+record+'\"\(NUM int IDENTITY(1,1),\bookId INT NOT NULL,\title VARCHAR(100),\score VARCHAR(100),\scorerCount float,\author VARCHAR(50),\countWord float ,\tags VARCHAR(100),\updateAt date) 'cursor.execute(sentence)conn.commit()cursor.close()def conn_sql():server = "127.0.0.1"user = "sa"password = "123456"conn = pymssql.connect(server, user, password, "novel")return conn

爬取结果:

注意,优书网更新后限制了爬虫,需要增加一个sleep()人为每次爬取的增加时间间隔。

全部代码:

# -- coding: gbk --
import json
import requests
import csv
import numpy
import json
import math
from lxml import etree
import pymssql
import datetime#爬取数据并存入sql数据库
record = str(datetime.date.today()).translate(str.maketrans('','','-'))  #date of today
def get_url():url = "http://www.yousuu.com/bookstore/?channel&classId&tag&countWord&status&update&sort&page="headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3534.4 Safari/537.36'}html = requests.get(url+"1",headers=headers)html.encoding = "UTF-8"js_info = xpathnode(html)js_info = js_info.get('Bookstore')account_info = js_info.get('total')pages = math.ceil(float(account_info/20))  #get the upper integerurl = [url+str(i+1) for i in range(pages)]    #this is the array of waited crawl url ,just return to another blockreturn pages,urldef xpathnode(html):            #return the structure of json datatree = etree.HTML(html.text)node = tree.xpath('//script/text()')   #get the account of booksinfo = node[0][25:-122]js_info = json.loads(info)return js_infodef crawl():    #the corepages,url_combine = get_url()conn = conn_sql()create_tab(conn)cursor = conn.cursor()flag = 0for url in url_combine:       #page turningflag  = flag+1headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3534.4 Safari/537.36'}html = requests.get(url,headers=headers)html.encoding = "UTF-8"book_js_info = xpathnode(html)book_js_info = book_js_info.get('Bookstore')book_js_info = book_js_info.get('books')print('rate of progress:'+str(round(flag*100/pages,2))+'%')   #rate of progressfor i in range(20):       #scanning the pagetry:book = book_js_info[i]dt = {'bookId':book.get('bookId'),'title':book.get('title'),'score':book.get('score'),'scorerCount':book.get('scorerCount'),'author':book.get('author'),'countWord':str(round(book.get('countWord')/10000,2)),\'tags':str(book.get('tags')).translate(str.maketrans('', '', '\'')),\'updateAt':book.get('updateAt')[:10]}store_to_sql(dt,conn,cursor)except:print('erro')cursor.close()conn.close()def store_to_sql(dt,connect,cursor):    #insert or just change the informationtbname = '['+record+']'ls = [(k,v) for k,v in dt.items() if k is not None]sentence = 'IF NOT EXISTS ( SELECT  * FROM '+tbname+' WHERE bookId = '+str(ls[0][1])+') \INSERT INTO %s (' % tbname +','.join([i[0] for i in ls]) +') VALUES (' + ','.join(repr(i[1]) for i in ls) + ');'cursor.execute((sentence))connect.commit()return ""def create_tab(conn):    #create table(if not exists)cursor = conn.cursor()sentence = 'if not exists (select * from sysobjects where id = object_id('+record+')and OBJECTPROPERTY(id, \'IsUserTable\') = 1) \CREATE TABLE\"'+record+'\"\(NUM int IDENTITY(1,1),\bookId INT NOT NULL,\title VARCHAR(100),\score VARCHAR(100),\scorerCount float,\author VARCHAR(50),\countWord float ,\tags VARCHAR(100),\updateAt date) 'cursor.execute(sentence)conn.commit()cursor.close()def conn_sql():server = "127.0.0.1"user = "sa"password = "123456"conn = pymssql.connect(server, user, password, "novel")return connif __name__ == '__main__':crawl()

Python 爬取小说点评网站,用大数据方法找小说相关推荐

  1. 什么品种的猫最受欢迎?Python爬取猫咪网站交易数据

    本篇文章是关于某化妆品企业的销售分析.从分析思路开始带大家一步步地用python进行分析,找出问题,并提出解决方案的整个流程. 以下文章来源于修炼Python 作者:叶庭云 Python爬虫.数据分析 ...

  2. python爬取大众点评数据

    python爬取大众点评数据 参考博客: python+requests+beautifulsoup爬取大众点评评论信息 大众点评评论抓取 Chrome如何获得网页的Cookies 如何查看自己访问网 ...

  3. python爬取王者_Python3爬取王者官方网站英雄数据

    爬取王者官方网站英雄数据 众所周知,王者荣耀已经成为众多人们喜爱的一款休闲娱乐手游,今天就利用python3 爬虫技术爬取官方网站上的几十个英雄的资料,包括官方给出的人物定位,英雄名称,技能名称,CD ...

  4. 如何使用python编程抢京东优惠券 知乎_学好Python爬取京东知乎价值数据

    原标题:学好Python爬取京东知乎价值数据 Python爬虫为什么受欢迎 如果你仔细观察,就不难发现,懂爬虫.学习爬虫的人越来越多,一方面,互联网可以获取的数据越来越多,另一方面,像 Python这 ...

  5. 使用python爬取BOSS直聘岗位数据并做可视化(Boss直聘对网页做了一些修改,现在的代码已经不能用了)

    使用python爬取BOSS直聘岗位数据并做可视化 结果展示 首页 岗位信息 岗位详情 薪资表 学历需求 公司排名 岗位关键词 福利关键词 代码展示 爬虫代码 一.导入库 二.爬取数据 1.爬取数据代 ...

  6. 使用Python爬取51job招聘网的数据

    使用Python爬取51job招聘网的数据 进行网站分析 获取职位信息 存储信息 最终代码 进行网站分析 进入https://www.51job.com/这个网站 我在这就以python为例搜索职位跳 ...

  7. 练习:使用Python爬取COVID-19疫情国内当日数据

    练习:使用Python爬取COVID-19疫情国内当日数据 推荐公众号:数据酷客 (里面有超详细的教程) 代码来源数据酷客公众号教程 URL它是Uniform Resource Locator的缩写, ...

  8. python爬取股票信息_利用Python爬取网易上证所有股票数据(代码

    利用Python爬取网易上证所有股票数据(代码 发布时间:2018-04-14 17:30, 浏览次数:1261 , 标签: Python import urllib.request import r ...

  9. 爬取某知名网站的数据

    爬取某知名网站的数据

  10. python如何爬取实时人流量_使用python爬取微信宜出行人流量数据

    代码地址:https://liujiao111.github.io/2019/06/18/easygo/ 工具介绍: 该工具基于微信中的宜出行提供的数据接口进行爬取,能够爬取一定范围内的当前时间点的人 ...

最新文章

  1. java对象深入理解
  2. 使ALV控件中的内容可编辑
  3. SAP Spartacus UI Duplicated keys has been found in the config of i18n chunks
  4. CSS3 Transform、Transition和Animation属性总结
  5. linux free 命令中buffers、cached以及-/+ buffers/cache解析
  6. 克服Dropout缺陷,简单又有效的正则方法:R-Drop
  7. 报错500 DEFAULT_INCOMPATIBLE_IMPROVEMENTS
  8. 我死了,你会娶别的女人吗?
  9. 机器学习基础(四十三)—— kd 树( k 近邻法的实现)
  10. NSXMLParser详解(事例)
  11. 论文笔记_S2D.01-2018-ICRA_Sparse-to-Dense:从稀疏深度样本+单一图像的深度预测
  12. php操作大缓存的存储与读取
  13. 【HDU 6299】Balanced Sequence
  14. 语文学科html代码,[2018年最新整理]学科分类与代码.doc
  15. [oeasy]python0020换行字符_feed_line_lf_反斜杠n_B语言_安徒生童话
  16. 餐厅座位表 canvas实现
  17. Spyder runfile
  18. GBase 8a里通过rsync加速调度coor节点的扩容和替换效率
  19. 计算机专业素质拓展,创新与素质拓展学分.doc
  20. UEFI.源码分析.DXE阶段的执行

热门文章

  1. 华师大计算机在线测试,华东师大:180道心理测试题面试免费师范生
  2. “天中三少”辛东方:量子动力能传送人到另一个星球?
  3. python抖音培训真的假的
  4. 面试中单例模式有几种写法
  5. 计算机qq群怎样提交作业,qq群作业怎么弄 qq群作业功能详细介绍
  6. Excel宏的录制和解密
  7. 【STM32Cube笔记】4-STM32Cube配置时钟设置
  8. IT人跨界:开了咖啡店却从没想过赚钱
  9. 计算机切换用户界面键,电脑切换用户_电脑切换用户快捷键
  10. java计算机毕业设计网上图书销售系统源程序+mysql+系统+lw文档+远程调试