Python 爬取小说点评网站，用大数据方法找小说

优书网是一个老白常用的第三方小说点评网站
首先爬取优书网–>书库
通过书库翻页来获得书籍相关信息

def get_url():url = "http://www.yousuu.com/bookstore/?channel&classId&tag&countWord&status&update&sort&page="headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3534.4 Safari/537.36'}html = requests.get(url+"1",headers=headers)html.encoding = "UTF-8"js_info = xpathnode(html)js_info = js_info.get('Bookstore')account_info = js_info.get('total')pages = math.ceil(float(account_info/20))  #get the upper integerurl = [url+str(i+1) for i in range(pages)]    #this is the array of waited crawl url ,just return to another blockreturn pages,urldef xpathnode(html):            #return the structure of json datatree = etree.HTML(html.text)node = tree.xpath('//script/text()')   #get the account of booksinfo = node[0][25:-122]js_info = json.loads(info)return js_infodef crawl():    #the corepages,url_combine = get_url()conn = conn_sql()create_tab(conn)cursor = conn.cursor()flag = 0for url in url_combine:       #page turningflag  = flag+1headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3534.4 Safari/537.36'}html = requests.get(url,headers=headers)html.encoding = "UTF-8"book_js_info = xpathnode(html)book_js_info = book_js_info.get('Bookstore')book_js_info = book_js_info.get('books')print('rate of progress:'+str(round(flag*100/pages,2))+'%')   #rate of progressfor i in range(20):       #scanning the pagetry:book = book_js_info[i]dt = {'bookId':book.get('bookId'),'title':book.get('title'),'score':book.get('score'),'scorerCount':book.get('scorerCount'),'author':book.get('author'),'countWord':str(round(book.get('countWord')/10000,2)),\'tags':str(book.get('tags')).translate(str.maketrans('', '', '\'')),\'updateAt':book.get('updateAt')[:10]}store_to_sql(dt,conn,cursor)except:print('erro')cursor.close()conn.close()

存入SQL server数据库：（使用时可更改自己数据库参数的设置）

def store_to_sql(dt,connect,cursor):    #insert or just change the informationtbname = '['+record+']'ls = [(k,v) for k,v in dt.items() if k is not None]sentence = 'IF NOT EXISTS ( SELECT  * FROM '+tbname+' WHERE bookId = '+str(ls[0][1])+') \INSERT INTO %s (' % tbname +','.join([i[0] for i in ls]) +') VALUES (' + ','.join(repr(i[1]) for i in ls) + ');'cursor.execute((sentence))connect.commit()return ""def create_tab(conn):    #create table(if not exists)cursor = conn.cursor()sentence = 'if not exists (select * from sysobjects where id = object_id('+record+')and OBJECTPROPERTY(id, \'IsUserTable\') = 1) \CREATE TABLE\"'+record+'\"\(NUM int IDENTITY(1,1),\bookId INT NOT NULL,\title VARCHAR(100),\score VARCHAR(100),\scorerCount float,\author VARCHAR(50),\countWord float ,\tags VARCHAR(100),\updateAt date) 'cursor.execute(sentence)conn.commit()cursor.close()def conn_sql():server = "127.0.0.1"user = "sa"password = "123456"conn = pymssql.connect(server, user, password, "novel")return conn

爬取结果：

注意，优书网更新后限制了爬虫，需要增加一个sleep()人为每次爬取的增加时间间隔。

全部代码：

# -- coding: gbk --
import json
import requests
import csv
import numpy
import json
import math
from lxml import etree
import pymssql
import datetime#爬取数据并存入sql数据库
record = str(datetime.date.today()).translate(str.maketrans('','','-'))  #date of today
def get_url():url = "http://www.yousuu.com/bookstore/?channel&classId&tag&countWord&status&update&sort&page="headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3534.4 Safari/537.36'}html = requests.get(url+"1",headers=headers)html.encoding = "UTF-8"js_info = xpathnode(html)js_info = js_info.get('Bookstore')account_info = js_info.get('total')pages = math.ceil(float(account_info/20))  #get the upper integerurl = [url+str(i+1) for i in range(pages)]    #this is the array of waited crawl url ,just return to another blockreturn pages,urldef xpathnode(html):            #return the structure of json datatree = etree.HTML(html.text)node = tree.xpath('//script/text()')   #get the account of booksinfo = node[0][25:-122]js_info = json.loads(info)return js_infodef crawl():    #the corepages,url_combine = get_url()conn = conn_sql()create_tab(conn)cursor = conn.cursor()flag = 0for url in url_combine:       #page turningflag  = flag+1headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3534.4 Safari/537.36'}html = requests.get(url,headers=headers)html.encoding = "UTF-8"book_js_info = xpathnode(html)book_js_info = book_js_info.get('Bookstore')book_js_info = book_js_info.get('books')print('rate of progress:'+str(round(flag*100/pages,2))+'%')   #rate of progressfor i in range(20):       #scanning the pagetry:book = book_js_info[i]dt = {'bookId':book.get('bookId'),'title':book.get('title'),'score':book.get('score'),'scorerCount':book.get('scorerCount'),'author':book.get('author'),'countWord':str(round(book.get('countWord')/10000,2)),\'tags':str(book.get('tags')).translate(str.maketrans('', '', '\'')),\'updateAt':book.get('updateAt')[:10]}store_to_sql(dt,conn,cursor)except:print('erro')cursor.close()conn.close()def store_to_sql(dt,connect,cursor):    #insert or just change the informationtbname = '['+record+']'ls = [(k,v) for k,v in dt.items() if k is not None]sentence = 'IF NOT EXISTS ( SELECT  * FROM '+tbname+' WHERE bookId = '+str(ls[0][1])+') \INSERT INTO %s (' % tbname +','.join([i[0] for i in ls]) +') VALUES (' + ','.join(repr(i[1]) for i in ls) + ');'cursor.execute((sentence))connect.commit()return ""def create_tab(conn):    #create table(if not exists)cursor = conn.cursor()sentence = 'if not exists (select * from sysobjects where id = object_id('+record+')and OBJECTPROPERTY(id, \'IsUserTable\') = 1) \CREATE TABLE\"'+record+'\"\(NUM int IDENTITY(1,1),\bookId INT NOT NULL,\title VARCHAR(100),\score VARCHAR(100),\scorerCount float,\author VARCHAR(50),\countWord float ,\tags VARCHAR(100),\updateAt date) 'cursor.execute(sentence)conn.commit()cursor.close()def conn_sql():server = "127.0.0.1"user = "sa"password = "123456"conn = pymssql.connect(server, user, password, "novel")return connif __name__ == '__main__':crawl()

Python 爬取小说点评网站，用大数据方法找小说相关推荐

什么品种的猫最受欢迎？Python爬取猫咪网站交易数据
本篇文章是关于某化妆品企业的销售分析.从分析思路开始带大家一步步地用python进行分析,找出问题,并提出解决方案的整个流程. 以下文章来源于修炼Python 作者:叶庭云 Python爬虫.数据分析 ...
python爬取大众点评数据
python爬取大众点评数据参考博客: python+requests+beautifulsoup爬取大众点评评论信息大众点评评论抓取 Chrome如何获得网页的Cookies 如何查看自己访问网 ...
python爬取王者_Python3爬取王者官方网站英雄数据
爬取王者官方网站英雄数据众所周知,王者荣耀已经成为众多人们喜爱的一款休闲娱乐手游,今天就利用python3 爬虫技术爬取官方网站上的几十个英雄的资料,包括官方给出的人物定位,英雄名称,技能名称,CD ...
如何使用python编程抢京东优惠券知乎_学好Python爬取京东知乎价值数据
原标题:学好Python爬取京东知乎价值数据 Python爬虫为什么受欢迎如果你仔细观察,就不难发现,懂爬虫.学习爬虫的人越来越多,一方面,互联网可以获取的数据越来越多,另一方面,像 Python这 ...
使用python爬取BOSS直聘岗位数据并做可视化（Boss直聘对网页做了一些修改，现在的代码已经不能用了）
使用python爬取BOSS直聘岗位数据并做可视化结果展示首页岗位信息岗位详情薪资表学历需求公司排名岗位关键词福利关键词代码展示爬虫代码一.导入库二.爬取数据 1.爬取数据代 ...
使用Python爬取51job招聘网的数据
使用Python爬取51job招聘网的数据进行网站分析获取职位信息存储信息最终代码进行网站分析进入https://www.51job.com/这个网站我在这就以python为例搜索职位跳 ...
练习：使用Python爬取COVID-19疫情国内当日数据
练习:使用Python爬取COVID-19疫情国内当日数据推荐公众号:数据酷客 (里面有超详细的教程) 代码来源数据酷客公众号教程 URL它是Uniform Resource Locator的缩写, ...
python爬取股票信息_利用Python爬取网易上证所有股票数据（代码
利用Python爬取网易上证所有股票数据(代码发布时间:2018-04-14 17:30, 浏览次数:1261 , 标签: Python import urllib.request import r ...
爬取某知名网站的数据
爬取某知名网站的数据
python如何爬取实时人流量_使用python爬取微信宜出行人流量数据
代码地址:https://liujiao111.github.io/2019/06/18/easygo/ 工具介绍: 该工具基于微信中的宜出行提供的数据接口进行爬取,能够爬取一定范围内的当前时间点的人 ...

Python 爬取小说点评网站，用大数据方法找小说

Python 爬取小说点评网站，用大数据方法找小说相关推荐

最新文章

热门文章