一个逐页抓取网站小说的爬虫

需求：

抓取某些网站上的小说，按页抓取

每页都有next 按钮，获取这写next 按钮的 href 然后就可以逐页抓取

解析网页使用beautisoup

from bs4 import BeautifulSoup
import urllib2
import timeimport sys#http://www.vc.com/htm/2016/12/24/t02/367246.html
host_name = 'http://www.vc.com'def html_process(html_file,url):'''use bs to get the titile && contain && next link from html_file'''global host_name#soup = BeautifulSoup(open(html_file),"html_parser")soup = BeautifulSoup(html_file,"html.parser")#####################################################text = '/dev/shm/novel.txt'file = open(text,'a')file.write('######################################')file.write('\r\n' + url + '\r\n')######################################################get titletitle_ret = soup.title.string.split('-')[0].strip()file.write('\r\n@# '+ title_ret+ '\r\n')######################################################get contextfile.write( soup.find("div",id='view2').get_text() + '\r\n')file.close()######################################################get next hreflink = soup.find_all("li",class_ = "next")[0]if None == link:print 'next link is None'exit(0)next_href = host_name + link.a['href'] return next_hrefdef html_get(url):user_agent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0"headers = {'User-Agent':user_agent}req = urllib2.Request(url,headers = headers)try:page = urllib2.urlopen(req,timeout=20).read()return pageexcept urllib2.URLError,e:print "error while loading" + urlexit(1)except socket.timeout:#do retryreturn html_get(url)def test(url):while None != url:html_file = html_get(url)if None == html_file:print 'ERROR OF READING ',urlexit(1)url = html_process(html_file,url)time.sleep(5)if __name__ == '__main__':reload(sys)sys.setdefaultencoding( "utf-8" )#start up url test("http://www.vc.com/htm/2013/11/2/t02/316551.html")

转载于:https://www.cnblogs.com/shaivas/p/6218227.html

一个逐页抓取网站小说的爬虫相关推荐

cefsharp内嵌资源html的读取,C#(csharp)用CefSharp开发实现一个浏览器，抓取网站任意资源...
前言: 不少网站会保护自己,防止别人下载图片. 因此下载器,先后改了好几次,不断技术升级. 早期用WebClient 下载 HTML 分析 URL 下载. 后来为了下载手机网站的图片,用上了 U ...
java爬虫抓取起点小说_爬虫实践-爬取起点中文网小说信息
qidian.py: import xlwt import requests from lxml import etree import time all_info_list = [] def get ...
JAVA爬虫进阶之springboot+webmagic抓取顶点小说网站小说
闲来无事最近写了一个全新的爬虫框架WebMagic整合springboot的爬虫程序,不清楚WebMagic的童鞋可以先查看官网了解什么是Webmagic,顺便说说用springboot时遇到的一些坑 ...
python抓取网站88titienmae88中的“图片区”的第一页的所有图片
#-*-coding:utf-8-*- from urllib.request import urlopen, urlretrieve from bs4 import BeautifulSoup im ...
使用java的html解析器jsoup和jQuery实现一个自动重复抓取任意网站页面指定元素的web应用...
在线演示本地下载如果你曾经开发过内容聚合类网站的话,使用程序动态整合来自不同页面或者网站内容的功能肯定对于你来说非常熟悉.通常使用java的话,我们都会使用到一些HTML的解析,例如,httpp ...
Java抓取起点小说输出到本地文件夹和数据库
Java抓取起点小说输出到本地文件夹和数据库目录项目结构所需插件项目代码输出结果目录项目结构第一次写网络爬虫,参考了别人的,也自己理解了用法所需插件因为使用了mevan,直接上po ...
Python爬虫小偏方：如何用robots.txt快速抓取网站？
作者 | 王平,一个IT老码农,写Python十年有余,喜欢分享通过爬虫技术挣钱和Python开发经验. 来源 | 猿人学Python 在我抓取网站遇到瓶颈,想剑走偏锋去解决时,常常会先去看下该网站的 ...
python抓取网站乱码_如何使用Python抓取网站
python抓取网站乱码 by Devanshu Jain 由Devanshu Jain It is that time of the year when the air is filled with ...
【python】python异步抓取网站数据【详细过程】
项目介绍 askWeb/index.py 网站爬取数据类 database/index.py 数据库类(数据库封装) utils/index.py 工具文件 main.py 项目入口文件 1.main ...

一个逐页抓取网站小说的爬虫

一个逐页抓取网站小说的爬虫相关推荐

最新文章

热门文章