python搞笑教程_python爬虫- 爬取幽默笑话网站，带你一起笑翻天

importrequestsimportthreadpoolimporttimeimportos,sysimportrefrom lxml importetreefrom lxml.html importtostringclassScrapDemo():

next_page_url="" #下一页的URL

page_num=1 #当前页

detail_url_list=0 #详情页面URL地址list

deepth=0 #设置抓取的深度

headers ={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36"}

fileNum=0def __init__(self,url):

self.scrapyIndex(url)def threadIndex(self,urllist): #开启线程池

if len(urllist) ==0:print("请输入需要爬取的地址")returnFalse

ScrapDemo.detail_url_list=len(urllist)

pool=threadpool.ThreadPool(len(urllist))

requests=threadpool.makeRequests(self.detailScray,urllist)for req inrequests:

pool.putRequest(req)

time.sleep(0.5)

pool.wait()def detailScray(self,url): #获取html结构

if not url == "":

url='http://xiaohua.zol.com.cn/{}'.format(url)

res=requests.get(url,headers=ScrapDemo.headers)

html=res.text#element=etree.HTML(html)

#divEle=element.xpath("//div[@class='article-text']")[0] # Element div

self.downloadText(html)def downloadText(self,ele): #抓取数据并存为txt文件

clist = re.findall('

(.*?)

',ele,re.S)for index inrange(len(clist)):'''正则表达式：过滤掉回车、制表符和p标签'''clist[index]=re.sub(r'(\r|\t|

|)+','',clist[index])

content="".join(clist)#print(content)

basedir=os.path.dirname(__file__)

filePath=os.path.join(basedir)

filename="xiaohua{0}-{1}.txt".format(ScrapDemo.deepth,str(ScrapDemo.fileNum))

file=os.path.join(filePath,'file_txt',filename)try:

f=open(file,"w")

f.write(content)if ScrapDemo.fileNum == (ScrapDemo.detail_url_list - 1):print(ScrapDemo.next_page_url)print(ScrapDemo.deepth)if not ScrapDemo.next_page_url == "":

self.scrapyIndex(ScrapDemo.next_page_url)exceptException as e:print("Error:%s" %str(e))

ScrapDemo.fileNum=ScrapDemo.fileNum+1

print(ScrapDemo.fileNum)defscrapyIndex(self,url):if not url == "":

ScrapDemo.fileNum=0

ScrapDemo.deepth=ScrapDemo.deepth+1

print("开启第{0}页抓取".format(ScrapDemo.page_num))

res=requests.get(url,headers=ScrapDemo.headers)

html=res.text

element=etree.HTML(html)

a_urllist=element.xpath("//a[@class='all-read']/@href") #当前页所有查看全文

next_page=element.xpath("//a[@class='page-next']/@href") #获取下一页的url

ScrapDemo.next_page_url='http://xiaohua.zol.com.cn/{}'.format(next_page[0])if not len(next_page) == 0 and ScrapDemo.next_page_url !=url:

ScrapDemo.page_num=ScrapDemo.page_num+1self.threadIndex(a_urllist[:])else:print('下载完成，当前页数为{}页'.format(ScrapDemo.page_num))

sys.exit()

python搞笑教程_python爬虫- 爬取幽默笑话网站，带你一起笑翻天相关推荐

详细实例：用python爬虫爬取幽默笑话网站！（建议收藏）
前言: 今天为大家带来的内容是详细实例:用python爬虫爬取幽默笑话网站!(建议收藏),文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友可以参考下! 爬取网站为 ...
python爬虫自学网站_python爬虫学习爬取幽默笑话网站
这篇文章主要介绍了python爬虫爬取幽默笑话网站,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友可以参考下爬取网站为:http://xiaohua.zol. ...
python二手房使用教程_python爬虫爬取链家二手房信息
#coding=utf-8 import requests from fake_useragent import UserAgent from bs4 import BeautifulSoup imp ...
python xpath循环_Python爬虫爬取北京二手房数据
点击蓝字"python教程"关注我们哟! 前言 Python现在非常火,语法简单而且功能强大,很多同学都想学Python!所以小的给各位看官们准备了高价值Python学习视频教程及 ...
python输出古诗词_python爬虫——爬取古诗词
一. 概要 1.通过python爬虫循环爬取古诗词网站唐诗宋词 2.落地到本地数据库二. 页面分析首先通过firedebug进行页面定位: 其次源码定位: 最终生成lxml etree定位div标 ...
python输出古诗词_python爬虫——爬取古诗名句
一. 概要 1.通过python爬虫循环爬取古诗词网站古诗名句 2.落地到本地数据库二. 页面分析首先通过firedebug进行页面定位: 其次源码定位: 最终生成lxml etree定位div标 ...
Java爬虫爬取某招聘网站招聘信息
Java爬虫爬取某招聘网站招聘信息一.系统介绍二.功能展示 1.需求爬取的网站内容 2.实现流程 2.1数据采集 2.2页面解析 2.3数据存储三.获取源码一.系统介绍系统主要功能:本项目 ...
爬虫爬取快代理网站动态IP
爬虫爬取快代理网站动态IP import requests, time from lxml import etree import time import randomcookie = "& ...
python百度贴吧怎么爬取最早的帖子_Python爬虫爬取百度贴吧的帖子
同样是参考网上教程,编写爬取贴吧帖子的内容,同时把爬取的帖子保存到本地文档: #!/usr/bin/python #_*_coding:utf-8_*_ import urllib import ur ...

python搞笑教程_python爬虫- 爬取幽默笑话网站，带你一起笑翻天

python搞笑教程_python爬虫- 爬取幽默笑话网站，带你一起笑翻天相关推荐

最新文章

热门文章