用lxml的xpath演示爬虫提取笑话集网页其中的标题，url，浏览数，日期，笑话内容

人狠话不多，直接上源码

from  urllib import request,parse
from  urllib import error
import chardet
from lxml import etree
import csv,string
def jokeji(url,beginPage, endPage):for page in range(beginPage, endPage):pn =pagefullurl = url + "me_page=" + str(pn)headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"}req = request.Request(fullurl, headers=headers)try:response = request.urlopen(req)resHtml = response.read()resHtml = resHtml.decode("gbk", 'ignore')html = etree.HTML(resHtml)results = html.xpath('//table[@width="646"]')for site in results:#标题title = site.xpath('.//td/a')[0].text# 浏览数view = site.xpath('.//td')[2].text# 日期date = site.xpath('.//td/span')[0].text# urljokeurl = site.xpath('.//td[2]/a[@class="main_14"]/@href')[0]newjokeurl="http://www.jokeji.cn/"+jokeurlnewjokeurl = parse.quote(newjokeurl,safe=string.printable)# print(newjokeurl)# 笑话内容headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"}requrl = request.Request(newjokeurl, headers=headers)print("标题:%s,url:%s,浏览数:%s,日期%s" % (title, jokeurl, view, date))try:jokresponse = request.urlopen(requrl)jokresHtml = jokresponse.read()jokresHtml = jokresHtml.decode("gbk")html = etree.HTML(jokresHtml)result = html.xpath('//div[@class="left_up"]//font[@face="Verdana"]//text()')for i in result:print(i)except Exception  as e:passexcept error.URLError as e:print(e)if __name__ == "__main__":proxy = {"http": "118.31.220.3:8080"}proxy_support = request.ProxyHandler(proxy)opener = request.build_opener(proxy_support)request.install_opener(opener)beginPage = int(input("请输入起始页："))endPage = int(input("请输入终止页："))url = "http://www.jokeji.cn/hot.asp?"jokeji(url,beginPage, endPage)

用lxml的xpath演示爬虫提取笑话集网页其中的标题，url，浏览数，日期，笑话内容相关推荐

python爬虫网页中的图片_Python爬虫爬取一个网页上的图片地址实例代码
本文实例主要是实现爬取一个网页上的图片地址,具体如下. 读取一个网页的源代码: import urllib.request def getHtml(url): html=urllib.request. ...
笑话集网站最近更新网站内容采集
转载请注明出处:http://blog.csdn.net/xiaojimanman/article/details/19158815 本篇博客主页介绍笑话集(www.jokeji.cn)最近更新列表页 ...
【囧囧笑话集】做一个愉快的少年
adies and 乡亲们,看过来!这里很精彩. 囧囧笑话集发布了.它含有多个栏目,每个栏目的笑话各有特色,包括'囧囧有神'.'爆笑大全'.'捧腹短信'.'整蛊专家'.'极品损人'.'成人玩笑'和'重 ...
python爬虫基础(二)～工具包: 下载包requests、urllib和解析包BeautifulSoup(bs4)、lxml.etree.xpath
目录 1. html下载工具包 1.1 urllib工具包 1.1.1 urllib错误一 1.2 Requests工具包 1.2.1 requests错误一 2. html解析工具包 2.1 Bea ...
Python3 爬虫实战 — 猫眼电影TOP100【requests、lxml、Xpath、CSV 】
爬取时间:2019-09-23 爬取难度:★☆☆☆☆☆ 请求链接:https://maoyan.com/board/4 爬取目标:猫眼电影 TOP100 的电影名称.排名.主演.上映时间.评分.封面图 ...
csv中包含多余换行符_Python3爬虫之猫眼电影TOP100(requests、lxml、Xpath、CSV)
点击关注,我们共同每天进步一点点! [1x00]循环爬取网页模块观察猫眼电影TOP100榜,请求地址为:https://maoyan.com/board/4 每页展示10条电影信息,翻页观察 url ...
Python案例：使用XPath的爬虫
案例:使用XPath的爬虫现在我们用XPath来做一个简单的爬虫,我们尝试爬取某个贴吧里的所有帖子,并且将该这个帖子里每个楼层发布的图片下载到本地. # tieba_xpath.py#!/usr/b ...
lxml 和 XPah （爬虫）
lxml 和 XPah (爬虫) XPath 的基本语法规则获取内容集合的概念属性匹配与获取按序选择节点 XPath 的基本语法规则 XPath 语法规则描述 nodename 选取此节点的 ...
Python爬虫(十三)_案例：使用XPath的爬虫
本篇是使用XPath的案例,更多内容请参考:Python学习指南案例:使用XPath的爬虫现在我们用XPath来做一个简单的爬虫,我们尝试爬取某个贴吧里的所有帖子且将该帖子里每个楼层发布的图片下载 ...

用lxml的xpath演示爬虫提取笑话集网页其中的标题，url，浏览数，日期，笑话内容

用lxml的xpath演示爬虫提取笑话集网页其中的标题，url，浏览数，日期，笑话内容相关推荐

最新文章

热门文章