python爬取文字和图片_python爬虫--xpath结合re同时爬取文字与图片

还是老家的旅游网址：http://www.patour.cn/site/pananzxw/tcgl/index.html，将这些特产的图片及其介绍都爬取下来！

源码：

1 # -*- coding:utf-8 -*-

2 import urllib2

3 import re

4 from lxml import etree

6 class Spider:

7 def __init__(self):

8 pass

9 def loadPage(self):

10 #将网页的源码爬取下来

11 url = 'http://www.patour.cn/site/pananzxw/tcgl/index.html'

12 headers ={"User-Agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0"}

13 request = urllib2.Request(url,headers=headers)

14 response = urllib2.urlopen(request)

15 html = response.read()

16 self.getfullUrl(html)

17 #print html

18 def getfullUrl(self,html):

19 #利用xpath将分网页拿取出来

20 content = etree.HTML(html)

21 link_list = content.xpath('//div[@class="box_con"]/a[@class="mtit"]/@href')

22 #print link_list

23 for item in link_list:

24 full_url = "http://www.patour.cn"+str(item)

25 #print full_url

26 self.loadlittlePage(full_url)

28 def loadlittlePage(self,url):

29 #将分网页的源码拿出

30 headers ={"User-Agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:45.0) Gecko/20 100101 Firefox/45.0"}

31 request = urllib2.Request(url,headers=headers)

32 html_little = urllib2.urlopen(request).read()

33 #print html_little

35 self.getImageUrl(html_little)

36 self.getWenzi(html_little)

38 def getImageUrl(self,html):

39 #分析拿出图片的url

40 content = etree.HTML(html)

41 link_list = content.xpath('//div[@class="news_text"]/p/img/@src')

42 for item in link_list:

43 fullImage_url = "http://www.patour.cn"+str(item)

44 #print fullImage_url

45 self.loadImage(fullImage_url)#下载图片

47 def getWenzi(self,html):

48 #分析文字

49 pattern = re.compile('

(.*?)

',re.S)

50 content_list = pattern.findall(html)

52 for content in content_list:

53 #print content

54 content = content.replace("
","").replace("
","")

55 self.loadWenzi(content)

57 def loadWenzi(self,content):

58 #下载文字并保存

59 with open("wenzi.txt","a") as f:

60 f.write(content)

62 def loadImage(self,link):

63 #将图片下载下来

64 headers ={"User-Agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:45.0) Gecko/20 100101 Firefox/45.0"}

65 request = urllib2.Request(link,headers=headers)

66 image = urllib2.urlopen(request).read()

67 filename = link[-15:]

68 with open(filename,'wb') as f:

69 f.write(image)

70 print '下载成功！'

73 if __name__ == "__main__":

74 techanspider = Spider()

75 techanspider.loadPage()

结果：

python爬取文字和图片_python爬虫--xpath结合re同时爬取文字与图片相关推荐

python爬取大众点评数据_python爬虫实例详细介绍之爬取大众点评的数据
python 爬虫实例详细介绍之爬取大众点评的数据一． Python作为一种语法简洁.面向对象的解释性语言,其便捷性.容易上手性受到众多程序员的青睐,基于python的包也越来越多,使得python ...
python爬取文本中的成语_python爬虫的简单项目之爬取成语
from selenium import webdriver from idiom import DbHandle option = webdriver.ChromeOptions() option. ...
python 循环覆盖之前print内容_Python爬虫第二战---爬取500px图片
前言: 如今的高速网络极大促进了信息的展示方式,高清图片,视频等成就了我们的视听盛宴.但是,我们获取到的图片或者视频可能是被压缩过的,所以总体上还是有点小瑕疵,今天呢,我给大家带来一篇使用Python ...
python爬取汽车之家_python爬虫实战之爬取汽车之家网站上的图片
随着生活水平的提高和快节奏生活的发展.汽车开始慢慢成为人们的必需品,浏览各种汽车网站便成为购买合适.喜欢车辆的前提.例如汽车之家网站中就有最新的报价和图片以及汽车的相关内容,是提供信息最快最全的中国汽 ...
python爬图片_Python爬虫：彼岸图网图片爬取-Go语言中文社区
杂哈哈,这是我第一篇博客半年以后回来再看发现这代码简直太难看了现在已经弃用大小驼峰转蛇形命名了确实好看除了命名别的也写的不怎么样因为爬虫只是个爱好所以也不准备再投入时间重构了将就着看吧 ...
python爬虫爬取网页图片_Python爬虫实现抓取网页图片
在逛贴吧的时候看见贴吧里面漂亮的图片,或有漂亮妹纸的图片,是不是想保存下来? 但是有的网页的图片比较多,一个个保存下来比较麻烦. 最近在学Python,所以用Python来抓取网页内容还是比较方便的: ...
python爬虫知乎图片_Python爬虫入门教程 25-100 知乎文章图片爬取器之一
1. 知乎文章图片爬取器之一写在前面今天开始尝试爬取一下知乎,看一下这个网站都有什么好玩的内容可以爬取到,可能断断续续会写几篇文章,今天首先爬取最简单的,单一文章的所有回答,爬取这个没有什么难度. ...
python爬取豆瓣电影信息_Python爬虫入门 | 爬取豆瓣电影信息
这是一个适用于小白的Python爬虫免费教学课程,只有7节,让零基础的你初步了解爬虫,跟着课程内容能自己爬取资源.看着文章,打开电脑动手实践,平均45分钟就能学完一节,如果你愿意,今天内你就可以迈入爬 ...
python爬取网页数据软件_python爬虫入门10分钟爬取一个网站
一.基础入门 1.1什么是爬虫爬虫(spider,又网络爬虫),是指向网站/网络发起请求,获取资源后分析并提取有用数据的程序. 从技术层面来说就是通过程序模拟浏览器请求站点的行为,把站点返回的HT ...

python爬取文字和图片_python爬虫--xpath结合re同时爬取文字与图片

python爬取文字和图片_python爬虫--xpath结合re同时爬取文字与图片相关推荐

最新文章

热门文章