python爬虫(14)获取淘宝MM个人信息及照片（上）

python爬虫(14)获取淘宝MM个人信息及照片（中）

python爬虫(14)获取淘宝MM个人信息及照片（下）（windows版本）

网上看到有获取淘宝MM照片的python程序，于是自己也忍不住照着学习一下

不得不说，淘宝网站的网页有所变化，只是纯粹把之前网上的程序down下来，程序还真的运行不起来。

有鉴于此，这里就使用新的方法来重新编码，获取淘宝MM的图片

思路是访问主页面获取当前页面每一个模特的主页url，在模特的主页获取个人信息以及头像保存好，同时获取个人相册地址

在个人相册获取相册数量，获取每一个相册名字以及该相册的照片数量，进入每一个相册获取每一个相册的相片地址

1.获取入口地址

#!/usr/bin/python
#coding=utf-8
__author__ = 'Jimy_fengqi'from selenium import webdriverimport sys
reload(sys)
sys.setdefaultencoding('utf-8')class TaoBaoSpider:def __init__(self):#自定义页面起始页self.page=1#定义存储的文件夹名字self.dirName='Jimy_fengqi'#创建webdriver全局变量self.driver = webdriver.PhantomJS()#页面加载入口 def getContent(self,maxPage):for index in range(1,maxPage+1):print '当前是第%d页' % indexself.getMMurl(index)self.driver.quit()#获取页面内容，同时找到MM的个人主页入口def getMMurl(self,index):url="https://mm.taobao.com/json/request_top_list.htm?page="+str(index)#获取页面内容self.driver.get(url)#找到当前页面所有的个人主页入口，通过xpath的方式来匹配items=self.driver.find_elements_by_xpath('//div[@class="list-item"]/div[1]/div[1]/p/a')mmUrls=[]for item in items:#对获得到的url进行处理MMurl= item.get_attribute('href').replace("model_card","model_info")mmUrls.append(MMurl)print MMurlspider=TaoBaoSpider()
spider.getContent(1)

结果如下：

当前是第1页
https://mm.taobao.com/self/model_info.htm?user_id=687471686
https://mm.taobao.com/self/model_info.htm?user_id=405095521
https://mm.taobao.com/self/model_info.htm?user_id=631300490
https://mm.taobao.com/self/model_info.htm?user_id=414457129
https://mm.taobao.com/self/model_info.htm?user_id=141234233
https://mm.taobao.com/self/model_info.htm?user_id=96614110
https://mm.taobao.com/self/model_info.htm?user_id=37448401
https://mm.taobao.com/self/model_info.htm?user_id=74386764
https://mm.taobao.com/self/model_info.htm?user_id=523216808
https://mm.taobao.com/self/model_info.htm?user_id=46599595

2.加载MM的个人主页

接着上面的函数继续写就OK了，代码如下：

#!/usr/bin/python
#coding=utf-8
__author__ = 'Jimy_fengqi'import urllib2,re,os,datetime,sys,time
from selenium import webdriver
from bs4 import BeautifulSoup as BSreload(sys)
sys.setdefaultencoding('utf-8')class TaoBaoSpider:def __init__(self):#自定义页面起始页self.page=1#定义存储的文件夹名字self.dirName='Jimy_fengqi'#创建两个webdriver,防止后续页面抢资源，一个顾不过来self.driver = webdriver.PhantomJS()self.driver_detail=webdriver.PhantomJS()#页面加载入口  def getContent(self,maxPage):for index in range(1,maxPage+1):print '当前是第%d页' % indexself.getMMurl(index)self.driver.quit()#获取页面内容，同时找到MM的个人主页入口def getMMurl(self,index):url="https://mm.taobao.com/json/request_top_list.htm?page="+str(index)#获取页面内容self.driver.get(url)#找到当前页面所有的个人主页入口items=self.driver.find_elements_by_xpath('//div[@class="list-item"]/div[1]/div[1]/p/a')mmUrls=[]for item in items:#对获得到的url进行处理MMurl= item.get_attribute('href').replace("model_card","model_info")mmUrls.append(MMurl)print MMurlself.getMMdetail(MMurl)#获取个人页面详情       def getMMdetail(self,mmUrl):self.driver.get(mmUrl)print self.driver.current_url
spider=TaoBaoSpider()
spider.getContent(1)

代码逻辑什么的也挺合理的，但是现在问题来了，代码运行之后，显示如下：

当前是第1页
https://mm.taobao.com/self/model_info.htm?user_id=687471686
https://mm.taobao.com/self/model_info.htm?user_id=687471686
Traceback (most recent call last):File "12.get_taobao_pic.py", line 49, in <module>spider.getContent(1)File "12.get_taobao_pic.py", line 27, in getContentself.getMMurl(index)File "12.get_taobao_pic.py", line 39, in getMMurlMMurl= item.get_attribute('href').replace("model_card","model_info")File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webelement.py", line 141, in get_attributeresp = self._execute(Command.GET_ELEMENT_ATTRIBUTE, {'name': name})File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webelement.py", line 494, in _executereturn self._parent.execute(command, params)File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 236, in executeself.error_handler.check_response(response)File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/errorhandler.py", line 192, in check_responseraise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: {"errorMessage":"Element does not exist in cache","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:46191","User-Agent":"Python-urllib/2.7"},"httpVersion":"1.1","method":"GET","url":"/attribute/href","urlParsed":{"anchor":"","query":"","file":"href","directory":"/attribute/","path":"/attribute/href","relative":"/attribute/href","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/attribute/href","queryKey":{},"chunks":["attribute","href"]},"urlOriginal":"/session/810afa50-0ab8-11e7-bd22-c72625d9ecf4/element/:wdc:1489717323088/attribute/href"}}
Screenshot: available via screen

什么意思呢？

其实原因是这样的，当在函数 getMMdetail里面再次使用 self.driver来访问页面的时候，之前的那个页面没有退出呢，因此造成了页面的访问冲突

我们可以再次定义一个webdriver来访问新的页面

即

 def getMMdetail(self,mmUrl):self.driver_detail.get(mmUrl)print self.driver_detail.current_url

这样就能正常访问了

3.获取个人信息

个人主页能够访问了，那么我们接下来就获取MM的一些个人信息了

   def getMMdetail(self,mmUrl):#获取个人页面详情self.driver_detail.get(mmUrl)self.my_print(0,self.driver_detail.current_url)#获取MM名字name=self.driver_detail.find_element_by_xpath('//div[@class="mm-p-model-info-left-top"]/dl/dd/a').textself.my_print(1,u'发现一位MM 名字叫%s 坐标%s 正在爬取。。。' % (name,mmUrl))#获取MM个人头像地址mmicon=self.driver_detail.find_element_by_xpath('//div[@class="mm-p-model-info-left-top"]/dl/dt/a/img').get_attribute('src')self.my_print(0, mmicon)#获取个人信息base_msg=self.driver_detail.find_elements_by_xpath('//div[@class="mm-p-info mm-p-base-info"]/ul/li')   brief=''for item in base_msg:brief+=item.text+'\n'

通过这个函数，就捕获到了MM的名字，个人头像，和她的个人简介

4.保存信息

信息获取完毕，就开始将捕获到的信息，保存下来

整体代码如下：

#!/usr/bin/python
#coding=utf-8
__author__ = 'Jimy_fengqi'import re,os,datetime,sys,time,urllib
from selenium import webdriver
from bs4 import BeautifulSoup as BSreload(sys)
sys.setdefaultencoding('utf-8')class TaoBaoSpider:def __init__(self):#自定义页面起始页self.page=1#定义存储的文件夹名字self.dirName='Jimy_fengqi'#创建两个webdriver,防止后续页面抢资源，一个顾不过来self.driver = webdriver.PhantomJS()self.driver_detail=webdriver.PhantomJS()#自定义打印函数def my_print(self,is_print,content):if is_print:print contentelse:return#页面加载入口  def getContent(self,maxPage):for index in range(1,maxPage+1):self.my_print(1,u'当前是第%d页' % index)self.getMMurl(index)self.driver.quit()#获取页面内容，同时找到MM的个人主页入口def getMMurl(self,index):url="https://mm.taobao.com/json/request_top_list.htm?page="+str(index)#获取页面内容self.driver.get(url)#找到当前页面所有的个人主页入口items=self.driver.find_elements_by_xpath('//div[@class="list-item"]/div[1]/div[1]/p/a')mmUrls=[]for item in items:#对获得到的url进行处理MMurl= item.get_attribute('href').replace("model_card","model_info")mmUrls.append(MMurl)#print MMurlself.getMMdetail(MMurl)#获取个人页面详情      def getMMdetail(self,mmUrl):#获取个人页面详情self.driver_detail.get(mmUrl)self.my_print(0,self.driver_detail.current_url)#获取MM名字name=self.driver_detail.find_element_by_xpath('//div[@class="mm-p-model-info-left-top"]/dl/dd/a').textself.my_print(1,u'发现一位MM 名字叫%s 坐标%s 正在爬取。。。' % (name,mmUrl))#获取MM个人头像地址mmicon=self.driver_detail.find_element_by_xpath('//div[@class="mm-p-model-info-left-top"]/dl/dt/a/img').get_attribute('src')self.my_print(0, mmicon)#获取个人信息base_msg=self.driver_detail.find_elements_by_xpath('//div[@class="mm-p-info mm-p-base-info"]/ul/li')   brief=''for item in base_msg:brief+=item.text+'\n'path=self.saveBriefInfo(name,mmicon,str(brief),mmUrl)def saveBriefInfo(self,name,mmicon,brief,mmUrl):path=self.dirName+'/'+namepath=path.strip()#创建目录if not os.path.exists(path):os.makedirs(path)#保存个人头像iconpath=path+'/'+name+'.jpg'urllib.urlretrieve(mmicon, iconpath)#保存个人简介信息fileName=path+'/'+name+'.txt'with open(fileName,'w+') as f:self.my_print(1,u'正在保存%s的个人信息到%s'%(name,path))f.write(brief.encode('utf-8'))mmLocation=u"个人主页地址为：" + mmUrlf.write(mmLocation)return pathif __name__ == '__main__':print ''''' ***************************************** **    Welcome to Spider for TaobaoMM   ** **      Created on 2017-3-17           ** **      @author: Jimy_fengqi           ** *****************************************'''    spider=TaoBaoSpider()spider.getContent(1)

运行结果：

            ***************************************** **    Welcome to Spider for TaobaoMM   ** **      Created on 2017-3-15           ** **      @author: Jimy_fengqi           ** *****************************************
当前是第1页
发现一位MM 名字叫田媛媛 坐标https://mm.taobao.com/self/model_info.htm?user_id=687471686 正在爬取。。。
正在保存田媛媛的个人信息到Jimy_fengqi/田媛媛
发现一位MM 名字叫v悦悦 坐标https://mm.taobao.com/self/model_info.htm?user_id=405095521 正在爬取。。。
正在保存v悦悦的个人信息到Jimy_fengqi/v悦悦
发现一位MM 名字叫崔辰辰 坐标https://mm.taobao.com/self/model_info.htm?user_id=631300490 正在爬取。。。
正在保存崔辰辰的个人信息到Jimy_fengqi/崔辰辰
发现一位MM 名字叫大猫儿 坐标https://mm.taobao.com/self/model_info.htm?user_id=414457129 正在爬取。。。
正在保存大猫儿的个人信息到Jimy_fengqi/大猫儿
发现一位MM 名字叫金甜甜 坐标https://mm.taobao.com/self/model_info.htm?user_id=141234233 正在爬取。。。
正在保存金甜甜的个人信息到Jimy_fengqi/金甜甜
发现一位MM 名字叫紫轩 坐标https://mm.taobao.com/self/model_info.htm?user_id=96614110 正在爬取。。。
正在保存紫轩的个人信息到Jimy_fengqi/紫轩
发现一位MM 名字叫谢婷婷 坐标https://mm.taobao.com/self/model_info.htm?user_id=37448401 正在爬取。。。
正在保存谢婷婷的个人信息到Jimy_fengqi/谢婷婷
发现一位MM 名字叫夏晨洁 坐标https://mm.taobao.com/self/model_info.htm?user_id=74386764 正在爬取。。。
正在保存夏晨洁的个人信息到Jimy_fengqi/夏晨洁
发现一位MM 名字叫Cherry 坐标https://mm.taobao.com/self/model_info.htm?user_id=523216808 正在爬取。。。
正在保存Cherry的个人信息到Jimy_fengqi/Cherry
发现一位MM 名字叫雪倩nika 坐标https://mm.taobao.com/self/model_info.htm?user_id=46599595 正在爬取。。。
正在保存雪倩nika的个人信息到Jimy_fengqi/雪倩nika