xpath爬取我爱我家杭州地区租房网

分析房源信息列表页

网页的请求属于get，然后找我们需要的信息所在的模块

可以看见我们需要的网页数据在doc模块中，找到这个模块，分析他的请求，在requests请求中模拟这个请求

import requests
这个包是后面的引用
from arearenthouselistpage_5i5j import get5i5jhtml_str
写成函数形式为了更好的使用多线程完成这个爬虫程序，并可以将不同地区区分开来
def rentwhere(threadName,location,area):这里是第一页的路径，因为可以看见第一页是没有加页面表示的ziru_request_url='https://'+area+'.5i5j.com/zufang/'+location+'/'模拟请求头部信息ziru_headers_not_page_change={'Cookie': '_Jo0OQK=394A86405E728B2E2C333E6CA8A18ACC0BB3A7A9FE20C59CA7E06E6E453EBF603D36AEB18A30FA137670A679327C2C727BFB98D739C1FF2EFE4CE4D9489BE35B0B79FC0DF34BBE505AF02631C467319B15B02631C467319B15B869297F6895F5D91GJ1Z1Jg==; PHPSESSID=h706jmshf4gl3oo07fodgfj7ur; yfx_c_g_u_id_10000001=_ck19071814495010631508206604565; yfx_mr_f_n_10000001=baidu%3A%3Amarket_type_cpc%3A%3A%3A%3Abaidu_ppc%3A%3A%25e7%25a7%259f%25e6%2588%25bf%3A%3A%3A%3A%25E5%2587%25BA%25E7%25A7%259F%25E5%258D%2595%25E9%2597%25B4%3A%3Awww.baidu.com%3A%3A93455795538%3A%3A%3A%3A%25E7%25A7%259F%25E6%2588%25BF-%25E9%2580%259A%25E7%2594%25A8%25E8%25AF%258D%3A%3A%25E6%2588%25BF%25E5%25B1%258B%25E5%2587%25BA%25E7%25A7%259F%3A%3A75%3A%3Apmf_from_adv%3A%3Ahz.5i5j.com%2Fzufang%2F; _ga=GA1.2.2016664089.1563432591; _gid=GA1.2.1340751855.1563432591; domain=hz; baidu_OCPC_pc=9b0365a45d1dd407fcb10009db737436e9719374b2a29e9da9edf86537231394a%3A2%3A%7Bi%3A0%3Bs%3A13%3A%22baidu_OCPC_pc%22%3Bi%3A1%3Bs%3A178%3A%22%22https%3A%5C%2F%5C%2Fhz.5i5j.com%5C%2F%3Fpmf_group%3Dbaidu%26pmf_medium%3Dppzq%26pmf_plan%3D%25E5%25B7%25A6%25E4%25BE%25A7%25E6%25A0%2587%25E9%25A2%2598%26pmf_unit%3D%25E6%25A0%2587%25E9%25A2%2598%26pmf_keyword%3D%25E6%25A0%2587%25E9%25A2%2598%26pmf_account%3D170%22%22%3B%7D; yfx_f_l_v_t_10000001=f_t_1563432589785__r_t_1563432589785__v_t_1563437789447__r_c_0; yfx_mr_n_10000001=baidu%3A%3Amarket_type_ppzq%3A%3A%3A%3Abaidu_ppc%3A%3A%25e6%2588%2591%25e7%2588%25b1%25e6%2588%2591%25e5%25ae%25b6%3A%3A%3A%3A%25E6%25A0%2587%25E9%25A2%2598%3A%3Asp0.baidu.com%3A%3A%3A%3A%3A%3A%25E5%25B7%25A6%25E4%25BE%25A7%25E6%25A0%2587%25E9%25A2%2598%3A%3A%25E6%25A0%2587%25E9%25A2%2598%3A%3A170%3A%3Apmf_from_adv%3A%3Ahz.5i5j.com%2F; yfx_key_10000001=%25e6%2588%2591%25e7%2588%25b1%25e6%2588%2591%25e5%25ae%25b6; isClose=yes; Hm_lvt_94ed3d23572054a86ed341d64b267ec6=1563432591,1563437790; Hm_lpvt_94ed3d23572054a86ed341d64b267ec6=1563437845','Referer': 'https://hz.5i5j.com/?pmf_group=baidu&pmf_medium=ppzq&pmf_plan=%E5%B7%A6%E4%BE%A7%E6%A0%87%E9%A2%98&pmf_unit=%E6%A0%87%E9%A2%98&pmf_keyword=%E6%A0%87%E9%A2%98&pmf_account=170','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36','Host': 'hz.5i5j.com',}发送请求，获取这个页面的text数据html_str=requests.get(url=ziru_request_url,headers=ziru_headers_not_page_change).text这个是向后一级函数传参，如果刚开始构建这个爬虫项目，建议printget5i5jhtml_str(html_str,x='',location=location,area=area)这个for循环为了遍历一遍不同地区的不同page中的信息for x in range(100):fade=x+1index=x+2index1=str(index)fade1=str(fade)伪造第二页以后的url信息ziru_request_url_more='https://'+area+'.5i5j.com/zufang/'+location+'/n'+index1+'/'伪造第二页以后的请求头ziru_headers_not_page_change_more = {'Cookie': '_Jo0OQK=394A86405E728B2E2C333E6CA8A18ACC0BB3A7A9FE20C59CA7E06E6E453EBF603D36AEB18A30FA137670A679327C2C727BFB98D739C1FF2EFE4CE4D9489BE35B0B79FC0DF34BBE505AF02631C467319B15B02631C467319B15B869297F6895F5D91GJ1Z1Jg==; PHPSESSID=h706jmshf4gl3oo07fodgfj7ur; yfx_c_g_u_id_10000001=_ck19071814495010631508206604565; yfx_mr_f_n_10000001=baidu%3A%3Amarket_type_cpc%3A%3A%3A%3Abaidu_ppc%3A%3A%25e7%25a7%259f%25e6%2588%25bf%3A%3A%3A%3A%25E5%2587%25BA%25E7%25A7%259F%25E5%258D%2595%25E9%2597%25B4%3A%3Awww.baidu.com%3A%3A93455795538%3A%3A%3A%3A%25E7%25A7%259F%25E6%2588%25BF-%25E9%2580%259A%25E7%2594%25A8%25E8%25AF%258D%3A%3A%25E6%2588%25BF%25E5%25B1%258B%25E5%2587%25BA%25E7%25A7%259F%3A%3A75%3A%3Apmf_from_adv%3A%3Ahz.5i5j.com%2Fzufang%2F; _ga=GA1.2.2016664089.1563432591; _gid=GA1.2.1340751855.1563432591; domain=hz; baidu_OCPC_pc=9b0365a45d1dd407fcb10009db737436e9719374b2a29e9da9edf86537231394a%3A2%3A%7Bi%3A0%3Bs%3A13%3A%22baidu_OCPC_pc%22%3Bi%3A1%3Bs%3A178%3A%22%22https%3A%5C%2F%5C%2Fhz.5i5j.com%5C%2F%3Fpmf_group%3Dbaidu%26pmf_medium%3Dppzq%26pmf_plan%3D%25E5%25B7%25A6%25E4%25BE%25A7%25E6%25A0%2587%25E9%25A2%2598%26pmf_unit%3D%25E6%25A0%2587%25E9%25A2%2598%26pmf_keyword%3D%25E6%25A0%2587%25E9%25A2%2598%26pmf_account%3D170%22%22%3B%7D; yfx_f_l_v_t_10000001=f_t_1563432589785__r_t_1563432589785__v_t_1563437789447__r_c_0; yfx_mr_n_10000001=baidu%3A%3Amarket_type_ppzq%3A%3A%3A%3Abaidu_ppc%3A%3A%25e6%2588%2591%25e7%2588%25b1%25e6%2588%2591%25e5%25ae%25b6%3A%3A%3A%3A%25E6%25A0%2587%25E9%25A2%2598%3A%3Asp0.baidu.com%3A%3A%3A%3A%3A%3A%25E5%25B7%25A6%25E4%25BE%25A7%25E6%25A0%2587%25E9%25A2%2598%3A%3A%25E6%25A0%2587%25E9%25A2%2598%3A%3A170%3A%3Apmf_from_adv%3A%3Ahz.5i5j.com%2F; yfx_key_10000001=%25e6%2588%2591%25e7%2588%25b1%25e6%2588%2591%25e5%25ae%25b6; isClose=yes; Hm_lvt_94ed3d23572054a86ed341d64b267ec6=1563432591,1563437790; Hm_lpvt_94ed3d23572054a86ed341d64b267ec6=1563437845','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36','Host': 'hz.5i5j.com',}发送请求，获取房源列表页信息html_str = requests.get(url=ziru_request_url_more, headers=ziru_headers_not_page_change_more).text往后传参get5i5jhtml_str(html_str,'n'+index1+'/',location,area)

这里不建议直接就我这样写，我这里的代码是进行测试以后的，第一次写我们在获取第一页的房源信息列表页的时候，print他的html的text信息，这样可以同这个text信息来测试下一级函数，否则每次测试都要运行一次request，很容易被系统检测出爬虫

查找列表页中信息，获取详情页地址

可以在这里找到我们要的路径信息。
注意，这个路径信息是在doc文档中的，不要直接右键检查元素，因为这并不一定是后端往前端传递数据的格式。

导入xpath
from lxml import etree
import requests
from infopage_5i5j import getinfo
import time
import random
xpath查找路径
item_href='//div[@class="list-con-box"]//h3/a/@href'
def get5i5jhtml_str(html_str,x,location,area):拼接报头信息basic_url = 'https://'+area+'.5i5j.com'创建一个随机的时间delay=random.uniform(1,3)让程序有一个1-3秒的随机延时time.sleep(delay)将传入函数的列表页信息转化为htmlhtml=etree.HTML(html_str)查找列表页中的详情页url信息html_href=html.xpath(item_href)显示一下运行时是哪个地区的线程print('change page'+location)拼接header伪造for each in html_href:headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36','Host': area+'.5i5j.com','Referer':'https://'+area+'.5i5j.com/zufang/'+location+'/'+x,'Cookie':'_Jo0OQK=394A86405E728B2E2C333E6CA8A18ACC0BB3A7A9FE20C59CA7E06E6E453EBF603D36AEB18A30FA137670A679327C2C727BFB98D739C1FF2EFE4CE4D9489BE35B0B79FC0DF34BBE505AF02631C467319B15B02631C467319B15B869297F6895F5D91GJ1Z1Jg==; PHPSESSID=h706jmshf4gl3oo07fodgfj7ur; yfx_c_g_u_id_10000001=_ck19071814495010631508206604565; yfx_mr_f_n_10000001=baidu%3A%3Amarket_type_cpc%3A%3A%3A%3Abaidu_ppc%3A%3A%25e7%25a7%259f%25e6%2588%25bf%3A%3A%3A%3A%25E5%2587%25BA%25E7%25A7%259F%25E5%258D%2595%25E9%2597%25B4%3A%3Awww.baidu.com%3A%3A93455795538%3A%3A%3A%3A%25E7%25A7%259F%25E6%2588%25BF-%25E9%2580%259A%25E7%2594%25A8%25E8%25AF%258D%3A%3A%25E6%2588%25BF%25E5%25B1%258B%25E5%2587%25BA%25E7%25A7%259F%3A%3A75%3A%3Apmf_from_adv%3A%3Ahz.5i5j.com%2Fzufang%2F; _ga=GA1.2.2016664089.1563432591; _gid=GA1.2.1340751855.1563432591; domain=hz; baidu_OCPC_pc=9b0365a45d1dd407fcb10009db737436e9719374b2a29e9da9edf86537231394a%3A2%3A%7Bi%3A0%3Bs%3A13%3A%22baidu_OCPC_pc%22%3Bi%3A1%3Bs%3A178%3A%22%22https%3A%5C%2F%5C%2Fhz.5i5j.com%5C%2F%3Fpmf_group%3Dbaidu%26pmf_medium%3Dppzq%26pmf_plan%3D%25E5%25B7%25A6%25E4%25BE%25A7%25E6%25A0%2587%25E9%25A2%2598%26pmf_unit%3D%25E6%25A0%2587%25E9%25A2%2598%26pmf_keyword%3D%25E6%25A0%2587%25E9%25A2%2598%26pmf_account%3D170%22%22%3B%7D; yfx_f_l_v_t_10000001=f_t_1563432589785__r_t_1563432589785__v_t_1563437789447__r_c_0; yfx_mr_n_10000001=baidu%3A%3Amarket_type_ppzq%3A%3A%3A%3Abaidu_ppc%3A%3A%25e6%2588%2591%25e7%2588%25b1%25e6%2588%2591%25e5%25ae%25b6%3A%3A%3A%3A%25E6%25A0%2587%25E9%25A2%2598%3A%3Asp0.baidu.com%3A%3A%3A%3A%3A%3A%25E5%25B7%25A6%25E4%25BE%25A7%25E6%25A0%2587%25E9%25A2%2598%3A%3A%25E6%25A0%2587%25E9%25A2%2598%3A%3A170%3A%3Apmf_from_adv%3A%3Ahz.5i5j.com%2F; yfx_key_10000001=%25e6%2588%2591%25e7%2588%25b1%25e6%2588%2591%25e5%25ae%25b6; isClose=yes; Hm_lvt_94ed3d23572054a86ed341d64b267ec6=1563432591,1563437790; ershoufang_BROWSES=31129068; zufang_BROWSES=90198704; Hm_lpvt_94ed3d23572054a86ed341d64b267ec6=1563438731','Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',}发送详情页请求html_info=requests.get(url=basic_url+each,headers=headers).text往下一级传递参数getinfo(html_info,location,area)

这里详情页url获取以后是少一个固定部分的的，所以我们需要添加这个固定部分。这个方法主要是获取列表页的html获取详情页html并传参

分析详情页，获取想要获取的信息

由于上一级函数获取的是整个html信息，我们就直接往elements模块里面查找我们想要的信息就好了

我们要的信息就在上面的p标签和li标签内，由于我们存到mongodb需要时字典类型，获取数据的时候别忘记拿键值。

from lxml import etree
我这里为了直观一点还保存了一个csv文档
import csv
mongodb交互库
import pymongo这个方法用来把两个同长的列表合成一个字典
def listtodict(keylist,valuelist):result={}if len(keylist)==len(valuelist):for i in range(len(keylist)):result[keylist[i]]=valuelist[i]return resultelse:return 'two list not match'def getinfo(html_str,location,area):将上一级传过来的htmltext变成htmlhtml=etree.HTML(html_str)这个是获取p标签中的键名html_block1_info_key=html.xpath('//div[@class="housesty"]//p[@class="cjname"]/text()')这个是用来保存所有获得到的键名headers=[]在for循环之前链接数据库client = pymongo.MongoClient('localhost', 27017)根据市来创建数据库db = client[area+'_5i5j']遍历前面获得的键名，将键名加入我们的列表中for each in html_block1_info_key:headers.append(each)获取p标签键值，这里有两种键值的写法html_block1_info_value1 = html.xpath('//div[@class="housesty"]//p[@class="jlinfo"]/text()')html_block1_info_value2=html.xpath('//div[@class="housesty"]//p[@class="jlinfo font18"]/text()')获取li标签中的键值html_block2_info_value1=html.xpath('//div[@class="zushous"]/ul/li/a/text()')html_block2_info_value2=html.xpath('//div[@class="zushous"]/ul/li/text()')获取li标签中的键名html_block2_info_key = html.xpath('//div[@class="zushous"]/ul/li/span/text()')将li标签中的键名加入键名列表for each1 in html_block2_info_key:headers.append(each1)current_headers=headers[2]将键名列表转换成符合键值顺序的列表headers[2]=headers[1]headers[1]=current_headersheaders[0]='租金(元/月）'创建用来保存键值的列表values=[]建立一个csv写文件的程序with open('csv/5i5j_python_'+location+'.csv', 'a+', newline='') as csvfile:writer = csv.writer(csvfile)这里为了将获取到的数据变成一个字符串存到键值列表中，由于数据是etree格式，所以需要的步骤有点复杂for i in range(len(html_block1_info_value1)):Str=''h=list(html_block1_info_value1[i])newSTR=Str.join(h)values.append(newSTR)for j in range(len(html_block1_info_value2)):Str = ''h = list(html_block1_info_value2[j])newSTR = Str.join(h)values.append(newSTR)for k in range(len(html_block2_info_value1)):Str = ''h = list(html_block2_info_value1[k])newSTR = Str.join(h)values.append(newSTR)for m in range(len(html_block2_info_value2)):Str = ''h = list(html_block2_info_value2[m])newSTR = Str.join(h)values.append(newSTR)这里把两个列表变成一个字典dict1=listtodict(keylist=headers,valuelist=values)输出一下获取到的数据print(dict1)把值列表写到csv文件中writer.writerow(values)根据地区来建集合collection=db[location+'_5i5j']try:往集合里面插入值，try防止程序运行中断collection.insert_one(dict1)except Exception as e:print(e)

给不同地区用不同线程

import threading
from mainpage_5i5j import rentwhere
上城区线程
t1=threading.Thread(target=rentwhere,args=('thread_1','shangchengqu','hz'))
下城区线程
t2=threading.Thread(target=rentwhere,args=('thread_2','xiachengqu','hz'))
滨江区线程
t3=threading.Thread(target=rentwhere,args=('thread_3','binjiangqu','hz'))
萧山区线程
t4=threading.Thread(target=rentwhere,args=('thread_4','xiaoshanqu','hz'))
西湖区线程
t5=threading.Thread(target=rentwhere,args=('thread_5','xihuqu','hz'))
拱墅区线程
t6=threading.Thread(target=rentwhere,args=('thread_6','gongshuqu','hz'))
余杭区线程
t7=threading.Thread(target=rentwhere,args=('thread_7','yuhangqu','hz'))
江干区线程
t8=threading.Thread(target=rentwhere,args=('thread_8','jiangganqu','hz'))
钱塘新区线程
t9=threading.Thread(target=rentwhere,args=('thread_9','qiantangxinqu','hz'))
开启线程
t1.start()
t2.start()
t3.start()
t4.start()
t5.start()
t6.start()
t7.start()
t8.start()
t9.start()

这里其实我应该是可以用进程来写不同地区的运行函数，然后用线程写在不同地区下的不同页面的函数，但是我用进程的次数不多，而且数据还没备份，所以就还没写。
这部分代码我是把不同市的信息也拿来传参的，但是不幸的是，北京地区的详情页和杭州地区的详情页的网页信息是不一样的，所以没有成功，可以通过添加case语句来检测不同市的情况，这里也还没做。所以代码其实还并未完整。
运行结果

查看mongod

csv文件