爬取我爱我家租房信息时问题总结（付代码）

最近一直在联系爬虫，所以对一些网站抓取信息是遇到了问题，这里总结一下我爱我家的问题

在以往的练习时，xpath是我最常用的方法，这次也不例外

好的，我们开始爬取数据

第一步，接口查找

https://bj.5i5j.com/zufang/

我们需要按照不同区域爬取，增加一点难度

第一个问题，我们在爬取时会第一步就遇到了界面访问不进去的问题

访问进去是一个空界面，或者根本访问不进去，

但是有些时候，我们还能访问进去，感觉上是随机的，比较烦

第二个问题

访问进去之后再进行数据爬取时有些数据为空，或者找不到，

这个问题我们可以写一个判空函数来决绝，

给大家看一下截图

那么我们如何解决的第一个问题呢，我们这里写了一个反复访问的函数，可能一次访问不进去，但是两次，三次呢

给大家看一下代码

import requests
from lxml import etree
import time
import re
import threading
from queue import Queuehouse_data = {'图片链接':'','标题': '','房屋信息': '','地址': '','访问动态': '','标签': '','价格': '','出租方式': '',
}
class Wawj(threading.Thread):# 初始化def __init__(self, name, url_queue):threading.Thread.__init__(self)# 拿到任务队列self.url_queue = url_queueself.name = namedef iskong(self, temp_list):if len(temp_list) > 0:return temp_list[0]else:return ''def run(self):while True:# 线程停止条件if self.url_queue.empty():breakelse:url = self.url_queue.get()print("取值：",url)print(self.name, '取出的任务是：', url)self.get_content(url=url)# print(self.name, '完成任务页码是：', url)def get_content(self,url):##给四次机会content = self.request_url(url=url)times =4while times > 0:if '<title>' in content:content = contentelse:pattern = re.compile(r'<script>.*?href=\'(.*?)\';')href = pattern.findall(content)[0]content = self.request_url(href)times -= 1print(times)self.get_data(content)# print(content)return contentdef request_url(self,url):headers = {'Cookie': 'PHPSESSID=qhu7e2m8q2qe88r85slgensh92; _ga=GA1.2.1363814418.1557303006; _gid=GA1.2.1393847228.1557303006; yfx_c_g_u_id_10000001=_ck19050816100612577584201493913; Hm_lvt_94ed3d23572054a86ed341d64b267ec6=1557303007; _Jo0OQK=3CD62F3642CABD4011F22C4A9159FB3A22425BFB3D220139C0679F361E7C08A30EA36E358C8BA464C45028D459BDDFCFBE183E96C00A59DFD84EB50F436F358AF15C57212F12283777C840763663251ADEB840763663251ADEB11EC4D4E6CBC2C3E9C0616B90758FB61GJ1Z1ew==; domain=bj; yfx_f_l_v_t_10000001=f_t_1557303006247__r_t_1557303006247__v_t_1557317859010__r_c_0; Hm_lpvt_94ed3d23572054a86ed341d64b267ec6=' + str(time.time()),'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}response = requests.get(url,headers=headers).content.decode('utf-8')content1 = responsereturn content1def get_data(self,response):try:# passtree = etree.HTML(response)total_list = tree.xpath('//div[@class="list-con-box"]/ul[@class="pList"]/li')# print(total_list)  #有值num = 1for total in total_list:# try:#图片链接picHerf = total.xpath('.//img[@class="lazy"]/@src | .//img[@class="lazy"]/@data-src')picHerf =self.iskong(picHerf)# print(picHerf)#标题title = total.xpath('.//div[@class="listCon"]/h3/a/text()')title = self.iskong(title)#信息info = total.xpath('.//div[@class="listX"]/p[position()<2]/text()')info = info[0].replace(" ", "")#地址addr = total.xpath('.//div[@class="listX"]/p/a/text()')addr = self.iskong(addr)#查看动态dynamic = total.xpath('.//div[@class="listX"]/p[last()]/text()')if len(dynamic) > 1:dynamic = dynamic[1]else:dynamic = "暂无信息"#标签lable = total.xpath('.//div[@class="listTag"]/span/text()')if len(lable) == 0:lable = "暂时无值"else:lable = ' '.join(lable)#价格price = total.xpath('.//div[@class="jia"]/p//text()')price = self.iskong(price)#出租方式Rentway = total.xpath('.//div[@class="jia"]/p/text()')if len(Rentway) > 1:Rentway = Rentway[1]else:Rentway = '暂无出租信息'house_data['图片链接'] = picHerfhouse_data['标题'] = titlehouse_data['房屋信息'] = infohouse_data['地址'] = addrhouse_data['访问动态'] = dynamichouse_data['标签'] = lablehouse_data['价格'] = pricehouse_data['出租方式'] = Rentwaynum += 1list_data = [{"num": num, "house_data": house_data}]with open('5i5j.txt', 'a', encoding='utf-8') as f:f.write(str(list_data) + '\n')f.write('==================' + '\n')except:pass# pattern = re.compile(r'<script>.*?href=\'(.*?)\';')# hreferror = pattern.findall(content)[0]# with open('job.txt', 'a', encoding='utf-8')as fp:#     fp.write(hreferror+'\n')if __name__ == '__main__':# 任务开始时间start_time = time.time()# 创建队列任务# ['chaoyangqu/', 'haidianqu/', 'dongchengqu/', 'xichengqu/', 'fengtaiqu/', 'shijingshanqu/', 'tongzhouqu/',#  'changpingqu/', 'daxingqu/', 'yizhuang/', 'shunyiqu/', 'fangshanqu/', 'mentougou/', 'pinggu/', 'huairou/',#  'miyun/', 'yanqing/']url_queue = Queue()href_list = [('chaoyangqu', 227),('haidianqu',145),('dongchengqu',40),('xichengqu',83),('fengtaiqu',104),('shijingshanqu',23),('tongzhouqu',68),('changpingqu',65),('daxingqu',44),('yizhuang',9),('shunyiqu',30),('fangshanqu',59),('mentougou',1)# ,('pinggu',1),('huairou',),('miyun',),('yanqing',) #没有房源的地区]  # 第一个为区域名，第二个为总页数for i in href_list:for j in range(1, i[1] + 1):url = 'https://bj.5i5j.com/zufang/{}/n{}/'.format(i[0], j)url_queue.put(url)# 2 生成线程craw1_name = ['c1', 'c2', 'c3']craw1_tread = []for name in craw1_name:crawl = Wawj(name, url_queue)crawl.start()craw1_tread.append(crawl)#join 阻塞线程，让子线程都完成任务后，主线程再往下进行for thread in craw1_tread:thread.join()# 任务结束时间end_time = time.time()# 需要时间print(end_time - start_time)

在这里我们做了一个路径拼接，将所有的区域链接拿到，可以省下一次访问，同时找到确定的页数，不要写死循环，避免跳不出来

爬取我爱我家租房信息时问题总结（付代码）相关推荐

xpath爬取我爱我家杭州地区租房网
xpath爬取我爱我家杭州地区租房网分析房源信息列表页网页的请求属于get,然后找我们需要的信息所在的模块可以看见我们需要的网页数据在doc模块中,找到这个模块,分析他的请求,在requests ...
Python疫起学习·万丈高楼平地起Day09（精简版|浓缩就是精华）爬虫知识附上案例爬取北京地区短租房信息、爬取酷狗TOP500的数据以及爬取网易云音乐热歌榜单
爬虫知识 Requests库部分运行结果如下: 有时爬虫需要加入请求头来伪装成浏览器,以便更好地抓取数据.在Chrome浏览器中按F12键打开Chrome开发者工具,刷新网页后找到User-Agen ...
Python爬虫入门 | 5 爬取小猪短租租房信息
小猪短租是一个租房网站,上面有很多优质的民宿出租信息,下面我们以成都地区的租房信息为例,来尝试爬取这些数据. 小猪短租(成都)页面:http://cd.xiaozhu.com/ 1.爬取租房标题 ...
爬取南京链家租房信息
爬取南京链家租房信息代码如下代码片. import requests from lxml import etree if name == "main": #设置一个通用URL模 ...
python用scrapy爬取58同城的租房信息
上篇我们用了beautifulsoup4做了简易爬虫,本次我们用scrapy写爬虫58同城的租房信息,可以爬取下一页的信息直至最后一页. 1.scrapy的安装这个安装网上教程比较多,也比较简单,就 ...
python爬取网上租房信息_Python爬虫入门 | 5 爬取小猪短租租房信息
小猪短租是一个租房网站,上面有很多优质的民宿出租信息,下面我们以成都地区的租房信息为例,来尝试爬取这些数据. 1.爬取租房标题按照惯例,先来爬下标题试试水,找到标题,复制xpath. 多复制几个房屋 ...
教你如何爬取某8APP的租房信息
爬虫爬取某数字app的房屋信息声明:本文只是用来学习交流,并不是用来使用爬虫恶意爬去别人劳动成果,本文只是用来作为研究分享爬虫的思路,加强开发人员在日常开发工作中的安全意识. 缘起从广州回来的时候 ...
利用python爬取我爱我家租赁房源信息
主要思路: 1.通过get方法向服务器提交head文件和cookie信息(通过在chrome网页上面登录之后获取,避免了通过账号密码模拟登陆的繁琐过程),实现模拟登陆的效果 2.访问网页,通过万能的正 ...
Python爬虫：爬取我爱我家网二手房源信息
# xpath爬取 # 爬取小区名称.户型.地区.售价.总价 1.导入模块 import requests import csv from lxml import etree 2.创建类 # 创建我爱 ...

爬取我爱我家租房信息时问题总结（付代码）

爬取我爱我家租房信息时问题总结（付代码）相关推荐

最新文章

热门文章

爬取我爱我家租房信息时 问题总结（付代码）

爬取我爱我家租房信息时 问题总结（付代码）相关推荐

最新文章

热门文章

爬取我爱我家租房信息时问题总结（付代码）

爬取我爱我家租房信息时问题总结（付代码）相关推荐