python爬虫爬取艺龙国际酒店信息

项目需要用到数据，在网上找了好久的数据都没有结果，就自己写了爬虫。没咋写过程序，为了简单都没有用函数。中间遇见了不少问题，首先爬虫只能爬取前十条信息。为了解决这个问题，上网搜了一些信息。网上说动态网页抓取可以用selenium。于是按照书本和网上的教程安装了selenium和phantomjs。但是问题仍然没有解决，后来试了下用匿名ip的方法，失败。接着添加了模拟下拉网页的代码。成功获取了第一页的30条信息。接着想办法模拟翻页，找了几行代码结果就成功了。这样我就可以爬取某一天纽约所有酒店的信息了。然后考虑连续爬取30天的信息。这个部分想了好久

给一张艺龙选择日期的截图

本来以为可以模拟点击，但是搞了一会不行
然后发现了可以直接输入日期。模拟输入。不得不赞selenium的强大
下面是源代码，不想说太多话

# -*- coding: utf-8 -*-
from selenium import webdriver
#import urllib2
import time from bs4 import BeautifulSoup
#import urlparse
#service_args=['--proxy=127.0.0.1:9150','--proxy-type=socks5',]
datelist=['2017-4-4','2017-4-5','2017-4-6','2017-4-7','2017-4-8','2017-4-9','2017-4-10','2017-4-11','2017-4-12','2017-4-13','2017-4-14','2017-4-15','2017-4-16','2017-4-17','2017-4-18','2017-4-19''2017-4-20','2017-4-21','2017-4-22','2017-4-23','2017-4-24','2017-4-25','2017-4-26','2017-4-27','2017-4-28']
driver=webdriver.PhantomJS(executable_path=r'C:\Users\cimdy\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Anaconda2 (32-bit)\phantomjs')driver.get('http://ihotel.elong.com/region_178293/')
time.sleep(2)
for date in datelist:driver.find_element_by_xpath("//input[@id='inDate']").send_keys(date)time.sleep(5) page=0hotels_inf=[]while page<=5:times=10for i in range(times + 1):driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")time.sleep(5)pageSource=driver.page_sourcehtml_content=BeautifulSoup(pageSource,'lxml')hotels=html_content.findAll('div',{'class':'h_item clearfix'})print len(hotels)for hotel in hotels:single_hotel_inf=[]if 'hotelname' in hotel.attrs:single_hotel_inf.append(hotel.attrs['id'])single_hotel_inf.append(hotel.attrs['commentcount'])single_hotel_inf.append(hotel.attrs['hotelprice'])single_hotel_inf.append(hotel.attrs['hotelname'])hotels_inf.append(single_hotel_inf) driver.find_element_by_xpath("//a[contains(text(),'下一页')]").click() # selenium的xpath用法，找到包含“下一页”的a标签去点击page = page + 1time.sleep(2) # 睡2秒让网页加载完再去读它的html代码with open(date+".txt","w") as f:for hotel_inf in hotels_inf:for hotel_attr in hotel_inf:print hotel_attrf.write(hotel_attr.encode('utf8')+' ')  f.write('\n')driver.get('http://ihotel.elong.com/region_178293/')time.sleep(2)
driver.close()

下面是运行结果展示

还有许多细节在程序里，比如自动存储文件，用字符串做了个list，遍历list

python爬虫爬取艺龙国际酒店信息相关推荐

python爬虫爬取当当网的商品信息
python爬虫爬取当当网的商品信息一.环境搭建二.简介三.当当网网页分析 1.分析网页的url规律 2.解析网页html页面书籍商品html页面解析其他商品html页面解析四.代码实现 ...
python爬虫爬取大众点评店铺简介信息
python爬虫爬取大众点评店铺简介信息写作目的: 爬取目标大众点评的保护机制应对方法还存在的问题写作目的: 今天帮朋友一个忙,要爬取一些大众点评上的数据.结果发现大众点评的防爬机制还挺多的 ...
Python爬虫爬取智联招聘职位信息
目的:输入要爬取的职位名称,五个意向城市,爬取智联招聘上的该信息,并打印进表格中 #coding:utf-8 import urllib2 import re import xlwtclass ZLZ ...
【Python爬虫案例学习20】Python爬虫爬取智联招聘职位信息
目的:输入要爬取的职位名称,五个意向城市,爬取智联招聘上的该信息,并打印进表格中 ####基本环境配置: Python版本:2.7 开发工具:pycharm 系统:win10 ####相关模块: im ...
python爬虫爬取19楼相亲女信息
最近在温习python爬虫知识,写了个简单的爬虫. 能爬取500页相亲女的信息 #coding=utf8 import requests import re import xlrd import xl ...
简单使用Python爬虫爬取淘宝网页商品信息
最近在学习爬虫,本人还是入门级的小白,自己跟着老师写了一些代码,算是自己的总结,还有一些心得,跟大家分享一下,如果不当,还请各位前辈斧正. 这是代码: # 导入库 import requests im ...
python爬虫爬取链家网房价信息
打开链家网页:https://sh.lianjia.com/zufang/ :用F12以页面中元素进行检查 <a target="_blank" href="/z ...
使用python爬虫爬取卷皮网背包信息实例
使用requests和BeautifulSoup实现对卷皮网背包名称与价格的爬取链接:www.juanpi.com 代码: import requests import re from bs4 im ...
Python 爬虫爬取安智网应用信息
2019独角兽企业重金招聘Python工程师标准>>> 爬取目标网站安卓应用的信息,爬取分类.更新时间.系统要求.下载量以及下载链接等描述信息 http://www.anzhi.co ...

python爬虫爬取艺龙国际酒店信息

python爬虫爬取艺龙国际酒店信息相关推荐

最新文章

热门文章