分析

在去哪儿网火车票查询页面，需要用户填写出发站、目的地站、出发时间等信息，然后，点击搜索按钮，
页面通过Ajax获取并显示查询结果数据。

这里用Selenium+PhantomJS模拟这一过程。

通过Selenium加载火车票查询页面，并获取到需要进行数据填充的3个输入框和进行数据提交的搜索按钮；
模拟填充3个输入框数据，模拟点搜索按钮；
从浏览器对象中获取到已经渲染完毕的HTML源码，进行解析，提取火车车次等信息；
将提取到的数据保存到数据库中；
如果有多页数据，就模拟点击下一页，跳到第3步继续进行（通过递归调用HTML解析方法实现）

源码

# !/usr/bin/env python
# -*- coding:utf-8 -*-import time
import pymysql
from lxml import etree
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.common.desired_capabilities import DesiredCapabilitiesclass TrainTicketSpider(object):"""使用Selenium库和PhantomJS浏览器,爬取去哪儿网机票信息只实现当日单程票查询"""def __init__(self):dcap = dict(DesiredCapabilities.PHANTOMJS)dcap["phantomjs.page.settings.userAgent"] = ("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36")self.browser = webdriver.PhantomJS(desired_capabilities=dcap)self.connection = pymysql.connect(host='localhost',user='root',password='123456',db='mydb',charset='utf8',  # 不能用utf-8cursorclass=pymysql.cursors.DictCursor)def crawl(self,url):self.browser.set_page_load_timeout(30)self.browser.get(url)self.browser.save_screenshot('1.png')# 发车站输入框fromStation=self.browser.find_element(By.NAME,'fromStation')# 目的地站输入框toStation=self.browser.find_element(By.NAME,'toStation')# 发车日期输入框date=self.browser.find_element(By.NAME,'date')# 搜索按钮btn_search=self.browser.find_element(By.NAME,'stsSearch')fromStation.clear()fromStation.send_keys(input('输入发车站>> '))fromStation.click()toStation.clear()toStation.send_keys(input('输入目的地站>> '))toStation.click()date.clear()date.send_keys(input('输入车车日期(格式:2000-01-22)>> '))date.click()btn_search.click()time.sleep(3)self.browser.save_screenshot('2.png')self.parse(current_page=1)def parse(self,current_page):html=self.browser.page_sourceHTML=etree.HTML(html)# 获取页数pages=HTML.xpath('//a[@data-pager]/text()')page_count=len(pages)# 当前页所有车次的元素列表li_list=HTML.xpath('//ul[@class="tbody"]/li')for li in li_list:# 车次/类型TRAIN_NUM=li.xpath('.//h3/text()')[0].strip('\n ')print('正在获取车次>>>',TRAIN_NUM)# 发站/到站start_station=li.xpath('.//div[@class="td col2"][1]/p[@class="start"]/span/text()')[0]end_station=li.xpath('.//div[@class="td col2"][1]/p[@class="end"]/span/text()')[0]STATION=(start_station+'-'+end_station).strip('\n ')# 发站时间/到站时间start_time=li.xpath('.//div[@class="td col2"][2]/time[@class="startime"]/text()')[0]end_time=li.xpath('.//div[@class="td col2"][2]/time[@class="endtime daytime"]/text()')[0]TIME=(start_time+'-'+end_time).strip('\n ')# 运行时间DURATION=li.xpath('.//time[@class="duration"]/text()')[0].strip('\n ')# 参考票价prices=[]ticket_types=li.xpath('.//p[@class="ticketed"]/text()') # 车票类型列表ticket_prices=li.xpath('.//span[@class="price"]/text()')# 车票价格列表for type_price in zip(ticket_types,ticket_prices):price='{type} {price}￥'.format(type=type_price[0],price=type_price[1])prices.append(price)# 剩余票量nums=[]ticket_ps=li.xpath('.//div[@class="td col4"]//p')for ticket_p in ticket_ps:ticket_num=ticket_p.xpath('./text()')if not ticket_num:ticket_num=ticket_p.xpath('./span/text()')nums.append(ticket_num)# 车票票价和余票数量一一对应PRICE_NUMS=''for i in zip(prices,nums):price_num="{}{}".format(i[0],i[1][0])PRICE_NUMS=PRICE_NUMS+price_num+' ,'PRICE_NUMS=PRICE_NUMS.strip(',')# 保存到MySQL数据库self.save(TRAIN_NUM,STATION,TIME,DURATION,PRICE_NUMS)# 如果有下一页,点击下一页按钮,继续爬取page_count-=current_pageif page_count:current_page+=1a_next=self.browser.find_element(By.XPATH,'//a[@data-pager={page}]'.format(page=current_page))a_next.click()time.sleep(3)# 递归调用解析方法self.parse(current_page)else:print('爬取结束')# 关闭数据库self.connection.close()def save(self,train_num,station,time,duration,price_nums):"""保存到MySQL数据库create table qunaer(id int not null primary key auto_increment,train_num varchar(10) not null,station varchar(30) not null,time varchar(30) not null,duration varchar(50) not null,price_nums varchar(80) not null,);"""with self.connection.cursor() as cursor:sql='INSERT INTO qunaer(train_num,station,time,duration,price_nums) VALUES (%s,%s,%s,%s,%s)'cursor.execute(sql,(train_num,station,time,duration,price_nums))self.connection.commit()if __name__ == '__main__':url='https://train.qunar.com/'spider=TrainTicketSpider()spider.crawl(url)

运行结果

/usr/bin/python3.5 /home/brandon/PythonProjects/MySpider/2_动态数据采集/selenium与phantomjs/小案例/去哪儿网机票查询爬虫.py
/usr/local/lib/python3.5/dist-packages/selenium/webdriver/phantomjs/webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox insteadwarnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '
输入发车站>> 北京
输入目的地站>> 石家庄
输入车车日期(格式:2000-01-22)>> 2015-01-15
正在获取车次>>> K7761
正在获取车次>>> K219
正在获取车次>>> K7725
正在获取车次>>> K599
正在获取车次>>> K7705
正在获取车次>>> G6731
...

动态网站数据采集 - 去哪儿网火车票查询爬虫相关推荐

在群晖NAS上搭建WordPress动态网站并实现外网访问
目录一.安装套件 1. 安装Web Station套件 2. 安装MariaDB 10数据库套件 3.安装服务套件 4.我为什么要用WordPress? 5.建站的其它方法二.访问WordPres ...
python爬虫去哪儿网_大型爬虫案例：爬取去哪儿网
世界那么大,我想去看看.相信每到暑假期间,就会有很多人都想去旅游.但是去哪里玩,没有攻略这又是个问题.这次作者给大家带来的是爬取去哪网自由行数据.先来讲解一下大概思路,我们去一个城市旅行必定有一个出发 ...
python爬取去哪网数据_Python爬虫入门：使用Python爬取网络数据
1 网络爬虫引用百度百科的定义:网络爬虫是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本. 简单的说,就是有一个程序可以自动去访问网页. 2 Python爬虫如何实现爬虫? 简单的讲,一共 ...
python爬取去哪儿网_python网络爬虫（12）去哪网酒店信息爬取
目的意义爬取某地的酒店价格信息,示例使用selenium在Firefox中的使用. 来源少部分来源于书.python爬虫开发与项目实战构造本次使用简易的方案,模拟浏览器访问,然后输入字段,查找 ...
微信小程序--火车票查询
写在最前面微信小程序自九月份推出内测资格以来,经历了舆论热潮到现在看似冷清,但并不意味着大家不那么关注或者不关注了.我想不管是否有内测资格,只要是感兴趣的开发者已经进入潜心耕耘产品的阶段了,至少是静 ...
去哪儿网抢票成功率怎么样？
跨站买票.买短途票上车补票.准点捡漏等已是老生常谈的技巧.随着互联网的发展,抢票软件成为购票热门渠道.现在的火车票分三个途径可以购买,传统线下窗口.12306PC端和移动端.电话订票,它们的票源都是分 ...
利用百度API Store接口进行火车票查询
火车票查询项目源码下载链接: Github:https://github.com/VincentWYJ/TrainTicketQuery 博客文件:http://files.cnblogs.com/ ...
爱看影院影视网站模版去授权
爱看影院影视网站模版去授权网盘下载地址: http://www.bytepan.com/PwOuhC49Fv0
java车次信息_java实现根据起点终点和日期查询去哪儿网的火车车次和火车站点信息...
本文章为原创文章,转载请注明,欢迎评论和改正. 一,分析之前所用的直接通过HTML中的元素值来爬取一些网页上的数据,但是一些比较敏感的数据,很多正规网站都是通过json数据存储,这些数据通过HTML ...

动态网站数据采集 - 去哪儿网火车票查询爬虫

分析

源码

运行结果

动态网站数据采集 - 去哪儿网火车票查询爬虫相关推荐

最新文章

热门文章