携程酒店数据爬取（新）

前言：由于携程网页的变化，以及不断的反击爬虫，导致目前许多携程的爬虫代码无法爬取到数据。
本文核心：根据更换cookies的值得到携程酒店数据
主要包含以下四个部分

headers
data
json解析
完整代码

前言

环境：python3.6+requests
包含部分文件写入操作

1、headers

爬虫程序需要模仿浏览器进行访问，因此headers属性必不可少，可以在网页中轻松找到

headers = {"Connection": "keep-alive","Cookie":cookies,"origin": "https://hotels.ctrip.com","Host": "hotels.ctrip.com",     "referer": "https://hotels.ctrip.com/hotel/qamdo575","user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36","Content-Type":"application/x-www-form-urlencoded; charset=utf-8"}

其中较为重要的部分就是cookies，假如没有cookies会直接导致验证失败，获得空数据，并且在cookies需要登录后的cookies。

2、data属性

由于采用数据接口的方式爬取数据，因此主要组合相应的data属性，才能获得准确的返回值。在浏览器检索中，从header里面可以找到我们需要的data属性。

data = {"StartTime": "2020-10-09","DepTime": "2019-10-10","RoomGuestCount": "1,1,0","cityId": 575,"cityPY": "qamdo","cityCode": "0895","page": page}

3、json解析

找到准确的数据接口之后，我们需要利用requests库，发送get或者post请求，拼接之前的headers和data参数，得到对应的json数据。
得到的json数据可以利用切片得到各种属性值，例如链接、评分、地址等。

 html = requests.post(url, headers=headers, data=data)hotel_list = html.json()["hotelPositionJSON"]

4、完整代码

# coding=utf8
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
import random
import time
import csv
import json
import re
from tqdm import tqdm
# Pandas display option
pd.set_option('display.max_columns', 10000)
pd.set_option('display.max_rows', 10000)
pd.set_option('display.max_colwidth', 10000)
pd.set_option('display.width',1000)url = "https://hotels.ctrip.com/Domestic/Tool/AjaxHotelList.aspx"
filename = "F:\\aaa\\changdu.csv"
print(requests.post(url))
def Scrap_hotel_lists():cookies = ''' ......"'headers = {"Connection": "keep-alive","Cookie":cookies,"origin": "https://hotels.ctrip.com","Host": "hotels.ctrip.com",     "referer": "https://hotels.ctrip.com/hotel/qamdo575","user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36","Content-Type":"application/x-www-form-urlencoded; charset=utf-8"}id = []name = []hotel_url = []address = []score = []star = []stardesc=[]lat=[]lon=[]dpcount=[]dpscore=[]for page in tqdm(range(1,13) ,desc='进行中',ncols=10):data = {"StartTime": "2020-10-09","DepTime": "2019-10-10","RoomGuestCount": "1,1,0","cityId": 575,"cityPY": "qamdo","cityCode": "0895","page": page}html = requests.post(url, headers=headers, data=data)hotel_list = html.json()["hotelPositionJSON"]for item in hotel_list:print(item)id.append(item['id'])name.append(item['name'])hotel_url.append(item['url'])address.append(item['address'])score.append(item['score'])stardesc.append(item['stardesc'])lat.append(item['lat'])lon.append(item['lon'])dpcount.append(item['dpcount'])dpscore.append(item['dpscore'])if(item['star']==''):star.append('NaN')else:star.append(item['star'])time.sleep(random.randint(3,5))hotel_array = np.array((id, name, score, hotel_url, address,star,stardesc,lat,lon,dpcount,dpscore)).Tlist_header = ['id', 'name', 'score', 'url', 'address','star','stardesc','lat','lon','dpcount','dpscore']array_header = np.array((list_header))hotellists = np.vstack((array_header, hotel_array))with open(filename, 'w', encoding="utf-8-sig", newline="") as f:csvwriter = csv.writer(f, dialect='excel')csvwriter.writerows(hotellists)
if __name__ == "__main__":Scrap_hotel_lists()df = pd.read_csv(filename, encoding='utf8')print(df)

备注：xiecheng网站经常发生改版，此程序仅用于学习

携程酒店数据爬取（新）相关推荐

携程酒店数据爬取2020.5
携程酒店数据爬取2020.5 1. 开题目前网上有好多爬取携程网站的教程,大多数通过xpath,beautifulsoup,正则来解析网页的源代码.然后我这个菜b贪方便,直接copy源码的xpath ...
对携程酒店用户评价爬取
...
爬取携程和蚂蜂窝的景点评论数据\携程评论数据爬取\旅游网站数据爬取
本人长期出售超大量微博数据.旅游网站评论数据,并提供各种指定数据爬取服务,Message to YuboonaZhang@Yahoo.com.同时欢迎加入社交媒体数据交流群:99918768 前言 ...
Java数据爬取——爬取携程酒店数据（二）
在上篇文章Java数据爬取--爬取携程酒店数据(一)爬取所有地区后,继续根据地区数据爬取酒店数据 1.首先思考怎样根据地域获取地域酒店信息,那么我们看一下携程上是怎样获得的. 还是打开http://h ...
Java数据爬取——爬取携程酒店数据（一）
最近工作要收集点酒店数据,就到携程上看了看,记录爬取过程去下 1.根据城市名称来分类酒店数据,所以先找了所有城市的名称在这个网页上有http://hotels.ctrip.com/domestic- ...
爬虫第六课：爬取携程酒店数据
首先打开携程所有北京的酒店http://hotels.ctrip.com/hotel/beijing1 简简单单,源代码中包含我们需要的酒店数据,你以为这样就结束了?携程的这些数据这么廉价地就给我们得 ...
JAVA爬虫爬取携程酒店数据selenium实现
在爬取携程的时候碰到很多的壁垒,接下来分析所有过程 1.根据以往经验最初想到用jsoup去解析每个HTML元素,然后拿到酒店数据,然后发现解析HTML根本拿不到id为hotel_list的div,所以 ...
js逆向之携程酒店房价抓取
团队持续招人,app逆向方向,私聊.2019-05-11 修复携程eleven参数改版问题 1 在调用require的时候判断是不是参数是不是path,如果是抛出异常 2 重新window的构造方法的 ...
Python3+Scrapy通过代理爬取携程酒店数据
目标:通过爬取酒店信息保存至本地mysql数据库中目标网址:https://hotels.ctrip.com/hotel/Haikou42 首先新建scrapy项目命令行输入:crapy star ...
破解携程中文验证码爬取机票价格数据
国内机票预定APP携程处于垄断地位,但是携程有反爬虫策略,对于密集的查询请求会要求验证,验证操作有两次,一次是拖动验证,一次是点选中文,selenium+webdriver可以轻松绕过这一反爬虫设置. ...

携程酒店数据爬取（新）

携程酒店数据爬取（新）

前言

1、headers

2、data属性

3、json解析

4、完整代码

携程酒店数据爬取（新）相关推荐

最新文章

热门文章