1、抓取全部国家

1)开始的url

start_url = 'http://www.mafengwo.cn/mdd/'

2)请求部分代码

        proxies = get_proxies_requests()random_header = get_header()add_header = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8','Accept-Encoding': 'gzip, deflate','Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8','Cache-Control': 'max-age=0','Connection': 'keep-alive','Host': 'www.mafengwo.cn','Upgrade-Insecure-Requests': '1',}last_header = dict()last_header.update(random_header)last_header.update(add_header)html = requests.get(url, headers=last_header, proxies=proxies, timeout=10)

3)提取数据代码

                content_list = list()html = etree.HTML(html.content)# 大洲列表continent_lists = html.xpath("//div[@class='bd']/dl[@class='item']")for continent in continent_lists:continent_name = continent.xpath("./dt[@class='sub-title']/text()")[0]# 大洲下面的国家country_lists = continent.xpath(".//dd[@class='clearfix']//a")for country in country_lists:country_dict = dict()country_chinese_name = country.xpath("./text()")[0]country_english_name = country.xpath("./span/text()")[0]country_part_url = country.xpath("./@href")[0]country_url = 'http://www.mafengwo.cn' + country_part_urlcountry_id = country_part_url.split('.')[0].split('/')[-1]country_start_popular_cities_url = part_popular_cities_url.format(country_id)country_dict['continent'] = continent_namecountry_dict['country_chinese_name'] = country_chinese_namecountry_dict['country_english_name'] = country_english_namecountry_dict['country_url'] = country_urlcountry_dict['country_id'] = country_idcountry_dict['country_start_popular_cities_url'] = country_start_popular_cities_urlcontent_list.append(country_dict)

4)结果

抓取了201个国家。

2、抓取全部的目的地

马蜂窝数据接口:

pagedata_destinationlist_url = 'http://www.mafengwo.cn/mdd/base/list/pagedata_citylist'

self.pagedata_destinationlist_url = pagedata_destinationlist_url

1)遍历国家id,获取全部的目的地

        last_lists = list()file = open('./results/mafengwo_country_info.json', 'r', encoding='utf-8')movie_lists = file.readlines()movie_lists = list(set(movie_lists))last_lists.extend(movie_lists)last_lists = list(set(last_lists))print('文件共' + str(len(last_lists)) + '个对象......')i = 0# print()# crawling_data = {"country_english_name": "China", "country_start_popular_cities_url": "http://www.mafengwo.cn/mdd/citylist/21536.html", "country_url": "http://www.mafengwo.cn/travel-scenic-spot/mafengwo/21536.html", "continent": "亚洲", "country_chinese_name": "中国 ", "country_id": "21536"}# print(type(crawling_data))# exit(0)# self.url_queue.put((crawling_data, i))for crawling_data in last_lists[:]:crawling_data = json.loads(crawling_data)self.url_queue.put((crawling_data, i))i += 1

2)发送请求部分

country_id = crawling_data['country_id']html_is_kong = 0page_num = 0for page in range(1, self.total_page):page_num += 1data = {"mddid": country_id,"page": str(page),}try:html = self._get_url_content_post(data)except:html = ''flag = 0if html is not '':source_code_html = html.content.decode()dic = json.loads(source_code_html)try:page_info = dic['page']except:page_info = ''if '后一页' not in page_info and '末页' not in page_info:flag = 1else:html_is_kong += 1if html_is_kong == 30:break# print('flag = {}'.format(flag))self.html_queue.put((html, i, crawling_data, flag, page_num))if flag:# 推出循环条件break

3)请求体部分,post请求

        proxies = get_proxies_requests()random_header = get_header()add_header = {'Accept': 'application/json, text/javascript, */*; q=0.01','Accept-Encoding': 'gzip, deflate','Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8','Connection': 'keep-alive','Content-Length': '18','Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',# 'Cookie': 'PHPSESSID=012jlosl7nd28cdlbtarggbop6; mfw_uuid=5bd7d7e9-e53d-89ac-2889-f0d77d529a39; oad_n=a%3A3%3A%7Bs%3A3%3A%22oid%22%3Bi%3A3546%3Bs%3A2%3A%22dm%22%3Bs%3A15%3A%22www.mafengwo.cn%22%3Bs%3A2%3A%22ft%22%3Bs%3A19%3A%222018-10-30+12%3A02%3A49%22%3B%7D; uva=s%3A78%3A%22a%3A3%3A%7Bs%3A2%3A%22lt%22%3Bi%3A1540872169%3Bs%3A10%3A%22last_refer%22%3Bs%3A6%3A%22direct%22%3Bs%3A5%3A%22rhost%22%3Bs%3A0%3A%22%22%3B%7D%22%3B; __mfwurd=a%3A3%3A%7Bs%3A6%3A%22f_time%22%3Bi%3A1540872169%3Bs%3A9%3A%22f_rdomain%22%3Bs%3A0%3A%22%22%3Bs%3A6%3A%22f_host%22%3Bs%3A3%3A%22www%22%3B%7D; __mfwuuid=5bd7d7e9-e53d-89ac-2889-f0d77d529a39; UM_distinctid=166c323745f60-084bd0cc3f686d-39614807-1fa400-166c32374604cf; ifu_class_num=1; ad_widget_footer_20181101preview_0_other=1%2C1vuh0fz; _r=baidu; _rp=a%3A2%3A%7Bs%3A1%3A%22p%22%3Bs%3A18%3A%22www.baidu.com%2Flink%22%3Bs%3A1%3A%22t%22%3Bi%3A1541044035%3B%7D; __mfwlv=1541063796; __mfwvn=5; CNZZDATA30065558=cnzz_eid%3D598255974-1540867004-%26ntime%3D1541063301; __mfwlt=1541067393','Host': 'www.mafengwo.cn','Origin': 'http://www.mafengwo.cn',# 'Referer': 'http://www.mafengwo.cn/mdd/citylist/21536.html',# 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36','X-Requested-With': 'XMLHttpRequest',}last_header = dict()last_header.update(random_header)last_header.update(add_header)sleep(random.uniform(2, 5))html = requests.post(self.pagedata_destinationlist_url, headers=last_header, proxies=proxies, timeout=10, data=data)

4)数据提取

                html = html.content.decode()html_dict = json.loads(html)html_content = html_dict['list']html_content = etree.HTML(html_content)try:li_lists = html_content.xpath("//li[@class='item ']")except:li_lists = list()# 获取城市信息last_lists = list()for destination_info in li_lists:destination_dict = dict()destination_chinese_name = ''.join(destination_info.xpath(".//div[@class='title']/text()")).strip()destination_english_name = destination_info.xpath("./div[@class='img']//div[@class='title']/p/text()")[0] if len(destination_info.xpath("./div[@class='img']//div[@class='title']/p/text()")) > 0 else ''destination_id = destination_info.xpath("./div[@class='img']/a/@data-id")[0]destination_url = destination_info.xpath("./div[@class='img']/a/@href")[0]destination_url = 'http://www.mafengwo.cn' + destination_urldestination_big_head_img_url = destination_info.xpath("./div[@class='img']/a/img/@data-original")[0]if '?' in destination_big_head_img_url:destination_big_head_img_url = destination_big_head_img_url.split('?')[0]destination_num_people_have_been = destination_info.xpath(".//div[@class='nums']/b/text()")[0]destination_summary = destination_info.xpath(".//div[@class='detail']/text()")[0].strip()destination_top_lists = destination_info.xpath(".//dl[@class='caption']/dd/a")if len(destination_top_lists) == 3:top_one_con = dict()top_two_con = dict()top_three_con = dict()# 第一top_one_con['attractions_name'] = destination_top_lists[0].xpath("./@title")[0]top_one_con['attractions_id'] = destination_top_lists[0].xpath("./@data-id")[0]top_one_con['attractions_url'] = 'http://www.mafengwo.cn' + destination_top_lists[0].xpath("./@href")[0]# print(top_one_con)# 第二top_two_con['attractions_name'] = destination_top_lists[1].xpath("./@title")[0]top_two_con['attractions_id'] = destination_top_lists[1].xpath("./@data-id")[0]top_two_con['attractions_url'] = 'http://www.mafengwo.cn' + destination_top_lists[1].xpath("./@href")[0]# 第三top_three_con['attractions_name'] = destination_top_lists[2].xpath("./@title")[0]top_three_con['attractions_id'] = destination_top_lists[2].xpath("./@data-id")[0]top_three_con['attractions_url'] = 'http://www.mafengwo.cn' + destination_top_lists[2].xpath("./@href")[0]top = {'top_1': top_one_con,'top_2': top_two_con,'top_3': top_three_con,}elif len(destination_top_lists) == 2:top_one_con = dict()top_two_con = dict()top_three_con = dict()# 第一top_one_con['attractions_name'] = destination_top_lists[0].xpath("./@title")[0]top_one_con['attractions_id'] = destination_top_lists[0].xpath("./@data-id")[0]top_one_con['attractions_url'] = 'http://www.mafengwo.cn' + destination_top_lists[0].xpath("./@href")[0]# print(top_one_con)# 第二top_two_con['attractions_name'] = destination_top_lists[1].xpath("./@title")[0]top_two_con['attractions_id'] = destination_top_lists[1].xpath("./@data-id")[0]top_two_con['attractions_url'] = 'http://www.mafengwo.cn' + destination_top_lists[1].xpath("./@href")[0]# 第三top_three_con['attractions_name'] = ''top_three_con['attractions_id'] = ''top_three_con['attractions_url'] = ''top = {'top_1': top_one_con,'top_2': top_two_con,'top_3': top_three_con,}elif len(destination_top_lists) == 1:top_one_con = dict()top_two_con = dict()top_three_con = dict()# 第一top_one_con['attractions_name'] = destination_top_lists[0].xpath("./@title")[0]top_one_con['attractions_id'] = destination_top_lists[0].xpath("./@data-id")[0]top_one_con['attractions_url'] = 'http://www.mafengwo.cn' + destination_top_lists[0].xpath("./@href")[0]# print(top_one_con)# 第二top_two_con['attractions_name'] = ''top_two_con['attractions_id'] = ''top_two_con['attractions_url'] = ''# 第三top_three_con['attractions_name'] = ''top_three_con['attractions_id'] = ''top_three_con['attractions_url'] = ''top = {'top_1': top_one_con,'top_2': top_two_con,'top_3': top_three_con,}else:top_one_con = dict()top_two_con = dict()top_three_con = dict()# 第一top_one_con['attractions_name'] = ''top_one_con['attractions_id'] = ''top_one_con['attractions_url'] = ''# print(top_one_con)# 第二top_two_con['attractions_name'] = ''top_two_con['attractions_id'] = ''top_two_con['attractions_url'] = ''# 第三top_three_con['attractions_name'] = ''top_three_con['attractions_id'] = ''top_three_con['attractions_url'] = ''top = {'top_1': top_one_con,'top_2': top_two_con,'top_3': top_three_con,}destination_top = topdestination_dict['destination_chinese_name'] = destination_chinese_namedestination_dict['destination_english_name'] = destination_english_namedestination_dict['destination_id'] = destination_iddestination_dict['destination_url'] = destination_urldestination_dict['destination_big_head_img_url'] = destination_big_head_img_urldestination_dict['destination_num_people_have_been'] = destination_num_people_have_beendestination_dict['destination_summary'] = destination_summarydestination_dict['destination_top'] = destination_topdestination_dict.update(crawling_data)last_lists.append(destination_dict)try:if len(last_lists) > 0:last_lists = json.dumps(last_lists)last_lists = json.loads(last_lists)# print()print('第 {} 个国家, 第 {} 页, 共有 {} 条数据...,国家城市链接:  {}'.format(i, page_num, len(last_lists), last_lists[0]['country_start_popular_cities_url']))# print('第 {} 个国家, 第 {} 页, 共有 {} 条数据...,国家城市链接: '.format(i, page_num, len(last_lists)))self.content_list_queue.put((last_lists, i))

5)结果

抓取了一万多目的地。。。

Python-马蜂窝全站抓取。。。今天先写抓取国家和城市,下次有时间在写景点、自由行攻略和游记。。。相关推荐

  1. Python之pandas:对pandas中dataframe数据中的索引输出、修改、重命名等详细攻略

    Python之pandas:对pandas中dataframe数据中的索引输出.修改.重命名等详细攻略 目录 对pandas中dataframe数据中的索引输出.修改.重命名等详细攻略 知识点学习 输 ...

  2. Python之pandas:pandas的get_dummies函数简介(将分类变量转为哑变量)及其使用方法之详细攻略

    Python之pandas:pandas的get_dummies函数简介(将分类变量转为哑变量)及其使用方法之详细攻略 目录 pandas的get_dummies函数简介 pandas.get_dum ...

  3. Python:列表、集合等交集、并集、差集、非集简介及其代码实现之详细攻略

    Python:列表.集合等交集.并集.差集.非集简介及其代码实现之详细攻略 目录 列表.集合等交集.并集.差集.非集简介及其代码实现之详细攻略 1.交集:A & B & C & ...

  4. Python:更改默认启动的python程序及其对应的安装包路径(更改pip的默认安装包的路径)图文教程之详细攻略

    Python:更改默认启动的python程序及其对应的安装包路径(更改pip的默认安装包的路径)图文教程之详细攻略 目录 更改默认启动的python程序及其对应的安装包路径 1.先找到默认安装包路径

  5. Python语言学习之pandas:DataFrame二维表的简介、常用函数、常用案例之详细攻略

    Python语言学习之pandas:DataFrame二维表的简介.常用函数.常用案例之详细攻略 目录 DataFrame的简介 DataFrame的常用案例 1.写入和读取excel表格文件

  6. Dataset之Handwritten Digits:Handwritten Digits(手写数字图片识别)数据集简介、安装、使用方法之详细攻略

    Dataset之Handwritten Digits:Handwritten Digits(手写数字图片识别)数据集简介.安装.使用方法之详细攻略 目录 Handwritten Digits数据集的简 ...

  7. python calu_Python的Woe:Woe库功能的介绍、安装和使用,之,woewoe,库函数,简介,方法,详细,攻略...

    Python之woe:woe库函数的简介.安装.使用方法之详细攻略 目录 woe库函数的简介 woe转换工具主要用于信用评级的记分卡模型.在woe中,对变量进行分箱的原理类似于二叉决策树,其中决定如何 ...

  8. caffe 训练手写字体HWDB1.1trn_gnt.zip HWDB1.1tst_gnt.zip并测试完整超详尽攻略

    工作中你老子要用到物体识别,也有用到OCR识别,因为深度摄像头对玻璃和透明物体拿不到深度距离.所以用这种办法来配合获取识别物体坐标. 首先,我需要强调的是,印刷体的汉字OCR识别你直接调用接口AT两家 ...

  9. python中单行注释采用的符号是什么_Python注释符号使用说明(多行注释和单行注释),用法,详解,攻略...

    Python 语言允许在任何地方插入空字符或注释,但不能插入到标识符和字符串中间. Python 源代码的注释有两种形式,分别是单行注释和多行注释: -单行注释-------------------- ...

最新文章

  1. 解构 StyleCLIP:文本驱动、按需设计,媲美人类 P 图师
  2. 基于Pytorch再次解析AlexNet现代卷积神经网络
  3. java cursor_使用3.0驱动程序的Java中的Tailable Cursor示例?
  4. WPF 使用DrawingVisual绘制高性能曲线图
  5. java发送文件_java 模拟http发送文件和参数
  6. python compile
  7. Unity3D 多平台_预编译相关宏定义
  8. 人生真是圆的,从BASIC开始的程序人生,又回到了BASIC,难道。。。。。
  9. ios保存录制好的视频 图片 到相簿
  10. Fiddler工具中显示服务器IP地址的方法
  11. 数据库连接池为啥要用 ThreadLocal?不用会怎么样?
  12. Android中删除照片操作
  13. AnyTXT Searcher中文版-比Google Desktop都牛连everything软件作者都拍手称赞的文本搜索工具
  14. 2020王道操作系统,数据结构,计算机网络,计算机组成原理PDF大合集+使用经验
  15. iOS 审核 苹果登陆失败
  16. 享元模式(羽量级模式、蝇量级模式Flyweight,对象结构型模式)
  17. 下载或移除 Chrome 主题背景
  18. c语言(练习for循环、字母大写转小写)
  19. 3D动画概述暨骨骼动画实现
  20. vue中使用vue-infinite-scroll

热门文章

  1. Windows系统中用户变量与系统变量的区别
  2. java多态编程,java范例多态
  3. PPT演讲技巧及一点点原则(转)
  4. 学习streamlit-4
  5. Vue 实现简单的二维码扫描功能
  6. 【canny边缘检测】canny边缘检测原理及代码详解
  7. 人脸识别之人脸对比 【C#】
  8. loam源码解析1 : scanRegistration(一)
  9. 中国旅游的十大趋势2
  10. ispm13:项目管理各个过程的工具和技术(重点知识)