Python_大众点评网站数据爬虫

目标：
爬取大众点评某地区的酒店信息，包括酒店名，平均价格，评价人数，标签等，并将其写入txt，导入数据库。
所用模块：urllib，urllib2，re，BeautifulSoup
大致步骤：
（1）获取页面所在首页url，及相应的headers；
（2）页面解析，获取信息，写入txt，并尝试获取下一页的url，若得到，则以此更新url，继续（2），若找不到，则停止，进入（3）；

（3）将所得到的txt文档中的数据一次性导入mysql。

#-*-coding:utf-8-*-
'''
created by zwg in 2016-10-15
'''
import sys
reload(sys)
sys.setdefaultencoding('utf-8')import urllib
import urllib2
import re
import copy
from bs4 import BeautifulSoup
import MySQLdbclass get_data:def get_html(self,url,headers):opener=urllib2.build_opener()headers_copy=headers.items()for i in headers_copy:opener.addheaders=[i]urllib2.install_opener(opener)self.url=urlpage=opener.open(url)self.html=page.read()self.soup=BeautifulSoup(self.html,'lxml',from_encoding='utf-8')self.opener=openerdef get_nextpage(self):basic_url='http://www.dianping.com'next_url=self.soup.find_all('a',class_='next')new_url=next_url[0]['href']self.url=basic_url+new_urlself.html=self.opener.open(self.url)self.soup=BeautifulSoup(self.html,'lxml',from_encoding='utf-8')def get_one_data(self):hotel_li = self.soup.find_all('li', class_='hotel-block')pattern1 = re.compile('''"title":"(.+)"''')info = []for i in hotel_li:s = i['data-hippo']s1 = pattern1.findall(s)[0]name = s1  # 酒店名p_class = i.find_all('p', class_='hotel-tags')[0]p11 = p_class.find_all('span')comment = ''for j in p11:comment = comment + j.string + ','comment=comment[0:len(comment)-1]p_price = i.find_all('strong')[0]price = p_price.string  # 酒店价格price.replace(' ', '')if price == '\n':price = 'None'p_people = i.find_all('a', class_='comments')[0]number = p_people.string  # 评论人数number = number.replace('(', '')number = number.replace(')', '')number = number.replace(' ', '')if not number.isdigit():number = 'None'p=i.find_all('p',class_='place')place=str(p[0].a.string)info.append((name, price, place, number, comment))print '%-20s%-10s%-10s%-5s%s' % (name, price, place, number, comment)self.info=infodef write_to_txt(self, file1):for i in self.info:a, b, c, d, e = is = ('%s\t%s\t%s\t%s\t%s\n') % (a, b, c, d, e)file1.writelines(s)def get_all_data(self,file1):for i in xrange(5):self.get_one_data()self.write_to_txt(file1)self.get_nextpage()url='http://www.dianping.com/guangzhou/hotel/p1'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 ''(KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36'}
G=get_data()
G.get_html(url,headers)
file1 = file('dazhong.txt', 'a+')
conn=MySQLdb.connect('127.0.0.1','root','1234','school',)
cursor=conn.cursor()
conn.set_character_set('utf8')
cursor.execute('SET NAMES utf8;')
cursor.execute('SET CHARACTER SET utf8;')
cursor.execute('SET character_set_connection=utf8;')
G.get_all_data(file1)
sql="load data local infile 'D:/Python/web_crawler/dazhong.txt' " \"into table hotel_info fields terminated by '\t'"
cursor.execute(sql)
conn.commit()
cursor.close()
conn.close()

实现通过，Done！

Python_大众点评网站数据爬虫相关推荐

python爬取大众点评数据_python爬虫实例详细介绍之爬取大众点评的数据
python 爬虫实例详细介绍之爬取大众点评的数据一． Python作为一种语法简洁.面向对象的解释性语言,其便捷性.容易上手性受到众多程序员的青睐,基于python的包也越来越多,使得python ...
大众点评数据，大众点评商家数据，大众点评2023爬虫
大众点评2023年7月商家数据,大众点评商家数据,几乎包含所有字段
【小o地图Excel插件版】不止能做图表，还能抓58、大众点评网页数据...
小o地图Excel插件版:一款基于Excel软件开发的地图软件,提供基于Excel表格进行地理数据挖掘.地理数据分析.地图绘制.地图图表等功能的工具类软件.具有易用.高效.稳定的特点,能够满足地理数据 ...
大众点评app 数据解密和反序列化
在使用charles 抓大众点评app数据包的时候会发现,请求接口是没有加密的. 但是抓到的数据全都是乱码,这其实是点评使用了加密算法,所以就需要对应的解密算法. 数据解析操作需要先解压缩,然后再解 ...
2019全国大众点评网数据下载获取
大众点评全网数据(324个国内城市,所有分类)美食数据超过1480万条(14,860,209条)全部分类4000千万+. 详细字段说明: "shop_id"(商户ID,唯一.固定不 ...
大众点评网站源码_与大众分享您的网站
大众点评网站源码 Originally, it was never going to get this complex. The internet was never meant to be this ...
python爬取大众点评评论_python爬虫抓取数据小试Python——爬虫抓取大众点评上的数据 - 电脑常识 - 服务器之家...
python爬虫抓取数据小试Python--爬虫抓取大众点评上的数据发布时间:2017-04-07
大众点评大数据 hadoop 应用案例
本文转自 http://blog.sina.com.cn/s/blog_7eb42b5a0101g0ei.html 大众点评网从2011年中开始使用Hadoop,并专门建立团队.Hado ...
爬取大众点评页面数据教程，图片文字如何爬取
大众点评的商家地址和详细分类,居然是用svg图形展示的文字,哇,真是用心良苦,为了反爬,可谓是脑洞大开啊,图形文字.滑块验证码.封ip,全都用上了,真是让人头疼.不过正所谓道高一尺,魔高一丈,没有达不 ...

Python_大众点评网站数据爬虫

Python_大众点评网站数据爬虫相关推荐

最新文章

热门文章