网络爬虫入门——案例三：爬取大众点评的商户信息

pyspider：http://demo.pyspider.org/

CSS选择器：http://www.w3school.com.cn/cssref/css_selectors.asp

Beautiful Soup：http://beautifulsoup.readthedocs.io/zh_CN/latest/

正则表达式：http://www.cnblogs.com/deerchao/archive/2006/08/24/zhengzhe30fengzhongjiaocheng.html

本帖目标：

http://www.dianping.com/search/keyword/3/0_%E4%B8%80%E9%B8%A3%E7%9C%9F%E9%B2%9C%E5%A5%B6%E5%90%A7

1.抓取一鸣真鲜奶吧的所有商店信息

2.抓取商店所有的评论信息

3.将抓取到的内容保存到数据库（没有体现）

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2016-06-07 07:40:58
# Project: dazhongdianpingfrom pyspider.libs.base_handler import *
from bs4 import BeautifulSoup
from pymongo import MongoClient
import base64
import reid = 0
count = 0
number=0
global count
global id
global numberclass Handler(BaseHandler):crawl_config = {}@every(minutes=24 * 60)def on_start(self):self.crawl('http://www.dianping.com/search/keyword/3/0_%E4%B8%80%E9%B8%A3%E7%9C%9F%E9%B2%9C%E5%A5%B6%E5%90%A7', callback=self.local_page)@config(age=2 * 24 *60)def local_page(self, response):self.save_local('remark',response.url,response.doc)for each in response.doc('DIV.pic>A').items():self.crawl(each.attr.href, callback=self.index_page)#下一页for each in response.doc('A.next',).items():self.crawl(each.attr.href, callback=self.local_page)@config(age=3*24*60)def index_page(self,response):global number#店铺信息for each in response.doc('DIV#basic-info').items():number +=1info={}tmp = BeautifulSoup(str(each))name = tmp.find('h1',class_='shop-name')#店铺编号info['itemid']=number#店铺名称if re.findall(r'<h1 class="shop-name">[\s]+(.*)',str(name)):info['name']=re.findall(r'<h1 class="shop-name">[\s]+(.*)',str(name))[0]else:info['name']='-'#
            if re.findall(r'<a class="branch J-branch">(.*)<i class="icon i-arrow"></i></a>',str(name)):info['branch']=re.findall(r'<a class="branch J-branch">(.*)<i class="icon i-arrow"></i></a>',str(name))[0]else:info['branch']='-'#   info['basic_info']=[]basic_info = tmp.find("div",class_="brief-info")if basic_info:#星级star=basic_info.span.get('class')[1]info['level']=int(re.findall(r'mid-str(.*)',str(star))[0])*1.0/10print info['level']for td in basic_info.find_all('span',class_="item"):info['basic_info'].append(td.string.encode('utf-8'))else:info['level']='-'#区名       region=tmp.find('span',itemprop='locality region')#街道信息address=tmp.find('span',class_='item',itemprop="street-address")if region:info['region']=region.string.encode('utf-8')else:info['region']='-'if address:info['address']=address.string.encode('utf-8').strip()else:info['address']='-'#电话tel=tmp.find('p',class_="expand-info tel")if tel:info['telephone']=tel.find('span',class_='item').string.encode('utf-8')else:info['telephone']='-'#更多评论     if response.doc('P.comment-all>A'):for each in response.doc('P.comment-all>A').items():self.crawl(each.attr.href, callback=self.detail_page_all)#如果当前已经显示了所有评论    else:self.crawl(response.url,callback=self.detail_page)@config(age=4*24*60)def detail_page(self, response):global ideach = BeautifulSoup(str(response.doc))#获取评论tmp=each.find_all('li',class_="comment-item")for tr in tmp:res={}id +=1#评论idres['itemid']=id#用户名if tr.find('p',class_='user-info'):res['user']=tr.find('p',class_='user-info').a.string.encode('utf-8')else:res['user']='-'res['comment']={}#点赞次数date=tr.find('div',class_='misc-info')res['time']=date.find('span',class_='time').string.encode('utf-8')#商店信息info = tr.find('p',class_='shop-info')#商店得分情况star=info.span.get('class')[1]res['level']=int(re.findall(r'sml-str(.*)',str(star))[0])*1.0/10#口味环境和服务得分if info.find_all('span',class_='item'):for thing in info.find_all('span',class_='item'):thing = thing.string.encode('utf-8').split('£º')res['comment'][thing[0]]=thing[1]if info.find('span',class_='average'):res['price']=info.find('span',class_='average').string.encode('utf-8').split('£º')[1]else:res['price']='-'#展开评论content=tr.find('div',class_='info J-info-all Hide')if content:res['content']=content.p.string.encode('utf-8')else:if tr.find('div',class_='info J-info-short'):res['content']=tr.find('div',class_='info J-info-short').p.string.encode('utf-8').strip()else:res['content']='-'@config(age=4*24*60)def detail_page_all(self, response):global count#得到全部评论for each in response.doc('DIV.comment-list').items():each = BeautifulSoup(str(each))tmp=each.find_all('li')for tr in tmp:res={}count += 1#点评的idres['itemid']=count#星级star=tr.find('div',class_='content')if star:rank=star.span.get('class')[1]res['level']=int(re.findall(r'irr-star(.*)',str(rank))[0])*1.0/10else:continue#点赞次数date=tr.find('div',class_='misc-info')res['time']=date.find('span',class_='time').string.encode('utf-8')#用户名name = tr.find('div',class_='pic')if name:res['user']=name.find('p',class_='name').string.encode('utf-8')else:res['user']='-'#口味环境服务res['comment']={}page=tr.find('div',class_='comment-rst')if page:info= re.findall('class="rst">(.*)<em class="col-exp">(.*)</em></span>',str(page))if info:for td in info:res['comment'][td[0]]=td[1].strip('(').strip(')')#是否为团购点评group=tr.find('div',class_='comment-txt')if group.find('a',target='blank'):res['shopping_group']=group.find('a',target='blank').string.encode('utf-8')else:res['shopping_group']='-'#人均价格     price=tr.find('span',class_='comm-per')if price:res['price']=price.string.encode('utf-8')else:res['price']='-'#简要评论if tr.find('div',class_='J_brief-cont'):tmp = str(tr.find('div',class_='J_brief-cont'))res['content']=re.findall(r'<div class="J_brief-cont">([\w\W]*)</div>',tmp)[0].strip()else:res['content']='-'#下一页for each in response.doc('A.NextPage').items():self.crawl(each.attr.href, callback=self.detail_page_all)

转载于:https://www.cnblogs.com/jingyuewutong/p/5569108.html

网络爬虫入门——案例三：爬取大众点评的商户信息相关推荐

python爬虫爬取大众点评店铺简介信息
python爬虫爬取大众点评店铺简介信息写作目的: 爬取目标大众点评的保护机制应对方法还存在的问题写作目的: 今天帮朋友一个忙,要爬取一些大众点评上的数据.结果发现大众点评的防爬机制还挺多的 ...
爬取大众点评美食店铺信息，解密_token的思路
爬取大众点评美食店铺信息,解密_token的思路先随意进入一个店铺的链接,例如:http://www.dianping.com/shop/127857802 进入之后打开控制台,进入之后选择xhr, ...
Python，requests爬虫，使用代理爬取大众点评（含爬取结果。。。在文末）
由于在工作中,客户需要大众点评的行业数据,因此本人使用Python对大众点评网站进行了爬取,虽然在爬取之前就想好了可能会遇到的坑,但是没想要从坑中爬出来这么难.本次大众点评爬虫代码编写耗时一个月.也算 ...
python爬虫之通过pyquery爬取大众点评评论信息
写作缘由:朋友求助帮忙爬取一下大众点评天河商圈的商户名, 店铺收藏量, 评论数量, 好评数, 差评数, 口味评分, 环境评分,服务评分, 人均价格, 首页优质评论数. 思路: 1. 我们发现大众点评是 ...
python爬取大众点评网商家信息以及评价，并将数据存储到excel表中（源码及注释）
import requests from bs4 import BeautifulSoup import traceback # 异常处理 import xlwt # 写入xls表 # Cookie记 ...
python大众点评网实训报告中的参考文献_python爬取大众点评网商家信息以及评价，并将数据存储到excel表中（源码及注释）...
import requests from bs4 import BeautifulSoup import traceback # 异常处理 import xlwt # 写入xls表 # Cookie记 ...
python爬虫大众点评_python爬虫——按城市及店铺面爬取大众点评分类
题外话:因为最近遇到很多标签要对其进行分类,其中很多是店铺名,所以就想爬取大众点评的分类信息.因为不是专门做爬虫的,所以下面这段代码仅仅是可以实现要求,如何能避免网站的反爬机制这一点就无能无力了.另外 ...
python爬虫实战---爬取大众点评评论
python爬虫实战-爬取大众点评评论(加密字体) 1.首先打开一个店铺找到评论很多人学习python,不知道从何学起. 很多人学习python,掌握了基本语法过后,不知道在哪里寻找案例上手. 很多 ...
python爬取大众点评数据_python爬虫实例详细介绍之爬取大众点评的数据
python 爬虫实例详细介绍之爬取大众点评的数据一． Python作为一种语法简洁.面向对象的解释性语言,其便捷性.容易上手性受到众多程序员的青睐,基于python的包也越来越多,使得python ...

网络爬虫入门——案例三：爬取大众点评的商户信息

网络爬虫入门——案例三：爬取大众点评的商户信息相关推荐

最新文章

热门文章