pyhong爬虫——大众点评—

1.前一部分依旧没有变化，不过用户的链接从哪来呢，从上一期的商户评论里拿到了用户个人主页的链接，建了一个csv文件，从而用作这次爬虫的链接库。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
import time
import pandas as pd
import json
import random#放chromedriver的文件路径加文件名
CHROME_DRIVER = 'C:\\Users\\Administrator\\Desktop\\chromedriver_win32\\chromedriver.exe'# 创建chrome参数对象
#opt = webdriver.ChromeOptions()
#opt.add_argument('blink-settings=imagesEnabled=false') #不加载图片, 提升速度driver = webdriver.Chrome(executable_path=CHROME_DRIVER# ,options=opt)
#读取用户名单
users=pd.read_csv('users.csv')
href = ['https://www.dianping.com/member/'+str(i) for i in users['user_id']]
#转换为评论页链接
comment_href = [i+'/reviews' for i in href]
driver.get('http://www.dianping.com/')#打开网站后手动登陆一下大众点评，再执行下面的语句

2.获取cookie

cookies = driver.get_cookies()

3.爬虫程序，这次的主程序是单个评论循环的，因为爬虫的体量不是特别大，这种方法下来的数据比较干净，相对于之前的运行速度，这一次的运行速度很快，但是要加入time.sleep把速度降下来，防止ip被封，如果电脑比较多，可以多台电脑并行，速度翻倍！

加个小提示吧，很多人喜欢检查页面元素之后直接copy xpath，这样很容易出问题，因为这里的xpath后面的路径是数字索引，例如div[2]这种，在不同页面，由于元素数量不同，可能就变成div[3]了，在循环的时候很可能会报错，所以还是下一个xpath插件，在插件里找属性索引的路径，这种路径一般不会报错，例如div[@class='txt']，一般的网页设计者也不会把不同内容舍相同的属性。

driver.add_cookie(cookies[0])
data_list = []
comment_list = []
#list(range(len(href)))
for i in list(range(0,998)):#在get之前一个异常体，如果意外ip被封，可以先把文件保存下来try:driver.get(href[i])except:print('ipipipipipipipipipipipipip两个小时以后再运行')user_info = pd.concat(data_list,axis = 0)user_info.to_csv('user_info_00000.csv',index = False,encoding = 'GBK')comment = pd.concat(comment_list,axis = 0)comment.to_csv('comment_00000.csv',index = False,encoding = 'GBK')break#100的倍数停止一小时，不然会被大众点评封掉24小时的ipif (i+1)%100 == 0 :print('wating...............')time.sleep(3500)else :pass#关注、粉丝、互动、注册时间、贡献、地区、性别、vipguanzhu = driver.find_element_by_xpath('/html/body/div[2]/div[2]/div/div[1]/div[1]/div/div[1]/ul/li[1]/a/strong').textfensi = driver.find_element_by_xpath('/html/body/div[2]/div[2]/div/div[1]/div[1]/div/div[1]/ul/li[2]/a/strong').texthudong = driver.find_element_by_xpath('/html/body/div[2]/div[2]/div/div[1]/div[1]/div/div[1]/ul/li[3]/strong').textregister_time = driver.find_element_by_xpath('/html/body/div[2]/div[2]/div/div[1]/div[1]/div/div[2]/p[3]').textcontribution = driver.find_element_by_xpath('//*[@id="J_col_exp"]').text#这里主要是因为不是每个用户都有这些数据，如果没有我们给他加一个字段try:region = driver.find_element_by_xpath('/html/body/div[2]/div[1]/div/div/div/div[2]/div[2]/span[2]').textexcept:region = 'unknown'try:gender = driver.find_element_by_xpath('/html/body/div[2]/div[1]/div/div/div/div[2]/div[2]/span[2]/i').get_attribute('class')except:gender = 'unknown'try :driver.find_element_by_xpath("//div[@class='txt']/div[@class='tit']/div[@class='vip']/a/i[@class='icon-vip']")vip = 1except :vip = 0#拼接数据框x = pd.DataFrame({'user_id' : users['user_id'][i],'guanzhu' : guanzhu,'fensi' : fensi,'hudong' : hudong,'register_time' : register_time,'contribution' : contribution,'region' : region,'gender' : gender,'vip' : vip},index = [0])data_list.append(x)print(str(i)+'info')time.sleep(random.randrange(0,2))#抓取用户评论try:driver.get(comment_href[i])except:print('ipipipipipipipipipipipipip两个小时以后再运行')user_info = pd.concat(data_list,axis = 0)user_info.to_csv('user_info_00000.csv',index = False,encoding = 'GBK')comment = pd.concat(comment_list,axis = 0)comment.to_csv('comment_00000.csv',index = False,encoding = 'GBK')break#程序暂停随机1-3秒time.sleep(random.randrange(1,3))#页面评论循环for j in list(range(10)):#拼接评论xpath和时间xpathfor k in list(range(1,16)):comment_xpath = "//div[@id='J_review']/div[@class='pic-txt']/ul/li["+ str(k) +"]/div[@class='txt J_rptlist']/div[@class='txt-c']/div[@class='mode-tc comm-entry']"time_xpath = "//div[@id='J_review']/div[@class='pic-txt']/ul/li[" + str(k) + "]/div[@class='txt J_rptlist']/div[@class='txt-c']/div[@class='mode-tc info']/span[@class='col-exp']"try:   u_time = driver.find_element_by_xpath(time_xpath).text.strip('发表于').strip('更新于')u_time = pd.to_datetime(u_time,format = '%y-%m-%d')if u_time>=pd.to_datetime('18-9-27',format = '%y-%m-%d'):comment = driver.find_element_by_xpath(comment_xpath).textx = pd.DataFrame({'href'  : href[i],'u_time' : u_time,'comment': comment},index = [0])comment_list.append(x)p = kelse:breakexcept:breakif p ==15:#翻页模块try:driver.find_element_by_link_text('下一页').click()except:breakelse:break
#列表数据转换成数据框并写入文件
#！！！！！！！注意更改文件名
user_info = pd.concat(data_list,axis = 0)
user_info.to_csv('user_info.csv',index = False,encoding = 'GBK')
comment = pd.concat(comment_list,axis = 0)
comment.to_csv('comment.csv',index = False,encoding = 'GBK')

pyhong爬虫——大众点评——用户信息相关推荐

python爬虫大众点评店铺信息（字体加密）
python爬虫大众点评店铺信息(字体加密) 1.观察网站发现部分字体加密 2.查看请求的字体文件发现请求到了两个字体文件,把他下载打开 3. 这就是对应该网页每个字体的unicode,发现两个字 ...
爬虫-大众点评评论信息（思路）
Python爬虫-爬取大众点评评论信息(CSS映射) 正常页面显示数据为: 而打开开发者工具每条评论的个别字是通过标签替换的部分字体被svgmtsi标签包含,实际上是一张svg背景图,通过类选择器进 ...
爬虫 — 大众点评商户信息的爬取和文字反爬
信息爬取 import requests from lxml import etree import time import json import pandas as pd# 获取商户名称和ID r ...
python爬虫进阶-大众点评店铺信息（字体反爬-静态映射）
目的获取大众点评店铺信息详细需求 http://www.dianping.com/shenzhen/ch10 思路解析一通过F12查找目标信息位置,进行分析同理进行其他信息的解析,分析汇总 ...
微信公众平台消息接口开发（30）大众点评商户信息团购及优惠券查询
微信公众平台开发微信公众平台开发者微信公众平台开发模式大众点评商户信息团购城市优惠券作者:方倍工作室原文:http://www.cnblogs.com/txw1958/archive ...
大众点评数据信息获取——字体反爬
大众点评数据信息获取--字体反爬大众点评的字体反爬算是比较常见的,这次来学习一下相关字体反爬的技巧以店铺的评论页面和店铺列表页面进行研究,分别对应了css字体映射,woff字体加密的反爬虫手段. ...
Python3 pyspider（二）大众点评商家信息爬取
大众点评商家信息爬取首页信息:http://www.dianping.com/ 我是按照城市----商家出售的商品类型----分页----商家----爬取城市: 出售商品类型: 分页:(应该是反爬 ...
python selenium 大众点评餐厅信息+用户评论爬虫
这次爬取的目标是大众点评里餐厅的信息以及用户的评论. 大众点评的反爬内容比较丰富,这里也只是记录了如何通过selenium模拟访问大众点评,以及大众点评的woff文件构建字典并对加密文字进行替换. 目 ...
python爬虫之通过pyquery爬取大众点评评论信息
写作缘由:朋友求助帮忙爬取一下大众点评天河商圈的商户名, 店铺收藏量, 评论数量, 好评数, 差评数, 口味评分, 环境评分,服务评分, 人均价格, 首页优质评论数. 思路: 1. 我们发现大众点评是 ...
python爬虫笔记四：大众点评店铺信息（字体反爬-静态映射）
https://jia666666.blog.csdn.net/article/details/108885263 里面讲的非常详细了,不过点评有改动,里面的代码也要相应的改动一下 #coding:u ...

pyhong爬虫——大众点评——用户信息

pyhong爬虫——大众点评——用户信息相关推荐

最新文章

热门文章