scrapy爬虫之抓取京东机械键盘评论量并画图展示

简介

最近想了解一下机械键盘，因此使用scrapy抓取了京东机械键盘
并使用python根据店铺名和评论量进行图片分析。

分析

在写爬虫前，我们需要先分析下京东机械键盘的是怎么访问的。
1.进入京东，搜索机械键盘

#页面url
https://search.jd.com/Search?keyword=机械键盘&enc=utf-8&wq=机械键盘&pvid=fdac35af19ef4c7bbe23defb205b1b59

2.查看网页源代码

通过源代码发现，默认情况下只显示30条信息，但是在浏览器中向下滚动到30条以后，页面通过ajax会自动加载后30条信息，
通过开发者工具查看：

通过上图可发现，页面通过ajax异步加载的url:

#后30条
https://search.jd.com/s_new.php?keyword=机械键盘&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=机械键盘&page=2&s=27&scrolling=y&log_id=1517196404.59517&tpl=1_M&show_items=3378484,6218105,3204859,2629440,3491212,2991278,1832316,4103095,5028795,2694404,3034311,1543721098,3606368,1792545,4911552,10494209225,2818591,2155852,1882111,3491218,584773,2942614,4285176,4873773,4106737,3204891,1495945,5259880,12039586866,3093295

注意:
url中的”page=2”
url中的show_items值为源代码中前30条信息的”data-sku”

待ajax异步加载后30条内容后，此页的全部内容则全部加载完毕。

3.分析翻页
点击第二页查看url

#第二页，前30条
https://search.jd.com/Search?keyword=机械键盘&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=机械键盘&page=3&s=57&click=0
#第二页，后30条
https://search.jd.com/s_new.php?keyword=机械键盘&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=机械键盘&page=4&s=84&scrolling=y&log_id=1517225828.64245&tpl=1_M&show_items=14689611523,1365181,3890366,3086129,5455802,4237668,3931658,3491228,1654797409,2361918,5442762,4237678,5225170,4960228,4237662,3931616,3491188,5009394,10151123711,4838698,4911578,1543721097,3093301,4838762,1836476,5910288,1135833,4277018,5028785,1324969

点击第三页查看url

#第三页，前30条
https://search.jd.com/Search?keyword=机械键盘&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=机械键盘&page=5&s=110&click=0
#第三页，后30条
https://search.jd.com/s_new.php?keyword=机械键盘&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=机械键盘&page=6&s=137&scrolling=y&log_id=1517225931.50937&tpl=1_M&show_items=5965870,3093297,14758401114,4825074,1247140,4911566,3634890,3212216,2329142,5155156,5225170,1812788,613970,5391428,1836460,1771658520,1308971,2512327,15428123588,2512333,3176567,6039820,10048750474,3093303,3724961,338871,10235508261,2144773,1939376,1543721095

通过以上我们可以看到，page是按3、5奇数方式增长的，而ajax加载的后30条信息中page是按2、4、6偶数方式增长的。

通过以上，我们的爬虫方案也就有了，先爬取当前页的前30条item，然后获取data-sku，模拟ajax请求异步加载获取后30条item；当前页全部抓取完毕后，翻页俺上面的方式继续爬取，直至最后。

实现

1.定义item

vim items.py
#将评论量转化由字符串为float，并将万按单位计算，便于后续分析计算
def filter_comment(x):str = x.strip('+')if str[-1] == u'万':return float(str[:-1])*10000else:return float(str)class KeyboardItem(scrapy.Item):#店铺名shopname = scrapy.Field(input_processor=MapCompose(unicode.strip),output_processor=TakeFirst())#产品名band = scrapy.Field(output_processor=TakeFirst())#价格price = scrapy.Field(output_processor=TakeFirst())#评价量comment = scrapy.Field(input_processor=MapCompose(filter_comment),output_processor=TakeFirst())

其中：
filter_comment函数，是将评论量转化由字符串为float，并将万按单位计算，便于后续分析计算。因为评论量有的以万为单位，如1.5万。
MapCompose(unicode.strip)，去掉空格
output_processor=TakeFirst()，获取shopname的第一个字段，否则我们获得的shopname、price、band、comment都为列表。

如果不经过已经处理，我们最终生成的json文件为一下：

[
{"comment": [1.2万+], "band": ["新盟游戏", "机械键盘"], "price": ["129.00"], "shopname": [罗技G官方旗舰店"]},
......
]

经过处理后

[
{"comment": 120000.0, "band": "新盟游戏", "price": "129.00", "shopname": 罗技G官方旗舰店"},
......
]

这种格式更方便我们通过python的pandas进行科学计算。

爬虫实现

1.编写爬虫

vim keyboard.py
# -*- coding: utf-8 -*-
#京东搜索机械键盘
import scrapy
from jingdong.items import KeyboardItem
from scrapy.loader import ItemLoaderclass KeyboardSpider(scrapy.Spider):name = 'keyboard'allowed_domains = ['jd.com']#start_urls = ['https://search.jd.com/Search?keyword=机械键盘&enc=utf-8&wq=机械键盘&pvid=361c7116408b4a10b5e769e3fd25bbbf']headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0"}def start_requests(self):#重写，增加headersyield scrapy.Request(url='https://search.jd.com/Search?keyword=机械键盘&enc=utf-8&wq=机械键盘&pvid=361c7116408b4a10b5e769e3fd25bbbf', meta={'pagenum':1}, headers=self.headers, callback=self.parse_first30)def parse_first30(self, response):#爬取前30条  pagenum = response.meta['pagenum']print '进入机械键盘第' + str(pagenum) + '页,显示前30条'for eachitem in response.xpath('//li[@class="gl-item"]'): load = ItemLoader(item=KeyboardItem(),selector=eachitem)info = load.nested_xpath('div')info.add_xpath('shopname', 'div[@class="p-shop"]/span/a/@title')info.add_xpath('band', 'div[@class="p-name p-name-type-2"]/a/em/text()')info.add_xpath('price', 'div[@class="p-price"]/strong/i/text()')info.add_xpath('comment', 'div[@class="p-commit"]/strong/a/text()')yield  load.load_item()#获取前30条记录的skuskulist = response.xpath('//li[@class="gl-item"]/@data-sku').extract()skustring = ','.join(skulist)#后30条为偶数页pagenum_more = pagenum*2baseurl = 'https://search.jd.com/s_new.php?keyword=机械键盘&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=机械键盘&&s=28&scrolling=y&log_id=1517052655.49883&tpl=1_M&' #ajax加载的后30条urlajaxurl = baseurl + 'page=' + str(pagenum_more) + '&show_items'+ skustring.encode('utf-8')yield scrapy.Request(ajaxurl, meta={'pagenum':pagenum},headers=self.headers, callback=self.parse_next30)def parse_next30(self, response):#爬取后30条pagenum = response.meta['pagenum']print '进入机械键盘第' + str(pagenum) + '页,显示后30条'for eachitem in response.xpath('//li[@class="gl-item"]'): load = ItemLoader(item=KeyboardItem(),selector=eachitem)info = load.nested_xpath('div')info.add_xpath('shopname', 'div[@class="p-shop"]/span/a/@title')info.add_xpath('band', 'div[@class="p-name p-name-type-2"]/a/em/text()')info.add_xpath('price', 'div[@class="p-price"]/strong/i/text()')info.add_xpath('comment', 'div[@class="p-commit"]/strong/a/text()')yield  load.load_item()#获取后30条记录的skuskulist = response.xpath('//li[@class="gl-item"]/@data-sku').extract()pagenum = pagenum+1#下一页的实际数字nextreal_num = pagenum*2-1#下一页urlnext_page = 'https://search.jd.com/Search?keyword=机械键盘&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=机械键盘&s=56&click=0&page=' + str(nextreal_num)yield scrapy.Request(next_page, meta={'pagenum':pagenum}, headers=self.headers, callback=self.parse_first30)

注意：我们将访问的第n页通过meta进行传递。例如:

第一页，pagenum=1，只显示前30条
pagenum_more = pagenum*2=2 ，ajax加载的后30条url中的page值
第二页nextreal_num = pagenum*2-1=3,下一页url中的page值

2.运行

scrapy crawl keyboard -o keyboard.json
[
{"comment": 120000.0, "band": "新盟游戏", "price": "129.00"},
{},
{},
{"comment": 15000.0, "band": "罗技（Logitech）G610 Cherry轴全尺寸背光", "price": "599.00", "shopname": "罗技G官方旗舰店"},
{"comment": 9900.0, "band": "ikbc c104 樱桃轴", "price": "389.00", "shopname": "ikbc京东自营旗舰店"},
{"comment": 11000.0, "band": "美商海盗船（USCorsair）Gaming系列 K70 LUX RGB 幻彩背光", "price": "1299.00", "shopname": "美商海盗船京东自营旗舰店"},
{"comment": 34000.0, "band": "达尔优（dareu）108键", "price": "199.00", "shopname": "达尔优京东自营旗舰店"},
{"comment": 74000.0, "band": "雷柏（Rapoo） V700S合金版 混光", "price": "189.00", "shopname": "雷柏京东自营官方旗舰店"},
{"comment": 8100.0, "band": "罗技（Logitech）G610 Cherry轴全尺寸背光", "price": "599.00", "shopname": "罗技G官方旗舰店"},
{"comment": 26000.0, "band": "雷蛇（Razer）BlackWidow X 黑寡妇蜘蛛X幻彩版 悬浮式游戏", "price": "799.00", "shopname": "雷蛇RAZER京东自营旗舰店"},
{"comment": 74000.0, "band": "雷柏（Rapoo） V500PRO 混光", "price": "169.00", "shopname": "雷柏京东自营官方旗舰店"},
{"comment": 150000.0, "band": "前行者游戏背光发光牧马人", "price": "65.00", "shopname": "敏涛数码专营店"},
{"comment": 11000.0, "band": "樱桃（Cherry）MX-BOARD 2.0 G80-3800 游戏办", "price": "389.00"},
{"comment": 12000.0, "band": "美商海盗船（USCorsair）STRAFE 惩戒者 ", "price": "699.00", "shopname": "美商海盗船京东自营旗舰店"},
{"comment": 6700.0, "band": "罗技（Logitech）G413", "price": "449.00", "shopname": "罗技G官方旗舰店"},
{"comment": 120000.0, "band": "新盟游戏", "price": "89.00", "shopname": "敏涛数码专营店"},
{"comment": 26000.0, "band": "雷蛇（Razer）BlackWidow X 黑寡妇蜘蛛X 竞技版87键 悬浮式游戏", "price": "299.00", "shopname": "雷蛇RAZER京东自营旗舰店"},
{"comment": 110000.0, "band": "达尔优（dareu）108键", "price": "199.00", "shopname": "达尔优京东自营旗舰店"},
{"comment": 61000.0, "band": "狼蛛（AULA）F2008混光跑马 ", "price": "129.00", "shopname": "狼蛛外设京东自营官方旗舰店"},
.......
]

科学计算

通过scrapy爬取到数据后，我们使用python科学计算进行分析
店铺名的评论量并画图展示。

vim keyboard_analyse.py
#!/home/yanggd/miniconda2/envs/science/bin/python
# -*- coding: utf-8 -*-
import matplotlib.pyplot as plt
import pandas as pd
from pandas import DataFrame
import jsonfilename= 'keyboard.json'#从json文件生成DataFrame
with open(filename) as f:pop_data = json.load(f)
df =DataFrame(pop_data)group_shopname = df.groupby('shopname')
group =group_shopname.mean()
#print group#字体设置
plt.rcParams['font.family'] = 'sans-serif'
plt.rcParams['font.sans-serif'] = ['simhei']
plt.rcParams['axes.unicode_minus'] = False
#柱状图
group.plot(kind='bar')
plt.xlabel(u"店铺名")
plt.ylabel(u"评论量")
plt.show()#运行
python keyboard_analyse.py