用Python爬取用户虾米音乐的歌单

Python的爬虫非常简单，现在又有成熟的爬虫框架scrapy。现在，我们来用scrapy爬取自己虾米歌单上的歌曲。
通过这篇博客，你将学到：

基本的爬虫设计
模拟登陆
维持登陆状态
Xpath

（中的一点皮毛233）。本文默认读者已经通过scrapy官方文档或中文版安装好了，并试过了测试用例。
然后第一步创建项目：scrapy startproject xiami命令会在当前路径下创建名为xiami的scrapy项目。

基本的爬虫设计

从需求的角度出发，先想好我们要爬取的内容，简单一点的话就爬取网页的标题、用户的名字、歌单的歌名。行文顺序参照scrapy官方文档的习惯。

items

先来修改items.py文件。items是保存数据的容器，它是scrapy框架自己的数据结构，与字典类似，都是键-值对的形式保存数据。定义了items，我们就可以用scrapy的方式保存数据到各种数据库或者本地文件。将来我们要把爬取到的歌单保存到本地的json文件中。

打开item.py文件，默认代码如下：

# -*- coding: utf-8 -*-# Define here the models for your scraped items
# ...import scrapyclass XiamiItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()pass

添加变量初始化

class XiamiItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()title = scrapy.Field()  # 网页的标题name = scrapy.Field()  # 用户的名字song = scrapy.Field()  # 歌单的歌名pass

这里我们相当于只是创建了一个空的字典，然后定义了里面的键值。下面我们需要定义爬虫爬取数据，并把数据交付给items。

spider

当前项目下的spider文件夹下只有一个空的__init__.py文件。该文件夹负责容纳自定义的爬虫模块。在spider文件夹下创建一个py文件，就是一个爬虫模块，当然它现在还没有任何的功能。创建python file——xiami_spider.py，这就是我们用来爬取虾米歌单的爬虫了。然后定义它的基本元素

from scrapy.spiders import CrawlSpider, Ruleclass XiamiSpider(CrawlSpider):name = "xiaoxia"  # 爬虫名：小虾allowed_domains = ["xiami.com"]start_urls = ["http://www.xiami.com"]account_number = '9839****8@qq.com'  # 换上你的账号password = '123456'  # 换上你的密码# 重写了start_request方法，Spider类不会再自动调用该方法来为start_url中的url创建Requestdef start_requests(self):return [Request("https://login.xiami.com/member/login",meta={'cookiejar': 1},callback=self.post_login)]

在这个新建的类中，我们继承的是CrawlSpider而不是普通的Spider，CrawlSpider是Spider的子类，所以它会有更多的功能，这样可以少些很多代码。
定义了爬虫的名字，最后在运行程序的时候，需要用到这个名字。
定义了爬取区域的大小。如果不讲范围限制在虾米网站的网页中，爬虫如果不停地最终网页中的新链接的话，可能会爬取到很多无关网站的无关内容
定义了初始的URL，对spider来说，爬取的循环类似下文:
- 调用默认start_requensts()函数，以初始的URL初始化Request，并设置回调函数。当该request下载完毕并返回时，将生成response，并作为参数传给该回调函数。spider中初始的request是通过调用 start_requests() 来获取的。 start_requests() 读取 start_urls 中的URL，并以 parse 为回调函数生成 Request 。
- 在回调函数内分析返回的(网页)内容，返回 Item 对象或者 Request 或者一个包括二者的可迭代容器。返回的Request对象之后会经过Scrapy处理，下载相应的内容，并调用设置的callback函数(函数可相同)。
- 在回调函数内，您可以使用选择器(Selectors) (您也可以使用BeautifulSoup, lxml 或者您想用的任何解析器) 来分析网页内容，并根据分析的数据生成item。
- 最后，由spider返回的item将被存到数据库(由某些 Item Pipeline 处理)或使用 Feed exports 存入到文件中。

# coding=utf-8
from scrapy.selector import Selector
from scrapy.http import Request, FormRequest, HtmlResponse
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from xiami.items import XiamiItem
from sys import argvclass XiamiSpider(CrawlSpider):print argvname = "xiaoxia"  # 爬虫名：小虾allowed_domains = ["xiami.com"]start_urls = ["http://www.xiami.com"]account_number = '983910368@qq.com'  # 换上你的账号password = '159661312'  # 换上你的密码headers = {"Accept": "application/json, text/javascript, */*; q=0.01","Accept-Encoding": "gzip, deflate, br","Accept-Language": "zh-CN,zh;q=0.8","Connection": "keep-alive","Content-Type": "application/x-www-form-urlencoded; charset=UTF-8","User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ""(KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36","Referer": "https://login.xiami.com/member/login?spm=a1z1s.6843761.226669498.1.2iL1jx"}'''"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ""Chrome/57.0.2987.133 Safari/537.36",'''rules = {Rule(LinkExtractor(allow=('/space/lib-song',)), callback='parse_page', follow=True),}# 重写了start_request方法，Spider类不会再自动调用该方法来为start_url中的url创建Requestdef start_requests(self):return [Request("https://login.xiami.com/member/login",meta={'cookiejar': 1},callback=self.post_login)]# FormRequestdef post_login(self, response):print 'Preparing login'# 下面这句话用于抓取请求页面后返回页面汇总的_xiamitoken字段的文字，用于成功提交表单_xiamitoken = Selector(response).xpath('//input[@name="_xiamitoken"]/@value').extract_first()print '验证信息: ', _xiamitoken# FormRequest.from_response是Scrapy提供的一个函数，用于post表单# 登陆成功后，会调用after_login回调函数return [FormRequest.from_response(response,meta={'cookiejar': response.meta['cookiejar']},headers=self.headers,formdata={'source': 'index_nav','_xiamitoken': _xiamitoken,'email': self.account_number,'password': self.password},callback=self.after_login,dont_filter=True)]def after_login(self, response):print 'after login======='for url in self.start_urls:yield Request(url, meta={'cookiejar': response.meta['cookiejar']})  # 创建Requestdef _requests_to_follow(self, response):if not isinstance(response, HtmlResponse):returnseen = set()for n, rule in enumerate(self._rules):links = [lnk for lnk in rule.link_extractor.extract_links(response)if lnk not in seen]if links and rule.process_links:links = rule.process_links(links)for link in links:seen.add(link)r = Request(url=link.url, callback=self._response_downloaded)# 重写r.meta.update(rule=n, link_text=link.text, cookiejar=response.meta['cookiejar'])yield rule.process_request(r)def parse_page(self, response):# print 'hh'mysong_list = Selector(response)songs = mysong_list.xpath('//td[@class="song_name"]/a/@title').extract()print songs[0]for song in songs:item = XiamiItem()item['title'] = 'xiami_music'item['name'] = self.account_numberitem['song'] = songyield item# print '---\n'# nexturl = mysong_list.xpath('//a[@class="p_redirect_l"]/@href').extract_first()# yield self.make_requests_from_url(nexturl)

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json
import codecs
from scrapy.exceptions import DropItemclass XiamiPipeline(object):def __init__(self):self.song_seen = set()self.file = codecs.open('xiamisongs.jl', 'w', encoding='utf-8')def process_item(self, item, spider):"""每个item pipeline组件都需要调用该方法，这个方法必须返回一个Item（或任何集成类）对象，或抛出DropItem异常，被丢弃的item将不被后面的pipeline处理:param item::param spider::return:"""# 过滤缺失数据# if True:#   return item# else:#   raise DropItem('reason')if spider.name == 'xiaoxia':if item['song'] in self.song_seen:raise DropItem('Duplicate song found: %s' % item['song'])else:self.song_seen.add(item['song'])'''保存到json文件(非必须)'''line = json.dumps(dict(item), ensure_ascii=False) + '\n'self.file.write(line)return itemdef close_spider(self, spider):print 'spider close'self.file.close()

# -*- coding: utf-8 -*-# Scrapy settings for xiami project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'xiami'SPIDER_MODULES = ['xiami.spiders']
NEWSPIDER_MODULE = 'xiami.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = 'xiami (+http://www.xiami.com)'# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
# #CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0.25
# The download delay setting will honor only one of:
# #CONCURRENT_REQUESTS_PER_DOMAIN = 16
# #CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
COOKIES_ENABLED = True# Disable Telnet Console (enabled by default)
# #TELNETCONSOLE_ENABLED = False# Override the default request headers:
# #DEFAULT_REQUEST_HEADERS = {#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
# }# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
# #SPIDER_MIDDLEWARES = {#    'xiami.middlewares.MyCustomSpiderMiddleware': 543,
# }# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# #DOWNLOADER_MIDDLEWARES = {#    'xiami.middlewares.MyCustomDownloaderMiddleware': 543,
# }# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
# # EXTENSIONS = {#    'scrapy.extensions.telnet.TelnetConsole': None,
# }# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'xiami.pipelines.XiamiPipeline': 300,  # 0-1000表示运行顺序
}# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
# #AUTOTHROTTLE_ENABLED = True
# The initial download delay
# #AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# #AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
# #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
# #AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# #HTTPCACHE_ENABLED = True
# #HTTPCACHE_EXPIRATION_SECS = 0
# #HTTPCACHE_DIR = 'httpcache'
# #HTTPCACHE_IGNORE_HTTP_CODES = []
# #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

#!/usr/bin/python
# coding=utf-8
# 开始爬虫的脚本文件from scrapy.cmdline import execute
# execute('scrapy crawl xiaoxia'.split())
execute('scrapy crawl xiaoxia -o xiamisongs.jl'.split())

这是草稿，下周再修改完善