一步一步学网络爬虫（从python到scrapy）

大概花了一个星期的时间，学习了一下网络爬虫的知识，现在使用scrapy能爬一些基本的网页，图片，解决网页编码兼容问题，基础的模拟登陆。对于有些模拟登陆，由于其提交的表单要经过js进行处理后提交；更难的其网页也是经js渲染的，要学会一步步去分析，没有太多的去深入，但我会提到基本的分析方法。
参考文章：
1、http://www.runoob.com/ 一个很好的语言语法入门学习的网站，我主要用其学习了python的语法。
2、http://blog.csdn.net/column/details/why-bug.html 此博客讲了一些网络爬虫的基础知识，包括http,url等，而且一步步讲解了实现爬虫的整个过程。
3、http://doc.scrapy.org/en/latest/intro/tutorial.html scrapy框架的学习教程，从安装讲到应用到常见问题，是个不可多得的参考手册，至少过一遍，对于想深入研究的同学，一定要多看几遍。
4、http://blog.csdn.net/u012150179/article/details/34486677 对于中文输出与保存，实现多网页的爬取，做了实现。
5、http://www.jianshu.com/p/b7f41df6202d
http://www.jianshu.com/p/36a39ea71bfd
对于怎么实现模拟登陆做了较好的解释和实现，当然由于技术的不断更新和动态变化，网站的反爬虫的技术也在不断更新，具体情况，应具体分析。

下面正式进入学习：
环境：ubuntu14.04
一、python
1、python的下载和安装：https://www.python.org/downloads/ 在链接中找到自己需要的版本，记得在研究中基本不用version>3.0的版本，然而有为了支持一些新的功能，基本上version>2.70 and version<3.0是一个比较合适的选择。由于ubuntu14.04的底层有些使用python实现的，所以都带了python,(python2.74的版本或者其它）如果需要不同的版本可在不删除原有版本的基础上下载新版本，并修改软链接即可。ln -s python pythonx.xx中间若有问题，请自行百度解决。
2、python的基础知识学习。熟悉一下基本的语法，重点关注列表，元组，字典，函数和类。其它的若有问题，再返回去学习吧，学习链接在参考中已给出，练习一下，一两天就差不多能搞定了。

二、网络爬虫的基础知识
1、网络爬虫的定义、浏览网页的过程、URI和URL的概念和举例、URL的理解和举例。
2、正则表达式
自己练习一下，如果记不住了看看下面的表。

三、scrapy
1、scrapy的安装
http://doc.scrapy.org/en/latest/intro/install.html 根据你自己应用的平台进行选择。比较简单，不做过多的解释。
2、一个scrapy例子
http://doc.scrapy.org/en/latest/intro/tutorial.html 有几点要注意一下：一是知道如何去调试，二是xpath()和css()，还有要学会使用firebox和firebug分析网页源码和表单提交情况，看到前面，我们基本能实现单网页的爬取。
3、讲讲scrapy框架
对scrapy的框架和运行有一个具体的思路之后我们才能更好的了解爬虫的整个情况，尤其是出了问题之后的调试
http://blog.csdn.net/u012150179/article/details/34441655
http://scrapy-chs.readthedocs.org/zh_CN/latest/_images/scrapy_architecture.png

4、scrapy自动多网页爬取
总结起来实现有三种方式:第一种，先prase眼前的url,得到各个item后，在获得新的url,调用prase_item进行解析，在prase_item中又callback parse function 。第二种，也就是例子（scrapy的一个例子）中后部分提到的：

def parse(self, response):for href in response.css("ul.directory.dir-col > li > a::attr('href')"):url = response.urljoin(href.extract())yield scrapy.Request(url, callback=self.parse_dir_contents)

prase只做url的获取，具体的解析工作交给prase_dir_contents去做。第三种、就是根据具体的爬虫去实现了,Generic Spiders（例子中我们用的是Simple Spiders）中用的比较多的有CrawlSpider,这里我们介绍这个的实现

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractorclass MySpider(CrawlSpider):name = 'example.com'allowed_domains = ['example.com']start_urls = ['http://www.example.com']rules = (# Extract links matching 'category.php' (but not matching 'subsection.php')# and follow links from them (since no callback means follow=True by default). followRule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),# Extract links matching 'item.php' and parse them with the spider's method parse_item. processRule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),)def parse_item(self, response):.......

这便是其的Link Extractors机制。
5、中文输出与存储编码
http://blog.csdn.net/u012150179/article/details/34450547
存储：在 item.py中编码：item[‘title’] = [t.encode(‘utf-8’) for t in title] ，在pipe.py中解码：

    import json  import codecs  class W3SchoolPipeline(object):  def __init__(self):  self.file = codecs.open('w3school_data_utf8.json', 'wb', encoding='utf-8')  def process_item(self, item, spider):  line = json.dumps(dict(item)) + '\n'  # print line  self.file.write(line.decode("unicode_escape"))  return item

输出：

for t in title:  print t.encode('utf-8')

6、图片或者文件的下载
http://doc.scrapy.org/en/latest/topics/media-pipeline.html
这里，主要提供一个抓取妹子图片的例子。注意怎么实现文件名的替换与改写。

class ImagedownPipeline(object):def process_item(self, item, spider):if 'image_url' in item:#images = []dir_path = '%s/%s' % (settings.IMAGES_STORE, spider.name)#            if not os.path.exists(dir_path):
#                os.makedirs(dir_path)for image_url in item['image_url']:image_page = item['image_page']dir_path = '%s/%s' %(dir_path, image_page)if not os.path.exists(dir_path):os.makedirs(dir_path)us = image_url.split('/')[3:]image_file_name = '_'.join(us)file_path = '%s/%s' % (dir_path, image_file_name)#images.append(file_path)if os.path.exists(file_path):continuewith open(file_path, 'wb') as handle:response = requests.get(image_url, stream=True)for block in response.iter_content():#if not block:#   breakhandle.write(block)#item['images'] = imagesreturn item

this kind of process would make more efficiency

class MyImagesPipeline(ImagesPipeline):def item_completed(self, results, item, info):image_paths = [x['path'] for ok, x in results if ok]
#        print settings.IMAGES_STORE + image_paths[0]if not image_paths:raise DropItem("Item contains no images")for img_url in item['image_urls']:us = img_url.split('/')[3:]newname = '_'.join(us)dir_path = settings.IMAGES_STORE+"full/"+item['image_page']+"/"
#        print dir_pathif not os.path.exists(dir_path):os.makedirs(dir_path)os.rename(settings.IMAGES_STORE + image_paths[0], dir_path + newname)return item

7、scrapy模拟登陆
此处不做过多的解释，给出两个例子。里面分析的很详细
http://www.jianshu.com/p/b7f41df6202d
http://www.jianshu.com/p/36a39ea71bfd
还有，有时需要模拟Request的headers，这在http://doc.scrapy.org/en/latest/topics/request-response.html 中有很好的说明，至于怎么找到对应的headers，可通过firebug或者其它。
8、处理js渲染过的网页
……
9、验证码的识别
…….
10、分布式爬虫的实现
…….

一步一步学网络爬虫（从python到scrapy）相关推荐

为什么要学网络爬虫？我来告诉你！
在数据量爆发式增长的互联网时代,网站与用户的沟通本质上是数据的交换:搜索引擎从数据库中提取搜索结果,将其展现在用户面前:电商将产品的描述.价格展现在网站上,以供买家选择心仪的产品:社交媒体在用户生态圈 ...
网络爬虫介绍||为什么学网络爬虫
网络爬虫介绍在大数据时代,信息的采集是一项重要的工作,而互联网中的数据是海量的,如果单纯靠人力进行信息采集,不仅低效繁琐,搜集的成本也会提高.如何自动高效地获取互联网中我们感兴趣的信息并为我们所用是 ...
Python网络爬虫之Python基本命令
往期内容 1.教你如何编写第一个简单的爬虫 2.Python编程无师自通–函数 3.在Windows平台上如何安装Python 本节主要介绍Python的一些基础语法.如果你已经学会使用Python, ...
精通python网络爬虫-精通Python网络爬虫：核心技术、框架与项目实战 PDF
给大家带来的一篇关于Python爬虫相关的电子书资源,介绍了关于Python.Python网络爬虫.Python核心技术.Python框架.Python项目实战方面的内容,本书是由机械工业出版社出版, ...
精通python网络爬虫-精通python网络爬虫
广告关闭腾讯云11.11云上盛惠 ,精选热门产品助力上云,云服务器首年88元起,买的越多返的越多,最高返5000元! 作者:韦玮转载请注明出处随着大数据时代的到来,人们对数据资源的需求越来越多, ...
精通python网络爬虫-精通Python网络爬虫：核心技术、框架与项目实战
-- 目录 -- 前言第一篇理论基础篇第1章什么是网络爬虫 1.1 初识网络爬虫 1.2 为什么要学网络爬虫 1.3 网络爬虫的组成 1.4 网络爬虫的类型 1.5 爬虫扩展--聚焦爬虫 1. ...
网络爬虫，python和数据分析学习--part1
# -- coding: utf-8 -- """ Created on Tue Oct 10 08:38:20 2017 本段程序为科大王澎老师<网络爬虫,pyt ...
网络爬虫，python和数据分析学习--part2
Created on Tue Oct 10 10:47:31 2017 本段程序为科大王澎老师<网络爬虫,python和数据分析>中P15,针对spyder3做了微调主要任务:实现了自动 ...
没有网络怎么学网络爬虫之BeautifulSoup爬取html表格存入Excel表格
学习网络爬虫当然是不可能一点网都没有的,我们前期需要网络打开自己需要的网页,获取网页上的源码保存在本地文件,就可以不用网络了,我家里的网络很差,我就是这样操作的,例如上节爬取智联招聘网步骤,下次访问的 ...

一步一步学网络爬虫（从python到scrapy）

一步一步学网络爬虫（从python到scrapy）相关推荐

最新文章

热门文章