Scrapy 小白自学笔记

Scrapy环境搭建
安装scrapy
pip install scrapy
安装pywin32
D:>pip install pywin32
Collecting pywin32
Using cached pywin32-223-cp35-cp35m-win32.whl
Installing collected packages: pywin32
Successfully installed pywin32-223
创建一个scrapy工程
2.1. 创建工程
D:\tmp>scrapy startproject tutorial
New Scrapy project ‘tutorial’, using template directory ‘D:\ProgramFiles\Python35\lib\site-packages\scrapy\templates\project’, created in:
D:\tmp\tutorial

You can start your first spider with:
cd tutorial
scrapy genspider example example.com

D:\tmp\tutorial>tree /F
卷 NewDisk 的文件夹 PATH 列表
卷序列号为 CC68-7CC0
D:.
│ scrapy.cfg # deploy(部署) configuration file
│
└─tutorial # project’s module, you’ll import your code from here
│ items.py # project items definition file
│ middlewares.py# project middlewares file
│ pipelines.py# project pipelines file
│ settings.py# project settings file
│ init.py
│
├─spiders# a directory where you’ll later put your spiders
│ │ init.py
│ │
│ └─__pycache__
└─__pycache__

2.2. 添加spider
在spiders目录下添加quotes_spider.py文件：

import scrapy
class QuotesSpider(scrapy.Spider):
name = “quotes” #唯一名称，同一工程中不可重复

def start_requests(self): #必须返回一个可迭代的Request对象urls = ['http://quotes.toscrape.com/page/1/','http://quotes.toscrape.com/page/2/',]for url in urls:yield scrapy.Request(url=url, callback=self.parse)def parse(self, response):page = response.url.split("/")[-2]filename = 'quotes-%s.html' % pagewith open(filename, 'wb') as f:f.write(response.body)self.log('Saved file %s' % filename)

2.3. 运行
运行：scrapy crawl quotes
结果：

背后发生了什么？
3.1. 调试
命令行输入：
scrapy shell ‘http://quotes.toscrape.com/page/1/’

运行结果：
D:>scrapy shell http://quotes.toscrape.com/page/1/
2018-04-06 09:55:59 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2018-04-06 09:55:59 [scrapy.utils.log] INFO: Overridden settings: {‘DUPEFILTER_CLASS’: ‘scrapy.dupefilters.BaseDupeFilter’, 'LOGSTATS_
2018-04-06 09:55:59 [scrapy.middleware] INFO: Enabled extensions:
[‘scrapy.extensions.telnet.TelnetConsole’,
‘scrapy.extensions.corestats.CoreStats’]
2018-04-06 09:56:00 [scrapy.middleware] INFO: Enabled downloader middlewares:
[‘scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware’,
‘scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware’,
‘scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware’,
‘scrapy.downloadermiddlewares.useragent.UserAgentMiddleware’,
‘scrapy.downloadermiddlewares.retry.RetryMiddleware’,
‘scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware’,
‘scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware’,
‘scrapy.downloadermiddlewares.redirect.RedirectMiddleware’,
‘scrapy.downloadermiddlewares.cookies.CookiesMiddleware’,
‘scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware’,
‘scrapy.downloadermiddlewares.stats.DownloaderStats’]
2018-04-06 09:56:00 [scrapy.middleware] INFO: Enabled spider middlewares:
[‘scrapy.spidermiddlewares.httperror.HttpErrorMiddleware’,
‘scrapy.spidermiddlewares.offsite.OffsiteMiddleware’,
‘scrapy.spidermiddlewares.referer.RefererMiddleware’,
‘scrapy.spidermiddlewares.urllength.UrlLengthMiddleware’,
‘scrapy.spidermiddlewares.depth.DepthMiddleware’]
2018-04-06 09:56:00 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-04-06 09:56:00 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-04-06 09:56:00 [scrapy.core.engine] INFO: Spider opened
2018-04-06 09:56:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x0000000002AE83C8>
[s] item {}
[s] request <GET http://quotes.toscrape.com/page/1/>
[s] response <200 http://quotes.toscrape.com/page/1/>
[s] settings <scrapy.settings.Settings object at 0x00000000054A4550>
[s] spider <DefaultSpider ‘default’ at 0x6682e10>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
In [1]:

In [2]: request
Out[2]: <GET http://quotes.toscrape.com/page/1/>
In [3]: response
Out[3]: <200 http://quotes.toscrape.com/page/1/>

3.2. XPATH
xpath语法
Chrome浏览器打开http://quotes.toscrape.com/page/1/,右键->检查

得到的xpath路径(绝对路径)：
/html/body/div/div[2]/div[1]/div[1]/span[1]
运行命令：
In [7]: response.xpath(’/html/body/div/div[2]/div[1]/div[1]/span[1]’)
Out[7]: []

In [8]: response.xpath(’/html/body/div/div[2]/div[1]/div[1]/span[1]’).extract()
Out[8]: [’“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”’]

In [9]: response.xpath(’/html/body/div/div[2]/div[1]/div[1]/span[1]/text()’).extract()
Out[9]: [’“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”’]
使用相对路径
分析这一段：

其内容为:
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
由此，得到xpath相对路径：
//div[@class=“quote”]/span[@itemprop=“text”]/text()

In [10]:
response.xpath(’//div[@class=“quote”]/span[@itemprop=“text”]/text()’).extract()
Out[10]:
[’“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”’,
‘“It is our choices, Harry, that show what we truly are, far more than our abilities.”’,
‘“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”’,
‘“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”’,
““Imperfection is beauty, madness is genius and it’s better to be absolutely ridiculous than absolutely boring.””,
‘“Try not to become a man of success. Rather become a man of value.”’,
‘“It is better to be hated for what you are than to be loved for what you are not.”’,
““I have not failed. I’ve just found 10,000 ways that won’t work.””,
““A woman is like a tea bag; you never know how strong it is until it’s in hot water.””,
‘“A day without sunshine is like, you know, night.”’]
注意到，这里取到了10条数据，这是因为使用相对路径时根据xpath参数，匹配到了10个结果。
如果只想要第一条可以这样：
In [11]:
response.xpath(’//div[@class=“quote”]/span[@itemprop=“text”]/text()’).extract_first()
Out[11]: ‘“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”’
或者：
In [12]: results =
response.xpath(’//div[@class=“quote”]/span[@itemprop=“text”]/text()’).extract()

In [13]: results[0]
Out[13]: ‘“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”’

3.3. 另一种方法， CSS

得到完整的CSS路径：
body > div > div:nth-child(2) > div.col-md-8 > div:nth-child(1) > span.text

In [14]: response.css(‘body > div > div:nth-child(2) > div.col-md-8 > div:nth-child(1) > span.text’).extract()
Out[14]:
[’“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”’]
但这得到的并不是我们要的内容本身，修改xpath:
body > div > div:nth-child(2) > div.col-md-8 > div:nth-child(1) > span::text

In [17]: response.css(‘body > div > div:nth-child(2) > div.col-md-8 > div:nth-child(1) > span::text’).extract()
Out[17]:
[’“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”’,
'by ',
'\n ',
'\n ']
取到了我们需要的内容，但多出了3行(为什么？)。
取第一条：
In [18]: response.css(‘body > div > div:nth-child(2) > div.col-md-8 > div:nth-child(1) > span::text’).extract_first()
Out[18]: ‘“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”’

CSS相对路径：
In [41]: response.css(“div.quote”).extract_first()
Out[41]:
'

\n “The world as we have created it is a process of our thinking. It cannot
be changed without changing our thinking.”\n by Albert Einstein\n (about)\n </s
pan>\n

\n Tags:\n \n \n change\n \n deep-thoughts\n \n <a class=“tag” href="/tag/thinking/page/1/"

thinking\n \n world\n \n \n ’

In [42]: response.css(“div.quote>span”).extract_first()
Out[42]: ‘“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”’

In [43]: response.css(“div.quote>span::text”).extract_first()
Out[43]: ‘“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”’

3.4. CSS+XPATH

In [46]: response.css(“div.quote”).xpath(’//span/text()’).extract_first()
Out[46]: ‘“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”’

提取quote和author
In [26]: for quote in response.css(‘div.quote’):
…: text = quote.css(“span.text::text”).extract_first()
…: author = quote.css(“small.author::text”).extract_first()
…: print(text, author)
…:
…:
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” Albert Einstein
“It is our choices, Harry, that show what we truly are, far more than our abilities.” J.K. Rowling
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.” Albert Einstein
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.” Jane Austen
“Imperfection is beauty, madness is genius and it’s better to be absolutely ridiculous than absolutely boring.” Marilyn Monroe
“Try not to become a man of success. Rather become a man of value.” Albert Einstein
“It is better to be hated for what you are than to be loved for what you are not.” André Gide
“I have not failed. I’ve just found 10,000 ways that won’t work.” Thomas A. Edison
“A woman is like a tea bag; you never know how strong it is until it’s in hot water.” Eleanor Roosevelt
“A day without sunshine is like, you know, night.” Steve Martin

3.5. 在spider中提取数据
import scrapy

class QuotesSpider(scrapy.Spider):
name = “quotes”

def start_requests(self):urls = ['http://quotes.toscrape.com/page/1/','http://quotes.toscrape.com/page/2/',]for url in urls:yield scrapy.Request(url=url, callback=self.parse)def parse(self, response):for quote in response.css('div.quote'):yield {'text': quote.css('span.text::text').extract_first(),'author': quote.css('small.author::text').extract_first(),}

Log输出中包含这么一段,即为抓取到的内容:
{‘text’: ‘“I like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.”’, ‘author’: ‘Dr. Seuss’}
2018-04-06 13:47:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{‘text’: ‘“I may not have gone where I intended to go, but I think I have ended up where I needed to be.”’, ‘author’: ‘Douglas Adams’}
2018-04-06 13:47:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{‘text’: ““The opposite of love is not hate, it’s indifference. The opposite of art is not ugliness, it’s indifference. The opposite of faith is not heresy, it’s indifference. And the opposite of lif
e is not death, it’s indifference.””, ‘author’: ‘Elie Wiesel’}
2018-04-06 13:47:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{‘text’: ‘“It is not a lack of love, but a lack of friendship that makes unhappy marriages.”’, ‘author’: ‘Friedrich Nietzsche’}
2018-04-06 13:47:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{‘text’: ‘“Good friends, good books, and a sleepy conscience: this is the ideal life.”’, ‘author’: ‘Mark Twain’}
2018-04-06 13:47:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{‘text’: ‘“Life is what happens to us while we are making other plans.”’, ‘author’: ‘Allen Saunders’}

4. Exclipse +PyDev+Scrapy环境搭建

Step 1: 使用scrapy创建,参见<创建一个scrapy工程>章节
Step2: 在eclipse中新建一个文件夹,取名Pydev_Scrapy

然后将Step1中创建的scrapy工程拖到Pydev_Scrapy文件夹下

Step 3: 在Pydev_Scrapy\tutorial\tutorial目录下新建main.py文件

main.py文件:
import scrapy.cmdline

if name == ‘main’:
scrapy.cmdline.execute(argv=[‘scrapy’,‘crawl’,‘quotes’])
Step 4: 运行

这样比较方便调试.

使用Exclipse+PyDev+Scrapy抓取图片
5.1. 创建工程
新建scrawl_img文件夹
命令行进入到crawl_img文件夹下，运行：
D:\workspace\WebScrapting\crawl_img>scrapy startproject crawimgs
在WebScrapting\crawl_img\crawimgs\crawimgs\目录下新建main.py文件
import scrapy.cmdline

if name == “main”:
scrapy.cmdline.execute(argv=“scrapy crawl crawimgs”.split())

5.2. 页面分析
url: https://blog.csdn.net/Y2c8YpZC15p/article/details/79562929
chrome 提取xpath路径：//[@id=“js_content”]/p[7]/img
换一张图片，xpath路径：//[@id=“js_content”]/p[8]/img
确定xpath为：//*[@id=“js_content”]/p/img

通过shell指令debug一下：
scrapy shell https://blog.csdn.net/Y2c8YpZC15p/article/details/79562929

In [10]: img_xpath = ‘//*[@id=“js_content”]/p/img/@src’

In [11]: response.xpath(img_xpath).extract()
Out[11]:
[‘https://img-blog.csdnimg.cn/img_convert/055d1f744fe59ac7fb003d5d9351777f.png;wx_lazy=1’,
‘https://img-blog.csdnimg.cn/img_convert/c6da8bcf22436ddfeb1b0f9932900fa9.png;wxfrom=5&wx_lazy=1’,
‘https://img-blog.csdnimg.cn/img_convert/883cc5c6bbe2a366b4650d684a4e2218.png’,
‘https://img-blog.csdnimg.cn/img_convert/775cf31ac82b802682d4adbaa298751b.png’,
‘https://img-blog.csdnimg.cn/img_convert/6bdfe067edf654442aeee2fad2371f17.png’,
‘https://img-blog.csdnimg.cn/img_convert/28f39a41b725525d46be67f8005e808a.png’,
‘https://img-blog.csdnimg.cn/img_convert/b969055e65360b7b1664745b9e8e08f2.png’,
‘https://img-blog.csdnimg.cn/img_convert/4a3d527bb2e1e888f29ae1fa0fd81733.png’,
‘https://img-blog.csdnimg.cn/img_convert/780d50945c6a708dfce0d7e3cab90131.png’,
‘https://img-blog.csdnimg.cn/img_convert/f6be87973dbc35d78ae9432c09f8095d.png’,
‘https://img-blog.csdnimg.cn/img_convert/9d5a6c46a10973966afb7d385a12aa8d.png’,
‘https://img-blog.csdnimg.cn/img_convert/8e933fb03ddc38aec9f0e3223eded96b.png’]
取到了我们想要的图片链接
5.3. 保存图片
保存图片前先要打开图片对应的url链接,然后再下载下来保存到本地。打开图片链接使用urllib库：
In [12]: img0 = urllib.request.urlopen(response.xpath(img_xpath).extract()[0])

In [13]: type(img0)
Out[13]: http.client.HTTPResponse
这样就打开了图片的url链接，通过dir(img0)命令可以发现它有个read()方法，使用help命令查看read()方法的使用：
In [54]: help(img0.read)
Help on method read in module http.client:

read(amt=None) method of http.client.HTTPResponse instance
Read and return up to n bytes.

If the argument is omitted, None, or negative, reads and
returns all data until EOF.If the argument is positive, and the underlying raw stream is
not 'interactive', multiple raw reads may be issued to satisfy
the byte count (unless EOF is reached first).  But for
interactive raw streams (as well as sockets and pipes), at most
one raw read will be issued, and a short result does not imply
that EOF is imminent.Returns an empty bytes object on EOF.Returns None if the underlying raw stream was open in non-blocking
mode and no data is available at the moment.

由此可知，直接调用read()方法就可以将整张图片读取下来，接下来只需要将读取到底内容写到本地文件就可以了。
通过urlopen打开图片链接：
def parse(self, response):
img_xpath=’//*[@id=“js_content”]/p/img/@src’
imgs = response.xpath(img_xpath).extract()

    index = 1for img in imgs:if imgs is not None:result = urllib.request.urlopen(img)

保存图片文件到本地：

def save_image(self,  response, fname):if response is not None:with open(fname, 'wb') as wf:wf.write(response.read())print("save image %s done!"%fname)

在WebScrapting\crawl_img\crawimgs\crawimgs\spiders目录下新建imgs_spider.py文件

#encoding=utf-8

import scrapy
import urllib

class Imgs_Spider(scrapy.Spider):
name = ‘crawimgs’

def start_requests(self):urls = ["https://blog.csdn.net/Y2c8YpZC15p/article/details/79562929"]for url in urls:yield scrapy.Request(url=url, callback=self.parse)def save_image(self,  response, fname):if response is not None:with open(fname, 'wb') as wf:wf.write(response.read())print("save image %s done!"%fname)def parse(self, response):img_xpath='//*[@id="js_content"]/p/img/@src'imgs = response.xpath(img_xpath).extract()index = 1for img in imgs:if imgs is not None:result = urllib.request.urlopen(img)self.save_image(result, "imgs/img_{0}.png".format(index))index +=1

在WebScrapting\crawl_img\crawimgs\crawimgs下新建imgs目录用于保存图片。
5.4. 运行结果

User-Agent
爬取http://www.meizitu.com/a/5582.html的图片

D:> scrapy shell http://www.meizitu.com/a/5582.html

In [1]: xpath=’//*[@id=“picture”]/p/img[1]’

In [2]: img = response.xpath(’//*[@id=“picture”]/p/img[1]/@src’).extract()

In [3]: img[0]
Out[3]: ‘http://mm.chinasareview.com/wp-content/uploads/2017a/07/18/01.jpg’
In [20]: import urllib
In [22]: urllib.request.urlopen(img[0])

HTTPError Traceback (most recent call last)
in ()
----> 1 urllib.request.urlopen(img[0])

c:\python35\lib\urllib\request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)

161     else:
162         opener = _opener

–> 163 return opener.open(url, data, timeout)
164
165 def install_opener(opener):

c:\python35\lib\urllib\request.py in open(self, fullurl, data, timeout)
470 for processor in self.process_response.get(protocol, []):
471 meth = getattr(processor, meth_name)
–> 472 response = meth(req, response)
473
474 return response

c:\python35\lib\urllib\request.py in http_response(self, request, response)
580 if not (200 <= code < 300):
581 response = self.parent.error(
–> 582 ‘http’, request, response, code, msg, hdrs)
583
584 return response

c:\python35\lib\urllib\request.py in error(self, proto, *args)
508 if http_err:
509 args = (dict, ‘default’, ‘http_error_default’) + orig_args
–> 510 return self._call_chain(*args)
511
512 # XXX probably also want an abstract factory that knows when it makes

c:\python35\lib\urllib\request.py in _call_chain(self, chain, kind, meth_name, *args)
442 for handler in handlers:
443 func = getattr(handler, meth_name)
–> 444 result = func(*args)
445 if result is not None:
446 return result

c:\python35\lib\urllib\request.py in http_error_default(self, req, fp, code, msg, hdrs)
588 class HTTPDefaultErrorHandler(BaseHandler):
589 def http_error_default(self, req, fp, code, msg, hdrs):
–> 590 raise HTTPError(req.full_url, code, msg, hdrs, fp)
591
592 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 403: Forbidden

执行urlopen的时候出错了，从这条错误信息：
HTTPError: HTTP Error 403: Forbidden
可以看出，网站禁止了图片抓取，做了反爬虫。这时就需要通过使用user-agent来模拟浏览器访问图片了。

In [40]: user_agent=“Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36”

In [41]: req = urllib.request.Request(img[0], headers = headers)

In [42]: resp = urllib.request.urlopen(req)

In [43]: with open (‘fuck.png’ , ‘wb’) as wf:
…: wf.write(resp.read())
…:
这样就把图片保存下来了！

完整code
import scrapy
import urllib

class UserAgent_Spider(scrapy.Spider):
name = ‘useragent_test’

def start_requests(self):urls = ['http://www.meizitu.com/a/5582.html']for url in urls:yield scrapy.Request(url=url, callback=self.parse)def parse(self, response):headers = {}user_agent="Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"headers['User-Agent']=user_agentxpath = '//*[@id="picture"]/p/img/@src'imgs = response.xpath(xpath).extract()for img in imgs:req = urllib.request.Request(img, headers = headers)resp = urllib.request.urlopen(req)fname = "images/"+img.split('/')[-1]with open(fname, 'wb') as wf:wf.write(resp.read())print('save', fname, 'done!')print('all done!!!')

抓取到的图片：

Scheme
这一段code:
class JdSpider(CrawlSpider):
name = “JDSpider”
redis_key = “JDSpider:start_urls”
start_urls = [“http://book.jd.com/booktop/0-0-0.html?category=1713-0-0-0-10001-1#comfort”]

def parse(self, response):
item = JdspiderItem()
selector = Selector(response)
Books = selector.xpath(’/html/body/div[8]/div[2]/div[3]/div/ul/li’)
for each in Books:
num = each.xpath(‘div[@class=“p-num”]/text()’).extract()
bookName = each.xpath(‘div[@class=“p-detail”]/a/text()’).extract()
author = each.xpath(‘div[@class=“p-detail”]/dl[1]/dd/a[1]/text()’).extract()
press = each.xpath(‘div[@class=“p-detail”]/dl[2]/dd/a/text()’).extract()

     temphref = each.xpath('div[@class="p-detail"]/a/@href').extract()temphref = str(temphref)BookID = str(re.search('com/(.*?)\.html',temphref).group(1))json_url = 'http://p.3.cn/prices/mgets?skuIds=J_' + BookIDr = requests.get(json_url).textdata = json.loads(r)[0]price = data['m']PreferentialPrice = data['p']item['number'] = numitem['bookName'] = bookNameitem['author'] = authoritem['press'] = pressitem['BookID'] = BookIDitem['price'] = priceitem['PreferentialPrice'] = PreferentialPriceyield itemnextLink = selector.xpath('/html/body/div[8]/div[2]/div[4]/div/div/span/a[7]/@href').extract()if nextLink:nextLink = nextLink[0]print('type of nextLink: ',type(nextLink))print(nextLink)yield Request(nextLink,callback=self.parse)

运行时出现了错误：
2018-04-07 14:37:00 [scrapy.core.scraper] ERROR: Spider error processing <GET http://book.jd.com/booktop/0-0-0.html?category=1713-0-0-0-10001-1#comfort> (referer: None)
Traceback (most recent call last):
File “C:\Python35\lib\site-packages\scrapy\utils\defer.py”, line 102, in iter_errback
yield next(it)
File “C:\Python35\lib\site-packages\scrapy\spidermiddlewares\offsite.py”, line 30, in process_spider_output
for x in result:
File “C:\Python35\lib\site-packages\scrapy\spidermiddlewares\referer.py”, line 339, in
return (_set_referer® for r in result or ())
File “C:\Python35\lib\site-packages\scrapy\spidermiddlewares\urllength.py”, line 37, in
return (r for r in result or () if _filter®)
File “C:\Python35\lib\site-packages\scrapy\spidermiddlewares\depth.py”, line 58, in
return (r for r in result or () if filter®)
File “D:\workspace\Python\WebScrapping\Spider-master\JDSpider\JDSpider\spiders\JD_Spider.py”, line 52, in parse
yield Request(nextLink,callback=self.parse)
File "C:\Python35\lib\site-packages\scrapy\http\request_init.py", line 25, in init
self.set_url(url)
File "C:\Python35\lib\site-packages\scrapy\http\request_init.py", line 62, in _set_url
raise ValueError(‘Missing scheme in request url: %s’ % self._url)
ValueError: Missing scheme in request url: //book.jd.com/booktop/1713-0-0-0-10001-2.html#comfort
这段错误报错的原因是：
raise ValueError(‘Missing scheme in request url: %s’ % self._url)
ValueError: Missing scheme in request url: //book.jd.com/booktop/1713-0-0-0-10001-2.html#comfort
原因是传递给Request的url地址缺失了http,修改如下：
nextLink = “http:” + nextLink[0]
即可解决问题。
关于scheme的知识：
Scheme basically has a syntax like
scheme:[//[user:password@]host[:port]][/]path[?query][#fragment]
Examples of popular schemes include http(s), ftp, mailto, file, data, and irc. Therecould also be terms like about or about:blank we are somewhat familiar with.

                hierarchical part┌───────────────────┴─────────────────────┐authority               path┌───────────────┴───────────────┐┌───┴────┐

abc://username:password@example.com:123/path/data?key=value&key2=value2#fragid1
└┬┘ └───────┬───────┘ └────┬────┘ └┬┘ └─────────┬─────────┘ └──┬──┘
scheme user information host port query fragment

urn:example:mammal:monotreme:echidna
└┬┘ └──────────────┬───────────────┘
scheme path

Cookie
8.1. 如何通过浏览器查看cookie
8.1.1. 搜狗浏览器

我们需要的是name和value两列的值。全部选择，然后复制：

内容：
lpvt_0c0e9d9b1e7d617b3e6842e85b9fb068 1523084377 .jianshu.com / Session 50
Hm_lvt_0c0e9d9b1e7d617b3e6842e85b9fb068 1523084377 .jianshu.com / 2019-04-07T06:59:37.000Z 49
_m7e_session 153d9f2e9183c0cf7b13519a686bf697 .www.jianshu.com / 2018-04-07T13:01:06.367Z 44 ? ?
default_font font2 .www.jianshu.com / Session 17
locale zh-CN .www.jianshu.com / Session 11
read_mode day .www.jianshu.com / Session 12
sensorsdata2015jssdkcross %7B%22distinct_id%22%3A%22160bc41d913137-020239e6130da5-2609281c-2073600-160bc41d917366%22%2C%22%24device_id%22%3A%22160bc41d913137-020239e6130da5-2609281c-2073600-160bc41d917366%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E8%87%AA%E7%84%B6%E6%90%9C%E7%B4%A2%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22http%3A%2F%2Fcn.bing.com%2Fsearch%3Fq%3Dscrapy%2Bcookie%26go%3D%25E6%2590%259C%25E7%25B4%25A2%26qs%3Dn%26form%3DQBRE%26sp%3D-1%26pq%3Dscrapy%2Bcookie%26sc%3D8-13%26sk%3D%26cvid%3DCABB19BFE0154CD7BB72105B6C402A66%22%2C%22%24latest_referrer_host%22%3A%22cn.bing.com%22%2C%22%24latest_search_keyword%22%3A%22scrapy%20cookie%22%2C%22%24latest_utm_source%22%3A%22weixin_timeline%22%2C%22%24latest_utm_medium%22%3A%22reader_share%22%2C%22%24latest_utm_campaign%22%3A%22haruki%22%2C%22%24latest_utm_content%22%3A%22note%22%7D%7D .jianshu.com / 2218-02-18T06:59:37.000Z 878
signin_redirect https%3A%2F%2Fwww.jianshu.com%2Fp%2F887af1ab4200 .www.jianshu.com / Session 63
需要通过python代码将name和value两列的值转换为scrapy的字典类型数据：
Code:

-- coding: utf-8 --

class transCookie:
def init(self, cookie):
self.cookie = {}
cookies = cookie.split(’\n’)
for cook in cookies:
if cook.strip() != ‘’:
cook = cook.split()
if len(cook) >= 2:
key, value = cook[0:2]
self.cookie[key] = value

def stringToDict(self):return self.cookie

if name == “main”:
cookie = ‘’’
lpvt_0c0e9d9b1e7d617b3e6842e85b9fb068 1523084377 .jianshu.com / Session 50
Hm_lvt_0c0e9d9b1e7d617b3e6842e85b9fb068 1523084377 .jianshu.com / 2019-04-07T06:59:37.000Z 49
_m7e_session 153d9f2e9183c0cf7b13519a686bf697 .www.jianshu.com / 2018-04-07T13:01:06.367Z 44 ? ?
default_font font2 .www.jianshu.com / Session 17
locale zh-CN .www.jianshu.com / Session 11
read_mode day .www.jianshu.com / Session 12
sensorsdata2015jssdkcross %7B%22distinct_id%22%3A%22160bc41d913137-020239e6130da5-2609281c-2073600-160bc41d917366%22%2C%22%24device_id%22%3A%22160bc41d913137-020239e6130da5-2609281c-2073600-160bc41d917366%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E8%87%AA%E7%84%B6%E6%90%9C%E7%B4%A2%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22http%3A%2F%2Fcn.bing.com%2Fsearch%3Fq%3Dscrapy%2Bcookie%26go%3D%25E6%2590%259C%25E7%25B4%25A2%26qs%3Dn%26form%3DQBRE%26sp%3D-1%26pq%3Dscrapy%2Bcookie%26sc%3D8-13%26sk%3D%26cvid%3DCABB19BFE0154CD7BB72105B6C402A66%22%2C%22%24latest_referrer_host%22%3A%22cn.bing.com%22%2C%22%24latest_search_keyword%22%3A%22scrapy%20cookie%22%2C%22%24latest_utm_source%22%3A%22weixin_timeline%22%2C%22%24latest_utm_medium%22%3A%22reader_share%22%2C%22%24latest_utm_campaign%22%3A%22haruki%22%2C%22%24latest_utm_content%22%3A%22note%22%7D%7D .jianshu.com / 2218-02-18T06:59:37.000Z 878
signin_redirect https%3A%2F%2Fwww.jianshu.com%2Fp%2F887af1ab4200 .www.jianshu.com / Session 63
‘’’
trans = transCookie(cookie)
print (trans.stringToDict())

得到的cookie 字典类型的数据为：
{‘sensorsdata2015jssdkcross’: ‘%7B%22distinct_id%22%3A%22160bc41d913137-020239e6130da5-2609281c-2073600-160bc41d917366%22%2C%22%24device_id%22%3A%22160bc41d913137-020239e6130da5-2609281c-2073600-160bc41d917366%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E8%87%AA%E7%84%B6%E6%90%9C%E7%B4%A2%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22http%3A%2F%2Fcn.bing.com%2Fsearch%3Fq%3Dscrapy%2Bcookie%26go%3D%25E6%2590%259C%25E7%25B4%25A2%26qs%3Dn%26form%3DQBRE%26sp%3D-1%26pq%3Dscrapy%2Bcookie%26sc%3D8-13%26sk%3D%26cvid%3DCABB19BFE0154CD7BB72105B6C402A66%22%2C%22%24latest_referrer_host%22%3A%22cn.bing.com%22%2C%22%24latest_search_keyword%22%3A%22scrapy%20cookie%22%2C%22%24latest_utm_source%22%3A%22weixin_timeline%22%2C%22%24latest_utm_medium%22%3A%22reader_share%22%2C%22%24latest_utm_campaign%22%3A%22haruki%22%2C%22%24latest_utm_content%22%3A%22note%22%7D%7D’, ‘signin_redirect’: ‘https%3A%2F%2Fwww.jianshu.com%2Fp%2F887af1ab4200’, ‘read_mode’: ‘day’, ‘Hm_lvt_0c0e9d9b1e7d617b3e6842e85b9fb068’: ‘1523084377’, ‘_m7e_session’: ‘153d9f2e9183c0cf7b13519a686bf697’, ‘locale’: ‘zh-CN’, ‘lpvt_0c0e9d9b1e7d617b3e6842e85b9fb068’: ‘1523084377’, ‘default_font’: ‘font2’}
8.1.2. Chrome浏览器

8.2. 通过cookie访问JD
通过cookie访问用户信息，这里抓取了用户名信息。
#encoding=utf-8

import scrapy

class Cookie_Spider(scrapy.Spider):

name = "cookie"def start_requests(self):urls = ["https://i.jd.com/user/info"]cookie = {}# your cookieheaders = {'Connection' : 'keep - alive','User-Agent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36'}for url in urls:yield scrapy.Request(url=url, callback=self.parse, headers=headers,cookies=cookie)def parse(self, response):username_label = response.xpath('//*[@id="main"]/div/div[2]/div[1]/div/div[1]/span"]').extract()username = response.xpath('//*[@id="main"]/div/div[2]/div[1]/div/div[1]/div/div/strong').extract()print(username_label, ':', username)

Scrapy 小白自学笔记相关推荐

小白自学笔记——JAVA基础 1.1常用的DOS文件
今天开始自学JAVA第一课.看的是B站上宋红康老师的视频. 一学习常用的DOS命令在搜索栏输入'cmd'打开 eg1:打开D盘文件 [输入] D: eg2:列出当前文件夹下的所有文件 [输入] d ...
小白自学笔记——JAVA基础 2.6运算符
名词解释运算符是一种特殊的符号,用以表示数据的运算.赋值和比较等. - 算术运算符 - 赋值运算符 - 比较运算符(关系运算符) - 逻辑运算符 - *位运算符 - 三元运算符算术运算符 eg: ...
小白自学笔记——JAVA基础 2.5进制
计算机中不同进制的使用说明所有数字在计算机底层都以二进制形式存在. 对于整数,有四种表示方式: - 二进制(binary):0,1 ,满2进1.以0b或0B开头. - 十进制(decimal):0- ...
小白自学笔记——JAVA基础 2.2变量
变量概述概念: - 内存中的一个存储区域 - 该区域的数据可以在同一类型范围内不断变化 - 变量是程序中最基本的存储单元,包含变量类型.变量名和存储值作用 - 用于在内存中保存数据变量的使用 j ...
小白自学笔记——JAVA基础 2.9循环结构
循环结构在某些条件满足的情况下,反复执行特定代码的功能. 循环语句分类 for 循环 while 循环 do-while 循环 FOR循环结构 for (①初始化部分;②循环条件部分;④迭代部分){ ...
小白自学笔记——JAVA基础 2.8分支结构
名词解释流程控制语句是用来控制程序中各语句执行顺序的语句,可以把语句组合成能完成一定功能的小逻辑模块. 其流程控制方式采用结构化程序设计中规定的三种基本流程结构,即: 顺序结构程序从上到下逐行地执 ...
小白自学笔记——JAVA基础 3.1 一维数组
名词解释数组(Array),是多个相同类型数据按一定顺序排列的集合,并使用一个名字命名,并通过编号的方式对这些数据进行统一管理. 数组的常见概念数组名下标(或索引) 元素数组的长度数组的特点 ...
小白自学笔记——JAVA基础 0.1Java语言概述
我学习的是宋红康老师的视频,首先是课程大纲. 课程大纲课程体系第1章 Java语言概述第2章基本语法第3章数组第4章面向对象编程(上) 第5章面向对象编程(中) 第6章面向对象编程 ...
小白自学笔记——JAVA基础 2.3基本数据类型转换
基本数据类型之间的运算规则 1.自动类型转换:当容量小的数据类型的变量与大容量的数据类型的变量做运算时,结果自动提升为容量大的数据类型. (说明:容量大小指表示数的范围的大小) - byte,shor ...

Scrapy 小白自学笔记

Scrapy 小白自学笔记

In [3]: img[0]
Out[3]: ‘http://mm.chinasareview.com/wp-content/uploads/2017a/07/18/01.jpg’
In [20]: import urllib
In [22]: urllib.request.urlopen(img[0])

-- coding: utf-8 --

Scrapy 小白自学笔记相关推荐

最新文章

热门文章

Scrapy 小白自学笔记

Scrapy 小白自学笔记

In [3]: img[0] Out[3]: ‘http://mm.chinasareview.com/wp-content/uploads/2017a/07/18/01.jpg’ In [20]: import urllib In [22]: urllib.request.urlopen(img[0])

-- coding: utf-8 --

Scrapy 小白自学笔记相关推荐

最新文章

热门文章

In [3]: img[0]
Out[3]: ‘http://mm.chinasareview.com/wp-content/uploads/2017a/07/18/01.jpg’
In [20]: import urllib
In [22]: urllib.request.urlopen(img[0])