Scrapy 小白自学笔记

Scrapy 小白自学笔记

  1. Scrapy环境搭建
    安装scrapy
    pip install scrapy
    安装pywin32
    D:>pip install pywin32
    Collecting pywin32
    Using cached pywin32-223-cp35-cp35m-win32.whl
    Installing collected packages: pywin32
    Successfully installed pywin32-223
  2. 创建一个scrapy工程
    2.1. 创建工程
    D:\tmp>scrapy startproject tutorial
    New Scrapy project ‘tutorial’, using template directory ‘D:\ProgramFiles\Python35\lib\site-packages\scrapy\templates\project’, created in:
    D:\tmp\tutorial

You can start your first spider with:
cd tutorial
scrapy genspider example example.com

D:\tmp\tutorial>tree /F
卷 NewDisk 的文件夹 PATH 列表
卷序列号为 CC68-7CC0
D:.
│ scrapy.cfg # deploy(部署) configuration file

└─tutorial # project’s module, you’ll import your code from here
│ items.py # project items definition file
│ middlewares.py# project middlewares file
│ pipelines.py# project pipelines file
│ settings.py# project settings file
init.py

├─spiders# a directory where you’ll later put your spiders
│ │ init.py
│ │
│ └─__pycache__
└─__pycache__

2.2. 添加spider
在spiders目录下添加quotes_spider.py文件:

import scrapy
class QuotesSpider(scrapy.Spider):
name = “quotes” #唯一名称,同一工程中不可重复

def start_requests(self): #必须返回一个可迭代的Request对象urls = ['http://quotes.toscrape.com/page/1/','http://quotes.toscrape.com/page/2/',]for url in urls:yield scrapy.Request(url=url, callback=self.parse)def parse(self, response):page = response.url.split("/")[-2]filename = 'quotes-%s.html' % pagewith open(filename, 'wb') as f:f.write(response.body)self.log('Saved file %s' % filename)

2.3. 运行
运行:scrapy crawl quotes
结果:

  1. 背后发生了什么?
    3.1. 调试
    命令行输入:
    scrapy shell ‘http://quotes.toscrape.com/page/1/’

运行结果:
D:>scrapy shell http://quotes.toscrape.com/page/1/
2018-04-06 09:55:59 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2018-04-06 09:55:59 [scrapy.utils.log] INFO: Overridden settings: {‘DUPEFILTER_CLASS’: ‘scrapy.dupefilters.BaseDupeFilter’, 'LOGSTATS_
2018-04-06 09:55:59 [scrapy.middleware] INFO: Enabled extensions:
[‘scrapy.extensions.telnet.TelnetConsole’,
‘scrapy.extensions.corestats.CoreStats’]
2018-04-06 09:56:00 [scrapy.middleware] INFO: Enabled downloader middlewares:
[‘scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware’,
‘scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware’,
‘scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware’,
‘scrapy.downloadermiddlewares.useragent.UserAgentMiddleware’,
‘scrapy.downloadermiddlewares.retry.RetryMiddleware’,
‘scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware’,
‘scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware’,
‘scrapy.downloadermiddlewares.redirect.RedirectMiddleware’,
‘scrapy.downloadermiddlewares.cookies.CookiesMiddleware’,
‘scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware’,
‘scrapy.downloadermiddlewares.stats.DownloaderStats’]
2018-04-06 09:56:00 [scrapy.middleware] INFO: Enabled spider middlewares:
[‘scrapy.spidermiddlewares.httperror.HttpErrorMiddleware’,
‘scrapy.spidermiddlewares.offsite.OffsiteMiddleware’,
‘scrapy.spidermiddlewares.referer.RefererMiddleware’,
‘scrapy.spidermiddlewares.urllength.UrlLengthMiddleware’,
‘scrapy.spidermiddlewares.depth.DepthMiddleware’]
2018-04-06 09:56:00 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-04-06 09:56:00 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-04-06 09:56:00 [scrapy.core.engine] INFO: Spider opened
2018-04-06 09:56:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x0000000002AE83C8>
[s] item {}
[s] request <GET http://quotes.toscrape.com/page/1/>
[s] response <200 http://quotes.toscrape.com/page/1/>
[s] settings <scrapy.settings.Settings object at 0x00000000054A4550>
[s] spider <DefaultSpider ‘default’ at 0x6682e10>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
In [1]:

In [2]: request
Out[2]: <GET http://quotes.toscrape.com/page/1/>
In [3]: response
Out[3]: <200 http://quotes.toscrape.com/page/1/>

3.2. XPATH
xpath语法
Chrome浏览器打开http://quotes.toscrape.com/page/1/,右键->检查

得到的xpath路径(绝对路径):
/html/body/div/div[2]/div[1]/div[1]/span[1]
运行命令:
In [7]: response.xpath(’/html/body/div/div[2]/div[1]/div[1]/span[1]’)
Out[7]: []

In [8]: response.xpath(’/html/body/div/div[2]/div[1]/div[1]/span[1]’).extract()
Out[8]: [’“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”’]

In [9]: response.xpath(’/html/body/div/div[2]/div[1]/div[1]/span[1]/text()’).extract()
Out[9]: [’“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”’]
使用相对路径
分析这一段:

其内容为:
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
由此,得到xpath相对路径:
//div[@class=“quote”]/span[@itemprop=“text”]/text()

In [10]:
response.xpath(’//div[@class=“quote”]/span[@itemprop=“text”]/text()’).extract()
Out[10]:
[’“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”’,
‘“It is our choices, Harry, that show what we truly are, far more than our abilities.”’,
‘“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”’,
‘“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”’,
““Imperfection is beauty, madness is genius and it’s better to be absolutely ridiculous than absolutely boring.””,
‘“Try not to become a man of success. Rather become a man of value.”’,
‘“It is better to be hated for what you are than to be loved for what you are not.”’,
““I have not failed. I’ve just found 10,000 ways that won’t work.””,
““A woman is like a tea bag; you never know how strong it is until it’s in hot water.””,
‘“A day without sunshine is like, you know, night.”’]
注意到,这里取到了10条数据,这是因为使用相对路径时根据xpath参数,匹配到了10个结果。
如果只想要第一条可以这样:
In [11]:
response.xpath(’//div[@class=“quote”]/span[@itemprop=“text”]/text()’).extract_first()
Out[11]: ‘“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”’
或者:
In [12]: results =
response.xpath(’//div[@class=“quote”]/span[@itemprop=“text”]/text()’).extract()

In [13]: results[0]
Out[13]: ‘“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”’

3.3. 另一种方法, CSS

得到完整的CSS路径:
body > div > div:nth-child(2) > div.col-md-8 > div:nth-child(1) > span.text

In [14]: response.css(‘body > div > div:nth-child(2) > div.col-md-8 > div:nth-child(1) > span.text’).extract()
Out[14]:
[’“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”’]
但这得到的并不是我们要的内容本身,修改xpath:
body > div > div:nth-child(2) > div.col-md-8 > div:nth-child(1) > span::text

In [17]: response.css(‘body > div > div:nth-child(2) > div.col-md-8 > div:nth-child(1) > span::text’).extract()
Out[17]:
[’“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”’,
'by ',
'\n ',
'\n ']
取到了我们需要的内容,但多出了3行(为什么?)。
取第一条:
In [18]: response.css(‘body > div > div:nth-child(2) > div.col-md-8 > div:nth-child(1) > span::text’).extract_first()
Out[18]: ‘“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”’

CSS相对路径:
In [41]: response.css(“div.quote”).extract_first()
Out[41]:
'

\n “The world as we have created it is a process of our thinking. It cannot
be changed without changing our thinking.”\n by Albert Einstein\n (about)\n </s
pan>\n

\n Tags:\n \n \n change\n \n deep-thoughts\n \n <a class=“tag” href="/tag/thinking/page/1/"

thinking\n \n world\n \n \n ’

In [42]: response.css(“div.quote>span”).extract_first()
Out[42]: ‘“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”’

In [43]: response.css(“div.quote>span::text”).extract_first()
Out[43]: ‘“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”’

3.4. CSS+XPATH

In [46]: response.css(“div.quote”).xpath(’//span/text()’).extract_first()
Out[46]: ‘“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”’

提取quote和author
In [26]: for quote in response.css(‘div.quote’):
…: text = quote.css(“span.text::text”).extract_first()
…: author = quote.css(“small.author::text”).extract_first()
…: print(text, author)
…:
…:
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” Albert Einstein
“It is our choices, Harry, that show what we truly are, far more than our abilities.” J.K. Rowling
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.” Albert Einstein
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.” Jane Austen
“Imperfection is beauty, madness is genius and it’s better to be absolutely ridiculous than absolutely boring.” Marilyn Monroe
“Try not to become a man of success. Rather become a man of value.” Albert Einstein
“It is better to be hated for what you are than to be loved for what you are not.” André Gide
“I have not failed. I’ve just found 10,000 ways that won’t work.” Thomas A. Edison
“A woman is like a tea bag; you never know how strong it is until it’s in hot water.” Eleanor Roosevelt
“A day without sunshine is like, you know, night.” Steve Martin

3.5. 在spider中提取数据
import scrapy

class QuotesSpider(scrapy.Spider):
name = “quotes”

def start_requests(self):urls = ['http://quotes.toscrape.com/page/1/','http://quotes.toscrape.com/page/2/',]for url in urls:yield scrapy.Request(url=url, callback=self.parse)def parse(self, response):for quote in response.css('div.quote'):yield {'text': quote.css('span.text::text').extract_first(),'author': quote.css('small.author::text').extract_first(),}

Log输出中包含这么一段,即为抓取到的内容:
{‘text’: ‘“I like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.”’, ‘author’: ‘Dr. Seuss’}
2018-04-06 13:47:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{‘text’: ‘“I may not have gone where I intended to go, but I think I have ended up where I needed to be.”’, ‘author’: ‘Douglas Adams’}
2018-04-06 13:47:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{‘text’: ““The opposite of love is not hate, it’s indifference. The opposite of art is not ugliness, it’s indifference. The opposite of faith is not heresy, it’s indifference. And the opposite of lif
e is not death, it’s indifference.””, ‘author’: ‘Elie Wiesel’}
2018-04-06 13:47:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{‘text’: ‘“It is not a lack of love, but a lack of friendship that makes unhappy marriages.”’, ‘author’: ‘Friedrich Nietzsche’}
2018-04-06 13:47:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{‘text’: ‘“Good friends, good books, and a sleepy conscience: this is the ideal life.”’, ‘author’: ‘Mark Twain’}
2018-04-06 13:47:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{‘text’: ‘“Life is what happens to us while we are making other plans.”’, ‘author’: ‘Allen Saunders’}

4. Exclipse +PyDev+Scrapy环境搭建

Step 1: 使用scrapy创建,参见<创建一个scrapy工程>章节
Step2: 在eclipse中新建一个文件夹,取名Pydev_Scrapy

然后将Step1中创建的scrapy工程拖到Pydev_Scrapy文件夹下

Step 3: 在Pydev_Scrapy\tutorial\tutorial目录下新建main.py文件

main.py文件:
import scrapy.cmdline

if name == ‘main’:
scrapy.cmdline.execute(argv=[‘scrapy’,‘crawl’,‘quotes’])
Step 4: 运行

这样比较方便调试.

  1. 使用Exclipse+PyDev+Scrapy抓取图片
    5.1. 创建工程

  2. 新建scrawl_img文件夹

  3. 命令行进入到crawl_img文件夹下,运行:
    D:\workspace\WebScrapting\crawl_img>scrapy startproject crawimgs

  4. 在WebScrapting\crawl_img\crawimgs\crawimgs\目录下新建main.py文件
    import scrapy.cmdline

if name == “main”:
scrapy.cmdline.execute(argv=“scrapy crawl crawimgs”.split())

5.2. 页面分析
url: https://blog.csdn.net/Y2c8YpZC15p/article/details/79562929
chrome 提取xpath路径://[@id=“js_content”]/p[7]/img
换一张图片,xpath路径://
[@id=“js_content”]/p[8]/img
确定xpath为://*[@id=“js_content”]/p/img

通过shell指令debug一下:
scrapy shell https://blog.csdn.net/Y2c8YpZC15p/article/details/79562929

In [10]: img_xpath = ‘//*[@id=“js_content”]/p/img/@src’

In [11]: response.xpath(img_xpath).extract()
Out[11]:
[‘https://img-blog.csdnimg.cn/img_convert/055d1f744fe59ac7fb003d5d9351777f.png;wx_lazy=1’,
‘https://img-blog.csdnimg.cn/img_convert/c6da8bcf22436ddfeb1b0f9932900fa9.png;wxfrom=5&wx_lazy=1’,
‘https://img-blog.csdnimg.cn/img_convert/883cc5c6bbe2a366b4650d684a4e2218.png’,
‘https://img-blog.csdnimg.cn/img_convert/775cf31ac82b802682d4adbaa298751b.png’,
‘https://img-blog.csdnimg.cn/img_convert/6bdfe067edf654442aeee2fad2371f17.png’,
‘https://img-blog.csdnimg.cn/img_convert/28f39a41b725525d46be67f8005e808a.png’,
‘https://img-blog.csdnimg.cn/img_convert/b969055e65360b7b1664745b9e8e08f2.png’,
‘https://img-blog.csdnimg.cn/img_convert/4a3d527bb2e1e888f29ae1fa0fd81733.png’,
‘https://img-blog.csdnimg.cn/img_convert/780d50945c6a708dfce0d7e3cab90131.png’,
‘https://img-blog.csdnimg.cn/img_convert/f6be87973dbc35d78ae9432c09f8095d.png’,
‘https://img-blog.csdnimg.cn/img_convert/9d5a6c46a10973966afb7d385a12aa8d.png’,
‘https://img-blog.csdnimg.cn/img_convert/8e933fb03ddc38aec9f0e3223eded96b.png’]
取到了我们想要的图片链接
5.3. 保存图片
保存图片前先要打开图片对应的url链接,然后再下载下来保存到本地。打开图片链接使用urllib库:
In [12]: img0 = urllib.request.urlopen(response.xpath(img_xpath).extract()[0])

In [13]: type(img0)
Out[13]: http.client.HTTPResponse
这样就打开了图片的url链接,通过dir(img0)命令可以发现它有个read()方法,使用help命令查看read()方法的使用:
In [54]: help(img0.read)
Help on method read in module http.client:

read(amt=None) method of http.client.HTTPResponse instance
Read and return up to n bytes.

If the argument is omitted, None, or negative, reads and
returns all data until EOF.If the argument is positive, and the underlying raw stream is
not 'interactive', multiple raw reads may be issued to satisfy
the byte count (unless EOF is reached first).  But for
interactive raw streams (as well as sockets and pipes), at most
one raw read will be issued, and a short result does not imply
that EOF is imminent.Returns an empty bytes object on EOF.Returns None if the underlying raw stream was open in non-blocking
mode and no data is available at the moment.

由此可知,直接调用read()方法就可以将整张图片读取下来,接下来只需要将读取到底内容写到本地文件就可以了。
通过urlopen打开图片链接:
def parse(self, response):
img_xpath=’//*[@id=“js_content”]/p/img/@src’
imgs = response.xpath(img_xpath).extract()

    index = 1for img in imgs:if imgs is not None:result = urllib.request.urlopen(img)

保存图片文件到本地:

def save_image(self,  response, fname):if response is not None:with open(fname, 'wb') as wf:wf.write(response.read())print("save image %s done!"%fname)
  1. 在WebScrapting\crawl_img\crawimgs\crawimgs\spiders目录下新建imgs_spider.py文件

#encoding=utf-8

import scrapy
import urllib

class Imgs_Spider(scrapy.Spider):
name = ‘crawimgs’

def start_requests(self):urls = ["https://blog.csdn.net/Y2c8YpZC15p/article/details/79562929"]for url in urls:yield scrapy.Request(url=url, callback=self.parse)def save_image(self,  response, fname):if response is not None:with open(fname, 'wb') as wf:wf.write(response.read())print("save image %s done!"%fname)def parse(self, response):img_xpath='//*[@id="js_content"]/p/img/@src'imgs = response.xpath(img_xpath).extract()index = 1for img in imgs:if imgs is not None:result = urllib.request.urlopen(img)self.save_image(result, "imgs/img_{0}.png".format(index))index +=1

在WebScrapting\crawl_img\crawimgs\crawimgs下新建imgs目录用于保存图片。
5.4. 运行结果

  1. User-Agent
    爬取http://www.meizitu.com/a/5582.html的图片

D:> scrapy shell http://www.meizitu.com/a/5582.html

In [1]: xpath=’//*[@id=“picture”]/p/img[1]’

In [2]: img = response.xpath(’//*[@id=“picture”]/p/img[1]/@src’).extract()

In [3]: img[0]
Out[3]: ‘http://mm.chinasareview.com/wp-content/uploads/2017a/07/18/01.jpg’
In [20]: import urllib
In [22]: urllib.request.urlopen(img[0])

HTTPError Traceback (most recent call last)
in ()
----> 1 urllib.request.urlopen(img[0])

c:\python35\lib\urllib\request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)

161     else:
162         opener = _opener

–> 163 return opener.open(url, data, timeout)
164
165 def install_opener(opener):

c:\python35\lib\urllib\request.py in open(self, fullurl, data, timeout)
470 for processor in self.process_response.get(protocol, []):
471 meth = getattr(processor, meth_name)
–> 472 response = meth(req, response)
473
474 return response

c:\python35\lib\urllib\request.py in http_response(self, request, response)
580 if not (200 <= code < 300):
581 response = self.parent.error(
–> 582 ‘http’, request, response, code, msg, hdrs)
583
584 return response

c:\python35\lib\urllib\request.py in error(self, proto, *args)
508 if http_err:
509 args = (dict, ‘default’, ‘http_error_default’) + orig_args
–> 510 return self._call_chain(*args)
511
512 # XXX probably also want an abstract factory that knows when it makes

c:\python35\lib\urllib\request.py in _call_chain(self, chain, kind, meth_name, *args)
442 for handler in handlers:
443 func = getattr(handler, meth_name)
–> 444 result = func(*args)
445 if result is not None:
446 return result

c:\python35\lib\urllib\request.py in http_error_default(self, req, fp, code, msg, hdrs)
588 class HTTPDefaultErrorHandler(BaseHandler):
589 def http_error_default(self, req, fp, code, msg, hdrs):
–> 590 raise HTTPError(req.full_url, code, msg, hdrs, fp)
591
592 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 403: Forbidden

执行urlopen的时候出错了,从这条错误信息:
HTTPError: HTTP Error 403: Forbidden
可以看出,网站禁止了图片抓取,做了反爬虫。这时就需要通过使用user-agent来模拟浏览器访问图片了。

In [40]: user_agent=“Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36”

In [41]: req = urllib.request.Request(img[0], headers = headers)

In [42]: resp = urllib.request.urlopen(req)

In [43]: with open (‘fuck.png’ , ‘wb’) as wf:
…: wf.write(resp.read())
…:
这样就把图片保存下来了!

完整code
import scrapy
import urllib

class UserAgent_Spider(scrapy.Spider):
name = ‘useragent_test’

def start_requests(self):urls = ['http://www.meizitu.com/a/5582.html']for url in urls:yield scrapy.Request(url=url, callback=self.parse)def parse(self, response):headers = {}user_agent="Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"headers['User-Agent']=user_agentxpath = '//*[@id="picture"]/p/img/@src'imgs = response.xpath(xpath).extract()for img in imgs:req = urllib.request.Request(img, headers = headers)resp = urllib.request.urlopen(req)fname = "images/"+img.split('/')[-1]with open(fname, 'wb') as wf:wf.write(resp.read())print('save', fname, 'done!')print('all done!!!')

抓取到的图片:

  1. Scheme
    这一段code:
    class JdSpider(CrawlSpider):
    name = “JDSpider”
    redis_key = “JDSpider:start_urls”
    start_urls = [“http://book.jd.com/booktop/0-0-0.html?category=1713-0-0-0-10001-1#comfort”]

    def parse(self, response):
    item = JdspiderItem()
    selector = Selector(response)
    Books = selector.xpath(’/html/body/div[8]/div[2]/div[3]/div/ul/li’)
    for each in Books:
    num = each.xpath(‘div[@class=“p-num”]/text()’).extract()
    bookName = each.xpath(‘div[@class=“p-detail”]/a/text()’).extract()
    author = each.xpath(‘div[@class=“p-detail”]/dl[1]/dd/a[1]/text()’).extract()
    press = each.xpath(‘div[@class=“p-detail”]/dl[2]/dd/a/text()’).extract()

         temphref = each.xpath('div[@class="p-detail"]/a/@href').extract()temphref = str(temphref)BookID = str(re.search('com/(.*?)\.html',temphref).group(1))json_url = 'http://p.3.cn/prices/mgets?skuIds=J_' + BookIDr = requests.get(json_url).textdata = json.loads(r)[0]price = data['m']PreferentialPrice = data['p']item['number'] = numitem['bookName'] = bookNameitem['author'] = authoritem['press'] = pressitem['BookID'] = BookIDitem['price'] = priceitem['PreferentialPrice'] = PreferentialPriceyield itemnextLink = selector.xpath('/html/body/div[8]/div[2]/div[4]/div/div/span/a[7]/@href').extract()if nextLink:nextLink = nextLink[0]print('type of nextLink: ',type(nextLink))print(nextLink)yield Request(nextLink,callback=self.parse)
    

运行时出现了错误:
2018-04-07 14:37:00 [scrapy.core.scraper] ERROR: Spider error processing <GET http://book.jd.com/booktop/0-0-0.html?category=1713-0-0-0-10001-1#comfort> (referer: None)
Traceback (most recent call last):
File “C:\Python35\lib\site-packages\scrapy\utils\defer.py”, line 102, in iter_errback
yield next(it)
File “C:\Python35\lib\site-packages\scrapy\spidermiddlewares\offsite.py”, line 30, in process_spider_output
for x in result:
File “C:\Python35\lib\site-packages\scrapy\spidermiddlewares\referer.py”, line 339, in
return (_set_referer® for r in result or ())
File “C:\Python35\lib\site-packages\scrapy\spidermiddlewares\urllength.py”, line 37, in
return (r for r in result or () if _filter®)
File “C:\Python35\lib\site-packages\scrapy\spidermiddlewares\depth.py”, line 58, in
return (r for r in result or () if filter®)
File “D:\workspace\Python\WebScrapping\Spider-master\JDSpider\JDSpider\spiders\JD_Spider.py”, line 52, in parse
yield Request(nextLink,callback=self.parse)
File "C:\Python35\lib\site-packages\scrapy\http\request_
init.py", line 25, in init
self.set_url(url)
File "C:\Python35\lib\site-packages\scrapy\http\request_
init.py", line 62, in _set_url
raise ValueError(‘Missing scheme in request url: %s’ % self._url)
ValueError: Missing scheme in request url: //book.jd.com/booktop/1713-0-0-0-10001-2.html#comfort
这段错误报错的原因是:
raise ValueError(‘Missing scheme in request url: %s’ % self._url)
ValueError: Missing scheme in request url: //book.jd.com/booktop/1713-0-0-0-10001-2.html#comfort
原因是传递给Request的url地址缺失了http,修改如下:
nextLink = “http:” + nextLink[0]
即可解决问题。
关于scheme的知识:
Scheme basically has a syntax like
scheme:[//[user:password@]host[:port]][/]path[?query][#fragment]
Examples of popular schemes include http(s), ftp, mailto, file, data, and irc. Therecould also be terms like about or about:blank we are somewhat familiar with.

                hierarchical part┌───────────────────┴─────────────────────┐authority               path┌───────────────┴───────────────┐┌───┴────┐

abc://username:password@example.com:123/path/data?key=value&key2=value2#fragid1
└┬┘ └───────┬───────┘ └────┬────┘ └┬┘ └─────────┬─────────┘ └──┬──┘
scheme user information host port query fragment

urn:example:mammal:monotreme:echidna
└┬┘ └──────────────┬───────────────┘
scheme path

  1. Cookie
    8.1. 如何通过浏览器查看cookie
    8.1.1. 搜狗浏览器

我们需要的是name和value两列的值。全部选择,然后复制:

内容:
lpvt_0c0e9d9b1e7d617b3e6842e85b9fb068 1523084377 .jianshu.com / Session 50
Hm_lvt_0c0e9d9b1e7d617b3e6842e85b9fb068 1523084377 .jianshu.com / 2019-04-07T06:59:37.000Z 49
_m7e_session 153d9f2e9183c0cf7b13519a686bf697 .www.jianshu.com / 2018-04-07T13:01:06.367Z 44 ? ?
default_font font2 .www.jianshu.com / Session 17
locale zh-CN .www.jianshu.com / Session 11
read_mode day .www.jianshu.com / Session 12
sensorsdata2015jssdkcross %7B%22distinct_id%22%3A%22160bc41d913137-020239e6130da5-2609281c-2073600-160bc41d917366%22%2C%22%24device_id%22%3A%22160bc41d913137-020239e6130da5-2609281c-2073600-160bc41d917366%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E8%87%AA%E7%84%B6%E6%90%9C%E7%B4%A2%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22http%3A%2F%2Fcn.bing.com%2Fsearch%3Fq%3Dscrapy%2Bcookie%26go%3D%25E6%2590%259C%25E7%25B4%25A2%26qs%3Dn%26form%3DQBRE%26sp%3D-1%26pq%3Dscrapy%2Bcookie%26sc%3D8-13%26sk%3D%26cvid%3DCABB19BFE0154CD7BB72105B6C402A66%22%2C%22%24latest_referrer_host%22%3A%22cn.bing.com%22%2C%22%24latest_search_keyword%22%3A%22scrapy%20cookie%22%2C%22%24latest_utm_source%22%3A%22weixin_timeline%22%2C%22%24latest_utm_medium%22%3A%22reader_share%22%2C%22%24latest_utm_campaign%22%3A%22haruki%22%2C%22%24latest_utm_content%22%3A%22note%22%7D%7D .jianshu.com / 2218-02-18T06:59:37.000Z 878
signin_redirect https%3A%2F%2Fwww.jianshu.com%2Fp%2F887af1ab4200 .www.jianshu.com / Session 63
需要通过python代码将name和value两列的值转换为scrapy的字典类型数据:
Code:

-- coding: utf-8 --

class transCookie:
def init(self, cookie):
self.cookie = {}
cookies = cookie.split(’\n’)
for cook in cookies:
if cook.strip() != ‘’:
cook = cook.split()
if len(cook) >= 2:
key, value = cook[0:2]
self.cookie[key] = value

def stringToDict(self):return self.cookie

if name == “main”:
cookie = ‘’’
lpvt_0c0e9d9b1e7d617b3e6842e85b9fb068 1523084377 .jianshu.com / Session 50
Hm_lvt_0c0e9d9b1e7d617b3e6842e85b9fb068 1523084377 .jianshu.com / 2019-04-07T06:59:37.000Z 49
_m7e_session 153d9f2e9183c0cf7b13519a686bf697 .www.jianshu.com / 2018-04-07T13:01:06.367Z 44 ? ?
default_font font2 .www.jianshu.com / Session 17
locale zh-CN .www.jianshu.com / Session 11
read_mode day .www.jianshu.com / Session 12
sensorsdata2015jssdkcross %7B%22distinct_id%22%3A%22160bc41d913137-020239e6130da5-2609281c-2073600-160bc41d917366%22%2C%22%24device_id%22%3A%22160bc41d913137-020239e6130da5-2609281c-2073600-160bc41d917366%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E8%87%AA%E7%84%B6%E6%90%9C%E7%B4%A2%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22http%3A%2F%2Fcn.bing.com%2Fsearch%3Fq%3Dscrapy%2Bcookie%26go%3D%25E6%2590%259C%25E7%25B4%25A2%26qs%3Dn%26form%3DQBRE%26sp%3D-1%26pq%3Dscrapy%2Bcookie%26sc%3D8-13%26sk%3D%26cvid%3DCABB19BFE0154CD7BB72105B6C402A66%22%2C%22%24latest_referrer_host%22%3A%22cn.bing.com%22%2C%22%24latest_search_keyword%22%3A%22scrapy%20cookie%22%2C%22%24latest_utm_source%22%3A%22weixin_timeline%22%2C%22%24latest_utm_medium%22%3A%22reader_share%22%2C%22%24latest_utm_campaign%22%3A%22haruki%22%2C%22%24latest_utm_content%22%3A%22note%22%7D%7D .jianshu.com / 2218-02-18T06:59:37.000Z 878
signin_redirect https%3A%2F%2Fwww.jianshu.com%2Fp%2F887af1ab4200 .www.jianshu.com / Session 63
‘’’
trans = transCookie(cookie)
print (trans.stringToDict())

得到的cookie 字典类型的数据为:
{‘sensorsdata2015jssdkcross’: ‘%7B%22distinct_id%22%3A%22160bc41d913137-020239e6130da5-2609281c-2073600-160bc41d917366%22%2C%22%24device_id%22%3A%22160bc41d913137-020239e6130da5-2609281c-2073600-160bc41d917366%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E8%87%AA%E7%84%B6%E6%90%9C%E7%B4%A2%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22http%3A%2F%2Fcn.bing.com%2Fsearch%3Fq%3Dscrapy%2Bcookie%26go%3D%25E6%2590%259C%25E7%25B4%25A2%26qs%3Dn%26form%3DQBRE%26sp%3D-1%26pq%3Dscrapy%2Bcookie%26sc%3D8-13%26sk%3D%26cvid%3DCABB19BFE0154CD7BB72105B6C402A66%22%2C%22%24latest_referrer_host%22%3A%22cn.bing.com%22%2C%22%24latest_search_keyword%22%3A%22scrapy%20cookie%22%2C%22%24latest_utm_source%22%3A%22weixin_timeline%22%2C%22%24latest_utm_medium%22%3A%22reader_share%22%2C%22%24latest_utm_campaign%22%3A%22haruki%22%2C%22%24latest_utm_content%22%3A%22note%22%7D%7D’, ‘signin_redirect’: ‘https%3A%2F%2Fwww.jianshu.com%2Fp%2F887af1ab4200’, ‘read_mode’: ‘day’, ‘Hm_lvt_0c0e9d9b1e7d617b3e6842e85b9fb068’: ‘1523084377’, ‘_m7e_session’: ‘153d9f2e9183c0cf7b13519a686bf697’, ‘locale’: ‘zh-CN’, ‘lpvt_0c0e9d9b1e7d617b3e6842e85b9fb068’: ‘1523084377’, ‘default_font’: ‘font2’}
8.1.2. Chrome浏览器

8.2. 通过cookie访问JD
通过cookie访问用户信息,这里抓取了用户名信息。
#encoding=utf-8

import scrapy

class Cookie_Spider(scrapy.Spider):

name = "cookie"def start_requests(self):urls = ["https://i.jd.com/user/info"]cookie = {}# your cookieheaders = {'Connection' : 'keep - alive','User-Agent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36'}for url in urls:yield scrapy.Request(url=url, callback=self.parse, headers=headers,cookies=cookie)def parse(self, response):username_label = response.xpath('//*[@id="main"]/div/div[2]/div[1]/div/div[1]/span"]').extract()username = response.xpath('//*[@id="main"]/div/div[2]/div[1]/div/div[1]/div/div/strong').extract()print(username_label, ':', username)

Scrapy 小白自学笔记相关推荐

  1. 小白自学笔记——JAVA基础 1.1常用的DOS文件

    今天开始自学JAVA第一课.看的是B站上宋红康老师的视频. 一 学习常用的DOS命令 在搜索栏输入'cmd'打开 eg1:打开D盘文件 [输入] D: eg2:列出当前文件夹下的所有文件 [输入] d ...

  2. 小白自学笔记——JAVA基础 2.6运算符

    名词解释 运算符是一种特殊的符号,用以表示数据的运算.赋值和比较等. - 算术运算符 - 赋值运算符 - 比较运算符(关系运算符) - 逻辑运算符 - *位运算符 - 三元运算符 算术运算符 eg: ...

  3. 小白自学笔记——JAVA基础 2.5进制

    计算机中不同进制的使用说明 所有数字在计算机底层都以二进制形式存在. 对于整数,有四种表示方式: - 二进制(binary):0,1 ,满2进1.以0b或0B开头. - 十进制(decimal):0- ...

  4. 小白自学笔记——JAVA基础 2.2变量

    变量概述 概念: - 内存中的一个存储区域 - 该区域的数据可以在同一类型范围内不断变化 - 变量是程序中最基本的存储单元,包含变量类型.变量名和存储值 作用 - 用于在内存中保存数据 变量的使用 j ...

  5. 小白自学笔记——JAVA基础 2.9循环结构

    循环结构 在某些条件满足的情况下,反复执行特定代码的功能. 循环语句分类 for 循环 while 循环 do-while 循环 FOR循环结构 for (①初始化部分;②循环条件部分;④迭代部分){ ...

  6. 小白自学笔记——JAVA基础 2.8分支结构

    名词解释 流程控制语句是用来控制程序中各语句执行顺序的语句,可以把语句组合成能完成一定功能的小逻辑模块. 其流程控制方式采用结构化程序设计中规定的三种基本流程结构,即: 顺序结构 程序从上到下逐行地执 ...

  7. 小白自学笔记——JAVA基础 3.1 一维数组

    名词解释 数组(Array),是多个相同类型数据按一定顺序排列的集合,并使用一个名字命名,并通过编号的方式对这些数据进行统一管理. 数组的常见概念 数组名 下标(或索引) 元素 数组的长度 数组的特点 ...

  8. 小白自学笔记——JAVA基础 0.1Java语言概述

    我学习的是宋红康老师的视频,首先是课程大纲. 课程大纲 课程体系 第1章 Java语言概述 第2章 基本语法 第3章 数组 第4章 面向对象编程(上) 第5章 面向对象编程(中) 第6章 面向对象编程 ...

  9. 小白自学笔记——JAVA基础 2.3基本数据类型转换

    基本数据类型之间的运算规则 1.自动类型转换:当容量小的数据类型的变量与大容量的数据类型的变量做运算时,结果自动提升为容量大的数据类型. (说明:容量大小指表示数的范围的大小) - byte,shor ...

最新文章

  1. 135. 最大子序和【前缀和 单调队列】
  2. Python10分钟入门
  3. linux struts2漏洞,重大漏洞预警:Struts 2 远程代码执行漏洞(s2-045\s2-046) (含PoC)
  4. 内存可见性和原子性:Synchronized和Volatile的比较
  5. 依赖注入利器 - Dagger ‡
  6. 使用直线标定板进行相机畸变校正,并且进行9点标定(halcon)
  7. linux vps 命令,CentOS最常用Linux vps操作命令整理大全
  8. StyleGAN如何定制人脸生成
  9. 编写 Window 服务程序
  10. Elasticsearch概念介绍文档路由与存储
  11. 浙江大学在Github开源了计算机课程,看完在家上个 985
  12. 2013年04月12日 JavaEE+物联云计算就业班-上海
  13. [图形学]ASTC纹理压缩格式
  14. 2022/11/6周报
  15. JDK内置并发框架AQS对CLH锁的优化
  16. Laravel 5.5 中文文档
  17. MCE公司:免疫治疗新课题——好心情,要保持!
  18. nasm用XMM寄存器计算double类型累加
  19. matlab局部趋势线,自动趋势线 局部极点
  20. 955 不加班的公司名单!2021 最新版!

热门文章

  1. 自旋锁 - linux内核锁(二)
  2. 伪装者:用代糖戒糖的山路十八弯
  3. 安卓手机如何玩转「动作手势检测」?有TensorFlow就够了 | 实用教程
  4. 百度地图加载空白颜色_本地地图批量标点的html实现
  5. python 摄像头录视频_Python实现树莓派摄像头持续录像并传送到主机
  6. zeotero+oneDrive在两台设备pdf等附加文件无法同步的问题
  7. anaconda 升级 python
  8. 图片如何排版?这波排版技巧请收好
  9. docker容器存放目录磁盘空间满了,转移数据修改Docker默认存储位置
  10. 学校计算机专用桌安装教程,一种计算机教学专用讲桌的制作方法