Scrapy_LinkExtractor

文章目录

使用LinkExtractor提取链接
描述LinkExtractor提取规则
LinkExtractor构造器参数描述

使用LinkExtractor提取链接

提取页面链接有Selector和LinkExtractor两种方法

因为链接也是页面中的数据，所以可以使用与提取数据相同的方法进行提取，在提取少量（几个）链接或提取规则比较简单时，使用selector就足够了

Scrapy提供了一个专门用于提取链接的类LinkExtractor，在提取大量链接或提取规则比较复杂时，使用LinkExtractor更加方便

描述LinkExtractor提取规则

导入LinkExtractor，位于scrapy.linkextractors模块
from scrapy.linkextractors import LinkExtractor

创建一个LinkExtractor对象，使用一个或多个构造器参数描述提取规则，下面详见
le = LinkExtractor(构造器参数)

调用LinkExtractor对象的extract_links方法传入一个Response对象，该方法依据创建对象时所描述的提取规则，
在Response对象所包含的页面中提取链接，最终返回一个列表，其中的每一个元素都是一个Link对象，即提取到的一个链接
links = le.extract_links(respons)

用links[index]获取Link对象，Link对象的url属性便是链接页面的绝对url地址（无须再调用response.urljoin方法）
url = link[index].url

LinkExtractor构造器参数描述

为了讲解举例，首先制造一个实验环境，创建两个包含多个链接的HTML页面

<!--example1.html-->
<!DOCTYPE html>
<html>
<head><title>LinkExtractor</title>
</head>
<body><div id="top"><p>下面是一些站内链接</p><a class="internal" href="/intro/install.html">Installation guide</a><a class="internal" href="/intro/tutorial.html">Tutorial</a><a class="internal" href="../examples.html">Examples</a></div><div id="bottom"><p>下面是一些站外链接</p><a href="http://stackoverflow.com/tags/scrapy/info">StackOverflow</a><a href="https://github.com/scrapy/scrapy">Fork on Github</a></div>
</body>
</html>

<!--example2.html-->
<!DOCTYPE html>
<html>
<head><title>LinkExtractor</title><script type="text/javascript" src="/js/app1.js"></script><script type="text/javascript" src="/js/app2.js"></script>
</head>
<body><a href="/home.html">主页</a><a href="javascript:goToPage('/doc.html');return false">文档</a><a href="javascript:goToPage('/example.html');return false">案例</a>
</body>
</html>

使用以上两个HTML文本构造两个Requests对象：

from scrapy.http import HtmlResponse
from scrapy.linkextractors import LinkExtractorhtml1 = open('example1.html', 'rb').read()
html2 = open('example2.html', 'rb').read()respons1 = HtmlResponse(url='http://example1.com', body=html1, encoding='utf-8')
respons2 = HtmlResponse(url='http://example2.com', body=html2, encoding='utf-8')

allow

接受一个正则表达式或一个正则表达式列表，提取绝对url与正则表达式匹配的链接，如果该参数为空（默认），就提取全部链接
示例：提取页面example1.html中路径以/intro开始的链接

pattern = '/intro/.+\.html$'
le = LinkExtractor(allow=pattern)
links = le.extract_links(respons1)
print([link.url for link in links])
# >>>['http://example1.com/intro/install.html', 'http://example1.com/intro/tutorial.html']

deny
接收一个正则表达式或一个正则表达式列表，与allow相反，排除绝对url 与正则表达式匹配的链接
示例：提取页面example1.html中所有站外链接（即排除站内链接）


from urllib.parse import urlparse
pattern = patten = '^' + urlparse(respons1.url).geturl()
print(pattern) # ^http://example1.comle = LinkExtractor(deny=pattern)
links = le.extract_links(respons1)
print([link.url for link in links])
# >>>['http://stackoverflow.com/tags/scrapy/info', 'https://github.com/scrapy/scrapy']

allow_domains
接收一个域名或一个域名列表，提取到指定域的链接
示例：提取页面example1.html中所有到github.com和stackoverflow.com这两个域的链接

domains = ['github.com', 'stackoverflow.com']
le = LinkExtractor(allow_domains=domains)
links = le.extract_links(respons1)
print([link.url for link in links])

deny_domains
接收一个域名或一个域名列表，与allow_domains相反，排除到指定域的链接。
示例：提取页面example1.html中除github.com域以外的链接

le = LinkExtractor(deny_domains='github.com')
links = le.extract_links(respons1)
print([link for link in links])

restrict_xpaths
接收一个XPath表达式或一个XPath表达式列表，提取XPath表达式选中区域下的链接
示例：提取页面example1.html中<div id = 'top’元素下的链接

le = LinkExtractor(restrict_xpaths='//div[@id = "top"]')
links = le.extract_links(respons1)
print([link for link in links])

restrict_css
接收一个CSS选择器或一个CSS选择器列表，提取CSS选择器选中区域下的链接
示例：提取页面example1.html中<div id = 'bottom’元素下的链接

le = LinkExtractor(restrict_css='div#bottom')
links = le.extract_links(respons1)
print([link for link in links])

tag
接收一个标签（字符串）或一个标签列表，提取指定标签内的链接，默认为[‘a’, ‘area’]
attrs
接收一个属性（字符串）或一个属性列表，提取指定属性内的链接，默认为[‘href’]
示例：提取页面example2.html中引用JavaScript文件的链接

le = LinkExtractor(tags='script', attrs='src')
links = le.extract_links(respons2)
print([link.url for link in links])

process_value
接收一个形如func(value) 的回调函数，如果传递了该参数，LInkExtractor将调用该回调函数
对提取的每一个链接（如a 的 href）进行处理，回调函数正常情况下应返回一个字符串（处理结果）,
想要抛弃所处理的链接时，返回None
示例：在页面example2.html中，某些a的href属性是一段JavaScript代码，代码中包含了链接页面
的实际url地址，此时应对链接进行处理，提取页面example2.html中所有实际链接

import re
def process(value):m = re.search("javascript:goToPage\('(.*?)'", value)# 如果匹配，就提取其中url并返回，不匹配则返回原值if m:value = m.group(1)return valuele = LinkExtractor(process_value=process)
links = le.extract_links(respons2)
print([link.url for link in links])
# >>>['http://example2.com/home.html', 'http://example2.com/doc.html', 'http://example2.com/example.html']

LinkExtractor构造器的所有参数都有默认值，如果构造器对象不传递任何参数，就提取页面中所有链接

Scrapy_LinkExtractor

文章目录

使用LinkExtractor提取链接

描述LinkExtractor提取规则

LinkExtractor构造器参数描述

Scrapy_LinkExtractor相关推荐

最新文章

热门文章