wljdeMacBook-Pro:~ wlj$ scrapy shell "http://www.bjrbj.gov.cn/mzhd/detail_29974.htm"

scrapy shell发送请求
scrapy shell "http://www.bjrbj.gov.cn/mzhd/detail_29974.htm"
wljdeMacBook-Pro:~ wlj$ scrapy shell "http://www.bjrbj.gov.cn/mzhd/detail_29974.htm"

响应文件

response.body
response.text
response.url
>>> response.url
'https://item.btime.com/m_9b62d3a9239a9473c'\

导入LinkExtractor,匹配整个html文档中的链接

from scrapy.linkextractors import LinkExtractor

>>> from scrapy.linkextractors import LinkExtractor
>>> response.xpath('//div[@class="xx_neirong"]/h1/text()').extract()[0] 

'北京社保开户流程是怎么个流程'



demo
 1 wljdeMacBook-Pro:Desktop wlj$ scrapy shell "http://hr.tencent.com/position.php?"
 2 2018-06-21 21:12:40 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
 3 2018-06-21 21:12:40 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.4.0, Python 3.6.5 (default, Apr 25 2018, 14:23:58) - [GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.1)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h  27 Mar 2018), cryptography 2.2.2, Platform Darwin-17.5.0-x86_64-i386-64bit
 4 2018-06-21 21:12:40 [scrapy.crawler] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0}
 5 2018-06-21 21:12:40 [scrapy.middleware] INFO: Enabled extensions:
 6 ['scrapy.extensions.corestats.CoreStats',
 7  'scrapy.extensions.telnet.TelnetConsole',
 8  'scrapy.extensions.memusage.MemoryUsage']
 9 2018-06-21 21:12:40 [scrapy.middleware] INFO: Enabled downloader middlewares:
10 ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
11  'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
12  'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
13  'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
14  'scrapy.downloadermiddlewares.retry.RetryMiddleware',
15  'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
16  'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
17  'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
18  'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
19  'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
20  'scrapy.downloadermiddlewares.stats.DownloaderStats']
21 2018-06-21 21:12:40 [scrapy.middleware] INFO: Enabled spider middlewares:
22 ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
23  'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
24  'scrapy.spidermiddlewares.referer.RefererMiddleware',
25  'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
26  'scrapy.spidermiddlewares.depth.DepthMiddleware']
27 2018-06-21 21:12:40 [scrapy.middleware] INFO: Enabled item pipelines:
28 []
29 2018-06-21 21:12:40 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
30 2018-06-21 21:12:40 [scrapy.core.engine] INFO: Spider opened
31 2018-06-21 21:12:41 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://hr.tencent.com/position.php> from <GET http://hr.tencent.com/position.php>
32 2018-06-21 21:12:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://hr.tencent.com/position.php> (referer: None)
33 [s] Available Scrapy objects:
34 [s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
35 [s]   crawler    <scrapy.crawler.Crawler object at 0x107617c18>
36 [s]   item       {}
37 [s]   request    <GET http://hr.tencent.com/position.php>
38 [s]   response   <200 https://hr.tencent.com/position.php>
39 [s]   settings   <scrapy.settings.Settings object at 0x10840e748>
40 [s]   spider     <DefaultSpider 'default' at 0x1086c6ba8>
41 [s] Useful shortcuts:
42 [s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
43 [s]   fetch(req)                  Fetch a scrapy.Request and update local objects
44 [s]   shelp()           Shell help (print this help)
45 [s]   view(response)    View response in a browser
46 >>> response.url
47 'https://hr.tencent.com/position.php'
48 >>> from scrapy.linkextractors import LinkExtractor
49 >>> link_list=LinkExtractor(allow=("start=\d+"))
50 >>> link_list.extract_links(response)
51 [Link(url='https://hr.tencent.com/position.php?&start=10#a', text='2', fragment='', nofollow=False), Link(url='https://hr.tencent.com/position.php?&start=20#a', text='3', fragment='', nofollow=False), Link(url='https://hr.tencent.com/position.php?&start=30#a', text='4', fragment='', nofollow=False), Link(url='https://hr.tencent.com/position.php?&start=40#a', text='5', fragment='', nofollow=False), Link(url='https://hr.tencent.com/position.php?&start=50#a', text='6', fragment='', nofollow=False), Link(url='https://hr.tencent.com/position.php?&start=60#a', text='7', fragment='', nofollow=False), Link(url='https://hr.tencent.com/position.php?&start=70#a', text='...', fragment='', nofollow=False), Link(url='https://hr.tencent.com/position.php?&start=3800#a', text='381', fragment='', nofollow=False)]
52 >>> 


转载于:https://www.cnblogs.com/wanglinjie/p/9211013.html

LinkExtractor相关推荐

  1. Scrapy爬虫中的链接提取器LinkExtractor

    今天在编写Scrapy爬虫的时候接触到了LinkExtractor,遂学习了一下这个链接提取器. Link Extractors 是那些目的仅仅是从网页(scrapy.http.Response 对象 ...

  2. Python:阳光热线问政平台爬虫

    爬取投诉帖子的编号.帖子的url.帖子的标题,和帖子里的内容. items.py import scrapyclass DongguanItem(scrapy.Item):# 每个帖子的标题title ...

  3. Python:Resquest/Response

    Request Request 部分源码: # 部分代码 class Request(object_ref):def __init__(self, url, callback=None, method ...

  4. Python:CrawlSpiders

    通过下面的命令可以快速创建 CrawlSpider模板 的代码: scrapy genspider -t crawl tencent tencent.com 上一个案例中,我们通过正则表达式,制作了新 ...

  5. Scrapy爬取IT之家

    创建项目 scrapy startproject ithome 创建CrawSpider scrapy genspider -t crawl IT ithome.com items.py 1 impo ...

  6. Scrapy框架CrawlSpider类爬虫实例

    CrawlSpider类爬虫中: rules用于定义提取URl地址规则,元祖数据有顺序 #LinkExtractor 连接提取器,提取url地址  #callback 提取出来的url地址的respo ...

  7. 赠书 | 详解 4 种爬虫技术

    作者 | 赵国生 王健 来源 | 大数据DT 头图 | 下载于视觉中国 前言: 聚焦网络爬虫是"面向特定主题需求"的一种爬虫程序,而通用网络爬虫则是捜索引擎抓取系统(Baidu.G ...

  8. Scrapy_redis框架原理分析并实现断点续爬以及分布式爬虫

    1. 下载github的demo代码 1.1 clone github scrapy-redis源码文件 git clone https://github.com/rolando/scrapy-red ...

  9. Scrapy框架中的crawlspider爬虫

    1 crawlspider是什么 在spider中要寻找下一页的url地址或者内容的url地址,想想是否有简单的方法省略寻找url的过程? 思路: 从response中提取所有的满足规则的url地址 ...

最新文章

  1. 随机查询N条记录MySQL、SQLServer、Oracle、postgreSQL
  2. Hadoop命令执行时提示JVM OOM问题的处理
  3. Vivado联合ModelSim仿真设置(附图步骤)
  4. android 辅助服务 简书,Android AccessibilityService使用
  5. 第一阶段:Java基础之控制结构
  6. STC89C52 STC89LE52 NRF24L01无线 教程 (二)
  7. 研究生开题报告需要注意的几点
  8. mysql重新创建测试对象的SQL_MySQL_Sql_打怪升级_进阶篇_测试: SQL随机生成测试数据...
  9. Eclipse在线安装Hibernate插件
  10. 1035. 插入与归并(25)-浙大PAT乙级真题
  11. 红帽学习笔记[RHCSA] 第七课[网络配置相关]
  12. java sql 搜索拼音
  13. deepfacelab安卓版_deepfacelab
  14. 【转摘】芯片的本质是什么
  15. JAVA竖线转义符号
  16. 备份一下Linux笔记
  17. Docker核心概念与实战
  18. 再也不用手动复制粘贴收集Taptap游戏评论了,还可以制作好看的热词词云图~
  19. 计算机方向kade期刊,计算机辅助导航技术在上颌骨肿瘤切除及缺损重建中的应用...
  20. java编程思想读书笔记

热门文章

  1. 在 PHP 中养成 7 个面向对象的好习惯
  2. Nginx Java环境(tomcat)支持
  3. MSSQL2005 手工盲注 总结
  4. Asp.Net 上传大文件
  5. 集合框架—HashMap
  6. 倍福TwinCAT(贝福Beckhoff)常见问题(FAQ)-人机界面HMI自锁按钮和自复位按钮如何理解(Toggle variable Tap variable)...
  7. Extjs鼠标长按事件(实现长按按钮触发事件的方法:mousedown、mouseup)
  8. 递归求解并生成哈夫曼编码的代码实现
  9. 容器 vector :为何要有reserve
  10. 用SAXBuilder、Document、Element操作xml