前言

主要运用了scrapy持久化存储操作，下面主要展示spider和管道文件及一些设置。

源码

爬取的是itcast师资信息http://www.itcast.cn/channel/teacher.shtml#ajavaee

爬虫文件（test1）

这部分主要是对内容解析


import scrapyclass Test1Spider(scrapy.Spider):name = 'test1'# allowed_domains = ['https://www.baidu.com/']start_urls = ['http://www.itcast.cn/channel/teacher.shtml#ajavaee']def parse(self, response):li_list = response.xpath('/html/body/div[10]/div/div[2]/ul/li')for li in li_list:item={}item["name"] = li.xpath(".//h2/text()").extract_first()if li.xpath(".//p/span[2]/text()").extract_first()!= None:item["title"] = li.xpath(".//p/span[1]/text()").extract_first() + li.xpath(".//p/span[2]/text()").extract_first()else:item["title"]= li.xpath(".//p/span[1]/text()").extract_first()yield item

pipelines.py

这部分是对解析后的内容进行持久化存储

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapterimport pymysql
#存储到文件中
class TestprojectPipeline:fp =Nonedef open_spider(self,spider):print("开始爬虫......")self.fp = open('./shizi.text','w',encoding='utf-8')def process_item(self, item, spider):author = item['name']content = item['title']self.fp.write(author+':'+content+'\n')return itemdef close_spider(self,spider):print("爬虫结束！")self.fp.close()#存储到数据库中
class mysqlPileLine(object):conn = Nonecursor = Nonedef open_spider(self,spider):self.conn = pymysql.Connect(host='127.0.0.1',port=3306,user='root',password='zpx',db='pydata',charset='utf8')def process_item(self,item,spider):self.cursor = self.conn.cursor()try:self.cursor.execute('insert into shizi value ("%s","%s")'%(item["name"],item["title"]))self.conn.commit()except Exception as e:print(e)self.conn.rollback()return itemdef close_spider(self,spider):self.cursor.close()self.conn.close()

setting.py

# Scrapy settings for testproject project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'testproject'SPIDER_MODULES = ['testproject.spiders']
NEWSPIDER_MODULE = 'testproject.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.17 (KHTML, like Gecko) Version/12.0.1  Safari/605.1.17"# Obey robots.txt rules
ROBOTSTXT_OBEY = FalseLOG_LEVEL= 'WARNING'
#显示的日志等级# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {#    'testproject.middlewares.TestprojectSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {'testproject.middlewares.TestprojectDownloaderMiddleware': 543,
}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'testproject.pipelines.TestprojectPipeline': 300,'testproject.pipelines.mysqlPileLine': 301,#两种持久化存储的优先级，数越小，优先级越高
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

运行结果

Python爬虫之运用scrapy框架将爬取的内容存入文件和数据库相关推荐

python爬虫实例——session自动登录并爬取相关内容
1.理解下 session (会话) 所谓的会话,你可以理解成我们用浏览器上网,到关闭浏览器的这一过程.session是会话过程中,服务器用来记录特定用户会话的信息. 比如今天双11,你淘宝网浏览了哪 ...
python爬虫(16)使用scrapy框架爬取顶点小说网
本文以scrapy 框架来爬取整个顶点小说网的小说 1.scrapy的安装这个安装教程,网上有很多的例子,这里就不在赘述了 2.关于scrapy scrapy框架是一个非常好的东西,能够实现异步爬 ...
Python爬虫5.3 — scrapy框架spider[Request和Response]模块的使用
Python爬虫5.3 - scrapy框架spider[Request和Response]模块的使用综述 Request对象 scrapy.Request()函数讲解: Response对象发送 ...
python爬虫多久能学会-不踩坑的Python爬虫：如何在一个月内学会爬取大规模数据...
原标题:不踩坑的Python爬虫:如何在一个月内学会爬取大规模数据 Python爬虫为什么受欢迎如果你仔细观察,就不难发现,懂爬虫.学习爬虫的人越来越多,一方面,互联网可以获取的数据越来越多,另一方 ...
使用python3.7中的scrapy框架，爬取起点小说
这几天在学习scrapy框架,感觉有所收获,便尝试使用scrapy框架来爬取一些数据,对自己阶段性学习进行一个小小的总结本次爬取的目标数据是起点中文网中的免费作品部分,如下图: 本次一共爬取了100 ...
手把手教你使用scrapy框架来爬取北京新发地价格行情（理论篇）
点击上方"Python爬虫与数据挖掘",进行关注回复"书籍"即可获赠Python从入门到进阶共10本电子书今日鸡汤博观而约取,厚积而薄发. 大家好! ...
python 扒数据_不踩坑的Python爬虫：如何在一个月内学会爬取大规模数据
Python爬虫为什么受欢迎如果你仔细观察,就不难发现,懂爬虫.学习爬虫的人越来越多,一方面,互联网可以获取的数据越来越多,另一方面,像 Python这样的编程语言提供越来越多的优秀工具,让爬虫变得 ...
使用Scrapy框架，爬取b站番剧信息。
使用Scrapy框架,爬取b站番剧信息. 感觉好久没写爬虫的,今天看了在b站浏览了一会儿,发现b站有很多东西可以爬取的,比如首页的排行榜,番剧感觉很容易找到数据来源的,所以就拿主页的番剧来练练手的. ...
Python爬虫系列（二）：爬取中国大学排名丁香园-用户名和回复内容淘宝品比价
Python爬虫系列(二):爬取中国大学排名&丁香园-用户名和回复内容&淘宝品比价目录 Python爬虫系列(二):爬取中国大学排名&丁香园-用户名和回复内容&淘宝品 ...

Python爬虫之运用scrapy框架将爬取的内容存入文件和数据库

文章目录

前言

源码

爬虫文件（test1）

pipelines.py

setting.py

运行结果

Python爬虫之运用scrapy框架将爬取的内容存入文件和数据库相关推荐

最新文章

热门文章