python爬虫scrapy框架爬取糗妹妹段子首页

声明：本文仅为学习爬虫，请勿商业和恶意攻击网站，本文所有解释权归作者。
本文分别用两种方法把获取的段子信息存储到了本地，分别是txt文件和json文件，
txt文件比较简单，生成字典后用命令直接执行即可，json文件稍显麻烦，文章里面有详细的注释可供理解。

# -*- coding: utf-8 -*-
# texts.py
import scrapy
#导入items
from first.items import FirstItemclass TextsSpider(scrapy.Spider):# 爬虫的名称 scrapy list列出所有的爬虫名称name = 'texts'# 允许爬虫文件所要爬的网站是基于此网站下进行的，# 如：有的资源如图片是在另一个服务器就爬不到了，一般注释掉，# 不在此允许范围内的域名就会被过滤，而不会进行爬取# allowed_domains = ['http://www.qiumeimei.com/']# 爬虫要爬取的第一个urlstart_urls = ['http://www.qiumeimei.com/text']# 爬虫代码的编写位置def parse(self, response):div_list = response.xpath('//div[@class="home_main_wrap"]/div[@class="panel clearfix"]')contents = []#可以保存临时文件 csv表格 jsonfor div in div_list:author = div.xpath('./div[@class="top clearfix"]/h2/a/text()').extract_first()content = div.xpath('./div[@class="main"]/p/text()').extract()# 需要判断拿数据  extract()经常用来切片（脱壳）从一个对象中得到listif content == ['\xa0']:content= div.xpath('./div[@class="main"]/div/p/text()').extract()content = "".join(content)# 注释的是用于本地存储的,没有分模块# dict1 = {#     "author":author,#     "content":content# }# contents.append(dict1)# 本地存储，多个模块联动操作# 把类实例化为一个对象items = FirstItem()items["author"] = authoritems["content"] = content# print(items)# 使用yield来传数据到items.py存储，不需要returnyield items#注释的是用于本地存储的,没有分模块# return contents

上面的是主要程序。

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
# items.py
import scrapy
# 存储解析到的页面数据class FirstItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()author = scrapy.Field()content = scrapy.Field()pass

然后是管道文件，这里主要解释了如何生成json文件

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# pipelines.py
import codecs
import json
import os
# 持久化存储的相关操作  管道文件用于持久化存储 txt json csv mysql redis
class FirstPipeline(object):f = None# 开始爬虫def open_spider(self,spider):# 打开文件# self.f = open("qmm.txt","w",encoding="utf_8")# 如果不使用codecs.open打开文件，则close_spider里面的语句不生效，就是一个编码和解码的工具self.f = codecs.open("qmm.json","w",encoding="utf_8")# 列表self.f.write('"list":[')# 执行爬虫def process_item(self, item, spider):# print("正在写入中。。。")author = item["author"]content = item["content"]# 写入数据  这个是直接存储txt文件# self.f.write(author + ":" + "\n" + content + "\n\n\n")# 想存储json文件，就得把item对象转变为字典对象res = dict(item)# 这是因为json.dumps 序列化时对中文默认使用的ascii编码.想输出真正的中文需要指定ensure_ascii=False：# 直接写入字典会保存，所以把字典形式的作为list列表的值字符串格式写入str = json.dumps(res,ensure_ascii=False)self.f.write(str + "," + "\n")return item# 关闭爬虫def close_spider(self,spider):# SEEK_END 移动游标到文件最后，再向前偏移2个字符self.f.seek(-2,os.SEEK_END)# 移除偏移后的所有字符 移除了逗号,和一个换行符\nself.f.truncate()# 完成列表self.f.write("]")self.f.close()

最后一个是settings配置,主要是伪装UA和关闭robots协议，关键是下面这行代码

#管道文件
ITEM_PIPELINES = {'first.pipelines.FirstPipeline': 300,
}# 注意：
# 下面这句话的含义：在执行scrapy crawl texts -o qiumeimei.json  --nolog保存
# json文件的时候，原来保存的是二进制，在添加了下面这个配置之后保存为utf-8   feed_export_encoding
FEED_EXPORT_ENCODING = 'UTF8'  #等同于scrapy crawl texts -o qiumeimei.json -s FEED_EXPORT_ENCODING = 'UTF8' --nolog

python爬虫scrapy框架爬取糗妹妹段子首页相关推荐

Python爬虫 scrapy框架爬取某招聘网存入mongodb解析
这篇文章主要介绍了Python爬虫 scrapy框架爬取某招聘网存入mongodb解析,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友可以参考下创建项目 sc ...
python爬虫scrapy框架爬取网页数据_Scrapy-Python
scrapy Scrapy:Python的爬虫框架实例Demo 抓取:汽车之家.瓜子.链家等数据信息版本+环境库 Python2.7 + Scrapy1.12 初窥Scrapy Scrapy是一 ...
python笔记之利用scrapy框架爬取糗事百科首页段子
环境准备: scrapy框架(可以安装anaconda一个python的发行版本,有很多库) cmd命令窗口教程: 创建爬虫项目 scrapy startproject qq #创建了一个爬虫项目q ...
python爬虫库scrapy_使用Python爬虫Scrapy框架爬取数据
时隔数月,国庆期间想做个假期旅游的分析展示. 1.通过Python爬取旅游网站上数据,并存储到数据库 2.通过Echart/FineReport/Superset等数据分析工具对数据展示环境: Wi ...
scrapy框架爬取糗妹妹网站妹子图分类的所有图片
爬取所有图片,一个页面的图片建一个文件夹.难点,图片中有不少.gif图片,需要重写下载规则, 创建scrapy项目 scrapy startproject qiumeimei 创建爬虫应用 cd qi ...
Python爬虫 scrapy框架爬取智联招聘，并把数据存入数据库，存为json格式的数据
First:创建项目:执行下面三句命令: 1. scrapy startproject zhilianzhaopin2. cd zhilianzhaopin3.scrapy genspider zhi ...
14. python爬虫——基于scrapy框架爬取糗事百科上的段子内容
python爬虫--基于scrapy框架爬取糗事百科上的段子内容 1.需求 2.分析及实现 3.实现效果 4.进行持久化存储 (1)基于终端指令 (2)基于管道 [前置知识]python爬虫--scr ...
【Python爬虫系列教程 28-100】小姐姐带你入门爬虫框架Scrapy、使用Scrapy框架爬取糗事百科段子
文章目录 Scrapy快速入门安装和文档: 快速入门: 创建项目: 目录结构介绍: Scrapy框架架构 Scrapy框架介绍: Scrapy框架模块功能: Scrapy Shell 打开Scrap ...
利用python的scrapy框架爬取google搜索结果页面内容
scrapy google search 实验目的爬虫实习的项目1,利用python的scrapy框架爬取google搜索结果页面内容. https://github.com/1012598167/ ...

python爬虫scrapy框架爬取糗妹妹段子首页

python爬虫scrapy框架爬取糗妹妹段子首页相关推荐

最新文章

热门文章