Scrapy 爬虫实战-爬取字幕库

1.首先，创建Scrapy框架

创建工程
scrapy startproject zimuku创建爬虫程序
cd zimuku
scrapy genspider zimu zimuku.cn

如图：

我们会发现所有的框架以及模板都已经创建好了，
依次给大家看看：

zimu.py
# -*- coding: utf-8 -*-
import scrapyclass ZimuSpider(scrapy.Spider):name = 'zimu'allowed_domains = ['zimuku.cn']start_urls = ['http://zimuku.cn/']def parse(self, response):passitems.py-*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ZimukuItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()passpipelines.py
# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlclass ZimukuPipeline(object):def process_item(self, item, spider):return item

这是三个比较重要，其他的我先不一一列举了。
接下来，我们要进行分析网页了

2.网页分析

我们的目的是把这个内容保存下来，因为Scrapy框架自带一下工具，所以我们就用xpath来做内容匹配。
提取红框的内容的xpath语句为：/html/body/div[2]/div/div/div[2]/table/tbody/tr[1]/td[1]/a/b/text()

3.编写代码
（1）首先编写items.py

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ZimukuItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()#要爬取的内容定义text = scrapy.Field()

(2)编写zimu.py

# -*- coding: utf-8 -*-
import scrapy
#需要把items中的类导进来
import zimuku.items import ZimukuItemclass ZimuSpider(scrapy.Spider):name = 'zimu'allowed_domains = ['zimuku.cn']start_urls = ['http://zimuku.cn/']def parse(self, response):''':param response: 解析网页返回的内容:return: '''name = response.xpath("/html/body/div[2]/div/div/div[2]/table/tbody/tr[1]/td[1]/a/b/text()")item = {}item['text'] = nameyield item

(3)编写piplines(处理爬到的内容的)

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlclass ZimukuPipeline(object):def process_item(self, item, spider):with open("F:\\python\\1.txt", 'a') as fp:fp.write(str(item['text']))print(item['text'])

（4）settings.py

#通过配置告诉Scrapy明白是谁来处理结果
ITEM_PIPELINES={'zimuku.pipelines.ZimukuPipeline':300,}

(5)运行

#不打印日志
scrapy crawl zimu --nolog
#打印日志
scrapy crawl meiju 建议最好打印日志，不然有些错误不会发现，除了问题还不知道出在哪块

我们会发现运行成功了，我们再看看文件是否保存成功

OK！！！终于成功了。
这篇Scrapy是一步一步做的，对自己还是入门的朋友都是不错的参考。

Scrapy 爬虫实战-爬取字幕库相关推荐

Python之Scrapy爬虫实战--爬取妹子图
1.前言反正闲着也是闲着,不如来学习啊! 2.关键代码新建项目不会的同学可参考我的另一篇博文,这里不再赘述:Python之Scrapy爬虫实战–新建scrapy项目这里只讲一下几个关键点,完整 ...
node 爬虫实战 - 爬取拉勾网职位数据
node 爬虫实战 - 爬取拉勾网职位数据,主要想把数据用于大数据学习,到时候大数据分析可以自己分析一下职位的情况,和比较一些我现在的职位在深圳乃至全国的开发人员水平. 涉及到的技术栈:node.j ...
Python爬虫实战爬取租房网站2w+数据-链家上海区域信息（超详细）
Python爬虫实战爬取租房网站-链家上海区域信息(过程超详细) 内容可能有点啰嗦大佬们请见谅后面会贴代码带火们有需求的话就用吧正好这几天做的实验报告就直接拿过来了,我想后面应该会有人用的到吧 ...
python爬虫实战---爬取大众点评评论
python爬虫实战-爬取大众点评评论(加密字体) 1.首先打开一个店铺找到评论很多人学习python,不知道从何学起. 很多人学习python,掌握了基本语法过后,不知道在哪里寻找案例上手. 很多 ...
python爬虫实战-爬取视频网站下载视频至本地(selenium)
#python爬虫实战-爬取视频网站下载视频至本地(selenium) import requests from lxml import etree import json from selenium ...
python爬虫实战-爬取微信公众号所有历史文章 - (00) 概述
http://efonfighting.imwork.net 欢迎关注微信公众号"一番码客"获取免费下载服务与源码,并及时接收最新文章推送. 最近几年随着人工智能和大数据的兴起,p ...
Python Scrapy爬虫框架爬取51job职位信息并保存至数据库
Python Scrapy爬虫框架爬取51job职位信息并保存至数据库 -------------------------------- 版权声明:本文为CSDN博主「杠精运动员」的原创文章,遵循CC ...
九 web爬虫讲解2—urllib库爬虫—实战爬取搜狗微信公众号—抓包软件安装Fiddler4讲解...
封装模块 #!/usr/bin/env python # -*- coding: utf-8 -*- import urllib from urllib import request import j ...
python爬虫实战--爬取猫眼专业版-实时票房
小白级别的爬虫入门最近闲来无事,发现了猫眼专业版-实时票房,可以看到在猫眼上映电影的票房数据,便验证自己之前学的python爬虫,爬取数据,做成.svg文件. 爬虫开始之前我们先来看看猫眼专业版- ...

Scrapy 爬虫实战-爬取字幕库

Scrapy 爬虫实战-爬取字幕库

Scrapy 爬虫实战-爬取字幕库相关推荐

最新文章

热门文章