MOOC课程信息爬取

时间：2019-10-12

一、任务与目标

网站地址
```
http://www.imooc.com/course/list/
```

2. 采用scrapy爬虫框架

爬取信息包括：课程名称，课程图片地址，学习人数，课程的学习人数及最后下载课程的图片。
信息保存格式：josn
信息全面：爬取所有都课程信息。

二、爬虫相关文件准备与安装

python版本：Python 3.7.4 官网下载
操作系统：window 10

scarapy爬虫框架安装

直接打开cmd 输入：

pip install scrapy

如果无法安装建议更新pip

python -m pip install -U pip

三、爬虫项目的部署

注意：scrapy爬虫的创建与启动都采用命令输入。

第一步电脑搜索cmd进入命令行输入：

cd 你准备放置爬虫的文件夹
例如：cd C:\Users\zhong\Desktop\scrapy\mooc

第二步建立scrapy项目 cmd输入：

scrapy startproject mooc

mooc 是你的项目名。

切换的mooc

cd mooc

第三步创建我们的第一个爬虫

scrapy genspider moocscrapy "http://www.imooc.com/course/list/"

打开对应的文件夹,cmd输入：tree /f 生成以下结构树

scrapy.cfg: 项目的配置文件
scrapytest/: 该项目的python模块。之后将在此加入代码。
scrapytest/items.py: 项目中的item文件.
scrapytest/pipelines.py: 项目中的pipelines文件.
scrapytest/settings.py: 项目的设置文件.
scrapytest/spiders/: 放置spider代码的目录.

这就是我们爬虫项目的文件布局了！

四、设计爬虫，爬取信息

编写代码建议使用编辑器（推荐使用Pycharm）

注意：

-name: 用于区别Spider。该名字必须是唯一的，不可以为不同的Spider设定相同的名字。
-start_urls: 包含了Spider在启动时进行爬取的url列表。因此，第一个被获取到的页面将是其中之一。后续的URL则从初始的URL获取到的数据中提取。
-parse() 是spider的一个方法。被调用时，每个初始URL完成下载后生成的 Response 对象将会作为唯一的参数传递给该函数。该方法负责解析返回的数据(response data)，提取数据(生成item)以及生成需要进一步处理的URL的 Request 对象。

第一步创建一个容器储存爬取的数据,打开item.py

import scrapyclass MoocItem(scrapy.Item):# 课程标题title = scrapy.Field()# 课程urlurl = scrapy.Field()# 课程标题图片image_url = scrapy.Field()# 课程描述introduction = scrapy.Field()# 学习人数student = scrapy.Field()# 图片地址image_path = scrapy.Field()

第二步编写 Moocscrapy.py文件

#-*- coding: utf-8 -*-
import scrapy
from mooc.items import MoocItemclass MoocscrapySpider(scrapy.Spider):name = 'moocscrapy'allowed_domains = ['http://www.imooc.com/course/list/']start_urls = ['http://www.imooc.com/course/list/']def parse(self, response):# 实例化item = MoocItem()# 先获取每个课程的divfor box in response.xpath('//div[@class="course-card-container"]/a[@target="_blank"]'):# 获取每个div中的课程路径item['url'] = 'http://www.imooc.com' + box.xpath('.//@href').extract()[0]# 获取div中的课程标题item['title'] = box.xpath('.//div[@class="course-card-content"]/h3[@class="course-card-name"]//text()').extract()[0]# 获取div中的标题图片地址item['image_url'] = "http:" + box.xpath('.//img[@class="course-banner lazy"]/@data-original').extract()[0]# 获取div中的学生人数item['student'] = box.xpath('.//span[2]/text()').extract()[0]# 获取div中的课程简介item['introduction'] = box.xpath('.//p/text()').extract()[0].strip()# 返回信息yield item

注：这里用到了xpath方式来获取页面信息。

在parse()方法中response参数返回一个下载好的网页信息，我们然后通过xpath来寻找我们需要的信息。
在执行完以上步骤之后，我们可以运行一下爬虫，看看是否出错。在命令行下进入工程文件夹，然后运行:

scrapy crawl moocscrapy

就可以得到以下结果：

第三步保存爬取到的数据

这时候我们已经成功了一半了！我需要把item的数据写入文件中便于储存。

现在我们需要打开用于数据处理储存的pipelines.py

下面我们将进行数据处理工作

便于处理我们将爬取到的文件写入到data.josn文件中

代码如下：

from scrapy.exceptions import DropItem
import jsonclass MoocPipeline(object):def __init__(self):# 打开文件self.file = open('data.json', 'w', encoding='utf-8')# 该方法用于处理数据def process_item(self, item, spider):# 读取item中的数据line = json.dumps(dict(item), ensure_ascii=False) + "\n"# 写入文件self.file.write(line)# 返回itemreturn item# 该方法在spider被开启时被调用。def open_spider(self, spider):pass# 该方法在spider被关闭时被调用。def close_spider(self, spider):pass

要使用Pipeline，首先要注册Pipeline

找到settings.py文件,这个文件时爬虫的配置文件

在其中添加

ITEM_PIPELINES = {'mooc.pipelines.MoocPipeline': 1,
}

上面的代码用于注册Pipeline，其中mooc.pipelines.MoocPipeline为你要注册的类，右侧的’1’为该Pipeline的优先级，范围1～1000，越小越先执行。

进行完以上操作，我们的一个最基本的爬取操作就完成了

这时我们再运行

scrapy crawl MySpider

第四步 URL跟进

通过我们的观察一共有29页的数据，而后分析url得出

        if self.n<28:self.n = self.n+1newurl = 'http://www.imooc.com/course/list/?page='+str(self.n)

然后再运行,查看结果：

以上源码：

mooc/spiders/moocscrapy.py

#-*- coding: utf-8 -*-
import scrapy
from mooc.items import MoocItemclass MoocscrapySpider(scrapy.Spider):name = 'moocscrapy'allowed_domains = ['imooc.com']start_urls = ['http://www.imooc.com/course/list/']n = 1def parse(self, response):# 实例化item = MoocItem()# 先获取每个课程的divfor box in response.xpath('//div[@class="course-card-container"]/a[@target="_blank"]'):# 获取每个div中的课程路径item['url'] = 'http://www.imooc.com' + box.xpath('.//@href').extract()[0]# 获取div中的课程标题item['title'] = box.xpath('.//div[@class="course-card-content"]/h3[@class="course-card-name"]//text()').extract()[0]# 获取div中的标题图片地址item['image_url'] = "http:" + box.xpath('.//img[@class="course-banner lazy"]/@data-original').extract()[0]# 获取div中的学生人数item['student'] = box.xpath('.//span[2]/text()').extract()[0]# 获取div中的课程简介item['introduction'] = box.xpath('.//p/text()').extract()[0].strip()# 返回信息yield item#url跟进if self.n<28:self.n = self.n+1newurl = 'http://www.imooc.com/course/list/?page='+str(self.n)yield scrapy.Request(newurl,callback=self.parse)

mooc/items.py

import scrapyclass MoocItem(scrapy.Item):# 课程标题title = scrapy.Field()# 课程urlurl = scrapy.Field()# 课程标题图片image_url = scrapy.Field()# 课程描述introduction = scrapy.Field()# 学习人数student = scrapy.Field()# 图片地址image_path = scrapy.Field()

mooc/middlewares.py

# -*- coding: utf-8 -*-# Define here the models for your spider middleware## See documentation in:# https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlfrom scrapy import signals
import randomclass UserAgentDownloadMiddleWare(object):USER_AGENTS = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36','Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/44.0.2403.155 Safari/537.36','Mozilla/5.0 (Macintosh; U; PPC Mac OS X; pl-PL; rv:1.0.1) Gecko/20021111 Chimera/0.6','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36','Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36','Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en) AppleWebKit/418.8 (KHTML, like Gecko, Safari) Cheshire/1.0.UNOFFICIAL','Mozilla/5.0 (X11; U; Linux i686; nl; rv:1.8.1b2) Gecko/20060821 BonEcho/2.0b2 (Debian-1.99+2.0b2+dfsg-1)']def process_request(self, request, spider):user_agent = random.choice(self.USER_AGENTS)request.headers['User-Agent'] = user_agentclass MoocSpiderMiddleware(object):# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the spider middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_spider_input(self, response, spider):# Called for each response that goes through the spider# middleware and into the spider.# Should return None or raise an exception.return Nonedef process_spider_output(self, response, result, spider):# Called with the results returned from the Spider, after# it has processed the response.# Must return an iterable of Request, dict or Item objects.for i in result:yield idef process_spider_exception(self, response, exception, spider):# Called when a spider or process_spider_input() method# (from other spider middleware) raises an exception.# Should return either None or an iterable of Request, dict# or Item objects.passdef process_start_requests(self, start_requests, spider):# Called with the start requests of the spider, and works# similarly to the process_spider_output() method, except# that it doesn’t have a response associated.# Must return only requests (not items).for r in start_requests:yield rdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)class MoocDownloaderMiddleware(object):# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the downloader middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_request(self, request, spider):# Called for each request that goes through the downloader# middleware.# Must either:# - return None: continue processing this request# - or return a Response object# - or return a Request object# - or raise IgnoreRequest: process_exception() methods of#   installed downloader middleware will be calledreturn Nonedef process_response(self, request, response, spider):# Called with the response returned from the downloader.# Must either;# - return a Response object# - return a Request object# - or raise IgnoreRequestreturn responsedef process_exception(self, request, exception, spider):# Called when a download handler or a process_request()# (from other downloader middleware) raises an exception.# Must either:# - return None: continue processing this exception# - return a Response object: stops process_exception() chain# - return a Request object: stops process_exception() chainpassdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)```**mooc/pipelines.py**```pythonfrom scrapy.exceptions import DropItem
import jsonclass MoocPipeline(object):def __init__(self):# 打开文件self.file = open('data.json', 'w', encoding='utf-8')# 该方法用于处理数据def process_item(self, item, spider):# 读取item中的数据line = json.dumps(dict(item), ensure_ascii=False) + "\n"# 写入文件self.file.write(line)# 返回itemreturn item# 该方法在spider被开启时被调用。def open_spider(self, spider):pass# 该方法在spider被关闭时被调用。def close_spider(self, spider):pass

mooc/setting.py

# -*- coding: utf-8 -*-# Scrapy settings for mooc project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'mooc'SPIDER_MODULES = ['mooc.spiders']
NEWSPIDER_MODULE = 'mooc.spiders'ITEM_PIPELINES = {'mooc.pipelines.MoocPipeline': 1,
}
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'mooc (+http://www.yourdomain.com)'# Obey robots.txt rules
ROBOTSTXT_OBEY = False
DOWNLOADER_MIDDLEWARES = {'mooc.middlewares.UserAgentDownloadMiddleWare': 543,
}
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {#    'mooc.middlewares.MoocSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {#    'mooc.middlewares.MoocDownloaderMiddleware': 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {#    'mooc.pipelines.MoocPipeline': 300,
#}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

python爬取MOOC课程信息相关推荐

网络爬虫---爬取MOOC课程信息并做一个可视化
文章目录爬取MOOC课程信息并做一个可视化一.目标二.知识要求三.思路分析 1.观察网页源代码,看里面是否有关于具体课程的信息 2.抓包分析与自动翻页 3.用PhantonJS构造模拟浏览器 ...
python爬取自如房间信息(一)
使用python和selenium+Chrome Headless爬取自如房间信息,并将结果存储在MongoDB中.其中最麻烦的应该是每间房的价格,因为自如是用一张图片和offset来显示价格,所以不 ...
2021最新 python爬取12306列车信息自动抢票并自动识别验证码（三）购票篇
项目前言 tiebanggg又来更新了,项目--[12306-tiebanggg-master]注:本项目仅供学习研究,如若侵犯到贵公司权益请联系我第一时间进行删除:切忌用于一切非法途径,否则后果自行 ...
php爬取房源,用python爬取二手房交易信息并进行分析
用python爬取二手房交易信息并分析第一步:编写爬虫爬取某平台上海市十个区共900条二手房的交易信息#爬取上海十个区的二手房价信息 import requests from bs4 import ...
2021最新python爬取12306列车信息自动抢票并自动识别验证码
项目描述项目前言 tiebanggg又来更新了,项目--[12306-tiebanggg-master]注:本项目仅供学习研究,如若侵犯到贵公司权益请联系我第一时间进行删除:切忌用于一切非法途径,否 ...
python爬取12306列车信息自动抢票并自动识别验证码（一）列车数据获取篇
项目前言自学python差不多有一年半载了,这两天利用在甲方公司搬砖空闲之余写了个小项目--[12306-tiebanggg-master].注:本项目仅供学习研究,如若侵犯到贵公司权益请联系我第一 ...
python爬取12306列车信息自动抢票并自动识别验证码（二）selenium登录验证篇
项目前言自学python差不多有一年半载了,这两天利用在甲方公司搬砖空闲之余写了个小项目--[12306-tiebanggg-master]注:本项目仅供学习研究,如若侵犯到贵公司权益请联系我第一时 ...
python爬取天猫商品信息
python爬取天猫商品信息主要信息有:商品名,价格,月销量,评论数,人气值,店铺评分以智能手机为例! 首先,发掘网址规律: 第二页的网址如上第三页的网址如上注意网址中的数字(靠近中间位置): ...
python爬取微博用户信息（六）—— 完整代码
本节为爬取微博用户信息的完整代码,以及项目结构. 感兴趣的小伙伴可以收藏哦! 另外,关于本代码的效果展示,以及教程,点击以下链接即可. python爬取微博用户信息(一)-- 效果展示 python爬 ...

python爬取MOOC课程信息

MOOC课程信息爬取

一、任务与目标

二、爬虫相关文件准备与安装

scarapy爬虫框架安装

三、爬虫项目的部署

四、设计爬虫，爬取信息

第一步创建一个容器储存爬取的数据,打开item.py

第二步编写 Moocscrapy.py文件

第三步保存爬取到的数据

第四步 URL跟进

python爬取MOOC课程信息相关推荐

最新文章

热门文章

python爬取MOOC课程信息

MOOC课程信息爬取

一、任务与目标

二、爬虫相关文件准备与安装

scarapy爬虫框架安装

三、爬虫项目的部署

四、设计爬虫，爬取信息

第一步 创建一个容器储存爬取的数据,打开item.py

第二步 编写 Moocscrapy.py文件

第三步 保存爬取到的数据

第四步 URL跟进

python爬取MOOC课程信息相关推荐

最新文章

热门文章

第一步创建一个容器储存爬取的数据,打开item.py

第二步编写 Moocscrapy.py文件

第三步保存爬取到的数据