【第3篇】python爬虫实战-CSDN个人主页文章列表获取

本文教程利用 Scrapy 框架实现一个网络爬虫，本文代码实现CSDN个人主页文章列表爬取，已实现自动翻页，爬取的数据集最终保存为json文件，代码仅供参考学习交流。开始本教程前，请确保你的本机环境中已经正确安装了python环境以及scrapy框架，如若没有安装，请先自行百度相关安装教程后，再来阅读本文。

1、创建Scrapy项目

2、创建一个爬虫

3、文件目录

4、代码实现

4.1、items.py

4.2、settings.py

4.3、csdn.py

4.4、pipelines.py

5、启动爬虫

6、爬取结果

1、创建Scrapy项目

scrapy startproject csdnSpider

2、创建一个爬虫

# 进入目录
cd csdnSpider#创建爬虫
scrapy genspider csdn csdn.net

3、文件目录

csdnSpider│  scrapy.cfg # 内容为scrapy的基础配置│└─csdnSpider│  items.py  # 定义爬虫程序的数据模型│  middlewares.py # 定义数据模型中的中间件│  pipelines.py # 管道文件,负责对爬虫返回数据的处理│  settings.py # 爬虫程序设置,主要是一些优先级设置,优先级越高,值越小│  __init__.py│├─spiders│  │  csdn.py # 自定义爬虫引擎│  │  __init__.py│  ││  └─__pycache__│          __init__.cpython-37.pyc│└─__pycache__settings.cpython-37.pyc__init__.cpython-37.pyc

4、代码实现

小提示： scrapy目录中未改动过的代码，就没有贴出来了。

4.1、items.py

import scrapyclass CsdnspiderItem(scrapy.Item):# IDid = scrapy.Field()# 类型type = scrapy.Field()# 标题title = scrapy.Field()# 创建时间createTime = scrapy.Field()# 阅读量views = scrapy.Field()# 文章地址url = scrapy.Field()pass

4.2、settings.py

BOT_NAME = 'csdnSpider'SPIDER_MODULES = ['csdnSpider.spiders']NEWSPIDER_MODULE = 'csdnSpider.spiders'
# UA认证
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'# robots协议
ROBOTSTXT_OBEY = False# 优先级设置,值越小越先执行
ITEM_PIPELINES = {'csdnSpider.pipelines.CsdnspiderPipeline': 300,
}

4.3、csdn.py

import scrapy
from csdnSpider.items import CsdnspiderItem
from scrapy import Request
import re
from urllib.parse import urlparseclass CsdnSpider(scrapy.Spider):# 爬虫名称name = 'csdn'# 爬取域名范围allowed_domains = ['csdn.net']# 博客主页地址url = 'https://blog.csdn.net/qq_19309473'# 从这个页面开始start_urls = [url]# 初始化函数def __init__(self):# 开始页数self.page = 1# 记录条数self.count = 0# 解析器def parse(self, response):# 构建对象列表item = CsdnspiderItem()post_list = response.xpath('//*[@id="articleMeList-blog"]/div[2]/div')# 获取文章总条数blog_str = response.xpath('//*[@id="container-header-blog"]/span/text()').get()total_str = re.findall("博客\((.+?)\)", blog_str)[0]total = int(total_str)for post in post_list:# 记录条数加1self.count += 1item['id'] = self.countitem['type'] = post.xpath('./h4//span/text()').get()item['title'] = post.xpath('./h4/a/text()')[1].extract().strip()item['createTime'] = post.xpath('.//span[@class="date"]/text()').get()item['views'] = post.xpath('.//span[@class="read-num"]/text()').get()item['url'] = post.xpath('./h4/a/@href').get()yield item# 循环换页爬取self.page += 1# 请求URLrequest_url = response.request.url# 协议protocol = urlparse(request_url).scheme# 域名domain = urlparse(request_url).netloc# authorhome = urlparse(request_url).path.split('/')[1]# 下一页地址next_url = "{}://{}/{}/article/list/{}".format(protocol, domain, home, self.page)# 最大页数maxPage = total // 40 if total % 40 == 0 else (total // 40 + 1)if self.page < maxPage + 1:yield Request(url=next_url, callback=self.parse, dont_filter=False)

4.4、pipelines.py

from scrapy.exporters import JsonLinesItemExporterclass CsdnspiderPipeline:# 初始化def __init__(self):# 新建并打开一个blog.json文件self.fp = open('blog.json', 'wb')self.exporter = JsonLinesItemExporter(self.fp, ensure_ascii=False, encoding='utf-8')#文本处理def process_item(self, item, spider):# 写入数据self.exporter.export_item(item)return item# 关闭def close_spider(self):# 关闭文件流self.fp.close()

5、启动爬虫

scrapy crawl csdn

6、爬取结果

【第3篇】python爬虫实战-CSDN个人主页文章列表获取相关推荐

python爬虫——实战篇
python爬虫--实战篇 2021.7.20晚已更新注:注释和说明已在代码中注释 python爬虫实战篇笔趣阁小说及其网址爬取 4k图片网站图片爬取简历模板爬取自动填体温小程序待补充笔趣 ...
Python爬虫实战之二：requests-爬取亚马逊商品详情页面
本实战项目是中国大学MOOC国家精品课程<Python网络爬虫与信息提取>(by 嵩天北京理工大学)学习笔记.代码段均可在ide中运行by now(2021-11-29). 目录 1.爬 ...
一份Python爬虫实战教程清单
一份Python爬虫实战教程清单本学期的所有课程任务已经完全结束了,有时间来整理一份关于 Python爬虫的实战教程. 本教程都没有大篇幅的介绍到底该如何去完成一份爬虫代码,但是会分享我在学习爬虫 ...
Python爬虫实战之二 - 基于Scrapy框架抓取Boss直聘的招聘信息
Python爬虫实战之三 - 基于Scrapy框架抓取Boss直聘的招聘信息 ---------------readme--------------- 简介:本人产品汪一枚,Python自学数月,对于 ...
线程，协程对比和Python爬虫实战说明
此文首发于我的个人博客:线程,协程对比和Python爬虫实战说明 - zhang0peter的个人博客这篇文章写的是我对线程和协程的理解,有错误之处欢迎指出. 举一个餐馆的例子.我们把一个餐厅当做一 ...
python爬虫项目-32个Python爬虫实战项目，满足你的项目慌
原标题:32个Python爬虫实战项目,满足你的项目慌爬虫项目名称及简介一些项目名称涉及企业名词,小编用拼写代替 1.[WechatSogou]- weixin公众号爬虫.基于weixin公众号爬 ...
Python爬虫实战（5）：模拟登录淘宝并获取所有订单
Python爬虫入门(1):综述 Python爬虫入门(2):爬虫基础了解 Python爬虫入门(3):Urllib库的基本使用 Python爬虫入门(4):Urllib库的高级用法 Python爬虫 ...
Python爬虫实战（4）：抓取淘宝MM照片
Python爬虫入门(1):综述 Python爬虫入门(2):爬虫基础了解 Python爬虫入门(3):Urllib库的基本使用 Python爬虫入门(4):Urllib库的高级用法 Python爬虫 ...
Python爬虫实战（3）：计算大学本学期绩点
Python爬虫入门(1):综述 Python爬虫入门(2):爬虫基础了解 Python爬虫入门(3):Urllib库的基本使用 Python爬虫入门(4):Urllib库的高级用法 Python爬虫 ...

【第3篇】python爬虫实战-CSDN个人主页文章列表获取

1、创建Scrapy项目

2、创建一个爬虫

3、文件目录

4、代码实现

4.1、items.py

4.2、settings.py

4.3、csdn.py

4.4、pipelines.py

5、启动爬虫

6、爬取结果

【第3篇】python爬虫实战-CSDN个人主页文章列表获取相关推荐

最新文章

热门文章