【crawler笔记】R语言简单动态网页爬虫（rvest包）示例

1、爬虫目标

大家普遍认为Python的爬虫功能强大，但在解决动态加载或者登陆网站时，Python存在一定困难的，相对于一些普通爬虫，使用R语言会更方便。

以https://www.thepaper.cn/为例，爬取首页的新闻（标题、内容、时间），主要采用的是动态网页中常用的httr包。

初始学习时，参考了B站《20分钟入门基于R语言的网络爬虫_哔哩哔哩 (゜-゜)つロ干杯~-bilibili》的视频，但代码运行出现了报错，所以在原代码上做了修改。

2、代码分析及报错修改

library(rvest)
library(stringr)url <-"https://www.thepaper.cn/"
web <- read_html(url)    #读取html网页的函数news <- web %>% html_nodes('h2 a')
title <- news %>% html_text()  #获取其中的文字部分
link <- news %>% html_attrs()  #获取每个标题对应的网址link1 <- c(1:length(link))
for(i in 1:length(link1))
{link1[i]<- link[[i]][1]
}
link2 <- paste("https://www.thepaper.cn/",link1,sep="")##获得每条新闻的全文
news_content<-c(1:length(link2))
for(i in 1:length(link2))
{web <- read_html(link2[i]){if(length(html_nodes(web,'div.news_txt'))==1)news_content[i]<- html_text(html_nodes(web,'div.news_txt'))elsenews_content[i]<- trimws(str_replace_all((html_text(html_nodes(web,'div.video_txt_l p'))), "[\r\n]" , ""))}
} ##获得每条新闻的时间
news_date <- c(1:length(link2))
for(i in 1:length(link2))
{web <- read_html(link2[i]){if(length(html_nodes(web,'div.news_txt'))==1)news_date[i]<- trimws(str_replace_all((html_text((html_nodes(web, "div p"))[2])), "[\r\n]" , ""))elsenews_date[i]<- trimws(str_replace_all((html_text(html_nodes(web,'div.video_txt_l span'))), "[\r\n]" , ""))}
} date <- c(1:length(link2))
time <- c(1:length(link2))
for(i in 1:length(link2))
{date[i] <- substring(news_date[i],1,10)time[i] <- substring(news_date[i],12,16)   # is.character(news_date[i])
}news_01 <- data.frame(title,date,time,url=link2,news_content)save(news_01,file="news_information.Rdata")
write.csv(news_01,file=" news_information.csv")

Note1：

按F12进入开发者页面，鼠标点击某一标题，则会显示标题对应节点为h2项下的a，具体可参考下图所示：

Note2:

原视频使用以下代码获得每天新闻的文本内容，但出现报错：更换参数长度为零。如下所示：

> for(i in 1:length(link2))
+ {
+     news_content[i] <- read_html(link2[i]) %>% html_nodes('div.news_txt') %>% html_text()
+ }
Error in news_content[i] <- read_html(link2[i]) %>% html_nodes("div.news_txt") %>%  : 更换参数长度为零

报错原因是，网页中存在纯视频内容，使得通过“div.news_txt”节点获取的参数为0，因此本文的代码中，先对网页内容进行判断，在就文本新闻和视频新闻分别提取文本内容。

Note3:

原视频使用以下代码获得每天新闻的时间，但提取的内容......（本纯真小白猜测可能是因为反爬所以网络升级了！！！）

所以本纯真小白重新找了对应的节点，仍然是在开发者工具页面找。文本的时间位于 "div p"节点，提取出来是含有换行符、空格的字符串；视频页面的时间位于“div.video_txt_l span”，提取出来还有小尾巴（新闻来源）。

Note4

最最重要，虽然有Warning，但是不重要，因为我也解决不了，结果出来了很开心。人生第一次爬虫，成功！

然后就是非常感谢B站的UP主，希望没有侵犯到他人的权益，以上！

3、奇奇怪怪的知识增加了

pacman::p_load(XML,rvest,jiebaR,dplyr,stringr) 一次性载入多个包，含install和library

read_html():读取html文档（网页）的函数；

html_nodes():选取提取文档中指定元素、节点的部分；

html_text():提取标签内的文本；

html_attrs():提取属性名称及其内容;

trimws():去除字符串前后空格；

str_replace_all():替换，本文中是替换了空格\r、换行\n；

is.character():判断是否为字符串形式