陈老师撕B志玲姐姐的热门微博数据分析

原文链接：http://www.tbk.ren/article/256.html?source=csdn

昨晚陈老师不知因何事忽然在微博上骂女神志玲姐姐，引起来网友们的热闹围观，导致前几天风风火火的汪峰的前妻吸毒的事件，顿时落下帷幕，汪峰老师好不容易上了一次头条，就这么被硬生生的扯下来了。

当然，这条微博火了之后，陈老师把它删了，只留下这条了，我们接下来的分析，都是基于这条微博的评论的。

好，开篇写好了，我们直接进入主题，作为一名技术宅，如何使用R语言，来分析一下，这个热点事件背后的数据意义。

首先当然是要去抓取数据，这里因为笔者有一个网站，它通过了新浪微博开放平台的审核，因此可以使用微博的API接口进行数据的获取，当然能获取到的数据也不多，评论数据只能获取40页，每页50条左右，也就是2000条。虽然对于陈老师这种上千万粉的人来说，简直是毛毛雨，但是并不影响我们使用它来做一下简单的分析。

library(XML);

library(RCurl);

library(RJSONIO);

page <- 1;

times <- 3;

GoOn = TRUE;

sleepTime <- 1;

while(GoOn) {

url <- paste(

"https://api.weibo.com/2/comments/show.json?",

"id=4001968182199220&",

"page=", page, "&",

"access_token=这里很私密，就是通过验证的token，我当然不会告诉你啦",

sep = ""

);

print(url)

commentJSONString <- getURL(

url,

.opts = list(ssl.verifypeer = FALSE)

);

commentJSON <- fromJSON(commentJSONString);

len <- length(commentJSON$comments)

print(len)

if(len==0) {

print("需要休息一下下")

sleepTime <- sleepTime+1;

if(sleepTime>10) {

GoOn <- FALSE;

}

} else {

result <- data.frame(

id=c(NA),

gender=c(NA),

followers_count=c(NA),

friends_count=c(NA),

pagefriends_count=c(NA),

statuses_count=c(NA),

favourites_count=c(NA),

created_at=c(NA),

verified=c(NA),

verified_type=c(NA),

verified_reason=c(NA),

verified_trade=c(NA),

lang=c(NA),

urank=c(NA),

screen_name=c(NA),

name=c(NA),

location=c(NA),

description=c(NA),

text=c(NA)

);

for(i in 1:len) {

result[i, ] <- c(

commentJSON$comments[[i]]$user$idstr,

commentJSON$comments[[i]]$user$gender,

commentJSON$comments[[i]]$user$followers_count,

commentJSON$comments[[i]]$user$friends_count,

commentJSON$comments[[i]]$user$pagefriends_count,

commentJSON$comments[[i]]$user$statuses_count,

commentJSON$comments[[i]]$user$favourites_count,

commentJSON$comments[[i]]$user$created_at,

commentJSON$comments[[i]]$user$verified,

commentJSON$comments[[i]]$user$verified_type,

commentJSON$comments[[i]]$user$verified_reason,

commentJSON$comments[[i]]$user$verified_trade,

commentJSON$comments[[i]]$user$lang,

commentJSON$comments[[i]]$user$urank,

commentJSON$comments[[i]]$user$screen_name,

commentJSON$comments[[i]]$user$name,

commentJSON$comments[[i]]$user$location,

commentJSON$comments[[i]]$user$description,

commentJSON$comments[[i]]$text

)

}

write.csv(

result, row.names=FALSE,

col.names=FALSE, fileEncoding = "UTF-8",

file=paste("data/result_", times, "_", page, ".txt", sep = "")

);

page <- page+1;

}

Sys.sleep(sleepTime);

}

抓取完成后，得到了用户的一些属性以及评论的内容。

好，既然是评论，我们首先当然要来一发词云分析先。

library(tm)

library(Rwordseg)

installDict('明星【官方推荐】.scel', '明星')

contentCorpus <- Corpus(VectorSource(na.omit(d$text)))

contentCorpus <- tm_map(contentCorpus, stripWhitespace)

contentCorpus= tm_map(contentCorpus, content_transformer(segmentCN), returnType='tm')

#tm分词对中文分词Bug解决方案

tokenizer <- function(x){

unlist(

strsplit(

x$content,

'[[:space:]]+'

)

}

tdm <- TermDocumentMatrix(

contentCorpus,

control=list(

wordLengths=c(1,Inf),

tokenize=tokenizer

)

#转成向量矩阵

tdm <- as.matrix(tdm)

library(wordcloud)

v <- sort(rowSums(tdm), decreasing = TRUE)

d <- data.frame(word = names(v), freq = v)

d <- d[1:300, ]

wordcloud(

d$word,

d$freq,

min.freq=2,

random.order=F,

colors=rainbow(length(row.names(d)))

)

执行这段代码，我们可以得到以下的词云：

可以看到，网友们对于陈老师的这种无端端撕B的行为，一致认为是“你，的，不，是”。陈老师，看到后，也回复了网友的关心：

好了，大家的态度，陈老师收到了，你们喜欢或者是不喜欢，陈老师还是当年的陈老师，粉丝量还是上两千万的陈老师。

好，接着我们来分析一下网友们的特征。

genderTable <- prop.table(table(d$gender))

女网友竟然占比达到65%，好惊讶，是因为志玲姐姐男女通吃，还是陈老师魅力不减当年，还有一堆的女性摄影爱好者粉丝呢……

接着我们来看看网友们的地区分布：

locationTable <- prop.table(table(d[, "1"][d[, "1"]!="其他"]))

这个就不出所料了，港东银占据了榜首，不愧是和陈老师发源地比较近。

接着，我们看看用户的加V情况。

verifiedTable <- prop.table(table(d$verified))

看来陈老师号召力很强，竟然有1.5%的加V用户转发它了。

最后，我们来看看用户的微博等级：

hist(d$urank, main = "用户等级", xlab = "用户等级", freq = FALSE, ylab = "占比")

我们可以看到，用户的等级竟然也是就接近正态分布，证明这个话题的水军比较少呢，陈老师果然是实力派，出来混，是不需要带是水军的。

PS：笔者微博等级才14级，因为懒得关注太多的人，所以卡在了14级，所以12到14级人多是很正常的。