分析复联系列电影台词，看看每个英雄说得最多的词是什么

编译：Mika
CDA数据分析研究院出品

本周《复仇者联盟4：终局之战》在国内上映了，同时也创造了国内进口片新的票房纪录：4天累计票房突破22亿元，成为中国内地最快破20亿元的影片。

作为漫威电影宇宙十年布局的第21部漫威电影，许多看完的小伙伴们表示3小时的剧情，十分精彩毫无尿点，有笑有泪。

有不少影迷感叹“共鸣、感动、不舍与怀念，所有按捺已久的情绪在这一刻得到彻底的释放。”也有不少网友感慨：“谢谢漫威，让我的青春有了最完美的结局。”

作为漫威的路人粉C君也回顾起之前的漫威电影，本文中我们选取了漫威电影中的中英雄出现交叉度较高的三部代表作《复仇者联盟》、《复仇者联盟2:奥创纪元》和《美国队长3:内战》的剧本，对每个英雄最爱说的词进行了分析。

分析结果

首先看到结论：

钢铁侠：I am Iron Man

作为漫威电影宇宙中的元老级英雄，钢铁侠是妥妥的C位。在这几部电影中他说得最多的词是呼唤智能管家“Jarvis”。

在《复联1》中美队曾问过他这么一个问题：脱下战衣，你是什么？钢铁侠的回答是：天才，亿万富翁，花花公子，慈善家。从2008年开启漫威宇宙的《钢铁侠》到2019年的《复联4》，钢铁侠也从开始的花花公子，一步步成为复联中的领袖。最后只想说，钢铁侠，我爱你三千遍。

美国队长：铁盾情深

image

美队不仅是复仇者联盟的核心成员，同样也是粉丝心中的精神寄托。没有钢铁侠的高科技装备（除了盾牌），也没有雷神的天神血脉，仍然做了复仇者的队长、领袖、首脑，足以证明他的强大。

美国队长特别喜欢呼唤其他英雄的名字。事实上，我们发现他呼唤地最多的就是钢铁侠了，除此之外还有Sam和Strucker。美队说的最多的5个词中4个都是人名，还有一个是"suit"应该指的是钢铁侠的战甲了，妥妥的铁盾情深。

雷神：深谋远虑的行动派

除了雷神系列电影(《雷神3：诸神黄昏》和《雷神2：黑暗世界》)，比起其他英雄，雷神说的做多的词表明他是行动导向的，而且眼光长远，深谋远虑。

除此之外，雷神叫洛基的次数也特别多。另外，他还很专注于能够推动故事向前发展的有形物件，比如洛基的权杖，宇宙立方和心灵宝石，这些也常常在雷神的词中出现。

洛基：追逐权力的诡计之神

洛基内心对权力十分渴望，他说得最多的都是“权力”“改变”等词。虽然雷神提到洛基的次数比其他复仇者都要多，但洛基提到哥哥雷神的次数并没有这么多，他的关注点显然在权力等其他地方。

洛基不是美队那样的正义人士，却是MCU最深入人心的角色之一，拥有无数粉丝。原因很简单，他很接地气，他没有那些拯救世界的远大理想，只关心自己的利益，他的邪魅、狡猾、心机、计谋，都是其它超级英雄没有的。

蜘蛛侠：人家还是个宝宝

作为复联中望之无愧的可爱担当，小蜘蛛侠的词都偏幼齿。他说的最多的词是“嘿”“呃”和“嗯”等语气词。

蜘蛛侠从一开始的拥有远大抱负、想要肩负起拯救世界的重任，到发现自己的能力不足而放弃，最终从“能力越大，责任越大”，到重新认识自己、看清自己的位置。这是他成为超级英雄的成长过程，也是他的可爱之处。

黑豹：Wakanda forever

随着角色的不断发展，黑豹的台词也在发生变化。这位超级英雄在《美队3：内战》中首次登场，在之后他的独立电影《黑豹》中已发生了不小的成长。黑豹说的最多的词主要是“父亲”“朋友”“国王”等词，真是一位心系国家的陛下。

鹰眼绿巨人：人人都爱寡姐

鹰眼和绿巨人提到最多的都是黑寡妇。

整个复仇者联盟只有鹰眼和黑寡妇没有超能力，他们两个都是特工出身，两人早些年的人生经历相同，感受也相同，因此鹰眼和黑寡妇的关系非常好，在战斗中常常关心对方。而对于绿巨人而言，黑寡妇是唯一能在绿巨人发狂时回复正常状态的人，所以她对绿巨人的重要性也不言而喻。

幻视猩红女巫：灵魂伴侣

!](http://upload-images.jianshu.io/upload_images/13825820-2c05a9c04f558f38.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

幻视和猩红女巫有很多类似的地方，两人的谈论内容都很一致，特别喜欢说“恐惧、担忧“类话题。

通过情绪分析，我们发现幻视负面情绪的台词最多。这可能是因为幻视作为超级人工智能，能够看到别的英雄看不到的“画面”，可能对未来的担忧让他心烦意乱。

可视化过程

数据可视化大神Matt Winn使用R的ggplot库对分析结果进行了可视化。

首先是要用到的R语言库

*library*(dplyr)
*library*(grid)
*library*(gridExtra)
*library*(ggplot2)
*library*(reshape2)
*library*(cowplot)
*library*(jpeg)
*library*(extrafont)

清除R工作环境中的全部东西

rm(list = ls())

包含所有图像的文件夹

dir_images <- “C:\\Users\\Matt\\Documents\\R\\Avengers”
setwd(dir_images)

设置字体

windowsFonts(Franklin=windowsFont(“Franklin Gothic Demi”))

英雄名字的简称

character_names <- c("black_panther","black_widow","bucky","captain_america","falcon","hawkeye","hulk","iron_man","loki","nick_fury","rhodey","scarlet_witch","spiderman","thor","ultron","vision")
image_filenames <- paste0(character_names, ".jpg")

读取图像文件中对应的英雄简称

read_image <- *function*(filename){char_name <- gsub(pattern = "\\.jpg$", "", filename)img <- jpeg::readJPEG(filename)*return*(img)
}

将所有图像读入一个列表

all_images <- lapply(image_filenames, read_image)

为图像列表指定名称，以便可以按字符索引

names(all_images) <- character_names

下面是使用这些名称的简单示例

/# clear the plot window/
grid.newpage()
/# draw to the plot window/
grid.draw(rasterGrob(all_images[[‘vision’]]))

获取文本数据

本文用到的文本数据由计算机科学家Elle O 'Brien收集的，对电影脚本使用了文本挖掘分析。

更正专有名词的大小写

capitalize <- Vectorize(*function*(string){substr(string,1,1) <- toupper(substr(string,1,1))*return*(string)
})proper_noun_list <- c("clint","hydra","steve","tony","sam","stark","strucker","nat","natasha","hulk","tesseract", "vision","loki","avengers","rogers", "cap", "hill")/# Run the capitalization function/
word_data <- word_data %>%mutate(word = ifelse(word %*in*% proper_noun_list, capitalize(word), word)) %>%mutate(word = ifelse(word == "jarvis", "JARVIS", word))

注意，前面的简化角色名与文本数据框中格式的角色名不匹配

unique(word_data$Speaker)##  [1] "Black Panther"   "Black Widow"     "Bucky"
##  [4] "Captain America" "Falcon"          "Hawkeye"
##  [7] "Hulk"            "Iron Man"        "Loki"
## [10] "Nick Fury"       "Rhodey"          "Scarlet Witch"
## [13] "Spiderman"       "Thor"            "Ultron"
## [16] "Vision"

创建一个查询表，将文件名转换为角色名

character_labeler <- c(`black_panther` = "Black Panther",`black_widow` = "Black Widow",`bucky` = "Bucky",`captain_america` = "Captain America",`falcon` = "Falcon", `hawkeye` = "Hawkeye",`hulk` = "Hulk", `iron_man` = "Iron Man",`loki` = "Loki", `nick_fury` = "Nick Fury",`rhodey` = "Rhodey",`scarlet_witch` ="Scarlet Witch",`spiderman`="Spiderman", `thor`="Thor",`ultron` ="Ultron", `vision` ="Vision")

两个不同版本的角色名

一个用于展示，一个用于索引。

convert_pretty_to_simple <- Vectorize(*function*(pretty_name){/# pretty_name = “Vision”/simple_name <- names(character_labeler)[character_labeler==pretty_name]/# simple_name <- as.vector(simple_name)/*return*(simple_name)
})
/# convert_pretty_to_simple(c(“Vision”,”Thor”))/
/# just for fun, the inverse of that function/
convert_simple_to_pretty <- *function*(simple_name){/# simple_name = “vision”/pretty_name <- character_labeler[simple_name] %>% as.vector()*return*(pretty_name)
}
/# example/
convert_simple_to_pretty(c(“vision”,”black_panther”))## [1] “Vision”        “Black Panther”

向文本数据框中添加简版角色名

word_data$character <- convert_pretty_to_simple(word_data$Speaker)

为每个角色分配主颜色

character_palette <- c(`black_panther` = “#51473E",`black_widow` = "“#89B9CD"”`bucky` = "“#6F7279"”`captain_america` = "“#475D6A"”`falcon` = "“#863C43", `hawkeye` = "#84707F",`hulk` = “#5F5F3F", `iron_man` = "#9C2728",`loki` = “#3D5C25", `nick_fury` = "#838E86",`rhodey` = “#38454E",`scarlet_witch` ="#620E1B",`spiderman`=“#A23A37", `thor`="#323D41",`ultron` =“#64727D", `vision` ="#81414F" )

制作水平条形图

avengers_bar_plot <- word_data %>%group_by(Speaker) %>%top_n(5, amount) %>%ungroup() %>%mutate(word = reorder(word, amount)) %>%ggplot(aes(x = word, y = amount, fill = character))+geom_bar(stat = “identity”, show.legend = FALSE)+scale_fill_manual(values = character_palette)+scale_y_continuous(name =“Log Odds of Word”,breaks = c(0,1,2)) +theme(text = element_text(family = “Franklin”),/# axis.title.x = element_text(size = rel(1.5)),/panel.grid = element_line(colour = NULL),panel.grid.major.y = element_blank(),panel.grid.minor = element_blank(),panel.background = element_rect(fill = “white”,colour = “white”))+/# theme(strip.text.x = element_text(size = rel(1.5)))+/xlab(“”)+coord_flip()+facet_wrap(~Speaker, scales = “free_y”)
avengers_bar_plot

image

效果已经很好了。但这还不够，我想在图像中插入角色的图，仅在条形图区域显示图像，在条形图端点处将其截断。为此，我们将制作一个透明条，然后在条端点处绘制一个延伸到图边缘的白色条，以覆盖图的其余部分。

在数据框中，我们现在要用余数来补充数值，余数需达到整体最大值，这样当值与余数结合时，所有的数都加到相同的最大值上，构成具有相同长度的线条组合。

max_amount <- max(word_data$amount)
word_data$remainder <- (max_amount - word_data$amount) + 0.2

每个角色只提取出现频率最高的前5个词

word_data_top5 <- word_data %>%group_by(character) %>%arrange(desc(amount)) %>%slice(1:5) %>%ungroup()

将数值“amount”和余数“remaining”转换为长条格式

这确保每个角色对应条目;一个是实际数值，另一个是数值的结束点，扩展到共同的最大值，即余数。

这将把数值(“amount”)和余数(“remainder”)折叠成一个名为变量(“variable”)的列，指示它是哪个值，另一列“value”包含每个值中的数字。

word_data_top5_m <- melt(word_data_top5, measure.vars = c(“amount”,”remainder”))

现在我们把这些条形图放入有序因子中，与在数据融合中相反。否则，“amount” 和“remainder”将在图中以相反的顺序显示。

word_data_top5_m$variable2 <- factor(word_data_top5_m$variable,levels = rev(levels(word_data_top5_m$variable)))

每个角色只显示前5个词

注意角色的名称，简称是“black_panther”而不是“Black Panther”

plot_char <- *function*(character_name){/# example: character_name = "black_panther"//# plot details that we might want to fiddle with//# thickness of lines between bars/bar_outline_size <- 0.5/# transparency of lines between bars/bar_outline_alpha <- 0.25/#//# The function takes the simple character name,//# but here, we convert it to the pretty name,//# because we'll want to use that on the plot./pretty_character_name <- convert_simple_to_pretty(character_name)/# Get the image for this character,//# from the list of all images./temp_image <- all_images[character_name]/# Make a data frame for only this character/temp_data <- word_data_top5_m %>%dplyr::filter(character == character_name) %>%mutate(character = character_name)/# order the words by frequency//# First, make an ordered vector of the most common words//# for this character/ordered_words <- temp_data %>%mutate(word = as.character(word)) %>%dplyr::filter(variable == "amount") %>%arrange(value) %>%`[[`(., "word")/# order the words in a factor,//# so that they plot in this order,//# rather than alphabetical order/temp_data$word = factor(temp_data$word, levels = ordered_words)/# Get the max value,//# so that the image scales out to the end of the longest bar/max_value <- max(temp_data$value)fill_colors <- c(`remainder` = "white", `value` = "white")/# Make a grid object out of the character's image/character_image <- rasterGrob(all_images[[character_name]],width = unit(1,"npc"),height = unit(1,"npc"))/# make the plot for this character/output_plot <- ggplot(temp_data)+aes(x = word, y = value, fill = variable2)+/# add image//# draw it completely bottom to top (x),//# and completely from left to the the maximum log-odds value (y)//# note that x and y are flipped here,//# in prep for the coord_flip()/annotation_custom(character_image,xmin = -Inf, xmax = Inf, ymin = 0, ymax = max_value) +geom_bar(stat = "identity", color = alpha("white", bar_outline_alpha),size = bar_outline_size, width = 1)+scale_fill_manual(values = fill_colors)+theme_classic()+coord_flip(expand = FALSE)+/# use a facet strip,//# to serve as a title, but with color/facet_grid(. ~ character, labeller = labeller(character = character_labeler))+/# figure out color swatch for the facet strip fill//# using character name to index the color palette//# color= NA means there's no outline color./theme(strip.background = element_rect(fill = character_palette[character_name],color = NA))+/# other theme elements/theme(strip.text.x = element_text(size = rel(1.15), color = "white"),text = element_text(family = "Franklin"),legend.position = "none",panel.grid = element_blank(),axis.text.x = element_text(size = rel(0.8)))+/# omit the axis title for the individual plot,//# because we'll have one for the entire ensemble/theme(axis.title = element_blank())*return*(output_plot)
}

X轴标题用于所有角色

plot_x_axis_text <- paste(“Tendency to use this word more than other characters do”,“(units of log odds ratio)”, sep = “\n”)

看函数如何适用单个角色

sample_plot <- plot_char("black_panther")+theme(axis.title = element_text())+/# x lab is still declared as y lab//# because of coord_flip()/ylab(plot_x_axis_text)
sample_plot

为什么我们使用对数优比作为x轴?

因为数值越高, 优比变得更高(这里省略数学公式), 对优比取对数会限制在屏幕上显示的变量范围。

如果想把对数优比变成线性概率，函数如下：

logit2prob <- *function*(logit){odds <- exp(logit)prob <- odds / (1 + odds)*return*(prob)
}

因此横轴会变成这样

logit2prob(seq(0, 2.5, 0.5))## [1] 0.5000000 0.6224593 0.7310586 0.8175745 0.8807971 0.9241418

注意，顺序中连续项之间的差异逐渐减小:

diff(logit2prob(seq(0, 2.5, 0.5)))## [1] 0.12245933 0.10859925 0.08651590 0.06322260 0.04334474

好了，现在我们已经搞定了一个图…

让我们将该函数应用于所有角色的列表，将所有的结果放入一个列表对象中。

all_plots <- lapply(character_names, plot_char)

从图中提取轴标题

get_axis_grob <- *function*(plot_to_pick, which_axis){/# plot_to_pick <- sample_plot/tmp <- ggplot_gtable(ggplot_build(plot_to_pick))/# tmp$grobs//# find the grob that looks like//# it would be the x axis/axis_x_index <- which(sapply(tmp$grobs, *function*(x){/# for all the grobs,//# return the index of the one//# where you can find the text//# "axis.title.x" or "axis.title.y"//# based on input argument `which_axis`/grepl(paste0("axis.title.",which_axis), x)}))axis_grob <- tmp$grobs[[axis_x_index]]*return*(axis_grob)
}

提取轴标题栏

px_axis_x <- get_axis_grob(sample_plot, “x”)
px_axis_y <- get_axis_grob(sample_plot, “y”)

下面是如何使用提取出来的坐标轴：

grid.newpage()
grid.draw(px_axis_x)

/# grid.draw(px_axis_y)/

将所有这些图排列到一个对象中

big_plot <- arrangeGrob(grobs = all_plots)

取一个大的图集合，把x轴放在下面。

big_plot_w_x_axis_title <- arrangeGrob(big_plot,px_axis_x,heights = c(10,1))
grid.newpage()
grid.draw(big_plot_w_x_axis_title)

由于词的长度不同，这些图所占的页面空间略有不同，看起来有点乱。

通常，我们会使用facet_grid（）或facet_wrap（）确保在绘图的过程中保持整齐和对齐，这个项目中不再适用，因为每个都有自己的自定义背景图像。

使用Cowplot而不是arrangebrob，让图片的轴垂直对齐

big_plot_aligned <- cowplot::plot_grid(plotlist = all_plots, align = 'v', nrow = 4)

像之前一样，在对齐的图下添加X轴标题

big_plot_w_x_axis_title_aligned <- arrangeGrob(big_plot_aligned,px_axis_x,heights = c(10,1))

最终成品

grid.newpage()
grid.draw(big_plot_w_x_axis_title_aligned)

最后保存一下

ggsave(big_plot_w_x_axis_title_aligned,file = “Avengers_Word_Usage.png”,width = 12, height = 6.3)