抓取网络源码python

2020 sent more bad news as Black Panther star Chadwick Boseman passed away aged 43. The response from the public and the industry has been enormous: the family’s announcement is now the most like tweet ever, stars made tributes, fans are creating art.

随着黑豹明星查德威克•玻斯曼 ( Chadwick Boseman)享年43岁的去世, 2020年传出了更多坏消息。 公众和业界的React是巨大的:这个家庭的公告现在是有史以来最像推特的消息,明星们致敬 ,粉丝们创造艺术 。

Black Panther and the rest of the Marvel Cinematic Universe (MCU) made a mark in popular culture. It’s a warming thought to think that these imaginary superheroes also make a positive impact in real people’s lives.

黑豹和漫威电影宇宙(MCU)的其余部分在流行文化中留下了印记。 认为这些虚构的超级英雄也对真实人们的生活产生积极影响是一种令人振奋的想法。

Apart from the standalone, critically acclaimed Black Panther film in 2018, King T’Challa was also an avenger who made notable appearances in Captain America: Civil War, Infinity War, and Endgame. In this blog post, I adapt methods from my previous Survivor article to look at the dynamics between Avengers characters.

除了2018年备受好评的独立电影《黑豹》外,T'Challa国王还是复仇者,曾在《美国队长:内战》,《无限战争》和《残局》中露面。 在此博客文章中,我将改编我先前的《 幸存者》文章中的方法,以查看《复仇者联盟》角色之间的动态关系。

How did the dynamics between Avengers characters change over films? I will be looking at who mentions who in dialogue, but also non-verbal interactions between characters — they are action movies afterall!

《复仇者联盟》角色之间的动态如何随着电影而变化? 我将看看谁在对话中提到谁,而且提到人物之间的非语言互动 -毕竟它们是动作电影!

I will use Infinity War (the first Avengers movie I saw!) as an example to walkthrough the process. Then I move on to look at how relationships evolved over time through five films: The Avengers (2012), Age of Ultron (2015), Civil War (2016), Infinity War (2018) and Endgame (2019). The steps are:

我将以《无尽战争》(我所见的第一部复仇者联盟的电影!)为例来介绍该过程。 然后,我继续看一下五部电影中的关系如何随着时间而演变:《复仇者联盟》(2012),《创世纪》(2015),《内战》(2016),《无限战争》(2018)和《残局》(2019)。 这些步骤是:

  1. Find and scrape data查找和抓取数据
  2. Explore and prep data探索和准备数据
  3. Extract entities提取实体
  4. Plot and analyse network graphs绘制和分析网络图
  5. Draw insights汲取见解

I will be leveraging some of my previous code and working in a Jupyter notebook. Key packages include BeautifulSoup for parsing, spaCy for entity extraction, and networkx for visualising. As always, any comments on the thinking, coding and analysis are welcomed!

我将利用以前的一些代码,并在Jupyter笔记本中工作。 关键软件包包括BeautifulSoup 用于解析,空间 用于实体提取和networkx 用于可视化。 一如既往,欢迎对思想,编码和分析发表任何评论!

Want to skip ahead? Here is the final code. And awayyyyy we go!

想跳过吗? 这是最终代码 。 而且awayyyyy我们走了!

1.查找和抓取数据 (1. Find and scrape data)

The MCU movies are obviously hugely popular, so it wasn’t difficult to find resources online. With Google’s help, I found two sources: 1. PDFs on Script Slug, and 2. wiki pages on Fandom.

MCU电影显然非常受欢迎,因此在网上查找资源并不难。 在Google的帮助下,我找到了两个来源:1 . Script Slug上的PDF ,以及Fandom上的 2. wiki页面 。

There are good python libraries that can extract text from PDFs (some detailed in this article by Mate Pocs). However, I’ve found it easier to work with text, especially if it doesn’t involve a trade-off with reliability. Plus, at the time of writing Script Slug seem to be missing some key MCU scripts!

有很好的python库可以从PDF提取文本(Mate Pocs在本文中对此进行了详细介绍)。 但是,我发现使用文本更容易,尤其是在不涉及可靠性折衷的情况下。 另外,在编写Script Slug时,似乎还缺少一些关键的MCU脚本!

The webpage looks like this, with Chrome’s html inspector on the right:

该网页如下所示,右侧是Chrome的html检查器:

Fandom.com is full of useful information, including full, community contributed scripts.
Fandom.com充满了有用的信息,包括完整的,社区贡献的脚本。

The content of the script appears to sit neatly in one section (or one ‘div’). With the help of BeautifulSoup, it didn’t take much work to extract the text:

脚本的内容似乎整齐地放在一个部分(或一个“ div”)中。 在BeautifulSoup的帮助下,提取文本不需要太多的工作:

2.探索和准备数据 (2. Explore and prep data)

Now I’ve got the data. Suppressing the urge to rush straight in, I took a bit of time to examine it and do some sense checks.

现在我有了数据。 我压制了匆忙进入的冲动,花了一些时间对其进行检查并进行了一些感觉检查。

I’ve learnt the hard way that it’s important not to skip this step. Data exploration provides understanding and often necessitate changes, corrections… even the need to use a different data source. It’s better to find out sooner rather than later!

我已经学到了很难的方法,不要跳过这一步很重要。 数据探索提供了理解,并且经常需要进行更改,更正……甚至需要使用其他数据源。 最好早点发现!

In this case, this led me to compile a lookup dictionary for characters (e.g. Tony Stark-Iron Man). It also affirmed that my scraping steps worked as intended and there’s no need to update assumptions I’ve made (yet).

在这种情况下,这导致我编写了字符查找字典(例如Tony Stark-Iron Man)。 它还肯定了我的抓取步骤可以按预期工作,并且不需要更新我所做的假设(尚未)。

3.提取实体 (3. Extract entities)

Here comes (the start of) the fun bit! Taking my approach with Survivor as a starting point, I am going to start with extracting who talks about who.

好玩的地方来了(开始)! 以我对Survivor的研究为起点,我将首先提取谁在谈论谁。

reminder of how Hulk met the sorcerers.提醒一下绿巨人是如何遇到巫师的,这是一个30秒的剪辑。

4.绘制和分析网络图 (4. Plot and analyse network graphs)

Borrowing again from my previous work, I quickly got some graphs together showing who mentioned who during Infinity War (here’s the code snippet).

从以前的工作中再次借来,我很快得到了一些图表,显示谁在无限战争中提到了谁(这是代码段)。

Dialogue only
仅对话

The graph on the right is a numerical measure of how ‘central’ the characters are in the network shown on the left. The more a character mentions others, or gets mentioned, the more central they are.

右图是数字显示字符在左图所示网络中的“中心”程度。 角色提及别人或被提及的次数越多,他们就越重要。

Data suggests that Tony Stark is the most ‘important’ man. Afterall, he is the one who started it all in Iron Man (2008), the debut MCU movie. Thanos, the lead protagonist, is also important. Hulk (Bruce Banner), Gamora, Thor… no big surprises there.

数据表明,托尼·斯塔克是最“重要”的人。 毕竟,他是首部MCU电影《钢铁侠》(Iron Man)(2008)中的开创者。 主角塔诺斯(Thanos)也很重要。 绿巨人(布鲁斯·班纳(Bruce Banner)),伽莫拉(Gamora),雷神(Thor)...那里没什么大惊喜。

However, I am shocked that Steve Rogers — Captain America— is all the way down at 10th, ranking after Peter Parker (Spiderman), Quill (Star Lord), and even Rocket the racoon. That doesn’t really seem to reflect the truth!

但是,令我震惊的是美国队长史蒂夫·罗杰斯(Steve Rogers)排名第10位,仅次于彼得·帕克(Spiderman),鹅毛笔(Star Lord),甚至是浣熊火箭(Rocket)。 这似乎并没有真正反映出事实!

Remember the code snippet from step 2, explore and prep data? I looked at the number of lines and was surprised that Cap had a meagre 25 ‘lines’ compared to Tony’s 128 and Quill’s 85. It dawned on me that it’s not fair to include only spoken lines — Cap is not as talkative as his colleagues (cough Tony) but he take lots of actions.

还记得步骤2 (探索和准备数据)中的代码段吗? 我查看了台词的数量,惊讶于Cap仅有25个“台词”,而Tony的128个和Quill的只有85个。我突然发现只包含口头的台词是不公平的-Cap不像他的同事那样健谈( 咳嗽托尼),但他采取了很多行动。

So I decided to draw a separate graph with entities extracted from description of actions. Unlike dialogue, there is no direction for this. So if Loki and Thanos fight, the interaction is (Loki, Thanos) and not Loki -> Thanos or Thanos -> Loki. Here’s the code snippet.

因此,我决定绘制一个单独的图形,其中包含从动作描述中提取的实体 与对话不同,这没有方向。 因此,如果Loki和Thanos战斗,则互动是(Loki,Thanos),而不是Loki-> Thanos或Thanos-> Loki。 这是代码片段 。

Non-verbal actions only
仅非言语行为

Tony, Thanos and Hulk are still leading — but Cap is now a close fourth which makes much more sense! It’s interesting to see that Wanda (Scarlet Witch) also shot right up from 13th to 5th place, probably because she played a key role in defending Vision and the mind stone (spoiler: not a happy ending).

托尼(Tony),塔诺斯(Thanos)和绿巨人(Hulk)仍处于领先地位,但Cap现在排名第四,这更有意义! 有趣的是,万达(Scarlet Witch)也从第13位升至第5位,这可能是因为她在捍卫Vision和心灵石方面发挥了关键作用(破坏者: 不是一个幸福的结局 )。

What if we combine dialogue and actions for an overview?

如果我们将对话和动作结合起来进行概览怎么办?

Combined! Dialogue x interactions
结合! 对话x互动

Tony maintains his substantial lead as the ‘center’ of the movie, followed by Hulk, Thanos and Captain America — a much more sensible analysis.

托尼(Tony)保持自己在电影“中心”的领先地位,其次是绿巨人(Hulk),塔诺斯(Thanos)和美国队长(Captain America),这是更为明智的分析。

I deliberately added in a weighing mechanism in this code snippet, so that one mention has the same importance as one interaction — note that this means the height of the yellow and blue graph can be compared to each other, but not to the green graph.

我故意在此代码段中添加了一种权衡机制,因此提及一个内容与进行一次交互具有相同的重要性-请注意,这意味着可以将黄色和蓝色图形的高度相互比较,而不能将其与绿色图形进行比较。

5.汲取见解 (5. Draw insights)

Et voila: I’ve cleaned up the code, worked through kinks like different html formatting, and ran five movies through the analysis:

等等:我已经清理了代码 ,通过类似不同html格式的纠结工作,并通过分析运行了五部电影:

I know, it’s a lot to take in. There are so many ways to look at this: individual characters, groups forming, relationships (anyone remember the Hulk-Black Widow romance subplot?)… we’re only limited by our imagination.

我知道,要花很多钱。有很多方法可以查看:个人角色,群体形成,人际关系(有人还记得绿巨人黑寡妇的浪漫情节吗?)……我们仅受我们的想象力限制。

Here are my main observations:

这是我的主要观察结果:

  • Tony Stark is the main man, period. He is by far the most important in 3 of the 5 movies (except The Avengers and Civil War, which is technically a Captain America movie anyway).托尼·史塔克(Tony Stark)是主要人物,时期。 到目前为止,他是5部电影中的3部中最重要的电影(《复仇者联盟》和《内战》在技术上仍然是美国队长电影)。
Tony Stark, the person who had everything to lose and still sacrificed himself (fine, you caught me, Iron Man is my personal favourite)
托尼·斯塔克(Tony Stark),他什么都输了,却仍然牺牲了自己(很好,你抓住了我,钢铁侠是我个人的最爱)
  • There are more and more named characters as the franchised progressed. The network graphs are very legible for The Avengers and Age of Ultron. It got a bit intense for Civil War, and by the time we’re at Endgame it’s hard to read.随着特许经营的进行,有越来越多的命名角色。 网络图对于《复仇者联盟》和《奥创纪元》非常清晰。 《南北战争》的气氛有些激烈,到了我们进入《残局》时,已经很难读懂了。
  • With the exception of Loki in The Avengers, the main antagonist does not get the limelight. Ultron, the AI-gone-wrong, trails in 9th place. Thanos places third in Infinity War, but drops to 7th in Endgame just behind Rhodey.除了《复仇者联盟》中的洛基之外,主要的对手并没有引起人们的关注。 Ultron,AI出错了,排名第9位。 Thanos在《无限战争》中排名第三,但在Endgame中跌至第七,仅次于Rhodey。
  • Clint Barton (Hawkeye) is not the studio’s favourite… let’s do a count:克林特·巴顿(克林特·巴顿)(霍基)不是工作室的最爱……让我们做个例子:
We feel ya Cliff! Hang in there.
我们感到崖! 等一下

The Avengers: he trails along with Black Widow and Hulk

复仇者联盟:他与黑寡妇和绿巨人同路

Age of Ultron: behind every Avenger except Rhodey

Ultron时代:除了Rhodey之外,所有复仇者都落后

Civil War: only above Vision (Thor and Loki were mentioned and not actually in the movie)

内战:仅在远景之上(在电影中并未提到托尔和洛基)

Infinity War: they didn’t even bother include him!

无限战争:他们甚至没有理会他!

Endgame: in a meager 15th spot despite getting the opening scene

残局:尽管获得了开场的机会,但仅排名第15位

把它们加起来… (To sum it up…)

This was a fun extension applying what I learnt analysing Survivor dynamics. I also got to cover different technical ground like web scraping. Some further ideas:

这是一个有趣的扩展,应用了我学到的分析幸存者动态的知识 。 我还必须涵盖不同的技术领域,例如网页抓取。 一些进一步的想法:

  • Single character views: inspect how people are connected, and how these connections evolved (this article by Juan De Dios Santos includes interesting single character views from a different angle)

    单字符视图:检查人们的联系方式以及这些联系的演变方式(Juan De Dios Santos的本文从不同的角度介绍了有趣的单字符视图)

  • Other metrics of importance: screentime, eigenvalue centrality (who’s important based on the importance of people they’re connected to), etc.其他重要指标:放映时间,特征值中心度(根据所连接的人的重要性来确定重要性)等。
  • Trailers: is importance in trailers indicative of importance in the actual movie? What about their share of area in promotion shots and posters?预告片:预告片中的重要性是否表示实际电影中的重要性? 他们在促销镜头和海报中所占的份额如何?

I enjoyed this a lot — this time there was less working it out (what is an edge?) and more problem solving (how can I fairly combine two graphs?). What are your thoughts on network of Avengers?

我非常喜欢它–这次减少了工作量(什么是优势?),解决了更多问题(我如何公平地将两个图表组合在一起?)。 您对复仇者联盟有何想法?

翻译自: https://medium.com/swlh/avengers-web-scraping-entity-extraction-and-network-graphs-in-python-ea6dc323eb7d

抓取网络源码python


http://www.taodudu.cc/news/show-2945386.html

相关文章:

  • 读作工业4.0,唱作工业互联网,写作中国制造2025
  • PS5 VS XSX 谁是最强次世代主机?
  • Attacking Black-box Recommendations viaCopying Cross-domain User Profiles
  • 读标准01-IEEE1451-智能传感器接口标准介绍
  • [JavaScript]如何将www.xxx.com变为com.xxx.www
  • shell批量替换文本中的多种字符串
  • Python列表,元组,字典,集合 练习
  • 用java去掉字符串中空格,存入数据库
  • 1、ATK-LORA-01
  • python中iloc切片_Dataframe选择行列loc,iloc,切片,布尔索引,条件判断等
  • linux--多目录下的MakeFile文件(嵌套Makefile)编写
  • Makefile 配置和使用
  • 下面这条语句一共创建了多少个对象:String bb=aa+bb+cc+dd
  • 5.字符串:aa:zhangsan@163.com!bb:lisi@sina.com!cc:wangwu@126.com 将存入hashMap中 key:aa,bb,cc value:zhang
  • 笛卡尔积算法的Java实现
  • java string数组循环_java数组中String [ ] a={aa,bb,cc};利用for循环进行遍历
  • 两种方法分割python多空格字符串
  • MySQL数据库范围、模糊、时间范围(时间段)查询语句
  • shell切分字符串到数组
  • c++ 按分割符(忽略多次出现)切割string字符串
  • [86题更新完毕] 牛客Python专项题
  • Linux小实验11|添加组group,添加用户aa、bb并加入group组 (2)新建文件/abc.txt (3)设置用户aa对文件拥有读、写和执行权限
  • redis五种数据类型及其常见操作
  • 编写自定义的字符串一致性匹配方法,只要两个字符串包含同样的字符, 不管字符的顺序如何,都认为两个字符串一致,如:”aabbcc”和”abcabc”被认为是一致的
  • 两种方式实现线程通信:三个线程交替打印AABBCC
  • php 正则筛选靓号如AABBCC(连对),abcdef(顺子)等QQ靓号保留
  • 如何自定义排序 aAbBcC 即Excel中的排序,AaBbCc
  • aabbcc本质不同的排列数
  • 闲杂文章(1)
  • 大龄打工程序员的出路在哪里?

抓取网络源码python_python中的复仇者网络抓取实体提取和网络图相关推荐

  1. 抓取网络源码python_使用Python进行网络抓取的新手指南

    抓取网络源码python 有很多很棒的书可以帮助您学习Python,但是谁真正读了这些A到Z? (剧透:不是我). 接下来是我的第一个Python抓取项目指南. 假定的Python和HTML知识很少. ...

  2. Android之使用HttpURLConnection类查看网络图片以及网络源码

    1.首先,来介绍一下HttpURLConnection类,HttpURLConnection类位于java.net包中,用于发送HTTP请求和获取HTTP响应.由于此类是抽象类,不能直接实例化对象,所 ...

  3. Alamofire源码解读系列(七)之网络监控(NetworkReachabilityManager)

    Alamofire源码解读系列(七)之网络监控(NetworkReachabilityManager) 本篇主要讲解iOS开发中的网络监控 前言 在开发中,有时候我们需要获取这些信息: 手机是否联网 ...

  4. 【附源码】计算机毕业设计SSM网络求职招聘系统

    项目运行 环境配置: Jdk1.8 + Tomcat7.0 + Mysql + HBuilderX(Webstorm也行)+ Eclispe(IntelliJ IDEA,Eclispe,MyEclis ...

  5. 使用PyTorch构建GAN生成对抗网络源码(详细步骤讲解+注释版)02 人脸识别 下

    文章目录 1 测试鉴别器 2 建立生成器 3 测试生成器 4 训练生成器 5 使用生成器 6 内存查看 上一节,我们已经建立好了模型所必需的鉴别器类与Dataset类. 使用PyTorch构建GAN生 ...

  6. 使用PyTorch构建GAN生成对抗网络源码(详细步骤讲解+注释版)02 人脸识别 上

    文章目录 1 数据集描述 2 GPU设置 3 设置Dataset类 4 设置辨别器类 5 辅助函数与辅助类 1 数据集描述 此项目使用的是著名的celebA(CelebFaces Attribute) ...

  7. [附源码]SSM计算机毕业设计农产品网络销售系统JAVA

    项目运行 环境配置: Jdk1.8 + Tomcat7.0 + Mysql + HBuilderX(Webstorm也行)+ Eclispe(IntelliJ IDEA,Eclispe,MyEclis ...

  8. 游戏陪玩app源码开发中,摄像头的调用及视频处理

    摄像头是游戏陪玩app源码进行视频连麦时的重要移动设备之一,在开发时,我们需要实现游戏陪玩app源码对摄像头的调用权限,这就涉及到相关接口的开发了,不过今天我们主要来了解一下在游戏陪玩app源码开发中 ...

  9. PHP开发仿推特Twitter社区网络源码+修复BUG

    正文: PHP开发仿推特Twitter社区网络源码+修复BUG,程序没什么好介绍的,主要就是一个社区网络,更多的功能大家自行去搭建测试,之前好像也发布过,只不过那个好像存在一些问题,这次发布的是修复B ...

最新文章

  1. mysql 5.6 gtid mha_MySQL MHA--故障切换模式(GTID模式和非GTID模式)
  2. 模式识别新研究:微软OCR两层优化提升自然场景下的文字识别精度
  3. vue在微信里面的兼容问题_vuejs在安卓系统下微信X5内核这个兼容性问题如何破?...
  4. java 记录一个类加载顺序的坑
  5. HTML5+canvas激流勇进网页游戏源码
  6. python标准库之zipfile
  7. web developer tips (65): 快速创建一个挂接SQL表的GridView
  8. activity流程变量使用
  9. C#的Navigate的异常处理
  10. php 中英文截取 php,PHP 中英文截取无乱码
  11. 【深度优先搜索】计蒜客:中国象棋
  12. C# CSharp 回调函数
  13. 408考研复试之计算机组成原理笔记第一二三章
  14. android动画类型有哪几种,Android动画概念大揭秘
  15. 怎么给服务器部署php探针,phpStudy学习之php探针
  16. DPDK NFV 性能提升
  17. Win10 KeilC51-C251-ARM共存方法
  18. php扑克牌随机发,PHP实现随机发放扑克牌
  19. 工业机器人的自由度是什么?
  20. 图的adt实现实验六图的应用(通信网络)

热门文章

  1. 什么是勒索病毒?有哪些危害?如何预防?
  2. ROS——roscpp
  3. ESP8266 MP3制作——关于SelectionList从源码中改代码的一次经历
  4. 文件服务器之:NFS服务器
  5. js FOR循环效率问题
  6. 华为数通HCIA学习资料学习总结
  7. CCF计算机软件能力认证试题练习:201903-1 小中大
  8. 进程通信实例之父子进程通信
  9. Linux ubuntu 服务器部署详细教程
  10. 自动控制原理(根轨迹)