TED(technology entertainment design)
旨在将技术、娱乐、设计领域的专家聚集在一起的非盈利性组织
口号:“Ideas worth spreading” 值得传播的思想
每年2-3月 会召集杰出人物,将工作和研究提炼为简短有力的演讲(通常少于18分钟),并上传到TED官网供观众免费收看。

目录

一、数据

1、数据介绍

2、数据质量检查

3、数据处理(日期)

二、描述性分析

(一)比例分析—单人演讲比例、时长小于18分钟演讲比例

(二)排名分析

1、浏览量最高

2、浏览量—按性统计:

3、讨论量最高

4、views+comments 相关性分析+比率分析

5、演讲长度+浏览量 相关关系

(三)时间段分析

1、按月分析

2、按周分析

3、按年分析

4、年月热图

(四)演讲者职业

(五)语言数量+浏览量 相关关系

(六)主题分析

(七)词汇量分析

(八)评价分析

(九)网络分析:由相关推荐,看TED演讲之间的关联

(十)词汇云图


一、数据

两个数据来源:

ted_main.csv:包含了2017年9月21日之前上传到官方网站TED.com的所有TED Talks演讲录制信息。

transcripts.csv:包含了具体的演讲文本信息

1、数据介绍

from IPython.core.interactiveshell import InteractiveShell   #多行输出
InteractiveShell.ast_node_interactivity = "all" %matplotlib inline
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
ted=pd.read_csv('C:/Users/ZJDCUser/Desktop/比赛实战/kaggle/TED talks/ted_main.csv')
#查看数据集的列
ted.columns#调整特征顺序
ted = ted[['name', 'title', 'description', 'main_speaker', 'speaker_occupation', 'num_speaker', 'duration', 'event', 'film_date', 'published_date', 'comments', 'tags', 'languages', 'ratings', 'related_talks', 'url', 'views']]

共17列

特征 解释
name 演讲的正式名称(主要发言人+标题)
title 演讲标题
description 演讲内容
main_speaker 主要发言人
speaker_occupation 主要发言人的职业
num_speaker 发言人数量
duration 演讲时长,以秒为单位
event 演讲所在的TED / TEDx活动
film_date 演讲拍摄时间 (Unix timestamp)
published_date 演讲发布时间 (Unix timestamp)
comments 评论数量
tags 与演讲相关的主题标签
languages 收听演讲时可选择的语言数量
ratings 一个列表,里面包含许多字典,每个字典是不同的演讲评级(如鼓舞人心,引人入胜,令人惊讶等)
related_talks 一个列表,里面包含许多字典,每个字典是下一个值得观看的演讲推荐
url 演讲的URL链接
views 观看数量

2、数据质量检查

ted.info()
ted.shape
#   Column              Non-Null Count  Dtype
---  ------              --------------  ----- 0   comments            2550 non-null   int64 1   description         2550 non-null   object2   duration            2550 non-null   int64 3   event               2550 non-null   object4   film_date           2550 non-null   int64 5   languages           2550 non-null   int64 6   main_speaker        2550 non-null   object7   name                2550 non-null   object8   num_speaker         2550 non-null   int64 9   published_date      2550 non-null   int64 10  ratings             2550 non-null   object11  related_talks       2550 non-null   object12  speaker_occupation  2544 non-null   object13  tags                2550 non-null   object14  title               2550 non-null   object15  url                 2550 non-null   object16  views               2550 non-null   int64
dtypes: int64(7), object(10)
(2550, 17)

speaker_occupation 存在6个缺失值,但无关紧要

ted.head()
ted.isnull().any()  #只有一列存在缺失值
ted[ted["speaker_occupation"].isnull()]   #显示具体的六行缺失值
name title description main_speaker speaker_occupation num_speaker duration event film_date published_date comments tags languages ratings related_talks url views
0 Ken Robinson: Do schools kill creativity? Do schools kill creativity? Sir Ken Robinson makes an entertaining and pro... Ken Robinson Author/educator 1 1164 TED2006 25-02-2006 27-06-2006 4553 ['children', 'creativity', 'culture', 'dance',... 60 [{'id': 7, 'name': 'Funny', 'count': 19645}, {... [{'id': 865, 'hero': 'https://pe.tedcdn.com/im... https://www.ted.com/talks/ken_robinson_says_sc... 47227110
1 Al Gore: Averting the climate crisis Averting the climate crisis With the same humor and humanity he exuded in ... Al Gore Climate advocate 1 977 TED2006 25-02-2006 27-06-2006 265 ['alternative energy', 'cars', 'climate change... 43 [{'id': 7, 'name': 'Funny', 'count': 544}, {'i... [{'id': 243, 'hero': 'https://pe.tedcdn.com/im... https://www.ted.com/talks/al_gore_on_averting_... 3200520
2 David Pogue: Simplicity sells Simplicity sells New York Times columnist David Pogue takes aim... David Pogue Technology columnist 1 1286 TED2006 24-02-2006 27-06-2006 124 ['computers', 'entertainment', 'interface desi... 26 [{'id': 7, 'name': 'Funny', 'count': 964}, {'i... [{'id': 1725, 'hero': 'https://pe.tedcdn.com/i... https://www.ted.com/talks/david_pogue_says_sim... 1636292
3 Majora Carter: Greening the ghetto Greening the ghetto In an emotionally charged talk, MacArthur-winn... Majora Carter Activist for environmental justice 1 1116 TED2006 26-02-2006 27-06-2006 200 ['MacArthur grant', 'activism', 'business', 'c... 35 [{'id': 3, 'name': 'Courageous', 'count': 760}... [{'id': 1041, 'hero': 'https://pe.tedcdn.com/i... https://www.ted.com/talks/majora_carter_s_tale... 1697550
4 Hans Rosling: The best stats you've ever seen The best stats you've ever seen You've never seen data presented like this. Wi... Hans Rosling Global health expert; data visionary 1 1190 TED2006 22-02-2006 28-06-2006 593 ['Africa', 'Asia', 'Google', 'demo', 'economic... 48 [{'id': 9, 'name': 'Ingenious', 'count': 3202}... [{'id': 2056, 'hero': 'https://pe.tedcdn.com/i... https://www.ted.com/talks/hans_rosling_shows_t... 12005869
name                  False
title                 False
description           False
main_speaker          False
speaker_occupation     True
num_speaker           False
duration              False
event                 False
film_date             False
published_date        False
comments              False
tags                  False
languages             False
ratings               False
related_talks         False
url                   False
views                 False
dtype: bool

3、数据处理(日期)

#原数据集中的film_data和published_date是用Unix timestamp表示的
#我们使用datetime库,将其转换为可读的日期形式。
from datetime import datetime
import time
import datetimeted['film_date'] = ted['film_date'].apply(lambda x: datetime.datetime.fromtimestamp(int(x)).strftime('%d-%m-%Y'))ted['published_date'] = ted['published_date'].apply(lambda x: datetime.datetime.fromtimestamp(int(x)).strftime('%d-%m-%Y'))

二、描述性分析

(一)比例分析—单人演讲比例、时长小于18分钟演讲比例

ted.describe()
print("单人演讲占所有演讲的比例为{}%".format(round(sum(ted["num_speaker"]==1)*100/len(ted),1)))
#format函数 即放在print()函数中进行格式化输出 即放入{}中
print("时长小于18分钟的演讲数占总演讲数的比例{}%".format(round(sum(ted["duration"]<=18*60)*100/len(ted),1)))
num_speaker duration comments languages views
count 2550.000000 2550.000000 2550.000000 2550.000000 2.550000e+03
mean 1.028235 826.510196 191.562353 27.326275 1.698297e+06
std 0.207705 374.009138 282.315223 9.563452 2.498479e+06
min 1.000000 135.000000 2.000000 0.000000 5.044300e+04
25% 1.000000 577.000000 63.000000 23.000000 7.557928e+05
50% 1.000000 848.000000 118.000000 28.000000 1.124524e+06
75% 1.000000 1046.750000 221.750000 33.000000 1.700760e+06
max 5.000000 5256.000000 6404.000000 72.000000 4.722711e+07
单人演讲占所有演讲的比例为97.7%
时长小于18分钟的演讲数占总演讲数的比例79.1%

结果显示,大部分都是单人演讲,79.1%演讲在18分钟以内。评论数平均值为191.5,观看数的平均数为1,700,000次。提供多种语言选择,最多72种语言。

(二)排名分析

1、浏览量最高

#根据views量 排序 前15行数据
pop_talks=ted[["title","main_speaker","views","film_date"]].sort_values("views",ascending=False)[:15]
pop_talksted.views.describe() #浏览量整体情况
count    2.550000e+03
mean     1.698297e+06
std      2.498479e+06
min      5.044300e+04
25%      7.557928e+05
50%      1.124524e+06
75%      1.700760e+06
max      4.722711e+07
Name: views, dtype: float64

views平均数为160万 中位数为112万。表明TED talks的普及程度非常高。

plt.figure(figsize=(10,6))
sns.displot(ted["views"])
sns.displot(ted[ted.views<0.4e7]["views"])

93.5%的演讲浏览量<4 million,因此将浏览量小于4million的视频画出分布图,更能看清视频浏览量的分布情况。

2、浏览量—按性统计:

#切分main_speaker的前三个字母,新增一列abbr数据
pop_talks["abbr"]=pop_talks["main_speaker"].apply(lambda x:x[:3])
pop_talks.head()
sns.set_style("white")  #set_style 设置主题
#seaborn有五个预设好的主题:darkgrid,whitegrid,dark,white,ticks 默认为darkgrid
plt.figure(figsize=(10,6)) #figsize:指定figure的宽和高,单位为英寸
sns.barplot(x="abbr",y="views",data=pop_talks)
title main_speaker views film_date abbr
0 Do schools kill creativity? Ken Robinson 47227110 25-02-2006 Ken
1346 Your body language may shape who you are Amy Cuddy 43155405 26-06-2012 Amy
677 How great leaders inspire action Simon Sinek 34309432 17-09-2009 Sim
837 The power of vulnerability Brené Brown 31168150 06-06-2010 Bre
452 10 things you didn't know about orgasm Mary Roach 22270883 06-02-2009 Mar

Out[15]:

<AxesSubplot:xlabel='abbr', ylabel='views'>

3、讨论量最高

sns.displot(ted.comments)
ted.comments.describe()

count    2550.000000
mean      191.562353
std       282.315223
min         2.000000
25%        63.000000
50%       118.000000
75%       221.750000
max      6404.000000
Name: comments, dtype: float64

TED社区高度参与讨论循环谈判,与评论相关的标准偏差很大。

4、views+comments 相关性分析+比率分析

sns.jointplot(x='views',y='comments',data=ted)
ted[["views","comments"]].corr()

views comments
views 1.000000 0.530939
comments 0.530939 1.000000

可看出观看数和评论数的 相关系数略大于0.5,表明两个数量之间存在中等相关性。

#新增一列"dis_quo"
ted["dis_quo"]=ted.comments/ted.views
#评论数/点击量之比 前10行
ted[["title","main_speaker","views","comments","dis_quo","film_date"]].sort_values("dis_quo",ascending=False)
title main_speaker views comments dis_quo film_date
744 The case for same-sex marriage Diane J. Savino 292395 649 0.002220 02-12-2009
803 E-voting without fraud David Bismark 543551 834 0.001534 14-07-2010
96 Militant atheism Richard Dawkins 4374792 6404 0.001464 02-02-2002
694 Inside a school for suicide bombers Sharmeen Obaid-Chinoy 1057238 1502 0.001421 10-02-2010
954 Taking imagination seriously Janet Echelman 1832930 2492 0.001360 03-03-2011
... ... ... ... ... ... ...
2494 A simple new blood test that can catch cancer ... Jimmy Lin 1005506 7 0.000007 24-04-2017
2528 How your pictures can help reclaim lost history Chance Coughenour 539207 3 0.000006 08-06-2016
2542 Living sculptures that stand for history's truths Sethembile Msezane 542088 3 0.000006 27-08-2017
2501 The stories behind The New Yorker's iconic covers Françoise Mouly 839040 3 0.000004 08-03-2017
2534 What it feels like to see Earth from space Benjamin Grant 646174 2 0.000003 07-04-2017

2550 rows × 6 columns

评论指数前10名的演讲,大部分含有“宗教”、“科学”或者“政治”的标签,可见相比于“文化”、“艺术”等话题,这些话题更容易引起人们的热烈探讨。

讨论指数最高的视频是 The Case for Same Sex Marriage,讨论美国关于婚姻平等立法的问题。同性婚姻往往与宗教、政治、人权等一系列容易引起争议的话题标签相连。

5、演讲长度+浏览量 相关关系

#演讲长度与浏览量的相关关系
sns.jointplot(x = 'duration', y='views', data = ted[ted['duration'] < 25])
plt.xlabel('Duration')
plt.ylabel('Views')
plt.show()
#pearson相关系数为0.076,演讲时长和浏览量没有相关关系。看来一个演讲是否受欢迎,还是要看其内容,而非长度。

(三)时间段分析

1、按月分析

month_order=["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"]
day_order=["Mon","Tue","Wed","Thu","Fri","Sat","Sun"]
ted["month"]=ted['film_date'].apply(lambda x:month_order[int(x.split('-')[1])-1])
ted.month.head()
month_ted=pd.DataFrame(ted.month.value_counts()).reset_index()
month_ted.columns=['month','talks']
sns.barplot(x='month',y='talks',data=month_ted,order=month_order)
0    Feb
1    Feb
2    Feb
3    Feb
4    Feb
Name: month, dtype: object

Out[25]:

<AxesSubplot:xlabel='month', ylabel='talks'>

可看出二月是最受欢迎的会议,而八月和一月是最不受欢迎的会议。而二月份的人气很大程度上由于二月份举行的官方会议。

2、按周分析

#使用datetime的strtime方法获取一个日期是周几
import datetime
def getday(x):day,month,year=(int(i) for i in x.split('-'))answer=datetime.date(year,month,day).strftime('%A')return answer[:3]#使用datetime的weekday方法获取一个日期是一周里的第几天,用这个当索引在day_order里取相应的value值def getday2(x):day,month,year=(int(i) for i in x.split('-'))answer=datetime.ted(year,month,day).weekday()return day_order[answer]#新增一列'day'
ted['day']=ted['film_date'].apply(getday)
day_ted=pd.DataFrame(ted['day'].value_counts()).reset_index()
day_ted.columns=['day','talks']
sns.barplot(x='day',y='talks',data=day_ted,order=day_order)
<AxesSubplot:xlabel='day', ylabel='talks'>

可看出wed和thu是最受欢迎的日子,而星期天Sun最少。即周中参与的人多,周末需要休息

3、按年分析

year_order=['1972','1983','1984','1990','1991','1994','2001','2002','2003','2004','2005','2006','2007','2008','2009','2010','2011','2012','2013','2014','2015','2016','2017']
ted['year']=ted['film_date'].apply(lambda x:x.split('-')[2])
year_ted=pd.DataFrame(ted['year'].value_counts().reset_index())
year_ted.columns=['year','talks']
plt.figure(figsize=(20,10))
sns.pointplot(x='year',y='talks',data=year_ted,order=year_order)
<AxesSubplot:xlabel='year', ylabel='talks'>

  • 和预期一样,TED演讲的视频数量逐年上升;
  • 2008年到2009年,视频数量急剧上升,涨了三倍;
  • 2009年之后,每年的视频数量稳定在250个左右。

可看出从2004年后 演讲数量骤增 可能重要原因是TED策划的活动得到了认可,各地举办。2013-2016 有短暂回落  16-17 突然下落可能是前期活动过于频繁,而请的大多是各领域专家,成果没有那么快出来。

4、年月热图

#增加一列month
ted['month']=ted['film_date'].apply(lambda x:x.split('-')[1])
ted_year_month=ted.groupby(['month','year'])['name'].nunique()
ted_year_month=ted_year_month.unstack()
ted_year_month
year 1972 1983 1984 1990 1991 1994 1998 2001 2002 2003 ... 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
month
01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 1.0 NaN 3.0 6.0 2.0 7.0 3.0 2.0 4.0 3.0
02 NaN NaN 1.0 NaN NaN 1.0 6.0 4.0 24.0 31.0 ... 47.0 83.0 70.0 2.0 42.0 75.0 4.0 1.0 83.0 10.0
03 NaN NaN NaN 1.0 NaN NaN NaN NaN 3.0 2.0 ... 10.0 5.0 4.0 76.0 38.0 13.0 90.0 78.0 3.0 8.0
04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN 1.0 27.0 10.0 33.0 15.0 5.0 3.0 11.0 68.0
05 1.0 NaN NaN NaN NaN NaN NaN 1.0 NaN NaN ... 8.0 2.0 6.0 17.0 18.0 15.0 4.0 35.0 16.0 NaN
06 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 2.0 6.0 11.0 5.0 82.0 70.0 12.0 14.0 39.0 1.0
07 NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN ... 5.0 65.0 59.0 70.0 5.0 7.0 9.0 1.0 2.0 4.0
08 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN 1.0 2.0 2.0 1.0 5.0 4.0 10.0 1.0 4.0
09 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 1.0 7.0 7.0 9.0 9.0 14.0 20.0 20.0 15.0 NaN
10 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 2.0 21.0 19.0 12.0 12.0 28.0 57.0 18.0 38.0 NaN
11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN 40.0 18.0 46.0 15.0 12.0 23.0 46.0 29.0 NaN
12 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN ... 8.0 1.0 41.0 15.0 10.0 9.0 6.0 11.0 5.0 NaN

12 rows × 24 columns

plt.figure(figsize=(20,15),dpi=80)
sns.heatmap(ted_year_month,annot=False,fmt='d',cmap='Reds')
plt.title('每一年演讲次数展示')

(四)演讲者职业

#画出演讲者职业与演讲次数的条形图,观察哪一个职业更受欢迎
ted_filter=ted.groupby('speaker_occupation')['name'].nunique().sort_values(ascending=False)
ted_filter.head(30)
plt.figure(figsize=(20,10))
ted_filter[ted_filter.values>10].plot.bar(color='red',edgecolor='k',width=0.85,alpha=0.85)
plt.xlabel("演讲者职业")
plt.ylabel('演讲次数')
plt.title('受欢迎演讲者职业分布情况')
speaker_occupation
Writer                                  45
Artist                                  34
Designer                                34
Journalist                              33
Entrepreneur                            31
Architect                               30
Inventor                                27
Psychologist                            26
Photographer                            25
Filmmaker                               21
Educator                                20
Author                                  20
Neuroscientist                          20
Economist                               20
Roboticist                              16
Philosopher                             16
Biologist                               15
Physicist                               14
Musician                                11
Marine biologist                        11
Technologist                            10
Global health expert; data visionary    10
Activist                                10
Graphic designer                         9
Oceanographer                            9
Behavioral economist                     9
Poet                                     9
Philanthropist                           9
Astronomer                               9
Historian                                9
Name: name, dtype: int64
#画出不同职业演讲者的视频浏览量的盒型图
ted_filter2=ted.groupby('speaker_occupation')['name'].apply(lambda x:x.sort_values(ascending=False))
ted_filter2.head(10)
plt.figure(figsize=(20,12))
sns.boxplot(y='views',x='speaker_occupation',data=ted[ted['speaker_occupation'].isin(ted_filter.head(20).index)])
plt.ylabel('浏览数量')
plt.title('不同职业演讲者的浏览数量')
speaker_occupation                               Chairman of the Cordoba Initiative          326     Feisal Abdul Rauf: Lose your ego, find your co...Child protection leader, activist, author   2525    Tara Winkler: Why we need to end the era of or...Robotics engineer                           2547    Radhika Nagpal: What intelligent machines can ...Space physicist                             2260    James Green: 3 moons and a planet that could h...
3D printer                                   1823           Avi Reichental: What's next in 3D printing
3D printing entrepreneur                     1121                Lisa Harouni: A primer on 3D printing
9/11 mothers                                 926     Aicha el-Wafi + Phyllis Rodriguez: The mothers...
A capella ensemble                           449                      Naturally 7: A full-band beatbox
AI expert                                    2451    Stuart Russell: 3 principles for creating safe...2526    Noriko Arai: Can a robot pass a university ent...
Name: name, dtype: object

(五)语言数量+浏览量 相关关系

sns.jointplot(x='languages',y='views',data=ted,color='green')
#pearson 相关系数为0.38 说明提供的语言数量和浏览量存在中度相关

(六)主题分析

尽管TED一开始专注于Technology,Entertainment和Design,发展到今天,其包含的主题已经非常多样化。

包含TED演讲主题信息的列是tags,以列表形式列出了每个演讲视频的若干标签。我们先对这一列数据进行清理,提取出标签信息。

#将字符串型的list转变成list
import ast
ted['tags'] = ted['tags'].apply(lambda x: ast.literal_eval(x))# 原本的tag内容
ted['tags'].head()
0    [children, creativity, culture, dance, educati...
1    [alternative energy, cars, climate change, cul...
2    [computers, entertainment, interface design, m...
3    [MacArthur grant, activism, business, cities, ...
4    [Africa, Asia, Google, demo, economics, global...
Name: tags, dtype: object
# 将每个视频的标签拆开
s = ted.apply(lambda x: pd.Series(x['tags']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'theme'
s.head()# 将拆分好的标签加回原数据集
theme_df = ted.drop('tags', axis = 1).join(s)
theme_df.head(10)
0      children
0    creativity
0       culture
0         dance
0     education
Name: theme, dtype: object
name title description main_speaker speaker_occupation num_speaker duration event film_date published_date ... languages ratings related_talks url views dis_quo month day year theme
0 Ken Robinson: Do schools kill creativity? Do schools kill creativity? Sir Ken Robinson makes an entertaining and pro... Ken Robinson Author/educator 1 1164 TED2006 25-02-2006 27-06-2006 ... 60 [{'id': 7, 'name': 'Funny', 'count': 19645}, {... [{'id': 865, 'hero': 'https://pe.tedcdn.com/im... https://www.ted.com/talks/ken_robinson_says_sc... 47227110 0.000096 02 Sat 2006 children
0 Ken Robinson: Do schools kill creativity? Do schools kill creativity? Sir Ken Robinson makes an entertaining and pro... Ken Robinson Author/educator 1 1164 TED2006 25-02-2006 27-06-2006 ... 60 [{'id': 7, 'name': 'Funny', 'count': 19645}, {... [{'id': 865, 'hero': 'https://pe.tedcdn.com/im... https://www.ted.com/talks/ken_robinson_says_sc... 47227110 0.000096 02 Sat 2006 creativity
0 Ken Robinson: Do schools kill creativity? Do schools kill creativity? Sir Ken Robinson makes an entertaining and pro... Ken Robinson Author/educator 1 1164 TED2006 25-02-2006 27-06-2006 ... 60 [{'id': 7, 'name': 'Funny', 'count': 19645}, {... [{'id': 865, 'hero': 'https://pe.tedcdn.com/im... https://www.ted.com/talks/ken_robinson_says_sc... 47227110 0.000096 02 Sat 2006 culture
0 Ken Robinson: Do schools kill creativity? Do schools kill creativity? Sir Ken Robinson makes an entertaining and pro... Ken Robinson Author/educator 1 1164 TED2006 25-02-2006 27-06-2006 ... 60 [{'id': 7, 'name': 'Funny', 'count': 19645}, {... [{'id': 865, 'hero': 'https://pe.tedcdn.com/im... https://www.ted.com/talks/ken_robinson_says_sc... 47227110 0.000096 02 Sat 2006 dance
0 Ken Robinson: Do schools kill creativity? Do schools kill creativity? Sir Ken Robinson makes an entertaining and pro... Ken Robinson Author/educator 1 1164 TED2006 25-02-2006 27-06-2006 ... 60 [{'id': 7, 'name': 'Funny', 'count': 19645}, {... [{'id': 865, 'hero': 'https://pe.tedcdn.com/im... https://www.ted.com/talks/ken_robinson_says_sc... 47227110 0.000096 02 Sat 2006 education
0 Ken Robinson: Do schools kill creativity? Do schools kill creativity? Sir Ken Robinson makes an entertaining and pro... Ken Robinson Author/educator 1 1164 TED2006 25-02-2006 27-06-2006 ... 60 [{'id': 7, 'name': 'Funny', 'count': 19645}, {... [{'id': 865, 'hero': 'https://pe.tedcdn.com/im... https://www.ted.com/talks/ken_robinson_says_sc... 47227110 0.000096 02 Sat 2006 parenting
0 Ken Robinson: Do schools kill creativity? Do schools kill creativity? Sir Ken Robinson makes an entertaining and pro... Ken Robinson Author/educator 1 1164 TED2006 25-02-2006 27-06-2006 ... 60 [{'id': 7, 'name': 'Funny', 'count': 19645}, {... [{'id': 865, 'hero': 'https://pe.tedcdn.com/im... https://www.ted.com/talks/ken_robinson_says_sc... 47227110 0.000096 02 Sat 2006 teaching
1 Al Gore: Averting the climate crisis Averting the climate crisis With the same humor and humanity he exuded in ... Al Gore Climate advocate 1 977 TED2006 25-02-2006 27-06-2006 ... 43 [{'id': 7, 'name': 'Funny', 'count': 544}, {'i... [{'id': 243, 'hero': 'https://pe.tedcdn.com/im... https://www.ted.com/talks/al_gore_on_averting_... 3200520 0.000083 02 Sat 2006 alternative energy
1 Al Gore: Averting the climate crisis Averting the climate crisis With the same humor and humanity he exuded in ... Al Gore Climate advocate 1 977 TED2006 25-02-2006 27-06-2006 ... 43 [{'id': 7, 'name': 'Funny', 'count': 544}, {'i... [{'id': 243, 'hero': 'https://pe.tedcdn.com/im... https://www.ted.com/talks/al_gore_on_averting_... 3200520 0.000083 02 Sat 2006 cars
1 Al Gore: Averting the climate crisis Averting the climate crisis With the same humor and humanity he exuded in ... Al Gore Climate advocate 1 977 TED2006 25-02-2006 27-06-2006 ... 43 [{'id': 7, 'name': 'Funny', 'count': 544}, {'i... [{'id': 243, 'hero': 'https://pe.tedcdn.com/im... https://www.ted.com/talks/al_gore_on_averting_... 3200520 0.000083 02 Sat 2006 climate change
# 标签数量
print("标签数量:{}".format(len(theme_df['theme'].value_counts())))
theme_df['theme'].value_counts()
theme_df['theme'].value_counts().head(10)
pop_themes=theme_df['theme'].value_counts().head(10)
pop_themes
theme_df['theme']
标签数量:416
technology       727
science          567
global issues    501
culture          486
TEDx             450...
evil               2
funny              1
testing            1
cloud              1
skateboarding      1
Name: theme, Length: 416, dtype: int64
0             children
0           creativity
0              culture
0                dance
0            education...
2549              play
2549     public spaces
2549           society
2549          software
2549    urban planning
Name: theme, Length: 19154, dtype: object

“技术”是最流行的演讲主题,随后是“科学”、“全球问题”和“文化”。“娱乐”和“设计”也进入了前10名的主题,分别排在第八名和第六名。

那么近年来,这些流行主题的发展趋势如何?排除TEDx这个不相关的标签,我们选取了前7个演讲主题(技术、科学、全球问题、文化、设计、商业、娱乐),并分析2009-2017年间这些话题的占比,希望能看到流行话题的演变过程。

(七)词汇量分析

trans=pd.read_csv('C:/Users/ZJDCUser/Desktop/比赛实战/kaggle/TED talks/transcripts.csv')
trans.head()
trans.describe()
#有2550条演技视频信息,说明有83个演技的文本信息缺失,但不影响对文本的分析
transcript url
0 Good morning. How are you?(Laughter)It's been ... https://www.ted.com/talks/ken_robinson_says_sc...
1 Thank you so much, Chris. And it's truly a gre... https://www.ted.com/talks/al_gore_on_averting_...
2 (Music: "The Sound of Silence," Simon & Garfun... https://www.ted.com/talks/david_pogue_says_sim...
3 If you're here today — and I'm very happy that... https://www.ted.com/talks/majora_carter_s_tale...
4 About 10 years ago, I took on the task to teac... https://www.ted.com/talks/hans_rosling_shows_t...

Out[41]:

transcript url
count 2467 2467
unique 2464 2464
top I'm going to tell you a little bit about my TE... https://www.ted.com/talks/rob_reid_the_8_billi...
freq 2 2
#将演技的文本分析与其他主要分析 合并在一个dataframe中
df=pd.merge(left=ted,right=trans,how='left',left_on='url',right_on='url')
df.head()
name title description main_speaker speaker_occupation num_speaker duration event film_date published_date ... languages ratings related_talks url views dis_quo month day year transcript
0 Ken Robinson: Do schools kill creativity? Do schools kill creativity? Sir Ken Robinson makes an entertaining and pro... Ken Robinson Author/educator 1 1164 TED2006 25-02-2006 27-06-2006 ... 60 [{'id': 7, 'name': 'Funny', 'count': 19645}, {... [{'id': 865, 'hero': 'https://pe.tedcdn.com/im... https://www.ted.com/talks/ken_robinson_says_sc... 47227110 0.000096 02 Sat 2006 Good morning. How are you?(Laughter)It's been ...
1 Al Gore: Averting the climate crisis Averting the climate crisis With the same humor and humanity he exuded in ... Al Gore Climate advocate 1 977 TED2006 25-02-2006 27-06-2006 ... 43 [{'id': 7, 'name': 'Funny', 'count': 544}, {'i... [{'id': 243, 'hero': 'https://pe.tedcdn.com/im... https://www.ted.com/talks/al_gore_on_averting_... 3200520 0.000083 02 Sat 2006 Thank you so much, Chris. And it's truly a gre...
2 David Pogue: Simplicity sells Simplicity sells New York Times columnist David Pogue takes aim... David Pogue Technology columnist 1 1286 TED2006 24-02-2006 27-06-2006 ... 26 [{'id': 7, 'name': 'Funny', 'count': 964}, {'i... [{'id': 1725, 'hero': 'https://pe.tedcdn.com/i... https://www.ted.com/talks/david_pogue_says_sim... 1636292 0.000076 02 Fri 2006 (Music: "The Sound of Silence," Simon & Garfun...
3 Majora Carter: Greening the ghetto Greening the ghetto In an emotionally charged talk, MacArthur-winn... Majora Carter Activist for environmental justice 1 1116 TED2006 26-02-2006 27-06-2006 ... 35 [{'id': 3, 'name': 'Courageous', 'count': 760}... [{'id': 1041, 'hero': 'https://pe.tedcdn.com/i... https://www.ted.com/talks/majora_carter_s_tale... 1697550 0.000118 02 Sun 2006 If you're here today — and I'm very happy that...
4 Hans Rosling: The best stats you've ever seen The best stats you've ever seen You've never seen data presented like this. Wi... Hans Rosling Global health expert; data visionary 1 1190 TED2006 22-02-2006 28-06-2006 ... 48 [{'id': 9, 'name': 'Ingenious', 'count': 3202}... [{'id': 2056, 'hero': 'https://pe.tedcdn.com/i... https://www.ted.com/talks/hans_rosling_shows_t... 12005869 0.000049 02 Wed 2006 About 10 years ago, I took on the task to teac...

5 rows × 22 columns

#处理NA值
df['transcript']=df['transcript'].fillna('')
#计算词汇量
df['wc']=df['transcript'].apply(lambda x:len(x.split()))
#词汇量分析
df['wc'].describe()
df['wc'].head()
#平均每个演讲含有1972个单词,最长的演讲含有9044个单词。标准差很大,有1009个单词。
count    2553.000000
mean     1971.550725
std      1009.494329
min         0.000000
25%      1235.000000
50%      1983.000000
75%      2681.000000
max      9044.000000
Name: wc, dtype: float64

Out[43]:

0    3066
1    2089
2    3253
3    3015
4    3121
Name: wc, dtype: int64
#每分钟词汇量 word per minute
df['wpm']=df['wc']/df['duration']
df['wpm'].describe()
count    2553.000000
mean        2.369129
std         0.660589
min         0.000000
25%         2.184486
50%         2.483636
75%         2.749744
max         4.122748
Name: wpm, dtype: float64

(八)评价分析

# 将ratings栏由字符串转为词典
ted['ratings'] = ted['ratings'].apply(lambda x: ast.literal_eval(x))#ratings举例
ted.iloc[0]['ratings']
[{'id': 7, 'name': 'Funny', 'count': 19645},{'id': 1, 'name': 'Beautiful', 'count': 4573},{'id': 9, 'name': 'Ingenious', 'count': 6073},{'id': 3, 'name': 'Courageous', 'count': 3253},{'id': 11, 'name': 'Longwinded', 'count': 387},{'id': 2, 'name': 'Confusing', 'count': 242},{'id': 8, 'name': 'Informative', 'count': 7346},{'id': 22, 'name': 'Fascinating', 'count': 10581},{'id': 21, 'name': 'Unconvincing', 'count': 300},{'id': 24, 'name': 'Persuasive', 'count': 10704},{'id': 23, 'name': 'Jaw-dropping', 'count': 4439},{'id': 25, 'name': 'OK', 'count': 1174},{'id': 26, 'name': 'Obnoxious', 'count': 209},{'id': 10, 'name': 'Inspiring', 'count': 24924}]
rating_words = []for i in range(len(ted)):l = ted.iloc[i]['ratings']for d in l:if d['name'] not in rating_words:rating_words.append(d['name'])rating_words
#我们发现每个视频都有同样的14个评价词,用户可以通过点击选择适合的评价词。因此,一个视频越符合评价特征,其相应分数会越高。
['Funny','Beautiful','Ingenious','Courageous','Longwinded','Confusing','Informative','Fascinating','Unconvincing','Persuasive','Jaw-dropping','OK','Obnoxious','Inspiring']

(九)网络分析:由相关推荐,看TED演讲之间的关联

每一个TED演讲视频,都有一些“其他相关视频推荐”。为了理清各个演讲之间的关系,我们选择网络分析。

构图时,每一个视频单独构成一个点,两个视频之间如果有推荐关系,则添加一条边。

在这里,我们使用networkx库实现画图功能。

(十)词汇云图

pip install wordcloud
#把所有演讲文本放在一起,形成语料库
corpus = ' '.join(trans['transcript'])
corpus = corpus.replace('.', '. ')#建立云图
from wordcloud import WordCloud, STOPWORDS
tedwordcloud = WordCloud(stopwords=STOPWORDS, background_color='white', width=2400,height=2000).generate(corpus)#绘制云图
plt.figure(figsize = (10,10))
plt.imshow(tedwordcloud)
plt.axis('off')
plt.show()
#最常见的词汇有one, now, people, know, think, see。看起来TED很强调对人和知识的观察、理解和思考。

import os
from PIL import Imagemask=np.array(Image.open("C:/Users/ZJDCUser/Desktop/比赛实战/kaggle/TED talks/ted.png"))ted_tedwordcloud = WordCloud(stopwords=STOPWORDS, background_color='white',mask=mask,mode='RGB',width=2400,height=2000).generate(corpus)
ted_tedwordcloud
#绘制云图
plt.figure(figsize = (10,10))
plt.imshow(ted_tedwordcloud)
plt.axis('off')
plt.show()
(-0.5, 220.5, 168.5, -0.5)

【实战】“TED”演讲——可视化分析相关推荐

  1. 人脸图像聚类实战及TSNE可视化分析

    人脸图像聚类实战及TSNE可视化分析 目录 人脸图像聚类实战及TSNE可视化分析 Kmeans聚类人脸图像数据 获取Kmeans最佳K值

  2. “TED演讲”可视化(应统期末作业)(二)

    先说好哈,欢迎学习借鉴,请勿直接搬运,好歹你改一改换个逻辑,改改代码再用作你的是吧!天下文章一大抄,就看你会抄不会抄! 放上链接 「可视化期末.pdf」https://www.aliyundrive. ...

  3. R语言相关性计算及使用ggcorrplot包相关性分析热力图可视化分析实战

    R语言相关性计算及使用ggcorrplot包相关性分析热力图可视化分析实战 目录 R语言相关性计算及使用ggcorrplot包相关性分析热力图可视化分析实战

  4. 密度聚类算法DBSCAN实战及可视化分析

    密度聚类算法DBSCAN实战及可视化分析 目录 密度聚类算法DBSCAN实战及可视化分析 DBSCAN实战及聚类效果可视化 构建分类算法获得预测推理能力 DBSCAN实战及聚类效果可视化 DBSCAN ...

  5. 高斯混合模型图像聚类、图像生成、可视化分析实战

    高斯混合模型图像聚类.图像生成.可视化分析实战 目录 高斯混合模型图像聚类.图像生成.可视化分析实战 PCA图像数据降维

  6. PCA图像数据降维及重构误差分析实战并使用TSNE进行异常数据可视化分析

    PCA图像数据降维及重构误差分析实战并使用TSNE进行异常数据可视化分析 目录 PCA图像数据降维及重构误差分析实战并使用TSNE进行异常数据可视化分析</

  7. [机器学习笔记] 用Python分析 TED演讲数据(更新中)

    用Python分析 TED演讲数据 首先准备TED演讲数据集,TED演讲数据集和信息可以从下面的资源获得: https://www.datafountain.cn/datasets/11 该数据集包含 ...

  8. Python数据分析实战,,美国总统大选数据可视化分析[基于pandas]

    目录 前言 一.任务详情 二.数据集来源 三.实现过程 四.运行代码 前言 在学习Python数据分析的过程中,是离不开实战的. 今天跟大家带来数据分析可视化经典项目,美国总统大选数据可视化分析,希望 ...

  9. 机器学习实战4-教育领域:学生成绩的可视化分析与成绩预测-详细分析

    大家好,我是微学AI,今天给大家带来机器学习实战4-学生成绩的可视化分析与成绩预测,机器学习在教育中的应用具有很大的潜力,特别是在学生成绩的可视化分析与成绩预测方面. 机器学习可以通过对学生的父母教育 ...

最新文章

  1. 科普丨莫拉维克悖论(人工智能中最重要的发现之一)
  2. spring bean scope作用域及多线程安全问题场景分析
  3. 自动发现_清华发布首个自动图学习框架,或有助于蛋白质建模和新药发现
  4. linux怎么在win上安装mysql_CentOS下安装MySQL及Windows下使用Navicat for MySQL连接
  5. hdu 2069 1 5 10 25 50 这几种硬币 一共100个(母函数)
  6. php调用lstat频繁,PHP lstat()函数使用方法
  7. python的序列类型包括_python基础之常用序列类型(字符串)
  8. 根据不同条件查询_好用的大数据即席查询工具——秒级响应
  9. Running Hero.
  10. python语言的运行效率高吗_为什么python运行效率低?原来因为它!
  11. uniapp中简单方法之上传图片到腾讯云
  12. 金武彩印机械设备有限公司仓储管理系统设计与实现
  13. 腾讯音乐管理层调整:联席总裁谢国民辞职 谢振宇兼任CTO
  14. php实现大文件分片上传
  15. QListWidgt QListView QTableWidget QTableView 去掉虚线框
  16. Linux公社 学习连接
  17. Must call super constructor in derived class before accessing 'this' or returning from derived const
  18. Pycharm 的设置背景颜色和字体颜色
  19. swagger3 不能传header未解之谜
  20. IC618的资源分享及IC618电路显示黄色问题

热门文章

  1. Altium Designer(AD)安装教程
  2. Android各种版本概述
  3. ffmpeg源码简析(六)编码-av_write_frame(),av_write_trailer()
  4. 自定义的串口通信协议
  5. P2392 kkksc03考前临时抱佛脚
  6. 看程序员奶爸是如何通过代码给宝宝起名的~
  7. [Linux RK Debian 10] chrome浏览器开启GPU硬件加速|CSDN创作打卡
  8. Java 使用Virtual Serial Port Driver及Modsim32进行modbus-rtu协议模拟(从机)并使用java当做主机(Maven项目)进行从机信息获取及修改
  9. WS2812灯珠(五)---移植Adafruit_NeoPixel库
  10. 基于JAVA的公交调度系统