点这传送kaggle原作者
点这传送数据源&比赛

变量构造

belongs_to_collection

for i, e in enumerate(train['belongs_to_collection'][:5]):print(i, e)

0 [{'id': 313576, 'name': 'Hot Tub Time Machine Collection', 'poster_path': '/iEhb00TGPucF0b4joM1ieyY026U.jpg', 'backdrop_path': '/noeTVcgpBiD48fDjFVic1Vz7ope.jpg'}]
1 [{'id': 107674, 'name': 'The Princess Diaries Collection', 'poster_path': '/wt5AMbxPTS4Kfjx7Fgm149qPfZl.jpg', 'backdrop_path': '/zSEtYD77pKRJlUPx34BJgUG9v1c.jpg'}]
2 {}
3 {}
4 {}

enumerate:在可循环对象的每个元素前面加个序号

train['belongs_to_collection'].apply(lambda x: len(x) if x != {} else 0).value_counts()

0    2396
1     604
Name: belongs_to_collection, dtype: int64

DataFrame.apply(func, axis=0)对列（对行则axis=1）使用func
lambda x:x+1 if 2==1 else 0 #if 条件为真的时候返回if前面内容，否则返回0

此列中的2396个值为空，604包含有关系列的信息。假设只有集合名称才有用。另一个可能有用的特征是是否属于系列。

train['collection_name'] = train['belongs_to_collection'].apply(lambda x: x[0]['name'] if x != {} else 0)
train['has_collection'] = train['belongs_to_collection'].apply(lambda x: len(x) if x != {} else 0)test['collection_name'] = test['belongs_to_collection'].apply(lambda x: x[0]['name'] if x != {} else 0)
test['has_collection'] = test['belongs_to_collection'].apply(lambda x: len(x) if x != {} else 0)train = train.drop(['belongs_to_collection'], axis=1)
test = test.drop(['belongs_to_collection'], axis=1)

genres

for i, e in enumerate(train['genres'][:5]):print(i, e)
0 [{'id': 35, 'name': 'Comedy'}]
1 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10751, 'name': 'Family'}, {'id': 10749, 'name': 'Romance'}]
2 [{'id': 18, 'name': 'Drama'}]
3 [{'id': 53, 'name': 'Thriller'}, {'id': 18, 'name': 'Drama'}]
4 [{'id': 28, 'name': 'Action'}, {'id': 53, 'name': 'Thriller'}

print('Number of genres in films')
train['genres'].apply(lambda x: len(x) if x != {} else 0).value_counts()
Number of genres in films
2    972
3    900
1    593
4    393
5    111
6     21
0      7
7      3
Name: genres, dtype: int64

类型列包含电影所属类型的命名和ID。大多数电影有2-3种类型，可能有5-6种类型。我认为0和7是异常值。让我们提取类型！我将在电影中创建一个包含所有类型的列，并为每个类型创建单独的列。

但首先让我们来看看这些类型。

list_of_genres = list(train['genres'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values)

[[‘Comedy’], [‘Comedy’, ‘Drama’, ‘Family’, ‘Romance’], [‘Drama’],…]

plt.figure(figsize = (12, 8))
text = ' '.join([i for j in list_of_genres for i in j])#把list_of_genres里面的每一个元素拿出来形成一个长list再用空格分隔
wordcloud = WordCloud(max_font_size=None, background_color='white', collocations=False,width=1200, height=1000).generate(text)#collocations=False, #避免重复单词 max_font_size=100,  # 字体最大值
plt.imshow(wordcloud)
plt.title('Top genres')
plt.axis("off")
plt.show()

剧情，喜剧和惊悚片比较流行。

Counter([i for j in list_of_genres for i in j]).most_common()

Counter是一个无序的容器类型，以字典的键值对形式存储，其中元素作为key，其计数作为value。
Counter.most_common([n])返回一个TopN列表。如果n没有被指定，则返回所有元素。当多个元素计数值相同时，排列是无确定顺序的。

[('Drama', 1531), ('Comedy', 1028), ('Thriller', 789), ('Action', 741), ('Romance', 571), ('Crime', 469), ('Adventure', 439), ('Horror', 301), ('Science Fiction', 290), ('Family', 260), ('Fantasy', 232), ('Mystery', 225), ('Animation', 141), ('History', 132), ('Music', 100), ('War', 100), ('Documentary', 87), ('Western', 43), ('Foreign', 31), ('TV Movie', 1)]

为前15种类型创建单独的列。

train['num_genres'] = train['genres'].apply(lambda x: len(x) if x != {} else 0)
train['all_genres'] = train['genres'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '')
top_genres = [m[0] for m in Counter([i for j in list_of_genres for i in j]).most_common(15)]
for g in top_genres:train['genre_' + g] = train['all_genres'].apply(lambda x: 1 if g in x else 0)test['num_genres'] = test['genres'].apply(lambda x: len(x) if x != {} else 0)
test['all_genres'] = test['genres'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '')
for g in top_genres:test['genre_' + g] = test['all_genres'].apply(lambda x: 1 if g in x else 0)train = train.drop(['genres'], axis=1)
test = test.drop(['genres'], axis=1)

production_companies&production_countries&Spoken languages&Keywords

制片公司、制片国、语言、关键词的处理类似类型的处理，保留参与公司（国家、语言种类、关键词）数目，为前30（或其他数字）单独生成变量。

for i, e in enumerate(train['production_companies'][:5]):print(i, e)
0 [{'name': 'Paramount Pictures', 'id': 4}, {'name': 'United Artists', 'id': 60}, {'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]
1 [{'name': 'Walt Disney Pictures', 'id': 2}]
2 [{'name': 'Bold Films', 'id': 2266}, {'name': 'Blumhouse Productions', 'id': 3172}, {'name': 'Right of Way Films', 'id': 32157}]
3 {}
4 {}
print('Number of production companies in films')
train['production_companies'].apply(lambda x: len(x) if x != {} else 0).value_counts()

Number of production companies in films
1     775
2     734
3     582
4     312
5     166
0     156
6     118
7      62
8      42
9      29
11      7
10      7
12      3
16      2
15      2
14      1
13      1
17      1
Name: production_companies, dtype: int64

Most of films have 1-2 production companies, cometimes 3-4. But there are films with 10+ companies! Let’s have a look at some of them.

train[train['production_companies'].apply(lambda x: len(x) if x != {} else 0) > 11]

list_of_companies = list(train['production_companies'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values)
Counter([i for j in list_of_companies for i in j]).most_common(30)
[('Warner Bros.', 202), ('Universal Pictures', 188), ('Paramount Pictures', 161), ('Twentieth Century Fox Film Corporation', 138), ('Columbia Pictures', 91), ('Metro-Goldwyn-Mayer (MGM)', 84), ('New Line Cinema', 75), ('Touchstone Pictures', 63), ('Walt Disney Pictures', 62), ('Columbia Pictures Corporation', 61), ('TriStar Pictures', 53), ('Relativity Media', 48), ('Canal+', 46), ('United Artists', 44), ('Miramax Films', 40), ('Village Roadshow Pictures', 36), ('Regency Enterprises', 31), ('BBC Films', 30), ('Dune Entertainment', 30), ('Working Title Films', 30), ('Fox Searchlight Pictures', 29), ('StudioCanal', 28), ('Lionsgate', 28), ('DreamWorks SKG', 27), ('Fox 2000 Pictures', 25), ('Summit Entertainment', 24), ('Hollywood Pictures', 24), ('Orion Pictures', 24), ('Amblin Entertainment', 23), ('Dimension Films', 23)]

For now I’m not sure what to do with this data. I’ll simply create binary columns for top-30 films. Maybe later I’ll have a better idea.

train['num_companies'] = train['production_companies'].apply(lambda x: len(x) if x != {} else 0)
train['all_production_companies'] = train['production_companies'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '')
top_companies = [m[0] for m in Counter([i for j in list_of_companies for i in j]).most_common(30)]
for g in top_companies:train['production_company_' + g] = train['all_production_companies'].apply(lambda x: 1 if g in x else 0)test['num_companies'] = test['production_companies'].apply(lambda x: len(x) if x != {} else 0)
test['all_production_companies'] = test['production_companies'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '')
for g in top_companies:test['production_company_' + g] = test['all_production_companies'].apply(lambda x: 1 if g in x else 0)train = train.drop(['production_companies', 'all_production_companies'], axis=1)
test = test.drop(['production_companies', 'all_production_companies'], axis=1)

for i, e in enumerate(train['production_countries'][:5]):print(i, e)
0 [{'iso_3166_1': 'US', 'name': 'United States of America'}]
1 [{'iso_3166_1': 'US', 'name': 'United States of America'}]
2 [{'iso_3166_1': 'US', 'name': 'United States of America'}]
3 [{'iso_3166_1': 'IN', 'name': 'India'}]
4 [{'iso_3166_1': 'KR', 'name': 'South Korea'}]
print('Number of production countries in films')
train['production_countries'].apply(lambda x: len(x) if x != {} else 0).value_counts()

Number of production countries in films
1    2222
2     525
3     116
4      57
0      55
5      21
6       3
8       1
Name: production_countries, dtype: int64

Normally films are produced by a single country, but there are cases when companies from several countries worked together.

list_of_countries = list(train['production_countries'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values)
Counter([i for j in list_of_countries for i in j]).most_common(25)
[('United States of America', 2282), ('United Kingdom', 380), ('France', 222), ('Germany', 167), ('Canada', 120), ('India', 81), ('Italy', 64), ('Japan', 61), ('Australia', 61), ('Russia', 58), ('Spain', 54), ('China', 42), ('Hong Kong', 42), ('Ireland', 23), ('Belgium', 23), ('South Korea', 22), ('Mexico', 19), ('Sweden', 18), ('New Zealand', 17), ('Netherlands', 15), ('Czech Republic', 14), ('Denmark', 13), ('Brazil', 12), ('Luxembourg', 10), ('South Africa', 10)]
train['num_countries'] = train['production_countries'].apply(lambda x: len(x) if x != {} else 0)
train['all_countries'] = train['production_countries'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '')
top_countries = [m[0] for m in Counter([i for j in list_of_countries for i in j]).most_common(25)]
for g in top_countries:train['production_country_' + g] = train['all_countries'].apply(lambda x: 1 if g in x else 0)test['num_countries'] = test['production_countries'].apply(lambda x: len(x) if x != {} else 0)
test['all_countries'] = test['production_countries'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '')
for g in top_countries:test['production_country_' + g] = test['all_countries'].apply(lambda x: 1 if g in x else 0)train = train.drop(['production_countries', 'all_countries'], axis=1)
test = test.drop(['production_countries', 'all_countries'], axis=1)

Spoken languages

for i, e in enumerate(train['spoken_languages'][:5]):print(i, e)
0 [{'iso_639_1': 'en', 'name': 'English'}]
1 [{'iso_639_1': 'en', 'name': 'English'}]

kaggle TMDB 票房预测 -step by step解读代码相关推荐

Kaggle TMDB 票房预测挑战赛
Kagle 为我们提供了 7000 多部过去影片的数据,通过这些数据尝试预测全球票房总收入. 提供的数据包括演员.制片组.情节关键字.预算.海报.上映日期.语言.制作公司和国家. 1. loading ...
Cheat Engine Step 8详细解读
Cheat Engine Step 8详细解读一:如何理解四级指针它由第一个指针指向第二个指针,再由第二个指针指向第三个指针,由第三个指针指向第四个指针,最终指向健康值的真正地址.即 [动态地址3 ...
kaggle竞赛--房价预测详细解读
## Kaggle竞赛 -- 房价预测 (House Prices) #### 完整代码见[kaggle kernel](https://www.kaggle.com/massquantity/all ...
Step by Step演示如何训练Pytorch版的EfficientDet
向AI转型的程序员都关注了这个号???????????? 机器学习AI算法工程公众号:datayx Paper:https://arxiv.org/abs/1911.09070 Base Git ...
《动手深度学习》4.10. 实战Kaggle比赛：预测房价
4.10. 实战Kaggle比赛:预测房价本节内容预览数据下载和缓存数据集访问和读取数据集使用pandas读入并处理数据数据预处理处理缺失值&对数值类数据标准化处理离散值-on ...
04.10. 实战Kaggle比赛：预测房价
4.10. 实战Kaggle比赛:预测房价详细介绍数据预处理.模型设计和超参数选择. 通过亲身实践,你将获得一手经验,这些经验将有益数据科学家的职业成长. import hashlib import ...
GAN Step By Step -- Step4 CGAN
GAN Step By Step 心血来潮 GSBS,顾名思义,我希望我自己能够一步一步的学习GAN.GAN 又名生成对抗网络,是最近几年很热门的一种无监督算法,他能生成出非常逼真的照片,图像甚至视 ...
实战Kaggle比赛：预测房价
文章目录实战Kaggle比赛:预测房价 1 - 下载和缓存数据集 2 - 访问和读取数据集 3 - 数据预处理 4 - 训练 5 - K折交叉验证 6 - 模型选择 7 - 提交你的Kaggle预测 ...
基于Python机器学习算法的电影推荐系统以及票房预测系统
电影数据分析目录电影数据分析 1 一..实验概述 1 1.1 实验标 1 1.2 .实验完成情况 1 二..电影特征的可视化分析 2 电影票房预测 9 2.1 Data Augmentation ...

kaggle TMDB 票房预测 -step by step解读代码

变量构造

belongs_to_collection

genres

production_companies&production_countries&Spoken languages&Keywords

kaggle TMDB 票房预测 -step by step解读代码相关推荐

最新文章

热门文章