Kagle 为我们提供了 7000 多部过去影片的数据，通过这些数据尝试预测全球票房总收入。

提供的数据包括演员、制片组、情节关键字、预算、海报、上映日期、语言、制作公司和国家。

1. loading relevant modules

import numpy as np
import pandas as pd import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus']=False
plt.style.use('ggplot')from collections import Counter
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import KFold
from nltk.corpus import stopwordspd.set_option('display.max_columns', None) # 因为数据集中列比较多，我们需要把 dataframe 中的列全部显示出来import os
print(os.listdir("./input"))  # 将当前目录的所有文件名输出import warnings
warnings.filterwarnings('ignore')  # 忽略警告

EDA

加载训练数据和测试数据

train = pd.read_csv("./input/train.csv")
test = pd.read_csv("./input/test.csv")
print(train.shape, test.shape) # 查看数据集的规模
print(set(train.columns).difference(set(test.columns)))

通过查看每个数据集的规模，与我们的训练集相比，测试数据集相对较大，所以我们必须保证不过拟合
观察发现测试集没有 revenue 字段，这就是我们要预测的票房
先为测试集加上 revenue 字段,然后合并训练集和测试集以便有更多的数据来训练我们的模型

test["revenue"] = np.nan
all_movies = pd.concat([train, test])
all_movies.head()

all_movies.columns.tolist

字段含义

movie_id : TMDB电影标识号
title : 电影名称
cast ：演员列表
director ：导演
budget ：预算（美元）
genres ：风格列表，电影类型
homepage ：电影首页的 URL
original_language ：电影语言
original_title ：电影名称
overview ：剧情摘要
popularity ：在 Movie Database 上的相对页面查看次数
production_companies ：制作公司
production_countries ：制作国家
release_date ：上映时间
revenue ：收入
runtime ：电影时长
spoken_languages ：口语
status ：状态
tagline ：电影的标语
cast 演员
crew 制片组

all_movies.shape

(7398, 23)

# 将 belongs_to_collection 中的空值用 “” 代替
def clean_belongs_to_collection(x):if x is np.nan:return ""x = x[1:-1]return eval(x)['name']  # 提取 nameall_movies["collection"] = all_movies["belongs_to_collection"].apply(clean_belongs_to_collection)
print(all_movies["collection"].isnull().sum())
print(all_movies["collection"].head())

def get_genres(x):if x is np.nan:return ""genres = list()for genre_dict in eval(x):genres.append(genre_dict['name'])return ",".join(genres)all_movies["genres_list"] = all_movies["genres"].apply(get_genres)
print(all_movies["genres_list"].isnull().sum())
all_movies["genres_list"].head()

def get_genres(x):if x is np.nan:return ""genres = list()for genre_dict in eval(x):genres.append(genre_dict['name'])return ",".join(genres)all_movies["genres_list"] = all_movies["genres"].apply(get_genres)
print(all_movies["genres_list"].isnull().sum())
all_movies["genres_list"].head()

# 统计各类别电影数量
genres = list()
for i in range(all_movies.shape[0]):genres += (all_movies.iloc[i]["genres_list"].split(","))
genres = Counter(genres)
genres.most_common(25)

# 输出
[('Drama', 3676),('Comedy', 2605),('Thriller', 1869),('Action', 1735),('Romance', 1435),('Adventure', 1116),('Crime', 1084),('Science Fiction', 744),('Horror', 735),('Family', 675),('Fantasy', 628),('Mystery', 550),('Animation', 382),('History', 295),('Music', 267),('War', 243),('Documentary', 221),('Western', 117),('Foreign', 84),('', 23),('TV Movie', 1)]

绘制词云图

from pyecharts.charts import Page, WordCloud
from pyecharts.globals import SymbolType
from pyecharts import options as opts
from pyecharts.globals import CurrentConfig, NotebookType
CurrentConfig.NOTEBOOK_TYPE = NotebookType.JUPYTER_LAB

attr = []
value = []
for i in genres.most_common(25):attr.append(i[0])value.append(i[1])

words = genres.most_common()
def wordcloud_diamond() -> WordCloud:c = (WordCloud().add("", words, word_size_range=[20, 100], shape=SymbolType.DIAMOND).set_global_opts(title_opts=opts.TitleOpts(title="电影类型词云图")))return c
wordcloud_diamond().load_javascript()
# 在 jupyter_lab 下要先加载 javascript，notebook 则不用

wordcloud_diamond().render_notebook()

for genre, count in genres.most_common(25)[:-2]:# 为每个电影类别创建单独的列all_movies[genre] = all_movies["genres_list"].apply(lambda x: 1 if genre in x else 0)
all_movies = all_movies.drop(["genres_list"], axis=1)
all_movies.isnull().sum()

all_movies["has_webpage"] = all_movies["homepage"].apply(lambda x: 0 if x is np.nan else 1)
all_movies = all_movies.drop(["belongs_to_collection", "genres","homepage", "imdb_id", "poster_path"], axis=1)

all_movies["has_webpage"].value_counts()

0 有主页
1 无主页

sns.catplot(x='has_webpage', y='revenue', data=all_movies)
plt.title('电影主页与票房关系')
plt.show()

可以看到，有主页的电影更容易获得高票房

# 对离散型特征进行 one-hot 编码是让距离的计算显得更加合理。
dummy_orig_langs = pd.get_dummies(all_movies["original_language"], prefix="original_lang")
all_movies = pd.concat([all_movies,dummy_orig_langs], axis=1)

# 需要用到 nltk 库，加载英文停用词
stop_words = set(stopwords.words('english'))
text_cols = ["overview", "tagline"]
ovw_words = list()
tag_words = list()
for i in range(all_movies.shape[0]):try:ovw_words += (all_movies.iloc[i]["overview"].replace(",","").replace(".","").lower().split())tag_words += (all_movies.iloc[i]["tagline"].replace(",","").replace(".","").lower().split())except AttributeError as e:continue
ovw_words = Counter([w for w in ovw_words if len(w) > 4 and w not in stop_words])
tag_words = Counter([w for w in tag_words if len(w) > 4 and w not in stop_words])
print(ovw_words.most_common(10))
print(tag_words.most_common(10))

words = ovw_words.most_common()
def wordcloud_diamond() -> WordCloud:c = (WordCloud().add("", words, word_size_range=[20, 100], shape=SymbolType.DIAMOND).set_global_opts(title_opts=opts.TitleOpts(title="电影概述词云图")))return c
wordcloud_diamond().load_javascript()

wordcloud_diamond().render_notebook()

words = tag_words.most_common()
def wordcloud_diamond() -> WordCloud:c = (WordCloud().add("", words, word_size_range=[20, 100], shape=SymbolType.DIAMOND).set_global_opts(title_opts=opts.TitleOpts(title="电影标签词云图")))return c
wordcloud_diamond().load_javascript()

wordcloud_diamond().render_notebook()

# 将各标签词和概述词创建列
for word, _ in ovw_words.most_common(100):col = "overview_"+wordall_movies[col] = all_movies["overview"].apply(lambda x: 0 if x is np.nan else 1 if word in x.lower() else 0)
for word, _ in tag_words.most_common(100):col = "tagline_"+wordall_movies[col] = all_movies["tagline"].apply(lambda x: 0 if x is np.nan else 1 if word in x.lower() else 0)
all_movies.shape

ovw_high_var_list = list()
tag_high_var_list = list()
train_idx = train.shape[0]
train = all_movies.iloc[:train_idx]
ovw_cols = [col for col in list(train.columns) if "overview_" in col]
tag_cols = [col for col in list(train.columns) if "tagline_" in col]
# var 方差
# 计算各各概述词标签词所带来票房的方差
for col in ovw_cols:ovw_high_var_list.append((col, train[train[col] == 1]["revenue"].var()))
for col in tag_cols:tag_high_var_list.append((col, train[train[col] == 1]["revenue"].var()))
# reverse=True 按票房方差降序
ovw_high_var_list = sorted(ovw_high_var_list, key=lambda x: x[1], reverse=True)
tag_high_var_list = sorted(tag_high_var_list, key=lambda x: x[1], reverse=True)ovw_drop_cols = [x[0] for x in ovw_high_var_list[50:]]
tag_drop_cols = [x[0] for x in tag_high_var_list[50:]]
all_movies = all_movies.drop((ovw_drop_cols + tag_drop_cols), axis=1)
all_movies.shape

查看缺失值数量

all_movies.isnull().sum()

处理缺失数据

all_movies.loc[all_movies['title'].isnull(), 'title'] = all_movies.loc[all_movies['title'].isnull(), 'original_title']
all_movies['status'].fillna("Released", inplace = True)# 根据这个网站 https://www.imdb.com 提供的信息填充播放时长
all_movies.loc[all_movies['title']=='Happy Weekend', 'runtime'] = 81
all_movies.loc[all_movies['title']=='Miesten välisiä keskusteluja', 'runtime'] = 90
all_movies.loc[all_movies['title']=='Nunca en horas de clase', 'runtime'] = 100
all_movies.loc[all_movies['title']=='Pancho, el perro millonario', 'runtime'] = 91
all_movies.loc[all_movies['title']=='La caliente niña Julietta', 'runtime'] = 93
all_movies.loc[all_movies['title']=='Королёв', 'runtime'] = 130#  Jails, Hospitals & Hip-Hop movie 的上映时间 : May 2000
all_movies.loc[all_movies['release_date'].isnull(), 'release_date'] = '5/1/00'

格式化时间

releaseDate = pd.to_datetime(all_movies['release_date'])
all_movies['release_dayofweek'] = releaseDate.dt.dayofweek
all_movies['release_quarter'] = releaseDate.dt.quarter
all_movies["release_month"] = all_movies["release_date"].apply(lambda x: x.split("/")[0])
all_movies["release_month"].value_counts()

将有多个字段的特征提取出来

# 这个函数之前用过 用来得到 genres
def get_list_of_values(x, key):if x is np.nan:return ""vals = list()for val in eval(x):vals.append(val[key])return ",".join(vals)def find_most_common(col, n):values = list()for i in range(all_movies.shape[0]):values += all_movies.iloc[i][col].split(",")return Counter(values).most_common(n)def one_hot_encode_most_common(new_col, list_col, cmn_lst):for name, cnt in cmn_lst:all_movies[new_col+"_"+name] = all_movies[list_col].apply(lambda x: 1 if name in x else 0)return None# production companies
all_movies["companies_list"] = all_movies["production_companies"].apply(get_list_of_values, args=('name',))
most_cmn_comps = find_most_common("companies_list", 10)
one_hot_encode_most_common("production_companies", "companies_list", most_cmn_comps)# production countries
all_movies["countries_list"] = all_movies["production_countries"].apply(get_list_of_values, args=('iso_3166_1',))
most_cmn_countries = find_most_common("countries_list", 25)
one_hot_encode_most_common("production_countries", "countries_list", most_cmn_countries)# spoken languages
all_movies["spoken_lang_list"] = all_movies["spoken_languages"].apply(get_list_of_values, args=('iso_639_1',))
most_cmn_langs = find_most_common("spoken_lang_list", 25)
one_hot_encode_most_common("spoken_languages", "spoken_lang_list", most_cmn_langs)# Keywords
all_movies["keywords_list"] = all_movies["Keywords"].apply(get_list_of_values, args=('name',))
most_cmn_kywds = find_most_common("keywords_list", 25)
one_hot_encode_most_common("Keywords", "keywords_list", most_cmn_kywds)# cast
all_movies.loc[all_movies['cast'].isnull(), 'cast'] = "[{'gender':'','gender':'','gender':''}]"
all_movies['cast_gender_0'] = all_movies['cast'].apply(lambda x: np.nan if len(eval(x)) < 1 else eval(x)[0]['gender'])
all_movies['cast_gender_1'] = all_movies['cast'].apply(lambda x: np.nan if len(eval(x)) < 2 else eval(x)[1]['gender'])
all_movies['cast_gender_2'] = all_movies['cast'].apply(lambda x: np.nan if len(eval(x)) < 3 else eval(x)[2]['gender'])all_movies.shape

# 输出 (7398, 220)

删除无用字段

all_movies = all_movies.drop(["release_date", "production_companies", "production_countries","spoken_languages", "Keywords", "cast", "crew", "overview", "tagline","companies_list", "countries_list", "spoken_lang_list","keywords_list"], axis=1)

# 有超过两种性别，对性别进行 one-hot 编码
# pd.get_dummies()，将 string 转换为 integers 类型：
dummy_genders_0 = pd.get_dummies(all_movies["cast_gender_0"], prefix="first_cast_gender")
all_movies = pd.concat([all_movies, dummy_genders_0], axis=1)
dummy_genders_1 = pd.get_dummies(all_movies["cast_gender_1"], prefix="scnd_cast_gender_1")
all_movies = pd.concat([all_movies, dummy_genders_1], axis=1)
dummy_genders_2 = pd.get_dummies(all_movies["cast_gender_2"], prefix="thrd_cast_gender_2")
all_movies = pd.concat([all_movies, dummy_genders_2], axis=1)
all_movies = all_movies.drop(["cast_gender_0", "cast_gender_1", "cast_gender_2"], axis=1)

查看除了 revenue 是否还有空值

all_movies[[col for col in all_movies.columns.tolist() if col != "revenue"]].isnull().sum().sum()

dummy_months = pd.get_dummies(all_movies["release_month"], prefix="month")
all_movies = pd.concat([all_movies, dummy_months], axis=1)

all_movies = all_movies.drop(['original_language','original_title','status','title','collection'], axis=1)

num_movies = all_movies.select_dtypes(include=['float64'])
num_movies = pd.concat([num_movies, all_movies[["budget"]]], axis=1)
num_movies.describe()

使用 KNN 前我们需要将 budgett,runtime 和 populatrity 进行归一化

# 也可以使用 MinMaxScaler() 进行归一化处理
def normalize_col(df, col):df[col+"_norm"] = (df[col] - df[col].min()) / (df[col].max() - df[col].min())return Nonenormalize_col(all_movies, "popularity")
normalize_col(all_movies, "runtime")
normalize_col(all_movies, "budget")

all_movies[["popularity_norm", "runtime_norm", "budget_norm"]].describe()

train = all_movies.iloc[:train_idx]
test = all_movies.iloc[train_idx:]
all_movies.columns.tolist()

Data Exploration

plt.figure(figsize=(20, 10))
plt.subplot(1, 3, 1)
plt.hist(all_movies['runtime'].fillna(0) / 60, bins=40)
plt.title('电影时长分布');
plt.subplot(1, 3, 2)
plt.scatter(all_movies['runtime'].fillna(0), all_movies['revenue'])
plt.title('runtime vs revenue');
plt.subplot(1, 3, 3)
plt.scatter(all_movies['runtime'].fillna(0), all_movies['popularity'])
plt.title('runtime vs popularity')

电影的时长和收入之间呈积极的关系，长度和受欢迎程度之间也有一定的积极的关系，但不太明显

plt.figure(figsize=(15, 8))
sns.countplot(all_movies['release_dayofweek']); plt.title('电影在一周中上映时间分布（0 - 周日，6 - 周六）')
plt.show()

令人惊讶的是，居然不是在周末上映的数量最多，而是在周四

plt.figure(figsize=(15, 8))
sns.countplot(all_movies['release_month'])
plt.title('每月电影分布')
plt.show()

plt.figure(figsize=(15, 8))
month_pivot = train.pivot_table(index="release_month", values="revenue", aggfunc=np.mean)
sns.barplot(x=month_pivot.index, y=month_pivot.revenue)
plt.show()

每月电影收益
6 月的票房明显较高，后面建立模型时可以将六月单独作为一个特征

plt.figure(figsize=(15, 8))
month_pivot = train.pivot_table(index="release_quarter", values="revenue", aggfunc=np.mean)
sns.barplot(x=month_pivot.index, y=month_pivot.revenue)
plt.title('每季度电影收入')
plt.show()

夏天和冬天似乎更有利于电影收入，因为大部分家庭都喜欢在这段时间团聚。我将把月份划分为夏季和冬季

all_movies["summer"] = all_movies["release_month"].apply(lambda x: 1 if x in ['5','6','7'] else 0)
all_movies["winter"] = all_movies["release_month"].apply(lambda x: 1 if x in ['11', '12'] else 0)

train_idx = train.shape[0]
train = all_movies.iloc[:train_idx]
test = all_movies.iloc[train_idx:]

corr = all_movies.corr()
corr['revenue'].sort_values(ascending=False).head(20)

corr['revenue'].sort_values(ascending=False).tail(20)

Modeling

我使用 KNN 回归模型进行预测

设置特征

simple_features = ['popularity_norm', 'runtime_norm', 'budget_norm']
other_features = ['has_webpage']
overview_features = [col for col in train.columns.tolist() if "overview" in col]
tagline_features = [col for col in train.columns.tolist() if "tagline" in col]
company_features = [col for col in train.columns.tolist() if "production_companies" in col]
country_features = [col for col in train.columns.tolist() if "production_countries" in col]
spoken_lang_features = [col for col in train.columns.tolist() if "spoken_languages" in col]
keyword_features = [col for col in train.columns.tolist() if "Keywords_" in col]
cast_gender_features = [col for col in train.columns.tolist() if "cast_gender_" in col]
month_features = [col for col in train.columns.tolist() if "month_" in col]
season_features = ['summer', 'winter']
june = ['month_6']
all_features =  [col for col in train.columns.tolist() if col not in ["revenue", "id", "released_month","budget", "popularity", "runtime"]]
genre_features = ['Drama','Comedy','Thriller','Action','Romance','Adventure','Crime','Science Fiction','Horror','Family','Fantasy','Mystery','Animation','History','Music','War','Documentary','Western','Foreign']

检验各特征准确性，及确定 K 的值

from sklearn.metrics import mean_squared_log_error

feature_sets = [simple_features,other_features,overview_features,tagline_features,company_features,country_features,spoken_lang_features,keyword_features,cast_gender_features,month_features,season_features,june,all_features,genre_features]
feature_set_strings = ["simple_features","other_features","overview_features","tagline_features","company_features","country_features","spoken_lang_features","keyword_features","cast_gender_features","month_features", "season_features", "june","all_features","genre_features"]y = train["revenue"]
k_s = [3,5,7,9,11,13,15, 17, 19]
model_results = list()
for feature_set, set_string in zip(feature_sets, feature_set_strings):print(set_string)features = list(set((feature_set + simple_features)))X = train[features]kf = KFold(n_splits=5, random_state=319, shuffle=True)feature_set_errors = list()for k in k_s:i = 0print(str(k)+" neighbors")for train_idx, test_idx in kf.split(X):i +=1model = KNeighborsRegressor(n_neighbors=k)X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]model.fit(X_train, y_train)predictions = model.predict(X_test)error = np.sqrt(mean_squared_log_error(y_test, predictions))feature_set_errors.append(error)print("Fold "+str(i) + ": "+str(round(error,4)))print(set_string + " "+ str(k)+" neigbors mean: " + str(round(np.mean(feature_set_errors),4)))model_results.append([(set_string+"_"+str(k)), round(np.mean(feature_set_errors),4)])

model_results = sorted(model_results, key=lambda x: x[1])
model_results[:30]

# 输出
[['company_features_3', 2.5645],['company_features_5', 2.5741],['company_features_7', 2.5809],['company_features_9', 2.5858],['company_features_11', 2.5909],['company_features_13', 2.5959],['company_features_15', 2.6005],['company_features_17', 2.6054],['company_features_19', 2.6102],['simple_features_3', 2.64],['simple_features_5', 2.6413],['simple_features_7', 2.6427],['other_features_3', 2.6452],['simple_features_9', 2.6497],['other_features_5', 2.6509],['other_features_7', 2.6542],['simple_features_11', 2.657],['other_features_9', 2.6611],['simple_features_13', 2.6629],['june_5', 2.6629],['june_7', 2.664],['june_3', 2.6646],['june_9', 2.6675],['simple_features_15', 2.6683],['other_features_11', 2.6695],['june_11', 2.6725],['simple_features_17', 2.6729],['other_features_13', 2.675],['country_features_3', 2.6757],['simple_features_19', 2.6768]]

可以看到 company_features, simple_features, other_features, country_features 的误差较低
june features 有 5- 9 个邻居，超过 9 个不适合

all_features = simple_features + other_features + company_features + country_features + june
my_hunch = simple_features + company_features + other_features + june
feature_sets = [simple_features,other_features,company_features,country_features,june, all_features, my_hunch]
feature_set_strings = ['simple_features','other_features','company_features','country_features','june', 'all_features', 'my_hunch']y = train["revenue"]
k_s = [k for k in range(1,10)]
model_results = list()
for feature_set, set_string in zip(feature_sets, feature_set_strings):print(set_string)features = list(set((feature_set + simple_features)))X = train[features]kf = KFold(n_splits=5, random_state=319, shuffle=True)feature_set_errors = list()for k in k_s:i = 0print(str(k)+" neighbors")for train_idx, test_idx in kf.split(X):i +=1model = KNeighborsRegressor(n_neighbors=k)X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]model.fit(X_train, y_train)predictions = model.predict(X_test)error = np.sqrt(mean_squared_log_error(y_test, predictions))feature_set_errors.append(error)print("Fold "+str(i) + ": "+str(round(error,4)))print(set_string + " "+ str(k)+" neigbors mean: " + str(round(np.mean(feature_set_errors),4)))model_results.append([(set_string+"_"+str(k)), round(np.mean(feature_set_errors),4)])

model_results = sorted(model_results, key=lambda x: x[1])
model_results[:25]

# 输出
[['company_features_9', 2.6601],['company_features_8', 2.6675],['my_hunch_9', 2.6729],['company_features_7', 2.6764],['my_hunch_8', 2.6781],['my_hunch_7', 2.6853],['company_features_6', 2.6901],['my_hunch_6', 2.6967],['company_features_5', 2.7084],['simple_features_9', 2.7093],['other_features_9', 2.7113],['simple_features_8', 2.7141],['my_hunch_5', 2.7141],['other_features_8', 2.715],['simple_features_7', 2.7212],['other_features_7', 2.7217],['june_9', 2.7284],['other_features_6', 2.7319],['simple_features_6', 2.7338],['june_8', 2.7347],['company_features_4', 2.7395],['my_hunch_4', 2.7413],['june_7', 2.7436],['other_features_5', 2.7469],['all_features_9', 2.7493]]

看起来好像错了，这个模型太少特征了

all_features = simple_features + other_features + company_features + country_features + june
my_hunch = simple_features + company_features + june + other_features
feature_sets = [simple_features,other_features,company_features,country_features,june, all_features, my_hunch]
feature_set_strings = ['simple_features','other_features','company_features','country_features','june', 'all_features', 'my_hunch']y = train["revenue"]
k_s = [k for k in range(1,20)]
model_results = list()
for feature_set, set_string in zip(feature_sets, feature_set_strings):features = list(set((feature_set + simple_features)))X = train[features]kf = KFold(n_splits=5, random_state=319, shuffle=True)feature_set_errors = list()for k in k_s:for train_idx, test_idx in kf.split(X):model = KNeighborsRegressor(n_neighbors=k)X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]model.fit(X_train, y_train)predictions = model.predict(X_test)error = np.sqrt(mean_squared_log_error(y_test, predictions))feature_set_errors.append(error)model_results.append([(set_string+"_"+str(k)), round(np.mean(feature_set_errors),4)])
model_results = sorted(model_results, key=lambda x: x[1])
model_results[:25]

# 输出
[['company_features_17', 2.6426],['company_features_18', 2.6426],['company_features_16', 2.6428],['company_features_19', 2.6429],['company_features_15', 2.6432],['company_features_14', 2.6443],['company_features_13', 2.6457],['company_features_12', 2.6478],['company_features_11', 2.6509],['company_features_10', 2.6548],['company_features_9', 2.6601],['my_hunch_14', 2.6636],['my_hunch_15', 2.6636],['my_hunch_16', 2.6639],['my_hunch_13', 2.6642],['my_hunch_17', 2.6646],['my_hunch_12', 2.6655],['my_hunch_18', 2.6655],['my_hunch_19', 2.6664],['my_hunch_11', 2.6669],['company_features_8', 2.6675],['my_hunch_10', 2.669],['my_hunch_9', 2.6729],['company_features_7', 2.6764],['my_hunch_8', 2.6781]]

从结果初步估计，K = 17 不适合

features = (company_features + simple_features)
knn = KNeighborsRegressor(n_neighbors=17)
knn.fit(train[features], train['revenue'])
predictions17 = knn.predict(test[features])

验证模型准确率

1. 使用 AdaBoost 回归模型

from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostRegressor

features = (company_features + simple_features)

regressor = AdaBoostRegressor(n_estimators=3)
regressor.fit(train_x, train_y)
pred_y = regressor.predict(test_x)
mse = mean_squared_log_error(test_y, pred_y)
print("AdaBoost 误差 = ", round(mse, 2))

AdaBoost 误差 = 9.93

2. 使用 KNN 回归模型

邻居数为17

knn_regressor = KNeighborsRegressor(n_neighbors=17)
knn_regressor.fit(train_x, train_y)
pred_y = knn_regressor.predict(test_x)
mse = mean_squared_log_error(test_y, pred_y)
print("KNN 误差 = ",round(mse, 2))

KNN 误差 = 7.49

3. 使用 KNN 回归模型

邻居数为 3

knn_regressor = KNeighborsRegressor(n_neighbors=3)
knn_regressor.fit(train_x, train_y)
pred_y = knn_regressor.predict(test_x)
mse = mean_squared_log_error(test_y, pred_y)
print("KNN 误差 = ",round(mse, 2))

KNN 误差 = 7.32

可以看到 KNN 模型中邻居为3时，误差最小，我们将他的预测值作为最终结果

knn = KNeighborsRegressor(n_neighbors=3)
knn.fit(train[features], train['revenue'])
predictions3 = knn.predict(test[features])

submission_df = {"id": test['id'], "revenue": predictions3}
submission3 = pd.DataFrame(submission_df)
submission3.to_csv("submission.csv", index=False)

最终得分

Kaggle TMDB 票房预测挑战赛相关推荐

kaggle TMDB 票房预测 -step by step解读代码
点这传送kaggle原作者点这传送数据源&比赛变量构造 belongs_to_collection for i, e in enumerate(train['belongs_to_coll ...
百度-北大在Kaggle发起自动驾驶环境下的汽车6-DOF预测挑战赛
点击我爱计算机视觉标星,更快获取CVML新技术在国内自动驾驶汽车技术研发和产业化方面,百度都处在领头羊的位置. 日前,百度自动驾驶研究部门联合北京大学在Kaggle发起了一个自驾环境下的汽车6自由度 ...
基于Python机器学习算法的电影推荐系统以及票房预测系统
电影数据分析目录电影数据分析 1 一..实验概述 1 1.1 实验标 1 1.2 .实验完成情况 1 二..电影特征的可视化分析 2 电影票房预测 9 2.1 Data Augmentation ...
深度高能粒子对撞追踪：Kaggle TrackML粒子追踪挑战赛亚军访谈
雷锋网(公众号:雷锋网) AI 科技评论按: Kaggle TravML 粒子追踪挑战赛的颁奖仪式即将在 NIPS 2018 大会上进行.这个比赛不仅是机器学习助力其它领域科学研究的经典案例,而且来自 ...
通过电影票房预测来一览机器学习一般流程
前言学习机器学习很久了,最近也涉及相关工作.这里通过电影票房预测来和大家,了解机器学习处理的一般流程. 数据集为kaggle上的tmdb5000的电影数据集,算法使用的是相对容易理解的knn算法.硬 ...
车辆贷款违约预测挑战赛
2021科大讯飞-车辆贷款违约预测挑战赛--方案简介车贷违约预测问题,目的是建立风险识别模型来预测可能违约的借款人.预测结果为借款人是否可能违约,属于二分类问题. 偏数据挖掘的比赛,关键点是如何基 ...
kaggle竞赛--房价预测详细解读
## Kaggle竞赛 -- 房价预测 (House Prices) #### 完整代码见[kaggle kernel](https://www.kaggle.com/massquantity/all ...
WSDM-爱奇艺：用户留存预测挑战赛线上0.865
赛题介绍 http://challenge.ai.iqiyi.com/detail?raceId=61600f6cef1b65639cd5eaa6 https://www.datafountain.c ...
你关心的问题都在这！爱奇艺用户留存预测挑战赛Baseline上线
近日,爱奇艺宣布联合数据挖掘顶级会议--网络搜索和数据挖掘国际会议WSDM 2022(ACM International Conference onWeb Search and Data Mining ...

Kaggle TMDB 票房预测挑战赛