Kagle 为我们提供了 7000 多部过去影片的数据,通过这些数据尝试预测全球票房总收入。

提供的数据包括演员、制片组、情节关键字、预算、海报、上映日期、语言、制作公司和国家。

1. loading relevant modules

import numpy as np
import pandas as pd import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus']=False
plt.style.use('ggplot')from collections import Counter
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import KFold
from nltk.corpus import stopwordspd.set_option('display.max_columns', None) # 因为数据集中列比较多,我们需要把 dataframe 中的列全部显示出来import os
print(os.listdir("./input"))  # 将当前目录的所有文件名输出import warnings
warnings.filterwarnings('ignore')  # 忽略警告

EDA

  • 加载训练数据和测试数据
train = pd.read_csv("./input/train.csv")
test = pd.read_csv("./input/test.csv")
print(train.shape, test.shape) # 查看数据集的规模
print(set(train.columns).difference(set(test.columns)))

  • 通过查看每个数据集的规模,与我们的训练集相比,测试数据集相对较大,所以我们必须保证不过拟合
  • 观察发现测试集没有 revenue 字段,这就是我们要预测的票房
  • 先为测试集加上 revenue 字段,然后合并训练集和测试集以便有更多的数据来训练我们的模型
test["revenue"] = np.nan
all_movies = pd.concat([train, test])
all_movies.head()

all_movies.columns.tolist

字段含义

  • movie_id : TMDB电影标识号
  • title : 电影名称
  • cast :演员列表
  • director :导演
  • budget :预算(美元)
  • genres :风格列表,电影类型
  • homepage :电影首页的 URL
  • original_language :电影语言
  • original_title :电影名称
  • overview :剧情摘要
  • popularity :在 Movie Database 上的相对页面查看次数
  • production_companies :制作公司
  • production_countries :制作国家
  • release_date :上映时间
  • revenue :收入
  • runtime :电影时长
  • spoken_languages :口语
  • status :状态
  • tagline :电影的标语
  • cast 演员
  • crew 制片组
all_movies.shape

(7398, 23)

# 将 belongs_to_collection 中的空值用 “” 代替
def clean_belongs_to_collection(x):if x is np.nan:return ""x = x[1:-1]return eval(x)['name']  # 提取 nameall_movies["collection"] = all_movies["belongs_to_collection"].apply(clean_belongs_to_collection)
print(all_movies["collection"].isnull().sum())
print(all_movies["collection"].head())

def get_genres(x):if x is np.nan:return ""genres = list()for genre_dict in eval(x):genres.append(genre_dict['name'])return ",".join(genres)all_movies["genres_list"] = all_movies["genres"].apply(get_genres)
print(all_movies["genres_list"].isnull().sum())
all_movies["genres_list"].head()

def get_genres(x):if x is np.nan:return ""genres = list()for genre_dict in eval(x):genres.append(genre_dict['name'])return ",".join(genres)all_movies["genres_list"] = all_movies["genres"].apply(get_genres)
print(all_movies["genres_list"].isnull().sum())
all_movies["genres_list"].head()

# 统计各类别电影数量
genres = list()
for i in range(all_movies.shape[0]):genres += (all_movies.iloc[i]["genres_list"].split(","))
genres = Counter(genres)
genres.most_common(25)
# 输出
[('Drama', 3676),('Comedy', 2605),('Thriller', 1869),('Action', 1735),('Romance', 1435),('Adventure', 1116),('Crime', 1084),('Science Fiction', 744),('Horror', 735),('Family', 675),('Fantasy', 628),('Mystery', 550),('Animation', 382),('History', 295),('Music', 267),('War', 243),('Documentary', 221),('Western', 117),('Foreign', 84),('', 23),('TV Movie', 1)]

绘制词云图

from pyecharts.charts import Page, WordCloud
from pyecharts.globals import SymbolType
from pyecharts import options as opts
from pyecharts.globals import CurrentConfig, NotebookType
CurrentConfig.NOTEBOOK_TYPE = NotebookType.JUPYTER_LAB
attr = []
value = []
for i in genres.most_common(25):attr.append(i[0])value.append(i[1])
words = genres.most_common()
def wordcloud_diamond() -> WordCloud:c = (WordCloud().add("", words, word_size_range=[20, 100], shape=SymbolType.DIAMOND).set_global_opts(title_opts=opts.TitleOpts(title="电影类型词云图")))return c
wordcloud_diamond().load_javascript()
# 在 jupyter_lab 下要先加载 javascript,notebook 则不用
wordcloud_diamond().render_notebook()

for genre, count in genres.most_common(25)[:-2]:# 为每个电影类别创建单独的列all_movies[genre] = all_movies["genres_list"].apply(lambda x: 1 if genre in x else 0)
all_movies = all_movies.drop(["genres_list"], axis=1)
all_movies.isnull().sum()

all_movies["has_webpage"] = all_movies["homepage"].apply(lambda x: 0 if x is np.nan else 1)
all_movies = all_movies.drop(["belongs_to_collection", "genres","homepage", "imdb_id", "poster_path"], axis=1)
all_movies["has_webpage"].value_counts()

  • 0 有主页
  • 1 无主页
sns.catplot(x='has_webpage', y='revenue', data=all_movies)
plt.title('电影主页与票房关系')
plt.show()

  • 可以看到,有主页的电影更容易获得高票房
# 对离散型特征进行 one-hot 编码是让距离的计算显得更加合理。
dummy_orig_langs = pd.get_dummies(all_movies["original_language"], prefix="original_lang")
all_movies = pd.concat([all_movies,dummy_orig_langs], axis=1)
# 需要用到 nltk 库,加载英文停用词
stop_words = set(stopwords.words('english'))
text_cols = ["overview", "tagline"]
ovw_words = list()
tag_words = list()
for i in range(all_movies.shape[0]):try:ovw_words += (all_movies.iloc[i]["overview"].replace(",","").replace(".","").lower().split())tag_words += (all_movies.iloc[i]["tagline"].replace(",","").replace(".","").lower().split())except AttributeError as e:continue
ovw_words = Counter([w for w in ovw_words if len(w) > 4 and w not in stop_words])
tag_words = Counter([w for w in tag_words if len(w) > 4 and w not in stop_words])
print(ovw_words.most_common(10))
print(tag_words.most_common(10))
words = ovw_words.most_common()
def wordcloud_diamond() -> WordCloud:c = (WordCloud().add("", words, word_size_range=[20, 100], shape=SymbolType.DIAMOND).set_global_opts(title_opts=opts.TitleOpts(title="电影概述词云图")))return c
wordcloud_diamond().load_javascript()
wordcloud_diamond().render_notebook()

words = tag_words.most_common()
def wordcloud_diamond() -> WordCloud:c = (WordCloud().add("", words, word_size_range=[20, 100], shape=SymbolType.DIAMOND).set_global_opts(title_opts=opts.TitleOpts(title="电影标签词云图")))return c
wordcloud_diamond().load_javascript()
wordcloud_diamond().render_notebook()

# 将各标签词和概述词创建列
for word, _ in ovw_words.most_common(100):col = "overview_"+wordall_movies[col] = all_movies["overview"].apply(lambda x: 0 if x is np.nan else 1 if word in x.lower() else 0)
for word, _ in tag_words.most_common(100):col = "tagline_"+wordall_movies[col] = all_movies["tagline"].apply(lambda x: 0 if x is np.nan else 1 if word in x.lower() else 0)
all_movies.shape

ovw_high_var_list = list()
tag_high_var_list = list()
train_idx = train.shape[0]
train = all_movies.iloc[:train_idx]
ovw_cols = [col for col in list(train.columns) if "overview_" in col]
tag_cols = [col for col in list(train.columns) if "tagline_" in col]
# var 方差
# 计算各各概述词标签词所带来票房的方差
for col in ovw_cols:ovw_high_var_list.append((col, train[train[col] == 1]["revenue"].var()))
for col in tag_cols:tag_high_var_list.append((col, train[train[col] == 1]["revenue"].var()))
# reverse=True 按票房方差降序
ovw_high_var_list = sorted(ovw_high_var_list, key=lambda x: x[1], reverse=True)
tag_high_var_list = sorted(tag_high_var_list, key=lambda x: x[1], reverse=True)ovw_drop_cols = [x[0] for x in ovw_high_var_list[50:]]
tag_drop_cols = [x[0] for x in tag_high_var_list[50:]]
all_movies = all_movies.drop((ovw_drop_cols + tag_drop_cols), axis=1)
all_movies.shape
  • 查看缺失值数量
all_movies.isnull().sum()

处理缺失数据

all_movies.loc[all_movies['title'].isnull(), 'title'] = all_movies.loc[all_movies['title'].isnull(), 'original_title']
all_movies['status'].fillna("Released", inplace = True)# 根据这个网站 https://www.imdb.com 提供的信息填充播放时长
all_movies.loc[all_movies['title']=='Happy Weekend', 'runtime'] = 81
all_movies.loc[all_movies['title']=='Miesten välisiä keskusteluja', 'runtime'] = 90
all_movies.loc[all_movies['title']=='Nunca en horas de clase', 'runtime'] = 100
all_movies.loc[all_movies['title']=='Pancho, el perro millonario', 'runtime'] = 91
all_movies.loc[all_movies['title']=='La caliente niña Julietta', 'runtime'] = 93
all_movies.loc[all_movies['title']=='Королёв', 'runtime'] = 130#  Jails, Hospitals & Hip-Hop movie 的上映时间 : May 2000
all_movies.loc[all_movies['release_date'].isnull(), 'release_date'] = '5/1/00'
  • 格式化时间
releaseDate = pd.to_datetime(all_movies['release_date'])
all_movies['release_dayofweek'] = releaseDate.dt.dayofweek
all_movies['release_quarter'] = releaseDate.dt.quarter
all_movies["release_month"] = all_movies["release_date"].apply(lambda x: x.split("/")[0])
all_movies["release_month"].value_counts()

  • 将有多个字段的特征提取出来
# 这个函数之前用过 用来得到 genres
def get_list_of_values(x, key):if x is np.nan:return ""vals = list()for val in eval(x):vals.append(val[key])return ",".join(vals)def find_most_common(col, n):values = list()for i in range(all_movies.shape[0]):values += all_movies.iloc[i][col].split(",")return Counter(values).most_common(n)def one_hot_encode_most_common(new_col, list_col, cmn_lst):for name, cnt in cmn_lst:all_movies[new_col+"_"+name] = all_movies[list_col].apply(lambda x: 1 if name in x else 0)return None# production companies
all_movies["companies_list"] = all_movies["production_companies"].apply(get_list_of_values, args=('name',))
most_cmn_comps = find_most_common("companies_list", 10)
one_hot_encode_most_common("production_companies", "companies_list", most_cmn_comps)# production countries
all_movies["countries_list"] = all_movies["production_countries"].apply(get_list_of_values, args=('iso_3166_1',))
most_cmn_countries = find_most_common("countries_list", 25)
one_hot_encode_most_common("production_countries", "countries_list", most_cmn_countries)# spoken languages
all_movies["spoken_lang_list"] = all_movies["spoken_languages"].apply(get_list_of_values, args=('iso_639_1',))
most_cmn_langs = find_most_common("spoken_lang_list", 25)
one_hot_encode_most_common("spoken_languages", "spoken_lang_list", most_cmn_langs)# Keywords
all_movies["keywords_list"] = all_movies["Keywords"].apply(get_list_of_values, args=('name',))
most_cmn_kywds = find_most_common("keywords_list", 25)
one_hot_encode_most_common("Keywords", "keywords_list", most_cmn_kywds)# cast
all_movies.loc[all_movies['cast'].isnull(), 'cast'] = "[{'gender':'','gender':'','gender':''}]"
all_movies['cast_gender_0'] = all_movies['cast'].apply(lambda x: np.nan if len(eval(x)) < 1 else eval(x)[0]['gender'])
all_movies['cast_gender_1'] = all_movies['cast'].apply(lambda x: np.nan if len(eval(x)) < 2 else eval(x)[1]['gender'])
all_movies['cast_gender_2'] = all_movies['cast'].apply(lambda x: np.nan if len(eval(x)) < 3 else eval(x)[2]['gender'])all_movies.shape
# 输出 (7398, 220)
  • 删除无用字段
all_movies = all_movies.drop(["release_date", "production_companies", "production_countries","spoken_languages", "Keywords", "cast", "crew", "overview", "tagline","companies_list", "countries_list", "spoken_lang_list","keywords_list"], axis=1)
# 有超过两种性别,对性别进行 one-hot 编码
# pd.get_dummies(),将 string 转换为 integers 类型:
dummy_genders_0 = pd.get_dummies(all_movies["cast_gender_0"], prefix="first_cast_gender")
all_movies = pd.concat([all_movies, dummy_genders_0], axis=1)
dummy_genders_1 = pd.get_dummies(all_movies["cast_gender_1"], prefix="scnd_cast_gender_1")
all_movies = pd.concat([all_movies, dummy_genders_1], axis=1)
dummy_genders_2 = pd.get_dummies(all_movies["cast_gender_2"], prefix="thrd_cast_gender_2")
all_movies = pd.concat([all_movies, dummy_genders_2], axis=1)
all_movies = all_movies.drop(["cast_gender_0", "cast_gender_1", "cast_gender_2"], axis=1)
  • 查看除了 revenue 是否还有空值
all_movies[[col for col in all_movies.columns.tolist() if col != "revenue"]].isnull().sum().sum()

dummy_months = pd.get_dummies(all_movies["release_month"], prefix="month")
all_movies = pd.concat([all_movies, dummy_months], axis=1)
all_movies = all_movies.drop(['original_language','original_title','status','title','collection'], axis=1)
num_movies = all_movies.select_dtypes(include=['float64'])
num_movies = pd.concat([num_movies, all_movies[["budget"]]], axis=1)
num_movies.describe()


使用 KNN 前我们需要将 budgett,runtime 和 populatrity 进行归一化

# 也可以使用 MinMaxScaler() 进行归一化处理
def normalize_col(df, col):df[col+"_norm"] = (df[col] - df[col].min()) / (df[col].max() - df[col].min())return Nonenormalize_col(all_movies, "popularity")
normalize_col(all_movies, "runtime")
normalize_col(all_movies, "budget")
all_movies[["popularity_norm", "runtime_norm", "budget_norm"]].describe()

train = all_movies.iloc[:train_idx]
test = all_movies.iloc[train_idx:]
all_movies.columns.tolist()

Data Exploration

plt.figure(figsize=(20, 10))
plt.subplot(1, 3, 1)
plt.hist(all_movies['runtime'].fillna(0) / 60, bins=40)
plt.title('电影时长分布');
plt.subplot(1, 3, 2)
plt.scatter(all_movies['runtime'].fillna(0), all_movies['revenue'])
plt.title('runtime vs revenue');
plt.subplot(1, 3, 3)
plt.scatter(all_movies['runtime'].fillna(0), all_movies['popularity'])
plt.title('runtime vs popularity')

  • 电影的时长和收入之间呈积极的关系,长度和受欢迎程度之间也有一定的积极的关系,但不太明显
plt.figure(figsize=(15, 8))
sns.countplot(all_movies['release_dayofweek']); plt.title('电影在一周中上映时间分布(0 - 周日,6 - 周六)')
plt.show()

  • 令人惊讶的是,居然不是在周末上映的数量最多,而是在周四
plt.figure(figsize=(15, 8))
sns.countplot(all_movies['release_month'])
plt.title('每月电影分布')
plt.show()

plt.figure(figsize=(15, 8))
month_pivot = train.pivot_table(index="release_month", values="revenue", aggfunc=np.mean)
sns.barplot(x=month_pivot.index, y=month_pivot.revenue)
plt.show()
  • 每月电影收益
  • 6 月的票房明显较高,后面建立模型时可以将六月单独作为一个特征
plt.figure(figsize=(15, 8))
month_pivot = train.pivot_table(index="release_quarter", values="revenue", aggfunc=np.mean)
sns.barplot(x=month_pivot.index, y=month_pivot.revenue)
plt.title('每季度电影收入')
plt.show()


夏天和冬天似乎更有利于电影收入,因为大部分家庭都喜欢在这段时间团聚。我将把月份划分为夏季和冬季

all_movies["summer"] = all_movies["release_month"].apply(lambda x: 1 if x in ['5','6','7'] else 0)
all_movies["winter"] = all_movies["release_month"].apply(lambda x: 1 if x in ['11', '12'] else 0)
train_idx = train.shape[0]
train = all_movies.iloc[:train_idx]
test = all_movies.iloc[train_idx:]
corr = all_movies.corr()
corr['revenue'].sort_values(ascending=False).head(20)

corr['revenue'].sort_values(ascending=False).tail(20)

Modeling

我使用 KNN 回归模型进行预测

设置特征

simple_features = ['popularity_norm', 'runtime_norm', 'budget_norm']
other_features = ['has_webpage']
overview_features = [col for col in train.columns.tolist() if "overview" in col]
tagline_features = [col for col in train.columns.tolist() if "tagline" in col]
company_features = [col for col in train.columns.tolist() if "production_companies" in col]
country_features = [col for col in train.columns.tolist() if "production_countries" in col]
spoken_lang_features = [col for col in train.columns.tolist() if "spoken_languages" in col]
keyword_features = [col for col in train.columns.tolist() if "Keywords_" in col]
cast_gender_features = [col for col in train.columns.tolist() if "cast_gender_" in col]
month_features = [col for col in train.columns.tolist() if "month_" in col]
season_features = ['summer', 'winter']
june = ['month_6']
all_features =  [col for col in train.columns.tolist() if col not in ["revenue", "id", "released_month","budget", "popularity", "runtime"]]
genre_features = ['Drama','Comedy','Thriller','Action','Romance','Adventure','Crime','Science Fiction','Horror','Family','Fantasy','Mystery','Animation','History','Music','War','Documentary','Western','Foreign']

检验各特征准确性,及确定 K 的值

from sklearn.metrics import mean_squared_log_error
feature_sets = [simple_features,other_features,overview_features,tagline_features,company_features,country_features,spoken_lang_features,keyword_features,cast_gender_features,month_features,season_features,june,all_features,genre_features]
feature_set_strings = ["simple_features","other_features","overview_features","tagline_features","company_features","country_features","spoken_lang_features","keyword_features","cast_gender_features","month_features", "season_features", "june","all_features","genre_features"]y = train["revenue"]
k_s = [3,5,7,9,11,13,15, 17, 19]
model_results = list()
for feature_set, set_string in zip(feature_sets, feature_set_strings):print(set_string)features = list(set((feature_set + simple_features)))X = train[features]kf = KFold(n_splits=5, random_state=319, shuffle=True)feature_set_errors = list()for k in k_s:i = 0print(str(k)+" neighbors")for train_idx, test_idx in kf.split(X):i +=1model = KNeighborsRegressor(n_neighbors=k)X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]model.fit(X_train, y_train)predictions = model.predict(X_test)error = np.sqrt(mean_squared_log_error(y_test, predictions))feature_set_errors.append(error)print("Fold "+str(i) + ": "+str(round(error,4)))print(set_string + " "+ str(k)+" neigbors mean: " + str(round(np.mean(feature_set_errors),4)))model_results.append([(set_string+"_"+str(k)), round(np.mean(feature_set_errors),4)])
model_results = sorted(model_results, key=lambda x: x[1])
model_results[:30]
# 输出
[['company_features_3', 2.5645],['company_features_5', 2.5741],['company_features_7', 2.5809],['company_features_9', 2.5858],['company_features_11', 2.5909],['company_features_13', 2.5959],['company_features_15', 2.6005],['company_features_17', 2.6054],['company_features_19', 2.6102],['simple_features_3', 2.64],['simple_features_5', 2.6413],['simple_features_7', 2.6427],['other_features_3', 2.6452],['simple_features_9', 2.6497],['other_features_5', 2.6509],['other_features_7', 2.6542],['simple_features_11', 2.657],['other_features_9', 2.6611],['simple_features_13', 2.6629],['june_5', 2.6629],['june_7', 2.664],['june_3', 2.6646],['june_9', 2.6675],['simple_features_15', 2.6683],['other_features_11', 2.6695],['june_11', 2.6725],['simple_features_17', 2.6729],['other_features_13', 2.675],['country_features_3', 2.6757],['simple_features_19', 2.6768]]
  • 可以看到 company_features, simple_features, other_features, country_features 的误差较低
  • june features 有 5- 9 个邻居,超过 9 个不适合
all_features = simple_features + other_features + company_features + country_features + june
my_hunch = simple_features + company_features + other_features + june
feature_sets = [simple_features,other_features,company_features,country_features,june, all_features, my_hunch]
feature_set_strings = ['simple_features','other_features','company_features','country_features','june', 'all_features', 'my_hunch']y = train["revenue"]
k_s = [k for k in range(1,10)]
model_results = list()
for feature_set, set_string in zip(feature_sets, feature_set_strings):print(set_string)features = list(set((feature_set + simple_features)))X = train[features]kf = KFold(n_splits=5, random_state=319, shuffle=True)feature_set_errors = list()for k in k_s:i = 0print(str(k)+" neighbors")for train_idx, test_idx in kf.split(X):i +=1model = KNeighborsRegressor(n_neighbors=k)X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]model.fit(X_train, y_train)predictions = model.predict(X_test)error = np.sqrt(mean_squared_log_error(y_test, predictions))feature_set_errors.append(error)print("Fold "+str(i) + ": "+str(round(error,4)))print(set_string + " "+ str(k)+" neigbors mean: " + str(round(np.mean(feature_set_errors),4)))model_results.append([(set_string+"_"+str(k)), round(np.mean(feature_set_errors),4)])
model_results = sorted(model_results, key=lambda x: x[1])
model_results[:25]
# 输出
[['company_features_9', 2.6601],['company_features_8', 2.6675],['my_hunch_9', 2.6729],['company_features_7', 2.6764],['my_hunch_8', 2.6781],['my_hunch_7', 2.6853],['company_features_6', 2.6901],['my_hunch_6', 2.6967],['company_features_5', 2.7084],['simple_features_9', 2.7093],['other_features_9', 2.7113],['simple_features_8', 2.7141],['my_hunch_5', 2.7141],['other_features_8', 2.715],['simple_features_7', 2.7212],['other_features_7', 2.7217],['june_9', 2.7284],['other_features_6', 2.7319],['simple_features_6', 2.7338],['june_8', 2.7347],['company_features_4', 2.7395],['my_hunch_4', 2.7413],['june_7', 2.7436],['other_features_5', 2.7469],['all_features_9', 2.7493]]

看起来好像错了,这个模型太少特征了

all_features = simple_features + other_features + company_features + country_features + june
my_hunch = simple_features + company_features + june + other_features
feature_sets = [simple_features,other_features,company_features,country_features,june, all_features, my_hunch]
feature_set_strings = ['simple_features','other_features','company_features','country_features','june', 'all_features', 'my_hunch']y = train["revenue"]
k_s = [k for k in range(1,20)]
model_results = list()
for feature_set, set_string in zip(feature_sets, feature_set_strings):features = list(set((feature_set + simple_features)))X = train[features]kf = KFold(n_splits=5, random_state=319, shuffle=True)feature_set_errors = list()for k in k_s:for train_idx, test_idx in kf.split(X):model = KNeighborsRegressor(n_neighbors=k)X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]model.fit(X_train, y_train)predictions = model.predict(X_test)error = np.sqrt(mean_squared_log_error(y_test, predictions))feature_set_errors.append(error)model_results.append([(set_string+"_"+str(k)), round(np.mean(feature_set_errors),4)])
model_results = sorted(model_results, key=lambda x: x[1])
model_results[:25]
# 输出
[['company_features_17', 2.6426],['company_features_18', 2.6426],['company_features_16', 2.6428],['company_features_19', 2.6429],['company_features_15', 2.6432],['company_features_14', 2.6443],['company_features_13', 2.6457],['company_features_12', 2.6478],['company_features_11', 2.6509],['company_features_10', 2.6548],['company_features_9', 2.6601],['my_hunch_14', 2.6636],['my_hunch_15', 2.6636],['my_hunch_16', 2.6639],['my_hunch_13', 2.6642],['my_hunch_17', 2.6646],['my_hunch_12', 2.6655],['my_hunch_18', 2.6655],['my_hunch_19', 2.6664],['my_hunch_11', 2.6669],['company_features_8', 2.6675],['my_hunch_10', 2.669],['my_hunch_9', 2.6729],['company_features_7', 2.6764],['my_hunch_8', 2.6781]]

从结果初步估计,K = 17 不适合

features = (company_features + simple_features)
knn = KNeighborsRegressor(n_neighbors=17)
knn.fit(train[features], train['revenue'])
predictions17 = knn.predict(test[features])

验证模型准确率

1. 使用 AdaBoost 回归模型

from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostRegressor
features = (company_features + simple_features)
regressor = AdaBoostRegressor(n_estimators=3)
regressor.fit(train_x, train_y)
pred_y = regressor.predict(test_x)
mse = mean_squared_log_error(test_y, pred_y)
print("AdaBoost 误差 = ", round(mse, 2))

AdaBoost 误差 = 9.93

2. 使用 KNN 回归模型

邻居数为17

knn_regressor = KNeighborsRegressor(n_neighbors=17)
knn_regressor.fit(train_x, train_y)
pred_y = knn_regressor.predict(test_x)
mse = mean_squared_log_error(test_y, pred_y)
print("KNN 误差 = ",round(mse, 2))

KNN 误差 = 7.49

3. 使用 KNN 回归模型

邻居数为 3

knn_regressor = KNeighborsRegressor(n_neighbors=3)
knn_regressor.fit(train_x, train_y)
pred_y = knn_regressor.predict(test_x)
mse = mean_squared_log_error(test_y, pred_y)
print("KNN 误差 = ",round(mse, 2))

KNN 误差 = 7.32

  • 可以看到 KNN 模型中邻居为3时,误差最小,我们将他的预测值作为最终结果
knn = KNeighborsRegressor(n_neighbors=3)
knn.fit(train[features], train['revenue'])
predictions3 = knn.predict(test[features])
submission_df = {"id": test['id'], "revenue": predictions3}
submission3 = pd.DataFrame(submission_df)
submission3.to_csv("submission.csv", index=False)

最终得分

Kaggle TMDB 票房预测挑战赛相关推荐

  1. kaggle TMDB 票房预测 -step by step解读代码

    点这传送kaggle原作者 点这传送数据源&比赛 变量构造 belongs_to_collection for i, e in enumerate(train['belongs_to_coll ...

  2. 百度-北大在Kaggle发起自动驾驶环境下的汽车6-DOF预测挑战赛

    点击我爱计算机视觉标星,更快获取CVML新技术 在国内自动驾驶汽车技术研发和产业化方面,百度都处在领头羊的位置. 日前,百度自动驾驶研究部门联合北京大学在Kaggle发起了一个自驾环境下的汽车6自由度 ...

  3. 基于Python机器学习算法的电影推荐系统以及票房预测系统

    电影数据分析 目录 电影数据分析 1 一..实验概述 1 1.1 实验标 1 1.2 .实验完成情况 1 二..电影特征的可视化分析 2 电影票房预测 9 2.1 Data Augmentation ...

  4. 深度高能粒子对撞追踪:Kaggle TrackML粒子追踪挑战赛亚军访谈

    雷锋网(公众号:雷锋网) AI 科技评论按: Kaggle TravML 粒子追踪挑战赛的颁奖仪式即将在 NIPS 2018 大会上进行.这个比赛不仅是机器学习助力其它领域科学研究的经典案例,而且来自 ...

  5. 通过电影票房预测来一览机器学习一般流程

    前言 学习机器学习很久了,最近也涉及相关工作.这里通过电影票房预测来和大家,了解机器学习处理的一般流程. 数据集为kaggle上的tmdb5000的电影数据集,算法使用的是相对容易理解的knn算法.硬 ...

  6. 车辆贷款违约预测挑战赛

    2021科大讯飞-车辆贷款违约预测挑战赛--方案 简介 车贷违约预测问题,目的是建立风险识别模型来预测可能违约的借款人.预测结果为借款人是否可能违约,属于二分类问题. 偏数据挖掘的比赛,关键点是如何基 ...

  7. kaggle竞赛--房价预测详细解读

    ## Kaggle竞赛 -- 房价预测 (House Prices) #### 完整代码见[kaggle kernel](https://www.kaggle.com/massquantity/all ...

  8. WSDM-爱奇艺:用户留存预测挑战赛 线上0.865

    赛题介绍 http://challenge.ai.iqiyi.com/detail?raceId=61600f6cef1b65639cd5eaa6 https://www.datafountain.c ...

  9. 你关心的问题都在这!爱奇艺用户留存预测挑战赛Baseline上线

    近日,爱奇艺宣布联合数据挖掘顶级会议--网络搜索和数据挖掘国际会议WSDM 2022(ACM International Conference onWeb Search and Data Mining ...

最新文章

  1. C++语言代码检查工具PC-Lint简介
  2. oracle 取整的几种方法
  3. mysql buffer_mysql read_buffer_size 设置多少合适
  4. P3648-[APIO2014]序列分割【斜率优化】
  5. 下标要求数组或指针类型_算法一看就懂之「 数组与链表 」
  6. perl调用shell命令并获取输出
  7. python微博爬虫实战_Python爬虫实战演练:爬取微博大V的评论数据
  8. 实习三个月的地一个完整项目总结
  9. 基于汇编的 C/C++ 协程 - 背景知识
  10. 2018.10.31国家统计局行政省市区数据and数据库建表
  11. 0基础使用php五分钟实现数据库增删改查功能
  12. json格式数据集转yolo txt格式
  13. matlab粒子群运动模拟伪代码,基本粒子群优化算法(PSO)的matlab实现
  14. 中国剩余定理(CRT)
  15. 二次函数回归方程_高三专题||【导数专题四】利用导数研究函数图形专项习题...
  16. 华为员工晒出7天的上班打卡记录,网友:福报满满!
  17. 【实习小tip】多层dialog弹窗遮罩问题、elementUI的form表单组件的select框在只读的情况下没办法拿到传来的数据、从弹窗子组件获取数据后需要刷新页面
  18. 股票每日复盘都应该做什么,需要从哪些方面复盘?
  19. ios系统设置z-index不生效问题
  20. PPT转Html5,ppt转h5,保留动画,提供源码,可对接接口,支持JAVA,C#,go等

热门文章

  1. win7用linux脚本文件怎么打开,win7下通过ShellExecute调用记事本
  2. antd的联级选择器异步调用编辑回显_react-uplod-img 是一个基于 React antd组件的图片上传组件...
  3. 想要进行gene prioritization分析,请看这里!
  4. 如何获取微信小程序包
  5. 河南山东商会刘继臣 全国工商联·万祥军:商协社团儒商大会
  6. 有关队列的操作 python
  7. 台式机安装EXSI,通过官方方式定制安装包
  8. 霸气牵手众多手机品牌商,苏宁手机蜜蜂节实力彰显行业影响力
  9. 矢量网络分析仪测试射频线线损的注意事项
  10. Windows: Ctrl,Alt, Shift等快捷键的含义