Kaggle TMDB 票房预测挑战赛
Kagle 为我们提供了 7000 多部过去影片的数据,通过这些数据尝试预测全球票房总收入。
提供的数据包括演员、制片组、情节关键字、预算、海报、上映日期、语言、制作公司和国家。
1. loading relevant modules
import numpy as np
import pandas as pd import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus']=False
plt.style.use('ggplot')from collections import Counter
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import KFold
from nltk.corpus import stopwordspd.set_option('display.max_columns', None) # 因为数据集中列比较多,我们需要把 dataframe 中的列全部显示出来import os
print(os.listdir("./input")) # 将当前目录的所有文件名输出import warnings
warnings.filterwarnings('ignore') # 忽略警告
EDA
- 加载训练数据和测试数据
train = pd.read_csv("./input/train.csv")
test = pd.read_csv("./input/test.csv")
print(train.shape, test.shape) # 查看数据集的规模
print(set(train.columns).difference(set(test.columns)))
- 通过查看每个数据集的规模,与我们的训练集相比,测试数据集相对较大,所以我们必须保证不过拟合
- 观察发现测试集没有 revenue 字段,这就是我们要预测的票房
- 先为测试集加上 revenue 字段,然后合并训练集和测试集以便有更多的数据来训练我们的模型
test["revenue"] = np.nan
all_movies = pd.concat([train, test])
all_movies.head()
all_movies.columns.tolist
字段含义
- movie_id : TMDB电影标识号
- title : 电影名称
- cast :演员列表
- director :导演
- budget :预算(美元)
- genres :风格列表,电影类型
- homepage :电影首页的 URL
- original_language :电影语言
- original_title :电影名称
- overview :剧情摘要
- popularity :在 Movie Database 上的相对页面查看次数
- production_companies :制作公司
- production_countries :制作国家
- release_date :上映时间
- revenue :收入
- runtime :电影时长
- spoken_languages :口语
- status :状态
- tagline :电影的标语
- cast 演员
- crew 制片组
all_movies.shape
(7398, 23)
# 将 belongs_to_collection 中的空值用 “” 代替
def clean_belongs_to_collection(x):if x is np.nan:return ""x = x[1:-1]return eval(x)['name'] # 提取 nameall_movies["collection"] = all_movies["belongs_to_collection"].apply(clean_belongs_to_collection)
print(all_movies["collection"].isnull().sum())
print(all_movies["collection"].head())
def get_genres(x):if x is np.nan:return ""genres = list()for genre_dict in eval(x):genres.append(genre_dict['name'])return ",".join(genres)all_movies["genres_list"] = all_movies["genres"].apply(get_genres)
print(all_movies["genres_list"].isnull().sum())
all_movies["genres_list"].head()
def get_genres(x):if x is np.nan:return ""genres = list()for genre_dict in eval(x):genres.append(genre_dict['name'])return ",".join(genres)all_movies["genres_list"] = all_movies["genres"].apply(get_genres)
print(all_movies["genres_list"].isnull().sum())
all_movies["genres_list"].head()
# 统计各类别电影数量
genres = list()
for i in range(all_movies.shape[0]):genres += (all_movies.iloc[i]["genres_list"].split(","))
genres = Counter(genres)
genres.most_common(25)
# 输出
[('Drama', 3676),('Comedy', 2605),('Thriller', 1869),('Action', 1735),('Romance', 1435),('Adventure', 1116),('Crime', 1084),('Science Fiction', 744),('Horror', 735),('Family', 675),('Fantasy', 628),('Mystery', 550),('Animation', 382),('History', 295),('Music', 267),('War', 243),('Documentary', 221),('Western', 117),('Foreign', 84),('', 23),('TV Movie', 1)]
绘制词云图
from pyecharts.charts import Page, WordCloud
from pyecharts.globals import SymbolType
from pyecharts import options as opts
from pyecharts.globals import CurrentConfig, NotebookType
CurrentConfig.NOTEBOOK_TYPE = NotebookType.JUPYTER_LAB
attr = []
value = []
for i in genres.most_common(25):attr.append(i[0])value.append(i[1])
words = genres.most_common()
def wordcloud_diamond() -> WordCloud:c = (WordCloud().add("", words, word_size_range=[20, 100], shape=SymbolType.DIAMOND).set_global_opts(title_opts=opts.TitleOpts(title="电影类型词云图")))return c
wordcloud_diamond().load_javascript()
# 在 jupyter_lab 下要先加载 javascript,notebook 则不用
wordcloud_diamond().render_notebook()
for genre, count in genres.most_common(25)[:-2]:# 为每个电影类别创建单独的列all_movies[genre] = all_movies["genres_list"].apply(lambda x: 1 if genre in x else 0)
all_movies = all_movies.drop(["genres_list"], axis=1)
all_movies.isnull().sum()
all_movies["has_webpage"] = all_movies["homepage"].apply(lambda x: 0 if x is np.nan else 1)
all_movies = all_movies.drop(["belongs_to_collection", "genres","homepage", "imdb_id", "poster_path"], axis=1)
all_movies["has_webpage"].value_counts()
- 0 有主页
- 1 无主页
sns.catplot(x='has_webpage', y='revenue', data=all_movies)
plt.title('电影主页与票房关系')
plt.show()
- 可以看到,有主页的电影更容易获得高票房
# 对离散型特征进行 one-hot 编码是让距离的计算显得更加合理。
dummy_orig_langs = pd.get_dummies(all_movies["original_language"], prefix="original_lang")
all_movies = pd.concat([all_movies,dummy_orig_langs], axis=1)
# 需要用到 nltk 库,加载英文停用词
stop_words = set(stopwords.words('english'))
text_cols = ["overview", "tagline"]
ovw_words = list()
tag_words = list()
for i in range(all_movies.shape[0]):try:ovw_words += (all_movies.iloc[i]["overview"].replace(",","").replace(".","").lower().split())tag_words += (all_movies.iloc[i]["tagline"].replace(",","").replace(".","").lower().split())except AttributeError as e:continue
ovw_words = Counter([w for w in ovw_words if len(w) > 4 and w not in stop_words])
tag_words = Counter([w for w in tag_words if len(w) > 4 and w not in stop_words])
print(ovw_words.most_common(10))
print(tag_words.most_common(10))
words = ovw_words.most_common()
def wordcloud_diamond() -> WordCloud:c = (WordCloud().add("", words, word_size_range=[20, 100], shape=SymbolType.DIAMOND).set_global_opts(title_opts=opts.TitleOpts(title="电影概述词云图")))return c
wordcloud_diamond().load_javascript()
wordcloud_diamond().render_notebook()
words = tag_words.most_common()
def wordcloud_diamond() -> WordCloud:c = (WordCloud().add("", words, word_size_range=[20, 100], shape=SymbolType.DIAMOND).set_global_opts(title_opts=opts.TitleOpts(title="电影标签词云图")))return c
wordcloud_diamond().load_javascript()
wordcloud_diamond().render_notebook()
# 将各标签词和概述词创建列
for word, _ in ovw_words.most_common(100):col = "overview_"+wordall_movies[col] = all_movies["overview"].apply(lambda x: 0 if x is np.nan else 1 if word in x.lower() else 0)
for word, _ in tag_words.most_common(100):col = "tagline_"+wordall_movies[col] = all_movies["tagline"].apply(lambda x: 0 if x is np.nan else 1 if word in x.lower() else 0)
all_movies.shape
ovw_high_var_list = list()
tag_high_var_list = list()
train_idx = train.shape[0]
train = all_movies.iloc[:train_idx]
ovw_cols = [col for col in list(train.columns) if "overview_" in col]
tag_cols = [col for col in list(train.columns) if "tagline_" in col]
# var 方差
# 计算各各概述词标签词所带来票房的方差
for col in ovw_cols:ovw_high_var_list.append((col, train[train[col] == 1]["revenue"].var()))
for col in tag_cols:tag_high_var_list.append((col, train[train[col] == 1]["revenue"].var()))
# reverse=True 按票房方差降序
ovw_high_var_list = sorted(ovw_high_var_list, key=lambda x: x[1], reverse=True)
tag_high_var_list = sorted(tag_high_var_list, key=lambda x: x[1], reverse=True)ovw_drop_cols = [x[0] for x in ovw_high_var_list[50:]]
tag_drop_cols = [x[0] for x in tag_high_var_list[50:]]
all_movies = all_movies.drop((ovw_drop_cols + tag_drop_cols), axis=1)
all_movies.shape
- 查看缺失值数量
all_movies.isnull().sum()
处理缺失数据
all_movies.loc[all_movies['title'].isnull(), 'title'] = all_movies.loc[all_movies['title'].isnull(), 'original_title']
all_movies['status'].fillna("Released", inplace = True)# 根据这个网站 https://www.imdb.com 提供的信息填充播放时长
all_movies.loc[all_movies['title']=='Happy Weekend', 'runtime'] = 81
all_movies.loc[all_movies['title']=='Miesten välisiä keskusteluja', 'runtime'] = 90
all_movies.loc[all_movies['title']=='Nunca en horas de clase', 'runtime'] = 100
all_movies.loc[all_movies['title']=='Pancho, el perro millonario', 'runtime'] = 91
all_movies.loc[all_movies['title']=='La caliente niña Julietta', 'runtime'] = 93
all_movies.loc[all_movies['title']=='Королёв', 'runtime'] = 130# Jails, Hospitals & Hip-Hop movie 的上映时间 : May 2000
all_movies.loc[all_movies['release_date'].isnull(), 'release_date'] = '5/1/00'
- 格式化时间
releaseDate = pd.to_datetime(all_movies['release_date'])
all_movies['release_dayofweek'] = releaseDate.dt.dayofweek
all_movies['release_quarter'] = releaseDate.dt.quarter
all_movies["release_month"] = all_movies["release_date"].apply(lambda x: x.split("/")[0])
all_movies["release_month"].value_counts()
- 将有多个字段的特征提取出来
# 这个函数之前用过 用来得到 genres
def get_list_of_values(x, key):if x is np.nan:return ""vals = list()for val in eval(x):vals.append(val[key])return ",".join(vals)def find_most_common(col, n):values = list()for i in range(all_movies.shape[0]):values += all_movies.iloc[i][col].split(",")return Counter(values).most_common(n)def one_hot_encode_most_common(new_col, list_col, cmn_lst):for name, cnt in cmn_lst:all_movies[new_col+"_"+name] = all_movies[list_col].apply(lambda x: 1 if name in x else 0)return None# production companies
all_movies["companies_list"] = all_movies["production_companies"].apply(get_list_of_values, args=('name',))
most_cmn_comps = find_most_common("companies_list", 10)
one_hot_encode_most_common("production_companies", "companies_list", most_cmn_comps)# production countries
all_movies["countries_list"] = all_movies["production_countries"].apply(get_list_of_values, args=('iso_3166_1',))
most_cmn_countries = find_most_common("countries_list", 25)
one_hot_encode_most_common("production_countries", "countries_list", most_cmn_countries)# spoken languages
all_movies["spoken_lang_list"] = all_movies["spoken_languages"].apply(get_list_of_values, args=('iso_639_1',))
most_cmn_langs = find_most_common("spoken_lang_list", 25)
one_hot_encode_most_common("spoken_languages", "spoken_lang_list", most_cmn_langs)# Keywords
all_movies["keywords_list"] = all_movies["Keywords"].apply(get_list_of_values, args=('name',))
most_cmn_kywds = find_most_common("keywords_list", 25)
one_hot_encode_most_common("Keywords", "keywords_list", most_cmn_kywds)# cast
all_movies.loc[all_movies['cast'].isnull(), 'cast'] = "[{'gender':'','gender':'','gender':''}]"
all_movies['cast_gender_0'] = all_movies['cast'].apply(lambda x: np.nan if len(eval(x)) < 1 else eval(x)[0]['gender'])
all_movies['cast_gender_1'] = all_movies['cast'].apply(lambda x: np.nan if len(eval(x)) < 2 else eval(x)[1]['gender'])
all_movies['cast_gender_2'] = all_movies['cast'].apply(lambda x: np.nan if len(eval(x)) < 3 else eval(x)[2]['gender'])all_movies.shape
# 输出 (7398, 220)
- 删除无用字段
all_movies = all_movies.drop(["release_date", "production_companies", "production_countries","spoken_languages", "Keywords", "cast", "crew", "overview", "tagline","companies_list", "countries_list", "spoken_lang_list","keywords_list"], axis=1)
# 有超过两种性别,对性别进行 one-hot 编码
# pd.get_dummies(),将 string 转换为 integers 类型:
dummy_genders_0 = pd.get_dummies(all_movies["cast_gender_0"], prefix="first_cast_gender")
all_movies = pd.concat([all_movies, dummy_genders_0], axis=1)
dummy_genders_1 = pd.get_dummies(all_movies["cast_gender_1"], prefix="scnd_cast_gender_1")
all_movies = pd.concat([all_movies, dummy_genders_1], axis=1)
dummy_genders_2 = pd.get_dummies(all_movies["cast_gender_2"], prefix="thrd_cast_gender_2")
all_movies = pd.concat([all_movies, dummy_genders_2], axis=1)
all_movies = all_movies.drop(["cast_gender_0", "cast_gender_1", "cast_gender_2"], axis=1)
- 查看除了 revenue 是否还有空值
all_movies[[col for col in all_movies.columns.tolist() if col != "revenue"]].isnull().sum().sum()
dummy_months = pd.get_dummies(all_movies["release_month"], prefix="month")
all_movies = pd.concat([all_movies, dummy_months], axis=1)
all_movies = all_movies.drop(['original_language','original_title','status','title','collection'], axis=1)
num_movies = all_movies.select_dtypes(include=['float64'])
num_movies = pd.concat([num_movies, all_movies[["budget"]]], axis=1)
num_movies.describe()
使用 KNN 前我们需要将 budgett,runtime 和 populatrity 进行归一化
# 也可以使用 MinMaxScaler() 进行归一化处理
def normalize_col(df, col):df[col+"_norm"] = (df[col] - df[col].min()) / (df[col].max() - df[col].min())return Nonenormalize_col(all_movies, "popularity")
normalize_col(all_movies, "runtime")
normalize_col(all_movies, "budget")
all_movies[["popularity_norm", "runtime_norm", "budget_norm"]].describe()
train = all_movies.iloc[:train_idx]
test = all_movies.iloc[train_idx:]
all_movies.columns.tolist()
Data Exploration
plt.figure(figsize=(20, 10))
plt.subplot(1, 3, 1)
plt.hist(all_movies['runtime'].fillna(0) / 60, bins=40)
plt.title('电影时长分布');
plt.subplot(1, 3, 2)
plt.scatter(all_movies['runtime'].fillna(0), all_movies['revenue'])
plt.title('runtime vs revenue');
plt.subplot(1, 3, 3)
plt.scatter(all_movies['runtime'].fillna(0), all_movies['popularity'])
plt.title('runtime vs popularity')
- 电影的时长和收入之间呈积极的关系,长度和受欢迎程度之间也有一定的积极的关系,但不太明显
plt.figure(figsize=(15, 8))
sns.countplot(all_movies['release_dayofweek']); plt.title('电影在一周中上映时间分布(0 - 周日,6 - 周六)')
plt.show()
- 令人惊讶的是,居然不是在周末上映的数量最多,而是在周四
plt.figure(figsize=(15, 8))
sns.countplot(all_movies['release_month'])
plt.title('每月电影分布')
plt.show()
plt.figure(figsize=(15, 8))
month_pivot = train.pivot_table(index="release_month", values="revenue", aggfunc=np.mean)
sns.barplot(x=month_pivot.index, y=month_pivot.revenue)
plt.show()
- 每月电影收益
- 6 月的票房明显较高,后面建立模型时可以将六月单独作为一个特征
plt.figure(figsize=(15, 8))
month_pivot = train.pivot_table(index="release_quarter", values="revenue", aggfunc=np.mean)
sns.barplot(x=month_pivot.index, y=month_pivot.revenue)
plt.title('每季度电影收入')
plt.show()
夏天和冬天似乎更有利于电影收入,因为大部分家庭都喜欢在这段时间团聚。我将把月份划分为夏季和冬季
all_movies["summer"] = all_movies["release_month"].apply(lambda x: 1 if x in ['5','6','7'] else 0)
all_movies["winter"] = all_movies["release_month"].apply(lambda x: 1 if x in ['11', '12'] else 0)
train_idx = train.shape[0]
train = all_movies.iloc[:train_idx]
test = all_movies.iloc[train_idx:]
corr = all_movies.corr()
corr['revenue'].sort_values(ascending=False).head(20)
corr['revenue'].sort_values(ascending=False).tail(20)
Modeling
我使用 KNN 回归模型进行预测
设置特征
simple_features = ['popularity_norm', 'runtime_norm', 'budget_norm']
other_features = ['has_webpage']
overview_features = [col for col in train.columns.tolist() if "overview" in col]
tagline_features = [col for col in train.columns.tolist() if "tagline" in col]
company_features = [col for col in train.columns.tolist() if "production_companies" in col]
country_features = [col for col in train.columns.tolist() if "production_countries" in col]
spoken_lang_features = [col for col in train.columns.tolist() if "spoken_languages" in col]
keyword_features = [col for col in train.columns.tolist() if "Keywords_" in col]
cast_gender_features = [col for col in train.columns.tolist() if "cast_gender_" in col]
month_features = [col for col in train.columns.tolist() if "month_" in col]
season_features = ['summer', 'winter']
june = ['month_6']
all_features = [col for col in train.columns.tolist() if col not in ["revenue", "id", "released_month","budget", "popularity", "runtime"]]
genre_features = ['Drama','Comedy','Thriller','Action','Romance','Adventure','Crime','Science Fiction','Horror','Family','Fantasy','Mystery','Animation','History','Music','War','Documentary','Western','Foreign']
检验各特征准确性,及确定 K 的值
from sklearn.metrics import mean_squared_log_error
feature_sets = [simple_features,other_features,overview_features,tagline_features,company_features,country_features,spoken_lang_features,keyword_features,cast_gender_features,month_features,season_features,june,all_features,genre_features]
feature_set_strings = ["simple_features","other_features","overview_features","tagline_features","company_features","country_features","spoken_lang_features","keyword_features","cast_gender_features","month_features", "season_features", "june","all_features","genre_features"]y = train["revenue"]
k_s = [3,5,7,9,11,13,15, 17, 19]
model_results = list()
for feature_set, set_string in zip(feature_sets, feature_set_strings):print(set_string)features = list(set((feature_set + simple_features)))X = train[features]kf = KFold(n_splits=5, random_state=319, shuffle=True)feature_set_errors = list()for k in k_s:i = 0print(str(k)+" neighbors")for train_idx, test_idx in kf.split(X):i +=1model = KNeighborsRegressor(n_neighbors=k)X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]model.fit(X_train, y_train)predictions = model.predict(X_test)error = np.sqrt(mean_squared_log_error(y_test, predictions))feature_set_errors.append(error)print("Fold "+str(i) + ": "+str(round(error,4)))print(set_string + " "+ str(k)+" neigbors mean: " + str(round(np.mean(feature_set_errors),4)))model_results.append([(set_string+"_"+str(k)), round(np.mean(feature_set_errors),4)])
model_results = sorted(model_results, key=lambda x: x[1])
model_results[:30]
# 输出
[['company_features_3', 2.5645],['company_features_5', 2.5741],['company_features_7', 2.5809],['company_features_9', 2.5858],['company_features_11', 2.5909],['company_features_13', 2.5959],['company_features_15', 2.6005],['company_features_17', 2.6054],['company_features_19', 2.6102],['simple_features_3', 2.64],['simple_features_5', 2.6413],['simple_features_7', 2.6427],['other_features_3', 2.6452],['simple_features_9', 2.6497],['other_features_5', 2.6509],['other_features_7', 2.6542],['simple_features_11', 2.657],['other_features_9', 2.6611],['simple_features_13', 2.6629],['june_5', 2.6629],['june_7', 2.664],['june_3', 2.6646],['june_9', 2.6675],['simple_features_15', 2.6683],['other_features_11', 2.6695],['june_11', 2.6725],['simple_features_17', 2.6729],['other_features_13', 2.675],['country_features_3', 2.6757],['simple_features_19', 2.6768]]
- 可以看到 company_features, simple_features, other_features, country_features 的误差较低
- june features 有 5- 9 个邻居,超过 9 个不适合
all_features = simple_features + other_features + company_features + country_features + june
my_hunch = simple_features + company_features + other_features + june
feature_sets = [simple_features,other_features,company_features,country_features,june, all_features, my_hunch]
feature_set_strings = ['simple_features','other_features','company_features','country_features','june', 'all_features', 'my_hunch']y = train["revenue"]
k_s = [k for k in range(1,10)]
model_results = list()
for feature_set, set_string in zip(feature_sets, feature_set_strings):print(set_string)features = list(set((feature_set + simple_features)))X = train[features]kf = KFold(n_splits=5, random_state=319, shuffle=True)feature_set_errors = list()for k in k_s:i = 0print(str(k)+" neighbors")for train_idx, test_idx in kf.split(X):i +=1model = KNeighborsRegressor(n_neighbors=k)X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]model.fit(X_train, y_train)predictions = model.predict(X_test)error = np.sqrt(mean_squared_log_error(y_test, predictions))feature_set_errors.append(error)print("Fold "+str(i) + ": "+str(round(error,4)))print(set_string + " "+ str(k)+" neigbors mean: " + str(round(np.mean(feature_set_errors),4)))model_results.append([(set_string+"_"+str(k)), round(np.mean(feature_set_errors),4)])
model_results = sorted(model_results, key=lambda x: x[1])
model_results[:25]
# 输出
[['company_features_9', 2.6601],['company_features_8', 2.6675],['my_hunch_9', 2.6729],['company_features_7', 2.6764],['my_hunch_8', 2.6781],['my_hunch_7', 2.6853],['company_features_6', 2.6901],['my_hunch_6', 2.6967],['company_features_5', 2.7084],['simple_features_9', 2.7093],['other_features_9', 2.7113],['simple_features_8', 2.7141],['my_hunch_5', 2.7141],['other_features_8', 2.715],['simple_features_7', 2.7212],['other_features_7', 2.7217],['june_9', 2.7284],['other_features_6', 2.7319],['simple_features_6', 2.7338],['june_8', 2.7347],['company_features_4', 2.7395],['my_hunch_4', 2.7413],['june_7', 2.7436],['other_features_5', 2.7469],['all_features_9', 2.7493]]
看起来好像错了,这个模型太少特征了
all_features = simple_features + other_features + company_features + country_features + june
my_hunch = simple_features + company_features + june + other_features
feature_sets = [simple_features,other_features,company_features,country_features,june, all_features, my_hunch]
feature_set_strings = ['simple_features','other_features','company_features','country_features','june', 'all_features', 'my_hunch']y = train["revenue"]
k_s = [k for k in range(1,20)]
model_results = list()
for feature_set, set_string in zip(feature_sets, feature_set_strings):features = list(set((feature_set + simple_features)))X = train[features]kf = KFold(n_splits=5, random_state=319, shuffle=True)feature_set_errors = list()for k in k_s:for train_idx, test_idx in kf.split(X):model = KNeighborsRegressor(n_neighbors=k)X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]model.fit(X_train, y_train)predictions = model.predict(X_test)error = np.sqrt(mean_squared_log_error(y_test, predictions))feature_set_errors.append(error)model_results.append([(set_string+"_"+str(k)), round(np.mean(feature_set_errors),4)])
model_results = sorted(model_results, key=lambda x: x[1])
model_results[:25]
# 输出
[['company_features_17', 2.6426],['company_features_18', 2.6426],['company_features_16', 2.6428],['company_features_19', 2.6429],['company_features_15', 2.6432],['company_features_14', 2.6443],['company_features_13', 2.6457],['company_features_12', 2.6478],['company_features_11', 2.6509],['company_features_10', 2.6548],['company_features_9', 2.6601],['my_hunch_14', 2.6636],['my_hunch_15', 2.6636],['my_hunch_16', 2.6639],['my_hunch_13', 2.6642],['my_hunch_17', 2.6646],['my_hunch_12', 2.6655],['my_hunch_18', 2.6655],['my_hunch_19', 2.6664],['my_hunch_11', 2.6669],['company_features_8', 2.6675],['my_hunch_10', 2.669],['my_hunch_9', 2.6729],['company_features_7', 2.6764],['my_hunch_8', 2.6781]]
从结果初步估计,K = 17 不适合
features = (company_features + simple_features)
knn = KNeighborsRegressor(n_neighbors=17)
knn.fit(train[features], train['revenue'])
predictions17 = knn.predict(test[features])
验证模型准确率
1. 使用 AdaBoost 回归模型
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostRegressor
features = (company_features + simple_features)
regressor = AdaBoostRegressor(n_estimators=3)
regressor.fit(train_x, train_y)
pred_y = regressor.predict(test_x)
mse = mean_squared_log_error(test_y, pred_y)
print("AdaBoost 误差 = ", round(mse, 2))
AdaBoost 误差 = 9.93
2. 使用 KNN 回归模型
邻居数为17
knn_regressor = KNeighborsRegressor(n_neighbors=17)
knn_regressor.fit(train_x, train_y)
pred_y = knn_regressor.predict(test_x)
mse = mean_squared_log_error(test_y, pred_y)
print("KNN 误差 = ",round(mse, 2))
KNN 误差 = 7.49
3. 使用 KNN 回归模型
邻居数为 3
knn_regressor = KNeighborsRegressor(n_neighbors=3)
knn_regressor.fit(train_x, train_y)
pred_y = knn_regressor.predict(test_x)
mse = mean_squared_log_error(test_y, pred_y)
print("KNN 误差 = ",round(mse, 2))
KNN 误差 = 7.32
- 可以看到 KNN 模型中邻居为3时,误差最小,我们将他的预测值作为最终结果
knn = KNeighborsRegressor(n_neighbors=3)
knn.fit(train[features], train['revenue'])
predictions3 = knn.predict(test[features])
submission_df = {"id": test['id'], "revenue": predictions3}
submission3 = pd.DataFrame(submission_df)
submission3.to_csv("submission.csv", index=False)
最终得分
Kaggle TMDB 票房预测挑战赛相关推荐
- kaggle TMDB 票房预测 -step by step解读代码
点这传送kaggle原作者 点这传送数据源&比赛 变量构造 belongs_to_collection for i, e in enumerate(train['belongs_to_coll ...
- 百度-北大在Kaggle发起自动驾驶环境下的汽车6-DOF预测挑战赛
点击我爱计算机视觉标星,更快获取CVML新技术 在国内自动驾驶汽车技术研发和产业化方面,百度都处在领头羊的位置. 日前,百度自动驾驶研究部门联合北京大学在Kaggle发起了一个自驾环境下的汽车6自由度 ...
- 基于Python机器学习算法的电影推荐系统以及票房预测系统
电影数据分析 目录 电影数据分析 1 一..实验概述 1 1.1 实验标 1 1.2 .实验完成情况 1 二..电影特征的可视化分析 2 电影票房预测 9 2.1 Data Augmentation ...
- 深度高能粒子对撞追踪:Kaggle TrackML粒子追踪挑战赛亚军访谈
雷锋网(公众号:雷锋网) AI 科技评论按: Kaggle TravML 粒子追踪挑战赛的颁奖仪式即将在 NIPS 2018 大会上进行.这个比赛不仅是机器学习助力其它领域科学研究的经典案例,而且来自 ...
- 通过电影票房预测来一览机器学习一般流程
前言 学习机器学习很久了,最近也涉及相关工作.这里通过电影票房预测来和大家,了解机器学习处理的一般流程. 数据集为kaggle上的tmdb5000的电影数据集,算法使用的是相对容易理解的knn算法.硬 ...
- 车辆贷款违约预测挑战赛
2021科大讯飞-车辆贷款违约预测挑战赛--方案 简介 车贷违约预测问题,目的是建立风险识别模型来预测可能违约的借款人.预测结果为借款人是否可能违约,属于二分类问题. 偏数据挖掘的比赛,关键点是如何基 ...
- kaggle竞赛--房价预测详细解读
## Kaggle竞赛 -- 房价预测 (House Prices) #### 完整代码见[kaggle kernel](https://www.kaggle.com/massquantity/all ...
- WSDM-爱奇艺:用户留存预测挑战赛 线上0.865
赛题介绍 http://challenge.ai.iqiyi.com/detail?raceId=61600f6cef1b65639cd5eaa6 https://www.datafountain.c ...
- 你关心的问题都在这!爱奇艺用户留存预测挑战赛Baseline上线
近日,爱奇艺宣布联合数据挖掘顶级会议--网络搜索和数据挖掘国际会议WSDM 2022(ACM International Conference onWeb Search and Data Mining ...
最新文章
- C++语言代码检查工具PC-Lint简介
- oracle 取整的几种方法
- mysql buffer_mysql read_buffer_size 设置多少合适
- P3648-[APIO2014]序列分割【斜率优化】
- 下标要求数组或指针类型_算法一看就懂之「 数组与链表 」
- perl调用shell命令并获取输出
- python微博爬虫实战_Python爬虫实战演练:爬取微博大V的评论数据
- 实习三个月的地一个完整项目总结
- 基于汇编的 C/C++ 协程 - 背景知识
- 2018.10.31国家统计局行政省市区数据and数据库建表
- 0基础使用php五分钟实现数据库增删改查功能
- json格式数据集转yolo txt格式
- matlab粒子群运动模拟伪代码,基本粒子群优化算法(PSO)的matlab实现
- 中国剩余定理(CRT)
- 二次函数回归方程_高三专题||【导数专题四】利用导数研究函数图形专项习题...
- 华为员工晒出7天的上班打卡记录,网友:福报满满!
- 【实习小tip】多层dialog弹窗遮罩问题、elementUI的form表单组件的select框在只读的情况下没办法拿到传来的数据、从弹窗子组件获取数据后需要刷新页面
- 股票每日复盘都应该做什么,需要从哪些方面复盘?
- ios系统设置z-index不生效问题
- PPT转Html5,ppt转h5,保留动画,提供源码,可对接接口,支持JAVA,C#,go等
热门文章
- win7用linux脚本文件怎么打开,win7下通过ShellExecute调用记事本
- antd的联级选择器异步调用编辑回显_react-uplod-img 是一个基于 React antd组件的图片上传组件...
- 想要进行gene prioritization分析,请看这里!
- 如何获取微信小程序包
- 河南山东商会刘继臣 全国工商联·万祥军:商协社团儒商大会
- 有关队列的操作 python
- 台式机安装EXSI,通过官方方式定制安装包
- 霸气牵手众多手机品牌商,苏宁手机蜜蜂节实力彰显行业影响力
- 矢量网络分析仪测试射频线线损的注意事项
- Windows: Ctrl,Alt, Shift等快捷键的含义