TMDB 5000电影数据集

原文：

Data Source Transfer Summary

We (Kaggle) have removed the original version of this dataset per a DMCA takedown request from IMDB. In order to minimize the impact, we're replacing it with a similar set of films and data fields from The Movie Database (TMDb) in accordance with their terms of use. The bad news is that kernels built on the old dataset will most likely no longer work.

The good news is that:

You can port your existing kernels over with a bit of editing. This kernel offers functions and examples for doing so. You can also find a general introduction to the new format here.
The new dataset contains full credits for both the cast and the crew, rather than just the first three actors.
Actor and actresses are now listed in the order they appear in the credits. It's unclear what ordering the original dataset used; for the movies I spot checked it didn't line up with either the credits order or IMDB's stars order.
The revenues appear to be more current. For example, IMDB's figures for Avatar seem to be from 2010 and understate the film's global revenues by over $2 billion.
Some of the movies that we weren't able to port over (a couple of hundred) were just bad entries. For example, this IMDB entry has basically no accurate information at all. It lists Star Wars Episode VII as a documentary.

Data Source Transfer Details

Several of the new columns contain json. You can save a bit of time by porting the load data functions [from this kernel]().
Even in simple fields like runtime may not be consistent across versions. For example, previous dataset shows the duration for Avatar's extended cut while TMDB shows the time for the original version.
There's now a separate file containing the full credits for both the cast and crew.
All fields are filled out by users so don't expect them to agree on keywords, genres, ratings, or the like.
Your existing kernels will continue to render normally until they are re-run.
If you are curious about how this dataset was prepared, the code to access TMDb's API is posted here.

New columns:

homepage
id
original_title
overview
popularity
production_companies
production_countries
release_date
spoken_languages
status
tagline
vote_average

Lost columns:

actor1facebook_likes
actor2facebook_likes
actor3facebook_likes
aspect_ratio
casttotalfacebook_likes
color
content_rating
directorfacebooklikes
facenumberinposter
moviefacebooklikes
movieimdblink
numcriticfor_reviews
numuserfor_reviews

译：

TMDB 5000电影数据集

来自TMDb的约5000部电影的元数据

一部电影在上映前的成功我们能说些什么呢？是否有某些公司（皮克斯？）找到了一致的公式？考虑到制作成本超过1亿美元的主要电影仍然会失败，这个问题对电影业来说比以往任何时候都更加重要。电影迷可能有不同的兴趣。我们能预测哪些电影会获得很高的评价，不管它们是否商业成功？

这是一个开始深入研究这些问题的好地方，它提供了数千部电影的情节、演员、人员、预算和收入的数据。

数据源传输摘要

我们（Kaggle）已经根据IMDB的DMCA takedown请求删除了这个数据集的原始版本。为了减少影响，我们根据电影数据库（TMDb）的使用条款，用一组类似的胶片和数据字段来代替它。坏消息是，基于旧数据集构建的内核很可能不再工作。

好消息是：

你可以通过一些编辑来移植现有的内核。这个内核提供了这样做的函数和示例。您也可以在这里找到对新格式的一般介绍。
新的数据集包含了演员和剧组的全部学分，而不仅仅是前三名演员。
演员和女演员现在按他们在演职员中出现的顺序排列。目前还不清楚原始数据集使用了什么样的排序；对于我抽查过的电影，它既不符合学分顺序，也不符合IMDB的明星顺序。
收入似乎更具流动性。例如，IMDB对《阿凡达》的数据似乎来自2010年，低估了该片的全球收入超过20亿美元。
有些电影我们无法移植（几百部）只是不好的作品。例如，这个IMDB条目基本上没有准确的信息。它被列为第七集的纪录片。

数据源传输详细信息

一些新列包含json。通过移植[来自这个内核]的加载数据函数（）可以节省一点时间。
即使是在诸如runtime这样的简单字段中，版本之间也可能不一致。例如，先前的数据集显示了Avatar的扩展剪切的持续时间，而TMDB显示了原始版本的时间。
现在有一个单独的文件包含演员和剧组的全部演职人员。
所有的字段都是由用户填写的，所以不要期望他们在关键字、流派、评分等方面达成一致。
现有内核将继续正常呈现，直到重新运行为止。
如果您想知道这个数据集是如何准备的，那么访问TMDb API的代码就发布在这里。

新列：

主页
身份证件
原始_标题
概述
人气
制片公司
生产国
发布日期
口语
地位
标语
投票平均数

丢失的列：

actor1facebook_likes
actor2facebook_likes
actor3facebook_likes
aspect_ratio
casttotalfacebook_likes
color
content_rating
directorfacebooklikes
facenumberinposter
moviefacebooklikes
movieimdblink
numcriticfor_reviews
numuserfor_reviews

大家可以到官网地址下载数据集，我自己也在百度网盘分享了一份。可关注本人公众号，回复“2020101705”获取下载链接。