数据集描述

数据集，来自Netflix的用户评价及描述。
属性分别如下
show_id type title director cast country date_added release_year rating duration listed_in description
数据集下载：点我

Introduction

该数据集包括截至2019年Netflix上可用的电视节目和电影。数据集是从第三方Netflix搜索引擎Flixable收集的。
2018年，他们发布了一份有趣的报告，报告显示，自2010年以来，Netflix上的电视剧数量增加了近两倍。自2010年以来，该流媒体服务的电影数量减少了2000多部，而电视节目数量增加了近两倍。探索从同一数据集可以获得的所有其他见解将是有趣的。
将这个数据集与其他外部数据集(如IMDB评级)集成，烂番茄还可以提供许多有趣的发现。
目前已经有了更新，现在我们甚至可以处理2021年的新数据。

EDA with Plotly

data = pd.read_csv('/data/netflix_titles.csv') # 读取文件
data.head() # 获取文件简单描述

data.info() # 查看数据类型

输出：

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 7787 entries, 0 to 7786
Data columns (total 12 columns):
Column Non-Null Count Dtype

0 show_id 7787 non-null object
1 type 7787 non-null object
2 title 7787 non-null object
3 director 5398 non-null object
4 cast 7069 non-null object
5 country 7280 non-null object
6 date_added 7777 non-null object
7 release_year 7787 non-null int64
8 rating 7780 non-null object
9 duration 7787 non-null object
10 listed_in 7787 non-null object
11 description 7787 non-null object
dtypes: int64(1), object(11)
memory usage: 730.2+ KB

Movies vs TV Shows

关于这一步主要是想把关于电影和剧集的内容进行对比

colors = px.colors.qualitative.D3
vs_count = data.type.value_counts()
fig = px.pie(values=vs_count.values, names=vs_count.index, title = 'Movies vs TV Shows', \color_discrete_sequence=colors)
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

Content added over the years

这一步希望可以看见直观的增长曲线

data['year_added'] = pd.DatetimeIndex(data['date_added']).year
d1 = data[data["type"] == "TV Show"].year_added.value_counts()
d2 = data[data["type"] == "Movie"].year_added.value_counts()
d1.sort_index(inplace=True)
d2.sort_index(inplace=True)t1 = go.Scatter(x = d1.index, y=d1.values ,mode='lines+markers',name='TV Show')t2 = go.Scatter(x = d2.index, y=d2.values ,mode='lines+markers',name='Movie')layout = go.Layout(hovermode= 'closest', title = 'Content added over the years' , \xaxis = dict(title = 'Year'), yaxis = dict(title = 'Content added'))fig = go.Figure(data=[t1, t2],layout=layout
)fig.show()

Content share of countries

查看发行国家的数量

country_count = data['country'].value_counts()[:25]
fig = px.pie(values=country_count.values, names=country_count.index, \title='Conent share of countries', color_discrete_sequence=colors)
fig.show()

Distribution of movie duration

查看电影持续时长的分布

movie_duration = data[data.type == 'Movie'].duration
movie_duration = movie_duration.apply(lambda x : float(x.replace(' min','')))
t1 = go.Histogram(x = movie_duration,xbins=dict(size=0.5),marker=dict(color = colors))layout = go.Layout(title = 'Distribution of movie duration', xaxis = dict(title = 'Minutes'))
fig = go.Figure(data = [t1], layout = layout)
fig.show()

Distribution of show duration

统计电视剧集季数

show_duration = data[data.type == 'TV Show'].duration
show_duration = show_duration.apply(lambda x : float(re.sub(' Seasons?','',x)))
t2 = go.Histogram(x = show_duration,xbins=dict(size=0.5),marker=dict(color = colors))layout2 = go.Layout(title = 'Distribution of show duration', xaxis = dict(title = 'Seasons'))
fig = go.Figure(data = [t2], layout = layout2)
fig.show()

Content-based recommender

基于内容的推荐者会基于特定的条目推荐类似的条目。该系统使用电影的项目元数据(如导演、描述、演员阵容等)来进行推荐。这些推荐系统背后的基本理念是，如果一个人喜欢某个特定的物品，他或她也会喜欢与之相似的物品。为了推荐这一点，它将使用用户的过去项目元数据。一个很好的例子可能是YouTube，根据你的历史，它可以为你推荐新的视频，你可能会观看。

个人理解是，通过tag来进行推荐

删除不必要的列(columns)

data.drop(['show_id','type','date_added','release_year','rating','duration'], axis=1, inplace=True)
data.head()

删除遗失点

data.isna().sum().sort_values(ascending=False)
data.dropna(inplace=True)
library = data.copy()
library.reset_index(inplace=True, drop=True)
english_stopwords = stopwords.words('english')
#base of english stopwords
stemmer = SnowballStemmer('english')
#stemming algorithm
regex = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"
#regex used for cleaning text from all unwanted marksdef preprocess(content, stem=False):content = re.sub(regex, ' ', str(content).lower()).strip()tokens = []for token in content.split():if token not in english_stopwords:tokens.append(stemmer.stem(token))return " ".join(tokens)data.description = data.description.apply(lambda x: preprocess(x))data.listed_in = data.listed_in.apply(lambda x: preprocess(x))data.listed_in = data.listed_in.apply(lambda x: x.lower().split(" ")) data.description = data.description.apply(lambda x: x.lower().split(" "))data.director = data.director.apply(lambda x: x.lower().split(","))data.cast = data.cast.apply(lambda x: x.lower().split(","))data.country = data.country.apply(lambda x: x.lower().split(","))for index, row in data.iterrows():row['director'] = [item.replace(" ", "") for item in row['director']]row['cast'] = [item.replace(" ", "") for item in row['cast']]row['country'] = [item.replace(" ", "") for item in row['country']]data.set_index('title', inplace = True)
data.head()

columns = data.columns
data['bagofwords'] = ""for index, row in data.iterrows():words = ''for column in columns:words = words + ' '.join(row[column])+' 'row['bagofwords'] = wordsdata.drop([column for column in columns], axis=1, inplace=True)data.head()count = CountVectorizer()
count_matrix = count.fit_transform(data['bagofwords'])
cosine_sim = cosine_similarity(count_matrix, count_matrix)def recommender(title):index = library[library['title']==str(title)].index[0]# creating a Series with the similarity scores in descending ordersimilar_indexes = pd.Series(cosine_sim[index]).sort_values(ascending=False)# getting the indexes of the 10 most similar moviestop5 = list(similar_indexes.iloc[1:6].index)recommended_movies = library.iloc[pd.Index(library.index).get_indexer(top5)]return recommended_movies

实现推荐

recommender('Bad Boys')

recommender('Indiana Jones and the Kingdom of the Crystal Skull')

推荐系统案例-网飞电影推荐系统-Netflix Recommender system