最近工作上需要使用到协同过滤,来计算相似度,因此根据https://blog.csdn.net/weixin_43849063/article/details/111500236的步骤对surprise库实操了一番(以下标注# 实操代码的,均源于这个网址里的代码),并研究了一下surprise库的部分源码。本文主要目的是记录本次研究过程,第一次研究源码,如果有问题,欢迎批评指正

1.数据源

使用的数据推荐系统必备数据集——ml-100k
本次涉及到的具休数据有u.data和u.item
u.data保存用户对电影的评分 格式为 用户|电影|评分|时间戳。共有10W条数据

u.item为电影的相关信息 格式为 id|电影名称|发布日期|网站等信息。共有1682条数据

# 实操代码
file_path = r"F:\ml-100k\u.data"
reader = Reader(line_format='user item rating timestamp', sep='\t')
movie_data = Dataset.load_from_file(file_path=file_path, reader=reader)item_file_path = r"F:\ml-100k\u.item"
id2name = {}
with open(item_file_path, 'r', encoding='ISO-8859-1') as f:for line in f:line = line.split('|')id2name[line[0]] = line[1]

2.切分数据集为train和test

2.1 train_test_split

# 实操代码
train, test = train_test_split(movie_data, test_size=.2, random_state=1)

下面我们看看源码里是怎么使用train_test_split切分数据集的
包地址:…\Python\Python37\site-packages\surprise\model_selection\split

# 源码
def train_test_split(data, test_size=.2, train_size=None, random_state=None,shuffle=True):ss = ShuffleSplit(n_splits=1, test_size=test_size, train_size=train_size,random_state=random_state, shuffle=shuffle)return next(ss.split(data))

可以看到,train_test_split函数里使用了类ShuffleSplit实例化对象生成ss,下面继续研究ShuffleSplit

2.2 ShuffleSplit

# 源码
class ShuffleSplit():"""A basic cross-validation iterator with random trainsets and testsets.Contrary to other cross-validation strategies, random splits do notguarantee that all folds will be different, although this is still verylikely for sizeable datasets.See an example in the :ref:`User Guide <use_cross_validation_iterators>`.Args:n_splits(int): The number of folds.test_size(float or int ``None``): If float, it represents theproportion of ratings to include in the testset. If int,represents the absolute number of ratings in the testset. If``None``, the value is set to the complement of the trainset size.Default is ``.2``.train_size(float or int or ``None``): If float, it represents theproportion of ratings to include in the trainset. If int,represents the absolute number of ratings in the trainset. If``None``, the value is set to the complement of the testset size.Default is ``None``.random_state(int, RandomState instance from numpy, or ``None``):Determines the RNG that will be used for determining the folds. Ifint, ``random_state`` will be used as a seed for a new RNG. This isuseful to get the same splits over multiple calls to ``split()``.If RandomState instance, this same instance is used as RNG. If``None``, the current RNG from numpy is used. ``random_state`` isonly used if ``shuffle`` is ``True``.  Default is ``None``.shuffle(bool): Whether to shuffle the ratings in the ``data`` parameterof the ``split()`` method. Shuffling is not done in-place. Settingthis to `False` defeats the purpose of this iterator, but it'suseful for the implementation of :func:`train_test_split`. Defaultis ``True``."""def __init__(self, n_splits=5, test_size=.2, train_size=None,random_state=None, shuffle=True):if n_splits <= 0:raise ValueError('n_splits = {0} should be strictly greater than ''0.'.format(n_splits))if test_size is not None and test_size <= 0:raise ValueError('test_size={0} should be strictly greater than ''0'.format(test_size))if train_size is not None and train_size <= 0:raise ValueError('train_size={0} should be strictly greater than ''0'.format(train_size))self.n_splits = n_splitsself.test_size = test_sizeself.train_size = train_sizeself.random_state = random_stateself.shuffle = shuffledef validate_train_test_sizes(self, test_size, train_size, n_ratings):if test_size is not None and test_size >= n_ratings:raise ValueError('test_size={0} should be less than the number of ''ratings {1}'.format(test_size, n_ratings))if train_size is not None and train_size >= n_ratings:raise ValueError('train_size={0} should be less than the number of'' ratings {1}'.format(train_size, n_ratings))if np.asarray(test_size).dtype.kind == 'f':test_size = ceil(test_size * n_ratings)if train_size is None:train_size = n_ratings - test_sizeelif np.asarray(train_size).dtype.kind == 'f':train_size = floor(train_size * n_ratings)if test_size is None:test_size = n_ratings - train_sizeif train_size + test_size > n_ratings:raise ValueError('The sum of train_size and test_size ({0}) ''should be smaller than the number of ''ratings {1}.'.format(train_size + test_size,n_ratings))return int(train_size), int(test_size)def split(self, data):"""Generator function to iterate over trainsets and testsets.Args:data(:obj:`Dataset<surprise.dataset.Dataset>`): The data containingratings that will be divided into trainsets and testsets.Yields:tuple of (trainset, testset)"""test_size, train_size = self.validate_train_test_sizes(self.test_size, self.train_size, len(data.raw_ratings))rng = get_rng(self.random_state)for _ in range(self.n_splits):if self.shuffle:permutation = rng.permutation(len(data.raw_ratings))else:permutation = np.arange(len(data.raw_ratings))raw_trainset = [data.raw_ratings[i] for i inpermutation[:test_size]]raw_testset = [data.raw_ratings[i] for i inpermutation[test_size:(test_size + train_size)]]trainset = data.construct_trainset(raw_trainset)testset = data.construct_testset(raw_testset)yield trainset, testset

2.2.1 init

ShuffleSplit的__init__用来判断 n_splits、test_size、train_size是否填的有问题,以及初始化数值,略过

接下来,通过类ShuffleSplit实例化的ss调用了split方法,split又调用了validate_train_test_sizes。

2.2.2 validate_train_test_sizes

validate_train_test_sizes输入参数为self.test_size, self.train_size, len(data.raw_ratings),前两个参数是测试集和训练集的占比,第三个参数是u.data的行数10W。validate_train_test_sizes作用是将test_size和train_size从0.2、0.8转成20000、80000

2.2.3 split

回到split函数,get_rng(self.random_state)和permutation用于打乱数据集的顺序,使得训练集和测试集划分时,不是按照固定的顺序划分,并生成raw_trainset 和raw_testset

2.2.4 construct_trainset

将划分好的训练集raw_trainset、测试集raw_testset 分别用类Dataset下的construct_trainset、construct_testset函数进行处理,下面看看construct_trainset的处理过程
包地址:…\Python\Python37\site-packages\surprise\model_selection\dataset.py

# 源码
class Dataset:"""Base class for loading datasets.Note that you should never instantiate the :class:`Dataset` class directly(same goes for its derived classes), but instead use one of the threeavailable methods for loading datasets."""...省略代码...def construct_trainset(self, raw_trainset):raw2inner_id_users = {}raw2inner_id_items = {}current_u_index = 0current_i_index = 0ur = defaultdict(list)ir = defaultdict(list)# user raw id, item raw id, translated rating, time stampfor urid, irid, r, timestamp in raw_trainset:try:uid = raw2inner_id_users[urid]except KeyError:uid = current_u_indexraw2inner_id_users[urid] = current_u_indexcurrent_u_index += 1try:iid = raw2inner_id_items[irid]except KeyError:iid = current_i_indexraw2inner_id_items[irid] = current_i_indexcurrent_i_index += 1ur[uid].append((iid, r))ir[iid].append((uid, r))n_users = len(ur)  # number of usersn_items = len(ir)  # number of itemsn_ratings = len(raw_trainset)trainset = Trainset(ur,ir,n_users,n_items,n_ratings,self.reader.rating_scale,raw2inner_id_users,raw2inner_id_items)return trainset

以我们构造的raw_trainset为例(如下,只截取了真实的raw_trainset的前50条记录作为raw_trainset),我们看看输出的内容是什么样的



可以看到trainset.ur中,17号和34号user对17号item分别给予了2.0和4.0的评分。在trainset.ir中,也可以看到17号item分别被17号和34号user给予了2.0和4.0的评分,一一对应
以上为construct_trainset处理raw_trainset的过程

P.S. defaultdict的使用
在生成ur和ir的过程中,使用到了defaultdict
from collections import defaultdict
ur = defaultdict(list)
ir = defaultdict(list)

2.2.5 construct_testset

下面看看construct_testset处理raw_testset的如何处理的

# 源码
class Dataset:"""Base class for loading datasets.Note that you should never instantiate the :class:`Dataset` class directly(same goes for its derived classes), but instead use one of the threeavailable methods for loading datasets."""...省略部分代码...def construct_testset(self, raw_testset):return [(ruid, riid, r_ui_trans)for (ruid, riid, r_ui_trans, _) in raw_testset]

P.S. yield 和next的使用

可以看出,返回的trainset和testset经过不同的处理逻辑,并不完全相同
至此,数据集经train_test_split切分完成

3.定义协同过滤方式并训练

# 实操代码
sim_options = {'name': 'pearson_baseline', 'user_based': False}
algo = KNNBasic(sim_options=sim_options)
algo.fit(train)

sim_options 的name参数,表示使用哪种方法来计算相似度。user_based参数,表示是基于用户的协同过滤,还是基于物品的协同过滤

根据下面的代码可知,KNNBasic继承了类SymmetricAlgo,而SymmetricAlgo又继承了类AlgoBase

3.1 AlgoBase

包地址:…\Python\Python37\site-packages\surprise\prediction_algorithms\algo_base.py

# 源码
class AlgoBase(object):"""Abstract class where is defined the basic behavior of a predictionalgorithm.Keyword Args:baseline_options(dict, optional): If the algorithm needs to compute abaseline estimate, the ``baseline_options`` parameter is used toconfigure how they are computed. See:ref:`baseline_estimates_configuration` for usage."""def __init__(self, **kwargs):self.bsl_options = kwargs.get('bsl_options', {})self.sim_options = kwargs.get('sim_options', {})if 'user_based' not in self.sim_options:self.sim_options['user_based'] = Truedef fit(self, trainset):"""Train an algorithm on a given training set.This method is called by every derived class as the first basic stepfor training an algorithm. It basically just initializes some internalstructures and set the self.trainset attribute.Args:trainset(:obj:`Trainset <surprise.Trainset>`) : A trainingset, as returned by the :meth:`folds<surprise.dataset.Dataset.folds>` method.Returns:self"""self.trainset = trainset# (re) Initialise baselinesself.bu = self.bi = Nonereturn self...省略部分代码...

先看AlgoBase,初始化时,由于传入的sim_options = {‘name’: ‘pearson_baseline’, ‘user_based’: False},使用的是基于物品的协同过滤。bu和bi均初始化为None

3.2 SymmetricAlgo

包地址:…\Python\Python37\site-packages\surprise\prediction_algorithms\knns.py

# 源码
class SymmetricAlgo(AlgoBase):"""This is an abstract class aimed to ease the use of symmetric algorithms.A symmetric algorithm is an algorithm that can can be based on users or onitems indifferently, e.g. all the algorithms in this module.When the algo is user-based x denotes a user and y an item. Else, it'sreversed."""def __init__(self, sim_options={}, verbose=True, **kwargs):AlgoBase.__init__(self, sim_options=sim_options, **kwargs)self.verbose = verbosedef fit(self, trainset):AlgoBase.fit(self, trainset)ub = self.sim_options['user_based']self.n_x = self.trainset.n_users if ub else self.trainset.n_itemsself.n_y = self.trainset.n_items if ub else self.trainset.n_usersself.xr = self.trainset.ur if ub else self.trainset.irself.yr = self.trainset.ir if ub else self.trainset.urreturn self...省略部分代码...

再看SymmetricAlgo,ub = False,因此
n_x = trainset.n_items
n_y = trainset.n_users
xr = trainset.ir
yr = trainset.ur

3.3 KNNBasic

包地址:…\Python\Python37\site-packages\surprise\prediction_algorithms\knns.py

# 源码
class KNNBasic(SymmetricAlgo):"""A basic collaborative filtering algorithm.The prediction :math:`\\hat{r}_{ui}` is set as:.. math::\hat{r}_{ui} = \\frac{\\sum\\limits_{v \in N^k_i(u)} \\text{sim}(u, v) \cdot r_{vi}}{\\sum\\limits_{v \in N^k_i(u)} \\text{sim}(u, v)}or.. math::\hat{r}_{ui} = \\frac{\\sum\\limits_{j \in N^k_u(i)} \\text{sim}(i, j) \cdot r_{uj}}{\\sum\\limits_{j \in N^k_u(i)} \\text{sim}(i, j)}depending on the ``user_based`` field of the ``sim_options`` parameter.Args:k(int): The (max) number of neighbors to take into account foraggregation (see :ref:`this note <actual_k_note>`). Default is``40``.min_k(int): The minimum number of neighbors to take into account foraggregation. If there are not enough neighbors, the prediction isset to the global mean of all ratings. Default is ``1``.sim_options(dict): A dictionary of options for the similaritymeasure. See :ref:`similarity_measures_configuration` for acceptedoptions.verbose(bool): Whether to print trace messages of bias estimation,similarity, etc.  Default is True."""def __init__(self, k=40, min_k=1, sim_options={}, verbose=True, **kwargs):SymmetricAlgo.__init__(self, sim_options=sim_options, verbose=verbose,**kwargs)self.k = kself.min_k = min_kdef fit(self, trainset):SymmetricAlgo.fit(self, trainset)self.sim = self.compute_similarities()return self...省略部分代码...

最后看KNNBasic,k=40,min_k =1,根据注释,k和min_k表示的是聚合时要考虑的邻居的上下限,如果没有足够的邻居,则将预测设置为所有评级的全局平均值
最后,在fit计算相似度时,使用了类AlgoBase下的compute_similarities函数,着重对这个函数研究一下

3.3.1 compute_similarities

# 源码
class AlgoBase(object):...省略部分代码...def compute_similarities(self):"""Build the similarity matrix.The way the similarity matrix is computed depends on the``sim_options`` parameter passed at the creation of the algorithm (see:ref:`similarity_measures_configuration`).This method is only relevant for algorithms using a similarity measure,such as the :ref:`k-NN algorithms <pred_package_knn_inpired>`.Returns:The similarity matrix."""construction_func = {'cosine': sims.cosine,'msd': sims.msd,'pearson': sims.pearson,'pearson_baseline': sims.pearson_baseline}if self.sim_options['user_based']:n_x, yr = self.trainset.n_users, self.trainset.irelse:n_x, yr = self.trainset.n_items, self.trainset.urmin_support = self.sim_options.get('min_support', 1)args = [n_x, yr, min_support]name = self.sim_options.get('name', 'msd').lower()if name == 'pearson_baseline':shrinkage = self.sim_options.get('shrinkage', 100)bu, bi = self.compute_baselines()if self.sim_options['user_based']:bx, by = bu, bielse:bx, by = bi, buargs += [self.trainset.global_mean, bx, by, shrinkage]try:if getattr(self, 'verbose', False):print('Computing the {0} similarity matrix...'.format(name))sim = construction_func[name](*args)if getattr(self, 'verbose', False):print('Done computing similarity matrix.')return simexcept KeyError:raise NameError('Wrong sim name ' + name + '. Allowed values ' +'are ' + ', '.join(construction_func.keys()) + '.')

衡量相似度有四种方式,分别是cosine、msd、pearson、pearson_baseline。通过分析源码可以了解这几种相似度的具体计算方法
包位置:…\Python\Python37\site-packages\surprise\similarities.pyx

同时为了验证其中的计算方法,自己创建了一个文件,利用这些数据计算相似度
1 1 3 891717742
1 3 3 881250949
2 1 3 869840916
2 2 1 878887116
2 3 2 880606923
3 1 1 886397596
3 2 4 884182806
4 1 2 881171488
4 2 5 891628467
4 3 4 891219467
5 1 1 880194817
5 2 3 886324817
5 3 3 883603013

3.3.1.1 cosine

公式


其中rui和ruj分别表示用户 u 对物品 i 和 j 的评分,Uij代表同时评分了物品 i 和 j 的用户集合,Iuv代表同时被用户u和v评分的物品集合

调用现成接口得到的相似度矩阵


这里algo.yr只有11条记录,而创建的文件却一共有13条记录,这是因为有2条作为测试集了

通过源码得到的相似度矩阵
# 源码
prods = np.zeros((algo.n_x, algo.n_x), np.double)
freq = np.zeros((algo.n_x, algo.n_x), np.int)
sqi = np.zeros((algo.n_x, algo.n_x), np.double)
sqj = np.zeros((algo.n_x, algo.n_x), np.double)
sim = np.zeros((algo.n_x, algo.n_x), np.double)# 摘自https://www.cnblogs.com/bjwu/p/9448043.html
def cosine(n_x, yr, min_support):min_sprt = 1for y, y_ratings in six.iteritems(yr):### xi和xj分别表示物品i和j### 以下为生成(3)式中的分母和分子for xi, ri in y_ratings:for xj, rj in y_ratings:freq[xi, xj] += 1prods[xi, xj] += ri * rjsqi[xi, xj] += ri**2sqj[xi, xj] += rj**2print('freq','\n',freq)  print('prods','\n',prods)print('sqi','\n',sqi)print('sqj','\n',sqj)   # freq:对角线是对应item被评分的次数,非对角线是对应的2个item被同一user评分的次数。该矩阵用来判断2个item被同一user评分的次数有没有超过设置的次数min_sprt,若无,则2者相似度为0# prods:公式的分子部分# sqi sqj:公式的分母部分### 以下为使用(3)式进行计算            for xi in range(n_x):sim[xi, xi] = 1for xj in range(xi + 1, n_x):if freq[xi, xj] < min_sprt:sim[xi, xj] = 0else:denum = np.sqrt(sqi[xi, xj] * sqj[xi, xj])sim[xi, xj] = prods[xi, xj] / denumsim[xj, xi] = sim[xi, xj]return sim

通过源码计算出来的相似度矩阵,与直接调用现成接口得到的相似度矩阵是一致的

需要注意的是,在计算2个user的相似度时,仅考虑被这2个user同时评分的item的分数。同理,在计算2个item的相似度时,仅考虑对这2个item均有评分的user的评分

3.3.1.2 msd

3.3.1.3 pearson


pearson与cosine的计算方法类似,但减去了均值

P.S.
根据https://www.cnblogs.com/bjwu/p/9448043.html文章的分析,在计算物品的pearson_sim时,应该减去的均值应该针对于用户,而不是物品。因为减去均值的目的,是为了避免得分会有偏向性,而只有用户才会习惯性的给高分或给低分,而物品是大众共同评分的结果,一般不会有哪些物品更容易被给高分或给低分。
所以文章认为在计算物品的pearson_sim时,下面的公式似乎更make sense

注意一点,这里的均值计算只考虑到同时喜欢物品i和j的用户集合Uij,对于其他不涉及物品i和j的用户,不要加到均值计算的过程中

3.3.1.4 pearson_baseline


原理:参考https://blog.csdn.net/qq_38574975/article/details/108310204

基本逻辑:用pearson_baseline计算相似度,涉及bu(每个用户普遍高于或低于他人的偏置值)和bi(每件物品普遍高于或低于其他物品的偏置值),它们的计算方式有2种,分别是baseline_als和baseline_sgd

baseline_als(交替最小二乘法优化)
# 实操代码
# 定义协同过滤方式  user_based:False 时为基于item
sim_options = {'name': 'pearson_baseline', 'user_based': False}
bsl_options = {'method': 'als','n_epochs': 5,'reg_u': 12, # 正则化参数'reg_i': 5  # 正则化参数}
algo = KNNBasic(sim_options=sim_options,bsl_options=bsl_options)
algo.fit(train)

# 源码
n_epochs = 5
reg_u = 12
reg_i = 5bu = np.zeros(train.n_users) # n_users = 5
bi = np.zeros(train.n_items) # n_items = 3
global_mean = train.global_meanfor dummy in range(n_epochs):for i in train.all_items():dev_i = 0for (u, r) in train.ir[i]:dev_i += r - global_mean - bu[u]bi[i] = dev_i / (reg_i + len(train.ir[i]))for u in train.all_users():dev_u = 0for (i, r) in train.ur[u]:dev_u += r - global_mean - bi[i]bu[u] = dev_u / (reg_u + len(train.ur[u]))


可见源码与直接调用接口所得的bu与bi是一致的

源码分析
第一层循环:for dummy in range(n_epochs)
在计算bu和bi的过程中:
先算bi。再算bu,bu计算时会用到bi。再算bi,bi在计算时也会用到bu。如此交替进行n_epochs次,这种交替计算方式,也是ALS被称为“交替最小二乘法”的原因
第二层循环:for i in train.all_items()
循环计算不同的user和不同的item
第三层循环:for (u, r) in train.ir[i]
循环计算出dev_i,即下图的分子

bi[i] = dev_i / (reg_i + len(train.ir[i]))的分母,reg_i即下图的正则化项λ2, len(train.ir[i])即下图的|R(i)|,该item被多少人评分过

baseline_sgd(随机梯度下降法)
# 实操代码
# 定义协同过滤方式  user_based:False 时为基于item
sim_options = {'name': 'pearson_baseline', 'user_based': False}
bsl_options = {'method': 'sgd','n_epochs': 20,'reg': 0.02, # 正则化参数'lr': 0.005  # 正则化参数}
algo = KNNBasic(sim_options=sim_options,bsl_options=bsl_options)
algo.fit(train)

# 源码
n_epochs = 20
reg = 0.02  # 正则化参数
lr = 0.005  # 正则化参数bu = np.zeros(train.n_users) # n_users = 5
bi = np.zeros(train.n_items) # n_items = 3
global_mean = train.global_meanfor dummy in range(n_epochs):for u, i, r in train.all_ratings():err = (r - (global_mean + bu[u] + bi[i]))bu[u] += lr * (err - reg * bu[u])bi[i] += lr * (err - reg * bi[i])


可见源码与直接调用接口所得的bu与bi是一致的

源码分析 略

P.S.
1.公式中的shrinkage只查到了这是一个类似于正则化的参数,并没有查到具体应该如何设置,默认值为100
2.对于https://www.cnblogs.com/bjwu/p/9448043.html中提到的惩罚项,在源码里并没有看到,但是这个思路值得参考

3.3.1.5 四种不同的相似度矩阵的计算对比

参考https://www.cnblogs.com/bjwu/p/9448043.html

4 取出前k个邻居和对应的相似度

# 实操代码
def get_k_nearest(inner_id, k_nearest):''':return: 返回最近的k个邻居'''if algo.sim_options['user_based']: # 判断基于item还是userall_instances = algo.trainset.all_userselse:all_instances = algo.trainset.all_items# item或者user之间的相似度信息保存在algo的sim属性中  sim[i][j]表示个体和j之间的相似度others = [(x, algo.sim[inner_id][x]) for x in all_instances() if x != inner_id]# 根据相似度进行从高到低排列sorted_others = sorted(others, key=lambda x: x[1], reverse=True)# 取出前k个邻居return sorted_others[:k_nearest]

5. 进行推荐

# 实操代码
def recommend(id, k_nearest, n_items, user_based=False):'''default adapt item-based CF:param id: original user id:param k_nearest::param n_items: top_n recommended finally:param user_based: item-based or user-based:return: top_n items liked by user most likely'''# 基于用户的推荐if user_based:recommend_dict = {}# 将原始用户id转化为inner_idinner_id = algo.trainset.to_inner_uid(ruid=id) # 该用户看过哪些片cur_user_like_and_rating = algo.trainset.ur[inner_id]cur_user_like_item = [ele[0] for ele in cur_user_like_and_rating]# 该用户距离最近的k_nearest个邻居user_neighbors = get_k_nearest(inner_id, k_nearest)for neighbor, similarity in user_neighbors:neighbor_user_like_and_rating = algo.trainset.ur[neighbor]for item, rating in neighbor_user_like_and_rating:# 若邻居喜欢的电影已经被该用户看过,跳过if item in cur_user_like_item: continueelse:# 若未看过,则该电影评分为k_nearest个用户的加权评分# 权重为该用户与相似用户的相似度recommend_dict.setdefault(item, 0) recommend_dict[item] += similarity*rating# 排序并返回前n个推荐sorted_recommend_dict = sorted(recommend_dict.items(), key=lambda x: x[1], reverse=True)selected_item_list = [ele[0] for ele in sorted_recommend_dict[:n_items]]return selected_item_listelse:recommend_dict = {}# 将原始用户id转化为inner_idinner_id = algo.trainset.to_inner_uid(ruid=id)# 该用户看过哪些片cur_user_like_and_rating = algo.trainset.ur[inner_id]cur_user_like_item = [ele[0] for ele in cur_user_like_and_rating]for item, rating in cur_user_like_and_rating:# 该用户看过的片距离最近的k_nearest个邻居item_neighbors = get_k_nearest(item, k_nearest)for similar_item, similarity in item_neighbors:# 若该邻居已被用户看过,跳过if similar_item in cur_user_like_item:continue# 若未看过,该邻居得分为:该用户已看过电影的评分*已看过电影与该邻居相似度的总和else:recommend_dict.setdefault(similar_item, 0)recommend_dict[similar_item] += rating*similarity# 排序并返回前n个推荐sorted_recommend_dict = sorted(recommend_dict.items(), key=lambda x: x[1], reverse=True)selected_item_list = [ele[0] for ele in sorted_recommend_dict[:n_items]]return selected_item_list

6.进行评估

6.1 获取测试集上每个用户看过的电影inner_id

# 实操代码(有误)
def get_test_user_like(test):res = {}for user, item, rating in test:# 此处to_inner_iid 对应的将原始item_id 转化为 inner_idres[user] = res.get(user, []).append(algo.trainset.to_inner_iid(item))return res

以上的代码来源于上述所标识的网址,但这里有些问题,res.get(user, []).append(algo.trainset.to_inner_iid(item))是对列表进行操作,返回值为None,所以执行后返回结果(字典)的值均为None

# 实操代码(重新修改,不知道有没有其他快捷的方法)
def get_test_user_like(test):res = {}for user, item, rating in test:res[user] = []for user, item, rating in test:# 此处to_inner_iid 对应的将原始item_id 转化为 inner_idres[user].append(algo.trainset.to_inner_iid(item))    return res

6.2 在测试集计算召回率、准确率

# 实操代码
def recall_and_precision(test, k_nearest, n_items):''':param test: [(user, item, rating)] user-id  item-id is ruid & riid:param k_nearest::param n_items::return:'''hit = 0precison = 0recall = 0# 测试集上用户看过的电影及评分test_user_like = get_test_user_like(test)for user, item, rating in test:# 给测试集上用户的推荐电影recommend_like = recommend(id=user, k_nearest=k_nearest, n_items=n_items)# 测试集上用户真实看过的电影true_like = test_user_like[user]# 命中数hit += len(set(recommend_like) & set(true_like))precison += len(recommend_like)recall += len(true_like)return hit/float(recall), hit/float(precison)

6.3 在测试集计算覆盖率

# 实操代码
def coverage(test, k_nearest, n_items):'''覆盖率: 推荐条目数/总条目数:param test::param k_nearest::param n_items::return:'''test_user_like = get_test_user_like(test)all_items = set()recommend_items = set()for user, item, rating in test:recommend_like = recommend(id=user, k_nearest=k_nearest, n_items=n_items)true_like = test_user_like[user]# 所有条目for ele in true_like:all_items.add(ele)# 推荐条目for ele in recommend_like:recommend_items.add(ele)return len(recommend_items)/len(all_items)

7.参考

1.https://blog.csdn.net/Dawei_01/article/details/79847686(四种不同的相似度矩阵的计算对比)
2.https://blog.csdn.net/weixin_43849063/article/details/111500236(实操)
3.https://www.cnblogs.com/bjwu/p/9448043.html(四种不同的相似度矩阵的计算对比)
4.https://blog.csdn.net/qq_38574975/article/details/108310204(baseline的计算原理)

surprise库源码分析相关推荐

  1. 《微信小程序-进阶篇》Lin-ui组件库源码分析-列表组件List(一)

    大家好,这是小程序系列的第二十篇文章,在这一个阶段,我们的目标是 由简单入手,逐渐的可以较为深入的了解组件化开发,从本文开始,将记录分享lin-ui的源码分析,期望通过对lin-ui源码的学习能加深组 ...

  2. Android主流三方库源码分析(九、深入理解EventBus源码)

    一.EventBus使用流程概念 1.Android事件发布/订阅框架 2.事件传递既可用于Android四大组件间通信 3.EventBus的优点是代码简洁,使用简单,事件发布.订阅充分解耦 4.首 ...

  3. sigslot库源码分析

    言归正传,sigslot是一个用标准C++语法实现的信号与槽机制的函数库,类型和线程安全.提到信号与槽机制,恐怕最容易想到的就是大名鼎鼎的Qt所支持的对象之间通信的模式吧.不过这里的信号与槽虽然在概念 ...

  4. Python Requests库源码分析

    1. Requests库简介 书籍是人类进步的阶梯,源码是程序员进步的阶梯.为了进步,我们就要不断地阅读源码,提升自己的技术水平.今天我们来剖析一下Python的Requests库. Requests ...

  5. cJSON库源码分析

    cJSON是一个超轻巧,携带方便,单文件,简单的可以作为ANSI-C标准的Json格式解析库. 那什么是Json格式?这里照搬度娘百科的说法: Json(JavaScript Object Notat ...

  6. python库源码分析_python第三方库Faker源码解读

    源码背景 Faker是一个Python第三方库,GITHUB开源项目,主要用于创建伪数据创建的数据包含地理信息类.基础信息类.个人账户信息类.网络基础信息类.浏览器信息类.文件信息类.数字类 文本加密 ...

  7. 山东大学软件实验课程-Ebiten-基于go语言实现的2D游戏库源码分析第一篇-综述 2021SC@SDUSC

    2021SC@SDUSC 目录 一.项目综述 二.go语言安装及环境配置 1.Go的安装 2.IDE的使用 三.小组内成员分工 一.项目综述 Ebiten 是Go 编程语言的开源游戏库.Ebiten ...

  8. SoundTouch音频处理库源码分析及算法提取(1)

    SoundTouch音频处理库的使用异常简单,经过简单的编译之后,设置编译环境,以vc为例 ,直接在include包含SoundTouch目录下的include路径,接着在lib添加SoundTouc ...

  9. 云风coroutine库源码分析

    项目介绍-coroutine 云凤写过一个简易协程库,介绍如下: It's an asymmetric coroutine library (like lua). You can use corout ...

最新文章

  1. Java EE (11) - 影响性能的因素
  2. Ubuntu 7.10 J2EE开发环境lomboz+eclipse3.2.1+tomcat5.5.25+mysql5.
  3. CDH6.2 Linux离线安装
  4. 用c++库函数实现WCHAR写入文件
  5. 使用VScode + PicGo 写markdown 以及github图片加载不出的问题
  6. c语言程序设计 简述操作系统管理文件的方法,C语言程序设计基础实验.doc
  7. html取json列表长度,js获取json元素数量的方法
  8. 第1章 数据库系统概论---数据库原理及应用
  9. linux脚本两个分号,Linux Shell中各种分号和括号的用法总结
  10. 差分信号,差分对和耦合(一)——基本概念介绍
  11. redis 获取验证码
  12. UG NX10.0制图——修改单位小数位数
  13. python中的snip用法_--snip--总是报错,找了好久不知道问题出在哪,望大佬求解(python-pygame)...
  14. telnet控制路由器交换机与微信公众平台
  15. 混合算法(SA+TS)解决TSP问题——lua实现(Microcity)
  16. matlab relieff函数,数据挖掘 ReliefF和K-means算法的应用
  17. gsoc 任务_我在GSoC'20中进行编码的第三周
  18. Candy -- LeetCode
  19. 2019,收获,静静等待
  20. GPRS连接阿里云物联网平台五

热门文章

  1. 如何将自己的网页上传让别人都可以搜索到,如何建站
  2. Linux系统级IO②:RIO-带缓冲区IO实现
  3. Centos 7.0 安装Mono 3.4 和 Jexus 5.6
  4. QGIS教程-4:添加精美底图的三种方法
  5. C语言课设 航空订票系统
  6. Web中常用字体介绍
  7. 软件工程未来发展方向
  8. 前端JS学习笔记——内置对象Math、Date 、Array、String
  9. 代码混淆之class-winter
  10. GDKOI-2023 游记