signature=e7a4f21fa0bd38abc7e1a2451a8b7b26,进阶作业.ipynb

{

"cells": [

{

"cell_type": "markdown",

"metadata": {},

"source": [

"# 进阶作业"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"问题描述\n",

"一、数据：Million Song Dataset(MSD) \n",

"\n",

"https://labrosa.ee.columbia.edu/millionsong/ \n",

"\n",

"作业使用的数据集是公开音乐数据集 Million Song Dataset(MSD) ，它包含来自 SecondHandSongs dataset 、 musiXmatch dataset、Last.fm dataset、Taste Profile subset、 thisismyjam-to-MSD mapping、tagtraum genre annotations 和 Top MAGD dataset 七个知名音乐社区的数据。 \n",

"原始数据集包括： \n",

"1. train_triplets.txt：三元组数据(用户、歌曲、播放次数) \n",

"2. track_metadata.db：每个歌曲的元数据 \n",

"由于原始数据太大，作业用的数据集只是其中的子集(播放次数最多的800个用户、播放次数最多的800首歌曲。 \n",

"数据预处理过程请见1.DataProcessing.ipynb文件(不必运行该程序，运行该程序需要原始数据。可以通过看代码理解数据提取过程)，最后得到的数据文件为：triplet_dataset_sub.csv(37000+条记录)\n",

"解题提示\n",

"1. 由于这个数据集中并没有用户对物品的显式打分，需要将播放次数转换为分数。 \n",

"2. 在协同过滤中，计算用户之间的相似度或物品之间的相似度时，一种方式用播放次数/比例作为用户/物品的特征表示，同课件。 \n",

"另一种可选的表示是只要用户播放过歌曲就表示为1，否则为0(二值化)，这样物品之间的相似度为播放两个歌曲的用户交集除以播放两个歌曲的用户并集： \n",

"。 \n",

"类似的，两个用户之间的相似度可用两个用户播放歌曲的交集除以两个用户播放歌曲的并集表示。 \n",

"批改标准\n",

"将triplet_dataset_sub.csv中的数据用train_test_split分成80%数据做训练，剩下20%数据做测试。 \n",

"1. 实现基于用户的协同过滤； (20分) \n",

"2. 实现基于物品的协同过滤； (20分) \n",

"3. 实现基于模型(矩阵分解)的协同过滤。(30分) \n",

"4. 对每种推荐算法的推荐结果，用Top10个推荐歌曲的准确率和召回率评价推荐系统的性能。(30分) "

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"## 协同过滤数据准备"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### 导入工具包"

]

{

"cell_type": "code",

"execution_count": 2,

"metadata": {},

"outputs": [],

"source": [

"#导入工具包\n",

"import pandas as pd\n",

"import numpy as np\n",

"\n",

"#字典，用于建立用户和物品的索引\n",

"from collections import defaultdict\n",

"\n",

"#稀疏矩阵，存储打分表\n",

"import scipy.io as sio\n",

"import scipy.sparse as ss\n",

"\n",

"#数据到文件存储\n",

"import _pickle as cPickle\n",

"import pdb\n",

"from sklearn.model_selection import train_test_split"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### 读取训练数据"

]

{

"cell_type": "code",

"execution_count": 22,

"metadata": {},

"outputs": [],

"source": [

"dpath = './data/'\n",

"\n",

"df_triplet_dataset = pd.read_csv(dpath +\"triplet_dataset_sub.csv\")"

]

{

"cell_type": "code",

"execution_count": 23,

"metadata": {

"scrolled": true

"outputs": [

{

"data": {

"text/html": [

\n",

" .dataframe tbody tr th:only-of-type {\n",

" vertical-align: middle;\n",

" }\n",

"\n",

" .dataframe tbody tr th {\n",

" vertical-align: top;\n",

" }\n",

"\n",

" .dataframe thead th {\n",

" text-align: right;\n",

" }\n",

"\n",

" \n",

\n",

user\n",

song\n",

play_count\n",

\n",

0\n",

4e11f45d732f4861772b2906f81a7d384552ad12\n",

SOCKSGZ12A58A7CA4B\n",

1\n",

\n",

1\n",

4e11f45d732f4861772b2906f81a7d384552ad12\n",

SOCVTLJ12A6310F0FD\n",

1\n",

\n",

2\n",

4e11f45d732f4861772b2906f81a7d384552ad12\n",

SODLLYS12A8C13A96B\n",

3\n",

\n",

3\n",

4e11f45d732f4861772b2906f81a7d384552ad12\n",

SOEGIYH12A6D4FC0E3\n",

1\n",

\n",

4\n",

4e11f45d732f4861772b2906f81a7d384552ad12\n",

SOFRQTD12A81C233C0\n",

2\n",

\n",

5\n",

4e11f45d732f4861772b2906f81a7d384552ad12\n",

SOHEMBB12A6701E907\n",

1\n",

\n",

6\n",

4e11f45d732f4861772b2906f81a7d384552ad12\n",

SOHJOLH12A6310DFE5\n",

1\n",

\n",

7\n",

4e11f45d732f4861772b2906f81a7d384552ad12\n",

SOIZLKI12A6D4F7B61\n",

1\n",

\n",

8\n",

4e11f45d732f4861772b2906f81a7d384552ad12\n",

SOJGSIO12A8C141DBF\n",

1\n",

\n",

9\n",

4e11f45d732f4861772b2906f81a7d384552ad12\n",

SOKEYJQ12A6D4F6132\n",

1\n",

\n",

"text/plain": [

" user song play_count\n",

"0 4e11f45d732f4861772b2906f81a7d384552ad12 SOCKSGZ12A58A7CA4B 1\n",

"1 4e11f45d732f4861772b2906f81a7d384552ad12 SOCVTLJ12A6310F0FD 1\n",

"2 4e11f45d732f4861772b2906f81a7d384552ad12 SODLLYS12A8C13A96B 3\n",

"3 4e11f45d732f4861772b2906f81a7d384552ad12 SOEGIYH12A6D4FC0E3 1\n",

"4 4e11f45d732f4861772b2906f81a7d384552ad12 SOFRQTD12A81C233C0 2\n",

"5 4e11f45d732f4861772b2906f81a7d384552ad12 SOHEMBB12A6701E907 1\n",

"6 4e11f45d732f4861772b2906f81a7d384552ad12 SOHJOLH12A6310DFE5 1\n",

"7 4e11f45d732f4861772b2906f81a7d384552ad12 SOIZLKI12A6D4F7B61 1\n",

"8 4e11f45d732f4861772b2906f81a7d384552ad12 SOJGSIO12A8C141DBF 1\n",

"9 4e11f45d732f4861772b2906f81a7d384552ad12 SOKEYJQ12A6D4F6132 1"

]

"execution_count": 23,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"df_triplet_dataset.head(10)"

]

{

"cell_type": "code",

"execution_count": 24,

"metadata": {

"scrolled": true

"outputs": [

{

"name": "stdout",

"output_type": "stream",

"text": [

"\n"

]

{

"data": {

"text/plain": [

"user\n",

"0025bfe6248070545d23721083acd3f60451da4f 52\n",

"002b63a7e2247de6d62bc62f253474edc7dd044c 686\n",

"003a5e3285141b1a54edbc51fbfa1cc922023aae 655\n",

"0084ecc8a2b3b0a371b968ded8b92d3d8525fd64 33\n",

"0091e0326c4c034cc04be6454742912845740a1f 230\n",

"00a55d1ba6f63109c208dbd80570520d5d80f563 29\n",

"00d7dede8a10a03ea0b2d4a08449a9776d414923 178\n",

"00fa0c8162aa95341f4da9defede8aae0675d3cc 600\n",

"010a2a11d5013b81195a4b2c5b4ef8996a60f4a9 32\n",

"01a957fd771b0e80ef7843684217cad5939b4add 212\n",

"02471b044e6bdff7c01e1ea2791214268ba5aaf4 362\n",

"0252acdea2a493da2704c23eebaeaa155b18b7d0 933\n",

"0298a31b3535c3a3f972bc0d342cfc207c3cd8a6 248\n",

"03699fa50d944261dd0fe6eb6c4b58cbb44bdae5 13\n",

"04043d5716a2359f49f62d13c7c3d3e72b28f520 212\n",

"0421d096b0c80ec287d436e6f535f44b711b58ef 1017\n",

"0469bfcd3e17b383cbf1a362af8844a44339998d 671\n",

"04ba112cdf196358a56d6bdbf453bb0b2eb50b1c 1468\n",

"05be6a588cc454d1e400a358c562807aeec8c054 223\n",

"0618bf6227486c545a548a649d32cee247dd198d 269\n",

"062eef2a03b53d2b10f5018135e3361659c6a3bf 208\n",

"074a2197ff72db9f7e44606dfd33208dcdf29f06 70\n",

"083a2a59603a605275107c00812a811526c2a0af 1081\n",

"093cb74eb3c517c5179ae24caf0ebec51b24d2a2 296\n",

"0a4c3c6999c74af7d8a44e96b44bf64e513c0f8b 643\n",

"0a66acea5854a05a1514cd259124433b4190534a 97\n",

"0aae6978a0342cf0c356aa8a28ec6516df684025 311\n",

"0ab9d9f7925520801fffa8b63287d799cfe9a5a4 400\n",

"0ad0283d63a5c591a104142f8a2f5bbd779389b0 98\n",

"0adc1da3be9d2c9fa26d21268713fe4030402781 232\n",

" ... \n",

"f6ae5e682750e815c1709ca99138d03b039839d6 1508\n",

"f6d78516f331c684ee611e07effcb796e94ae456 666\n",

"f799c4ea9030eea12c078db1c1fcd5fa956e786b 561\n",

"f8181f9b3d85fa4ac04c66bc9f84f0ad2a18a777 787\n",

"f84fb3d29bb05bb9dec96684215c763ccbbc67a9 572\n",

"f8544ba8ff908f44d61f5d9d17c213423c1fc782 15\n",

"f986a1b01b2a75109baa39d637537b5124c111ab 48\n",

"f99a25251dfd3c44b629c3658bf6c0d0a7a3d0ce 709\n",

"f9cf7849592621b46a793e0f283de8ab48b3d5f8 65\n",

"f9edc8907be695518817082a224aa43beca7d994 302\n",

"fa5d9eddc010bc3fc71f8a42db15e5dd4f1c18a3 832\n",

"faf0beb5d7ff9d39244b0713de08304c3691f71b 401\n",

"fb644c3f2a83114325dc67b97df0bce60b5ac9a1 641\n",

"fba8ff1f9dd32aa35f3e13960a008fec773e2903 350\n",

"fbd1b7d1bf19158773820cb45639362347979926 529\n",

"fc05f377863a77d7784b02de2cc06cdecb85968b 239\n",

"fc77d71ecc8a4c7f4a0402fbe9118973124391fe 387\n",

"fcbc6bdcec1f293d1d03bbf3c64f613e59acbfe0 726\n",

"fd1ebc6caa7ad07c84677ba6bada683077bf0f15 304\n",

"fe2d77de7e57f3b3eedcf473545110b13ca03426 919\n",

"fe53f4bc06e09b02015312e1d0ea48b208cd490e 278\n",

"fe67eae6791418a5a85125145609f518f01efe48 393\n",

"fe8b98246d279f71f7cb0d493cdedce2bbc30aae 378\n",

"fe9a05c03c29da973743a83b80d1660748077432 109\n",

"fef771ab021c200187a419f5e55311390f850a50 186\n",

"ff124a0cd09e26b78b2b7d3a1de83512ba9978c8 383\n",

"ff7429bd2788349b026cf8f8a7a4b3f3971a310a 303\n",

"ffa96cd6cc641b38a946e8444d261435d615b2dc 87\n",

"ffe2ec5b72cddb8537ad7f0ac191624f8ae2c8dc 16\n",

"ffe5ad43c24d81878621185e164043a6e49b2fe4 422\n",

"Name: play_count, Length: 790, dtype: int64"

]

"execution_count": 24,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"df_count = df_triplet_dataset.groupby(by=['user'])['play_count'].sum()\n",

"print(type(df_count))\n",

"df_count"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### 获取每个用户每首歌曲的播放百分比作为用户对于该首歌曲的评分"

]

{

"cell_type": "code",

"execution_count": 25,

"metadata": {},

"outputs": [],

"source": [

"def playPercent(x, y):\n",

" return (x / df_count[y]) * 100"

]

{

"cell_type": "code",

"execution_count": 26,

"metadata": {},

"outputs": [],

"source": [

"df_triplet_dataset['play_count'] = df_triplet_dataset.apply(lambda row: playPercent(row['play_count'], row['user']), axis=1)"

]

{

"cell_type": "code",

"execution_count": 27,

"metadata": {},

"outputs": [

{

"data": {

"text/html": [

\n",

" .dataframe tbody tr th:only-of-type {\n",

" vertical-align: middle;\n",

" }\n",

"\n",

" .dataframe tbody tr th {\n",

" vertical-align: top;\n",

" }\n",

"\n",

" .dataframe thead th {\n",

" text-align: right;\n",

" }\n",

"\n",

" \n",

\n",

user\n",

song\n",

play_count\n",

\n",

0\n",

4e11f45d732f4861772b2906f81a7d384552ad12\n",

SOCKSGZ12A58A7CA4B\n",

0.386100\n",

\n",

1\n",

4e11f45d732f4861772b2906f81a7d384552ad12\n",

SOCVTLJ12A6310F0FD\n",

0.386100\n",

\n",

2\n",

4e11f45d732f4861772b2906f81a7d384552ad12\n",

SODLLYS12A8C13A96B\n",

1.158301\n",

\n",

3\n",

4e11f45d732f4861772b2906f81a7d384552ad12\n",

SOEGIYH12A6D4FC0E3\n",

0.386100\n",

\n",

4\n",

4e11f45d732f4861772b2906f81a7d384552ad12\n",

SOFRQTD12A81C233C0\n",

0.772201\n",

\n",

5\n",

4e11f45d732f4861772b2906f81a7d384552ad12\n",

SOHEMBB12A6701E907\n",

0.386100\n",

\n",

6\n",

4e11f45d732f4861772b2906f81a7d384552ad12\n",

SOHJOLH12A6310DFE5\n",

0.386100\n",

\n",

7\n",

4e11f45d732f4861772b2906f81a7d384552ad12\n",

SOIZLKI12A6D4F7B61\n",

0.386100\n",

\n",

8\n",

4e11f45d732f4861772b2906f81a7d384552ad12\n",

SOJGSIO12A8C141DBF\n",

0.386100\n",

\n",

9\n",

4e11f45d732f4861772b2906f81a7d384552ad12\n",

SOKEYJQ12A6D4F6132\n",

0.386100\n",

\n",

"text/plain": [

" user song play_count\n",

"0 4e11f45d732f4861772b2906f81a7d384552ad12 SOCKSGZ12A58A7CA4B 0.386100\n",

"1 4e11f45d732f4861772b2906f81a7d384552ad12 SOCVTLJ12A6310F0FD 0.386100\n",

"2 4e11f45d732f4861772b2906f81a7d384552ad12 SODLLYS12A8C13A96B 1.158301\n",

"3 4e11f45d732f4861772b2906f81a7d384552ad12 SOEGIYH12A6D4FC0E3 0.386100\n",

"4 4e11f45d732f4861772b2906f81a7d384552ad12 SOFRQTD12A81C233C0 0.772201\n",

"5 4e11f45d732f4861772b2906f81a7d384552ad12 SOHEMBB12A6701E907 0.386100\n",

"6 4e11f45d732f4861772b2906f81a7d384552ad12 SOHJOLH12A6310DFE5 0.386100\n",

"7 4e11f45d732f4861772b2906f81a7d384552ad12 SOIZLKI12A6D4F7B61 0.386100\n",

"8 4e11f45d732f4861772b2906f81a7d384552ad12 SOJGSIO12A8C141DBF 0.386100\n",

"9 4e11f45d732f4861772b2906f81a7d384552ad12 SOKEYJQ12A6D4F6132 0.386100"

]

"execution_count": 27,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"df_triplet_dataset.head(10)"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"只存在播放次数不存在评分，这里简单的将播放次数作为用户的评分"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### 获取校验集和测试集"

]

{

"cell_type": "code",

"execution_count": 28,

"metadata": {},

"outputs": [],

"source": [

"x_train = df_triplet_dataset.iloc[:, 0:2]\n",

"y_train = df_triplet_dataset.iloc[:, 2:]"

]

{

"cell_type": "code",

"execution_count": 29,

"metadata": {

"scrolled": true

"outputs": [

{

"name": "stderr",

"output_type": "stream",

"text": [

"D:\\programfiles\\anaconda3\\lib\\site-packages\\sklearn\\model_selection\\_split.py:2179: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.\n",

" FutureWarning)\n"

]

}

"source": [

"X_train, X_test, Y_train, Y_test = train_test_split(x_train, y_train, train_size = 0.8,random_state = 0)"

]

{

"cell_type": "code",

"execution_count": 30,

"metadata": {},

"outputs": [],

"source": [

"train = pd.concat([X_train, Y_train], axis = 1, ignore_index=False)\n",

"test = pd.concat([X_test, Y_test], axis = 1, ignore_index=False)"

]

{

"cell_type": "code",

"execution_count": 31,

"metadata": {

"scrolled": true

"outputs": [

{

"data": {

"text/html": [

\n",

" .dataframe tbody tr th:only-of-type {\n",

" vertical-align: middle;\n",

" }\n",

"\n",

" .dataframe tbody tr th {\n",

" vertical-align: top;\n",

" }\n",

"\n",

" .dataframe thead th {\n",

" text-align: right;\n",

" }\n",

"\n",

" \n",

\n",

user\n",

song\n",

play_count\n",

\n",

7971\n",

7bdfc45af7e15511d150e2acb798cd5e4788abf5\n",

SOXBCZH12A67ADAD77\n",

1.529637\n",

\n",

31459\n",

c405c586f6d7aadbbadfcba5393b543fd99372ff\n",

SOXFYTY127E9433E7D\n",

0.574713\n",

\n",

14683\n",

625d0167edbc5df88e9fbebe3fcdd6b121a316bb\n",

SONOYIB12A81C1F88C\n",

0.167504\n",

\n",

10005\n",

20ad98ab543da9ec41c6ac3b6354c5ab3ca6bc5e\n",

SOIMCDE12A6D4F8383\n",

0.233100\n",

\n",

10371\n",

d331a8bf7d0ca9cb37e375496e6075603f6fb44a\n",

SONYKOW12AB01849C9\n",

4.145078\n",

\n",

"text/plain": [

" user song \\\n",

"7971 7bdfc45af7e15511d150e2acb798cd5e4788abf5 SOXBCZH12A67ADAD77 \n",

"31459 c405c586f6d7aadbbadfcba5393b543fd99372ff SOXFYTY127E9433E7D \n",

"14683 625d0167edbc5df88e9fbebe3fcdd6b121a316bb SONOYIB12A81C1F88C \n",

"10005 20ad98ab543da9ec41c6ac3b6354c5ab3ca6bc5e SOIMCDE12A6D4F8383 \n",

"10371 d331a8bf7d0ca9cb37e375496e6075603f6fb44a SONYKOW12AB01849C9 \n",

"\n",

" play_count \n",

"7971 1.529637 \n",

"31459 0.574713 \n",

"14683 0.167504 \n",

"10005 0.233100 \n",

"10371 4.145078 "

]

"execution_count": 31,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"train.head()"

]

{

"cell_type": "code",

"execution_count": 32,

"metadata": {

"scrolled": true

"outputs": [

{

"data": {

"text/html": [

\n",

" .dataframe tbody tr th:only-of-type {\n",

" vertical-align: middle;\n",

" }\n",

"\n",

" .dataframe tbody tr th {\n",

" vertical-align: top;\n",

" }\n",

"\n",

" .dataframe thead th {\n",

" text-align: right;\n",

" }\n",

"\n",

" \n",

\n",

user\n",

song\n",

play_count\n",

\n",

26019\n",

3325fe1d8da7b13dd42004ede8011ce3d7cd205d\n",

SOURVJI12A58A7F353\n",

3.637413\n",

\n",

33943\n",

e82b3380f770c78f8f067f464941057c798eaca2\n",

SOKNWRZ12A8C13BF62\n",

5.205479\n",

\n",

15356\n",

bdfca47d03157d26f1404075172128a6f8a3d39e\n",

SOMNGMO12A6702187E\n",

0.790514\n",

\n",

5948\n",

7ffc14a55b6256c9fa73fc5c5761d210deb7f738\n",

SOGTQNI12AB0184A5C\n",

0.079428\n",

\n",

1466\n",

083a2a59603a605275107c00812a811526c2a0af\n",

SOXZOMB12AB017DA15\n",

0.370028\n",

\n",

"text/plain": [

" user song \\\n",

"26019 3325fe1d8da7b13dd42004ede8011ce3d7cd205d SOURVJI12A58A7F353 \n",

"33943 e82b3380f770c78f8f067f464941057c798eaca2 SOKNWRZ12A8C13BF62 \n",

"15356 bdfca47d03157d26f1404075172128a6f8a3d39e SOMNGMO12A6702187E \n",

"5948 7ffc14a55b6256c9fa73fc5c5761d210deb7f738 SOGTQNI12AB0184A5C \n",

"1466 083a2a59603a605275107c00812a811526c2a0af SOXZOMB12AB017DA15 \n",

"\n",

" play_count \n",

"26019 3.637413 \n",

"33943 5.205479 \n",

"15356 0.790514 \n",

"5948 0.079428 \n",

"1466 0.370028 "

]

"execution_count": 32,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"test.head()"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"#### 将数据保存以便后续处理"

]

{

"cell_type": "code",

"execution_count": 33,

"metadata": {},

"outputs": [],

"source": [

"train.rename(columns={'user':'user_id', 'song':'item_id', 'play_count':'rating'}, inplace = True)\n",

"train.to_csv(path_or_buf = dpath + 'train.csv', index=False)\n",

"test.rename(columns={'user':'user_id', 'song':'item_id', 'play_count':'rating'}, inplace = True)\n",

"test.to_csv(path_or_buf = dpath + 'test.csv', index=False)"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"#### 协同过滤数据准备\n",

"对训练数据，\n",

"1.建立用户和物品索引，方便用下标访问打分表\n",

"2.建立倒排表，加速查询访问\n",

"3.打分表按用户、物品索引保存为稀疏矩阵"

]

{

"cell_type": "code",

"execution_count": 6,

"metadata": {

"scrolled": true

"outputs": [

{

"data": {

"text/html": [

\n",

" .dataframe tbody tr th:only-of-type {\n",

" vertical-align: middle;\n",

" }\n",

"\n",

" .dataframe tbody tr th {\n",

" vertical-align: top;\n",

" }\n",

"\n",

" .dataframe thead th {\n",

" text-align: right;\n",

" }\n",

"\n",

" \n",

\n",

user_id\n",

item_id\n",

rating\n",

\n",

0\n",

7bdfc45af7e15511d150e2acb798cd5e4788abf5\n",

SOXBCZH12A67ADAD77\n",

1.529637\n",

\n",

1\n",

c405c586f6d7aadbbadfcba5393b543fd99372ff\n",

SOXFYTY127E9433E7D\n",

0.574713\n",

\n",

2\n",

625d0167edbc5df88e9fbebe3fcdd6b121a316bb\n",

SONOYIB12A81C1F88C\n",

0.167504\n",

\n",

3\n",

20ad98ab543da9ec41c6ac3b6354c5ab3ca6bc5e\n",

SOIMCDE12A6D4F8383\n",

0.233100\n",

\n",

4\n",

d331a8bf7d0ca9cb37e375496e6075603f6fb44a\n",

SONYKOW12AB01849C9\n",

4.145078\n",

\n",

"text/plain": [

" user_id item_id rating\n",

"0 7bdfc45af7e15511d150e2acb798cd5e4788abf5 SOXBCZH12A67ADAD77 1.529637\n",

"1 c405c586f6d7aadbbadfcba5393b543fd99372ff SOXFYTY127E9433E7D 0.574713\n",

"2 625d0167edbc5df88e9fbebe3fcdd6b121a316bb SONOYIB12A81C1F88C 0.167504\n",

"3 20ad98ab543da9ec41c6ac3b6354c5ab3ca6bc5e SOIMCDE12A6D4F8383 0.233100\n",

"4 d331a8bf7d0ca9cb37e375496e6075603f6fb44a SONYKOW12AB01849C9 4.145078"

]

"execution_count": 6,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"#读取训练数据\n",

"dpath = './data/'\n",

"df_triplet = pd.read_csv(dpath +\"train.csv\")\n",

"df_triplet.head()"

]

{

"cell_type": "code",

"execution_count": 7,

"metadata": {},

"outputs": [],

"source": [

"#统计总的用户数目和物品数目\n",

"unique_users = df_triplet['user_id'].unique()\n",

"unique_items = df_triplet['item_id'].unique()\n",

"\n",

"n_users = unique_users.shape[0]\n",

"n_items = unique_items.shape[0]"

]

{

"cell_type": "code",

"execution_count": 8,

"metadata": {},

"outputs": [],

"source": [

"#建立用户和物品的索引表\n",

"#本数据集中user_id和item_id都已经是索引了,可以减1，将从1开始编码变成从0开始的编码\n",

"#下面的代码更通用，可对任意编码的用户和物品重新索引\n",

"users_index = dict()\n",

"items_index = dict()\n",

"\n",

"for j, u in enumerate(unique_users):\n",

" users_index[u] = j\n",

" \n",

"#重新编码活动索引字典 \n",

"for j, i in enumerate(unique_items):\n",

" items_index[i] = j\n",

" \n",

"#保存用户索引表\n",

"cPickle.dump(users_index, open(\"users_index.pkl\", 'wb'))\n",

"#保存活动索引表\n",

"cPickle.dump(items_index, open(\"items_index.pkl\", 'wb'))"

]

{

"cell_type": "code",

"execution_count": 9,

"metadata": {},

"outputs": [],

"source": [

"#倒排表\n",

"#统计每个用户打过分的电影 / 每个电影被哪些用户打过分\n",

"user_items = defaultdict(set)\n",

"item_users = defaultdict(set)\n",

"\n",

"#用户-物品关系矩阵R, 稀疏矩阵，记录用户对每个电影的打分\n",

"user_item_scores = ss.dok_matrix((n_users, n_items))\n",

"\n",

"#扫描训练数据\n",

"for line in df_triplet.index: #对每条记录\n",

" cur_user_index = users_index [df_triplet.iloc[line]['user_id']]\n",

" cur_item_index = items_index [df_triplet.iloc[line]['item_id']]\n",

" \n",

" #倒排表\n",

" user_items[cur_user_index].add(cur_item_index) #该用户对这个电影进行了打分\n",

" item_users[cur_item_index].add(cur_user_index) #该电影被该用户打分\n",

" \n",

" user_item_scores[cur_user_index, cur_item_index] = df_triplet.iloc[line]['rating']\n",

"\n",

"##保存倒排表\n",

"#每个用户打分的电影\n",

"cPickle.dump(user_items, open(\"user_items.pkl\", 'wb'))\n",

"##对每个电影打过分的用户\n",

"cPickle.dump(item_users, open(\"item_users.pkl\", 'wb'))\n",

"\n",

"#保存打分矩阵，在UserCF和ItemCF中用到\n",

"sio.mmwrite(\"user_item_scores\", user_item_scores)"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"## 基于物品的协同过滤"

]

{

"cell_type": "code",

"execution_count": 10,

"metadata": {},

"outputs": [],

"source": [

"import pandas as pd\n",

"import numpy as np\n",

"\n",

"#load数据(用户和物品索引，以及倒排表)\n",

"import _pickle as cPickle\n",

"\n",

"#稀疏矩阵，打分表\n",

"import scipy.io as sio\n",

"import os\n",

"\n",

"#距离\n",

"import scipy.spatial.distance as ssd"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"#### 读入数据"

]

{

"cell_type": "code",

"execution_count": 11,

"metadata": {},

"outputs": [],

"source": [

"#用户和item的索引\n",

"users_index = cPickle.load(open(\"users_index.pkl\", 'rb'))\n",

"items_index = cPickle.load(open(\"items_index.pkl\", 'rb'))\n",

"\n",

"n_users = len(users_index)\n",

"n_items = len(items_index)\n",

" \n",

"#倒排表\n",

"##每个用户打过分的电影\n",

"user_items = cPickle.load(open(\"user_items.pkl\", 'rb'))\n",

"##对每个电影打过分的事用户\n",

"item_users = cPickle.load(open(\"item_users.pkl\", 'rb'))\n",

"\n",

"#用户-物品关系矩阵R\n",

"user_item_scores = sio.mmread(\"user_item_scores\")#.todense()\n",

"user_item_scores = user_item_scores.tocsr()"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"#### 计算每个用户的平均打分"

]

{

"cell_type": "code",

"execution_count": 12,

"metadata": {},

"outputs": [],

"source": [

"users_mu = np.zeros(n_users)\n",

"for u in range(n_users): \n",

" n_user_items = 0\n",

" r_acc = 0.0\n",

" \n",

" for i in user_items[u]: #用户打过分的item\n",

" r_acc += user_item_scores[u,i]\n",

" n_user_items += 1\n",

" \n",

" users_mu[u] = r_acc/n_user_items"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### 计算两个item之间的相似度"

]

{

"cell_type": "code",

"execution_count": 13,

"metadata": {},

"outputs": [],

"source": [

"def item_similarity(iid1, iid2):\n",

" su={} #有效item(两个用户均有打分的item)的集合\n",

" for user in item_users[iid1]: #对iid1所有打过分的用户\n",

" if user in item_users[iid2]: #如果该用户对iid2也打过分\n",

" su[user]=1 #该用户为一个有效user\n",

" \n",

" n=len(su) #有效item数，有效item为即对uid对Item打过分，uid2也对Item打过分\n",

" if (n==0): #没有共同打过分的item，相似度设为0？\n",

" similarity=0 \n",

" return similarity \n",

" \n",

" #iid1的有效打分(减去用户的平均打分)\n",

" s1=np.array([user_item_scores[user,iid1]-users_mu[user] for user in su])\n",

" \n",

" #iid2的有效打分(减去用户的平均打分)\n",

" s2=np.array([user_item_scores[user,iid2]-users_mu[user] for user in su]) \n",

" \n",

" similarity = 1 - ssd.cosine(s1, s2) \n",

" if( np.isnan(similarity) ): #分母为0(s1或s2中所有元素为0)\n",

" similarity = 0.0\n",

" return similarity "

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### 预计算好所有item之间的相似性\n",

"对item比较少、Item比较固定的系统适用"

]

{

"cell_type": "code",

"execution_count": 14,

"metadata": {

"scrolled": true

"outputs": [

{

"name": "stdout",

"output_type": "stream",

"text": [

"i=0 \n"

]

{

"name": "stderr",

"output_type": "stream",

"text": [

"D:\\programfiles\\anaconda3\\lib\\site-packages\\scipy\\spatial\\distance.py:702: RuntimeWarning: invalid value encountered in double_scalars\n",

" dist = 1.0 - uv / np.sqrt(uu * vv)\n"

]

{

"name": "stdout",

"output_type": "stream",

"text": [

"i=100 \n",

"i=200 \n",

"i=300 \n",

"i=400 \n",

"i=500 \n",

"i=600 \n",

"i=700 \n"

]

}

"source": [

"items_similarity_matrix = np.matrix(np.zeros(shape=(n_items, n_items)), float)\n",

"\n",

"for i in range(n_items):\n",

" items_similarity_matrix[i,i] = 1.0\n",

" \n",

" #打印进度条\n",

" if(i % 100 == 0):\n",

" print (\"i=%d \" % (i) )\n",

"\n",

" for j in range(i+1,n_items): #items by user \n",

" items_similarity_matrix[j,i] = item_similarity(i, j)\n",

" items_similarity_matrix[i,j] = items_similarity_matrix[j,i]\n",

"\n",

"cPickle.dump(items_similarity_matrix, open(\"items_similarity.pkl\", 'wb')) "

]

{

"cell_type": "code",

"execution_count": 15,

"metadata": {

"scrolled": true

"outputs": [

{

"data": {

"text/plain": [

"(800, 800)"

]

"execution_count": 15,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"items_similarity_matrix.shape"

]

{

"cell_type": "code",

"execution_count": 16,

"metadata": {},

"outputs": [],

"source": [

"def items_similarity(n_items ):\n",

" items_similarity_matrix = np.matrix(np.zeros(shape=(n_items, n_items)), float)\n",

"\n",

" for i in range(n_items):\n",

" items_similarity_matrix[i,i] = 1.0\n",

" \n",

" #打印进度条\n",

" if(i % 100 == 0):\n",

" print (\"i=:%d \" % (i) )\n",

"\n",

" for j in range(i+1,n_items): #items by user \n",

" items_similarity_matrix[j,i] = item_similarity(i, j)\n",

" items_similarity_matrix[i,j] = items_similarity_matrix[j,i]\n",

"\n",

" cPickle.dump(items_similarity_matrix, open(\"items_similarity.pkl\", 'wb'))\n",

" return items_similarity_matrix"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### 测试"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"#### 预测用户对item的预测打分1\n",

"利用用户打过分的item与所有item的相似度计算预测打分 RMSE为3.X, 很多预测打分为负值"

]

{

"cell_type": "code",

"execution_count": 17,

"metadata": {},

"outputs": [],

"source": [

"### 预测用户对item的打分\n",

"def Item_CF_pred1(uid, iid): \n",

" sim_accumulate=0.0 \n",

" rat_acc=0.0 \n",

" \n",

" for item_id in user_items[uid]: #对用户打过分的item\n",

" #计算当前用户打过分item与其他item之间的相似度\n",

" #sim = item_similarity(item_id, iid)\n",

" sim = items_similarity_matrix[item_id, iid]\n",

" \n",

" #由于相似性可能为负，而用户打过分的item又不多(与iid不相似的item占多数)预测打分为负\n",

" if sim != 0: \n",

" rat_acc += sim * (user_item_scores[uid, item_id]) #用户user对item i的打分\n",

" sim_accumulate += np.abs(sim) \n",

" \n",

" if sim_accumulate != 0: #no same user rated,return average rates of the data \n",

" score = rat_acc/sim_accumulate\n",

" else: #no items the same user rated,return average rates of the user \n",

" score = users_mu[uid]\n",

" \n",

" if score <0:\n",

" score = 0.0\n",

" \n",

" return score"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"#### 预测用户对item的预测打分2\n",

"利用用户打过分的item中，与item最相似的n_Knns最物品计算预测打分 RMSE为1.13 PR和覆盖度性能最好"

]

{

"cell_type": "code",

"execution_count": 18,

"metadata": {},

"outputs": [],

"source": [

"### 预测用户对item的打分, 取该用户n_Knns最相似的物品\n",

"def Item_CF_pred2(uid, iid, n_Knns): \n",

" sim_accumulate=0.0 \n",

" rat_acc=0.0 \n",

" n_nn_items = 0\n",

" \n",

" try:\n",

" #相似度排序\n",

" cur_items_similarity = np.array(items_similarity_matrix[iid,:])\n",

" cur_items_similarity = cur_items_similarity.flatten()\n",

" sort_index = sorted(((e,i) for i,e in enumerate(list(cur_items_similarity))), reverse=True)\n",

" except Exception:\n",

" pdb.set_trace()\n",

" print(\"iid\",iid)\n",

" \n",

" for i in range(0,len(sort_index)):\n",

" cur_item_index = sort_index[i][1]\n",

" \n",

" if n_nn_items >= n_Knns: #相似的items已经足够多(>n_Knns)\n",

" break;\n",

" \n",

" if cur_item_index in user_items[uid]: #对用户打过分的item\n",

" #计算当前用户打过分item与其他item之间的相似度\n",

" #sim = item_similarity(cur_item_index, iid)\n",

" sim = items_similarity_matrix[iid, cur_item_index]\n",

" \n",

" if sim != 0: \n",

" rat_acc += sim * (user_item_scores[uid, cur_item_index]) #用户user对item i的打分\n",

" sim_accumulate += np.abs(sim) \n",

" \n",

" n_nn_items += 1\n",

" \n",

" if sim_accumulate != 0: \n",

" score = rat_acc/sim_accumulate\n",

" else: #no similar items,return average rates of the user \n",

" score = users_mu[uid]\n",

" \n",

" if score <0:\n",

" score = 0.0\n",

" \n",

" return score"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"#### 预测用户对item的预测打分3\n",

"与item最相似的n_Knns最物品计算预测打分 RMSE为1.0"

]

{

"cell_type": "code",

"execution_count": 19,

"metadata": {},

"outputs": [],

"source": [

"### 预测用户对item的打分, 取所有item中n_Knns最相似的物品\n",

"def Item_CF_pred3(uid, iid, n_Knns): \n",

" sim_accumulate=0.0 \n",

" rat_acc=0.0 \n",

" \n",

" #相似度排序\n",

" cur_items_similarity = np.array(items_similarity_matrix[iid,:])\n",

" cur_items_similarity = cur_items_similarity.flatten()\n",

" sort_index = sorted(((e,i) for i,e in enumerate(list(cur_items_similarity))), reverse=True)[0: n_Knns]\n",

" \n",

" for i in range(0,len(sort_index)):\n",

" cur_item_index = sort_index[i][1]\n",

" \n",

" if cur_item_index in user_items[uid]: #对用户打过分的item\n",

" #计算当前用户打过分item与其他item之间的相似度\n",

" #sim = item_similarity(cur_item_index, iid)\n",

" sim = items_similarity_matrix[iid, cur_item_index]\n",

" \n",

" if sim != 0: \n",

" rat_acc += sim * (user_item_scores[uid, cur_item_index]) #用户user对item i的打分\n",

" sim_accumulate += np.abs(sim) \n",

" \n",

" if sim_accumulate != 0: \n",

" score = rat_acc/sim_accumulate\n",

" else: #no items the same user rated,return average rates of the user \n",

" score = users_mu[uid]\n",

" \n",

" if score <0:\n",

" score = 0.0\n",

" \n",

" return score"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"#### 对给定用户，推荐物品/计算打分\n",

"不同的推荐算法，只是预测打分函数不同， user_items_scores[i] = User_CF_pred(cur_user_id, i) #预测打分\n",

"如User_CF_pred, Item_CF_pred, svd_CF_pred,... 甚至基于内容的推荐也是一样。"

]

{

"cell_type": "code",

"execution_count": 20,

"metadata": {},

"outputs": [],

"source": [

"#user：用户\n",

"#返回推荐items及其打分(DataFrame)\n",

"def recommend(user):\n",

" cur_user_id = users_index[user]\n",

" \n",

" #训练集中该用户打过分的item\n",

" cur_user_items = user_items[cur_user_id]\n",

"\n",

" #该用户对所有item的打分\n",

" user_items_scores = np.zeros(n_items)\n",

"\n",

" #预测打分\n",

" for i in range(n_items): # all items \n",

" if i not in cur_user_items: #训练集中没打过分\n",

" \n",

" user_items_scores[i] = Item_CF_pred2(cur_user_id, i, 10) #预测打分\n",

" \n",

" #推荐\n",

" #Sort the indices of user_item_scores based upon their value，Also maintain the corresponding score\n",

" sort_index = sorted(((e,i) for i,e in enumerate(list(user_items_scores))), reverse=True)\n",

" \n",

" #Create a dataframe from the following\n",

" columns = ['item_id', 'score']\n",

" df = pd.DataFrame(columns=columns)\n",

" \n",

" #Fill the dataframe with top 20 (n_rec_items) item based recommendations\n",

" #sort_index = sort_index[0:n_rec_items]\n",

" #Fill the dataframe with all items based recommendations\n",

" for i in range(0,len(sort_index)):\n",

" cur_item_index = sort_index[i][1] \n",

" cur_item = list (items_index.keys()) [list (items_index.values()).index (cur_item_index)]\n",

" \n",

" if ~np.isnan(sort_index[i][0]) and cur_item_index not in cur_user_items:\n",

" df.loc[len(df)]=[cur_item, sort_index[i][0]]\n",

" \n",

" return df"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### 读取测试数据"

]

{

"cell_type": "code",

"execution_count": 24,

"metadata": {

"scrolled": true

"outputs": [

{

"data": {

"text/html": [

\n",

" .dataframe tbody tr th:only-of-type {\n",

" vertical-align: middle;\n",

" }\n",

"\n",

" .dataframe tbody tr th {\n",

" vertical-align: top;\n",

" }\n",

"\n",

" .dataframe thead th {\n",

" text-align: right;\n",

" }\n",

"\n",

" \n",

\n",

user_id\n",

item_id\n",

rating\n",

\n",

0\n",

3325fe1d8da7b13dd42004ede8011ce3d7cd205d\n",

SOURVJI12A58A7F353\n",

3.637413\n",

\n",

1\n",

e82b3380f770c78f8f067f464941057c798eaca2\n",

SOKNWRZ12A8C13BF62\n",

5.205479\n",

\n",

2\n",

bdfca47d03157d26f1404075172128a6f8a3d39e\n",

SOMNGMO12A6702187E\n",

0.790514\n",

\n",

3\n",

7ffc14a55b6256c9fa73fc5c5761d210deb7f738\n",

SOGTQNI12AB0184A5C\n",

0.079428\n",

\n",

4\n",

083a2a59603a605275107c00812a811526c2a0af\n",

SOXZOMB12AB017DA15\n",

0.370028\n",

\n",

"text/plain": [

" user_id item_id rating\n",

"0 3325fe1d8da7b13dd42004ede8011ce3d7cd205d SOURVJI12A58A7F353 3.637413\n",

"1 e82b3380f770c78f8f067f464941057c798eaca2 SOKNWRZ12A8C13BF62 5.205479\n",

"2 bdfca47d03157d26f1404075172128a6f8a3d39e SOMNGMO12A6702187E 0.790514\n",

"3 7ffc14a55b6256c9fa73fc5c5761d210deb7f738 SOGTQNI12AB0184A5C 0.079428\n",

"4 083a2a59603a605275107c00812a811526c2a0af SOXZOMB12AB017DA15 0.370028"

]

"execution_count": 24,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"#读取测试数据\n",

"dpath = './data/'\n",

"df_triplet_test = pd.read_csv(dpath +\"test.csv\")\n",

"df_triplet_test.head()"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### 测试，并计算评价指标"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"推荐歌曲书为10"

]

{

"cell_type": "code",

"execution_count": 25,

"metadata": {

"scrolled": true

"outputs": [

{

"name": "stdout",

"output_type": "stream",

"text": [

"67c5b5b1982902d15badd8ce0c18b3278ec4bfc0 is a new user.\n",

"\n",

"62420be0fd0df5ab0eb4cba35a4bc7cb3e3b506a is a new user.\n",

"\n",

"3ab78e39bddeaeb789edad041fff03050077417c is a new user.\n",

"\n"

]

}

"source": [

"#统计总的用户\n",

"unique_users_test = df_triplet_test['user_id'].unique()\n",

"\n",

"#为每个用户推荐的item的数目\n",

"n_rec_items = 10\n",

"\n",

"#性能评价参数初始化，用户计算Percison和Recall\n",

"n_hits = 0\n",

"n_total_rec_items = 0\n",

"n_test_items = 0\n",

"\n",

"#所有被推荐商品的集合(对不同用户)，用于计算覆盖度\n",

"all_rec_items = set()\n",

"\n",

"#残差平方和，用与计算RMSE\n",

"rss_test = 0.0\n",

"\n",

"#对每个测试用户\n",

"for user in unique_users_test:\n",

" #测试集中该用户打过分的电影(用于计算评价指标的真实值)\n",

" if user not in users_index: #user在训练集中没有出现过，新用户不能用协同过滤\n",

" print(str(user) + ' is a new user.\\n')\n",

" continue\n",

" \n",

" user_records_test= df_triplet_test[df_triplet_test.user_id == user]\n",

" \n",

" #对每个测试用户，计算该用户对训练集中未出现过的商品的打分，并基于该打分进行推荐(top n_rec_items)\n",

" #返回结果为DataFrame\n",

" \n",

" rec_items = recommend(user)\n",

" \n",

" for i in range(n_rec_items):\n",

" item = rec_items.iloc[i]['item_id']\n",

" \n",

" if item in user_records_test['item_id'].values:\n",

" n_hits += 1\n",

" all_rec_items.add(item)\n",

" \n",

" #计算rmse\n",

" for i in range(user_records_test.shape[0]):\n",

" item = user_records_test.iloc[i]['item_id']\n",

" score = user_records_test.iloc[i]['rating']\n",

" \n",

" df1 = rec_items[rec_items.item_id == item]\n",

" if(df1.shape[0] == 0): #item在训练集中没有出现过，新item不能被协同过滤推荐\n",

" print(str(item) + ' is a new item.\\n')\n",

" continue\n",

" pred_score = df1['score'].values[0]\n",

" rss_test += (pred_score - score)**2 #残差平方和\n",

" \n",

" #推荐的item总数\n",

" n_total_rec_items += n_rec_items\n",

" \n",

" #真实item的总数\n",

" n_test_items += user_records_test.shape[0]\n",

"\n",

"#Precision & Recall\n",

"precision = n_hits / (1.0*n_total_rec_items)\n",

"recall = n_hits / (1.0*n_test_items)\n",

"\n",

"#覆盖度：推荐商品占总需要推荐商品的比例\n",

"coverage = len(all_rec_items) / (1.0* n_items)\n",

"\n",

"#打分的均方误差\n",

"rmse=np.sqrt(rss_test / df_triplet_test.shape[0]) "

]

{

"cell_type": "code",

"execution_count": 26,

"metadata": {

"scrolled": true

"outputs": [

{

"data": {

"text/plain": [

"0.017427385892116183"

]

"execution_count": 26,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"precision"

]

{

"cell_type": "code",

"execution_count": 27,

"metadata": {

"scrolled": true

"outputs": [

{

"data": {

"text/plain": [

"0.01679776029862685"

]

"execution_count": 27,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"recall"

]

{

"cell_type": "code",

"execution_count": 28,

"metadata": {},

"outputs": [

{

"data": {

"text/plain": [

"0.99625"

]

"execution_count": 28,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"coverage"

]

{

"cell_type": "code",

"execution_count": 29,

"metadata": {},

"outputs": [

{

"data": {

"text/plain": [

"5.104169274441924"

]

"execution_count": 29,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"rmse"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### 基于用户的协同过滤"

]

{

"cell_type": "code",

"execution_count": 30,

"metadata": {},

"outputs": [],

"source": [

"import pandas as pd\n",

"import numpy as np\n",

"\n",

"#load数据(用户和物品索引，以及倒排表)\n",

"import _pickle as cPickle\n",

"\n",

"#稀疏矩阵，打分表\n",

"import scipy.io as sio\n",

"import os\n",

"\n",

"#距离\n",

"import scipy.spatial.distance as ssd"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"#### 读入训练数据"

]

{

"cell_type": "code",

"execution_count": 31,

"metadata": {},

"outputs": [],

"source": [

"#用户和item的索引\n",

"users_index = cPickle.load(open(\"users_index.pkl\", 'rb'))\n",

"items_index = cPickle.load(open(\"items_index.pkl\", 'rb'))\n",

"\n",

"n_users = len(users_index)\n",

"n_items = len(items_index)\n",

" \n",

"#倒排表\n",

"##每个用户打过分的电影\n",

"user_items = cPickle.load(open(\"user_items.pkl\", 'rb'))\n",

"##对每个电影打过分的事用户\n",

"item_users = cPickle.load(open(\"item_users.pkl\", 'rb'))\n",

"\n",

"#用户-物品关系矩阵R\n",

"user_item_scores = sio.mmread(\"user_item_scores\")#.todense()\n",

"user_item_scores = user_item_scores.tocsr()"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"#### 计算每个用户的平均打分"

]

{

"cell_type": "code",

"execution_count": 32,

"metadata": {},

"outputs": [],

"source": [

"users_mu = np.zeros(n_users)\n",

"for u in range(n_users): \n",

" n_user_items = 0\n",

" r_acc = 0.0\n",

" \n",

" for i in user_items[u]: #用户打过分的item\n",

" r_acc += user_item_scores[u,i]\n",

" n_user_items += 1\n",

" \n",

" users_mu[u] = r_acc/n_user_items"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"#### 计算两个用户之间的相似度"

]

{

"cell_type": "code",

"execution_count": 33,

"metadata": {},

"outputs": [],

"source": [

"def user_similarity(uid1, uid2 ):\n",

" si={} #有效item(两个用户均有打分的item)的集合\n",

" for item in user_items[uid1]: #uid1所有打过分的Item1\n",

" if item in user_items[uid2]: #如果uid2也对该Item打过分\n",

" si[item]=1 #item为一个有效item\n",

" \n",

" n=len(si) #有效item数，有效item为即对uid对Item打过分，uid2也对Item打过分\n",

" if (n==0): #没有共同打过分的item，相似度设为0？\n",

" similarity=0.0 \n",

" return similarity \n",

" \n",

" #用户uid1的有效打分(减去该用户的平均打分)\n",

" s1=np.array([user_item_scores[uid1,item]-users_mu[uid1] for item in si]) \n",

" \n",

" #用户uid2的有效打分(减去该用户的平均打分)\n",

" s2=np.array([user_item_scores[uid2,item]-users_mu[uid2] for item in si]) \n",

" \n",

" similarity = 1 - ssd.cosine(s1, s2) \n",

" \n",

" if np.isnan(similarity): #s1或s2的l2模为0(全部等于该用户的平均打分)\n",

" similarity = 0.0\n",

" return similarity "

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"#### 预计算好所有用户之间的相似性\n",

"对用户比较少、用户比较固定的的系统适用"

]

{

"cell_type": "code",

"execution_count": 34,

"metadata": {

"scrolled": true

"outputs": [

{

"name": "stdout",

"output_type": "stream",

"text": [

"ui=0 \n",

"ui=100 \n",

"ui=200 \n",

"ui=300 \n",

"ui=400 \n",

"ui=500 \n",

"ui=600 \n",

"ui=700 \n"

]

}

"source": [

"users_similarity_matrix = np.matrix(np.zeros(shape=(n_users, n_users)), float)\n",

"\n",

"for ui in range(n_users):\n",

" users_similarity_matrix[ui,ui] = 1.0\n",

" \n",

" #打印进度条\n",

" if(ui % 100 == 0):\n",

" print (\"ui=%d \" % (ui))\n",

"\n",

" for uj in range(ui+1,n_users): \n",

" users_similarity_matrix[uj,ui] = user_similarity(ui, uj)\n",

" users_similarity_matrix[ui,uj] = users_similarity_matrix[uj,ui]\n",

"\n",

"cPickle.dump(users_similarity_matrix, open(\"users_similarity.pkl\", 'wb')) "

]

{

"cell_type": "code",

"execution_count": 35,

"metadata": {},

"outputs": [],

"source": [

"def users_similarity(n_users ):\n",

" users_similarity_matrix = np.matrix(np.zeros(shape=(n_users, n_users)), float)\n",

"\n",

" for ui in range(n_users):\n",

" users_similarity_matrix[ui,ui] = 1.0\n",

" \n",

" #打印进度条\n",

" if(ui % 100 == 0):\n",

" print (\"ui=:%d \" % (ui))\n",

"\n",

" for uj in range(ui+1,n_users): \n",

" users_similarity_matrix[uj,ui] = user_similarity(ui, uj)\n",

" users_similarity_matrix[ui,uj] = users_similarity_matrix[uj,ui]\n",

"\n",

" cPickle.dump(users_similarity_matrix, open(\"users_similarity.pkl\", 'wb')) \n",

" return users_similarity_matrix"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### 测试"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"#### 预测用户对item的打分"

]

{

"cell_type": "code",

"execution_count": 36,

"metadata": {},

"outputs": [],

"source": [

"### 预测用户对item的打分\n",

"def User_CF_pred(uid, iid): \n",

" sim_accumulate=0.0 \n",

" rat_acc=0.0 \n",

" for user_id in item_users[iid]: #对item iid打过分的所有用户\n",

" #计算当前用户与给item i打过分的用户之间的相似度\n",

" #sim = user_similarity(user_id, uid)\n",

" sim = users_similarity_matrix[user_id,uid]\n",

" \n",

" if sim != 0: \n",

" rat_acc += sim * (user_item_scores[user_id,iid] - users_mu[user_id]) #用户user对item i的打分\n",

" sim_accumulate += np.abs(sim) \n",

" \n",

" if sim_accumulate != 0: \n",

" score = users_mu[uid] + rat_acc/sim_accumulate\n",

" else: #no similar users,return average rates of the user \n",

" score = users_mu[uid]\n",

" \n",

" return score"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"#### 对给定用户，推荐物品/计算打分\n",

"不同的推荐算法，只是预测打分函数不同， user_items_scores[i] = User_CF_pred(cur_user_id, i) #预测打分\n",

"如User_CF_pred, Item_CF_pred, svd_CF_pred,... 甚至基于内容的推荐也是一样"

]

{

"cell_type": "code",

"execution_count": 37,

"metadata": {},

"outputs": [],

"source": [

"#user：用户\n",

"#返回推荐items及其打分(DataFrame)\n",

"def recommend(user):\n",

" cur_user_id = users_index[user]\n",

" \n",

" #训练集中该用户打过分的item\n",

" cur_user_items = user_items[cur_user_id]\n",

"\n",

" #该用户对所有item的打分\n",

" user_items_scores = np.zeros(n_items)\n",

"\n",

" #预测打分\n",

" for i in range(n_items): # all items \n",

" if i not in cur_user_items: #训练集中没打过分\n",

" user_items_scores[i] = User_CF_pred(cur_user_id, i) #预测打分\n",

" \n",

" #推荐\n",

" #Sort the indices of user_item_scores based upon their value，Also maintain the corresponding score\n",

" sort_index = sorted(((e,i) for i,e in enumerate(list(user_items_scores))), reverse=True)\n",

" \n",

" #Create a dataframe from the following\n",

" columns = ['item_id', 'score']\n",

" df = pd.DataFrame(columns=columns)\n",

" \n",

" #Fill the dataframe with top 20 (n_rec_items) item based recommendations\n",

" #sort_index = sort_index[0:n_rec_items]\n",

" #Fill the dataframe with all items based recommendations\n",

" for i in range(0,len(sort_index)):\n",

" cur_item_index = sort_index[i][1] \n",

" cur_item = list (items_index.keys()) [list (items_index.values()).index (cur_item_index)]\n",

" \n",

" if ~np.isnan(sort_index[i][0]) and cur_item_index not in cur_user_items:\n",

" df.loc[len(df)]=[cur_item, sort_index[i][0]]\n",

" \n",

" return df"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### 读取测试数据"

]

{

"cell_type": "code",

"execution_count": 38,

"metadata": {

"scrolled": true

"outputs": [

{

"data": {

"text/html": [

\n",

" .dataframe tbody tr th:only-of-type {\n",

" vertical-align: middle;\n",

" }\n",

"\n",

" .dataframe tbody tr th {\n",

" vertical-align: top;\n",

" }\n",

"\n",

" .dataframe thead th {\n",

" text-align: right;\n",

" }\n",

"\n",

" \n",

\n",

user_id\n",

item_id\n",

rating\n",

\n",

0\n",

3325fe1d8da7b13dd42004ede8011ce3d7cd205d\n",

SOURVJI12A58A7F353\n",

3.637413\n",

\n",

1\n",

e82b3380f770c78f8f067f464941057c798eaca2\n",

SOKNWRZ12A8C13BF62\n",

5.205479\n",

\n",

2\n",

bdfca47d03157d26f1404075172128a6f8a3d39e\n",

SOMNGMO12A6702187E\n",

0.790514\n",

\n",

3\n",

7ffc14a55b6256c9fa73fc5c5761d210deb7f738\n",

SOGTQNI12AB0184A5C\n",

0.079428\n",

\n",

4\n",

083a2a59603a605275107c00812a811526c2a0af\n",

SOXZOMB12AB017DA15\n",

0.370028\n",

\n",

"text/plain": [

" user_id item_id rating\n",

"0 3325fe1d8da7b13dd42004ede8011ce3d7cd205d SOURVJI12A58A7F353 3.637413\n",

"1 e82b3380f770c78f8f067f464941057c798eaca2 SOKNWRZ12A8C13BF62 5.205479\n",

"2 bdfca47d03157d26f1404075172128a6f8a3d39e SOMNGMO12A6702187E 0.790514\n",

"3 7ffc14a55b6256c9fa73fc5c5761d210deb7f738 SOGTQNI12AB0184A5C 0.079428\n",

"4 083a2a59603a605275107c00812a811526c2a0af SOXZOMB12AB017DA15 0.370028"

]

"execution_count": 38,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"#读取测试数据\n",

"dpath = './data/'\n",

"df_triplet_test = pd.read_csv(dpath +\"test.csv\")\n",

"df_triplet_test.head()"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### 测试，并计算评价指标"

]

{

"cell_type": "code",

"execution_count": 39,

"metadata": {

"scrolled": true

"outputs": [

{

"name": "stdout",

"output_type": "stream",

"text": [

"67c5b5b1982902d15badd8ce0c18b3278ec4bfc0 is a new user.\n",

"\n",

"62420be0fd0df5ab0eb4cba35a4bc7cb3e3b506a is a new user.\n",

"\n",

"3ab78e39bddeaeb789edad041fff03050077417c is a new user.\n",

"\n"

]

}

"source": [

"#统计总的用户\n",

"unique_users_test = df_triplet_test['user_id'].unique()\n",

"\n",

"#为每个用户推荐的item的数目\n",

"n_rec_items = 10\n",

"\n",

"#性能评价参数初始化，用户计算Percison和Recall\n",

"n_hits = 0\n",

"n_total_rec_items = 0\n",

"n_test_items = 0\n",

"\n",

"#所有被推荐商品的集合(对不同用户)，用于计算覆盖度\n",

"all_rec_items = set()\n",

"\n",

"#残差平方和，用与计算RMSE\n",

"rss_test = 0.0\n",

"\n",

"#对每个测试用户\n",

"for user in unique_users_test:\n",

" #测试集中该用户打过分的电影(用于计算评价指标的真实值)\n",

" if user not in users_index: #user在训练集中没有出现过，新用户不能用协同过滤\n",

" print(str(user) + ' is a new user.\\n')\n",

" continue\n",

" \n",

" user_records_test= df_triplet_test[df_triplet_test.user_id == user]\n",

" \n",

" #对每个测试用户，计算该用户对训练集中未出现过的商品的打分，并基于该打分进行推荐(top n_rec_items)\n",

" #返回结果为DataFrame\n",

" rec_items = recommend(user)\n",

" \n",

" for i in range(n_rec_items):\n",

" item = rec_items.iloc[i]['item_id']\n",

" \n",

" if item in user_records_test['item_id'].values:\n",

" n_hits += 1\n",

" all_rec_items.add(item)\n",

" \n",

" #计算rmse\n",

" for i in range(user_records_test.shape[0]):\n",

" item = user_records_test.iloc[i]['item_id']\n",

" score = user_records_test.iloc[i]['rating']\n",

" \n",

" df1 = rec_items[rec_items.item_id == item]\n",

" if(df1.shape[0] == 0): #item在训练集中没有出现过，新item不能被协同过滤推荐\n",

" print(str(item) + ' is a new item.\\n')\n",

" continue\n",

" pred_score = df1['score'].values[0]\n",

" rss_test += (pred_score - score)**2 #残差平方和\n",

" \n",

" #推荐的item总数\n",

" n_total_rec_items += n_rec_items\n",

" \n",

" #真实item的总数\n",

" n_test_items += user_records_test.shape[0]\n",

"\n",

"#Precision & Recall\n",

"precision = n_hits / (1.0*n_total_rec_items)\n",

"recall = n_hits / (1.0*n_test_items)\n",

"\n",

"#覆盖度：推荐商品占总需要推荐商品的比例\n",

"coverage = len(all_rec_items) / (1.0* n_items)\n",

"\n",

"#打分的均方误差\n",

"rmse=np.sqrt(rss_test / df_triplet_test.shape[0]) "

]

{

"cell_type": "code",

"execution_count": 40,

"metadata": {},

"outputs": [

{

"data": {

"text/plain": [

"0.011341632088520055"

]

"execution_count": 40,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"precision"

]

{

"cell_type": "code",

"execution_count": 41,

"metadata": {},

"outputs": [

{

"data": {

"text/plain": [

"0.010931875749900012"

]

"execution_count": 41,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"recall"

]

{

"cell_type": "code",

"execution_count": 42,

"metadata": {},

"outputs": [

{

"data": {

"text/plain": [

"0.3525"

]

"execution_count": 42,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"coverage"

]

{

"cell_type": "code",

"execution_count": 43,

"metadata": {},

"outputs": [

{

"data": {

"text/plain": [

"4.95095498958581"

]

"execution_count": 43,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"rmse"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"## 基于svd的协同过滤"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### 导入工具包"

]

{

"cell_type": "code",

"execution_count": 252,

"metadata": {},

"outputs": [],

"source": [

"import pandas as pd\n",

"import numpy as np\n",

"\n",

"#load数据(用户和物品索引，以及倒排表)\n",

"import _pickle as cPickle\n",

"import json \n",

"\n",

"from numpy.random import random\n",

"import math"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### 读入数据"

]

{

"cell_type": "code",

"execution_count": 253,

"metadata": {},

"outputs": [],

"source": [

"#用户和item的索引\n",

"users_index = cPickle.load(open(\"users_index.pkl\", 'rb'))\n",

"items_index = cPickle.load(open(\"items_index.pkl\", 'rb'))\n",

"\n",

"n_users = len(users_index)\n",

"n_items = len(items_index)\n",

" \n",

"#用户-物品关系矩阵R\n",

"#scores = sio.mmread(\"scores\").todense()\n",

" \n",

"#倒排表\n",

"##每个用户打过分的电影\n",

"user_items = cPickle.load(open(\"user_items.pkl\", 'rb'))\n",

"##对每个电影打过分的事用户\n",

"item_users = cPickle.load(open(\"item_users.pkl\", 'rb'))"

]

{

"cell_type": "code",

"execution_count": 254,

"metadata": {

"scrolled": true

"outputs": [

{

"data": {

"text/html": [

\n",

" .dataframe tbody tr th:only-of-type {\n",

" vertical-align: middle;\n",

" }\n",

"\n",

" .dataframe tbody tr th {\n",

" vertical-align: top;\n",

" }\n",

"\n",

" .dataframe thead th {\n",

" text-align: right;\n",

" }\n",

"\n",

" \n",

\n",

user_id\n",

item_id\n",

rating\n",

\n",

0\n",

7bdfc45af7e15511d150e2acb798cd5e4788abf5\n",

SOXBCZH12A67ADAD77\n",

1.529637\n",

\n",

1\n",

c405c586f6d7aadbbadfcba5393b543fd99372ff\n",

SOXFYTY127E9433E7D\n",

0.574713\n",

\n",

2\n",

625d0167edbc5df88e9fbebe3fcdd6b121a316bb\n",

SONOYIB12A81C1F88C\n",

0.167504\n",

\n",

3\n",

20ad98ab543da9ec41c6ac3b6354c5ab3ca6bc5e\n",

SOIMCDE12A6D4F8383\n",

0.233100\n",

\n",

4\n",

d331a8bf7d0ca9cb37e375496e6075603f6fb44a\n",

SONYKOW12AB01849C9\n",

4.145078\n",

\n",

"text/plain": [

" user_id item_id rating\n",

"0 7bdfc45af7e15511d150e2acb798cd5e4788abf5 SOXBCZH12A67ADAD77 1.529637\n",

"1 c405c586f6d7aadbbadfcba5393b543fd99372ff SOXFYTY127E9433E7D 0.574713\n",

"2 625d0167edbc5df88e9fbebe3fcdd6b121a316bb SONOYIB12A81C1F88C 0.167504\n",

"3 20ad98ab543da9ec41c6ac3b6354c5ab3ca6bc5e SOIMCDE12A6D4F8383 0.233100\n",

"4 d331a8bf7d0ca9cb37e375496e6075603f6fb44a SONYKOW12AB01849C9 4.145078"

]

"execution_count": 254,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"#读取训练数据\n",

"dpath = './data/'\n",

"df_triplet = pd.read_csv(dpath +\"train.csv\")\n",

"df_triplet.head()"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"由于数据都是小数，在svd分解的时候会出现负无穷大的情况，这里将所有值转换成大于1的值"

]

{

"cell_type": "code",

"execution_count": 255,

"metadata": {},

"outputs": [],

"source": [

"def transfer(x):\n",

" if x<0.1:\n",

" return 1\n",

" if x > 0.9:\n",

" return 9\n",

" return x * 10"

]

{

"cell_type": "code",

"execution_count": 256,

"metadata": {},

"outputs": [],

"source": [

"df_triplet['rating'] = df_triplet.apply(lambda row: transfer(row['rating']), axis=1)"

]

{

"cell_type": "code",

"execution_count": 257,

"metadata": {},

"outputs": [

{

"data": {

"text/html": [

\n",

" .dataframe tbody tr th:only-of-type {\n",

" vertical-align: middle;\n",

" }\n",

"\n",

" .dataframe tbody tr th {\n",

" vertical-align: top;\n",

" }\n",

"\n",

" .dataframe thead th {\n",

" text-align: right;\n",

" }\n",

"\n",

" \n",

\n",

user_id\n",

item_id\n",

rating\n",

\n",

0\n",

7bdfc45af7e15511d150e2acb798cd5e4788abf5\n",

SOXBCZH12A67ADAD77\n",

9.000000\n",

\n",

1\n",

c405c586f6d7aadbbadfcba5393b543fd99372ff\n",

SOXFYTY127E9433E7D\n",

5.747126\n",

\n",

2\n",

625d0167edbc5df88e9fbebe3fcdd6b121a316bb\n",

SONOYIB12A81C1F88C\n",

1.675042\n",

\n",

3\n",

20ad98ab543da9ec41c6ac3b6354c5ab3ca6bc5e\n",

SOIMCDE12A6D4F8383\n",

2.331002\n",

\n",

4\n",

d331a8bf7d0ca9cb37e375496e6075603f6fb44a\n",

SONYKOW12AB01849C9\n",

9.000000\n",

\n",

5\n",

e58e9da7ba717f1bc6079626d5513530dd8f9b97\n",

SOOEEPE12A8AE459A4\n",

9.000000\n",

\n",

6\n",

2a3d80f37b92fa113d4f4b0785d797153cca5f63\n",

SOWCKVR12A8C142411\n",

9.000000\n",

\n",

7\n",

4cb4632e48cd8960dc113eae340adc402a0413cf\n",

SOVSGXX12A58A7F991\n",

9.000000\n",

\n",

8\n",

625d0167edbc5df88e9fbebe3fcdd6b121a316bb\n",

SOKLRPJ12A8C13C3FE\n",

9.000000\n",

\n",

9\n",

c3eee61e081ea89785c4fa4a4a0f29f9f5eb5829\n",

SOQWZAB12AB017C6F7\n",

5.347594\n",

\n",

"text/plain": [

" user_id item_id rating\n",

"0 7bdfc45af7e15511d150e2acb798cd5e4788abf5 SOXBCZH12A67ADAD77 9.000000\n",

"1 c405c586f6d7aadbbadfcba5393b543fd99372ff SOXFYTY127E9433E7D 5.747126\n",

"2 625d0167edbc5df88e9fbebe3fcdd6b121a316bb SONOYIB12A81C1F88C 1.675042\n",

"3 20ad98ab543da9ec41c6ac3b6354c5ab3ca6bc5e SOIMCDE12A6D4F8383 2.331002\n",

"4 d331a8bf7d0ca9cb37e375496e6075603f6fb44a SONYKOW12AB01849C9 9.000000\n",

"5 e58e9da7ba717f1bc6079626d5513530dd8f9b97 SOOEEPE12A8AE459A4 9.000000\n",

"6 2a3d80f37b92fa113d4f4b0785d797153cca5f63 SOWCKVR12A8C142411 9.000000\n",

"7 4cb4632e48cd8960dc113eae340adc402a0413cf SOVSGXX12A58A7F991 9.000000\n",

"8 625d0167edbc5df88e9fbebe3fcdd6b121a316bb SOKLRPJ12A8C13C3FE 9.000000\n",

"9 c3eee61e081ea89785c4fa4a4a0f29f9f5eb5829 SOQWZAB12AB017C6F7 5.347594"

]

"execution_count": 257,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"df_triplet.head(10)"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### 初始化模型参数"

]

{

"cell_type": "code",

"execution_count": 258,

"metadata": {},

"outputs": [],

"source": [

"#隐含变量的维数\n",

"K = 40\n",

"\n",

"#item和用户的偏置项\n",

"bi = np.zeros((n_items,1)) \n",

"bu = np.zeros((n_users,1)) \n",

"\n",

"#item和用户的隐含向量\n",

"qi = np.zeros((n_items,K)) \n",

"pu = np.zeros((n_users,K)) \n",

"\n",

"for uid in range(n_users): #对每个用户\n",

" pu[uid] = np.reshape(random((K,1))/10*(np.sqrt(K)),K)\n",

" \n",

"for iid in range(n_items): #对每个item\n",

" qi[iid] = np.reshape(random((K,1))/10*(np.sqrt(K)),K)\n",

"\n",

"#所有用户的平均打分\n",

"mu = df_triplet['rating'].mean() #average rating"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### 根据当前参数，预测用户uid对Item(i_id)的打分"

]

{

"cell_type": "code",

"execution_count": 259,

"metadata": {},

"outputs": [],

"source": [

"def svd_pred(uid, iid): \n",

" \n",

" score = mu + bi[iid] + bu[uid] + np.sum(qi[iid]* pu[uid]) \n",

"\n",

" if math.isnan(score):\n",

" pdb.set_trace()\n",

" \n",

" #将打分范围控制在1-5之间\n",

" if score>9: \n",

" score = 9 \n",

" elif score<1: \n",

" score = 1 \n",

" \n",

" return score "

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### 模型训练"

]

{

"cell_type": "code",

"execution_count": 260,

"metadata": {

"scrolled": false

"outputs": [

{

"name": "stdout",

"output_type": "stream",

"text": [

"The 0-th step is running\n",

"the rmse of this step on train data is [2.70822462]\n",

"The 1-th step is running\n",

"the rmse of this step on train data is [2.05752694]\n",

"The 2-th step is running\n",

"the rmse of this step on train data is [1.5947205]\n",

"The 3-th step is running\n",

"the rmse of this step on train data is [1.31375827]\n",

"The 4-th step is running\n",

"the rmse of this step on train data is [1.16493074]\n",

"The 5-th step is running\n",

"the rmse of this step on train data is [1.07976222]\n",

"The 6-th step is running\n",

"the rmse of this step on train data is [1.02374809]\n",

"The 7-th step is running\n",

"the rmse of this step on train data is [0.989143]\n",

"The 8-th step is running\n",

"the rmse of this step on train data is [0.95947045]\n",

"The 9-th step is running\n",

"the rmse of this step on train data is [0.93794159]\n",

"The 10-th step is running\n",

"the rmse of this step on train data is [0.92409384]\n",

"The 11-th step is running\n",

"the rmse of this step on train data is [0.91129497]\n",

"The 12-th step is running\n",

"the rmse of this step on train data is [0.89974171]\n",

"The 13-th step is running\n",

"the rmse of this step on train data is [0.8910503]\n",

"The 14-th step is running\n",

"the rmse of this step on train data is [0.88163238]\n",

"The 15-th step is running\n",

"the rmse of this step on train data is [0.87544381]\n",

"The 16-th step is running\n",

"the rmse of this step on train data is [0.86850341]\n",

"The 17-th step is running\n",

"the rmse of this step on train data is [0.86426275]\n",

"The 18-th step is running\n",

"the rmse of this step on train data is [0.85997099]\n",

"The 19-th step is running\n",

"the rmse of this step on train data is [0.85454328]\n"

]

}

"source": [

"#gamma：为学习率\n",

"#Lambda：正则参数\n",

"#steps：迭代次数\n",

"\n",

"steps=20\n",

"gamma=0.04\n",

"Lambda=0.15\n",

"\n",

"#总的打分记录数目\n",

"n_records = df_triplet.shape[0]\n",

"\n",

"for step in range(steps): \n",

" print ('The ' + str(step) + '-th step is running' )\n",

" rmse_sum=0.0 \n",

" \n",

" #将训练样本打散顺序\n",

" kk = np.random.permutation(n_records) \n",

" for j in range(n_records): \n",

" #每次一个训练样本\n",

" line = kk[j] \n",

" \n",

" uid = users_index [df_triplet.iloc[line]['user_id']]\n",

" iid = items_index [df_triplet.iloc[line]['item_id']]\n",

" \n",

" rating = df_triplet.iloc[line]['rating']\n",

" \n",

" #预测残差\n",

" eui = rating - svd_pred(uid, iid) \n",

" \n",

" #残差平方和 \n",

" rmse_sum += eui**2 \n",

" \n",

" #随机梯度下降，更新\n",

" bu[uid] += gamma * (eui - Lambda * bu[uid]) \n",

" bi[iid] += gamma * (eui - Lambda * bi[iid]) \n",

" \n",

" temp = qi[iid] \n",

" qi[iid] += gamma * (eui* pu[uid]- Lambda*qi[iid] ) \n",

" pu[uid] += gamma * (eui* temp - Lambda*pu[uid]) \n",

" \n",

" #学习率递减\n",

" gamma=gamma*0.93 \n",

" print (\"the rmse of this step on train data is \",np.sqrt(rmse_sum/n_records)) "

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### 保存模型"

]

{

"cell_type": "code",

"execution_count": 261,

"metadata": {},

"outputs": [],

"source": [

"# A method for saving object data to JSON file\n",

"def save_json(filepath):\n",

" dict_ = {}\n",

" dict_['mu'] = mu\n",

" dict_['K'] = K\n",

" \n",

" dict_['bi'] = bi.tolist()\n",

" dict_['bu'] = bu.tolist()\n",

" \n",

" dict_['qi'] = qi.tolist()\n",

" dict_['pu'] = pu.tolist()\n",

"\n",

" # Creat json and save to file\n",

" json_txt = json.dumps(dict_)\n",

" with open(filepath, 'w') as file:\n",

" file.write(json_txt)"

]

{

"cell_type": "code",

"execution_count": 262,

"metadata": {},

"outputs": [],

"source": [

"# A method for loading data from JSON file\n",

"def load_json(filepath):\n",

" with open(filepath, 'r') as file:\n",

" dict_ = json.load(file)\n",

"\n",

" mu = dict_['mu']\n",

" K = dict_['K']\n",

"\n",

" bi = np.asarray(dict_['bi'])\n",

" bu = np.asarray(dict_['bu'])\n",

" \n",

" qi = np.asarray(dict_['qi'])\n",

" pu = np.asarray(dict_['pu'])"

]

{

"cell_type": "code",

"execution_count": 263,

"metadata": {},

"outputs": [],

"source": [

"save_json('svd_model.json')\n",

"load_json('svd_model.json')"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### 测试"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### 对给定用户，推荐物品/计算打分"

]

{

"cell_type": "code",

"execution_count": 275,

"metadata": {},

"outputs": [],

"source": [

"#user：用户\n",

"#返回推荐items及其打分(DataFrame)\n",

"def svd_CF_recommend(user):\n",

" cur_user_id = users_index[user]\n",

" \n",

" #训练集中该用户打过分的item\n",

" cur_user_items = user_items[cur_user_id]\n",

"\n",

" #该用户对所有item的打分\n",

" user_items_scores = np.zeros(n_items)\n",

"\n",

" #预测打分\n",

" for i in range(n_items): # all items \n",

" if i not in cur_user_items: #训练集中没打过分\n",

" user_items_scores[i] = svd_pred(cur_user_id, i) #预测打分\n",

" \n",

" #推荐\n",

" #Sort the indices of user_item_scores based upon their value，Also maintain the corresponding score\n",

" sort_index = sorted(((e,i) for i,e in enumerate(list(user_items_scores))), reverse=True)\n",

" \n",

" #Create a dataframe from the following\n",

" columns = ['item_id', 'score']\n",

" df = pd.DataFrame(columns=columns)\n",

" \n",

" #Fill the dataframe with top 20 (n_rec_items) item based recommendations\n",

" #sort_index = sort_index[0:n_rec_items]\n",

" #Fill the dataframe with all items based recommendations\n",

" for i in range(0,len(sort_index)):\n",

" cur_item_index = sort_index[i][1] \n",

" cur_item = list (items_index.keys()) [list (items_index.values()).index (cur_item_index)]\n",

" \n",

" if ~np.isnan(sort_index[i][0]) and cur_item_index not in cur_user_items:\n",

" df.loc[len(df)]=[cur_item, sort_index[i][0]]\n",

" \n",

" return df"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### 读取测试数据"

]

{

"cell_type": "code",

"execution_count": 276,

"metadata": {

"scrolled": true

"outputs": [

{

"data": {

"text/html": [

\n",

" .dataframe tbody tr th:only-of-type {\n",

" vertical-align: middle;\n",

" }\n",

"\n",

" .dataframe tbody tr th {\n",

" vertical-align: top;\n",

" }\n",

"\n",

" .dataframe thead th {\n",

" text-align: right;\n",

" }\n",

"\n",

" \n",

\n",

user_id\n",

item_id\n",

rating\n",

\n",

0\n",

3325fe1d8da7b13dd42004ede8011ce3d7cd205d\n",

SOURVJI12A58A7F353\n",

3.637413\n",

\n",

1\n",

e82b3380f770c78f8f067f464941057c798eaca2\n",

SOKNWRZ12A8C13BF62\n",

5.205479\n",

\n",

2\n",

bdfca47d03157d26f1404075172128a6f8a3d39e\n",

SOMNGMO12A6702187E\n",

0.790514\n",

\n",

3\n",

7ffc14a55b6256c9fa73fc5c5761d210deb7f738\n",

SOGTQNI12AB0184A5C\n",

0.079428\n",

\n",

4\n",

083a2a59603a605275107c00812a811526c2a0af\n",

SOXZOMB12AB017DA15\n",

0.370028\n",

\n",

"text/plain": [

" user_id item_id rating\n",

"0 3325fe1d8da7b13dd42004ede8011ce3d7cd205d SOURVJI12A58A7F353 3.637413\n",

"1 e82b3380f770c78f8f067f464941057c798eaca2 SOKNWRZ12A8C13BF62 5.205479\n",

"2 bdfca47d03157d26f1404075172128a6f8a3d39e SOMNGMO12A6702187E 0.790514\n",

"3 7ffc14a55b6256c9fa73fc5c5761d210deb7f738 SOGTQNI12AB0184A5C 0.079428\n",

"4 083a2a59603a605275107c00812a811526c2a0af SOXZOMB12AB017DA15 0.370028"

]

"execution_count": 276,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"#读取测试数据\n",

"dpath = './data/'\n",

"df_triplet_test = pd.read_csv(dpath +\"test.csv\")\n",

"df_triplet_test.head()"

]

{

"cell_type": "code",

"execution_count": 277,

"metadata": {},

"outputs": [],

"source": [

"df_triplet_test['rating'] = df_triplet_test.apply(lambda row: transfer(row['rating']), axis=1)"

]

{

"cell_type": "code",

"execution_count": 278,

"metadata": {},

"outputs": [

{

"data": {

"text/html": [

\n",

" .dataframe tbody tr th:only-of-type {\n",

" vertical-align: middle;\n",

" }\n",

"\n",

" .dataframe tbody tr th {\n",

" vertical-align: top;\n",

" }\n",

"\n",

" .dataframe thead th {\n",

" text-align: right;\n",

" }\n",

"\n",

" \n",

\n",

user_id\n",

item_id\n",

rating\n",

\n",

0\n",

3325fe1d8da7b13dd42004ede8011ce3d7cd205d\n",

SOURVJI12A58A7F353\n",

9.000000\n",

\n",

1\n",

e82b3380f770c78f8f067f464941057c798eaca2\n",

SOKNWRZ12A8C13BF62\n",

9.000000\n",

\n",

2\n",

bdfca47d03157d26f1404075172128a6f8a3d39e\n",

SOMNGMO12A6702187E\n",

7.905138\n",

\n",

3\n",

7ffc14a55b6256c9fa73fc5c5761d210deb7f738\n",

SOGTQNI12AB0184A5C\n",

1.000000\n",

\n",

4\n",

083a2a59603a605275107c00812a811526c2a0af\n",

SOXZOMB12AB017DA15\n",

3.700278\n",

\n",

"text/plain": [

" user_id item_id rating\n",

"0 3325fe1d8da7b13dd42004ede8011ce3d7cd205d SOURVJI12A58A7F353 9.000000\n",

"1 e82b3380f770c78f8f067f464941057c798eaca2 SOKNWRZ12A8C13BF62 9.000000\n",

"2 bdfca47d03157d26f1404075172128a6f8a3d39e SOMNGMO12A6702187E 7.905138\n",

"3 7ffc14a55b6256c9fa73fc5c5761d210deb7f738 SOGTQNI12AB0184A5C 1.000000\n",

"4 083a2a59603a605275107c00812a811526c2a0af SOXZOMB12AB017DA15 3.700278"

]

"execution_count": 278,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"df_triplet_test.head()"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### 测试，并计算评价指标\n",

"PR、覆盖度、RMSE"

]

{

"cell_type": "code",

"execution_count": 279,

"metadata": {},

"outputs": [

{

"name": "stdout",

"output_type": "stream",

"text": [

"67c5b5b1982902d15badd8ce0c18b3278ec4bfc0 is a new user.\n",

"\n",

"62420be0fd0df5ab0eb4cba35a4bc7cb3e3b506a is a new user.\n",

"\n",

"3ab78e39bddeaeb789edad041fff03050077417c is a new user.\n",

"\n"

]

}

"source": [

"#统计总的用户\n",

"unique_users_test = df_triplet_test['user_id'].unique()\n",

"\n",

"#为每个用户推荐的item的数目\n",

"n_rec_items = 10\n",

"\n",

"#性能评价参数初始化，用户计算Percison和Recall\n",

"n_hits = 0\n",

"n_total_rec_items = 0\n",

"n_test_items = 0\n",

"\n",

"#所有被推荐商品的集合(对不同用户)，用于计算覆盖度\n",

"all_rec_items = set()\n",

"\n",

"#残差平方和，用与计算RMSE\n",

"rss_test = 0.0\n",

"\n",

"#对每个测试用户\n",

"for user in unique_users_test:\n",

" #测试集中该用户打过分的电影(用于计算评价指标的真实值)\n",

" if user not in users_index: #user在训练集中没有出现过，新用户不能用协同过滤\n",

" print(str(user) + ' is a new user.\\n')\n",

" continue\n",

" \n",

" user_records_test= df_triplet_test[df_triplet_test.user_id == user]\n",

" \n",

" #对每个测试用户，计算该用户对训练集中未出现过的商品的打分，并基于该打分进行推荐(top n_rec_items)\n",

" #返回结果为DataFrame\n",

" \n",

" rec_items = svd_CF_recommend(user)\n",

" for i in range(n_rec_items):\n",

" item = rec_items.iloc[i]['item_id']\n",

" \n",

" if item in user_records_test['item_id'].values:\n",

" n_hits += 1\n",

" all_rec_items.add(item)\n",

" \n",

" #计算rmse\n",

" for i in range(user_records_test.shape[0]):\n",

" item = user_records_test.iloc[i]['item_id']\n",

" score = user_records_test.iloc[i]['rating']\n",

" \n",

" df1 = rec_items[rec_items.item_id == item]\n",

" if(df1.shape[0] == 0): #item不在推荐列表中，可能是新item在训练集中没有出现过，或者该用户已经打过分新item不能被协同过滤推荐\n",

" print(str(item) + ' is a new item or user ' + str(user) +' already rated it.\\n')\n",

" continue\n",

" pred_score = df1['score'].values[0]\n",

" rss_test += (pred_score - score)**2 #残差平方和\n",

" \n",

" #推荐的item总数\n",

" n_total_rec_items += n_rec_items\n",

" \n",

" #真实item的总数\n",

" n_test_items += user_records_test.shape[0]\n",

"\n",

"#Precision & Recall\n",

"precision = n_hits / (1.0*n_total_rec_items)\n",

"recall = n_hits / (1.0*n_test_items)\n",

"\n",

"#覆盖度：推荐商品占总需要推荐商品的比例\n",

"coverage = len(all_rec_items) / (1.0* n_items)\n",

"\n",

"#打分的均方误差\n",

"rmse=np.sqrt(rss_test / df_triplet_test.shape[0]) "

]

{

"cell_type": "code",

"execution_count": 280,

"metadata": {},

"outputs": [

{

"data": {

"text/plain": [

"0.031673582295988933"

]

"execution_count": 280,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"precision"

]

{

"cell_type": "code",

"execution_count": 281,

"metadata": {},

"outputs": [

{

"data": {

"text/plain": [

"0.030529262764964673"

]

"execution_count": 281,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"recall"

]

{

"cell_type": "code",

"execution_count": 282,

"metadata": {},

"outputs": [

{

"data": {

"text/plain": [

"0.82125"

]

"execution_count": 282,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"coverage"

]

{

"cell_type": "code",

"execution_count": 283,

"metadata": {},

"outputs": [

{

"data": {

"text/plain": [

"2.595602697148551"

]

"execution_count": 283,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"rmse"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### 总结"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"基于矩阵分解的协同过滤在准确度，召回率，覆盖率上效果比基于人物和物品的协同过滤效果要好，均方误差也要低。"

]

}

"metadata": {

"kernelspec": {

"display_name": "Python 3",

"language": "python",

"name": "python3"

"language_info": {

"codemirror_mode": {

"name": "ipython",

"version": 3

"file_extension": ".py",

"mimetype": "text/x-python",

"name": "python",

"nbconvert_exporter": "python",

"pygments_lexer": "ipython3",

"version": "3.7.3"

}

"nbformat": 4,

"nbformat_minor": 2

}

一键复制

编辑

Web IDE

原始数据

按行查看

历史

signature=e7a4f21fa0bd38abc7e1a2451a8b7b26,进阶作业.ipynb相关推荐

signature=e7a4f21fa0bd38abc7e1a2451a8b7b26,Win10 14328起“迅雷7.9、迅雷极速版”崩溃修正补丁...
---------Exception----------- thunder7(迅雷) 7.10.33.358 Process(PID:1948,workstate=0) : D:\Program Fi ...
day11函数进阶作业
写一个匿名函数,判断指定的年是否是闰年 (先直接用普通函数) def is_leap_year(year):if year % 4 == 0 and year % 100 != 0 or year % ...
day12函数进阶作业
""" """__author__=胡兴航""" from functools import reduce # ...
day12装饰器进阶
1.复习 # 复习 # 讲作业 # 装饰器的进阶# functools.wraps# 带参数的装饰器# 多个装饰器装饰同一个函数 # 周末的作业# 文件操作# 字符串处理# 输入输出# 流程控制# 装 ...
Udacity Deep Learning课程作业（五）
作业五是根据Text8的语料库训练一个语言模型word2vec,得到语料库中每个词的嵌入式表达(向量). Mikolov提出的word2vec包括skip-gram和CBOW两种模型,前者是根据给定词 ...
《面向对象程序设计》2018年春学期寒假及博客作业总结
在过了半个月后写这篇总结,有点时过境迁的感觉,开头写了好几遍,都找不到感觉,但又觉得有些要点很有必要记录下来,以便以后改进.因此,只能简略言说,无法传情达意. 一.关于寒假作业1.寒假作业有必要继续保 ...
昇腾CANN训练营-应用营第二期-第三课作业流程记录
课程及作业地址:ascend_camp: CANN训练营第二期-应用营 (gitee.com) 一.基本作业 1.在上节课申请的镜像环境中,安装opencv-python,并进入python3,输入i ...
python开发教程视频教程_金牌大神讲师Alex带你学Python 153节课带你轻松学透Python开发视频教程_IT教程网...
(1)\第一章:目录中文件数:29个 ├─01课程介绍(一).mp4 ├─02课程介绍(二)-Python与其他语言的区别.mp4 ├─03课程介绍(三)-Python生态圈.mp4 ├─04课程介绍 ...
第四期 | 带学斯坦福CS224n自然语言处理课+带打全球Kaggle比赛（文末重金招募老师！）...
在未来五年,你认为人工智能领域最具有商业价值的方向是什么? 上个月我和一个算法工程师朋友聊了聊,询问算法岗的行业薪资,他说现在计算机视觉算法岗工程师年薪大约50万左右,正当我感叹如今计算机视觉的火爆时 ...
【算法竞赛学习】气象海洋预测-Task1 气象数据分析常用工具
气象海洋预测-Task1 气象数据分析常用工具气象科学中的数据通常包含多个维度,例如本赛题中给出的数据就包含年.月.经度.纬度四个维度,为了便于数据的读取和操作,气象数据通常采用netCDF文件来存 ...

signature=e7a4f21fa0bd38abc7e1a2451a8b7b26,进阶作业.ipynb

signature=e7a4f21fa0bd38abc7e1a2451a8b7b26,进阶作业.ipynb相关推荐

最新文章

热门文章