原文地址:http://www.igvita.com/2007/01/15/svd-recommendation-system-in-ruby/

One day, a bunch of friends, who happened to be big Family Guy fans, decided to put together a site to rank and share their thoughts on the show. Soon thereafter they had a Rails site up and running, and all was well, and other fans joined in hordes. A web 2.0 success! Then one day they realized that they could no longer track everyone's ratings, their user-base was too large, and so it occurred to one of the developers: "Wouldn't it be cool if we could use the collective knowledge of our whole community to recommend and rank episodes for each user individually?"

Sounds familiar, right? In fact, recommendation systems are a billion-dollar industry, and growing. In academic jargon this problem is known as Collaborative Filtering, and a lot of ink has been spilled on the matter. Netflix, for one, announced a 1 million dollar competitionlast year for a system that beats their algorithm by +10% percent. It goes without saying that a lot of different systems have been proposed and explored in theory and practice. However, one of the most successful and widely used approaches to this day also happens to be one of the simplest: Singular Value Decomposition (SVD), also affectionately referred to in the literature as LSI (Latent Semantic Indexing), dimensionality reduction, or projection.

Linear Algebra Refresher

SVD methods are a direct consequence of a theorem in linear algebra:

Any MxN matrix A whose number of rows M is greater than or equal to its number of columns N, can be written as the product of an MxM column-orthogonal matrix U, and MxN diagonal matrix W with positive or zero elements (singular values), and the transpose of an NxN orthogonal matrix V.

More intuitively, assume that we have a matrix where every column represents a user, and every row represents a product (or a Family Guy season, in our case). Thus, with M users and N products, we are looking at an MxN matrix. The theorem simply states that we can decompose such a matrix into three components: (MxM) call it U, (MxN) call it S, and (NxN) call it V. More importantly, we can use this decomposition to approximate the original MxN matrix. By taking the first k eigenvalues of the matrix S, we can effectively obtain a compressed representation of the data. So why do we care? (Mathies click here, we'll wait.)

Machine Learning & Information Retrieval

One of the most fundamental, and fun properties of Machine Learning is its close correlation to the concept of data compression - if we can identify significant concepts (clusters of users, for example) then we can represent a large dataset with fewer bits. However, this logic also works in reverse! If we can represent our data with fewer bits (compress our data), then we have identified 'significant' concepts! I bet you see where we're headed - SVD's allow us to compress a large matrix by approximating it in a smaller-dimensional space.

SVD's found wide application in the field of Information Retrieval (IR) where this process is often referred to as Latent Semantic Indexing (LSI). In these applications the columns of the matrix are the documents, and the rows are the individual words. Running SVD allows us to collapse this matrix into a smaller-dimensional space where highly correlated items (for example, words that often occur together) are captured as a single feature. Essentially, we are discarding the noise, and keeping the signal. In practice, the IR guys usually collapse their ginormous matrices to 100, 200, or 300 dimensions (from original 10000+) and then perform similarity calculations. In case you're curious, this same method has also found many uses in image compression and computer vision applications.

Dimensionality Reduction

Back to our Family Guy developers. For the sake of brevity we will use a very simple example with only 4 users, and 6 seasons (User x Rating matrix shown above). Cranking this matrix through the SVD yields three different components: matrix U (6x6), matrix S (6x6), matrix V (4x4). Now, we will collapse this matrix from a (6x4) space into a 2-Dimensional one. To do this, we simply take the first two columns of U, S and V. The end result:

Now, because we are working with a 2-Dimensional space, we can plot our results (below). We can treat the first column of U can as x , and the second column as y - these are the seasons. Same process is repeated for matrix V - these are the users.

Do you see what happened? Because we are working with a small example it's hard to call two users a 'cluster' but you will nonetheless notice that Ben and Fred are located very close to each other - now compare their respective ratings in our original matrix. Very cool, huh! Same pattern re-occurs for Seasons 5 and 6. Our dimensionality reduction technique effectively captured the fact that Ben and Fred seem to have similar taste - we're halfway there!

Finding Similar Users

Next, Bob joins the site and shares with us a few of his season ratings ([5,5,0,0,0,5] for seasons 1-6) - it's our goal to give him a recommendation based on this data. Intuitively, we want to find users similar to Bob, thus if we can 'embed' Bob into our 2-Dimensional space and look where he is located, we will be able to answer this question. To do this, we perform the following calculation:

First line is the general formula to project a new user into our space - I won't motivate the math behind it, but if you're interested, check the document I referenced in the Linear Algebra Refresher section. The important result is that we have the x, and y coordinates for Bob. Let's add them to our earlier graph:

The green triangle represents Bob. It's not immediately evident which user is closest, but if we extend the vector (from the origin - green line), we can see that Ben's and Fred's vectors are, in fact, very similar. A common way to judge similarity between any two vectors is to look at the angles separating them: cosine similarity. From our graph we can intuitively tell that the angle between Ben and Bob is smaller than the one between Ben and Fred. To determine this, let's iterate over all users and compute their cosine similarities. Furthermore, let's discard anyone whose similarity is less than 0.90 (outside of the shaded region). We get: Ben (0.987), Fred (0.955). Hence, we conclude that Ben and Bob have the most similar tastes, though Fred is pretty close also!

What happens now is up to you. Here is one very simple strategy: find the most similar user and compare his/her items against that of the new user; take the items that the similar user has rated and the new user has not and return them in decreasing order of ratings. Thus, Ben rated every season except 4, and Bob rated seasons 1,2 and 6. We take the set difference ([1,2,3,5,6] - [1,2,6] = [3,5]) which are the seasons Ben rated but Bob hasn't seen and return them in the decreasing order of Ben's ratings: Season 5 (5 stars), Season 3 (3 stars).

Will you just give me the code already?

For the brave ones that made it to here, below is the equivalent of what we just did on paper.. in Ruby. First, install the linalg library, and now you're ready to roll:

require 'linalg'users = { 1 => "Ben", 2 => "Tom", 3 => "John", 4 => "Fred" }
m = Linalg::DMatrix[#Ben, Tom, John, Fred[5,5,0,5], # season 1[5,0,3,4], # season 2[3,4,0,3], # season 3[0,0,5,3], # season 4[5,4,4,5], # season 5[5,4,5,5]  # season 6]# Compute the SVD Decomposition
u, s, vt = m.singular_value_decomposition
vt = vt.transpose# Take the 2-rank approximation of the Matrix
#   - Take first and second columns of u  (6x2)
#   - Take first and second columns of vt (4x2)
#   - Take the first two eigen-values (2x2)
u2 = Linalg::DMatrix.join_columns [u.column(0), u.column(1)]
v2 = Linalg::DMatrix.join_columns [vt.column(0), vt.column(1)]
eig2 = Linalg::DMatrix.columns [s.column(0).to_a.flatten[0,2], s.column(1).to_a.flatten[0,2]]# Here comes Bob, our new user
bob = Linalg::DMatrix[[5,5,0,0,0,5]]
bobEmbed = bob * u2 * eig2.inverse# Compute the cosine similarity between Bob and every other User in our 2-D space
user_sim, count = {}, 1
v2.rows.each { |x|user_sim[count] = (bobEmbed.transpose.dot(x.transpose)) / (x.norm * bobEmbed.norm)count += 1}# Remove all users who fall below the 0.90 cosine similarity cutoff and sort by similarity
similar_users = user_sim.delete_if {|k,sim| sim < 0.9 }.sort {|a,b| b[1] <=> a[1] }
similar_users.each { |u| printf "%s (ID: %d, Similarity: %0.3f)\n", users[u[0]], u[0], u[1]  }# We'll use a simple strategy in this case:
#   1) Select the most similar user
#   2) Compare all items rated by this user against your own and select items that you have not yet rated
#   3) Return the ratings for items I have not yet seen, but the most similar user has rated
similarUsersItems = m.column(similar_users[0][0]-1).transpose.to_a.flatten
myItems = bob.transpose.to_a.flattennot_seen_yet = {}
myItems.each_index { |i|not_seen_yet[i+1] = similarUsersItems[i] if myItems[i] == 0 and similarUsersItems[i] != 0
}printf "\n%s recommends:\n", users[similar_users[0][0]]
not_seen_yet.sort {|a,b| b[1] <=> a[1] }.each { |item|printf "\tSeason %d .. I gave it a rating of %d\n", item[0], item[1]
}print "We've seen all the same seasons, bugger!" if not_seen_yet.size == 0

svd-recommender-gsl.rb - Ruby/GSL version, courtesy of Joshua Bassett

Running our algorithm produces:

Ben (ID: 1, Similarity: 0.987)
Fred (ID: 4, Similarity: 0.955)Ben recommends:Season 5 .. I gave it a rating of 5Season 3 .. I gave it a rating of 3

That's it! A 50 line SVD recommendation / collaborative filtering system for a Rails app. with the help of some simple linear algebra.

转载于:https://www.cnblogs.com/taylorwesley/archive/2013/04/20/3031984.html

分享一篇关于奇异值分解的文章[Eng]相关推荐

  1. 电源功耗压力测试软件,开关电源负载测试经验分享——这篇三分钟小文章着实令人“心动”...

    开关电源,又称交换式电源.开关变换器,是一种高频化电能转换装置,是电源供应器的一种.开关电源利用的切换晶体管多半是在全开模式及全闭模式之间切换,这两个模式都有低耗散的特点,切换之间的转换会有较高的耗散 ...

  2. 分享一篇我很喜欢的文章《不破不立的哲学与个人成长》

    本文引用自张鑫旭的博客 原文地址:http://www.zhangxinxu.com/life/?p=801 前言 分享一篇我个人特别喜欢的一篇文章,每隔一段时间我都会搜出来看一看,提醒激励一下自己, ...

  3. 简书=鸡汤?爬取今日看点数据:1916篇简书热门文章可视化

    一.前言 最近写得两篇关于简书的数据可视化文章:<简书推荐作者风云榜(爬取简书app数据)>.<我的简书一月记:数据可视化>反响都还不错,因而将继续针对简书进行数据分析和可视化 ...

  4. 程序员怎样才能写出一篇好的技术文章

    来源:http://droidyue.com/blog/2016/06/19/how-to-write-an-awesome-post/ 首先,这算是一篇回答知乎问题 程序员怎样才能写出一篇好的博客或 ...

  5. 你做一篇微信公众号文章要多久?

    你做一篇微信公众号文章要多久? 君佳 北漂95后互联网人.公众号「愿君佳」外表内在都得捯饬漂亮 在这个问题下只说排版,真的弱爆了好吗,没有干货跟一些比较清奇的工具分享,我不会来回答这个问题. 前段时间 ...

  6. 2021上岸东南大学网络空间安全学院916学硕心得分享——初试篇

    2021上岸东南大学网络空间安全学院916学硕心得分享--初试篇 一.导言 1.根据自己的实际情况合理选择院校 2.坚定的决心 3.良好的心态 4.灵通的消息 5.良好的学习习惯 二.英语篇 1.单词 ...

  7. 一百多篇热门经典计算文章 来自 11 个热门的技术类微信公众

    为了扩散本文收录的一百多篇热门经典计算文章,主页君也做回标题党.拿了板砖的童鞋,请往下看了再拍不迟 :) 本文收录的文章来自 11 个热门的技术类微信公众号.我们从每个公号中选出了 2015 年最热门 ...

  8. 分享一篇glibc 2.30内存管理源码分析

    分享一篇glibc 2.30内存管理源码分析,出于时间关系文章中可能存在问题(如纰漏.或者解释不顺,后续我会持续更新修正),还请大家海涵,大家互相探讨,也多多希望大家指出文章中问题,我及时斧正.本文只 ...

  9. 为什么linux图形引擎那么丑,为什么你的技术文章配图总是那么丑?那是你还没看过这篇教科书般的技术文章配图指南!...

    原标题:为什么你的技术文章配图总是那么丑?那是你还没看过这篇教科书般的技术文章配图指南! 这可能是一篇很多博客的读者都期待的文章,我最终还是决定说一说『如何为技术文章配图』这一话题,过去的几年一直都有 ...

最新文章

  1. maven项目update报错
  2. 【跃迁之路】【724天】程序员高效学习方法论探索系列(实验阶段481-2019.2.14)...
  3. Thinking In Java 读书笔记
  4. 数组的遍历 java
  5. 关于流(文件)的输入,输出与调用(fprintf,fscanf)
  6. html的设置语言为en,CSS中的html [lang =“ en”]和html:lang(en)有什么区别?
  7. 企业网站建设注意事项
  8. hbase 源代码解析(2)HAdmin 的表创建过程
  9. 信息化案例:国家电投
  10. android apk 微信登入_Android实现使用微信登录第三方APP的方法
  11. 容斥原理与Mobius函数
  12. MIB 浏览器的使用指导
  13. java.awt.eventdispatchthread_大神们,小弟来了!解决办法
  14. 不懂年轻人,还怎么带团队
  15. Mybatis核心配置文件
  16. 购物车js代码_JS实现购物车商品列表结算功能代码
  17. python列表对应元素相乘_在python中,将两个列表中的每个元素相乘
  18. 阿里大人都在读的10本Java实战书籍,Java开发进阶必备书单
  19. linux无法切用户,为什么我的linux 无法进行用户的切换?(ylmf os3.0)
  20. 给中国学生的第三封信:成功、自信、快乐

热门文章

  1. 网络通信协议-TCP/IP模型实战
  2. Python3实现队列
  3. 学java专科_专科学历可以学习java开发吗
  4. python数据分析方法和命令_《利用Python进行数据分析》 —— (1)
  5. android-创建sdcard
  6. 电视光端机常见故障问题介绍
  7. 21秋期末考试财务会计(二)10165k2
  8. 【渝粤教育】国家开放大学2018年春季 8659-22T计算机平面设计(1)(2) 参考试题
  9. 【渝粤教育】广东开放大学 教育心理学 形成性考核 (42)
  10. cnn 一维时序数据_蚂蚁集团智能监控的时序异常检测:基于 CNN 神经网络的异常检测...