Deal with relational data using libFM with blocks
原文:https://thierrysilbermann.wordpress.com/2015/09/17/deal-with-relational-data-using-libfm-with-blocks/
An answer for this question: [Example] Files for Block Structure
There is a quick explanation in the README doc here: libFM1.42 Manual
Quick explanation is case you don’t want to read this whole blog post.
I’ll take back the toy dataset I used in this previous blog post. Look at it to get the features meaning.
train.libfm
5 0:1 2:1 6:1 9:12.5
5 0:1 3:1 6:1 9:20
4 0:1 4:1 6:1 9:78
1 1:1 2:1 8:1 9:12.5
1 1:1 3:1 8:1 9:20
and test.libfm
0 1:1 4:1 8:1 9:78
0 0:1 5:1 6:1
And I’ll merge them, so it will be easier for the whole process
dataset.libfm
5 0:1 2:1 6:1 9:12.5
5 0:1 3:1 6:1 9:20
4 0:1 4:1 6:1 9:78
1 1:1 2:1 8:1 9:12.5
1 1:1 3:1 8:1 9:20
0 1:1 4:1 8:1 9:78
0 0:1 5:1 6:1
So if we wanted to use block structure.
We will have those 5 files first:
- rel_user.libfm (features 0,1 and 6-8 are users features)
0 0:1 6:1
0 1:1 8:1
but in fact you can avoid to have feature_id_number broken like that (0-1, 6-8), we can recompress it, so (0-1 -> 0-1 and 6-8 -> 2-4)
0 0:1 2:1
0 1:1 4:1
- rel_product.libfm (features 2-5 and 9 are products features) Same thing we can compress from:
0 2:1 9:12.5
0 3:1 9:20
0 4:1 9:78
0 5:1
to
0 0:1 4:12.5
0 1:1 4:20
0 2:1 4:78
0 3:1
- rel_user.train (which is now the mapping, the first 3 lines correspond to the first line of rel_user.libfm | /!\ we are using a 0 indexing)
0
0
0
1
1
- rel_product.train (which is now the mapping)
0
1
2
0
1
- file y.train which contains the ratings only
5
5
4
1
1
Almost done…
Now you need to create the .x and .xt files for the user block and the product block. For this you need the script available with libFM in /bin/ after you compile them.
./bin/convert –ifile rel_user.libfm –ofilex rel_user.x –ofiley rel_user.y
you are forced to used the flag –ofiley even if rel_user.y will never be used. You can delete it every time.
and then
./bin/transpose –ifile rel_user.x –ofile rel_user.xt
Now you can do the same thing for the test set, for test because we merge the train and test dataset at the beginning, we only need to generate rel_user.test, rel_product.test and y.test
At this point, you will have a lot of files: (rel_user.train, rel_user.test, rel_user.x, rel_user.xt, rel_product.train, rel_product.test, rel_product.x, rel_produt.xt, y.train, y.test)
And run the whole thing:
./bin/libFM -task r -train y.train -test y.test –relation rel_user,rel_product -out output
It’s a bit overkill for this problem but I hope you get the point.
Now a real example
For this example, I’ll use the ml-1m.zip MovieLens dataset that you can get from here (1 million ratings)
ratings.dat (sample) / Format: UserID::MovieID::Rating::Timestamp
1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291
movies.dat (sample) / Format: MovieID::Title::Genres
1::Toy Story (1995)::Animation|Children’s|Comedy
2::Jumanji (1995)::Adventure|Children’s|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
users.dat (sample) / Format: UserID::Gender::Age::Occupation::Zip-code
1::F::1::10::48067
2::M::56::16::70072
3::M::25::15::55117
4::M::45::7::02460
5::M::25::20::55455
I’ll create 3 different models.
- Easiest libFM files to train without block. I’ll use those features: UserID, MovieID
- Regular libFM files to train without block. I’ll use those features: UserID, MovieID, Gender, Age, Occupation, Genre (of movie)
- libFM files to train with block. I’ll also use those features: UserID, MovieID, Gender, Age, Occupation, Genre (of movie)
Model 1 and 2 can be created using the following code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
|
# -*- coding: utf-8 -*-
__author__ = 'Silbermann Thierry'
__license__ = 'WTFPL'
import pandas as pd
import numpy as np
def create_libfm(w_filename, model_lvl = 1 ):
# Load the data
file_ratings = 'ratings.dat'
data_ratings = pd.read_csv(file_ratings, delimiter = '::' , engine = 'python' ,
names = [ 'UserID' , 'MovieID' , 'Ratings' , 'Timestamp' ])
file_movies = 'movies.dat'
data_movies = pd.read_csv(file_movies, delimiter = '::' , engine = 'python' ,
names = [ 'MovieID' , 'Name' , 'Genre_list' ])
file_users = 'users.dat'
data_users = pd.read_csv(file_users, delimiter = '::' , engine = 'python' ,
names = [ 'UserID' , 'Genre' , 'Age' , 'Occupation' , 'ZipCode' ])
# Transform data
ratings = data_ratings[ 'Ratings' ]
data_ratings = data_ratings.drop([ 'Ratings' , 'Timestamp' ], axis = 1 )
data_movies = data_movies.drop([ 'Name' ], axis = 1 )
list_genres = [genres.split( '|' ) for genres in data_movies[ 'Genre_list' ]]
set_genre = [item for sublist in list_genres for item in sublist]
data_users = data_users.drop([ 'ZipCode' ], axis = 1 )
print 'Data loaded'
# Map the data
offset_array = [ 0 ]
dict_array = []
feat = [( 'UserID' , data_ratings), ( 'MovieID' , data_ratings)]
if model_lvl > 1 :
feat.extend[( 'Genre' , data_users), ( 'Age' , data_users),
( 'Occupation' , data_users), ( 'Genre_list' , data_movies)]
for (feature_name, dataset) in feat:
uniq = np.unique(dataset[feature_name])
offset_array.append( len (uniq) + offset_array[ - 1 ])
dict_array.append({key: value + offset_array[ - 2 ]
for value, key in enumerate (uniq)})
print 'Mapping done'
# Create libFM file
w = open (w_filename, 'w' )
for i in range (data_ratings.shape[ 0 ]):
s = "{0}" . format (ratings[i])
for index_feat, (feature_name, dataset) in enumerate (feat):
if dataset[feature_name][i] in dict_array[index_feat]:
s + = " {0}:1" . format (
dict_array[index_feat][dataset[feature_name][i]]
+ offset_array[index_feat]
)
s + = '\n'
w.write(s)
w.close()
if __name__ = = '__main__' :
create_libfm( 'model1.libfm' , 1 )
create_libfm( 'model2.libfm' , 2 )
|
So you end up with a file model1.libfm and model2.libfm. Just need to split each of those files in two to create a training et test set file that I’ll call train_m1.libfm, test_m1.libfm (same thing for model2, train_m2.libfm, test_m2.libfm)
Then you just run libFM like this:
./libFM -train train_m1.libfm -test test_m1.libfm -task r -iter 20 -method mcmc -dim ‘1,1,8’ -output output_m1
But I guess you already know how to do those.
Now the interesting part, using blocks.
[TODO]
转载于:https://www.cnblogs.com/zhizhan/p/5099751.html
Deal with relational data using libFM with blocks相关推荐
- Spark SQL: Relational Data Processing in Spark
Spark SQL: Relational Data Processing in Spark Spark SQL : Spark中关系型处理模块 说明: 类似这样的说明并非是原作者的内容翻译,而是本篇 ...
- RGCN - Modeling Relational Data with Graph Convolutional Networks 使用图卷积网络对关系数据进行建模 ESWC 2018
文章目录 1 相关介绍 两个任务 main contributions 2 Neural relational modeling 2.1 符号定义 2.2 关系图卷积网络R-GCN 2.3 Regul ...
- 【论文解读 ESWC 2018 | R-GCN】Modeling Relational Data with Graph Convolutional Networks
论文题目:Modeling Relational Data with Graph Convolutional Networks 论文来源:ESWC 2018 论文链接:https://arxiv.or ...
- 2018 ESWC | Modeling Relational Data with Graph Convolutional Networks
2018 ESWC | Modeling Relational Data with Graph Convolutional Networks Paper: https://arxiv.org/pdf/ ...
- 论文阅读 Modeling Relational Data with Graph Convolutional Networks
Modeling Relational Data with Graph Convolutional Networks 使用图卷积网络建模关系数据 发表于 [stat.ML] 26 Oct 2017 摘 ...
- 论文阅读笔记: Modeling Relational Data with Graph Convolutional Networks
arXiv:1703.06103v4 文章目录 1.Introduction 2.神经关系建模(Neural relational modeling) 2.1 关系图卷积网络(Relational g ...
- GNN in KG(一) Modeling Relational Data with Graph Convolutional Networks,ESWC2018
本文作者来自University of Amsterdam,Kipf作为共同一作.其实ESCW只是CCF C类会议,不过外国人当然不看CCF啦.这是本系列的第一篇,做了一阵子GNN的理论研究,当然也需 ...
- 论文笔记:ESWC 2018 Modeling Relational Data with Graph Convolutional Networks
前言 论文链接:https://arxiv.org/pdf/1703.06103.pdf github:https://github.com/kkteru/r-gcn 本文提出了一种将图卷积操作应用与 ...
- Embedding Multimodal Relational Data for Knowledge Base Completion理解
Pouya Pezeshkpour et al. , Proceddings of the 2018 Conference on Empirical Methods in Natural Langua ...
最新文章
- python2.7 threading RLock/Condition文档翻译 (RLock/Condition详解)
- redis有序集合sorted set详解
- 最常见的Web服务器市场份额
- FTD概要图之MVC架构
- 语言 OJ 高低位逆转_用于检测污水井内水位高低的报警器--液位开关
- http head详解
- checkbox 点击搜索失去焦点_早些年植入三焦点晶体矫正老花的人,现在怎么样了?...
- 【nginx笔记】系统参数设置-使Nginx支持更多并发请求的TCP网络参数
- object c小代码——日期篇
- Bootstrap3学习笔记
- 烧脑又走心,CCF BDCI大赛这波儿操作有点赞!
- 数字签名(Digital Signature)
- python如何实现飞机上下移动_飞机大战正确方法,利用Python开发一个全自动版来实现自动打飞机...
- 测试网站速度简单方法
- 关于DDoS攻击的8个误区
- INFO:ProjectMgmt - The selected process was not run because a prior process failed.的解决方案
- android edittext 取消软键盘,android Edittext输入修改软键盘并关闭软键盘
- p7510 rom android 8,三星p7510 recovery卡刷rom 刷机教程
- 激光视觉惯导融合的slam系统
- 【pytest】内置 fixtures 之 tmpdir:创建临时文件
热门文章
- 持续集成部署Jenkins工作笔记0004---Subversion环境要求
- C/C++网络编程工作笔记0003---客户服务端程序说明
- SpringMVC工作总结001_SpringMVC拦截器(资源和权限管理)
- opencv 从原始的图像中找出ROI区域
- the value of esp was not properly saved across a function call异常
- linux目录表及功能n鸟哥,鸟哥linux学习之-文件属性跟系统目录
- mysql 5.1.48-log_mysql5.1+syslog8.3+loganalyzer配置过程
- 迷宫求解 java_迷宫求解算法(java版)
- ad画板子的一些问题解答(持续更新)
- parallels desktop 缺少组件_厦门100W5折电脑太阳能光伏组件,100W293mm*418mm*70mmMP4车载太阳能板...