Deal with relational data using libFM with blocks

原文：https://thierrysilbermann.wordpress.com/2015/09/17/deal-with-relational-data-using-libfm-with-blocks/

September 17, 2015ThierryS

An answer for this question: [Example] Files for Block Structure

There is a quick explanation in the README doc here: libFM1.42 Manual

Quick explanation is case you don’t want to read this whole blog post.

I’ll take back the toy dataset I used in this previous blog post. Look at it to get the features meaning.

train.libfm

5 0:1 2:1 6:1 9:12.5
5 0:1 3:1 6:1 9:20
4 0:1 4:1 6:1 9:78
1 1:1 2:1 8:1 9:12.5
1 1:1 3:1 8:1 9:20

and test.libfm

0 1:1 4:1 8:1 9:78
0 0:1 5:1 6:1

And I’ll merge them, so it will be easier for the whole process

dataset.libfm

5 0:1 2:1 6:1 9:12.5
5 0:1 3:1 6:1 9:20
4 0:1 4:1 6:1 9:78
1 1:1 2:1 8:1 9:12.5
1 1:1 3:1 8:1 9:20
0 1:1 4:1 8:1 9:78
0 0:1 5:1 6:1

So if we wanted to use block structure.

We will have those 5 files first:

rel_user.libfm (features 0,1 and 6-8 are users features)

~~0 0:1 6:1~~
~~0 1:1 8:1~~

but in fact you can avoid to have feature_id_number broken like that (0-1, 6-8), we can recompress it, so (0-1 -> 0-1 and 6-8 -> 2-4)

0 0:1 2:1
0 1:1 4:1

rel_product.libfm (features 2-5 and 9 are products features) Same thing we can compress from:

~~0 2:1 9:12.5~~
~~0 3:1 9:20~~
~~0 4:1 9:78~~
~~0 5:1~~

0 0:1 4:12.5
0 1:1 4:20
0 2:1 4:78
0 3:1

rel_user.train (which is now the mapping, the first 3 lines correspond to the first line of rel_user.libfm | /!\ we are using a 0 indexing)

0
0
0
1
1

rel_product.train (which is now the mapping)

0
1
2
0
1

file y.train which contains the ratings only

5
5
4
1
1

Almost done…

Now you need to create the .x and .xt files for the user block and the product block. For this you need the script available with libFM in /bin/ after you compile them.

./bin/convert –ifile rel_user.libfm –ofilex rel_user.x –ofiley rel_user.y

you are forced to used the flag –ofiley even if rel_user.y will never be used. You can delete it every time.

and then

./bin/transpose –ifile rel_user.x –ofile rel_user.xt

Now you can do the same thing for the test set, for test because we merge the train and test dataset at the beginning, we only need to generate rel_user.test, rel_product.test and y.test

At this point, you will have a lot of files: (rel_user.train, rel_user.test, rel_user.x, rel_user.xt, rel_product.train, rel_product.test, rel_product.x, rel_produt.xt, y.train, y.test)

And run the whole thing:

./bin/libFM -task r -train y.train -test y.test –relation rel_user,rel_product -out output

It’s a bit overkill for this problem but I hope you get the point.

Now a real example

For this example, I’ll use the ml-1m.zip MovieLens dataset that you can get from here (1 million ratings)

ratings.dat (sample) / Format: UserID::MovieID::Rating::Timestamp

1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291

movies.dat (sample) / Format: MovieID::Title::Genres

1::Toy Story (1995)::Animation|Children’s|Comedy
2::Jumanji (1995)::Adventure|Children’s|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama

users.dat (sample) / Format: UserID::Gender::Age::Occupation::Zip-code

1::F::1::10::48067
2::M::56::16::70072
3::M::25::15::55117
4::M::45::7::02460
5::M::25::20::55455

I’ll create 3 different models.

Easiest libFM files to train without block. I’ll use those features: UserID, MovieID
Regular libFM files to train without block. I’ll use those features: UserID, MovieID, Gender, Age, Occupation, Genre (of movie)
libFM files to train with block. I’ll also use those features: UserID, MovieID, Gender, Age, Occupation, Genre (of movie)

Model 1 and 2 can be created using the following code:

# -*- coding: utf-8 -*-

__author__ = 'Silbermann Thierry'

__license__ = 'WTFPL'

import pandas as pd

import numpy as np

def create_libfm(w_filename, model_lvl=1):

# Load the data

file_ratings = 'ratings.dat'

data_ratings = pd.read_csv(file_ratings, delimiter='::', engine='python',

names=['UserID', 'MovieID', 'Ratings', 'Timestamp'])

file_movies = 'movies.dat'

data_movies = pd.read_csv(file_movies, delimiter='::', engine='python',

names=['MovieID', 'Name', 'Genre_list'])

file_users = 'users.dat'

data_users = pd.read_csv(file_users, delimiter='::', engine='python',

names=['UserID', 'Genre', 'Age', 'Occupation', 'ZipCode'])

# Transform data

ratings = data_ratings['Ratings']

data_ratings = data_ratings.drop(['Ratings', 'Timestamp'], axis=1)

data_movies = data_movies.drop(['Name'], axis=1)

list_genres = [genres.split('|') for genres in data_movies['Genre_list']]

set_genre = [item for sublist in list_genres for item in sublist]

data_users = data_users.drop(['ZipCode'], axis=1)

print 'Data loaded'

# Map the data

offset_array = [0]

dict_array = []

feat = [('UserID', data_ratings), ('MovieID', data_ratings)]

if model_lvl > 1:

feat.extend[('Genre', data_users), ('Age', data_users),

('Occupation', data_users), ('Genre_list', data_movies)]

for (feature_name, dataset) in feat:

uniq = np.unique(dataset[feature_name])

offset_array.append(len(uniq) + offset_array[-1])

dict_array.append({key: value + offset_array[-2]

for value, key in enumerate(uniq)})

print 'Mapping done'

# Create libFM file

w = open(w_filename, 'w')

for i in range(data_ratings.shape[0]):

s = "{0}".format(ratings[i])

for index_feat, (feature_name, dataset) in enumerate(feat):

if dataset[feature_name][i] in dict_array[index_feat]:

s += " {0}:1".format(

dict_array[index_feat][dataset[feature_name][i]]

+ offset_array[index_feat]

)

s += '\n'

w.write(s)

w.close()

if __name__ == '__main__':

create_libfm('model1.libfm', 1)

create_libfm('model2.libfm', 2)

So you end up with a file model1.libfm and model2.libfm. Just need to split each of those files in two to create a training et test set file that I’ll call train_m1.libfm, test_m1.libfm (same thing for model2, train_m2.libfm, test_m2.libfm)

Then you just run libFM like this:

./libFM -train train_m1.libfm -test test_m1.libfm -task r -iter 20 -method mcmc -dim ‘1,1,8’ -output output_m1

But I guess you already know how to do those.

Now the interesting part, using blocks.

[TODO]

转载于:https://www.cnblogs.com/zhizhan/p/5099751.html

Deal with relational data using libFM with blocks相关推荐

Spark SQL: Relational Data Processing in Spark
Spark SQL: Relational Data Processing in Spark Spark SQL : Spark中关系型处理模块说明: 类似这样的说明并非是原作者的内容翻译,而是本篇 ...
RGCN - Modeling Relational Data with Graph Convolutional Networks 使用图卷积网络对关系数据进行建模 ESWC 2018
文章目录 1 相关介绍两个任务 main contributions 2 Neural relational modeling 2.1 符号定义 2.2 关系图卷积网络R-GCN 2.3 Regul ...
【论文解读 ESWC 2018 | R-GCN】Modeling Relational Data with Graph Convolutional Networks
论文题目:Modeling Relational Data with Graph Convolutional Networks 论文来源:ESWC 2018 论文链接:https://arxiv.or ...
2018 ESWC | Modeling Relational Data with Graph Convolutional Networks
2018 ESWC | Modeling Relational Data with Graph Convolutional Networks Paper: https://arxiv.org/pdf/ ...
论文阅读 Modeling Relational Data with Graph Convolutional Networks
Modeling Relational Data with Graph Convolutional Networks 使用图卷积网络建模关系数据发表于 [stat.ML] 26 Oct 2017 摘 ...
论文阅读笔记: Modeling Relational Data with Graph Convolutional Networks
arXiv:1703.06103v4 文章目录 1.Introduction 2.神经关系建模(Neural relational modeling) 2.1 关系图卷积网络(Relational g ...
GNN in KG(一) Modeling Relational Data with Graph Convolutional Networks，ESWC2018
本文作者来自University of Amsterdam,Kipf作为共同一作.其实ESCW只是CCF C类会议,不过外国人当然不看CCF啦.这是本系列的第一篇,做了一阵子GNN的理论研究,当然也需 ...
论文笔记：ESWC 2018 Modeling Relational Data with Graph Convolutional Networks
前言论文链接:https://arxiv.org/pdf/1703.06103.pdf github:https://github.com/kkteru/r-gcn 本文提出了一种将图卷积操作应用与 ...
Embedding Multimodal Relational Data for Knowledge Base Completion理解
Pouya Pezeshkpour et al. , Proceddings of the 2018 Conference on Empirical Methods in Natural Langua ...

Deal with relational data using libFM with blocks

Deal with relational data using libFM with blocks相关推荐

最新文章

热门文章