一、请知晓

 本文是基于:

  Event Recommendation Engine Challenge分步解析第一步

  Event Recommendation Engine Challenge分步解析第二步

  Event Recommendation Engine Challenge分步解析第三步

  Event Recommendation Engine Challenge分步解析第四步

 需要读者先阅读前四篇文章解析

二、活跃度/event热度数据

 由于用到event_attendees.csv.gz文件,我们先看看该文件

import pandas as pd
df_events_attendees = pd.read_csv('event_attendees.csv.gz', compression='gzip')
df_events_attendees.head()

 代码示例结果(该文件保存了某event出席情况信息):

 1)变量解释

  nevents:train.csvtest.csv中总共的events数目,这里值为13418

  self.eventPopularity:稀疏矩阵,shape为(nevents,1),保存的值是某个event在上图中yes数目-no数目,即一行行处理上述文件,获取该event的index,后yes列空格分割后数目减去no列空格分割数目,并做归一化 

import pandas as pd
import scipy.io as sio
eventPopularity = sio.mmread('EA_eventPopularity').todense()
pd.DataFrame(eventPopularity)

  代码示例结果:

  第五步完整代码:

from collections import defaultdict
import locale, pycountry
import scipy.sparse as ss
import scipy.io as sio
import itertools
#import cPickle
#From python3, cPickle has beed replaced by _pickle
import _pickle as cPickleimport scipy.spatial.distance as ssd
import datetime
from sklearn.preprocessing import normalizeimport gzip
import numpy as npimport hashlib#处理user和event关联数据
class ProgramEntities:"""我们只关心train和test中出现的user和event,因此重点处理这部分关联数据,经过统计:train和test中总共3391个users和13418个events"""def __init__(self):#统计训练集中有多少独立的用户的eventsuniqueUsers = set()#uniqueUsers保存总共多少个用户:3391个uniqueEvents = set()#uniqueEvents保存总共多少个events:13418个eventsForUser = defaultdict(set)#字典eventsForUser保存了每个user:所对应的eventusersForEvent = defaultdict(set)#字典usersForEvent保存了每个event:哪些user点击for filename in ['train.csv', 'test.csv']:f = open(filename)f.readline()#跳过第一行for line in f:cols = line.strip().split(',')uniqueUsers.add( cols[0] )uniqueEvents.add( cols[1] )eventsForUser[cols[0]].add( cols[1] )usersForEvent[cols[1]].add( cols[0] )f.close()self.userEventScores = ss.dok_matrix( ( len(uniqueUsers), len(uniqueEvents) ) )self.userIndex = dict()self.eventIndex = dict()for i, u in enumerate(uniqueUsers):self.userIndex[u] = ifor i, e in enumerate(uniqueEvents):self.eventIndex[e] = iftrain = open('train.csv')ftrain.readline()for line in ftrain:cols = line.strip().split(',')i = self.userIndex[ cols[0] ]j = self.eventIndex[ cols[1] ]self.userEventScores[i, j] = int( cols[4] ) - int( cols[5] )ftrain.close()sio.mmwrite('PE_userEventScores', self.userEventScores)#为了防止不必要的计算,我们找出来所有关联的用户或者关联的event#所谓关联用户指的是至少在同一个event上有行为的用户user pair#关联的event指的是至少同一个user有行为的event pairself.uniqueUserPairs = set()self.uniqueEventPairs = set()for event in uniqueEvents:users = usersForEvent[event]if len(users) > 2:self.uniqueUserPairs.update( itertools.combinations(users, 2) )for user in uniqueUsers:events = eventsForUser[user]if len(events) > 2:self.uniqueEventPairs.update( itertools.combinations(events, 2) )#rint(self.userIndex)cPickle.dump( self.userIndex, open('PE_userIndex.pkl', 'wb'))cPickle.dump( self.eventIndex, open('PE_eventIndex.pkl', 'wb') )#数据清洗类
class DataCleaner:def __init__(self):#一些字符串转数值的方法#载入localeself.localeIdMap = defaultdict(int)for i, l in enumerate(locale.locale_alias.keys()):self.localeIdMap[l] = i + 1#载入countryself.countryIdMap = defaultdict(int)ctryIdx = defaultdict(int)for i, c in enumerate(pycountry.countries):self.countryIdMap[c.name.lower()] = i + 1if c.name.lower() == 'usa':ctryIdx['US'] = iif c.name.lower() == 'canada':ctryIdx['CA'] = ifor cc in ctryIdx.keys():for s in pycountry.subdivisions.get(country_code=cc):self.countryIdMap[s.name.lower()] = ctryIdx[cc] + 1self.genderIdMap = defaultdict(int, {'male':1, 'female':2})#处理LocaleIddef getLocaleId(self, locstr):#这样因为localeIdMap是defaultdict(int),如果key中没有locstr.lower(),就会返回默认int 0return self.localeIdMap[ locstr.lower() ]#处理birthyeardef getBirthYearInt(self, birthYear):try:return 0 if birthYear == 'None' else int(birthYear)except:return 0#性别处理def getGenderId(self, genderStr):return self.genderIdMap[genderStr]#joinedAtdef getJoinedYearMonth(self, dateString):dttm = datetime.datetime.strptime(dateString, "%Y-%m-%dT%H:%M:%S.%fZ")return "".join( [str(dttm.year), str(dttm.month) ] )#处理locationdef getCountryId(self, location):if (isinstance( location, str)) and len(location.strip()) > 0 and location.rfind('  ') > -1:return self.countryIdMap[ location[location.rindex('  ') + 2: ].lower() ]else:return 0#处理timezonedef getTimezoneInt(self, timezone):try:return int(timezone)except:return 0def getFeatureHash(self, value):if len(value.strip()) == 0:return -1else:#return int( hashlib.sha224(value).hexdigest()[0:4], 16) python3会报如下错误#TypeError: Unicode-objects must be encoded before hashingreturn int( hashlib.sha224(value.encode('utf-8')).hexdigest()[0:4], 16)#python必须先进行encodedef getFloatValue(self, value):if len(value.strip()) == 0:return 0.0else:return float(value)#用户与用户相似度矩阵
class Users:"""构建user/user相似度矩阵"""def __init__(self, programEntities, sim=ssd.correlation):#spatial.distance.correlation(u, v) #计算向量u和v之间的相关系数cleaner = DataCleaner()nusers = len(programEntities.userIndex.keys())#3391#print(nusers)fin = open('users.csv')colnames = fin.readline().strip().split(',') #7列特征self.userMatrix = ss.dok_matrix( (nusers, len(colnames)-1 ) )#构建稀疏矩阵for line in fin:cols = line.strip().split(',')#只考虑train.csv中出现的用户,这一行是作者注释上的,但是我不是很理解#userIndex包含了train和test的所有用户,为何说只考虑train.csv中出现的用户if cols[0] in programEntities.userIndex:i = programEntities.userIndex[ cols[0] ]#获取user:对应的indexself.userMatrix[i, 0] = cleaner.getLocaleId( cols[1] )#localeself.userMatrix[i, 1] = cleaner.getBirthYearInt( cols[2] )#birthyear,空值0填充self.userMatrix[i, 2] = cleaner.getGenderId( cols[3] )#处理性别self.userMatrix[i, 3] = cleaner.getJoinedYearMonth( cols[4] )#处理joinedAt列self.userMatrix[i, 4] = cleaner.getCountryId( cols[5] )#处理locationself.userMatrix[i, 5] = cleaner.getTimezoneInt( cols[6] )#处理timezonefin.close()#归一化矩阵self.userMatrix = normalize(self.userMatrix, norm='l1', axis=0, copy=False)sio.mmwrite('US_userMatrix', self.userMatrix)#计算用户相似度矩阵,之后会用到self.userSimMatrix = ss.dok_matrix( (nusers, nusers) )#(3391,3391)for i in range(0, nusers):self.userSimMatrix[i, i] = 1.0for u1, u2 in programEntities.uniqueUserPairs:i = programEntities.userIndex[u1]j = programEntities.userIndex[u2]if (i, j) not in self.userSimMatrix:#print(self.userMatrix.getrow(i).todense()) 如[[0.00028123,0.00029847,0.00043592,0.00035208,0,0.00032346]]#print(self.userMatrix.getrow(j).todense()) 如[[0.00028123,0.00029742,0.00043592,0.00035208,0,-0.00032346]]usim = sim(self.userMatrix.getrow(i).todense(),self.userMatrix.getrow(j).todense())self.userSimMatrix[i, j] = usimself.userSimMatrix[j, i] = usimsio.mmwrite('US_userSimMatrix', self.userSimMatrix)#用户社交关系挖掘
class UserFriends:"""找出某用户的那些朋友,想法非常简单1)如果你有更多的朋友,可能你性格外向,更容易参加各种活动2)如果你朋友会参加某个活动,可能你也会跟随去参加一下"""def __init__(self, programEntities):nusers = len(programEntities.userIndex.keys())#3391self.numFriends = np.zeros( (nusers) )#array([0., 0., 0., ..., 0., 0., 0.]),保存每一个用户的朋友数self.userFriends = ss.dok_matrix( (nusers, nusers) )fin = gzip.open('user_friends.csv.gz')print( 'Header In User_friends.csv.gz:',fin.readline() )ln = 0#逐行打开user_friends.csv.gz文件#判断第一列的user是否在userIndex中,只有user在userIndex中才是我们关心的user#获取该用户的Index,和朋友数目#对于该用户的每一个朋友,如果朋友也在userIndex中,获取其朋友的userIndex,然后去userEventScores中获取该朋友对每个events的反应#score即为该朋友对所有events的平均分#userFriends矩阵记录了用户和朋友之间的score#如851286067:1750用户出现在test.csv中,该用户在User_friends.csv.gz中一共2151个朋友#那么其朋友占比应该是2151 / 总的朋友数sumNumFriends=3731377.0 = 2151 / 3731377 = 0.0005764627910822198for line in fin:if ln % 200 == 0:print( 'Loading line:', ln )cols = line.decode().strip().split(',')user = cols[0]if user in programEntities.userIndex:friends = cols[1].split(' ')#获得该用户的朋友列表i = programEntities.userIndex[user]self.numFriends[i] = len(friends)for friend in friends:if friend in programEntities.userIndex:j = programEntities.userIndex[friend]#the objective of this score is to infer the degree to#and direction in which this friend will influence the#user's decision, so we sum the user/event score for#this user across all training eventseventsForUser = programEntities.userEventScores.getrow(j).todense()#获取朋友对每个events的反应:0, 1, or -1#print(eventsForUser.sum(), np.shape(eventsForUser)[1] )#socre即是用户朋友在13418个events上的平均分score = eventsForUser.sum() / np.shape(eventsForUser)[1]#eventsForUser = 13418,#print(score)self.userFriends[i, j] += scoreself.userFriends[j, i] += scoreln += 1fin.close()#归一化数组sumNumFriends = self.numFriends.sum(axis=0)#每个用户的朋友数相加#print(sumNumFriends)self.numFriends = self.numFriends / sumNumFriends#每个user的朋友数目比例sio.mmwrite('UF_numFriends', np.matrix(self.numFriends) )self.userFriends = normalize(self.userFriends, norm='l1', axis=0, copy=False)sio.mmwrite('UF_userFriends', self.userFriends)#构造event和event相似度数据
class Events:"""构建event-event相似度,注意这里有2种相似度1)由用户-event行为,类似协同过滤算出的相似度2)由event本身的内容(event信息)计算出的event-event相似度"""def __init__(self, programEntities, psim=ssd.correlation, csim=ssd.cosine):cleaner = DataCleaner()fin = gzip.open('events.csv.gz')fin.readline()#skip headernevents = len(programEntities.eventIndex)print(nevents)#13418self.eventPropMatrix = ss.dok_matrix( (nevents, 7) )self.eventContMatrix = ss.dok_matrix( (nevents, 100) )ln = 0for line in fin:#if ln > 10:#breakcols = line.decode().strip().split(',')eventId = cols[0]if eventId in programEntities.eventIndex:i = programEntities.eventIndex[eventId]self.eventPropMatrix[i, 0] = cleaner.getJoinedYearMonth( cols[2] )#start_timeself.eventPropMatrix[i, 1] = cleaner.getFeatureHash( cols[3] )#cityself.eventPropMatrix[i, 2] = cleaner.getFeatureHash( cols[4] )#stateself.eventPropMatrix[i, 3] = cleaner.getFeatureHash( cols[5] )#zipself.eventPropMatrix[i, 4] = cleaner.getFeatureHash( cols[6] )#countryself.eventPropMatrix[i, 5] = cleaner.getFloatValue( cols[7] )#latself.eventPropMatrix[i, 6] = cleaner.getFloatValue( cols[8] )#lonfor j in range(9, 109):self.eventContMatrix[i, j-9] = cols[j]ln += 1fin.close()self.eventPropMatrix = normalize(self.eventPropMatrix, norm='l1', axis=0, copy=False)sio.mmwrite('EV_eventPropMatrix', self.eventPropMatrix)self.eventContMatrix = normalize(self.eventContMatrix, norm='l1', axis=0, copy=False)sio.mmwrite('EV_eventContMatrix', self.eventContMatrix)#calculate similarity between event pairs based on the two matricesself.eventPropSim = ss.dok_matrix( (nevents, nevents) )self.eventContSim = ss.dok_matrix( (nevents, nevents) )for e1, e2 in programEntities.uniqueEventPairs:i = programEntities.eventIndex[e1]j = programEntities.eventIndex[e2]if not ((i, j) in self.eventPropSim):epsim = psim( self.eventPropMatrix.getrow(i).todense(), self.eventPropMatrix.getrow(j).todense())if np.isnan(epsim):epsim = 0self.eventPropSim[i, j] = epsimself.eventPropSim[j, i] = epsimif not ((i, j) in self.eventContSim):#两个向量,如果某个全为0,会返回nan"""import numpy as npa = np.array([0, 1, 1, 1, 0, 0, 0, 1, 0, 0])b = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])from scipy.spatial.distance import cosinetemp = cosine(a, b)会出现下面问题:Warning (from warnings module):File "D:\Python35\lib\site-packages\scipy\spatial\distance.py", line 644dist = 1.0 - uv / np.sqrt(uu * vv)RuntimeWarning: invalid value encountered in double_scalars"""ecsim = csim( self.eventContMatrix.getrow(i).todense(), self.eventContMatrix.getrow(j).todense())if np.isnan(ecsim):ecsim = 0self.eventContSim[i, j] = ecsimself.eventContSim[j, i] = ecsimsio.mmwrite('EV_eventPropSim', self.eventPropSim)sio.mmwrite('EV_eventContSim', self.eventContSim)class EventAttendees:"""统计某个活动,参加和不参加的人数,从而为活动活跃度做准备"""def __init__(self, programEntities):nevents = len(programEntities.eventIndex)#13418self.eventPopularity = ss.dok_matrix( (nevents, 1) )f = gzip.open('event_attendees.csv.gz')f.readline()#skip headerfor line in f:cols = line.decode().strip().split(',')eventId = cols[0]if eventId in programEntities.eventIndex:i = programEntities.eventIndex[eventId]self.eventPopularity[i, 0] = len(cols[1].split(' ')) - len(cols[4].split(' '))#yes人数-no人数,即出席人数减未出席人数f.close()self.eventPopularity = normalize( self.eventPopularity, norm='l1', axis=0, copy=False)sio.mmwrite('EA_eventPopularity', self.eventPopularity)def data_prepare():"""计算生成所有的数据,用矩阵或者其他形式存储方便后续提取特征和建模"""print('第1步:统计user和event相关信息...')pe = ProgramEntities()print('第1步完成...\n')print('第2步:计算用户相似度信息,并用矩阵形式存储...')Users(pe)print('第2步完成...\n')print('第3步:计算用户社交关系信息,并存储...')UserFriends(pe)print('第3步完成...\n')print('第4步:计算event相似度信息,并用矩阵形式存储...')Events(pe)print('第4步完成...\n')print('第5步:计算event热度信息...')EventAttendees(pe)print('第5步完成...\n')#运行进行数据准备
data_prepare()

 

 综上完成数据的预处理和保存功能

 下面我们来看看特征构建:Event Recommendation Engine Challenge分步解析第六步

转载于:https://www.cnblogs.com/always-fight/p/10505454.html

Event Recommendation Engine Challenge分步解析第五步相关推荐

  1. Event Recommendation Engine Challenge(基础版)---代码

    第一步:统计user和event相关信息 #查看train_csv的数据 import pandas as pd df_train = pd.read_csv('train.csv') df_trai ...

  2. kaggle(05)---Event Recommendation Engine Challenge(基础版)

    文章目录 目录 1.比赛相关介绍 1.1 比赛介绍 1.2 数据集介绍 1.3 评价标准介绍 1.4 个人理解 2. 解决方案 2.1 统计用户和event信息 2.2 计算用户相似度 2.3 用户社 ...

  3. Comprehensive Guide to build a Recommendation Engine from scratch (in Python) / 从0开始搭建推荐系统...

    https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-recommendation-engine-python/, 一篇详细 ...

  4. ElasticSearch源码解析(五):排序(评分公式)

    ElasticSearch源码解析(五):排序(评分公式) 转载自:http://blog.csdn.net/molong1208/article/details/50623948   一.目的 一个 ...

  5. Kafka设计解析(五): Kafka Consumer设计解析

    Kafka设计解析(五)- Kafka Consumer设计解析 大数据架构(郭俊_Jason) · 2015-09-18 08:24 点击上方 大数据架构   快速关注 Kafka Consumer ...

  6. 大型商贸系统(进货管理)技术解析(五)自营无订单进仓冲红单

    大型商贸系统(进货管理)技术解析(五)自营无订单进仓冲红单 功能介绍:      自营无订单仓冲红单通过选择已审核无订单进仓单抵消某张已审核过的进仓冲红单使其进货无效.进仓冲红包括自营进仓冲红:无订单 ...

  7. (转) Quick Guide to Build a Recommendation Engine in Python

    本文转自:http://www.analyticsvidhya.com/blog/2016/06/quick-guide-build-recommendation-engine-python/ Int ...

  8. 【Azure】微软 Azure 基础解析(五)核心体系结构之管理组、订阅、资源和资源组以及层次关系

    本系列博文还在更新中,收录在专栏:「Azure探秘:构建云计算世界」 专栏中. 本系列文章列表如下: [Azure]微软 Azure 基础解析(三)描述云计算运营中的 CapEx 与 OpEx,如何区 ...

  9. MVC北京络捷斯特第三方物流系统技术解析(五)库内加工

    MVC北京络捷斯特第三方物流系统技术解析(五)库内加工 在"加工信息"页面,用户首先要选择库房,之后可以选择加工类型为"包装"."组合".& ...

最新文章

  1. JupyterLab Server 搭建与使用笔记
  2. cisco路由器ios升级(rommon下)
  3. php接口调用教程,php接口调用
  4. Spring.net 模块组成
  5. SpringBoot+zxing+Vue实现前端请求后台二维码图片
  6. java循环单链表比较相等_java的循环单链表
  7. 51nod 1188 最大公约数之和 V2(欧拉函数)
  8. Docker容器基本使用
  9. 工作流activiti5 使用流程变量
  10. springboot-web进阶(三)——统一异常处理
  11. 【java笔记】接口的定义,接口的使用
  12. [Stack]Valid Parentheses
  13. bug篇——MySQL的时区问题
  14. FreeSWITCH核心命令
  15. ajax怎么传全局变量的值,ajax方法如何给全局变量赋值(示例代码)
  16. 大学C语言各章节练习题_及答案合集【350题】《选择题- 判断-程序填空-程序设计》
  17. Oracle 12.2 ORA-01017问题处理
  18. docker部署redies高可用集群实战
  19. pyqt5优秀项目python_【项目】PYQT5--Python/C++实现网络聊天室
  20. DHCP——分配固定IP地址

热门文章

  1. 每日一题(41)—— 数组和链表的区别
  2. pythonb超分辨成像_Papers | 超分辨 + 深度学习(未完待续)
  3. apache php的日志在哪里,PHP在哪里存储错误日志? (php5,apache,fastcgi,cpanel)...
  4. string最大容量_string初步使用
  5. 在统计学中参数的含义是指_《统计学》名词解释及公式
  6. 02.改善深层神经网络:超参数调试、正则化以及优化 W3. 超参数调试、Batch Norm和程序框架(作业:TensorFlow教程+数字手势预测)
  7. LeetCode 393. UTF-8 编码验证(位运算)
  8. LeetCode 1217. 玩筹码(脑筋急转弯)
  9. 程序员面试金典 - 面试题 16.13. 平分正方形(数学)
  10. oracle计算每月最小工作日,Oracle计算指定日期内的工作日(不包含周末)