conversational recommender system论文笔记；推荐系统(recommender system)+对话系统(dialogue system)

Conversational Recommender System

YuemingSun,YiZhang

abstract

在目前的解决方案中，基于单轮的即时搜索引擎和传统的多轮对话系统（single round adhoc search engine or traditional multi round dialog system）存在的问题：只考虑用户在当前session的输入，忽略用户的长期偏好（users’ long term preferences）

已知，推荐系统可以大幅的提高销售转换率（sales conversion rate），依据过去的购物行为分析用户的偏好，优化商业导向的评测指标：转换率（conversion rate），期望收益（expected revenue）。

在这篇文章中，作者提出了一个将推荐和对话合并的统一的深度强化学习框架（framework）从而建立个性化的对话推荐代理。

把用户的会话历史表示成a semi-structured user query with facet-value pairs

其中query是通过belief tracker 在每一步中分析会话中的每一句话，生成更新得到的。

train a deep policy network to decide which action

action（i.e. asking for the value of a facet or making a recommendation）

通过询问用户问题收集用户的偏好，一旦信息足够就会进行推荐。

模型使用用户的past ratings 和在当前会话阶段收集的query 进行rating的预测和推荐的生成。

introduction

对话系统主要分三类：chit-chat,information chat , task oriented chat.

关于task-oriented的工作，当前主要关注于自然语言处理和semantic rich search solutions。

分为三个主要模块：NLU,（dialogue management）DM,自然语言生成模块（natural language generation module）

1.NLU:分析用户的每一句会话，跟踪用户的对话历史，持续更新用户意图（intention）。关注于抽取item的元数据。

we train a deep belief tracker to analyze a user’s current utterance based on context and extract the facet values of the targeted item from the user utterance。

它的输出被用来更新用户意图，并且被作为一个关于target的facet-value pairs的集合在DM和推荐系统中使用。

2.DM：决定对于所处的状态采取哪种行动（action）

we train a deep policy network that decides which machine action to take at each turn given the current user query and long term user preferences learned by the recommender system.

当得到的信息不足时，询问；当足够充足时，推荐a list of items

3.natural language generation module：生成回复。

2. Related work
3. conversational recommendation with reinforcement learning
- 3.1 overview

framework包含：belief tracker ，recommender system , policy network

只有当policy network做出推荐的action时，才调用recommender得到一个item的列表

在推荐过程中重要的三个方面：

如何正确的理解用户意图
如何做出序列的决定，采取合适的行为（action）
如何做出个性化推荐

3.2 belief tracker

We introduce a Belief Tracker module similar to [5] to extract facet-value pairs from user utterances during the conversation

同时这个facet-value对当作agent的状态

在每个时刻t,给定一个用户utterance ete_tet,通过zt=ngram(et)z_t=ngram(e_t)zt=ngram(et) 得到n为的输入向量 ztz_tzt

之后ztz_tzt输入到LSTM中，编码得到向量hth_tht，再将其传入softmax激活层得到一个特定facet的j个值（value）的概率分布。

ht=LSTM(z1,z2,...,zt)h_t=LSTM(z_1,z_2,...,z_t)ht=LSTM(z1,z2,...,zt)

fi=softmax(ht)f_i=softmax(h_t)fi=softmax(ht)

假设不同的facet共有L个，则将其相互连接形成agent当前对话面临的状态（state）：

At each round,all the fif_ifi are concatenated to each other to form the agent’s current belief of the dialogue state in the current session. If there are lll facets,then :

st=f1⊕f2...⊕fls_t=f_1\oplus f_2...\oplus f_lst=f1⊕f2...⊕fl

fif_ifi is the learned vector representation for the facet i,i≤li, i\leq li,i≤l

关于每个时刻t输入的是用户的一句完整的话得到一个特定的facet概率分布吗？？？还是每个时刻t输入用户的一句话的一个单词？？？？

在当前时刻，如果有的facet分布还没有得到，应该怎么计算？？？

论文中对每一个facet都训练一个LSTM，计算其概率分布

3.3 recommender system

使用sts_tst, umu_mum(user information), ini_nin(item information)训练recommender

使用了FM，因为结合不同的特征值 FM不太清楚

假设在数据集中有M个用户，N个item。user/item向量是 1-hot编码的。

输出ym,ny_{m,n}ym,n可以看作是rating score for the explicit feedback or a 0-1 scalar for the implicit feedback.

关于rating prediction使用最小化真实的rating score和预测的rating score之间的L2lossL_2 lossL2loss，使用随机梯度下降。

在进行推荐时，操作没太看懂。

看懂了后边一部分：选择μ\muμ个最有可能的combinations(这个是啥，前边的介绍没太看懂)，使用它们的facet values在数据集中检索对应的item，得到一个候选集。然后使用训练好的模型计算它们的rating score重新排序。

关于那个combination的介绍没太看懂，然后这个combination的选择是怎么选出来的不太清楚？？？？感觉是belief tracker得到的那个状态，不过怎么选择不太清楚

3.4 Deep policy network

使用dialogue state 进行action的选择。

使用policy gradient method，which can learn a policy directly, without consulting the value functions[22]. 不太会这个

The reinforcement learning has the basic components of state S,action A,reward R and policy π(a|s).

State: st=f1⊕f2...⊕fls_t=f_1\oplus f_2...\oplus f_lst=f1⊕f2...⊕fl

**Action:**主要分为两类：1.请求一个facet的value

每一个facet都被进一步分为lll个actions：{a1,a2,...,ala_1,a_2,...,a_la1,a2,...,al}

2.做一个个性化的推荐areca_{rec}arec：调用上边提到的推荐系统

Reward: We model the recommendation reward in different ways,which will be introduced in section 4.3.

policy:模型尝试学习的目标。

π(ata_tat|sts_tst )代表分数或者在状态sts_tst下采取动作ata_tat的可能性

使用两个全连接层，每层之间有一个RELU激活函数。输出被传送到一个softmax层。

网络的目的是最大化从开始状态的reward：

梯度：使用了[28]的算法：

不会这个梯度的计算

提到$\theta $不能随机初始化，不然会很难训练

4. Experimental setup

进行了线下模拟用户的实验和线上真实用户的实验。

4.1 Dataset

4.2 User Simulation

4.3 Recommendation Rewards

4.4 User Utterance Collection

4.5 Baselines

4.6 Evaluation Methodology

4.7 Model Training

4.1 Dataset

使用的 yelp challenge recommendation dataset 关于餐馆和食物的数据集：https://www.yelp.com/dataset/challenge

把低于五个评论的删掉。共有五个facet。具体情况如下：

4.2 User Simulation

对话系统强化学习的环境很难预先定义，因为其关于真实的用户给出的反馈。如果参数的初始话很糟糕，那么可能最终的表现也不好。

为了克服这一问题，使用simulated user来预训练。（在这之前在其他文章中出现过这个方法，所以可能比较成熟了已经）。

The goal of the simulated user is to chat with the conversational system to find the target item。

simulated user的流程：先将target item的facet value值告诉agent，当agent给出推荐时，在list中查找target item。

一个user有以下三种行为：1.用自然语言回答用户的问题。2.在agent给出的list中找到目标项（成功或失败）。3.离开对话（对话过长，推荐的list中没有目标项，目标项的分数过低）

在训练时user给出reward：rqr_qrq :negative reward 当用户放弃对话

rpr_prp ：positive reward 当用户在推荐中成功找到target

rcr_crc：small negative reward每一轮对话，避免对话轮数太长

关于用户具体最多多少轮数推出没有说明

算法1说明了这一过程

4.3 Recommendation Rewards

用户查看推荐的list的方法各式各样，给出了几种不同的success reward $r_p $，C是当推荐的第一个是target时能得到的最多reward。K是查看列表项个数的阈值

Linear $r_p :::r_p=\frac{C*(K- \tau +1)}{K}$

其中τ<K\tau<Kτ<K时，τ\tauτ是target的排名；当大于时，表明失败。

NDCG$r_p $: When computing the NDCG,we use a binary relevance score.So rp=C∗NDCG@Kr_p =C ∗NDCG@Krp=C∗NDCG@K

Cascade $r_p $:based on the cascade model==[4]==

用户每次有prp_rpr的概率继续看下一页的推荐，有1-prp_rpr的概率停止。其中看下一页的概率指数衰减(α1\alpha_1α1)。同时得到的reward也指数衰减（α2\alpha_2α2）。其中指数均大于0小于1.

rp=C∗α2ρ−1r_p=C*\alpha_2^{\rho -1}rp=C∗α2ρ−1 （ρ≤[K/k]\rho \leq [K/k]ρ≤[K/k]）这里设置k=3k=3k=3

4.4 User Utterance Collection

因为yelp的数据集不包含对话utterance，所以论文人工收集了385个会话，再通过会话的模板生成了875721个会话对应于yelp数据集的user-item pairs

4.5 Baselines

use a Maximum Entropy rule based method as the baseline

对于每一个未知的facet都计算entropy，选择一个最大的值进行询问，知道没有item符合当前对话的belief或者所以的facet 都已知或对话轮数超过了阈值（但文章没说是多少）轮数设置为7

MaxEnt Full ： for the one that asking all the facets

MaxEnt @K ： for ones that asking exactly K facets.

4.6 Evaluation Methodology

Average Reward

Success Rate

Average of Turns

Wrong Quit Rate：由于belief tracker的错误，target不在推荐的候选list中

Low Rank Rate

4.7 Model Training

关于模型的训练和各种参数的设置

5. experiment result

5.1 Offline Experiments
5.2 Online User Study

5.1 Offline Experiments

上边的表采用的linear 的rpr_prp计算方法。可以看到CRM相比于baseline可以在较少的轮数里实现较高的成功率。在文中指出由于belief tracker 的不完美，减少和belief tracker的交互次数，调用的越多约有可能出错。

文中还尝试不使用状态向量$s_t $,没有给出结果，直接说发现状态向量可以帮助推荐候选集的选择，但却对促进FM模型没作用后半段不太懂（we find that $ s_t $ contributes to the candidate selection step in recommendation, however it does not seem to boost the FM model.）

改变reward rpr_prp的计算方法，下表放出关于NDCG$r_p $，Cascade $r_p $的结果

可以看到linear的方法得到的模型表现要比另外两个好，文章中说是因为这两个reward的计算方法随着排名的下降是非线性减少的。

belief tracker的正确率对模型的影响：

由上图可以看到belief tracker的正确率影响着强化学习模型的鲁棒性，与平均奖励和成功率成正相关的关系。对话的轮数影响不是很明显。

关于 the Maximum Success Reward C and the Recommendation List Stop Threshold也即环境对模型的影响：
5.2 Online User Study

文中提到理想的线上评测用户：The ideal users would be those yelp user who have actually visited a number of restaurants and would like to chat with our agent to inform her current interest of a target

但这个条件十分难达成，所以采用了一种方法：

从test数据集中sample一批target restaurant,包含 user id, a restaurant id, and the facets of this restaurant。然后从train数据集中根据user id检索出他曾经去过的restaurant 的列表。把这个列表当作历史信息，让参与评估的工作人员被要求根据这个列表学习对应用户的偏好，模拟那个用户；然后给工作人员目标餐馆的facet value，但不知道目标餐馆，让工作人员从agent推荐的list中选择他认为对的target（可以选三个）。如果正确，则成功；否则失败。

下面是评估结果：

conversational recommender system论文笔记；推荐系统(recommender system)+对话系统(dialogue system)相关推荐

【论文笔记】End-to-End Knowledge-Routed Relational Dialogue System for Automatic Diagnosis
写在前面 hello大家好,我是fantasy,今天起打算在这里分享自己在NLP上的所学所得,第一篇博客相当于对整篇论文的翻译,并不能算严格意义上的论文"笔记",希望之后可以越写越 ...
VDO-SLAM: A Visual Dynamic Object-aware SLAM System论文笔记
VDO-SLAM: A Visual Dynamic Object-aware SLAM System论文笔记原文:https://arxiv.org/pdf/2005.11052.pdf 代码:h ...
最新论文笔记(+20)：Biscotti_ A Blockchain System for Private and Secure Federated Learning / TPDS21
Biscotti: A Blockchain System for Private and Secure Federated Learning"译为"Biscotti:一个用于隐私 ...
论文笔记：Pay More Attention to History: A Context Modeling Strategy for Conversational Text-to-SQL
论文笔记:Pay More Attention to History: A Context Modeling Strategy for Conversational Text-to-SQL 目录论文 ...
《基于小型训练集的深度学习迁移的食用毒蘑菇机器视觉识别系统》论文笔记
<基于小型训练集的深度学习迁移的食用毒蘑菇机器视觉识别系统>论文笔记链接:Machine Vision Recognition System of Edible and Poisonou ...
Dialogue System for Unity文档中英对照版（简雨原创翻译）第四篇（关于主流ui插件拓展相关）
接续上一篇 UnityQuest Log Window (Old) Thissection applies to the old Unity Quest Log Window. TheUnity Qu ...
推荐系统（Recommender System）笔记 01：推荐系统的演化
推荐系统(Recommender System)01 推荐系统的架构数据部分模型部分传统推荐模型协同推荐(Collaborative Filtering) 矩阵分解(Matrix Factor ...
推荐系统（Recommender System）笔记 03：推荐系统的重要思想
推荐系统(Recommender System)笔记 03:推荐系统的重要思想推荐系统的特征工程构建特征工程的原则推荐系统的常用特征用户行为数据用户关系数据属性.标签类数据内容类数据上 ...
Google推荐系统Wide Deep Learning for Recommender Systems论文翻译解读
Wide & Deep Learning for Recommender Systems 推荐系统中的Wide & Deep Learning 摘要 Generalized linea ...
【论文笔记】Towards Universal Sequence Representation Learning for Recommender Systems
论文详细信息题目:Towards Universal Sequence Representation Learning for Recommender Systems 作者:Yupeng Hou a ...

conversational recommender system论文笔记；推荐系统(recommender system)+对话系统(dialogue system)

Conversational Recommender System

conversational recommender system论文笔记；推荐系统(recommender system)+对话系统(dialogue system)相关推荐

最新文章

热门文章