Elena Ricciardelli, Debmalya Biswas

埃琳娜·里恰德利(Elena Ricciardelli)

Abstract. We present a Reinforcement Learning (RL) model for self-improving chatbots, specifically targeting FAQ-type chatbots. The model is not aimed at building a dialog system from scratch, but to leverage data from user conversations to improve chatbot performance. At the core of our approach is a score model, which is trained to score chatbot utterance-response tuples based on user feedback. The scores predicted by this model are used as rewards for the RL agent. Policy learning takes place offline, thanks to an user simulator which is fed with utterances from the FAQ-database. Policy learning is implemented using a Deep Q-Network (DQN) agent with epsilon-greedy exploration, which is tailored to effectively include fallback answers for out-of-scope questions. The potential of our approach is shown on a small case extracted from an enterprise chatbot. It shows an increase in performance from an initial 50% success rate to 75% in 20–30 training epochs.

抽象。 我们提出了一种自我学习型聊天机器人的强化学习(RL)模型，专门针对FAQ型聊天机器人。 该模型的目的不是从头开始构建对话系统，而是利用来自用户对话的数据来改善聊天机器人的性能。 我们方法的核心是一个评分模型，该模型经过训练可以根据用户反馈对聊天机器人语音响应元组进行评分。 该模型预测的分数将用作RL代理的奖励。 借助用户模拟器，可以从FAQ数据库中获取语音信息，从而离线进行策略学习。 策略学习是使用具有epsilon贪婪探索功能的Deep Q-Network(DQN)代理实施的，该代理经过量身定制，可以有效地包含范围外问题的后备答案。 我们的方法的潜力在从企业聊天机器人中提取的一个小案例中得以展示。 它显示了在20-30个训练时期内，性能从最初的50％成功率提高到了75％。

The published version of the paper is available below, in proceedings of the 4th Multidisciplinary Conference on Reinforcement Learning and Decision Making (RLDM), Montreal, 2019: http://rldm.org/papers/abstracts.pdf

该论文的出版版本可在以下第四届强化学习和决策多学科会议(RLDM)会议录中找到，2019年蒙特利尔： http ：//rldm.org/papers/abstracts.pdf

1引言 (1 Introduction)

The majority of dialog agents in an enterprise setting are domain specific, consisting of a Natural Language Understanding (NLU) unit trained to recognize the user’s goal in a supervised manner. However, collecting a good training set for a production system is a time-consuming and cumbersome process. Chatbots covering a wide range of intents often face poor performance due to intent overlap and confusion. Furthermore, it is difficult to autonomously retrain a chatbot taking into account the user feedback from live usage or testing phase. Self-improving chatbots are challenging to achieve, primarily because of the difficulty in choosing and prioritizing metrics for chatbot performance evaluation. Ideally, one wants a dialog agent to be capable to learn from the user’s experience and improve autonomously.

企业设置中的大多数对话代理都是特定于域的，由自然语言理解(NLU)单元组成，该单元受过训练，可以以监督方式识别用户的目标。但是，为生产系统收集好的培训集是一个既耗时又麻烦的过程。由于意图重叠和混乱，涵盖多种意图的聊天机器人通常会面临较差的性能。此外，考虑到来自实时使用或测试阶段的用户反馈，很难自主地重新训练聊天机器人。自我完善的聊天机器人很难实现，这主要是因为难以选择和优先确定聊天机器人性能评估的指标。理想情况下，人们希望对话代理能够从用户的经验中学习并自主改进。

In this work, we present a reinforcement learning approach for self-improving chatbots, specifically targeting FAQ-type chatbots. The core of such chatbots is an intent recognition NLU, which is trained with hard-coded examples of question variations. When no intent is matched with a confidence level above 30%, the chatbot returns a fallback answer. For all others, the NLU engine returns the corresponding confidence level along with the response.

在这项工作中，我们提出了一种针对自我改进聊天机器人的强化学习方法，特别是针对FAQ型聊天机器人。此类聊天机器人的核心是意图识别NLU，其中包含问题变体的硬编码示例。当没有意图与高于30％的置信度匹配时，聊天机器人将返回后备答案。对于其他所有变量，NLU引擎将返回相应的置信度以及响应。

Several research papers [2, 3, 7, 8] have shown the effectiveness of a RL approach in developing dialog systems. Critical to this approach is the choice of a good reward model. A typical reward model is the implementation of a penalty term for each dialog turn. However, such rewards only apply to task completion chatbots where the purpose of the agent is to satisfy user’s request in the shortest time, but it is not suitable for FAQ-type chatbots where the chatbot is expected to provide a good answer in one turn. The user’s feedback can also be used as a reward model in an online reinforcement learning. However, applying RL on live conversations can be challenging and it may incur a significant cost in case of RL failure.

几篇研究论文[2、3、7、8]显示了RL方法在开发对话系统中的有效性。这种方法的关键是选择好的奖励模式。典型的奖励模型是对每个对话回合执行惩罚项。但是，此类奖励仅适用于任务完成聊天机器人，其中代理的目的是在最短的时间内满足用户的请求，但不适用于常见问题解答(FAQ)类型的聊天机器人，其中希望该聊天机器人能够在一轮之内提供良好的答案。用户的反馈也可以用作在线强化学习中的奖励模型。但是，在实时对话中应用RL可能会带来挑战，并且在RL失败的情况下可能会产生巨大的成本。

A better approach for deployed systems is to perform the RL training offline and then update the NLU policy once satisfactory levels of performance have been reached.

对于已部署系统，更好的方法是脱机执行RL培训，然后在达到令人满意的性能水平后更新NLU策略。

2强化学习模型 (2 Reinforcement Learning Model)

The RL model architecture is illustrated in Figure 1. The various components of the model are: the NLU unit, which is used to initially train the RL agent in a warm-up phase; the user simulator, which randomly extracts the user utterances from the database of user experiences; the score model trained on the user’s conversation with feedback and the RL agent based on a Deep Q-Network (DQN) network.

RL模型的体系结构如图1所示。模型的各个组成部分是：NLU单元，用于在热身阶段最初训练RL代理；用户模拟器，它从用户体验数据库中随机抽取用户话语；基于深度Q网络(DQN)网络的带有反馈和RL代理的用户对话中训练的得分模型。

2.1对话系统 (2.1 Dialog System)

We apply the reinforcement learning approach on a FAQ-type chatbot.

我们在FAQ型聊天机器人上应用强化学习方法。

At the core of the chatbot, there is an intent recognition NLU, which is trained with hard-coded examples of question variations. An intent is defined as the user’s intention, which is formulated through the utterance. For this work, we have chosen the open-source NLU from Rasa, using the TensorFlow pipeline. However the RL approach is independent from the NLU chosen and for systems in production it can easily be extended to NLU engines such as IBM Watson or Amazon LEX.

在聊天机器人的核心处，有一个意图识别NLU，它以问题代码的硬编码示例进行训练。意图被定义为用户的意图，通过话语表达出来。为此，我们使用TensorFlow管道从Rasa选择了开源NLU。但是，RL方法独立于所选的NLU，对于生产中的系统，可以轻松地将其扩展到NLU引擎，例如IBM Watson或Amazon LEX。

2.2真实用户对话 (2.2 Real User Conversations)

We used user feedback obtained during the development an actual internal chatbot for our work.

我们使用在开发过程中获得的用户反馈作为我们工作的实际内部聊天机器人。

The scope of the chatbot was to answer employee queries related to office building facilities, HR policies and benefits, etc.

聊天机器人的范围是回答与办公楼设施，人力资源政策和福利等有关的员工问题。

All the 10 users participating in the test phase were informed that their feedback would be used to improve the chatbot performance. The testers provided a (binary) feedback after each conversation turn, thus rating the utterance-response tuples. The historical data thus contains quadruples of the following format: (utterance, response, NLU confidence level and feedback). By removing non valid conversations (i.e. those lacking or with invalid feedback) we end up with 550 user conversations, triggering about 120 intents. Although we have tested the score model on all the conversations, the RL model has been applied only on a subsample of 155 conversations, triggering the top 5 intents. On this subset, the user’s satisfaction is 50%.

参与测试阶段的所有10位用户都被告知，他们的反馈将用于改善聊天机器人的性能。测试人员在每个会话回合后提供(二进制)反馈，从而对发声响应元组进行评分。因此，历史数据包含以下格式的四倍：(话语，响应，NLU置信度和反馈)。通过删除无效对话(即那些缺少反馈或反馈无效的对话)，我们最终得到550个用户对话，触发了约120个意图。尽管我们已经在所有对话中测试了分数模型，但是RL模型仅应用于155个对话的子样本中，触发了前5个意图。在此子集上，用户满意度为50％。

Table 1: Example of conversation from the database, as well as the score provided by the model and by the user

2.3奖励功能：分数模型(2.3 Reward Function: the Score Model)

Evaluating chatbot performance is a long-standing issue in computational linguistic. Automatic metrics borrowed from machine translations (e.g. [6]) do not perform well on short sentences (e.g. [4]), such as the chatbot utterance-response tuples. On the other hand, human rating of chatbots is by now the de-facto standard to evaluate the success of a chatbot, although those ratings are often difficult and expensive to gather.

在计算语言学中，评估聊天机器人的性能是一个长期存在的问题。从机器翻译(例如[6])借来的自动度量在简短的句子(例如[4])上表现不佳，例如聊天机器人语音响应元组。另一方面，尽管对聊天机器人的人工评分通常很困难且费用昂贵，但到目前为止，它是评估聊天机器人成功与否的实际标准。

To evaluate the correctness of chatbot responses, we propose a new approach which makes use of the user conversation logs, gathered during the development and testing phases of the chatbot. Each user had been asked to provide a binary feedback (positive/negative) at each chatbot turn.

为了评估聊天机器人响应的正确性，我们提出了一种新方法，该方法利用了在聊天机器人的开发和测试阶段收集的用户对话日志。已要求每个用户在每个聊天机器人转弯时提供二进制反馈(正/负)。

In order to use the user feedback in an offline reinforcement learning, we have developed a score model, capable of modeling the binary feedback for unseen utterance-response tuples. In a supervised fashion, the score model learns how to project the vector representations of utterance and response in a linearly transformed space, such that similar vector representations give high score. As for the vector representation of sentences, we compute sentence embedding through the universal sentence encoder [1], available through TensorFlow Hub. To train the model, the optimization is done on a squared error (between model prediction and human feedback) loss with L2 regulation. To evaluate the model, the predicted scores are then converted into a binary outcome and compared with the targets (the user feedbacks). For those couples of utterances having a recognized intent with both a positive feedback and a NLU confidence level close to 1, we perform data augmentation, assigning low scores to the combination of utterance and fallback intent.

为了在离线强化学习中使用用户反馈，我们开发了一个评分模型，该模型能够对看不见的话语响应元组进行二进制反馈建模。以监督的方式，得分模型学习如何在线性变换的空间中投影发声和响应的向量表示，以使相似的向量表示获得高分。至于句子的向量表示，我们通过可通过TensorFlow Hub使用的通用句子编码器[1]计算句子嵌入。为了训练模型，在具有L2调节的平方误差(模型预测与人类反馈之间)损失上进行优化。为了评估模型，将预测分数转换为二进制结果，并与目标进行比较(用户反馈)。对于那些具有既定意图且正反馈和NLU置信度均接近1的言语对，我们执行数据增强，将言语和后备意图组合的得分较低。

A similar approach for chatbot evaluation has been suggested by [4]. The authors model the scores by using a labelled set of conversations, that also include model and human-generated responses, collected through crowdsourcing.

[4]提出了一种类似的聊天机器人评估方法。作者使用一组标记的对话对得分进行建模，这些对话还包括通过众包收集的模型和人工生成的响应。

Our approach differs from the above authors in that it just requires a labelled set of utterance-response tuples, which are relatively straightforward to gather during the chatbot development and user testing phases.

我们的方法与上述作者的不同之处在于，它只需要一组标记的发声响应元组，这些元组在聊天机器人开发和用户测试阶段相对容易收集。

2.4使用DQN进行政策学习 (2.4 Policy Learning with DQN)

To learn the policy, the RL agent uses a Q-learning algorithm with DQN architecture [5]. In DQN, a neural network is trained to approximate the state-action function Q(s_t, a_t, θ), which represents the quality of an action a_t provided at state s_t, and θ are the trainable parameters. As for the DQN network, we have followed the approach proposed by [3], using a fully-connected network, fed by an experience replay pool buffer, that contains the one-hot representation of utterance and response and the corresponding reward. A one-hot representation is possible in this case as we have a finite possible values for utterances (given by the number of real users’s question in the logs) and responses (equal to the number of intents used on out test-case, 5). In a warm-up phase, the DQN is trained on the NLU, using as a reward the NLU confidence level. The DQN training set is augmented whenever a state-action pair has a confidence above a threshold, by assigning zero weight to the given state and all the other available actions. Thus, at the starting of the RL training, the agent performs similar to the NLU unit.

为了学习该策略，RL代理使用具有DQN架构的Q学习算法[5]。在DQN中，训练了一个神经网络以近似表示状态动作函数Q(s_t，a_t，θ) ，该函数表示在状态s_t提供的动作a_t的质量，其中θ是可训练的参数。对于DQN网络，我们遵循了[3]提出的方法，使用了一个完全连接的网络，该网络由经验重播池缓冲区提供，该缓冲区包含话语和响应的一站式表示以及相应的奖励。在这种情况下，可以采用一站式表示法，因为我们对发声(由日志中实际用户的问题数决定)和响应(与在测试用例中使用的意图数相同)具有有限的可能值。。在热身阶段，使用NLU置信度作为奖励在NLU上对DQN进行训练。通过将零权重分配给给定状态和所有其他可用动作，只要状态动作对的置信度高于阈值，就会增强DQN训练集。因此，在RL训练开始时，代理的行为类似于NLU单元。

During RL training, we use an epsilon ε-greedy exploration, where random actions are explored according to a probability ε. We use a time-varying ε which facilitates the exploration at the beginning of the training with e_t0 = 0.2 and ε_t = 0.05 during the last epoch. To speed-up the learning when picking random actions, we also force higher probability to get a “No intent detected”, as several questions are actually out of the chatbot scope, but they are erroneously matched to a wrong intent by the NLU. During an epoch we simulate a batch of conversations of size n-episodes (ranging from 10 to 30 in our experiments) and fill an experience replay buffer with the tuple (s_t, a_t, r_t). The buffer has fixed size and it is flushed the first time when the agent performance increases above a specified threshold. In those episodes where the state-action tuple gets a reward greater than 50%, we perform data augmentation by assigning zero reward to the assignment of any other action to the current state.

在RL训练期间，我们使用epsilonε贪婪探索，其中根据概率ε探索随机动作。我们使用随时间变化的ε来促进训练开始时的探索，在最后一个时期内e_t0 = 0.2和ε_t= 0.05 。为了加快选择随机动作时的学习速度，我们还迫使更高的概率获得“未检测到意图”，因为实际上有几个问题不在聊天机器人的范围内，但NLU错误地将它们与错误的意图相匹配。在一个时期，我们模拟了n个片段大小的对话(在我们的实验中为10到30)，并用元组(s_t，a_t，r_t)填充了一个经验重播缓冲区。缓冲区的大小固定，当代理程序性能增加到指定阈值以上时，第一次刷新该缓冲区。在状态动作元组获得大于50％的奖励的情节中，我们通过将零奖励分配给当前状态的任何其他动作来执行数据增强。

Figure 2: Performances of the score model. Left-hand panel: cross-validated test set accuracy with 95% confidence interval for different sub-samples having different number of intents. The horizontal red line indicates the performances for the entire sample. Right-hand panel: ROC curves for the different subsamples.

3模型评估 (3 Model Evaluation)

3.1评分模型评估(3.1 Score Model Evaluation)

To evaluate the model, we select subsets of conversations, triggering the top N intents, with N between 5 and 50. The results of the score model are summarized in Figure 2, showing the cross-validated (5-fold CV) accuracy on the test set and the ROC curve as a function of the number of intents.

为了评估模型，我们选择对话的子集，触发前N个意图，N介于5和50之间。得分模型的结果总结在图2中，显示了交叉验证(5倍CV)准确性。测试集和ROC曲线作为意图数量的函数。

For the whole sample of conversations, we obtain a cross-validated accuracy of 75% and an AUC of 0.84.

对于整个会话样本，我们获得的交叉验证准确性为75％，AUC为0.84。

However, by selecting only those conversations triggering the top 5 intents, thus including more examples per intent, we obtain an accuracy of 86% and an AUC of 0.94. For the RL model evaluation, we have focussed on the 5 intents subsets; which ensures that we have the most reliable rewards.

但是，通过仅选择那些触发前5个意图的对话，从而每个意图包含更多示例，我们可以获得86％的准确性和0.94的AUC。对于RL模型评估，我们集中在5个意图子集上。这可以确保我们获得最可靠的回报。

3.2强化学习模型评估 (3.2 Reinforcement Learning Model Evaluation)

The learning curve for the RL training is shown in Figure 3. In the left-hand panel, we compare the RL training with the reward model with a test done with a direct reward (in interactive way), showing that the score model is giving similar performances to the reference case, where the reward is known. Large fluctuations in the average score are due to a limited batch size (n-episodes = 10) and a relatively large ε. We also show the success rate on a test set of 20 conversations, extracted from the full sample, where a “golden response” is manually provided for all the utterances.

RL训练的学习曲线如图3所示。在左侧面板中，我们将RL训练与奖励模型以及直接奖励(以交互方式)进行的测试进行了比较，表明分数模型可以与已知奖励的参考案例具有相似的性能。平均分数的较大波动是由于批量大小有限(n个片段= 10)和相对较大的ε。我们还显示了从完整样本中提取的20个会话的测试集的成功率，其中手动为所有话语提供“黄金响应”。

The agent success rate increases from an initial 50% to 75% in only 30 epochs, showing the potential of this approach.

代理成功率仅在30个时期内就从最初的50％增加到了75％，显示了这种方法的潜力。

In the right-hand panel, we show the results using n-episodes = 30, showing similar performances but with a smoother learning curve.

在右侧面板中，我们显示使用n-episodes = 30的结果，显示相似的性能，但学习曲线更平滑。

Figure 3: Learning curves showing the DQN agent’s average score (continuous black line) per training epoch and success rate (purple shaded area) based on a labelled test set of 20 conversations. Left-hand panel: learning curves for direct RL with interactive reward (black line) and the reward model (blue dotted line), using 10 episodes per epoch. Right-hand panel: learning curves for the model reward, using 30 episodes per epoch.

4。结论 (4 Conclusions)

In this work, we have shown the potential of a reinforcement learning approach in improving the performance of FAQ-type chatbots, based on the feedback from a user testing phase. To achieve this, we have developed a score model, which is able to predict the user’s satisfaction on utterance-response tuples, and implemented a DQN reinforcement model, using the score model predictions as rewards. We have evaluated the model on a small, but real, test case, demonstrating promising results. Further training on more epochs and including more data, as well as extensive tests on the model hyper-parameters are in progress. The value of our approach is in providing a practical tool to improve large-scale chatbots (with a large set of diverse intents), in an automated fashion based on user feedback.

在这项工作中，我们已经基于用户测试阶段的反馈，展示了增强学习方法在改善FAQ型聊天机器人性能方面的潜力。为此，我们开发了一个分数模型，该模型可以预测用户对语音响应元组的满意度，并使用分数模型预测作为奖励来实现DQN强化模型。我们已经在一个很小但真实的测试案例中对模型进行了评估，展示了令人鼓舞的结果。有关更多时期和包括更多数据的进一步培训以及有关模型超参数的广泛测试正在进行中。我们方法的价值在于提供一种实用的工具，以基于用户反馈的自动化方式来改进大型聊天机器人(具有多种多样的意图)。

Finally, we notice that although the reinforcement learning model presented in this work is suitable for FAQ-type chatbots, it can be generalised to include the sequential nature of conversation by incorporating a more complex score model.

最后，我们注意到，尽管本文中介绍的强化学习模型适用于FAQ型聊天机器人，但可以通过合并更复杂的评分模型将其概括为包含对话的顺序性质。

翻译自: https://medium.com/analytics-vidhya/self-improving-chatbots-based-on-reinforcement-learning-75cca62debce

http://www.taodudu.cc/news/show-4385061.html

HR人力资源管理精粹70则
浅谈IT企业人力资源流失现状与对策
信息系统项目管理系列之十：项目人力资源管理
2020年中国人力资源科技“十大趋势”丨研报系列
华为如何开展人力资源管理变革
常耀俊老师经典《非人力资源经理的人力资源管理》课程
经营者思维—赢在战略人力资源管理
信息系统项目管理师---第九章项目人力资源管理历年考题
vue人力管理_springboot+vue微人事人力资源管理系统，前后台分离源码
第九章-项目人力资源管理
人力资源专业见习
做人力资源需要掌握python_彭剑锋：做人力资源管理必须吃透的20个金句
H3C 物联网路由器4G配置
华为云IoT智慧物流案例04 | 4G网卡拨号（广和通L610-CAT.1模组）与北斗定位模组（中科微电子）数据解析
物联网控制卡学习资料第465篇：基于STM32F429多路RS232 物联网控制卡
夜间旅游在经济发展中起到哪些作用
物联网卡传感器赋能零售商品感应机制开启智能零售新风尚
物联网卡新型智慧城市解决方案
基于ZigBee+ESP32+MQTT+EMQX+TomCat+Servlet接口+MySQL+安卓app的物联网课设
Serverless 技术架构 — 初探传说中的FAAS（Function as a Service）
物联网价值链中只有软件或者平台/服务才会赚大钱？
干货来了，这些物联网基础知识你了解吗
物联网小课堂之NB-IoT黑科技——长连接方案x2
连物联网数据都理不清楚还怎么搞大数据分析？
技术大牛博客
自动驾驶技术简史
僵尸物联网大战区块链
家园防线 | 斐乐的手敲大型庭院物联网灌溉系统（持续更新）
纵论物联网（十一）：蓝牙技术的发展
传说中的BNET边缘传输

基于强化学习的自我完善聊天机器人相关推荐

[论文]基于强化学习的无模型水下机器人深度控制
基于强化学习的无模型水下机器人深度控制摘要介绍问题公式 A.水下机器人的坐标框架 B.深度控制问题马尔科夫模型 A.马尔科夫决策 B.恒定深度控制MDP C.弯曲深度控制MDP D.海底追踪的 ...
【论文笔记】基于强化学习的机器人手臂仿人运动规划方法
文章目录摘要关键词 0 引言学者研究阶段一:采集运动数据,分析运动过程特征阶段二:设计仿人变量.建立仿人标准和约束阶段三:用智能算法提升仿人运动机器人性能本文工作 1 问题描述及方法架构 ...
强化学习3——基于强化学习的四足机器人运动控制
一.问题描述传统的机器人行走控制如图1所示. 图1 传统机器人运动控制框图包括三个环:平衡控制.运动轨迹控制.电机控制.传统的控制需要利用外部的视觉信号以及机器人的传感器获取环境及自身的状态信息, ...
MATLAB强化学习实战(十三) 使用强化学习智能体训练Biped机器人行走
使用强化学习智能体训练Biped机器人行走两足机器人模型创建环境接口选择和创建训练智能体 DDPG Agent TD3 Agent 指定训练选项和训练智能体仿真训练过的智能体比较智能体性能 ...
【强化学习与机器人控制论文 2】基于强化学习的五指灵巧手操作
基于强化学习的五指灵巧手操作 1. 引言 2. 论文解读 2.1 背景 2.2 论文中所用到的强化学习方法 2.3 实验任务和系统 2.4 仿真到实物的迁移 2.5 分布式RL训练--从状态中学到控制 ...
基于强化学习的图像配准 - Image Registration: Reinforcement Learning Approaches
配准定义给定参考图像 I_f 和浮动图像 I_m ,所谓的配准就是寻找一个图像变换T,将浮动图像I_m变换到和 I_f 相同的坐标空间下,使得两个图像中对应的点处于同一坐标下,从而达到信息聚合的目的 ...
【聊天机器人】深度学习构建检索式聊天机器人原理
一.检索式与生成式聊天机器人对比 1.基于检索的chatterbot 2.基于生成的chatterbot 3.聊天机器人的一些思考: (1)基于检索的chatterbot 根据input和contex ...
【论文笔记】基于强化学习的连续型机械臂自适应跟踪控制
文章目录摘要关键词 0 引言 1 空间连续型机器人动力学模型 1.1 场景假设 (1) 环境假设 (2) 模型假设 1.2 公式分析 2 空间连续型机器人滑模控制器 3 基于强化学习的滑模控制器 ...
【强化学习与机器人控制论文 3】基于强化学习的五指灵巧手玩转魔方
这里写目录标题 1. 引言 2. 论文解读 2.1 背景 2.2 硬件平台和仿真平台的搭建 2.3 主要算法框架 2.4 自动域随机化ADR 2.5 实验结果 3. 总结 1. 引言本文介绍一篇Op ...

基于强化学习的自我完善聊天机器人