Combining Reinforcement Learning and Rule-based Method to Manipulate Objects in Clutter

文章目录

  • **Combining Reinforcement Learning and Rule-based Method to Manipulate Objects in Clutter**
    • **Abstract**
    • **Introduction**
    • **Pushing and Grasping**
      • **Pushing**
      • **Grasping**
    • **Experiment**
      • **Grasp algorithm verification**
      • **Reinforcement Learning Training**
      • **Clutter clearing**
  • **Conclusion**

Abstract

To reduce the complexity of strategy learning, we propose a framework for robots to pick up the objects in clutter on table based on deep reinforcement learning and rule-based method.

深度强化学习+基于目标的方法

To manipulate the objects on table, we mainly divide the robot actions into two categories: one is pushing that uses the reinforcement learning method, while the other one is grasping that is inferred by image morphological processing.

位置移动:强化学习;抓取过程:图像形态处理技术

The pushing action can separate the stacking objects, create a robust grasp point for the following grasp.

“前进动作”可以将物品堆叠进行划分,创造出合适的抓取点进行抓取

利用强化学习,找到合适的点之后,就返回一个奖励

Taking images as input, our framework can keep a high grasp rate with low computational complexity, which makes it achieve clutter clearing quickly.

图片作为输入,提出的框架可以保持一个高抓取率和低计算复杂度,使抓取速度更快

Introduction

Especially for grasping, few positive samples and diverse objects lead to the fact that hundreds of hour for collecting data is inescapable.

特别是对于抓取,很少有正确样本和多样的物体会导致数百小时收集数据过程是不可避免的。

This kind of problem is hard to define manually and doesn’t require a very precise solution, hence it is suitable for reinforcement learning to deal with this problem.

这类问题很难手动定义,也不需要一个非常精确的解决方案,因此它适合于强化学习来处理这个问题。

Compared with their work, we try to employ the reinforcement learning network with continuous output to remedy this issue.

与他们的工作相比,我们尝试使用具有连续输出的强化学习网络来解决这一问题。

We find that the grasp algorithm based on supervised learning is mostly trained on Cornell Grasping Dataset or Jacquard Dataset, whose depth image is strikingly different from the depth image in simulation because of different shooting angles.

我们发现,基于监督学习的抓取算法大多是在康奈尔抓取数据集Jacquard数据集上训练的,由于拍摄角度不同,模拟中的深度图像与深度图像有显著不同。

We make use of the twin delayed deep deterministic policy gradient [6] to train our policy that determines where to start pushing and pushing direction according to current image.

使用DDDQN网络作为策略函数

The grasp detecting is processed with rule-based method mainly based on the recognition of minimum bounding convex hull and minimum bounding rectangle of connected regions.

抓取检测采用基于规则的方法,主要基于连接区域的最小边界凸包最小边界矩形的识别。

The grasp detecting is processed with rule-based method mainly based on the recognition of minimum bounding convex hull and minimum bounding rectangle of connected regions.

抓取检测采用基于规则的方法,主要基于连接区域的最小边界凸包最小边界矩形的识别。

The grasp detecting algorithm will calculate out whether it is graspable, the grasp center and the grasp orientation.

抓取检测算法将计算出它是否可抓取抓取中心抓取方向

Yuan et al. learn the nonprehensile rearrangement based on deep Q-learning [1], pushing an object to the predefined goal pose in an environment with obstacles.

袁等人学习了基于深度Q学习[1]的非综合性重排,在有障碍物的环境中将物体推到预定的目标姿势

Nair et al. utilize variational auto-encoder to encode the input image [15], calculate the reward based on the Euclidean distance of encoded vector, and verify this algorithm in the experiment of reaching and pushing.

Nair等利用变分自动编码器对输入图像[15]进行编码,根据编码向量的欧氏距离计算奖励,并在推送实验中验证该算法。

The large-scale exploration space and delayed reward makes it hard to get training data of high quality, and thus lots of time is needed to collect data.

大规模的探索空间 + 延迟的奖励函数 造成了训练集的低效率 造成了需要大量的时间收集数据

In [20] they achieve pixel-wise grasp rectangle detection by using the fully convolutional network like U-net to predict rectangle for every pixel. Without fully connected layers, their network is significantly smaller than other networks.

用全卷积神经网络进行检测,因为缺少了全连接层,网络规模变得很小

In the face of cleaning clustered objects that needs to combine pushing and grasping, we are inspired by the algorithm that maps the image to the high-level actions instead of continuous actions of low level based on the mapping relation between image and workspace [9] [22].

在清理需要结合推取和抓取的聚类对象时,我们受到了将图像映射到高级动作的算法的启发,而不是基于图像与工作空间之间的映射关系的低层次连续动作

Pushing and Grasping

Pushing

We employ the Twined Delayed DDPG to learn the policy, which consists of one policy network, double critic networks and their own target networks.

策略函数的公式是:
at=πϕ(st)a_{t} = \pi_{\phi}(s_{t}) at​=πϕ​(st​)
critic network的损失函数是:
loss=(R(st,st+1)+γmax⁡i=1,2Qθi,(st+1,a,)−Qθi(st,at)))2loss = (R(s_{t},s_{t+1})+\gamma\max\limits_{i=1,2}Q_{\theta_{i}^{,}}(s_{t+1},a^{,})-Q_{\theta_{i}}(s_{t},a_{t})))^2 loss=(R(st​,st+1​)+γi=1,2max​Qθi,​​(st+1​,a,)−Qθi​​(st​,at​)))2
策略函数的更新是:
∇ϕJ(ϕ)=∇aQθ1(s,a)∣at=πϕ(st)∇ϕπϕ(s)\nabla_{\phi}J(\phi)=\nabla_{a}Q_{\theta_{1}}(s,a)|_{a_{t} = \pi_{\phi}(s_{t})}\nabla_{\phi}\pi_{\phi}(s) ∇ϕ​J(ϕ)=∇a​Qθ1​​(s,a)∣at​=πϕ​(st​)​∇ϕ​πϕ​(s)
In this work, only depth image is used as the state that is captured by the camera over the table. The pixel plane is parallel to table surface so that pixel coordinate and table planimetric position are linearly proportional.

深度相机拍摄的图片作为DQN的状态;

相机镜头面与桌面平行,使得像素坐标和桌面表面呈线型比例

The policy network outputs action with four dimensions (a1, a2, a3, a4) and each dimension is limited to (−1, 1). They present x and y coordinate of the table surface, which side pushing to and the pushing angle, respectively.

Specifically, (a1, a2) decides the position where to start pushing, and (a3, a4) decides the pushing orientation.

四维方向,前两维是桌面的坐标,从哪里开始推

第三维是推向哪一侧,最后一维是推进的角度,最后两维是朝哪个方向推多少

To avoid pushing objects out of table, we limit the length of area that can start push to 0.6 times the length of table surface.

避免将物品推出桌面,设置开始推点的范围是桌面边界的0.6

Although cosine-sine encoder is widely used in supervised learning [20] to represent the angle at the circumference, we found it hard to master the many-to-one mapping for reinforcement learning in the absence of direct oversight of the target.

余弦-正弦编码器广泛运用在监督学习,但是对于多对一的强化学习来说,缺少目标显得困难。

The robot end-effector reaches the position that is 30cm over the pushing start point decided by (a1, a2). The robot end-effector moves straight down until it contacts with objects or it is 1.5cm above the table surface. The robot end-effector pushes a constant distance in a given orientation decided by (a3, a4).

机器人先根据(a1,a2)来到距离物品30cm的地方,然后一直移动到距离物体表面1.5cm的地方,再根据(a3,a4)制定的方向推进一段距离。

If a grasp can be performed after the push action, the reward R (st, st+1) = 1.

If the push action results in enough change of the clustered object positions which can be judged by calculating the difference between depth images before and after pushing, the reward R (st, st+1) = 0.5.

抓取动作成功执行,奖励+1;只是推进一段距离,并没有成功抓取,奖励+0.5;

Both the policy network and critic network have the same convolutional layers to extract image feature.

策略网络和评价网络都用相同的卷积层

用SeLU激活函数较好;用批标准化来保证梯度反向传播平衡

Grasping

grasp rectangle g:
g={x,y,θ,h,wg = \begin{cases}x,y,\theta,h,w\end{cases} g={x,y,θ,h,w​
(x,y)抓取中心点的位置;θ是抓取臂角度;w是抓取臂的宽度;h是抓握器的厚度。

We start by making a binary image to separate the objects in the picture from the background based on depth image. Due to the ideal simulation environment, the pixel intensity of objects in an image is always greater than that of desktop background. It is simple to make binary processing with a fixed threshold. Then, we detect a grasp configuration for every connected region in binary image and make up a grasp list. Every element in this grasp list is a grasp configuration (x, y, θ, w).

原来的方法是抓取单个物体,但是如果物体比较多的话,抓取臂就会伸不进去

因此需要检测一下是否抓取有效,其中τ是一个超参数
isvalid={True,I(center−point)<I(end−point)−τFalse,Othersisvalid=\begin{cases}True,I(center-point)<I(end-point)-\tau \\False,Others \end{cases} isvalid={True,I(center−point)<I(end−point)−τFalse,Others​

Experiment

After a grasp or a push, the robot arm is reset to a position out of camera field. Then, the camera capture an image for the next detection.

抓取或者位置移动结束之后,抓取臂返回一边,让深度相机拍照

We perform the experiment in a simulation environment called MuJoCo. The module is built with a toolkit called robosuite [24], which contains a modularized design of APIs for building new environments.

MuJoco仿真强化学习;robosuite用于建模

项目 数值
Input 84*84 pixels
Termination Conditions 1. all the objects on table are taken away; 2. push action has been performed 15 times
CPU Intel Core i7-8700
GPU NVIDIA 2080Ti
Optimizer Adam
Learning Rate 0.0003
Batch Size 128
Target Network Update Delay 0.01
Noise Gaussian Noise ( without a3 )

Grasp algorithm verification

Therefore, we reset the environment if no grasp is detected in the condition of multiple objects.

没有抓取,就重置环境

原本的算法:对单个抓取很厉害,但是多个就怂了

The main reason for grasp failure presently is that two objects next to each other are recognized as one object, and the grasp center is on where they connect.

抓取失败的原因:俩物体靠的太近了,以至于看成了一个物体

解决方法,分开这俩物体后,抓取成功

Reinforcement Learning Training

Therefore, we evaluate pushing performance for 50 episodes after 300 episodes of training.

评估的是训练300个episodes之后的后面50个episode

And we can see that the discount factor γ has a great impact on pushing performance.

折扣因子γ会对推进动作产生显著效果

这篇论文的γ在0.5左右

In this kind of task, the pushing action should have a positive immediate effect to help grasp, and the relationship between two pushes is little. Therefore, small γ reduces the consideration of future state and have greater performance.

γ为什么要小的原因:这个任务里面更看重及时的信息,也可以说是更需要短视一点,由于前后两个push动作关联不大,因此不需要进行长期利益的考量。

From the perspective of input type, depth image is the main factor affecting performance.

深度图片比RGB图片对性能效果更好,成为了主要因素

效果是加速了收敛因速度 accelerate speed

Clutter clearing

人工放置物体,物体都是紧紧挨着的

The main reason for unsuccessful clearing is that double objects stay in a corner, and the robot can’t divide them by pushing because of the limitation of push working range.

清理不成功的原因是:由于存在推进力的作用范围,有个死角存在两个紧邻的物体,机器人不能把它们分开

calculate from image to push action 1 ms
detect the best grasp 5 ms

Conclusion

In the future, we will try to further improve the grasp rate, transfer this framework to real Baxter robot, and test it with more objects of different shapes.

接下来的工作:提升抓取率、将这个模型运用到Baxter机器人上进行实物测试

【论文笔记】Combining Reinforcement Learning and Rule-based Method to Manipulate Objects in Clutter相关推荐

  1. 论文笔记 Hierarchical Reinforcement Learning for Scarce Medical Resource Allocation

    KDD 2021 0 摘要 面对COVID-19的爆发,医疗资源紧缺问题日益突出.因此,迫切需要有效的医疗资源配置策略. 强化学习(RL)对于决策制定很强大,但通过强化学习解决这个问题存在三个关键挑战 ...

  2. 论文笔记—A Review of Visual-LiDAR Fusion based Simultaneous Localization and Mapping

    论文笔记-A Review of Visual-LiDAR Fusion based Simultaneous Localization and Mapping 论文链接 文章摘要 ~~~~    ~ ...

  3. 【论文笔记】Reinforcement and Imitation Learning for Diverse Visuomotor Skills

    目录 Abstract Introduction Related Work Model A. Background: GAIL and PPO 1. 行为克隆(Behavior Cloning) 2. ...

  4. 论文翻译 —— Episodic reinforcement learning with associative memory

    标题:Episodic reinforcement learning with associative memory 文章链接:Episodic reinforcement learning with ...

  5. 论文浅尝 | Reinforcement Learning for Relation Classification

    论文链接:http://aihuang.org/p/papers/AAAI2018Denoising.pdf 来源:AAAI 2018 Motivation Distant Supervision 是 ...

  6. 每天一篇论文 289/365Deep Reinforcement Learning for Robotic Pushing and Picking in Cluttered Environment

    Deep Reinforcement Learning for Robotic Pushing and Picking in Cluttered Environment 摘要 本文提出了一种新的机器人 ...

  7. 图像隐写术分析论文笔记:Deep learning for steganalysis via convolutional neural networks

    好久没有写论文笔记了,这里开始一个新任务,即图像的steganalysis任务的深度网络模型.现在是论文阅读阶段,会陆续分享一些相关论文,以及基础知识,以及传统方法的思路,以资借鉴. 这一篇是Medi ...

  8. 论文笔记:Deep Learning [nature review by Lecun, Bengio, Hinton]

    如今,机器学习的技术在我们的生活中扮演着越来越重要的角色.从搜索引擎到推荐系统,从图像识别到语音识别.而这些应用都开始逐渐使用一类叫做深度学习(Deep Learning)的技术. 传统机器学习算法的 ...

  9. [论文笔记]Spectrum Sharing in Vehicular Networks Based on Multi-Agent Reinforcement Learning

    本文的研究目标是车在网络中的频谱资源分配问题,具体来讲是如何实现多个V2V链路重用V2I链路的频谱.车载链路中环境的快速变化使传统的在基站处收集CSI信息以进行集中式资源管理成为难题,而本方法将资源共 ...

  10. (论文笔记)EEG fading data classification based on improved manifold learning with adaptive neighborhood

    EEG fading data classification based on improved manifold learning with adaptive neighborhood select ...

最新文章

  1. 骚操作 !用 Python 偷偷抓取女朋友的行踪(女朋友在哪里)
  2. Android Launcher3(一) -- 启动过程
  3. 【aspnetcore】添加自定义json配置文件
  4. 阿里云原生数据库:POLARDB
  5. 代码统计工具有哪几种_跟我学“Linux”小程序Web版开发(四):引入统计及Crash收集...
  6. CF280C-Game on Tree【数学期望】
  7. 如何估算太坊交易的gas消耗量
  8. error C2065: 'IDD_***' : undeclared identifier
  9. 这种思路讲解数据仓库建模,你见过吗?数据人与架构师必看
  10. HDU1240 POJ2225 Asteroids!【BFS】
  11. 奇异值分解(SVD)和最小二乘解在解齐次线性超定方程中的应用
  12. 【POJ2775】The Number of the Same BST(二叉搜索树+计数+lucas定理)
  13. Html 返回顶部代码及注释说明
  14. python用turtle画一个苹果
  15. linux 模拟usb键盘,在Linux下模拟键盘按键
  16. Backtrader量化平台教程(五)Signal
  17. C语言程序设计入门——平均值
  18. linux connect自动重连,Linux 北大网关断网重连
  19. VBA 字典数组运用查询系统
  20. 前端js解析识别图片二维码

热门文章

  1. Unity Shader - 车漆效果(基于MatCap)
  2. kappa一致性检验教程_诊断试验的一致性检验-Kappa
  3. python半圆代码_r或python中的半圆形色轮[闭合]
  4. bootstrap框架中的分割线
  5. 跨越生态裂谷 华为云Stack如何为企业智能化转型架桥铺路?
  6. 没有执行力,一切都是0,优秀都会沦为平庸
  7. linux 查看上一级目录,du 使用详解 查看一级目录大小
  8. 腾讯云服务器无限更换ip,腾讯云服务器免费更换IP额度不足利用弹性IP地址更换...
  9. 计算机专业毕业设计流程,计算机专业毕业设计答辩流程
  10. python输入负数_如何让python使用负数