Combining Reinforcement Learning and Rule-based Method to Manipulate Objects in Clutter


    • **Abstract**
    • **Introduction**
    • **Pushing and Grasping**
      • **Pushing**
      • **Grasping**
    • **Experiment**
      • **Grasp algorithm verification**
      • **Reinforcement Learning Training**
      • **Clutter clearing**
  • **Conclusion**


To reduce the complexity of strategy learning, we propose a framework for robots to pick up the objects in clutter on table based on deep reinforcement learning and rule-based method.


To manipulate the objects on table, we mainly divide the robot actions into two categories: one is pushing that uses the reinforcement learning method, while the other one is grasping that is inferred by image morphological processing.


The pushing action can separate the stacking objects, create a robust grasp point for the following grasp.



Taking images as input, our framework can keep a high grasp rate with low computational complexity, which makes it achieve clutter clearing quickly.



Especially for grasping, few positive samples and diverse objects lead to the fact that hundreds of hour for collecting data is inescapable.


This kind of problem is hard to define manually and doesn’t require a very precise solution, hence it is suitable for reinforcement learning to deal with this problem.


Compared with their work, we try to employ the reinforcement learning network with continuous output to remedy this issue.


We find that the grasp algorithm based on supervised learning is mostly trained on Cornell Grasping Dataset or Jacquard Dataset, whose depth image is strikingly different from the depth image in simulation because of different shooting angles.


We make use of the twin delayed deep deterministic policy gradient [6] to train our policy that determines where to start pushing and pushing direction according to current image.


The grasp detecting is processed with rule-based method mainly based on the recognition of minimum bounding convex hull and minimum bounding rectangle of connected regions.


The grasp detecting algorithm will calculate out whether it is graspable, the grasp center and the grasp orientation.


The grasp detecting algorithm will calculate out whether it is graspable, the grasp center and the grasp orientation.


Yuan et al. learn the nonprehensile rearrangement based on deep Q-learning [1], pushing an object to the predefined goal pose in an environment with obstacles.


Nair et al. utilize variational auto-encoder to encode the input image [15], calculate the reward based on the Euclidean distance of encoded vector, and verify this algorithm in the experiment of reaching and pushing.


The large-scale exploration space and delayed reward makes it hard to get training data of high quality, and thus lots of time is needed to collect data.

大规模的探索空间 + 延迟的奖励函数 造成了训练集的低效率 造成了需要大量的时间收集数据

In [20] they achieve pixel-wise grasp rectangle detection by using the fully convolutional network like U-net to predict rectangle for every pixel. Without fully connected layers, their network is significantly smaller than other networks.


In the face of cleaning clustered objects that needs to combine pushing and grasping, we are inspired by the algorithm that maps the image to the high-level actions instead of continuous actions of low level based on the mapping relation between image and workspace [9] [22].


Pushing and Grasping


We employ the Twined Delayed DDPG to learn the policy, which consists of one policy network, double critic networks and their own target networks.

at=πϕ(st)a_{t} = \pi_{\phi}(s_{t}) at​=πϕ​(st​)
critic network的损失函数是:
loss=(R(st,st+1)+γmax⁡i=1,2Qθi,(st+1,a,)−Qθi(st,at)))2loss = (R(s_{t},s_{t+1})+\gamma\max\limits_{i=1,2}Q_{\theta_{i}^{,}}(s_{t+1},a^{,})-Q_{\theta_{i}}(s_{t},a_{t})))^2 loss=(R(st​,st+1​)+γi=1,2max​Qθi,​​(st+1​,a,)−Qθi​​(st​,at​)))2
∇ϕJ(ϕ)=∇aQθ1(s,a)∣at=πϕ(st)∇ϕπϕ(s)\nabla_{\phi}J(\phi)=\nabla_{a}Q_{\theta_{1}}(s,a)|_{a_{t} = \pi_{\phi}(s_{t})}\nabla_{\phi}\pi_{\phi}(s) ∇ϕ​J(ϕ)=∇a​Qθ1​​(s,a)∣at​=πϕ​(st​)​∇ϕ​πϕ​(s)
In this work, only depth image is used as the state that is captured by the camera over the table. The pixel plane is parallel to table surface so that pixel coordinate and table planimetric position are linearly proportional.



The policy network outputs action with four dimensions (a1, a2, a3, a4) and each dimension is limited to (−1, 1). They present x and y coordinate of the table surface, which side pushing to and the pushing angle, respectively.

Specifically, (a1, a2) decides the position where to start pushing, and (a3, a4) decides the pushing orientation.



To avoid pushing objects out of table, we limit the length of area that can start push to 0.6 times the length of table surface.


Although cosine-sine encoder is widely used in supervised learning [20] to represent the angle at the circumference, we found it hard to master the many-to-one mapping for reinforcement learning in the absence of direct oversight of the target.


The robot end-effector reaches the position that is 30cm over the pushing start point decided by (a1, a2). The robot end-effector moves straight down until it contacts with objects or it is 1.5cm above the table surface. The robot end-effector pushes a constant distance in a given orientation decided by (a3, a4).


If a grasp can be performed after the push action, the reward R (st, st+1) = 1.

If the push action results in enough change of the clustered object positions which can be judged by calculating the difference between depth images before and after pushing, the reward R (st, st+1) = 0.5.


Both the policy network and critic network have the same convolutional layers to extract image feature.




grasp rectangle g:
g={x,y,θ,h,wg = \begin{cases}x,y,\theta,h,w\end{cases} g={x,y,θ,h,w​

We start by making a binary image to separate the objects in the picture from the background based on depth image. Due to the ideal simulation environment, the pixel intensity of objects in an image is always greater than that of desktop background. It is simple to make binary processing with a fixed threshold. Then, we detect a grasp configuration for every connected region in binary image and make up a grasp list. Every element in this grasp list is a grasp configuration (x, y, θ, w).


isvalid={True,I(center−point)<I(end−point)−τFalse,Othersisvalid=\begin{cases}True,I(center-point)<I(end-point)-\tau \\False,Others \end{cases} isvalid={True,I(center−point)<I(end−point)−τFalse,Others​


After a grasp or a push, the robot arm is reset to a position out of camera field. Then, the camera capture an image for the next detection.


We perform the experiment in a simulation environment called MuJoCo. The module is built with a toolkit called robosuite [24], which contains a modularized design of APIs for building new environments.


项目 数值
Input 84*84 pixels
Termination Conditions 1. all the objects on table are taken away; 2. push action has been performed 15 times
CPU Intel Core i7-8700
Optimizer Adam
Learning Rate 0.0003
Batch Size 128
Target Network Update Delay 0.01
Noise Gaussian Noise ( without a3 )

Grasp algorithm verification

Therefore, we reset the environment if no grasp is detected in the condition of multiple objects.



The main reason for grasp failure presently is that two objects next to each other are recognized as one object, and the grasp center is on where they connect.



Reinforcement Learning Training

Therefore, we evaluate pushing performance for 50 episodes after 300 episodes of training.


And we can see that the discount factor γ has a great impact on pushing performance.



In this kind of task, the pushing action should have a positive immediate effect to help grasp, and the relationship between two pushes is little. Therefore, small γ reduces the consideration of future state and have greater performance.


From the perspective of input type, depth image is the main factor affecting performance.


效果是加速了收敛因速度 accelerate speed

Clutter clearing


The main reason for unsuccessful clearing is that double objects stay in a corner, and the robot can’t divide them by pushing because of the limitation of push working range.


calculate from image to push action 1 ms
detect the best grasp 5 ms


In the future, we will try to further improve the grasp rate, transfer this framework to real Baxter robot, and test it with more objects of different shapes.


