In-Hand Manipulation

Improving exploration with In-Hand Manipulation environments in Reinforcement Learning

One of the greatest challenges in Reinforcement Learning (RL) is sufficient exploration. Theoretically, Q-Learning should be capable of randomly exploring any environment until an optimal policy is converged upon. However, with Deep Q-Learning exploration is not only expensive, but if the complexity of a problem is too high, no feasible level of exploration could bring an RL algorithm to an optimal solution. One method of assisting with exploration is using a Suboptimal Expert to explore higher reward states earlier in training. When training using an Off-Policy RL Algorithm (SAC or TD3), we can sample actions not only from the policy that the default algorithm samples from but also from our controller. This method can speed up the exploration process when training on environments that have large state spaces and allow us to develop optimal policies in a much shorter time.

Before testing this method on In-Hand Manipulation, it was tested on a more basic reaching task. The reaching problem consisted of a 2D 3-Link robotic arm tasked with the goal of bringing its end-effector to a randomly chosen goal position within the vicinity of the arm. When using a sparse reward function, this problem can become extremely difficult for RL to solve due to the large number of episodes the algorithm may explore without ever seeing a state that returns a positive reward. An analytical controller can be written

For the purpose of this experiment, Finger Gaiting is defined as the repetitive process of removing and rotating each finger around an object so that it may continuously rotate in the hand. In the field of In-Hand Manipulation, finger gaiting is an extremely difficult task that can take tens of millions of timesteps to learn if learning is even feasible. As explained, finger gaiting can be split into two subactions: turning and switching. Switching is the act of detaching a finger from the object and reattaching it in a new location; turning is the act of reorienting the fingertips in order to rotate the object. As discussed above, we can design an analytical controller to be sampled during training. After implementing this controller, training on finger gaiting with complex shaped objects such as cubes can improved when compared to baseline algorithms. We hope to publish our work on this later this year.

An animation of the 3-Link arm environment

by solving for the torques at each joint necessary to create an end-effector force in the direction of the goal. When using the controller to sample actions during training, we can converge upon an optimal policy in a few million timesteps, which the baseline algorithm is not capable of.