# **Introduction** ### **Goal of the paper** * The goal of this paper is to use an RGB-D image to find the best pose for grasping an object using a parallel pose gripper. * The goal of this algorithm is to also give an open loop method for manipulation of the object using vision data. ### **Previous Research** * Even the state of the art in grasp detection algorithms fail under real world circumstances and cannot work in real time. * To perform grasping a 7D grasp representation is used. But usually a 5D grasping representation is used and this is projected back into 7D space. * Previous methods directly found the 7D pose representation using only the vision data. * Compared to older computer vision techniques like sliding window classifier deep learning methods are more robust to occlusion , rotation and scaling. * Grasp Point detection gave high accuracy (> 92%) but was helpful for only grasping cloths or towels. ### **Method** * Grasp detection is generally a computer vision problem. * The algorithm given by the paper made use of computer vision to find the grasp as a 5D representation. The 5D representation is faster to compute and is also less computationally intensive and can be used in real time. * The general grasp planning algorithms can be divided into three distinct sequential phases ; 1. Grasp detection 1. Trajectory planning 1. Grasp execution * One of the most major tasks in grasping algorithms is to find the best place for grasping and to map the vision data to coordinates that can be used for manipulation. * The method makes use of three neural networks : 1. 50 deep neural network (ResNet 50) to find the features in RGB image. This network is pretrained on the ImageNet dataset. 1. Another neural network to find the feature in depth image. 1. The output from the two neural networks are fed into another network that gives the final grasp configuration as the output. * The robot grasping configuration can be given as a function of the x,y,w,h and theta where (x,y) are the centre of the grasp rectangle and theta is the angle of the grasp rectangle. * Since very deep networks are being used (number of layers > 20) , residual layers are used that helps in improving the loss surface of the network and reduce the vanishing gradient problems. * This paper gives two types of networks for the grasp detection ; 1. Uni-Modal Grasp Predictor * These use only an RGB 2D image to extract the feature from the input image and then use the features to give the best pose. * A Linear - SVM is used as the final classifier to classify the best pose for the object. 1. Multi-Modal Grasp Predictor * This model makes use of both the 2D image and the RGB-D image to extract the grasp. * RGB-D image is decomposed into an RGB image and a depth image. * Both the images are passed through the networks and the outputs are the combined together to a shallow CNN. * The output of the shallow CNN is the best grasp for the object. ### **Experiments and Results** * The experiments are done on the Cornell Grasp dataset. * Almost no or minimum preprocessing is done on the images except resizing the image. * The results of the algorithm given by this paper are compared to unimodal methods that use only RGB images. * To validate the model it is checked if the predicted angle of grasp is less than 30 degrees and that the Jaccard similarity is more than 25% of the ground truth label. ### **Conclusion** * This paper shows that Deep-Convolutional neural networks can be used to predict the grasping pose for an object. * Another major observation is that the deep residual layers help in better extraction of the features of the grasp object from the image. * The new model was able to run at realtime speeds. * The model gave state of the art results on Cornell Grasping dataset. ---- ### **Open research questions** * Transfer Learning concepts to try the model on real robots. * Try the model in industrial environments on objects of different sizes and shapes. * Formulating the grasping problem as a regression problem.