### **Goal of the paper**
* The goal of this paper is to use an RGB-D image to find the best pose for grasping an object using a parallel pose gripper.
* The goal of this algorithm is to also give an open loop method for manipulation of the object using vision data.
### **Previous Research**
* Even the state of the art in grasp detection algorithms fail under real world circumstances and cannot work in real time.
* To perform grasping a 7D grasp representation is used. But usually a 5D grasping representation is used and this is projected back into 7D space.
* Previous methods directly found the 7D pose representation using only the vision data.
* Compared to older computer vision techniques like sliding window classifier deep learning methods are more robust to occlusion , rotation and scaling.
* Grasp Point detection gave high accuracy (> 92%) but was helpful for only grasping cloths or towels.
* Grasp detection is generally a computer vision problem.
* The algorithm given by the paper made use of computer vision to find the grasp as a 5D representation. The 5D representation is faster to compute and is also less computationally intensive and can be used in real time.
* The general grasp planning algorithms can be divided into three distinct sequential phases ;
1. Grasp detection
1. Trajectory planning
1. Grasp execution
* One of the most major tasks in grasping algorithms is to find the best place for grasping and to map the vision data to coordinates that can be used for manipulation.
* The method makes use of three neural networks :
1. 50 deep neural network (ResNet 50) to find the features in RGB image. This network is pretrained on the ImageNet dataset.
1. Another neural network to find the feature in depth image.
1. The output from the two neural networks are fed into another network that gives the final grasp configuration as the output.
* The robot grasping configuration can be given as a function of the x,y,w,h and theta where (x,y) are the centre of the grasp rectangle and theta is the angle of the grasp rectangle.
* Since very deep networks are being used (number of layers > 20) , residual layers are used that helps in improving the loss surface of the network and reduce the vanishing gradient problems.
* This paper gives two types of networks for the grasp detection ;
1. Uni-Modal Grasp Predictor
* These use only an RGB 2D image to extract the feature from the input image and then use the features to give the best pose.
* A Linear - SVM is used as the final classifier to classify the best pose for the object.
1. Multi-Modal Grasp Predictor
* This model makes use of both the 2D image and the RGB-D image to extract the grasp.
* RGB-D image is decomposed into an RGB image and a depth image.
* Both the images are passed through the networks and the outputs are the combined together to a shallow CNN.
* The output of the shallow CNN is the best grasp for the object.
### **Experiments and Results**
* The experiments are done on the Cornell Grasp dataset.
* Almost no or minimum preprocessing is done on the images except resizing the image.
* The results of the algorithm given by this paper are compared to unimodal methods that use only RGB images.
* To validate the model it is checked if the predicted angle of grasp is less than 30 degrees and that the Jaccard similarity is more than 25% of the ground truth label.
* This paper shows that Deep-Convolutional neural networks can be used to predict the grasping pose for an object.
* Another major observation is that the deep residual layers help in better extraction of the features of the grasp object from the image.
* The new model was able to run at realtime speeds.
* The model gave state of the art results on Cornell Grasping dataset.
### **Open research questions**
* Transfer Learning concepts to try the model on real robots.
* Try the model in industrial environments on objects of different sizes and shapes.
* Formulating the grasping problem as a regression problem.
Progressive GAN , High resolution generator
1. **Goal of the paper**
1. Generation of very high quality images using progressively increasing size of the generator and discriminator.
1. Improved training and stability of GANs.
1. New metric for evaluating GAN results.
1. A high quality version of CELEBA-HQ dataset.
1. **Previous Research**
1. Generative methods help to produce new samples from higher-dimensional data distributions such as images .
1. The common approaches for generative methods are :
1. Autoregressive models : Produce sharp images and are slow to evaluate. eg PixelCNN
1. Variational Autoencoders : Easy to train but produces blurry images.
1. Generative Adversarial Neural Network : Produces sharp images at small resolutions but are highly unstable.
1. **Basic GAN architecture**
1. Gan consists of two major parts :
1. _Generator_ : Creates a sample image from latent code which look very close to the training images.
1. _Discriminator_: Discriminator is trained to assess how close the sample image looks to the training image.
1. To measure the overlap between the training and the generated distributions many methods are used like Jensen-Shannon divergence , least-squares divergence and Wasserstein Distance.
1. Larger resolution generations cause problems because it becomes difficult for both the training and the generated networks amplifying the gradient problem. Larger resolutions also require large memory and can cause problems.
1. A mechanism is also proposed to stop the generator from participating in escalation that causes mode collapse problem.
1. **Progressive growing of GANs**
1. The primary method for the GAN training is to start off from a low resolution image and add extra layers in each step of the training process.
1. Lower resolution images are more stable as they have very less class information and as the resolution of the image increases further smaller details and features are added to the image.
1. This leads to a smooth increase in the quality of image instead of the network learning lot of details in one single step.
1. **Mini-batch separation**
1. GANs tend to capture only a very small set of features from the image.
1. "Minibatch discrimination" is used to generate feature vector for each individual image along with one for the the mini batch of images also.
1. Higher resolution images are able to be generated which are robust and efficient.
1. Improved quality of the generated images is given.
1. Reduced training time for a comparable result and output quality and resolution.
* Gradient Problem : At higher resolutions it becomes easier to tell the differences between the training and the testing images . This is referred to as the gradient problem.
* Mode Collapse : The generator is incapable of creating a large variety of samples and get stuck.
## **Open research questions**
1. Improved methods for a true photorealism generation of images.
1. Improved semantic sensibility and improved understanding of the dataset.
First published: 2017/10/24 (2 years ago) Abstract: Recent research has revealed that the output of Deep Neural Networks (DNN)
can be easily altered by adding relatively small perturbations to the input
vector. In this paper, we analyze an attack in an extremely limited scenario
where only one pixel can be modified. For that we propose a novel method for
generating one-pixel adversarial perturbations based on differential evolution.
It requires less adversarial information and can fool more types of networks.
The results show that 70.97% of the natural images can be perturbed to at least
one target class by modifying just one pixel with 97.47% confidence on average.
Thus, the proposed attack explores a different take on adversarial machine
learning in an extreme limited scenario, showing that current DNNs are also
vulnerable to such low dimension attacks.
One pixel attack , adversarial examples , differential evolution , targeted and non-targeted attack
1. **Introduction **
1. Deep learning methods are better than the traditional image processing techniques in most of the cases in computer vision domain.
1. "Adversarial examples" are specifically modified images with imperceptible perturbations that are classified wrong by the network.
1. **Goals of the paper**
1. In most of the older techniques excessive modifications are made on the images and it may become perceivable to the human eyes. The authors of the paper suggest a method to create adversarial examples by changing only one , three or five pixels of the image.
1. Generating examples under constrained conditions can help in _getting insights about the decision boundaries_ in the higher dimensional space.
1. **Previous Work**
1. Methods to create adversarial examples :
1. Gradient-based algorithms using backpropagation for obtaining gradient information
1. "fast gradient sign" algorithm
1. Greedy perturbation searching method
1. Jacobian matrix to build "Adversarial Saliency Map"
1. Understanding and visualizing the decision boundaries of the DNN input space.
1. Concept of "Universal perturbations" , a perturbation that when added to any natural image can generate adversarial samples with high effectiveness
1. **Advantages of the new types of attack **
1. _Effectiveness_ : One pixel modification with efficiency ranging from 60% - 75%.
1. _Semi-Black-Box attack _: Requires only black-box feedback (probability labels) , no gradient and network architecture required.
1. _Flexibility_ : Can generalize between different types of network architectures.
1. Finding the adversarial example as an optimization problem with constraints.** **
1. _Differential evolution_
1. _"Differential evolution" _, a general kind of evolutionary algorithms , used to solve multimodal optimization problems.
1. Does Not make use of gradient information
1. Advantages of DE for generating adversarial images :
1. _Higher probability of finding the global optima_
1. _Requires less information from the target system_
1. _Simplicity_ : Independent of the classifier
1. **Results **
1. CIFAR-10 dataset was selected with 3 types of networks architectures , all convolution network , Network in Network and VGG16 network . 500 random images were selected to create the perturbations and run both _targeted_ and_ non-targeted attack._
1. Adversarial examples were created with only one pixel change in some cases and with 3 and 5 pixel changes in other cases.
1. The attack was generalized over different architectures.
1. Some specific target-pair classes are more vulnerable to attack compared to the others.
1. Some classes are very difficult to perturb to other classes and some cannot be changed at all.
1. Robustness of the class against attack can be broken by using higher dimensional perturbations.
1. Few pixels are enough to fool different types of networks.
1. The properties of the targeted perturbation depends on its decision boundary.
1. Assumptions made that small changes addictive perturbation on the values of many dimensions will accumulate and cause huge change to the output , might not be necessary for explaining why natural images are sensitive to small perturbation.
## **Notes **
* Location of data points near the decision boundaries might affect the robustness against perturbations.
* If the boundary shape is wide enough it is possible to have natural images far away from the boundary such that it is hard to craft adversarial images from it.
* If the boundary shape is mostly long and thin with natural images close to the border, it is easy to craft adversarial images from them but hard to craft adversarial images to them.
* The data points are moved in small steps and the change in the class probabilities are observed.
## **Open research questions**
1. Effect of a larger set of initial candidate solutions( Training images) to finding the adversarial image?
1. Generate better adversarial examples by having more iterations of Differential evolution?
1. Why imbalances occur when creating perturbations?
First published: 2015/03/12 (4 years ago) Abstract: Despite significant recent advances in the field of face recognition,
implementing face verification and recognition efficiently at scale presents
serious challenges to current approaches. In this paper we present a system,
called FaceNet, that directly learns a mapping from face images to a compact
Euclidean space where distances directly correspond to a measure of face
similarity. Once this space has been produced, tasks such as face recognition,
verification and clustering can be easily implemented using standard techniques
with FaceNet embeddings as feature vectors.
Our method uses a deep convolutional network trained to directly optimize the
embedding itself, rather than an intermediate bottleneck layer as in previous
deep learning approaches. To train, we use triplets of roughly aligned matching
/ non-matching face patches generated using a novel online triplet mining
method. The benefit of our approach is much greater representational
efficiency: we achieve state-of-the-art face recognition performance using only
128-bytes per face.
On the widely used Labeled Faces in the Wild (LFW) dataset, our system
achieves a new record accuracy of 99.63%. On YouTube Faces DB it achieves
95.12%. Our system cuts the error rate in comparison to the best published
result by 30% on both datasets.
We also introduce the concept of harmonic embeddings, and a harmonic triplet
loss, which describe different versions of face embeddings (produced by
different networks) that are compatible to each other and allow for direct
comparison between each other.
Triplet-loss , face embedding , harmonic embedding
**Goal of the paper**
A unified system is given for face verification , recognition and clustering.
Use of a 128 float pose and illumination invariant feature vector or embedding in the euclidean space.
* Face Verification : Same faces of the person gives feature vectors that have a very close L2 distance between them.
* Face recognition : Face recognition becomes a clustering task in the embedding space
* Previous use of deep learning made use of an bottleneck layer to represent face as an embedding of 1000s dimension vector.
* Some other techniques use PCA to reduce the dimensionality of the embedding for comparison.
* This method makes use of inception style CNN to get an embedding of each face.
* The thumbnails of the face image are the tight crop of the face area with only scaling and translation done on them.
Triplet loss makes use of two matching face thumbnails and a non-matching thumbnail. The loss function tries to reduce the distance between the matching pair while increasing the separation between the the non-matching pair of images.
* Selection of triplets is done such that samples are hard-positive or hard-negative .
* Hardest negative can lead to local minima early in the training and a collapse model in a few cases
* Use of semi-hard negatives help to improve the convergence speed while at the same time reach nearer to the global minimum.
**Deep Convolutional Network**
* Training is done using SGD (Stochastic gradient descent) with Backpropagation and AdaGrad
* The training is done on two networks :
- Zeiler&Fergus architecture with model depth of 22 and 140 million parameters
- GoogLeNet style inception model with 6.6 to 7.5 million parameters.
* Study of the following cases are done :
- Quality of the jpeg image : The validation rate of model improves with the JPEG quality upto a certain threshold.
- Embedding dimensionality : The dimension of the embedding increases from 64 to 128,256 and then gradually starts to decrease at 512 dimensions.
- No. of images in the training data set
**Results classification accuracy** :
- LFW(Labelled faces in the wild) dataset : 98.87% 0.15
- Youtube Faces DB : 95.12% .39
On clustering tasks the model was able to work on a wide varieties of face images and is invariant to pose , lighting and also age.
* The model can be extended further to improve the overall accuracy.
* Training networks to run on smaller systems like mobile phones.
* There is need for improving the training efficiency.
* Harmonic embedding is a set of embedding that we get from different models but are compatible to each other. This helps to improve future upgrades and transitions to a newer model
* To make the embeddings compatible with different models , harmonic-triplet loss and the generated triplets must be compatible with each other
## Open research questions
* Better understanding of the error cases.
* Making the model more compact for embedded and mobile use cases.
* Methods to reduce the training times.
First published: 2013/12/21 (6 years ago) Abstract: Deep neural networks are highly expressive models that have recently achieved
state of the art performance on speech and visual recognition tasks. While
their expressiveness is the reason they succeed, it also causes them to learn
uninterpretable solutions that could have counter-intuitive properties. In this
paper we report two such properties.
First, we find that there is no distinction between individual high level
units and random linear combinations of high level units, according to various
methods of unit analysis. It suggests that it is the space, rather than the
individual units, that contains of the semantic information in the high layers
of neural networks.
Second, we find that deep neural networks learn input-output mappings that
are fairly discontinuous to a significant extend. We can cause the network to
misclassify an image by applying a certain imperceptible perturbation, which is
found by maximizing the network's prediction error. In addition, the specific
nature of these perturbations is not a random artifact of learning: the same
perturbation can cause a different network, that was trained on a different
subset of the dataset, to misclassify the same input.
Adversarial example , Perturbations
* Explain two properties of neural network that cause it to misclassify images and cause difficulty to get solid understanding of network.
1. Theoretical understanding of the individual high level unit of a network and a combination of these units or layers.
2. Understanding the continuity of input - output mapping space and the stability of the output wrt. the input.
* Performing a few experiments on different networks and architectures
1. MNIST dataset - Autoencoder , Fully Connected net
2. ImageNet - “AlexNet”
3. 10M youtube images - “QuocNet”
##### Understanding individual units of the Network
* Previous work used individual images to maximize the activation value of each feature unit.
Similar experiment was done by the authors on the MNIST data set.
* The interpretation of the results are as following ;
1. Random direction vector (V) gives rise to similarly interpretable semantic properties.
2. Each feature unit is able to generate invariance on a particular subset of input distribution.
##### Blind spots in the neural network
* Output layers are highly non-linear and are able to give a nonlinear generalization over the input space.
* It is possible for the output layers to give non-significant probabilities to regions of the input space that contain no training examples in their vicinity. Ie. It is possible to obtain probability of the different viewpoints of the object without training.
* Deep learning kernel methods can't be assumed to have smooth decision boundaries.
* Using optimization techniques, small changes to the image can lead to very large deviations in the output
* __“Adversarial examples”__ represent pockets or holes in the input-space which are difficult to find simply moving around the input images.
##### Experimental Results
* Adversarial examples that are indistinguishable from the actual image can be created for all networks.
1. Cross model generalization : Adversarial images created for one network can affect the other networks also.
2. Cross training generalization
* Neural network have a counter intuitive properties wrt. the working of the individual units and discontinuities.
* Occurance of the adversarial examples and its properties.
* Feeding adversarial examples during the model training can improve the generalization of the model.
* The adversarial examples on the higher layers are more effective than those of input and lower layers.
* Adversarial examples affect models trained with different hyper parameters.
* According to the the test conducted , autoencoders are more resilient to the adversarial examples.
* Deep learning networks which are trained from purely supervised training are unstable to a few particular types of perturbations. Small addition of perturbations to the input leads to large perturbations at the output of the last layers.
### Open research questions
 Comparing the effects of adversarial examples on lower layers to that of the higher layers.
 Dependence of the adversarial attacks on training data set of the model.
 Why the adversarial examples generalize across different hyperparameters or training sets.
 How often do adversarial example occur?