![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
Main purpose: * This work proposes a software-based resolution augmentation method which is more agile and simpler to implement than hardware engineering solutions. * The paper examines three deep learning single image super resolution techniques on pCLE images * A video-registration based method is proposed to estimate ground truth HR pCLE images (this can be assumed as the main objective of the paper) Highlights: * The papers emphasise that this is the first work to address the image resolution problem in pCLE image acquisitions * The paper introduces useful information on how pCLE devices work * Strong related work * Clear story * Comprehensive evaluation Main Idea: * Use video-registration based techniques to estimate the HR images (real ground truth HR image is not available) * Simulate LR images from estimate HR images with help of Voronoi diagram and Delaunay-based linear interpolation. * Train an Exemplar-based SR model (EBSR -- DL-based approach) to learn the mapping between simulated LR and estimate HR images. Methodology Details * To estimate the HR images, a video-registration based mosaicking techniques (by the same authors in MIA 2006) is used which fuses a collection of input images by averaging the temporal information. * Since mosaicking generates single large filed-of-view mosaic image from LR images, the mosaic-to-image diffeomorphic spatial transformation is used which results from the mosaicking process to propagate and crop the fused information from the mosaic back into each input LR image space. * At this point, the authors observe that the misalignment between input LR images (used in the video-registration based mosaicking technique) and estimate HR cause training problem for the EBSR model. So, they treat the HR images as realistic and chose to simulate LR images from them!!!! * Simulated LR images by obtained using the Voronoi diagram (averaging the Voronoi cell on HR image) + additive noise on estimate HR images. * Finally, they build to experimental datasets 1) LR_org and HR and 2) LR_synth and HR and train three CNN SR models on these twor datasets. * They train FSRCNN, EDSR, SRGAN * The networks are trained using L1+SSIM loss functions Experiment Notes: * SSIM and GCF are used to quantitatively assess the performance of the models. * A composite score is also used to take SSIM and GCF into account jointly * In the ideal case, when the models are trained and etsted on simulated LR and HR images, the quantitative results are convincing. * "From this experiment, it is possible to conclude that the proposed solution is capable of performing SR reconstruction when the models are trained on synthetic data with no domain gap at test time" * When models are trained and tested on original LR and estimate HR images, the performance is not reasonable * When the models are trained on simulated LR images and tested on original LR images, the results become better compared to the previous case, * For a solid conclusion, and MOS study was carried out. The models are trained on simulated LR images. ![]() |
[link]
This paper presents a variety of issues related to the evaluation of image generative models. Specifically, they provide evidence that evaluations of generative models based on the popular Parzen windows estimator or based on a visual fidelity (qualitative) measure both present serious flaws. The Parzen windows approach to generative modeling evaluation works by taking a finite set of samples generated from a given model and then using those as the centroids of a Parzen windows Gaussian mixture. The constructed Parzen windows mixture is then used to compute a log-likelihood score on a set of test examples. Some of the key observations made in this paper are: 1. A simple, k-means based approach can obtain better Parzen windows performance than using the original training samples for a given dataset, even though these are samples from the true distribution! 2. Even for the fairly low dimensional space of 6x6 image patches, a Parzen windows estimator would require an extremely large number of samples to come close to the true log-likelihood performance of a model. 3. Visual fidelity is a bad predictor of true log-likelihood performance, as it is possible to Obtain great visual fidelity and arbitrarily low log-likelihood, with a Parzen windows model made of Gaussians with very small variance. Obtain bad visual fidelity and high log-likelihood by taking a model with high log-likelihood and mixing it with a white noise model and putting as much as 99% of the mixing probability on the white noise model (i.e. which would produce bad samples 99% of the time). 4. Measuring overfitting of a model by taking samples from the model and making sure their training set nearest neighbors are different is ineffective, since it is actually trivial to generate samples that are each visually almost identical to a training example, but that yet each have large euclidean distance with their corresponding (visually similar) training example. ![]() |
[link]
TLDR; The authors propose a web navigation task where an agent must find a target page containing a search query (typically a few sentences) by navigating a web graph with restrictions on memory, path length and number of exlorable nodes. Tey train Feedforward and Recurrent Neural Networks and evaluate their performance against that of human volunteers. #### Key Points - Datasets: Wiki-[NUM_ALLOWED_HOPS]: WikiNav-4 (6k train), WikiNav-8 (890k train), WikiNav-16 (12M train). Authors evaluate variosu query lengths for all data sets. - Vector representation of pages: BoW of pre-trained word2vec embeddings. - State-dependent action space: All possible outgoing links on the current page. At each step, the agent can peek at the neighboring nodes and see their full content. - Training, a single correct path is fed to the agent. Beam search to make predictions. - NeuAgent-FF uses a single tanh layer. NeuAgent-Rec uses LSTM. - Human performance typically worse than that of Neural agents #### Notes/Questions - Is it reasonable to allow the agents to "peek" at neighboring pages? Humans can make decisions based on the hyperlink context. In practice, peaking at each page may not be feasible if there are many links on the page. - I'm not sure if I buy the fact that this task requires Natural Language Understanding. Agents are just matching query word vectors against pages, which is no indication of NLU. An indication of NLU would be if the query was posed in a question format, which is typically short. But here, the authors use several sentences as queries and longer queries lead to better results, suggesting that the agents don't actually have any understanding of language. They just match text. - Authors say that NeuAgent-Rec performed consistently better for high hop length, but I don't see that in the data. - The training method seems a bit strange to me because the agent is fed only one correct path, but in reality there are a large number of correct paths and target pages. It may be more sensible to train the agent with all possible target pages and paths to answer a query. ![]() |
[link]
Zhang et al. propose CROWN, a method for certifying adversarial robustness based on bounding activations functions using linear functions. Informally, the main result can be stated as follows: if the activation functions used in a deep neural network can be bounded above and below by linear functions (the activation function may also be segmented first), the network output can also be bounded by linear functions. These linear functions can be computed explicitly, as stated in the paper. Then, given an input example $x$ and a set of allowed perturbations, usually constrained to a $L_p$ norm, these bounds can be used to obtain a lower bound on the robustness of networks. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). ![]() |
[link]
Facebook has [released a series of papers](https://research.facebook.com/blog/learning-to-segment/) for object segmentation and detection. This paper is the first in that series. This is how modern object detection works (think [RCNN](https://arxiv.org/abs/1311.2524), [Fast RCNN](http://arxiv.org/abs/1504.08083)): 1. A rich set of object proposals (i.e., a set of image regions which are likely to contain an object) is generated using a fast (but possibly imprecise) algorithm. 2. A CNN classifier is applied on each of the proposals. The current paper improves the step 1, i.e., region/object proposals. Most object proposals approaches fall into three categories: * Objectness scoring * Seed Segmentation * Superpixel Merging Current method is different from these three. It share similarities with [Faster R-CNN](https://arxiv.org/abs/1506.01497) in that proposals are generated using a CNN. The method predicts a segmentation mask given an input *patch* and assigns a score corresponding to how likely the patch is to contain an object. ## Model and Training Both mask and score predictions are achieved with a single convolutional network but with multiple outputs. All the convolutional layers except the last few are from VGG-A pretrained model. Each training sample is a triplet of RGB input patch, the binary mask corresponding to the input patch, a label which specifies whether the patch contains an object. A patch is given label 1 only if it satisfies the following constraints: * the patch contains an object roughly centered in the input patch * the object is fully contained in the patch and in a given scale range Note that the network must output a mask for a single object at the center even when multiple objects are present. Figure 1 shows the architecture and sampling for training.  Model is then jointly trained for segmentation and objectness. Negative samples are not used for segmentation. ## Inference During full image inference, model is applied densely at multiple locations and scales. This can be done efficiently since all computations are convolutional like in a fully convolutional network (FCN).  This approach surpasses the previous state of the art by a large margin in both box and segmentation proposal generation. ![]() |
[link]
This paper proposes a framework where an agent learns to navigate a 2D maze-like environment (XWORLD) from (templated) natural language commands, in the process simultaneously learning visual representations, syntax and semantics of language and performing navigation actions. The task is essentially VQA + navigation; at every step the agent either gets a question about the environment or navigation command, and the output is either a navigation action or answer. Key contributions: - Grounding and recognition are tied together to be two versions of the same problem. In grounding, given an image feature map and label (word), the problem is to find regions of the image corresponding to word semantics (attention map); and in recognition, given an image feature map and attention, the problem is to assign a word label. And thus word embeddings (for grounding) and softmax layer weights (for recognition) are tied together. This enables transferring concepts learnt during recognition to navigation. - Further, recognition is modulated by question intent. For e.g. given an attention map that highlights an agent's west, should it be recognized as 'west', 'apple' or 'red' (location, object or attribute)? It depends on what the question asks. Thus, GRU encoding of question produces an embedding mask that modulates recognition. The equivalent when grounding is that word embeddings are passed through fully-connected layers. - Compositionality in language is exploited by performing grounding and recognition by sequentially (softly) attending to parts of a sentence and grounding in image. The resulting attention map is selectively combined with attention from previous timesteps for final decision. ## Weaknesses / Notes Although the environment is super simple, it's a neat framework and it is useful that the target is specified in natural language (unlike prior/concurrent work e.g. Zhu et al., ICRA17). The model gets to see a top-down centred view of the entire environment at all times, which is a little weird. ![]() |
[link]
* They describe a variation of convolutions that have a differently structured receptive field. * They argue that their variation works better for dense prediction, i.e. for predicting values for every pixel in an image (e.g. coloring, segmentation, upscaling). ### How * One can image the input into a convolutional layer as a 3d-grid. Each cell is a "pixel" generated by a filter. * Normal convolutions compute their output per cell as a weighted sum of the input cells in a dense area. I.e. all input cells are right next to each other. * In dilated convolutions, the cells are not right next to each other. E.g. 2-dilated convolutions skip 1 cell between each input cell, 3-dilated convolutions skip 2 cells etc. (Similar to striding.) * Normal convolutions are simply 1-dilated convolutions (skipping 0 cells). * One can use a 1-dilated convolution and then a 2-dilated convolution. The receptive field of the second convolution will then be 7x7 instead of the usual 5x5 due to the spacing. * Increasing the dilation factor by 2 per layer (1, 2, 4, 8, ...) leads to an exponential increase in the receptive field size, while every cell in the receptive field will still be part in the computation of at least one convolution. * They had problems with badly performing networks, which they fixed using an identity initialization for the weights. (Sounds like just using resdiual connections would have been easier.)  *Receptive fields of a 1-dilated convolution (1st image), followed by a 2-dilated conv. (2nd image), followed by a 4-dilated conv. (3rd image). The blue color indicates the receptive field size (notice the exponential increase in size). Stronger blue colors mean that the value has been used in more different convolutions.* ### Results * They took a VGG net, removed the pooling layers and replaced the convolutions with dilated ones (weights can be kept). * They then used the network to segment images. * Their results were significantly better than previous methods. * They also added another network with more dilated convolutions in front of the VGG one, again improving the results.  *Their performance on a segmentation task compared to two competing methods. They only used VGG16 without pooling layers and with convolutions replaced by dilated convolutions.* ![]() |
[link]
This paper is about Convolutional Neural Networks for Computer Vision. It was the first break-through in the ImageNet classification challenge (LSVRC-2010, 1000 classes). ReLU was a key aspect which was not so often used before. The paper also used Dropout in the last two layers. ## Training details * Momentum of 0.9 * Learning rate of $\varepsilon$ (initialized at 0.01) * Weight decay of $0.0005 \cdot \varepsilon$. * Batch size of 128 * The training took 5 to 6 days on two NVIDIA GTX 580 3GB GPUs. ## See also * [Stanford presentation](http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf) ![]() |
[link]
This paper presents a method to extract motion (dynamic) and skeleton / camera-view (static) representations from the video of a person represented as a 2D joints skeleton. This decomposition allows transferring the motion to different skeletons (retargeting) and many more. It does so by utilizing deep neural networks. https://i.imgur.com/J5jBzcs.png The architecture consists of motion and skeleton / camera-view encoders that decompose an input sequence of 2D joint positions into latent spaces and a decoder that reconstructs a sequence from such components. The motion vector varies in length, while skeleton and camera view representations are fixed. https://i.imgur.com/QaDksg1.png This is achieved by the nature of the network design. Specifically, motion encoder uses 1D convolutions with strides, thus output dimensions are proportionally related to the input. On the other hand, the static encoder uses global average pooling in the final layer to produce a fixed-size latent representation: https://i.imgur.com/Cf7TVKA.png More detailed design of the encoders and decoder is shown below: https://i.imgur.com/cpaveFm.png **Dataset**. Adobe Mixamo is used to obtain sequences of poses of different 3D characters. It allows creating multiple samples where different characters (with different skeleton structure) perform the same motions. These 3D video clips are then projected into 2D by selecting arbitrary view angles and distance to the object. Thus, we can easily create multiple pairs of 2D image sequences of characters (same or different) performing various actions (same or different) from various views. **Loss functions** used to for training (refer the paper for the detailed formulas): - *Cross Reconstruction Loss* It is a sum of two other losses. The first one is the reconstruction loss where the network tries to reconstruct original input. The second one is cross reconstruction loss where the network tries to reconstruct the sequence where a different character performs the exact same action as the input. It is best shown in the Figure below: https://i.imgur.com/ewZOAox.png - *Triplet Loss* This loss aims to bring latent spaces of similar motions closer together, while separate apart the ones that are different. It takes two triplets, where each contains two samples that share the same (or very similar) motion and one with different. The same concept is applied to the static latent space. - Foot velocity loss This loss helps to remove the foot skating phenomenon - hands and feet exhibit larger errors that the other keypoints. https://i.imgur.com/DclJEde.png where $V_{global}$ and $V_{joint_n}$ extract the global and local ($n$th joint) velocities from the reconstructed output $\hat{p}_{ij}$, respectively, and map them back to the image units, and $V_{orig_n}$ returns the original global velocity of the $n$th joint from the ground truth, $p_{ij}$ **Normalization** - subtract the root position from all joint locations in every frame - subtract the mean joint position and divide by the standard deviation (averaged over the entire dataset) - per-frame global velocity is not touched **Data Augmentation** applied during training: - temporal clipping during the batch creation process - scaling - same as to use different camera distance to the object - flipping symmetrical joints - dropping joints to simulate behavior of a real keypoint detector as they often miss some joints - adding real video data to the training and use reprojection loss in case no labels are given **Results and Evaluation** (to be continued) ... While the summary becomes too long to be a called a summary it is worth mentioning that there are several applications possible with this approach: - performance cloning - make any 2D skeleton repeat particular motions - motion retrieval - search videos that contain the particular target motion ![]() |
[link]
## Introduction * Introduces techniques to learn word vectors from large text datasets. * Can be used to find similar words (semantically, syntactically, etc). * [Link to the paper](http://arxiv.org/pdf/1301.3781.pdf) * [Link to open source implementation](https://code.google.com/archive/p/word2vec/) ## Model Architecture * Computational complexity defined in terms of a number of parameters accessed during model training. * Proportional to $E*T*Q$ * *E* - Number of training epochs * *T* - Number of words in training set * *Q* - depends on the model ### Feedforward Neural Net Language Model (NNLM) * Probabilistic model with input, projection, hidden and output layer. * Input layer encodes N previous word using 1-of-V encoding (V is vocabulary size). * Input layer projected to projection layer P with dimensionality *N\*D* * Hidden layer (of size *H*) computes the probability distribution over all words. * Complexity per training example $Q =N*D + N*D*H + H*V$ * Can reduce *Q* by using hierarchical softmax and Huffman binary tree (for storing vocabulary). ### Recurrent Neural Net Language Model (RNNLM) * Similar to NNLM minus the projection layer. * Complexity per training example $Q =H*H + H*V$ * Hierarchical softmax and Huffman tree can be used here as well. ## Log-Linear Models * Nonlinear hidden layer causes most of the complexity. * NNLMs can be successfully trained in two steps: * Learn continuous word vectors using simple models. * N-gram NNLM trained over the word vectors. ### Continuous Bag-of-Words Model * Similar to feedforward NNLM. * No nonlinear hidden layer. * Projection layer shared for all words and order of words does not influence projection. * Log-linear classifier uses a window of words to predict the middle word. * $Q = N*D + D*\log_2V$ ### Continuous Skip-gram Model * Similar to Continuous Bag-of-Words but uses the middle world of the window to predict the remaining words in the window. * Distant words are given less weight by sampling fewer distant words. * $Q = C*(D + D*log_2 V$) where *C* is the max distance of the word from the middle word. * Given a *C* and a training data, a random *R* is chosen in range *1 to C*. * For each training word, *R* words from history (previous words) and *R* words from future (next words) are marked as target output and model is trained. ## Results * Skip-gram beats all other models for semantic accuracy tasks (eg - relating Athens with Greece). * Continuous Bag-of-Words Model outperforms other models for semantic accuracy tasks (eg great with greater) - with skip-gram just behind in performance. * Skip-gram architecture combined with RNNLMs outperforms RNNLMs (and other models) for Microsoft Research Sentence Completion Challenge. * Model can learn relationships like "Queen is to King as Woman is to Man". This allows algebraic operations like Vector("King") - Vector("Man") + Vector("Woman"). ![]() |