![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
Modelling a distributed system as a replicated state machine provides the illusion that the distributed system is really just a single machine. At the core of the replicated state machine approach is a replicated log that is kept consistent by a consensus algorithm. Traditionally, consensus has been synonymous with Paxos. Paxos is taught in schools, and most consensus algorithm implementations are based on Paxos. However, Paxos has two main disadvantages: 1. It is hard to understand. Single-decree Paxos is nuanced, and composing single-decree Paxos into multi-Paxos is confusing. 2. It is hard to implement efficiently. Multi-Paxos is not very well described in the literature, and the algorithm is difficult to implement efficiently without modification. This paper presents the Raft consensus algorithm. Raft provides the same performance and safety as multi-Paxos but it is designed to be much easier to understand. Basics. Every node in a raft cluster is in one of three states: leader, follower, or candidate. The leader receives requests from users and forwards them to followers. Followers are completely passive and receive messages from leaders. Candidates perform leader elections in an attempt to become a leader. In normal operation, there is a single leader, and every other node is a follower. Raft proceeds in a series of increasingly numbered terms. Each term consists of a leader election followed by (potentially) normal operation. There is exactly one leader elected per term. Moreover, each node participates in monotonically increasing terms. When a node sends a message in Raft, it annotates it with its term. If a leader receives a message from a later term, it immediately becomes a follower. Nodes ignore messages annotated with older terms. Raft uses two RPCs: RequestVote (for leader election) and AppendEntries (for replication and heartbeats). Leader Election. Leaders periodically send heartbeats (AppendEntries RPCs without any entries) to followers. As long as a follower continues to receive heartbeats, it continues to be a follower. If a follower does not receive a heartbeat after a certain amount of time, it begins leader election: it increments its term, enters the candidate state, votes for itself, and sends RequestVote RPCs in parallel to all other nodes. Either, 1. It wins. Nodes issue a single vote per term on a first come first serve basis. If a candidate receives a vote from a majority of the nodes, then it becomes leader. 2. It hears from another leader. If a candidate receives a message from another leader in a term at least as large as it, it becomes a follower. 3. It times out. It's possible that a split vote occurs and nobody becomes leader in a particular term. If this happens, the candidate times out after a certain amount of time and begins another election in the next term. Log Replication. During normal operation, a leader receives a request from a client, appends it to its log annotated with the current term, and issues AppendEntries to all nodes in parallel. An entry is considered committed after it is replicated to a majority of the nodes. Once a log entry is committed, all previous log entries are also committed. Once a log entry is committed, the leader can apply it and respond to the user. Moreover, once an entry is committed, it is guaranteed to eventually execute at all available nodes. The leader keeps track of the index of the largest committed entry and sends it to all other nodes so that they can also apply log entries. Raft satisfies a powerful log matching invariant: 1. "If two entries in different logs have the same index and term, then they store the same command." 2. "If two entries in different logs have the same index and term, then the logs are identical in all preceding entries." 1 is ensured by the fact that a single leader is elected for any given term, the fact that a leader only creates a single log entry per index, and the fact that once a log entry is created, it never changes index. 2 is ensured by a runtime check. When a leader sends an AppendEntries RPC for a particular index, it also sends its log entry for the previous index. The follower only applies the AppendEntries RPC if it agrees on the previous index. Inductively, this guarantees 2. Followers may have missing or extraneous log entries. When this happens, the leader identifies the longest prefix on which the two agree. It then sends the rest of its log. The follower overwrites its log to match the leader. Safety. The protocol described so far is unsafe. If a new leader is elected, it can accidentally force followers to overwrite committed values with uncommitted values. Thus, we must ensure that leaders contain all committed entries. Other consensus algorithms ensure this by shipping committed values to newly elected leaders. Raft takes an alternative approach and guarantees that if a leader is elected, it has every committed entry. To ensure this, Raft must restrict which nodes can be elected. A follower rejects a RequestVote RPC if the requesting candidate's log is not as up-to-date as its log. One log is as up-to-date as another if its last entry has a higher term or has the same term but is longer. Since a candidate must receive a majority of votes and committed values have been replicated to a majority of nodes, a candidate must contact a node with all committed values during its election which will prevent it from being elected if it doesn't have all the committed log entries. To prevent another subtle bug, leaders also do not directly commit values from previous terms. They only commit values from their own term which indirectly commits previous log entries from previous terms. Cluster Membership Changes. A Raft cluster cannot be instantaneously switched from one configuration to another. For example consider a cluster moving from 3 to 5 nodes. It's possible that two nodes are elected master for the same term which can lead to a safety violation. Instead, the cluster transitions to a joint consensus phase where decisions require a majority from both the old and new configuration. Once a majority of nodes accept the new configuration, the cluster can transition to it. ![]() |
[link]
#### This nice paper looks amazing at the first sight since it brings a mixture of: - Fancy models - State-of-art training procedure(considering the 32-GPU distributed training effort which takes 21 days to get the best result) - Significant theory metric improvement(single model: 51.3 -> 30 perplexity reduction, ensemble model:41.0 -> 23.7) - Benchmark on a somewhat industry scale(vocabulary of 793471 words, 0,8B words training data) data-set rather than a pure research one. #### However, I also want to add some criticism: - As [1] mentioned perplexity is somewhat confusing metric, big perplexity may not reflect the real improvement, it would rather bring some kind of "exaggerating" effect. - This paper only provide the language model improvement, however, LMs are usually embedded into a complex usage scenario, such as speech recognition or machine translation. It would be more insightful if the LMs provided in this paper could share its result with integrating into some end-to-end products. Since the authors are working for Google Brain team, this is not too much a stringent requirement. - So far as I know, the data set used by this paper is from news stories[2], this kind of data set is more formal than oral one. And for real application, what we face are usually less formal data(such as search engine and speech recognition). It is still a question what the best model mentioned in this paper will perform in a more realistic scenario. Again, for Google Brain team, this should not be a big obstacles for integrating it with existing system just by replacing or complementing the existing LMs. Although I posted some personal criticism, I do still appreciate this nice paper and recommend this as a "must-read" for NLP and related guys since I do think this paper provide a unifying and comprehensive survey-style perspective for us to help grasp the latest state-of-art language model technology in an efficient way. References: - [1].http://www.fit.vutbr.cz/~imikolov/rnnlm/thesis.pdf - [2].http://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/41880.pdf ![]() |
[link]
* The authors define in this paper a special loss function (DeePSiM), mostly for autoencoders. * Usually one would use a MSE of euclidean distance as the loss function for an autoencoder. But that loss function basically always leads to blurry reconstructed images. * They add two new ingredients to the loss function, which results in significantly sharper looking images. ### How * Their loss function has three components: * Euclidean distance in image space (i.e. pixel distance between reconstructed image and original image, as usually used in autoencoders) * Euclidean distance in feature space. Another pretrained neural net (e.g. VGG, AlexNet, ...) is used to extract features from the original and the reconstructed image. Then the euclidean distance between both vectors is measured. * Adversarial loss, as usually used in GANs (generative adversarial networks). The autoencoder is here treated as the GAN-Generator. Then a second network, the GAN-Discriminator is introduced. They are trained in the typical GAN-fashion. The loss component for DeePSiM is the loss of the Discriminator. I.e. when reconstructing an image, the autoencoder would learn to reconstruct it in a way that lets the Discriminator believe that the image is real. * Using the loss in feature space alone would not be enough as that tends to lead to overpronounced high frequency components in the image (i.e. too strong edges, corners, other artefacts). * To decrease these high frequency components, a "natural image prior" is usually used. Other papers define some function by hand. This paper uses the adversarial loss for that (i.e. learns a good prior). * Instead of training a full autoencoder (encoder + decoder) it is also possible to only train a decoder and feed features - e.g. extracted via AlexNet - into the decoder. ### Results * Using the DeePSiM loss with a normal autoencoder results in sharp reconstructed images. * Using the DeePSiM loss with a VAE to generate ILSVRC-2012 images results in sharp images, which are locally sound, but globally don't make sense. Simple euclidean distance loss results in blurry images. * Using the DeePSiM loss when feeding only image space features (extracted via AlexNet) into the decoder leads to high quality reconstructions. Features from early layers will lead to more exact reconstructions. * One can again feed extracted features into the network, but then take the reconstructed image, extract features of that image and feed them back into the network. When using DeePSiM, even after several iterations of that process the images still remain semantically similar, while their exact appearance changes (e.g. a dog's fur color might change, counts of visible objects change).  *Images generated with a VAE using DeePSiM loss.*  *Images reconstructed from features fed into the network. Different AlexNet layers (conv5 - fc8) were used to generate the features. Earlier layers allow more exact reconstruction.*  *First, images are reconstructed from features (AlexNet, layers conv5 - fc8 as columns). Then, features of the reconstructed images are fed back into the network. That is repeated up to 8 times (rows). Images stay semantically similar, but their appearance changes.* -------------------- ### Rough chapter-wise notes * (1) Introduction * Using a MSE of euclidean distances for image generation (e.g. autoencoders) often results in blurry images. * They suggest a better loss function that cares about the existence of features, but not as much about their exact translation, rotation or other local statistics. * Their loss function is based on distances in suitable feature spaces. * They use ConvNets to generate those feature spaces, as these networks are sensitive towards important changes (e.g. edges) and insensitive towards unimportant changes (e.g. translation). * However, naively using the ConvNet features does not yield good results, because the networks tend to project very different images onto the same feature vectors (i.e. they are contractive). That leads to artefacts in the generated images. * Instead, they combine the feature based loss with GANs (adversarial loss). The adversarial loss decreases the negative effects of the feature loss ("natural image prior"). * (3) Model * A typical choice for the loss function in image generation tasks (e.g. when using an autoencoders) would be squared euclidean/L2 loss or L1 loss. * They suggest a new class of losses called "DeePSiM". * We have a Generator `G`, a Discriminator `D`, a feature space creator `C` (takes an image, outputs a feature space for that image), one (or more) input images `x` and one (or more) target images `y`. Input and target image can be identical. * The total DeePSiM loss is a weighted sum of three components: * Feature loss: Squared euclidean distance between the feature spaces of (1) input after fed through G and (2) the target image, i.e. `||C(G(x))-C(y)||^2_2`. * Adversarial loss: A discriminator is introduced to estimate the "fakeness" of images generated by the generator. The losses for D and G are the standard GAN losses. * Pixel space loss: Classic squared euclidean distance (as commonly used in autoencoders). They found that this loss stabilized their adversarial training. * The feature loss alone would create high frequency artefacts in the generated image, which is why a second loss ("natural image prior") is needed. The adversarial loss fulfills that role. * Architectures * Generator (G): * They define different ones based on the task. * They all use up-convolutions, which they implement by stacking two layers: (1) a linear upsampling layer, then (2) a normal convolutional layer. * They use leaky ReLUs (alpha=0.3). * Comparators (C): * They use variations of AlexNet and Exemplar-CNN. * They extract the features from different layers, depending on the experiment. * Discriminator (D): * 5 convolutions (with some striding; 7x7 then 5x5, afterwards 3x3), into average pooling, then dropout, then 2x linear, then 2-way softmax. * Training details * They use Adam with learning rate 0.0002 and normal momentums (0.9 and 0.999). * They temporarily stop the discriminator training when it gets too good. * Batch size was 64. * 500k to 1000k batches per training. * (4) Experiments * Autoencoder * Simple autoencoder with an 8x8x8 code layer between encoder and decoder (so actually more values than in the input image?!). * Encoder has a few convolutions, decoder a few up-convolutions (linear upsampling + convolution). * They train on STL-10 (96x96) and take random 64x64 crops. * Using for C AlexNet tends to break small structural details, using Exempler-CNN breaks color details. * The autoencoder with their loss tends to produce less blurry images than the common L2 and L1 based losses. * Training an SVM on the 8x8x8 hidden layer performs significantly with their loss than L2/L1. That indicates potential for unsupervised learning. * Variational Autoencoder * They replace part of the standard VAE loss with their DeePSiM loss (keeping the KL divergence term). * Everything else is just like in a standard VAE. * Samples generated by a VAE with normal loss function look very blurry. Samples generated with their loss function look crisp and have locally sound statistics, but still (globally) don't really make any sense. * Inverting AlexNet * Assume the following variables: * I: An image * ConvNet: A convolutional network * F: The features extracted by a ConvNet, i.e. ConvNet(I) (feaures in all layers, not just the last one) * Then you can invert the representation of a network in two ways: * (1) An inversion that takes an F and returns roughly the I that resulted in F (it's *not* key here that ConvNet(reconstructed I) returns the same F again). * (2) An inversion that takes an F and projects it to *some* I so that ConvNet(I) returns roughly the same F again. * Similar to the autoencoder cases, they define a decoder, but not encoder. * They feed into the decoder a feature representation of an image. The features are extracted using AlexNet (they try the features from different layers). * The decoder has to reconstruct the original image (i.e. inversion scenario 1). They use their DeePSiM loss during the training. * The images can be reonstructed quite well from the last convolutional layer in AlexNet. Chosing the later fully connected layers results in more errors (specifially in the case of the very last layer). * They also try their luck with the inversion scenario (2), but didn't succeed (as their loss function does not care about diversity). * They iteratively encode and decode the same image multiple times (probably means: image -> features via AlexNet -> decode -> reconstructed image -> features via AlexNet -> decode -> ...). They observe, that the image does not get "destroyed", but rather changes semantically, e.g. three apples might turn to one after several steps. * They interpolate between images. The interpolations are smooth. ![]() |
[link]
The paper discusses and empirically investigates by empirical testing the effect of "catastrophic forgetting" (**CF**), i.e. the inability of a model to perform a task it was previously trained to perform if retrained to perform a second task. An illuminating example is what happens in ML systems with convex objectives: regardless of the initialization (i.e. of what was learnt by doing the first task), the training of the second task will always end in the global minimum, thus totally "forgetting" the first one. Neuroscientific evidence (and common sense) suggest that the outcome of the experiment is deeply influenced by the similarity of the tasks involved. Namely, if (i) the two tasks are *functionally identical but input is presented in a different format* or if (ii) *tasks are similar* and the third case for (iii) *dissimilar tasks*. Relevant examples may be provided respectively by (i) performing the same image classification task starting from two different image representations as RGB or HSL, (ii) performing image classification tasks with semantically similar as classifying two similar animals and (iii) performing a text classification followed by image classification. The problem is investigated by an empirical study covering two methods of training ("SGD" and "dropout") combined with 4 activations functions (logistic sigmoid, RELU, LWTA, Maxout). A random search is carried out on these parameters. From a practitioner's point of view, it is interesting to note that dropout has been set to 0.5 in hidden units and 0.2 in the visible one since this is a reasonably well-known parameter. ## Why the paper is important It is apparently the first to provide a systematic empirical analysis of CF. Establishes a framework and baselines to face the problem. ## Key conclusions, takeaways and modelling remarks * dropout helps in preventing CF * dropout seems to increase the optimal model size with respect to the model without dropout * choice of activation function has a less consistent effect than dropout\no dropout choice * dissimilar task experiment provides a notable exception of then dissimilar task experiment * the previous hypothesis that LWTA activation is particularly resistant to CF is rejected (even if it performs best in the new task in the dissimilar task pair the behaviour is inconsistent) * choice of activation function should always be cross-validated * If computational resources are insufficient for cross-validation the combination dropout + maxout activation function is recommended. ![]() |
[link]
TLDR; The authors propose a Contextual LSTM (CLSTM) model that appends a context vector to input words when making predictions. The authors evaluate the model and Language Modeling, next sentence selection and next topic prediction tasks, beating standard LSTM baselines. #### Key Points - The topic vector comes from an internal classifier system and is supervised data. Topics can also be estimated using unsupervised techniques. - Topic can be calculated either based on the previous words of the current sentence (SentSegTopic), all words of the previous sentence (PrevSegTopic), and current paragraph (ParaSegTopic). Best CLSTM uses all of them. - English Wikipedia Dataset: 1400M words train, 177M validation, 178M words test. 129k vocab. - When current segment topic is present, the topic of the previous sentence doesn't matter. - Authors couldn't compare to other models that incorporate topics because they don't scale to large-scale datasets. - LSTMs are a long chain and authors don't reset the hidden state between sentence boundaries. So, a sentence has implicit access to the prev. sentence information, but explicitly modeling the topic still makes a difference. #### Notes/Thoughts - Increasing number of hidden units seems to have a *much* larger impact on performance than increasing model perplexity. The simple word-based LSTM model with more hidden units significantly outperforms the complex CLSTM model. This makes me question the practical usefulness of this model. - IMO the comparisons are somewhat unfair because by using an external classifier to obtain topic labels you are bringing in external data that the baseline models didn't have access to. - What about using other unsupervised sentence embeddings as context vectors, e.g. seq2seq autoencoders or PV? - If the LSTM was perfect in modeling long-range dependencies then we wouldn't need to feed extra topic vectors. What about residual connections? ![]() |
[link]
TLDR; The authors propose two different architectures to improve the performance of character-level RNNs. In the first architecture ("mixed") the authors condition the model on the state of a word-level RNN. In the second architecture ("cond") they condition the output classifier on character n-grams. The authors show that the proposed architecture outperform plain character-level RNNs in terms of entropy in bits per character. #### Key Points - Plain character-level RNNs need a huge hidden representation in order to model long-term dependencies. But Word-level RNNs can't generalize to new vocabulary and may require a huge output vocab. - Model 1: Jointly train word-level and char-level CNN. Interpolate the losses of the two models. - Model 2: Condition softmax on n-grams before character, "relieving" the network of memorizing some of the sequence. - Training: Constant learning rate, reduce every epoch when validation accuracy decreases - N-gram model can be applied to arbitrary data, not just characters. Authors evaluate on binary data. #### Notes / Questions - In the comparison table the authors don't show the number of parameters for the models. They compare models with the same number of hidden units, but their proposed architecture need extra parameters and computation. Unfair comparison? - People typically use LSTMs/GRUs for language modeling. Of course the proposed techniques can be applied to LSTM/GRU networks, but the experimental result may look very different. Do these architectures result in any benefit when using LSTM/GRU char data? - Entropy in bits per character seems like somewhat of a strange evaluation metric. I don't really know what to make of it, and no intuitive explanations are given. - One argument the authors make in the paper is that character-level models can be applied to arbitrary input data (different languages, binary data, code, etc). But their mixed is clearly very language-specific. It can't be applied to arbitrary data, and many languages don't have clear word boundaries. Similarly, n-grams may be prohibituvely expensive depending on what kind of data we're working with. - The n-gram conditioned models isn't clearly explained, I *think* I understand what it does, but I'm not quite sure. No intuitive explanations what any of the models are learning are given. ![]() |
[link]
Deeper networks should never have a higher **training** error than smaller ones. In the worst case, the layers should "simply" learn identities. It seems as this is not so easy with conventional networks, as they get much worse with more layers. So the idea is to add identity functions which skip some layers. The network only has to learn the **residuals**. Advantages: * Learning the identity becomes learning 0 which is simpler * Loss in information flow in the forward pass is not a problem anymore * No vanishing / exploding gradient * Identities don't have parameters to be learned ## Evaluation The learning rate starts at 0.1 and is divided by 10 when the error plateaus. Weight decay of 0.0001 ($10^{-4}$), momentum of 0.9. They use mini-batches of size 128. * ImageNet ILSVRC 2015: 3.57% (ensemble) * CIFAR-10: 6.43% * MS COCO: 59.0% mAp@0.5 (ensemble) * PASCAL VOC 2007: 85.6% mAp@0.5 * PASCAL VOC 2012: 83.8% mAp@0.5 ## See also * [DenseNets](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993) ![]() |
[link]
#### Introduction * Open-domain Question Answering (Open QA) - efficiently querying large-scale knowledge base(KB) using natural language. * Two main approaches: * Information Retrieval * Transform question (in natural language) into a valid query(in terms of KB) to get a broad set of candidate answers. * Perform fine-grained detection on candidate answers. * Semantic Parsing * Interpret the correct meaning of the question and convert it into an exact query. * Limitations: * Human intervention to create lexicon, grammar, and schema. * This work builds upon the previous work where an embedding model learns low dimensional vector representation of words and symbols. * [Link](https://arxiv.org/abs/1406.3676) to the paper. #### Task Definition * Input - Training set of questions (paired with answers). * KB providing a structure among the answers. * Answers are entities in KB and questions are strings with one identified KB entity. * The paper has used FREEBASE as the KB. * Datasets * WebQuestions - Built using FREEBASE, Google Suggest API, and Mechanical Turk. * FREEBASE triplets transformed into questions. * Clue Web Extractions dataset with entities linked with FREEBASE triplets. * Dataset of paraphrased questions using WIKIANSWERS. #### Embedding Questions and Answers * Model learns low-dimensional vector embeddings of words in question entities and relation types of FREEBASE such that questions and their answers are represented close to each other in the joint embedding space. * Scoring function $S(q, a)$, where $q$ is a question and $a$ is an answer, generates high score if $a$ answers $q$. * $S(q, a) = f(q)^{T} . g(a)$ * $f(q)$ maps question to embedding space. * $f(q) = W \phi (q)$ * $W$ is a matrix of dimension $K * N$ * $K$ - dimension of embedding space (hyper parameter). * $N$ - total number of words/entities/relation types. * $\psi(q)$ - Sparse Vector encoding the number of times a word appears in $q$. * Similarly, $g(a) = W \psi (a)$ maps answer to embedding space. * $\psi(a)$ gives answer representation, as discussed below. #### Possible Representations of Candidate Answers * Answer represented as a **single entity** from FREEBASE and TBD is a one-of-N encoded vector. * Answer represented as a **path** from question to answer. The paper considers only one or two hop paths resulting in 3-of-N or 4-of-N encoded vectors(middle entities are not recorded). * Encode the above two representations using **subgraph representation** which represents both the path and the entire subgraph of entities connected to answer entity as a subgraph. Two embedding representations are used to differentiate between entities in path and entities in the subgraph. * SubGraph approach is based on the hypothesis that including more information about the answers would improve results. #### Training and Loss Function * Minimize margin based ranking loss to learn matrix $W$. * Stochastic Gradient Descent, multi-threaded with Hogwild. #### Multitask Training of Embeddings * To account for a large number of synthetically generated questions, the paper also multi-tasks the training of model with paraphrased prediction. * Scoring function $S_{prp} (q1, q2) = f(q1)^{T} f(q2)$, where $f$ uses the same weight matrix $W$ as before. * High score is assigned if $q1$ and $q2$ belong to same paraphrase cluster. * Additionally, the model multitasks the task of mapping embeddings of FREEBASE entities (mids) to actual words. #### Inference * For each question, a candidate set is generated. * The answer (from candidate set) with the highest set is reported as the correct answer. * Candidate set generation strategy * $C_1$ - All KB triplets containing the KB entity from the question forms a candidate set. Answers would be limited to 1-hop paths. * $C_2$ - Rank all relation types and keep top 10 types and add only those 2-hop candidates where the selected relations appear in the path. #### Results * $C_2$ strategy outperforms $C_1$ approach supporting the hypothesis that a richer representation for answers can store more information. * Proposed approach outperforms the baseline methods but is outperformed by an ensemble of proposed approach with semantic parsing via paraphrasing model. ![]() |
[link]
This paper proposes a neural architecture that allows to backpropagate gradients though a procedure that can go through a variable and adaptive number of iterations. These "iterations" for instance could be the number of times computations are passed through the same recurrent layer (connected to the same input) before producing an output, which is the case considered in this paper. This is essentially achieved by pooling the recurrent states and respective outputs computed by each iteration. The pooling mechanism is essentially the same as that used in the really cool Neural Stack architecture of Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman and Phil Blunsom \cite{conf/nips/GrefenstetteHSB15}. It relies on the introduction of halting units, which are sigmoidal units computed at each iteration and which gives a soft weight on whether the computation should stop at the current iteration. Crucially, the paper introduces a new ponder cost $P(x)$, which is a regularization cost that penalizes what is meant to be a smooth upper bound on the number of iterations $N(t)$ (more on that below). The paper presents experiment on RNNs applied on sequences where, at each time step t (not to be confused with what I'm calling computation iterations, which are indexed by n) in the sequence the RNN can produce a variable number $N(t)$ of intermediate states and outputs. These are the states and outputs that are pooled, to produce a single recurrent state and output for the time step t. During each of the $N(t)$ iterations at time step t, the intermediate states are connected to the same time-step-t input. After the $N(t)$ iterations, the RNN pools the $N(t)$ intermediate states and outputs, and then moves to the next time step $t+1$. To mark the transitions between time steps, an extra binary input is appended, which is 1 only for the first intermediate computation iteration. Results are presented on a variety of synthetic problems and a character prediction problem. ![]() |
[link]
* They suggest a new stochastic optimization method, similar to the existing SGD, Adagrad or RMSProp. * Stochastic optimization methods have to find parameters that minimize/maximize a stochastic function. * A function is stochastic (non-deterministic), if the same set of parameters can generate different results. E.g. the loss of different mini-batches can differ, even when the parameters remain unchanged. Even for the same mini-batch the results can change due to e.g. dropout. * Their method tends to converge faster to optimal parameters than the existing competitors. * Their method can deal with non-stationary distributions (similar to e.g. SGD, Adadelta, RMSProp). * Their method can deal with very sparse or noisy gradients (similar to e.g. Adagrad). ### How * Basic principle * Standard SGD just updates the parameters based on `parameters = parameters - learningRate * gradient`. * Adam operates similar to that, but adds more "cleverness" to the rule. * It assumes that the gradient values have means and variances and tries to estimate these values. * Recall here that the function to optimize is stochastic, so there is some randomness in the gradients. * The mean is also called "the first moment". * The variance is also called "the second (raw) moment". * Then an update rule very similar to SGD would be `parameters = parameters - learningRate * means`. * They instead use the update rule `parameters = parameters - learningRate * means/sqrt(variances)`. * They call `means/sqrt(variances)` a 'Signal to Noise Ratio'. * Basically, if the variance of a specific parameter's gradient is high, it is pretty unclear how it should be changend. So we choose a small step size in the update rule via `learningRate * mean/sqrt(highValue)`. * If the variance is low, it is easier to predict how far to "move", so we choose a larger step size via `learningRate * mean/sqrt(lowValue)`. * Exponential moving averages * In order to approximate the mean and variance values you could simply save the last `T` gradients and then average the values. * That however is a pretty bad idea, because it can lead to high memory demands (e.g. for millions of parameters in CNNs). * A simple average also has the disadvantage, that it would completely ignore all gradients before `T` and weight all of the last `T` gradients identically. In reality, you might want to give more weight to the last couple of gradients. * Instead, they use an exponential moving average, which fixes both problems and simply updates the average at every timestep via the formula `avg = alpha * avg + (1 - alpha) * avg`. * Let the gradient at timestep (batch) `t` be `g`, then we can approximate the mean and variance values using: * `mean = beta1 * mean + (1 - beta1) * g` * `variance = beta2 * variance + (1 - beta2) * g^2`. * `beta1` and `beta2` are hyperparameters of the algorithm. Good values for them seem to be `beta1=0.9` and `beta2=0.999`. * At the start of the algorithm, `mean` and `variance` are initialized to zero-vectors. * Bias correction * Initializing the `mean` and `variance` vectors to zero is an easy and logical step, but has the disadvantage that bias is introduced. * E.g. at the first timestep, the mean of the gradient would be `mean = beta1 * 0 + (1 - beta1) * g`, with `beta1=0.9` then: `mean = 0.9 * g`. So `0.9g`, not `g`. Both the mean and the variance are biased (towards 0). * This seems pretty harmless, but it can be shown that it lowers the convergence speed of the algorithm by quite a bit. * So to fix this pretty they perform bias-corrections of the mean and the variance: * `correctedMean = mean / (1-beta1^t)` (where `t` is the timestep). * `correctedVariance = variance / (1-beta2^t)`. * Both formulas are applied at every timestep after the exponential moving averages (they do not influence the next timestep).  ![]() |