## Introduction * Introduces techniques to learn word vectors from large text datasets. * Can be used to find similar words (semantically, syntactically, etc). * [Link to the paper](http://arxiv.org/pdf/1301.3781.pdf) * [Link to open source implementation](https://code.google.com/archive/p/word2vec/) ## Model Architecture * Computational complexity defined in terms of a number of parameters accessed during model training. * Proportional to $E*T*Q$ * *E* - Number of training epochs * *T* - Number of words in training set * *Q* - depends on the model ### Feedforward Neural Net Language Model (NNLM) * Probabilistic model with input, projection, hidden and output layer. * Input layer encodes N previous word using 1-of-V encoding (V is vocabulary size). * Input layer projected to projection layer P with dimensionality *N\*D* * Hidden layer (of size *H*) computes the probability distribution over all words. * Complexity per training example $Q =N*D + N*D*H + H*V$ * Can reduce *Q* by using hierarchical softmax and Huffman binary tree (for storing vocabulary). ### Recurrent Neural Net Language Model (RNNLM) * Similar to NNLM minus the projection layer. * Complexity per training example $Q =H*H + H*V$ * Hierarchical softmax and Huffman tree can be used here as well. ## Log-Linear Models * Nonlinear hidden layer causes most of the complexity. * NNLMs can be successfully trained in two steps: * Learn continuous word vectors using simple models. * N-gram NNLM trained over the word vectors. ### Continuous Bag-of-Words Model * Similar to feedforward NNLM. * No nonlinear hidden layer. * Projection layer shared for all words and order of words does not influence projection. * Log-linear classifier uses a window of words to predict the middle word. * $Q = N*D + D*\log_2V$ ### Continuous Skip-gram Model * Similar to Continuous Bag-of-Words but uses the middle world of the window to predict the remaining words in the window. * Distant words are given less weight by sampling fewer distant words. * $Q = C*(D + D*log_2 V$) where *C* is the max distance of the word from the middle word. * Given a *C* and a training data, a random *R* is chosen in range *1 to C*. * For each training word, *R* words from history (previous words) and *R* words from future (next words) are marked as target output and model is trained. ## Results * Skip-gram beats all other models for semantic accuracy tasks (eg - relating Athens with Greece). * Continuous Bag-of-Words Model outperforms other models for semantic accuracy tasks (eg great with greater) - with skip-gram just behind in performance. * Skip-gram architecture combined with RNNLMs outperforms RNNLMs (and other models) for Microsoft Research Sentence Completion Challenge. * Model can learn relationships like "Queen is to King as Woman is to Man". This allows algebraic operations like Vector("King") - Vector("Man") + Vector("Woman").