
#### Introduction * Method to visualize highdimensional data points in 2/3 dimensional space. * Data visualization techniques like Chernoff faces and graph approaches just provide a representation and not an interpretation. * Dimensionality reduction techniques fail to retain both local and global structure of the data simultaneously. For example, PCA and MDS are linear techniques and fail on data lying on a nonlinear manifold. * tSNE approach converts data into a matrix of pairwise similarities and visualizes this matrix. * Based on SNE (Stochastic Neighbor Embedding) * [Link to paper](http://jmlr.csail.mit.edu/papers/volume9/vandermaaten08a/vandermaaten08a.pdf) #### SNE * Given a set of datapoints $x_1, x_2, ...x_n, p_{ij}$ is the probability that $x_i$ would pick $x_j$ as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at $x_i$. Calculation of $\sigma_i$ is described later. * Similarly, define $q_{ij}$ as conditional probability corresponding to lowdimensional representations of $y_i$ and $y_j$ (corresponding to $x_i$ and $x_j$). The variance of Gaussian in this case is set to be $1/\sqrt{2}$ * Argument is that if $y_i$ and $y_j$ captures the similarity between $x_i$ and $x_j$, $p_{ij}$ and $q_{ij}$ should be equal. So objective function to be minimized is KullbackLeibler (KL) Divergence measure for $P_i$ and $Q_i$, where $P_i$ ($Q_i$) represent conditional probability distribution given $x_i$ ($y_i$) * Since KL Divergence is not symmetric, the objective function focuses on retaining the local structure. * Users specify a value called perplexity (measure of effective number of neighbors). A binary search is performed to find $\sigma_i$ which produces the $P_i$ having same perplexity as specified by the user. * Gradient Descent (with momentum) is used to minimize objective function and Gaussian noise is added in early stages to perform simulated annealing. #### tSNE (tDistributed SNE) ##### Symmetric SNE * A single KL Divergence between P (joint probability distribution in highdimensional space) and Q (joint probability distribution in lowdimensional space) is minimized. * Symmetric because $p_{ij} = p_{ji}$ and $q_{ij} = q_{ji}$ * More robust to outliers and has a simpler gradient expression. ##### Crowding Problem * When we model a highdimensional dataset in 2 (or 3) dimensions, it is difficult to segregate the nearby datapoints from moderately distant datapoints and gaps can not form between natural clusters. * One way around this problem is to use UNISNE but optimization of the cost function, in that case, is difficult. ##### tSNE * Instead of Gaussian, use a heavytailed distribution (like Studentt distribution) to convert distances into probability scores in low dimensions. This way moderate distance in highdimensional space can be modeled by larger distance in lowdimensional space. * Studentt distribution is an infinite mixture of Gaussians and density for a point under this distribution can be computed very fast. * The cost function is easy to optimize. ##### Optimization Tricks ###### Early Compression * At the start of optimization, force the datapoints (in lowdimensional space) to stay close together so that datapoints can easily move from one cluster to another. * Implemented an L2penalty term proportional to the sum of the squared distance of datapoints from the origin. ###### Early Exaggeration * Scale all the $p_{ij}$'s so that large $q_{ij}$*'s are obtained with the effect that natural clusters in the data form tight, widely separated clusters as a lot of empty space is created in the lowdimensional space. ##### tSNE on large datasets * Space and time complexity is quadratic in the number of datapoints so infeasible to apply on large datasets. * Select a random subset of points (called landmark points) to display. * for each landmark point, define a random walk starting at a landmark point and terminating at any other landmark point. * $p_{ij}$ is defined as fraction of random walks starting at $x_i$ and finishing at $x_j$ (both these points are landmark points). This way, $p_{ij}$ is not sensitive to "shortcircuits" in the graph (due to noisy data points). #### Advantages of tSNE * Gaussian kernel employed by tSNE (in highdimensional) defines a soft border between the local and global structure of the data. * Both nearby and distant pair of datapoints get equal importance in modeling the lowdimensional coordinates. * The local neighborhood size of each datapoint is determined on the basis of the local density of the data. * Random walk version of tSNE takes care of "shortcircuit" problem. #### Limitations of tSNE * It is unclear tSNE would perform on general **Dimensionality Reduction** for more than 3 dimensions. For such higher (than 3) dimensions, Studentt distribution with more degrees of freedom should be more appropriate. * tSNE reduces the dimensionality of data mainly based on local properties of the data which means it would fail in data which has intrinsically high dimensional structure (**curse of dimensionality**). * The cost function for tSNE is not convex requiring several optimization parameters to be chosen which would affect the constructed solution. 
### Introduction * *Curriculum Learning*  When training machine learning models, start with easier subtasks and gradually increase the difficulty level of the tasks. * Motivation comes from the observation that humans and animals seem to learn better when trained with a curriculum like a strategy. * [Link](http://ronan.collobert.com/pub/matos/2009_curriculum_icml.pdf) to the paper. ### Contributions of the paper * Explore cases that show that curriculum learning benefits machine learning. * Offer hypothesis around when and why does it happen. * Explore relation of curriculum learning with other machine learning approaches. ### Experiments with convex criteria * Training perceptron where some input data is irrelevant(not predictive of the target class). * Difficulty can be defined in terms of the number of irrelevant samples or margin from the separating hyperplane. * Curriculum learning model outperforms nocurriculum based approach. * Surprisingly, in the case of difficulty defined in terms of the number of irrelevant examples, the anticurriculum strategy also outperforms nocurriculum strategy. ### Experiments on shape recognition with datasets having different variability in shapes * Standard(target) dataset  Images of rectangles, ellipses, and triangles. * Easy dataset  Images of squares, circles, and equilateral triangles. * Start performing gradient descent on easy dataset and switch to target data set at a particular epoch (called *switch epoch*). * For nocurriculum learning, the first epoch is the *switch epoch*. * As *switch epoch* increases, the classification error comes down with the best performance when *switch epoch* is half the total number of epochs. * Paper does not report results for higher values of *switch epoch*. ### Experiments on language modeling * Standard data set is the set of all possible windows of the text of size 5 from Wikipedia where all words in the window appear in 20000 most frequent words. * Easy dataset considers only those windows where all words appear in 5000 most frequent words in vocabulary. * Each word from the vocabulary is embedded into a *d* dimensional feature space using a matrix **W** (to be learnt). * The model predicts the score of next word, given a window of words. * Expected value of ranking loss function is minimized to learn **W**. * Curriculum Learningbased model overtakes the other model soon after switching to the target vocabulary, indicating that curriculumbased model quickly learns new words. ### Curriculum as a continuation method * Continuation methods start with a smoothed objective function and gradually move to less smoothed function. * Useful in the case where the objective function in nonconvex. * Consider a family of cost functions $C_\lambda (\theta)$ such that $C_0(\theta)$ can be easily optimized and $C_1(\theta)$ is the actual objective function. * Start with $C_0 (\theta)$ and increase $\lambda$, keeping $\theta$ at a local minimum of $C_\lambda (\theta)$. * Idea is to move $\theta$ towards a dominant (if not global) minima of $C_1(\theta)$. * Curriculum learning can be seen as a sequence of training criteria starting with an easytooptimise objective and moving all the way to the actual objective. * The paper provides a mathematical formulation of curriculum learning in terms of a target training distribution and a weight function (to model the probability of selecting anyone training example at any step). ### Advantages of Curriculum Learning * Faster training in the online setting as learner does not try to learn difficult examples when it is not ready. * Guiding training towards better local minima in parameter space, specifically useful for nonconvex methods. ### Relation to other machine learning approaches * **Unsupervised preprocessing**  Both have a regularizing effect and lower the generalization error for the same training error. * **Active learning**  The learner would benefit most from the examples that are close to the learner's frontier of knowledge and are neither too hard nor too easy. * **Boosting Algorithms**  Difficult examples are gradually emphasised though the curriculum starts with a focus on easier examples and the training criteria do not change. * **Transfer learning** and **Lifelong learning**  Initial tasks are used to guide the optimisation problem. ### Criticism * Curriculum Learning is not well understood, making it difficult to define the curriculum. * In one of the examples, anticurriculum performs better than nocurriculum. Given that curriculum learning is modeled on the idea that learning benefits when examples are presented in order of increasing difficulty, one would expect anticurriculum to perform worse. 
## Introduction * Neural Network with a recurrent attention model over a large external memory. * Continous form of MemoryNetwork but with endtoend training so can be applied to more domains. * Extension of RNNSearch and can perform multiple hops (computational steps) over the memory per symbol. * [Link to the paper](http://arxiv.org/pdf/1503.08895v5.pdf). * [Link to the implementation](https://github.com/facebook/MemNN). ## Approach * Model takes as input $x_1,...,x_n$ (to store in memory), query $q$ and outputs answer $a$. ### Single Layer * Input set ($x_i$) embedded in Ddimensional space, using embedding using matrix $A$ to obtain memory vectors ($m_i$). * Query is also embedded using matrix $B$ to obtain internal state $u$. * Compute match between each memory $m_i$ and $u$ in the embedding space followed by softmax operation to obtain probability vector $p$ over the inputs. * Each $x_i$ maps to an output vector $c_i$ (using embedding matrix $C$). * Output $o$ = weighted sum of transformed input $c_i$, weighted by $p_i$. * Sum of output vector, $o$ and embedding vector, $u$, is passed through weight matrix $W$ followed by softmax to produce output. * $A$, $B$, $C$ and $W$ are learnt by minimizing cross entropy loss. ### Multiple Layers * For layers above the first layer, input $u^{k+1} = u^k + o^k$. * Each layer has its own $A^k$ and $C^k$  with constraints. * At final layer, output $o = \text{softmax}(W(o^K, u^K))$ ### Constraints On Embedding Vectors * Adjacent * Output embedding for one layer is input embedding for another ie $A^k+1 = C^k$ * $W = C^k$ * $B = A^1$ * Layerwise (RNNlike) * Same input and output embeddings across layes ie $A^1 = A^2 ... = A^K$ and $C^1 = C^2 ... = C^K$. * A linear mapping $H$ is added to update of $u$ between hops. * $u^{k+1} = Hu^k + o^k$. * $H$ is also learnt. * Think of this as a traditional RNN with 2 outputs * Internal output  used for memory consideration * External output  the predicted result * $u$ becomes the hidden state. * $p$ is an internal output which, combined with $C$ is used to update the hidden state. ## Related Architectures * RNN  Memory stored as the state of the network and unusable in long temporal contexts. * LSTM  Locks network state using local memory cells. Fails over longer temporal contexts. * Memory Networks  Uses global memory. * Bidirectional RNN  Uses a small neural network with sophisticated gated architecture (attention model) to find useful hidden states but unlike MemNN, perform only a single pass over the memory. ## Sentence Representation for Question Answering Task * Bagofwords representation * Input sentences and questions are embedded as a bag of words. * Can not capture the order of the words. * Position Encoding * Takes into account the order of words. * Temporal Encoding * Temporal information encoded by matrix $T_A$ and memory vectors are modified as $m_i = \text{sum}(Ax_{ij}) + T_A(i)$ * Random Noise * Dummy Memories (empty memories) are added at training time to regularize $T_A$. * Linear Start (LS) training * Removes softmax layers when starting training and insert them when validation loss stops decreasing. ## Observations * Best MemN2N models are close to supervised models in performance. * Position Encoding improves over bagofwords approach. * Linear Start helps to avoid local minima. * Random Noise gives a small yet consistent boost in performance. * More computational hops leads to improved performance. * For Language Modelling Task, some hops concentrate on recent words while other hops have more broad attention span over all memory locations. 
### Introduction * Knowledge Bases (KBs) are effective tools for Question Answering (QA) but are often too restrictive (due to fixed schema) and too sparse (due to limitations of Information Extraction (IE) systems). * The paper proposes KeyValue Memory Networks, a neural network architecture based on [Memory Networks](https://gist.github.com/shagunsodhani/c7a03a47b3d709e7c592fa7011b0f33e) that can leverage both KBs and raw data for QA. * The paper also introduces MOVIEQA, a new QA dataset that can be answered by a perfect KB, by Wikipedia pages and by an imperfect KB obtained using IE techniques thereby allowing a comparison between systems using any of the three sources. * [Link to the paper](https://arxiv.org/abs/1606.03126). ### Related Work * TRECQA and WIKIQA are two benchmarks where systems need to select the sentence containing the correct answer, given a question and a list of candidate sentences. * These datasets are small and make it difficult to compare the systems using different sources. * Best results on these benchmarks are reported by CNNs and RNNs with attention mechanism. ### KeyValue Memory Networks * Extension of [Memory Networks Model](https://gist.github.com/shagunsodhani/c7a03a47b3d709e7c592fa7011b0f33e). * Generalises the way context is stored in memory. * Comprises of a memory made of slots in the form of pair of vectors *(k<sub>1</sub>, v<sub>1</sub>)...(k<sub>m</sub>, v<sub>m</sub>)* to encode longterm and shortterm context. #### Reading the Memory * **Key Hashing**  Question, *x* is used to preselect a subset of array $(k_{h1}, v_{h1})...(k_{hN}, v_{hN})$ where the key shares atleast one word with *x* and frequency of the words is less than 1000. * **Key Addresing**  Each candidate memory is assigned a relevance probability: * $p_hi$ = softmax($Aφ_X(x).Aφ_K (k_{hi}))$ * φ is a feature map of dimension *D* and *A* is a *dxD* matrix. * **Value Reading**  Value of memories are read by taking their weighted sum using addressing probabilites and a vector *o* is returned. * $o = sum(p_{hi} Aφ_V(v_{hi}))$ * Memory access process conducted by "controller" neural network using $q = Aφ_X (x)$ as the query. * Query is updated using * $q_2 = R_1 (q+o)$ * Addressing and reading steps are repeated using new $R_i$ matrices to retrieve more pertinent information in subsequent access. * After a fixed number of hops, H, resulting state of controller is used to compute a final prediction. * $a = \text{argmax}(\text{softmax}(q_{H+1}^T Bφ_Y (y_i)))$ where $y_i$ are the possible candidate outputs and $B$ is a $dXD$ matrix. * The network is trained endtoend using a cross entropy loss, backpropogation and stochastic gradient. * EndtoEnd Memory Networks can be viewed as a special case of KeyValue Memory Networks by setting key and value to be the same for all the memories. #### Variants of KeyValue Memories * $φ_x$ and $φ_y$  feature map corresponding to query and answer are fixed as bagofwords representation. ##### KB Triple * Triplets of the form "subject relation object" can be represented in KeyValue Memory Networks with subject and relation as the key and object as the value. * In standard Memory Networks, the whole triplet would have to be embedded in the same memory slot. * The reversed relations "object is_related_to subject" can also be stored. ##### Sentence Level * A document can be split into sentences with each sentence encoded in the keyvalue pair of the memory slot as a bagofwords. ##### Window Level * Split the document in the windows of W words and represent it as bagofwords. * The window becomes the key and the central word becomes the value. ##### Window + Centre Encoding * Instead of mixing the window centre with the rest of the words, double the size of the dictionary and encode the centre of the window and the value using the second dictionary. ##### Window + Title * Since the title of the document could contain useful information, the word window can be encoded as the key and document title as the value. * The key could be augmented with features like "_window_" and "_title_" to distinguish between different cases. ### MOVIEQA Benchmark #### Knowledge Representation * Doc  Raw documents (from Wikipedia) related to movies. * KB  Graphbased KB made of entities and relations. * IE  Performing Information Extraction on Wikipedia to create a KB. * The QA pairs should be answerable by both raw document and KB so that the three approaches can be compared and the gap between the three solutions can be closed. * The dataset has more than 100000 QA pairs, making it much larger than most existing datasets. ### Experiments #### MOVIEQA ##### Systems Compared * [Bordes et al's QA system](TBD) * [Supervised Embeddings](TBD)(without KB) * [Memory Networks](TBD) * KeyValue Memory Networks ##### Observations * KeyValue Memory Networks outperforms all methods on all data sources. * KB > Doc > IE * The best memory representation for directly reading documents uses "Window Level + Centre Encoding + Title". ##### KB vs Synthetic Document Analysis * Given KB triplets, construct synthetic "Wikipedia" articles using templates, conjunctions and coreferences to determine the causes for the gap in performance when using KB vs doc. * Loss in One Template sentences are due to the difficulty of extracting subject, relation and object from the artificial docs. * Using multiple templates does not deteriorate performance much. But conjunctions and coreferences cause a dip in performance. #### WIKIQA * Given a question, select the sentence (from Wikipedia document) that best answers the question. * KeyValue Memory Networks outperforms all other solutions though it is only marginally better than LDC and attentive models based on CNNs and RNNs. 
## Problem Statement * Evaluating if LSTMs can express and learn short, simple programs (linear time, constant memory) in the sequencetosequence framework. * [Link to paper](http://arxiv.org/pdf/1410.4615v3.pdf) ## Approach * Formulate program evaluation task as a sequencetosequence learning problem using RNNs. * Train on short programs that can be evaluated in linear time and constant memory  RNN can perform only a single pass over the data and its memory is limited. * Two parameters to control the difficulty of the program: * `length` : Number of digits in the integer that appears in the program. * `nesting` : Number of times operations can be combined with each other. * LSTM reads the input program, one character at a time and produces output, one character at a time. ### Additional Learning Tasks * **Addition Task**  Given two numbers, the model learns to add them. This task becomes the baseline for comparing performance on other tasks. * **Memorization Task**  Give a random number, the model memorizes it and outputs it. Following techniques enhance the accuracy of the model: * **Input reversing**  Reversing the order of input, while keeping the output fixed introduces many shortterm dependencies that help LSTM in learning the process. * **Input doubling**  Presenting the same input to the network twice enhances the performance as the model gets to look at the input twice. ## Curriculum learning Gradually increase the difficulty of the program fed to the system. * **No Curriculum (baseline)**  Fixed `length` and fixed `nesting` programs are fed to the system. * **Naive Curriculum**  Start with `length` = 1 and `nesting` = 1 and keep increasing the values iteratively. * **Mix Strategy**  Randomly choose `length` and `nesting` to generate a mix of easy and difficult examples. * **Combined Strategy**  Each training example is obtained either by Naive curriculum strategy or mix strategy. ## Network Architecture * 2 layers, unrolled for 50 steps. * 400 cells per layer. * Parameters initialized uniformly in [0.08, 0.08] * minibatch size 100 * norm of gradient normalized to be less than 5 * start with learning rate = 0.5, further decreased by 0.8 after reaching target accuracy of 95% ## Observations Teacher forcing technique used for computing accuracy ie when predicting the $i_{th}$ digit, the correct first i1 digits of the output are provided as input to the LSTM. The general trend is (combine, mix) > (naive, baseline). In certain cases for program evaluation, baseline performs better than naive curriculum strategy. Intuitively, the model would use all its memory to store patterns for a given size input. Now when a higher size input is provided, the model would have to restructure its memory patterns to learn the output for this new class of inputs. The process of memory restructuring may be causing the degraded performance of the naive strategy. The combined strategy combines the naive and mix strategy and hence reduces the need to restructure the memory patterns. While LSTMs can learn to map the character level representation of simple programs to their correct output, the idea can not extend to arbitrary programs due to the runtime limitations of conventional RNNs and LSTM. Moreover, while learning is essential, the optimal curriculum strategy needs to be understood further. 
#### Introduction * GraphLab abstraction exposes asynchronous, dynamic, graphparallel computation model in the sharedmemory setting. * This paper extends the abstraction to the distributed setting. * [Link](http://vldb.org/pvldb/vol5/p716_yuchenglow_vldb2012.pdf) to the paper. #### Characteristics of MLDM (Machine Learning and Data Mining) * Graph Structured Computation * Sometimes computation requires modeling dependencies between data. * eg modeling dependencies between similar users for the recommendation use case. * Asynchronous Iterative Computation * In many cases, asynchronous procedures outperform synchronous ones. * eg linear systems, belief propagation, stochastic optimization etc. * Dynamic Computation * Iterative computation converges asymmetrically. * Convergence can be accelerated by dynamic scheduling. * eg do not update parameters that have already converged. * Serializability * Ensuring that all parallel executions have an equivalent serial execution is desirable for both correctness and faster convergence. #### GraphLab Abstraction ### Data Graph * Store program state as a directed graph. * **G = (V,E,D)** where D is the user defined data (model parameters, algorithm state, statistical data etc). * The graph data **D** is mutable but the state of the graph **(V,E)** is immutable. #### Update Function * Stateless procedure that modifies the data within the scope of a vertex and schedules the execution of the *update* function on other vertices. * **Scope** of a vertex (S)  data corresponding to the vertex, its edges and its adjacent vertices. * **update: $f (v, S_v) > (S_v, T)$** where T is the set of vertices where *update* function is scheduled to be invoked. * Scheduling of computation id decoupled from movement of data and no message passing is required between vertices. #### Execution Model * Input to the model is G and T, the initial set of vertices to be updated. * During each step, a vertex is extracted from T, updated and a set of vertices is added to T (for future computation). * Vertices in T can be executed in any order with the only constraint that all vertices be eventually executed. #### Sync Operation * Sync operation runs in the background to maintain global aggregates concurrently. * These global values are read by *update* function and written by the sync operation. #### Consistency Models * Full consistency * Full read/write access in the *scope*. * Scope of concurrently updating vertices cannot overlap. * Edge consistency * Read/write access on the vertex and the adjacent edges but only read access to adjacent vertices. * Slightly overlapping scope. * Vertex consistency * Write access to the vertex and read access to adjacent edges and vertices. * All vertices can run update function simultaneously. ### Distributed Data Graph * Twophase partitioning process for load balancing the graph on arbitrary cluster size. * In the first phase, partition the graph into k parts (k >> number of machines). * Each part, called **atom**, is a file of graph generating commands. * Atom also stores information about **ghosts** (set of vertices and edges adjacent to the partition boundary). * Atom index file contains connectivity structure and file location for the k atoms as a metagraph. * In the second phase, this metagraph is partitioned over the physical machines. ### Distributed GraphLab Engines #### Chromatic Engine * A vertex coloring (no adjacent vertices have the same color) is constructed to serialize parallel execution of dependent tasks (in our case, vertices in the graph). * For edge consistency model, execute all vertices of the same color before going to next color and run sync operation between color steps. * Changes to ghost vertices and edges are communicated asynchronously as they are made. * Vertex consistency is trivial  assign same color to all the vertices. * For full consistency, construct secondorder vertex coloring (no vertex shares the same color as any of its distance two neighbors) #### Distributed Locking Engine * Associate readerwriter locks on each vertex. * Each machine can update only the local vertices. * Optimisations * Ghosting system uses caching to eliminate wait on remote, unchanged data. * Lock request and synchronization are pipelined to hide network latency. * Each machine maintains a pipeline of vertices for which locks have been requested but not granted. * A vertex is executed once lock acquisition and data synchronization are complete. * Nonblocking readerwriter locks, that work through callback functions, are used. ### Fault Tolerance * Distributed checkpointing via two modes: * Synchronous checkpointing * Suspend computation to save all modified data since the last checkpoint. * Asynchronous checkpointing based on ChandyLamport snapshot algorithm. * The snapshot step becomes an *update* function in the GraphLab abstraction. * Better than synchronous checkpointing. ### System Design * One instance of GraphLab runs on each machine. * These processes are symmetric and communicate via RPC. * The first process additionally acts as the master and computes placement of atoms based on atom index. * Each process maintains a local scheduler (for its vertices) and a cache to access remote data. * Distributed consensus algorithm to decide when all the schedulers are empty. ### Observations * The biggest strength of the paper are its extensive experiments. * GraphLab benefits from the use of background asynchronous communication and pipelined locking but its communication layer is not as efficient as MPI's communication layer. 
## Introduction * In machine learning, accuracy tends to increase with an increase in the number of training examples and number of model parameters. * For large data, training becomes slow on even GPU (due to increase CPUGPU data transfer). * Solution: Distributed training and inference  DistBelief * [Link to paper](https://papers.nips.cc/paper/4687largescaledistributeddeepnetworks.pdf) ## DistBelief * Framework for parallel and distributed computation in neural networks. * Exploits both model parallelism and data parallelism. * Two methods implemented over DistBelief  *Downpour SGD* and *Sandblaster LBFGS* ## Previous work in this domain * Distributed Gradient Computation using relaxed synchronization requirements and delayed gradient descent. * Lockless asynchronous stochastic gradient descent in case of sparse gradient. * Deep Learning * Most focus has been on training small models on a single machine. * Train multiple, small models on a GPU farm and combine their predictions. * Alternate approach is to modify standard deep networks to make them more parallelizable. * Distributed computational tools * MapReduce  Not suited for iterative computations * GraphLab  Can not exploit computing efficiencies available in deep networks. ## Model Parallelism * User specifies the computation at each node in each layer of the neural network and the messages to be exchanged between different nodes. * Model parallelism supported using multithreading (within a machine) and message passing (across machines). * Model can be partitioned across machines with DistBelief taking care of all communication/synchronization tasks. * Diminishing returns in terms of speed gain if too many machines/partitions are used as partitioning benefits as long as network overhead is small. ## Data Parallelism * Distributed training across multiple model instances (or replicas) to optimize a single function. ## Distributed Optimization Algorithms ### Downpour SGD * Asynchronous stochastic gradient descent. * Divide training data into subsets and run a copy of model on each subset. * Centralized sharded parameter server * Keeps the current state of all the parameters of the model. * Used by model instances to share parameters. * Asynchronous approach as both model replicas and parameter server shards run independently. * Workflow * Model replica requests an updated copy of model parameters from parameter server. * Processes mini batches and sends the gradient to the server. * Server updates the parameters. * Communication can be limited by limiting the rate at which parameters are requested and updates are pushed. * Robust to machine failure since multiple instances are running. * Relaxing consistency requirements works very well in practice. * Warm starting  Start training with a single replica and add other replicas later. ### Sandblaster LBFGS * Sandblaster is the distributed batch Optimization framework. * Provides distributed implementation of LBFGS. * A single 'coordinator' process interacts with replicas and the parameter server to coordinate the batch optimization process. * Workflow * Coordinator issues commands like dot product to the parameter server shards. * Shards execute the operation, store the results locally and maintain additional information like history cache. * Parameters are fetched at the beginning of each batch and the gradients are sent every few completed cycles. * Saves overhead of sending all parameters and gradients to a single server. * Robust to machine failure just like Downpour SGD. * Coordinator uses a load balancing scheme similar to "backup tasks" in MapReduce. ## Observation Downpour SGD (with Adagrad adaptive learning rate) outperforms Downpour SGD (with fixed learning rate) and Sandblaster LBFGS. Moreover, Adagrad can be easily implemented locally within each parameter shard. It is surprising that Adagrad works so well with Downpour SGD as it was not originally designed to be used with asynchronous SGD. The paper conjectures that "*Adagrad automatically stabilizes volatile parameters in the face of the flurry of asynchronous updates, and naturally adjusts learning rates to the demands of different layers in the deep network.*" ## Beyond DistBelief  TensorFlow * In November 2015, [Google open sourced TensorFlow](http://googleresearch.blogspot.in/2015/11/tensorflowgoogleslatestmachine_9.html) as its secondgeneration machine learning system. DistBelief had limitations like tight coupling with Google's infrastructure and was difficult to configure. TensorFlow takes care of these issues and is twice fast as DistBelief.Google also published a [white paper](http://download.tensorflow.org/paper/whitepaper2015.pdf) on the topic and the implementation can be found [here](http://www.tensorflow.org/). 
## Introduction * Introduces techniques to learn word vectors from large text datasets. * Can be used to find similar words (semantically, syntactically, etc). * [Link to the paper](http://arxiv.org/pdf/1301.3781.pdf) * [Link to open source implementation](https://code.google.com/archive/p/word2vec/) ## Model Architecture * Computational complexity defined in terms of a number of parameters accessed during model training. * Proportional to $E*T*Q$ * *E*  Number of training epochs * *T*  Number of words in training set * *Q*  depends on the model ### Feedforward Neural Net Language Model (NNLM) * Probabilistic model with input, projection, hidden and output layer. * Input layer encodes N previous word using 1ofV encoding (V is vocabulary size). * Input layer projected to projection layer P with dimensionality *N\*D* * Hidden layer (of size *H*) computes the probability distribution over all words. * Complexity per training example $Q =N*D + N*D*H + H*V$ * Can reduce *Q* by using hierarchical softmax and Huffman binary tree (for storing vocabulary). ### Recurrent Neural Net Language Model (RNNLM) * Similar to NNLM minus the projection layer. * Complexity per training example $Q =H*H + H*V$ * Hierarchical softmax and Huffman tree can be used here as well. ## LogLinear Models * Nonlinear hidden layer causes most of the complexity. * NNLMs can be successfully trained in two steps: * Learn continuous word vectors using simple models. * Ngram NNLM trained over the word vectors. ### Continuous BagofWords Model * Similar to feedforward NNLM. * No nonlinear hidden layer. * Projection layer shared for all words and order of words does not influence projection. * Loglinear classifier uses a window of words to predict the middle word. * $Q = N*D + D*\log_2V$ ### Continuous Skipgram Model * Similar to Continuous BagofWords but uses the middle world of the window to predict the remaining words in the window. * Distant words are given less weight by sampling fewer distant words. * $Q = C*(D + D*log_2 V$) where *C* is the max distance of the word from the middle word. * Given a *C* and a training data, a random *R* is chosen in range *1 to C*. * For each training word, *R* words from history (previous words) and *R* words from future (next words) are marked as target output and model is trained. ## Results * Skipgram beats all other models for semantic accuracy tasks (eg  relating Athens with Greece). * Continuous BagofWords Model outperforms other models for semantic accuracy tasks (eg great with greater)  with skipgram just behind in performance. * Skipgram architecture combined with RNNLMs outperforms RNNLMs (and other models) for Microsoft Research Sentence Completion Challenge. * Model can learn relationships like "Queen is to King as Woman is to Man". This allows algebraic operations like Vector("King")  Vector("Man") + Vector("Woman"). 
## Introduction to elastic net * Regularization and variable selection method. * Sparse Representation * Exihibits grouping effect. * Prticulary useful when number of predictors (*p*) >> number of observations (*n*). * LARSEN algorithm to compute elastic net regularization path. ## Lasso * Least square method with L1penalty on regression coefficient. * Does continuous shrinkage and automatic variable selection ### Limitations * If *p >> n*, lasso can select at most *n* variables. * In the case of a group of variables exhibiting high pairwise correlation, lasso doesn't care about which variable is selected. * If *n > p* and there is a high correlation between predictors, ridge regression outperforms lasso. ## Naive elastic net * Least square method. * Penalty on regression cofficients is a convex combination of lasso and ridge penalty. * *penalty = (1−α)\*β + α\*β<sup>2</sup>* where *β* refers to the coefficient matrix. * *α = 0* => lasso penalty * *α = 1* => ridge penalty * Naive elastic net can be solved by transforming to lasso on augmeneted data. * Can be viewed as redge type shrinkage followed by lasso type thresholding. ### Limitations * The twostage procedure incurs double amount of shrinkage and introduces extra bias without reducing variance. ## Bridge Regression * Generalization of lasso and ridge regression. * Can not produce sparse solutions. ## Elastic net * Rescaled naive elastic net coefficients to undo shrinkage. * Retains good properties of the naive elastic net. ## Justification for scaling * Elastic net becomes minimax optimal. * Scaling reverses the shrinkage control introduced by ridge regression. ## LARSEN * Based on LARS (used to solve lasso). * Elastic net can be transformed to lasso on augmented data so can reuse pieces of LARS algorithm. * Use sparseness to save on computation. ## Conclusion Elastic net performs superior to lasso. 
# TAO * Geographically distributed, readoptimized, graph data store. * Favors availability and efficiency over consistency. * Developed by and used within Facebook (social graph). * Link to [paper](https://cs.uwaterloo.ca/~brecht/courses/854Emerging2014/readings/datastore/taofacebookdistributeddatastoreatc2013.pdf). ## Before TAO * Facebook's servers directly accessed MySQL to read/write the social graph. * Memcache used as a lookaside cache. * Had several issue: * **Inefficient edge list**  A keyvalue store is not a good design for storing a list of edges. * **Distributed Control Logic**  In lookaside cache architecture, the control logic runs on the clients which increase the number of failure modes. * **Expensive ReadAfterWrite Consistency**  Facebook used asynchronous masterslave replication for MySQL which introduced a time lag before latest data would reflect in the local replicas. ## TAO Data Model * **Objects** * Typed nodes (type is denoted by `otype`) * Identified by 64bit integers. * Contains data in the form of keyvalue pairs. * Models users and repeatable actions (eg `comments`). * **Associations** * Typed directed edges between objects (type is denoted by `atype`) * Identified by source object `id1`, `atype` and destination object `id2`. * Contains data in the form of keyvalue pairs. * Contains a 32bit time field. * Models actions that happen at most once or records state transition (eg `like`) * Often `inverse association` is also meaningful (eg `like` and `liked by`). ## Query API * Support to create, retrieve, update and delete objects and associations. * Support to get all associations (`assoc_get`) or their count(`assoc_count`) based on starting node, time, index and limit parameters. ## TAO Architecture ### Storage Layer * Objects and associations stored in MySQL. * TAO API mapped to SQL queries. * Data divided into logical shards. * Objects bound to the shard for their lifetime(`shard_id` is embedded in `id`). * Associations stored on the shard of its `id` (for faster association query). ### Caching Layer * Consists of multiple cache servers (together form a `tier`). * In memory, LRU cache stores objects, association lists, and association counts. * Write operation on association list with inverse involves writing 2 shards (for `id1` and `id2`). * The client sends the query to cache layer which issues inverse write query to shard2 and once that is completed, a write query is made to shard1. * Write failure leads to hanging associations which are repaired by an asynchronous job. ### Leaders and Followers * A single, large tier is prone to hot spots and square growth in terms of alltoall connections. * Cache split into 2 levels  one **leader tier** and multiple **follower tiers**. * Clients communicate only with the followers. * In the case of read miss/write, followers forward the request to the leader which connects to the storage layer. * Eventual consistency maintained by serving cache maintenance messages from leaders to followers. * Object update in leaders leads results in `invalidation message` to followers. * Leader sends `refill message` to notify about association write. * Leaders also serialize concurrent writes and mediates thundering herds. ## Scaling Geographically * Since workload is read intensive, read misses are serviced locally at the expense of data freshness. * In the multiregion configuration, there are masterslave regions for each shard and each region has its own followers, leader, and database. * Database in the local region is a replica of the database in the master region. * In the case of read miss, the leader always queries the region database (irrespective of it being the master database or slave database). * In the case of write, the leader in the local region would forward the request to database in the master region. ## Optimisations ### Caching Server * RAM is partitioned into `arena` to extend the lifetime of important data types. * For small, fixedsize items (eg association count), a direct 8way associative cache is maintained to avoid the use of pointers. * Each `atype` is mapped to 16bit value to reduce memory footprint. ### Cache Sharding and Hot Spots * Load is balanced among followers through `shard cloning`(reads to a shard are served by multiple followers in a tier). * Response to query include the object's access rate and version number. If the access rate is too high, the object is cached by the client itself. Next time when the query comes, the data is omitted in the reply if it has not changed since the previous version. ### High Degree Objects * In the case of `assoc_count`, the edge direction is chosen on the basis of which node (source or destination) has a lower degree (to optimize reading the association list). * For `assoc_get` query, only those associations are searched where time > object's creation time. ## Failure Detection and Handling * Aggressive network timeouts to detect (potential) failed nodes. ### Database Failure * In the case of master failure, one of the slaves take over automatically. * In case of slave failure, cache miss are redirected to TAO leader in the region hosting the database master. ### Leader Failure * When a leader cache server fails, followers route read miss directly to the database and write to a replacement leader (chosen randomly from the leader tier). ### Refill and Invalidation Failures * Refill and invalidation are sent asynchronously. * If the follower is not available, it is stored in leader's disk. * These messages will be lost in case of leader failure. * To maintain consistency, all the shards mapping to a failed leader are invalidated. ### Follower Failure * Each TAO client is configured with a primary and backup follower tier. * In normal mode, the request is made to primary tier and in the case of its failure, requests go to backup tier. * Read after write consistency may be violated if failing over between different tiers (read reaches the failover target before writer's `refill` or `invalidate`). 
The [Batch Normalization paper](http://arxiv.org/pdf/1502.03167.pdf) describes a method to address the various issues related to training of Deep Neural Networks. It makes normalization a part of the archwitecture itself and reports significant improvements in terms of the number of iterations required to train the network. ## Issues With Training Deep Neural Networks ### Internal Covariate shift Covariate shift refers to the change in the input distribution to a learning system. In the case of deep networks, the input to each layer is affected by parameters in all the input layers. So even small changes to the network get amplified down the network. This leads to change in the input distribution to internal layers of the deep network and is known as internal covariate shift. It is well established that networks converge faster if the inputs have been whitened (ie zero mean, unit variances) and are uncorrelated and internal covariate shift leads to just the opposite. ### Vanishing Gradient Saturating nonlinearities (like tanh or sigmoid) can not be used for deep networks as they tend to get stuck in the saturation region as the network grows deeper. Some ways around this are to use: * Nonlinearities like ReLU which do not saturate * Smaller learning rates * Careful initializations ## Normalization Let us say that the layer we want to normalize has *d* dimensions **x** $= (x_1, ... x_d)$. Then, we can normalize the $k^th$ dimension as follows: ![Scaled and shifted normalized value](https://db.tt/YORi6lov) We also need to scale and shift the normalized values otherwise just normalizing a layer would limit the layer in terms of what it can represent. For example, if we normalize the inputs to a sigmoid function, then the output would be bound to the linear region only. So the normalized input $x^k$ is transformed to: ![Scaled and shifted normalized value](https://db.tt/6vImAQoc) where $γ$ and $β$ are parameters to be learned. Moreover, just like we use minibatch in Stochastic Gradient Descent (SGD), we can use minibatch with normalization to estimate the mean and variance for each activation. The transformation from $x$ to $y$ as described above is called **Batch Normalizing Tranform**. This BN transform is differentiable and ensures that as the model is training, the layers can learn on the input distributions that exhibit less internal covariate shift and can hence accelerate the training. At training time, a subset of activations in specified and BN transform is applied to all of them. During test time, the normalization is done using the population statistics instead of minibatch statistics to ensure that the output deterministically depends on the input. ## Batch Normalized Convolutional Networks Let us say that $x = g(Wu+b)$ is the operation performed by the layer where $W$ and $b$ are the parameters to be learned, $g$ is a nonlinearity and $u$ is the input from the previous layer. The BN transform is added just before the nonlinearity, by normalizing $x = Wu+b$. An alternative would have been to normalize $u$ itself but constraining just the first and the second moment would not eliminate the covariate shift from $u$. When normalizing $Wu+b$, we can ignore the $b$ term as it would be canceled during the normalization step (*b*'s role is subsumed by β) and we have $z = g( BN(Wu) )$ For convolutional layers, normalization should follow the convolution property as well  ie different elements of the same feature map, at different locations, are normalized in the same way. So all the activations in a minibatch are jointly normalized over all the locations and parameters (*γ* and *β*) are learnt per feature map instead of per activation. ## Advantages Of Batch Normalization 1. Reduces internal covariant shift. 2. Reduces the dependence of gradients on the scale of the parameters or their initial values. 3. Regularizes the model and reduces the need for dropout, photometric distortions, local response normalization and other regularization techniques. 4. Allows use of saturating nonlinearities and higher learning rates. Batch Normalization was applied to models trained for MNIST and Inception Network for ImageNet. All the abovementioned advantages were validated in the experiments. Interestingly, Batch Normalization with sigmoid achieved an accuracy of 69.8% (overall best, using any nonlinearity, was 74.8%) while Inception model (sigmoid nonlinearity), without Batch Normalisation, worked only as good as a random guess. ## Future Work While BN Transform does enhance the overall deep network training task, its precise effect on gradient propagation is still not well understood. A future extension of Batch Normalisation would be in the domain of Recurrent Neural Networks where internal covariate shift and vanishing gradients are more severe. It remains to be explored if it can also help with domain adaption by easily generalizing to new data distributions. 
The [paper](http://vldb.org/pvldb/vol5/p1771_georgelee_vldb2012.pdf) presents Twitter's logging infrastructure, how it evolved from application specific logging to a unified logging infrastructure and how sessionsequences are used as a common case optimization for a large class of queries. ## Messaging Infrastructure Twitter uses **Scribe** as its messaging infrastructure. A Scribe daemon runs on every production server and sends log data to a cluster of dedicated aggregators in the same data center. Scribe itself uses **Zookeeper** to discover the hostname of the aggregator. Each aggregator registers itself with Zookeeper. The Scribe daemon consults Zookeeper to find a live aggregator to which it can send the data. Colocated with the aggregators is the staging Hadoop cluster which merges the percategory stream from all the server daemons and writes the compressed results to HDFS. These logs are then moved into main Hadoop data warehouse and are deposited in percategory, perhour directory (eg /logs/category/YYYY/MM/DD/HH). Within each directory, the messages are bundled in a small number of large files and are partially ordered by time. Twitter uses **Thrift** as its data serialization framework, as it supports nested structures, and was already being used elsewhere within Twitter. A system called **Elephant Bird** is used to generate Hadoop record readers and writers for arbitrary thrift messages. Production jobs are written in **Pig(Latin)** and scheduled using **Oink**. ## Application Specific Logging Initially, all applications defined their own custom formats for logging messages. While it made it easy to develop application logging, it had many downsides as well. * Inconsistent naming conventions: eg uid vs userId vs user_Id * Inconsistent semantics associated with each category name causing resource discovery problem. * Inconsistent format of log messages. All these issues make it difficult to reconstruct user session activity. ## Client Events This is an effort within Twitter to develop a unified logging framework to get rid of all the issues discussed previously. A hierarchical, 6level schema is imposed on all the events (as described in the table below).  Component  Description  Example    client  client application  web, iPhone, android   page  page or functional grouping  home, profile, who_to_follow   section  tab or stream on a page  home, mentions, retweets, searches, suggestions   component  component object or objects  search_box, tweet   element  UI element within the component  button, avatar   action  actual user or application action  impression, click, hover  **Table 1: Hierarchical decomposition of client event names.** For example, the following event, `web:home:mentions:stream:avatar:profile_click` is logged whenever there is an image profile click on the avatar of a tweet in the mentions timeline for a user on twitter.com (read from right to left). The alternate design was a tree based model for logging client events. That model allowed for arbitrarily deep event namespace with as finegrained logging as required. But the client events model was chosen to make the top level aggregate queries easier. A client event is a Thrift structure that contains the components given in the table below.  Field  Description    event initiator  {client, server} × {user, app}   event_name  event name   user_id  user id   session_id  session id   ip  user’s IP address   timestamp  timestamp   event_details  event details  **Table 2: Definition of a client event.** The logging infrastructure is unified in two senses: * All log messages share a common format with clear semantics. * All log messages are stored in a single place. ## Session Sequences A session sequence is a sequence of symbols *S = {s<sub>0</sub>, s<sub>1</sub>, s<sub>2</sub>...s<sub>n</sub>}* such that each symbol is drawn from a finite alphabet *Σ*. A bijective mapping is defined between Σ and universe of event names. Each symbol in Σ is represented by a valid Unicode point (frequent events are assigned shorter code prints) and each session sequence becomes a valid Unicode string. Once all logs have been imported to the main database, a histogram of event counts is created and is used to map event names to Unicode code points. The counts and samples of each event type are stored in a known location in HDFS. Session sequences are reconstructed from the raw client event logs via a *groupby* on *user_id* and *session_id*. Session sequences are materialized as it is difficult to work with raw client event logs for following reasons: * A lot of brute force scans. * Large groupby operations needed to reconstruct user session. #### Alternate Designs Considered * Reorganize complete Thrift messages by reconstructing user sessions  This solves the second problem but not the first. * Use a columnar storage format  This addresses the first issue but it just reduces the time taken by mappers and not the number of mappers itself. The materialized session sequences are much smaller than raw client event logs (around 50 times smaller) and address both the issues. ## Client Event Catalog To enhance the accessibility of the client event logs, an automatically generated event data log is used along with a browsing interface to allow users to browse, search and access sample entries for the various client events. (These sample entries are the same entries that were mentioned in the previous section. The catalog is rebuilt every day and is always up to date. ## Applications Client Event Logs and Session Sequences are used in following applications: * Summary Statistics  Session sequences are used to compute various statistics about sessions. * Event Counting  Used to understand what feature of users take advantage of a particular feature. * Funnel Analytics  Used to focus on user attention in a multistep process like signup process. * User Modeling  Used to identify "interesting" user behavior. Ngram models (from NLP domain) can be extended to measure how important temporal signals are by modeling user behavior on the basis of last n actions. The paper also mentions the possibility of extracting "activity collocations" based on the notion of collocations. ## Possible Extensions Session sequences are limited in the sense that they capture only event name and exclude other details. The solution adopted by Twitter is to use a generic indexing infrastructure that integrates with Hadoop at the level of InputFormats. The indexes reside with the data making it easier to reindex the data. An alternative would have been to use **Trojan layouts** which members indexing in HDFS block header but this means that indexing would require the data to be rewritten. Another possible extension would be to leverage more analogies from the field of Natural Language Processing. This would include the use of automatic grammar induction techniques to learn hierarchical decomposition of user activity. Another area of exploration is around leveraging advanced visualization techniques for exploring sessions and mapping interesting behavioral patterns into distinct visual patterns that can be easily recognized. 
The [paper](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf) presents some key lessons and "folk wisdom" that machine learning researchers and practitioners have learnt from experience and which are hard to find in textbooks. ### 1. Learning = Representation + Evaluation + Optimization All machine learning algorithms have three components: * **Representation** for a learner is the set if classifiers/functions that can be possibly learnt. This set is called *hypothesis space*. If a function is not in hypothesis space, it can not be learnt. * **Evaluation** function tells how good the machine learning model is. * **Optimisation** is the method to search for the most optimal learning model. ### 2. Its Generalization That Counts The fundamental goal of machine learning is to generalize beyond the training set. The data used to evaluate the model must be kept separate from the data used to learn the model. When we use generalization as a goal, we do not have access to a function that we can optimize. So we have to use training error as a proxy for test error. ### 3. Data Alone Is Not Enough Since our ultimate goal is generalization (see point 2), there is no such thing as **"enough"** data. Some knowledge beyond the data is needed to generalize beyond the data. Another way to put is "No learner can beat random guessing over all possible functions." But instead of hardcoding assumptions, learners should allow assumptions to be explicitly stated, varied and incorporated automatically into the model. ### 4. Overfitting Has Many Faces One way to interpret overfitting is to break down generalization error into two components: bias and variance. **Bias** is the tendency of the learner to constantly learn the same wrong thing (in the image, a high bias would mean more distance from the centre). **Variance** is the tendency to learn random things irrespective of the signal (in the image, a high variance would mean more scattered points). ![Bias Variance Diagram](https://dl.dropboxusercontent.com/u/56860240/APaperAWeek/BiasVarianceDiagram.png) A more powerful learner (one that can learn many models) need not be better than a less powerful one as they can have a high variance. While noise is not the only reason for overfitting, it can indeed aggravate the problem. Some tools against overfitting are  **crossvalidation**, **regularization**, **statistical significance testing**, etc. ### 5. Intuition Fails In High Dimensions Generalizing correctly becomes exponentially harder as dimensionality (number of features) become large. Machine learning algorithms depend on similaritybased reasoning which breaks down in high dimensions as a fixedsize training set covers only a small fraction of the large input space. Moreover, our intuitions from threedimensional space often do not apply to higher dimensional spaces. So the **curse of dimensionality** may outweigh the benefits of having more features. Though, in most cases, learners benefit from the **blessing of nonuniformity** as data points are concentrated in lowerdimensional manifolds. Learners can implicitly take advantage of this lower effective dimension or use dimensionality reduction techniques. ### 6. Theoretical Guarantees Are Not What They Seem A common type of bound common when dealing with machine learning algorithms is related to the number of samples needed to ensure good generalization. But these bounds are very loose in nature. Moreover, the bound says that given a large enough training dataset, our learner would return a good hypothesis with high probability or would not find a consistent hypothesis. It does not tell us anything about how to select a good hypothesis space. Another common type of bound is the asymptotic bound which says "given infinite data, the learner is guaranteed to output correct classifier". But in practice we never have infinite data and data alone is not enough (see point 3). So theoretical guarantees should be used to understand and drive the algorithm design and not as the only criteria to select algorithm. ### 7. Feature Engineering Is The Key Machine Learning is an iterative process where we train the learner, analyze the results, modify the learner/data and repeat. Feature engineering is a crucial step in this pipeline. Having the right kind of features (independent features that correlate well with the class) makes learning easier. But feature engineering is also difficult because it requires domain specific knowledge which extends beyond just the data at hand (see point 3). ### 8. More Data Beats A Clever Algorithm As a rule of thumb, a dumb algorithm with lots of data beats a clever algorithm with a modest amount of data. But more data means more scalability issues. Fixed size learners (parametric learners) can take advantage of data only to an extent beyond which adding more data does not improve the results. Variable size learners (nonparametric learners) can, in theory, learn any function given sufficient amount of data. Of course, even nonparametric learners are bound by limitations of memory and computational power. ### 9. Learn Many Models, Not Just One In early days of machine learning, the model/learner to be trained was predetermined and the focus was on tuning it for optimal performance. Then the focus shifted to trying many variants of different learners. Now the focus is on combining the various variants of different algorithms to generate the most optimal results. Such model ensembling techniques include *bagging*, *boosting* and *stacking*. ### 10. Simplicity Does Not Imply Accuracy Though Occam's razor suggests that machine learning models should be kept simple, there is no necessary connection between the number of parameters of a model and its tendency to overfit. The complexity of a model can be related to the size of hypothesis space as smaller spaces allow the hypothesis to be generated by smaller, simpler codes. But there is another side to this picture  A learner with a larger hypothesis space that tries fewer hypotheses is less likely to overfit than one that tries more hypotheses from a smaller space. So hypothesis space size is just a rough guide towards accuracy. Domingos conclude in his [other paper](http://homes.cs.washington.edu/~pedrod/papers/dmkd99.pdf) that "simpler hypotheses should be preferred because simplicity is a virtue in its own right, not because of a hypothetical connection with accuracy." ### 11. Representation Does Not Imply Learnable Just because a function can be represented, does not mean that the function can actually be learnt. Restrictions imposed by data, time and memory, limit the functions that can actually be learnt in a feasible manner. For example, decision tree learners can not learn trees with more leaves than the number of training data points. The right question to ask is "whether a function can be learnt" and not "whether a function can be represented". ### 12. Correlation Does Not Imply Causation Correlation may hint towards a possible cause and effect relationship but that needs to be investigated and validated. On the face of it, correlation can not be taken as proof of causation. 
[Hive](http://infolab.stanford.edu/~ragho/hiveicde2010.pdf) is an opensource data warehousing solution built on top of Hadoop. It supports an SQLlike query language called HiveQL. These queries are compiled into MapReduce jobs that are executed on Hadoop. While Hive uses Hadoop for execution of queries, it reduces the effort that goes into writing and maintaining MapReduce jobs. # Data Model Hive supports database concepts like tables, columns, rows and partitions. Both primitive (integer, float, string) and complex datatypes(map, list, struct) are supported. Moreover, these types can be composed to support structures of arbitrary complexity. The tables are serialized/deserialized using default serializers/deserializer. Any new data format and type can be supported by implementing SerDe and ObjectInspector java interface. # Query Language Hive query language (HiveQL) consists of a subset of SQL along with some extensions. The language is very SQLlike and supports features like subqueries, joins, cartesian product, group by, aggregation, describe and more. MapReduce programs can also be used in Hive queries. A sample query using MapReduce would look like this: ``` FROM ( MAP inputdata USING 'python mapper.py' AS (word, count) FROM inputtable CLUSTER BY word ) REDUCE word, count USING 'python reduce.py'; ``` This query uses `mapper.py` for transforming `inputdata` into `(word, count)` pair, distributes data to reducers by hashing on `word` column (given by `CLUSTER`) and uses `reduce.py`. Notice that Hive allows the order of `FROM` and `SELECT/MAP/REDUCE` to be changed within a subquery. This allows insertion of different transformation results into different tables/partitions/hdfs/local directory as part of the same query and reduces the number of scans on the input data. ### Limitations * Only equijoins are supported. * Data can not be inserted into existing table/partitions and all inserts overwrite the data. `INSERT INTO, UPDATE`, and `DELETE` are not supported which makes it easier to handle reader and writer concurrency. # Data Storage While a table is the logical data unit in Hive, the data is actually stored into hdfs directories. A **table** is stored as a directory in hdfs, **partition** of a table as a subdirectory within a directory and **bucket** as a file within the table/partition directory. Partitions can be created either when creating tables or by using `INSERT/ALTER` statement. The partitioning information is used to prune data when running queries. For example, suppose we create partition for `day=monday` using the query ``` ALTER TABLE dummytable ADD PARTITION (day='monday') ``` Next, we run the query  ``` SELECT * FROM dummytable WHERE day='monday' ``` Suppose the data in dummytable is stored in `/user/hive/data/dummytable` directory. This query will only scan the subdirectory `/user/hive/data/dummytable/day=monday` within the `dummytable` directory. A **bucket** is a file within the leaf directory of a table or a partition. It can be used to prune data when the user runs a `SAMPLE` query. Any data stored in hdfs can be queried using the `EXTERNAL TABLE` clause by specifying its location with the `LOCATION` clause. When dropping an external table, only its metadata is deleted and not the data itself. # Serialization/Deserialization Hive implements the `LazySerDe` as the default SerDe. It deserializes rows into internal objects lazily so that the cost of Deserialization of a column is incurred only when it is needed. Hive also provides a `RegexSerDe` which allows the use of regular expressions to parse columns out from a row. Hive also supports various formats like `TextInputFormat`, `SequenceFileInputFormat` and `RCFileInputFormat`. Other formats can also be implemented and specified in the query itself. For example, ``` CREATE TABLE dummytable(key INT, value STRING) STORED AS INPUTFORMAT org.apache.hadoop.mapred.SequenceFileInputFormat OUTPUTFORMAT org.apache.hadoop.mapred.SequenceFileOutputFormat ``` # System Architecture and Component ### Components * **Metastore**  Stores system catalog and metadata about tables, columns and partitions. * **Driver**  Manages life cycle of a HiveQL statement as it moves through Hive. * **Query Compiler**  Compiles query into a directed acyclic graph of MapReduce tasks. * **Execution Engine**  Execute tasks produced by the compiler in proper dependency order. * **Hive Server**  Provides a thrift interface and a JDBC/ODBC server. * **Client components**  CLI, web UI, and JDBC/ODBC driver. * **Extensibility Interfaces**  Interfaces for SerDe, ObjectInspector, UDF (User Defined Function) and UDAF (UserDefined Aggregate Function). ### Life Cycle of a query The query is submitted via CLI/web UI/any other interface. This query goes to the compiler and undergoes parse, typecheck and semantic analysis phases using the metadata from Metastore. The compiler generates a logical plan which is optimized by the rulebased optimizer and an optimized plan in the form of DAG of MapReduce and hdfs tasks is generated. The execution engine executes these tasks in the correct order using Hadoop. ### Metastore It stores all information about the tables, their partitions, schemas, columns and their types, etc. Metastore runs on traditional RDBMS (so that latency for metadata query is very small) and uses an open source ORM layer called DataNuclues. Matastore is backed up regularly. To make sure that the system scales with the number of queries, no metadata queries are made the mapper/reducer of a job. Any metadata needed by the mapper or the reducer is passed through XML plan files that are generated by the compiler. ### Query Compiler Hive Query Compiler works similar to traditional database compilers. * Antlr is used to generate the Abstract Syntax Tree (AST) of the query. * A logical plan is created using information from the metastore. An intermediate representation called query block (QB) tree is used when transforming AST to operator DAG. Nested queries define the parentchild relationship in QB tree. * Optimization logic consists of a chain of transformation operations such that output from one operation is input to next operation. Each transformation comprises of a **walk** on operator DAG. Each visited **node** in the DAG is tested for different **rules**. If any rule is satisfied, its corresponding **processor** is invoked. **Dispatcher** maintains a mapping fo different rules and their processors and does rule matching. **GraphWalker** manages the overall traversal process. * Logical plan generated in the previous step is split into multiple MapReduce and hdfs tasks. Nodes in the plan correspond to physical operators and edges represent the flow of data between operators. ### Optimisations * **Column Pruning**  Only the columns needed in the query processing are projected. * **Predicate Pushdown**  Predicates are pushed down to the scan so that rows are filtered as early as possible. * **Partition Pruning**  Predicates on partitioned columns are used to prune out files of partitions that do not satisfy the predicate. * **Map Side Joins**  In case the tables involved in the join are very small, the tables are replicated in all the mappers and the reducers. * **Join Reordering**  Large tables are streamed and not materialized inmemory in the reducer to reduce memory requirements. Some optimizations are not enabled by default but can be activated by setting certain flags. These include: * Repartitioning data to handle skew in `GROUP BY` processing.This is achieved by performing `GROUP BY` in two MapReduce stages  first where data is distributed randomly to the reducers and partial aggregation is performed. In the second stage, these partial aggregations are distributed on GROUP BY columns to different reducers. * Hash bases partial aggregations in the mappers to reduce the data that is sent by the mappers to the reducers which help in reducing the amount of time spent in sorting and merging the resulting data. ### Execution Engine Execution Engine executes the tasks in order of their dependencies. A MapReduce task first serializes its part of the plan into a plan.xml file. This file is then added to the job cache and mappers and reducers are spawned to execute relevant sections of the operator DAG. The final results are stored to a temporary location and then moved to the final destination (in the case of say `INSERT INTO` query). # Future Work The paper mentions the following areas for improvements: * HiveQL should subsume SQL syntax. * Implementing a costbased optimizer and using adaptive optimization techniques. * Columnar storage to improve scan performance. * Enhancing JDBS/ODBC drivers for Hive to integrate with other BI tools. * Multiquery optimization and generic nway join in a single MapReduce job. 
## Introduction A variable x is said to obey a powerlaw if it is drawn from a probability distribution function (pdf) of the form $p(x) = Cx^{\alpha}$ where $C$ is called the **normalization constant** and $\alpha$ is called **scaling parameter** or exponent. Often, the powerlaw applies only for values greater than some minimum $x$, called $x_\text{min}$. The paper describes various statistical techniques to test if a given distribution follows a powerlaw or not. Powerlaw distributions come in both continuous and discrete flavor with the discrete case being more involved than the continuous one. So, the discrete powerlaw behavior is often approximated by continuous powerlaw behavior for the sake of convenience. One reliable approximation is to assume that discrete values of $x$ are generated from a continuous powerlaw and then rounded to nearest integer to get the discrete values. Sometimes, complementary cumulative distribution function (or CDF) is also considered where $P(X) = p(x ≥ X)$ ## Fitting powerlaws to empirical data Powerlaw distribution makes a straight line on the loglog plot. This slope can be calculated using the method of least square linear regression. But simple line fitting does not guarantee that data follows a powerlaw distribution. Moreover, the assumption of independent, Gaussian noise, which is a prerequisite for linear regression, does not hold for this case. ### Estimating scaling parameter Assuming that we know the value of $x_\text{min}$, the value of $\alpha$ can be obtained by the *method of maximum likelihood*. Maximum likelihood estimator (MLE) for continuous case is given as: https://github.com/shagunsodhani/powerlaw/raw/master/paper/assets/MLEContinous.png and that for the discrete case is given as: https://github.com/shagunsodhani/powerlaw/raw/master/paper/assets/MLEDiscrete.png The equation for the discrete case is only an approximation as there is no exact MLE for discrete case. MLE method outperforms several linear regression based approaches like line fitting on the loglog plot, line fitting after performing logarithmic binning (done to reduce fluctuations in the tail of the distribution), line fitting to CDF with constant size bins and line fitting to CDF without any bins. But for any finite sample size $n$ and any choice of $x_\text{min}$, there is bias present which decays as $O(1/n)$ and can be ignored for $n ≥ 50$ ### Estimating $x_\text{min}$ If we choose a value of $x_\text{min}$ less than the original value, then we will get a biased value of $\alpha$ as we will be fitting powerlaw to non powerlaw region as well. If we choose a value larger than the original value, we will be losing legitimate data points (leading to statistical errors). But it is more acceptable to make a higher estimate of $x_\text{min}$ than the original value. One approach is to plot the PDF or CDF on the loglog plot and mark the point beyond which the distribution becomes roughly straight or to plot $\alpha$ as a function of $x_\text{min}$ and mark the point beyond which the value appears relatively stable. But these approaches are not objective as roughly straight and relatively stable are not quantified. The approach proposed by Clauset et al. [Clauset, A., M. Young, and K. S. Gleditsch, 2007, Journal of Conflict Resolution 51, 58] is as follows: Choose a value of $x_\text{min}$ such that the probability distribution of the measured data and bestfit powerlaw model are as similar as possible. Similarity between distributions can be measured using KolmogorovSmirnov (KS) statistic which is defined as: https://github.com/shagunsodhani/powerlaw/raw/master/paper/assets/KSStatistics.png where $S(x)$ is CDF of given data with values greater than or equal to $x_\text{min}$ and *P(x)* is the CDF of powerlaw model that best fits the data in the region $x ≥ x_\text{min}$. This method works well for both continuous and discrete data and is recommended for the general case. Other measures, like weighted KS or Kuiper statistics, can also be used in place of KS statistic. ## Testing the powerlaw hypothesis MLE and other approaches do not tell us whether powerlaw is a possible fit to the given data  all they do is find the best fit values of $x_\text{min}$ and $\alpha$ assuming the data comes from a powerlaw distribution. A basic approach would be to calculate the value of $x_\text{min}$ and $\alpha$ and use them to hypothesize a powerlaw distribution from which the data is drawn. We then check the validity of this hypothesis using the goodnessoffit tests. ### Goodnessoffit tests A large number of synthetic data sets are generated from the hypothesized powerlaw distribution. Then each of these distributions is fitted to their own powerlaw model individually and the KS statistics is calculated for each distribution. The *p* value is defined to be the fraction of synthetic datasets where the distance (KS statistic value) is greater than the distance for given dataset. A large value of *p* (close to 1) means that the fluctuations between given data and the hypothesized model could be because of statistical fluctuations alone while a small value of *p* (close to 0) means that the model is not a possible fit to the distribution. ### Dataset generation The generated dataset needs to be such that it has a distribution similar to the given data below $x_\min$ and follows the fitted powerlaw above $x_\text{min}$. Suppose the given data has *n<sub>tail</sub>* observations (where $x ≥ x_\text{min}$) and *n* observations in total. With a probability of $x_\text{min}/n$, a random number $x_i$ is generated from the hypothesized powerlaw distribution. With a probability of $1 n_{tail}/n$, a number is picked randomly from the given dataset with $x < x_\text{min}$. This way, the generated dataset of n elements is expected to follow powerlaw above $x ≥ x_\text{min}$ and same distribution as given data below $x_\text{min}$. If we want the *p* values to be accurate to within about *ε* of the true value, then we should generate at least *1/4 ε<sup>2</sup>* synthetic data sets. The power law is ruled out if *p ≤ 0.1*. A large *p* value does not mean that the powerlaw is the correct distribution for the data. There can be other distributions that can fit the data equally well or even better. Moreover, for small values of n, it is possible that the given distribution will follow a power law closely, and hence that the pvalue will be large, even when the power law is the wrong model for the data. ## Alternate distributions *p* value test can only be used to reject the powerlaw hypothesis and not accept it. So even if *pvalue > 0.1*, we can only say that powerlaw hypothesis is not rejected. It could be the case that some other distribution fits the data equally well or even better. To eliminate this possibility, we calculate a *p* value for a fit to the alternate distribution and compare it with the *p* value for the powerlaw. If the *p* value for powerlaw is high and the *p* value for the other distribution is low, we can say that data is more likely to be drawn from the powerlaw distribution (though we still can not be sure that it is **definitely** drawn from the powerlaw distribution). ### Likelihood Ratio Test This test can be used to directly compare two distributions against one another to see which is a better fit for the given data. The idea is to compute the likelihood of the given data under the two competing distributions. The one with the higher likelihood is taken to be the better fit. Alternatively, the ratio of the two likelihoods, or the logarithm *R* of the ratio can be used. If *R* is close enough to zero, then it could go to either side of zero, depending on statistical fluctuations. So *R* value needs to be sufficiently far from zero. To check for this, Vuong's method [Vuong, Q. H., 1989, Econometrica 57, 307] is used which gives a *p* value that can tell if the conclusion from the value of *R* is statistically significant. If this *p* value is small (*p < 0.1*), the result is significant. Otherwise, the result is not taken to be reliable and the test does not favor either distribution. Other than the likelihood ratio, several other tests like minimum description length (MDL) or crossvalidation can also be performed. 