* In machine learning, accuracy tends to increase with an increase in the number of training examples and number of model parameters.
* For large data, training becomes slow on even GPU (due to increase CPU-GPU data transfer).
* Solution: Distributed training and inference - DistBelief
* [Link to paper](https://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf)
* Framework for parallel and distributed computation in neural networks.
* Exploits both model parallelism and data parallelism.
* Two methods implemented over DistBelief - *Downpour SGD* and *Sandblaster L-BFGS*
## Previous work in this domain
* Distributed Gradient Computation using relaxed synchronization requirements and delayed gradient descent.
* Lock-less asynchronous stochastic gradient descent in case of sparse gradient.
* Deep Learning
* Most focus has been on training small models on a single machine.
* Train multiple, small models on a GPU farm and combine their predictions.
* Alternate approach is to modify standard deep networks to make them more parallelizable.
* Distributed computational tools
* MapReduce - Not suited for iterative computations
* GraphLab - Can not exploit computing efficiencies available in deep networks.
## Model Parallelism
* User specifies the computation at each node in each layer of the neural network and the messages to be exchanged between different nodes.
* Model parallelism supported using multithreading (within a machine) and message passing (across machines).
* Model can be partitioned across machines with DistBelief taking care of all communication/synchronization tasks.
* Diminishing returns in terms of speed gain if too many machines/partitions are used as partitioning benefits as long as network overhead is small.
## Data Parallelism
* Distributed training across multiple model instances (or replicas) to optimize a single function.
## Distributed Optimization Algorithms
### Downpour SGD
* Asynchronous stochastic gradient descent.
* Divide training data into subsets and run a copy of model on each subset.
* Centralized sharded parameter server
* Keeps the current state of all the parameters of the model.
* Used by model instances to share parameters.
* Asynchronous approach as both model replicas and parameter server shards run independently.
* Model replica requests an updated copy of model parameters from parameter server.
* Processes mini batches and sends the gradient to the server.
* Server updates the parameters.
* Communication can be limited by limiting the rate at which parameters are requested and updates are pushed.
* Robust to machine failure since multiple instances are running.
* Relaxing consistency requirements works very well in practice.
* Warm starting - Start training with a single replica and add other replicas later.
### Sandblaster L-BFGS
* Sandblaster is the distributed batch Optimization framework.
* Provides distributed implementation of L-BFGS.
* A single 'coordinator' process interacts with replicas and the parameter server to coordinate the batch optimization process.
* Coordinator issues commands like dot product to the parameter server shards.
* Shards execute the operation, store the results locally and maintain additional information like history cache.
* Parameters are fetched at the beginning of each batch and the gradients are sent every few completed cycles.
* Saves overhead of sending all parameters and gradients to a single server.
* Robust to machine failure just like Downpour SGD.
* Coordinator uses a load balancing scheme similar to "backup tasks" in MapReduce.
Downpour SGD (with Adagrad adaptive learning rate) outperforms Downpour SGD (with fixed learning rate) and Sandblaster L-BFGS. Moreover, Adagrad can be easily implemented locally within each parameter shard. It is surprising that Adagrad works so well with Downpour SGD as it was not originally designed to be used with asynchronous SGD. The paper conjectures that "*Adagrad automatically stabilizes volatile parameters in the face of the flurry of asynchronous updates, and naturally adjusts learning rates to the demands of different layers in the deep network.*"
## Beyond DistBelief - TensorFlow
* In November 2015, [Google open sourced TensorFlow](http://googleresearch.blogspot.in/2015/11/tensorflow-googles-latest-machine_9.html) as its second-generation machine learning system. DistBelief had limitations like tight coupling with Google's infrastructure and was difficult to configure. TensorFlow takes care of these issues and is twice fast as DistBelief.Google also published a [white paper](http://download.tensorflow.org/paper/whitepaper2015.pdf) on the topic and the implementation can be found [here](http://www.tensorflow.org/).
This paper is about Convolutional Neural Networks for Computer Vision. It was the first break-through in the ImageNet classification challenge (LSVRC-2010, 1000 classes).
ReLU was a key aspect which was not so often used before. The paper also used Dropout in the last two layers.
## Training details
* Momentum of 0.9
* Learning rate of $\varepsilon$ (initialized at 0.01)
* Weight decay of $0.0005 \cdot \varepsilon$.
* Batch size of 128
* The training took 5 to 6 days on two NVIDIA GTX 580 3GB GPUs.
## See also
* [Stanford presentation](http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf)