[link]
We want to find two matrices $W$ and $H$ such that $V = WH$. Often a goal is to determine underlying patterns in the relationships between the concepts represented by each row and column. $W$ is some $m$ by $n$ matrix and we want the inner dimension of the factorization to be $r$. So $$\underbrace{V}_{m \times n} = \underbrace{W}_{m \times r} \underbrace{H}_{r \times n}$$ Let's consider an example matrix where of three customers (as rows) are associated with three movies (the columns) by a rating value. $$ V = \left[\begin{array}{c c c} 5 & 4 & 1 \\\\ 4 & 5 & 1 \\\\ 2 & 1 & 5 \end{array}\right] $$ We can decompose this into two matrices with $r = 1$. First lets do this without any nonnegative constraint using an SVD reshaping matrices based on removing eigenvalues: $$ W = \left[\begin{array}{c c c} 0.656 \\\ 0.652 \\\ 0.379 \end{array}\right], H = \left[\begin{array}{c c c} 6.48 & 6.26 & 3.20\\\\ \end{array}\right] $$ We can also decompose this into two matrices with $r = 1$ subject to the constraint that $w_{ij} \ge 0$ and $h_{ij} \ge 0$. (Note: this is only possible when $v_{ij} \ge 0$): $$ W = \left[\begin{array}{c c c} 0.388 \\\\ 0.386 \\\\ 0.224 \end{array}\right], H = \left[\begin{array}{c c c} 11.22 & 10.57 & 5.41 \\\\ \end{array}\right] $$ Both of these $r=1$ factorizations reconstruct matrix $V$ with the same error. $$ V \approx WH = \left[\begin{array}{c c c} 4.36 & 4.11 & 2.10 \\\ 4.33 & 4.08 & 2.09 \\\ 2.52 & 2.37 & 1.21 \\\ \end{array}\right] $$ If they both yield the same reconstruction error then why is a nonnegativity constraint useful? We can see above that it is easy to observe patterns in both factorizations such as similar customers and similar movies. `TODO: motivate why NMF is better` #### Paper Contribution This paper discusses two approaches for iteratively creating a nonnegative $W$ and $H$ based on random initial matrices. The paper discusses a multiplicative update rule where the elements of $W$ and $H$ are iteratively transformed by scaling each value such that error is not increased. The multiplicative approach is discussed in contrast to an additive gradient decent based approach where small corrections are iteratively applied. The multiplicative approach can be reduced to this by setting the learning rate ($\eta$) to a ratio that represents the magnitude of the element in $H$ to the scaling factor of $W$ on $H$. ### Still a draft
Your comment:

[link]
The paper introduces nonnegative matrix factorization, a technique which used in fields such as chemometrics. The problem formulation is this: $$ \underset{W,H}{\text{argmin}} ~ d(X, WH) \\\\ \text{s.t.} ~W_{ij}, H_{ij} \ge 0 $$ Where:  $X \in \mathbb{R}^{n \times m}$ is a matrix of data, for example, $n$ samples of $m$ features. Each element of $X$ is nonnegative, as are the elements of $W$ and $H$.  $W \in \mathbb{R}^{n \times k}$ represents how each of the $n$ samples belong to each of the $k$ "clusters".  $H \in \mathbb{R}^{k \times m}$ describes each of the $k$ clusters in terms of the $m$ variables.  $d$ is some cost function, for example, sum of squared differences. The nonnegativity constraint means the clusters (represented by the rows $W$ of $W$) describe clusters in terms of what features are present. This may make interpretation easier in some instances, but makes the optimization problem more difficult. The paper mentions two loss functions, sum of squared error: $$ d(X,WH) = \sum_{ij} X_{ij}(WH)_{ij}^2 $$ and a measure similar to unnormalized KullbackLeibler divergence: $$ d(X,WH) = \sum_{ij} \left( X_{ij} \log \frac{X_{ij}}{(WH)_{ij}}  X_{ij}+(WH)_{ij} \right) $$ For each of these objectives, multiplicative update rules are given. For squared error: $$ H_{ij} \leftarrow H_{ij} \frac{(W^TX)_{ij}}{(W^TWH)}_{ij} ~~~ W_{ij} \leftarrow W_{ij} \frac{(XH^T)_{ij}}{(WHH^T)}_{ij} $$ And for divergence: $$ H_{ij} \leftarrow H_{ij} \frac{\sum_a W_{ai} X_{aj} / (WH)_{aj}} { \sum_b W_{bj}} ~~~~ W_{ij} \leftarrow W_{ij} \frac{\sum_a H_{ja} X_{ia} / (WH)_{ia}} { \sum_b H_{jb}} $$ These rules are applied alternatingly; fix $W$ and update $H$, then fix $H$ and update $W$. These multiplicative updates are essentially a diagonally rescaled gradient descent. The authors then prove that these update rules do not increase the objective. Future authors have pointed out that not increasing the cost does not imply convergence; e.g. the parameters could stop updating, without having reached a minima. However, a trivial fix to the multiplicative update rules (ensuring no division by zero, by making 0 elements slightly positive) alleviates these problems. 
[link]
Algorithms for NonNegative Matrix Factorization – Short Science Summary: This paper analyzes two separate iterative methods for calculating NMF decompositions: one that minimizes leastsquares error and one that minimizes generalized KLdivergence. The cost functions given for the two methods are as follows: Leastsquares error: A – B^2 = ∑_(ij) (Aij*Bij)^2 Divergence: D(AB) = ∑_ij (Aij log(Aij/Bij)  Aij+Bij) The authors found that using the below multiplicative update rules for each iteration of the corresponding optimization algorithm results in a good compromise between run speed and ease of implementation: [Thm 1] [Thm 2] The authors go on to prove convergence to at least a locally optimal solution for both of the above choices of cost function. 
[link]
Two different multiplicative algorithms for NMF are proposed. Both of them try to rescaled the gradient descent, and choose the optimal rescaling factor to update the W and H in each iteration. Like W←αW And H← βH The authors propose two different algorithms based on the definition of the cost functions which help to quantify the quality of the approximation. The first one is constructed based on the Euclidean distance: 〖AB〗^2 . The second one is based on the divergence: D(AB). Then the new algorithms can be regarded as minimize〖 VWH〗^2 with the respect to W and H, and minimize D(VWH) with the respect to W and H by using gradient descent. For each iteration of the gradient descent, the authors design new update rules (Theorem 1 and Theorem 2) to get the optimal new W and H. The authors provide the proofs of convergence to show that the proposed Theorem 1 and Theorem 2 can help to get the optimal results. 
[link]
From this paper, my understanding is that NMF performance an unsupervised learning algorithm which decompose the data matrix into two nonnegative sub matrices. The final result is a compress of the original data matrix. "Each data vector $v$ can be generated by a linear combination of the columns of $W$, weighted by the components of $h$. Therefore $W$ can be regarded as containing a basis that is optimized for the linear approximation of the data in $V$. Since relatively few basis vectors are used to represent many data vectors, good approximation can only be achieved." Here I regard this method can generates new features for current dataset and could be used in dimension reduction like PCA but with nonnegative result. To achieve the factorization, they proposed two cost functions, one is just ordinary euclidean distance between original dataset and the production of two factorized small matrices. The other one used the KLdivergence to perform the same goal. The only difference is the later one is not symmetric. Despite the difference of cost function, the updating formula are similar. The idea, in my understanding, comes from the ordinary gradient descent. They connected the updating rule: $H_{av} = H_{av}+\eta_{a\mu}[(W^TV)_{a\mu}(W^TWH)_{a\mu}]$ with gradient descent which $\eta_{a\mu}$ like a learning rate and the multiplier looks like a tiny difference or the first order of cost function at variable $H$. Finally, they utilized the concept of auxiliary function and updating rule $h^{t+1} = arg\min_h G(h, h^t)$ to guarantee the nonincreasing of cost function $F$. 
[link]
NMF aims to find two matrices W and H, such that V=W H.There are two cost functions as follows: Leastsquares error: $V – WH^2$ Divergence: $D(V  WH)$ However, when we try to optimize the the functions $V – WH^2$ and $D(V  WH)$. They are convex in W only or H only, and they are not convex in both variables together. To solve this problem, the authors build a new update rules called "multiplicative versus additive update rules" and $V – WH^2$ and $D(V  WH)$ are nonincreasing under the update rules. Multiplicative versus additive update rules: $H_{\alpha \mu}\leftarrow H_{\alpha \mu} + \eta_{\alpha \mu}\bigg[\sum\limits_{i} W_{i\alpha}\frac{V_{i\mu}}{(WH)_{i\mu}}\sum\limits_{i} W_{i\alpha}\bigg]$ 