ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

Algorithms for Non-negative Matrix Factorization
Lee, Daniel D. and Seung, H. Sebastian
Neural Information Processing Systems Conference - 2000 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 7 years ago

We want to find two matrices $W$ and $H$ such that $V = WH$. Often a goal is to determine underlying patterns in the relationships between the concepts represented by each row and column. $W$ is some $m$ by $n$ matrix and we want the inner dimension of the factorization to be $r$. So 

$$\underbrace{V}_{m \times n} = \underbrace{W}_{m \times r} \underbrace{H}_{r \times n}$$

Let's consider an example matrix where of three customers (as rows) are associated with three movies (the columns) by a rating value.

$$
V = \left[\begin{array}{c c c}
5 & 4 & 1  \\\\
4 & 5 & 1 \\\\
2 & 1 & 5
\end{array}\right]
$$


We can decompose this into two matrices with $r = 1$. First lets do this without any non-negative constraint using an SVD reshaping matrices based on removing eigenvalues:


$$
W = \left[\begin{array}{c c c}
-0.656 \\\
 -0.652 \\\
 -0.379
\end{array}\right],
H = \left[\begin{array}{c c c}
-6.48 & -6.26 & -3.20\\\\
\end{array}\right]
$$

We can also decompose this into two matrices with $r = 1$ subject to the constraint that $w_{ij} \ge 0$ and  $h_{ij} \ge 0$. (Note: this is only possible when $v_{ij} \ge 0$):

$$
W = \left[\begin{array}{c c c}
0.388 \\\\
0.386 \\\\
0.224
\end{array}\right],
H = \left[\begin{array}{c c c}
11.22 & 10.57 & 5.41  \\\\
\end{array}\right]
$$

Both of these $r=1$ factorizations reconstruct matrix $V$ with the same error. 

$$
V \approx WH = \left[\begin{array}{c c c}
4.36 & 4.11 & 2.10 \\\
4.33 & 4.08 & 2.09 \\\
2.52 & 2.37 & 1.21 \\\
\end{array}\right]
$$


If they both yield the same reconstruction error then why is a non-negativity constraint useful? We can see above that it is easy to observe patterns in both factorizations such as similar customers and similar movies. `TODO: motivate why NMF is better`



#### Paper Contribution 

This paper discusses two approaches for iteratively creating a non-negative $W$ and $H$ based on random initial matrices. The paper discusses a multiplicative update rule where the elements of $W$ and $H$ are iteratively transformed by scaling each value such that error is not increased. 

The multiplicative approach is discussed in contrast to an additive gradient decent based approach where small corrections are iteratively applied. The multiplicative approach can be reduced to this by setting the learning rate ($\eta$) to a ratio that represents the magnitude of the element in $H$ to the scaling factor of $W$ on $H$.



### Still a draft

scholar.google.com

Collaborative Filtering for Implicit Feedback Datasets
Hu, Yifan and Koren, Yehuda and Volinsky, Chris
International Conference on Data Mining - 2008 via Local Bibsonomy
Keywords: collaborativfiltering, alternaterootsquare

[link] Summary by Martin Thoma 4 years ago

This paper is about a recommendation system approach using collaborative filtering (CF) on implicit feedback datasets.

The core of it is the minimization problem

$$\min_{x_*, y_*} \sum_{u,i} c_{ui} (p_{ui} - x_u^T y_i)^2 + \underbrace{\lambda \left ( \sum_u || x_u ||^2 + \sum_i || y_i ||^2\right )}_{\text{Regularization}}$$

with

* $\lambda \in [0, \infty[$ is a hyper parameter which defines how strong the model is regularized
* $u$ denoting a user, $u_*$ are all user factors $x_u$ combined
* $i$ denoting an item, $y_*$ are all item factors $y_i$ combined
* $x_u \in \mathbb{R}^n$ is the latent user factor (embedding); $n$ is another hyper parameter. $n=50$ seems to be a reasonable choice.
* $y_i \in \mathbb{R}^n$ is the latent item factor (embedding)
* $r_{ui}$ defines the "intensity"; higher values mean user $u$ interacted more with item $i$
* $p_{ui} = \begin{cases}1 & \text{if } r_{ui} >0\\0 &\text{otherwise}\end{cases}$
* $c_{ui} := 1 + \alpha r_{ui}$ where $\alpha \in [0, \infty[$ is a hyper parameter; $\alpha =40$ seems to be reasonable

In contrast, the standard matrix factoriation optimization function looks like this ([example](https://www.cs.cmu.edu/~mgormley/courses/10601-s17/slides/lecture25-mf.pdf)):

$$\min_{x_*, y_*} \sum_{(u, i, r_{ui}) \in \mathcal{R}} {(r_{ui} - x_u^T y_i)}^2  + \underbrace{\lambda \left ( \sum_u || x_u ||^2 + \sum_i || y_i ||^2\right )}_{\text{Regularization}}$$

where

* $\mathcal{R}$ is the set of all ratings $(u, i, r_{ui})$ - user $u$ has rated item $i$ with value $r_{ui} \in \mathbb{R}$

They use alternating least squares (ALS) to train this model.

The prediction then is the dot product between the user factor and all item factors ([source](https://github.com/benfred/implicit/blob/master/implicit/recommender_base.pyx#L157-L176))

dx.doi.org
sci-hub
scholar.google.com

Generative adversarial networks uncover epidermal regulators and predict single cell perturbations
Arsham Ghahramani and Fiona M Watt and Nicholas M Luscombe
bioRxiv: The preprint server for biology - 2018 via Local CrossRef
Keywords:

[link] Summary by David Stutz 5 years ago

Lee et al. propose a variant of adversarial training where a generator is trained simultaneously to generated adversarial perturbations. This approach follows the idea that it is possible to “learn” how to generate adversarial perturbations (as in [1]). In this case, the authors use the gradient of the classifier with respect to the input as hint for the generator. Both generator and classifier are then trained in an adversarial setting (analogously to generative adversarial networks), see the paper for details.

[1] Omid Poursaeed, Isay Katsman, Bicheng Gao, Serge Belongie. Generative Adversarial Perturbations. ArXiv, abs/1712.02328, 2017.

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

Combining Markov Random Fields and Convolutional Neural Networks for Image Synthesis
Li, Chuan and Wand, Michael
Conference and Computer Vision and Pattern Recognition - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Alexander Jung 6 years ago

* They describe a method that applies the style of a source image to a target image.
* Example: Let a normal photo look like a van Gogh painting.
* Example: Let a normal car look more like a specific luxury car.
* Their method builds upon the well known artistic style paper and uses a new MRF prior.
* The prior leads to locally more plausible patterns (e.g. less artifacts).

### How
* They reuse the content loss from the artistic style paper.
* The content loss was calculated by feed the source and target image through a network (here: VGG19) and then estimating the squared error of the euclidean distance between one or more hidden layer activations.
* They use layer `relu4_2` for the distance measurement.
* They replace the original style loss with a MRF based style loss.
* Step 1: Extract from the source image `k x k` sized overlapping patches.
* Step 2: Perform step (1) analogously for the target image.
* Step 3: Feed the source image patches through a pretrained network (here: VGG19) and select the representations `r_s` from specific hidden layers (here: `relu3_1`, `relu4_1`).
* Step 4: Perform step (3) analogously for the target image. (Result: `r_t`)
* Step 5: For each patch of `r_s` find the best matching patch in `r_t` (based on normalized cross correlation).
* Step 6: Calculate the sum of squared errors (based on euclidean distances) of each patch in `r_s` and its best match (according to step 5).
* They add a regularizer loss.
* The loss encourages smooth transitions in the synthesized image (i.e. few edges, corners).
* It is based on the raw pixel values of the last synthesized image.
* For each pixel in the synthesized image, they calculate the squared x-gradient and the squared y-gradient and then add both.
* They use the sum of all those values as their loss (i.e. `regularizer loss = <sum over all pixels> x-gradient^2 + y-gradient^2`).
* Their whole optimization problem is then roughly `image = argmin_image MRF-style-loss + alpha1 * content-loss + alpha2 * regularizer-loss`.
* In practice, they start their synthesis with a low resolution image and then progressively increase the resolution (each time performing some iterations of optimization).
* In practice, they sample patches from the style image under several different rotations and scalings.

### Results
* In comparison to the original artistic style paper:
* Less artifacts.
* Their method tends to preserve style better, but content worse.
* Can handle photorealistic style transfer better, so long as the images are similar enough. If no good matches between patches can be found, their method performs worse.

![Non-photorealistic example images](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Combining_MRFs_and_CNNs_for_Image_Synthesis__examples.png?raw=true "Non-photorealistic example images")

*Non-photorealistic example images. Their method vs. the one from the original artistic style paper.*

![Photorealistic example images](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Combining_MRFs_and_CNNs_for_Image_Synthesis__examples_real.png?raw=true "Photorealistic example images")

*Photorealistic example images. Their method vs. the one from the original artistic style paper.*

dx.doi.org
sci-hub
scholar.google.com

Prediction gradients for feature extraction and analysis from convolutional neural networks
Lo, Henry Z. and Cohen, Joseph Paul and Ding, Wei
Conference on Automatic Face and Gesture Recognition - 2015 via Local Bibsonomy
Keywords: dblp

3	[link] Summary by Joseph Paul Cohen 8 years ago The prediction gradient is just $\frac{\partial \mathbf{y}}{\partial w}$ where $\mathbf{y}$ is the output before the loss function. more less