Perceptrons - an introduction to computational geometry on ShortScience.org

scholar.google.com

Perceptrons - an introduction to computational geometry
Minsky, Marvin and Papert, Seymour
MIT Press - 1987 via Local Bibsonomy
Keywords: dblp

Summaries/Notes 1

[link] Summary by Joseph Paul Cohen 8 years ago

### Perceptron Classification

The function of the perceptron takes this form for some weight vector $\vec{w}$ and bias scalar $b$. Given some input $x$ it will produce a binary prediction.

$$ f(x) = \left\{ \begin{matrix} 
1 & \text{if } (\vec{w} \cdot \vec{x} + b > 0)  \\
-1 & otherwise  \\
\end{matrix}\right. $$

### Perceptron Learning

The values $w$ and $b$ for this function are learned from the sample data by minimizing the misclassification error of predictions. Our sample data is in the form $(x\_i,y\_i)$ where $y\_i$ the correct label (1 or -1). If the output of $f(x\_i)$ is equal to $y\_i$ then multiplying $-y\_i f(x\_i)$ will be 1 or -1. If it is incorrect it will be 1. So we can take the $max$ of 0 and this product and then sum them all to get how bad $w$ and $b$ are! $J_i(w,b)$ is the error for that one example. We can sum these together to get the error over all samples.

$$J_i(w,b) = max(0,-y\_i f(x\_i))$$

$$J(w,b) = \frac{1}{N} \displaystyle\sum\_{i=1}^N max(0,-y\_i f(x\_i))$$

To apply Gradient Decent to this problem we calculate the gradient of $J_i(w,b)$ with respect to each $w\_j \in w$ so we can know how to adjust it to minimize $J_i(w,b)$ Because we have a $max$ this gradient is annoying and has a split.

$$ \frac{\partial J_i}{\partial w_j}= 
\left\{ \begin{matrix} 
0 & \text{if } (\vec{w} \cdot \vec{x} + b > 0)  \\
y\_ix\_{ij} & otherwise  \\
\end{matrix}\right. $$

This gradient $\frac{\partial J_i}{\partial w_j}$ is then used to adjust $w_j$. By subtracting $\frac{\partial J_i}{\partial w_j}$ from $w_j$ it will adjust the output of $f(x_i)$ such that the error $J_i(w,b)$ is reduced. Generally, subtracting the full gradient will not result in the minimal error. So a fraction of the gradient is subtracted $\lambda$ normally at a rate of $0.05$ but this term is still a point of debate and generally is set by experience.

Your comment: