Perceptrons - an introduction to computational geometryPerceptrons - an introduction to computational geometryMinsky, Marvin and Papert, Seymour1987
Paper summaryjoecohen### Perceptron Classification
The function of the perceptron takes this form for some weight vector $\vec{w}$ and bias scalar $b$. Given some input $x$ it will produce a binary prediction.
$$ f(x) = \left\{ \begin{matrix}
1 & \text{if } (\vec{w} \cdot \vec{x} + b > 0) \\
-1 & otherwise \\
\end{matrix}\right. $$
### Perceptron Learning
The values $w$ and $b$ for this function are learned from the sample data by minimizing the misclassification error of predictions. Our sample data is in the form $(x\_i,y\_i)$ where $y\_i$ the correct label (1 or -1). If the output of $f(x\_i)$ is equal to $y\_i$ then multiplying $-y\_i f(x\_i)$ will be 1 or -1. If it is incorrect it will be 1. So we can take the $max$ of 0 and this product and then sum them all to get how bad $w$ and $b$ are! $J_i(w,b)$ is the error for that one example. We can sum these together to get the error over all samples.
$$J_i(w,b) = max(0,-y\_i f(x\_i))$$
$$J(w,b) = \frac{1}{N} \displaystyle\sum\_{i=1}^N max(0,-y\_i f(x\_i))$$
To apply Gradient Decent to this problem we calculate the gradient of $J_i(w,b)$ with respect to each $w\_j \in w$ so we can know how to adjust it to minimize $J_i(w,b)$ Because we have a $max$ this gradient is annoying and has a split.
$$ \frac{\partial J_i}{\partial w_j}=
\left\{ \begin{matrix}
0 & \text{if } (\vec{w} \cdot \vec{x} + b > 0) \\
y\_ix\_{ij} & otherwise \\
\end{matrix}\right. $$
This gradient $\frac{\partial J_i}{\partial w_j}$ is then used to adjust $w_j$. By subtracting $\frac{\partial J_i}{\partial w_j}$ from $w_j$ it will adjust the output of $f(x_i)$ such that the error $J_i(w,b)$ is reduced. Generally, subtracting the full gradient will not result in the minimal error. So a fraction of the gradient is subtracted $\lambda$ normally at a rate of $0.05$ but this term is still a point of debate and generally is set by experience.
### Perceptron Classification
The function of the perceptron takes this form for some weight vector $\vec{w}$ and bias scalar $b$. Given some input $x$ it will produce a binary prediction.
$$ f(x) = \left\{ \begin{matrix}
1 & \text{if } (\vec{w} \cdot \vec{x} + b > 0) \\
-1 & otherwise \\
\end{matrix}\right. $$
### Perceptron Learning
The values $w$ and $b$ for this function are learned from the sample data by minimizing the misclassification error of predictions. Our sample data is in the form $(x\_i,y\_i)$ where $y\_i$ the correct label (1 or -1). If the output of $f(x\_i)$ is equal to $y\_i$ then multiplying $-y\_i f(x\_i)$ will be 1 or -1. If it is incorrect it will be 1. So we can take the $max$ of 0 and this product and then sum them all to get how bad $w$ and $b$ are! $J_i(w,b)$ is the error for that one example. We can sum these together to get the error over all samples.
$$J_i(w,b) = max(0,-y\_i f(x\_i))$$
$$J(w,b) = \frac{1}{N} \displaystyle\sum\_{i=1}^N max(0,-y\_i f(x\_i))$$
To apply Gradient Decent to this problem we calculate the gradient of $J_i(w,b)$ with respect to each $w\_j \in w$ so we can know how to adjust it to minimize $J_i(w,b)$ Because we have a $max$ this gradient is annoying and has a split.
$$ \frac{\partial J_i}{\partial w_j}=
\left\{ \begin{matrix}
0 & \text{if } (\vec{w} \cdot \vec{x} + b > 0) \\
y\_ix\_{ij} & otherwise \\
\end{matrix}\right. $$
This gradient $\frac{\partial J_i}{\partial w_j}$ is then used to adjust $w_j$. By subtracting $\frac{\partial J_i}{\partial w_j}$ from $w_j$ it will adjust the output of $f(x_i)$ such that the error $J_i(w,b)$ is reduced. Generally, subtracting the full gradient will not result in the minimal error. So a fraction of the gradient is subtracted $\lambda$ normally at a rate of $0.05$ but this term is still a point of debate and generally is set by experience.