[link]
### Perceptron Classification The function of the perceptron takes this form for some weight vector $\vec{w}$ and bias scalar $b$. Given some input $x$ it will produce a binary prediction. $$ f(x) = \left\{ \begin{matrix} 1 & \text{if } (\vec{w} \cdot \vec{x} + b > 0) \\ 1 & otherwise \\ \end{matrix}\right. $$ ### Perceptron Learning The values $w$ and $b$ for this function are learned from the sample data by minimizing the misclassification error of predictions. Our sample data is in the form $(x\_i,y\_i)$ where $y\_i$ the correct label (1 or 1). If the output of $f(x\_i)$ is equal to $y\_i$ then multiplying $y\_i f(x\_i)$ will be 1 or 1. If it is incorrect it will be 1. So we can take the $max$ of 0 and this product and then sum them all to get how bad $w$ and $b$ are! $J_i(w,b)$ is the error for that one example. We can sum these together to get the error over all samples. $$J_i(w,b) = max(0,y\_i f(x\_i))$$ $$J(w,b) = \frac{1}{N} \displaystyle\sum\_{i=1}^N max(0,y\_i f(x\_i))$$ To apply Gradient Decent to this problem we calculate the gradient of $J_i(w,b)$ with respect to each $w\_j \in w$ so we can know how to adjust it to minimize $J_i(w,b)$ Because we have a $max$ this gradient is annoying and has a split. $$ \frac{\partial J_i}{\partial w_j}= \left\{ \begin{matrix} 0 & \text{if } (\vec{w} \cdot \vec{x} + b > 0) \\ y\_ix\_{ij} & otherwise \\ \end{matrix}\right. $$ This gradient $\frac{\partial J_i}{\partial w_j}$ is then used to adjust $w_j$. By subtracting $\frac{\partial J_i}{\partial w_j}$ from $w_j$ it will adjust the output of $f(x_i)$ such that the error $J_i(w,b)$ is reduced. Generally, subtracting the full gradient will not result in the minimal error. So a fraction of the gradient is subtracted $\lambda$ normally at a rate of $0.05$ but this term is still a point of debate and generally is set by experience.
Your comment:
