Summary by Chris Murray 2 years ago
This paper describes a class of algorithms for classification or regression in the on-line setting. That is, the data is a bunch of pairs $(X_t,Y_t)$ (where X may be a vector), and these data items arrive in some order: the algorithm must predict each $\hat{Y}_t$ using only the $X_t$ and previously seen pairs. In the regression setting, each mis-prediction has a loss that is like $(Y_t - \hat{Y}_t)^2$, and in the classification setting $Y_t$ is always 0 or 1 and the loss is $| Y_t - \hat{Y}_t |$.
Roughly, the algorithm makes linear predictions using some internal weight vector $(\hat{y} = w * X)$, and does a gradient-descent like weight update. However, it tries to keep the q-norm (q can be any number) of the weight vector "small", preventing the weights themselves from becoming too large. The algorithm is actually simple, and the weight update takes advantage of link functions, which the author defines. The majority of the paper is focused on deriving loss bounds, showing that the loss incurred by this algorithm isn't much worse than that incurred by the best weight vector, chosen in hindsight. Typical readers will be interested in the first few pages, as the latter part of the paper is mainly technical proofs.

more
less