[link]
A fundamental paper by Adam Berger and his colleagues at IBM Watson Research Center presents a thorough guide on how to apply maximum entropy (ME) modelling in NLP. Authors follow the principle of maximum entropy and show the optimality of their procedure as well as its duality relationship with the maximum loglikelihood estimation. Importantly, the paper proposes a method for automatic feature selection, perhaps the most critical step in the entire approach. Empirical results from Candide, an IBM’s automatic machinetranslation system, demonstrate the capabilities of ME models. Both originality and wide range of possible applications made the paper to stand out and led to the deserved accolades. Many practical NLP problems impose constraints in terms of data or explicit statements, while requiring minimal assumption on the underlying distributions. Laplace advocated that in this scenario the best strategy is to treat everything as equal as possible, while being consistent with the observed facts. Authors follow Laplace’s philosophy and propose to use entropy as a measure of uniformly distributed uncertainty. They employ the principle of maximum entropy to choose the best performing model, which simply states that the optimal model is the one that has the largest entropy. For a given set of features f they model its expected value using the empirical data distribution p(x,y) as p(f) = sum[p(x,y)*f(x,y)] = sum[p(yx)*p(x)*f(x,y)]. By replacing the conditional probability p(yx) with its entropy, they obtain the objective function H(p) = sum[p(x)*p(yx)*log(p(yx))]. The optimal distribution p* is then defined as p* = argmax(H(p)). Note that this optimization usually requires a numerical solution. Therefore, authors employ Lagrange multipliers L(p, lambda) = H(p) + sum[lambda_i * (p(f_i) – p_tilda(f_i))], which yields p_lambda(yx) = 1/Z * exp(sum[lambda_i * f_i(x,y)]) and L(lambda) = sum[p(x)*log(Z)] + sum[lambda_i*p(f_i)], where Z is a normalizing constant. The unconstraint dual problem then becomes lambda = argmax(L(lambda)). Interestingly, that L is exactly equal to the loglikelihood of the data sample, which implies that MLE and ME are dual and ensures the optimality of ME, since it converges to the best fit of the data. To estimate the model parameters, authors proposed an iterative scaling method, which update lambda by solving sum[p(x)*p(yx)*f_i(x,y)*exp(delta_lambda * f(x,y))] = p(f_i). The choice of features f is apparently the most critical step in the procedure. Authors propose a simple approach of feature selection by iteratively adding a feature f_new that maximizes the increase in entropy. They continue expanding the set of features until the crossvalidation results converge. To evaluate ME approach, the authors applied it to contextsensitive wordtranslation, sentence segmentation, and word reordering problems. These applications demonstrate the efficacy of ME techniques for performing contextsensitive modelling and the paper undoubtedly deserves the attention it gained. For example, in word reordering the model outperform a simple baseline model by 10%. However, a deeper analysis of the feature selection could benefit it. The proposed feature selection method is computationally expensive, as it iteratively calls optimization subroutine and involves a crossvalidation step. Moreover, authors skip the discussion on overfitting and the choice of the initial pool of the features. They imply that the features have to be designed manually, which makes the approach impractical for large diverse problems. Therefore, further developments in dimensionality reduction and feature extraction could benefit ME.
Your comment:
