The paper proposed a new method to provide large scale supervised reading comprehension and also developing attention based deep neural networks that can answer complex questions from real documents.
Obtaining supervised natural language comprehensive data in large scale is difficult. On the other hand, reading comprehension methods constructed based on synthetic data failed in real environment when facing real data. This work addresses lack of real supervised reading comprehension data. In addition, they build novel deep learning models for reading comprehension by incorporating attention mechanism into recurrent neural networks. Attention mechanism allows a model to focus on the parts of a document that it believes will help it answer a question.
First part, two machine reading corpora is created by exploiting CNN and Daily Mail articles along with their corresponding summaries in form of the bullet points. These bullet points are abstractive and they are paraphrasing important parts of the article rather than copying sentences from the text. The bullet points turn into Cloze type questions by replacing one entity at a time with an entity marker, for example, “producer X will not press charges against ent212 ,his lawyer says.”. All the entities are replaced by entity markers and using a coreference and also entity markers are permuted for each data points to avoid world knowledge and co-occurrence effects in the reading comprehension.
Second part, For the reading comprehension task, they used 2 simple base line models(A), 2 symbolic matching models(B), and 4 recurrent neural networks models(C):
A1) Majority Baseline: It picks the most frequently observed entity in the context document.
A2) Exclusive Majority: It picks the most frequently observed entity in the context document which is not observed in the query.
B1) Frame-Semantic Parsing: This method parses the sentence to find "who did what to whom" using state-of-the-art frame semantic parser on the anonymized data points.
B2) Word Distance Benchmark: It aligns placeholder of Cloze form questions with each possible entity in the context document and calculates the distance between the question and the context around the aligned entity. Then sum of the distance of every word in a query to their nearest aligned word in the document is calculated.
C1) Deep LSTM Reader (2-layer LSTM)
This model feeds the [document | query] pair separated by a delimiter as a single large document, one word at a time. LSTM cells have skip connections from input to hidden layers and hidden layer to output.
C2) Attentive Reader (bi-directional LSTM with attention)
This model employs attention mechanism to overcome the bottleneck of fixed width hidden vector. First, it encodes the document and the query using separate bi-directional single layer LSTM. Then, query encoding is obtained by concatenating the final forward and backwards outputs. Document encoding is obtained by a weighted sum of output vectors (obtained by concatenating the forward and backwards outputs). The weights can be interpreted as the degree to which the network attends to a particular token in the document. Finally, the model is completed by defining a non-linear combination of document and query embedding.
C3) Uniform Reader (bi-directional LSTM)
It is Attentive Reader without attention mechanism, which is used here to see the effect of attention mechanism on the results.
C4) Impatient Reader (bi-directional LSTM with attention per each query token)
This one is similar to Attentive Reader except that the attention weights are computed per each query token. The intuition is that for each token the model finds which part of the context document is more relevant. The model accumulates the information from the document as each query token is seen and finally outputs a joint document query representation using a non-linear combination of document embedding and query embedding.
As expected, Attentive and Impatient Readers outperform all other models which show the benefits of attention model. Also Uniform Reader supports this hypothesis. The accuracies on two datasets (CNN, Daily Mail) are Maximum Frequency: 33.2 / 25.5, Exclusive Frequency: 39.3 / 32.8, Frame-semantic model: 40.2 / 35.5, Word distance model: 50.9 / 55.5, Deep LSTM Reader: 57.0 / 62.2, Uniform Reader: 39.4 / 34.4, Attentive Reader: 63.0 / 69.0, Impatient Reader: 63.8 / 68.0.
The fundamental question that the paper is going to answer is weather deep learning can be realized with other prediction model other thahttps://i.imgur.com/Wh6xAbP.pngn neural networks. The authors proposed deep forest, the realization of deep learning using random forest(gcForest). The idea is simple and was inspired by representation learning in deep neural networks which mostly relies on the layer-by-layer processing of raw features.
Importance: Deep Neural Network (DNN) has several draw backs. It needs a lot of data to train. It has many hyper-parameters to tune. Moreover, not everyone has access to GPUs to build and train them. Training DNN is mostly like an art instead of a scientific/engineering task. Finally, theoretical analysis of DNN is extremely difficult. The aim of the paper is to propose a model to address these issues and at the same time to achieve performance competitive to deep neural networks.
Model: The proposed model consists of two parts. First part is a deep forest ensemble with a cascade structure similar to layer-by-layer architecture in DNN. Each level is an ensemble of random forest and to include diversity a combination of completely-random random forests and typical random forests are employed (number of trees in each forest is a hyper-parameter). The estimated class distribution, which is obtained by k-fold cv from forests, forms a class vector, which is then concatenated with the original feature vector to be input to the next level of cascade. Second part is a multi-grained scanning for representational learning where spatial and sequential relationships are captured using a sliding window scan (by applying various window sizes) on raw features, similar to the convolution and recurrent layers in DNN. Then, those features are passed to a completely random tree-forest and a typical random forest in order to generate transformed features. When transformed feature vectors are too long to be accommodated, feature sampling can be performed.
Benefits: gcForest has much fewer hyper-parameters than deep neural networks. The number of cascade levels can be adaptively determined such that the model complexity can be automatically set. If growing a new level does not improve the performance, the growth of the cascade terminates. Its performance is quite robust to hyper-parameter settings, such that in most cases and across different data from different domains, it is able to get excellent performance by using the default settings. gcForest achieves highly competitive performance to deep neural networks, whereas the training time cost of gcForest is smaller than that of DNN.
Experimental results: the authors compared the performance of gcForest and DNN by fixing an architecture for gcForest and testing various architectures for DNN, however assumed some fixed hyper-parameters for DNN such as activation and loss function, and dropout rate. They used MNIST (digit images recognition), ORL(face recognition), GTZAN(music classification ), sEMG (Hand Movement Recognition), IMDB (movie reviews sentiment analysis), and some low-dimensional datasets. The gcForest got the best results in these experiments and sometimes with significant differences.
My Opinions: The main goal of the paper is interesting; however one concern is the amount of efforts they put to find the best CNN network for the experiments as they also mentioned that finding a good configuration is an art instead of scientific work. For instance, they could use deep recurrent layers instead of MLP for the sentiment analysis dataset, which is typically a better option for this task. For the time complexity of the method, they only reported it for one experiment not all. More importantly, the result of CIFAR-10 in the supplementary materials shows a big gap between superior deep learning method result and gcForest result although the authors argued that gcForest can be tuned to get better result. gcForest was also compared to non-deep learning methods such as random forest and SVM which showed superior results. It was good to have the time complexity comparison for them as well. In my view, the paper is good as a starting point to answer to the original question, however, the proposed method and the experimental results are not convincing enough.
Github link: https://github.com/kingfengji/gcForest