Convolutional Neural Networks for Sentence Classification on ShortScience.org

arxiv.org
scholar.google.com

Convolutional Neural Networks for Sentence Classification
Kim, Yoon
arXiv e-Print archive - 2014 via Local Bibsonomy
Keywords: dblp

Summaries/Notes 2

[link] Summary by Shagun Sodhani 7 years ago

#### Introduction

* The paper demonstrates how simple CNNs, built on top of word embeddings, can be used for sentence classification tasks.
* [Link to the paper](https://arxiv.org/abs/1408.5882)
* [Implementation](https://github.com/shagunsodhani/CNN-Sentence-Classifier)

#### Architecture

* Pad input sentences so that they are of the same length.
* Map words in the padded sentence using word embeddings (which may be either initialized as zero vectors or initialized as word2vec embeddings) to obtain a matrix corresponding to the sentence.
* Apply convolution layer with multiple filter widths and feature maps.
* Apply max-over-time pooling operation over the feature map.
* Concatenate the pooling results from different layers and feed to a fully-connected layer with softmax activation.
* Softmax outputs probabilistic distribution over the labels.
* Use dropout for regularisation.

#### Hyperparameters

* RELU activation for convolution layers
* Filter window of 3, 4, 5 with 100 feature maps each.
* Dropout - 0.5
* Gradient clipping at 3
* Batch size - 50
* Adadelta update rule.

#### Variants

* CNN-rand
    * Randomly initialized word vectors.
* CNN-static
    * Uses pre-trained vectors from word2vec and does not update the word vectors.
* CNN-non-static
    * Same as CNN-static but updates word vectors during training.
* CNN-multichannel
    * Uses two set of word vectors (channels).
    * One set is updated and other is not updated.

#### Datasets

* Sentiment analysis datasets for Movie Reviews, Customer Reviews etc.
* Classification data for questions.
* Maximum number of classes for any dataset - 6

#### Strengths

* Good results on benchmarks despite being a simple architecture.
* Word vectors obtained by non-static channel have more meaningful representation. 

#### Weakness

* Small data with few labels.
* Results are not very detailed or exhaustive.

Your comment: