[link]
Seeks to tackle database completion using Maccartney's natural logic. Approach does not require explicit alignment between premise and query and allows imprecise inferences at an associated cost learned from data. Casts transformation from query to supporting premise as a unified search problem where each step may have associated with it a cost reflecting confidence in the step. System allows for unstructured text as input to database, without a need to specify a schema or domain of text. ## Summary Represent Maccartney's inference model as a finite state machine that can be collapsed into three states. Represent acquisition of new premises in knowledge base as search over FSA, with nodes representing candidate facts and edges as mutations of these facts with associated costs. The confidence is a path is then computed using the cost vector and associated feature vector (representing the distance between the endpoints of two transitions). Model was tested on FraCaS entailment corpus as well as a corpus of OpenIE extractions. ## Future Work Search process does not have full access to the parse tree so struggles with issues of alignment. |
[link]
#### Introduction * Introduces a new global log-bilinear regression model which combines the benefits of both global matrix factorization and local context window methods. #### Global Matrix Factorization Methods * Decompose large matrices into low-rank approximations. * eg - Latent Semantic Analysis (LSA) ##### Limitations * Poor performance on word analogy task * Frequent words contribute disproportionately high to the similarity measure. #### Shallow, Local Context-Based Window Methods * Learn word representations using adjacent words. * eg - Continous bag-of-words (CBOW) model and skip-gram model. ##### Limitations * Since they do not operate directly on the global co-occurrence counts, they can not utilise the statistics of the corpus effectively. #### GloVe Model * To capture the relationship between words $i$ and $j$, word vector models should use ratios of co-occurene probabilites (with other words $k$) instead of using raw probabilites themselves. * In most general form: * $F(w_{i}, w_{j}, w_{k}^{~} ) = P_{ik}/P_{jk}$ * We want $F$ to encode information in the vector space (which have a linear structure), so we can restrict to the difference of $w_{i}$ and $w_{j}$ * $F(w_{i} - w_{j}, w_{k}^{~} ) = P_{ik}/P_{jk}$ * Since right hand side is a scalar and left hand side is a vector, we take dot product of the arguments. * $F( (w_{i} - w_{j})^{T}, w_{k}^{~} ) = P_{ik}/P_{jk}$ * *F* should be invariant to order of the word pair $i$ and $j$. * $F(w_{i}^{T}w_{k}^{~}) = P_{ik}$ * Doing further simplifications and optimisations (refer paper), we get cost function, * $J = \sum_{\text{over all i, j pairs in the vocabulary}}[w_{i}^{T}w_{k}^{~} + b_{i} + b_{k}^{~} - log(X_{ik})]^{2}$ * $f$ is a weighing function. * $f(x) = min((x/x_{max})^{\alpha}, 1)$ * Typical values, $x_{\max} = 100$ and $\alpha = 3/4$ * *b* are the bias terms. ##### Complexity * Depends on a number of non-zero elements in the input matrix. * Upper bound by the square of vocabulary size * Since for shallow window-based approaches, complexity depends on $|C|$ (size of the corpus), tighter bounds are needed. * By modelling number of co-occurrences of words as power law function of frequency rank, the complexity can be shown to be proportional to $|C|^{0.8}$ #### Evaluation ##### Tasks * Word Analogies * a is to b as c is to ___? * Both semantic and syntactic pairs * Find closest d to $w_{b} - w_{c} + w_{a}$ (using cosine similarity) * Word Similarity * Named Entity Recognition ##### Datasets * Wikipedia Dumps - 2010 and 2014 * Gigaword5 * Combination of Gigaword5 and Wikipedia2014 * CommonCrawl * 400,000 most frequent words considered from the corpus. ##### Hyperparameters * Size of context window. * Whether to distinguish left context from right context. * $f$ - Word pairs that are $d$ words apart contribute $1/d$ to the total count. * $xmax = 100$ * $\alpha = 3/4$ * AdaGrad update ##### Models Compared With * Singular Value Decomposition * Continous Bag-Of-Words * Skip-Gram ##### Results * Glove outperforms all other models significantly. * Diminishing returns for vectors larger than 200 dimensions. * Small and asymmetric context windows (context window only to the left) works better for syntactic tasks. * Long and symmetric context windows (context window to both the sides) works better for semantic tasks. * Syntactic task benefited from larger corpus though semantic task performed better with Wikipedia instead of Gigaword5 probably due to the comprehensiveness of Wikipedia and slightly outdated nature of Gigaword5. * Word2vec’s performance decreases if the number of negative samples increases beyond about 10. * For the same corpus, vocabulary, and window size GloVe consistently achieves better results, faster. |
[link]
#### Introduction * Open-domain Question Answering (Open QA) - efficiently querying large-scale knowledge base(KB) using natural language. * Two main approaches: * Information Retrieval * Transform question (in natural language) into a valid query(in terms of KB) to get a broad set of candidate answers. * Perform fine-grained detection on candidate answers. * Semantic Parsing * Interpret the correct meaning of the question and convert it into an exact query. * Limitations: * Human intervention to create lexicon, grammar, and schema. * This work builds upon the previous work where an embedding model learns low dimensional vector representation of words and symbols. * [Link](https://arxiv.org/abs/1406.3676) to the paper. #### Task Definition * Input - Training set of questions (paired with answers). * KB providing a structure among the answers. * Answers are entities in KB and questions are strings with one identified KB entity. * The paper has used FREEBASE as the KB. * Datasets * WebQuestions - Built using FREEBASE, Google Suggest API, and Mechanical Turk. * FREEBASE triplets transformed into questions. * Clue Web Extractions dataset with entities linked with FREEBASE triplets. * Dataset of paraphrased questions using WIKIANSWERS. #### Embedding Questions and Answers * Model learns low-dimensional vector embeddings of words in question entities and relation types of FREEBASE such that questions and their answers are represented close to each other in the joint embedding space. * Scoring function $S(q, a)$, where $q$ is a question and $a$ is an answer, generates high score if $a$ answers $q$. * $S(q, a) = f(q)^{T} . g(a)$ * $f(q)$ maps question to embedding space. * $f(q) = W \phi (q)$ * $W$ is a matrix of dimension $K * N$ * $K$ - dimension of embedding space (hyper parameter). * $N$ - total number of words/entities/relation types. * $\psi(q)$ - Sparse Vector encoding the number of times a word appears in $q$. * Similarly, $g(a) = W \psi (a)$ maps answer to embedding space. * $\psi(a)$ gives answer representation, as discussed below. #### Possible Representations of Candidate Answers * Answer represented as a **single entity** from FREEBASE and TBD is a one-of-N encoded vector. * Answer represented as a **path** from question to answer. The paper considers only one or two hop paths resulting in 3-of-N or 4-of-N encoded vectors(middle entities are not recorded). * Encode the above two representations using **subgraph representation** which represents both the path and the entire subgraph of entities connected to answer entity as a subgraph. Two embedding representations are used to differentiate between entities in path and entities in the subgraph. * SubGraph approach is based on the hypothesis that including more information about the answers would improve results. #### Training and Loss Function * Minimize margin based ranking loss to learn matrix $W$. * Stochastic Gradient Descent, multi-threaded with Hogwild. #### Multitask Training of Embeddings * To account for a large number of synthetically generated questions, the paper also multi-tasks the training of model with paraphrased prediction. * Scoring function $S_{prp} (q1, q2) = f(q1)^{T} f(q2)$, where $f$ uses the same weight matrix $W$ as before. * High score is assigned if $q1$ and $q2$ belong to same paraphrase cluster. * Additionally, the model multitasks the task of mapping embeddings of FREEBASE entities (mids) to actual words. #### Inference * For each question, a candidate set is generated. * The answer (from candidate set) with the highest set is reported as the correct answer. * Candidate set generation strategy * $C_1$ - All KB triplets containing the KB entity from the question forms a candidate set. Answers would be limited to 1-hop paths. * $C_2$ - Rank all relation types and keep top 10 types and add only those 2-hop candidates where the selected relations appear in the path. #### Results * $C_2$ strategy outperforms $C_1$ approach supporting the hypothesis that a richer representation for answers can store more information. * Proposed approach outperforms the baseline methods but is outperformed by an ensemble of proposed approach with semantic parsing via paraphrasing model. |
[link]
TLDR; The authors propose a novel encoder-decoder neural network architecture. The encoder RNN encodes a sequence into a fixed length vector representation and the decoder generates a new variable-length sequence based on this representation. The authors also introduce a new cell type (now called GRU) to be used with this network architecture. The model is evaluated on a statistical machine translation task where it is fed as an additional feature to a log-linear model. It leads to improved BLEU scores. The authors also find that the model learns syntactically and semantically meaningful representations of both words and phrases. #### Key Points: - New encoder-decoder architecture, seq2seq. Decoder conditioned on thought vector. - Architecture can be used for both scoring or generation - New hidden unit type, now called GRU. Simplified LSTM. - Could replace whole pipeline with this architecture, but this paper doesn't - 15k vocabulary (93% of dataset cover). 100d embeddings, 500 maxout units in final affine layer, batch size of 64, adagrad, 384M words, 3 days training time. - Architecture is trained without frequency information so we expect it to capture linguistic information rather than statistical information. - Visualizations of both words embeddings and thought vectors. #### Questions/Notes - Why not just use LSTM units? |