[link]
#### Introduction * Automated Theorem Proving (ATP)  Attempting to prove mathematical theorems automatically. * Bottlenecks in ATP: * **Autoformalization**  Semantic or formal parsing of informal proofs. * **Automated Reasoning**  Reasoning about already formalised proofs. * Paper evaluates the effectiveness of neural sequence models for premise selection (related to automated reasoning) without using hand engineered features. * [Link to the paper](https://arxiv.org/abs/1606.04442) #### Premise Selection * Given a large set of premises P, an ATP system A with given resource limits, and a new conjecture C, predict those premises from P that will most likely lead to an automatically constructed proof of C by A #### Dataset * Mizar Mathematical Library (MML) used as the formal corpus. * The premise length varies from 5 to 84299 characters and over 60% if the words occur fewer than 10 times (rare word problem). #### Approach * The model predicts the probability that a given axiom is useful for proving a given conjecture. * Conjecture and axiom sequences are separately embedded into fixed length real vectors, then concatenated and passed to a third network with few fully connected layers and logistic loss. * The two embedding networks and the joint predictor path are trained jointly. ##### Stage 1: Characterlevel Models * Treat premises on character level using an 80dimensional one hot encoding vector. * Architectures for embedding: * pure recurrent LSTM and GRU Network * CNN (with max pooling) * Recurrentconvolutional network that shortens the sequence using convolutional layer before feeding it to LSTM. ##### Stage 2: Wordlevel Models * MML dataset contains both implicit and explicit definitions. * To avoid manually encoding the implicit definitions, the entire statement defining an identifier is embedded and the definition embeddings are used as word level embeddings. * This is better than recursively expanding and embedding the word definition as the definition chains can be very deep. * Once word level embeddings are obtained, the architecture from Characterlevel models can be reused. #### Experiments ##### Metrics * For each conjecture, the model ranks the possible premises. * Primary metric is the number of conjectures proved from topk premises. * Average Max Relative Rank (AMMR) is more sophisticated measure based on the motivation that conjectures are easier to prove if all their dependencies occur earlier in ranking. * Since it is very costly to rank all axioms for a conjecture, an approximation is made and a fixed number of random false dependencies are used for evaluating AMMR. ##### Network Training * Asynchronous distributed stochastic gradient descent using Adam optimizer. * Clipped vs Nonclipped Gradients. * Max Sequence length of 2048 for characterlevel sequences and 500 for wordlevel sequences. ##### Results * Best Selection Pipeline  Stage 1 characterlevel CNN which produces wordlevel embeddings for the next stage. * Best models use simple CNNs followed by max pooling and twostage definitionbased defCNN outperforms naive wordCNN (where word embeddings are learnt in a single pass).
Your comment:
