Welcome to ShortScience.org! |
[link]
This work improves the performance of a segmentation network by utilizing unlabelled data. They use a discriminator (they call EN) to distinguish between annotated and unannotated examples. They then train the segmentation generator (they call SN) based on what will fool the discriminator. https://i.imgur.com/7CfKnh5.png Three training phases are shown above This work is really great. They are using the segmentation to condition the discriminator which will learn to point out flaws when applying the segmentation to the unlabelled examples. Then these flaws in the segmentation are corrected by using the gradients from the discriminator to adjust the segmentation. In contrast with other semi-supervised approaches which learn a latent space for all samples, labelled and unlabelled, and then uses this space to learn a classifier or segmentation; this approach looks for the boundaries of the space only. The unlabelled examples are used to bias the representation learned by the segmentation network to conform to the distribution represented by all observed examples. Read this paper for more: https://arxiv.org/abs/1611.08408 Poster: https://i.imgur.com/eR5jgwn.png |
[link]
This paper focuses on the application of deep learning to the docking problem within rational drug design. The overall objective of drug design or discovery is to build predictive models of how well a candidate compound (or "ligand") will bind with a target protein, to help inform the decision of what compounds are promising enough to be worth testing in a wet lab. Protein binding prediction is important because many small-molecule drugs, which are designed to be small enough to get through cell membranes, act by binding to a specific protein within a disease pathway, and thus blocking that protein's mechanism. The formulation of the docking problem, as best I understand it, is: 1. A "docking program," which is generally some model based on physical and chemical interactions, takes in a (ligand, target protein) pair, searches over a space of ways the ligand could orient itself within the binding pocket of the protein (which way is it facing, where is it twisted, where does it interact with the protein, etc), and ranks them according to plausibility 2. A scoring function takes in the binding poses (otherwise known as binding modes) ranked the highest, and tries to predict the affinity strength of the resulting bond, or the binary of whether a bond is "active". The goal of this paper was to interpose modern machine learning into the second step, as alternative scoring functions to be applied after the pose generation . Given the complex data structure that is a highly-ranked binding pose, the hope was that deep learning would facilitate learning from such a complicated raw data structure, rather than requiring hand-summarized features. They also tested a similar model structure on the problem of predicting whether a highly ranked binding pose was actually the empirically correct one, as determined by some epsilon ball around the spatial coordinates of the true binding pose. Both of these were binary tasks, which I understand to be 1. Does this ranked binding pose in this protein have sufficiently high binding affinity to be "active"? This is known as the "virtual screening" task, because it's the relevant task if you want to screen compounds in silico, or virtually, before doing wet lab testing. 2. Is this ranked binding pose the one that would actually be empirically observed? This is known as the "binding mode prediction" task The goal of this second task was to better understand biases the researchers suspected existed in the underlying dataset, which I'll explain later in this post. The researchers used a graph convolution architecture. At a (very) high level, graph convolution works in a way similar to normal convolution - in that it captures hierarchies of local patterns, in ways that gradually expand to have visibility over larger areas of the input data. The distinction is that normal convolution defines kernels over a fixed set of nearby spatial coordinates, in a context where direction (the pixel on top vs the pixel on bottom, etc) is meaningful, because photos have meaningful direction and orientation. By contrast, in a graph, there is no "up" or "down", and a given node doesn't have a fixed number of neighbors (whereas a fixed pixel in 2D space does), so neighbor-summarization kernels have to be defined in ways that allow you to aggregate information from 1) an arbitrary number of neighbors, in 2) a manner that is agnostic to orientation. Graph convolutions are useful in this problem because both the summary of the ligand itself, and the summary of the interaction of the posed ligand with the protein, can be summarized in terms of graphs of chemical bonds or interaction sites. Using this as an architectural foundation, the authors test both solo versions and ensembles of networks: https://i.imgur.com/Oc2LACW.png 1. "L" - A network that uses graph convolution to summarize the ligand itself, with no reference to the protein it's being tested for binding affinity with 2. "LP" - A network that uses graph convolution on the interaction points between the ligand and protein under the binding pose currently being scored or predicted 3. "R" - A simple network that takes into account the rank assigned to the binding pose by the original docking program (generally used in combination with one of the above). The authors came to a few interesting findings by trying different combinations of the above model modules. First, they found evidence supporting an earlier claim that, in the dataset being used for training, there was a bias in the positive and negative samples chosen such that you could predict activity of a ligand/protein binding using *ligand information alone.* This shouldn't be possible if we were sampling in an unbiased way over possible ligand/protein pairs, since even ligands that are quite effective with one protein will fail to bind with another, and it shouldn't be informationally possible to distinguish the two cases without protein information. Furthermore, a random forest on hand-designed features was able to perform comparably to deep learning, suggesting that only simple features are necessary to perform the task on this (bias and thus over-simplified) Specifically, they found that L+LP models did no better than models of L alone on the virtual screening task. However, the binding mode prediction task offered an interesting contrast, in that, on this task, it's impossible to predict the output from ligand information alone, because by construction each ligand will have some set of binding modes that are not the empirically correct one, and one that is, and you can't distinguish between these based on ligand information alone, without looking at the actual protein relationship under consideration. In this case, the LP network did quite well, suggesting that deep learning is able to learn from ligand-protein interactions when it's incentivized to do so. Interestingly, the authors were only able to improve on the baseline model by incorporating the rank output by the original docking program, which you can think of an ensemble of sorts between the docking program and the machine learning model. Overall, the authors' takeaways from this paper were that (1) we need to be more careful about constructing datasets, so as to not leak information through biases, and (2) that graph convolutional models are able to perform well, but (3) seem to be capturing different things than physics-based models, since ensembling the two together provides marginal value. |
[link]
This paper from 2016 introduced a new k-mer based method to estimate isoform abundance from RNA-Seq data called kallisto. The method provided a significant improvement in speed and memory usage compared to the previously used methods while yielding similar accuracy. In fact, kallisto is able to quantify expression in a matter of minutes instead of hours. The standard (previous) methods for quantifying expression rely on mapping, i.e. on the alignment of a transcriptome sequenced reads to a genome of reference. Reads are assigned to a position in the genome and the gene or isoform expression values are derived by counting the number of reads overlapping the features of interest. The idea behind kallisto is to rely on a pseudoalignment which does not attempt to identify the positions of the reads in the transcripts, only the potential transcripts of origin. Thus, it avoids doing an alignment of each read to a reference genome. In fact, kallisto only uses the transcriptome sequences (not the whole genome) in its first step which is the generation of the kallisto index. Kallisto builds a colored de Bruijn graph (T-DBG) from all the k-mers found in the transcriptome. Each node of the graph corresponds to a k-mer (a short sequence of k nucleotides) and retains the information about the transcripts in which they can be found in the form of a color. Linear stretches having the same coloring in the graph correspond to transcripts. Once the T-DBG is built, kallisto stores a hash table mapping each k-mer to its transcript(s) of origin along with the position within the transcript(s). This step is done only once and is dependent on a provided annotation file (containing the sequences of all the transcripts in the transcriptome). Then for a given sequenced sample, kallisto decomposes each read into its k-mers and uses those k-mers to find a path covering in the T-DBG. This path covering of the transcriptome graph, where a path corresponds to a transcript, generates k-compatibility classes for each k-mer, i.e. sets of potential transcripts of origin on the nodes. The potential transcripts of origin for a read can be obtained using the intersection of its k-mers k-compatibility classes. To make the pseudoalignment faster, kallisto removes redundant k-mers since neighboring k-mers often belong to the same transcripts. Figure1, from the paper, summarizes these different steps. https://i.imgur.com/eNH2kuO.png **Figure1**. Overview of kallisto. The input consists of a reference transcriptome and reads from an RNA-seq experiment. (a) An example of a read (in black) and three overlapping transcripts with exonic regions as shown. (b) An index is constructed by creating the transcriptome de Bruijn Graph (T-DBG) where nodes (v1, v2, v3, ... ) are k-mers, each transcript corresponds to a colored path as shown and the path cover of the transcriptome induces a k-compatibility class for each k-mer. (c) Conceptually, the k-mers of a read are hashed (black nodes) to find the k-compatibility class of a read. (d) Skipping (black dashed lines) uses the information stored in the T-DBG to skip k-mers that are redundant because they have the same k-compatibility class. (e) The k-compatibility class of the read is determined by taking the intersection of the k-compatibility classes of its constituent k-mers.[From Bray et al. Near-optimal probabilistic RNA-seq quantification, Nature Biotechnology, 2016.] Then, kallisto optimizes the following RNA-Seq likelihood function using the expectation-maximization (EM) algorithm. $$L(\alpha) \propto \prod_{f \in F} \sum_{t \in T} y_{f,t} \frac{\alpha_t}{l_t} = \prod_{e \in E}\left( \sum_{t \in e} \frac{\alpha_t}{l_t} \right )^{c_e}$$ In this function, $F$ is the set of fragments (or reads), $T$ is the set of transcripts, $l_t$ is the (effective) length of transcript $t$ and **y**$_{f,t}$ is a compatibility matrix defined as 1 if fragment $f$ is compatible with $t$ and 0 otherwise. The parameters $α_t$ are the probabilities of selecting reads from a transcript $t$. These $α_t$ are the parameters of interest since they represent the isoforms abundances or relative expressions. To make things faster, the compatibility matrix is collapsed (factorized) into equivalence classes. An equivalent class consists of all the reads compatible with the same subsets of transcripts. The EM algorithm is applied to equivalence classes (not to reads). Each $α_t$ will be optimized to maximise the likelihood of transcript abundances given observations of the equivalence classes. The speed of the method makes it possible to evaluate the uncertainty of the abundance estimates for each RNA-Seq sample using a bootstrap technique. For a given sample containing $N$ reads, a bootstrap sample is generated from the sampling of $N$ counts from a multinomial distribution over the equivalence classes derived from the original sample. The EM algorithm is applied on those sampled equivalence class counts to estimate transcript abundances. The bootstrap information is then used in downstream analyses such as determining which genes are differentially expressed. Practically, we can illustrate the different steps involved in kallisto using a small example. Starting from a tiny genome with 3 transcripts, assume that the RNA-Seq experiment produced 4 reads as depicted in the image below. https://i.imgur.com/5JDpQO8.png The first step is to build the T-DBG graph and the kallisto index. All transcript sequences are decomposed into k-mers (here k=5) to construct the colored de Bruijn graph. Not all nodes are represented in the following drawing. The idea is that each different transcript will lead to a different path in the graph. The strand is not taken into account, kallisto is strand-agnostic. https://i.imgur.com/4oW72z0.png Once the index is built, the four reads of the sequenced sample can be analysed. They are decomposed into k-mers (k=5 here too) and the pre-built index is used to determine the k-compatibility class of each k-mer. Then, the k-compatibility class of each read is computed. For example, for read 1, the intersection of all the k-compatibility classes of its k-mers suggests that it might come from transcript 1 or transcript 2. https://i.imgur.com/woektCH.png This is done for the four reads enabling the construction of the compatibility matrix **y**$_{f,t}$ which is part of the RNA-Seq likelihood function. In this equation, the $α_t$ are the parameters that we want to estimate. https://i.imgur.com/Hp5QJvH.png The EM algorithm being too slow to be applied on millions of reads, the compatibility matrix **y**$_{f,t}$ is factorized into equivalence classes and a count is computed for each class (how many reads are represented by this equivalence class). The EM algorithm uses this collapsed information to maximize the new equivalent RNA-Seq likelihood function and optimize the $α_t$. https://i.imgur.com/qzsEq8A.png The EM algorithm stops when for every transcript $t$, $α_tN$ > 0.01 changes less than 1%, where $N$ is the total number of reads. |
[link]
Ongoing declines in production of the world's fisheries may have serious ecological and socioeconomic consequences. As a result, a number of international efforts have sought to improve management and prevent overexploitation, while helping to maintain biodiversity and a sustainable food supply. Although these initiatives have received broad acceptance, the extent to which corrective measures have been implemented and are effective remains largely unknown. We used a survey approach, validated with empirical data, and enquiries to over 13,000 fisheries experts (of which 1,188 responded) to assess the current effectiveness of fisheries management regimes worldwide; for each of those regimes, we also calculated the probable sustainability of reported catches to determine how management affects fisheries sustainability. Our survey shows that 7% of all coastal states undergo rigorous scientific assessment for the generation of management policies, 1.4% also have a participatory and transparent processes to convert scientific recommendations into policy, and 0.95% also provide for robust mechanisms to ensure the compliance with regulations; none is also free of the effects of excess fishing capacity, subsidies, or access to foreign fishing. A comparison of fisheries management attributes with the sustainability of reported fisheries catches indicated that the conversion of scientific advice into policy, through a participatory and transparent process, is at the core of achieving fisheries sustainability, regardless of other attributes of the fisheries. Our results illustrate the great vulnerability of the world's fisheries and the urgent need to meet well-identified guidelines for sustainable management; they also provide a baseline against which future changes can be quantified. Author Summary Top Global fisheries are in crisis: marine fisheries provide 15% of the animal protein consumed by humans, yet 80% of the world's fish stocks are either fully exploited, overexploited or have collapsed. Several international initiatives have sought to improve the management of marine fisheries, hoping to reduce the deleterious ecological and socioeconomic consequence of the crisis. Unfortunately, the extent to which countries are improving their management and whether such intervention ensures the sustainability of the fisheries remain unknown. Here, we surveyed 1,188 fisheries experts from every coastal country in the world for information about the effectiveness with which fisheries are being managed, and related those results to an index of the probable sustainability of reported catches. We show that the management of fisheries worldwide is lagging far behind international guidelines recommended to minimize the effects of overexploitation. Only a handful of countries have a robust scientific basis for management recommendations, and transparent and participatory processes to convert those recommendations into policy while also ensuring compliance with regulations. Our study also shows that the conversion of scientific advice into policy, through a participatory and transparent process, is at the core of achieving fisheries sustainability, regardless of other attributes of the fisheries. These results illustrate the benefits of participatory, transparent, and science-based management while highlighting the great vulnerability of the world's fisheries services. The data for each country can be viewed at http://as01.ucis.dal.ca/ramweb/surveys/fishery_assessment . |
[link]
Brendel et al. propose a decision-based black-box attacks against (deep convolutional) neural networks. Specifically, the so-called Boundary Attack starts with a random adversarial example (i.e. random noise that is not classified as the image to be attacked) and randomly perturbs this initialization to move closer to the target image while remaining misclassified. In pseudo code, the algorithm is described in Algorithm 1. Key component is the proposal distribution $P$ used to guide the adversarial perturbation in each step. In practice, they use a maximum-entropy distribution (e.g. uniform) with a couple of constraints: the perturbed sample is a valid image; the perturbation has a specified relative size, i.e. $\|\eta^k\|_2 = \delta d(o, \tilde{o}^{k-1})$; and the perturbation reduces the distance to the target image $o$: $d(o, \tilde{o}^{k-1}) – d(o,\tilde{o}^{k-1} + \eta^k)=\epsilon d(o, \tilde{o}^{k-1})$. This is approximated by sampling from a standard Gaussian, clipping and rescaling and projecting the perturbation onto the $\epsilon$-sphere around the image. In experiments, they show that this attack is competitive to white-box attacks and can attack real-world systems. https://i.imgur.com/BmzhiFP.png Algorithm 1: Minimal pseudo code version of the boundary attack. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |