This paper is a bit provocative (especially in the light of the recent DeepMind MuZero paper), and poses some interesting questions about the value of model-based planning. I'm not sure I agree with the overall argument it's making, but I think the experience of reading it made me hone my intuitions around why and when model-based planning should be useful.
The overall argument of the paper is: rather than learning a dynamics model of the environment and then using that model to plan and learn a value/policy function from, we could instead just keep a large replay buffer of actual past transitions, and use that in lieu of model-sampled transitions to further update our reward estimators without having to acquire more actual experience. In this paper's framing, the central value of having a learned model is this ability to update our policy without needing more actual experience, and it argues that actual real transitions from the environment are more reliable and less likely to diverge than transitions from a learned parametric model. It basically sees a big buffer of transitions as an empirical environment model that it can sample from, in a roughly equivalent way to being able to sample transitions from a learnt model.
An obvious counter-argument to this is the value of models in being able to simulate particular arbitrary trajectories (for example, potential actions you could take from your current point, as is needed for Monte Carlo Tree Search). Simply keeping around a big stock of historical transitions doesn't serve the use case of being able to get a probable next state *for a particular transition*, both because we might not have that state in our data, and because we don't have any way, just given a replay buffer, of knowing that an available state comes after an action if we haven't seen that exact combination before. (And, even if we had, we'd have to have some indexing/lookup mechanism atop the data). I didn't feel like the paper's response to this was all that convincing. It basically just argues that planning with model transitions can theoretically diverge (though acknowledges it empirically often doesn't), and that it's dangerous to update off of "fictional" modeled transitions that aren't grounded in real data. While it's obviously definitionally true that model transitions are in some sense fictional, that's just the basic trade-off of how modeling works: some ability to extrapolate, but a realization that there's a risk you extrapolate poorly.
The paper's empirical contribution to its argument was to argue that in a low-data setting, model-free RL (in the form of the "everything but the kitchen sink" Rainbow RL algorithm) with experience replay can outperform a model-based SimPLe system on Atari. This strikes me as fairly weak support for the paper's overall claim, especially since historically Atari has been difficult to learn good models of when they're learnt in actual-observation pixel space. Nonetheless, I think this push against the utility of model-based learning is a useful thing to consider if you do think models are useful, because it will help clarify the reasons why you think that's the case.
Arguably, the central achievement of the deep learning era is multi-layer neural networks' ability to learn useful intermediate feature representations using a supervised learning signal. In a supervised task, it's easy to define what makes a feature representation useful: the fact that's easier for a subsequent layer to use to make the final class prediction. When we want to learn features in an unsupervised way, things get a bit trickier. There's the obvious problem of what kinds of problem structures and architectures work to extract representations at all. But there's also a deeper problem: when we ask for a good feature representation, outside of the context of any given task, what are we asking for? Are there some inherent aspects of a representation that can be analyzed without ground truth labels to tell you whether the representations you've learned are good are not?
The notion of "disentangled" features is one answer to that question: it suggests that a representation is good when the underlying "factors of variation" (things that are independently variable in the underlying generative process of the data) are captured in independent dimensions of the feature representation. That is, if your representation is a ten-dimensional vector, and it just so happens that there are ten independent factors along which datapoints differ (color, shape, rotation, etc), you'd ideally want each dimension to correspond to each factor.
This criteria has an elegance to it, and it's previously been shown useful in predicting when the representations learned by a model will be useful in predicting the values of the factors of variation. This paper goes one step further, and tests the value representations for solving a visual reasoning task that involves the factors of variation, but doesn't just involve predicting them. In particular, the authors use learned representations to solve a task patterned on a human IQ test, where some factors stay fixed across a row in a grid, and some vary, and the model needs to generate the image that "fits the pattern".
To test the value of disentanglement, they looked at a few canonical metrics of disentanglement, including scores that represent "how many factors are captured in each dimension" and "how many dimensions is a factor spread across". They measured the correlation of these metrics with task performance, and compared that with the correlation between simple autoencoder reconstruction error and performance.
They found that at early stages of training on top of the representations, the disentanglement metrics were more predictive of performance than reconstruction accuracy. This distinction went away as the model learning on top of the representations had more time to train. It makes reasonable sense that you'd mostly see value for disentangled features in a low-data regime, since after long enough the fine-tuning network can learn its own features regardless. But, this paper does appear to contribute to evidence that disentangled features are predictive of task performance, at least when that task directly involves manipulation of specific, known, underlying factors of variation.
Summary: An odd thing about machine learning these days is how far you can get in a line of research while only ever testing your method on image classification and image datasets in general. This leads one occasionally to wonder whether a given phenomenon or advance is a discovery of the field generally, or whether it's just a fact about the informatics and learning dynamics inherent in image data.
This paper, part of a set of recent papers released by Facebook centering around the Lottery Ticket Hypothesis, exists in the noble tradition of "lets try <thing> on some non-image datasets, and see if it still works". This can feel a bit silly in the cases where the ideas or approaches do transfer, but I think it's still an important impulse for the field to have, lest we become too captured by ImageNet and its various descendants.
This paper test the Lottery Ticket Hypothesis - the idea that there are a small subset of weights in a trained network whose lucky initializations promoted learning, such that if you reset those weights to their initializations and train only them you get comparable or near-comparable performance to the full network - on reinforcement learning and NLP datasets. In particular, within RL, they tested on both simple continuous control (where the observation state is a vector of meaningful numbers) and Atari from pixels (where the observation is a full from-pixels image). In NLP, they trained on language modeling and translation, with both a LSTM and a Transformer respectively. (Prior work had found that Transformers didn't exhibit lottery ticket like phenomenon, but this paper found a circumstance where they appear to. )
Some high level interesting results:
- So as to not bury the lede: by and large, "winning" tickets retrained at their original initializations outperform random initializations of the same size and configuration on both NLP and Reinforcement Learning problems
- There is wide variability in how much pruning in general (a necessary prerequisite operation) impacts reinforcement learning. On some games, pruning at all crashes performance, on others, it actually improves it. This leads to some inherent variability in results
- One thing that prior researchers in this area have found is that pruning weights all at once at the end of training tends to crash performance for complex models, and that in order to find pruned models that have Lottery Ticket-esque high-performing properties, you need to do "iterative pruning". This works by training a model for a period, then pruning some proportion of weights, then training again from the beginning, and then pruning again, and so on, until you prune down to the full percentage you want to prune. The idea is that this lets the model adapt gradually to a drop in weights, and "train around" the capacity reduction, rather than it just happening all at once. In this paper, the authors find that this is strongly necessary for Lottery Tickets to be found for either Transformers or many RL problems. On a surface level, this makes sense, since Reinforcement Learning is a notoriously tricky and non-stationary learning problem, and Transformers are complex models with lots of parameters, and so dramatically reducing parameters can handicap the model. A weird wrinkle, though, is that the authors find that lottery tickets found without iterative pruning actually perform worse than "random tickets" (i.e. initialized subnetworks with random topology and weights). This is pretty interesting, since it implies that the topology and weights you get if you prune all at once are actually counterproductive to learning. I don't have a good intuition as to why, but would love to hear if anyone does.
- For the Transformer specifically, there was an interesting divergence in the impact of weight pruning between the weights of the embedding matrix and the weights of the rest of the network machinery. If you include embeddings in the set of weights being pruned, there's essentially no difference in performance between random and winning tickets, whereas if you exclude them, winning tickets exhibit the more typical pattern of outperforming random ones. This implies that whatever phenomenon that makes winning tickets better is more strongly (or perhaps only) present in weights for feature calculation on top of embeddings, and not very present for the embeddings themselves
Bayesian Neural Networks (BNN): intrinsic importance model based on weight uncertainty; variational inference can approximate posterior distributions using Monte Carlo sampling for gradient estimation; acts like an ensemble method in that they reduce the prediction variance but only uses 2x the number of parameters.
The idea is to use BNN's uncertainty to guide gradient descent to not update the important weight when learning new tasks.
## Bayes by Backprop (BBB):
Where $q(w|\theta)$ is our approximation of the posterior $p(w|x)$. $q$ is most probably gaussian with diagonal covariance. We can optimize this via the ELBO:
## Uncertainty-guided CL with BNN (UCB):
UCB the regularizing is performed with the learning rate such that the learning rate of each parameter and hence its gradient update becomes a function of its importance. They set the importance to be inversely proportional to the standard deviation $\sigma$ of $q(w|\theta)$
Simply put, the more confident the posterior is about a certain weight, the less is this weight going to be updated.
You can also use the importance for weight pruning (sort of a hard version of the first idea)
In my view, the Lottery Ticket Hypothesis is one of the weirder and more mysterious phenomena of the last few years of Machine Learning. We've known for awhile that we can take trained networks and prune them down to a small fraction of their weights (keeping those weights with the highest magnitudes) and maintain test performance using only those learned weights. That seemed somewhat surprising, in that there were a lot of weights that weren't actually necessary to encoding the learned function, but, the thinking went, possibly having many times more weights than that was helpful for training, even if not necessary once a model is trained. The authors of the original Lottery Ticket paper came to the surprising realization that they could take the weights that were pruned to exist in the final network, re-initialize them (and only them) to the values they had during initial training, and perform almost as well as the final pruned model that had all weights active during training. And, performance using the specific weights and their particular initialization values is much higher than training a comparable topology of weights with random initial values.
This paper out of Facebook AI adds another fascinating experiment to the pile of off evidence around lottery tickets: they test whether lottery tickets transfer *between datasets*, and they find that they often do (at least when the dataset on which the lottery ticket is found is more complex (in terms of in size, input complexity, or number of classes) than the dataset the ticket is being transferred to. Even more interestingly, they find that for sufficiently simple datasets, the "ticket" initialization pattern learned on a more complex dataset actually does *better* than ones learned on the simple dataset itself. They also find that tickets by and large transfer between SGD and Adam, so whatever kind of inductive bias or value they provide is general across optimizers in addition to at least partially general across datasets.
I find this result fun to think about through a few frames. The first is to remember that figuring out heuristics for initializing networks (as a function of their topology) was an important step in getting them to train at all, so while this result may at first seem strange and arcane, in that context it feels less surprising that there are still-better initialization heuristics out there, possibly with some kind of interesting theoretical justification to them, that humans simply haven't been clever enough to formalize yet, and have only discovered empirically through methods like this.
This result is also interesting in terms of transfer: we've known for awhile that the representations learned on more complex datasets can convey general information back to smaller ones, but it's less easy to think about what information is conveyed by the topology and connectivity of a network. This paper suggests that the information is there, and has prompted me to think more about the slightly mind-bending question of how training models could lead to information compressed in this form, and how this information could be better understood.
VQ-VAE is a Variational AutoEncoder that uses as its information bottleneck a discrete set of codes, rather than a continuous vector. That is: the encoder creates a downsampled spatial representation of the image, where in each grid cell of the downsampled image, the cell is represented by a vector. But, before that vector is passed to the decoder, it's discretized, by (effectively) clustering the vectors the network has historically seen, and substituting each vector with the center of the vector it's closest to. This has the effect of reducing the capacity of your information bottleneck, but without just pushing your encoded representation closer to an uninformed prior. (If you're wondering how the gradient survives this very much not continuous operation, the answer is: we just pretend that operation didn't exist, and imagine that the encoder produced the cluster-center "codebook" vector that the decoder sees).
The part of the model that got a (small) upgrade in this paper is the prior distribution model that's learned on top of these latent representations. The goal of this prior is to be able to just sample images, unprompted, from the distribution of latent codes. Once we have a trained decoder, if we give it a grid of such codes, it can produce an image. But these codes aren't one-per-image, but rather a grid of many codes representing features in different part of the image. In order to generate a set of codes corresponding to a reasonable image, we can either generate them all at once, or else (as this paper does) use an autoregressive approach, where some parts of the code grid are generated, and then subsequent ones conditioned on those. In the original version of the paper, the autoregressive model used was a PixelCNN (don't have the space to fully explain that here, but, at a high level: a model that uses convolutions over previously generated regions to generate a new region). In this paper, the authors took inspiration from the huge rise of self-attention in recent years, and swapped that operation in in lieu of the convolutions. Self-attention has the nice benefit that you can easily have a global receptive range (each region being generated can see all other regions) which you'd otherwise need multiple layers of convolutions to accomplish.
In addition, the authors add an additional layer of granularity: generating both a 32x32 and 64x64 grid, and using both to generate the decoded reconstruction. They argue that this allows one representation to focus on more global details, and the other on more precise ones.
The final result is the ability to generate quite realistic looking images, that at least are being claimed to be more diverse than those generated by GANs (examples above). I'm always a bit cautious of claims of better performance in the image-generation area, because it's all squinting at pixels and making up somewhat-reasonable but still arbitrary metrics. That said, it seems interesting and useful to be aware of the current relative capabilities of two of the main forms of generative modeling, and so I'd recommend this paper on that front, even if it's hard for me personally to confidently assess the improvements on prior art.
When talking about modern machine learning, particularly on images, it can feel like deep neural networks are a world unto themselves when it comes to complexity. On one hand, there are straightforward things like hand-designed features and linear classifiers, and then on the other, there are these deep, heavily-interacting networks that dazzle us with their performance but seem almost unavoidably difficult to hold in our heads or interpret. This paper, from ICLR 2019 earlier this year, investigates another point along this trade-off curve of complexity: a model that uses deep layers of convolutions, but limits the receptive field of those convolutions so that each feature is calculated using only a small spatial area of the image.
This approach, termed BagNet, essentially predicts class logits off of a small area of the image, without using information from anywhere else. Then, to aggregate the local predictions, a few simple and linear steps are performed: the predictions from each spatial area are averaged together into one vector containing the "aggregate information" for each class, and then that class information vector is passed into a linear (non-interacting!) model to predict final class probabilities.
This is quite nice for interpretability, because you can directly identify the areas of the image that contributed evidence to the prediction, and you can know that the impact of those areas wasn't in fact amplified by feature values elsewhere, because there are no interaction effects outside of these small regions
Now, it's fairly obvious that you're not going to get any state-of-the-art results off of this: the entire point is to handicap a network in ways believed to make it more interpretable. So the interesting question is instead what degree of performance loss comes from such a (fairly drastic) limitation of model capacity and receptive field? And the answer of the paper is: less than you might think. (Or, at least, less than *they* think you think). If you only use features calculated from 33x33 pixel chunks of image net, and aggregate their evidence together in a purely linear way, you can get to 87.6% top-5 image accuracy on ImageNet, which is about where we were with AlexNet in 2012.
The authors also do some comparisons of their network to more common neural networks, to try to argue that even fully nonlinear neural nets don't use spatial information very much in their predictions. One way they did this was by masking different areas of the image, and comparing the effect of masking each individually to the effect of masking all areas together. In a purely linear model like BagNet, where the effects of different areas are just aggregated together, these would sum together perfectly, and the performance loss of all areas at once would be equal to the sum of each individually. To measure "effective spatial linearity" of each network, they measured the correlation between the sum of the individual effects and the joint effect. For VGG, they found a correlation of 0.75 here (compared to 1.0 for BagNet), which they use to argue that VGG doesn't use very much spatial information. I found this result hard to really get a grounding on, since I don't have a good intuitive grasp for what differences in this correlation value would mean. Is a difference of 0.25 a small difference, or a dramatic one?
That aside, I found this paper interesting, and I'm quite pleased it was written. On one hand, you can say: well, obviously, we've done a lot of work in 7 years to build ResNet and DenseNet and whatnot, so of course if you apply those more advanced architectures, even on a small region of image space, you'll get good performance. That said, I still think this is an interesting finding, because it helps us understand how much of the added value in recent research requires a high (and uninterpretable) interaction complexity, and what proportion of the overall performance can be achieved with a simpler-to-understand model. Machine learning is used in a lot of settings, and it always practically exists on a trade-off curve, where performance is important, but it's often worth trading off performance to do better on other considerations, and this paper does a good job of illustrating that trade-off curve more fully.
First published: 2019/11/19 (2 weeks ago) Abstract: Constructing agents with planning capabilities has long been one of the main
challenges in the pursuit of artificial intelligence. Tree-based planning
methods have enjoyed huge success in challenging domains, such as chess and Go,
where a perfect simulator is available. However, in real-world problems the
dynamics governing the environment are often complex and unknown. In this work
we present the MuZero algorithm which, by combining a tree-based search with a
learned model, achieves superhuman performance in a range of challenging and
visually complex domains, without any knowledge of their underlying dynamics.
MuZero learns a model that, when applied iteratively, predicts the quantities
most directly relevant to planning: the reward, the action-selection policy,
and the value function. When evaluated on 57 different Atari games - the
canonical video game environment for testing AI techniques, in which
model-based planning approaches have historically struggled - our new algorithm
achieved a new state of the art. When evaluated on Go, chess and shogi, without
any knowledge of the game rules, MuZero matched the superhuman performance of
the AlphaZero algorithm that was supplied with the game rules.
The successes of deep learning on complex strategic games like Chess and Go have been largely driven by the ability to do tree search: that is, simulating sequences of actions in the environment, and then training policy and value functions to more speedily approximate the results that more exhaustive search reveals. However, this relies on having a good simulator that can predict the next state of the world, given your action. In some games, with straightforward rules, this is easy to explicitly code, but in many RL tasks like Atari, and in many contexts in the real world, having a good model of how the world responds to your actions is in fact a major part of the difficulty of RL.
A response to this within the literature has been systems that learn models of the world from trajectories, and then use those models to do this kind of simulated planning. Historically these have been done by designing models that predict the next observation, given past observations and a passed-in action. This lets you "roll out" observations from actions in a way similar to how a simulator could. However, in high-dimensional observation spaces it takes a lot of model capacity to accurately model the full observation, and many parts of a given observation space will often be irrelevant.
To address this difficulty, the MuZero architecture uses an approach from Value Prediction Networks, and learns an internal model that can predict transitions between abstract states (which don't need to match the actual observation state of the world) and then predict a policy, value, and next-step reward from the abstract state. So, we can plan in latent space, by simulating transitions from state to state through actions, and the training signal for that space representation and transition model comes from being able to accurately predict the reward, the empirical future value at a state (discovered through Monte Carlo rollouts) and the policy action that the rollout search would have taken at that point. If two observations are identical in terms of their implications for these quantities, the transition model doesn't need to differentiate them, making it more straightforward to learn. (Apologies for the long caption in above screenshot; I feel like it's quite useful to gain intuition, especially if you're less recently familiar with the MCTS deep learning architectures DeepMind typically uses)
The most impressive empirical aspect of this paper is the fact that it claims (from what I can tell credibly) to be able to perform as well as planning algorithms with access to a real simulator in games like Chess and Go, and as well as model-free models in games like Atari where MFRL has typically been the state of the art (because world models have been difficult to learn). I feel like I've read a lot recently that suggests to me that the distinction between model-free and model-based RL is becoming increasingly blurred, and I'm really curious to see how that trajectory evolves in future.
Recently, DeepMind released a new paper showing strong performance on board game tasks using a mechanism similar to the Value Prediction Network one in this paper, which inspired me to go back and get a grounding in this earlier work.
A goal of this paper is to design a model-based RL approach that can scale to complex environment spaces, but can still be used to run simulations and do explicit planning. Traditional, model-based RL has worked by learning a dynamics model of the environment - predicting the next observation state given the current one and an action, and then using that model of the world to learn values and plan with. In addition to the advantages of explicit planning, a hope is that model-based systems generalize better to new environments, because they predict one-step changes in local dynamics in a way that can be more easily separated from long-term dynamics or reward patterns.
However, a downside of MBRL is that it can be hard to train, especially when your observation space is high-dimensional, and learning a straight model of your environment will lead to you learning details that aren't actually unimportant for planning or creating policies.
The synthesis proposed by this paper is the Value Prediction Network. Rather than predicting observed state at the next step, it learns a transition model in latent space, and then learns to predict next-step reward and future value from that latent space vector. Because it learns to encode latent-space state from observations, and also learns a transition model from one latent state to another, the model can be used for planning, by simulating multiple transitions between latent state. However, unlike a normal dynamics model, whose training signal comes from a loss against observational prediction, the signal for training both latent → reward/value/discount predictions, and latent → latent transitions comes from using this pipeline to predict reward values. This means that if an aspect of the environment isn't useful for predicting reward, it won't generally be encoded into latent state, meaning you don't waste model capacity predicting irrelevant detail.
Once this model exists, it can be used for generating a policy through a tree-search planning approach: simulating future trajectories and aggregating the predicted reward along those trajectories, and then taking the highest-value one.
The authors find that their model is able to do better than both model-free and model-based methods on the tasks they tested on. In particular, they find that it has many of the benefits of a model that predicts full observations, but that the Value Prediction Network learns more quickly, and is more robust to stochastic environments where there's an inherent ceiling on how well a next-step observation prediction can work.
My main question coming into this paper is: how is this different from simply a value estimator like those used in DQN or A2C, and my impression is that the difference comes from this model's ability to do explicit state simulation in latent space, and then predict a value off of the *latent* state, whereas a value network predicts value from observational state.
Object segmentation methods are often produced an imprecise result as objects frequently not always agree with homogeneous regions. Thus this paper provides segmentation of images and videos into homogeneous region in color and texture feature cues called JSEG. Assumptions for the environments used are:
* Image contains homogeneous color and texture regions
* Color is quantized
* There are distinct colors in neighboring regions
* Present work in image segmentation requires texture model parameter approximation that often needs the homogeneous region to produce good derivation.
* There is an existing technique for segmentation using motion. However, this method is not dependable in noisy data, insufficiency in affine transformation for close-up motion, and errors in the presence of occlusion.
* Method consists of two stages that are color quantization and image segmentation spatially.
* Colors quantized into several appointed classes to distinguish regions by weighting pixels individually using the Lloyd algorithm.
* Result of quantized colors are assigned labels. These labels or class-map also define the composition of textures.
* Class maps are labeled by three symbols that are: *, + and o.
* Symbols indicate positions where line of segmentations need to be drawn, for example a class map with half of the left region that contains + symbol and the other half contains uniform distribution of \* and o can be segmented into two regions: one with + symbol and the other is a collection of * and o symbols.
* Variance from the class map is computed and the value J is computed using the variance of both the same class and different class.
* Value of J is small when image contains a uniform distribution of color classes and large otherwise.
* The definition of J initiate an assumption of states of the class labels and specify information where line of segmentation could be drawn.
* In the segmented region, the mean of J is calculated and the minimized value of J mean is a criterion to segment image given region numbers.
* In a good segmentation, the value of J means is small as the number of colors that are uniformly distributed is small in the divided region.
* Algorithm of spatial segmentation contains several stages: calculate J values in each region, growing regions by using seeds, and merging regions once scale has exceeded the threshold. Described as follows:
* Local J values applied as it has the property to indicate whether an area is in a region or near boundary of a region
* Windows are used to detect region sizes. Large windows classify boundary of texture cues and small windows classify color or intensity edges.
* Multiple sizes of windows are utilized with the circular shape of diameter 9 pixels for the smallest window.
* To grow region using seeds, seeds are set first by finding mean and standard deviation of local J values, setting threshold by adding mean and standard deviation multiplied by preset values, and seed is fixed once it consists of an area larger than predetermined values for each window pixels.
* Seeds is then grown by: removing empty classifications from fixed seeds before, averaging local J values in unsegmented region where if a region is near to only one seed, it is classified as the corresponding seed's region, calculating J values for smaller region, averaging more local J values in the respected remaining unsegmented region and growing region at the smallest scale.
* Similarities in color built the merging of a region. As colors have quantized in histogram bins, distance is calculated between two histograms using Euclidean distance in CIE LUV color space.
* To merge regions, distances are enlisted and pairs with small distance are joined. Next, a new feature vector is computed and process iterates for merging and generating new feature vectors until the maximum threshold is attained.
* JSEG can be implemented in video data by using movements of objects as indirect constraints for tracking and segmentation. The assumption used for implementation is that videos have been parted into shots and shots are continuous scenes.
* Video is decomposed in the spatiotemporal domain and grouped to be segmented for consecutive frames.
* In this paper, 1000 frames are grouped and quantized for its color to generate class-maps
* In frames that have color textures that are close to each other, they are counted as one object.
* After seeds are fixed from the frames, tracking is done by assigning initial seed, overlapped seeds are considered to be one, iteratively checking overlapped seeds, and assign time duration for objects
* To reduce the number of false merges, The value of J track is computed by calculating mean and standard deviance between two frames. When the region is static, the J value will be small and large otherwise
* The running time for JSEG in video segmentation is equal to the application of image segmentation by grouped frames
* Overall, parameters that need to be adjusted in using JSEG are color quantization threshold, number of image scales and object duration for video segmentation
* In video segmentation, in average frames can be grouped as 10 to 15 frames
* Paper provides a new method called JSEG to segment objects by spatial segmentation and color quantization in images and videos unsupervised.
* Criteria to evaluate a good segmentation in an image is proposed.
* Final segmentation is obtained by dividing region based on seed areas from J-image.