[link]
The Magenta group at Google is a consistent source of really interesting problems for machine learning to solve, in the vein of creative generation of art and music, as well as mathematically creative ways to solve those problem. In this paper, they tackle a new problem with some interesting modelstructural implications: generating Bach chorales composed of polyphonic multiinstrument arrangements. On one layer, this is similar to music generation problems that have been studied before, in that generating a musically coherent sequence requires learning both local and largerscale structure between time steps in the music sequence. However, an additional element here is that there’s dependence of multiple instruments’ notes on one another at a given time step, so, in addition to generating time steps conditional on one another, you ideally want to learn how to model certain notes in a given harmony conditional on the other notes already present there. Understanding the specifics of the approach was one of those scenarios where the mathematical arguments were somewhat opaque, but the actual mechanical description of the model gave a lot of clarity. I find this frequently the case with machine learning, where there’s this strange set of dual incentives between the engineering impulse towards designing effective system, and the academic need to connect the approach to a more theoretical mathematical foundation. The approach taken here has a lot in common with the autoregressive model structures used in PixelCNN or WaveNet. These are all based, theoretically speaking, on the autoregressive property of joint probability distributions, that they can sampled from by sampling first from the prior over the first variable (or pixel, or wave value), and then the second conditional on the first, then the third conditional on the first two, and so on. In practice, autoregressive models don’t necessary condition on the *entire* previous rest of the input in generating a conditional distribution for a new point (for example, because they use a convolutional structure that doesn’t have a receptive field big enough to reach back through the entire previous sequence), but they are based on that idea. A unique aspect of this model is that, instead of defining one specific conditional dependence relationship (where pixel J is conditioned on wave values J5 through J, or some such), they argue that they instead learn conditional relationships over any possible autoregressive ordering of both time steps and instrument IDs. This is a bit of a strange idea, that, like I mentioned, is simplified by going through the mechanics. The model works in a way strikingly similar to recent large scale language modeling: by, for each sample, masking some random subset of the tokens, and asking the model to predict the masked values given the unmasked ones. In this case, an interesting nuance is that the values to be masked are randomly sampled across both time step and instrument, such that in some cases you’ll have a prior time step but no other instruments at your time step, or other instruments at your time step but no prior time steps to work from, and so on. The model needs to flexibly use various kinds of local context to predict the notes that are masked. (As an aside, in addition to the actual values, the network is given the actual 0/1 mask, so it can better distinguish between “0, no information” and “0 because in the actual data sample there wasn’t a pitch here”.) The model refers to these unmasked points as “context points”. An interesting capacity that this gives the model, and which the authors use as their sampling technique, is to create songs that are hybrids of existing chorales by randomly keeping some chunks and dropping out others, and using the model to interpolate through the missing bits.
Your comment:
