Brain MRI segmentation using adversarial training approach
55 T1 weighted brain MR images (35 adults and 20 elders) with respective label maps.
1. The authors suggest an adversarial loss in addition to the traditional loss.
2. The authors compare 2 Generator (Segmentor) models - Fully convolutional and dilated networks.
Using conv layers, allows for larger receptive field with fewer trainable weights (compared to the FCN option).
However, the authors claim the adversarial loss contributes more when applying the FCN model
Label map (semantic segmentation) to realistic image using GANs.
1. Coarse-to-fine generator
2. Multi-scale discriminator
3. Robust adversarial learning objective function
G1 - Global generator
G2 - Local enhancer
1. convolutional front-end
2. set of residual blocks
3. transposed convolutional back-end
A semantic label map is passed through the 3 components sequentially
1. convolutional front-end
2. set of residual blocks
3. transposed convolutional back-end
1. Train standalone global generator
2. Freeze global generator weights, train local enhancer
3. Fine-tune all weights together
Multi scale Discriminator
To allow for global context but work at higher resolution as well, several discriminators are applied at different image scales.
Robust adversarial learning objective function
* Compare original and generated images in feature space at different scales.
* This is done to ensure more abstract resemblance, not just pixel-space resemblance.
* For feature extraction the discriminator is used.
1. New GAN training methodology - progressively going from low-res to hi-res, adding additional layers to the model.
2. When introducing new layers during training, it is gradually faded-in using a coefficient.
3. increasing variation of generated images by counting the standard deviation in the discriminator.
Refine synthetically simulated images to look real
* Generative adversarial networks
1. **Refiner** FCN that improves simulated image to realistically looking image
2. **Adversarial + Self regularization loss**
* **Adversarial loss** term = CNN that Classifies whether the image is refined or real
* **Self regularization** term = L1 distance of refiner produced image from simulated image. The distance can be either in pixel space or in feature space (to preserve gaze direction for example).
* grayscale eye images
* depth sensor hand images
1. **Local adversarial loss** - The discriminator is applied on image patches thus creating multiple "realness" metrices
2. **Discriminator with history** - to avoid the refiner from going back to previously used refined images.
First published: 2017/08/03 (10 months ago) Abstract: MR-only radiotherapy treatment planning requires accurate MR-to-CT synthesis.
Current deep learning methods for MR-to-CT synthesis depend on pairwise aligned
MR and CT training images of the same patient. However, misalignment between
paired images could lead to errors in synthesized CT images. To overcome this,
we propose to train a generative adversarial network (GAN) with unpaired MR and
CT images. A GAN consisting of two synthesis convolutional neural networks
(CNNs) and two discriminator CNNs was trained with cycle consistency to
transform 2D brain MR image slices into 2D brain CT image slices and vice
versa. Brain MR and CT images of 24 patients were analyzed. A quantitative
evaluation showed that the model was able to synthesize CT images that closely
approximate reference CT images, and was able to outperform a GAN model trained
with paired MR and CT images.
Convert MR scans to CT scans.
Unpaired brain CT:MR images.
The dataset contains both CT and MR scans of same patient taken on the same day.
The volumes are aligned using mutual information and contain some local minor misalignments.
Train the following models:
1. Syn_ct: CNN: MR -> CT
2. Syn_mr: CNN: CT -> MR
3. Dis_ct: classify real and synthetic CT images (result of Syn_ct)
4. Dis_mr: classify real and synthetic MR images. Syn_mr(Syn_ct(MR Image))) or Syn_mr(CT image)
First published: 2017/07/17 (11 months ago) Abstract: We present a real-time method for synthesizing highly complex human motions
using a novel LSTM network training regime we call the auto-conditioned LSTM
(acLSTM). Recently, researchers have attempted to synthesize new motion by
using autoregressive techniques, but existing methods tend to freeze or diverge
after a couple of seconds due to an accumulation of errors that are fed back
into the network. Furthermore, such methods have only been shown to be reliable
for relatively simple human motions, such as walking or running. In contrast,
our approach can synthesize arbitrary motions with highly complex styles,
including dances or martial arts in addition to locomotion. The acLSTM is able
to accomplish this by explicitly accommodating for autoregressive noise
accumulation during training. Furthermore, the structure of the acLSTM is
modular and compatible with any other recurrent network architecture, and is
usable for tasks other than motion. Our work is the first to our knowledge that
demonstrates the ability to generate over 18,000 continuous frames (300
seconds) of new complex human motion w.r.t. different styles.
auto-conditioned LSTM - an LSTM network that uses only fraction of the input timestamps, but all of the outputs (a little bit similar to keyframes).
Video prediction with human objects
Instead of the common approach of predicting directly in pixel-space, use explicit knowledge of human motion space to predict the future of the video.
1. VAE to model the possible future movements of humans in the pose space
2. Conditional GAN - use pose information for to predict video in pixel space.
Predict human motion from static image
1. 2d pose sequence generator
2. convert 2d pose to 3d skeleton
3 Step training strategy
1. Train human 2d pose extractor using annotated video with 2d joint positions
2. 3d skeleton extractor: project mocap data to 2d and use as ground truth for training the 2d->3d skeleton converter
3. Full network training
1. Penn Action - Annotated human pose in sports image sequences: bench_press, jumping_jacks, pull_ups...
2. MPII - human action videos with annotated single frame
3. Human3.6M - video, depth and mocap. action include: sitting, purchasing, waiting
On the following tasks:
1. 2D pose forecasting
2. 3D pose recovery
First published: 2017/07/04 (11 months ago) Abstract: This work make the first attempt to generate articulated human motion
sequence from a single image. On the one hand, we utilize paired inputs
including human skeleton information as motion embedding and a single human
image as appearance reference, to generate novel motion frames, based on the
conditional GAN infrastructure. On the other hand, a triplet loss is employed
to pursue appearance-smoothness between consecutive frames. As the proposed
framework is capable of jointly exploiting the image appearance space and
articulated/kinematic motion space, it generates realistic articulated motion
sequence, in contrast to most previous video generation methods which yield
blurred motion effects. We test our model on two human action datasets
including KTH and Human3.6M, and the proposed framework generates very
promising results on both datasets.
Video generation of human motion given:
1. Single appearance reference image
2. Skeleton motion sequence
* KTH - grayscale human actions
* Human3.6M - color multiview human actions
The authors try both Stack GAN and Siamese GAN.
The later provides better result.
Isn't using a full sequence of human skeleton motion considered more then a "hint"?
Given an unconstrained image, estimate:
1. 3d pose of human skeleton
2. 3d body mesh
1. full body mesh extraction from image
2. improvement of state of the art
1. Leeds Sports
Consider the problem both bottom-up and top-down.
1. Bottom-up: DeepCut cnn model to fit joints 2d positions onto the image.
2. top-down: A skinned multi-person linear model (SMPL) is fitted and projected onto 2d joint positions and image.
Text to image
* Images are more photo realistic and higher resolution then previous methods
* Stacked generative model
2 stage process:
1. Text-to-image: generates low resolution image with primitive shape and color.
2. low-to-hi-res: using low res image and text, generates hi res image. adding details and sharpening the edges.
* CUB - Birds
* Oxford-102 - Flowers
* Is it possible the resulting images are replicas of images in the original dataset? To what extent does the model "hallucinate" new images?
Given a video of robot motion, predict future frames of the motion.
1. The authors assembled a new dataset of 59,000 robot interactions involving pushing motions.
2. Human3.6m - video, depth and mocap. action include: sitting, purchasing, waiting...
* Use LSTMs to "remember" previous frames.
* Predict 10 transformations from previous frame (each approach represents the transformation differently).
* Predict a mask to determine which transformation is applied to which pixel.
The authors suggest 3 models based on this approach:
1. Dynamic Neural Advection
2. Convolutional Dynamic Neural Advection
3. Spatial Transformer Predictors
Predict frames of a video using 3 newly proposed and complementary methods:
1. Multi scale cnn
3. Image gradient difference loss
* Input: several frames of video from dataset
* output: next frame of video
* input: original and last frame
* output: is the last frame from dataset or generated
Problem: Still blurry on edges on moving object.
Solution: Image gradient difference loss