[link]
#### Motivation: + When sampling a clinical time series, missing values become ubiquitous due to a variety of factors such as frequency of medical events (when a blood test is performed, for example). + Missing values can be very informative about the label  *informative missingness*. + The goal of the paper is to propose a deep learning model that **exploits the missingness patterns** to enhance its performance. #### Time series notation: Multivariate time series with $D$ variables of length $T$: + ${\bf X} = ({\bf x}_1, {\bf x}_2, \ldots, {\bf x}_T)^T \in \mathbb{R}^{T \times D}$. + ${\bf x}_t \in \mathbb{R}^{D}$ is the $t$th measurement of all variables. + $x_t^d$ is the $d$th component of ${\bf x}_t$. Missing value information is incorporated using *masking* and *timeinterval* concepts. + Masking: says which of the entries are missing values. + Masking vector ${\bf m}_t \in \{0, 1\}^D$, $m_t^d = 1$ if $x_t^d$ exists and $m_t^d = 0$ if $x_t^d$ is missing. + Timeinterval: temporal pattern of 'nomissing' observations. Represented by timestamps $s_t$ and time intervals $\delta_t$ (since its last observation). Example: ${\bf X}$: input time series with 2 variables, $$ {\bf X} = \begin{pmatrix} 47 & 49 & NA & 40 & NA & 43 & 55 \\ NA & 15 & 14 & NA & NA & NA & 15 \end{pmatrix} $$ with timestamps $${\bf s} = \begin{pmatrix} 0 & 0.1 & 0.6 & 1.6 & 2.2 & 2.5 & 3.1 \end{pmatrix} $$ The masking vectors ${\bf m}_t$ and time intervals ${\delta}_t$ for each variable are computed and stacked forming the masking matrix ${\bf M}$ and time interval matrix ${\bf \Delta}$ : $$ {\bf M} = \begin{pmatrix} 1 & 1 & 0 & 1 & 0 & 1 & 1 \\ 0 & 1 & 1 & 0 & 0 & 0 & 1 \end{pmatrix} $$ $$ {\bf \Delta} = \begin{pmatrix} 0 & 0.1 & 0.5 & 1.5 & 0.6 & 0.9 & 0.6 \\ 0 & 0.1 & 0.5 & 1.0 & 1.6 & 1.9 & 2.5 \end{pmatrix} $$ #### Proposed Architecture: + GRU (Gated Recurrent Units) with "trainable" decays: + Input decay: which causes the variable to converge to its empirical mean instead of simply filling with the last value of the variable. The decay of each input is treated independently + Hidden state decay: Attempts to capture richer information from missing patterns. In this case the hidden state of the network at the previous time step is decayed. #### Dataset: + MIMIC III v1.4: https://mimic.physionet.org/ + Input events, Output events, Lab events, Prescription events + PhysioNet Challenge 2012: https://physionet.org/challenge/2012/  MIMIC III  PhysioNet 2012   Number of samples ($N$)  19714  4000 Number of variables ($D$) 99  33 Mean number of time steps 35.89  68.91 Maximum number of time steps150  155 Mean of variable missing rate 0.9621 0.8225 #### Experiments and Results: **Methodology** + Baselines: + Logistic Regression, SVM, Random Forest (PhysioNet sampled every 1h. MIMIC sampled every 2h). Forward / backfilling imputation. Masking vector is concatenated input to inform the models what inputs are imputed. + LSTM with mean imputation. + Variations of the proposed GRU model: + GRUmean: impute average of the training set. + GRUforward: impute last value. + GRUsimple: masking vectors and time interval are inputs. There is no imputation. + GRUD: proposed model. + Batch normalization and dropout (p = 0.5) applied to the regression layer. + Normalized inputs to have a mean of 0 and standard deviation 1. + Parameter optimization: early stopping on validation set. **Results** Mortality Prediction (results in terms of AUC): + Proposed GRUD outperforms other models on both datasets: + AUC = 0.8527 $\pm$ 0.003 for MIMICIII and 0.8424 $\pm$ 0.012 for PhysioNet + Random Forest and SVM are the best nonRNN baselines. + GRUsimple was the best RNN variant. Multitask Prediction (results in terms of AUC): + PhysioNet: mortality, <3 days, surgery, cardiac condition. + MIMIC III: 20 diagnostic categories. + The proposed GRUD outperforms other baseline models. #### Positive Aspects: + Instead of performing simple mean imputation or using indicator functions, the paper exploits missing values and missing patterns in a novel way. + The paper performs lengthy comparisons against baselines. #### Caveats: + Clinical mortality datasets usually have very high imbalance between classes. In such cases, AUC alone is not the best metric to evaluate. It would have been interesting to see the results in terms of precision/recall.
Your comment:
