#### Goal + Use unsupervised deep learning to obtain a low-dimensional representation of a patient from EHR data. + A better representation will facilitate clinical prediction tasks. #### Architecture: + Patient EHR is obtained from the Hospital Data Warehouse: + demographic info + ICD-9 codes + medication, labs + clinical notes: free text + Use stacked denoising autoencoders (SDA) to obtain an abstract representation of the patient with lower dimensionality. ![Framework](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/clinical-data/images/Miotto2016_framework.png?raw=true "Deep Patient Framework") #### Dataset: + Data Warehouse from Mount Sinai Hospital in NY. + All patient records that had a diagnosed disease (ICD-9 code) between 1980 and 2014 - approximately 1.2 million patients with 88.9 records/patient - were initially selected. + 1980-2013: training, 2014: test. *Data Cleaning*: + Diseases diagnosed in fewer than 10 patients in the training dataset were eliminated. + Diseases that could not be diagnosed through EHR labels were eliminated. Related to social behavior (HIV), fortuitous events (injuries, poisoning) or unspecific ('other cancers'). The final list contains 78 diseases. *Final version of the dataset (raw patient representation)*: + Training: 704,587 patients (to obtain deep features post SDA). + Validation: 5,000 patients (for the evaluation of the predictive model for diseases). + Test: 76,214 patients (for the evaluation of the predictive model for diseases). + 41072 columns - demographic info, ICD-9, medication, lab test, free text (LDA topic modeling dimension 300) + Very high dimensional but very sparse representation #### Results: *Stacked Denoisinig Autoencoders for low-dimensional patient representation*: + 3 layers of denoising autoencoders. + Each layer has 500 neurons. Patient is now represented by a dense vector of 500 features. + Inputs are normalized to lie in the [0, 1] interval. + Inputs in each of the layers have added noise at a ratio of 5% noise (masking noise corruption - value of these features is set to '0'). + Sigmoid activation function. *Classifiers for disease prediction*: + Random forest classifiers with 100 trees trained for each of the 78 diseases. *Baseline for comparison*: + PCA with 100 components, k-means with 500 clusters, GMM with 200 mixes and ICA with 100 components. (see Discussion) + RawFeat: original patient EHR features: sparse vector with 41072 features (~ 1% of non-zero entries). + Threshold to rank as "positive": 0.6 *Aggregate performance in predicting diseases*: ![Aggregate performance](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/clinical-data/images/Miotto2016_results1.png?raw=true "Aggregate performance") + Comment: This result of F-Score = 0.181 implies a precision of 0.102 (let us assume a recall in the order of 80%), which means that with each correct diagnosis, the Deep Patient generates approximately 9 false alarms. *Performance for some particular diseases*: ![Disease results](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/clinical-data/images/Miotto2016_results2.png?raw=true "Results for some diseases") #### Discussion: + DeepPatient *does not* use lab results in model building. Only the *frequency* at which the analysis is performed is taken into account. + Future enhancements: + Describe a patient with a temporal sequence of vectors s instead of summarizing all data in one vector. + Add other categories of EHR data, such as insurance details, family history and social behaviors. + Use PCA as a pre-processing step before SDA? + Caveat: the comparisons does not seem to be fair. If the autoencoder has dimension 500, the other baselines should also have dimension 500.