--- title: "planningML User Guide" author: "Xinying Fang" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{planningML User Guide} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction Advances in automated document classification has led to identifying massive numbers of clinical concepts from handwritten clinical notes. These high dimensional clinical concepts can serve as highly informative predictors in building classification algorithms for identifying patients with different clinical conditions, commonly referred to as patient phenotyping. However, from a planning perspective, it is critical to ensure that enough data is available for the phenotyping algorithm to obtain a desired classification performance. This challenge in sample size planning is further exacerbated by the high dimensionality of the feature space and the inherent imbalance of the response class. Currently available sample size planning methods can be categorized into: (i) model-based approaches that predict the sample size required for achieving a desired accuracy using a linear machine learning classifier and (ii) learning curve-based approaches that fit an inverse power law curve to pilot data to extrapolate performance. We develop model-based approaches for imbalanced data with correlated features, deriving sample size formulas for performance metrics that are sensitive to class imbalance such as Area Under the receiver operating characteristic Curve (AUC) and Matthews Correlation Coefficient (MCC). This is done using a two-step approach where we first perform feature selection using the innovated High Criticism thresholding method, then determine the sample size by optimizing the two performance metrics. Further, we develop software in the form of an R package named ‘planningML’ and an R Shiny app to facilitate the convenient implementation of the developed model-based approaches and learning curve approaches for imbalanced data. We apply our methods to the problem of phenotyping rare outcomes using the MIMIC-III electronic health record database. We show that our developed methods which relate training data size and performance on AUC and MCC, can predict the true or observed performance from linear ML classifiers such as LASSO and SVM at different training data sizes. Therefore, in high-dimensional classification analysis with imbalanced data and correlated features, our approach can efficiently and accurately determine the sample size needed for machine-learning based classification. ## Step 1: High dimentional feature selection \item We consider the two-class classification problem with the high dimensional covariate vector $x \sim N(\mu,\Sigma)$ when $x \in C_1$ and $x \sim N(-\mu,\Sigma)$ when $x \in C_2$. \item Using LDA classify$x \in C_1$ when $2 \boldsymbol {x^T\Sigma^{-1}\mu>k}$ where $k>log(\frac{1-p_1}{p_1})$ \item Given the high-dimensionality of the feature space we employ a feature selection procedure to eliminate $(p-m)$ redundant covariates, hence making $\Sigma$ non-singular. \item Only the remaining $m$ features are included in the linear classifier. \item HCT method employs Higher Criticism Thresholding2 approach to select m important features out of the p total features \item iHCT method improves the HCT method in a transformed coordinate system. ## Step 2: Computation of sample size dependent performance metrics Once obtaining the $m$ important features, performance accuracy metrics that are sensitive to imbalanced class datasets are derived under both the DS and HCT method \item DS method Define $\theta=(\delta,\beta,\lambda,p,k)$ where $\delta$ denotes the minimum effect size, $m$ is the total number of important features, $p$ is the total number of features, $\beta$ is the power of the test, $\alpha$ is the level of the test, $\lambda$ is the maximum eigenvalue of the population correlation matrix and $$ \begin{align} \text{ AUC(n)} & = \int_{\kappa=-\infty}^{\infty} TPR(n)(\kappa)\text{d} (1 - TNR(n)(\kappa))d\kappa \nonumber\\ MCC &=\sqrt{PPV \times TPR \times NPV \times TNR}\nonumber\\ & - \sqrt{(1-PPV) \times (1-TPR) \times (1-TNR) \times (1-NPV)} \end{align} \\ $$ where $$ \begin{align*} TPR(n) &= E_w [P(w'x > \kappa| w,x \in C_1]\\ &\approx \Phi\bigg( \frac{\delta m (1- \beta) - \kappa}{\sigma \sqrt{\rho} \sqrt{m (1- \beta) + (p-m) \alpha }}\bigg)\nonumber\\ TNR(n) &= E_w [P(w'x < \kappa| w,x \in C_2]\\ &\approx \Phi\bigg( \frac{\kappa + \delta m (1- \beta)}{\sigma \sqrt{\rho} \sqrt{m (1- \beta) + (p-m) \alpha }}\bigg) \end{align*} $$ $$ \begin{align*} \text{PPV(n)} \approx \frac{\pi_1 \times \Phi\bigg( \frac{\delta m (1- \beta) - \kappa}{\sigma \sqrt{\rho} \sqrt{m (1- \beta) + (p-m) \alpha }}\bigg)}{\pi_1 \times \Phi\bigg( \frac{\delta m (1- \beta) - \kappa}{\sigma \sqrt{\rho} \sqrt{m (1- \beta) + (p-m) \alpha }}\bigg) + \pi_2 \times \bigg\{ 1- \Phi\bigg( \frac{\kappa + \delta m (1- \beta)}{\sigma \sqrt{\rho} \sqrt{m (1- \beta) + (p-m) \alpha }}\bigg) \bigg\}} \end{align*} $$ $$ \begin{align*} \text{NPV(n)} &= \approx \frac{\pi_2 \times \Phi\bigg( \frac{\delta m (1- \beta) + \kappa}{\sigma \sqrt{\rho} \sqrt{m (1- \beta) + (p-m) \alpha }}\bigg) }{\pi_2 \times \Phi\bigg( \frac{\delta m (1- \beta) + \kappa}{\sigma \sqrt{\rho} \sqrt{m (1- \beta) + (p-m) \alpha }}\bigg) + \pi_1 \times \Phi\bigg( \frac{\delta m (1- \beta) - \kappa}{\sigma \sqrt{\rho} \sqrt{m (1- \beta) + (p-m) \alpha }}\bigg) } \end{align*} $$ \item HCT method ## Example: Sample size determination for identifying patients with Depression in MIMIC-III database Clinical notes were extracted from the MIMIC-III database which contains de-identified clinical data of over 53,000 hospital admissions for adult patients to the intensive care units (ICU) at the Beth Israel Deaconess Medical Center from 2001 to 2012. This project uses a dataset of 833 patient discharge summaries restricted to frequently re-admitted patients (>3 in a single year), labeled with 15 clinical patient phenotypes believed to be associated with risk of recurrent readmission by domain experts. The example was focused on building a classifier for identifying patients with ‘Depression’, which had a prevalence of 29%. Clinical notes were transformed into Unified Medical Language System (UMLS) Concepts using MetaMap Lite. Each note represented as a vector of 10,109 Concept Unique Identifiers (CUIs). A pilot dataset comprising 135 samples was used to determine the optimal sample size. ```{r} ## load dataset pilot.data = readRDS(system.file("extdata", "pilotdata.rds", package = "planningML")) dim(pilot.data) ``` ```{r} x = pilot.data[,-ncol(pilot.data)] y = pilot.data$DEPRESSION ``` ```{r} head(x) ``` ```{r} y ``` ### Feature selection based on iHCT method ```{r} library(planningML) ``` ```{r eval=FALSE, include=TRUE} features = featureselection(x = x, y = y) ``` ```{r include=FALSE} features = readRDS(system.file("extdata", "features.rds", package = "planningML")) ``` ```{r} summary(features) ``` ### Sample size determination ```{r eval=FALSE, include=TRUE} output = samplesize(features=features, method="HCT", m=c(5,10,length(features$features)), effectsize=NULL, class.prob = NULL, totalnum_features = NULL, threshold=0.1, metric="MCC") ``` ```{r include=FALSE} output = readRDS(system.file("extdata", "output.rds", package = "planningML")) ``` ```{r} head(output$outtable) ``` The minimum sample sizes needed under different number of selected important features are: ```{r} summary(output) ``` A plot demonstrating the relationship between sample size the performance measurement metrics are demonstrated: ```{r} plot(output) ``` #### Skip the featureselection step for iHCT method For the iHCT method, if we have historical information and want to skip the featureselection step, then we can supply the number of important features (m), effect size (sample mean difference), prevalence of the event (class.prob), and total number of features (totalnum_features) and obtain an estimation of the optimal sample size. ```{r} effect_size = readRDS(system.file("extdata", "effectsize.rds", package = "planningML")) effect_size ``` ```{r warning=FALSE} output2 = samplesize(features = NULL, method="HCT", m=200, effectsize=effect_size, class.prob = 0.5, totalnum_features = 5000, threshold=0.1, metric="MCC") ``` ```{r} summary(output2) ``` ```{r} plot(output2) ``` ## Learning curve approximation method for imbalanced data The learning curve approximation method is demonstrated with the 2016 National Hospital Ambulatory Medical Care Survey (NHAMCS) data. The NHAMCS data examined factors associated with opioid prescriptions in the emergency department. The outcome variable is receipt of opioid prescription, and the primary variable of interest was the type of visit (dental/non-dental) with other features including age, gender, race/ethnicity, region, payer, day of the visit, and pain level. The input NHAMCS dataset has been pre-processed for convenient analysis. We will use the part of the NHAMCS dataset as the pilot data to build the learning curve. ```{r} pilotSet = readRDS(system.file("extdata", "pilotSet.rds", package = "planningML")) pilotY = readRDS(system.file("extdata", "pilotY.rds", package = "planningML")) ``` ```{r} dim(pilotSet) ``` ```{r} table(pilotY) ``` The pilot dataset has a prevalence of 0.105 ### Obtain fitted data to build the learning curve ```{r eval=FALSE, include=TRUE} nhamcs_rf_auc <- learningcurve_data(pilotSet, pilotY, method="rf", batchsize = 100, nfold=5, nrepeat=10, class.prob = 0.105, metric="AUC") ``` ```{r include=FALSE} nhamcs_rf_auc = readRDS(system.file("extdata", "nhamcs_rf_auc.rds", package = "planningML")) ``` ```{r} nhamcs_rf_auc ``` ### Fit learning curve With the learning curve dataset, we can fit the learning curve: ```{r warning=FALSE} lc_fit <- fit_learningcurve(nhamcs_rf_auc, testX=seq(10, 1500, 5), target=0.8) ``` ```{r} plot(lc_fit) ```