addressing to-dos, adding rmd files, updating vignettes

istallworthy · May 24, 2024 · 3fefc87 · 3fefc87
1 parent e55ba6a
commit 3fefc87
Show file tree

Hide file tree

Showing 20 changed files with 3,333 additions and 584 deletions.
diff --git a/.Rprofile b/.Rprofile
@@ -0,0 +1,13 @@
+# ~/.Rprofile
+if (interactive()) {
+  View <- function(x, title) {
+    if (is.data.frame(x) || is.matrix(x) || is.list(x)) {
+      utils::View(x, title)  # This should open in RStudio's viewer
+    } else {
+      print("Object is not viewable in RStudio Viewer")
+    }
+  }
+}
+
+
+
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -34,8 +34,8 @@ Imports:
     stats,
     survey,
     tinytable,
-    utils,
-    WeightIt
+    WeightIt,
+    utils
 Suggests: 
     CBPS,
     dbarts,

diff --git a/ExampleDataPrep.Rmd b/ExampleDataPrep.Rmd
@@ -0,0 +1,329 @@
+---
+title: "Data Preparation for using devMSMs with Longitudinal Data"
+output: html_document
+date: "2024-05-17"
+---
+The following notebook provides a guide to assist the user in preparingn their data for use with *devMSMs*.  
+
+Please review the accompanying manuscript for a full conceptual and practical introduction to MSMs in the context of developmental data. Please also see the vignettes on the *devMSMs* website for step-by-step guidance on the use of this code: https://istallworthy.github.io/devMSMs/index.html. 
+
+Headings denote accompanying website sections and steps. We suggest using the interactive outline tool (located above the Console) for ease of navigation.
+
+We advise users implement the appropriate data prepeartion steps in accordance with your starting data, with the goal of assigning to `data` one of the following wide data formats (see Figure 1) for use in the package:  
+
+* a single data frame of data in wide format with no missing data  
+
+* a mids object (output from mice::mice()) of data imputed in wide format  
+
+* a list of data imputed in wide format as data frames.  
+
+
+Users have several options for reading in data. They can begin this workflow with the following options:  
+
+* long data with missingness can can be formatted and converted to wide data (P1a) for imputation (P2)
+
+* wide with missingness can be formatted (P1b) before imputing (P2)
+
+* data already imputed in wide format can be read in as a list (P2). 
+
+* data with no missingness (P3)
+
+## Installation
+You will always need to install the *devMSMs helper* functions from Github (*devMSMsHelpers*).
+
+```{r}
+install.packages("devtools", quiet = TRUE)
+library(devtools)
+
+install_github("istallworthy/devMSMsHelpers", quiet = TRUE)
+library(devMSMsHelpers)
+
+install_github("istallworthy/devMSMs", quiet = TRUE)
+library(devMSMs, include.only = "initMSM")
+```
+
+
+# Specify Core Inputs
+
+```{r}
+# required 
+
+exposure = c("ESETA1.6", "ESETA1.15", "ESETA1.24", "ESETA1.35", "ESETA1.58") # wide format
+
+# outcome
+
+outcome <- "StrDif_Tot.58"
+
+
+# required
+
+ti_conf =  c("state", "BioDadInHH2", "PmAge2", "PmBlac2", "TcBlac2", "PmMrSt2", "PmEd2", "KFASTScr",
+             "RMomAgeU", "RHealth", "HomeOwnd", "SWghtLB", "SurpPreg", "SmokTotl", "DrnkFreq",
+             "peri_health", "caregiv_health", "gov_assist")
+
+tv_conf = c("SAAmylase.6","SAAmylase.15", "SAAmylase.24", 
+            "MDI.6", "MDI.15",                                            
+            "RHasSO.6", "RHasSO.15", "RHasSO.24","RHasSO.35", "RHasSO.58",                                         
+            "WndNbrhood.6","WndNbrhood.24", "WndNbrhood.35", "WndNbrhood.58",                                       
+            "IBRAttn.6", "IBRAttn.15", "IBRAttn.24",                                   
+            "B18Raw.6", "B18Raw.15", "B18Raw.24", "B18Raw.58",                                           
+            "HOMEETA1.6", "HOMEETA1.15", "HOMEETA1.24", "HOMEETA1.35", "HOMEETA1.58",                               
+            "InRatioCor.6", "InRatioCor.15", "InRatioCor.24", "InRatioCor.35", "InRatioCor.58",                         
+            "CORTB.6", "CORTB.15", "CORTB.24",                                                                  
+            "EARS_TJo.24", "EARS_TJo.35",                                        
+            "LESMnPos.24", "LESMnPos.35",                                  
+            "LESMnNeg.24", "LESMnNeg.35",       
+            "StrDif_Tot.35", "StrDif_Tot.58",    
+            "fscore.35", "fscore.58")
+
+home_dir = '/Users/isabella/Library/CloudStorage/Box-Box/BSL General/MSMs/testing/isa/'
+
+```
+
+
+# D1. Starting with a Single Long Data Frame
+Users beginning with a single data frame in long format (with or without missingness).
+
+## D1a. Format Long Data
+Users have the option of formatting their long data if their data columns are not yet labeled correctly using the `formatLongData()` helper function. 
+
+Read in long data and specify any factors and integers
+```{r}
+# add path to your long data file here if you want to begin with long data
+
+data_long <- read.csv("/Users/isabella/Library/CloudStorage/Box-Box/BSL General/MSMs/testing/testing data/continuous outcome/continuous exposure/FLP_long_missing_unformatted.csv", 
+                      header = TRUE)
+
+# list factor variables in long format here
+
+factor_confounders <- c("state", "TcBlac2","BioDadInHH2","HomeOwnd", "PmBlac2",       
+                        "PmMrSt2", "SurpPreg", "RHealth", "SmokTotl", "DrnkFreq",
+                        "RHasSO") 
+
+
+# list any integer variables in long format here
+
+integer_confounders <- c("KFASTScr", "PmEd2", "RMomAgeU", "SWghtLB", 
+                         "peri_health", "caregiv_health" , 
+                         "gov_assist", "B18Raw", "EARS_TJo", "MDI")
+```
+
+Format your long data
+```{r}
+data_long_f <- formatLongData(data = data_long, 
+                              exposure = exposure, 
+                              outcome = outcome, 
+                              sep = "\\.",
+                              time_var = "Tage", # list original time variable here if it's not "WAVE"
+                              id_var = "S_ID", # list original id variable here if it's not "ID"
+                              missing = -9999, # list missing value here
+                              factor_confounders = factor_confounders, # list factor variables in long format here
+                              integer_confounders = integer_confounders, # list any integer variables in long format here
+                              home_dir = home_dir, save.out = TRUE) 
+head(data_long_f)
+```
+
+Make sure to inspect your data at each step of the way to ensure formatting looks correct. 
+
+
+## D1b. Tranform Formatted Long Data to Wide
+Users with correctly formatted data in long format have the option of using the following code to transform their data into wide format, to proceed to using the package (if there is no missing data) or imputing (with < 20% missing data MAR). They may see some warnings about *NA* values.    
+
+```{r}
+# identify time-varying confounders 
+
+sep <- "\\." # time delimiter
+v <- sapply(strsplit(tv_conf[!grepl("\\:", tv_conf)], sep), head, 1)
+v <- c(v[!duplicated(v)], sapply(strsplit(exposure[1], sep), head, 1))
+
+
+# reshape data 
+
+library(stats)
+data_wide_f <- stats::reshape(data = data_long_f, 
+                            idvar = "ID", #list ID variable in your dataset
+                            v.names = v, 
+                            timevar = "WAVE", # list time variable in your long dataset
+                            times = c(6, 15, 24, 35, 58), # list all time points in your dataset
+                            direction = "wide")
+
+data_wide_f <- data_wide_f[, colSums(is.na(data_wide_f)) < nrow(data_wide_f)]
+
+head(data_wide_f)
+```
+
+
+
+# D2. Startingn with a Single Wide Data Frame
+Alternatively, we could start with a single data frame of wide data (with or without missingness).  
+
+## D2a. Format Wide Data
+Users can draw on the `formatWideData()` helper function to format their wide data.   
+
+Read in long data and specify any factors and integers
+```{r}
+
+# add path to your wide, formatted data file here 
+
+data_wide <- read.csv("/Users/isabella/Library/CloudStorage/Box-Box/BSL General/MSMs/Tutorial paper/merged_tutorial_filtered.csv", 
+                      header = TRUE)
+
+
+# list factor variables in wide format here
+
+factor_confounders <- c("state", "TcBlac2","BioDadInHH2","HomeOwnd", "PmBlac2",       
+                        "PmMrSt2", "SurpPreg", "RHealth", "SmokTotl", "DrnkFreq",
+                        "RHasSO.6", "RHasSO.15", "RHasSO.24", "RHasSO.35", "RHasSO.58") 
+
+
+# list any integer variables in wide format here
+
+integer_confounders = c("KFASTScr", "PmEd2", "RMomAgeU", "SWghtLB", "peri_health", "caregiv_health" , 
+                        "gov_assist", "B18Raw.6", "B18Raw.15", "B18Raw.24", "B18Raw.58", 
+                        "EARS_TJo.24", "EARS_TJo.35", "MDI.6", "MDI.15") 
+```
+
+Format wide data
+```{r}
+data_wide_f <- formatWideData(data = data_wide, 
+                              exposure = exposure,
+                              outcome = outcome, 
+                              sep = "\\.",
+                              id_var = "ID", # list original id variable here if it's not "ID"
+                              missing = NA, # list missing value here
+                              factor_confounders = factor_confounders,
+                              integer_confounders = integer_confounders,
+                              home_dir = home_dir, save.out = TRUE) 
+
+head(data_wide_f)
+```
+
+
+# D3. Starting with Formatted Wide Data with Missingness
+Most data collected from humans will have some degree of missing data.  
+
+## D3a. Multiply Impute Formatted, Wide Data Frame Using Mice
+Users have the option of imputing their wide, formatted data using the *mice* package, for use with *devMSMs*.    
+
+Specify imputation parameters
+```{r}
+# optional seed for reproducibility 
+
+s <- NA 
+s <- 1234 # empirical example
+
+
+# optional; number of imputations (default is 5)
+
+m <- NA
+m <- 5 # empirical example
+
+
+# optional; provide an imputation method pmm, midastouch, sample, cart , rf (default)
+
+method <- NA
+method <- "rf" # empirical example
+
+
+# optional maximum iterations for imputation (default is 5)
+ 
+maxit <- NA
+maxit <- 5 # empirical example
+```
+
+Impute data
+```{r}
+set.seed(s)
+
+imputed_data <- imputeData(data = data_wide_f, 
+                           exposure = exposure, 
+                           outcome = outcome, 
+                           sep = "\\.",
+                           m = m, 
+                           method = method, 
+                           maxit = maxit, 
+                           para_proc = TRUE, 
+                           seed = s, 
+                           read_imps_from_file = FALSE, 
+                           home_dir = home_dir, 
+                           save.out = TRUE)
+```
+
+To use the package with imputed data
+```{r}
+
+data <- imputed_data
+
+
+# optional: extract first imputed dataset for inspection
+
+library(mice)
+summary(mice::complete(data, 1))
+```
+
+
+Alternatively, users could read in a saved mids object for use with *devMSMS*.    
+```{r}
+# place your .rds file in your home directory and change the name of file here
+
+imputed_data <- readRDS("/Users/isabella/Library/CloudStorage/Box-Box/BSL General/MSMs/testing/testing data/continuous outcome/continuous exposure/FLP_wide_imputed_mids.rds") 
+
+
+data <- imputed_data
+
+# optional: extract first imputed dataset for inspection
+
+library(mice)
+summary(mice::complete(imputed_data, 1))
+anyNA(data) # make sure this is false, meaning no NAs
+```
+
+
+## D3b. Read in a List of Wide Imputed Data Saved Locally 
+Alternatively, if a user has imputed datasets already created using a program other than *mice*, they can read in, as a list, files saved locally as .csv files (labeled “1”:m) in a single folder. 
+
+```{r}
+# change this to match your local folder
+
+folder <- "/Users/isabella/Library/CloudStorage/Box-Box/BSL General/MSMs/testing/testing data/continuous outcome/continuous exposure/imputations/"
+
+
+# make sure pattern matches suffix of your data
+
+files <- list.files(folder, full.names = TRUE, pattern = "\\.csv") 
+
+```
+
+To use the package with a list of imputed data from above
+```{r}
+data <- lapply(files, function(file) {
+  imp_data <- read.csv(file)
+  imp_data
+})
+
+
+# check for character variables
+
+any(as.logical(unlist(lapply(data, function(x) { # if this is TRUE, run next lines
+  any(sapply(x, class) == "character") }))))
+names(data[[1]])[sapply(data[[1]], class) == "character"] # find names of any character variables
+
+# run this for each variable that needs char -> factor
+
+data <- lapply(data, function(x){
+  x[, "state"] <- factor(x[, "state"], labels = c(1, 0)) 
+})
+
+
+# make factors; list factor variables here
+
+factor_covars <- c("state", "TcBlac2","BioDadInHH2","HomeOwnd", "PmBlac2",       
+                   "PmMrSt2", "SurpPreg", "RHealth", "SmokTotl", "DrnkFreq",
+                   "RHasSO.6", "RHasSO.15", "RHasSO.24", "RHasSO.35", "RHasSO.58")
+
+data <- lapply(data, function(x) {
+  x[, factor_covars] <- as.data.frame(lapply(x[, factor_covars], as.factor))
+  x })
+```
+
+Having read in their data, users should now have have assigned to `data` wide, complete data as: a data frame, mice object, or list of imputed data frames for use with the *devMSMs* functions.