Skip to content

Commit

Permalink
addressing to-dos, adding rmd files, updating vignettes
Browse files Browse the repository at this point in the history
  • Loading branch information
istallworthy committed May 24, 2024
1 parent e55ba6a commit 3fefc87
Show file tree
Hide file tree
Showing 20 changed files with 3,333 additions and 584 deletions.
13 changes: 13 additions & 0 deletions .Rprofile
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# ~/.Rprofile
if (interactive()) {
View <- function(x, title) {
if (is.data.frame(x) || is.matrix(x) || is.list(x)) {
utils::View(x, title) # This should open in RStudio's viewer
} else {
print("Object is not viewable in RStudio Viewer")
}
}
}



4 changes: 2 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,8 @@ Imports:
stats,
survey,
tinytable,
utils,
WeightIt
WeightIt,
utils
Suggests:
CBPS,
dbarts,
Expand Down
329 changes: 329 additions & 0 deletions ExampleDataPrep.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,329 @@
---
title: "Data Preparation for using devMSMs with Longitudinal Data"
output: html_document
date: "2024-05-17"
---
The following notebook provides a guide to assist the user in preparingn their data for use with *devMSMs*.

Please review the accompanying manuscript for a full conceptual and practical introduction to MSMs in the context of developmental data. Please also see the vignettes on the *devMSMs* website for step-by-step guidance on the use of this code: https://istallworthy.github.io/devMSMs/index.html.

Headings denote accompanying website sections and steps. We suggest using the interactive outline tool (located above the Console) for ease of navigation.

We advise users implement the appropriate data prepeartion steps in accordance with your starting data, with the goal of assigning to `data` one of the following wide data formats (see Figure 1) for use in the package:

* a single data frame of data in wide format with no missing data

* a mids object (output from mice::mice()) of data imputed in wide format

* a list of data imputed in wide format as data frames.


Users have several options for reading in data. They can begin this workflow with the following options:

* long data with missingness can can be formatted and converted to wide data (P1a) for imputation (P2)

* wide with missingness can be formatted (P1b) before imputing (P2)

* data already imputed in wide format can be read in as a list (P2).

* data with no missingness (P3)

## Installation
You will always need to install the *devMSMs helper* functions from Github (*devMSMsHelpers*).

```{r}
install.packages("devtools", quiet = TRUE)
library(devtools)
install_github("istallworthy/devMSMsHelpers", quiet = TRUE)
library(devMSMsHelpers)
install_github("istallworthy/devMSMs", quiet = TRUE)
library(devMSMs, include.only = "initMSM")
```


# Specify Core Inputs

```{r}
# required
exposure = c("ESETA1.6", "ESETA1.15", "ESETA1.24", "ESETA1.35", "ESETA1.58") # wide format
# outcome
outcome <- "StrDif_Tot.58"
# required
ti_conf = c("state", "BioDadInHH2", "PmAge2", "PmBlac2", "TcBlac2", "PmMrSt2", "PmEd2", "KFASTScr",
"RMomAgeU", "RHealth", "HomeOwnd", "SWghtLB", "SurpPreg", "SmokTotl", "DrnkFreq",
"peri_health", "caregiv_health", "gov_assist")
tv_conf = c("SAAmylase.6","SAAmylase.15", "SAAmylase.24",
"MDI.6", "MDI.15",
"RHasSO.6", "RHasSO.15", "RHasSO.24","RHasSO.35", "RHasSO.58",
"WndNbrhood.6","WndNbrhood.24", "WndNbrhood.35", "WndNbrhood.58",
"IBRAttn.6", "IBRAttn.15", "IBRAttn.24",
"B18Raw.6", "B18Raw.15", "B18Raw.24", "B18Raw.58",
"HOMEETA1.6", "HOMEETA1.15", "HOMEETA1.24", "HOMEETA1.35", "HOMEETA1.58",
"InRatioCor.6", "InRatioCor.15", "InRatioCor.24", "InRatioCor.35", "InRatioCor.58",
"CORTB.6", "CORTB.15", "CORTB.24",
"EARS_TJo.24", "EARS_TJo.35",
"LESMnPos.24", "LESMnPos.35",
"LESMnNeg.24", "LESMnNeg.35",
"StrDif_Tot.35", "StrDif_Tot.58",
"fscore.35", "fscore.58")
home_dir = '/Users/isabella/Library/CloudStorage/Box-Box/BSL General/MSMs/testing/isa/'
```


# D1. Starting with a Single Long Data Frame
Users beginning with a single data frame in long format (with or without missingness).

## D1a. Format Long Data
Users have the option of formatting their long data if their data columns are not yet labeled correctly using the `formatLongData()` helper function.

Read in long data and specify any factors and integers
```{r}
# add path to your long data file here if you want to begin with long data
data_long <- read.csv("/Users/isabella/Library/CloudStorage/Box-Box/BSL General/MSMs/testing/testing data/continuous outcome/continuous exposure/FLP_long_missing_unformatted.csv",
header = TRUE)
# list factor variables in long format here
factor_confounders <- c("state", "TcBlac2","BioDadInHH2","HomeOwnd", "PmBlac2",
"PmMrSt2", "SurpPreg", "RHealth", "SmokTotl", "DrnkFreq",
"RHasSO")
# list any integer variables in long format here
integer_confounders <- c("KFASTScr", "PmEd2", "RMomAgeU", "SWghtLB",
"peri_health", "caregiv_health" ,
"gov_assist", "B18Raw", "EARS_TJo", "MDI")
```

Format your long data
```{r}
data_long_f <- formatLongData(data = data_long,
exposure = exposure,
outcome = outcome,
sep = "\\.",
time_var = "Tage", # list original time variable here if it's not "WAVE"
id_var = "S_ID", # list original id variable here if it's not "ID"
missing = -9999, # list missing value here
factor_confounders = factor_confounders, # list factor variables in long format here
integer_confounders = integer_confounders, # list any integer variables in long format here
home_dir = home_dir, save.out = TRUE)
head(data_long_f)
```

Make sure to inspect your data at each step of the way to ensure formatting looks correct.


## D1b. Tranform Formatted Long Data to Wide
Users with correctly formatted data in long format have the option of using the following code to transform their data into wide format, to proceed to using the package (if there is no missing data) or imputing (with < 20% missing data MAR). They may see some warnings about *NA* values.

```{r}
# identify time-varying confounders
sep <- "\\." # time delimiter
v <- sapply(strsplit(tv_conf[!grepl("\\:", tv_conf)], sep), head, 1)
v <- c(v[!duplicated(v)], sapply(strsplit(exposure[1], sep), head, 1))
# reshape data
library(stats)
data_wide_f <- stats::reshape(data = data_long_f,
idvar = "ID", #list ID variable in your dataset
v.names = v,
timevar = "WAVE", # list time variable in your long dataset
times = c(6, 15, 24, 35, 58), # list all time points in your dataset
direction = "wide")
data_wide_f <- data_wide_f[, colSums(is.na(data_wide_f)) < nrow(data_wide_f)]
head(data_wide_f)
```



# D2. Startingn with a Single Wide Data Frame
Alternatively, we could start with a single data frame of wide data (with or without missingness).

## D2a. Format Wide Data
Users can draw on the `formatWideData()` helper function to format their wide data.

Read in long data and specify any factors and integers
```{r}
# add path to your wide, formatted data file here
data_wide <- read.csv("/Users/isabella/Library/CloudStorage/Box-Box/BSL General/MSMs/Tutorial paper/merged_tutorial_filtered.csv",
header = TRUE)
# list factor variables in wide format here
factor_confounders <- c("state", "TcBlac2","BioDadInHH2","HomeOwnd", "PmBlac2",
"PmMrSt2", "SurpPreg", "RHealth", "SmokTotl", "DrnkFreq",
"RHasSO.6", "RHasSO.15", "RHasSO.24", "RHasSO.35", "RHasSO.58")
# list any integer variables in wide format here
integer_confounders = c("KFASTScr", "PmEd2", "RMomAgeU", "SWghtLB", "peri_health", "caregiv_health" ,
"gov_assist", "B18Raw.6", "B18Raw.15", "B18Raw.24", "B18Raw.58",
"EARS_TJo.24", "EARS_TJo.35", "MDI.6", "MDI.15")
```

Format wide data
```{r}
data_wide_f <- formatWideData(data = data_wide,
exposure = exposure,
outcome = outcome,
sep = "\\.",
id_var = "ID", # list original id variable here if it's not "ID"
missing = NA, # list missing value here
factor_confounders = factor_confounders,
integer_confounders = integer_confounders,
home_dir = home_dir, save.out = TRUE)
head(data_wide_f)
```


# D3. Starting with Formatted Wide Data with Missingness
Most data collected from humans will have some degree of missing data.

## D3a. Multiply Impute Formatted, Wide Data Frame Using Mice
Users have the option of imputing their wide, formatted data using the *mice* package, for use with *devMSMs*.

Specify imputation parameters
```{r}
# optional seed for reproducibility
s <- NA
s <- 1234 # empirical example
# optional; number of imputations (default is 5)
m <- NA
m <- 5 # empirical example
# optional; provide an imputation method pmm, midastouch, sample, cart , rf (default)
method <- NA
method <- "rf" # empirical example
# optional maximum iterations for imputation (default is 5)
maxit <- NA
maxit <- 5 # empirical example
```

Impute data
```{r}
set.seed(s)
imputed_data <- imputeData(data = data_wide_f,
exposure = exposure,
outcome = outcome,
sep = "\\.",
m = m,
method = method,
maxit = maxit,
para_proc = TRUE,
seed = s,
read_imps_from_file = FALSE,
home_dir = home_dir,
save.out = TRUE)
```

To use the package with imputed data
```{r}
data <- imputed_data
# optional: extract first imputed dataset for inspection
library(mice)
summary(mice::complete(data, 1))
```


Alternatively, users could read in a saved mids object for use with *devMSMS*.
```{r}
# place your .rds file in your home directory and change the name of file here
imputed_data <- readRDS("/Users/isabella/Library/CloudStorage/Box-Box/BSL General/MSMs/testing/testing data/continuous outcome/continuous exposure/FLP_wide_imputed_mids.rds")
data <- imputed_data
# optional: extract first imputed dataset for inspection
library(mice)
summary(mice::complete(imputed_data, 1))
anyNA(data) # make sure this is false, meaning no NAs
```


## D3b. Read in a List of Wide Imputed Data Saved Locally
Alternatively, if a user has imputed datasets already created using a program other than *mice*, they can read in, as a list, files saved locally as .csv files (labeled “1”:m) in a single folder.

```{r}
# change this to match your local folder
folder <- "/Users/isabella/Library/CloudStorage/Box-Box/BSL General/MSMs/testing/testing data/continuous outcome/continuous exposure/imputations/"
# make sure pattern matches suffix of your data
files <- list.files(folder, full.names = TRUE, pattern = "\\.csv")
```

To use the package with a list of imputed data from above
```{r}
data <- lapply(files, function(file) {
imp_data <- read.csv(file)
imp_data
})
# check for character variables
any(as.logical(unlist(lapply(data, function(x) { # if this is TRUE, run next lines
any(sapply(x, class) == "character") }))))
names(data[[1]])[sapply(data[[1]], class) == "character"] # find names of any character variables
# run this for each variable that needs char -> factor
data <- lapply(data, function(x){
x[, "state"] <- factor(x[, "state"], labels = c(1, 0))
})
# make factors; list factor variables here
factor_covars <- c("state", "TcBlac2","BioDadInHH2","HomeOwnd", "PmBlac2",
"PmMrSt2", "SurpPreg", "RHealth", "SmokTotl", "DrnkFreq",
"RHasSO.6", "RHasSO.15", "RHasSO.24", "RHasSO.35", "RHasSO.58")
data <- lapply(data, function(x) {
x[, factor_covars] <- as.data.frame(lapply(x[, factor_covars], as.factor))
x })
```

Having read in their data, users should now have have assigned to `data` wide, complete data as: a data frame, mice object, or list of imputed data frames for use with the *devMSMs* functions.
Loading

0 comments on commit 3fefc87

Please sign in to comment.