Premier Analysis

Sean Browning, Scott Lee, Karen Wong

Deep models using EHR data.

Data

The required data is contained in a private submodule which can be recursively cloned when you pull down the repository (if you have access):

git clone --recursive [email protected]:cdcai/premier_analysis.git

Alternatively, if you've already cloned the repository and didn't pull the submodule, you can update it from the project root:

git submodule update --init --recursive

Data/Model flow

1. python/feature_extraction.py

The main pre-processing for all EHR data.

Input: Several EHR views in parquet format

Output:

A pandas data frame with separate columns of tokenized features by source table aggregated to either days, hours, or seconds from index date.
A feature dictionary containing the abbreviated feature names and their descriptions.

Steps:

Join all visit-level data across EHR views
Discretize continuous values by quantile (labs, vitals)
Abbreviate categorical values to tokens (billing data, qualitative lab result)
Aggregate all features to day, hour, or seconds from medical record index date

2. python/features_to_integers.py

Additional aggregation and conversion of dataframe of string tokens by time step to integer coded nested list.

Input: Pandas dataframe from feature_extraction.py

Output:

Nested list-of-lists with all features encoded into integer representation
Dictionary of tokens and their encoded values
dictionary of visits and their LOS and outcome indicators

Steps:

Join all feature columns together into single string aggregated by timestep in visit
Use sklearn CountVectorizer to encode tokens to integer representation (minimum document frequency: 5)
Save list-of-lists
Compute LOS, outcomes for each visit, save to dictionary

3. python/model_prep.py

Trim sequences to the appropriate lookback period according to which COVID visit is of interest.

Per the working definition, this is the first visit to occur. Also labels the sequences according to the outcome of interest.

Input:

List-of-lists from features_to_integers
Dictionary of visits and their LOS + outcome indicators from features_to_integers

Output:

A Tuple, (trimmed list-of-lists, list of labels)

Steps:

Using dictionary and global settings, compute appropriate lookback period (tp.find_cutpoints)
Trim sequences to the appropriate length determined, combine with labels to form tuple

4. Keras modelling / Hyperparameter tuning

Either python/model.py or python/hp_tune.py

Splits data into train, validation, and test generators and converts to RaggedTensor representation. This is then fed into the keras model as defined in the script and the metrics evaluated.

Name		Name	Last commit message	Last commit date
Latest commit History 586 Commits
R		R
data @ 9864942		data @ 9864942
notebooks		notebooks
papers		papers
python		python
renv		renv
targets		targets
.Rprofile		.Rprofile
.gitignore		.gitignore
.gitmodules		.gitmodules
.renvignore		.renvignore
README.md		README.md
importances.xlsx		importances.xlsx
lgr_coefs.xlsx		lgr_coefs.xlsx
renv.lock		renv.lock
requirements.txt		requirements.txt
run_all_models.sh		run_all_models.sh
run_models.sh		run_models.sh
run_preprocessing.sh		run_preprocessing.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Premier Analysis

Sean Browning, Scott Lee, Karen Wong

Data

Data/Model flow

1. python/feature_extraction.py

2. python/features_to_integers.py

3. python/model_prep.py

4. Keras modelling / Hyperparameter tuning

About

Releases

Packages

Contributors 6

Languages

cdcai/premier_analysis

Folders and files

Latest commit

History

Repository files navigation

Premier Analysis

Sean Browning, Scott Lee, Karen Wong

Data

Data/Model flow

1. python/feature_extraction.py

2. python/features_to_integers.py

3. python/model_prep.py

4. Keras modelling / Hyperparameter tuning

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages