-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First pass at parameterizing script inputs used in tp53_analysis.sh #98
Changes from 6 commits
f2e9f2f
e1dafc9
30f237e
c2c4546
1c7e6c6
1162ed3
efba673
aabe59b
b026a75
cf7d774
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -31,20 +31,36 @@ | |
help='folder location of classifier file') | ||
parser.add_argument('-u', '--copy_number', action='store_true', | ||
help='Supplement Y matrix with copy number events') | ||
|
||
parser.add_argument('-x', '--x_matrix', default=None, | ||
help='Filename of features to use in model') | ||
parser.add_argument( '--filename_mut', default=None, | ||
help='Filename of sample/gene mutations to use in model') | ||
parser.add_argument( '--filename_mut_burden', default=None, | ||
help='Filename of sample mutation burden to use in model') | ||
parser.add_argument( '--filename_sample', default=None, | ||
help='Filename of patient/samples to use in model') | ||
parser.add_argument( '--filename_copy_loss', default=None, | ||
help='Filename of copy number loss') | ||
parser.add_argument( '--filename_copy_gain', default=None, | ||
help='Filename of copy number gain') | ||
parser.add_argument( '--filename_cancer_gene_classification', default=None, | ||
help='Filename of cancer gene classification table') | ||
|
||
args = parser.parse_args() | ||
|
||
# Load command arguments | ||
classifier_file = os.path.join(args.classifier, 'classifier_summary.txt') | ||
copy_number = args.copy_number | ||
|
||
# Load Constants | ||
rnaseq_file = os.path.join('data', 'pancan_rnaseq_freeze.tsv') | ||
mut_file = os.path.join('data', 'pancan_mutation_freeze.tsv') | ||
sample_freeze_file = os.path.join('data', 'sample_freeze.tsv') | ||
cancer_gene_file = os.path.join('data', 'vogelstein_cancergenes.tsv') | ||
copy_loss_file = os.path.join('data', 'copy_number_loss_status.tsv') | ||
copy_gain_file = os.path.join('data', 'copy_number_gain_status.tsv') | ||
mutation_burden_file = os.path.join('data', 'mutation_burden_freeze.tsv') | ||
rnaseq_file = args.x_matrix or os.path.join('data', 'pancan_rnaseq_freeze.tsv') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like this addition 👍 |
||
mut_file = args.filename_mut or os.path.join('data', 'pancan_mutation_freeze.tsv') | ||
sample_freeze_file = args.filename_sample or os.path.join('data', 'sample_freeze.tsv') | ||
cancer_gene_file = args.filename_cancer_gene_classification or os.path.join('data', 'vogelstein_cancergenes.tsv') | ||
copy_loss_file = args.filename_copy_loss or os.path.join('data', 'copy_number_loss_status.tsv') | ||
copy_gain_file = args.filename_copy_gain or os.path.join('data', 'copy_number_gain_status.tsv') | ||
mutation_burden_file = args.filename_mut_burden or os.path.join('data', 'mutation_burden_freeze.tsv') | ||
|
||
# Generate filenames to save output plots | ||
output_base_file = os.path.dirname(classifier_file) | ||
|
@@ -65,7 +81,11 @@ | |
if line[0] == 'Diseases:': | ||
diseases = line[1:] | ||
if line[0] == 'Coefficients:': | ||
coef_df = pd.read_table(line[1]) | ||
coef_df = line[1] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The way the summary file was constructed (e.g. here) implies that the coefficients for the specific classifier (of which to apply the weights using) was already generated. I guess I am confused about the specific scenario where the classifier file doesn't exist, but the summary file does. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If e.g. I use a job staging submission process on a cluster, the Job that writes On a subsequent job, that makes use of the classifier, it may stage files in at |
||
# If file DNE (e.g. jobs executed with file staging), fallback to local copy | ||
if not os.path.exists(coef_df): | ||
coef_df = os.path.join(args.classifier, 'classifier_coefficients.tsv') | ||
coef_df = pd.read_table(coef_df) | ||
|
||
# Subset matrix that indicates mutation status (y) | ||
common_genes = set(mutation_df.columns).intersection(genes) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,13 +12,57 @@ | |
# Output: | ||
# Two figures summarizing copy burden across TCGA Pan Can samples | ||
|
||
library(optparse) | ||
library(ggplot2) | ||
|
||
# parse options | ||
option_list = list( | ||
make_option( | ||
c("--version"), | ||
action = "store_true", | ||
default = FALSE, | ||
help = "Print version and exit" | ||
), | ||
make_option( | ||
c("--alt_folder"), | ||
action = "store", | ||
default = NA, | ||
type = 'character', | ||
help = "Classifier base folder" | ||
), | ||
make_option( | ||
c("--junctions_with_mutations"), | ||
action = "store", | ||
default = NA, | ||
type = 'character', | ||
help = "junctions_with_mutations.csv.gz" | ||
) | ||
) | ||
|
||
opt <-parse_args(OptionParser(option_list = option_list)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add space between |
||
|
||
if (opt$version){ | ||
# print version and exit | ||
cat(paste("PanCancer version", toString(packageVersion("pancancer"))), "\n") | ||
quit() | ||
} | ||
|
||
|
||
# Set File Names | ||
base_file <- file.path("classifiers", "TP53") | ||
if ( !is.na(opt$alt_folder) ){ | ||
base_file <- opt$alt_folder | ||
} else { | ||
base_file <- file.path("classifiers", "TP53") | ||
} | ||
burden_file <- file.path(base_file, "tables", "copy_burden_predictions.tsv") | ||
snaptron_file <- file.path("scripts", "snaptron", | ||
"junctions_with_mutations.csv.gz") | ||
if ( !is.na(opt$alt_folder) ){ | ||
snaptron_file <- opt$junctions_with_mutations | ||
} else { | ||
snaptron_file <- file.path("scripts", "snaptron", | ||
"junctions_with_mutations.csv.gz") | ||
} | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. remove extra space |
||
|
||
frac_alt_plot <- file.path(base_file, "figures", "fraction_altered_plot.pdf") | ||
violin_plot <- file.path(base_file, "figures", "seg_altered_violin_plot.pdf") | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -22,11 +22,13 @@ | |
parser = argparse.ArgumentParser() | ||
parser.add_argument('-c', '--classifier_folder', | ||
help='string of the location of classifier data') | ||
parser.add_argument( '--filename_burden', default=None, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. remove extra space |
||
help='Filename of burden') | ||
args = parser.parse_args() | ||
|
||
# Load command arguments | ||
pred_fild = os.path.join(args.classifier_folder, 'classifier_decisions.tsv') | ||
burden_file = os.path.join('data', 'seg_based_scores.tsv') | ||
burden_file = args.filename_burden or os.path.join('data', 'seg_based_scores.tsv') | ||
out_file = os.path.join(os.path.dirname(pred_fild), 'tables', | ||
'copy_burden_predictions.tsv') | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -33,6 +33,14 @@ | |
help='string of the genes to extract or genelist file') | ||
parser.add_argument('-c', '--copy_number', action='store_true', | ||
help='optional flag to include copy number info in map') | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. remove new line |
||
parser.add_argument( '--filename_copy_loss', default=None, | ||
help='Filename of copy number loss') | ||
parser.add_argument( '--filename_copy_gain', default=None, | ||
help='Filename of copy number gain') | ||
parser.add_argument( '--filename_raw_mut', default=None, | ||
help='Filename of raw mut MAF') | ||
|
||
args = parser.parse_args() | ||
|
||
scores = args.scores | ||
|
@@ -53,7 +61,7 @@ | |
os.makedirs(out_dir) | ||
|
||
out_file = os.path.join(out_dir, 'mutation_classification_scores.tsv') | ||
raw_mut_file = os.path.join('data', 'raw', 'mc3.v0.2.8.PUBLIC.maf') | ||
raw_mut_file = args.filename_raw_mut or os.path.join('data', 'raw', 'mc3.v0.2.8.PUBLIC.maf') | ||
|
||
pred_df = pd.read_table(prediction_file, index_col=0) | ||
mut_df = pd.read_table(raw_mut_file) | ||
|
@@ -71,8 +79,8 @@ | |
.fillna('Wild-Type') | ||
if copy_number: | ||
# Load Copy Number info | ||
copy_loss_file = os.path.join('data', 'copy_number_loss_status.tsv') | ||
copy_gain_file = os.path.join('data', 'copy_number_gain_status.tsv') | ||
copy_loss_file = args.filename_copy_loss or os.path.join('data', 'copy_number_loss_status.tsv') | ||
copy_gain_file = args.filename_copy_gain or os.path.join('data', 'copy_number_gain_status.tsv') | ||
|
||
copy_loss_df = pd.read_table(copy_loss_file, index_col=0) | ||
copy_gain_df = pd.read_table(copy_gain_file, index_col=0) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -55,6 +55,8 @@ | |
import pandas as pd | ||
import csv | ||
import argparse | ||
import matplotlib | ||
matplotlib.use('agg') | ||
import matplotlib.pyplot as plt | ||
import seaborn as sns | ||
|
||
|
@@ -69,6 +71,10 @@ | |
from tcga_util import get_args, get_threshold_metrics, integrate_copy_number | ||
from tcga_util import shuffle_columns | ||
|
||
RASOPATHY_GENES = set(['BRAF', 'CBL', 'HRAS', 'KRAS', 'MAP2K1', 'MAP2K2', 'NF1', | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add a comment above this pointing to how these genes were defined? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
'NRAS', 'PTPN11', 'RAF1', 'SHOC2', 'SOS1', 'SPRED1', 'RIT1']) | ||
|
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. remove new line |
||
# Load command arguments | ||
args = get_args() | ||
|
||
|
@@ -140,14 +146,16 @@ | |
'{}_summary.tsv'.format(alt_gene_base)) | ||
|
||
# Load Datasets | ||
x_as_raw = args.x_as_raw | ||
if x_matrix == 'raw': | ||
expr_file = os.path.join('data', 'pancan_rnaseq_freeze.tsv.gz') | ||
x_as_raw = True | ||
else: | ||
expr_file = x_matrix | ||
|
||
mut_file = os.path.join('data', 'pancan_mutation_freeze.tsv.gz') | ||
mut_burden_file = os.path.join('data', 'mutation_burden_freeze.tsv') | ||
sample_freeze_file = os.path.join('data', 'sample_freeze.tsv') | ||
mut_file = args.filename_mut or os.path.join('data', 'pancan_mutation_freeze.tsv.gz') | ||
mut_burden_file = args.filename_mut_burden or os.path.join('data', 'mutation_burden_freeze.tsv') | ||
sample_freeze_file = args.filename_sample or os.path.join('data', 'sample_freeze.tsv') | ||
|
||
rnaseq_full_df = pd.read_table(expr_file, index_col=0) | ||
mutation_df = pd.read_table(mut_file, index_col=0) | ||
|
@@ -156,7 +164,7 @@ | |
|
||
# Construct data for classifier | ||
common_genes = set(mutation_df.columns).intersection(genes) | ||
if x_matrix == 'raw': | ||
if x_as_raw: | ||
common_genes = list(common_genes.intersection(rnaseq_full_df.columns)) | ||
else: | ||
common_genes = list(common_genes) | ||
|
@@ -169,27 +177,31 @@ | |
'are {}'.format(missing_genes), category=Warning) | ||
|
||
if drop: | ||
if x_matrix == 'raw': | ||
if x_as_raw: | ||
rnaseq_full_df.drop(common_genes, axis=1, inplace=True) | ||
|
||
if drop_rasopathy: | ||
rasopathy_genes = set(['BRAF', 'CBL', 'HRAS', 'KRAS', 'MAP2K1', 'MAP2K2', | ||
'NF1', 'NRAS', 'PTPN11', 'RAF1', 'SHOC2', 'SOS1', | ||
'SPRED1', 'RIT1']) | ||
rasopathy_drop = list(rasopathy_genes.intersection(rnaseq_full_df.columns)) | ||
rnaseq_full_df.drop(rasopathy_drop, axis=1, inplace=True) | ||
# Drop rasopathy genes as defined by default | ||
drop_x_genes = RASOPATHY_GENES | ||
else: | ||
drop_x_genes = set() | ||
if args.drop_x_genes: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Needs updating. see later comment |
||
drop_x_genes = drop_x_genes.union( set( map( lambda x: x.strip(), args.drop_x_genes.split(',') ) ) ) | ||
if drop_x_genes: | ||
x_drop = list(drop_x_genes.intersection(rnaseq_full_df.columns)) | ||
rnaseq_full_df.drop(x_drop, axis=1, inplace=True) | ||
|
||
# Incorporate copy number for gene activation/inactivation | ||
if copy_number: | ||
# Load copy number matrices | ||
copy_loss_file = os.path.join('data', 'copy_number_loss_status.tsv.gz') | ||
copy_loss_file = args.filename_copy_loss or os.path.join('data', 'copy_number_loss_status.tsv.gz') | ||
copy_loss_df = pd.read_table(copy_loss_file, index_col=0) | ||
|
||
copy_gain_file = os.path.join('data', 'copy_number_gain_status.tsv.gz') | ||
copy_gain_file = args.filename_copy_gain or os.path.join('data', 'copy_number_gain_status.tsv.gz') | ||
copy_gain_df = pd.read_table(copy_gain_file, index_col=0) | ||
|
||
# Load cancer gene classification table | ||
vogel_file = os.path.join('data', 'vogelstein_cancergenes.tsv') | ||
vogel_file = args.filename_cancer_gene_classification or os.path.join('data', 'vogelstein_cancergenes.tsv') | ||
cancer_genes = pd.read_table(vogel_file) | ||
|
||
y = integrate_copy_number(y=y, cancer_genes_df=cancer_genes, | ||
|
@@ -243,7 +255,7 @@ | |
x_df = rnaseq_df.loc[y_df.index, :] | ||
|
||
# Subset x matrix to MAD genes and scale | ||
if x_matrix == 'raw': | ||
if x_as_raw: | ||
med_dev = pd.DataFrame(mad(x_df), index=x_df.columns) | ||
mad_genes = med_dev.sort_values(by=0, ascending=False)\ | ||
.iloc[0:num_features_kept].index.tolist() | ||
|
@@ -664,7 +676,7 @@ | |
|
||
# Process alternative x matrix | ||
x_alt_df = rnaseq_alt_df.loc[y_alt_df.index, :] | ||
if x_matrix == 'raw': | ||
if x_as_raw: | ||
x_alt_df = x_alt_df.loc[:, mad_genes] | ||
|
||
x_alt_df_update = pd.DataFrame(fitted_scaler.transform(x_alt_df), | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,11 +11,49 @@ | |
# wildtype samples. Also a fishers exact test comparing enrichment of 200 bp | ||
# exon 4 truncation in mutated samples versus wildtype | ||
|
||
library(optparse) | ||
library(RColorBrewer) | ||
|
||
# parse options | ||
option_list = list( | ||
make_option( | ||
c("--version"), | ||
action = "store_true", | ||
default = FALSE, | ||
help = "Print version and exit" | ||
), | ||
make_option( | ||
c("--snaptron_file"), | ||
action = "store", | ||
default = NA, | ||
type = 'character', | ||
help = "SNAPTRON junctions with mutations file" | ||
), | ||
make_option( | ||
c("--output_folder"), | ||
action = "store", | ||
default = '.', | ||
type = 'character', | ||
help = "Folder to store outputs" | ||
) | ||
) | ||
|
||
opt <-parse_args(OptionParser(option_list = option_list)) | ||
|
||
if (opt$version){ | ||
# print version and exit | ||
cat(paste("PanCancer version", toString(packageVersion("pancancer"))), "\n") | ||
quit() | ||
} | ||
|
||
set.seed(123) | ||
|
||
snaptron_file <- "tp53_junctions_with_mutations.csv.gz" | ||
if ( !is.na(opt$snaptron_file) ){ | ||
snaptron_file <- opt$snaptron_file | ||
} else { | ||
snaptron_file <- "tp53_junctions_with_mutations.csv.gz" | ||
} | ||
|
||
junc_df <- readr::read_csv(snaptron_file) | ||
junc_df <- junc_df[, -1] | ||
junc_df <- junc_df[!duplicated(junc_df), ] | ||
|
@@ -92,7 +130,7 @@ plot_exon_exon_junc <- function(exon_df, plot_range = x_range, row_add = 1.85) { | |
} | ||
|
||
# Plot of c.375G>T mutations | ||
pdf("exon-exon_junctions_GtoT.pdf", height = 5, width = 5) | ||
pdf(file.path(opt$output_folder, "exon-exon_junctions_GtoT.pdf"), height = 5, width = 5) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you set the file name before inputing into |
||
plot_exon_exon_junc(exon_df = silent_mutations) | ||
text(x = 7676325, y = 1.9, "Probability", font = 3, cex = 0.5) | ||
text(x = 7675237, y = 2, "Exon 5", font = 3, cex = 0.5) | ||
|
@@ -106,7 +144,7 @@ segments(x = 7675237, x0 = 7675237, y = -45, y0 = 1, col = "black", | |
dev.off() | ||
|
||
# Plots of WT samples | ||
pdf("exon-exon_junctions_wt_random.pdf", height = 5, width = 5) | ||
pdf(file.path(opt$output_folder, "exon-exon_junctions_wt_random.pdf"), height = 5, width = 5) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. here too |
||
plot_exon_exon_junc(wt_junc) | ||
text(x = 7676325, y = 1.9, "Probability", font = 3, cex = 0.5) | ||
text(x = 7675237, y = 2, "Exon 5", font = 3, cex = 0.5) | ||
|
@@ -120,7 +158,7 @@ segments(x = 7675237, x0 = 7675237, y = -45, y0 = 1, col = "black", | |
dev.off() | ||
|
||
# Plot of c.375G>A mutations | ||
pdf("exon-exon_junctions_GtoA.pdf", height = 5, width = 5) | ||
pdf(file.path(opt$output_folder, "exon-exon_junctions_GtoA.pdf"), height = 5, width = 5) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. here too |
||
plot_exon_exon_junc(exon_df = silent_mut_a, row_add = 2.5) | ||
text(x = 7676325, y = 1.9, "Probability", font = 3, cex = 0.5) | ||
text(x = 7675237, y = 2, "Exon 5", font = 3, cex = 0.5) | ||
|
@@ -149,7 +187,7 @@ total_silent <- length(unique(c(silent_mutations$tcga_id, | |
silent_mut_a$tcga_id))) | ||
|
||
f_test <- data.frame(c(silent_trunc, total_silent), c(wt_trunc, total_wt)) | ||
sink("fishers_exon4_truncation_enrichment_in_c375G-T_mutations.txt") | ||
sink(file.path(opt$output_folder, "fishers_exon4_truncation_enrichment_in_c375G-T_mutations.txt")) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. here too. Maybe we can use a sequence of lines in the beginning of this file that defines all output names |
||
print(f_test) | ||
fisher.test(f_test) | ||
sink() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you remove leading space character in lines 37 - 48? (between
(
and'
)