Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First pass at parameterizing script inputs used in tp53_analysis.sh #98

Closed
wants to merge 10 commits into from
36 changes: 28 additions & 8 deletions scripts/apply_weights.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,20 +31,36 @@
help='folder location of classifier file')
parser.add_argument('-u', '--copy_number', action='store_true',
help='Supplement Y matrix with copy number events')

parser.add_argument('-x', '--x_matrix', default=None,
help='Filename of features to use in model')
parser.add_argument( '--filename_mut', default=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove leading space character in lines 37 - 48? (between ( and ')

help='Filename of sample/gene mutations to use in model')
parser.add_argument( '--filename_mut_burden', default=None,
help='Filename of sample mutation burden to use in model')
parser.add_argument( '--filename_sample', default=None,
help='Filename of patient/samples to use in model')
parser.add_argument( '--filename_copy_loss', default=None,
help='Filename of copy number loss')
parser.add_argument( '--filename_copy_gain', default=None,
help='Filename of copy number gain')
parser.add_argument( '--filename_cancer_gene_classification', default=None,
help='Filename of cancer gene classification table')

args = parser.parse_args()

# Load command arguments
classifier_file = os.path.join(args.classifier, 'classifier_summary.txt')
copy_number = args.copy_number

# Load Constants
rnaseq_file = os.path.join('data', 'pancan_rnaseq_freeze.tsv')
mut_file = os.path.join('data', 'pancan_mutation_freeze.tsv')
sample_freeze_file = os.path.join('data', 'sample_freeze.tsv')
cancer_gene_file = os.path.join('data', 'vogelstein_cancergenes.tsv')
copy_loss_file = os.path.join('data', 'copy_number_loss_status.tsv')
copy_gain_file = os.path.join('data', 'copy_number_gain_status.tsv')
mutation_burden_file = os.path.join('data', 'mutation_burden_freeze.tsv')
rnaseq_file = args.x_matrix or os.path.join('data', 'pancan_rnaseq_freeze.tsv')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this addition 👍

mut_file = args.filename_mut or os.path.join('data', 'pancan_mutation_freeze.tsv')
sample_freeze_file = args.filename_sample or os.path.join('data', 'sample_freeze.tsv')
cancer_gene_file = args.filename_cancer_gene_classification or os.path.join('data', 'vogelstein_cancergenes.tsv')
copy_loss_file = args.filename_copy_loss or os.path.join('data', 'copy_number_loss_status.tsv')
copy_gain_file = args.filename_copy_gain or os.path.join('data', 'copy_number_gain_status.tsv')
mutation_burden_file = args.filename_mut_burden or os.path.join('data', 'mutation_burden_freeze.tsv')

# Generate filenames to save output plots
output_base_file = os.path.dirname(classifier_file)
Expand All @@ -65,7 +81,11 @@
if line[0] == 'Diseases:':
diseases = line[1:]
if line[0] == 'Coefficients:':
coef_df = pd.read_table(line[1])
coef_df = line[1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way the summary file was constructed (e.g. here) implies that the coefficients for the specific classifier (of which to apply the weights using) was already generated.

I guess I am confused about the specific scenario where the classifier file doesn't exist, but the summary file does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If e.g. I use a job staging submission process on a cluster, the Job that writes classifier_summary.txt might do so in a job/node specific directory /a/b/c/, which would set the value for Coefficients to /a/b/c/classifier_coefficients.tsv in the classifier summary file. The written directories and files, can then be staged back out to a persistent directory, maybe /x/y/z/.

On a subsequent job, that makes use of the classifier, it may stage files in at /q/r/s/ (or use /x/y/z/ directly), so /q/r/s/classifier_summary.txt (loaded by user provided file path) has a Coefficients value that points to /a/b/c/classifier_coefficients.tsv, but /a/b/c does not exist, the file is actually at /q/r/s/classifier_coefficients.tsv.

# If file DNE (e.g. jobs executed with file staging), fallback to local copy
if not os.path.exists(coef_df):
coef_df = os.path.join(args.classifier, 'classifier_coefficients.tsv')
coef_df = pd.read_table(coef_df)

# Subset matrix that indicates mutation status (y)
common_genes = set(mutation_df.columns).intersection(genes)
Expand Down
50 changes: 47 additions & 3 deletions scripts/copy_burden_figures.R
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,57 @@
# Output:
# Two figures summarizing copy burden across TCGA Pan Can samples

library(optparse)
library(ggplot2)

# parse options
option_list = list(
make_option(
c("--version"),
action = "store_true",
default = FALSE,
help = "Print version and exit"
),
make_option(
c("--alt_folder"),
action = "store",
default = NA,
type = 'character',
help = "Classifier base folder"
),
make_option(
c("--junctions_with_mutations"),
action = "store",
default = NA,
type = 'character',
help = "junctions_with_mutations.csv.gz"
)
)

opt <-parse_args(OptionParser(option_list = option_list))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add space between <- and parse_args(?


if (opt$version){
# print version and exit
cat(paste("PanCancer version", toString(packageVersion("pancancer"))), "\n")
quit()
}


# Set File Names
base_file <- file.path("classifiers", "TP53")
if ( !is.na(opt$alt_folder) ){
base_file <- opt$alt_folder
} else {
base_file <- file.path("classifiers", "TP53")
}
burden_file <- file.path(base_file, "tables", "copy_burden_predictions.tsv")
snaptron_file <- file.path("scripts", "snaptron",
"junctions_with_mutations.csv.gz")
if ( !is.na(opt$alt_folder) ){
snaptron_file <- opt$junctions_with_mutations
} else {
snaptron_file <- file.path("scripts", "snaptron",
"junctions_with_mutations.csv.gz")
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra space


frac_alt_plot <- file.path(base_file, "figures", "fraction_altered_plot.pdf")
violin_plot <- file.path(base_file, "figures", "seg_altered_violin_plot.pdf")

Expand Down
4 changes: 3 additions & 1 deletion scripts/copy_burden_merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,13 @@
parser = argparse.ArgumentParser()
parser.add_argument('-c', '--classifier_folder',
help='string of the location of classifier data')
parser.add_argument( '--filename_burden', default=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra space

help='Filename of burden')
args = parser.parse_args()

# Load command arguments
pred_fild = os.path.join(args.classifier_folder, 'classifier_decisions.tsv')
burden_file = os.path.join('data', 'seg_based_scores.tsv')
burden_file = args.filename_burden or os.path.join('data', 'seg_based_scores.tsv')
out_file = os.path.join(os.path.dirname(pred_fild), 'tables',
'copy_burden_predictions.tsv')

Expand Down
14 changes: 11 additions & 3 deletions scripts/map_mutation_class.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,14 @@
help='string of the genes to extract or genelist file')
parser.add_argument('-c', '--copy_number', action='store_true',
help='optional flag to include copy number info in map')

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove new line

parser.add_argument( '--filename_copy_loss', default=None,
help='Filename of copy number loss')
parser.add_argument( '--filename_copy_gain', default=None,
help='Filename of copy number gain')
parser.add_argument( '--filename_raw_mut', default=None,
help='Filename of raw mut MAF')

args = parser.parse_args()

scores = args.scores
Expand All @@ -53,7 +61,7 @@
os.makedirs(out_dir)

out_file = os.path.join(out_dir, 'mutation_classification_scores.tsv')
raw_mut_file = os.path.join('data', 'raw', 'mc3.v0.2.8.PUBLIC.maf')
raw_mut_file = args.filename_raw_mut or os.path.join('data', 'raw', 'mc3.v0.2.8.PUBLIC.maf')

pred_df = pd.read_table(prediction_file, index_col=0)
mut_df = pd.read_table(raw_mut_file)
Expand All @@ -71,8 +79,8 @@
.fillna('Wild-Type')
if copy_number:
# Load Copy Number info
copy_loss_file = os.path.join('data', 'copy_number_loss_status.tsv')
copy_gain_file = os.path.join('data', 'copy_number_gain_status.tsv')
copy_loss_file = args.filename_copy_loss or os.path.join('data', 'copy_number_loss_status.tsv')
copy_gain_file = args.filename_copy_gain or os.path.join('data', 'copy_number_gain_status.tsv')

copy_loss_df = pd.read_table(copy_loss_file, index_col=0)
copy_gain_df = pd.read_table(copy_gain_file, index_col=0)
Expand Down
42 changes: 27 additions & 15 deletions scripts/pancancer_classifier.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,8 @@
import pandas as pd
import csv
import argparse
import matplotlib
matplotlib.use('agg')
import matplotlib.pyplot as plt
import seaborn as sns

Expand All @@ -69,6 +71,10 @@
from tcga_util import get_args, get_threshold_metrics, integrate_copy_number
from tcga_util import shuffle_columns

RASOPATHY_GENES = set(['BRAF', 'CBL', 'HRAS', 'KRAS', 'MAP2K1', 'MAP2K2', 'NF1',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment above this pointing to how these genes were defined?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'NRAS', 'PTPN11', 'RAF1', 'SHOC2', 'SOS1', 'SPRED1', 'RIT1'])


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove new line

# Load command arguments
args = get_args()

Expand Down Expand Up @@ -140,14 +146,16 @@
'{}_summary.tsv'.format(alt_gene_base))

# Load Datasets
x_as_raw = args.x_as_raw
if x_matrix == 'raw':
expr_file = os.path.join('data', 'pancan_rnaseq_freeze.tsv.gz')
x_as_raw = True
else:
expr_file = x_matrix

mut_file = os.path.join('data', 'pancan_mutation_freeze.tsv.gz')
mut_burden_file = os.path.join('data', 'mutation_burden_freeze.tsv')
sample_freeze_file = os.path.join('data', 'sample_freeze.tsv')
mut_file = args.filename_mut or os.path.join('data', 'pancan_mutation_freeze.tsv.gz')
mut_burden_file = args.filename_mut_burden or os.path.join('data', 'mutation_burden_freeze.tsv')
sample_freeze_file = args.filename_sample or os.path.join('data', 'sample_freeze.tsv')

rnaseq_full_df = pd.read_table(expr_file, index_col=0)
mutation_df = pd.read_table(mut_file, index_col=0)
Expand All @@ -156,7 +164,7 @@

# Construct data for classifier
common_genes = set(mutation_df.columns).intersection(genes)
if x_matrix == 'raw':
if x_as_raw:
common_genes = list(common_genes.intersection(rnaseq_full_df.columns))
else:
common_genes = list(common_genes)
Expand All @@ -169,27 +177,31 @@
'are {}'.format(missing_genes), category=Warning)

if drop:
if x_matrix == 'raw':
if x_as_raw:
rnaseq_full_df.drop(common_genes, axis=1, inplace=True)

if drop_rasopathy:
rasopathy_genes = set(['BRAF', 'CBL', 'HRAS', 'KRAS', 'MAP2K1', 'MAP2K2',
'NF1', 'NRAS', 'PTPN11', 'RAF1', 'SHOC2', 'SOS1',
'SPRED1', 'RIT1'])
rasopathy_drop = list(rasopathy_genes.intersection(rnaseq_full_df.columns))
rnaseq_full_df.drop(rasopathy_drop, axis=1, inplace=True)
# Drop rasopathy genes as defined by default
drop_x_genes = RASOPATHY_GENES
else:
drop_x_genes = set()
if args.drop_x_genes:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs updating. see later comment

drop_x_genes = drop_x_genes.union( set( map( lambda x: x.strip(), args.drop_x_genes.split(',') ) ) )
if drop_x_genes:
x_drop = list(drop_x_genes.intersection(rnaseq_full_df.columns))
rnaseq_full_df.drop(x_drop, axis=1, inplace=True)

# Incorporate copy number for gene activation/inactivation
if copy_number:
# Load copy number matrices
copy_loss_file = os.path.join('data', 'copy_number_loss_status.tsv.gz')
copy_loss_file = args.filename_copy_loss or os.path.join('data', 'copy_number_loss_status.tsv.gz')
copy_loss_df = pd.read_table(copy_loss_file, index_col=0)

copy_gain_file = os.path.join('data', 'copy_number_gain_status.tsv.gz')
copy_gain_file = args.filename_copy_gain or os.path.join('data', 'copy_number_gain_status.tsv.gz')
copy_gain_df = pd.read_table(copy_gain_file, index_col=0)

# Load cancer gene classification table
vogel_file = os.path.join('data', 'vogelstein_cancergenes.tsv')
vogel_file = args.filename_cancer_gene_classification or os.path.join('data', 'vogelstein_cancergenes.tsv')
cancer_genes = pd.read_table(vogel_file)

y = integrate_copy_number(y=y, cancer_genes_df=cancer_genes,
Expand Down Expand Up @@ -243,7 +255,7 @@
x_df = rnaseq_df.loc[y_df.index, :]

# Subset x matrix to MAD genes and scale
if x_matrix == 'raw':
if x_as_raw:
med_dev = pd.DataFrame(mad(x_df), index=x_df.columns)
mad_genes = med_dev.sort_values(by=0, ascending=False)\
.iloc[0:num_features_kept].index.tolist()
Expand Down Expand Up @@ -664,7 +676,7 @@

# Process alternative x matrix
x_alt_df = rnaseq_alt_df.loc[y_alt_df.index, :]
if x_matrix == 'raw':
if x_as_raw:
x_alt_df = x_alt_df.loc[:, mad_genes]

x_alt_df_update = pd.DataFrame(fitted_scaler.transform(x_alt_df),
Expand Down
48 changes: 43 additions & 5 deletions scripts/snaptron/investigate_silent_junctions.R
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,49 @@
# wildtype samples. Also a fishers exact test comparing enrichment of 200 bp
# exon 4 truncation in mutated samples versus wildtype

library(optparse)
library(RColorBrewer)

# parse options
option_list = list(
make_option(
c("--version"),
action = "store_true",
default = FALSE,
help = "Print version and exit"
),
make_option(
c("--snaptron_file"),
action = "store",
default = NA,
type = 'character',
help = "SNAPTRON junctions with mutations file"
),
make_option(
c("--output_folder"),
action = "store",
default = '.',
type = 'character',
help = "Folder to store outputs"
)
)

opt <-parse_args(OptionParser(option_list = option_list))

if (opt$version){
# print version and exit
cat(paste("PanCancer version", toString(packageVersion("pancancer"))), "\n")
quit()
}

set.seed(123)

snaptron_file <- "tp53_junctions_with_mutations.csv.gz"
if ( !is.na(opt$snaptron_file) ){
snaptron_file <- opt$snaptron_file
} else {
snaptron_file <- "tp53_junctions_with_mutations.csv.gz"
}

junc_df <- readr::read_csv(snaptron_file)
junc_df <- junc_df[, -1]
junc_df <- junc_df[!duplicated(junc_df), ]
Expand Down Expand Up @@ -92,7 +130,7 @@ plot_exon_exon_junc <- function(exon_df, plot_range = x_range, row_add = 1.85) {
}

# Plot of c.375G>T mutations
pdf("exon-exon_junctions_GtoT.pdf", height = 5, width = 5)
pdf(file.path(opt$output_folder, "exon-exon_junctions_GtoT.pdf"), height = 5, width = 5)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you set the file name before inputing into pdf() call?

plot_exon_exon_junc(exon_df = silent_mutations)
text(x = 7676325, y = 1.9, "Probability", font = 3, cex = 0.5)
text(x = 7675237, y = 2, "Exon 5", font = 3, cex = 0.5)
Expand All @@ -106,7 +144,7 @@ segments(x = 7675237, x0 = 7675237, y = -45, y0 = 1, col = "black",
dev.off()

# Plots of WT samples
pdf("exon-exon_junctions_wt_random.pdf", height = 5, width = 5)
pdf(file.path(opt$output_folder, "exon-exon_junctions_wt_random.pdf"), height = 5, width = 5)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here too

plot_exon_exon_junc(wt_junc)
text(x = 7676325, y = 1.9, "Probability", font = 3, cex = 0.5)
text(x = 7675237, y = 2, "Exon 5", font = 3, cex = 0.5)
Expand All @@ -120,7 +158,7 @@ segments(x = 7675237, x0 = 7675237, y = -45, y0 = 1, col = "black",
dev.off()

# Plot of c.375G>A mutations
pdf("exon-exon_junctions_GtoA.pdf", height = 5, width = 5)
pdf(file.path(opt$output_folder, "exon-exon_junctions_GtoA.pdf"), height = 5, width = 5)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here too

plot_exon_exon_junc(exon_df = silent_mut_a, row_add = 2.5)
text(x = 7676325, y = 1.9, "Probability", font = 3, cex = 0.5)
text(x = 7675237, y = 2, "Exon 5", font = 3, cex = 0.5)
Expand Down Expand Up @@ -149,7 +187,7 @@ total_silent <- length(unique(c(silent_mutations$tcga_id,
silent_mut_a$tcga_id)))

f_test <- data.frame(c(silent_trunc, total_silent), c(wt_trunc, total_wt))
sink("fishers_exon4_truncation_enrichment_in_c375G-T_mutations.txt")
sink(file.path(opt$output_folder, "fishers_exon4_truncation_enrichment_in_c375G-T_mutations.txt"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here too. Maybe we can use a sequence of lines in the beginning of this file that defines all output names

print(f_test)
fisher.test(f_test)
sink()
Loading