Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bedshift seems to produce invalid regions #49

Open
nleroy917 opened this issue Jun 7, 2023 · 1 comment
Open

Bedshift seems to produce invalid regions #49

nleroy917 opened this issue Jun 7, 2023 · 1 comment

Comments

@nleroy917
Copy link
Member

I noticed using bedshift recently that the algorithm seems to produce regions that are invalid. Specifically, it shifted my bedfile to create a region with a start that occurred after the end (i.e. chr1 1400 900). bedtools was throwing errors because of this.

I saw this happen after I did repeated rounds of shifting on a single bedfile so I'm not sure if this is because of that.

To create the shifted bed files, this is what I ran (I can provide filesm GitHub doesn't let you upload bedfiles):

# %%
import os
from tqdm import tqdm
from bedshift import Bedshift

# %%
# params
ADD_RATE = 0.01
ADD_MEAN = 320.0
ADD_STDEV = 20.0
SHIFT_RATE = 0.01
SHIFT_MEAN = -10.0
SHIFT_STDEV = 120.0
CUT_RATE = 0.01
MERGE_RATE = 0.01
DROP_RATE = 0.03
N_SHIFTS = 100
OUT_DIR = "shifted"

# %%
# create bedshifter object on original bed file
bedshifter = Bedshift("pbmcs.bed", "hg38.chrom.sizes")

for shift_num in tqdm(range(N_SHIFTS), total=N_SHIFTS):
    file_name = f"shifted_{shift_num}.bed"
    file_path = os.path.join(OUT_DIR, file_name)

    bedshifter.all_perturbations(
        addrate=ADD_RATE,  # the rate (as a proportion of the total number of regions) to add regions
        addmean=ADD_MEAN,  # the mean length of added regions
        addstdev=ADD_STDEV,  # the standard deviation of the length of added regions
        shiftrate=SHIFT_RATE,  #  the rate to shift regions (both the start and end are shifted by the same amount)
        shiftmean=SHIFT_MEAN,  #  the mean shift distance
        shiftstdev=SHIFT_STDEV,  #  the standard deviation of the shift distance
        cutrate=CUT_RATE,  #  the rate to cut regions into two separate regions
        mergerate=MERGE_RATE,  #  the rate to merge two regions into one
        droprate=DROP_RATE,  # the rate to drop/remove regions
        seed=42,
    )

    bedshifter.to_bed(file_path)

    # start at the file we just created
    bedshifter = Bedshift(file_path, "hg38.chrom.sizes")

Then I analyzed overlaps with bedtools with (one directory up):

#!/bin/bash

mkdir -p results/
rm results/intersect.csv
touch results/intersect.csv

original_file="data/pbmcs.bed"

# add baseline to results
total=$(wc -l < "$original_file")
n_overlaps=$(bedtools intersect -wa -a "$original_file" -b "$original_file" | wc -l)
percentage=$(echo "scale=2; $n_overlaps / $total" | bc)

echo "pbmcs.bed,$percentage" >> results/intersect.csv

for file in data/shifted_clean/*.bed; do
    filename=$(basename "$file")

    # compute num intersections and store
    total=$(wc -l < "$file")
    n_overlaps=$(bedtools intersect -u -a "$file" -b "$original_file" | wc -l)
    percentage=$(echo "scale=2; $n_overlaps / $total" | bc)
    
    echo "Processing $file"
    echo "$filename,$percentage" >> results/intersect.csv

done
@nsheff
Copy link
Member

nsheff commented Jun 7, 2023

if you're providing "hg38.chrom.sizes" then it shouldn't go outside of that coordinate system...

so probably something isn't getting that setting correctly, maybe in the Python API...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants