Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering out host (human) genome beforehand? #96

Open
mhmism opened this issue Jul 24, 2023 · 3 comments
Open

Filtering out host (human) genome beforehand? #96

mhmism opened this issue Jul 24, 2023 · 3 comments

Comments

@mhmism
Copy link

mhmism commented Jul 24, 2023

Thank you for this wonderful tool!

Should we filter out the host (human) genome before executing the pipeline e.g. using kneadData or fastp? the same also applies to the PhiX genomes?

I tried Hecatomb on one of my DNA shotgun metagenomics datasets, and I found that there is a large difference in the output with or without host DNA removal beforehand. Specifically, the number (and diversity) of viral sequences retrieved to was much higher when I did not remove the host (human) DNA before using Hecatomb. The other issue is that I found a large proportion of sequences was assigned to RNA viruses including ones that I should not normally see in my dataset, such as Human immunodeficiency virus. These RNA viruses were found with with or without prior host DNA removal, however, it was significantly higher when I included the dataset without removing the host DNA. This makes me think that the host DNA is mistakenly classified in my dataset.
Also, I am not sure whether I should expect to find any RNA viruses when my dataset is mainly shotgun DNA metagenomics.

For more context, my dataset is a bulk shotgun metagenomics datasets (i.e. not viral enriched).

Thank you in advance!

@beardymcjohnface
Copy link
Collaborator

Hi,
You shouldn't need to perform host removal as this step is performed by Hecatomb, but there will be a difference due to the way Hecatomb prepares the references for filtering. Viral-like sequences in the host are masked to avoid removing real viral sequences that happen to be similar, but this will result in host sequences that need to be filtered later. Hecatomb currently doesn't remove phix but I think this will change in the next version. I'm interested in hearing what your preference would be re: filtering as we've had this conversation several times about what approach would be best.
I wouldn't expect to find many RNA viruses in a DNA metagenome, but you might still have hits to known RNA viruses if they share homology to DNA viruses in your sample.

@mhmism
Copy link
Author

mhmism commented Jul 25, 2023

Thanks for your response. It would be great to remove the phix genome in the next version of Hecatomb. I will be looking forward to the next version.
Regarding the filtering process, unfortunately, there is no easy answer. Based on what I saw in my toy dataset, I think lots of the host DNA reads were wrongly classified as RNA viruses (this was suggested from the large proportion of RNA viruses that were retrieved from a DNA dataset, so an unexpected behaviour). This may be a problem in short reads datasets, in general. On the other hand, you may also lose some DNA viruses if you filtered beforehand. I think if you would like to be more conservative and avoid false positives as much as possible, then removing host DNA beforehand might be needed. However, this still needs some benchmarking on synthetic datasets where a mix of microbial (including viral) and host short reads are included to reach more conclusive thoughts.

In addition, you may wish to include a feature to only search in the DNA vs RNA viral catalogue or both. This way, it may better suit the type of the dataset you are investigating.

I am curious to know your thoughts!

@beardymcjohnface
Copy link
Collaborator

Yes, I agree 100%. This misclassification of host DNA as RNA viruses is very typical. I like the idea of switching off searching for RNA viruses; I'll have to think of the best way to implement it as we want to do the same thing for phages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants