Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about filtering features and groups #88

Open
pig-raffles opened this issue Jan 21, 2024 · 7 comments
Open

Question about filtering features and groups #88

pig-raffles opened this issue Jan 21, 2024 · 7 comments

Comments

@pig-raffles
Copy link

Hi,

Thanks for creating this excellent package for analysing PICRUSt2 output data. I have been looking for something like this for a while now.

I have run the analysis main pipeline (ggpicrust2()) with a couple of different data sets and have run into some error messages that I have some questions about.

The first question is about filtering the feature data. I generally get the following warning when running the pipeline:

"In MicrobiomeStat::linda(abundance, LinDA_metadata_df, formula = "~Group_group_nonsense_", :
Some features have less than 3 nonzero values!
They have virtually no statistical power. You may consider filtering them in the analysis!"

Do you have any advice on how best to filter the feature data. Is it just a case of opening the abundance file (TSV format) and editing it to remove these features/functions? For a data set of 10 individuals split into two treatment groups, would I filter out functions that have nonzero values for only 3 or less individuals out of the 10?

My second issue is about the number of groups I want to compare. Currently the PICRUSt2 output data I wish to use is for 4 different groups and I would like to filter this down to merely pairwise comparisons between groups. Would the simplest way of doing this be, again, to edit the abundance file, filtering out any individuals not from the pairwise comparison I wish to make. Would I also need to alter the metadata file as well?

Finally, I get also get the warning:

"In cbind(sample = colnames(sub_relative_abundance_mat), group = Group, :
number of rows of result is not a multiple of vector length (arg 2)"

What could be causing this?

Thank you for your time and any help you can offer,

Best wishes,

Alan

@cafferychen777
Copy link
Owner

Dear Alan,

Thank you for reaching out and for using ggpicrust2 to analyze your PICRUSt2 output data. I appreciate your detailed questions.

Regarding the warnings you encountered, I'd like to assure you that these are typical in the analysis process and generally do not have a significant impact on the overall results.

For the first warning about feature data filtering, it's common in bioinformatics pipelines to encounter features with low non-zero values. While these features have limited statistical power, their presence is a normal part of diverse datasets and doesn't necessarily compromise the analysis. If you wish to filter them, doing so directly in the abundance file (TSV format) is a standard approach. However, it's not always necessary unless they significantly skew your results or if you have specific reasons for stringent data curation.

For the second point regarding group comparisons, editing the abundance file for pairwise comparisons is indeed a straightforward method. It allows you to focus on specific groups of interest. Remember to adjust the metadata file accordingly to ensure consistency between your data and metadata.

Lastly, the warning about the row number not being a multiple of vector length often arises due to mismatches in data dimensions or when combining datasets with different lengths. It's a common warning in data processing and, in most cases, doesn't critically affect the analysis outcome.

In summary, these warnings are part of routine data analysis and do not necessarily indicate a major problem with your analysis or data. Feel free to proceed with your analysis, keeping these points in mind.

Best wishes in your research, and don't hesitate to reach out if you have further questions.

Kind regards,

Chen YANG

@pig-raffles
Copy link
Author

pig-raffles commented Mar 1, 2024

Hi Chen,

Thanks for your help. The suggestions you gave worked and now the analysis runs.

Do you have any recommendations for a DA method suitable for smaller data sets (<10 individuals)?

Best wishes,

Alan

@cafferychen777
Copy link
Owner

Hi Alan,

I'm glad to hear the suggestions worked and you were able to run the analysis successfully.

For smaller microbiome datasets with less than 10 individuals, DESeq2 could be a good differential abundance method to try. It uses shrinkage estimation for dispersion and fold change to improve results for experiments with small numbers of replicates. This helps avoid high variability or false positives sometimes seen with small sample sizes.

Other options are meta-analysis methods like Fisher's method, which combines P values across studies to gain power. But with very limited samples per group (<5), all methods will struggle. Adding more biological replicates per group is best if feasible.

Let me know if you have any other questions!

Best,
Chen

@pig-raffles
Copy link
Author

Sorry, one further question.

When using DESeq2, I get the following error message.

"Error in if (num_significant_biomarkers == 0) { :
missing value where TRUE/FALSE needed"

As I understand it, this refers to NAs being present in the dataset. What is causing the NAs and how would I best remove them?

Thanks in advance,

Alan

@cafferychen777
Copy link
Owner

Hi Alan,

Thank you for your interest in the ggpicrust2 package and for your thoughtful questions.

Regarding the error you encountered with DESeq2 and the missing values, it would be very helpful if you could share your dataset with me. Having access to the actual data you are working with would allow me to investigate the source of the NAs and determine the best approach for handling them. I would be happy to take a look and provide more specific guidance on preprocessing your data to avoid this error.

Feel free to send over your abundance and metadata files, or a representative subset of your data. I will do my best to reproduce the issue and suggest a solution. You can attach the files here on GitHub or send them to my email at [email protected].

Please let me know if you have any other questions! I appreciate you taking the time to report this error and am committed to helping you resolve it.

Best regards,
Chen

@pig-raffles
Copy link
Author

Hi Chen,

Sorry for the delay. Please find the metadata file (SW_FW_ANT_Tilapia_metadata.txt) and abundance file (SW_FW_Ant_KO_pred_metagenome_unstrat.txt) attached.

Best wishes,

Alan

SW_FW_Ant_KO_pred_metagenome_unstrat.txt

SW_FW_ANT_Tilapia_metadata.txt

@pig-raffles
Copy link
Author

Sorry Chen, did you get a chance to look at the files?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants