Skip to content

chmccarthy/ATOLRootStudy

Repository files navigation

Improving Orthologous Signal and Model Fit in Datasets Addressing the Root of the Animal Phylogeny

DOI

This repository contains datasets, scripts and supplementary material from: Charley GP McCarthy, Peter O Mulhair, Karen Siu-Ting, Christopher J Creevey and Mary J O’Connell (2023). Improving Orthologous Signal and Model Fit in Datasets Addressing the Root of the Animal Phylogeny. Molecular Biology and Evolution, 40(1) msac276, https://doi.org/10.1093/molbev/msac276.

1_Datasets

This folder contains data and results files associated with the primary datasets reanalyzed in this study, arranged into tarballs: the dataset from Chang et al. (2015), two datasets (Datasets 10 and 20) from Whelan et al. (2015), the full dataset from Simion et al. (2017) and the "Metazoa_Choano_RCFV_strict" dataset from Whelan et al. (2017). Due to size constraints, the tarball for Simion et al. (2017) has been split into three partaa, partab and partac files. To join them together, run the following command in bash:

cat 4_Simion2017.tar.gz.parta* > 4_Simion2017.tar.gz

and extract the tarball as usual.

Each tarball (1_Chang2015.tar.gz etc.) has an identical layout described below.

1_Alignments

This subfolder contains multiple sequence alignments for all component orthogroups in a dataset, aligned under three different methods: MUSCLE, MAFFT and PRANK. Each method corresponds to a different sub-subfolder, i.e. 1_MUSCLE etc. The "best-fit" alignments for each orthogroup were selected using a combination of MetAl and norMD, these can be found in the sub-subfolder 4_Selected. All alignments are in FASTA format.

2_ClanCheck

This subfolder contains six sub-folders related to clan_check analysis of each animal tree of life dataset; orthogroups and corresponding ML trees for all orthogroups in an original dataset (1_All_OGs, 2_All_Treefiles), orthogroups and trees which were capable of recovering >2 user-defined animal or outgroup clans (3_Passing_OGs, 4_Passing_Treefiles) and those which could not (5_Failing_OGs, 6_Failing_Treefiles). Each orthogroup sub-subfolder contains alignments in FASTA format, each tree sub-subfolder contains trees in Newick format.

3_Phylogeny

This subfolder contains two sub-subfolders: 1_Matrix and 2_Tree. The matrix folder contains the concatenated data matrix of all orthogroups passing the clan_check filter described above and the corresponding partition file. The tree folder contains the posterior consensus tree generated from PhyloBayes-MPI, alongside .bplist and .bpdiff files from bpcomp and the .tracecomp file from tracecomp.

4_Analysis

This subfolder contains two sub-subfolders related to analysis of each original or filtered dataset. 1_PPA contains .ppred files for posterior predictive analysis results of CAT-GTR+G4 model fit of filtered data matrices, as run under PhyloBayes-MPI. 2_RCFV contains a results folder for BaCoCa analysis of the original dataset which was used to determine distribution of RCFV values across passing and failing orthogroups.

2_OutgroupReanalysis

This folder contains two subfolders relating to reanalysis of filtered datasets with specific outgroups excluded.

1_HoloChoano

This subfolder contains two sub-subfolders corresponding to reanalysis of Whelan2015_D10_filtered and D20_filtered with fungal outgroups removed (1_Whelan2015_D10_FilteredHolo &c). Each of these folders contains two subfolders: 1_Matrix contains the filtered data matrix with outgroups removed and the associated partition file, 2_Tree contains the posterior consensus tree generated from PhyloBayes-MPI, alongside .bplist and .bpdiff files from bpcomp and the .tracecomp file from tracecomp.

2_Choano

This subfolder contains four sub-subfolders corresponding to reanalysis of the filtered Chang2015, Whelan2015_D10 and 20 and Simion2017 datasets with fungal+holozoan outgroups removed (1_Chang2015_FilteredChoano &c). Each of these folders contains two subfolders: 1_Matrix contains the filtered data matrix with outgroups removed and the associated partition file, 2_Tree contains the posterior consensus tree generated from PhyloBayes-MPI, alongside .bplist and .bpdiff files from bpcomp and the .tracecomp file from tracecomp.

3_AdditionalPPAs

This folder contains additional data used to faciliate comparisons of posterior predictive analysis in original vs. filtered dataset. 1_Feuda2017 contains PPA results for the original Chang2015 and Whelan2015_D20 datasets as generated by Feuda et al. (2017) sourced from https://bitbucket.org/bzxdp/feuda_et_al_2017/src/master/. 2_OwnReanalysis contains PPA results generated by this study for the original Whelan2015_D10, Simion2017 and Whelan2017_MCRS datasets.

4_QMatrices

This folder contains, for each dataset (1_Chang2015 etc.), estimated amino acid substitution matrices (Q) for both the original and filtered versions of that dataset as generated using the QMaker approach in IQTREE (Minh et al., 2021). These matrices are named Q.Original and Q.Filtered, respectively.

5_Scripts

This folder contains Python and R scripts used to generate results and/or figures for the manuscript.

6_Spreadsheets

This folder contains an .xlsx files corresponding to supplementary spreadsheets S1 and S2 for the manuscript.

7_Figures

This folder contains the figures for both the main and supplementary sections of the manusript in PDF format.

Bibliography

See associated manuscript linked at the top of the README and supplementary information for references to data/analysis deposited here.

About

Data and scripts for McCarthy et al. (2022).

Resources

Stars

Watchers

Forks

Packages

 
 
 

Languages