Skip to content

Latest commit

 

History

History
57 lines (41 loc) · 4.4 KB

README.md

File metadata and controls

57 lines (41 loc) · 4.4 KB

simulated_data

A repository for code that generates various simulated data.

Subfolders found here concern:

Related efforts by others

"What is a good and easy way to select 5-10 Mb of neutral DNA sequences in the human genome? Would selecting random intergenic regions (say >10Kb away from genes) be enough? Has someone done something similar recently in a paper I could cite and use the same loci?"
https://twitter.com/vsbuffalo/status/1646212322833334272
"I had to do this recently — I took all exonic + phastcons + UTRS, merged them, and then add 200bp of buffer on both ends (all using bedtools). You could do this and even select out random regions. I did some sensitivity analysis and comparison to the CADD tracks and seemed good."
"Also (and perhaps this is being too paranoid) but I merged the refseq and ensembl tracks. They differ slightly in their percent of basepairs that annotated as coding, so I took the union."

"Go and grab 130G of long-read mock microbial community data from PromethION and 36G from MinION over here, if you fancy: https://github.com/LomanLab/mockcommunity … #UKGS18 - could be useful for bioinformatics pipeline validation and method development!"

https://twitter.com/Hasindu2008/status/1628569325895585793

"Squigulator r10 branch https://github.com/hasindu2008/squigulator/tree/r10 can simulate r10.4.1 signals. Also f5c r10 branch https://github.com/hasindu2008/f5c/tree/r10 can do resquiggle and eventalign for R10.4.1. Note: still work in progress and improvements are on the way. Thanks, @nanopore for providing the pore-model."

"To test different approaches for assembling genomes, I needed data with known microbial content. Only long reads were available, but I needed to test the algorithm on short paired-end reads. This script was written to create short reads from long reads."

"Our single-cell and spatial omics simulator scDesign3 is now online: https://nature.com/articles/s41587-023-01772-1 scDesign3 has two functionalities: (1) synthetic data simulation and (2) real data interpretation and modification 1/"

Only Tangentially Related

https://x.com/RobAboukhalil/status/1808602458698232258 July 2024

"When writing bioinformatics tools, I often need small datasets to test edge cases or invalid file formats, e.g. files that are truncated, unsorted, have extraneous whitespace, etc. I started compiling examples here: https://github.com/omgenomics/bio-data-zoo, contributions are welcome!" Bio Data Zoo
>"This repo contains example data in various genomics file formats. It is intended for bioinformatics tool developers to make testing software easier. It includes examples of valid file formats, edge cases, and invalid formats."