Skip to content

Tutorial

John Lees edited this page Oct 8, 2018 · 6 revisions

Here I show how to apply seer to 603 pneumococcal genomes isolated from the nasopharynx of US (MA) children (see Croucher et. al. Nat Gen 6(45) 656--663 doi: 10.1038/ng.2625). 224 were resistant or intermediately resistant to penicillin.

  1. Count the k-mers
  2. Estimate population structure
  3. Run seer
  4. Downstream analysis

We start with genomes assembled by velvet (in assemblies) and the list of which were resistant and susceptible to penicillin (penicillin.pheno)

Count the k-mers

paste <(ls assemblies | cut -d "." -f 1) <(ls assemblies/*.fa) > fsm_files.txt
fsm-lite -v -l fsm_files.txt -t tmp_idx -s 10 -S 593 > fsm_kmers.txt
split -d -n l/16 fsm_kmers.txt fsm_out
rm fsm_kmers.txt
gzip fsm_out*

These first steps count the k-mers from the assemblies and split the output into files to facilitate parallelisation later. We use k-mers to test for association as these allow us to look for both gene and SNP variation simultaneously, and they don't rely on a single reference genome.

The parameters -s and -S should be set to around 1% and 99% of the sample size respectively.

This step takes about 4 hours. If you are running into issues with disk space you should pipe the fsm-lite command directly to gzip, and leave out the split command.

Estimate population structure

Make sure you are using v1.1.3 or later. The order of samples in ls assemblies/*.fa and cut -f 1 penicillin.pheno should be the same. Otherwise in the final command give a file with sample names in the order of the former command.

mash sketch -o reference assemblies/*.fa
mash dist reference.msh reference.msh > mash_distances.txt
perl ~/seer/scripts/mash2matrix.pl mash_distances.txt > all_distances.csv
perl ~/seer/scripts/R_mds.pl -d all_distances.csv -p penicillin.pheno -o projection
  1. Create a minhash of all the assemblies
  2. Calculate the pairwise distance between all samples
  3. Convert the mash output into a distance matrix
  4. Project the 603x603 distance matrix into 3 dimensions

Bacteria are clonal, and so a simple test of association will bring up all variants in a lineage associated with the phenotype as well as the causal variants. By assessing the similarity between samples we can approximate this population structure, and control for it in seer.

seer

seer -k fsm_out00.gz -p penicillin.pheno --struct projection --threads 8 --maf 0.05 > seer00.txt

This runs the association. It uses the counted kmers -k, the phenotype -p and the population structure calculated above --struct. We also filter out low frequency kmers, and multithread on the vm for speed.

This command can be run in a reasonable amount of time (<20 minutes) to get the output from one of the kmer files. Using less -S, have a look in the output file to see what seer generates as output.

This can then be filtered to just those kmers positively associated with resistance

filter_seer -k seer00.txt --pos_beta > seer00_filtered.txt

Run in parallel on all 12 million k-mers

parallel --results answers -j 8 seer -k fsm_out{}.gz -p penicillin.pheno --struct projection --maf 0.05 ::: 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15
for i in $(find . -name stdout); do filter_seer -k $i --pos_beta | sed '1d' >> seer_filtered.txt; done

This runs all the seer jobs in parallel and then filters them. This takes about two hours.

Downstream analysis

~/seer/scripts/hits_to_fastq.pl -k seer_filtered.txt -b 10e-8 > seer_filtered.fastq
bowtie2 -p 8 --reorder -q -U seer_filtered.fastq --ignore-quals -D 24 -R 3 -N 0 -L 7 -i S,1,0.50 -x Streptococcus_pneumoniae_ATCC_700669_v1 -S seer_filtered.sam
~/seer/scripts/mapping_to_phandango.pl -s seer_filtered.sam -k seer_filtered.txt -m 20 > phandango.plot
  1. Converts significant kmers to a fastq for mapping
  2. Maps significant kmers to a reference genome
  3. Visualise as a Manhattan plot in phandango

Note the message from bowtie2 -> if kmers are unmapped another reference can easily be mapped to without repeating any of the long steps above. Therefore the pan-genome has been tested, not just the core

On the local machine

Copy the gff and plot file locally and drag onto http://jameshadfield.github.io/phandango/

The four biggest peaks respectively are pbp2x, orf15 (in the ICE), pbp1a, pbp2b All three penicillin binding proteins!

Bonus question for fans of the pneumococcus: why does the ICE appear to be associated with penicillin resistance?