Skip to content

RADdata

Lindsay Clark edited this page Sep 27, 2019 · 17 revisions

"RADdata" is an S3 class for storing all data and parameters pertaining to a GBS or RAD-seq dataset.

Slots available using the $ operator

Slots created by the RADdata function

  • alleleDepth: Read depth for each allele in each taxon. Stored as an integer matrix with taxa in rows and alleles in columns. Rather than NA for missing data, there should be zeros to indicate no reads.
  • alleles2loc: An integer vector with one value for each column of alleleDepth. The number indicates the identity of the locus to which the allele belongs. A locus can have any number of alleles assigned to it (including zero).
  • locTable: A data frame, where locus names are row names. There must be at least as many rows as the highest value of alleles2loc; each number in alleles2loc corresponds to a row index in locTable. No columns are required, although if provided a column named "Chr" will be used for indicating chromosome identities and a column named "Pos" will be used for indicating physical position.
  • possiblePloidies: A list, where each item in the list is an integer vector. Each vector indicates an inheritance pattern that SNPs in the dataset might obey. 2 indicates diploid, 4 indicates autotetraploid, c(2, 2) indicates allotetraploid, etc.
  • alleleNucleotides: A character vector with one value for each column of alleleDepth, indicating the DNA sequence for that allele. Typically only the sequence at variable sites is provided. (Although having all sites spanning the region with any variable sites will enable better investigation of mutations in CDS later.) The attribute "Variable_sites_only" indicates whether only sequence at variable sites is provided.
  • locDepth: A matrix with taxa in rows and loci in columns, with read depth summed across all alleles for each locus. Column names are locus numbers rather than locus names. See GetLocDepth for retrieving the same matrix but with locus names as column names.
  • depthRatio: A numeric matrix with taxa in rows and alleles in columns. Calculated as alleleDepth / locDepth. Used by other polyRAD functions for rough estimation of genotypes and allele frequency.
  • antiAlleleDepth: An integer matrix with taxa in rows and alleles in columns. For each allele, the number of reads from the locus that do NOT belong to that allele. Calculated as locDepth - alleleDepth. Used for likelihood estimations by AddGenotypeLikelihood.

Slots added by other functions

  • alleleFreq: Allele frequencies. This is a vector of values ranging from zero to one, with one value per allele. Added by AlleleFreqHWE and AlleleFreqMapping. This vector additionally has an attribute called "type" that indicates what parameters were used for estimating allele frequency; this can be "individual frequency", "posterior prob", or "depth ratio".
  • depthSamplingPermutations: An integer matrix with taxa in rows and alleles in columns. It is calculated as log(locDepth choose alleleDepth). This is used as a coefficient for likelihood estimations done by AddGenotypeLikelihood.
  • genotypeLikelihood: Genotype likelihoods, i.e. the probability of the observed read count distribution for each allele and taxa, given each possible ploidy and genotype. It is formatted as a list of the arrays. There is one array in the list for each possible ploidy, ignoring differences between auto and allopolyploidy. For each array, the first dimension represents allele copy number ranging from zero to the ploidy, the second dimension is taxa, and the third dimension is alleles. Added by AddGenotypeLikelihood.
  • priorProb: Prior probabilities of genotypes, i.e. expected genotype frequencies in the population. This is formatted as a list, with one list item per possible ploidy, counting differences between auto and allopolyploid inheritance modes. For AddGenotypePriorProb_Mapping2Parents and AddGenotypePriorProb_HWE: Each list item is a matrix, with allele copy number (from zero to the total ploidy) in rows, and alleles in columns. Each value is the probability of sampling an individual with that allele copy number from the population. For AddGenotypePriorProb_ByTaxa: Each list item is an array, with allele copy number in the first dimension, taxa in the second dimension, and alleles in the third dimension. Each value is the probability of sampling an individual with that allele copy number from the population local to the taxon.
  • priorProbPloidies: A list in the same format as possiblePloidies, and the same length as priorProb. Each item in the list is a vector indicating the inheritance mode for the corresponding matrix in priorProb. Added by AddGenotypePriorProb_Mapping2Parents, AddGenotypePriorProb_HWE, and AddGenotypePriorProb_ByTaxa.
  • ploidyLikelihood: Likelihoods estimated for inheritance modes using AddPloidyLikelihood, likely to be removed from the package.
  • ploidyChiSq: Chi-squared values estimated for each inheritance mode and allele, stored in a matrix with inheritance mode in rows (same order as priorProb) and alleles in columns. Low values indicate that genotype likelihoods and prior probabilities are a good match. Added by AddPloidyChiSq.
  • ploidyChiSqP: P-values derived from ploidyChiSq, in a matrix of the same dimensions. Added by AddPloidyChiSq.
  • priorTimesLikelihood: A list of arrays, with one list element for each element of priorProb (each inheritance mode), and array dimensions identical to genotypeLikelihood. Genotype priors multiplied by genotype likelihoods. Added by AddPriorTimesLikelihood. May be eliminated in the future.
  • posteriorProb: Genotype posterior probabilities. A list of arrays, with one list element for each element of priorProb (each inheritance mode), and array dimensions identical to genotypeLikelihood, with allele copy number in the first dimension, taxa in the second dimension, and alleles in the third dimension. Values should range from zero to one. Added by AddGenotypePosteriorProb.
  • alleleFreqByTaxa: Estimated allele frequencies for the local population to which each taxon belongs. Matrix with taxa in rows and alleles in columns, and values ranging from zero to one. Added by AddAlleleFreqByTaxa.
  • PCA: A matrix of principal component analysis scores with taxa in rows and PC axes in columns. Added by AddPCA.
  • alleleLinkages: A list with one item per allele in the dataset. Each item is a list of two vectors, with allele numbers in the first column and correlation coefficients in the second column, listing alleles that can be used for predicting the genotype at a given allele. Added by AddAlleleLinkages.
  • priorProbLD: A list of arrays in the same dimensions as posteriorProb; there is one array for each possible ploidy, and arrays have allele copy number in the first dimension, taxa in the second dimension, and alleles in the third dimension. These are prior genotype probabilities based on linked loci only. Added by AddGenotypePriorProb_LD.
  • likelyGeno_donor and likelyGeno_recurrent: Matrices formatted like the output of GetLikelyGen; these have alleles in columns, and possible ploidies in rows, ignoring differences between auto and allopolyploid types. Rows are named by total ploidy. Numbers in the matrix indicate the likely allele copy number. These slots are added by AddGenotypePriorProb_Mapping2Parents to indicate the likely genotypes of the two parents, after correction based on progeny allele frequencies.
  • donorPloidies and recurrentPloidies: Lists in the same format as possiblePloidies indicating the parent ploidy corresponding to each ploidy listed in priorProbPloidies. Added by AddGenotypePriorProb_Mapping2Parents.

Attributes available using the attr function

Attributes created by the RADdata function

  • "taxa": A character vector listing all taxa names, in the same order as the rows of alleleDepth. Can be retrieved using GetTaxa.
  • "nTaxa": An integer indicating the number of taxa in the dataset. Retrieved with the nTaxa function.
  • "nLoci": An integer indicating the number of loci in locTable.
  • "contamRate": A number ranging from zero to one (although in practice probably less than 0.01) indicating the expected sample cross-contamination rate. Can later be edited using SetContamRate or retrieved using GetContamRate.

Attributes added by other functions

  • "donorParent": A character string indicating the name or the taxon that is the donor parent, if the dataset represents a mapping population. Added by SetDonorParent, retrieved by GetDonorParent.
  • "recurrentParent": A character string indicating the name or the taxon that is the recurrent parent, if the dataset represents a mapping population. Added by SetRecurrentParent, retrieved by GetRecurrentParent. If no backcrossing took place, it does not matter which parent is listed as "donor" or "recurrent".
  • "blankTaxa": A character vector indicating names of any taxa that were blanks, i.e. barcoded reactions to which no genomic DNA was added during library creation. These can be useful for estimating the contamination rate, and additionally many polyRAD functions will exclude from calculations taxa that are listed as blanks. Added by SetBlankTaxa and retrieved by GetBlankTaxa.
  • "alleleFreqType": A character string indicating how allele frequencies were estimated. "mapping" and "HWE" are current possible values. Added by AlleleFreqHWE and AlleleFreqMapping.
  • "priorType": How the prior genotype probabilities were calculated. AddGenotypePriorProb_Mapping2Parents and AddGenotypePriorProb_HWE record "population" for this attribute to indicate that probabilities were estimated on a population basis. AddGenotypePriorProb_ByTaxa records "taxon" for this attribute to indicate that probabilities were estimated on an individual basis.