Composite of Multiple Signals: tests for selection in meiotically recombinant populations¶
Contents¶
About CMS 2.0¶
Composite of Multiple Signals (CMS) refers to a family of tests applied to population genetic datasets in order to (i) identify genomic regions that may have been subject to strong recent positive selection (a ‘sweep’) and (ii) to narrow signals of selection within such regions, in order to identify tractable lists of candidate variants for experimental scrutiny. In both of these cases, CMS requires (a) phased variation data for several populations, along with (b) the identity of the ancestral allele for a majority of sites listed. It was developed with humans in mind (e.g., the 1000 Genomes Project) but could in principle be applied to any diploid species with data in VCF or TPED format.
In its current instantiation (‘CMS 2.0’), it includes scripts to (i) calculate a variety of selection metrics for each population, (ii) model the demographic history of the dataset using an exploratory approach, (iii) generate probability distributions for each selection metric from data simulated from demographic models, (iv) generate composite scores and (v) visualize signals of selection in the UCSC Genome Browser.
Background¶
The method used in CMS is described in greater detail in the following papers:
A Composite of Multiple Signals distinguishes causal variants in regions of positive selection Sharon R. Grossman, Ilya Shylakhter, Elinor K. Karlsson, Elizabeth H. Byrne, Shannon Morales, Gabriel Frieden, Elizabeth Hostetter, Elaine Angelino, Manuel Garber, Or Zuk, Eric S. Lander, Stephen F. Schaffner, and Pardis C. Sabeti Science 12 February 2010: 327 (5967), 883-886.Published online 7 January 2010 [DOI:10.1126/science.1183863]
Identifying recent adaptations in large-scale genomic data Grossman SR, Andersen KG, Shlyakhter I, Tabrizi S, Winnicki S, Yen A, Park DJ, Griesemer D, Karlsson EK, Wong SH, Cabili M, Adegbola RA, Bamezai RN, Hill AV, Vannberg FO, Rinn JL; 1000 Genomes Project, Lander ES, Schaffner SF, Sabeti PC. Cell 14 February 2013: 152 (4), 883-886.Published online 7 January 2010 [DOI:10.1016/j.cell.2013.01.035]
Coalescent simulations¶
CMS uses simulated population genetic data for a variety of purposes. For the purpose of flexibility, this pipeline is optimized for use with cosi 2, but it would theoretically be straightforward to substitute e.g. Hudson’s ms.
Cosi2: an efficient simulator of exact and approximate coalescent with selection. Shlyakhter I, Sabeti PC, Schaffner SF. Bioinformatics 1 December 2014: 30 (23), 3427-9.Published online 22 August 2014 [DOI:10.1093/bioinformatics/btu562]
Installation¶
System dependencies¶
To be described in greater detail...
Manual Installation¶
Step 1: Install Conda
To use conda, you need to install the Conda package manager which is most easily obtained via the Miniconda Python distribution. Miniconda can be installed to your home directory without admin priviledges. On Broad Institute systems, you can make use of the ”.anaconda3-4.0.0” dotkit.
Step 2: Configure Conda
Software used by the cms project is distributed through the bioconda channel for the conda package manager. It is necessary to add this channel to the conda config:
conda config --add channels bioconda
Step 3: Make a conda environment and install cms
It is recommended to install cms into its own conda directory. This ensures its dependencies do not interfere with other conda packages installed on your system. A new conda environment can be created with the following command, which will also install relevant cms dependencies. It is reccommended to use the Python3 version of the environmnent file:
conda env create -f=conda-environment_py3.yml -n cms-env
Step 4: Activate the cms environment
In order to use cms, you will need to activate its conda environment:
source activate cms-env
Command line tools¶
scans.py¶
This script contains command-line utilities for calculating EHH-based scans for positive selection in genomes, including EHH, iHS, and XP-EHH.
usage: scans.py subcommand
- Sub-commands:
- selscan_file_conversion
Process a bgzipped-VCF (such as those included in the Phase 3 1000 Genomes release) into a gzip-compressed tped file of the sort expected by selscan.
usage: scans.py selscan_file_conversion [-h] [--startBp STARTBP] [--endBp ENDBP] [--ploidy PLOIDY] [--considerMultiAllelic] [--rescaleGeneticDistance] [--includeLowQualAncestral] [--codingFunctionClassFile CODINGFUNCTIONCLASSFILE] [--sampleMembershipFile SAMPLEMEMBERSHIPFILE] [--filterPops FILTERPOPS [FILTERPOPS ...]] [--filterSuperPops FILTERSUPERPOPS [FILTERSUPERPOPS ...]] [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}] [--version] [--tmpDir TMPDIR] [--tmpDirKeep] inputVCF genMap outPrefix outLocation chromosomeNum
- Positional arguments:
inputVCF Input VCF file genMap Genetic recombination map tsv file with four columns: (Chromosome, Position(bp), Rate(cM/Mb), Map(cM)) outPrefix Output file prefix outLocation Output location chromosomeNum Chromosome number. - Options:
--startBp=0 Coordinate in bp of start position. (default: %(default)s). --endBp Coordinate in bp of end position. --ploidy=2 Number of chromosomes expected for each genotype. (default: %(default)s). --considerMultiAllelic=False Include multi-allelic variants in the output as separate records --rescaleGeneticDistance=False Genetic distance is rescaled to be out of 100.0 cM --includeLowQualAncestral=False Include variants where the ancestral information is low-quality (as indicated by lower-case x for AA=x in the VCF info column) (default: %(default)s). --codingFunctionClassFile A python class file containing a function used to code each genotype as ‘1’ and ‘0’. coding_function(current_value, reference_allele, alternate_allele, ancestral_allele) --sampleMembershipFile The call sample file containing four columns: sample, pop, super_pop, gender --filterPops Populations to include in the calculation (ex. “FIN”) --filterSuperPops Super populations to include in the calculation (ex. “EUR”) --loglevel=DEBUG Verboseness of output. [default: %(default)s]
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
--version, -V show program’s version number and exit --tmpDir=/tmp Base directory for temp files. [default: %(default)s] --tmpDirKeep=False Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
- selscan_ehh
Perform selscan’s calculation of EHH.
usage: scans.py selscan_ehh [-h] [--gapScale GAPSCALE] [--maf MAF] [--threads THREADS] [--window WINDOW] [--cutoff CUTOFF] [--maxExtend MAXEXTEND] [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}] [--version] [--tmpDir TMPDIR] [--tmpDirKeep] inputTped outFile locusID
- Positional arguments:
inputTped Input tped file outFile Output filepath locusID The locus ID - Options:
--gapScale=20000 Gap scale parameter in bp. If a gap is encountered between two snps > GAP_SCALE and < MAX_GAP, then the genetic distance is scaled by GAP_SCALE/GA (default: %(default)s). --maf=0.05 Minor allele frequency. If a site has a MAF below this value, the program will not use it as a core snp. (default: %(default)s). --threads=1 The number of threads to spawn during the calculation. Partitions loci across threads. (default: %(default)s). --window=100000 When calculating EHH, this is the length of the window in bp in each direction from the query locus (default: %(default)s). --cutoff=0.05 The EHH decay cutoff (default: %(default)s). --maxExtend=1000000 The maximum distance an EHH decay curve is allowed to extend from the core. Set <= 0 for no restriction. (default: %(default)s). --loglevel=DEBUG Verboseness of output. [default: %(default)s]
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
--version, -V show program’s version number and exit --tmpDir=/tmp Base directory for temp files. [default: %(default)s] --tmpDirKeep=False Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
- selscan_ihs
Perform selscan’s calculation of iHS.
usage: scans.py selscan_ihs [-h] [--gapScale GAPSCALE] [--maf MAF] [--threads THREADS] [--skipLowFreq] [--dontWriteLeftRightiHH] [--truncOk] [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}] [--version] [--tmpDir TMPDIR] [--tmpDirKeep] inputTped outFile
- Positional arguments:
inputTped Input tped file outFile Output filepath - Options:
--gapScale=20000 Gap scale parameter in bp. If a gap is encountered between two snps > GAP_SCALE and < MAX_GAP, then the genetic distance is scaled by GAP_SCALE/GA (default: %(default)s). --maf=0.05 Minor allele frequency. If a site has a MAF below this value, the program will not use it as a core snp. (default: %(default)s). --threads=1 The number of threads to spawn during the calculation. Partitions loci across threads. (default: %(default)s). --skipLowFreq=False Do not include low frequency variants in the construction of haplotypes (default: %(default)s). --dontWriteLeftRightiHH=False When writing out iHS, do not write out the constituent left and right ancestral and derived iHH scores for each locus.(default: %(default)s). --truncOk=False If an EHH decay reaches the end of a sequence before reaching the cutoff, integrate the curve anyway. Normal function is to disregard the score for that core. (default: %(default)s). --loglevel=DEBUG Verboseness of output. [default: %(default)s]
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
--version, -V show program’s version number and exit --tmpDir=/tmp Base directory for temp files. [default: %(default)s] --tmpDirKeep=False Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
- selscan_nsl
Perform selscan’s calculation of nSL.
usage: scans.py selscan_nsl [-h] [--gapScale GAPSCALE] [--maf MAF] [--threads THREADS] [--truncOk] [--maxExtendNsl MAXEXTENDNSL] [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}] [--version] [--tmpDir TMPDIR] [--tmpDirKeep] inputTped outFile
- Positional arguments:
inputTped Input tped file outFile Output filepath - Options:
--gapScale=20000 Gap scale parameter in bp. If a gap is encountered between two snps > GAP_SCALE and < MAX_GAP, then the genetic distance is scaled by GAP_SCALE/GA (default: %(default)s). --maf=0.05 Minor allele frequency. If a site has a MAF below this value, the program will not use it as a core snp. (default: %(default)s). --threads=1 The number of threads to spawn during the calculation. Partitions loci across threads. (default: %(default)s). --truncOk=False If an EHH decay reaches the end of a sequence before reaching the cutoff, integrate the curve anyway. Normal function is to disregard the score for that core. (default: %(default)s). --maxExtendNsl=100 The maximum distance an nSL haplotype is allowed to extend from the core. Set <= 0 for no restriction. (default: %(default)s). --loglevel=DEBUG Verboseness of output. [default: %(default)s]
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
--version, -V show program’s version number and exit --tmpDir=/tmp Base directory for temp files. [default: %(default)s] --tmpDirKeep=False Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
- selscan_xpehh
Perform selscan’s calculation of XPEHH.
usage: scans.py selscan_xpehh [-h] [--gapScale GAPSCALE] [--maf MAF] [--threads THREADS] [--truncOk] [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}] [--version] [--tmpDir TMPDIR] [--tmpDirKeep] inputTped outFile inputRefTped
- Positional arguments:
inputTped Input tped file outFile Output filepath inputRefTped Input tped for the reference population to which the first is compared - Options:
--gapScale=20000 Gap scale parameter in bp. If a gap is encountered between two snps > GAP_SCALE and < MAX_GAP, then the genetic distance is scaled by GAP_SCALE/GA (default: %(default)s). --maf=0.05 Minor allele frequency. If a site has a MAF below this value, the program will not use it as a core snp. (default: %(default)s). --threads=1 The number of threads to spawn during the calculation. Partitions loci across threads. (default: %(default)s). --truncOk=False If an EHH decay reaches the end of a sequence before reaching the cutoff, integrate the curve anyway. Normal function is to disregard the score for that core. (default: %(default)s). --loglevel=DEBUG Verboseness of output. [default: %(default)s]
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
--version, -V show program’s version number and exit --tmpDir=/tmp Base directory for temp files. [default: %(default)s] --tmpDirKeep=False Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
- selscan_norm_nsl
Undocumented
Normalize Selscan’s nSL output
usage: scans.py selscan_norm_nsl [-h] [--bins BINS] [--critPercent CRITPERCENT] [--critValue CRITVALUE] [--minSNPs MINSNPS] [--qbins QBINS] [--winSize WINSIZE] [--bpWin] [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}] [--version] [--tmpDir TMPDIR] [--tmpDirKeep] inputFiles [inputFiles ...]
- Positional arguments:
inputFiles A list of files delimited by whitespace for joint normalization. Expected format for iHS/nSL files (no header): <locus name> <physical pos> <freq> <ihh1/sL1> <ihh2/sL0> <ihs/nsl> Expected format for XP-EHH files (one line header): <locus name> <physical pos> <genetic pos> <freq1> <ihh1> <freq2> <ihh2> <xpehh> - Options:
--bins=100 The number of frequency bins in [0,1] for score normalization (default: %(default)s) --critPercent=-1.0 Set the critical value such that a SNP with iHS in the most extreme CRIT_PERCENT tails (two-tailed) is marked as an extreme SNP. Not used by default (default: %(default)s) --critValue=2.0 Set the critical value such that a SNP with |iHS| > CRIT_VAL is marked as an extreme SNP. Default as in Voight et al. (default: %(default)s) --minSNPs=10 Only consider a bp window if it has at least this many SNPs (default: %(default)s) --qbins=20 Outlying windows are binned by number of sites within each window. This is the number of quantile bins to use. (default: %(default)s) --winSize=100000 GThe non-overlapping window size for calculating the percentage of extreme SNPs (default: %(default)s) --bpWin=False If set, will use windows of a constant bp size with varying number of SNPs --loglevel=DEBUG Verboseness of output. [default: %(default)s]
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
--version, -V show program’s version number and exit --tmpDir=/tmp Base directory for temp files. [default: %(default)s] --tmpDirKeep=False Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
- selscan_norm_ihs
Undocumented
Normalize Selscan’s iHS output
usage: scans.py selscan_norm_ihs [-h] [--bins BINS] [--critPercent CRITPERCENT] [--critValue CRITVALUE] [--minSNPs MINSNPS] [--qbins QBINS] [--winSize WINSIZE] [--bpWin] [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}] [--version] [--tmpDir TMPDIR] [--tmpDirKeep] inputFiles [inputFiles ...]
- Positional arguments:
inputFiles A list of files delimited by whitespace for joint normalization. Expected format for iHS/nSL files (no header): <locus name> <physical pos> <freq> <ihh1/sL1> <ihh2/sL0> <ihs/nsl> Expected format for XP-EHH files (one line header): <locus name> <physical pos> <genetic pos> <freq1> <ihh1> <freq2> <ihh2> <xpehh> - Options:
--bins=100 The number of frequency bins in [0,1] for score normalization (default: %(default)s) --critPercent=-1.0 Set the critical value such that a SNP with iHS in the most extreme CRIT_PERCENT tails (two-tailed) is marked as an extreme SNP. Not used by default (default: %(default)s) --critValue=2.0 Set the critical value such that a SNP with |iHS| > CRIT_VAL is marked as an extreme SNP. Default as in Voight et al. (default: %(default)s) --minSNPs=10 Only consider a bp window if it has at least this many SNPs (default: %(default)s) --qbins=20 Outlying windows are binned by number of sites within each window. This is the number of quantile bins to use. (default: %(default)s) --winSize=100000 GThe non-overlapping window size for calculating the percentage of extreme SNPs (default: %(default)s) --bpWin=False If set, will use windows of a constant bp size with varying number of SNPs --loglevel=DEBUG Verboseness of output. [default: %(default)s]
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
--version, -V show program’s version number and exit --tmpDir=/tmp Base directory for temp files. [default: %(default)s] --tmpDirKeep=False Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
- selscan_norm_xpehh
Undocumented
Normalize Selscan’s XPEHH output
usage: scans.py selscan_norm_xpehh [-h] [--bins BINS] [--critPercent CRITPERCENT] [--critValue CRITVALUE] [--minSNPs MINSNPS] [--qbins QBINS] [--winSize WINSIZE] [--bpWin] [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}] [--version] [--tmpDir TMPDIR] [--tmpDirKeep] inputFiles [inputFiles ...]
- Positional arguments:
inputFiles A list of files delimited by whitespace for joint normalization. Expected format for iHS/nSL files (no header): <locus name> <physical pos> <freq> <ihh1/sL1> <ihh2/sL0> <ihs/nsl> Expected format for XP-EHH files (one line header): <locus name> <physical pos> <genetic pos> <freq1> <ihh1> <freq2> <ihh2> <xpehh> - Options:
--bins=100 The number of frequency bins in [0,1] for score normalization (default: %(default)s) --critPercent=-1.0 Set the critical value such that a SNP with iHS in the most extreme CRIT_PERCENT tails (two-tailed) is marked as an extreme SNP. Not used by default (default: %(default)s) --critValue=2.0 Set the critical value such that a SNP with |iHS| > CRIT_VAL is marked as an extreme SNP. Default as in Voight et al. (default: %(default)s) --minSNPs=10 Only consider a bp window if it has at least this many SNPs (default: %(default)s) --qbins=20 Outlying windows are binned by number of sites within each window. This is the number of quantile bins to use. (default: %(default)s) --winSize=100000 GThe non-overlapping window size for calculating the percentage of extreme SNPs (default: %(default)s) --bpWin=False If set, will use windows of a constant bp size with varying number of SNPs --loglevel=DEBUG Verboseness of output. [default: %(default)s]
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
--version, -V show program’s version number and exit --tmpDir=/tmp Base directory for temp files. [default: %(default)s] --tmpDirKeep=False Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
- store_selscan_results_in_db
Aggregate results from selscan in to a SQLite database via helper JSON metadata file.
usage: scans.py store_selscan_results_in_db [-h] [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}] [--version] [--tmpDir TMPDIR] [--tmpDirKeep] inputFile outFile
- Positional arguments:
inputFile Input *.metadata.json file outFile Output SQLite filepath - Options:
--loglevel=INFO Verboseness of output. [default: %(default)s]
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
--version, -V show program’s version number and exit --tmpDir=/tmp Base directory for temp files. [default: %(default)s] --tmpDirKeep=False Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
cms_modeller.py¶
This script contains command-line utilities for exploratory fitting of demographic models to population genetic data.
usage: cms_modeller.py [-h] {target_stats,bootstrap,point,grid,optimize} ...
- Sub-commands:
- target_stats
perform per-site(/per-site-pair) calculations of population summary statistics for model target values
usage: cms_modeller.py target_stats [-h] [--freqs] [--ld] [--fst] inputTpeds recomFile regions out
- Positional arguments:
inputTpeds comma-delimited list of input tped files (only one file per pop being modelled; must run chroms separately or concatenate) recomFile recombination map regions tab-separated file with putative neutral regions out outfile prefix - Options:
--freqs=False calculate summary statistics from within-population allele frequencies --ld=False calculate summary statistics from within-population linkage disequilibrium --fst=False calculate summary statistics from population comparison using allele frequencies
- bootstrap
perform bootstrap estimates of population summary statistics in order to finalize model target values
usage: cms_modeller.py bootstrap [-h] [--in_freqs IN_FREQS] [--in_ld IN_LD] [--in_fst IN_FST] nBootstrapReps out
- Positional arguments:
nBootstrapReps number of bootstraps to perform in order to estimate standard error of the dataset (should converge for reasonably small n) out outfile prefix - Options:
--in_freqs comma-delimited list of infiles with per-site calculations for population. One file per population – for bootstrap estimates of genome-wide values, should first concatenate per-chrom files --in_ld comma-delimited list of infiles with per-site-pair calculations for population. One file per population – for bootstrap estimates of genome-wide values, should first concatenate per-chrom files --in_fst comma-delimited list of infiles with per-site calculations for population pair. One file per population-pair – for bootstrap estimates of genome-wide values, should first concatenate per-chrom files
- point
run simulates of a point in parameter-space
usage: cms_modeller.py point [-h] [--cosiBuild COSIBUILD] [--dropSings DROPSINGS] [--genmapRandomRegions] [--stopAfterMinutes STOPAFTERMINUTES] [--calcError CALCERROR] [--targetvalsFile TARGETVALSFILE] [--plotStats] inputParamFile nCoalescentReps outputDir
- Positional arguments:
inputParamFile file with model specifications for input nCoalescentReps num reps outputDir location to write cosi output - Options:
--cosiBuild=/Users/vitti/Desktop/COSI_DEBUG_TEST/cosi-2.0/coalescent which version of cosi to run? (*automate installation) --dropSings randomly thin global singletons from output dataset (i.e., to model ascertainment bias) --genmapRandomRegions=False cosi option to sub-sample genetic map randomly from input --stopAfterMinutes cosi option to terminate simulations --calcError file specifying dimensions of error function to use. if unspecified, defaults to all. first line = stats, second line = pops --targetvalsFile targetvalsfile for model --plotStats=False visualize goodness-of-fit to model targets
- grid
run grid search
usage: cms_modeller.py grid [-h] [--cosiBuild COSIBUILD] [--dropSings DROPSINGS] [--genmapRandomRegions] [--stopAfterMinutes STOPAFTERMINUTES] [--calcError CALCERROR] inputParamFile nCoalescentReps outputDir grid_inputdimensionsfile
- Positional arguments:
inputParamFile file with model specifications for input nCoalescentReps num reps outputDir location to write cosi output grid_inputdimensionsfile file with specifications of grid search. each parameter to vary is indicated: KEY INDEX [VALUES] - Options:
--cosiBuild=/Users/vitti/Desktop/COSI_DEBUG_TEST/cosi-2.0/coalescent which version of cosi to run? (*automate installation) --dropSings randomly thin global singletons from output dataset (i.e., to model ascertainment bias) --genmapRandomRegions=False cosi option to sub-sample genetic map randomly from input --stopAfterMinutes cosi option to terminate simulations --calcError file specifying dimensions of error function to use. if unspecified, defaults to all. first line = stats, second line = pops
- optimize
run optimization algorithm to fit model parameters
usage: cms_modeller.py optimize [-h] [--cosiBuild COSIBUILD] [--dropSings DROPSINGS] [--genmapRandomRegions] [--stopAfterMinutes STOPAFTERMINUTES] [--calcError CALCERROR] [--stepSize STEPSIZE] [--method METHOD] inputParamFile nCoalescentReps outputDir optimize_inputdimensionsfile
- Positional arguments:
inputParamFile file with model specifications for input nCoalescentReps num reps outputDir location to write cosi output optimize_inputdimensionsfile file with specifications of optimization. each parameter to vary is indicated: KEY INDEX - Options:
--cosiBuild=/Users/vitti/Desktop/COSI_DEBUG_TEST/cosi-2.0/coalescent which version of cosi to run? (*automate installation) --dropSings randomly thin global singletons from output dataset (i.e., to model ascertainment bias) --genmapRandomRegions=False cosi option to sub-sample genetic map randomly from input --stopAfterMinutes cosi option to terminate simulations --calcError file specifying dimensions of error function to use. if unspecified, defaults to all. first line = stats, second line = pops --stepSize scaled step size (i.e. whole range = 1) --method=SLSQP algorithm to pass to scipy.optimize
likes_from_model.py¶
This script contains command-line utilities for generating probability distributions for component scores from pre-specified demographic model(s).
usage: likes_from_model.py [-h] {run_neut_sims,get_sel_trajs,run_sel_sims,scores_from_sims,likes_from_scores} ...
- Sub-commands:
- run_neut_sims
run neutral simulations
usage: likes_from_model.py run_neut_sims [-h] [--cosiBuild COSIBUILD] [--dropSings DROPSINGS] [--genmapRandomRegions] n inputParamFile outputDir
- Positional arguments:
n num replicates to run inputParamFile file with model specifications for input outputDir location to write cosi output - Options:
--cosiBuild=/Users/vitti/Desktop/COSI_DEBUG_TEST/cosi-2.0/coalescent which version of cosi to run? (*automate installation) --dropSings randomly thin global singletons from output dataset to model ascertainment bias --genmapRandomRegions=False cosi option to sub-sample genetic map randomly from input
- get_sel_trajs
run forward simulations of selection trajectories and perform rejection sampling to populate selscenarios by final allele frequency before running coalescent simulations for entire sample
usage: likes_from_model.py get_sel_trajs [-h] [--cosiBuild COSIBUILD] [--dropSings DROPSINGS] [--genmapRandomRegions] [--freqRange FREQRANGE] [--nBins NBINS] nSimsPerBin maxSteps inputParamFile outputDir
- Positional arguments:
nSimsPerBin number of selection trajectories to generate per allele frequency bin maxSteps number of attempts to generate a selection trajectory before re-sampling selection coefficient and start time. inputParamFile file with model specifications for input outputDir location to write cosi output - Options:
--cosiBuild=/Users/vitti/Desktop/COSI_DEBUG_TEST/cosi-2.0/coalescent which version of cosi to run? (*automate installation) --dropSings randomly thin global singletons from output dataset to model ascertainment bias --genmapRandomRegions=False cosi option to sub-sample genetic map randomly from input --freqRange=.05-.95 range of final selected allele frequencies to simulate, e.g. .05-.95 --nBins=9 number of frequency bins
- run_sel_sims
run sel. simulations
usage: likes_from_model.py run_sel_sims [-h] [--cosiBuild COSIBUILD] [--dropSings DROPSINGS] [--genmapRandomRegions] [--freqRange FREQRANGE] [--nBins NBINS] n trajDir inputParamFile outputDir
- Positional arguments:
n num replicates to run per sel scenario trajDir location of simulated trajectories (i.e. outputDir from get_sel_trajs) inputParamFile file with model specifications for input outputDir location to write cosi output - Options:
--cosiBuild=/Users/vitti/Desktop/COSI_DEBUG_TEST/cosi-2.0/coalescent which version of cosi to run? (*automate installation) --dropSings randomly thin global singletons from output dataset to model ascertainment bias --genmapRandomRegions=False cosi option to sub-sample genetic map randomly from input --freqRange=.05-.95 range of final selected allele frequencies to simulate, e.g. .05-.95 --nBins=9 number of frequency bins
- scores_from_sims
get scores from simulations
usage: likes_from_model.py scores_from_sims [-h] [--inputTped INPUTTPED] [--inputIhs INPUTIHS] [--inputdelIhh INPUTDELIHH] [--inputXpehh INPUTXPEHH] [--ihs] [--delIhh] [--xpehh XPEHH] [--fst_deldaf FST_DELDAF] [--normalizeIhs NORMALIZEIHS] [--normalizeDelIhh NORMALIZEDELIHH] [--normalizeXpehh NORMALIZEXPEHH] outputFilename
- Positional arguments:
outputFilename where to write scorefile - Options:
--inputTped tped from which to calculate score --inputIhs iHS from which to calculate delihh --inputdelIhh delIhh from which to calculate norm --inputXpehh Xp-ehh from which to calculate norm --ihs=False Undocumented --delIhh=False Undocumented --xpehh inputTped for altpop --fst_deldaf inputTped for altpop --normalizeIhs filename for parameters to normalize to; if not given then will by default normalize file to its own global dist --normalizeDelIhh filename for parameters to normalize to; if not given then will by default normalize file to its own global dist --normalizeXpehh filename for parameters to normalize to; if not given then will by default normalize file to its own global dist
- likes_from_scores
get component score probability distributions from scores
usage: likes_from_model.py likes_from_scores [-h] [--thinToSize] [--ihs] [--delihh] [--xp] [--deldaf] [--fst] [--freqRange FREQRANGE] [--nBins NBINS] neutFile selFile selPos nLikesBins outPrefix
- Positional arguments:
neutFile file with scores for neutral scenarios (normalized if necessary) selFile file with scores for selected scenarios (normalized if necessary) selPos position of causal variant nLikesBins number of bins to use for histogram to approximate probability density function outPrefix save file as - Options:
--thinToSize=False subsample from simulated SNPs (since nSel << nLinked < nNeut) --ihs=False Undocumented --delihh=False Undocumented --xp=False Undocumented --deldaf=False Undocumented --fst=False Undocumented --freqRange=.05-.95 range of final selected allele frequencies to simulate, e.g. .05-.95 --nBins=9 number of frequency bins
composite.py¶
This script contains command-line utilities for combining component statistics – i.e., the final step of the CMS 2.0 pipeline.
usage: composite.py [-h] {poppair,outgroups,bayesian_gw,bayesian_region,ml_region} ...
- Sub-commands:
- poppair
collate all component statistics for a given population pair (as a prerequisite to more sophisticated group comparisons
usage: composite.py poppair [-h] [--xp_reverse_pops] [--deldaf_reverse_pops] in_ihs_file in_delihh_file in_xp_file in_fst_deldaf_file outfile
- Positional arguments:
in_ihs_file file with normalized iHS values for putative selpop in_delihh_file file with normalized delIhh values for putative selpop in_xp_file file with normalized XP-EHH values in_fst_deldaf_file file with Fst, delDaf values for poppair outfile file to write with collated scores - Options:
--xp_reverse_pops=False include if the putative selpop for outcome is the altpop in XPEHH (and vice versa) --deldaf_reverse_pops=False finclude if the putative selpop for outcome is the altpop in delDAF (and vice versa)
- outgroups
combine scores from comparisons of a putative selected pop to 2+ outgroups.
usage: composite.py outgroups [-h] infiles likesfile outfile
- Positional arguments:
infiles comma-delimited set of pop-pair comparisons likesfile text file where probability distributions are specified for component scores outfile file to write with finalized scores
- bayesian_gw
default algorithm and weighting, genome-wide
usage: composite.py bayesian_gw [-h] inputparamfile
- Positional arguments:
inputparamfile file with specifications for input
- bayesian_region
default algorithm and weighting, within-region
usage: composite.py bayesian_region [-h] chrom startBp endBp selPop altPops demModel
- Positional arguments:
chrom chromosome containing region startBp start location of region in basepairs endBp end location of region in basepairs selPop Undocumented altPops comma-delimited demModel Undocumented
- ml_region
machine learning algorithm (within-region)
usage: composite.py ml_region [-h] chrom startBp endBp selPop altPops demModel
- Positional arguments:
chrom chromosome containing region startBp start location of region in basepairs endBp end location of region in basepairs selPop Undocumented altPops comma-delimited demModel Undocumented
Sample workflow¶
CMS provides a computational framework to explore signals of natural selection in diploid organisms. This section describes how to do so at an abstract level.
Preliminary considerations¶
CMS is a computational and statistical framework for exploring the evolution of populations within a species at a genomic level. To that end, the user must first provide a dataset containing genotype calls for individuals in at least one putative selected population and at least one putative ‘outgroup.’ CMS 2.0 is designed to be flexible with respect to the number and configuration of input populations – that is, given input of however many populations, the user can easily calculate CMS scores for any configuration of these populations. Nonetheless, CMS still relies on the user to define these populations appropriately.
- To determine or confirm appropriate population groupings, identify outliers, etc., we recommend that users first characterize their dataset using such methods as likeliness clustering (e.g. STRUCTURE), principal components analysis, or phylogenetic methods (see e.g. SNPRelate)
- Each population should be randomly thinned to the same number of individuals, none of whom should be related within the past few generations.
- Larger samples are generally preferable (e.g. 50-100+ diploid individuals per population). However, depending on such factors as the landscape of recombination in the species, the quality/density of genotype data, and the extent of neutral genetic divergence between represented populations, it may be possible to leverage smaller datasets. As CMS necessitates the generation of a demographic model for the given dataset, the user is advised to use their model to generate simulated data with which to perform power estimations.
Data formatting¶
CMS requires the user to provide population genetic (i.e., within-species) diversity data, including genotype phase and allele polarity.
- If your dataset is unphased, you can preprocess it using a program like Beagle or PLINK.
- The identity of the ancestral allele at each site is typically determined by comparison to outgroups at orthologous sites. Inferred ancestral sequence is available for a number of species through e.g. Ensembl via their ftp. You can use VCFtools to populate the “AA=” section of your VCF’s INFO field.
- In most cases, the user will want to provide a genetic recombination map. If this is unavailable, CMS will assume uniform recombination rates when calculating haplotype scores. Human recombination maps are available from the HapMap Project.
- CMS works with TPED datafiles, and includes support to convert from VCF using the command line tool scans.py.
Demographic modeling¶
CMS combines several semi-independent component tests for selection in a Bayesian or Machine Learning framework. In the former case, a demographic model for the species in question is critical in order to furnish posterior distributions of scores for said component tests under alternate hypotheses of neutrality or selection. Put otherwise: a demographic model is a (conjectural) descriptive historical account of our dataset, including population sizes and migration rates across time, that can be used to generate simulated data that ‘looks like’ our original dataset. We then simulate many scenarios of selection in order and calculate the distributions of component scores for adaptive, linked, and neutral variants. These distributions form the basis of our Bayesian classifier. We can also circumvent the need to define posterior score distributions by using simulated data as training data for Machine Learning implementations of CMS.
Our modeling framework is designed to accomodate an arbitrary number of populations in a tree of arbitrary complexity; as such, it is designed to be exploratory, allowing users to iteratively perform optimizations while visualizing the effect on model goodness-of-fit. For rigorous demographic inference (i.e., in the case of a model with known topology and tractably few parameters), users may consider programs such as dadi or diCal.
Following Schaffner et al 2005 , our framework calculates a range of population summary statistics as target values, and defines error as the Root Mean Square discrepancy between target and simulated values. These summary statistics are calculated by bootstrap estimate from user-specified putative neutral regions. For human populations, the Neutral Regions Explorer is a useful resource.
The user must specify tree topology and ranges for parameter values. These can be added and removed as desired through the script params.py. After target values have been estimated and model topology defined, the user can iteratively search through subsets of parameter-space using cms_modeller.py with a masterfile specifying search input.
Calculating selection statistics¶
CMS packages a number of previously described population genetic tests for recent positive selection. Haplotype scores are calculated using selscan.
Combining scores¶
CMS 2.0 allows users to define CMS scores flexibly with respect to (i) number and identity of putative selected/neutral populations, (ii) assumed demographic model, (iii) input component scores, (iv) method of score combination. In each case the user should motivate their choices and consider how robust a putative signal of selection is to variation or arbitrariness in these factors.
Identifying regions¶
CMS is motivated by the need to resolve signals of selection – that is, to identify genetic variants that confer adaptive phenotypes. Because selective events can alter patterns of population genetic diversity across large genomic regions, we take a two-step approach to this goal: we first identify putative selected regions (using CMS, another framework, prior knowledge, etc.), and then examine each region with CMS to identify a tractable list of candidate variants for further scrutiny.
Localizing signals¶
Once regions are defined, we can reapply our composite framework in order to thin our list of candidate variants for further scrutiny and prioritize those sites that have the strongest evidence of selection (or other compelling evidence, e.g. overlap with known or predicted functional elements).