3. Command line tools

3.1. scans.py

This script contains command-line utilities for calculating EHH-based scans for positive selection in genomes, including EHH, iHS, and XP-EHH.

usage: scans.py subcommand
Sub-commands:
selscan_file_conversion

Process a bgzipped-VCF (such as those included in the Phase 3 1000 Genomes release) into a gzip-compressed tped file of the sort expected by selscan.

usage: scans.py selscan_file_conversion [-h] [--startBp STARTBP]
                                        [--endBp ENDBP] [--ploidy PLOIDY]
                                        [--considerMultiAllelic]
                                        [--rescaleGeneticDistance]
                                        [--includeLowQualAncestral]
                                        [--codingFunctionClassFile CODINGFUNCTIONCLASSFILE]
                                        [--sampleMembershipFile SAMPLEMEMBERSHIPFILE]
                                        [--filterPops FILTERPOPS [FILTERPOPS ...]]
                                        [--filterSuperPops FILTERSUPERPOPS [FILTERSUPERPOPS ...]]
                                        [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                        [--version] [--tmpDir TMPDIR]
                                        [--tmpDirKeep]
                                        inputVCF genMap outPrefix outLocation
                                        chromosomeNum
Positional arguments:
inputVCF Input VCF file
genMap Genetic recombination map tsv file with four columns: (Chromosome, Position(bp), Rate(cM/Mb), Map(cM))
outPrefix Output file prefix
outLocation Output location
chromosomeNum Chromosome number.
Options:
--startBp=1 Coordinate in bp of start position. (default: %(default)s).
--endBp Coordinate in bp of end position.
--ploidy=2 Number of chromosomes expected for each genotype. (default: %(default)s).
--considerMultiAllelic=False
 Include multi-allelic variants in the output as separate records
--rescaleGeneticDistance=False
 Genetic distance is rescaled to be out of 100.0 cM
--includeLowQualAncestral=False
 Include variants where the ancestral information is low-quality (as indicated by lower-case x for AA=x in the VCF info column) (default: %(default)s).
--codingFunctionClassFile
 A python class file containing a function used to code each genotype as ‘1’ and ‘0’. coding_function(current_value, reference_allele, alternate_allele, ancestral_allele)
--sampleMembershipFile
 The call sample file containing four columns: sample, pop, super_pop, gender
--filterPops Populations to include in the calculation (ex. “FIN”)
--filterSuperPops
 Super populations to include in the calculation (ex. “EUR”)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmpDir=/tmp Base directory for temp files. [default: %(default)s]
--tmpDirKeep=False
 Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
selscan_ehh

Perform selscan’s calculation of EHH.

usage: scans.py selscan_ehh [-h] [--gapScale GAPSCALE] [--maf MAF]
                            [--threads THREADS] [--window WINDOW]
                            [--cutoff CUTOFF] [--maxExtend MAXEXTEND]
                            [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                            [--version] [--tmpDir TMPDIR] [--tmpDirKeep]
                            inputTped outFile locusID
Positional arguments:
inputTped Input tped file
outFile Output filepath
locusID The locus ID
Options:
--gapScale=20000
 Gap scale parameter in bp. If a gap is encountered between two snps > GAP_SCALE and < MAX_GAP, then the genetic distance is scaled by GAP_SCALE/GA (default: %(default)s).
--maf=0.05 Minor allele frequency. If a site has a MAF below this value, the program will not use it as a core snp. (default: %(default)s).
--threads=1 The number of threads to spawn during the calculation. Partitions loci across threads. (default: %(default)s).
--window=100000
 When calculating EHH, this is the length of the window in bp in each direction from the query locus (default: %(default)s).
--cutoff=0.05 The EHH decay cutoff (default: %(default)s).
--maxExtend=1000000
 The maximum distance an EHH decay curve is allowed to extend from the core. Set <= 0 for no restriction. (default: %(default)s).
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmpDir=/tmp Base directory for temp files. [default: %(default)s]
--tmpDirKeep=False
 Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
selscan_ihs

Perform selscan’s calculation of iHS.

usage: scans.py selscan_ihs [-h] [--gapScale GAPSCALE] [--maf MAF]
                            [--threads THREADS] [--skipLowFreq]
                            [--dontWriteLeftRightiHH] [--truncOk]
                            [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                            [--version] [--tmpDir TMPDIR] [--tmpDirKeep]
                            inputTped outFile
Positional arguments:
inputTped Input tped file
outFile Output filepath
Options:
--gapScale=20000
 Gap scale parameter in bp. If a gap is encountered between two snps > GAP_SCALE and < MAX_GAP, then the genetic distance is scaled by GAP_SCALE/GA (default: %(default)s).
--maf=0.05 Minor allele frequency. If a site has a MAF below this value, the program will not use it as a core snp. (default: %(default)s).
--threads=1 The number of threads to spawn during the calculation. Partitions loci across threads. (default: %(default)s).
--skipLowFreq=False
  Do not include low frequency variants in the construction of haplotypes (default: %(default)s).
--dontWriteLeftRightiHH=False
  When writing out iHS, do not write out the constituent left and right ancestral and derived iHH scores for each locus.(default: %(default)s).
--truncOk=False
 If an EHH decay reaches the end of a sequence before reaching the cutoff, integrate the curve anyway. Normal function is to disregard the score for that core. (default: %(default)s).
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmpDir=/tmp Base directory for temp files. [default: %(default)s]
--tmpDirKeep=False
 Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
selscan_nsl

Perform selscan’s calculation of nSL.

usage: scans.py selscan_nsl [-h] [--gapScale GAPSCALE] [--maf MAF]
                            [--threads THREADS] [--truncOk]
                            [--maxExtendNsl MAXEXTENDNSL]
                            [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                            [--version] [--tmpDir TMPDIR] [--tmpDirKeep]
                            inputTped outFile
Positional arguments:
inputTped Input tped file
outFile Output filepath
Options:
--gapScale=20000
 Gap scale parameter in bp. If a gap is encountered between two snps > GAP_SCALE and < MAX_GAP, then the genetic distance is scaled by GAP_SCALE/GA (default: %(default)s).
--maf=0.05 Minor allele frequency. If a site has a MAF below this value, the program will not use it as a core snp. (default: %(default)s).
--threads=1 The number of threads to spawn during the calculation. Partitions loci across threads. (default: %(default)s).
--truncOk=False
 If an EHH decay reaches the end of a sequence before reaching the cutoff, integrate the curve anyway. Normal function is to disregard the score for that core. (default: %(default)s).
--maxExtendNsl=100
 The maximum distance an nSL haplotype is allowed to extend from the core. Set <= 0 for no restriction. (default: %(default)s).
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmpDir=/tmp Base directory for temp files. [default: %(default)s]
--tmpDirKeep=False
 Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
selscan_xpehh

Perform selscan’s calculation of XPEHH.

usage: scans.py selscan_xpehh [-h] [--gapScale GAPSCALE] [--maf MAF]
                              [--threads THREADS] [--truncOk]
                              [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                              [--version] [--tmpDir TMPDIR] [--tmpDirKeep]
                              inputTped outFile inputRefTped
Positional arguments:
inputTped Input tped file
outFile Output filepath
inputRefTped Input tped for the reference population to which the first is compared
Options:
--gapScale=20000
 Gap scale parameter in bp. If a gap is encountered between two snps > GAP_SCALE and < MAX_GAP, then the genetic distance is scaled by GAP_SCALE/GA (default: %(default)s).
--maf=0.05 Minor allele frequency. If a site has a MAF below this value, the program will not use it as a core snp. (default: %(default)s).
--threads=1 The number of threads to spawn during the calculation. Partitions loci across threads. (default: %(default)s).
--truncOk=False
 If an EHH decay reaches the end of a sequence before reaching the cutoff, integrate the curve anyway. Normal function is to disregard the score for that core. (default: %(default)s).
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmpDir=/tmp Base directory for temp files. [default: %(default)s]
--tmpDirKeep=False
 Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
selscan_norm_nsl

Undocumented

Normalize Selscan’s nSL output

usage: scans.py selscan_norm_nsl [-h] [--bins BINS]
                                 [--critPercent CRITPERCENT]
                                 [--critValue CRITVALUE] [--minSNPs MINSNPS]
                                 [--qbins QBINS] [--winSize WINSIZE] [--bpWin]
                                 [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                 [--version] [--tmpDir TMPDIR] [--tmpDirKeep]
                                 inputFiles [inputFiles ...]
Positional arguments:
inputFiles A list of files delimited by whitespace for joint normalization. Expected format for iHS/nSL files (no header): <locus name> <physical pos> <freq> <ihh1/sL1> <ihh2/sL0> <ihs/nsl> Expected format for XP-EHH files (one line header): <locus name> <physical pos> <genetic pos> <freq1> <ihh1> <freq2> <ihh2> <xpehh>
Options:
--bins=100 The number of frequency bins in [0,1] for score normalization (default: %(default)s)
--critPercent=-1.0
 Set the critical value such that a SNP with iHS in the most extreme CRIT_PERCENT tails (two-tailed) is marked as an extreme SNP. Not used by default (default: %(default)s)
--critValue=2.0
 Set the critical value such that a SNP with |iHS| > CRIT_VAL is marked as an extreme SNP. Default as in Voight et al. (default: %(default)s)
--minSNPs=10 Only consider a bp window if it has at least this many SNPs (default: %(default)s)
--qbins=20 Outlying windows are binned by number of sites within each window. This is the number of quantile bins to use. (default: %(default)s)
--winSize=100000
 GThe non-overlapping window size for calculating the percentage of extreme SNPs (default: %(default)s)
--bpWin=False If set, will use windows of a constant bp size with varying number of SNPs
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmpDir=/tmp Base directory for temp files. [default: %(default)s]
--tmpDirKeep=False
 Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
selscan_norm_ihs

Undocumented

Normalize Selscan’s iHS output

usage: scans.py selscan_norm_ihs [-h] [--bins BINS]
                                 [--critPercent CRITPERCENT]
                                 [--critValue CRITVALUE] [--minSNPs MINSNPS]
                                 [--qbins QBINS] [--winSize WINSIZE] [--bpWin]
                                 [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                 [--version] [--tmpDir TMPDIR] [--tmpDirKeep]
                                 inputFiles [inputFiles ...]
Positional arguments:
inputFiles A list of files delimited by whitespace for joint normalization. Expected format for iHS/nSL files (no header): <locus name> <physical pos> <freq> <ihh1/sL1> <ihh2/sL0> <ihs/nsl> Expected format for XP-EHH files (one line header): <locus name> <physical pos> <genetic pos> <freq1> <ihh1> <freq2> <ihh2> <xpehh>
Options:
--bins=100 The number of frequency bins in [0,1] for score normalization (default: %(default)s)
--critPercent=-1.0
 Set the critical value such that a SNP with iHS in the most extreme CRIT_PERCENT tails (two-tailed) is marked as an extreme SNP. Not used by default (default: %(default)s)
--critValue=2.0
 Set the critical value such that a SNP with |iHS| > CRIT_VAL is marked as an extreme SNP. Default as in Voight et al. (default: %(default)s)
--minSNPs=10 Only consider a bp window if it has at least this many SNPs (default: %(default)s)
--qbins=20 Outlying windows are binned by number of sites within each window. This is the number of quantile bins to use. (default: %(default)s)
--winSize=100000
 GThe non-overlapping window size for calculating the percentage of extreme SNPs (default: %(default)s)
--bpWin=False If set, will use windows of a constant bp size with varying number of SNPs
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmpDir=/tmp Base directory for temp files. [default: %(default)s]
--tmpDirKeep=False
 Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
selscan_norm_xpehh

Undocumented

Normalize Selscan’s XPEHH output

usage: scans.py selscan_norm_xpehh [-h] [--bins BINS]
                                   [--critPercent CRITPERCENT]
                                   [--critValue CRITVALUE] [--minSNPs MINSNPS]
                                   [--qbins QBINS] [--winSize WINSIZE]
                                   [--bpWin]
                                   [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                   [--version] [--tmpDir TMPDIR]
                                   [--tmpDirKeep]
                                   inputFiles [inputFiles ...]
Positional arguments:
inputFiles A list of files delimited by whitespace for joint normalization. Expected format for iHS/nSL files (no header): <locus name> <physical pos> <freq> <ihh1/sL1> <ihh2/sL0> <ihs/nsl> Expected format for XP-EHH files (one line header): <locus name> <physical pos> <genetic pos> <freq1> <ihh1> <freq2> <ihh2> <xpehh>
Options:
--bins=100 The number of frequency bins in [0,1] for score normalization (default: %(default)s)
--critPercent=-1.0
 Set the critical value such that a SNP with iHS in the most extreme CRIT_PERCENT tails (two-tailed) is marked as an extreme SNP. Not used by default (default: %(default)s)
--critValue=2.0
 Set the critical value such that a SNP with |iHS| > CRIT_VAL is marked as an extreme SNP. Default as in Voight et al. (default: %(default)s)
--minSNPs=10 Only consider a bp window if it has at least this many SNPs (default: %(default)s)
--qbins=20 Outlying windows are binned by number of sites within each window. This is the number of quantile bins to use. (default: %(default)s)
--winSize=100000
 GThe non-overlapping window size for calculating the percentage of extreme SNPs (default: %(default)s)
--bpWin=False If set, will use windows of a constant bp size with varying number of SNPs
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmpDir=/tmp Base directory for temp files. [default: %(default)s]
--tmpDirKeep=False
 Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
store_selscan_results_in_db

Aggregate results from selscan in to a SQLite database via helper JSON metadata file.

usage: scans.py store_selscan_results_in_db [-h]
                                            [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                            [--version] [--tmpDir TMPDIR]
                                            [--tmpDirKeep]
                                            inputFile outFile
Positional arguments:
inputFile Input *.metadata.json file
outFile Output SQLite filepath
Options:
--loglevel=INFO
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmpDir=/tmp Base directory for temp files. [default: %(default)s]
--tmpDirKeep=False
 Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

3.2. cms_modeller.py

This script contains command-line utilities for exploratory fitting of demographic models to population genetic data.

usage: cms_modeller.py [-h] {target_stats,bootstrap,point,grid,optimize} ...
Sub-commands:
target_stats

Perform per-site(/per-site-pair) calculations of population summary statistics for model target values.

usage: cms_modeller.py target_stats [-h] [--freqs] [--ld] [--fst]
                                    [--modelpath MODELPATH] [--printOnly]
                                    inputTpeds recomFile regions out
Positional arguments:
inputTpeds comma-delimited list of unzipped input tped files (only one file per pop being modelled; must run chroms separately or concatenate)
recomFile file defining recombination map for input
regions tab-separated file with putative neutral regions
out outfile prefix
Options:
--freqs=False calculate summary statistics from within-population allele frequencies
--ld=False calculate summary statistics from within-population linkage disequilibrium
--fst=False calculate summary statistics from population comparison using allele frequencies
--modelpath=cms/model/
 path to model directory containing executables
--printOnly=False
 print rather than execute pipeline commands
bootstrap

Perform bootstrap estimates of population summary statistics from per-site(/per-site-pair) calculations in order to finalize model target values.

usage: cms_modeller.py bootstrap [-h] [--in_freqs IN_FREQS]
                                 [--nFreqHistBins NFREQHISTBINS]
                                 [--in_ld IN_LD]
                                 [--mafcutoffdprime MAFCUTOFFDPRIME]
                                 [--nphysdisthist NPHYSDISTHIST]
                                 [--in_fst IN_FST]
                                 [--ngendisthist NGENDISTHIST]
                                 nBootstrapReps out
Positional arguments:
nBootstrapReps number of bootstraps to perform in order to estimate standard error of the dataset (should converge for reasonably small n)
out outfile prefix
Options:
--in_freqs comma-delimited list of infiles with per-site calculations for population. One file per population – for bootstrap estimates of genome-wide values, should first concatenate per-chrom files
--nFreqHistBins=6
 number of bins for site frequency spectrum and p(der|freq)
--in_ld comma-delimited list of infiles with per-site-pair calculations for population. One file per population – for bootstrap estimates of genome-wide values, should first concatenate per-chrom files
--mafcutoffdprime=0.2
 for D’ calculations, only use sites with MAF > mafcutoffdprime
--nphysdisthist=14
 nbins for r2 LD calculations
--in_fst comma-delimited list of infiles with per-site calculations for population pair. One file per population-pair – for bootstrap estimates of genome-wide values, should first concatenate per-chrom files
--ngendisthist=17
 nbins for D’ LD calculations
point

Run simulates of a point in parameter-space.

usage: cms_modeller.py point [-h] [--cosiBuild COSIBUILD]
                             [--dropSings DROPSINGS] [--genmapRandomRegions]
                             [--stopAfterMinutes STOPAFTERMINUTES]
                             [--calcError CALCERROR]
                             [--targetvalsFile TARGETVALSFILE] [--plotStats]
                             [--printOnly]
                             inputParamFile nCoalescentReps outputDir
Positional arguments:
inputParamFile file with model specifications for input
nCoalescentReps
 number of coalescent replicates to run per point in parameter-space
outputDir location in which to write cosi output
Options:
--cosiBuild=coalescent
 which version of cosi to run?
--dropSings randomly thin global singletons from output dataset (i.e., to model ascertainment bias)
--genmapRandomRegions=False
 cosi option to sub-sample genetic map randomly from input
--stopAfterMinutes
 cosi option to terminate simulations
--calcError file specifying dimensions of error function to use. if unspecified, defaults to all. first line = stats, second line = pops
--targetvalsFile
 file containing target values for model
--plotStats=False
 visualize goodness-of-fit to model targets
--printOnly=False
 print rather than execute pipeline commands
grid

Perform grid search: for specified parameters and intervals, define points in parameter-space to sample and compare.

usage: cms_modeller.py grid [-h] [--cosiBuild COSIBUILD]
                            [--dropSings DROPSINGS] [--genmapRandomRegions]
                            [--stopAfterMinutes STOPAFTERMINUTES]
                            [--calcError CALCERROR]
                            inputParamFile nCoalescentReps outputDir
                            grid_inputdimensionsfile
Positional arguments:
inputParamFile file with model specifications for input
nCoalescentReps
 number of coalescent replicates to run per point in parameter-space
outputDir location in which to write cosi output
grid_inputdimensionsfile
 file with specifications of grid search. each parameter to vary is indicated: KEY INDEX [VALUES]
Options:
--cosiBuild=coalescent
 which version of cosi to run?
--dropSings randomly thin global singletons from output dataset (i.e., to model ascertainment bias)
--genmapRandomRegions=False
 cosi option to sub-sample genetic map randomly from input
--stopAfterMinutes
 cosi option to terminate simulations
--calcError file specifying dimensions of error function to use. if unspecified, defaults to all. first line = stats, second line = pops
optimize

Perform optimization algorithm (scipy.optimize) to fit model parameters robustly.

usage: cms_modeller.py optimize [-h] [--cosiBuild COSIBUILD]
                                [--dropSings DROPSINGS]
                                [--genmapRandomRegions]
                                [--stopAfterMinutes STOPAFTERMINUTES]
                                [--calcError CALCERROR] [--stepSize STEPSIZE]
                                [--method METHOD]
                                inputParamFile nCoalescentReps outputDir
                                optimize_inputdimensionsfile
Positional arguments:
inputParamFile file with model specifications for input
nCoalescentReps
 number of coalescent replicates to run per point in parameter-space
outputDir location in which to write cosi output
optimize_inputdimensionsfile
 file with specifications of optimization. each parameter to vary is indicated: KEY INDEX
Options:
--cosiBuild=coalescent
 which version of cosi to run?
--dropSings randomly thin global singletons from output dataset (i.e., to model ascertainment bias)
--genmapRandomRegions=False
 cosi option to sub-sample genetic map randomly from input
--stopAfterMinutes
 cosi option to terminate simulations
--calcError file specifying dimensions of error function to use. if unspecified, defaults to all. first line = stats, second line = pops
--stepSize scaled step size (i.e. whole range = 1)
--method=SLSQP algorithm to pass to scipy.optimize

3.3. likes_from_model.py

This script contains command-line utilities for generating probability distributions for component scores from pre-specified demographic model(s).

usage: likes_from_model.py [-h]
                           {generate_sel_bins,get_sel_traj,run_neut_sim,run_sel_sim,run_repscores,get_neut_norm_params,norm_from_neut_params,likes_from_scores,plot_likes_vs_scores,write_master_likes}
                           ...
Sub-commands:
generate_sel_bins

Pre-processing step: generate directories with parameter files divided by selDAF, according to specified bins

usage: likes_from_model.py generate_sel_bins [-h] [--freqRange FREQRANGE]
                                             [--nBins NBINS]
                                             outputDir
Positional arguments:
outputDir location to write cosi output
Options:
--freqRange=.05-.95
 range of final selected allele frequencies to simulate, e.g. .05-.95
--nBins=9 number of frequency bins
get_sel_traj

Run forward simulations of selection trajectories to populate selscenarios by final allele frequency before running coalescent simulations for entire sample.

usage: likes_from_model.py get_sel_traj [-h] [--maxAttempts MAXATTEMPTS]
                                        [--cosiBuild COSIBUILD]
                                        [--freqRange FREQRANGE]
                                        [--nBins NBINS]
                                        traj_outputname inputParamFile
Positional arguments:
traj_outputname
 write to file
inputParamFile file with model specifications for input
Options:
--maxAttempts=100
 maximum number of attempts to generate a trajectory before re-sampling selection coefficient / start time
--cosiBuild=coalescent
 which version of cosi to run
--freqRange=.05-.95
 range of final selected allele frequencies to simulate, e.g. .05-.95
--nBins=9 number of frequency bins
run_neut_sim

run neutral simulations

usage: likes_from_model.py run_neut_sim [-h] [--checkOverwrite]
                                        [--dropSings DROPSINGS]
                                        [--cosiBuild COSIBUILD]
                                        writeBase inputParamFile
Positional arguments:
writeBase write prefix
inputParamFile file with model specifications for input
Options:
--checkOverwrite=True
 Undocumented
--dropSings=0.25
 randomly thin global singletons from output dataset to model ascertainment bias
--cosiBuild=coalescent
 which version of cosi to run
run_sel_sim

run sims with selection

usage: likes_from_model.py run_sel_sim [-h] [--checkOverwrite]
                                       [--dropSings DROPSINGS]
                                       [--cosiBuild COSIBUILD]
                                       traj_infilename writeBase
                                       inputParamFile
Positional arguments:
traj_infilename
 selection trajectory
writeBase write prefix
inputParamFile file with model specifications for input
Options:
--checkOverwrite=True
 Undocumented
--dropSings=0.25
 randomly thin global singletons from output dataset to model ascertainment bias
--cosiBuild=coalescent
 which version of cosi to run
run_repscores

run composite score calculations for simulated tped

usage: likes_from_model.py run_repscores [-h] [--simRecomFile SIMRECOMFILE]
                                         [--checkOverwrite] [--cmsdir CMSDIR]
                                         score_basedir inputTpedFile
Positional arguments:
score_basedir parent directory in which to generate/populate folders for each composite score
inputTpedFile TPED file with simulate data for putative selected population
Options:
--simRecomFile=/n/home08/jvitti/params/test_recom.recom
 location of input recom file
--checkOverwrite=True
 Undocumented
--cmsdir=/n/home08/jvitti/cms/cms/
 location of CMS scripts (TEMPORARY; conda will obviate)
get_neut_norm_params

jointly normalize scores from neutral replicates to get parameters with which to normalize all replicates

usage: likes_from_model.py get_neut_norm_params [-h] [--checkOverwrite]
                                                [--cmsdir CMSDIR]
                                                [--edge EDGE]
                                                [--chromlen CHROMLEN]
                                                [--score SCORE]
                                                [--simpop SIMPOP]
                                                [--altpop ALTPOP]
                                                [--nrep NREP]
                                                modeldir
Positional arguments:
modeldir location of component score folders for demographic scenario
Options:
--checkOverwrite=True
 Undocumented
--cmsdir=/n/home08/jvitti/cms/cms/
 location of CMS scripts (TEMPORARY; conda will obviate)
--edge=250000 use interior of replicates; define per-end bp. (e.g. 1.5Mb -> 1Mb: 250000)
--chromlen=1500000
 per bp (1.5mb = 1500000)
--score=ihs Undocumented
--simpop=1 Undocumented
--altpop=2 Undocumented
--nrep=100 Undocumented
norm_from_neut_params

normalize component scores according to neutral distribution

usage: likes_from_model.py norm_from_neut_params [-h] [--selbin SELBIN]
                                                 [--scenario_dir SCENARIO_DIR]
                                                 [--checkOverwrite]
                                                 [--cmsdir CMSDIR]
                                                 [--score SCORE]
                                                 [--simpop SIMPOP]
                                                 [--altpop ALTPOP]
                                                 [--nrep NREP]
                                                 modeldir
Positional arguments:
modeldir location of component score folders for demographic scenario
Options:
--selbin e.g. 0.10 – if excluded, normalize neutral simulates
--scenario_dir=neut
 neut/selsim dir
--checkOverwrite=True
 Undocumented
--cmsdir=/n/home08/jvitti/cms/cms/
 location of CMS scripts (TEMPORARY; conda will obviate)
--score=ihs Undocumented
--simpop=1 Undocumented
--altpop=2 Undocumented
--nrep=100 Undocumented
likes_from_scores

Collate scores from simulated data in order to generate component test probability distributions.

usage: likes_from_model.py likes_from_scores [-h] [--nrep_neut NREP_NEUT]
                                             [--nrep_sel NREP_SEL]
                                             [--binMerge] [--foldDist]
                                             [--save_suffix SAVE_SUFFIX]
                                             [--save_dir SAVE_DIR]
                                             [--save_likelihoods] [--ihs]
                                             [--delihh] [--nsl] [--xpehh]
                                             [--deldaf] [--fst] [--edge EDGE]
                                             [--chromlen CHROMLEN]
                                             [--freqRange FREQRANGE]
                                             [--nBins NBINS]
                                             modeldir
Positional arguments:
modeldir location of component score folders for demographic scenario
Options:
--nrep_neut=1000
 number of neutral replicates
--nrep_sel=500 number of replicates per selection scenario bin
--binMerge=False
 if included, generate higher-order groupings of selscenario frequency bins
--foldDist=False
 if included, take absolute value of score values (fold distribution over y-axis)
--save_suffix optional string to add to filenames (e.g. to avoid overwrite)
--save_dir if included, specify alternate outpot location
--save_likelihoods=False
 in addition to plotting, save data
--ihs=False visualize likelihoods for iHS
--delihh=False visualize likelihoods for delIHH
--nsl=False visualize likelihoods for nSL
--xpehh=False visualize likelihoods for XP-EHH
--deldaf=False visualize likelihoods for delDAF
--fst=False visualize likelihoods for Fst
--edge=250000 use interior of replicates; define per-end bp. (e.g. 1.5Mb -> 1Mb: 250000)
--chromlen=1500000
 per bp (1.5mb = 1500000)
--freqRange=.05-.95
 range of final selected allele frequencies to simulate, e.g. .05-.95
--nBins=9 number of frequency bins
plot_likes_vs_scores

useful QC step; visualize the function that converts score values into likelihood contributions towards composite scores

usage: likes_from_model.py plot_likes_vs_scores [-h]
                                                [--default_nregion_snps DEFAULT_NREGION_SNPS]
                                                inputLikesPrefix
Positional arguments:
inputLikesPrefix
 filename minus causal/linked/neutral.txt
Options:
--default_nregion_snps=5000
 presumptive number of SNPs in region (determines prior distributions for within-region CMS calculations)
write_master_likes

specify and bundle together input distributions to pass to CMS. This allows the user to specify models, frequency of SNPs used to generated p(score|sel), etc.

usage: likes_from_model.py write_master_likes [-h]
                                              [--likes_masterDir LIKES_MASTERDIR]
                                              [--model MODEL]
                                              [--likes_inDir LIKES_INDIR]
                                              [--score SCORE]
                                              [--simpop SIMPOP]
                                              [--altpop ALTPOP] [--nrep NREP]
Options:
--likes_masterDir=/n/regal/sabeti_lab/jvitti/
 location of likelihood tables, defined
--model=nulldefault
 ...
--likes_inDir=/n/regal/sabeti_lab/jvitti/
 ...
--score=ihs Undocumented
--simpop=1 Undocumented
--altpop=2 Undocumented
--nrep=100 Undocumented

3.4. composite.py

This script contains command-line utilities for manipulating and combining component statistics

usage: composite.py [-h]
                    {freqscores,delihh_from_ihs,ihh_from_xp,xp_from_ihh,composite_sims,composite_emp,normsims_genomewide,normemp_genomewide,hapviz,ml_region,ucsc_viz}
                    ...
Sub-commands:
freqscores

Calculate allele frequency-based scores (Fst and delDAF) for a pair of populations.

usage: composite.py freqscores [-h] [--modelpath MODELPATH]
                               inTped1 inTped2 recomFile outfile
Positional arguments:
inTped1 input tped 1
inTped2 input tped 2
recomFile input recombination file
outfile file to write
Options:
--modelpath=cms/cms/model/
 path to model directory containing executables
delihh_from_ihs

Calculate delIHH values from iHS output files.

usage: composite.py delihh_from_ihs [-h] readfile writefile
Positional arguments:
readfile input ihs file
writefile delihh file to write
ihh_from_xp

extract per-pop iHH values from XP-EHH and write to individual files to facilitate arbitrary population comparisons

usage: composite.py ihh_from_xp [-h] inXpehh outIhh takePop
Positional arguments:
inXpehh input xpehh file
outIhh write to file
takePop write for first (1) or second (2) pop in XP-EHH file?
xp_from_ihh

Calculate XP-EHH based on two per-pop iHH files.

usage: composite.py xp_from_ihh [-h] [--cmsdir CMSDIR] [--printOnly]
                                inIhh1 inIhh2 outfilename
Positional arguments:
inIhh1 input ihh file 1
inIhh2 input ihh file 2
outfilename write to file
Options:
--cmsdir=/idi/sabeti-scratch/jvitti/cms/cms/
 TEMPORARY, will become redundant with conda packaging
--printOnly=False
 print rather than execute pipeline commands
composite_sims

calculate composite scores for simulations

usage: composite.py composite_sims [-h] [--regional_cms]
                                   [--scenariopop SCENARIOPOP]
                                   [--cmsdir CMSDIR] [--printOnly]
                                   [--nrep_sel NREP_SEL]
                                   [--nrep_neut NREP_NEUT]
                                   [--likes_masterDir LIKES_MASTERDIR]
                                   [--likes_nonSel LIKES_NONSEL]
                                   [--likes_freqSuffix LIKES_FREQSUFFIX]
                                   [--cutoffline CUTOFFLINE]
                                   [--includeline INCLUDELINE]
                                   [--writedir WRITEDIR]
                                   [--runSuffix RUNSUFFIX] [--simpop SIMPOP]
                                   [--model MODEL] [--checkOverwrite]
Options:
--regional_cms=False
 calculate within-region CMS rather than genome-wide CMS
--scenariopop sel directory
--cmsdir=/idi/sabeti-scratch/jvitti/cms/cms/
 TEMPORARY, will become redundant with conda packaging
--printOnly=False
 print rather than execute pipeline commands
--nrep_sel=500 Undocumented
--nrep_neut=1000
 Undocumented
--likes_masterDir=/n/regal/sabeti_lab/jvitti/clear-synth/sims_reeval/likes_masters/
 location of likelihood tables, defined
--likes_nonSel=vsNeut
 do we use completely neutral, or linked neutral SNPs for our non-causal distributions? by default, uses strict neutral (CMSgw)
--likes_freqSuffix=allFreqs
 for causal SNPs, include suffix to specify which selbins to include
--cutoffline=250000 1250000 0 1
 specify bounds to include/exclude in calculations, along with MAF filter and likes decomposition
--includeline=0 0 0 0 0 0
 specify (0:yes; 1:no) which scores to include: iHS, ...
--writedir= specify relative path
--runSuffix add a suffix to .cms file (corresponds to runparamfile)
--simpop=1 simulated population
--model=nulldefault
 Undocumented
--checkOverwrite=False
 Undocumented
composite_emp

calculate composite scores for empirical data

usage: composite.py composite_emp [-h] [--score_basedir SCORE_BASEDIR]
                                  [--regional_cms_chrom REGIONAL_CMS_CHROM]
                                  [--composite_writedir COMPOSITE_WRITEDIR]
                                  [--cmsdir CMSDIR] [--printOnly]
                                  [--likes_masterDir LIKES_MASTERDIR]
                                  [--likes_nonSel LIKES_NONSEL]
                                  [--likes_freqSuffix LIKES_FREQSUFFIX]
                                  [--cutoffline CUTOFFLINE]
                                  [--includeline INCLUDELINE]
                                  [--writedir WRITEDIR]
                                  [--runSuffix RUNSUFFIX] [--simpop SIMPOP]
                                  [--emppop EMPPOP] [--model MODEL]
                                  [--checkOverwrite]
Options:
--score_basedir=/n/regal/sabeti_lab/jvitti/clear-synth/1kg_scores/
 Undocumented
--regional_cms_chrom
 if included, calculate within-region CMS (rather than CMS_gw) for specified bounds at this chromosome
--composite_writedir=
 write output to
--cmsdir=/idi/sabeti-scratch/jvitti/cms/cms/
 TEMPORARY, will become redundant with conda packaging
--printOnly=False
 print rather than execute pipeline commands
--likes_masterDir=/n/regal/sabeti_lab/jvitti/clear-synth/sims_reeval/likes_masters/
 location of likelihood tables, defined
--likes_nonSel=vsNeut
 do we use completely neutral, or linked neutral SNPs for our non-causal distributions? by default, uses strict neutral (CMSgw)
--likes_freqSuffix=allFreqs
 for causal SNPs, include suffix to specify which selbins to include
--cutoffline=250000 1250000 0 1
 specify bounds to include/exclude in calculations, along with MAF filter and likes decomposition
--includeline=0 0 0 0 0 0
 specify (0:yes; 1:no) which scores to include: iHS, ...
--writedir= specify relative path
--runSuffix add a suffix to .cms file (corresponds to runparamfile)
--simpop=1 simulated population
--emppop=YRI empirical population
--model=nulldefault
 Undocumented
--checkOverwrite=False
 Undocumented
normsims_genomewide

normalize simulated composite scores to neutral

usage: composite.py normsims_genomewide [-h] [--nrep_sel NREP_SEL]
                                        [--nrep_neut NREP_NEUT]
                                        [--writedir WRITEDIR]
                                        [--runSuffix RUNSUFFIX]
                                        [--simpop SIMPOP] [--model MODEL]
                                        [--checkOverwrite]
Options:
--nrep_sel=500 Undocumented
--nrep_neut=1000
 Undocumented
--writedir= specify relative path
--runSuffix add a suffix to .cms file (corresponds to runparamfile)
--simpop=1 simulated population
--model=nulldefault
 Undocumented
--checkOverwrite=False
 Undocumented
normemp_genomewide

normalize CMS scores to genome-wide

usage: composite.py normemp_genomewide [-h] [--score_basedir SCORE_BASEDIR]
                                       [--cmsdir CMSDIR] [--printOnly]
                                       [--writedir WRITEDIR]
                                       [--runSuffix RUNSUFFIX]
                                       [--emppop EMPPOP]
Options:
--score_basedir=/n/regal/sabeti_lab/jvitti/clear-synth/1kg_scores/
 Undocumented
--cmsdir=/idi/sabeti-scratch/jvitti/cms/cms/
 TEMPORARY, will become redundant with conda packaging
--printOnly=False
 print rather than execute pipeline commands
--writedir= specify relative path
--runSuffix add a suffix to .cms file (corresponds to runparamfile)
--emppop=YRI empirical population
hapviz

Visualize haplotypes for region

usage: composite.py hapviz [-h] [--startpos STARTPOS] [--endpos ENDPOS]
                           [--corepos COREPOS] [--title TITLE]
                           [--annotate ANNOTATE] [--maf MAF] [--dpi DPI]
                           inputfile out
Positional arguments:
inputfile input tped
out save image as file
Options:
--startpos define physical bounds of region
--endpos define physical bounds of region
--corepos=-1 partition haplotypes based on allele status at this position
--title title to give to plot
--annotate tab-delimited file where each line gives <chr.pos> <annotation>
--maf filter on minor allele frequency (e.g. .01, .05)
--dpi=300 image resolution
ml_region

machine learning algorithm (within-region)

usage: composite.py ml_region [-h]
ucsc_viz

Generate trackfiles of CMS scores for visualization in the UCSC genome browser.

usage: composite.py ucsc_viz [-h] [--posIndex POSINDEX]
                             [--scoreIndex SCOREINDEX] [--strip_header]
                             infile_prefix outfile
Positional arguments:
infile_prefix prefix of file containing scores to be reformatted (e.g. ‘score_chr’ for files named scores_chr#.txt)
outfile file to write
Options:
--posIndex=1 index for column of datafile containing physical position (zero-indexed)
--scoreIndex=-2
 index for column of datafile containing score (zero-indexed)
--strip_header=False
 if input files include header line