3. Command line tools¶

3.1. scans.py¶

This script contains command-line utilities for calculating EHH-based scans for positive selection in genomes, including EHH, iHS, and XP-EHH.

usage: scans.py subcommand

Sub-commands:

selscan_file_conversion

Process a bgzipped-VCF (such as those included in the Phase 3 1000 Genomes release) into a gzip-compressed tped file of the sort expected by selscan.

usage: scans.py selscan_file_conversion [-h] [--startBp STARTBP]
                                        [--endBp ENDBP] [--ploidy PLOIDY]
                                        [--considerMultiAllelic]
                                        [--rescaleGeneticDistance]
                                        [--includeLowQualAncestral]
                                        [--codingFunctionClassFile CODINGFUNCTIONCLASSFILE]
                                        [--sampleMembershipFile SAMPLEMEMBERSHIPFILE]
                                        [--filterPops FILTERPOPS [FILTERPOPS ...]]
                                        [--filterSuperPops FILTERSUPERPOPS [FILTERSUPERPOPS ...]]
                                        [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                        [--version] [--tmpDir TMPDIR]
                                        [--tmpDirKeep]
                                        inputVCF genMap outPrefix outLocation
                                        chromosomeNum

Positional arguments:

`inputVCF`	Input VCF file
`genMap`	Genetic recombination map tsv file with four columns: (Chromosome, Position(bp), Rate(cM/Mb), Map(cM))
`outPrefix`	Output file prefix
`outLocation`	Output location
`chromosomeNum`	Chromosome number.

Options:

`--startBp=1`	Coordinate in bp of start position. (default: %(default)s).
`--endBp`	Coordinate in bp of end position.
`--ploidy=2`	Number of chromosomes expected for each genotype. (default: %(default)s).
`--considerMultiAllelic=False`
	Include multi-allelic variants in the output as separate records
`--rescaleGeneticDistance=False`
	Genetic distance is rescaled to be out of 100.0 cM
`--includeLowQualAncestral=False`
	Include variants where the ancestral information is low-quality (as indicated by lower-case x for AA=x in the VCF info column) (default: %(default)s).
`--codingFunctionClassFile`
	A python class file containing a function used to code each genotype as ‘1’ and ‘0’. coding_function(current_value, reference_allele, alternate_allele, ancestral_allele)
`--sampleMembershipFile`
	The call sample file containing four columns: sample, pop, super_pop, gender
`--filterPops`	Populations to include in the calculation (ex. “FIN”)
`--filterSuperPops`
	Super populations to include in the calculation (ex. “EUR”)
`--loglevel=DEBUG`
	Verboseness of output. [default: %(default)s] Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
`--version, -V`	show program’s version number and exit
`--tmpDir=/tmp`	Base directory for temp files. [default: %(default)s]
`--tmpDirKeep=False`
	Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

selscan_ehh

Perform selscan’s calculation of EHH.

usage: scans.py selscan_ehh [-h] [--gapScale GAPSCALE] [--maf MAF]
                            [--threads THREADS] [--window WINDOW]
                            [--cutoff CUTOFF] [--maxExtend MAXEXTEND]
                            [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                            [--version] [--tmpDir TMPDIR] [--tmpDirKeep]
                            inputTped outFile locusID

Positional arguments:

`inputTped`	Input tped file
`outFile`	Output filepath
`locusID`	The locus ID

Options:

`--gapScale=20000`
	Gap scale parameter in bp. If a gap is encountered between two snps > GAP_SCALE and < MAX_GAP, then the genetic distance is scaled by GAP_SCALE/GA (default: %(default)s).
`--maf=0.05`	Minor allele frequency. If a site has a MAF below this value, the program will not use it as a core snp. (default: %(default)s).
`--threads=1`	The number of threads to spawn during the calculation. Partitions loci across threads. (default: %(default)s).
`--window=100000`
	When calculating EHH, this is the length of the window in bp in each direction from the query locus (default: %(default)s).
`--cutoff=0.05`	The EHH decay cutoff (default: %(default)s).
`--maxExtend=1000000`
	The maximum distance an EHH decay curve is allowed to extend from the core. Set <= 0 for no restriction. (default: %(default)s).
`--loglevel=DEBUG`
	Verboseness of output. [default: %(default)s] Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
`--version, -V`	show program’s version number and exit
`--tmpDir=/tmp`	Base directory for temp files. [default: %(default)s]
`--tmpDirKeep=False`
	Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

selscan_ihs

Perform selscan’s calculation of iHS.

usage: scans.py selscan_ihs [-h] [--gapScale GAPSCALE] [--maf MAF]
                            [--threads THREADS] [--skipLowFreq]
                            [--dontWriteLeftRightiHH] [--truncOk]
                            [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                            [--version] [--tmpDir TMPDIR] [--tmpDirKeep]
                            inputTped outFile

Positional arguments:

`inputTped`	Input tped file
`outFile`	Output filepath

Options:

`--gapScale=20000`
	Gap scale parameter in bp. If a gap is encountered between two snps > GAP_SCALE and < MAX_GAP, then the genetic distance is scaled by GAP_SCALE/GA (default: %(default)s).
`--maf=0.05`	Minor allele frequency. If a site has a MAF below this value, the program will not use it as a core snp. (default: %(default)s).
`--threads=1`	The number of threads to spawn during the calculation. Partitions loci across threads. (default: %(default)s).
`--skipLowFreq=False`
	Do not include low frequency variants in the construction of haplotypes (default: %(default)s).
`--dontWriteLeftRightiHH=False`
	When writing out iHS, do not write out the constituent left and right ancestral and derived iHH scores for each locus.(default: %(default)s).
`--truncOk=False`
	If an EHH decay reaches the end of a sequence before reaching the cutoff, integrate the curve anyway. Normal function is to disregard the score for that core. (default: %(default)s).
`--loglevel=DEBUG`
	Verboseness of output. [default: %(default)s] Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
`--version, -V`	show program’s version number and exit
`--tmpDir=/tmp`	Base directory for temp files. [default: %(default)s]
`--tmpDirKeep=False`
	Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

selscan_nsl

Perform selscan’s calculation of nSL.

usage: scans.py selscan_nsl [-h] [--gapScale GAPSCALE] [--maf MAF]
                            [--threads THREADS] [--truncOk]
                            [--maxExtendNsl MAXEXTENDNSL]
                            [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                            [--version] [--tmpDir TMPDIR] [--tmpDirKeep]
                            inputTped outFile

Positional arguments:

`inputTped`	Input tped file
`outFile`	Output filepath

Options:

`--gapScale=20000`
	Gap scale parameter in bp. If a gap is encountered between two snps > GAP_SCALE and < MAX_GAP, then the genetic distance is scaled by GAP_SCALE/GA (default: %(default)s).
`--maf=0.05`	Minor allele frequency. If a site has a MAF below this value, the program will not use it as a core snp. (default: %(default)s).
`--threads=1`	The number of threads to spawn during the calculation. Partitions loci across threads. (default: %(default)s).
`--truncOk=False`
	If an EHH decay reaches the end of a sequence before reaching the cutoff, integrate the curve anyway. Normal function is to disregard the score for that core. (default: %(default)s).
`--maxExtendNsl=100`
	The maximum distance an nSL haplotype is allowed to extend from the core. Set <= 0 for no restriction. (default: %(default)s).
`--loglevel=DEBUG`
	Verboseness of output. [default: %(default)s] Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
`--version, -V`	show program’s version number and exit
`--tmpDir=/tmp`	Base directory for temp files. [default: %(default)s]
`--tmpDirKeep=False`
	Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

selscan_xpehh

Perform selscan’s calculation of XPEHH.

usage: scans.py selscan_xpehh [-h] [--gapScale GAPSCALE] [--maf MAF]
                              [--threads THREADS] [--truncOk]
                              [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                              [--version] [--tmpDir TMPDIR] [--tmpDirKeep]
                              inputTped outFile inputRefTped

Positional arguments:

`inputTped`	Input tped file
`outFile`	Output filepath
`inputRefTped`	Input tped for the reference population to which the first is compared

Options:

`--gapScale=20000`
	Gap scale parameter in bp. If a gap is encountered between two snps > GAP_SCALE and < MAX_GAP, then the genetic distance is scaled by GAP_SCALE/GA (default: %(default)s).
`--maf=0.05`	Minor allele frequency. If a site has a MAF below this value, the program will not use it as a core snp. (default: %(default)s).
`--threads=1`	The number of threads to spawn during the calculation. Partitions loci across threads. (default: %(default)s).
`--truncOk=False`
	If an EHH decay reaches the end of a sequence before reaching the cutoff, integrate the curve anyway. Normal function is to disregard the score for that core. (default: %(default)s).
`--loglevel=DEBUG`
	Verboseness of output. [default: %(default)s] Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
`--version, -V`	show program’s version number and exit
`--tmpDir=/tmp`	Base directory for temp files. [default: %(default)s]
`--tmpDirKeep=False`
	Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

selscan_norm_nsl

Undocumented

Normalize Selscan’s nSL output

usage: scans.py selscan_norm_nsl [-h] [--bins BINS]
                                 [--critPercent CRITPERCENT]
                                 [--critValue CRITVALUE] [--minSNPs MINSNPS]
                                 [--qbins QBINS] [--winSize WINSIZE] [--bpWin]
                                 [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                 [--version] [--tmpDir TMPDIR] [--tmpDirKeep]
                                 inputFiles [inputFiles ...]

Positional arguments:

inputFiles

A list of files delimited by whitespace for joint normalization. Expected format for iHS/nSL files (no header): <locus name> <physical pos> <freq> <ihh1/sL1> <ihh2/sL0> <ihs/nsl> Expected format for XP-EHH files (one line header): <locus name> <physical pos> <genetic pos> <freq1> <ihh1> <freq2> <ihh2> <xpehh>

Options:

`--bins=100`	The number of frequency bins in [0,1] for score normalization (default: %(default)s)
`--critPercent=-1.0`
	Set the critical value such that a SNP with iHS in the most extreme CRIT_PERCENT tails (two-tailed) is marked as an extreme SNP. Not used by default (default: %(default)s)
`--critValue=2.0`
	Set the critical value such that a SNP with \|iHS\| > CRIT_VAL is marked as an extreme SNP. Default as in Voight et al. (default: %(default)s)
`--minSNPs=10`	Only consider a bp window if it has at least this many SNPs (default: %(default)s)
`--qbins=20`	Outlying windows are binned by number of sites within each window. This is the number of quantile bins to use. (default: %(default)s)
`--winSize=100000`
	GThe non-overlapping window size for calculating the percentage of extreme SNPs (default: %(default)s)
`--bpWin=False`	If set, will use windows of a constant bp size with varying number of SNPs
`--loglevel=DEBUG`
	Verboseness of output. [default: %(default)s] Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
`--version, -V`	show program’s version number and exit
`--tmpDir=/tmp`	Base directory for temp files. [default: %(default)s]
`--tmpDirKeep=False`
	Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

selscan_norm_ihs

Undocumented

Normalize Selscan’s iHS output

usage: scans.py selscan_norm_ihs [-h] [--bins BINS]
                                 [--critPercent CRITPERCENT]
                                 [--critValue CRITVALUE] [--minSNPs MINSNPS]
                                 [--qbins QBINS] [--winSize WINSIZE] [--bpWin]
                                 [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                 [--version] [--tmpDir TMPDIR] [--tmpDirKeep]
                                 inputFiles [inputFiles ...]

Positional arguments:

inputFiles

A list of files delimited by whitespace for joint normalization. Expected format for iHS/nSL files (no header): <locus name> <physical pos> <freq> <ihh1/sL1> <ihh2/sL0> <ihs/nsl> Expected format for XP-EHH files (one line header): <locus name> <physical pos> <genetic pos> <freq1> <ihh1> <freq2> <ihh2> <xpehh>

Options:

`--bins=100`	The number of frequency bins in [0,1] for score normalization (default: %(default)s)
`--critPercent=-1.0`
	Set the critical value such that a SNP with iHS in the most extreme CRIT_PERCENT tails (two-tailed) is marked as an extreme SNP. Not used by default (default: %(default)s)
`--critValue=2.0`
	Set the critical value such that a SNP with \|iHS\| > CRIT_VAL is marked as an extreme SNP. Default as in Voight et al. (default: %(default)s)
`--minSNPs=10`	Only consider a bp window if it has at least this many SNPs (default: %(default)s)
`--qbins=20`	Outlying windows are binned by number of sites within each window. This is the number of quantile bins to use. (default: %(default)s)
`--winSize=100000`
	GThe non-overlapping window size for calculating the percentage of extreme SNPs (default: %(default)s)
`--bpWin=False`	If set, will use windows of a constant bp size with varying number of SNPs
`--loglevel=DEBUG`
	Verboseness of output. [default: %(default)s] Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
`--version, -V`	show program’s version number and exit
`--tmpDir=/tmp`	Base directory for temp files. [default: %(default)s]
`--tmpDirKeep=False`
	Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

selscan_norm_xpehh

Undocumented

Normalize Selscan’s XPEHH output

usage: scans.py selscan_norm_xpehh [-h] [--bins BINS]
                                   [--critPercent CRITPERCENT]
                                   [--critValue CRITVALUE] [--minSNPs MINSNPS]
                                   [--qbins QBINS] [--winSize WINSIZE]
                                   [--bpWin]
                                   [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                   [--version] [--tmpDir TMPDIR]
                                   [--tmpDirKeep]
                                   inputFiles [inputFiles ...]

Positional arguments:

inputFiles

A list of files delimited by whitespace for joint normalization. Expected format for iHS/nSL files (no header): <locus name> <physical pos> <freq> <ihh1/sL1> <ihh2/sL0> <ihs/nsl> Expected format for XP-EHH files (one line header): <locus name> <physical pos> <genetic pos> <freq1> <ihh1> <freq2> <ihh2> <xpehh>

Options:

`--bins=100`	The number of frequency bins in [0,1] for score normalization (default: %(default)s)
`--critPercent=-1.0`
	Set the critical value such that a SNP with iHS in the most extreme CRIT_PERCENT tails (two-tailed) is marked as an extreme SNP. Not used by default (default: %(default)s)
`--critValue=2.0`
	Set the critical value such that a SNP with \|iHS\| > CRIT_VAL is marked as an extreme SNP. Default as in Voight et al. (default: %(default)s)
`--minSNPs=10`	Only consider a bp window if it has at least this many SNPs (default: %(default)s)
`--qbins=20`	Outlying windows are binned by number of sites within each window. This is the number of quantile bins to use. (default: %(default)s)
`--winSize=100000`
	GThe non-overlapping window size for calculating the percentage of extreme SNPs (default: %(default)s)
`--bpWin=False`	If set, will use windows of a constant bp size with varying number of SNPs
`--loglevel=DEBUG`
	Verboseness of output. [default: %(default)s] Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
`--version, -V`	show program’s version number and exit
`--tmpDir=/tmp`	Base directory for temp files. [default: %(default)s]
`--tmpDirKeep=False`
	Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

store_selscan_results_in_db

Aggregate results from selscan in to a SQLite database via helper JSON metadata file.

usage: scans.py store_selscan_results_in_db [-h]
                                            [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                            [--version] [--tmpDir TMPDIR]
                                            [--tmpDirKeep]
                                            inputFile outFile

Positional arguments:

`inputFile`	Input *.metadata.json file
`outFile`	Output SQLite filepath

Options:

`--loglevel=INFO`
	Verboseness of output. [default: %(default)s] Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
`--version, -V`	show program’s version number and exit
`--tmpDir=/tmp`	Base directory for temp files. [default: %(default)s]
`--tmpDirKeep=False`
	Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

3.2. cms_modeller.py¶

This script contains command-line utilities for exploratory fitting of demographic models to population genetic data.

usage: cms_modeller.py [-h] {target_stats,bootstrap,point,grid,optimize} ...

Sub-commands:

target_stats

Perform per-site(/per-site-pair) calculations of population summary statistics for model target values.

usage: cms_modeller.py target_stats [-h] [--freqs] [--ld] [--fst]
                                    [--modelpath MODELPATH] [--printOnly]
                                    inputTpeds recomFile regions out

Positional arguments:

`inputTpeds`	comma-delimited list of unzipped input tped files (only one file per pop being modelled; must run chroms separately or concatenate)
`recomFile`	file defining recombination map for input
`regions`	tab-separated file with putative neutral regions
`out`	outfile prefix

Options:

`--freqs=False`	calculate summary statistics from within-population allele frequencies
`--ld=False`	calculate summary statistics from within-population linkage disequilibrium
`--fst=False`	calculate summary statistics from population comparison using allele frequencies
`--modelpath=cms/model/`
	path to model directory containing executables
`--printOnly=False`
	print rather than execute pipeline commands

bootstrap

Perform bootstrap estimates of population summary statistics from per-site(/per-site-pair) calculations in order to finalize model target values.

usage: cms_modeller.py bootstrap [-h] [--in_freqs IN_FREQS]
                                 [--nFreqHistBins NFREQHISTBINS]
                                 [--in_ld IN_LD]
                                 [--mafcutoffdprime MAFCUTOFFDPRIME]
                                 [--nphysdisthist NPHYSDISTHIST]
                                 [--in_fst IN_FST]
                                 [--ngendisthist NGENDISTHIST]
                                 nBootstrapReps out

Positional arguments:

`nBootstrapReps`	number of bootstraps to perform in order to estimate standard error of the dataset (should converge for reasonably small n)
`out`	outfile prefix

Options:

`--in_freqs`	comma-delimited list of infiles with per-site calculations for population. One file per population – for bootstrap estimates of genome-wide values, should first concatenate per-chrom files
`--nFreqHistBins=6`
	number of bins for site frequency spectrum and p(der\|freq)
`--in_ld`	comma-delimited list of infiles with per-site-pair calculations for population. One file per population – for bootstrap estimates of genome-wide values, should first concatenate per-chrom files
`--mafcutoffdprime=0.2`
	for D’ calculations, only use sites with MAF > mafcutoffdprime
`--nphysdisthist=14`
	nbins for r2 LD calculations
`--in_fst`	comma-delimited list of infiles with per-site calculations for population pair. One file per population-pair – for bootstrap estimates of genome-wide values, should first concatenate per-chrom files
`--ngendisthist=17`
	nbins for D’ LD calculations

point

Run simulates of a point in parameter-space.

usage: cms_modeller.py point [-h] [--cosiBuild COSIBUILD]
                             [--dropSings DROPSINGS] [--genmapRandomRegions]
                             [--stopAfterMinutes STOPAFTERMINUTES]
                             [--calcError CALCERROR]
                             [--targetvalsFile TARGETVALSFILE] [--plotStats]
                             [--printOnly]
                             inputParamFile nCoalescentReps outputDir

Positional arguments:

`inputParamFile`	file with model specifications for input
`nCoalescentReps`
	number of coalescent replicates to run per point in parameter-space
`outputDir`	location in which to write cosi output

Options:

`--cosiBuild=coalescent`
	which version of cosi to run?
`--dropSings`	randomly thin global singletons from output dataset (i.e., to model ascertainment bias)
`--genmapRandomRegions=False`
	cosi option to sub-sample genetic map randomly from input
`--stopAfterMinutes`
	cosi option to terminate simulations
`--calcError`	file specifying dimensions of error function to use. if unspecified, defaults to all. first line = stats, second line = pops
`--targetvalsFile`
	file containing target values for model
`--plotStats=False`
	visualize goodness-of-fit to model targets
`--printOnly=False`
	print rather than execute pipeline commands

grid

Perform grid search: for specified parameters and intervals, define points in parameter-space to sample and compare.

usage: cms_modeller.py grid [-h] [--cosiBuild COSIBUILD]
                            [--dropSings DROPSINGS] [--genmapRandomRegions]
                            [--stopAfterMinutes STOPAFTERMINUTES]
                            [--calcError CALCERROR]
                            inputParamFile nCoalescentReps outputDir
                            grid_inputdimensionsfile

Positional arguments:

`inputParamFile`	file with model specifications for input
`nCoalescentReps`
	number of coalescent replicates to run per point in parameter-space
`outputDir`	location in which to write cosi output
`grid_inputdimensionsfile`
	file with specifications of grid search. each parameter to vary is indicated: KEY INDEX [VALUES]

Options:

`--cosiBuild=coalescent`
	which version of cosi to run?
`--dropSings`	randomly thin global singletons from output dataset (i.e., to model ascertainment bias)
`--genmapRandomRegions=False`
	cosi option to sub-sample genetic map randomly from input
`--stopAfterMinutes`
	cosi option to terminate simulations
`--calcError`	file specifying dimensions of error function to use. if unspecified, defaults to all. first line = stats, second line = pops

optimize

Perform optimization algorithm (scipy.optimize) to fit model parameters robustly.

usage: cms_modeller.py optimize [-h] [--cosiBuild COSIBUILD]
                                [--dropSings DROPSINGS]
                                [--genmapRandomRegions]
                                [--stopAfterMinutes STOPAFTERMINUTES]
                                [--calcError CALCERROR] [--stepSize STEPSIZE]
                                [--method METHOD]
                                inputParamFile nCoalescentReps outputDir
                                optimize_inputdimensionsfile

Positional arguments:

`inputParamFile`	file with model specifications for input
`nCoalescentReps`
	number of coalescent replicates to run per point in parameter-space
`outputDir`	location in which to write cosi output
`optimize_inputdimensionsfile`
	file with specifications of optimization. each parameter to vary is indicated: KEY INDEX

Options:

`--cosiBuild=coalescent`
	which version of cosi to run?
`--dropSings`	randomly thin global singletons from output dataset (i.e., to model ascertainment bias)
`--genmapRandomRegions=False`
	cosi option to sub-sample genetic map randomly from input
`--stopAfterMinutes`
	cosi option to terminate simulations
`--calcError`	file specifying dimensions of error function to use. if unspecified, defaults to all. first line = stats, second line = pops
`--stepSize`	scaled step size (i.e. whole range = 1)
`--method=SLSQP`	algorithm to pass to scipy.optimize

3.3. likes_from_model.py¶

This script contains command-line utilities for generating probability distributions for component scores from pre-specified demographic model(s).

usage: likes_from_model.py [-h]
                           {generate_sel_bins,get_sel_traj,run_neut_sim,run_sel_sim,run_repscores,get_neut_norm_params,norm_from_neut_params,likes_from_scores,plot_likes_vs_scores,write_master_likes}
                           ...

Sub-commands:

generate_sel_bins

Pre-processing step: generate directories with parameter files divided by selDAF, according to specified bins

usage: likes_from_model.py generate_sel_bins [-h] [--freqRange FREQRANGE]
                                             [--nBins NBINS]
                                             outputDir

Positional arguments:

outputDir

location to write cosi output

Options:

`--freqRange=.05-.95`
	range of final selected allele frequencies to simulate, e.g. .05-.95
`--nBins=9`	number of frequency bins

get_sel_traj

Run forward simulations of selection trajectories to populate selscenarios by final allele frequency before running coalescent simulations for entire sample.

usage: likes_from_model.py get_sel_traj [-h] [--maxAttempts MAXATTEMPTS]
                                        [--cosiBuild COSIBUILD]
                                        [--freqRange FREQRANGE]
                                        [--nBins NBINS]
                                        traj_outputname inputParamFile

Positional arguments:

`traj_outputname`
	write to file
`inputParamFile`	file with model specifications for input

Options:

`--maxAttempts=100`
	maximum number of attempts to generate a trajectory before re-sampling selection coefficient / start time
`--cosiBuild=coalescent`
	which version of cosi to run
`--freqRange=.05-.95`
	range of final selected allele frequencies to simulate, e.g. .05-.95
`--nBins=9`	number of frequency bins

run_neut_sim

run neutral simulations

usage: likes_from_model.py run_neut_sim [-h] [--checkOverwrite]
                                        [--dropSings DROPSINGS]
                                        [--cosiBuild COSIBUILD]
                                        writeBase inputParamFile

Positional arguments:

`writeBase`	write prefix
`inputParamFile`	file with model specifications for input

Options:

`--checkOverwrite=True`
	Undocumented
`--dropSings=0.25`
	randomly thin global singletons from output dataset to model ascertainment bias
`--cosiBuild=coalescent`
	which version of cosi to run

run_sel_sim

run sims with selection

usage: likes_from_model.py run_sel_sim [-h] [--checkOverwrite]
                                       [--dropSings DROPSINGS]
                                       [--cosiBuild COSIBUILD]
                                       traj_infilename writeBase
                                       inputParamFile

Positional arguments:

`traj_infilename`
	selection trajectory
`writeBase`	write prefix
`inputParamFile`	file with model specifications for input

Options:

`--checkOverwrite=True`
	Undocumented
`--dropSings=0.25`
	randomly thin global singletons from output dataset to model ascertainment bias
`--cosiBuild=coalescent`
	which version of cosi to run

run_repscores

run composite score calculations for simulated tped

usage: likes_from_model.py run_repscores [-h] [--simRecomFile SIMRECOMFILE]
                                         [--checkOverwrite] [--cmsdir CMSDIR]
                                         score_basedir inputTpedFile

Positional arguments:

`score_basedir`	parent directory in which to generate/populate folders for each composite score
`inputTpedFile`	TPED file with simulate data for putative selected population

Options:

`--simRecomFile=/n/home08/jvitti/params/test_recom.recom`
	location of input recom file
`--checkOverwrite=True`
	Undocumented
`--cmsdir=/n/home08/jvitti/cms/cms/`
	location of CMS scripts (TEMPORARY; conda will obviate)

get_neut_norm_params

jointly normalize scores from neutral replicates to get parameters with which to normalize all replicates

usage: likes_from_model.py get_neut_norm_params [-h] [--checkOverwrite]
                                                [--cmsdir CMSDIR]
                                                [--edge EDGE]
                                                [--chromlen CHROMLEN]
                                                [--score SCORE]
                                                [--simpop SIMPOP]
                                                [--altpop ALTPOP]
                                                [--nrep NREP]
                                                modeldir

Positional arguments:

modeldir

location of component score folders for demographic scenario

Options:

`--checkOverwrite=True`
	Undocumented
`--cmsdir=/n/home08/jvitti/cms/cms/`
	location of CMS scripts (TEMPORARY; conda will obviate)
`--edge=250000`	use interior of replicates; define per-end bp. (e.g. 1.5Mb -> 1Mb: 250000)
`--chromlen=1500000`
	per bp (1.5mb = 1500000)
`--score=ihs`	Undocumented
`--simpop=1`	Undocumented
`--altpop=2`	Undocumented
`--nrep=100`	Undocumented

norm_from_neut_params

normalize component scores according to neutral distribution

usage: likes_from_model.py norm_from_neut_params [-h] [--selbin SELBIN]
                                                 [--scenario_dir SCENARIO_DIR]
                                                 [--checkOverwrite]
                                                 [--cmsdir CMSDIR]
                                                 [--score SCORE]
                                                 [--simpop SIMPOP]
                                                 [--altpop ALTPOP]
                                                 [--nrep NREP]
                                                 modeldir

Positional arguments:

modeldir

location of component score folders for demographic scenario

Options:

`--selbin`	e.g. 0.10 – if excluded, normalize neutral simulates
`--scenario_dir=neut`
	neut/selsim dir
`--checkOverwrite=True`
	Undocumented
`--cmsdir=/n/home08/jvitti/cms/cms/`
	location of CMS scripts (TEMPORARY; conda will obviate)
`--score=ihs`	Undocumented
`--simpop=1`	Undocumented
`--altpop=2`	Undocumented
`--nrep=100`	Undocumented

likes_from_scores

Collate scores from simulated data in order to generate component test probability distributions.

usage: likes_from_model.py likes_from_scores [-h] [--nrep_neut NREP_NEUT]
                                             [--nrep_sel NREP_SEL]
                                             [--binMerge] [--foldDist]
                                             [--save_suffix SAVE_SUFFIX]
                                             [--save_dir SAVE_DIR]
                                             [--save_likelihoods] [--ihs]
                                             [--delihh] [--nsl] [--xpehh]
                                             [--deldaf] [--fst] [--edge EDGE]
                                             [--chromlen CHROMLEN]
                                             [--freqRange FREQRANGE]
                                             [--nBins NBINS]
                                             modeldir

Positional arguments:

modeldir

location of component score folders for demographic scenario

Options:

`--nrep_neut=1000`
	number of neutral replicates
`--nrep_sel=500`	number of replicates per selection scenario bin
`--binMerge=False`
	if included, generate higher-order groupings of selscenario frequency bins
`--foldDist=False`
	if included, take absolute value of score values (fold distribution over y-axis)
`--save_suffix`	optional string to add to filenames (e.g. to avoid overwrite)
`--save_dir`	if included, specify alternate outpot location
`--save_likelihoods=False`
	in addition to plotting, save data
`--ihs=False`	visualize likelihoods for iHS
`--delihh=False`	visualize likelihoods for delIHH
`--nsl=False`	visualize likelihoods for nSL
`--xpehh=False`	visualize likelihoods for XP-EHH
`--deldaf=False`	visualize likelihoods for delDAF
`--fst=False`	visualize likelihoods for Fst
`--edge=250000`	use interior of replicates; define per-end bp. (e.g. 1.5Mb -> 1Mb: 250000)
`--chromlen=1500000`
	per bp (1.5mb = 1500000)
`--freqRange=.05-.95`
	range of final selected allele frequencies to simulate, e.g. .05-.95
`--nBins=9`	number of frequency bins

plot_likes_vs_scores

useful QC step; visualize the function that converts score values into likelihood contributions towards composite scores

usage: likes_from_model.py plot_likes_vs_scores [-h]
                                                [--default_nregion_snps DEFAULT_NREGION_SNPS]
                                                inputLikesPrefix

Positional arguments:

`inputLikesPrefix`
	filename minus causal/linked/neutral.txt

Options:

`--default_nregion_snps=5000`
	presumptive number of SNPs in region (determines prior distributions for within-region CMS calculations)

write_master_likes

specify and bundle together input distributions to pass to CMS. This allows the user to specify models, frequency of SNPs used to generated p(score|sel), etc.

usage: likes_from_model.py write_master_likes [-h]
                                              [--likes_masterDir LIKES_MASTERDIR]
                                              [--model MODEL]
                                              [--likes_inDir LIKES_INDIR]
                                              [--score SCORE]
                                              [--simpop SIMPOP]
                                              [--altpop ALTPOP] [--nrep NREP]

Options:

`--likes_masterDir=/n/regal/sabeti_lab/jvitti/`
	location of likelihood tables, defined
`--model=nulldefault`
	...
`--likes_inDir=/n/regal/sabeti_lab/jvitti/`
	...
`--score=ihs`	Undocumented
`--simpop=1`	Undocumented
`--altpop=2`	Undocumented
`--nrep=100`	Undocumented

3.4. composite.py¶

This script contains command-line utilities for manipulating and combining component statistics

usage: composite.py [-h]
                    {freqscores,delihh_from_ihs,ihh_from_xp,xp_from_ihh,composite_sims,composite_emp,normsims_genomewide,normemp_genomewide,hapviz,ml_region,ucsc_viz}
                    ...

Sub-commands:

freqscores

Calculate allele frequency-based scores (Fst and delDAF) for a pair of populations.

usage: composite.py freqscores [-h] [--modelpath MODELPATH]
                               inTped1 inTped2 recomFile outfile

Positional arguments:

`inTped1`	input tped 1
`inTped2`	input tped 2
`recomFile`	input recombination file
`outfile`	file to write

Options:

`--modelpath=cms/cms/model/`
	path to model directory containing executables

delihh_from_ihs

Calculate delIHH values from iHS output files.

usage: composite.py delihh_from_ihs [-h] readfile writefile

Positional arguments:

`readfile`	input ihs file
`writefile`	delihh file to write

ihh_from_xp

extract per-pop iHH values from XP-EHH and write to individual files to facilitate arbitrary population comparisons

usage: composite.py ihh_from_xp [-h] inXpehh outIhh takePop

Positional arguments:

`inXpehh`	input xpehh file
`outIhh`	write to file
`takePop`	write for first (1) or second (2) pop in XP-EHH file?

xp_from_ihh

Calculate XP-EHH based on two per-pop iHH files.

usage: composite.py xp_from_ihh [-h] [--cmsdir CMSDIR] [--printOnly]
                                inIhh1 inIhh2 outfilename

Positional arguments:

`inIhh1`	input ihh file 1
`inIhh2`	input ihh file 2
`outfilename`	write to file

Options:

`--cmsdir=/idi/sabeti-scratch/jvitti/cms/cms/`
	TEMPORARY, will become redundant with conda packaging
`--printOnly=False`
	print rather than execute pipeline commands

composite_sims

calculate composite scores for simulations

usage: composite.py composite_sims [-h] [--regional_cms]
                                   [--scenariopop SCENARIOPOP]
                                   [--cmsdir CMSDIR] [--printOnly]
                                   [--nrep_sel NREP_SEL]
                                   [--nrep_neut NREP_NEUT]
                                   [--likes_masterDir LIKES_MASTERDIR]
                                   [--likes_nonSel LIKES_NONSEL]
                                   [--likes_freqSuffix LIKES_FREQSUFFIX]
                                   [--cutoffline CUTOFFLINE]
                                   [--includeline INCLUDELINE]
                                   [--writedir WRITEDIR]
                                   [--runSuffix RUNSUFFIX] [--simpop SIMPOP]
                                   [--model MODEL] [--checkOverwrite]

Options:

`--regional_cms=False`
	calculate within-region CMS rather than genome-wide CMS
`--scenariopop`	sel directory
`--cmsdir=/idi/sabeti-scratch/jvitti/cms/cms/`
	TEMPORARY, will become redundant with conda packaging
`--printOnly=False`
	print rather than execute pipeline commands
`--nrep_sel=500`	Undocumented
`--nrep_neut=1000`
	Undocumented
`--likes_masterDir=/n/regal/sabeti_lab/jvitti/clear-synth/sims_reeval/likes_masters/`
	location of likelihood tables, defined
`--likes_nonSel=vsNeut`
	do we use completely neutral, or linked neutral SNPs for our non-causal distributions? by default, uses strict neutral (CMSgw)
`--likes_freqSuffix=allFreqs`
	for causal SNPs, include suffix to specify which selbins to include
`--cutoffline=250000 1250000 0 1`
	specify bounds to include/exclude in calculations, along with MAF filter and likes decomposition
`--includeline=0 0 0 0 0 0`
	specify (0:yes; 1:no) which scores to include: iHS, ...
`--writedir=`	specify relative path
`--runSuffix`	add a suffix to .cms file (corresponds to runparamfile)
`--simpop=1`	simulated population
`--model=nulldefault`
	Undocumented
`--checkOverwrite=False`
	Undocumented

composite_emp

calculate composite scores for empirical data

usage: composite.py composite_emp [-h] [--score_basedir SCORE_BASEDIR]
                                  [--regional_cms_chrom REGIONAL_CMS_CHROM]
                                  [--composite_writedir COMPOSITE_WRITEDIR]
                                  [--cmsdir CMSDIR] [--printOnly]
                                  [--likes_masterDir LIKES_MASTERDIR]
                                  [--likes_nonSel LIKES_NONSEL]
                                  [--likes_freqSuffix LIKES_FREQSUFFIX]
                                  [--cutoffline CUTOFFLINE]
                                  [--includeline INCLUDELINE]
                                  [--writedir WRITEDIR]
                                  [--runSuffix RUNSUFFIX] [--simpop SIMPOP]
                                  [--emppop EMPPOP] [--model MODEL]
                                  [--checkOverwrite]

Options:

`--score_basedir=/n/regal/sabeti_lab/jvitti/clear-synth/1kg_scores/`
	Undocumented
`--regional_cms_chrom`
	if included, calculate within-region CMS (rather than CMS_gw) for specified bounds at this chromosome
`--composite_writedir=`
	write output to
`--cmsdir=/idi/sabeti-scratch/jvitti/cms/cms/`
	TEMPORARY, will become redundant with conda packaging
`--printOnly=False`
	print rather than execute pipeline commands
`--likes_masterDir=/n/regal/sabeti_lab/jvitti/clear-synth/sims_reeval/likes_masters/`
	location of likelihood tables, defined
`--likes_nonSel=vsNeut`
	do we use completely neutral, or linked neutral SNPs for our non-causal distributions? by default, uses strict neutral (CMSgw)
`--likes_freqSuffix=allFreqs`
	for causal SNPs, include suffix to specify which selbins to include
`--cutoffline=250000 1250000 0 1`
	specify bounds to include/exclude in calculations, along with MAF filter and likes decomposition
`--includeline=0 0 0 0 0 0`
	specify (0:yes; 1:no) which scores to include: iHS, ...
`--writedir=`	specify relative path
`--runSuffix`	add a suffix to .cms file (corresponds to runparamfile)
`--simpop=1`	simulated population
`--emppop=YRI`	empirical population
`--model=nulldefault`
	Undocumented
`--checkOverwrite=False`
	Undocumented

normsims_genomewide

normalize simulated composite scores to neutral

usage: composite.py normsims_genomewide [-h] [--nrep_sel NREP_SEL]
                                        [--nrep_neut NREP_NEUT]
                                        [--writedir WRITEDIR]
                                        [--runSuffix RUNSUFFIX]
                                        [--simpop SIMPOP] [--model MODEL]
                                        [--checkOverwrite]

Options:

`--nrep_sel=500`	Undocumented
`--nrep_neut=1000`
	Undocumented
`--writedir=`	specify relative path
`--runSuffix`	add a suffix to .cms file (corresponds to runparamfile)
`--simpop=1`	simulated population
`--model=nulldefault`
	Undocumented
`--checkOverwrite=False`
	Undocumented

normemp_genomewide

normalize CMS scores to genome-wide

usage: composite.py normemp_genomewide [-h] [--score_basedir SCORE_BASEDIR]
                                       [--cmsdir CMSDIR] [--printOnly]
                                       [--writedir WRITEDIR]
                                       [--runSuffix RUNSUFFIX]
                                       [--emppop EMPPOP]

Options:

`--score_basedir=/n/regal/sabeti_lab/jvitti/clear-synth/1kg_scores/`
	Undocumented
`--cmsdir=/idi/sabeti-scratch/jvitti/cms/cms/`
	TEMPORARY, will become redundant with conda packaging
`--printOnly=False`
	print rather than execute pipeline commands
`--writedir=`	specify relative path
`--runSuffix`	add a suffix to .cms file (corresponds to runparamfile)
`--emppop=YRI`	empirical population

hapviz

Visualize haplotypes for region

usage: composite.py hapviz [-h] [--startpos STARTPOS] [--endpos ENDPOS]
                           [--corepos COREPOS] [--title TITLE]
                           [--annotate ANNOTATE] [--maf MAF] [--dpi DPI]
                           inputfile out

Positional arguments:

`inputfile`	input tped
`out`	save image as file

Options:

`--startpos`	define physical bounds of region
`--endpos`	define physical bounds of region
`--corepos=-1`	partition haplotypes based on allele status at this position
`--title`	title to give to plot
`--annotate`	tab-delimited file where each line gives <chr.pos> <annotation>
`--maf`	filter on minor allele frequency (e.g. .01, .05)
`--dpi=300`	image resolution

ml_region

machine learning algorithm (within-region)

usage: composite.py ml_region [-h]

ucsc_viz

Generate trackfiles of CMS scores for visualization in the UCSC genome browser.

usage: composite.py ucsc_viz [-h] [--posIndex POSINDEX]
                             [--scoreIndex SCOREINDEX] [--strip_header]
                             infile_prefix outfile

Positional arguments:

`infile_prefix`	prefix of file containing scores to be reformatted (e.g. ‘score_chr’ for files named scores_chr#.txt)
`outfile`	file to write

Options:

`--posIndex=1`	index for column of datafile containing physical position (zero-indexed)
`--scoreIndex=-2`
	index for column of datafile containing score (zero-indexed)
`--strip_header=False`
	if input files include header line