Parallel PopGen Package
|
Important links:
Welcome to the manual for the Parallel PopGen Package (3P)! This is a library of CUDA C/C++ APIs for performing population genetics simulations and analyses on the GPU. Using the 3P APIs, complex, powerful, and fast simulations and analyses can be written in otherwise standard C++ - requiring no knowledge of parallel/GPU programming from the end-user. Below is a brief description of the information found in this manual, organized in the tabs above:
3P currently contains 3 APIs, each in its own namespace:
The relevant header files are in the folder include/
The implementations of these header files are in the source/ folder. All outside libraries used in the Parallel PopGen Package (beyond standard CUDA and C/C++ libs) are stored in the folder outside_libraries/ and are automatically included. The examples in the examples/ folder show how to compile a project which using a 3P API (also covered in Parallel PopGen Package v0.3 README).
For faster GO_Fish simulations:
= 0
. (Pro Tip - Advanced users) In implementation file shared.cuh, there are several variables that can be tuned:
typedef r123::Philox4x32_R<10> P;
: <10> can be set as low as <8> for extra-speed.#define RNG_MEAN_BOUNDARY_NORM 6
: RNG_MEAN_BOUNDARY_NORM
can be set between 1-33#define RNG_N_BOUNDARY_POIS_BINOM 100
: RNG_N_BOUNDARY_POIS_BINOM
can be set between 1-200n
, for the distribution is greater than RNG_N_BOUNDARY_POIS_BINOM, the poisson approximation is used; less than RNG_N_BOUNDARY_POIS_BINOM, the binomial distribution is used. The binomial calculation gets numerically unstable for large n
>> 200.Use the embedded link for all bug reporting, feature requests, to do, and discussion. Below is a description of the two kinds of errors to encounter during API use:
function name: description
. error # s file s line # generation # population #
, where #
= number and s
= string.__DEBUG__
(shared.cuh line 23) to true - the program will run slower, but the error will be guaranteed to be reported with the correct file/line number/etc ... __DEBUG__
set to true ensures that the CPU and GPU are always in-sync where normally the CPU is allowed to race ahead of the GPU and thus by the time an error from the GPU is reported, the CPU may be executing a different piece of code. If problem is rectified, make sure to turn __DEBUG__
back to false to speed up program execution. A reported generation or population of -1 means that generation or population was not relevant to the error. = true
) and the population size is too big, so that the program fails in function initialize_mse in go_fish_impl.cuh before the arrays in struct mutations are allocated, try rescaling the simulation as shown in δaδi-to-GOFish Parameter Conversion - either temporarily just for generation 0 with a couple of burn-in generations after all the parameters are reset (selection too!) or for the whole simulation. If there are too many mutations, simply lower the number of independent sites in the simulation using GO_Fish::allele_trajectories::sim_constants::num_sites.GO_Fish simulations are discrete time/discrete frequency simulations. That said, the parameters of such a simulation can be translated into the continuous time/continuous frequency paradigm of δaδi. In δaδi, the main parameter is Nref
- the reference population size of the initial population at time 0. Time, theta, selection, migration, and the population sizes of later generations/other populations are all in reference to Nref
. The list below covers the population genetics functions/variable in a GO_Fish simulation, and how to match that to the corresponding δaδi parameter. Example2-DaDi gives a specific instance of an analogous δaδi and GO_Fish simulation with population scaling.
Inbreeding functions return a float parameter, F
, between 0 (outbred) and 1 (inbred) inclusive - see Simulation Models: Inbreeding, Mutation, and Dominance Group for more detail. In δaδi, populations are outbred, F = 0
.
Dominance functions return a float parameter, h
, between -inf and +inf - see Simulation Models: Inbreeding, Mutation, and Dominance Group for more detail. hGOFish
=
hδaδi
.
Selection functions return a selection coefficient, which is a float greater than or equal to -1 (lethal) - see Simulation Models: Selection Group for more detail. In δaδi the selection parameter is the effective selection coefficient, gamma
=
2Ns
, where the selection on alleles AA
, Aa
, aa
are 1
, 1+2hs
, 1+2s
and N
is the number of individuals. Thus the equivalent GO_Fish selection coefficient is sGOFish
=
2*sδaδi
. Sim_Model, like δaδi, has constructors in Simulation Models: Selection Group that allow effective selection coefficients to be input into the simulation (e.g. Sim_Model::selection_constant). For an outbred population, F = 0
, gammaGOFish
=
2NsGOFish
=
2*2Nsδaδi
=
2*gammaδaδi
. Effective selection coefficients are invariant to population scaling - see Demography.
Demography functions return a number of individuals, an integer between 0 and max 232-1 inclusive - see Simulation Models: Demography Group for more detail. Like in δaδi, population sizes in simulations can be scaled - doing so requires re-scaling migration rates, the number of generations, and the mutation rate. If taking samples of the populations, it is best to choose a reference population size for the initial population, Nref_GOFish
, such that the size(s) of the sampled population(s) are >> than the sample size, so that using the binomial is an appropriate choice of sampling distribution.
Migration functions return a float, m
, between 0 and 1 inclusive - see Simulation Models: Migration Group for more detail. Both δaδi and GO_Fish use a conservative model of migration. In δaδi, the migration rate is scaled by the population size, so mGOFish
=
mδaδi
), where /
(2Nref_GOFishNref_GOFish
is the number of initial individuals in the reference population of the GO_Fish simulation. This automatically scales migration rates for different reference population sizes.
The number of generations in the simulation is a constant integer between 0 and max 232-1 inclusive - use GO_Fish::allele_trajectories::sim_constants::num_generations to set the number of generations for a simulation. In δaδi, time T
is measured in units of 2Nref
, so the corresponding number of generations in GO_Fish would be T*2Nref_GOFish
. Thus, the number of generations in a simulation can be scaled by the population size. Similarly in δaδi Integration.py, one can set the variable dadi.Integration.set_timescale_factor
=
#
where #
=
1
) so that a time step in the integration routine is equal to a discrete generation in the simulation. /
(num_generations
Mutation functions return a per-site, per-generation mutation rate, which a float parameter greater than 0 - see Simulation Models: Inbreeding, Mutation, and Dominance Group for more detail. If the simulation population sizes (Nref_GOFish
) have been scaled relative to the actual population sizes N
, then, to maintain the same polymorphism level (theta), set muref
= muactual*N
In δaδi, /Nref_GOFish
.thetaref
, defined as 4Nrefmu
, is often (not always) defined to be 1 for the diffusion calculations, then later fit to the data. GO_Fish does not actually use theta as a relevant parameter in the simulations, but the fitted δaδi mutation rates should be equivalent to the pre-scaled GO_Fish muactual
.
The number of sites in the simulation is a constant float between 0 and +inf - use GO_Fish::allele_trajectories::sim_constants::num_sites to set the number of independent sites for a simulation. While δaδi doesn't have this parameter, grid_size is somewhat analogous in that the more sites there are in a GO_Fish simulation, then the less noise, the more precise the output of the simulation will be - similarly, the finer the grid size in δaδi, then the more accurate, the more precise the integration. Other than being parameters of precision in their respective tools, they are otherwise quite different in intent.