Matthieu Foll
September, 2008
matthieu.foll@zoo.unibe.ch

This program BayeScan identifies candidate loci under natural selection, from either dominant (AFLP) or codominant markers (SNP, microsatellites). The method is described in Foll, M. and O. E. Gaggiotti, 2008 Genetics 180(2).

Interpretation of the results: Selection gives a posterior probability that each locus is subject to selection. This probability can NOT be compared directly to a p-value like in the Fdist software by Mark Beaumont. In Bayesian statistics, decision on model choice should be performed using the so-called “Bayes factors”. Given a model selection problem in which we have to choose between two models M1 and M2 (neutral or selected), on the basis of a data vector N, the Bayes factor BF is given by BF=P(M2|N)/P(M1|N) * P(M1)/P(M2). Because here we set P(M1)=P(M2)=1/2, the Bayes factor is simply given by the ratio of posterior model probabilities P(M2|N)/P(M1|N). The Bayes factor provides a scale of evidence in favor of one model versus another. For example, BF=2 indicates that the data favor model M2 over model M1 at odds of two to one. The next table gives Jeffreys' scale of evidence for Bayes factors:

P(α≠0)

Bayes Factor (BF)

log10(BF)

Jeffreys' interpretation

0.50

0.76

1

3

0

0.5

Barely worth mentioning

0.76

0.91

3

10

0.5

1

Substantial

0.91

0.97

10

32

1

1.5

Strong

0.97

0.99

32

100

1.5

2

Very strong

0.99

1.00

100

2

Decisive

 

As a result, a Bayes factor of 0.5 corresponding to a posterior probability of 0.76, is already considered as being a “substantial” evidence for selection, although a corresponding p-value in Fdist would be considered as a very weak signal of selection.

1.    Main

Note that in Selection, Log10(BF) is set to 1000 when p=1 (should be infinity).

The program needs one input file for the multilocus allele (or band for AFLPs) counts (see input file description section). By default the file is named alleles.txt and located in the directory of the program. You can browse for your file with the button on the right of the line. Note that UNICODE files are not recognized (be careful when exporting from Excel).

There is no limitation for the number of loci, populations and alleles, but your computer’s memory.

An R converter script is also provided to easily convert data from the matrix format of AFLPDAT (Ehrich 2006 Molecular Ecology Notes 6: 603-604).

You can change the prefix of output files. By default it is “output” and the files will be named: output.sel, outpur_fst.txt and output_Verif.txt and optionally: output_AccRte.txt, output_popNN.txt and output _ancestral.txt.

When ready, click on the Start button. The program opens a text file that summarizes the parameters you chose and the input data so you can check that everything is correct. A bar indicates the progress of the process. If there are pilot runs (present by default), there are two steps. The estimated remaining time is indicated at the left of the bar.

 You can stop the process at any moment. Output files are valid and contain what has been done up to the time you stopped the process.

You can plot results with the Plot button, note however that this is done automatically at the end of a run. All the results are automatically written in the file output_res.txt at this time.

Most of the results you will need are in the “output_fst.txt” file created by the main program automatically at the end of the calculation. You can also save it manually afterward in the plot program when opening the output.sel file (file > save results as...).

2.    Options

The sample size is the number of iterations the program will write out and use for estimation. The thinning interval is the number of iterations between two samples. This reduces the autocorrelation of the data generated from a Markov chain. The total number of iterations is the sample size time the thinning interval. Proposal distributions have to be adjusted in order to have acceptance rates between 0.25 and 0.45. These values are tuned on the basis of short pilot runs: by default we run 5000 iterations, and for each parameter, the proposal is adjusted in order to reduce or increase the acceptance rate. We make by default 10 such pilot runs before starting the sampling, which also play the role of burn-in period. At the same time, we can choose the proposal distribution for the reversible jump. The best choice would be to take the full conditional distribution of alpha. Because we don't know this distribution, we use the pilot runs to have an estimation of the mean and variance for all alpha’s under the saturated model (which contains all the alpha’s parameters). Then we use a normal proposal with these means and variances, which is generally close to the full conditional distribution.

If we choose no pilots runs, you can fix the parameters of all proposal distributions. Otherwise, values given are just starting points and are modified by the pilot runs. We suggest to always using pilot runs.

Finally, you can tell the program to write out some additional files:

·       The allele frequencies of the ancestral population (can be a very big file) and also of each local population for dominant markers.

·       The results of the pilot runs.

·       The evolution of the acceptance rates for all parameters.

·       A file to check that the input files are read correctly and the parameters of the run.

 

3.    Plot

When you click on the Plot button, a new window is opened and the main results are drawn. .sel files are associated with the plot program so you can just double click on these files to plot results. You can also launch the external plot program; in this case, no file is opened by default and you can open the output. When you open a file, the number of populations is asked. You can also drag and drop on the plot program icon the output.sel file. Allele frequency file can also be opened. In fact, you can open any file where data are written in columns.

There are two tabs:

·         The details tab allows drawing different plots (trace, mean, histogram, density) of all parameters. It gives statistics (mean, variance, HPDI…) and you can add burn in or remove data at the end (“burn out”). When opening an allele frequency file, you can choose the locus and allele to plot. The mode estimation is based on a Gaussian kernel.

·         The summary tab plots estimated fst against the log10 of the Bayes factors. Each point represents a single locus, and you can make appear their label in different ways using the options on the top of the screen. You can draw a vertical bar corresponding to a given threshold value. It is also possible to visualize a specific loci on the plot by typing its number in the space provided in the upper-left corner of the graphical interface and then pressing the add button.

 

The reload button is used when you plot a file where the calculation is still in progress or if you modify the burn in / burn out and want to see the results in the summary tab.

You can zoom on a part of the graph by drawing a square (with left mouse button pressed) from top-left corner to bottom-right corner. You can restore the original zoom by drawing the square from the bottom-write to the top-left corner.

You can move in the graph by clicking on the middle mouse button and dragging it. The right button also opens a popup menu in which you can modify the plot properties (color, background, grid, point and/line).

The summary graph can be saved as a bitmap image. A R script producing the same graph but in better quality (for printing/publishing purpose) is provided.

4.    Input files

Dominant markers:

The input file format does not require the full matrix of individuals and loci, but only, for each locus in each population, the number of individuals that has a band. An example is given in the file “alleles_aflp.inf”. You must indicate the number I of loci with “[loci]=I”, and of populations J with “[populations]=J”. Then for each population j from 1 to J, you must start with “[pop]=j” and write for each locus i from 1 to I, a line of the form: “i      indiv      bands”, with “indiv” being the number of individuals observed at locus i and “bands” the corresponding number of individuals that have a band.

Codominant markers:

An example is given in the file “alleles_msats.inf”. The program recognizes keywords in […]. You have to indicate the number of loci and populations before the populations. For each population, there is one line per locus numbered from 1 to the number of loci. Then there is the number of alleles measured for this population at this locus (50 individuals make 100 alleles for diploids) and the number of possible alleles found at this locus (for all populations). After, there is the corresponding allele count. This part must sum to the number of alleles measured. You can write any comments between sections. A file converter from GENPOP input file is given (just drag and drop your GENPOP file on “convert_genpop.exe”).

5.    Output files

In this part, we assume that “output” is the prefix chosen in the main page.

Most of the results you will need are in the “output_fst.txt” file created by the plot program automatically at the end of the calculation. You can also save it manually in the plot program (file > save results as...). For each locus, it gives the posterior probability to be under selection, the estimated alpha coefficient (indicates the strength and direction of selection), and the estimated fst coefficient.

The main output file is named output.sel and can be open by the plot program. Each column contains the evolution of a parameter in this order:

Iteration, sigma˛, loglikelihood, Fis, a, Fst1…FstJ, α0...αI, p1...pI, Fst1...FstI. Note that Fis and a are only useful for dominant markers.

The file output_Verif.txt helps to verify that the input file is read correctly.

The file output_prop.txt gives the results of the pilot runs.

The file output_AccRte.txt gives the evolution of the acceptance rate for all parameters during the process.

The file output_ancestral.txt contains the allele frequencies for the ancestral population, and the files output _popXX.txt for the local populations in the case of dominant markers.