This is the README file for the program MDIV. Disclaimer and copyright MDIV is copyrighted (c) by Rasmus Nielsen 2001. Any injury or loss due to the use of this software is not the responsibility of the author. This software is provided "as is" without any express or implied warranties, including, without limitation, the implied warranties of merchantability and fitness for a particular purpose. Infile Format The infile format is a standard phylip format but with the numbers of sequences from each population in the end of the file. For example, if you have 9 sequences from one population and 14 sequences from another population you would put the first 14 sequences in the top of the file and the remaining 9 sequence in the end of the file. In addition you would add a new line with the numbers '14 9' in the end of the file. Then the program can figure out which sequences are sampled from which population. The name of the input file should be 'infile.txt'. An example file should be distributed with this README file. Remember that the sequences must be compatible with the infinite sites model - if not - the program will terminate with an error message. Running the program To run the program you will first be asked which model to use. The infinite sites model assumes that only one mutation occurs in each sites. The HKY model makes no such assumption, but takes into account the possibility of multiple hits, differences in the nucleotide frequencies and the presence of a transition/transversion bias. It is faster to run the program under the infite sites model, but the HKY model may provide a more accurate description of DNA sequence evolution. Using the HKY model a uniform (0, 100) prior is assumed for kappa, the parameter relating to the transition/transversion bias. After choosing the model, you are prompted for five numbers. (1) A seed for the random number generator (a positive integer). (2) The length of the Markov chain (a positive integer). A sufficient length is, in many cases, 5,000,000 cycles. However, for large data sets you may need more. I STRONGLY recommend that you run multiple chains with different random seeds. If the outcome from all chains are identical, this would suggest that you have run enough cycles. However, if the results vary depending on the seed, you definitely need to run longer chains (more cycles). (3) The burn-in time. To avoid dependence on initial conditions, it is useful not to sample data from the first cycles. A reasonable burn-in time seems to be 10% of the total number of cycles. At the present I have not implemented a method for automating the choice of number of cycles and burn-in time. I would very much like to do this at some point in the future. (4) The maximum value for M, Mmax. M is the scaled migration rate [2*(effective pop. size) *(migration rate)]. The program assumes a uniform prior for M, between 0 and Mmax. You need to specify Mmax. A reasonable value in most cases is Mmax = 10. However, if your estimate of M is close to Mmax, you should run the program again using a larger value of Mmax. Setting Mmax = 0 corresponds to a model with no migration between the populations (just divergence). (5) The maximum value for T, Tmax. T is the scaled divergence time [(divergence time)/(2*(effective population size))]. A uniform prior is also assumed for T. Choose the value of T depending on your prior beliefs. Can you exclude very large values of T? Then is might be reasonable to set Tmax = 5 or 10. However, if you don't feel you can exclude very large values of T a priori you might want to choose a larger value of Tmax. Notice that the program might converge very slowly, if you have very little information regarding T in the data and choose a large value of T. Estimation for smaller data sets should take no more than 15-20 minutes but it may take many hours for larger data sets. If you have more than say 50 sequences, you should probably run it on a fast computer overnight. Interpreting the output The output of the program is three distributions. First the posterior distribution of theta (4*(effective pop. size)*(mutation rate)), then the posterior distribution of M (2*(effective pop. size)*(migration rate)) and finally the posterior distribution of T ((divergence time)/(2*(effective population size))). The mode of each distribution provides an estimate of the parameter (an estimate based on maximum posterior probability). A Bayesian confidence interval, called a credibility interval, for each parameter can also be obtained from the posterior distributions (see Nielsen and Wakeley 2001). Finally the program also prints out an estimate of the expected time to the most recent common ancestor (TMRCA) of all the sequences given the data. If you need help running the program or have comments, contact me at rn28@cornell.edu. When publishing results based on the program, Please Cite Nielsen, R. and J. W. Wakeley. 2001. Distinguishing Migration from Isolation: an MCMC Approach. Genetics 158: 885-896. Rasmus Nielsen, 2001