This is the README file for the program MDIV.

 Disclaimer and copyright
	
	MDIV is copyrighted (c) by Rasmus Nielsen 2001.  Any injury or
	loss due to the use of this software is not the responsibility
	of the author.  This software is provided "as is" without any
	express or implied warranties, including, without limitation,
	the implied warranties of merchantability and fitness for a
	particular purpose.

Infile Format

The infile format is a standard phylip format but with the numbers of sequences from
each population in the end of the file. For example, if you have 9 sequences from
one population and 14 sequences from another population you would put the first 14
sequences in the top of the file and the remaining 9 sequence in the end of the file.
In addition you would add a new line with the numbers '14 9' in the end of the file.
Then the program can figure out which sequences are sampled from which population.
The name of the input file should be 'infile.txt'.  An example file should be
distributed with this README file.  Remember that the sequences must be compatible
with the infinite sites model - if not - the program will terminate with an error
message.

Running the program

To run the program you will first be asked which model to use.  The infinite sites model 
assumes that only one mutation occurs in each sites.  The HKY model makes no such assumption,
but takes into account the possibility of multiple hits, differences in the nucleotide
frequencies and the presence of a transition/transversion bias.  It is faster to run the program
under the infite sites model, but the HKY model may provide a more accurate description
of DNA sequence evolution.  Using the HKY model a uniform (0, 100) prior is assumed for kappa,
the parameter relating to the transition/transversion bias.

After choosing the model, you are prompted for five numbers.  
(1) A seed for the random number generator (a positive integer).
(2) The length of the Markov chain (a positive integer).  A sufficient length is, in
	many cases, 5,000,000 cycles.  However, for large data sets you may need more.
	I STRONGLY recommend that you run multiple chains with different random seeds.  If
	the outcome from all chains are identical, this would suggest that you have run
 	enough cycles.  However, if the results vary depending on the seed, you definitely
	need to run longer chains (more cycles).
(3) The burn-in time.  To avoid dependence on initial conditions, it is useful not to
	sample data from the first cycles.  A reasonable burn-in time seems to be 10% of
	the total number of cycles.  At the present I have not implemented a method for
	automating the choice of number of cycles and burn-in time.  I would very much
	like to do this at some point in the future.
(4) The maximum value for M, Mmax.  M is the scaled migration rate
	[2*(effective pop. size)
	*(migration rate)].  The program assumes a uniform prior for M, between 0
	and Mmax.  You need to specify Mmax.  A reasonable value in most cases is
	Mmax = 10.  However, if your estimate of M is close to Mmax, you should run the
	program again using a larger value of Mmax.  Setting Mmax = 0 corresponds to a
	model with no migration between the populations (just divergence).
(5) The maximum value for T, Tmax.  T is the scaled divergence time
	[(divergence time)/(2*(effective population size))].  A uniform prior is also
 	assumed for T.  Choose the value of T depending on your prior beliefs.  Can you
 	exclude very large values of T?  Then is might be reasonable to set Tmax = 5
	or 10.  However, if you don't feel you can exclude very large values of T a priori
	you might want to choose a larger value of Tmax.  Notice that the program might
	converge very slowly, if you have very little information regarding T in the data
	and choose a large value of T.

Estimation for smaller data sets should take no more than 15-20 minutes but it may take
many hours for larger data sets. If you have more than say 50 sequences, you should
probably run it on a fast computer overnight.

Interpreting the output

The output of the program is three distributions. First the posterior distribution of
theta (4*(effective pop. size)*(mutation rate)), then the posterior distribution of
M (2*(effective pop. size)*(migration rate)) and finally the posterior distribution
of T ((divergence time)/(2*(effective population size))).
The mode of each distribution provides an estimate of the parameter (an estimate based
on maximum posterior probability). A Bayesian confidence interval, called a credibility
interval, for each parameter can also be obtained from the posterior distributions
(see Nielsen and Wakeley 2001). Finally the program also prints out an
estimate of the expected time to the most recent common ancestor (TMRCA) of all the
sequences given the data.

If you need help running the program or have comments, contact me at rn28@cornell.edu.

When publishing results based on the program, Please Cite Nielsen, R. and J. W. Wakeley. 2001. Distinguishing Migration from Isolation: an MCMC Approach. Genetics 158: 885-896.

Rasmus Nielsen, 2001