(c) Copyright 1986-1995 by Joseph Felsenstein and the University of Washington.
Permission is granted to copy this document provided that no fee is charged for
it and that this copyright notice is not removed.

Input File Format
----- ---- ------

     I have tried to adhere to a rather stereotyped input  and  output  format.
For the parsimony, compatibility and maximum likelihood programs, excluding the
distance matrix methods, the simplest version of the input file looks something
like this:

   6   13
Archaeopt CGATGCTTAC CGC
HesperorniCGTTACTCGT TGT
BaluchitheTAATGTTAAT TGT
B. virginiTAATGTTCGT TGT
BrontosaurCAAAACCCAT CAT
B.subtilisGGCAGCCAAT CAC

The first line of the input file contains the number of species and the
number of characters, in free format, separated by blanks (not by
commas).  The information for each species follows, starting with a
ten-character species name (which can include punctuation marks and blanks),
and continuing with the characters for that species.  In the
discrete-character, DNA and protein sequence programs the characters are each a
single letter or digit, sometimes separated by blanks.  In
the continuous-characters programs they are real numbers with decimal points,
separated by blanks:




 


Latimeria  2.03  3.457  100.2  0.0  -3.7

The conventions about continuing the data  beyond  one  line  per  species  are
different  between  the  molecular  sequence  programs  and  the  others.   The
molecular sequence programs can take the data  in  "aligned"  or  "interleaved"
format,  with  some  lines giving the first part of each of the sequences, then
lines giving the next part of each, and so on.  Thus the sequences  might  look
like this:

   6   39
Archaeopt CGATGCTTAC CGCCGATGCT
HesperorniCGTTACTCGT TGTCGTTACT
BaluchitheTAATGTTAAT TGTTAATGTT
B. virginiTAATGTTCGT TGTTAATGTT
BrontosaurCAAAACCCAT CATCAAAACC
B.subtilisGGCAGCCAAT CACGGCAGCC

TACCGCCGAT GCTTACCGC
CGTTGTCGTT ACTCGTTGT
AATTGTTAAT GTTAATTGT
CGTTGTTAAT GTTCGTTGT
CATCATCAAA ACCCATCAT
AATCACGGCA GCCAATCAC

Note that in these sequences we have a blank  every  ten  sites  to  make  them
easier  to  read:  any such blanks are allowed.  The blank line which separates
the two groups of lines (the ones containing sites  1-20  and  ones  containing
sites  21-39)  may  or may not be present, but if it is, it should be a line of
zero length and not contain any extra blank characters (this is  because  of  a
limitation  of the current versions of the programs).  It is important that the
number of sites in each group be the same for all species (i.e., it will not be
possible to run the programs successfully if the first species line contains 20
bases, but the first line for the second species contains 21 bases).

     Alternatively, an option can be selected to take the data in  "sequential"
format,  with all of the data for the first species, then all of the characters
for the next species, and so on.  This  is  also  the  way  that  the  discrete
characters  programs  and  the  gene  frequencies  and  quantitative characters
programs want to read the data.  They do not allow the "interleaved" format.

     In the sequential format, the character data can run on to a new  line  at
any  time  (except in a species name or in the case of continuous character and
distance matrix programs where you cannot go to a new line in the middle  of  a
real number).  Thus it is legal to have:

Archaeopt 001100
1101

or even:

Archaeopt
0011001101

though note that the FULL ten characters of  the  species  name  MUST  then  be
present:  in  the above case there must be a blank after the "t".  In all cases
it is possible to put internal blanks between any of the character  values,  so
that





 


Archaeopt 0011001101 0111011100

is allowed.

If you make an error in the input file, the programs  will  often  detect  that
they have been fed an illegal character or illegal numerical value and issue an
error message such as "BAD CHARACTER STATE:", often printing out the bad value,
and  sometimes  the  number  of the species and character in which it occurred.
The program will then stop shortly after.  One of the things which can lead  to
a  bad value is the omission of something earlier in the file, or the insertion
of something superfluous, which cause the reading of the file  to  get  out  of
synchronization.   The program then starts reading things it didn't expect, and
concludes that they are in error.  So if you see this error  message,  you  may
also want to look for the earlier problem that may have led to this.

     The other major  variation  on  the  input  data  format  is  the  options
information.   Many options are selected using the menu, but a few are selected
by including extra information in the input file.  Some options  are  described
below.