(c) Copyright  1986-1993  by  Joseph  Felsenstein  and  by  the  University  of
Washington.  Written by Joseph Felsenstein.  Permission is granted to copy this
document provided that no fee is charged for it and that this copyright  notice
is not removed.


                      INTERLEAVED AND SEQUENTIAL FORMATS

     The sequences can continue over multiple lines;  when  this  is  done  the
sequences  must  be  either  in  "interleaved" format, similar to the output of
alignment programs, or "sequential" format.  These are described  in  the  main
document  file.  In sequential format all of one sequence is given, possibly on
multiple lines, before the next starts.  In interleaved format the  first  part
of  the  file  should  contain  the  first  part of each of the sequences, then
possibly a line containing nothing but a carriage-return  character,  then  the
second part of each sequence, and so on.  Only the first parts of the sequences
should be preceded by names.  Here is a  hypothetical  example  of  interleaved
format:






 


  5    42
Turkey    AAGCTNGGGC ATTTCAGGGT
Salmo gairAAGCCTTGGC AGTGCAGGGT
H. SapiensACCGGTTGGC CGTTCAGGGT
Chimp     AAACCCTTGC CGTTACGCTT
Gorilla   AAACCCTTGC CGGTACGCTT

GAGCCCGGGC AATACAGGGT AT
GAGCCGTGGC CGGGCACGGT AT
ACAGGTTGGC CGTTCAGGGT AA
AAACCGAGGC CGGGACACTC AT
AAACCATTGC CGGTACGCTT AA

while in sequential format the same sequences would be:

  5    42
Turkey    AAGCTNGGGC ATTTCAGGGT
GAGCCCGGGC AATACAGGGT AT
Salmo gairAAGCCTTGGC AGTGCAGGGT
GAGCCGTGGC CGGGCACGGT AT
H. SapiensACCGGTTGGC CGTTCAGGGT
ACAGGTTGGC CGTTCAGGGT AA
Chimp     AAACCCTTGC CGTTACGCTT
AAACCGAGGC CGGGACACTC AT
Gorilla   AAACCCTTGC CGGTACGCTT
AAACCATTGC CGGTACGCTT AA

Note, of course, that a portion of a sequence like this:

   300   AAGCGTGAAC GTTGTACTAA TRCAG

is perfectly legal, assuming that the species name  has  gone  before,  and  is
filled  out  to  full  length  by  blanks.  The above digits and blanks will be
ignored, the sequence being taken as starting at the first base symbol (in this
case  an  A).  This should enable you to use output from many multiple-sequence
alignment programs with only minimal editing.

     In interleaved format the present versions of the programs  may  sometimes
have  difficulties  with the blank lines between groups of lines, and if so you
might want to retype those lines, making sure that they have only  a  carriage-
return  and  no  blank characters on them, or you may perhaps have to eliminate
them.  The symptoms of this problem are that the  programs  complain  that  the
sequences  are  not  properly aligned, and you can find no other cause for this
complaint.


                      INPUT FOR THE DNA SEQUENCE PROGRAMS

     The input format for the DNA sequence programs is standard: the data  have
A's,  G's, C's and T's (or U's).  The first line of the input file contains the
number of species and the number of sites.  As with the other programs, options
information  may  follow  this.   Following  this, each species starts on a new
line.  The first 10 characters of that line are the species name.   There  then
follows  the  base  sequence  of  that species, each character being one of the
letters A, B, C, D, G, H, K, M, N, O, R, S, T, U, V, W, X, Y, ?, or - (a period
was  also  previously allowed but it is no longer allowed, because it sometimes
is used in different senses in other programs).  Blanks will be ignored, and so
will  numerical  digits.   This  allows GENBANK and EMBL sequence entries to be
read with minimum editing.


 


     These characters can be  either  upper  or  lower  case.   The  algorithms
convert  all  input  characters  to upper case (which is how they are treated).
The characters constitute the IUPAC (IUB) nucleic acid code  plus  some  slight
extensions.  They enable input of nucleic acid sequences taking full account of
any ambiguities in the sequence.

          Symbol   Meaning
          ------   -------
            A       Adenine
            G       Guanine
            C       Cytosine
            T       Thymine
            U       Uracil
            Y       pYrimidine  (C or T)
            R       puRine      (A or G)
            W       "Weak"      (A or T)
            S       "Strong"    (C or G)
            K       "Keto"      (T or G)
            M       "aMino"     (C or A)
            B       not A       (C or G or T)
            D       not C       (A or G or T)
            H       not G       (A or C or T)
            V       not T       (A or C or G)
          X,N,?     unknown     (A or C or G or T)
            O       deletion
            -       deletion