(c) Copyright 1986-1993 by Joseph Felsenstein and by the University of Washington. Written by Joseph Felsenstein. Permission is granted to copy this document provided that no fee is charged for it and that this copyright notice is not removed. INTERLEAVED AND SEQUENTIAL FORMATS The sequences can continue over multiple lines; when this is done the sequences must be either in "interleaved" format, similar to the output of alignment programs, or "sequential" format. These are described in the main document file. In sequential format all of one sequence is given, possibly on multiple lines, before the next starts. In interleaved format the first part of the file should contain the first part of each of the sequences, then possibly a line containing nothing but a carriage-return character, then the second part of each sequence, and so on. Only the first parts of the sequences should be preceded by names. Here is a hypothetical example of interleaved format: 5 42 Turkey AAGCTNGGGC ATTTCAGGGT Salmo gairAAGCCTTGGC AGTGCAGGGT H. SapiensACCGGTTGGC CGTTCAGGGT Chimp AAACCCTTGC CGTTACGCTT Gorilla AAACCCTTGC CGGTACGCTT GAGCCCGGGC AATACAGGGT AT GAGCCGTGGC CGGGCACGGT AT ACAGGTTGGC CGTTCAGGGT AA AAACCGAGGC CGGGACACTC AT AAACCATTGC CGGTACGCTT AA while in sequential format the same sequences would be: 5 42 Turkey AAGCTNGGGC ATTTCAGGGT GAGCCCGGGC AATACAGGGT AT Salmo gairAAGCCTTGGC AGTGCAGGGT GAGCCGTGGC CGGGCACGGT AT H. SapiensACCGGTTGGC CGTTCAGGGT ACAGGTTGGC CGTTCAGGGT AA Chimp AAACCCTTGC CGTTACGCTT AAACCGAGGC CGGGACACTC AT Gorilla AAACCCTTGC CGGTACGCTT AAACCATTGC CGGTACGCTT AA Note, of course, that a portion of a sequence like this: 300 AAGCGTGAAC GTTGTACTAA TRCAG is perfectly legal, assuming that the species name has gone before, and is filled out to full length by blanks. The above digits and blanks will be ignored, the sequence being taken as starting at the first base symbol (in this case an A). This should enable you to use output from many multiple-sequence alignment programs with only minimal editing. In interleaved format the present versions of the programs may sometimes have difficulties with the blank lines between groups of lines, and if so you might want to retype those lines, making sure that they have only a carriage- return and no blank characters on them, or you may perhaps have to eliminate them. The symptoms of this problem are that the programs complain that the sequences are not properly aligned, and you can find no other cause for this complaint. INPUT FOR THE DNA SEQUENCE PROGRAMS The input format for the DNA sequence programs is standard: the data have A's, G's, C's and T's (or U's). The first line of the input file contains the number of species and the number of sites. As with the other programs, options information may follow this. Following this, each species starts on a new line. The first 10 characters of that line are the species name. There then follows the base sequence of that species, each character being one of the letters A, B, C, D, G, H, K, M, N, O, R, S, T, U, V, W, X, Y, ?, or - (a period was also previously allowed but it is no longer allowed, because it sometimes is used in different senses in other programs). Blanks will be ignored, and so will numerical digits. This allows GENBANK and EMBL sequence entries to be read with minimum editing. These characters can be either upper or lower case. The algorithms convert all input characters to upper case (which is how they are treated). The characters constitute the IUPAC (IUB) nucleic acid code plus some slight extensions. They enable input of nucleic acid sequences taking full account of any ambiguities in the sequence. Symbol Meaning ------ ------- A Adenine G Guanine C Cytosine T Thymine U Uracil Y pYrimidine (C or T) R puRine (A or G) W "Weak" (A or T) S "Strong" (C or G) K "Keto" (T or G) M "aMino" (C or A) B not A (C or G or T) D not C (A or G or T) H not G (A or C or T) V not T (A or C or G) X,N,? unknown (A or C or G or T) O deletion - deletion