next up previous contents
Next: Control file format Up: Data file format Previous: Molecular sequences   Contents


Class section

The fourth section is not compulsory and is used when performing a combined analysis of heterogeneous data sets (e.g., loops and stems of a RNA molecule, protein coding genes with three codon positions or concatenated data of different genes with different evolutionary patterns). You can safely skip this section if you plan to study DNA sequences or RNA helices only (i.e., no ``.'' in the pairing mask) with only one appropriate nucleotide/base-pair substitution model.

The aim of this section is to assign each nucleotide/pair to a class. Each class is expected to have a different pattern of evolution. This section consists of a sequence of integers which correspond to the class of each nucleotide. For instance, the class section of a protein coding gene may look like:
...2 3 1 2 3 1 2 3 1 2 3 1 2 ...
When the data file contains a class section, programs in the PHASE package expect it to comply to the following set of rules:

* class labels are separated by a space
* classes are labelled from 1 to K, where K is the number of distinct classes
* the number of labels equals the length of the sequences
* when used in conjunction with a base-paired structure, the two components of a paired site are in the same class.

Since PHASE is specifically designed for the analysis of RNA sequences with secondary structure, the most common use of the class section should be the obvious separation of unpaired and base-paired sites into two distinct classes. The code MIXED can replace the code RNA to avoid a tiresome task and let PHASE know that he can simply use the provided pairing mask to build the class section (e.g., (((.())))..) implies 2 2 2 1 2 2 2 2 2 1 1 2). When the code MIXED is used the class section is not compulsory and the unpaired and paired sites will respectively be attributed to the classes 1 and 2 automatically[*].

Usually classes are used to determine the model of sequence evolution PHASE is using with each nucleotide. Each class in the data file is treated by its own model of nucleotide substitution during the phylogenetic inference. The models are defined later in the model section of the control file. Let us just point out here that if you use the MIXED type for your data with the automatic assignment, i.e., without the class section, you have to make sure your first and second model are respectively a nucleotide substitution model and a base-pair substitution model when you declare your models of evolution. We will return to this point later on.


next up previous contents
Next: Control file format Up: Data file format Previous: Molecular sequences   Contents
Gowri-Shankar Vivek 2003-04-24