ClustalV - Multiple sequence alignment by Des Higgins

ClustalV is an order independent multiple sequences alignment algorithm.
It first determines which sequences are similar to eachother, and aligns
those sequences first.  This helps reduce alignment bias based on the
early alignment of poor sequences.


Be warned that all sequences are returned in upper case, and all
extended information under Get info.. will be lost.  For this reason,
the results are placed in a new GDE window.

The following is the help file that Higgins provides. 

---

This is the on-line help file for CLUSTAL V.   

It should be named or defined as:   clustalv_hlp

>>HELP<< 1            General help for CLUSTAL V 
CLUSTAL V is a general purpose multiple alignment program for DNA or proteins.

SEQUENCE INPUT:  all sequences must be in 1 file, one after another.  3 formats
are automatically recognised: NBRF/PIR, EMBL/SWISSPROT or Pearson (Fasta).  
All non-alphabetic characters (spaces, digits, punctuation marks) are ignored
except "-" which is used to indicate a GAP.  Upper or lower case is allowed.


To do a MULTIPLE ALIGNMENT on a set of sequences, use item 1 from this menu to 
INPUT them; go to menu item 2 to do the multiple alignment.


PROFILE ALIGNMENTS (menu item 3) are used to align 2 alignments.  Use this to
add a new sequence to an old alignment.  GAPS in the old alignments are 
indicated using the "-" character.   PROFILES can be input as PIR format files.


PHYLOGENETIC TREES (menu item 4) can be calculated from old alignments (read in
in PIR format with "-" characters to indicate gaps) OR after a multiple 
alignemnt while the alignment is still in memory.
>>HELP<< 2     Help for multiple alignments

If you have already loaded sequences, use menu item 1 to do the complete
multiple alignment.  You will be prompted for 2 output files: 1 for the 
alignment itself; another to store a dendrogram that describes the similarity
of the sequences to each other.

Multiple alignments are carried out in 3 stages (automatically done from menu
item 1 ... multiple alignments NOW):

1) all sequences are compared to each other (pairwise alignments);

2) a dendrogram (like a phylogenetic tree) is constructed, describing the
approximate groupings of the sequences by similarity (stored in a file).

3) the final multiple alignment is carried out, using the dendrogram as a guide.


PAIRWISE ALIGNMENT parameters control the speed/sensitivity of the initial
alignments.

MULTIPLE ALIGNMENT parameters control the gaps in the final multiple alignments.


You can skip the first stages (pairwise alignments; dendrogram) by using an
old dendrogram file (menu item 3); or you can just produce the dendrogram
with no final multiple alignment (menu item 2).


OUTPUT FORMAT: Menu item 6 (format options) allows you to choose between 4 
different alignment formats (CLUSTAL, GCG, NBRF/PIR and PHYLIP).  


>>HELP<< 3     Help for pairwise alignment parameters

A similarity score is calculated between every pair of sequence and these are
used to construct the dendrogram which guides the final multiple alignment.

These similarity scores are calculated from fast, approximate, global align-
ments, which are controlled by 4 parameters.   2 techniques are used to make
these alignments very fast: 1) only exactly matching fragments (k-tuples) are
considered; 2) only the 'best' diagonals (the ones with most k-tuple matches)
are used.


K-TUPLE SIZE:  This is the size of exactly matching fragment that is used. 
INCREASE for speed (max= 2 for proteins; 4 for DNA), DECREASE for sensitivity.
For longer sequences (e.g. > 300 residues) you may need to increase the default.


GAP PENALTY:   This is a penalty for each gap in the fast alignments.  It has
little affect on the speed or sensitivity.  


TOP DIAGONALS: The number of k-tuple matches on each diagonal (in an imaginary
dot-matrix plot) is calculated.  Only the best ones (with most matches) are
used in the alignment.  This parameter specifies how many.  Decrease for speed;
increase for sensitivity.


DIAGONAL WINDOW:  This is the number of diagonals around each of the 'best' 
diagonals that will be used.  Decrease for speed; increase for sensitivity.


SCORING METHOD = PERCENTAGE or ABSOLUTE:   This controls whether the similarity
scores are calculated as raw alignment scores (number of k-tuple matches minus a
gap penalty for every gap) (ABSOLUTE) or as the alignment score divided by the
length of the shorter sequence (PERCENTAGE).


>>HELP<< 4     Help for multiple alignment parameters
These parameters control the final multiple alignment.  There are 2 gap penalty
parameters and 1 for whether transitions (A <--> G or C <--> T) are weighted in
DNA alignments.  The default weight matrix for protein alignments is a PAM250
matrix, converted to distances.

GAP PENALTY (FIXED): 	This is a penalty for opening up a gap.   Decrease it
and you will encourage gaps of all sizes.  TERMINAL GAPS are penalised (same as
internal ones).  BEWARE:  if you make this too small (+/- 5 or so), the program
will prefer to align each sequence opposite a long gap.

GAP PENALTY (VARYING):  This penalty is incurred for every item in a gap.  This
penalises long gaps more.  Increase this and gaps will get shorter.   BEWARE: 
if you make this too small (+/- 5 or so), the program will prefer to align each
sequence opposite a long gap.

TRANSITIONS = WEIGHTED or UNWEIGHTED:  With UNWEIGHTED transitions identical 
bases in a DNA alignment have a DISTANCE of 0; different ones have a distance 
of 10.  If transitions are WEIGHTED then A vs G and C vs T will have a distance
of 5 (less distant than A vs C,T or C vs A,G).  
>>HELP<< 5     Help for output format options.
Four output formats are offered.  You can choose more than one (or all four if
you wish).  NBRF/PIR format is ESPECIALLY USEFUL.  Alignments that are written
in this format can be used again as input (for calculating phylogenetic trees;
profile alignments; general input).

CLUSTAL format output is a self explanatory alignment format.  It shows the
sequences aligned in blocks.

GCG output can be used by any of the GCG programs that can work on multiple
alignments (e.g. PRETTY, PROFILEMAKE, PLOTALIGN).  It is the same as the GCG
.msf format files (multiple sequence file); new in version 7 of GCG.

PHYLIP format output can be used for input to the PHYLIP package of Joe 
Felsenstein.  This is an extremely widely used package for doing every 
imaginable form of phylogenetic analysis (MUCH more than the the modest intro-
duction offered by this program).

NBRF/PIR:  this is the same as the standard PIR format with ONE ADDITION.  Gap
characters "-" are used to indicate the positions of gaps in the multiple 
alignment.   These files can be re-used as input in any part of clustal that
allows sequences (or alignments or profiles) to be read in.  
>>HELP<< 6     Help for profile alignments

By PROFILE ALIGNMENT, we mean the alignment of two old alignments.  One of the
alignments can be a single sequence.  

The profiles should be in PIR format (one of the 4 output formats produced by 
this program).   This is the same as standard NBRF/PIR format, with 1 addition:
gap characters are indicated by "-".   

The alignment method produces a global, optimal alignment using an amino acid
weight matrix (PAM250 is default) and 2 gap penalty parameters.

Profile alignments allow you to store alignments of your favourite sequences (as
long as they are in PIR format) and add new sequences to them in small bunches 
at a time.  One of the 2 profiles can simply be a single sequence.


>>HELP<< 7     Help for phylogenetic trees
Before calculating a tree, you must have an alignment in memory.  This can be
input in NBRF/PIR format or you should have just carried out a full multiple 
alignment and the alignment is still in memory.

The method used is the NJ (Neighbour Joining) method of Saitou and Nei.  First
you calculate distances (percent divergence) between all pairs of sequence from
a multiple alignment; second you apply the NJ method to the distance matrix.

EXCLUDE POSITIONS WITH GAPS?  If you choose this option, any alignment positions
where ANY of the sequences have a gap will be ignored.  This guarantees that
the distances will be 'metric'.  Also, it means that 'like' will be compared to
'like' in all distances.  The disadvantage is that you may throw away much of
the data if there are many gaps.

CORRECT FOR MULTIPLE SUBSTITUTIONS?  For small divergence (say <10%) this
option makes little difference.  For greater divergence, this option corrects
for the fact that observed distances underestimate actual evolutionary dist-
ances.  This is because, as sequences diverge, more than one substitution will
happen at many sites.  However, you only see one difference when you look at the
present day sequences.  Therefore, this option has the effect of stretching
branch lengths in trees (especially long branches).  The corrections used here
(for DNA or proteins) are both due to Motoo Kimura.

To calculate a tree, use option 4 (DRAW TREE NOW).  This gives an UNROOTED
tree and all branch lengths.  The root of the tree can only be inferred by
using an outgroup (a sequence that you are certain branches at the outside
of the tree .... certain on biological grounds) OR if you assume a degree
of constancy in the 'molecular clock', you can place the root along the
longest branch.

BOOTSTRAPPING is a method for deriving confidence values for the groupings in
a tree (first adapted for trees by Joe Felsenstein).   It involves making N
random samples of sites from the alignment (N should be LARGE, e.g. 500 - 1000);
drawing N trees (1 from each sample) and counting how many times each grouping
from the original tree occurs in the sample trees.   For a group to be consid-
ered significant at the 5% level (p <= 0.05) it should occur in at least 95% of
the sample trees. You must supply a seed number for the random number generator.
>>HELP<< 8     Help for choosing protein weight matrix
For protein alignments, you use a weight matrix to determine the similarity of
non-identical amino acids.  For example, Tyr aligned with Phe is usually judged 
to be 'better' than Tyr aligned with Pro.  


There are three 'in-built' weight matrices offered: 


1) PAM 100 and 2) PAM 250    These are from the work of M. Dayhoff and are often
simply called Dayhoff matrices.   The pam 250 matrix is the most commonly used
and is the default in most protein comparison packages.   It is claimed that
a pam 100 matrix is more sensitive in many cases, so we have included it
here.


3) Identity matrix.   This matrix just scores identical residues.


You can also input your own matrix.  If so then be careful:  1) follow the 
instructions on format below; 2) watch the gap penalty parameters (the default
values may no be appropriate).   Conservative substitutions will not be 
indicated in alignments.

The values in a new weight matrix must be integers and the scores should be
similarities.  You can use negative as well as positive values if you wish.


INPUT FORMAT  The lower triangle of a 20x20 matrix of values is read in, in free
format, row by row.  The diagonal must be included.   Using the 1 letter code,
the order of amino acids in the matrix is:   CSTPAGNDEQHRKMILVFYW.   Seperate
the values by spaces (not commas).   You can put the values on as many lines
as you like as long as they are in the right order.


GAP PENALTIES  The default gap penalty parameters work fine with a PAM 250
matrix.  The range of PAM 250 values is 0 to 25 (when rescaled to be positive)
and the default gap penalties are 10 each.   Very approximately, the best gap
penalty settings are 2/5 the maximum weight matrix score.   
>>HELP<< 9     Help for command line parameters
                DATA (sequences)

/INFILE=file.ext                             :input sequences.
/PROFILE1=file.ext  and  /PROFILE2=file.ext  :profiles (old alignment).

                VERBS (do things)

/HELP  or /CHECK    :list the command line params.
/ALIGN              :do full multiple alignment 
/TREE               :calculate NJ tree.
/BOOTSTRAP(=n)      :bootstrap a NJ tree (n= number of bootstraps; def. = 1000).

                PARAMETERS (set things)

***Pairwise alignments:***
/KTUP=n      :word size                  /TOPDIAGS=n  :number of best diags.
/WINDOW=n    :window around best diags.  /PAIRGAP=n   :gap penalty

***Multiple alignments:***
/FIXEDGAP=n  :fixed length gap pen.      /FLOATGAP=n  :variable length gap pen.
/MATRIX=     :PAM100 or ID or file name. /TYPE=p or d :type is prot. or DNA
/OUTPUT=     :GCG or PHYLIP or PIR.      /TRANSIT     :transitions not weighted.

***Trees:***                             /SEED      :seed number for bootstraps.
/KIMURA      :use Kimura's correction.   /TOSSGAPS  :ignore positions with gaps.