ClustalV - Multiple sequence alignment by Des Higgins ClustalV is an order independent multiple sequences alignment algorithm. It first determines which sequences are similar to eachother, and aligns those sequences first. This helps reduce alignment bias based on the early alignment of poor sequences. Be warned that all sequences are returned in upper case, and all extended information under Get info.. will be lost. For this reason, the results are placed in a new GDE window. The following is the help file that Higgins provides. --- This is the on-line help file for CLUSTAL V. It should be named or defined as: clustalv_hlp >>HELP<< 1 General help for CLUSTAL V CLUSTAL V is a general purpose multiple alignment program for DNA or proteins. SEQUENCE INPUT: all sequences must be in 1 file, one after another. 3 formats are automatically recognised: NBRF/PIR, EMBL/SWISSPROT or Pearson (Fasta). All non-alphabetic characters (spaces, digits, punctuation marks) are ignored except "-" which is used to indicate a GAP. Upper or lower case is allowed. To do a MULTIPLE ALIGNMENT on a set of sequences, use item 1 from this menu to INPUT them; go to menu item 2 to do the multiple alignment. PROFILE ALIGNMENTS (menu item 3) are used to align 2 alignments. Use this to add a new sequence to an old alignment. GAPS in the old alignments are indicated using the "-" character. PROFILES can be input as PIR format files. PHYLOGENETIC TREES (menu item 4) can be calculated from old alignments (read in in PIR format with "-" characters to indicate gaps) OR after a multiple alignemnt while the alignment is still in memory. >>HELP<< 2 Help for multiple alignments If you have already loaded sequences, use menu item 1 to do the complete multiple alignment. You will be prompted for 2 output files: 1 for the alignment itself; another to store a dendrogram that describes the similarity of the sequences to each other. Multiple alignments are carried out in 3 stages (automatically done from menu item 1 ... multiple alignments NOW): 1) all sequences are compared to each other (pairwise alignments); 2) a dendrogram (like a phylogenetic tree) is constructed, describing the approximate groupings of the sequences by similarity (stored in a file). 3) the final multiple alignment is carried out, using the dendrogram as a guide. PAIRWISE ALIGNMENT parameters control the speed/sensitivity of the initial alignments. MULTIPLE ALIGNMENT parameters control the gaps in the final multiple alignments. You can skip the first stages (pairwise alignments; dendrogram) by using an old dendrogram file (menu item 3); or you can just produce the dendrogram with no final multiple alignment (menu item 2). OUTPUT FORMAT: Menu item 6 (format options) allows you to choose between 4 different alignment formats (CLUSTAL, GCG, NBRF/PIR and PHYLIP). >>HELP<< 3 Help for pairwise alignment parameters A similarity score is calculated between every pair of sequence and these are used to construct the dendrogram which guides the final multiple alignment. These similarity scores are calculated from fast, approximate, global align- ments, which are controlled by 4 parameters. 2 techniques are used to make these alignments very fast: 1) only exactly matching fragments (k-tuples) are considered; 2) only the 'best' diagonals (the ones with most k-tuple matches) are used. K-TUPLE SIZE: This is the size of exactly matching fragment that is used. INCREASE for speed (max= 2 for proteins; 4 for DNA), DECREASE for sensitivity. For longer sequences (e.g. > 300 residues) you may need to increase the default. GAP PENALTY: This is a penalty for each gap in the fast alignments. It has little affect on the speed or sensitivity. TOP DIAGONALS: The number of k-tuple matches on each diagonal (in an imaginary dot-matrix plot) is calculated. Only the best ones (with most matches) are used in the alignment. This parameter specifies how many. Decrease for speed; increase for sensitivity. DIAGONAL WINDOW: This is the number of diagonals around each of the 'best' diagonals that will be used. Decrease for speed; increase for sensitivity. SCORING METHOD = PERCENTAGE or ABSOLUTE: This controls whether the similarity scores are calculated as raw alignment scores (number of k-tuple matches minus a gap penalty for every gap) (ABSOLUTE) or as the alignment score divided by the length of the shorter sequence (PERCENTAGE). >>HELP<< 4 Help for multiple alignment parameters These parameters control the final multiple alignment. There are 2 gap penalty parameters and 1 for whether transitions (A <--> G or C <--> T) are weighted in DNA alignments. The default weight matrix for protein alignments is a PAM250 matrix, converted to distances. GAP PENALTY (FIXED): This is a penalty for opening up a gap. Decrease it and you will encourage gaps of all sizes. TERMINAL GAPS are penalised (same as internal ones). BEWARE: if you make this too small (+/- 5 or so), the program will prefer to align each sequence opposite a long gap. GAP PENALTY (VARYING): This penalty is incurred for every item in a gap. This penalises long gaps more. Increase this and gaps will get shorter. BEWARE: if you make this too small (+/- 5 or so), the program will prefer to align each sequence opposite a long gap. TRANSITIONS = WEIGHTED or UNWEIGHTED: With UNWEIGHTED transitions identical bases in a DNA alignment have a DISTANCE of 0; different ones have a distance of 10. If transitions are WEIGHTED then A vs G and C vs T will have a distance of 5 (less distant than A vs C,T or C vs A,G). >>HELP<< 5 Help for output format options. Four output formats are offered. You can choose more than one (or all four if you wish). NBRF/PIR format is ESPECIALLY USEFUL. Alignments that are written in this format can be used again as input (for calculating phylogenetic trees; profile alignments; general input). CLUSTAL format output is a self explanatory alignment format. It shows the sequences aligned in blocks. GCG output can be used by any of the GCG programs that can work on multiple alignments (e.g. PRETTY, PROFILEMAKE, PLOTALIGN). It is the same as the GCG .msf format files (multiple sequence file); new in version 7 of GCG. PHYLIP format output can be used for input to the PHYLIP package of Joe Felsenstein. This is an extremely widely used package for doing every imaginable form of phylogenetic analysis (MUCH more than the the modest intro- duction offered by this program). NBRF/PIR: this is the same as the standard PIR format with ONE ADDITION. Gap characters "-" are used to indicate the positions of gaps in the multiple alignment. These files can be re-used as input in any part of clustal that allows sequences (or alignments or profiles) to be read in. >>HELP<< 6 Help for profile alignments By PROFILE ALIGNMENT, we mean the alignment of two old alignments. One of the alignments can be a single sequence. The profiles should be in PIR format (one of the 4 output formats produced by this program). This is the same as standard NBRF/PIR format, with 1 addition: gap characters are indicated by "-". The alignment method produces a global, optimal alignment using an amino acid weight matrix (PAM250 is default) and 2 gap penalty parameters. Profile alignments allow you to store alignments of your favourite sequences (as long as they are in PIR format) and add new sequences to them in small bunches at a time. One of the 2 profiles can simply be a single sequence. >>HELP<< 7 Help for phylogenetic trees Before calculating a tree, you must have an alignment in memory. This can be input in NBRF/PIR format or you should have just carried out a full multiple alignment and the alignment is still in memory. The method used is the NJ (Neighbour Joining) method of Saitou and Nei. First you calculate distances (percent divergence) between all pairs of sequence from a multiple alignment; second you apply the NJ method to the distance matrix. EXCLUDE POSITIONS WITH GAPS? If you choose this option, any alignment positions where ANY of the sequences have a gap will be ignored. This guarantees that the distances will be 'metric'. Also, it means that 'like' will be compared to 'like' in all distances. The disadvantage is that you may throw away much of the data if there are many gaps. CORRECT FOR MULTIPLE SUBSTITUTIONS? For small divergence (say <10%) this option makes little difference. For greater divergence, this option corrects for the fact that observed distances underestimate actual evolutionary dist- ances. This is because, as sequences diverge, more than one substitution will happen at many sites. However, you only see one difference when you look at the present day sequences. Therefore, this option has the effect of stretching branch lengths in trees (especially long branches). The corrections used here (for DNA or proteins) are both due to Motoo Kimura. To calculate a tree, use option 4 (DRAW TREE NOW). This gives an UNROOTED tree and all branch lengths. The root of the tree can only be inferred by using an outgroup (a sequence that you are certain branches at the outside of the tree .... certain on biological grounds) OR if you assume a degree of constancy in the 'molecular clock', you can place the root along the longest branch. BOOTSTRAPPING is a method for deriving confidence values for the groupings in a tree (first adapted for trees by Joe Felsenstein). It involves making N random samples of sites from the alignment (N should be LARGE, e.g. 500 - 1000); drawing N trees (1 from each sample) and counting how many times each grouping from the original tree occurs in the sample trees. For a group to be consid- ered significant at the 5% level (p <= 0.05) it should occur in at least 95% of the sample trees. You must supply a seed number for the random number generator. >>HELP<< 8 Help for choosing protein weight matrix For protein alignments, you use a weight matrix to determine the similarity of non-identical amino acids. For example, Tyr aligned with Phe is usually judged to be 'better' than Tyr aligned with Pro. There are three 'in-built' weight matrices offered: 1) PAM 100 and 2) PAM 250 These are from the work of M. Dayhoff and are often simply called Dayhoff matrices. The pam 250 matrix is the most commonly used and is the default in most protein comparison packages. It is claimed that a pam 100 matrix is more sensitive in many cases, so we have included it here. 3) Identity matrix. This matrix just scores identical residues. You can also input your own matrix. If so then be careful: 1) follow the instructions on format below; 2) watch the gap penalty parameters (the default values may no be appropriate). Conservative substitutions will not be indicated in alignments. The values in a new weight matrix must be integers and the scores should be similarities. You can use negative as well as positive values if you wish. INPUT FORMAT The lower triangle of a 20x20 matrix of values is read in, in free format, row by row. The diagonal must be included. Using the 1 letter code, the order of amino acids in the matrix is: CSTPAGNDEQHRKMILVFYW. Seperate the values by spaces (not commas). You can put the values on as many lines as you like as long as they are in the right order. GAP PENALTIES The default gap penalty parameters work fine with a PAM 250 matrix. The range of PAM 250 values is 0 to 25 (when rescaled to be positive) and the default gap penalties are 10 each. Very approximately, the best gap penalty settings are 2/5 the maximum weight matrix score. >>HELP<< 9 Help for command line parameters DATA (sequences) /INFILE=file.ext :input sequences. /PROFILE1=file.ext and /PROFILE2=file.ext :profiles (old alignment). VERBS (do things) /HELP or /CHECK :list the command line params. /ALIGN :do full multiple alignment /TREE :calculate NJ tree. /BOOTSTRAP(=n) :bootstrap a NJ tree (n= number of bootstraps; def. = 1000). PARAMETERS (set things) ***Pairwise alignments:*** /KTUP=n :word size /TOPDIAGS=n :number of best diags. /WINDOW=n :window around best diags. /PAIRGAP=n :gap penalty ***Multiple alignments:*** /FIXEDGAP=n :fixed length gap pen. /FLOATGAP=n :variable length gap pen. /MATRIX= :PAM100 or ID or file name. /TYPE=p or d :type is prot. or DNA /OUTPUT= :GCG or PHYLIP or PIR. /TRANSIT :transitions not weighted. ***Trees:*** /SEED :seed number for bootstraps. /KIMURA :use Kimura's correction. /TOSSGAPS :ignore positions with gaps.