FASTA.DOC                                        Release 1.6


                        COPYRIGHT NOTICE

Copyright 1988, 1991, 1992 by William R. Pearson and the
University of Virginia.  All rights reserved. The FASTA program
and documentation may not be sold or incorporated into a
commercial product, in whole or in part, without written consent
of William R. Pearson and the University of Virginia.  For
further information regarding permission for use or reproduction,
please contact: William R. Wilkerson, Assistant Provost for
Research, University of Virginia, P.O. Box 9025, Charlottesville,
VA 22906-9025, (804) 924-6853


The FASTA program package

Introduction

     This documentation describes the version 1.6c of the FASTA
program package (see W. R. Pearson and D. J. Lipman (1988),
"Improved Tools for Biological Sequence Analysis", PNAS 85:2444-
2448, and W. R.  Pearson (1990) "Rapid and Sensitive Sequence
Comparison with FASTP and FASTA" Methods in Enzymology 183:63-
98).  Version 1.6 is the first release for the IBM-PC and
Macintosh since version 1.4 (version 1.5 was distributed only via
ftp to unix machines).  Version 1.6 has a large number of
improvements over versions 1.4 and 1.5, including the ability to
search libraries in several different formats in the same run,
more robust algorithms for aligning sequences along a band, and
additional, rigorous (but slow) programs for sequence searching,
statistical analysis, and local sequence alignment.  In addition,
several additional options are included.  Programs that are new
with version 1.6 are highlighted in italics.


Although there are a large number of programs in this package,
they belong to three groups:


    Library search programs: FASTA, TFASTA, SSEARCH

    Local homology programs: LFASTA, PLFASTA, LALIGN, PLALIGN

    Statistical significance: RDF2, RELATE, RSS


In addition, there are several programs for other sequence
analysis tasks:


    ALIGN - global alignment of two sequences (no limit on gaps).

    EXTRACTP, SINDEX - programs to index (SINDEX) and extract sequences


                              - 1 -


FASTA.DOC                                             Release 1.6


    from a protein sequence database.

    EXTRACTN - programs to extract sequences from the GenBank floppy disk
    format data base.


In addition, I have included several programs for protein
sequence analysis, including a Kyte-Doolittle hydropathicity
plotting program (GREASE, TGREASE), and a secondary structure
prediction package (GARNIER).

     The FASTA sequence comparison programs on this disk are
improved versions of the FASTP program, originally described in
Science (Lipman and Pearson, (1985) Science 227:1435-1441).  We
have made several improvements.  First, the library search
programs use a more sensitive method for the initial comparison
of two sequences which allows the scores of several similar
regions to be combined.  As a result, the results of a library
search are now given with three scores, initn (the new initial
score which may include several similar regions), init1 (the old
fastp initial score from the best initial region), and opt (the
old fastp optimized score allowing gaps in a 32 residue wide
band).

     These programs have also been modified to become "universal"
(hence FAST-A, for FASTA-All, as opposed to FAST-P (protein) or
FAST-N (nucleotides)); by changing the environment variable
SMATRIX, the programs can be used to search protein sequences,
DNA sequences, or whatever you like.  By default, FASTA, LFASTA,
and the RDF programs automatically recognize protein and DNA
sequences.  Sequences are first read as amino acids, and then
converted to nucleotides if the sequence is greater than 85%
A,C,G,T (the '-n' option can be used to indicate DNA sequences).
TFASTA compares protein sequences to a translated DNA sequence.
Alternative scoring matrices can also be used.  In addition to
the PAM250 matrix for proteins, matrices based on simple
identities or the genetic code can also be used for sequence
comparisons or evaluation of significance.  Several different
protein sequence matrices have been included; instructions for
constructing your own scoring matrix are included in the file
FORMAT.DOC.


The remainder of this document is divided into three sections:
(1) a brief history of the changes to the FASTA package; (2) A
guide to installing the programs and databases; (3) A guide to
using the FASTA programs. The programs are very easy to use, so
if you are using them on a machine that is administered by
someone else, you may want to skip to section (3) to learn how to
use the programs, and then read section (1) to look at some of
the more recent changes.  If you are installing the programs on
your own machine, you will need to read section (2) carefully.


                              - 2 -


FASTA.DOC                                             Release 1.6


1.  Revision History

1.1.  Changes with version 1.6

     FASTA version 1.6 uses a new method for calculating optimal
scores in a band (the optimization or last step in the FASTA
algorithm). In addition, it uses a linear-space method for
calculating the actual alignments.  The FASTA package also
includes four new programs:

SSEARCH   a program to search a sequence database using the
          rigorous Smith-Waterman algorithm (this program is
          about 100-fold slower than FASTA with ktup=2 (for
          proteins).

RSS       a version of RDF2 that uses a rigorous Smith-Waterman
          calculation to score similarities

LALIGN    A rigorous local sequence alignment program that will
          display the N-best local alignments (N=10 by default).

PLALIGN   a version of lalign that plots the local alignments.

     The LALIGN/PLALIGN programs incorporate the "sim" algorithm
described by Huang and Miller (1991) Adv. Appl. Math. 12:337-357.
The SSEARCH and RSS programs incorporate algorithms described by
Huang, Hardison, and Miller (1990) CABIOS 6:373-381.

     LFASTA and PLFASTA now calculate a different number of local
similarities; they now behave more like LALIGN/PLALIGN.  Since
local alignments of identical sequences produce "mirror-image"
alignments, lalign and lfasta consider only one-half of the
potential alignments between sequences from identical file names.
Thus

    lfasta mchu.aa mchu.aa

Displays only two alignments, with earlier versions of the
program, it would have displayed five, including the identity
alignment.  PLFASTA does display five alignments; when two
identical filenames are given, it draws the identity alignment,
calculates the two unique local alignments, draws them, and draws
their mirror images. LFASTA/PLFASTA and LALIGN/PLALIGN use the
filenames, rather than the actual sequences, to determine whether
sequences are identical; you can "trick" the programs into
behaving the old way by putting the same sequence in two
different files.

1.2.  Changes with version 1.5

     FASTA version 1.5 includes a number of substantial revisions
to improve the performance and sensitivity of the program.  It is
now possible to tell the program to optimize all of the initn


                              - 3 -


FASTA.DOC                                             Release 1.6


scores greater than a threshold.  The threshold is set at the
same value as the old FASTA cutoff score (approximately 0.5
standard deviations above the mean for average length sequences).
For highest sensitivity, you can use the -c 1 option to set the
threshold to 1.  (This will slow the search down about 5-fold).
Alternatively, you can tell FASTA to sort the results by the
init1, rather than the initn, score by using the -1 option.
FASTA -1 ...  will report the results the way the older FASTP
program did.  A comparison of the performance of FASTA in this,
its slowest mode, with the standard FASTA and the Smith-Waterman
algorithm has been published in Genomics (1991) 11:635-650.

     A new method has been provided for selecting libraries. In
the past, one could enter the name of a sequence file to be
searched or a single letter that would specify a library from the
list included in the $FASTLIBS file. Now, you can specify a set
of library files with a string of letters preceded by a '%'.
Thus, if the FASTLIBS file has the lines:


    Genbank 70 primates$1P/seqlib/gbpri.seq
    Genbank 70 rodents$1R/seqlib/gbrod.seq
    Genbank 70 other mammals$1M/seqlib/gbmam.seq
    Genbank 70 vertebrates $1B/seqlib/gbvrt.seq


Then the string: "%PRMB" would tell FASTA to search the four
libraries listed above.  The %PRMB string can be entered either
on the command line or when the program asks for a filename or
library letter.

     FASTA1.5 also provides additional flexibility for specifying
the number of results and alignments to be displayed with the -Q
(quiet) option.  The -b number option allows you to specify the
number of sequence scores to show when the search is finished.
Thus


    FASTA -b 100 ...


tells the program to display the top 100 sequence scores. In the
past, if you displayed 100 scores (in -Q mode), you would also
have store 100 alignments. The -d option allows you to limit the
number of alignments shown.  FASTA -b 100 -d 20 would show 100
scores and 20 alignments.

     The old CUTOFF parameter is no longer used.  The program
stores the best 2000 (IBM-PC, MAC) or 6000 (Unix, VMS) scores and
then throws out the lowest 25%, stores the next 500 (1500) better
than the threshold determined with the first scores were
discarded, and repeats the process as the library is scanned.  As
a result, the best 1500 - 2000 (4500 - 6000) scores are saved.


                              - 4 -


FASTA.DOC                                             Release 1.6


The old cut-off parameter was also used to set the joining
threshold for the calculation of the initn score from initial
regions.  This joining threshold can now be set with the -g
option or with the GAPCUT parameter.

     Finally, FASTA can provide a complete list of all of the
sequences and scores calculated to a file with the -r (results)
option.  FASTA -r results.out ... creates a file with a list of
scores for every sequence in the library.  The list is not
sorted, and only includes those scores calculated during the
initial scan of the library (the optimized score is not
calculated unless the -o option is used).

2.  Installing the FASTA package

2.1.  Installing the programs

2.1.1.  IBM-PC/DOS version

     For the IBM-PC/DOS version, the FASTA source code disk
contains the complete source code to all of the programs on the
other disks.  The programs were compiled with Borland's Turbo
'C++', using Borland's MAKE utility.  The graphics programs
(PLFASTA, TGREASE) use the graphics device drivers supplied with
the Turbo 'C' V2.0 package.  Also included are the documentation
files PROGRAMS.DOC and FORMAT.DOC.  You do not need any of the
files the source code disk to run the programs.  The files on
this disk are identical to the UNIX and VMS versions that run on
larger machines.  Also included is the code to compile
ALIGN0.EXE.  ALIGN0 is the same as ALIGN, but does not penalize
for end-gaps.

     If you have the DOS or Macintosh version of the FASTA
package, to install the programs you should:

(1)  Make a new directory (folder) for the FASTA programs.  This
     need not be the same as the directory for your sequence
     databases.

(2)  Copy the files from the FASTA source disk to the new
     directory.

(3)  (DOS only) Edit your AUTOEXEC.BAT file to (a) modify your
     PATH command to include the FASTA directory and (b) add the
     line:

         set FASTLIBS=c:\yourfastadirectory\fastgbs

     On the Macintosh, you may need to edit the "environment"
     file and change the line that reads:

         FASTLIBS=fastgbs


                              - 5 -


FASTA.DOC                                             Release 1.6


     to indicate the full directory path for the fastgbs file,
     for example:

         FASTLIBS=Q105:FASTA:fastgbs


(4)  Finally, you will need to edit the fastgbs file.  This is
     usually the most confusing part of the installation.  An
     example of this file is shown below; to customize this file
     for your machine, you will need to change the file names
     from those provided in the fastgbs file to ones that reflect
     the directory names and file names you use on your machine.
     This is explained in more detail below.  In addition, some
     entries in the fastgbs file refer to other files of file
     names.  These files of file names (as opposed to actual
     database files) may also need to be edited.

2.1.2.  Unix version

     The FASTA distribution comes with several makefile's that
can be used to compile the FASTA programs.  Over the years, as
ATT Unix System 5 and BSD unix have converged, these files have
become very similar. To begin with, I recommend using the
standard Makefile.  There are two values in the makefile that
should be checked against the values used on your system: the HZ
value, which is the frequency in ticks per second used by the
times() system call, this value can usually be found by running:

    grep HZ /usr/include/sys/*

and the functions available to return random numbers.  If you
have a rand48() function that returns a 32-bit random number, use
it and use the lines:

    NRAND=nrand48
    RANFLG= -DRAND32

If not, you will need to use the rand() function call and
determine whether it returns a 16-bit or a 32-bit value.  These
functions are used by RDF2 and RSS.  If you have problems
compiling the programs, you may want to examine the makefile.unx
and makefile.sun files, to look for differences.  I have tried to
use very standard unix functions in these programs, and they have
been successfully compiled, with very small changes to the
Makefile, on Sun's (Sun OS 4.1), IBM RS/6000's (AIX), and MIPS
machines (under the BSD environment).

2.2.  Installing the libraries

2.2.1.  The NBRF protein sequence library

     The FASTA program package does not include any protein or
DNA sequence libraries.  You can obtain the PIR protein sequence


                              - 6 -


FASTA.DOC                                             Release 1.6


database from:

    National  Biomedical Research Foundation
    Georgetown  University  Medical  Center
    3900 Reservoir Rd, N.W.
    Washington, D.C. 20007

In addition, this database is available via anonymous ftp from
the host "ftp.bchs.uh.edu". It is available in two formats, VMS
and CODATA format.  The "VMS" format (library type 5 below) can
be searched much faster, can be easily reformatted for use by the
"BLAST" rapid searching program, and is compatible with the
Genetics Computer Group package of programs.  The CODATA format
is used by the EUGENE/MBIR computing package from Baylor (library
type 2).

     (DOS/Macintosh users) The SINDEX and EXTRACTP programs now
allow you to index a file in one subdirectory, and then move the
library without having to remake the index.  When you type:
SINDEX @prot.nam, two index files are created: PROT.IXX and
PROT.INX.  PROT.IXX is a binary file that cannot be edited; it
contains the offsets into the library files for each of the
sequence entries.  PROT.INX looks exactly like the original
PROT.NAM file, and can be edited.  However, you cannot change the
order of the library files in PROT.INX.  What you can do is
change the first line, which indicates the directory where the
library files can be found.  The index in PROT.IXX might tell
EXTRACTP to find the entry LCBO at offset 123,456 in the PROT.3
file.  If you changed the PROT.3 line in PROT.INX to PROT.4, LCBO
would not be extracted properly.  However, if you decide to move
your library files from disk /usr/tmp to disk /usr/lib, you can
edit PROT.INX to reflect this change.

     EXTRACTP has also been updated to use the new indexing
scheme.  To extract sequences from a multi-file library that you
made with SINDEX @prot.nam, type: EXTRACTP @prot.nam, or set the
environment variable AABANK=@prot.nam.  Then enter the protein
sequence identifiers as before.  Remember, if you move the
library into a different directory, you will need to copy both
the *.IXX and *.INX files to use EXTRACTP.  You can test EXTRACTP
by trying to extract the PIR sequences LCBO, HBHU, or CCHU.  If
you do not get an error message, the sequences were successfully
extracted.  They are automatically saved to a file with the name
"sequence.aa".  So "LCBO" would be found in "lcbo.aa". When you
need to extract a sequence from the NEW.LIB library, you will
have to set AABANK=new.lib.

2.2.2.  The GENBANK DNA sequence library

     FASTA, TFASTA, and EXTRACTN search and extract sequences
from the GENBANK DNA sequence library in its compressed, floppy
disk format.  This library is available from:


                              - 7 -


FASTA.DOC                                             Release 1.6


    GENBANK
    c/o Intelligenetics
    700 E. El Camino Real
    Mountain View, CA  94040
    (415) 962-7300

(The GBANN program used to extract DNA sequence annotations.
Unfortunately, GBANN has not been updated since release 63.0 of
GENBANK, when some changes in the annotation files were made.
GBANN no longer works.)

     The GenBank DNA sequence library is also available via
anonymous FTP from genbank.bio.net.

2.2.3.  The EMBL CD-ROM libraries

     The European Molecular Biology Laboratory (EMBL) is
distributing a CD-ROM that contains both the complete EMBL DNA
sequence database (which should be essentially identical to the
GenBank DNA sequence database) and the SWISS-PROT protein
sequence database. SWISS-PROT is derived from the NBRF Protein
sequence database with additions from the EMBL DNA sequence
database.  This CD-ROM is a "best-buy," since it provides both
DNA and protein sequence libraries.  It is available from:


    EMBL Data Library
    Meyerhofstr. 1
    D-6900 Heidelberg
    Germany
    +49 6221 387258
    Email: SOFTWARE@EMBL-Heidelberg.DE


     In addition, the SWISS-PROT protein sequence database is
available via anonymous FTP from the hosts genbank.bio.net and
ncbi.nlm.nih.gov.

2.3.  Finding the libraries: FASTLIBS

     FASTA and TFASTA use the environment variable FASTLIBS to
find the protein and DNA sequence libraries.  The FASTLIBS
variable contains the name of a file that has the actual
filenames of the libraries.  The FASTGBS file on is an example of
a file that can be referred to by FASTLIBS. To use the FASTGBS
file, type:

    setenv FASTLIBS /usr/lib/fasta/fastgbs (BSD UNIX)
    or
    FASTLIBS=/usr/lib/fasta/fastgbs; export FASTLIBS (SysV UNIX)

Then edit the FASTGBS file to indicate where the protein and DNA


                              - 8 -


FASTA.DOC                                             Release 1.6


sequence libraries can be found.  If you have a hard disk and
your protein sequence library is kept in the file
/usr/lib/aabank.lib and your Genbank DNA sequence library is kept
in the directory: /usr/lib/genbank, then fastgbs might contain:

    NBRF Protein$0P/usr/lib/seq/aabank.lib 0
    SWISS PROT 10$0S/usr/lib/vmspir/swiss.seq 5
    GB Primate$1P@/usr/lib/genbank/gpri.nam
    GB Rodent$1R@/usr/lib/genbank/grod.nam
    GB Mammal$1M@/usr/lib/genbank/gmammal.nam
    ^   1    ^^^^       4                   ^     ^
              23                             (5)

The first line of this file says that there is a copy of the NBRF
protein sequence database (which is a protein database) that can
be selected by typing "P" on the command line or when the
database menu is presented in the file /usr/lib/seq/aabank.lib.

     Note that there are 4 or 5 fields in the lines in fastgbs.
The first field is the description of the library which will be
displayed by FASTA; it ends with a '$'.  The second field (1
character), is a 0 if the library is a protein library and 1 if
it is a DNA library.  The third field (1 character) is the
character to be typed to select the library.

     The fourth field is the name of the library file.  In the
example above, the /usr/lib/seq/aabank.lib file contains the
entire protein sequence library.  However the DNA library file
names are preceded by a '@', because these files (gpri.nam,
grod.nam, gmammal.nam) do not contain the sequences; instead they
the names of the files which contain the sequences.  This is done
because the GENBANK DNA database is broken down in to a large
number of smaller files.  In order to search the entire primate
database, you must search more than a dozen files.

     In addition, an optional fifth field can be used to specify
the format of the library file.  Alternatively, you can specify
the library format in a file of file names (a file preceded by an
'@').  This field must be separated from the file name by a space
character (' ') from the filename.  In the example above, the
aabank.lib file is in Pearson/FASTA format, while the swiss.seq
file is in PIR/VMS format (from the EMBL CD-ROM), while the DNA
sequences are in compressed GenBank format.  No file type number
is included for the Genbank files, because it is included in the
file of filenames (see below).  Currently, FASTA can read the
following formats:

    0 Pearson/FASTA (>SEQID - comment/sequence)
    1 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)
    2 NBRF CODATA (ENTRY/SEQUENCE)
    3 EMBL/SWISS-PROT (ID/DE/SQ)
    4 Intelligenetics (;comment/SEQID/sequence)
    5 NBRF/PIR VMS (>P1;SEQID/comment/sequence)


                              - 9 -


FASTA.DOC                                             Release 1.6


    9 Compressed Genbank Floppy format

(In the near future, I hope to support the BLAST formats.) In
particular, this version will work with the EMBL and PIR VMS
formats that are distributed on the EMBL CD-ROM. The latter
format (PIR VMS) is much faster to search than EMBL format.  If a
library format is not specified, for example, because you are
just comparing two sequences, Pearson/FASTA (format 0) is used by
default.  To change this default, you may set the LIBTYPE
environment variable to a number.  For example,

    setenv LIBTYPE 1

would cause the program to use the GenBank LOCUS format by
default for libraries (or the second sequence file), but the
Pearson/FASTA format would still be used for the query sequence.

     You can specify a group of library files by putting a '@'
symbol before a file that contains a list of file names to be
searched.  For example, if @gpri.nam is in the fastgbs file, the
file "gpri.nam" might contain the lines:

    </usr/lib/genbank
    >glocus.idx
    gpri1.seq
    gpri2.seq
    gpri12.seq

In this case, the line beginning with a '<' indicates the
directory the files will be found in.  The line beginning with a
'>' indicates the index file; this is only used for the GENBANK
compressed DNA database.  The remaining lines name the actual
sequence files.  So the first sequence file to be searched would
be:

    /usr/lib/genbank/gpri1.seq

The notation "<PIRNAQ:" might be used under the VAX/VMS operating
system. Under UNIX, the trailing '/' is left off, so the library
directory might be written as "</usr/seqlib".  In addition, when
using the floppy disk version of GENBANK, annotation files are
also required. These files (*.ano) should be placed in the same
directory as the *.seq files.

     With version 1.4 of the FASTA package, the FASTA and TFASTA
programs can search a library composed of different files in
different sequence formats.  For example, you may wish to search
the Genbank files (which are in compressed floppy format) and the
EMBL DNA sequence database on CD-ROM.  To do this, you simply
list the names and filetypes of the files to be searched in a
file of filenames.  For example, to search the mammalian portion
of Genbank, the unannotated portion of Genbank, and the
unannotated portion of the EMBL library, you could use the file:


                             - 10 -


FASTA.DOC                                             Release 1.6


    </usr/lib/DNA
    >glocus.idx
    gpri1.seq 9
    gpri2.seq 9
    ...
    gpri9.seq 9
    #  (this '#' causes the program to display the size of the library)
    grod1.seq 9
    ...
    gmam1.seq 9
    ...
    guna1.seq 9
    ...
    unanno.seq 5
    #


    You do not need to include library format numbers if  you
    only use the Pearson/FASTA version of the PIR protein se-
    quence library and the Genbank  DNA  database  on  floppy
    disks.   If no library type is specified, the program as-
    sumes that type 0 is being used (unless you have set LIB-
    TYPE).   However,  if the program sees an index file line
    (e.g. ">glocus.idx"), it assumes that the  files  are  in
    Genbank floppy disk format (type 9).


     Although FASTA works best when the libraries are saved on a
hard disk, this is not required.  If you do not have a hard disk,
you could refer to the protein database files by making a file
"prot.nam" with the lines:

    <B:
    prot.0
    prot.1
    ...
    prot.6
    #       (print library summary)
    new.0
    ...

The FASTA program would then look for the files on the B: drive,
and when it did not find them, it would allow you replace the
diskette in the drive.


     Test the setup by running FASTA.  Enter the sequence file
'MUSPLFM.AA' when the program requests it (this file is included
with the programs).  The program should then ask you to select a
protein sequence library.  Alternatively, if you run the TFASTA
program and use the MUSPLFM.AA query sequence, the program should
show you a selection of DNA sequence libraries.  Once the fastgbs
file has been set up correctly, you can set FASTLIBS=fastgbs in


                             - 11 -


FASTA.DOC                                             Release 1.6


your AUTOEXEC.BAT file, and you will not need to remember where
the libraries are kept or how they are named.

     The EXTRACTN program extracts DNA sequences or annotations
from the GENBANK DNA sequence library in the compressed floppy
disk format. To tell EXTRACTN where to find the DNA sequence
library and index files, set the environment variable GBLIB.

    setenv GBLIB /usr/lib/genbank


     FASTA and TFASTA must open a large number of files when
searching and reporting the results of a GENBANK floppy disk
format library search.  You may have problems with the large
number of files under DOS on IBM-PC's (Unix and VMS users will
not have these problems).  If you are going to search the GENBANK
floppy disk format DNA sequence library under DOS, you should add
the line:

    FILES=16

to your CONFIG.SYS file.  (Typically this is already done for
programs like Windows or WordPerfect.)


                             - 12 -


FASTA.DOC                                             Release 1.6


3.  Using the FASTA Package

3.1.  Overview

     The FASTA sequence comparison programs all require similar
information, the name of a query sequence file, a library file,
and the ktup parameter.  All of the programs can accept arguments
on the command line, or they will prompt for the file names and
ktup value.

To use FASTA, simply type:

    FASTA
    and you will be prompted for :
         the name of the test sequence file
         the name of the library file
         and whether you want ktup = 1 or 2. (or 1 to 6 for DNA sequences)

             ktup of 2 is about 5 times faster than ktup = 1.
             For  a  200  aa sequence against a 10,000,000 aa
             library, the program takes  about  30  min  with
             ktup = 2, 150 min with ktup = 1, on a 12 Mhz 286
             IBM-PC.


The program can also be run by typing

    FASTA test.aa /lib/bigfile.lib ktup (1 or 2)


Included with the package are the test files, MUSPLFM.AA,
LCBO.AA, MCHU.AA and BOVPRL.SEQ.  To check to make certain that
everything is working, you can try:

    fasta musplfm.aa lcbo.aa
    and
    tfasta musplfm.aa bovprl.seq

To test the local similarity programs LFASTA and PLFASTA, try:

    lfasta mchu.aa mchu.aa
    and
    plfasta mchu.aa mchu.aa (use this only on an IBM-PC with graphics
    or on a Tektronix terminal under UNIX or VMS)

MCHU (calmodulin) has four duplicated calcium binding sites that
are clearly detected by LFASTA.  For a more complicated example,
try MWRTC1.aa, myosin heavy chain.

3.2.  Sequence files

     The FASTA programs know about three kinds of sequence files
(four under VMS): (1) plain sequence files that can only be used


                             - 13 -


FASTA.DOC                                             Release 1.6


as query sequences or for LFASTA, RDF2, and ALIGN. (2) Standard
library files.  These are the same as plain sequence files, each
sequence is preceded by a comment line with a '>' in the first
column. (3) distributed sequence libraries (this is a broad class
that includes the NBRF/PIR VMS and blocked ascii formats, Genbank
flat-file format, EMBL flat-file format, and Intelligenetics
format.  All of the files that you create should be of type (1)
or (2).  Type (2) files (ones with a be used as query or library
sequence files by all of the programs.

     I have included several sample test files, *.AA.  The first
line may begin with a '>'  or ';' followed by a comment.  The
text after ';' in other lines will  be  ignored.   Spaces  and
tabs  (and anything else that  is  not  an amino-acid code) are
ignored.

     Library files should have the form:

    >Sequence name and identifier
    A F A S Y T .... actual sequence.
    F S S       .... second line of sequence.
    >Next sequence name and identifier

This is the form of the PROT.* supplied with the floppy disk
version of the PIR protein sequence library. You can also build
your own library by concatenating several sequence files.  Just
be sure that each sequence is preceded by a line beginning with a
'>' with a sequence name.

     The test file should not have lines longer than 120
characters, and sequences entered with word processors should use
a document mode, with normal carriage returns at the end of
lines.

Program Summary

3.3.  Sequence search programs

FASTA     universal sequence comparison. Defaults to comparing
          protein sequences; if the sequences are > 85% A+C+G+T
          or the -n option is used, a DNA sequence is assumed.

TFASTA    Search DNA library for a protein sequence by
          translating the DNA sequence to protein in all six
          frames (three forward frames with the -3 command line
          option). TFASTA with ktup=2 is about as fast as a DNA
          FASTA with ktup=4, and is substantially more sensitive.
          (also reads the GENBANK library)

SSEARCH   Universal sequence comparison using the Smith-Waterman
          algorithm ( T. F. Smith and M. S. Waterman (1981) J.
          Mol. Biol. 147:195-197).  This program uses code
          developed by Huang and Miller (X. Huang, R. C.


                             - 14 -


FASTA.DOC                                             Release 1.6


          Hardison, W. Miller (1990) CABIOS 6:373-381) for
          calculating the local similarity score and code from
          the ALIGN program (see below) for calculating the local
          alignment.  SSEARCH is about 100-times slower than
          FASTA with ktup=2 (for proteins).  It should never be
          used to search an entire protein sequence library, but
          can be used to search several hundred sequences.

ALIGN     optimal global alignment of two sequences with no
          short-cuts.  This program is a slightly modified
          version of one taken from E.  Myers and W. Miller. The
          algorithm is described in E. Myers and W.  Miller,
          "Optimal Alignments in Linear Space" (CABIOS (1988)
          4:11-17).

3.4.  Local similarity programs

LFASTA    local similarity searches showing local alignments.
          The algorithm used to calculate the local alignment in
          a band has been improved (Chao, Pearson, and Miller,
          submitted).

PLFASTA   local similarity searches with plot output (on the IBM,
          this program requires that the environment variable
          BGIDIR be set).

PCLFASTA  (unix only) local similarity searches with plot output
          using pic commands.

LALIGN    Calculates the N-best local alignments using a rigorous
          algorithm.  (N=10 by default.) The algorithm was
          developed by Huang and Miller (X.  Huang and W.  Miller
          (1991) Adv. Appl. Math. 12:337-357), which is a
          linear-space version of an algorithm described by M. S.
          Waterman and M. Eggert (J.  Mol. Biol. 197:723-728).
          Like SSEARCH, LALIGN is rigorous, but also very slow.

PLALIGN   A version of LALIGN that plots its output to a screen
          or to a Tektronix terminal emulator.

3.5.  Statistical Significance

RDF2      improved version of RDF program with all three scoring
          methods (now includes local, or window, shuffle
          routine)

RSS       A version of RDF2 that uses the rigorous Smith-Waterman
          calculation used by SSEARCH.  RSS should provide a more
          rigorous test of the statistical significance of a
          similarity score.

RELATE    significance program described by Dayhoff (Atlas of
          Protein Sequence and Structure, Vol. 5, Supplement 3).


                             - 15 -


FASTA.DOC                                             Release 1.6


          Each chunk of 25 residues in one sequence is compared
          to every 25 residue fragment of the second sequence.
          Sequences which are genuinely related will have a large
          number of scores greater than 3 standard deviations
          above the mean score of all of the comparisons.

3.6.  Other analysis programs

AACOMP    calculate the amino acid composition and molecular
          weight of a sequence.

BESTSCOR  calculate the best self-comparison score.

GREASE    Kyte-Doolittle hydropathicity profile

TGREASE   graphic plot of Kyte-Doolittle profile

FROMGB    convert from GenBank LOCUS format (also used by the
          IBI-Pustell programs) to Pearson/FASTA format.

GARNIER   A secondary structure prediction program using the
          method of Garnier, Osgusthorpe, and Robson, J. Mol.
          Biol., (1978) 120:97-120.

3.7.  Searching for keywords

FINDP     (DOS, Macintosh only) Searches the protein sequence
          library title lines (or the aabank.nam file created by
          SINDEX) for a list of key words.  For example:

              FINDP aabank.nam trypsin

          will search the file of title lines and report all
          lines with the word "trypsin" in them.  You can search
          for several words at once, by putting several words on
          the line.  Normally, FINDP (and FINDN) ignore upper and
          lower case.  If you would like to search for a specific
          case, e.g. Trypsin but not chymotrypsin, use the -l
          option:

              FINDP aabank.nam -l Trypsin


FINDN     Searches the GENBANK *.ano annotation files for words.
          FINDN can search a specific file, or a list of
          annotation files.  For example, if the file GPRIA.NAM
          contains the lines:

              gpri1.ano
              gpri2.ano
              gpri3.ano
              ...
              then


                             - 16 -


FASTA.DOC                                             Release 1.6


              FINDN @gpria.nam trypsin

          would search all of the files.  FINDN also uses "-l" to
          preserve upper/lower case distinctions.

3.8.  Options

     These programs have a number of output options, which are
invoked by the environment variables LINLEN, SHOWALL, and MARKX.
Alternatively, these values can be controlled by command line
options.  The number of sequence residues per output line is now
adjustable by setting the environment variable LINLEN, or the
command line option -w.  LINLEN is normally 60, to change it set
LINLEN=80 before running the program or add -w 80 to the command
line.  LINLEN can be set up to 200.  SHOWALL (-a) determines
whether all, or just a portion, of the aligned sequences are
displayed.  Previously, FASTP would show the entire length of
both sequences in an alignment while FASTN would only show the
portions of the two sequences that overlapped. Now the default is
to show only the overlap between the two sequences, to show
complete sequences, set SHOWALL=1, or use the -a option on the
command line.

     The differences between the two aligned sequences can be
highlighted in three different ways by changing the environment
variable MARKX or the -m option.  Normally (MARKX=0) the program
uses ':' do denote identities and '.' to denote conservative
replacements.  If MARKX=1, the program will not mark identities;
instead conservative replacements are denoted by a 'x' and non-
conservative substitutions by a 'X'.  If MARKX=2, the residues in
the second sequence are only shown if they are different from the
first.  Thus the three options are:


    MARKX=0 (default)       MARKX=1        MARKX=2

            MWRTCGPPYT     MWRTCGPPYT     MWRTCGPPYT
            ::..:: :::       xx  X        ..KS..Y...
            MWKSCGYPYT     MWKSCGYPYT


3.9.  Command line options

     It is now possible to specify  several options on the
command line, instead of using environment variables.  The
command line options are preceded by a dash; the following
options are available:

-a        same as showall=1

-b        number of sequence scores to be shown on output


                             - 17 -


FASTA.DOC                                             Release 1.6


-c #      threshold score for optimization (OPTCUT).  Set "-c 1"
          and "-o" to optimize every sequence in a database.
          (This slows the program down about 5-fold).

-d #      number of alignments to be reported by default. (Used
          in conjunction with -Q).

-f         identical match score from scoring matrix in the scan
          for initial regions. (default for protein) (PAMFACT=1)

-g #      Threshold for joining init1 segments to build an initn
          score (GAPCUT).

-k        use constant score in scan for initial regions (like
          old fastp, fastn, default for DNA) (PAMFACT=0)

-l file   location of library menu file (FASTLIBS)

-m #      MARKX = # (0, 1, 2)

-n        Force the query sequence to be treated as a DNA
          sequence.  This is particularly useful for query
          sequences that contain a large number of ambiguous
          residues, e.g. transcription factor binding sites.

-o        optimize all scores greater than OPTCUT.  If '-c' is
          not specified, OPTCUT will be calculated from the
          length of the sequence and the ktup setting, as the old
          CUTOFF value used to be.

-Q        quiet - does not prompt for any input.  Writes scores
          and alignments to the terminal or standard output file.

-r file   save a results summary line for every sequence in the
          sequence library.  The summary line includes the
          sequence identifier, superfamily number (if available)
          position in the library, and the similarity scores
          calculated.  This option can be used to evaluate the
          sensitivity and selectivity of different search
          strategies (see W. R. Pearson (1991) Genomics 11:635-
          650.)

-s file   SMATRIX is read from file.  Several SMATRIX files are
          provided with the standard distribution.  For protein
          sequences: codaa.mat - based on minimum mutation
          matrix; idnaa.mat - identity matrix; idpaa.mat -
          identity matrix for mismatches, but identical matches
          weighted according to the PAM250 matrix; pam250.mat -
          the PAM250 matrix developed by Dayhoff et al (Atlas of
          Protein Sequence and Structure, vol. 5, suppl. 3,
          1978); pam120.mat - a PAM120 matrix.  The SMATRIX also
          specifies the penalties for the first residue in a gap
          and additional residues in a gap; FASTA, the other


                             - 18 -


FASTA.DOC                                             Release 1.6


          alignment programs, and the SMATRIX files use -12 and
          -4. Currently, to change the -12, -4 gap penalties, the
          SMATRIX file must be edited.

-v        (LINEVAL) values used for line styles in plfasta

-w #      line length (width) = number (<200)

-x        specifies offsets for the beginning of the query and
          library sequence.  For example, if you are comparing
          upstream regions for two genes, and the first sequence
          contains 500 nt of upstream sequence while the second
          contains 300 nt of upstream sequence, you might try:

              fasta -x "-500 -300" seq1.nt seq2.nt

          If the -x option is not used, FASTA assumes numbering
          starts with 1.  This option will not work properly with
          the translated library sequence with tfasta.  (You
          should double check to be certain the negative
          numbering works properly.)

-1        sort output by init1 score (as FASTP used to do).

-3        (TFASTA only) translate only three forward frames


For example:

    fasta -w 80 -a seq1.aa seq.aa

would compare the sequence in seq1.aa to that in seq2.aa and
display the results with 80 residues on an output line, showing
all of the residues in both sequences.  Be sure to enter the
options before entering the file names, or just enter the options
on the command line, and the program will prompt for the file
names.

     Not all of these options are appropriate for all of the
programs.  The options above are used by FASTA and TFASTA RELATE
uses the -s option, ALIGN uses the -w, -m, and -s options, and
the RDF2 programs use -c, -f, -k, and -s.

4.  Environment variable summary

     Environment variables allow you to set search parameters
that will be used frequently when you run a program; for example,
if you prefer to use the PAM120 scoring matrix, you might "set
SMATRIX=120."  Command line parameters, if used, always override
environment variable settings. The following environment
variables are used by this program:


                             - 19 -


FASTA.DOC                                             Release 1.6


AABANK    the file name  of the default sequence library.

FASTLIBS  the location of the file which contains the list of
          library files to be searched.

GAPCUT    threshold used for joining init1 regions in the second
          step of FASTA.  Normally set based on sequence length
          and ktup.

GBLIB     the directory where the EXTRACTN files and glocus.idx
          are found.

LIBTYPE   used to specify the format of the library sequence for
          FASTA and TFASTA.

LINLEN    output line length - can go up to 200

LINEVAL   used by plfasta to determine the relationship between
          line style and similarity score (-v).  This should be a
          string of three numbers, e.g.  "200 100 50"

MARKX     symbol for denoting matches, mismatches. Note that this
          symbol is only used across the optimized local region;
          sequences that are outside this region are not marked.

OPTCUT    Set the threshold to be used for optimization in a band
          around the best initial region.  Normally the OPTCUT
          value is calculated from the length of the sequence and
          the ktup value (for a 200 residue sequence, it is about
          28).  If OPTCUT=1, every sequence in the database will
          be optimized.  This is the most sensitive option.

PAMFACT   This version of fasta uses a more sensitive method for
          identifying initial regions. Instead of using a
          constant factor (fact) for each match in a ktup, it
          uses the scoring matrix (PAM) scores.  While this works
          well for protein sequences, it has not been as
          carefully tested for DNA sequences, so by default, this
          modification is used for proteins but not for DNA.  The
          -f 1 option forces this option on. -f 0 forces it off.
          Setting the PAMFACT environment variable to 1 forces
          the option on; PAMFACT=0 turns it off.

SHOWALL   on output, show the complete sequence instead of just
          the overlap of the two aligned sequences.

SMATRIX   alternative scoring matrix file.

TEKPLOT   (IBM-PC only, Unix and VMS versions generate Tektronix
          graphics by default) Generate Tektronix output.
          Normally, PLFASTA and TGREASE plot graphs using the
          Turbo C graphics library.  Unfortunately, often these
          plots cannot be printed out without special programs.


                             - 20 -


FASTA.DOC                                             Release 1.6


          (I have used GRAFPLUS, from Jewell Technologies, (206)
          937-1081, $50, successfully.) However, if you set
          TEKPLOT=1, tektronix graphics commands will be used.
          Tektronix commands can be used together with the
          PLOTDEV program, available from Microplot Systems, 1897
          Red Fern Dr.  Columbus, OH, 43229, (614) 882-4786, for
          $40, which also allows you to print out graphics on the
          screen.


As always, please inform me of bugs as soon as possible.

William R. Pearson
Department of Biochemistry
Box 440, Jordan Hall
U. of Virginia
Charlottesville, VA

wrp@virginia.EDU
wrp@virginia.BITNET


                             - 21 -