FASTA.DOC Release 1.6 COPYRIGHT NOTICE Copyright 1988, 1991, 1992 by William R. Pearson and the University of Virginia. All rights reserved. The FASTA program and documentation may not be sold or incorporated into a commercial product, in whole or in part, without written consent of William R. Pearson and the University of Virginia. For further information regarding permission for use or reproduction, please contact: William R. Wilkerson, Assistant Provost for Research, University of Virginia, P.O. Box 9025, Charlottesville, VA 22906-9025, (804) 924-6853 The FASTA program package Introduction This documentation describes the version 1.6c of the FASTA program package (see W. R. Pearson and D. J. Lipman (1988), "Improved Tools for Biological Sequence Analysis", PNAS 85:2444- 2448, and W. R. Pearson (1990) "Rapid and Sensitive Sequence Comparison with FASTP and FASTA" Methods in Enzymology 183:63- 98). Version 1.6 is the first release for the IBM-PC and Macintosh since version 1.4 (version 1.5 was distributed only via ftp to unix machines). Version 1.6 has a large number of improvements over versions 1.4 and 1.5, including the ability to search libraries in several different formats in the same run, more robust algorithms for aligning sequences along a band, and additional, rigorous (but slow) programs for sequence searching, statistical analysis, and local sequence alignment. In addition, several additional options are included. Programs that are new with version 1.6 are highlighted in italics. Although there are a large number of programs in this package, they belong to three groups: Library search programs: FASTA, TFASTA, SSEARCH Local homology programs: LFASTA, PLFASTA, LALIGN, PLALIGN Statistical significance: RDF2, RELATE, RSS In addition, there are several programs for other sequence analysis tasks: ALIGN - global alignment of two sequences (no limit on gaps). EXTRACTP, SINDEX - programs to index (SINDEX) and extract sequences - 1 - FASTA.DOC Release 1.6 from a protein sequence database. EXTRACTN - programs to extract sequences from the GenBank floppy disk format data base. In addition, I have included several programs for protein sequence analysis, including a Kyte-Doolittle hydropathicity plotting program (GREASE, TGREASE), and a secondary structure prediction package (GARNIER). The FASTA sequence comparison programs on this disk are improved versions of the FASTP program, originally described in Science (Lipman and Pearson, (1985) Science 227:1435-1441). We have made several improvements. First, the library search programs use a more sensitive method for the initial comparison of two sequences which allows the scores of several similar regions to be combined. As a result, the results of a library search are now given with three scores, initn (the new initial score which may include several similar regions), init1 (the old fastp initial score from the best initial region), and opt (the old fastp optimized score allowing gaps in a 32 residue wide band). These programs have also been modified to become "universal" (hence FAST-A, for FASTA-All, as opposed to FAST-P (protein) or FAST-N (nucleotides)); by changing the environment variable SMATRIX, the programs can be used to search protein sequences, DNA sequences, or whatever you like. By default, FASTA, LFASTA, and the RDF programs automatically recognize protein and DNA sequences. Sequences are first read as amino acids, and then converted to nucleotides if the sequence is greater than 85% A,C,G,T (the '-n' option can be used to indicate DNA sequences). TFASTA compares protein sequences to a translated DNA sequence. Alternative scoring matrices can also be used. In addition to the PAM250 matrix for proteins, matrices based on simple identities or the genetic code can also be used for sequence comparisons or evaluation of significance. Several different protein sequence matrices have been included; instructions for constructing your own scoring matrix are included in the file FORMAT.DOC. The remainder of this document is divided into three sections: (1) a brief history of the changes to the FASTA package; (2) A guide to installing the programs and databases; (3) A guide to using the FASTA programs. The programs are very easy to use, so if you are using them on a machine that is administered by someone else, you may want to skip to section (3) to learn how to use the programs, and then read section (1) to look at some of the more recent changes. If you are installing the programs on your own machine, you will need to read section (2) carefully. - 2 - FASTA.DOC Release 1.6 1. Revision History 1.1. Changes with version 1.6 FASTA version 1.6 uses a new method for calculating optimal scores in a band (the optimization or last step in the FASTA algorithm). In addition, it uses a linear-space method for calculating the actual alignments. The FASTA package also includes four new programs: SSEARCH a program to search a sequence database using the rigorous Smith-Waterman algorithm (this program is about 100-fold slower than FASTA with ktup=2 (for proteins). RSS a version of RDF2 that uses a rigorous Smith-Waterman calculation to score similarities LALIGN A rigorous local sequence alignment program that will display the N-best local alignments (N=10 by default). PLALIGN a version of lalign that plots the local alignments. The LALIGN/PLALIGN programs incorporate the "sim" algorithm described by Huang and Miller (1991) Adv. Appl. Math. 12:337-357. The SSEARCH and RSS programs incorporate algorithms described by Huang, Hardison, and Miller (1990) CABIOS 6:373-381. LFASTA and PLFASTA now calculate a different number of local similarities; they now behave more like LALIGN/PLALIGN. Since local alignments of identical sequences produce "mirror-image" alignments, lalign and lfasta consider only one-half of the potential alignments between sequences from identical file names. Thus lfasta mchu.aa mchu.aa Displays only two alignments, with earlier versions of the program, it would have displayed five, including the identity alignment. PLFASTA does display five alignments; when two identical filenames are given, it draws the identity alignment, calculates the two unique local alignments, draws them, and draws their mirror images. LFASTA/PLFASTA and LALIGN/PLALIGN use the filenames, rather than the actual sequences, to determine whether sequences are identical; you can "trick" the programs into behaving the old way by putting the same sequence in two different files. 1.2. Changes with version 1.5 FASTA version 1.5 includes a number of substantial revisions to improve the performance and sensitivity of the program. It is now possible to tell the program to optimize all of the initn - 3 - FASTA.DOC Release 1.6 scores greater than a threshold. The threshold is set at the same value as the old FASTA cutoff score (approximately 0.5 standard deviations above the mean for average length sequences). For highest sensitivity, you can use the -c 1 option to set the threshold to 1. (This will slow the search down about 5-fold). Alternatively, you can tell FASTA to sort the results by the init1, rather than the initn, score by using the -1 option. FASTA -1 ... will report the results the way the older FASTP program did. A comparison of the performance of FASTA in this, its slowest mode, with the standard FASTA and the Smith-Waterman algorithm has been published in Genomics (1991) 11:635-650. A new method has been provided for selecting libraries. In the past, one could enter the name of a sequence file to be searched or a single letter that would specify a library from the list included in the $FASTLIBS file. Now, you can specify a set of library files with a string of letters preceded by a '%'. Thus, if the FASTLIBS file has the lines: Genbank 70 primates$1P/seqlib/gbpri.seq Genbank 70 rodents$1R/seqlib/gbrod.seq Genbank 70 other mammals$1M/seqlib/gbmam.seq Genbank 70 vertebrates $1B/seqlib/gbvrt.seq Then the string: "%PRMB" would tell FASTA to search the four libraries listed above. The %PRMB string can be entered either on the command line or when the program asks for a filename or library letter. FASTA1.5 also provides additional flexibility for specifying the number of results and alignments to be displayed with the -Q (quiet) option. The -b number option allows you to specify the number of sequence scores to show when the search is finished. Thus FASTA -b 100 ... tells the program to display the top 100 sequence scores. In the past, if you displayed 100 scores (in -Q mode), you would also have store 100 alignments. The -d option allows you to limit the number of alignments shown. FASTA -b 100 -d 20 would show 100 scores and 20 alignments. The old CUTOFF parameter is no longer used. The program stores the best 2000 (IBM-PC, MAC) or 6000 (Unix, VMS) scores and then throws out the lowest 25%, stores the next 500 (1500) better than the threshold determined with the first scores were discarded, and repeats the process as the library is scanned. As a result, the best 1500 - 2000 (4500 - 6000) scores are saved. - 4 - FASTA.DOC Release 1.6 The old cut-off parameter was also used to set the joining threshold for the calculation of the initn score from initial regions. This joining threshold can now be set with the -g option or with the GAPCUT parameter. Finally, FASTA can provide a complete list of all of the sequences and scores calculated to a file with the -r (results) option. FASTA -r results.out ... creates a file with a list of scores for every sequence in the library. The list is not sorted, and only includes those scores calculated during the initial scan of the library (the optimized score is not calculated unless the -o option is used). 2. Installing the FASTA package 2.1. Installing the programs 2.1.1. IBM-PC/DOS version For the IBM-PC/DOS version, the FASTA source code disk contains the complete source code to all of the programs on the other disks. The programs were compiled with Borland's Turbo 'C++', using Borland's MAKE utility. The graphics programs (PLFASTA, TGREASE) use the graphics device drivers supplied with the Turbo 'C' V2.0 package. Also included are the documentation files PROGRAMS.DOC and FORMAT.DOC. You do not need any of the files the source code disk to run the programs. The files on this disk are identical to the UNIX and VMS versions that run on larger machines. Also included is the code to compile ALIGN0.EXE. ALIGN0 is the same as ALIGN, but does not penalize for end-gaps. If you have the DOS or Macintosh version of the FASTA package, to install the programs you should: (1) Make a new directory (folder) for the FASTA programs. This need not be the same as the directory for your sequence databases. (2) Copy the files from the FASTA source disk to the new directory. (3) (DOS only) Edit your AUTOEXEC.BAT file to (a) modify your PATH command to include the FASTA directory and (b) add the line: set FASTLIBS=c:\yourfastadirectory\fastgbs On the Macintosh, you may need to edit the "environment" file and change the line that reads: FASTLIBS=fastgbs - 5 - FASTA.DOC Release 1.6 to indicate the full directory path for the fastgbs file, for example: FASTLIBS=Q105:FASTA:fastgbs (4) Finally, you will need to edit the fastgbs file. This is usually the most confusing part of the installation. An example of this file is shown below; to customize this file for your machine, you will need to change the file names from those provided in the fastgbs file to ones that reflect the directory names and file names you use on your machine. This is explained in more detail below. In addition, some entries in the fastgbs file refer to other files of file names. These files of file names (as opposed to actual database files) may also need to be edited. 2.1.2. Unix version The FASTA distribution comes with several makefile's that can be used to compile the FASTA programs. Over the years, as ATT Unix System 5 and BSD unix have converged, these files have become very similar. To begin with, I recommend using the standard Makefile. There are two values in the makefile that should be checked against the values used on your system: the HZ value, which is the frequency in ticks per second used by the times() system call, this value can usually be found by running: grep HZ /usr/include/sys/* and the functions available to return random numbers. If you have a rand48() function that returns a 32-bit random number, use it and use the lines: NRAND=nrand48 RANFLG= -DRAND32 If not, you will need to use the rand() function call and determine whether it returns a 16-bit or a 32-bit value. These functions are used by RDF2 and RSS. If you have problems compiling the programs, you may want to examine the makefile.unx and makefile.sun files, to look for differences. I have tried to use very standard unix functions in these programs, and they have been successfully compiled, with very small changes to the Makefile, on Sun's (Sun OS 4.1), IBM RS/6000's (AIX), and MIPS machines (under the BSD environment). 2.2. Installing the libraries 2.2.1. The NBRF protein sequence library The FASTA program package does not include any protein or DNA sequence libraries. You can obtain the PIR protein sequence - 6 - FASTA.DOC Release 1.6 database from: National Biomedical Research Foundation Georgetown University Medical Center 3900 Reservoir Rd, N.W. Washington, D.C. 20007 In addition, this database is available via anonymous ftp from the host "ftp.bchs.uh.edu". It is available in two formats, VMS and CODATA format. The "VMS" format (library type 5 below) can be searched much faster, can be easily reformatted for use by the "BLAST" rapid searching program, and is compatible with the Genetics Computer Group package of programs. The CODATA format is used by the EUGENE/MBIR computing package from Baylor (library type 2). (DOS/Macintosh users) The SINDEX and EXTRACTP programs now allow you to index a file in one subdirectory, and then move the library without having to remake the index. When you type: SINDEX @prot.nam, two index files are created: PROT.IXX and PROT.INX. PROT.IXX is a binary file that cannot be edited; it contains the offsets into the library files for each of the sequence entries. PROT.INX looks exactly like the original PROT.NAM file, and can be edited. However, you cannot change the order of the library files in PROT.INX. What you can do is change the first line, which indicates the directory where the library files can be found. The index in PROT.IXX might tell EXTRACTP to find the entry LCBO at offset 123,456 in the PROT.3 file. If you changed the PROT.3 line in PROT.INX to PROT.4, LCBO would not be extracted properly. However, if you decide to move your library files from disk /usr/tmp to disk /usr/lib, you can edit PROT.INX to reflect this change. EXTRACTP has also been updated to use the new indexing scheme. To extract sequences from a multi-file library that you made with SINDEX @prot.nam, type: EXTRACTP @prot.nam, or set the environment variable AABANK=@prot.nam. Then enter the protein sequence identifiers as before. Remember, if you move the library into a different directory, you will need to copy both the *.IXX and *.INX files to use EXTRACTP. You can test EXTRACTP by trying to extract the PIR sequences LCBO, HBHU, or CCHU. If you do not get an error message, the sequences were successfully extracted. They are automatically saved to a file with the name "sequence.aa". So "LCBO" would be found in "lcbo.aa". When you need to extract a sequence from the NEW.LIB library, you will have to set AABANK=new.lib. 2.2.2. The GENBANK DNA sequence library FASTA, TFASTA, and EXTRACTN search and extract sequences from the GENBANK DNA sequence library in its compressed, floppy disk format. This library is available from: - 7 - FASTA.DOC Release 1.6 GENBANK c/o Intelligenetics 700 E. El Camino Real Mountain View, CA 94040 (415) 962-7300 (The GBANN program used to extract DNA sequence annotations. Unfortunately, GBANN has not been updated since release 63.0 of GENBANK, when some changes in the annotation files were made. GBANN no longer works.) The GenBank DNA sequence library is also available via anonymous FTP from genbank.bio.net. 2.2.3. The EMBL CD-ROM libraries The European Molecular Biology Laboratory (EMBL) is distributing a CD-ROM that contains both the complete EMBL DNA sequence database (which should be essentially identical to the GenBank DNA sequence database) and the SWISS-PROT protein sequence database. SWISS-PROT is derived from the NBRF Protein sequence database with additions from the EMBL DNA sequence database. This CD-ROM is a "best-buy," since it provides both DNA and protein sequence libraries. It is available from: EMBL Data Library Meyerhofstr. 1 D-6900 Heidelberg Germany +49 6221 387258 Email: SOFTWARE@EMBL-Heidelberg.DE In addition, the SWISS-PROT protein sequence database is available via anonymous FTP from the hosts genbank.bio.net and ncbi.nlm.nih.gov. 2.3. Finding the libraries: FASTLIBS FASTA and TFASTA use the environment variable FASTLIBS to find the protein and DNA sequence libraries. The FASTLIBS variable contains the name of a file that has the actual filenames of the libraries. The FASTGBS file on is an example of a file that can be referred to by FASTLIBS. To use the FASTGBS file, type: setenv FASTLIBS /usr/lib/fasta/fastgbs (BSD UNIX) or FASTLIBS=/usr/lib/fasta/fastgbs; export FASTLIBS (SysV UNIX) Then edit the FASTGBS file to indicate where the protein and DNA - 8 - FASTA.DOC Release 1.6 sequence libraries can be found. If you have a hard disk and your protein sequence library is kept in the file /usr/lib/aabank.lib and your Genbank DNA sequence library is kept in the directory: /usr/lib/genbank, then fastgbs might contain: NBRF Protein$0P/usr/lib/seq/aabank.lib 0 SWISS PROT 10$0S/usr/lib/vmspir/swiss.seq 5 GB Primate$1P@/usr/lib/genbank/gpri.nam GB Rodent$1R@/usr/lib/genbank/grod.nam GB Mammal$1M@/usr/lib/genbank/gmammal.nam ^ 1 ^^^^ 4 ^ ^ 23 (5) The first line of this file says that there is a copy of the NBRF protein sequence database (which is a protein database) that can be selected by typing "P" on the command line or when the database menu is presented in the file /usr/lib/seq/aabank.lib. Note that there are 4 or 5 fields in the lines in fastgbs. The first field is the description of the library which will be displayed by FASTA; it ends with a '$'. The second field (1 character), is a 0 if the library is a protein library and 1 if it is a DNA library. The third field (1 character) is the character to be typed to select the library. The fourth field is the name of the library file. In the example above, the /usr/lib/seq/aabank.lib file contains the entire protein sequence library. However the DNA library file names are preceded by a '@', because these files (gpri.nam, grod.nam, gmammal.nam) do not contain the sequences; instead they the names of the files which contain the sequences. This is done because the GENBANK DNA database is broken down in to a large number of smaller files. In order to search the entire primate database, you must search more than a dozen files. In addition, an optional fifth field can be used to specify the format of the library file. Alternatively, you can specify the library format in a file of file names (a file preceded by an '@'). This field must be separated from the file name by a space character (' ') from the filename. In the example above, the aabank.lib file is in Pearson/FASTA format, while the swiss.seq file is in PIR/VMS format (from the EMBL CD-ROM), while the DNA sequences are in compressed GenBank format. No file type number is included for the Genbank files, because it is included in the file of filenames (see below). Currently, FASTA can read the following formats: 0 Pearson/FASTA (>SEQID - comment/sequence) 1 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN) 2 NBRF CODATA (ENTRY/SEQUENCE) 3 EMBL/SWISS-PROT (ID/DE/SQ) 4 Intelligenetics (;comment/SEQID/sequence) 5 NBRF/PIR VMS (>P1;SEQID/comment/sequence) - 9 - FASTA.DOC Release 1.6 9 Compressed Genbank Floppy format (In the near future, I hope to support the BLAST formats.) In particular, this version will work with the EMBL and PIR VMS formats that are distributed on the EMBL CD-ROM. The latter format (PIR VMS) is much faster to search than EMBL format. If a library format is not specified, for example, because you are just comparing two sequences, Pearson/FASTA (format 0) is used by default. To change this default, you may set the LIBTYPE environment variable to a number. For example, setenv LIBTYPE 1 would cause the program to use the GenBank LOCUS format by default for libraries (or the second sequence file), but the Pearson/FASTA format would still be used for the query sequence. You can specify a group of library files by putting a '@' symbol before a file that contains a list of file names to be searched. For example, if @gpri.nam is in the fastgbs file, the file "gpri.nam" might contain the lines: glocus.idx gpri1.seq gpri2.seq gpri12.seq In this case, the line beginning with a '<' indicates the directory the files will be found in. The line beginning with a '>' indicates the index file; this is only used for the GENBANK compressed DNA database. The remaining lines name the actual sequence files. So the first sequence file to be searched would be: /usr/lib/genbank/gpri1.seq The notation "glocus.idx gpri1.seq 9 gpri2.seq 9 ... gpri9.seq 9 # (this '#' causes the program to display the size of the library) grod1.seq 9 ... gmam1.seq 9 ... guna1.seq 9 ... unanno.seq 5 # You do not need to include library format numbers if you only use the Pearson/FASTA version of the PIR protein se- quence library and the Genbank DNA database on floppy disks. If no library type is specified, the program as- sumes that type 0 is being used (unless you have set LIB- TYPE). However, if the program sees an index file line (e.g. ">glocus.idx"), it assumes that the files are in Genbank floppy disk format (type 9). Although FASTA works best when the libraries are saved on a hard disk, this is not required. If you do not have a hard disk, you could refer to the protein database files by making a file "prot.nam" with the lines: ' in the first column. (3) distributed sequence libraries (this is a broad class that includes the NBRF/PIR VMS and blocked ascii formats, Genbank flat-file format, EMBL flat-file format, and Intelligenetics format. All of the files that you create should be of type (1) or (2). Type (2) files (ones with a be used as query or library sequence files by all of the programs. I have included several sample test files, *.AA. The first line may begin with a '>' or ';' followed by a comment. The text after ';' in other lines will be ignored. Spaces and tabs (and anything else that is not an amino-acid code) are ignored. Library files should have the form: >Sequence name and identifier A F A S Y T .... actual sequence. F S S .... second line of sequence. >Next sequence name and identifier This is the form of the PROT.* supplied with the floppy disk version of the PIR protein sequence library. You can also build your own library by concatenating several sequence files. Just be sure that each sequence is preceded by a line beginning with a '>' with a sequence name. The test file should not have lines longer than 120 characters, and sequences entered with word processors should use a document mode, with normal carriage returns at the end of lines. Program Summary 3.3. Sequence search programs FASTA universal sequence comparison. Defaults to comparing protein sequences; if the sequences are > 85% A+C+G+T or the -n option is used, a DNA sequence is assumed. TFASTA Search DNA library for a protein sequence by translating the DNA sequence to protein in all six frames (three forward frames with the -3 command line option). TFASTA with ktup=2 is about as fast as a DNA FASTA with ktup=4, and is substantially more sensitive. (also reads the GENBANK library) SSEARCH Universal sequence comparison using the Smith-Waterman algorithm ( T. F. Smith and M. S. Waterman (1981) J. Mol. Biol. 147:195-197). This program uses code developed by Huang and Miller (X. Huang, R. C. - 14 - FASTA.DOC Release 1.6 Hardison, W. Miller (1990) CABIOS 6:373-381) for calculating the local similarity score and code from the ALIGN program (see below) for calculating the local alignment. SSEARCH is about 100-times slower than FASTA with ktup=2 (for proteins). It should never be used to search an entire protein sequence library, but can be used to search several hundred sequences. ALIGN optimal global alignment of two sequences with no short-cuts. This program is a slightly modified version of one taken from E. Myers and W. Miller. The algorithm is described in E. Myers and W. Miller, "Optimal Alignments in Linear Space" (CABIOS (1988) 4:11-17). 3.4. Local similarity programs LFASTA local similarity searches showing local alignments. The algorithm used to calculate the local alignment in a band has been improved (Chao, Pearson, and Miller, submitted). PLFASTA local similarity searches with plot output (on the IBM, this program requires that the environment variable BGIDIR be set). PCLFASTA (unix only) local similarity searches with plot output using pic commands. LALIGN Calculates the N-best local alignments using a rigorous algorithm. (N=10 by default.) The algorithm was developed by Huang and Miller (X. Huang and W. Miller (1991) Adv. Appl. Math. 12:337-357), which is a linear-space version of an algorithm described by M. S. Waterman and M. Eggert (J. Mol. Biol. 197:723-728). Like SSEARCH, LALIGN is rigorous, but also very slow. PLALIGN A version of LALIGN that plots its output to a screen or to a Tektronix terminal emulator. 3.5. Statistical Significance RDF2 improved version of RDF program with all three scoring methods (now includes local, or window, shuffle routine) RSS A version of RDF2 that uses the rigorous Smith-Waterman calculation used by SSEARCH. RSS should provide a more rigorous test of the statistical significance of a similarity score. RELATE significance program described by Dayhoff (Atlas of Protein Sequence and Structure, Vol. 5, Supplement 3). - 15 - FASTA.DOC Release 1.6 Each chunk of 25 residues in one sequence is compared to every 25 residue fragment of the second sequence. Sequences which are genuinely related will have a large number of scores greater than 3 standard deviations above the mean score of all of the comparisons. 3.6. Other analysis programs AACOMP calculate the amino acid composition and molecular weight of a sequence. BESTSCOR calculate the best self-comparison score. GREASE Kyte-Doolittle hydropathicity profile TGREASE graphic plot of Kyte-Doolittle profile FROMGB convert from GenBank LOCUS format (also used by the IBI-Pustell programs) to Pearson/FASTA format. GARNIER A secondary structure prediction program using the method of Garnier, Osgusthorpe, and Robson, J. Mol. Biol., (1978) 120:97-120. 3.7. Searching for keywords FINDP (DOS, Macintosh only) Searches the protein sequence library title lines (or the aabank.nam file created by SINDEX) for a list of key words. For example: FINDP aabank.nam trypsin will search the file of title lines and report all lines with the word "trypsin" in them. You can search for several words at once, by putting several words on the line. Normally, FINDP (and FINDN) ignore upper and lower case. If you would like to search for a specific case, e.g. Trypsin but not chymotrypsin, use the -l option: FINDP aabank.nam -l Trypsin FINDN Searches the GENBANK *.ano annotation files for words. FINDN can search a specific file, or a list of annotation files. For example, if the file GPRIA.NAM contains the lines: gpri1.ano gpri2.ano gpri3.ano ... then - 16 - FASTA.DOC Release 1.6 FINDN @gpria.nam trypsin would search all of the files. FINDN also uses "-l" to preserve upper/lower case distinctions. 3.8. Options These programs have a number of output options, which are invoked by the environment variables LINLEN, SHOWALL, and MARKX. Alternatively, these values can be controlled by command line options. The number of sequence residues per output line is now adjustable by setting the environment variable LINLEN, or the command line option -w. LINLEN is normally 60, to change it set LINLEN=80 before running the program or add -w 80 to the command line. LINLEN can be set up to 200. SHOWALL (-a) determines whether all, or just a portion, of the aligned sequences are displayed. Previously, FASTP would show the entire length of both sequences in an alignment while FASTN would only show the portions of the two sequences that overlapped. Now the default is to show only the overlap between the two sequences, to show complete sequences, set SHOWALL=1, or use the -a option on the command line. The differences between the two aligned sequences can be highlighted in three different ways by changing the environment variable MARKX or the -m option. Normally (MARKX=0) the program uses ':' do denote identities and '.' to denote conservative replacements. If MARKX=1, the program will not mark identities; instead conservative replacements are denoted by a 'x' and non- conservative substitutions by a 'X'. If MARKX=2, the residues in the second sequence are only shown if they are different from the first. Thus the three options are: MARKX=0 (default) MARKX=1 MARKX=2 MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT ::..:: ::: xx X ..KS..Y... MWKSCGYPYT MWKSCGYPYT 3.9. Command line options It is now possible to specify several options on the command line, instead of using environment variables. The command line options are preceded by a dash; the following options are available: -a same as showall=1 -b number of sequence scores to be shown on output - 17 - FASTA.DOC Release 1.6 -c # threshold score for optimization (OPTCUT). Set "-c 1" and "-o" to optimize every sequence in a database. (This slows the program down about 5-fold). -d # number of alignments to be reported by default. (Used in conjunction with -Q). -f identical match score from scoring matrix in the scan for initial regions. (default for protein) (PAMFACT=1) -g # Threshold for joining init1 segments to build an initn score (GAPCUT). -k use constant score in scan for initial regions (like old fastp, fastn, default for DNA) (PAMFACT=0) -l file location of library menu file (FASTLIBS) -m # MARKX = # (0, 1, 2) -n Force the query sequence to be treated as a DNA sequence. This is particularly useful for query sequences that contain a large number of ambiguous residues, e.g. transcription factor binding sites. -o optimize all scores greater than OPTCUT. If '-c' is not specified, OPTCUT will be calculated from the length of the sequence and the ktup setting, as the old CUTOFF value used to be. -Q quiet - does not prompt for any input. Writes scores and alignments to the terminal or standard output file. -r file save a results summary line for every sequence in the sequence library. The summary line includes the sequence identifier, superfamily number (if available) position in the library, and the similarity scores calculated. This option can be used to evaluate the sensitivity and selectivity of different search strategies (see W. R. Pearson (1991) Genomics 11:635- 650.) -s file SMATRIX is read from file. Several SMATRIX files are provided with the standard distribution. For protein sequences: codaa.mat - based on minimum mutation matrix; idnaa.mat - identity matrix; idpaa.mat - identity matrix for mismatches, but identical matches weighted according to the PAM250 matrix; pam250.mat - the PAM250 matrix developed by Dayhoff et al (Atlas of Protein Sequence and Structure, vol. 5, suppl. 3, 1978); pam120.mat - a PAM120 matrix. The SMATRIX also specifies the penalties for the first residue in a gap and additional residues in a gap; FASTA, the other - 18 - FASTA.DOC Release 1.6 alignment programs, and the SMATRIX files use -12 and -4. Currently, to change the -12, -4 gap penalties, the SMATRIX file must be edited. -v (LINEVAL) values used for line styles in plfasta -w # line length (width) = number (<200) -x specifies offsets for the beginning of the query and library sequence. For example, if you are comparing upstream regions for two genes, and the first sequence contains 500 nt of upstream sequence while the second contains 300 nt of upstream sequence, you might try: fasta -x "-500 -300" seq1.nt seq2.nt If the -x option is not used, FASTA assumes numbering starts with 1. This option will not work properly with the translated library sequence with tfasta. (You should double check to be certain the negative numbering works properly.) -1 sort output by init1 score (as FASTP used to do). -3 (TFASTA only) translate only three forward frames For example: fasta -w 80 -a seq1.aa seq.aa would compare the sequence in seq1.aa to that in seq2.aa and display the results with 80 residues on an output line, showing all of the residues in both sequences. Be sure to enter the options before entering the file names, or just enter the options on the command line, and the program will prompt for the file names. Not all of these options are appropriate for all of the programs. The options above are used by FASTA and TFASTA RELATE uses the -s option, ALIGN uses the -w, -m, and -s options, and the RDF2 programs use -c, -f, -k, and -s. 4. Environment variable summary Environment variables allow you to set search parameters that will be used frequently when you run a program; for example, if you prefer to use the PAM120 scoring matrix, you might "set SMATRIX=120." Command line parameters, if used, always override environment variable settings. The following environment variables are used by this program: - 19 - FASTA.DOC Release 1.6 AABANK the file name of the default sequence library. FASTLIBS the location of the file which contains the list of library files to be searched. GAPCUT threshold used for joining init1 regions in the second step of FASTA. Normally set based on sequence length and ktup. GBLIB the directory where the EXTRACTN files and glocus.idx are found. LIBTYPE used to specify the format of the library sequence for FASTA and TFASTA. LINLEN output line length - can go up to 200 LINEVAL used by plfasta to determine the relationship between line style and similarity score (-v). This should be a string of three numbers, e.g. "200 100 50" MARKX symbol for denoting matches, mismatches. Note that this symbol is only used across the optimized local region; sequences that are outside this region are not marked. OPTCUT Set the threshold to be used for optimization in a band around the best initial region. Normally the OPTCUT value is calculated from the length of the sequence and the ktup value (for a 200 residue sequence, it is about 28). If OPTCUT=1, every sequence in the database will be optimized. This is the most sensitive option. PAMFACT This version of fasta uses a more sensitive method for identifying initial regions. Instead of using a constant factor (fact) for each match in a ktup, it uses the scoring matrix (PAM) scores. While this works well for protein sequences, it has not been as carefully tested for DNA sequences, so by default, this modification is used for proteins but not for DNA. The -f 1 option forces this option on. -f 0 forces it off. Setting the PAMFACT environment variable to 1 forces the option on; PAMFACT=0 turns it off. SHOWALL on output, show the complete sequence instead of just the overlap of the two aligned sequences. SMATRIX alternative scoring matrix file. TEKPLOT (IBM-PC only, Unix and VMS versions generate Tektronix graphics by default) Generate Tektronix output. Normally, PLFASTA and TGREASE plot graphs using the Turbo C graphics library. Unfortunately, often these plots cannot be printed out without special programs. - 20 - FASTA.DOC Release 1.6 (I have used GRAFPLUS, from Jewell Technologies, (206) 937-1081, $50, successfully.) However, if you set TEKPLOT=1, tektronix graphics commands will be used. Tektronix commands can be used together with the PLOTDEV program, available from Microplot Systems, 1897 Red Fern Dr. Columbus, OH, 43229, (614) 882-4786, for $40, which also allows you to print out graphics on the screen. As always, please inform me of bugs as soon as possible. William R. Pearson Department of Biochemistry Box 440, Jordan Hall U. of Virginia Charlottesville, VA wrp@virginia.EDU wrp@virginia.BITNET - 21 -