BLAST(1) USER COMMANDS BLAST(1) NAME blastp, blastn, blastx, tblastn - rapid sequence database query programs using the BLAST algorithm SYNOPSIS blastp aadb aaquery [E=#] [S=#] [E2=#] [S2=#] [W=#] [T=#] [X=#] [M=subfile] [Y=#] [Z=#] [K=#] [L=#] [H=#] [V=#] [B=#] [-sort_by...] blastn ntdb ntquery [E=#] [S=#] [W=#] [X=#] [M=#] [N=#] [Y=#] [Z=#] [K=#] [L=#] [H=#] [V=#] [B=#] [[top][bottom]] [-sort_by...] blastx aadb ntquery [E=#] [S=#] [W=#] [T=#] [X=#] [M=subfile] [Y=#] [Z=#] [C=#] [K=#] [L=#] [V=#] [B=#] [[top][bottom]] [-sort_by...] tblastn ntdb aaquery [E=#] [S=#] [E2=#] [S2=#] [W=#] [T=#] [X=#] [M=subfile] [Y=#] [Z=#] [C=#] [K=#] [L=#] [H=#] [V=#] [B=#] [[top][bottom]] [-sort_by...] DESCRIPTION BLAST (Basic Local Alignment Search Tool) is the heuristic search algorithm employed by the programs blastp, blastn, blastx, and tblastn. The four programs are used for the following purposes: blastp to compare an amino acid query sequence vs. a protein sequence database; blastn to compare a nucleotide query sequence vs. a nucleotide sequence database; blastx to compare a nucleotide query sequence translated in all reading frames vs. a protein sequence database; tblastn to compare a protein query sequence vs. a nucleotide sequence database dynamically translated in all reading frames. Whenever a nucleotide query sequence or nucleotide database is involved, both strands (or all 6 reading frames) are searched by default. The "top" and "bottom" options may be used to restrict a search to the specified strand. (If both options are specified, both strands will be searched). The unit of BLAST algorithm output is the High-scoring Seg- ment Pair (HSP), where each segment in the pair is an equal but arbitrarily long run of contiguous residues and the score of the two aligned segments meets or exceeds a Sun Release 4.1 Last change: 7 July 1993 1 BLAST(1) USER COMMANDS BLAST(1) positive-valued cutoff. A set of zero or more HSPs is thus defined by two sequences, an alignment scoring scheme, and a cutoff score. In the programmatic implementations of the algorithm described here, the cutoff score has been parameterized and an HSP consists of one segment from the query sequence and one segment from a database sequence. A Maximal-scoring Segment Pair (MSP) is defined by two sequences and a scoring scheme and is the highest-scoring of all possible segment pairs on all diagonals. Karlin and Altschul (1990) statistics are applicable to determining the statistical significance of MSP scores. For the programs described here, Karlin-Altschul statistics have been extra- polated to the assessment of HSP scores, for which the approximation is often good when the scores are predicted to be at least marginally significant. Depending on how low the cutoff score is set and on the parameters regulating the sensitivity of a BLAST sequence comparison, there may be a non-zero probability, which may or may not be significant, that the heuristic method does not detect one or more HSPs of which the MSP is a member. At the user's discretion, search speed can be sacrificed in exchange for greater sensitivity, and vice versa. PARAMETERS Parameters are modified using a _n_a_m_e=_v_a_l_u_e syntax, _e._g., E=0.05 or S=100. E is interpreted as the _e_x_p_e_c_t_e_d number of MSPs that will satisfy the cutoff score under the random sequence model. The value of E approximates the _e_x_p_e_c_t_e_d number of HSPs that will be found purely by chance during the course of the entire database search. The default value for E is 10, and the permitted range for this Real valued parameter is 0. < E <= 1000. S is the cutoff score for reporting HSPs. Higher scores correspond to increasing statistical significance, a lower probability, or a reduced expected frequency of occurrence by chance. Any positive-scoring alignments which the pro- grams find but which score below S go unreported. Unless S is explicitly set on the command line, its default value is calculated from the value of E. The values for E and S are interconvertable, a process which is dependent on the following factors: the length and resi- due composition of the query sequence; the length of the database; a fixed, hypothetical residue composition for the database; and the scoring scheme employed. The scoring scheme used by blastp, blastx, and tblastn is a substitution matrix; the scoring scheme used by blastn is a positive Sun Release 4.1 Last change: 7 July 1993 2 BLAST(1) USER COMMANDS BLAST(1) reward score for matching residues and a negative penalty score for mismatching residues. When both of the parameters E and S are specified on the command line, the one resulting in the highest (most res- trictive) cutoff score will be used. When neither of these parameters is specified on the command line, the default value for E is used to calculate the cutoff score. For a given value of E (_e._g., the default value of 10), a given query sequence, and a single scoring scheme, the cal- culated value of the cutoff score S will be different when searching databases of different lengths. To normalize the statistics reported when databases of different lengths are searched, the parameter Z (see below) may be set to a con- stant value for all database searches. S takes on only integral values in the present implementa- tions of the BLAST algorithm. When the cutoff score is set implicitly via E, S is rounded to the least integral value required to satisfy E. Since the rounding procedure can decrease the effective value of E, the calculated value for S is used to back-calculate the effective value for E. For example, if the user specifies E = 50 on the command line, a cutoff score that is rounded up by 0.9 units to the smallest satisfying integer might correspond to an expected number of HSPs of only 43. In this case, the value displayed for E at the end of the program's report will be 43, not 50. When at least one HSP is found involving any given database sequence, the programs blastp and tblastn search the data- base sequence a second time for HSPs that satisfy a lower cutoff score, S2. In essence, the second-pass search gives these programs the opportunity to report any low- significance HSPs they may have found that might be of interest within the context of finding one or more higher- scoring (perhaps statistically significant) HSPs. Poisson statistics may indicate that the lower-scoring (higher- probability) HSPs are statistically significant when their frequencies of occurrence are considered. In a relationship similar to that between the parameters E and S, S2 can be set explicitly on the command line or it will be calculated from the setting of E2. Whereas S is related to E by the size of the database and the length of the query sequence, S2 is related to E2 by the lengths of a pair of hypothetical protein sequences of 300 residues each. In other words, E2 approximates the number of HSPs one would expect to find when comparing two protein sequences of length 300, one having the composition of the query sequence and the other having the hypothetical residue composition of the database. If a second-pass search is not desired, Sun Release 4.1 Last change: 7 July 1993 3 BLAST(1) USER COMMANDS BLAST(1) setting E2 to zero (0) turns this feature off. If S2 hap- pens to be equal to or greater than the primary cutoff score, a second-pass search is not performed, as well. The user should be forewarned that with typical parameter values, the probability the BLAST algorithm will detect an alignment is expected to decrease as the score of the align- ment decreases. Consequently, low-scoring HSPs looked for in the second-pass search have a lesser chance individually of being found than the original HSP. With a fixed scoring scheme, the probability of missing an alignment can be decreased by: lowering the neighborhood word-score threshold, T, while keeping the word size, W, constant; lowering both W and T appropriately (see Altschul _e_t _a_l., 1990); and/or raising the word-hit-extension drop- off score X (described below). W is the word size for finding initial _w_o_r_d _h_i_t_s against the database sequences. Each word hit is extended in both directions along the corresponding diagonal of an imaginary 2-dimensional matrix until the cumulative segment score drops off by at least the quantity X. The default value for W is 3 amino acids for blastp, blastx, and tblastn, and 12 nucleotides for blastn. The value of W used by blastn should not be changed, as the logic of the program source code has not been validated for use with values other than the default (particularly smaller values). For the other programs, which perform sequence comparisons at the level of individual amino acids, W should generally be restricted to values less than 5 or else the value for T should be speci- fied disproportionately larger to avoid consuming vast quan- tities of memory for the neighborhood word list (see below). T is the word score threshold for generating neighborhood words of length W from the query sequence, prior to scanning the database (blastp, blastx, and tblastn only). Words which have an aggregate score (through summation of the individual residue substitution scores) of at least T when aligned with words from the query sequence are included in the neighborhood list. Raising the value of T increases the likelihood of completely missing HSPs, but can decrease the search time and memory requirements of the programs by decreasing the size of the neighborhood list. One of the key features of the BLAST algorithm is the user-selectable trade-off in sensitivity for speed. A generally suitable value for T is calculated at run-time, using the residue composition and length of the query sequence and the substitution matrix employed. The neigh- borhood word-score threshold is set using an _a_d _h_o_c equation that is a function of _L_a_m_b_d_a and _H. _L_a_m_b_d_a is the number of Sun Release 4.1 Last change: 7 July 1993 4 BLAST(1) USER COMMANDS BLAST(1) nats of information gained per unit increase in score of an alignment (approximately 0.69315 times the number of bits per unit score). _H is the relative entropy of the target and background residue frequencies (Karlin and Altschul, 1990), or the expected information available per position in an alignment to distinguish it from chance. Occasionally it may be necessary to manually set the neighborhood word-score threshold via the command line, for which 13 may be a good value to try, but this choice is _h_i_g_h_l_y dependent on the substitution matrix and word-length, W, used. The supplied PAM120 amino acid substitution matrix, with a scale of natural log(2)/2, yields values for _L_a_m_b_d_a that are expected to be close to 0.5 bit per unit score for query sequences of typical residue compositions. Under these con- ditions, an increase in an alignment score by 2 units is expected to increase the informativeness of the alignment by 2 times 0.5 = 1 bit, corresponding to an increase in the statistical significance by a factor of 2. The supplied PAM250 matrix was produced to a scale of natural log(2)/3, suggesting that an increase in alignment score by 3 units will be required to increase statistical significance by a factor of 2. The significance of an alignment score is indeterminate without specific knowledge of the particular substitution matrix employed, including its scale, and preferably in the context of knowing the actual values determined for the Karlin-Altschul parameters _L_a_m_b_d_a and _K. X is a positive integer representing the maximum permissible drop-off of the cumulative segment score during word-hit extension. Raising X may decrease the chance that the BLAST algorithm overlooks an HSP, but it may significantly increase the search time, as well. If computation time is of little concern, X might be increased several points from its default value, but often only a very marginal increase in sensitivity should be expected. For blastp, blastx, and tblastn, the default value of X is calculated to be the minimum integral score representing at least 10 bits of information, or a reduction in the statist- ical significance of the alignment by a factor of 2 to the power of 10 (or about 1,000). For blastn, the default value of X is the minimum integral score that represents at least 20 bits of information, or a reduction in the statistical significance of the alignment by a factor of 2 to the power of 20 (or about one million). The command line parameters K and L can be used to set desired values for the Karlin-Altschul statistics' _K and _L_a_m_b_d_a parameters, respectively. Users should generally avoid setting these parameters, though, unless the full ram- ifications of doing so are understood. For example, the Sun Release 4.1 Last change: 7 July 1993 5 BLAST(1) USER COMMANDS BLAST(1) value of the H statistic reported at the end of each program's output is a function of Lambda; and the default value for the neighborhood word-score threshold parameter T is in turn a function of H. OPTIONS Except where noted, the BLAST programs accept all of the following options: -overlap Ordinarily, using a greedy algorithm, the programs discard and do not report HSPs they find that are overlapped or spanned by one another on either the query sequence or the database sequence (or both). If two HSPs are found to overlap, the one having the highest score is retained and the other is dis- carded. This option turns off the overlap removal feature entirely. -overlap2 For two HSPs to be deemed overlapping, both the query and the database segments from one HSP must span the corresponding segments in the other HSP. (The default is that either one or both segments must span). This option may be useful and result in additional HSPs being reported when the query or database sequences contain internal repeats. On the negative side of using this option, the Poisson statistics reported may be inaccurate, since the repeats may not be independent events but the software may count them as such. -filter _f_i_l_t_e_r_o_p_t_i_o_n This activates query sequence filtering to mask regions based on a potentially wide variety of cri- teria. The usual intent of filtering is to mask regions of the query sequence that may be charac- teristic of regions in several unrelated protein families and, thus, are not alone diagnostic for family membership. For instance, acidic, basic, or proline-rich regions are often masked that would otherwise produce overwhelming amounts of unin- teresting BLAST program output. The BLAST programs currently know how to properly invoke SEG and XNU filters, but these two filter programs must have been independently installed; they are not included in the BLAST software distribution. SEG (Wootton and Federhen, 1993) masks low-compositional complex- ity regions, while XNU (Claverie and States, 1993) generally masks regions containing short-periodicity internal repeats. The BLAST programs can also pipe the filtered output from one program into another. Sun Release 4.1 Last change: 7 July 1993 6 BLAST(1) USER COMMANDS BLAST(1) For instance XNU+SEG or SEG+XNU can be specified to have both programs filter the query sequence in suc- cession. -echofilter This causes the filtered query sequence to be displayed in the output. Any masked letters are indicated with X's. -consistency This turns off both the determination of the number of HSPs likely to be consistent with each other in a gapped alignment and an adjustment made to the Pois- son statistics to account for the consistency cri- terion. -codoninfo _c_o_d_o_n_i_n_f_o_f_i_l_e This blastx option is used to specify a file con- taining codon usage or codon bias information to be used in concert with a traditional substitution matrix to score alignments. The codonusagefile must have a .cdi extension to its name, but the .cdi extension should be omitted from the command line option. Information in the file must be expressed in units that coincide with the scale of the substi- tution matrix. The substitution matrix employed must have a .cdi extension on its name, as well. A few such substitution matrices are provided in the -gapdecayrate _r_a_t_e This parameter is the common ratio of the terms in a geometric progression. A Poisson probability for N segments is weighted by the reciprocal of the Nth term in this progression; and the first term in the progression has a value of (1-rate). The default value for the rate is 0.5, such that the probability of one segment is discounted by a factor of 2, the Poisson probability of 2 segments is discounted by a factor of 4, and so on. The rate essentially defines a gap penalty imposed between each segment, and the default penalty is equivalent to 1 bit of information. The suggestion to use a geometric pro- gression to normalize probabilities for the multiple numbers of segments considered in BLAST output was made by Phil Green (Washington University, St. Louis, MO). BLAST software distribution. SORT OPTIONS The default sort order for reporting database sequences is by increasing Poisson probability (P-value). The following sort options are available: Sun Release 4.1 Last change: 7 July 1993 7 BLAST(1) USER COMMANDS BLAST(1) -sort_by_pvalue Sort from most statistically significant (lowest Poisson P-value) to least sta- tistically significant (highest Poisson P-value), the default sort order. -sort_by_count Sort from highest to lowest by the number of HSPs found for each database sequence. -sort_by_highscore Sort from highest to lowest by the score of the highest scoring HSP for each database sequence. -sort_by_totalscore Sort from the highest to the lowest by the sum total score of all HSPs for each database sequence. SCORING SCHEMES The default scoring matrix used by blastp, blastx, and tblastn is the BLOSUM62 matrix (Henikoff and Henikoff, 1992). With blastp, blastx, and tblastn, the M option can be used to select an alternate substitution matrix file (_e._g., one of the PAM matrices described below). Several PAM (point accepted mutations per 100 residues) scoring matrices are provided in the BLAST software distri- bution, including the PAM40, PAM120, and PAM250. Of the PAM scoring matrices, the PAM120 substitution matrix is recom- mended for general protein similarity searches if only one is to be used (Altschul, 1991). The pam(1) program can be used to produce PAM matrices of any desired generation from 2 to 511. Each matrix is most sensitive at finding similar- ities at its particular PAM distance. For more thorough searches, particularly when the mutational distance between potential homologs is unknown and the significance of their similarity may be only marginal, Altschul (1991, 1992) recommends performing at least three searches, one each with the PAM40, PAM120 and PAM250 matrices. When multiple PAM matrices are tried with the same query sequence, additional degrees of freedom for optimizing alignments are available, which reduce the alignments' sta- tistical significance. Since PAM matrices are not entirely independent of one another, statistical significance is reduced by a factor that is less than the actual number of matrices which are tried. This factor approaches a limit of 4.6 (just over 2 bits of information) when all PAM matrices are employed; however, the potential loss of signal above background from using a suboptimal matrix is typically much greater (Altschul, 1992). When the matrices tried are gen- erated from a different mutational model, or if multiple mutational models are employed, greater independence may be Sun Release 4.1 Last change: 7 July 1993 8 BLAST(1) USER COMMANDS BLAST(1) obtained for the matrices, reducing statistical significance by a factor that may be as high as the actual number of models or matrices that are tried. In blastn, M is the score for a single-letter match; N is the score for a single-letter _m_i_smatch. M and N must be positive and negative integers, respectively. It is not the absolute magnitudes of M and N, but rather their _r_e_l_a_t_i_v_e magnitudes (or absolute value of their ratio), that deter- mines the number of nucleic acid PAMs (point accepted muta- tions per 100 residues) for which they are the most sensi- tive at finding homologs. Higher ratios of M:N correspond to increasing nucleic acid PAMs. The default values for M and N, with a ratio of 1.25, correspond to about 47 nucleic acid PAMs, or about 58 amino acid PAMs (States _e_t _a_l., 1991). If the ratio M:N is 1, this corresponds to 30 nucleic acid PAMs or 38 amino acid PAMs. At higher than about 40 nucleic acid PAMs, or 50 amino acid PAMs, better sensitivity at detecting similarities between coding regions is expected from performing comparisons at the amino acid level, using conceptually translated nucleo- tide sequences (re: blastx and tblastn). Independent of the values chosen for M and N, the fixed wordlength W=12 used by blastn restricts it to finding homo- logs that share at a minimum a 12-mer stretch of 100% iden- tity. Under the random sequence model, stretches of 12 con- secutive matching residues are unlikely to occur merely by chance for distant homologs. Thus, blastn in its present implementation is poorly suited to finding distant homologs. For blastn, it should be easy to see how multiplying both M and N by any factor, no matter how large, yields proportion- ally larger alignment scores with unchanged statistical sig- nificance. This scale independence of the statistical sig- nificance has its analog in the consideration of the substi- tution matrices used by the other BLAST programs. Multiply- ing all elements in a substitution matrix by an arbitrary factor will alter alignment scores proportionally, but will not alter their statistical significance (if numerical pre- cision is maintained). For this reason, it is insufficient for the interpretation of raw alignment scores by Karlin- Altschul statistics to report merely that a matrix for a particular PAM distance was used, without also mentioning the scale of the matrix (see above). Regardless of the scoring scheme employed, two stringent criteria must be met in order to be able to calculate the Karlin-Altschul parameters _L_a_m_b_d_a and _K. First, given the residue composition for the query sequence and the residue composition assumed for the database, the _e_x_p_e_c_t_e_d Sun Release 4.1 Last change: 7 July 1993 9 BLAST(1) USER COMMANDS BLAST(1) substitution score for any two randomly selected residues (one from the query sequence and one from the database) must be negative. Second, given the residue compositions and scoring scheme, a positive score must be possible. For instance, the match reward in blastn must have a positive value; and the substitution matrix used by blastp, blastx and tblastn, must contain at least one positive-valued sub- stitution score for residues having non-zero frequencies in both the query sequence and in the assumed composition for the database. Given the assumption made by blastn that the 4 nucleotides A, C, G and T are represented at equal frequencies in the database, the _e_x_p_e_c_t_e_d score for any two randomly selected residues must be negative. This precludes searching with some value combinations for M and N, in particular those combinations where the magnitude of the ratio M:N is greater than or equal to 3. SEQUENCE LENGTH AND STATISTICAL SIGNIFICANCE For the purpose of calculating significance levels, Y is the effective length of the query sequence and Z is the effec- tive length of the database, both measured in residues. The default values for these parameters are the actual lengths of the query sequence and database, respectively. Larger values signify more degrees of freedom for aligning the sequences and reduced statistical significance for an align- ment of any given score. GENETIC CODES C is a non-negative integer that determines the genetic code that will be used by blastx (tblastn) to translate the query sequence (database sequences). The default genetic code (C=0) corresponds to the so-called Standard or Universal genetic code. To obtain a listing of the nine available genetic codes and their associated numerical identifiers, invoke either blastx or tblastn with the command line param- eter C=list. The current list of genetic codes and their associated values for parameter C are: 0 Standard or Universal 1 Vertebrate Mitochondrial 2 Yeast Mitochondrial 3 Mold Mitochondrial and Mycoplasma 4 Invertebrate Mitochondrial Sun Release 4.1 Last change: 7 July 1993 10 BLAST(1) USER COMMANDS BLAST(1) 5 Ciliate Macronuclear 6 Protozoan Mitochondrial 7 Plant Mitochondrial 8 Echinodermate Mitochondrial EXPECT VALUES The Expect value reported for each HSP is the number of times an HSP of equal or greater score is expected to occur by chance alone during the course of the entire database search. The total length of the database figures into these estimates, with larger databases yielding proportionately higher Expect values. POISSON STATISTICS The occurrence of two or more HSPs involving the query sequence and the same database sequence is modeled as a Poisson process. An important result of applying Poisson statistics is that an HSP having a low score and high Expect value (low statistical significance) may be discovered to be statistically significant when it appears in the context of one or more additional matches of equal or higher score against the same database sequence. The Poisson P-value for any given HSP is a function of its expected frequency of occurrence and the number of HSPs observed against the same database sequence with scores at least as high. The Poisson P-value for a group of HSP events is the probability that at least as many HSPs would occur by chance alone, each with a score at least as high as the lowest-scoring member of the group. HSPs which appear on opposite strands of a nucleotide query or database sequence are considered to be independent, distinguishable events, and are counted separately. P-VALUES, ALIGNMENT SCORES, AND INFORMATION The Expect and P-values reported for HSPs are dependent on numerous factors including: the scoring scheme employed, the residue composition of the query sequence, an assumed resi- due composition for a typical database sequence, the length of the query sequence, and the total length of the database. HSP scores from different program runs are appropriate for comparison even if the databases searched are of different lengths, as long as the other relevant factors described here do not vary. For example, alignment scores from searches with the default BLOSUM62 matrix must not be com- pared with scores obtained with the PAM120 matrix; and scores of alignments produced using possibly two versions of the same matrix, each having different scales (see above), can not be meaningfully compared without conversion to the Sun Release 4.1 Last change: 7 July 1993 11 BLAST(1) USER COMMANDS BLAST(1) same scale. Some isolation from the many variables involved in assessing the statistical significance of HSPs can be obtained by observing the information content reported (in bits) for the alignments. While the information content of an HSP may change when different scoring schemes are used (e.g., with different PAM matrices), the number of bits reported for an HSP will at least be independent of the scales to which the matrices were generated. (In practice, this statement is not quite true, because the substitution scores used by these programs are integers that were obtained from high- precision floating point or Real number values by rounding to nearest integral values, with a consequent loss of preci- sion). In efforts to communicate the statistical signifi- cance of an alignment, the alignment score itself is mean- ingless unless the specific substitution matrix that was employed (including its scale) is also specified, either explicitly or implicitly. In isolation, the information content reported for an alignment is a much more meaningful statistic than the score, and is independent of the query and database length parameters used in estimating the sta- tistical significance (P-values) of alignments. REGULATING OUTPUT The output is organized into three independently regulated sections: a histogram of word-hit extension scores; one-line descriptions of the database sequences that yielded one or more HSPs; and the high-scoring segment pairs themselves. Each section of the output can be selectively suppressed by setting the parameters H, V, and B to 0 (zero). Parameter H regulates the display of a histogram of the scores of the highest-scoring hit extensions for each data- base sequence. If H is assigned a non-zero value on the command line, the histogram will be displayed (except for the blastx program, which never displays a histogram but retains the H parameter for command-line compatibility with the other programs). The default value for H is 0 (no his- togram). Parameter V is the maximum number of database sequences for which one-line descriptions will be reported. The default value for V is 500. A warning message is prominently displayed at the end of the one-line descriptions section when HSPs are found in more than V sequences. When V is zero, no one-line descriptions are reported and no warning is given. Negative values for V are undefined and disal- lowed. As an example of how V can be used advantageously, if a high value for E is desired to virtually assure in all cases that Sun Release 4.1 Last change: 7 July 1993 12 BLAST(1) USER COMMANDS BLAST(1) at least one HSP will be found, selecting a small value for V will ensure that the output will not be too voluminous; only the most statistically significant matches will be reported. Parameter B regulates the display of the high-scoring seg- ment pairs. For positive values, B is the maximum number of database sequences for which high-scoring segment pairs will be reported. This may be much smaller than the actual number of high-scoring segment pairs reported, since any given database sequence may yield several HSPs. The default value for B is 250. Negative values for B are undefined and disallowed. ENVIRONMENT VARIABLES The environment variables BLASTDB and BLASTMAT may be set by the user to override the default directories in which these programs try to find database files and substitution matrix files, respectively. The default pathnames are /usr/ncbi/blast/db/ and /usr/ncbi/blast/matrix/. _I_t _i_s _e_s_s_e_n_t_i_a_l _t_h_a_t _t_h_e_s_e _p_a_t_h_n_a_m_e_s _e_n_d _w_i_t_h _a _s_l_a_s_h (/) _c_h_a_r_a_c_- _t_e_r. SUPPORT UTILITIES Databases to be searched by these programs must first be processed either by the program setdb for protein sequence databases (re: blastp and blastx) or the program pressdb for nucleotide sequence databases (re: blastn and tblastn). The input database format is FASTA/Pearson. Point accepted mutation (PAM) matrices of various genera- tions can be produced automatically with the pam program. The output can be saved in a file whose name is then speci- fied in the M=filename option of a blastp, blastx, or tblastn query. BUGS blastn uses a large value for the wordlength, W, and does no neighboring on these words. Consequently, the program is suitable for finding nearly identical sequences rapidly, but not distantly related ones. To identify weak amino acid similarities encoded by nucleic acid, use blastx or tblastn. In blastp, blastx, and tblastn, _a_d _h_o_c equations have not been implemented yet for calculating appropriate default values for T when W has a value other than 3 or 4. When nucleotide sequence databases are processed into searchable form by the pressdb program, IUPAC ambiguity letters are replaced by an appropriate random selection from the list A, C, G and T. (For example, an R would be replaced on the average half of the time by an A and half of the time Sun Release 4.1 Last change: 7 July 1993 13 BLAST(1) USER COMMANDS BLAST(1) by a G). If the original database in FASTA format is not available to blastn and tblastn, then the original locations of the ambiguity codes can not be determined by the programs and the alignments and alignment scores may be in error with respect to the original sequences. tblastn uses only one genetic code to translate the entire nucleotide sequence database, although the particular genetic code employed is selectable via the parameter C. blastn, blastx, and tblastn treat U and T residues in nucleotide sequences as the same residue (_i._e., they match). With two exceptions, any letter in the query sequence which is not a member of the relevant IUPAC amino acid or nucleo- tide code is stripped and does not contribute to the sequence coordinate numbers reported by the programs. The exceptions are asterisks (*) and hyphens (-) in amino acid sequences, which are interpreted as translation stops and gap characters, respectively. In protein sequence databases that are processed into searchable form by the setdb pro- gram, non-IUPAC letters, including any punctuation but excluding asterisks and hyphens, are also stripped. The pressdb program does _n_o_t strip non-IUPAC codes, but treats them similarly to Ns. blastn does not incorporate the concept of a partial- or half-match, such as when a purine in one sequence is juxta- posed with a purine from the other. For two residues to match at all, they both must be members of the set A, C, G and T (or U). When calculating the Poisson statistics, some HSPs may be incompatible with each other in the same segmented alignment and yet they are (incorrectly) counted as independent events. This even happens (albeit at reduced frequency) when a simple mid-point consistency rule is invoked. HSPs on opposite strands or in reading frames on opposite strands _a_r_e counted separately, however. The user may note that the nucleotide composition of a blastn query sequence is irrelevant to the resulting Karlin-Altschul parameters, _L_a_m_b_d_a and _K. This is due to the equi-probable 0.25/0.25/0.25/0.25 A/C/G/T residue distribu- tion assumed for a typical database sequence. The values of the Karlin-Altschul parameters are still affected by the scoring scheme employed (parameters M and N). Furthermore, blastn may be compiled with a non-uniform residue composi- tion for the database, in which case the query composition does become relevant and will impact the values of the Karlin-Altschul parameters that are calculated. Sun Release 4.1 Last change: 7 July 1993 14 BLAST(1) USER COMMANDS BLAST(1) The observed high scores reported by blastx for each reading frame at the end of the output may not coincide with the highest scores observed in the HSP alignments that are actu- ally displayed. This is because an HSP found in one frame may be eliminated from the output by an overlapping, higher-scoring HSP found in another reading frame on the same strand. SEE ALSO blast3(1). COPYRIGHT This work is in the public domain. REFERENCES Altschul, Stephen F. (1991). _A_m_i_n_o _a_c_i_d _s_u_b_s_t_i_t_u_t_i_o_n _m_a_t_r_i_c_e_s _f_r_o_m _a_n _i_n_f_o_r_m_a_t_i_o_n _t_h_e_o_r_e_t_i_c _p_e_r_s_p_e_c_t_i_v_e. J. Mol. Biol. 219:555-65. Altschul, S. F. (1993). _A _p_r_o_t_e_i_n _a_l_i_g_n_m_e_n_t _s_c_o_r_i_n_g _s_y_s_t_e_m _s_e_n_s_i_t_i_v_e _a_t _a_l_l _e_v_o_l_u_t_i_o_n_a_r_y _d_i_s_t_a_n_c_e_s. J. Mol. Evol. 36:290-300. Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman (1990). _B_a_s_i_c _l_o_c_a_l _a_l_i_g_n_m_e_n_t _s_e_a_r_c_h _t_o_o_l. J. Mol. Biol. 215:403-10. Claverie, J.-M. and D. J. States (1993). Computers in Chem- istry _i_n _p_r_e_s_s. Wootton, J. C. and S. Federhen (1993). Computers in Chemis- try _i_n _p_r_e_s_s. Gish, W. and D. J. States (1993). _I_d_e_n_t_i_f_i_c_a_t_i_o_n _o_f _p_r_o_t_e_i_n _c_o_d_i_n_g _r_e_g_i_o_n_s _b_y _d_a_t_a_b_a_s_e _s_i_m_i_l_a_r_i_t_y _s_e_a_r_c_h. Nature Genet- ics 3:266-72. Henikoff, Steven and Jorga G. Henikoff (1992). _A_m_i_n_o _a_c_i_d _s_u_b_s_t_i_t_u_t_i_o_n _m_a_t_r_i_c_e_s _f_r_o_m _p_r_o_t_e_i_n _b_l_o_c_k_s. Proc. Natl. Acad. Sci. USA 89:10915-19. Karlin, Samuel and Stephen F. Altschul (1990). _M_e_t_h_o_d_s _f_o_r _a_s_s_e_s_s_i_n_g _t_h_e _s_t_a_t_i_s_t_i_c_a_l _s_i_g_n_i_f_i_c_a_n_c_e _o_f _m_o_l_e_c_u_l_a_r _s_e_q_u_e_n_c_e _f_e_a_t_u_r_e_s _b_y _u_s_i_n_g _g_e_n_e_r_a_l _s_c_o_r_i_n_g _s_c_h_e_m_e_s. Proc. Natl. Acad. Sci. USA 87:2264-68. Karlin, Samuel and Stephen F. Altschul (1993). Proc. Natl. Acad. Sci. USA _s_u_b_m_i_t_t_e_d. States, D. J., W. Gish and S. F. Altschul (1991). _I_m_p_r_o_v_e_d _s_e_n_s_i_t_i_v_i_t_y _o_f _n_u_c_l_e_i_c _a_c_i_d _d_a_t_a_b_a_s_e _s_i_m_i_l_a_r_i_t_y _s_e_a_r_c_h_e_s _u_s_i_n_g _a_p_p_l_i_c_a_t_i_o_n _s_p_e_c_i_f_i_c _s_c_o_r_i_n_g _m_a_t_r_i_c_e_s. Methods: A Sun Release 4.1 Last change: 7 July 1993 15 BLAST(1) USER COMMANDS BLAST(1) companion to Methods in Enzymology 3:66-70. Sun Release 4.1 Last change: 7 July 1993 16