BLAST(1)                 USER COMMANDS                   BLAST(1)


NAME
     blastp, blastn, blastx, tblastn -  rapid  sequence  database
     query programs using the BLAST algorithm

SYNOPSIS
     blastp aadb aaquery [E=#] [S=#] [E2=#] [S2=#] [W=#] [T=#] [X=#]
          [M=subfile] [Y=#] [Z=#] [K=#] [L=#] [H=#] [V=#] [B=#] [-sort_by...]

     blastn ntdb ntquery [E=#] [S=#] [W=#] [X=#] [M=#] [N=#] [Y=#] [Z=#]
               [K=#] [L=#] [H=#] [V=#] [B=#] [[top][bottom]] [-sort_by...]

     blastx aadb ntquery [E=#] [S=#] [W=#] [T=#] [X=#] [M=subfile]
                         [Y=#] [Z=#] [C=#] [K=#] [L=#] [V=#] [B=#]
                         [[top][bottom]] [-sort_by...]

     tblastn ntdb aaquery [E=#] [S=#] [E2=#] [S2=#] [W=#] [T=#] [X=#]
                         [M=subfile] [Y=#] [Z=#] [C=#] [K=#] [L=#]
                         [H=#] [V=#] [B=#] [[top][bottom]] [-sort_by...]

DESCRIPTION
     BLAST (Basic Local Alignment Search Tool) is  the  heuristic
     search  algorithm  employed  by the programs blastp, blastn,
     blastx, and tblastn.  The four programs  are  used  for  the
     following purposes:

     blastp
          to compare an amino acid query sequence vs.  a  protein
          sequence database;

     blastn
          to compare a nucleotide query sequence vs. a nucleotide
          sequence database;

     blastx
          to compare a nucleotide query  sequence  translated  in
          all reading frames vs. a protein sequence database;

     tblastn
          to compare a protein query sequence  vs.  a  nucleotide
          sequence database dynamically translated in all reading
          frames.

     Whenever a nucleotide query sequence or nucleotide  database
     is  involved,  both  strands  (or  all 6 reading frames) are
     searched by default.  The "top" and "bottom" options may  be
     used to restrict a search to the specified strand.  (If both
     options are specified, both strands will be searched).

     The unit of BLAST algorithm output is the High-scoring  Seg-
     ment  Pair (HSP), where each segment in the pair is an equal
     but arbitrarily long run  of  contiguous  residues  and  the
     score  of  the  two  aligned  segments  meets  or  exceeds a


Sun Release 4.1     Last change: 7 July 1993                    1


BLAST(1)                 USER COMMANDS                   BLAST(1)


     positive-valued cutoff.  A set of zero or more HSPs is  thus
     defined by two sequences, an alignment scoring scheme, and a
     cutoff score.  In the programmatic  implementations  of  the
     algorithm   described   here,  the  cutoff  score  has  been
     parameterized and an HSP consists of one  segment  from  the
     query sequence and one segment from a database sequence.

     A Maximal-scoring Segment  Pair  (MSP)  is  defined  by  two
     sequences and a scoring scheme and is the highest-scoring of
     all possible segment pairs on  all  diagonals.   Karlin  and
     Altschul (1990) statistics are applicable to determining the
     statistical significance of MSP scores.   For  the  programs
     described  here, Karlin-Altschul statistics have been extra-
     polated to the assessment  of  HSP  scores,  for  which  the
     approximation is often good when the scores are predicted to
     be at least marginally significant.

     Depending on how low the cutoff score  is  set  and  on  the
     parameters  regulating  the  sensitivity of a BLAST sequence
     comparison, there may be a non-zero probability,  which  may
     or  may  not  be significant, that the heuristic method does
     not detect one or more HSPs of which the MSP  is  a  member.
     At  the user's discretion, search speed can be sacrificed in
     exchange for greater sensitivity, and vice versa.

PARAMETERS
     Parameters are modified using  a  _n_a_m_e=_v_a_l_u_e  syntax,  _e._g.,
     E=0.05 or S=100.

     E is interpreted as the _e_x_p_e_c_t_e_d number of  MSPs  that  will
     satisfy  the  cutoff  score under the random sequence model.
     The value of E approximates the _e_x_p_e_c_t_e_d number of HSPs that
     will  be  found  purely  by  chance during the course of the
     entire database search.  The default value for E is 10,  and
     the permitted range for this Real valued parameter is 0. < E
     <= 1000.

     S is the cutoff score for  reporting  HSPs.   Higher  scores
     correspond  to  increasing statistical significance, a lower
     probability, or a reduced expected frequency  of  occurrence
     by  chance.   Any positive-scoring alignments which the pro-
     grams find but which score below S go unreported.  Unless  S
     is  explicitly set on the command line, its default value is
     calculated from the value of E.

     The values for E and S are interconvertable, a process which
     is  dependent on the following factors: the length and resi-
     due composition of the query sequence;  the  length  of  the
     database;  a fixed, hypothetical residue composition for the
     database; and the  scoring  scheme  employed.   The  scoring
     scheme used by blastp, blastx, and tblastn is a substitution
     matrix; the scoring scheme used  by  blastn  is  a  positive


Sun Release 4.1     Last change: 7 July 1993                    2


BLAST(1)                 USER COMMANDS                   BLAST(1)


     reward  score  for  matching residues and a negative penalty
     score for mismatching residues.

     When both of the parameters E and S  are  specified  on  the
     command  line,  the  one resulting in the highest (most res-
     trictive) cutoff score will be used.  When neither of  these
     parameters  is  specified  on  the command line, the default
     value for E is used to calculate the cutoff score.

     For a given value of E (_e._g., the default value  of  10),  a
     given  query sequence, and a single scoring scheme, the cal-
     culated value of the cutoff score S will be  different  when
     searching  databases of different lengths.  To normalize the
     statistics reported when databases of different lengths  are
     searched,  the  parameter Z (see below) may be set to a con-
     stant value for all database searches.

     S takes on only integral values in the  present  implementa-
     tions  of the BLAST algorithm.  When the cutoff score is set
     implicitly via E, S is rounded to the least  integral  value
     required  to  satisfy  E.   Since the rounding procedure can
     decrease the effective value of E, the calculated value  for
     S  is used to back-calculate the effective value for E.  For
     example, if the user specifies E = 50 on the command line, a
     cutoff score that is rounded up by 0.9 units to the smallest
     satisfying integer might correspond to an expected number of
     HSPs of only 43.  In this case, the value displayed for E at
     the end of the program's report will be 43, not 50.

     When at least one HSP is found involving any given  database
     sequence,  the  programs blastp and tblastn search the data-
     base sequence a second time for HSPs that  satisfy  a  lower
     cutoff  score, S2.  In essence, the second-pass search gives
     these  programs  the  opportunity   to   report   any   low-
     significance  HSPs  they  may  have  found  that might be of
     interest within the context of finding one or  more  higher-
     scoring  (perhaps  statistically significant) HSPs.  Poisson
     statistics may  indicate  that  the  lower-scoring  (higher-
     probability)  HSPs  are statistically significant when their
     frequencies of occurrence are considered.

     In a relationship similar to that between the  parameters  E
     and  S,  S2  can be set explicitly on the command line or it
     will be calculated from the setting of  E2.   Whereas  S  is
     related  to  E by the size of the database and the length of
     the query sequence, S2 is related to E2 by the lengths of  a
     pair of hypothetical protein sequences of 300 residues each.
     In other words, E2 approximates the number of HSPs one would
     expect  to  find  when  comparing  two  protein sequences of
     length 300, one having the composition of the query sequence
     and the other having the hypothetical residue composition of
     the database.  If  a  second-pass  search  is  not  desired,


Sun Release 4.1     Last change: 7 July 1993                    3


BLAST(1)                 USER COMMANDS                   BLAST(1)


     setting  E2  to zero (0) turns this feature off.  If S2 hap-
     pens to be equal to  or  greater  than  the  primary  cutoff
     score, a second-pass search is not performed, as well.

     The user should be forewarned that  with  typical  parameter
     values,  the  probability the BLAST algorithm will detect an
     alignment is expected to decrease as the score of the align-
     ment  decreases.   Consequently, low-scoring HSPs looked for
     in the second-pass search have a lesser chance  individually
     of being found than the original HSP.

     With a fixed scoring scheme, the probability of  missing  an
     alignment  can  be  decreased  by: lowering the neighborhood
     word-score threshold, T, while keeping  the  word  size,  W,
     constant;  lowering both W and T appropriately (see Altschul
     _e_t _a_l., 1990); and/or raising the  word-hit-extension  drop-
     off score X (described below).

     W is the word size for finding initial _w_o_r_d _h_i_t_s against the
     database  sequences.   Each  word  hit  is  extended in both
     directions along the corresponding diagonal of an  imaginary
     2-dimensional  matrix  until  the  cumulative  segment score
     drops off by at least the quantity X.  The default value for
     W  is  3 amino acids for blastp, blastx, and tblastn, and 12
     nucleotides for blastn.  The  value  of  W  used  by  blastn
     should  not  be  changed, as the logic of the program source
     code has not been validated for use with values  other  than
     the  default  (particularly  smaller values).  For the other
     programs, which perform sequence comparisons at the level of
     individual  amino acids, W should generally be restricted to
     values less than 5 or else the value for T should be  speci-
     fied disproportionately larger to avoid consuming vast quan-
     tities of memory for the neighborhood word list (see below).

     T is the word score threshold  for  generating  neighborhood
     words of length W from the query sequence, prior to scanning
     the database (blastp,  blastx,  and  tblastn  only).   Words
     which  have  an  aggregate  score  (through summation of the
     individual residue substitution scores) of at least  T  when
     aligned  with  words from the query sequence are included in
     the neighborhood list.  Raising the value of T increases the
     likelihood  of completely missing HSPs, but can decrease the
     search time and  memory  requirements  of  the  programs  by
     decreasing  the  size  of the neighborhood list.  One of the
     key features of the BLAST algorithm is  the  user-selectable
     trade-off in sensitivity for speed.

     A generally suitable value for T is calculated at  run-time,
     using  the  residue  composition  and  length  of  the query
     sequence and the substitution matrix employed.   The  neigh-
     borhood word-score threshold is set using an _a_d _h_o_c equation
     that is a function of _L_a_m_b_d_a and _H. _L_a_m_b_d_a is the number  of


Sun Release 4.1     Last change: 7 July 1993                    4


BLAST(1)                 USER COMMANDS                   BLAST(1)


     nats  of information gained per unit increase in score of an
     alignment (approximately 0.69315 times the  number  of  bits
     per  unit  score).   _H is the relative entropy of the target
     and background residue  frequencies  (Karlin  and  Altschul,
     1990), or the expected information available per position in
     an alignment to distinguish it from chance.  Occasionally it
     may be necessary to manually set the neighborhood word-score
     threshold via the command line, for which 13 may be  a  good
     value  to  try,  but  this choice is _h_i_g_h_l_y dependent on the
     substitution matrix and word-length, W, used.

     The supplied PAM120 amino acid substitution matrix,  with  a
     scale of natural log(2)/2, yields values for _L_a_m_b_d_a that are
     expected to be close to 0.5 bit per  unit  score  for  query
     sequences of typical residue compositions.  Under these con-
     ditions, an increase in an alignment score  by  2  units  is
     expected to increase the informativeness of the alignment by
     2 times 0.5  = 1 bit, corresponding to an  increase  in  the
     statistical  significance  by  a  factor of 2.  The supplied
     PAM250 matrix was produced to a scale of  natural  log(2)/3,
     suggesting  that  an  increase in alignment score by 3 units
     will be required to increase statistical significance  by  a
     factor  of  2.   The  significance  of an alignment score is
     indeterminate without specific knowledge of  the  particular
     substitution  matrix  employed,  including  its  scale,  and
     preferably in the  context  of  knowing  the  actual  values
     determined for the Karlin-Altschul parameters _L_a_m_b_d_a and _K.

     X is a positive integer representing the maximum permissible
     drop-off  of  the  cumulative  segment score during word-hit
     extension.  Raising X may decrease the chance that the BLAST
     algorithm   overlooks  an  HSP,  but  it  may  significantly
     increase the search time, as well.  If computation  time  is
     of  little concern, X might be increased several points from
     its default value, but often only a very  marginal  increase
     in sensitivity should be expected.

     For blastp, blastx, and tblastn, the default value of  X  is
     calculated  to be the minimum integral score representing at
     least 10 bits of information, or a reduction in the statist-
     ical  significance  of the alignment by a factor of 2 to the
     power of 10 (or about 1,000).  For blastn, the default value
     of  X is the minimum integral score that represents at least
     20 bits of information, or a reduction  in  the  statistical
     significance  of the alignment by a factor of 2 to the power
     of 20 (or about one million).

     The command line parameters K and  L  can  be  used  to  set
     desired  values  for  the  Karlin-Altschul statistics' _K and
     _L_a_m_b_d_a parameters,  respectively.   Users  should  generally
     avoid setting these parameters, though, unless the full ram-
     ifications of doing so are  understood.   For  example,  the


Sun Release 4.1     Last change: 7 July 1993                    5


BLAST(1)                 USER COMMANDS                   BLAST(1)


     value  of  the  H  statistic  reported  at  the  end of each
     program's output is a function of Lambda;  and  the  default
     value  for the neighborhood word-score threshold parameter T
     is in turn a function of H.

OPTIONS
     Except where noted, the BLAST programs  accept  all  of  the
     following options:

     -overlap
             Ordinarily, using a greedy algorithm,  the  programs
             discard  and  do  not report HSPs they find that are
             overlapped or spanned by one another on  either  the
             query  sequence  or the database sequence (or both).
             If two HSPs are found to overlap, the one having the
             highest  score  is  retained  and  the other is dis-
             carded.  This option turns off the  overlap  removal
             feature entirely.

     -overlap2
             For two HSPs to  be  deemed  overlapping,  both  the
             query  and  the  database segments from one HSP must
             span the corresponding segments in  the  other  HSP.
             (The  default  is  that  either one or both segments
             must span).  This option may be useful and result in
             additional  HSPs  being  reported  when the query or
             database sequences contain internal repeats.  On the
             negative  side  of  using  this  option, the Poisson
             statistics reported may  be  inaccurate,  since  the
             repeats  may  not  be  independent  events  but  the
             software may count them as such.

     -filter _f_i_l_t_e_r_o_p_t_i_o_n
             This activates  query  sequence  filtering  to  mask
             regions  based on a potentially wide variety of cri-
             teria.  The usual intent of  filtering  is  to  mask
             regions  of  the  query sequence that may be charac-
             teristic of regions  in  several  unrelated  protein
             families  and,  thus,  are  not alone diagnostic for
             family membership.  For instance, acidic, basic,  or
             proline-rich  regions  are  often  masked that would
             otherwise  produce  overwhelming  amounts  of  unin-
             teresting  BLAST program output.  The BLAST programs
             currently know how to properly invoke  SEG  and  XNU
             filters,  but  these  two  filter programs must have
             been independently installed; they are not  included
             in  the  BLAST  software distribution.  SEG (Wootton
             and Federhen, 1993) masks low-compositional complex-
             ity  regions,  while XNU (Claverie and States, 1993)
             generally masks regions containing short-periodicity
             internal  repeats.  The BLAST programs can also pipe
             the filtered output from one program  into  another.


Sun Release 4.1     Last change: 7 July 1993                    6


BLAST(1)                 USER COMMANDS                   BLAST(1)


             For  instance XNU+SEG or SEG+XNU can be specified to
             have both programs filter the query sequence in suc-
             cession.

     -echofilter
             This  causes  the  filtered  query  sequence  to  be
             displayed  in  the  output.   Any masked letters are
             indicated with X's.

     -consistency
             This turns off both the determination of the  number
             of HSPs likely to be consistent with each other in a
             gapped alignment and an adjustment made to the Pois-
             son  statistics  to account for the consistency cri-
             terion.

     -codoninfo _c_o_d_o_n_i_n_f_o_f_i_l_e
             This blastx option is used to specify  a  file  con-
             taining  codon usage or codon bias information to be
             used in  concert  with  a  traditional  substitution
             matrix to score alignments.  The codonusagefile must
             have a .cdi extension to  its  name,  but  the  .cdi
             extension  should  be  omitted from the command line
             option.  Information in the file must  be  expressed
             in units that coincide with the scale of the substi-
             tution matrix.   The  substitution  matrix  employed
             must  have a .cdi extension on its name, as well.  A
             few such substitution matrices are provided in the

     -gapdecayrate _r_a_t_e
             This parameter is the common ratio of the terms in a
             geometric  progression.  A Poisson probability for N
             segments is weighted by the reciprocal  of  the  Nth
             term  in this progression; and the first term in the
             progression has a value of  (1-rate).   The  default
             value for the rate is 0.5, such that the probability
             of one segment is discounted by a factor of  2,  the
             Poisson probability of 2 segments is discounted by a
             factor of  4,  and  so  on.   The  rate  essentially
             defines  a gap penalty imposed between each segment,
             and the default penalty is equivalent to  1  bit  of
             information.  The suggestion to use a geometric pro-
             gression to normalize probabilities for the multiple
             numbers  of  segments considered in BLAST output was
             made  by  Phil  Green  (Washington  University,  St.
             Louis, MO).  BLAST software distribution.

SORT OPTIONS
     The default sort order for reporting database  sequences  is
     by  increasing Poisson probability (P-value).  The following
     sort options are available:


Sun Release 4.1     Last change: 7 July 1993                    7


BLAST(1)                 USER COMMANDS                   BLAST(1)


     -sort_by_pvalue     Sort from most statistically significant
                         (lowest  Poisson  P-value) to least sta-
                         tistically significant (highest  Poisson
                         P-value), the default sort order.

     -sort_by_count      Sort  from  highest  to  lowest  by  the
                         number  of  HSPs found for each database
                         sequence.

     -sort_by_highscore  Sort from highest to lowest by the score
                         of  the  highest  scoring  HSP  for each
                         database sequence.

     -sort_by_totalscore Sort from the highest to the  lowest  by
                         the sum total score of all HSPs for each
                         database sequence.

SCORING SCHEMES
     The default scoring  matrix  used  by  blastp,  blastx,  and
     tblastn  is  the  BLOSUM62  matrix  (Henikoff  and Henikoff,
     1992).  With blastp, blastx, and tblastn, the M  option  can
     be  used  to  select  an  alternate substitution matrix file
     (_e._g., one of the PAM matrices described below).

     Several PAM (point  accepted  mutations  per  100  residues)
     scoring  matrices are provided in the BLAST software distri-
     bution, including the PAM40, PAM120, and PAM250. Of the  PAM
     scoring  matrices,  the PAM120 substitution matrix is recom-
     mended for general protein similarity searches if  only  one
     is  to  be used (Altschul, 1991).  The pam(1) program can be
     used to produce PAM matrices of any desired generation  from
     2 to 511.  Each matrix is most sensitive at finding similar-
     ities at its particular PAM  distance.   For  more  thorough
     searches,  particularly when the mutational distance between
     potential homologs is unknown and the significance of  their
     similarity  may  be  only  marginal,  Altschul  (1991, 1992)
     recommends performing at least three searches, one each with
     the PAM40, PAM120 and PAM250 matrices.

     When multiple PAM matrices are tried  with  the  same  query
     sequence,  additional  degrees  of  freedom  for  optimizing
     alignments are available, which reduce the alignments'  sta-
     tistical  significance.  Since PAM matrices are not entirely
     independent of  one  another,  statistical  significance  is
     reduced  by  a factor that is less than the actual number of
     matrices which are tried.  This factor approaches a limit of
     4.6  (just over 2 bits of information) when all PAM matrices
     are employed; however, the potential loss  of  signal  above
     background  from using a suboptimal matrix is typically much
     greater (Altschul, 1992).  When the matrices tried are  gen-
     erated  from  a  different  mutational model, or if multiple
     mutational models are employed, greater independence may  be


Sun Release 4.1     Last change: 7 July 1993                    8


BLAST(1)                 USER COMMANDS                   BLAST(1)


     obtained for the matrices, reducing statistical significance
     by a factor that may be as high  as  the  actual  number  of
     models or matrices that are tried.

     In blastn, M is the score for a single-letter  match;  N  is
     the  score  for  a  single-letter _m_i_smatch.  M and N must be
     positive and negative integers, respectively.  It is not the
     absolute  magnitudes  of  M and N, but rather their _r_e_l_a_t_i_v_e
     magnitudes (or absolute value of their ratio),  that  deter-
     mines  the number of nucleic acid PAMs (point accepted muta-
     tions per 100 residues) for which they are the  most  sensi-
     tive  at  finding homologs.  Higher ratios of M:N correspond
     to increasing nucleic acid PAMs.  The default values  for  M
     and  N, with a ratio of 1.25, correspond to about 47 nucleic
     acid PAMs, or about 58  amino  acid  PAMs  (States  _e_t  _a_l.,
     1991).   If  the  ratio  M:N  is  1,  this corresponds to 30
     nucleic acid PAMs or 38 amino acid PAMs.

     At higher than about 40 nucleic acid PAMs, or 50 amino  acid
     PAMs,  better  sensitivity at detecting similarities between
     coding regions is expected from  performing  comparisons  at
     the  amino acid level, using conceptually translated nucleo-
     tide sequences (re: blastx and tblastn).

     Independent of the values chosen for  M  and  N,  the  fixed
     wordlength W=12 used by blastn restricts it to finding homo-
     logs that share at a minimum a 12-mer stretch of 100%  iden-
     tity.  Under the random sequence model, stretches of 12 con-
     secutive matching residues are unlikely to occur  merely  by
     chance  for  distant  homologs.  Thus, blastn in its present
     implementation is poorly suited to finding distant homologs.

     For blastn, it should be easy to see how multiplying both  M
     and N by any factor, no matter how large, yields proportion-
     ally larger alignment scores with unchanged statistical sig-
     nificance.   This scale independence of the statistical sig-
     nificance has its analog in the consideration of the substi-
     tution matrices used by the other BLAST programs.  Multiply-
     ing all elements in a substitution matrix  by  an  arbitrary
     factor  will alter alignment scores proportionally, but will
     not alter their statistical significance (if numerical  pre-
     cision  is maintained).  For this reason, it is insufficient
     for the interpretation of raw alignment  scores  by  Karlin-
     Altschul  statistics  to  report  merely that a matrix for a
     particular PAM distance was used,  without  also  mentioning
     the scale of the matrix (see above).

     Regardless of the scoring  scheme  employed,  two  stringent
     criteria  must  be  met in order to be able to calculate the
     Karlin-Altschul parameters _L_a_m_b_d_a and _K.  First,  given  the
     residue  composition  for the query sequence and the residue
     composition  assumed  for   the   database,   the   _e_x_p_e_c_t_e_d


Sun Release 4.1     Last change: 7 July 1993                    9


BLAST(1)                 USER COMMANDS                   BLAST(1)


     substitution  score  for  any two randomly selected residues
     (one from the query sequence and one from the database) must
     be  negative.   Second,  given  the residue compositions and
     scoring scheme, a positive  score  must  be  possible.   For
     instance,  the  match  reward in blastn must have a positive
     value; and the substitution matrix used  by  blastp,  blastx
     and  tblastn, must contain at least one positive-valued sub-
     stitution score for residues having non-zero frequencies  in
     both  the  query sequence and in the assumed composition for
     the database.

     Given the assumption made by blastn that the  4  nucleotides
     A,  C,  G  and T are represented at equal frequencies in the
     database, the _e_x_p_e_c_t_e_d score for any two  randomly  selected
     residues  must  be  negative.  This precludes searching with
     some value combinations for M and  N,  in  particular  those
     combinations where the magnitude of the ratio M:N is greater
     than or equal to 3.

SEQUENCE LENGTH AND STATISTICAL SIGNIFICANCE
     For the purpose of calculating significance levels, Y is the
     effective  length  of the query sequence and Z is the effec-
     tive length of the database, both measured in residues.  The
     default  values  for these parameters are the actual lengths
     of the query sequence and  database,  respectively.   Larger
     values  signify  more  degrees  of  freedom for aligning the
     sequences and reduced statistical significance for an align-
     ment of any given score.

GENETIC CODES
     C is a non-negative integer that determines the genetic code
     that will be used by blastx (tblastn) to translate the query
     sequence (database sequences).   The  default  genetic  code
     (C=0)  corresponds  to  the  so-called Standard or Universal
     genetic code.  To obtain a listing  of  the  nine  available
     genetic  codes  and  their associated numerical identifiers,
     invoke either blastx or tblastn with the command line param-
     eter C=list.

     The current list  of  genetic  codes  and  their  associated
     values for parameter C are:

     0 Standard or Universal

     1 Vertebrate Mitochondrial

     2 Yeast Mitochondrial

     3 Mold Mitochondrial and Mycoplasma

     4 Invertebrate Mitochondrial


Sun Release 4.1     Last change: 7 July 1993                   10


BLAST(1)                 USER COMMANDS                   BLAST(1)


     5 Ciliate Macronuclear

     6 Protozoan Mitochondrial

     7 Plant Mitochondrial

     8 Echinodermate Mitochondrial

EXPECT VALUES
     The Expect value reported for each  HSP  is  the  number  of
     times  an HSP of equal or greater score is expected to occur
     by chance alone during the course  of  the  entire  database
     search.  The total length of the database figures into these
     estimates, with larger  databases  yielding  proportionately
     higher Expect values.

POISSON STATISTICS
     The occurrence of two  or  more  HSPs  involving  the  query
     sequence  and  the  same  database  sequence is modeled as a
     Poisson process.  An important result  of  applying  Poisson
     statistics is that an HSP having a low score and high Expect
     value (low statistical significance) may be discovered to be
     statistically  significant when it appears in the context of
     one or more additional matches  of  equal  or  higher  score
     against the same database sequence.

     The Poisson P-value for any given HSP is a function  of  its
     expected  frequency  of  occurrence  and  the number of HSPs
     observed against the same database sequence with  scores  at
     least  as  high.   The  Poisson  P-value  for a group of HSP
     events is the probability that at least as many  HSPs  would
     occur by chance alone, each with a score at least as high as
     the lowest-scoring member of the group.  HSPs  which  appear
     on  opposite  strands  of  a  nucleotide  query  or database
     sequence are considered to be  independent,  distinguishable
     events, and are counted separately.

P-VALUES, ALIGNMENT SCORES, AND INFORMATION
     The Expect and P-values reported for HSPs are  dependent  on
     numerous factors including: the scoring scheme employed, the
     residue composition of the query sequence, an assumed  resi-
     due  composition for a typical database sequence, the length
     of the query sequence, and the total length of the database.
     HSP  scores  from different program runs are appropriate for
     comparison even if the databases searched are  of  different
     lengths,  as  long  as  the other relevant factors described
     here do  not  vary.   For  example,  alignment  scores  from
     searches  with  the default BLOSUM62 matrix must not be com-
     pared with scores  obtained  with  the  PAM120  matrix;  and
     scores of alignments produced using possibly two versions of
     the same matrix, each having different scales  (see  above),
     can  not  be meaningfully compared without conversion to the


Sun Release 4.1     Last change: 7 July 1993                   11


BLAST(1)                 USER COMMANDS                   BLAST(1)


     same scale.

     Some isolation from the many variables involved in assessing
     the  statistical  significance  of  HSPs  can be obtained by
     observing the information content reported (in bits) for the
     alignments.   While  the  information  content of an HSP may
     change when different scoring schemes are used  (e.g.,  with
     different  PAM matrices), the number of bits reported for an
     HSP will at least be independent of the scales to which  the
     matrices  were  generated.   (In practice, this statement is
     not quite true, because  the  substitution  scores  used  by
     these  programs  are  integers that were obtained from high-
     precision floating point or Real number values  by  rounding
     to nearest integral values, with a consequent loss of preci-
     sion).  In efforts to communicate the  statistical  signifi-
     cance  of  an alignment, the alignment score itself is mean-
     ingless unless the specific  substitution  matrix  that  was
     employed  (including  its  scale)  is also specified, either
     explicitly or implicitly.   In  isolation,  the  information
     content  reported for an alignment is a much more meaningful
     statistic than the score, and is independent  of  the  query
     and  database  length parameters used in estimating the sta-
     tistical significance (P-values) of alignments.

REGULATING OUTPUT
     The output is organized into three  independently  regulated
     sections: a histogram of word-hit extension scores; one-line
     descriptions of the database sequences that yielded  one  or
     more  HSPs;  and  the high-scoring segment pairs themselves.
     Each section of the output can be selectively suppressed  by
     setting the parameters H, V, and B to 0 (zero).

     Parameter H regulates the display  of  a  histogram  of  the
     scores  of the highest-scoring hit extensions for each data-
     base sequence.  If H is assigned a  non-zero  value  on  the
     command  line,  the  histogram will be displayed (except for
     the blastx program, which never  displays  a  histogram  but
     retains  the H parameter for command-line compatibility with
     the other programs).  The default value for H is 0 (no  his-
     togram).

     Parameter V is the maximum number of database sequences  for
     which  one-line  descriptions will be reported.  The default
     value for V  is  500.   A  warning  message  is  prominently
     displayed  at  the  end of the one-line descriptions section
     when HSPs are found in more than V  sequences.   When  V  is
     zero,  no  one-line descriptions are reported and no warning
     is given.  Negative values for V are  undefined  and  disal-
     lowed.

     As an example of how V can be used advantageously, if a high
     value for E is desired to virtually assure in all cases that


Sun Release 4.1     Last change: 7 July 1993                   12


BLAST(1)                 USER COMMANDS                   BLAST(1)


     at least one HSP will be found, selecting a small value  for
     V  will  ensure  that the output will not be too voluminous;
     only the most  statistically  significant  matches  will  be
     reported.

     Parameter B regulates the display of the  high-scoring  seg-
     ment pairs.  For positive values, B is the maximum number of
     database sequences for which high-scoring segment pairs will
     be  reported.   This  may  be  much  smaller than the actual
     number of high-scoring segment  pairs  reported,  since  any
     given database sequence may yield several HSPs.  The default
     value for B is 250.  Negative values for B are undefined and
     disallowed.

ENVIRONMENT VARIABLES
     The environment variables BLASTDB and BLASTMAT may be set by
     the  user to override the default directories in which these
     programs try to find database files and substitution  matrix
     files,    respectively.     The    default   pathnames   are
     /usr/ncbi/blast/db/  and  /usr/ncbi/blast/matrix/.   _I_t   _i_s
     _e_s_s_e_n_t_i_a_l  _t_h_a_t _t_h_e_s_e _p_a_t_h_n_a_m_e_s _e_n_d _w_i_t_h _a _s_l_a_s_h (/) _c_h_a_r_a_c_-
     _t_e_r.

SUPPORT UTILITIES
     Databases to be searched by these  programs  must  first  be
     processed  either  by the program setdb for protein sequence
     databases (re: blastp and blastx) or the program pressdb for
     nucleotide sequence databases (re: blastn and tblastn).  The
     input database format is FASTA/Pearson.

     Point accepted mutation (PAM) matrices  of  various  genera-
     tions  can  be  produced automatically with the pam program.
     The output can be saved in a file whose name is then  speci-
     fied  in  the  M=filename  option  of  a  blastp, blastx, or
     tblastn query.

BUGS
     blastn uses a large value for the wordlength, W, and does no
     neighboring  on  these  words.  Consequently, the program is
     suitable for finding nearly identical sequences rapidly, but
     not  distantly  related  ones.   To identify weak amino acid
     similarities encoded by nucleic acid, use blastx or tblastn.

     In blastp, blastx, and tblastn, _a_d _h_o_c  equations  have  not
     been  implemented  yet  for  calculating appropriate default
     values for T when W has a value other than 3 or 4.

     When  nucleotide  sequence  databases  are  processed   into
     searchable  form  by  the  pressdb  program, IUPAC ambiguity
     letters are replaced by an appropriate random selection from
     the list A, C, G and T. (For example, an R would be replaced
     on the average half of the time by an A and half of the time


Sun Release 4.1     Last change: 7 July 1993                   13


BLAST(1)                 USER COMMANDS                   BLAST(1)


     by  a  G).   If the original database in FASTA format is not
     available to blastn and tblastn, then the original locations
     of the ambiguity codes can not be determined by the programs
     and the alignments and alignment scores may be in error with
     respect to the original sequences.

     tblastn uses only one genetic code to translate  the  entire
     nucleotide   sequence   database,  although  the  particular
     genetic code employed is selectable via the parameter C.

     blastn, blastx, and  tblastn  treat  U  and  T  residues  in
     nucleotide sequences as the same residue (_i._e., they match).

     With two exceptions, any letter in the query sequence  which
     is  not a member of the relevant IUPAC amino acid or nucleo-
     tide code  is  stripped  and  does  not  contribute  to  the
     sequence  coordinate  numbers reported by the programs.  The
     exceptions are asterisks (*) and hyphens (-) in  amino  acid
     sequences,  which  are  interpreted as translation stops and
     gap characters, respectively.  In protein sequence databases
     that  are  processed  into searchable form by the setdb pro-
     gram,  non-IUPAC  letters,  including  any  punctuation  but
     excluding  asterisks  and  hyphens,  are also stripped.  The
     pressdb program does _n_o_t strip non-IUPAC codes,  but  treats
     them similarly to Ns.

     blastn does not incorporate the concept  of  a  partial-  or
     half-match,  such as when a purine in one sequence is juxta-
     posed with a purine from the other.   For  two  residues  to
     match  at  all, they both must be members of the set A, C, G
     and T (or U).

     When calculating the Poisson statistics, some  HSPs  may  be
     incompatible with each other in the same segmented alignment
     and  yet  they  are  (incorrectly)  counted  as  independent
     events.   This  even  happens  (albeit at reduced frequency)
     when a simple mid-point consistency rule is  invoked.   HSPs
     on opposite strands or in reading frames on opposite strands
     _a_r_e counted separately, however.

     The user may note  that  the  nucleotide  composition  of  a
     blastn   query  sequence  is  irrelevant  to  the  resulting
     Karlin-Altschul parameters, _L_a_m_b_d_a and _K. This is due to the
     equi-probable  0.25/0.25/0.25/0.25 A/C/G/T residue distribu-
     tion assumed for a typical database sequence.  The values of
     the  Karlin-Altschul  parameters  are  still affected by the
     scoring scheme employed (parameters M and N).   Furthermore,
     blastn  may  be compiled with a non-uniform residue composi-
     tion for the database, in which case the  query  composition
     does  become  relevant  and  will  impact  the values of the
     Karlin-Altschul parameters that are calculated.


Sun Release 4.1     Last change: 7 July 1993                   14


BLAST(1)                 USER COMMANDS                   BLAST(1)


     The observed high scores reported by blastx for each reading
     frame  at  the  end  of the output may not coincide with the
     highest scores observed in the HSP alignments that are actu-
     ally  displayed.   This is because an HSP found in one frame
     may  be  eliminated  from  the  output  by  an  overlapping,
     higher-scoring  HSP  found  in  another reading frame on the
     same strand.

SEE ALSO
     blast3(1).

COPYRIGHT
     This work is in the public domain.

REFERENCES
     Altschul,  Stephen  F.  (1991).   _A_m_i_n_o  _a_c_i_d   _s_u_b_s_t_i_t_u_t_i_o_n
     _m_a_t_r_i_c_e_s  _f_r_o_m _a_n _i_n_f_o_r_m_a_t_i_o_n _t_h_e_o_r_e_t_i_c _p_e_r_s_p_e_c_t_i_v_e. J. Mol.
     Biol.  219:555-65.

     Altschul, S. F. (1993).  _A _p_r_o_t_e_i_n _a_l_i_g_n_m_e_n_t _s_c_o_r_i_n_g  _s_y_s_t_e_m
     _s_e_n_s_i_t_i_v_e  _a_t  _a_l_l  _e_v_o_l_u_t_i_o_n_a_r_y  _d_i_s_t_a_n_c_e_s.  J.  Mol. Evol.
     36:290-300.

     Altschul, Stephen F., Warren Gish, Webb  Miller,  Eugene  W.
     Myers,  and  David  J. Lipman (1990).  _B_a_s_i_c _l_o_c_a_l _a_l_i_g_n_m_e_n_t
     _s_e_a_r_c_h _t_o_o_l. J. Mol. Biol.  215:403-10.

     Claverie, J.-M. and D. J. States (1993).  Computers in Chem-
     istry _i_n _p_r_e_s_s.

     Wootton, J. C. and S. Federhen (1993).  Computers in Chemis-
     try _i_n _p_r_e_s_s.

     Gish, W. and D. J. States (1993).  _I_d_e_n_t_i_f_i_c_a_t_i_o_n _o_f _p_r_o_t_e_i_n
     _c_o_d_i_n_g  _r_e_g_i_o_n_s _b_y _d_a_t_a_b_a_s_e _s_i_m_i_l_a_r_i_t_y _s_e_a_r_c_h. Nature Genet-
     ics 3:266-72.

     Henikoff, Steven and Jorga G. Henikoff (1992).   _A_m_i_n_o  _a_c_i_d
     _s_u_b_s_t_i_t_u_t_i_o_n _m_a_t_r_i_c_e_s _f_r_o_m _p_r_o_t_e_i_n _b_l_o_c_k_s. Proc. Natl. Acad.
     Sci. USA 89:10915-19.

     Karlin, Samuel and Stephen F. Altschul (1990).  _M_e_t_h_o_d_s  _f_o_r
     _a_s_s_e_s_s_i_n_g _t_h_e _s_t_a_t_i_s_t_i_c_a_l _s_i_g_n_i_f_i_c_a_n_c_e _o_f _m_o_l_e_c_u_l_a_r _s_e_q_u_e_n_c_e
     _f_e_a_t_u_r_e_s _b_y _u_s_i_n_g _g_e_n_e_r_a_l _s_c_o_r_i_n_g _s_c_h_e_m_e_s. Proc. Natl. Acad.
     Sci. USA 87:2264-68.

     Karlin, Samuel and Stephen F. Altschul (1993).  Proc.  Natl.
     Acad. Sci. USA _s_u_b_m_i_t_t_e_d.

     States, D. J., W. Gish and S. F. Altschul (1991).   _I_m_p_r_o_v_e_d
     _s_e_n_s_i_t_i_v_i_t_y  _o_f  _n_u_c_l_e_i_c  _a_c_i_d  _d_a_t_a_b_a_s_e _s_i_m_i_l_a_r_i_t_y _s_e_a_r_c_h_e_s
     _u_s_i_n_g _a_p_p_l_i_c_a_t_i_o_n  _s_p_e_c_i_f_i_c  _s_c_o_r_i_n_g  _m_a_t_r_i_c_e_s.  Methods:  A


Sun Release 4.1     Last change: 7 July 1993                   15


BLAST(1)                 USER COMMANDS                   BLAST(1)


     companion to Methods in Enzymology 3:66-70.


Sun Release 4.1     Last change: 7 July 1993                   16