GDE2.2 rev1 1 Genetic Data Environment version 2.2 Table of Contents Introduction 2 What's New for this Release 2 System Requirements 2 Note to Motif users 2 Installing the GDE 3 Using the GDE 3 Data Types 7 Menu Functions File menu 7 Edit menu 9 DNA/RNA menu 9 External Functions 9 Bug reports/extensions 12 Acknowledgments 12 Appendix A, File Formats 13 Appendix B, Adding Functions 16 Appendix C, External functions 19 .c.Introduction The Genetic Data Environment is part of a growing set of programs for manipulating and analyzing "genetic" data. It differs in design from other analysis programs in that it is intended to be an expandable and customizable system, while still being easy to use. There are a tremendous number of publicly available programs for sequence analysis. Many of these programs have found their way into commercial packages which incorporate them into integrated, easy to use systems. The goal of the GDE is to minimize the amount of effort required to integrate sequence analysis functions into a common environment. The GDE takes care of the user interface issues, and allows the programmer to concentrate on the analysis itself. Existing programs can be tied into the GDE in a matter of hours (or minutes) as apposed to days or weeks. Programs may be written in any language, and still seamlessly be incorporated into the GDE. These programs are, and will continue to be, available at no charge. It is the hope that this system will grow in functionality as more and more people see the benefits of a modular analysis environment. Users are encouraged to make modifications to the system, and forward all changes and additions to Steven Smith at smith@bioimage.millipore.com. .c.What's New for this Release GDE 2.2 represents a maintainence release. Several small bugs have been fixed, as well as new editing features and user interface elements. Also, I have tried to update all of the contributed external programs to their latest release. Updated programs include: Phylip Treetool LoopTool Readseq Blast Fasta Improved versions of printing, and translate are included as well. As for new editing features, a useful "yanking" feature has been added by Scott Ferguson from Exxon Research, and the capability to export the colormap for a seqeunce (see appendicies A/C). Among the bugs fixed in this release are: Selection mask problems when exporting to Genbank (fixed in 2.1) Memory leaks (fixed in 2.1) Correct handling of circular sequences More liberal interpretation of Genbank formatted files. (not column dependent) .c.System Requirements GDE 2.2 currently runs on the Sun family of workstations. This includes the Sun3 and Sun4 (Sparcstation) systems. It was written in XView, and runs on Suns using OpenWindows 3.0 or MIT's X Windows. It runs in both monochrome, and color, and can be run remotely on any system capable of running X Windows Release 4. You should have at least 15 meg of free disk space available. The binay release for SparcStations was compiled under SunOS 4.1.2 and Openwindows 3.0. We are also supporting a DECStation version of GDE. This is running under XView 3.0/X11R5. We encourage interested people to port the programs to their favorite Unix platform. There are informal ports to the SGI line of unix machines. .c.Note to Motif users GDE2.2 can be run using different window managers. The most common alternative to olwm is the Motif window manager (mwm). The only problem in using another window manager is that the status line is not displayed. We have added a "Message panel" as an option under "File- >Properties" which displays all of the information contained on the status line. People using other window managers may also prefer using xterm, and xedit as default terminals and file editors. This can be accomplished by replacing all occurrences of 'shelltool' and 'textedit' with 'xterm -e' and 'xedit' in the $GDE_HELP_DIR/.GDEmenus file. .c.Installing the GDE Instructions for the source code release are included in the README.install file. The binary installations consist of creating a GDE directory, such as /usr/local/GDE, and un-taring the installation tarfile into the directory. If you are installing the GDE for your own use, then you can simply make a GDE subdirectory. There is no need to be superuser (root) to do the installation in your own directory. For example: demo% mkdir /usr/local/GDE demo% cp GDE2.2.tar /usr/local/GDE demo% cd /usr/local/GDE demo% tar -xf GDE2.2.tar After this, each new user will need to add two lines to their .cshrc file so that they can find the gde programs and files. demo% cat >> ~/.cshrc set path = ($path /usr/local/GDE/bin) setenv GDE_HELP_DIR /usr/local/GDE/help/ ^D You may wish to make a copy of the .GDEmenus file from the help directory into your home directory. This is only necessary if you wish to modify your menus. Copy the demo files from /usr/local/GDE/demo into your local directory, and you are now ready to use the GDE. FastA and Blast need to have the properly formatted databases installed in the $GDE_HELP_DIR under the directories FASTA/PIR, FASTA/GENBANK, BLAST/pir BLAST/genbank. For FASTA, simply copy a version of PIR and Genbank into the proper directory. Alternately, the PIR and GENBANK files can be symbolic links to copies of Genbank held elsewhere on your system. You may need to look at the .GDEmenus file in $GDE_HELP_DIR to verify that you are using the same divisions for these databases. Blast installation involves converting PIR and GENBANK to a temporary FASTA format (using pir2fasta and gb2fasta) and then using pressdb for nucleic acid, and setdb for amino acid to reformat the databases again into blast format. The .GDEmenus file is currently set up to search with blast using the following databases: pir, genpept, genupdate, and genbank. If you wish to divide these into subdivisions, then the .GDEmenus file will have to be edited. The most up to date release of blast can be obtained via anonymous ftp to ncbi.nlm.nih.gov. The most recent release of FASTA can be obtained via anonymous ftp to uvaarpa.virginia.edu. It is strongly recommended that you retrieve these copies, and become familiar with their setup. .c.Using the GDE It is assumed that the user is familiar with the Unix, and OpenWindows/Xwindows environments. It is also assumed that people running standard MIT X- Windows will be using the OpenLook window manager (olwm). Other window managers work with varied success. If you are not certain as to how your system is set up, please contact your systems administrator. Once the window system has started, and a terminal window (xterm, shelltool etc.) you can start up the GDE by typing: demo% gde tRNAs This should load the sample data set tRNAs into GDE, and the following window should appear: This is the sequence alignment editor. It consists of a color alignment display, a set of command menus, horizontal and vertical scroll bars to navigate the alignment, a list of short sequence names (usually the LOCUS of a Genbank entry), and a status line. The cursor is located in the upper left corner. Using the Mouse The mouse follow OpenLook standards for operation. The functions for each button are: The left mouse button is used for placing the cursor, selecting sequences by their short names, scrolling/paging, performing split screens, and resizing. The right button is used for pop up menus, and scrollbar menus. The middle button is used for extending a text selection. Cursor Movement The cursor can be moved using the arrow keys, or by clicking the mouse within a sequence. The cursors position is displayed on the status line in both sequence position and alignment column number. The right hand side of the status line shows the left and right column positions of the currently active display. Scrolling is controlled by the scrollbar elevator. By clicking (left mouse button) on one of the elevator arrows, the screen will scroll one character in that direction. By dragging the elevator center, the screen can be moved directly to any location. By clicking directly to one side of the elevator, the screen will scroll one full screen in that direction. And by clicking on the scrollbar anchor, the elevator will move to that anchor. Scrollbars also have menus associated with them giving other scroll options. Use the right mouse button to activate the menu. Selecting Sequences Sequence selection is necessary before most functions can be performed. Selecting sequences is accomplished by clicking or dragging (left button) over the short name associated with the sequence(s). The name of the sequence should become highlighted on the release of the mouse button. By holding down the shift key, you can toggle the selection on or off for any set of sequences. By clicking just to the right of any sequence short name, you will deselect all of them. Selecting Text Selecting text is accomplished in much the same way as selecting entire sequences. In the editing window, you can drag the mouse pointer over a rectangular region the select a block of text. By using the shift key (or the middle mouse button) you can adjust the selection to include other sequences, or other columns of text. If groups are enabled, GDE will automatically select all sequences in a group if any one sequence in a group is selected (See Sequence Editing). Sequence Protection All sequences can be individually protected against accidental modification. This is accomplished by selecting the set of sequences that you are interested in editing, and choosing the "Set protections" menu item under the File menu. Your choices are: Unambiguous modification Changing/adding/deleting regular characters Ambiguous changes Changing ambiguous codes ('N', 'X'...) Alignment modifications Changing alignment gaps ('-', '~') Sequence Editing Sequences can be edited by simply typing to insert, and using the delete or backspace key to delete characters. Sequences must have the proper protections set to allow the type of modifications that you are attempting. The default protection level only allows modification to the alignment, but not to the sequences themselves. The Sun function keys, cut, copy and paste are used to edit selected text. Text selections work in rectangular (possibly disjointed) regions. You can cut or copy a block of sequence text, and paste it to a new cursor location using these three keys. Sequence Yanking: Yanking referes to the "pulling" of a base to fill a gapped position like beads on an abacus. Place the cursor over a gap character, and type k to yank the character from the left into the current position. Type l to pull the character from the right. Repeat counts are honored ("20 l" will yank 20 characters from the right). Repeat Counts By typing a numeric value before an editing function you can insert, delete or move a number of characters at a time. The current repeat count is displayed on the status line, and can be cleared by clicking the left mouse button in the alignment window. In order to insert twenty gaps into a sequence, one would type "20-". In order to move down five sequences, one would type "5¯". This works with all sequence types, however the meta (diamond) key must be held down when the cursor is in a text or mask sequence. This is because numbers are valid characters in these sequences, and would otherwise be confused with repeat counts. Split Screen Split screen editing allows the viewing one region while editing another. This is very useful for aligning "downstream" regions by editing "upstream". The alignment window can be split horizontally into two or more windows into the alignment. These windows scroll independently of each other both horizontally and vertically. The short names displayed to the left of the alignment correspond to the window that was last scrolled or edited. Care should be taken in any modifications done in this mode so that edits are performed on the correct sequence. To avoid confusion during split screen operations, the vertical scroll bars may be locked so that all windows scroll together. In order to split a window into two views, grab (left button) the left or right anchor (small rectangle) at either end of the horizontal scrollbar and drag to the middle of the window. This should split the window into two views. To join two views, place the mouse pointer on the horizontal scroll bar use the menu (right button) . The views are NOT two copies of the alignment. Changes in one window are reflected in the other. Users should not be confused by this fact. Sequence Grouping Sequences can be grouped for editing functions. This is very helpful when trying to adjust several sub alignments. When grouped, all sequences within a group will be affected by editing in any member of the group. All sequences within a group must have protections set to allow modification before any one will be modified. In order to group sequences, select the names of the sequences that should fall within a group, and select Group under the Edit menu. A number will be placed at the left of the sequence representing its assigned group number. To any sequence or sequences, the user selects those sequences and uses the Ungroup command under the Edit menu. Special keys There are also a few special function keys used in the GDE. Some functions have meta key equivalences so that they can be called from the keyboard, instead of by the menu system. The "meta" key is a standard property of X windows, and may be remapped to a different key symbol for different keyboards. For example, meta on Sun workstations is represented with a à, where on a Macintosh running MacX it might be the "apple" key. The operation of the key is the same as the control or shift key, it is held down while pressing the second key in the sequence. Cut text, copy text and paste text are mapped to the Openlook equivalent keys (L10, L6, and L8 on Sun keyboards). Other meta keys are defined in the .GDEmenus file, and may be changed to suit your preferences. .c.Data Types The GDE supports several data types. The data types supported in 2.2 are DNA, RNA, protein (single letter codes), mask sequence, and text. DNA and RNA Nucleic acid sequences are tightly type cast, and can contain any IUPAC code (ACGTUM RSVWYHKDBN) as well as two alignment gap characters ('~' and '-'). Some keys are remapped to fit IUPAC codes. For example, 'X' is mapped to 'N'. All nonstandard characters get mapped to the alignment gap '-'. Upper and lower case are both supported, and the T/U characters are mapped based on whether you are working with DNA or RNA. The color coding for DNA and RNA is identical. The color for ambiguous characters, and for alignment gaps is grey. Amino Acid Sequence Amino acid sequences are loosely type cast, and can contain any valid ASCII character. The results of analysis on nonstandard characters is not guaranteed. The color for nonstandard amino acid characters, and for alignment gaps is grey. Text Sequence Any valid ASCII printable character can be entered into a text sequence. Care should be taken with using space characters, as these will only be saved properly in Genbank format, and not in flat file format. The characters @#% and " should be avoided as well, as these can confuse the reading of flat files if saved in that format. Mask Sequence Mask sequence is identical to text sequence with the following exceptions. Mask sequence can have the ability (function dependent) of masking out positions in an alignment for analysis. If a mask sequence is selected along with some other sequence(s) for an analysis function that permits masking, then all columns that contain a '0' in the mask sequence will be ignored by the function. The mask itself would not be passed to the analysis function either. Some functions allow masking, some do not. Refer to the instruction page for each function to see whether or not it supports sequence masking. Color Masks Color masks give color to a sequence on a position by position basis. Individual sequences can have color masks attached to them, or one color mask can be used for an entire alignment. Color masks are generated externally by some analysis functions, and are then passed back to the GDE. The file format for a colormask is described in Appendix A. .c.Menu Functions: File menu The GDE has several built-in menu functions under the File and Edit menus. These functions are unique in that they are part of the primary display editor, and are not described in the .GDEmenus file. Open... Selecting this will bring up the open file dialog box. Users can scroll through a list of files in the current directory, move up and down the directory tree, and open any individual data file. The sequence data in that file is loaded into the current editing window below any existing sequences. The open command will open any Genbank formatted file, or a GDE flat file. Save as... This function will save the entire alignment to a specified file in either Genbank or flat file format. The file will be saved in the local directory unless a relative or absolute path is specified. Properties... Properties controls the display settings. Those settings include character size, color type, and insert direction. The screen can also be inverted, vertical scroll lock and keyboard clicks (tactile feedback) can be turned on or off. Vertical scrollbar lock will cause all split views to scroll together in the vertical direction. Protections... This will display, and then set the default protections for all selected sequences. If two or more of the sequences differ in their current protection settings, a warning message will appear in the protection dialog box. The protections currently available are alignment gap protection, ambiguous character protection, unambiguous character protection, and translation protection. Get info... This option allows the viewing and setting of attributes associated with each individual sequence. These attributes include short name, full name, description, author, comments, and the sequence type. The attributes loosely correspond to fields in a Genbank entry. Comments can be included for each sequence in the comments field. .c2. Edit menu Select All Selects all sequences. This is helpful when you have several dozen sequences. Select by name... Select all sequences containing a given string in their short names field. No wild cards are allowed, and only selecting is allowed, not de-selecting. The search is started when the Return key is pressed, and multiple searches can be accumulated. Press Done when finished. Cut/Copy/Paste sequences Cut copy and paste are primarily useful for reordering sequences, and for making duplicate copies of a given sequence. They do not pass information to other programs. This capability will be implemented in a later release. Cut and copy will place the selected sequences on an internal clipboard. They can then be pasted back into the top of editing window (default) or under the last selected sequence. Group/Ungroup Assign a group number to the selected sequences. Edit operations in any one sequence within the group will be propagated to all within the group. Sequence protections from one group are also imposed upon all other sequence in the given group. If a given operation is illegal in one sequence in a group (i.e. alignment modification) then it will not work in any of the sequences in that group. Ungroup will remove the selected sequences from a given group. Compress Compress will remove gap characters from the selected sequences. The user has the option of removing all gaps, or simply all columns containing nothing but gaps. This is useful for minimizing the length of a subalignment. Reverse Sequence Reverses the selected sequences. Alignment gaps are reversed as well. The selected sequences will remain aligned after reversal. .c2. DNA/RNA menu Complement Sequence Converts DNA/RNA into its complement strand (keeping full IUPAC ambiguity). This function has no effect on text, protein, or mask sequence. Note that this function does not produce the reverse strand of DNA but merely converts A<->T and G<- >C. If the reverse strand is needed, remember to Complement and Reverse the sequence (Edit menu). .c.External Functions See appendix C for a full description of functions supported in GDE 2.2 All external functions are described in the configuration file .GDEmenus. Here is a brief description of some of the basic functions included. File menu New Sequence Create a new sequence. Prompts for sequence type, and short name. Import foreign format Export foreign format Load and save sequences using Readseq by Don Gilbert (see Appendix C). Save Selection Save the currently selected sequences in a specified file. Pretty print Print using the sequence formatter supplied by Readseq. Print Selection Print the selected sequences to the chosen printer. This function supports the Unix command enscript as well as lpr. The .GDEmenus file may need to be modified to add the names of local printers to the printer list. Edit menu sort... Sort the selected sequences by a primary and secondary key. Pass the new order to a new GDE window. Extract Extract the selected sequences into a new window. DNA/RNA Menu Translate Translate the selected sequences from DNA/RNA to Amino acid. The user can specify the desired reading frame, and the minimum open reading frame (stop codon to stop codon) to translate. The user can also choose between single letter code and triple letter codes. There is also an option to allow each ORF to to be entered as a seperate sequence. Dot plot Display a dotplot identity matrix for the selected sequence(s). If only one sequence is selected, then the dotplot is a self comparison. If two or more sequences are selected, then the first two sequences are compared. Clustal Align Align the selected sequences using the clustalv algorithm by Des Higgins. (See Appendix C) Find All Search and highlight the selected sequences for a given substring. A specified percent of mismatching can also be allowed. Variable Positions The selected sequences are scored column by column for conservation. The result is returned as a grey scale alignment color mask. This can be useful in selecting PCR primers. Sequence Consensus Return the consensus for the selected sequences. This can either be a majority consensus, or an ambiguity consensus using IUPAC coding. Distance Matrix Calculate a distance matrix for the selected sequences. (See Appendix C) MFOLD Fold the selected sequences using MFOLD by Michael Zuker. The resulting structure is returned as a nested bracket ('[]') representation of the secondary structure.(See appendix C.) Draw Secondary Struct Draw the selected sequence using the proposed secondary structure. Both the secondary structure prediction, and the RNA sequence should be selected before calling this function. The drawing program is LoopTool. (See Appendix C) Highlight Helix Show all violations to a proposed RNA secondary structure. The secondary structure represented must be selected, as well as the aligned sequences to be tested. The selected sequences will then be colored according to whether or not they support the proposed 2¡ structure. Standard Watson/Crick paring will be colored dark blue, G-U paring will be colored light blue, mismatches will be colored gold, and pairng to gaps will be red. Blastn/BlastX Search the selected sequence (select only one) against a given database with the BLAST searching tool written by Altschul, Gish, Miller, Myers, and Lipman. Blastn searches DNA against DNA databases, blastx searches DNA against AA databases by translating the sequence in all six reading frames. (See Appendix C) FastA Search the selected sequence (select only one) against a given database using the FASTA similarity search program written by Pearson and Lipman. (See Appendix C) Protein Menu Clustal Align Align the selected amino acid sequences using the clustal algorithm. (See Appendix C) Blastp, Tblastn, Blast3 Search the selected sequence (select only one) against a given database with the BLAST searching tool written by Altschul, Gish, Miller, Myers, and Lipman. Blastp searches AA against AA databases, tblastn searches AA against DNA databases by translating the database in all six reading frames. Blast3 finds three way alignments that are could not be found with only pairwise comparisons. (See Appendix C) Sequence Management Menu Assemble contigs Assemble the selected sequences into contigs using the program CAP (Contig Assemble Program) written by Xiaoqiu Huang. The resulting sequences are returned to the current GDE window, and they are grouped into contigs. The user can then sort the sequences by group, and offset to produce an ordered list of the contigs. (See Appendix C) Strategy view Pass out the selected sequences to StratView. This program will display contigs in a greatly reduced line drawing. This is very useful for large contigs. Restriction sites Search the selected sequences for the restriction enzymes specified in the given enzyme file. The restriction sites are then colored by enzyme. Phylogeny menu DeSoete Tree fit Calculate a phylogenetic tree using a least squares fitting algorithm on a distance matrix calculated from the selected sequences. The results can then be passed on to treetool for display and manipulation. (See Appendix C) Phylip 3.5 Pass the selected data to on of the treeing programs in Phylip, written by Joe Felsenstein. The chosen phylip program is started in it's own window, with the selected sequences already loaded. (See Appendix C) Citation of work We ask that any published work using any of the external functions in GDE cite the appropriate authors. Please see Appendix C for references. .c.Bug reports/extensions Any bug reports, request for enhancement, and useful extensions to the GDE should be forwarded by electronic mail to: smith@bioimage.millipore.com Please include as much detail as possible in bug reports so that the bug can be reproduced. Correspondence should be addressed to: Steven Smith Millipore Imaging Systems 777 E. Eisenhower Pkwy Ann Arbor, MI 48108 .c.Acknowledgments I would like to thank the following people for their input and assistance and code used in the development of the GDE: Carl Woese, Gary Olsen and Mike Maciukenas at University of Illinois Dept of Microbiology, Ross Overbeek at Argonne National Laboratories,Walter Gilbert, Patrick Gillevet, Chunwei Wang, Susan Russo and Erik Bunce at the Harvard Genome Laboratory. I would also like to personally thank the following people for their permission to include their software with this release of GDE. Tim Littlejohn Scott Ferguson Brian Fristensky Des Higgins David Lipman and the group at NCBI William Pearson Don Gilbert Xiaoqui Huang Joe Felsenstein Michael Zuker Geert DeSoete Many thanks to all the people who have directly and indirectly helped with the ongoing support of GDE. It is only by the generosity of these people that GDE has been successful. .c.Appendix A, File Formats The currently supported file formats include GDE data files, Genbank formatted files (with type extensions), a generic flat file format, and a color mask file. GDE format GDE format is a tagged field format used for storing all available information about a sequence. The format matches very closely the GDE internal structures for sequence data. The format consists of text records starting and ending with braces ('{}'). Between the open and close braces are several tagged field lines specifying different pieces of information about a given sequence. The tag values can be wrapped with double quote characters ('""') as needed. If quotes are not used, the first whitespace delimited string is taken as the value. The allowable fields are: { name "Short name for sequence" longname "Long (more descriptive) name for sequence" sequence-ID "Unique ID number" creation-date "mm/dd/yy hh:mm:ss" direction [-1|1] strandedness [1|2] type [DNA|RNA||PROTEIN|TEXT|MASK] offset (-999999,999999) group-ID (0,999) creator "Author's name" descrip "Verbose description" comments "Lines of comments that can be fairly arbitrary text about a sequence. Return characters are allowed, but no internal double quotes or brace characters. Remember to close with a double quote" sequence "gctagctagctagctagctcttagctgtagtcgtagctgatgc tagct gatgctagctagctagctagctgatcgatgctagctgatcgtagctgacg gactgatgctagctagctagctagctgtctagtgtcgtagtgcttattgc" } Any fields that are not specified are assumed to be the default values. Offsets can be negative as well as positive. Genbank entries written out in this format will have all (") converted to ('), and all ({}) converted to ([]) to avoid confusion in the parser. Leading and trailing gaps are removed prior to writing each sequence. This format is deliberately verbose in order to be simple to duplicate. Genbank format: GDE can read a concatenated list of Genbank entries, and extract certain fields from such files. The default method for storing nucleic acid, amino acid, masking sequences or text is in Genbank format. The following fields are recognized: LOCUS: Short name for this sequence (Maximum of 32 characters)  DEFINITION: Definition of sequence (Maximum of 80 characters) ORGANISM: Full name of organism (Maximum of 80 characters) AUTHORS: Authors of this sequence (Maximum of 80 characters) ACCESSION: ID Number for this sequence (Maximum of 80 characters) ORIGIN: Beginning of sequence data  // End of sequence data  All other lines are retained as comments. The LOCUS line also specifies what type of sequence follows. The form of this line is: LOCUS name size bp type date where name is the Genbank Locus name, size is total base count, type is one of DNA, RNA, PROTEIN, MASK, or TEXT and date is of the form dd-MON- yyyy. In this way, the standard Genbank format is extended to store all text, mask and protein data. The Genbank character set has also been extended in order to support these other data types. Valid characters are: DNA/RNA: Full IUPAC coding as well as '-' and '~' characters for alignment gaps Protein: All valid single letter codes plus '-' and '~'. Other ASCII characters may be inserted, however external functions may be confused by such characters. Mask: All legal printable ASCII characters. If used as a selection mask, all columns containing a '0' will be removed from any analysis. Text: All valid ASCII characters. Here is a valid Genbank entry for two E.coli tRNA's: LOCUS ECOTRNT4 76 bp RNA 28-JAN-1991 DEFINITION E. coli (T4 infected) vulnerable tRNA (A). ORGANISM Escherichia coli AUTHORS Amitsur,M., Levitz,R. and Kaufmann,G. FEATURES From To/Span Description tRNA 1 76 vulnerable tRNA(A) BASE COUNT ? ORIGIN 1 GGGUCGUUAG CUCAGUUGGU AGAGCAGUUG ACUUUUAAUC AAUUGGNCGC AGGUUCGAAU 61 CCUGCACGAC CCACCA // LOCUS ECOTRQ1 75 bp RNA 28-JAN-1991 DEFINITION E.coli Gln-tRNA-1. ORGANISM Escherichia coli AUTHORS Yaniv,M. and Folk,W.R. SOURCE -REFERENCE [1] JOURNAL J. Biol. Chem. 250, 3243-3253 (1975) FEATURES From To/Span Description tRNA 1 75 Gln-tRNA-1 (NAR: 0510) refnumbr 1 1 sequence not numbered in [1] BASE COUNT ? ORIGIN 1 UGGGGUAUCG CCAAGCGGUA AGGCACCGGU UUUUGAUACC GGCAUUCCCU GGUUCGAAUC 61 CAGGUACCCC AGCCA // Flat file format: This is a simplified format for importing sequence data, and passing it out to analysis functions. Very little information is actually retained in this format, and should be used carefully so as not to lose attribute information. It is defined as follow: type_character short_name sequence_data sequence_data sequence_data ... The type character is # for DNA/RNA, % for protein sequence, @ for mask sequence, and " for text. The short name is the same as the LOCUS line in Genbank. This is followed by lines of sequence, each ending with a return character.These lines are read until the next type character is encountered, or until the end of the file is reached. Care should be taken in using this format with text as space characters are stripped automatically. As of release 2.0, flat file format allows for an optional offset to be specified in parentheses after the sequence name. An offset represents how many leading gap characters should be placed before the start of a sequence. If this offset does not exist, then it is defined to be 0. Here is a sample flat file for two Ecoli tRNA's: #ECOTRNT4 GGGUCGUUAGCUCAGUUGGUAGAGCAGUUGACUUUUAAUCAAUUGGNCGCAG GUUCGAAU CCUGCACGACCCACCA #ECOTRQ1 UGGGGUAUCGCCAAGCGGUAAGGCACCGGUUUUUGAUACCGGCAUUCCCUGG UUCGAAUC CAGGUACCCCAGCCA Color mask: The format for a color mask has been kept simple to make implementation of color functions easy. The format optionally defines which sequence to color, whether or not to color alignment gaps in the existing sequence, and how long the following mask will be. It is then followed by a list of decimal color codes (range 0 to 15) for each position in the sequence. There are four keywords used in the color mask file. Those keywords are: name:short name If short name matches a currently loaded sequence, then impose this color mask on that sequence. If this line is omitted, then color all sequences this color, and the color mask is expected to start at the leftmost column on the screen. length:length The following list in length long nodash: Skip over dash characters when imposing this color mask on the named sequence. This allows an unaligned color mask to be placed over aligned sequence. start: Begin reading the color mask on the next line. Here is a sample color mask file: name:test_sequence length:10 nodash: start: 3 3 3 6 5 3 3 3 2 7 The colors in the default color lookup table are: 0 White 8 Black 1 Yellow 9 Grey 1 2 Violet 10 Grey 2 3 Red 11 Grey 3 4 Aqua 12 Grey 4 5 Lime Green 13 Grey 5 6 Blue 14 Grey 6 7 Purple 15 White .c.Appendix B, Adding Functions The GDE uses a menu description language to define what external programs it can call, and what parameters and data to pass to each function. This language allows users to customize their own environment to suite individual needs. The following is how the GDE handles external programs when selected from a menu: Each step in this process is described in a file .GDEmenus in the user's current or home directory. The language used in this file describes three phases to an external function call. The first phase describes the menu item as it will appear, and the Unix command line that is actually run when it is selected. The second phase describes how to prompt for the parameters needed by the function. The third phase describes what data needs to be passed as input to the external function, and what data (if any) needs to be read back from its output. The form of the language is a simple keyword/value list delimited by the colon (:) character. The language retains old values until new ones are set. For example, setting the menu name is done once for all items in that menu, and is only reset when the next menu is reached. The keywords for phase one are: menu:menu name Name of current menu item:item name Name of current menu item itemmeta:meta_key Meta key equivalence (quick keys) itemhelp:help_file Help file (either full path, or in GDE_HELP_DIR) itemmethod:Unix command The item method command is a bit more involved, it is the Unix command that will actually run the external program intended. It is one line long, and can be up to 256 characters in length. It can have embedded variable names (starting with a '$') that will be replaced with appropriate values later on. It can consist of multiple Unix commands separated by semi-colons (;), and may contain shell scripts and background processes as well as simple command names. Examples will be given later. The keywords for phase two are: arg:argument_variable_name Name of this variable. It will appear in the itemmethod: line with a dollar sign ($) in front of it. argtype:slider,chooser,choice_menu or text The type of graphic object representing this argument. arglabel:descriptive label A short description of what this argument represents argmin:minimum_value (integer) Used for sliders. argmax:maximum_value (integer) Used for sliders. argvalue:default_value (integer) It is the numeric value associated with sliders or the default choice in choosers, choice_menus, and choice_lists (the first choice is 0, the second is 1 etc.) argtext:default value Used for text fields. argchoice:displayed value:passed value Used for choosers and choice_menus. The first value is displayed on screen, and the second value is passed to the itemmethod line. The keywords for phase three are as follows: in:input_file GDE will replace this name with a randomly generated temporary file name. It will then write the selected data out to this file. informat:file_format Write data to this file for input to this function. Currently support values are Genbank, and flat. inmask: This data can be controlled by a selection mask. insave: Do not remove this file after running the external function. This is useful for functions put in the background. out:output_file GDE will replace this name with a randomly generated temporary file name. It is up to the external function to fill this file with any results that might be read back into the GDE. outformat:file_format The data in the output file will be in this format. Currently support values are colormask, Genbank, and flat. outsave: Do not remove this file after reading. This is useful for background tasks. outoverwrite: Overwrite existing sequences in the current GDE window. Currently supported with "gde" format only. Here is a sample dialog box, and it's entry in the .GDEmenus file: Using the default parameters given in the dialog box, the executed Unix command line would be: (tr '[a-z]' '[A-Z]' < .gde_001 >.gde_001.tmp ; mv .gde_001.tmp CAPS ; gde CAPS -Wx medium ; rm .gde_001 ) & where .gde_001 is the name of the temporary file generated by the GDE which contains the selected sequences in flat file format. Since the GDE runs this command in the background ('&' at the end) it is necessary to specify the insave: line, and to remove all temporary files manually. There is no output file specific because the data is not loaded back into the current GDE window, but rather a new GDE window is opened on the file. A simpler command that reloads the data after conversion might be: item:All caps itemmethod:tr '[a-z]' '[A-Z]' OUTPUT in:INPUT informat:flat out:OUTPUT outformat:flat In this example, no arguments are specified, and so no dialog box will appear. The command is not run in the background, so the GDE can clean up after itself automatically. The converted sequence is automatically loaded back into the current GDE window. In general, the easiest type of program to integrate into the GDE is a program completely driven from a Unix command line. Interactive programs can be tied in (MFOLD for example), however shell scripts must be used to drive the parameter entry for these programs. Programs of the form: program_name -a1 argument1 -a2 arguement2 -f inputfile -er errorfile > outputfile can be specified in the .GDEmenus file directly. As this is the general form of most one Unix commands, these tend to be simpler to implement under the GDE. As functions grow in complexity, they may begin to need a user interface of their own. In these cases, the command line calling arguments are still necessary in order to allow the GDE to hand them the appropriate data, and possible retrieve results after some external manipulation. .c.Appendix C, External functions ClustalV - Cluster multiple sequence alignment Author: Des Higgins. Reference: Higgins,D.G. Bleasby,A.J. and Fuchs,R. (1991) CLUSTAL V: improved software for multiple sequence alignment. ms. submitted to CABIOS Parameters: k-tuple pairwise search Word size for pairwise comparisons Window size Smaller values give faster alignments, larger values are more sensitive. Transitions weighted Can weight transitions twice as high as transversions (DNA only). Fixed gap penalty Gap insertion penalty, lower value, more gaps Floating gap penalty Gap extension penalty, lower value, longer gaps Comments: ClustalV is a directed multiple sequence alignment algorithm that aligns a set of sequences based on their level of similarity. It first uses a Lipman Peasron pairwise similarity scoring to find "clusters" of similar sequences, and pre- aligns those sequences. It then adds other sequences to the alignment in the order of their similarity so as to produce the cleanest alignment. Warning: ClustalV only uses unambiguous character codes. It will also convert all sequences to upper case in the process of aligning. Clustal does not pass back comments, author etc. Be sure to keep copies of your sequences if you do not wish to lose this information. MFOLD - RNA secondary prediction Author: Michael Zuker Reference: M. Zuker On Finding All Suboptimal Foldings of an RNA Molecule. Science, 244, 48-52, (1989) J. A. Jaeger, D. H. Turner and M. Zuker Improved Predictions of Secondary Structures for RNA. Proc. Natl. Acad. Sci. USA, BIOCHEMISTRY, 86, 7706-7710, (1989) J. A. Jaeger, D. H. Turner and M. Zuker Predicting Optimal and Suboptimal Secondary Structure for RNA. in "Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences", R. F. Doolittle ed. Methods in Enzymology, 183, 281-306 (1989) Parameters: Linear/circular RNA fold ct File to save results Comments: MFOLD passes it's output to a program Zuk_to_gen that translates the secondary structure prediction to a nested bracket ([]) notation. This notation can then be used in the Highlight Helix, and Draw Secondary structure (LoopTool) functions. MFOLD currently does not support much in the way of additional parameters. We hope to have all additional parameters available soon. Blast - Basic Local Alignment Search Tool Reference: Karlin, Samuel and Stephen F. Altschul (1990). Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes, Proc. Natl. Acad. Sci. USA 87:2264-2268. Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman (1990). Basic local alignment search tool, J. Mol. Biol. 215:403-410. Altschul, Stephen F. (1991). Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 219:555-565. Parameters: Which Database Which nucleic or amino acid database to search. Word Size Length of initial hit. after locating a match of this length, alignment extension is attempted. Blastn Match score Score for matches in secondary alignment extension Mismatch score Score for mismatches in secondary alignment extension Blastx, tblastn, blastp, blast3 Substitution Matrix PAM120 or PAM250 Comments: The report is loaded into a text editor. This should be saved as a new file as the default file is removed after execution. The latest version of blast can be obtained via anonymous ftp to ncbi.nlm.nih.gov. FastA - Similarity search Reference: W. R. Pearson and D. J. Lipman (1988), "Improved Tools for Biological Sequence Analysis", PNAS 85:2444-2448 W. R. Pearson (1990) "Rapid and Sensitive Sequence Comparison with FASTP and FASTA" Methods in Enzymology 183:63-98 Parameters: Database Which database to search Number of alignments to report SMATRIX Which similarity matrix to use Comments: The FastA package includes several additional programs for pairwise alignment. We have only included a bare bones link to FastA. We hope to include a more complete setup for the actual 2.2 release. Assemble Contigs - CAP Contig Assembly Program Author - Xiaoqiu Huang Department of Computer Science Michigan Technological University Houghton, MI 49931 E-mail: huang@cs.mtu.edu Minor modifications for I/O by S. Smith Reference - "A Contig Assembly Program Based on Sensitive Detection of Fragment Overlaps" (submitted to Genomics, 1991) Parameters: Minimum overlap Number of bases required for overlap Percent match within overlap Percentage match required in the overlap region before merge is alowwed. Comments: CAP returns the aligned sequences to the current editor window. The sequences are placed into contigs by setting the groupid. Cap does not change the order of the sequences, and so the results should be sorted by group and offset (see sort under the Edit menu). Lsadt - Least squares additive tree analysis Author: Geert De Soete, 'C' implementation by Mike Maciukenas University of Illinois Reference:LSADT, 1983 Psychometrika, 1984 Quality and Quantity Parameters: Distance correction to use in distance matrix calculations (see count below). What should be used for initial parameters estimates Random number seed Display method (See TreeTool below) Comments: The program has been rewritten in 'C' and will be included with the rRNA Database phylogenetic package being written at the University of Illinois Department of Microbiology. Count is a short program to calculate a distance matrix from a sequence alignment (see below). Count - Distance matrix calculator Author: Steven Smith Parameters: Correction method Currently Jukes-Cantor or none Include dashed columns Match upper case to lower Comments: Passes back a distance matrix in a format readable by LSADT. Treetool - Tree drawing/manipulation Author: Michael Maciukenas, University of Illinois Comments: See included documentation for TreeTool usage. Readseq - format conversion program Author: Don Gilbert Parameters: Many, but can easily be run in interactive mdoe. Comments: Readseq is a very useful program for format conversion. The latest versionsupports over a dozen different file formats, as well as formating capabilities for publication. GDE makes of Readseq for importing and exporting seqeuences as well as a filtering tool to some external functions. Lsadt - Least squares additive tree analysis Author: Geert De Soete, 'C' implementation by Mike Maciukenas University of Illinois Reference:LSADT, 1983 Psychometrika, 1984 Quality and Quantity Parameters: Distance correction to use in distance matrix calculations (see count below). What should be used for initial parameters estimates Random number seed Display method (See TreeTool below) Comments: The program has been rewritten in 'C' and will be included with the rRNA Database phylogenetic package being written at the University of Illinois Department of Microbiology. Count is a short program to calculate a distance matrix from a sequence alignment (see below). Count - Distance matrix calculator Author: Steven Smith Parameters: Correction method Currently Jukes-Cantor or none Include dashed columns Match upper case to lower Comments: Passes back a distance matrix in a format readable by LSADT. Copyright Notice The Genetic Data Environment (GDE) software and documentation are not in the public domain. Portions of this code are owned and copyrighted by the The Board of Trustees of the University of Illinois and by Steven Smith. External functions used by GDE are the proporty of, their respective authors. This release of the GDE program and documentation may not be sold, or incorporated into a commercial product, in whole or in part without the expressed written consent of the University of Illinois and of its author, Steven Smith. All interested parties may redistribute the GDE as long as all copies are accompanied by this documentation, and all copyright notices remain intact. Parties interested in redistribution must do so on a non-profit basis, charging only for cost of media. Modifications to the GDE core editor should be forwarded to the author Steven Smith. External programs used by the GDE are copyright by, and are the property of their respective authors unless otherwise stated. While all attempts have been made to insure the integrity of these programs: Disclaimer THE UNIVERSITY OF ILLINOIS, HARVARD UNIVERSITY AND THE AUTHOR, STEVEN SMITH GIVE NO WARRANTIES, EXPRESSED OR IMPLIED FOR THE SOFTWARE AND DOCUMENTATION PROVIDED, INCLUDING, BUT NOT LIMITED TO WARRANTY OF MERCHANTABILITY AND WARRANTY OF FITNESS FOR A PARTICULAR PURPOSE. User understands the software is a research tool for which no warranties as to capabilities or accuracy are made, and user accepts the software "as is." User assumes the entire risk as to the results and performance of the software and documentation. The above parties cannot be held liable for any direct, indirect, consequential or incidental damages with respect to any claim by user or any third party on account of, or arising from the use of software and associated materials. This disclaimer covers both the GDE core editor and all external programs used by the GDE.   Required field