UP arb.hlp UP glossary.hlp UP save.hlp SUB arb_edit.hlp SUB ale.hlp TITLE GDE Interface and Editor DESCRIPTION Starts the GDE Editor designed by Steven Smith. See next chapter of this text for the original help text. As GDE originally used its own built-in database, it had to be slightly modified to run under ARB. So **** READ THE WARNINGS/BUGS CAREFULLY **** WARNINGS As soon as you start GDE, it creates a copy of the selected sequences. That means that you may change the sequences with either GDE or ARB, but not both. Therefore, if you have started GDE, do nothing but sequence editing in GDE till you quit GDE. To really save sequences to disc, you have to send the sequence changes to ARB and then use ARB to save the ARB database. BUGS Many functions, especially -deleting, -moving, -duplicating, -creating, -importing, species do not work correctly. ********* Part of the Original GDE HELPTEXT ****************** SECTION Introduction The Genetic Data Environment is part of a growing set of programs for manipulating and analyzing "genetic" data. It differs in design from other analysis programs in that it is intended to be an expandable and customizable system, while still being easy to use. There are a tremendous number of publicly available programs for sequence analysis. Many of these programs have found their way into commercial packages which incorporate them into integrated, easy to use systems. The goal of the GDE is to minimize the amount of effort required to integrate sequence analysis functions into a common environment. The GDE takes care of the user interface issues, and allows the programmer to concentrate on the analysis itself. Existing programs can be tied into the GDE in a matter of hours (or minutes) as apposed to days or weeks. Programs may be written in any language, and still seamlessly be incorporated into the GDE. These programs are, and will continue to be, available at no charge. It is the hope that this system will grow in functionality as more and more people see the benefits of a modular analysis environment. Users are encouraged to make modifications to the system, and forward all changes and additions to Steven Smith at smith@bioimage.millipore.com. SECTION What's New for this Release GDE 2.2 represents a maintenance release. Several small bugs have been fixed, as well as new editing features and user interface elements. Also, I have tried to update all of the contributed external programs to their latest release. Updated programs include: - Phylip - Treetool - LoopTool - Readseq - Blast - Fasta Improved versions of printing, and translate are included as well. As for new editing features, a useful "yanking" feature has been added by Scott Ferguson from Exxon Research, and the capability to export the colormap for a sequence (see appendices A/C). Among the bugs fixed in this release are: Selection mask problems when exporting to Genbank (fixed in 2.1) Memory leaks (fixed in 2.1) Correct handling of circular sequences More liberal interpretation of Genbank formatted files. (not column dependent) SECTION System Requirements GDE 2.2 currently runs on the Sun family of workstations. This includes the Sun3 and Sun4 (Sparcstation) systems. It was written in XView, and runs on Suns using OpenWindows 3.0 or MIT's X Windows. It runs in both monochrome, and color, and can be run remotely on any system capable of running X Windows Release 4. You should have at least 15 meg of free disk space available. The binary release for SparcStations was compiled under SunOS 4.1.2 and Openwindows 3.0. We are also supporting a DECStation version of GDE. This is running under XView 3.0/X11R5. We encourage interested people to port the programs to their favorite Unix platform. There are informal ports to the SGI line of unix machines. SECTION Note to Motif users GDE2.2 can be run using different window managers. The most common alternative to olwm is the Motif window manager (mwm). The only problem in using another window manager is that the status line is not displayed. We have added a "Message panel" as an option under "File- >Properties" which displays all of the information contained on the status line. People using other window managers may also prefer using xterm, and xedit as default terminals and file editors. This can be accomplished by replacing all occurrences of 'shelltool' and 'textedit' with 'xterm -e' and 'xedit' in the $GDE_HELP_DIR/.GDEmenus file. FastA and Blast need to have the properly formatted databases installed in the $GDE_HELP_DIR under the directories FASTA/PIR, FASTA/GENBANK, BLAST/pir BLAST/genbank. For FASTA, simply copy a version of PIR and Genbank into the proper directory. Alternately, the PIR and GENBANK files can be symbolic links to copies of Genbank held elsewhere on your system. You may need to look at the .GDEmenus file in $GDE_HELP_DIR to verify that you are using the same divisions for these databases. Blast installation involves converting PIR and GENBANK to a temporary FASTA format (using pir2fasta and gb2fasta) and then using pressdb for nucleic acid, and setdb for amino acid to reformat the databases again into blast format. The .GDEmenus file is currently set up to search with blast using the following databases: pir, genpept, genupdate, and genbank. If you wish to divide these into subdivisions, then the .GDEmenus file will have to be edited. The most up to date release of blast can be obtained via anonymous ftp to ncbi.nlm.nih.gov. The most recent release of FASTA can be obtained via anonymous ftp to uvaarpa.virginia.edu. It is strongly recommended that you retrieve these copies, and become familiar with their setup. SECTION Using the GDE It is assumed that the user is familiar with the Unix, and OpenWindows/Xwindows environments. It is also assumed that people running standard MIT X- Windows will be using the OpenLook window manager (olwm). Other window managers work with varied success. If you are not certain as to how your system is set up, please contact your systems administrator. The GDE uses a menu description language to define what external programs it can call, and what parameters and data to pass to each function. This language allows users to customize their own environment to suite individual needs. The following is how the GDE handles external programs when selected from a menu: Each step in this process is described in a file .GDEmenus in the user's current or home directory. The language used in this file describes three phases to an external function call. The first phase describes the menu item as it will appear, and the Unix command line that is actually run when it is selected. The second phase describes how to prompt for the parameters needed by the function. The third phase describes what data needs to be passed as input to the external function, and what data (if any) needs to be read back from its output. The form of the language is a simple keyword/value list delimited by the colon (:) character. The language retains old values until new ones are set. For example, setting the menu name is done once for all items in that menu, and is only reset when the next menu is reached. The keywords for phase one are: menu:menu name Name of current menu item:item name Name of current menu item itemmeta:meta_key Meta key equivalence (quick keys) itemhelp:help_file Help file (either full path, or in GDE_HELP_DIR) itemmethod: Unix command The item method command is a bit more involved, it is the Unix command that will actually run the external program intended. It is one line long, and can be up to 256 characters in length. It can have embedded variable names (starting with a '$') that will be replaced with appropriate values later on. It can consist of multiple Unix commands separated by semi-colons (;), and may contain shell scripts and background processes as well as simple command names. Examples will be given later. The keywords for phase two are: arg:argument_variable_name Name of this variable. It will appear in the itemmethod: line with a dollar sign ($) in front of it. argtype:slider,chooser,choice_menu or text The type of graphic object representing this argument. arglabel:descriptive label A short description of what this argument represents argmin:minimum_value (integer) Used for sliders. argmax:maximum_value (integer) Used for sliders. argvalue:default_value (integer) It is the numeric value associated with sliders or the default choice in choosers, choice_menus, and choice_lists (the first choice is 0, the second is 1 etc.) argtext:default value Used for text fields. argchoice:displayed value:passed value Used for choosers and choice_menus. The first value is displayed on screen, and the second value is passed to the itemmethod line. The keywords for phase three are as follows: in:input_file GDE will replace this name with a randomly generated temporary file name. It will then write the selected data out to this file. informat:file_format Write data to this file for input to this function. Currently support values are Genbank, and flat. inmask: This data can be controlled by a selection mask. insave: Do not remove this file after running the external function. This is useful for functions put in the background. out:output_file GDE will replace this name with a randomly generated temporary file name. It is up to the external function to fill this file with any results that might be read back into the GDE. outformat:file_format The data in the output file will be in this format. Currently support values are colormask, Genbank, and flat. outsave: Do not remove this file after reading. This is useful for background tasks. outoverwrite: Overwrite existing sequences in the current GDE window. Currently supported with "gde" format only. Here is a sample dialog box, and it's entry in the .GDEmenus file: Using the default parameters given in the dialog box, the executed Unix command line would be: (tr '[a-z]' '[A-Z]' < .gde_001 >.gde_001.tmp ; mv .gde_001.tmp CAPS ; gde CAPS -Wx medium ; rm .gde_001 ) & where .gde_001 is the name of the temporary file generated by the GDE which contains the selected sequences in flat file format. Since the GDE runs this command in the background ('&' at the end) it is necessary to specify the insave: line, and to remove all temporary files manually. There is no output file specific because the data is not loaded back into the current GDE window, but rather a new GDE window is opened on the file. A simpler command that reloads the data after conversion might be: item: All caps itemmethod: tr '[a-z]' '[A-Z]' OUTPUT in: INPUT informat: flat out: OUTPUT outformat: flat In this example, no arguments are specified, and so no dialog box will appear. The command is not run in the background, so the GDE can clean up after itself automatically. The converted sequence is automatically loaded back into the current GDE window. In general, the easiest type of program to integrate into the GDE is a program completely driven from a Unix command line. Interactive programs can be tied in (MFOLD for example), however shell scripts must be used to drive the parameter entry for these programs. Programs of the form: program_name -a1 argument1 -a2 argument2 -f inputfile -er errorfile > outputfile can be specified in the .GDEmenus file directly. As this is the general form of most one Unix commands, these tend to be simpler to implement under the GDE. As functions grow in complexity, they may begin to need a user interface of their own. In these cases, the command line calling arguments are still necessary in order to allow the GDE to hand them the appropriate data, and possible retrieve results after some external manipulation. SECTION Appendix C, External functions ClustalV - Cluster multiple sequence alignment Author: Des Higgins. Reference: Higgins,D.G. Bleasby,A.J. and Fuchs,R. (1991) CLUSTAL V: improved software for multiple sequence alignment. ms. submitted to CABIOS Parameters: k-tuple pairwise search Word size for pairwise comparisons Window size Smaller values give faster alignments, larger values are more sensitive. Transitions weighted Can weight transitions twice as high as transversions (DNA only). Fixed gap penalty Gap insertion penalty, lower value, more gaps Floating gap penalty Gap extension penalty, lower value, longer gaps Comments: ClustalV is a directed multiple sequence alignment algorithm that aligns a set of sequences based on their level of similarity. It first uses a Lipman Peasron pairwise similarity scoring to find "clusters" of similar sequences, and pre-aligns those sequences. It then adds other sequences to the alignment in the order of their similarity so as to produce the cleanest alignment. Warning: ClustalV only uses unambiguous character codes. It will also convert all sequences to upper case in the process of aligning. Clustal does not pass back comments, author etc. Be sure to keep copies of your sequences if you do not wish to lose this information. MFOLD - RNA secondary prediction Author: Michael Zuker Reference: M. Zuker On Finding All Suboptimal Foldings of an RNA Molecule. Science, 244, 48-52, (1989) J. A. Jaeger, D. H. Turner and M. Zuker Improved Predictions of Secondary Structures for RNA. Proc. Natl. Acad. Sci. USA, BIOCHEMISTRY, 86, 7706-7710, (1989) J. A. Jaeger, D. H. Turner and M. Zuker Predicting Optimal and Suboptimal Secondary Structure for RNA. in "Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences", R. F. Doolittle ed. Methods in Enzymology, 183, 281-306 (1989) Parameters: Linear/circular RNA fold ct File to save results Comments: MFOLD passes it's output to a program Zuk_to_gen that translates the secondary structure prediction to a nested bracket ([]) notation. This notation can then be used in the Highlight Helix, and Draw Secondary structure (LoopTool) functions. MFOLD currently does not support much in the way of additional parameters. We hope to have all additional parameters available soon. Blast - Basic Local Alignment Search Tool Reference: Karlin, Samuel and Stephen F. Altschul (1990). Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes, Proc. Natl. Acad. Sci. USA 87:2264-2268. Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman (1990). Basic local alignment search tool, J. Mol. Biol. 215:403-410. Altschul, Stephen F. (1991). Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 219:555-565. Parameters: Which Database Which nucleic or amino acid database to search. Word Size Length of initial hit. after locating a match of this length, alignment extension is attempted. Blastn Match score Score for matches in secondary alignment extension Mismatch score Score for mismatches in secondary alignment extension Blastx, tblastn, blastp, blast3 Substitution Matrix PAM120 or PAM250 Comments: The report is loaded into a text editor. This should be saved as a new file as the default file is removed after execution. The latest version of blast can be obtained via anonymous ftp to ncbi.nlm.nih.gov. FastA - Similarity search Reference: W. R. Pearson and D. J. Lipman (1988), "Improved Tools for Biological Sequence Analysis", PNAS 85:2444-2448 W. R. Pearson (1990) "Rapid and Sensitive Sequence Comparison with FASTP and FASTA" Methods in Enzymology 183:63-98 Parameters: Database Which database to search Number of alignments to report SMATRIX Which similarity matrix to use Comments: The FastA package includes several additional programs for pairwise alignment. We have only included a bare bones link to FastA. We hope to include a more complete setup for the actual 2.2 release. Assemble Contigs - CAP Contig Assembly Program Author Xiaoqiu Huang Department of Computer Science Michigan Technological University Houghton, MI 49931 E-mail: huang@cs.mtu.edu Minor modifications for I/O by S. Smith Reference "A Contig Assembly Program Based on Sensitive Detection of Fragment Overlaps" (submitted to Genomics, 1991) Parameters: Minimum overlap Number of bases required for overlap Percent match within overlap Percentage match required in the overlap region before merge is allowed. Comments: CAP returns the aligned sequences to the current editor window. The sequences are placed into contigs by setting the groupid. Cap does not change the order of the sequences, and so the results should be sorted by group and offset (see sort under the Edit menu). Lsadt - Least squares additive tree analysis Author: Geert De Soete, 'C' implementation by Mike Maciukenas, University of Illinois Reference: LSADT, 1983 Psychometrika, 1984, Quality and Quantity Parameters: Distance correction to use in distance matrix calculations (see count below). What should be used for initial parameters estimates. Random number seed. Display method (See TreeTool below). Comments: The program has been rewritten in 'C' and will be included with the rRNA Database phylogenetic package being written at the University of Illinois Department of Microbiology. Count is a short program to calculate a distance matrix from a sequence alignment (see below). Count - Distance matrix calculator Author: Steven Smith Parameters: Correction method Currently Jukes-Cantor or none, Include dashed columns, Match upper case to lower Comments: Passes back a distance matrix in a format readable by LSADT. Treetool - Tree drawing/manipulation Author: Michael Maciukenas, University of Illinois Comments: See included documentation for TreeTool usage. Readseq - format conversion program Author: Don Gilbert Parameters: Many, but can easily be run in interactive mode. Comments: Readseq is a very useful program for format conversion. The latest versionsupports over a dozen different file formats, as well as formating capabilities for publication. GDE makes of Readseq for importing and exporting sequences as well as a filtering tool to some external functions. SECTION Copyright Notice The Genetic Data Environment (GDE) software and documentation are not in the public domain. Portions of this code are owned and copyrighted by the The Board of Trustees of the University of Illinois and by Steven Smith. External functions used by GDE are the proporty of, their respective authors. This release of the GDE program and documentation may not be sold, or incorporated into a commercial product, in whole or in part without the expressed written consent of the University of Illinois and of its author, Steven Smith. All interested parties may redistribute the GDE as long as all copies are accompanied by this documentation, and all copyright notices remain intact. Parties interested in redistribution must do so on a non-profit basis, charging only for cost of media. Modifications to the GDE core editor should be forwarded to the author Steven Smith. External programs used by the GDE are copyright by, and are the property of their respective authors unless otherwise stated. While all attempts have been made to insure the integrity of these programs: SECTION Disclaimer THE UNIVERSITY OF ILLINOIS, HARVARD UNIVERSITY AND THE AUTHOR, STEVEN SMITH GIVE NO WARRANTIES, EXPRESSED OR IMPLIED FOR THE SOFTWARE AND DOCUMENTATION PROVIDED, INCLUDING, BUT NOT LIMITED TO WARRANTY OF MERCHANTABILITY AND WARRANTY OF FITNESS FOR A PARTICULAR PURPOSE. User understands the software is a research tool for which no warranties as to capabilities or accuracy are made, and user accepts the software "as is." User assumes the entire risk as to the results and performance of the software and documentation. The above parties cannot be held liable for any direct, indirect, consequential or incidental damages with respect to any claim by user or any third party on account of, or arising from the use of software and associated materials. This disclaimer covers both the GDE core editor and all external programs used by the GDE.