HOW TO USE THIS MANUAL The SEQSEE manual (Version 1.2) you are now reading is composed of15 parts. The first seven sections are intended to serve as a general introduction and describe such aspects as system requirements, installation procedures and menu operations. Sections VIII and IX offer more detailed descriptions on how to run SEQSEE with section IX offering a fully documented tutorial to assist first-time users in developing an understanding of SEQSEE. Section X of this manual offers explicit examples of input and output for each of the SEQSEE menu options. Sections XI, XII and XIII describe SEQSEE file structures, libraries, databases, help facilities and general UNIX commands. Section XV offers a "Q & A" tutorial, suggestions for trouble shooting and other potentially helpful pieces of advice. A list of recommended readings and references is also included at the end of this manual along with two appendices. Appendix 1 is a copy of the SEQSEE control file (seqsee.parms) while Appendix 2 provides a detailed explanation of the "STATS" output. We DO NOT recommend that users read through this manual cover- to-cover, instead,we suggest that all first-time users should make an effort to read through the first 30 or 40 pages of this document, including the tutorial (although sections III and IV may be ignored). Those who wish to learn more about the program or who are having difficulty in understanding the I/O operations for any one of the SEQSEE functions are invited to browse through sections X, XI and XII. If a user wishes to modify any one of the databases or to change the default parameters in the SEQSEE control file, we suggest that the user carefully read section XIII and/or Appendix 1. If a user wishes to learn more about the actual operation of any one of the algorithms, this may be done by following up on the recommended readings or by directing their inquiries to bionmr@biochem.ualberta.ca: I. INTRODUCTION The past few years have seen an explosion in the use of computers in molecular biology. This is in no small part due to the vast quantity of raw protein and nucleic acid sequence data that has been generated in the last decade. For example, since 1980 the number of sequences in the PIR databank alone has increased from less than 2000 to almost 50,000 separate entries today. Without the aid of a computer it would simply be impossible to attempt to analyze or categorize this huge reservoir of biological information or to manage the rapid influx of new sequence data which is being generated every single day. Fortunately, through the establishment of publicly funded databanks such as GENBANK, SWISS-PROT, NBRF-PIR and EMBL, most of us have been spared the headache of keeping up with this information explosion. Now, much of this sequence data is readily available, in computer readable format, to any scientist who wishes to subscribe to it. While the development of these centralized databases has certainly helped in the rapid dissemination of sequence information, it has not necessarily solved the problem of its rapid "assimilation". As a consequence, a great deal of privately funded effort has been directed towards the development of software (and hardware) which would permit molecular biologists to quickly compare, analyze and otherwise dissect sequence data in a useful or informative manner. Programs or programming suites such as Intelligenetics' IG suite, Wisconsin's GCG suite, IBI's MacVector, and others are now widely available for this purpose and are typically designed to run on computers of all sizes and shapes including IBM PC's, Macintosh's, VAX's, and SUN workstations. Many of these larger and more costly programs permit flexible sequence manipulations such as database searching, aligning, comparing and matching -- all at the touch of a button. It is a result of the widespread implementation of these software packages (in conjunction with their accompanying databases) that a number of extremely important and very useful "discoveries" have been made. These include, just to mention a few, the identification of a number of new and important receptor families, the identification of numerous repetitive or recurrent folding "modules", the identification of various oncongenic products and the establishment of evolutionary relatedness between hundreds of previously unidentified or poorly understood protein products (see Doolittle, 1990 and references therein). The success that molecular biologists have had at performing simple, yet important "computer experiments" has induced others, including X-ray crystallographers, NMR spectroscopists, protein engineers and evolutionary biologists to begin using or adapting these same software packages to help answer specific questions of their own. In particular, it is now becoming increasingly common to see some of the more advanced sequence analysis programs (involving multiple sequence alignments and advanced structure prediction algorithms) being used to predict the tertiary structure of previously uncharacterized proteins (Schulz, 1988). Likewise, the push to develop more efficient methods for identifying the potential function or active site location of newly isolated proteins is leading to the development of methods which are, in effect, redefining the meaning of "homology" or "sequence similarity" (Gribskov et al., 1987). In addition, new databases containing secondary structural information, phi and psi angles, torsion angle restraints, NMR chemical shifts, sequence motifs and the like, are continually being added to the current software arsenal to permit even more diverse inquiries and analyses. Quite clearly the development of sequence analysis packages is entering into a phase of very rapid expansion with many new and unforeseen applications being proposed for an increasingly diverse and substantially larger investigative population. Of course with this rapid expansion comes the usual problems of limited availability, restricted usability and increased costs of most sequence analysis software products. As a result, a rather problematic "software stratification" is developing in the field with the most powerful (and most useful) packages becoming more and more expensive while the freely available, unintegrated "shareware" products are becoming less and less useful. In response both to this program diversification and this software stratification we have endeavored to develop a publicly available software package (called SEQSEE - SEQuence SEEker) which offers the program diversity of the expensive packages at the "cost" of the freely available shareware products. Specifically SEQSEE is a multi-purpose menu-driven suite of programs designed to provide a fully integrated, state-of-the-art package for the analysis and display of protein sequences and protein databases. It has been designed with considerable flexibility in mind so as to permit the addition of new features and new algorithms when they are developed or as they are reported in the literature. It contains many of the features available in some the most comprehensive commercially available programs such as rapid database searching, flexible pattern matching and multiple sequence alignment. It also contains a large number of structural analysis and prediction programs which have been enhanced through the incorporation of several unique databases. In this regard, SEQSEE has been developed expressly from the point of view of those protein chemists who are interested in questions pertaining to both structure and function. As a result, we believe SEQSEE offers a number of important enhancements and many unique advantages over what is typically found in other commercially available software packages. SEQSEE has already been used in the analyses of fibroma/myxoma viral products (Upton et al., 1992, 1993), cystic fibrosis gene products, fish anti-freeze proteins (Sonnichsen et al., 1993) and a variety of growth factors. In many respects SEQSEE has performed beyond our expectations and it is as a consequence of its consistent (and sometimes unexpected) success that we believe it should be made freely available to all members of the scientific community. We hope you will find SEQSEE as useful in your work as we have found it in ours. II. BEFORE YOU BEGIN 1) Please make sure you have read the USER NOTICE. 2) Try to read the first 30 pages of the manual (including the tutorial) to gain some familiarity with the package and its overall operation and design. 3) If you have just received SEQSEE (either by tape or anonymous FTP) and are wanting to install it on your computer system please proceed to Sections III and IV to learn more about the system requirements and installation procedures. On the other hand, if SEQSEE has already been installed on your system, ask your system administrator where the "seqsee" directory is located. On a SUN workstation this will typically be "/home/local/seqsee". You must have full access to this directory in order to run the program. 4) If SEQSEE is to be operated in a UNIX environment, try to learn a little about the "vi" editor. Knowing how to scroll through a file, how to quit and how to make rudimentary editing changes with this editor will help out a lot. A short summary of "vi" commands with some brief explanations can be found in Section XIII. 5) It is recommended that SEQSEE be run in a window which is at least 75 to 80 columns wide. Narrower windows may lead to some character strings vanishing off the right edge of the screen or being wrapped around to produce difficult-to-read output. 6) For its complete and proper operation, SEQSEE requires a control file. The SEQSEE control file contains all of the program's default parameters and file pathways. When SEQSEE is first started, it will typically check to see if the file "seqsee.parms" (the control file) exists in the current directory. If it is not there, then a check for a pre-designated location for a file with that name will be performed. Note that if you wish to make changes to the default parameters, you must have a copy of "seqsee.parms" in your current directory. The control file may be edited either before running SEQSEE or while you are in SEQSEE using the "File Viewer" menu option. A complete explanation of the SEQSEE control file is provided in the Appendix located at the end of this manual. 7) To kill any current or unwanted operation in SEQSEE, simply type "^c". This will immediately halt the job and return you to the main menu. (The notation "^" or "CNTL" is often used to designate the control key on the computer keyboard). III. PROGRAM COMPATIBILITY & COMPUTER REQUIREMENTS The current version of SEQSEE (version 1.2) is configured to run on most kinds of UNIX workstations. SEQSEE is written in the C programming language (consistent with both ANSI and standard C) and is compatible with the UNIX BSD operating system implemented on many current SUN 3, SUN 4 and SUN Sparcstation computers, the IRIX operating system available on most Silicon Graphics workstations and the MACH operating system found on all NeXT workstations. Portability to other computers (IBM, VAX, Macintosh) is possible although this has not yet been implemented. Considerable effort has been made to make the programs and I/O operations as machine-independent as possible. Consequently SEQSEE does not offer any machine-specific graphics capability or machine-dependent windowing capacity. These enhancements (using X windows) may appear in later versions of the program. Taken together, the various programs and subroutines in SEQSEE amount to over 10,000 lines of source code. If the accompanying databases, libraries, manuals, installation routines and compiled versions of the main programs are included, the whole SEQSEE suite occupies some 7 megabytes of memory. The NBRF Protein Information Resource (PIR version 34.0) and the SWISS-PROT protein sequence database (version 23.0) take up an additional 115 megabytes of memory. Currently the only major sequence databanks compatible with SEQSEE are the NBRF-PIR database (both PIR and Intelligenetics format) and the SWISS-PROT database (both SWISS-PROT and Intelligenetics format). While we acknowledge that there are some differences between the SWISS-PROT and the PIR databanks, this discrepancy should not be of any concern to most users. Future versions of the program are expected to be compatible with all four major databases (PIR, SWISS-PROT, EMBL and GENBANK) in a number of different formats (Standard, IG, GCG etc.). Computers running the full suite of SEQSEE programs require at least 4 megabytes of RAM (although 8-16 MB is recommended) and at least 150 megabytes of additional hard disk memory to accommodate both the programs and the relevant databases. It is also important to note that SEQSEE requires at least 16 MB of "swap-space" when running on most UNIX-based machines. Additional copies or updates of SEQSEE and its accompanying databases may be obtained through our website at: http://www.bionmr.ualberta.ca/ bds/software IV. INSTALLATION If you have received SEQSEE from our anonymous FTP site it is in a compressed format and therefore must be "uncompressed" and "untarred" before it can be compiled and installed. Versions of SEQSEE received on magnetic tape are in a regular format and need not be unpacked. Magnetic tapes may be read using one of the following commands: 1) SGI computers: tar -xvf /dev/tape 2) SUN computers: tar -xvf /dev/rst0 "Tarring" the tape will read all of the files contained on the tape directly into a directory called "seqsee" within your current directory. Reading the entire tape will typically take about 5-10 minutes. With your copy of SEQSEE you will find a total of more than 30 files and directories containing all of the required routines, databases and libraries needed to run the SEQSEE suite. A complete listing (using the UNIX "ls" command) of these files should look something like this: COPYRIGHT alexis/ init.c sb_align/ Makefile browse/ install/ seqed/ README calc.c lib/ seqhelp/ VERSION databases/ libc/ seqret/ a_cfas/ docs/ main.c seqsearch/ a_gor/ dotplot/ moment/ seqsee.h a_homol/ fast_align/ mult_align/ seqsee.parms a_membrane/ fleqsee/ nw_align/ sequences/ a_moment/ hsearch/ psearch/ stats/ a_motif/ hydro/ refscan/ In the third column of this list, you will notice the directory called "install/". This particular directory contains an installation script (or macro) known as a csh program as well as several other programs of note. The installation script is: install Before you install SEQSEE throught your whole system you should install it only for yourself. The installation script allows you to do either. If you install it only for yourlself, it will allow you to experiment with the program and to investigate how well it works on your own computer environment. We INSIST that you do this before deciding if SEQSEE should be placed on your full system. SEQSEE should only be installed system wide when a decision has been made to make SEQSEE available to all system users. The other 4 files in the "/install" directory include: 1) README : A copy of the instructions you are now reading 2) seqdb.fnames: Contains pathnames for sequence database 3) refdb.fnames: Contains pathnames for references database 4) xparms: Csh program - Called by install to build "seqsee.parms" ********************** Before beginning the installation process it is important to note that SEQSEE can use four types of sequence databases. These are described below: 1) SWISS-PROT: This is publicly available from a number of anonymous FTP sites (for example: ncbi.nlm.nih.gov). 2) SWISS-PROT_IG Intelligenetics format: This is available through the Intelligenetics Corporation only. It contains the same information as the standard SWISS-PROT database, but in a different format. 3) NBRF-PIR: This is publicly available from a number of anonymous FTP sites (for example: ftp.bchs.uh.edu) 4) PIR_IG Intelligenetics format: This is available through the Intelligenetics Corporation only. It contains the same information as the standard PIR database, but in a differentformat. If you wish to run SEQSEE you must get one of the above databases either through an anonymous FTP site (addresses given above) or through the appropriate database vendor (Intelligenetics or NBRF). Typically, the standard NBRF-PIR and SWISS-PROT databases have both the sequences and the references located together in the same file(s). The Intelligenetics versions of the PIR and SWISS-PROT databases actually has the sequence data located in a separate file (or set of files) which is distinct from the reference and bibliographic data. These format differences affect the way that you install SEQSEE, so please take note. Remember that during the installation process (see below), you must set up SEQSEE with one of the above databases -- and only one of the above databases. *********************** You need to be in the install directory (this one) to run the installation script. The installation script does the following things: 1) Asks where to put the library files that SEQSEE uses then copies all of the library files to that place. 2) Asks where to put the executable files 3) Asks if you have set up the 'seqdb.fnames' and 'refdb.fnames' files. 4) Sets up the seqsee.parms file (default parameters file) 5) Sets up the io.lib.c file, which is needed to compile SEQSEE 6) Sets the default editor 7) Sets the compiler and compiler options 8) Compiles the SEQSEE programs 9) Moves the SEQSEE programs to where you said the executables should go Please look through the README file in the "install" directory for additional information on SEQSEE installation. *************************************************************************** SEQSEE INSTALLATION FOR A SINGLE USER: You do NOT need to be root to do this. This example assumes that a single user is installing SEQSEE from within their own account. For illustration purposes let's say the user's name is ``fido'' and their home directory is ``/somemachine/home/fido''. (1) At the prompt: >> Give the full path name of where the SEQSEE Library files should" be installed. These include the help files, the tables," the documentation (manual), and the enclosed database files." Press for the default (/usr/local/lib/seqsee)." Give the name of a directory in the current users account. For example, /somemachine/home/fido/lib/seqsee (2) At the prompt: >> Where should SEQSEE executables exist on the system? Enter the directory or press for the default (/usr/local/bin). Give the name of the current user's bin directory. For example, /somemachine/home/fido/bin (3) The other questions are self-explanatory. When the installation program has completed, the user will be able to run the program by just typing: seqsee from within their bin directory (or if the PATH is set up correctly, the user can type ``seqsee'' from any directory). *************************************************************************** SEQSEE INSTALLATION SYSTEM WIDE: You MUST be root to do this. The installation is identical to be above, but instead of giving paths into the user's account, you will be installing SEQSEE where it is accessible to everyone. (1) At the prompt: >> Give the full path name of where the SEQSEE Library files should" be installed. These include the help files, the tables," the documentation (manual), and the enclosed database files." Press for the default (/usr/local/lib/seqsee)." Give the name of where the library files should be located. This must be accessible to everyone. The default should work. (2) At the prompt: >> Where should SEQSEE executables exist on the system? Enter the directory or press for the default (/usr/local/bin). Give the name of where the executables should reside so that everyone can run them. This should be a directory which is in everyone's PATH. (3) The other questions are self-explanatory. When the installation program has completed, any user will be able to run the program by just typing: seqsee from any location. *************************************************************************** SEQSEE is organized such that each module is an independent entity, distinct from the "main driver". The purpose of the main driver is, simply, to call the appropriate program and to display or save the results. Each module in SEQSEE is written in standard C code (although we cannot guarantee that differences will not exist between some compilers). SEQSEE should be easily portable to almost any UNIX machine. If you are porting SEQSEE to a different (ie. non-UNIX) system, you will have to make changes to the driver and the "unix.lib.c" file in the "libc" directory. This should not prove to be too difficult as most systems have comparable command structures. V. SEQSEE -- GENERAL DESCRIPTION Following is a brief description of the general functions that have been implemented in the current version (Version 1.2) of SEQSEE: 1) SEQUENCE ENTRY & EDITING - New sequences and sequence files may be created, entered and/or edited using a computer-directed protocol found in SEQED. The program permits the flexible entry and storage of sequence information in both upper and lower case -- with or without spacing. 2) STRUCTURAL ANALYSIS - Sequences may be analyzed statistically or predictively for the extent and location of secondary structure, active-site motifs, sequence signatures, membrane spanning regions, flexibility, hydrophobicity, hydrophobic moments and many other features. The structural analysis routines are designed to help the user in determining important aspects of structure and function in those cases where very little is known about the protein of interest. 3) SEQUENCE COMPARISON & ALIGNMENT - Sequences may be compared against a database, against themselves or alternatively individual sequences may be aligned in a pair-wise or multi-layered fashion depending on the choice of program or program parameters. Sequence alignments are marked explicitly to distinguish between exact matches and similar matches. Consensus sequences are generated for all multiply-aligned sequences. Choices of scoring matrices and gap penalties are possible. Sequence alignment and comparison are two excellent methods for discerning protein function and evolutionary relatedness. 4) FLEXIBLE PATTERN MATCHING - Pattern matching to a database (PIR, SWISS-PROT, SEQBANK, etc.) or to individual sequences may be done using a flexible query language (for exact matches) or a homology-based matching protocol (using a scoring matrix). Flexible pattern matching is ideal for the identification and location of suspected sequence motifs. 5) SEQUENCE LOCATION, RETRIEVAL & SCANNING - Sequences, names of sequences, accession numbers and bibliographic information may be scanned, retrieved or precisely located in either the PIR or SWISS-PROT databases (or in other user-specified databases) using a number of browsing, database scanning or pattern matching programs. These routines are ideal for interactive identification and retrieval of database sequences. VI. THE SEQSEE SUITE This section provides a more detailed description of the functions and subroutines currently available in SEQSEE. Note that each subroutine description is presented in the same order that it appears in the main SEQSEE menu. As well as providing a more complete description of the SEQSEE functions, we hope this section will also be of some interest to those wishing to understand the character of sequence analysis in general. (Note that program names appear in upper case letters). 1) HELP HELP contains an abridged version of the SEQSEE manual for online consultation. A menu is provided with a selection of various topics and accompanying descriptions. Online help does not offer the same detailed information as the hardcopy SEQSEE manual, hence detailed inquiries should be directed to the manual. 2) ENTER / EDIT A SEQUENCE The program known as SEQED is used for the entry and editing of new (or old) sequence files. The program first queries the user as to whether he or she wishes to: 1) Enter a new sequence. 2) Edit an old sequence. If one chooses to enter a new sequence the program queries the user for the name of the sequence file (sequence filename), the name of the sequence (sequence name) and finally, the actual sequence (using the standard single letter amino acid code). Sequences may be entered using either lower case letters, upper case letters or an arbitrary combination of both. In other words, sequence entry is case independent. The program also ignores blank characters so sequence entries may have as many blank spaces as desired. A "sequence ruler" is presented at the top of each sequence file entry line to permit quick identification of residue positions as they are typed. After each group of 50 characters has been entered, the user is expected to press so that a new sequence ruler can appear. Upon completion of the sequence entry, the user must enter the '$' character to indicate to the computer that the typing process has finished. 3) RETRIEVE SEQUENCE FROM DATABASE The program SEQRET is designed to allow the user to retrieve complete sequences or groups of sequences from the PIR database using either the PIR accession number or protein name (or portion thereof). Thus one may seek and select only a single sequence for a specific purpose, or entire protein families to create special user-specified databases. The sequences may be saved and/or edited for further analysis (as in the preparation of files for multiple sequence alignments). All sequences are saved in a SEQFILE format and, therefore, are ready to be analyzed by any of the other SEQSEE functions. 4) SEQUENCE STATISTICS The STATS program carries out a simple statistical analysis of any given protein sequence. It calculates and displays the molecular weight, the amino acid composition, average hydropathy (Kyte and Doolittle, 1982), total charge, predicted iso-electric point, expected quantity of exposed and interior surface area (Chothia, 1976; Richards, 1977; Miller et al., 1987), expected packing volume (Richards, 1977; Janin, 1979), predicted specific volume (Zamayatnin, 1972), aggregation potential (Fisher, 1964), estimated solvation free energy of folding (Chiche et al., 1990) and a host of other values that may be of structural or statistical interest (See Appendix 2). Note that STATS can only be used on sequence files in the SEQFILE format. 5) STRUCTURE PREDICTION ALEXIS is a comprehensive structural analysis program which has been been developed expressly for the SEQSEE software suite. ALEXIS performs calculations on the extent and location of potential membrane spanning regions, the identification of short sequence folding motifs, the prediction of the protein folding class (Chou and Zhang, 1992) and the prediction of secondary structure using the cumulative results of five different and well- tested methods. Detailed descriptions of the techniques and their respective enhancements are given below: a) MEMBRANE SPANNING REGIONS This calculation uses the central point maxima technique first described by Klein et al. (1985). This has been shown to be the most accurate method for membrane spanning identification through independent tests performed by Fasman & Gilbert (1990). The method uses a linear discriminant model to test the probability that any given sequence is membrane spanning. The hydrophobicity scale (and hence the the discriminant equation) has been adopted specifically for the Kyte-Doolittle parameters. Some modifications have been introduced to this scale to permit better discrimination of the membrane spanning regions. The program is designed to determine, first, if there are membrane spanning regions and, second, where they are located. b) CHOU-FASMAN SECONDARY STRUCTURE PREDICTION This procedure predicts the secondary structure for any given protein sequence through a modified Chou and Fasman (1974, 1978) algorithm. The Chou-Fasman algorithm is based on statistically observed propensities of all 20 amino acids to occur in various protein secondary structures. Despite its widespread use and general popularity, it is a technique not without its shortcomings. In an attempt to improve both its accuracy and its general utility, a number of modifications to the original algorithm have been made. Some of these changes include the adoption of the simplified rules of Williams et al., (1987) and the use of updated Chou-Fasman parameters as derived from SEQBANK. With these new modifications, this technique can predict secondary structures with a 59.8% level of accuracy. A random three-state prediction, on the other hand, is expected to be only 33.6% correct (based on the disposition of secondary structures in SEQBANK). c) HYDROPHOBIC MOMENT SECONDARY STRUCTURE PREDICTION This procedure determines the secondary structure for any given protein sequence on the basis of hydrophobic periodicities. It has its origins with the Fourier analysis of hydrophobicity profiles as first proposed by Eisenberg et al., (1984). In contrast to the statistical techniques of Chou and Fasman, it is an approach that is based on well established physico-chemical principles. According to Eisenberg, stretches of residues with hydrophobic periodicities in the range of 90 to 120 degrees (corresponding to a hydrophobic residue every three to four residues) are typically found in alpha-helices, while stretches of amino acids with hydrophobic periodicities of 160 to 180 degrees (corresponding to alternating hydrophobic and hydrophilic residues) are typically in beta strands. By introducing a number of modifications to Eisenberg's original proposal, including the use of optimized hydrophobicity parameters and the introduction of Chou- Fasman conformational probabilities, the level of prediction accuracy can reach 64.5% (This value was calculated using the structural assignments available in SEQBANK). d) GARNIER,OSGUTHORPE,ROBSON SECONDARY STRUCTURE PREDICTION Commonly called the GOR method (after the three authors' initials) this procedure predicts the secondary structure on the basis of parameters obtained through information theory. It is based on a series of proposals originally put forward by these investigators in the 1970's (Garnier, et al., 1978). It is very much a statistical technique, not unlike the Chou-Fasman approach, except that it takes into account the positional preferences of amino acids within helices, beta-strands and coils. Despite its high level of parameterization, the procedure is extremely fast (when computerized) and is consistently rated among the most accurate of known methods. With recent modifications in place, including some degree of re-parameterization of the previously published values found in Gibrat et al. (1987), the method attains a 64.6% level of accuracy (This value was calculated using the structural assignments available in SEQBANK). e) HOMOLOGY-BASED SECONDARY STRUCTURE PREDICTION This procedure determines the secondary structure for any given protein sequence by searching for short stretches of homologous sequences and comparing them to known protein structures. It is based on a number of related proposals simultaneously offered by several authors in 1986 (Nishikawa and Ooi, 1986; Sweet, 1986 and Levin et al., 1986). The most recent implementation of this procedure, as described by Levin and Garnier (1988), has been adopted for use in SEQSEE. In this version, SEQBANK is used as the database of known structures from which sequence homologies are sought. This method is the most accurate secondary structure prediction scheme presently known. For proteins sharing greater than 25% sequence similarity with any protein in SEQBANK, the method approaches a level of accuracy of 87%. For proteins possessing no significant homology, the prediction is 66.0% correct. SEQSEE uses a specially optimized amino acid exchange matrix in order to achieve these high scores. f) MOTIF-BASED SECONDARY STRUCTURE PREDICTION This procedure predicts secondary structure based on primary sequence patterns contained in the files SEQMOTIF1 and SEQMOTIF2. It is an extension of the methods first proposed by Rooman and Wodak (1988, 1991) for identifying and incorporating well established sequence/structure patterns in secondary structure prediction schemes. The procedure, as it is currently implemented, can only perform structural predictions (on average) on less than 20% of the residues in any given sequence. However, for those regions that are predicted, the confidence level is often very high (> 80%). g) CONSENSUS SECONDARY STRUCTURE PREDICTION This procedure determines a consensus secondary structure based on the cumulative scores of the five methods described above. The residue specific scores for each method are weighted according to its expected prediction accuracy. The homology-based technique has the strongest weighting and the Chou-Fasman technique has the weakest. The consensus method is generally found to improve overall prediction accuracy by one to two percent and, furthermore, it can greatly simplify the interpretation of the other six predictions. We recommend that the consensus prediction be used when only a single answer or a single method is desired. 6) SEQSITE PATTERN SEARCH The SEQSITE procedure allows the user to search any given sequence for active sites, binding sites, signature sequences, sequence motifs, phosphorylation sites and potential antigenic sties. A library of more than 1000 signature sequence patterns, 50 phosphorylation sites and 20 generalized antigenic regions can be scanned when this function is invoked. All sites are identified by residue location, matched template pattern and at least one current reference. This type of "function search" is extremely useful for determining the properties and features of newly sequenced or poorly characterized proteins. 7) FLEXIBILITY The program named FLEQSEE predicts the flexibility and mobility of various regions in a protein based on sequence information alone. Flexibility is calculated on the basis of the Karplus algorithm (Karplus and Schulz, 1985). This procedure determines main-chain mobility by using smoothed averages of X-ray thermal B factors taken from approximately 30 highly resolved structures. In SEQSEE, flexibility may be used to determine the position and length of coil regions by locating all "significant" maxima (those maxima which exceed a minimum threshold) in the flexibility plot. Flexibility plots may also be used to identify surface-seeking elements or to locate strongly antigenic regions of any given sequence. 8) HYDROPHOBIC MOMENT MOMENT calculates the hydrophobic moment of a sequence using the Cornette et al. (1987) scale of hydrophobicity and the Fourier analysis technique of Eisenberg et al. (1984). Calculations are preformed over a set "sequence window" of predefined length using a range of values specific to helical periodicities (90 to 120 degrees), exterior beta strand periodicities (160 to 180 degrees) and interior beta strand periodicities (0 degrees). The values for helix and beta strand may be compared with one another and to a minimum cutoff value (usually around 5) to identify amphipathic helices or beta strands. This method has some utility in identifying potential T-cell epitopes (amphipathic helices) and other biologically important structures. 9) HYDROPHOBICITY HYDRO calculates the smoothed hydrophobicity (over a window of pre- defined length) of any given sequence using a choice of several hydrophobicity scales. The operator may choose (using the control file) from the Eisenberg consensus scale (Eisenberg et al., 1984), the Kyte-Doolittle scale (Kyte and Doolittle, 1982), the Cornette scale (Cornette et al., 1987) or the Parker-HPLC scale (Parker et al., 1986). The Hopp-Woods antigenicity scale (Hopp and Woods, 1981) is also available for antigenicity determination. Hydrophobicity plots may additionally be used to locate membrane spanning regions in some types of proteins (hydrophobic regions of 20 or more residues). A choice of both "raw" and "scaled" values is offered. 10) FAST ALIGNMENT SEARCH FAST_ALIGN is a k-tuple based fast alignment algorithm based loosely on the speed-up protocols incorporated in Lipman and Pearson's FASTA (1988) and Altschul et al.'s BLAST (1990). First, a table of homologous 3-tuples is generated for the query sequence using a modified scoring matrix. Second, a look-up table of these 3-tuples and their respective location is prepared from the query sequence. Third, a look-up table is prepared of 3-tuples for each sequence in the database. The two look-up tables (one from the query and the second from the database) are then compared and matches are identified. The result is a one-dimensional "spectrogram" of homologies characterized by low level noise (poor matches) and the occasional sharp peak (a string of matches). Database sequences with sufficiently high peaks are then pulled out and rigorously aligned using the Needleman-Wunsch program to determine the significance of the alignment. The program is capable of searching the complete PIR database and then ordering and aligning 50 homologous matches of a 100 residue query sequence in less than 90 seconds. This is an extremely powerful technique to accomplish quick inquiries regarding protein relatedness and identification. FAST_ALIGN may be used to align sequences against the PIR, SWISS-PROT, SEQBANK or a user-specified database with a SEQFILE format. Several choices of scoring matrices are possible and these include: the Unity matrix, the Dayhoff PAM 250 matrix (Dayhoff et al., 1983), the McLachlan matrix (McLachlan, 1971) and the RBO matrix (unpublished). The RBO matrix is the default scoring matrix. 11) EXHAUSTIVE ALIGNMENT SEARCH NW_ALIGN is a program which carries out an exhaustive pair-wise alignment of any given query sequence to all other sequences in a given database. Only those sequences with scores above a certain user-defined threshold are retained. The algorithm used for this procedure is based on the Needleman-Wunsch (1970) approach for pair-wise alignment. This dynamic programming method is guaranteed to find the optimal alignment between any two sequences for any given scoring matrix and gap penalty. Alignments can either be done against the PIR database, SWISS-PROT, SEQBANK or a user defined database in the SEQFILE format. If alignments are done against SEQBANK, knowledge of the secondary structure is included to determine the location and length of gaps (Lesk et al., 1986). A choice of scoring matrices and gap penalties is available. The scoring matrices include: the Unity matrix, the Dayhoff PAM 250 matrix (Dayhoff et al., 1983), the McLachlan matrix (McLachlan, 1971) and the RBO matrix (unpublished). The RBO matrix is the default scoring matrix. Scores are rigorously calculated on the basis of comparisons to randomized sequence alignments as suggested by Dayhoff et al. (1983). The program is extremely time consuming with a query sequence of 100 residues typically taking 4 hours to complete on a SUN Sparcstation. However, the improvement in overall alignment accuracy and the possibility of identifying very remote and previously unidentified relationships may well be worth the wait. NW_ALIGN also incorporates another program called SB_ALIGN which is capable of performing structure-based alignments using the approach of Lesk et al. (1986). SB_ALIGN is only called when conducting alignments against the SEQBANK database. If the user wishes to place an exhaustive alignment run into the background (to prevent the computer from being tied up for long periods of time) this can be done as follows: 1) Press the "control" and "z" keys simultaneously to temporarily stop the job. 2) Type "bg" and press the "return" key to restart the program in the background. The results can be viewed at any time by re-opening the SEQSEE window and inspecting the *.tmp files that are automatically created and updated during the alignment run. 12) ALIGN 2 OR MORE SEQUENCES The program MULT_ALIGN uses a modification of the pair-wise Needleman- Wunsch protocol to align two or more protein sequences. The method is closely related to the progressive alignment procedure first described by Barton and Sternberg (1987), which permits rapid and accurate multiple alignments for up to several hundred proteins. A consensus sequence is also generated for each pair-wise or multiple alignment. A choice of scoring matrices and gap penalties is available. Sequences which are to be aligned must be contained in SEQFILE formats, either in the form of databases (for multiple alignments) or singly (for pair-wise alignments). The procedure for aligning more than two sequences (like the fast alignment search described in section 10) is fundamentally heuristic in nature and so it cannot be proven that the resulting alignments are mathematically optimal. 13) PATTERN SEARCH This procedure can search the SEQBANK, SWISS-PROT, or PIR databases or, alternately, a sequence of your own choosing to find exact pattern matches according to the following rules (note the sequence patterns are case INDEPENDENT): a) X Match exact residue specified where X = any amino acid b) !X Match any residue EXCEPT X c) * Wild card character--matches any amino acid d) [XYZ] "OR" braces--match X "or" Y "or" Z. e) X&Y "AND" character--match X "and" Y no matter what the separation f) X{2,8}Y Match X and Y if separation is between 2 and 8 residues. "Range" braces--allow a range of wild card characters. i.e. {2,8} = 2 to 8 "*" g) $**X Match X if located 2 residues from N terminus -- "Termination" characters are used to mark either the beginning (N terminus) or end (C terminus) of a sequence Pattern Search (PSEARCH) is constructed to allow the user to enter several patterns at once, both on a single line (using the "&" feature) or on separate lines. Patterns appearing on separate lines are treated as "independent" patterns (meaning they don't have to appear in the same protein sequence) while patterns with "&" characters are viewed as "dependent" patterns (meaning they do have to appear in the same protein sequence). Some examples of sequence pattern searches are given below: AA***K Find all occurrences of 2 alanines together followed by any 3 residues followed by a single lysine AA!P!P!PK Find all occurrences of 2 alanines together followed by any 3 residues (as long as they're NOT prolines) followed by a single lysine. (ie. look for AA***K except AAP**K, AA*P*K, AA**PK, AA*PPK, AAPP*K, AAPPPK) [AG][AG]*[KR] Find all occurrences of 2 alanines or 2 glycines or any combination of the two followed by any residue followed by a lysine or an arginine. (ie. look for AA*K, AG*K, GA*K, GG*K, AA*R, AG*R, GA*R and GG*R) AA*K&I**R Find all occurrences of 2 alanines together followed by any amino acid followed by a single lysine, AND if that pattern is found, then find all occurrences of a single isoleucine followed by any two amino acids followed by a single arginine IN THE SAME PROTEIN SEQUENCE. (ie. look for AA*K and I**R within a sequence) AA{2,5}[KR] Find all occurrences of 2 alanines together followed by at least two but no more than 5 amino acids (any type) followed by either a lysine or an arginine. (ie. look for AA**[KR], AA***[KR], AA****[KR] and AA*****[KR]) ${3,5}M Find all occurrences of methionine that are between 3 and 5 residues from the N terminus. (ie. look for $***M, $****M and $*****M) Of course any combination of the above queries could be used in a PSEARCH pattern search. Other examples of PSEARCH queries may be found by browsing through the SEQSITE database. 14) HOMOLOGY SEARCH The HSEARCH program searches either the PIR database, SWISS-PROT, SEQBANK or a compatible user-defined database to find the "nearest" or most homologous matches to any given sequence. Homologies are determined according to any one of four user-defined scoring matrices (described earlier) with the default being the RBO scoring matrix. Presently, gap penalties are not yet incorporated into the homology search routine. The homology search is a useful complement to other pattern search routines, especially when attempting to locate distantly related or difficult-to-identify sequence motifs. 15) DOT PLOT DOTPLOT is an extremely flexible program developed to produce character representations of standard dot plots (Lipman and Pearson, 1985). The low resolution of most character-defined screens prevents the incorporation of a useful graphic representation of dot plot results and hence a character representation with a user defined "threshold" has been incorporated to overcome this problem. DOTPLOT may be used to compare a sequence with itself (to identify internal repeats), with another sequence (for pair-wise alignments), with a SEQFILE compatible database or with the PIR or SWISS- PROT databases (for medium speed alignments). By using DOTPLOT in conjunction with a database it is possible to look for homologies between any shared regions in a group of sequences. Such an option has proven to be quite useful in identifying previously unrecognized motifs or unexpected similarities in a number of proteins. 16) PROTEIN DATABASE REFERENCE SEARCH The program REFSCAN is designed to allow the user to locate and retrieve specific sequence references from the PIR or SWISS-PROT databases using either the accession number, the name (or portion thereof) or a bibliographic/functional reference. This feature allows the user to quickly access important information about many newly sequenced proteins pertaining to their function, structure or relationship with other proteins in the database. 17) FILE VIEWER BROWSE permits the user to edit or view a variety of database files while still in the SEQSEE environment. Abbreviated versions of the SWISS-PROT and PIR databases (which provide sequence name, source and accession code only) may be viewed directly with this command. Likewise, the complete SEQBANK database may also be displayed and scrolled through at leisure. BROWSE also permits the user to interactively edit the SEQSEE control file (SEQSEE.PARMS). This allows the user to customize SEQSEE program parameters in almost any manner desired. Standard UNIX commands may be used for scrolling through or locating particular character strings in any of the files. 0) EXIT SEQSEE Closes all current files and returns the user to the general operating system. The program may be restarted by typing "seqsee". If the program crashes or hangs up for any reason, simply type "^c" (i.e. press the "control" and "z" keys simultaneously). This will stop all processes and return the user to the main menu. VII. GETTING STARTED Although it is possible to run SEQSEE simply by signing on and typing seqsee (regardless of what directory you are in), we recommend that all regular SEQSEE users should try to do the following: 1) Create your own directory for SEQSEE and make this your current directory (ie. make the directory by typing "mkdir seqsee" and then type "cd seqsee" to get into this directory). Having your own SEQSEE directory will help you better organize your input files and results. 2) Copy the control file "seqsee.parms" into your SEQSEE directory. Your system administrator should be able to tell you where it can be found. The command for this might typically be: cp /usr/local/seqsee/seqsee.parms . Please note the period at the end of this command -- it stands for "current directory" (i.e. the directory you are already in). Having a copy of the "seqsee.parms" file will permit you to change almost any of the default parameters. This can allow you to "customize" SEQSEE to suit your own special needs. 3) If you already have sequence files somewhere in your computer, copy them into your SEQSEE directory. Try to ensure that these files are in the proper SEQFILE format. Of course you can always use SEQSEE to create or retrieve your own sequence files which will automatically conform to the SEQFILE format. 4) Although there are no absolute windowing requirements for SEQSEE, we do recommend that your screen or window be at least 80 characters wide and at least 25 or more lines in length. (The 25 line window length is actually the upper limit for VT100 or Mac/PC terminal emulators). Choosing a window this size (or larger) will permit easy viewing of your output , help files and menus. If your terminal or terminal emulator permits, we also recommend having more than one window on the screen since this can make the file manipulations and the viewing of intermediate results much more convenient. If you have done all of the above operations and are satisfied with your current status, simply type: seqsee The following menu should appear (see below): ********************************************************************** * Package...: SEQSEE Version 1.2 (c) * * Authors...: Robert Boyko / Leigh Willard / David Wishart * * Fred Richards / Brian Sykes * * Location..: University of Alberta * * Protein Engineering Network of Centres of Excellence * ********************************************************************** *** Preliminaries *** *** Alignments *** 1) Help 10) Fast Alignment Search 2) Enter/Edit a Sequence 11) Exhaustive Alignment Search 3) Retrieve Sequence from Database 12) Align 2 or more sequences *** Structural Analysis *** *** Scanning *** 4) Sequence Statistics 13) Pattern Search 5) Structure Prediction 14) Homology Search 6) SEQSITE Pattern Search 15) Dot Plot 7) Flexibility 16) Database Reference Search 8) Hydrophobic Moment 17) File Viewer 9) Hydrophobicity 0) EXIT SEQSEE Enter the number of the desired function >> To continue with the program, simply type in any one of the above menu numbers and press "". SEQSEE will automatically prompt the user for a variety of input or output names (such as input filenames, output filenames, accession numbers etc.). It is important to note that SEQSEE requires protein sequences for input in 10 of its 18 functions, hence it is important to have at least one sequence file (a SEQFILE prepared using either menu item #2 or #3) stored in a known location. This way you only have to type in the sequence filename -- and not the whole sequence -- each time you make a function call. SEQSEE has been specifically designed to be a self-guiding interactive tool so it is hoped that all computer queries will easily lead the uninitiated user through the program without much difficulty or confusion. For those wishing a more complete introduction to the SEQSEE suite, we recommend that they carefully study the tutorial presented in the next section. VIII. TUTORIAL "SO YOU THINK YOU'VE FOUND SOMETHING" Let us suppose that you and a collaborator have succeeded in isolating a small protein from Bacillus subtillus which appears to act as an oxidizing co- factor for certain cellular processes. After many weeks of amino acid analysis and peptide sequencing, your collaborator provides you with the N- terminal sequence of the first 60 amino acids of this new protein. You are requested to find out anything you can about this partial sequence, and to report to your colleague as soon as possible. Sounds like a job for SEQSEE! Let's demonstrate how you might go about analyzing this sequence using just a few of the options available in SEQSEE. Note that in this example we will first show how a new sequence is entered. Then we will demonstrate how the sequence can be analyzed statistically. We will also show how to check this sequence for sequence motifs and how to compare (and align) the query sequence against the PIR database. Finally we will demonstrate how to search SEQBANK to locate those proteins which might be evolutionarily related to the query sequence. So here it goes... 1) Sign on to a computer 2) Type "seqsee" (the following menu should appear) ********************************************************************** * Package...: SEQSEE Version 1.2 (c) * * Authors...: Robert Boyko / Leigh Willard / David Wishart * * Fred Richards / Brian Sykes * * Location..: University of Alberta * * Protein Engineering Network of Centres of Excellence * ********************************************************************** *** Preliminaries *** *** Alignments *** 1) Help 10) Fast Alignment Search 2) Enter/Edit a Sequence 11) Exhaustive Alignment Search 3) Retrieve Sequence from Database 12) Align 2 or more sequences *** Structural Analysis *** *** Scanning *** 4) Sequence Statistics 13) Pattern Search 5) Structure Prediction 14) Homology Search 6) SEQSITE Pattern Search 15) Dot Plot 7) Flexibility 16) Database Reference Search 8) Hydrophobic Moment 17) File Viewer 9) Hydrophobicity 0) EXIT SEQSEE Enter the number of the desired function >> 3) Type "2" (and press ) so that you can input your new sequence. This puts you into the program SEQED. When in this program you are, first, required to choose an option for entering or editing your sequence. Since we wish to enter a new sequence we will select option "1". Second, you are asked to provide a name for your sequence. In this case we'll call it "bacillus_redoxase". The sequence name is required for record keeping purposes only. >> 2 What would you like to do? 1) Enter new sequence. 2) Edit old sequence 0) Exit Enter a number (then press return). >> 1 Seqed (Version 1.2) Enter name for sequence. Use underscores instead of blanks to separate words. (eg. thioredoxin_human) >> bacillus_redoxase Enter each amino acid (one letter code). You may enter up to 50 amino acids on one line. Press to get a new prompt line. When you are done enter $ and press . 1 2 3 4 5 12345678901234567890123456789012345678901234567890 | | | | | msdklihitddsfdtdvikadgailvdfwaewcgpckmiapildeladey 1 2 3 4 5 12345678901234567890123456789012345678901234567890 | | | | | qgkltvakln$ 4) After typing in the above 60 residues, press "$" and then . The newly prepared sequence will then be "echoed" to the screen in the precisely the same format it will be stored. This is done so that you may inspect your sequence for errors and make any required corrections. Note that you have been placed in the "vi" editor and so it is essential to have some rudimentary knowledge of how this editor actually works. Changes to the sequence can be either upper or lower case -- it is not necessary to keep all entries or corrections in upper case. >Title: bacillus_redoxase MSDKLIHITDDSFDTDVIKADGAILVDFWAEWCGPCKMIAPILDELADEY QGKLTVAKLN ~ ~ ~ ~ ~ ~ 5) If no corrections are necessary, type ":q" to exit the editor. If you do make corrections, type ":wq" and this will save the corrections you have made. After exiting the "vi" editor, you are asked if you should save the file. Upon replying (usually with a 'y') you are then asked to provide a name for the sequence file (we'll call it redoxase.seq). After responding you are immediately returned to the main SEQSEE menu. Note that you have now created a SEQFILE called "redoxase.seq" containing the amino acid sequence of "bacillus redoxase". :q Save this file? (Y/N) >> y Enter sequence filename. >> redoxase.seq ********************************************************************** * Package...: SEQSEE Version 1.2 (c) * * Authors...: Robert Boyko / Leigh Willard / David Wishart * * Fred Richards / Brian Sykes * * Location..: University of Alberta * * Protein Engineering Network of Centres of Excellence * ********************************************************************** *** Preliminaries *** *** Alignments *** 1) Help 10) Fast Alignment Search 2) Enter/Edit a Sequence 11) Exhaustive Alignment Search 3) Retrieve Sequence from Database 12) Align 2 or more sequences *** Structural Analysis *** *** Scanning *** 4) Sequence Statistics 13) Pattern Search 5) Structure Prediction 14) Homology Search 6) SEQSITE Pattern Search 15) Dot Plot 7) Flexibility 16) Database Reference Search 8) Hydrophobic Moment 17) File Viewer 9) Hydrophobicity 0) EXIT SEQSEE Enter the number of the desired function >> 6) Type in "4" to choose the Sequence Statistics option. This function performs a quick and useful statistical analysis of the sequence to help you identify any peculiar trends in the sequence that might not be obvious on first inspection. >> 4 Sequence Statistics (Version 1.2) Your amino acid sequence is now required: 1) Read sequence from an input file 2) Sequence to be entered via keyboard 3) I do not have my sequence ready Enter a number (then press return). >> 7) Type in "1" and since you already have a UNIX file (a SEQFILE called "redoxase.seq") containing your sequence. You will then be queried for the name of this sequence file. Enter it and press as usual. >> 1 Enter input sequence filename. >> redoxase.seq 8) The statistical analysis will automatically be written to the screen in the format shown below. You may scroll through the file and make any changes you wish. ************************************************************** Program......: stats (version 1.2) Description..: Statistical Analysis of a Sequence Date.........: Tue Feb 2 09:57:14 1993 Sequence Name: bacillus_redoxase 1 MSDKLIHITD DSFDTDVIKA DGAILVDFWA EWCGPCKMIA PILDELADEY 51 QGKLTVAKLN ************************************************************** Molecular Weight......: 6684.86 Amino acids...........: 60 Mean residue weight...: 111.41 *** Amino Acid Composition *** Amino Freq Freq E(Freq) Weight E(weight) Acid (total) (percent) (percent) (percent) (percent) A 6 10.00 <8.84> 6.40 <5.73> C 2 3.33 <2.09> 3.09 <1.97> D 9 15.00 <5.89> 15.54 <6.17> E 3 5.00 <5.90> 5.81 <6.94> F 2 3.33 <3.70> 4.42 <4.96> G 3 5.00 <8.29> 2.57 <4.31> H 1 1.67 <2.12> 2.06 <2.64> I 6 10.00 <5.40> 10.18 <5.57> K 5 8.33 <6.22> 9.01 <7.27> L 6 10.00 <7.93> 10.18 <8.18> M 2 3.33 <1.97> 3.94 <2.35> N 1 1.67 <4.59> 1.71 <4.78> P 2 3.33 <4.51> 2.91 <3.99> Q 1 1.67 <3.75> 1.92 <4.37> R 0 0.00 <4.21> 0.00 <6.00> S 2 3.33 <6.59> 2.61 <5.23> T 3 5.00 <5.96> 4.55 <5.49> V 3 5.00 <7.12> 4.46 <6.43> W 2 3.33 <1.37> 5.59 <2.32> Y 1 1.67 <3.56> 2.45 <5.30> Note: E(x) are expected values based on average amino acid content of soluble proteins. ************************************************************** Hydrophobicity Parameters: /canopus/rbo/seqsee/lib/kyte.parms Average Hydrophobicity (ah)...................: 0.78 Notes: ah = -2.67 --> Average Protein ah > 0.10 --> Hydrophobic Protein ah < -6.00 --> Hydrophilic Protein Ratio of Hydrophilicity to Hydrophobicity (rh): 0.95 Notes: rh = 1.22 --> Average Protein rh > 1.90 --> Non-folding Protein rh < 0.85 --> Insoluble Protein Percentage of Hydrophobic residues............: 56.67 Notes: Average percentage is 52.44 Hydrophobic Amino Acids are ACFGHILMVWY Percentage of Hydrophilic residues............: 43.33 Notes: Average percentage is 47.56 Hydrophilic Amino Acids are DEKNPQRST Ratio of %Hydrophilic to %Hydrophobic.........: 0.76 Notes: rhp = 0.91 --> Average Protein rhp > 1.43 --> Non-folding Protein rhp < 0.77 --> Insoluble Protein ************************************************************** Number of Basic amino acids: 5 Number of Acidic amino acids: 12 Estimated pI for protein....: 4.60 pH: 3 4 5 6 7 8 9 10 11 Charge: 7.1 3.7 -2.6 -5.0 -5.9 -7.0 -9.0 -11.9 -14.4 Total linear charge density.: 0.32 ************************************************************** Polar Area of Extended Chain...............: 3666.20 Angs**2 Non-Polar Area of Extended Chain...........: 6923.10 Angs**2 Total Area of Extended Chain ..............: 10359.60 Angs**2 Polar ASA of Folded Protein................: 1117.84 Angs**2 Non-Polar ASA of Folded Protein............: 2839.88 Angs**2 ASA of folded protein .....................: 3957.72 Angs**2 Ratio of Folded to Extended Area...........: 0.40 ************************************************************* Buried Polar Area of Folded Protein........: 2096.61 Angs**2 Buried Non-polar Area of Folded Protein....: 3654.08 Angs**2 Buried Charge Area of Folded Protein.......: 239.61 Angs**2 Total Buried Surface.......................: 5990.30 Angs**2 Expected Number and Fraction of Residues 95% Buried A: 1 (0.166) C: 1 (0.284) D: 0 (0.038) E: 0 (0.022) F: 1 (0.291) G: 0 (0.127) H: 0 (0.127) I: 2 (0.317) K: 0 (0.004) L: 2 (0.284) M: 1 (0.304) N: 0 (0.041) P: 0 (0.056) Q: 0 (0.038) R: 0 (0.013) S: 0 (0.069) T: 0 (0.079) V: 1 (0.271) W: 0 (0.218) Y: 0 (0.085) Number of buried Amino Acids...............: 7 ************************************************************* Packing Volume (estimate)..................: 8300.24 Angs**3 Packing Volume (actual)....................: 8149.90 Angs**3 Interior Volume of Protein.................: 4056.60 Angs**3 Exterior Volume of Protein.................: 4093.40 Angs**3 Partial Specific Volume....................: 0.73 ml/g Fisher Volume Ratio (actual)...............: 1.01 Fisher Volume Ratio (idealized)............: 1.50 >>> Molecule likely forms dimer or multimer (aggregates). <<< Protein Solubility.........................: 1.47 Notes: solubility = 1.6 --> Average Protein solubility < 1.1 --> Insoluble Protein >>> Protein is likely water soluble. <<< ************************************************************* Radius of Protein..........................: 15.17 Angs RMS end to end distance of Ext. chain......: 81.24 Angs Radius of Gyration of Extended chain.......: 33.17 Angs ************************************************************* Solvation Free Energy of Folding...........: -43.38 kcal/mol ~ ~ ~ ~ ~ 9) After checking through the stats file, you may exit it simply by typing ":q" as before. The computer then asks you whether it should save the file or not. As usual we will respond with "y" and give our results file the name "redoxase.stat". After entering the name, the main SEQSEE menu appears on the screen once again. :q Save this file? (Y/N) >> y Enter output filename. >> redoxase.stat ********************************************************************** * Package...: SEQSEE Version 1.2 (c) * * Authors...: Robert Boyko / Leigh Willard / David Wishart * * Fred Richards / Brian Sykes * * Location..: University of Alberta * * Protein Engineering Network of Centres of Excellence * ********************************************************************** *** Preliminaries *** *** Alignments *** 1) Help 10) Fast Alignment Search 2) Enter/Edit a Sequence 11) Exhaustive Alignment Search 3) Retrieve Sequence from Database 12) Align 2 or more sequences *** Structural Analysis *** *** Scanning *** 4) Sequence Statistics 13) Pattern Search 5) Structure Prediction 14) Homology Search 6) SEQSITE Pattern Search 15) Dot Plot 7) Flexibility 16) Database Reference Search 8) Hydrophobic Moment 17) File Viewer 9) Hydrophobicity 0) EXIT SEQSEE Enter the number of the desired function >> 10) Lets see if we can uncover more information about the sequence by checking if it contains any unusual sequence motifs or sequence patterns. This might give us an idea of what it does or what it looks like. To do so, let's choose option "6" and initiate SEQSITE. >> 6 Seqsite Pattern Search (Version 1.2) Please select a sequence motif database 1) SEQSITE.db (general sequence motifs) 2) PHOSITE.db (general phosphorylation sites) 3) EPISITE.db (antigenic sites) Enter a number (then press return). >> 1 11) We have three database options to choose from in the SEQSITE program. Since we are primarily interested in determining whether this protein contains any known sequence motifs we will choose "SEQSITE.db" (option 1) which contains information on more than 1000 general sequence motifs. After typing "1" and pressing the following output appears. Your amino acid sequence is now required: 1) Read sequence from an input file 2) Sequence to be entered via keyboard 3) I do not have my sequence ready Enter a number (then press return). >> 12) Type in "1" and press , as before, since you already have a UNIX file containing your sequence. As usual you are required to give the name of your sequence file (or SEQFILE) containing the sequence so we will type in "redoxase.seq". >> 1 Enter input sequence filename. >> redoxase.seq 13) After pressing the SEQSITE analysis will automatically be written to the screen in the format shown below. You may scroll through the file and make any changes you wish. **************************************************************** Program......: seqsite (version 1.2) Description..: Search for Interesting Motifs Date.........: Thu Feb 16 13:02:21 1993 Sequence Name: bacillus_redoxase 1 MSDKLIHITD DSFDTDVIKA DGAILVDFWA EWCGPCKMIA PILDELADEY 51 QGKLTVAKLN Database.....: /sirius/local/seqsee/databases/seqsite.db **************************************************************** **********(1)********* Motif Matched...: *[STA]*[WG]C[AVG][PH]C* Sequence Matched: WAEWCGPCK Amino Acids.....: 29-37 GLEASON, F.R. ET AL., FEMS MICRO REV. 54:271-297(1988) ACTIVE SITE FOR PROKARYOTIC/EUKARYOTIC THIOREDOXIN-LIKE MOLECULES Number of motifs found..: 1 Number of motifs scanned: 1110 ~ ~ ~ ~ ~ 14) Well, Well...It looks like we've found something. It appears that bacillus redoxase contains the active site for a certain class of molecules called thioredoxins. Before leaping to any conclusions, though, we should check to see whether bacillus redoxase shares other similarities to thioredoxins or whether this shared sequence pattern is simply an accident of evolution. To answer this question we need to get back to SEQSEE's main menu. To do this we type ":q" as before and save the file by replying with a "y" and giving this file a name like "redoxase.site". :q Save this file? (Y/N) >> y Enter output filename. >> redoxase.site ********************************************************************** * Package...: SEQSEE Version 1.2 (c) * * Authors...: Robert Boyko / Leigh Willard / David Wishart * * Fred Richards / Brian Sykes * * Location..: University of Alberta * * Protein Engineering Network of Centres of Excellence * ********************************************************************** *** Preliminaries *** *** Alignments *** 1) Help 10) Fast Alignment Search 2) Enter/Edit a Sequence 11) Exhaustive Alignment Search 3) Retrieve Sequence from Database 12) Align 2 or more sequences *** Structural Analysis *** *** Scanning *** 4) Sequence Statistics 13) Pattern Search 5) Structure Prediction 14) Homology Search 6) SEQSITE Pattern Search 15) Dot Plot 7) Flexibility 16) Database Reference Search 8) Hydrophobic Moment 17) File Viewer 9) Hydrophobicity 0) EXIT SEQSEE Enter the number of the desired function >> 15) The best method to determine the evolutionary relatedness of any one protein with another is to perform a database alignment. Such a database alignment is offered by both the Fast Alignment Search and the Exhaustive Alignment Search options in SEQSEE (numbers 10 and 11 on the menu). Since we want a really quick (but not absolutely accurate) answer to our question, let's choose the Fast Alignment Search by typing "10". >> 10 Fast Alignment (Version 1.2) Your amino acid sequence is now required: 1) Read sequence from an input file 2) Sequence to be entered via keyboard 3) I do not have my sequence ready Enter a number (then press return). >> 16) Type in "1" and press , as before, since you already have a UNIX file containing your sequence. As usual you are required to give the name of your sequence file (or SEQFILE) containing the sequence (redoxase.seq). For this type of program you are also required to indicate how many of the best alignments you want to keep -- we'll choose the top 50. >> 1 Enter sequence input filename: >> redoxase.seq This program keeps track of the top 'x' alignments. Enter a value for 'x' where 0 < x < 500. >> 200 17) For a search of this magnitude, the computer will take about 60 to 90 seconds. While doing the search, the program will indicate what it's doing and how many sequences it has scanned. At the end of the search the top 200 alignments are printed out in descending order, with the best alignment at the top of the file. Following is a sample of the output you should expect to see. Initializing lookup table... Reading database file: /sirius/seqsee/databases/pir.IG/* Proteins: 1000 BestScore: 6624 GroupScore: 6624 Proteins: 2000 BestScore: 6624 GroupScore: 495 Proteins: 3000 BestScore: 6624 GroupScore: 1147 ~ ~ ~ ~ ~ *************************************************************** Program......: fast_align (version 1.2) Description..: Fast Alignment on database Date.........: Thu Feb 16 13:16:00 1993 Sequence Name: bacillus_redoxase Amino Acids..: 60 Database.....: PIR (Intelligenetics Version) Scoring Mat..: /sirius/local/seqsee/lib/wt.align Gap Penalty..: 20 Gap Size Pen.: 5 Tuple Cut-off: 48 *************************************************************** Number of proteins tested.: 44890 Number of alignments found: 200 ***********(1)********** Title....: Thioredoxin precursor -- Eschericia coli Id.......: TXEC FastScore: 6624 NW Score.: 1224 Matches..: 56 Query Seq..: MSDKLIHITDDSFDTDVIKADGAILVDFWAEW Matching...: ||||*||*|||||||||*|||||||||||||| Database...:MLHQQRNQHARLIPVELYMSDKIIHLTDDSFDTDVLKADGAILV DFWAEW Query Seq..:CGPCKMIAPILDELADEYQGKLTVAKLN Matching...:|||||||||||||*|||||||||||||| Database...:CGPCKMIAPILDEIADEYQGKLTVAKLNIDQNPGTAPKYGIRGIP TLLLF Query Seq..: Matching Database...:KNGEVAATKVGALSKGQLKEFLDANLA ~ ~ ~ ~ ~ 18) It looks like we've got a hit! Clearly bacillus redoxase is very closely related to E. coli thioredoxin. Indeed, given the level of similarity between the two we can be quite certain the bacillus redoxase is actually bacillus subtillus thioredoxin. A quick check through the full alignment file will reveal that thioredoxins are actually very common proteins that seem to ubiquitous in just about every creature presently known. Obviously we would like to know more about thioredoxins so that we may find out what they do and how they function. Of course we could run to the library and look up a few references, but we could also save ourselves some time by finding out immediately if some thioredoxins have already had their structures investigated by crystallography or NMR spectroscopy. To do so we need to get back to SEQSEE's main menu. As usual we type ":q" and save the file using the name "redoxase.align". :q Save this file? (Y/N): >> y Enter output filename: >> redoxase.align ********************************************************************** * Package...: SEQSEE Version 1.2 (c) * * Authors...: Robert Boyko / Leigh Willard / David Wishart * * Fred Richards / Brian Sykes * * Location..: University of Alberta * * Protein Engineering Network of Centres of Excellence * ********************************************************************** *** Preliminaries *** *** Alignments *** 1) Help 10) Fast Alignment Search 2) Enter/Edit a Sequence 11) Exhaustive Alignment Search 3) Retrieve Sequence from Database 12) Align 2 or more sequences *** Structural Analysis *** *** Scanning *** 4) Sequence Statistics 13) Pattern Search 5) Structure Prediction 14) Homology Search 6) SEQSITE Pattern Search 15) Dot Plot 7) Flexibility 16) Database Reference Search 8) Hydrophobic Moment 17) File Viewer 9) Hydrophobicity 0) EXIT SEQSEE Enter the number of the desired function >> 19) SEQSEE contains a file called SEQBANK which is a compilation of the sequences and secondary structures of all proteins which have had their structures reported in the literature. We can access this databank through the File Viewer option (number 17 on the menu) and search for any occurrences of thioredoxin in this databank. So let's type "17" and see what happens. >> 17 File Viewer (Version 1.2) What would you like to browse? 1) User specified file 2) PIRSEE database 3) SWISSEE database 4) SEQBANK database 5) SEQSEE Control File 0) Exit Enter a number (then press return). >> 20) Enter "4" and since we want to inspect the SEQBANK database. Once this is done we should see the following file: >> 4 # SEQBANK # (REVISED DEC. 1992) # # COPYRIGHT APRIL, 1992 # DAVID S. WISHART # # DEPARTMENT OF BIOCHEMISTRY # UNIVERSITY OF ALBERTA # EDMONTON, ALBERTA # CANADA # T6G 2H7 # # SEQBANK is a compilation of sequences and "consensus" secondary #structure assignments of soluble proteins and peptides which have had #their..... ~ ~ ~ >ACTIN (RABBIT SKELETAL) #REFERENCE : KABSCH, W. ET AL., NATURE 347:37-44 (1990) #REFERENCE : FLAHERTY, K.M. ET AL., PNAS 88:5041-5045 (1991) #SEQBANK ID: 1 #BRKHAVN ID: #PIR-NBR ID: ATRB #SWISPRO ID: ACTS$RABIT #RESOLUTION: 2.8 #R FACTOR : 23.8 #FOLD CLASS: M #NUM RESIDU: 375 DEDETTALVC DNGSGLVKAG FAGDDAPRAV FPSIVGRPRH QGVMVGMGQK CCCCCCBBBB BBBCCBBBBB BBCCCCCCBB BBCCBBBBCC CCCCCCCCCC DSYVGDEAQS KRGILTLKYP IEHGIITNWD DMEKIWHHTF YNELRVAPEE CBBBCHHHHH HCCBBBBBCC BBBCBBBCCH HHHHHHHHHH HCCCCCCCCC HPTLLTEAPL NPKANREKTM QIMFETFNVP AMYVAIQAVL SLYASGRTTG CCBBBBBCHH HHHHHHHHHH HHHHHCCCCC BBBBBBCHHH HHHHCCCCBB IVLDSGDGVT HNVPIYEGYA LPHAIMRLDL AGRDLTDYLM KILTERGYSF BBBBCCCCBB BBBBBBCCBB BCCBBBBBCC CHHHHHHHHH HHHHHHCCCC VTTAEREIVR DIKEKLCYVA LDFENAMATA ASSSSLEKSY ELPDGQVITI CCHHHHHHHH HHHHHHCCCC CHHHHHHHHH HCCCCCCBBB BBCCCCBBBB GNERFRCPET LFQPSFIGME SAGIHETTYN SIMKCDIDIR KDLYANNVMS CCHHHHHHHH HHHCCCCCCC CCHHHHHHHH HHHHCCCHHH HHHHCCBBBB GGTTMYPGIA DRMQKEITAL APSTMKIKII APPERKYSVW IGGSILASLS CCCCCCCCHH HHHHHHHHHH HCCCCCBBBB CCHHHHHHHH HHHHHHHHCC TFQQMWITKQ EYDEAGPSIV HRKCF HHHHHCCCCH HHHHHCCHHH HHHCC ~ ~ 21) To check if thioredoxin is in this databank we use one of the "vi" editor commands for character string searches. This is done by typing /THIOREDOXIN/ followed by (note that this search is NOT case- sensitive). Once the command is entered, the files is scrolled through to locate the word "THIOREDOXIN". Alternately, we could just scroll through the database and look for the word "THIOREDOXIN" in the sequence file header. This is easily done since SEQBANK is arranged alphabetically. Regardless of how we choose to do it we find that we are indeed fortunate for it appears that E. coli thioredoxin has already had its crystal structure solved. This will eventually permit us to accurately model the bacillus subtillus molecule and might also help us explain its apparently unusual redox activities. To leave the File Viewer option, simply type ":q". >THIOREDOXIN (E. COLI) #REFERENCE : HOLMGREN, A. ET AL., PNAS (USA) 72:2305-2309 (1975) #REFERENCE : DYSON, H.J. et al., BIOCHEMSITRY 28:7074-7087 (1989) #REFERENCE : KATTI, S.K. et al., J. MOL. BIOL. 212:167-184 (1990) #SEQBANK ID: 242 #BRKHAVN ID: 2TRX #PIR-NBR ID: TXEC #SWISPRO ID: THIO$ECOLI #RESOLUTION: 1.7 #R FACTOR : 16.5 #FOLD CLASS: M #NUM RESIDU: 108 SDKIIHLTDD DFDTDLVKAD GAILVDFWAE WCGPCKMIAP ILDEIADEYQ CCBBBBBBCC HHHHHHHHCC CBBBBBBBBC CCCCHHHHHH HHHHHHHHHC GKLTVAKLNI DQNPGTAPKY IGRGIPTLLL FKNGEVAATK VGALSKGQLK CCBBBBBBBC CCCHHHHHHH HHHCCCBBBB BBCCCBBBBB BCCCHHHHHH EFLDANLA HHHHHHHH In sum, this exercise demonstrates how it is possible to begin with a fragmentary peptide sequence of unknown structure and function and end up with a great deal of knowledge about that peptide's putative structure, probable function and potential origin -- all in the matter of a few minutes. While this example clearly demonstrates the potential utility of SEQSEE it is important to understand that the results were obtained by adopting an efficient analytic strategy summarized below: 1) Enter the new sequence into a SEQFILE 2) Conduct a statistical analysis of the sequence using STATS 3) Scan for sequence motifs using SEQSITE 4) Carry out a fast database alignment with FAST_ALIGN 5) Browse through the SEQBANK file to identify potentially related structures and preliminary references IX. SUMMARY OF SEQSEE MENU OPTIONS SAMPLE INPUT AND OUTPUT WITH EXPLANATIONS 2. Enter/Edit a Sequence ******************************************* * * * 1. Choose option 1) to enter a * * new sequence * * 2. Enter the sequence name * * 3. Enter the sequence (remember * * to type $ when finished) * * 4. Check the sequence for errors * * 5. Exit editor with ":q" or ":wq" * * 6. Save the file * * * ******************************************* The program known as SEQED is used for the entry and editing of new (or old) sequence files. The program first queries the user as to whether he or she wishes to: 1) Enter a new sequence 2) Edit an old sequence If one chooses to enter a new sequence the program queries the user for the name of the sequence file (sequence filename), the name of the sequence (sequence name) and finally, the actual sequence (using the standard single letter amino acid code). Sequences may be entered using either lower case letters, upper case letters or an arbitrary combination of both. In other words, sequence entry is case independent. The program also ignores blank characters so sequence entries may have as many blank spaces as desired. A "sequence ruler" is presented at the top of each sequence file entry line to permit quick identification of residue positions as they are typed. After each group of 50 characters has been entered, the user is expected to press so that a new sequence ruler can appear. Upon completion of the sequence entry, the user must enter the '$' character to indicate to the computer that the typing process has finished. Should any non-standard amino acid characters appear in the sequence file the program produces an error message and aborts the file saving procedure. ****************************************************************************** AVAILABLE OPTIONS IN "SEQSEE.PARMS": 1) There are no options available for this particular menu item. ****************************************************************************** ********************************************************************** * Package...: SEQSEE Version 1.2 (c) * * Authors...: Robert Boyko / Leigh Willard / David Wishart * * Fred Richards / Brian Sykes * * Location..: University of Alberta * * Protein Engineering Network of Centres of Excellence * ********************************************************************** *** Preliminaries *** *** Alignments *** 1) Help 10) Fast Alignment Search 2) Enter/Edit a Sequence 11) Exhaustive Alignment Search 3) Retrieve Sequence from Database 12) Align 2 or more sequences *** Structural Analysis *** *** Scanning *** 4) Sequence Statistics 13) Pattern Search 5) Structure Prediction 14) Homology Search 6) SEQSITE Pattern Search 15) Dot Plot 7) Flexibility 16) Database Reference Search 8) Hydrophobic Moment 17) File Viewer 9) Hydrophobicity 0) EXIT SEQSEE Enter the number of the desired function >> 2 <--- What would you like to do? 1) Enter new sequence. 2) Edit old sequence. 0) Exit Enter a number (then press return). >> 1 <--- Seqed (Version 1.2) Enter name for sequence. Use underscores instead of blanks to separate words. (eg. thioredoxin_human) >> bacillus_redoxase <--- Enter each amino acid (one letter code). You may enter up to 50 amino acids on one line. Press to get a new prompt line. When you are done enter $ and press . 1 2 3 4 5 12345678901234567890123456789012345678901234567890 | | | | | msdklihitddsfdtdvikadgailvdfwaewcgpckmiapildeladey <--- 1 2 3 4 5 12345678901234567890123456789012345678901234567890 | | | | | qgkltvakln$ <--- Title: bacillus_redoxase MSDKLIHITDDSFDTDVIKADGAILVDFWAEWCGPCKMIAPILDELADEY QGKLTVAKLN ~ ~ ~ ~ ~ :q <--- Save this File? (Y/N): >> y <--- Enter output filename. >> redoxase.seq <--- 3. Retrieve Sequence from Database ******************************************* * * * 1. Indicate how your query will * * be entered * * 2. Enter the search query (remember * * to type "quit" when finished) * * 3. Check the output file for the * * desired sequence or sequences * * 4. Edit the file (if desired) * * 5. Exit the editor with :q or :wq * * 6. Save the file * * * ******************************************* The program SEQRET is designed specifically for the user to find and retrieve complete sequences from the PIR or SWISS-PROT databases using either the PIR/SWISS-PROT accession number or the protein name (or portion thereof). Note that multiple sequence identifiers using the conjunctive "&" symbol may be employed for increased specificity (eg. CYSTIC&FIBROSIS&HUMAN for HUMAN CYSTIC FIBROSIS). Thus one may seek and select only a single sequence for a specific purpose, or entire protein families to create special user-specified databases. The sequences may be saved and/or edited for further analysis (ie. multiple alignments). All sequences are saved in a SEQFILE format. When using this function, the user is required to identify the method by which the query will be entered (either the keyboard or through a UNIX file) as well as the exact Id numbers or sequence names which must be searched for in the database. Note that the user MUST type "quit" on the final line of his or her search string. The word "quit" is used by the program as a termination flag and is essential for proper functioning of the program. The user is also required to provide a name for the output file (we suggest using the suffix ".scan" for consistency). Note that only the protein (or peptide) name and its sequence are included in the output. ****************************************************************************** AVAILABLE OPTIONS IN "SEQSEE.PARMS: 1) Choice of retrieval from PIR, PIR_IG, SWISS-PROT or SWISS-PROT_IG database formats. 2) Choice of file-update frequency (i.e. every 100, 500 or 1000 sequences). ****************************************************************************** ********************************************************************** * Package...: SEQSEE Version 1.2 (c) * * Authors...: Robert Boyko / Leigh Willard / David Wishart * * Fred Richards / Brian Sykes * * Location..: University of Alberta * * Protein Engineering Network of Centres of Excellence * ********************************************************************** *** Preliminaries *** *** Alignments *** 1) Help 10) Fast Alignment Search 2) Enter/Edit a Sequence 11) Exhaustive Alignment Search 3) Retrieve Sequence from Database 12) Align 2 or more sequences *** Structural Analysis *** *** Scanning *** 4) Sequence Statistics 13) Pattern Search 5) Structure Prediction 14) Homology Search 6) SEQSITE Pattern Search 15) Dot Plot 7) Flexibility 16) Database Reference Search 8) Hydrophobic Moment 17) File Viewer 9) Hydrophobicity 0) EXIT SEQSEE Enter the number of the desired function >> 3 <--- Seqret (Version 1.2) How will you enter your search queries? 1) Protein Name(s) entered from the keyboard 2) Protein Name(s) taken from a file 3) Protein Id(s) entered from the keyboard 4) Protein Id(s) taken from a file 0) Exit program Enter a number (then press return) >> 1 <--- Enter one search string per line. Use underscores instead of blanks (eg. CYSTIC_FIBROSIS). Use '&' symbol for conjunction (eg. FIBROSIS & CYSTIC). Type QUIT (then press return) when done. >> THIOREDOXIN <--- >> quit <--- Reading database file: /sirius/seqsee/databases/pir/* Proteins scanned: 1000 Matches found...: 8 Proteins scanned: 2000 Matches found...: 8 Proteins scanned: 3000 Matches found...: 8 ~ ~ ******************************************************************* Program......: seqret (version 1.2) Description..: Sequence Retrieval Results Date.........: Thu Feb 16 14:01:34 1993 Database.....: PIR (Intelligenetics Version) Searchstrings: THIOREDOXIN ******************************************************************* >TXBY1 THIOREDOXIN I - YEAST (SACCHAROMYCES CEREVISIAE) MVTQLKSASEYDSALASGDKLVVVDFFATWCTPCKMIAPMIEKFAEQYSD AAFYKLDVDEVSDVAQKAEVSSMPTLIFYKGGKEVTRVVGANPAAIKQAI ASNV >TXBY2 THIOREDOXIN II - YEAST (SACCHAROMYCES CEREVISIAE) MVTQFKTASEFDSAIAQDKLVVVDFYATWCGPCKMIAPMIEKFSEQYPQA DFYKLDVDELGDVAQKNEVSAMPTLLLFKNGKEVAKVVGANPAAIKQAIA ANA >TXEC THIOREDOXIN PRECURSOR - ESCHERICIA COLI MLHQQRNQHARLIPVELYMSDKIIHLTDDSFDTDVLKADGAILVDFSATW CGPCKMIAPILDEIADEYQGKLTVAKLNIDQNPGTAPKYGIRGIPTLLLF KNGEVAATKVGALSKGQLKEFLDANLA >TXFK THIOREDOXIN - CORYNEFORM BACTERIUM ATCC11425 ATVKVDNSNFQSDVLQSSEPVVVDFWAEWCGPCKMIAPALDEIATEMAGQ VKIKLTVAKLNIDQNPGTAPKYGIRGIPTLLLFKNGEVAATKVGALSADW IKASA >TXAI THIOREDOXIN - ANABAENA SP. SAAAQVTDSTFKQEVLDSDVPVLVDFWAPWCGPCRMVAPVVDEIAQQYEG KIKVVTVAKLNIDQNPGTAPKYGIRGIPTLLLFKNGEVAATKVGALSADW TLEKHL ~ ~ ~ ~ ~ :q <--- Save this File? (Y/N): >> y <--- Enter output filename. >> trx.scan <--- 4. Sequence Statistics ******************************************* * * * 1. Indicate how your sequence will * * be entered * * 2. Enter the sequence or sequence * * filename * * 3. Check the output file for * * interesting information * * 4. Exit the editor with ":q" * * 5. Save the file * * * ******************************************* The STATS program carries out a simple statistical analysis of any given protein sequence. When using this program, the user is required to indicate how the sequence will be entered (via a SEQFILE in this case), what the name of the SEQFILE will be and what the name of the output file should be (we chose the suffix ".stat" for consistency). The output file provides information on the number of residues, molecular weight, the amino acid composition, average hydropathy (based on the Kyte Doolittle parameters), total charge, predicted isoelectric point, expected quantity of exposed and interior surface area (Miller et al., 1987), expected packing volume (Richards, 1977), predicted specific volume, aggregation potential (Fisher, 1964), estimated solvation free energy of folding (Chiche et al., 1990), expected folded-protein radius and many other values that may be of structural or statistical interest. ****************************************************************************** AVAILABLE OPTIONS IN "SEQSEE.PARMS": 1) Choice of hydrophobicity values (kyte.parms, hphob.* files or user- defined). 2) Choice of threshold or cutoff values. 3) Choice of definition for hydrophobic and hydrophilic amino acids. 4) Choice of molecular volume values (mol.volume or user-defined). 5) Choice of residue-specific surface area values (mol.surfarea or user- defined). 6) Choice of amino acid partial specific volumes (mol.parspecvol or user- defined). 7) Choice of residue-specific polar, nonpolar and charged surface areas (mol.asa or user-defined). ****************************************************************************** ********************************************************************** * Package...: SEQSEE Version 1.2 (c) * * Authors...: Robert Boyko / Leigh Willard / David Wishart * * Fred Richards / Brian Sykes * * Location..: University of Alberta * * Protein Engineering Network of Centres of Excellence * ********************************************************************** *** Preliminaries *** *** Alignments *** 1) Help 10) Fast Alignment Search 2) Enter/Edit a Sequence 11) Exhaustive Alignment Search 3) Retrieve Sequence from Database 12) Align 2 or more sequences *** Structural Analysis *** *** Scanning *** 4) Sequence Statistics 13) Pattern Search 5) Structure Prediction 14) Homology Search 6) SEQSITE Pattern Search 15) Dot Plot 7) Flexibility 16) Database Reference Search 8) Hydrophobic Moment 17) File Viewer 9) Hydrophobicity 0) EXIT SEQSEE Enter the number of the desired function >> 4 <--- Sequence Statistics (Version 1.2) Your amino acid sequence is now required: 1) Read sequence from an input file 2) Sequence to be entered via keyboard 3) I do not have my sequence ready Enter a number (then press return). >> 1 <--- Enter input filename: >> redoxase.seq <--- **************************************************************** Program......: stats (version 1.2) Description..: Statistical Analysis of a Sequence Date.........: Thu Feb 16 13:02:21 1993 Sequence Name: bacillus_redoxase 1 MSDKLIHITD DSFDTDVIKA DGAILVDFWA EWCGPCKMIA PILDELADEY 51 QGKLTVAKLN **************************************************************** Molecular Weight......: 6684.86 Amino acids...........: 60 Mean residue weight...: 111.41 *** Amino Acid Composition *** Amino Freq Freq E(Freq) Weight E(weight) Acid (total) (percent) (percent) (percent) (percent) A 6 10.00 <8.84> 6.40 <5.73> C 2 3.33 <2.09> 3.09 <1.97> D 9 15.00 <5.89> 15.54 <6.17> E 3 5.00 <5.90> 5.81 <6.94> F 2 3.33 <3.70> 4.42 <4.96> G 3 5.00 <8.29> 2.57 <4.31> H 1 1.67 <2.12> 2.06 <2.64> I 6 10.00 <5.40> 10.18 <5.57> K 5 8.33 <6.22> 9.01 <7.27> L 6 10.00 <7.93> 10.18 <8.18> M 2 3.33 <1.97> 3.94 <2.35> N 1 1.67 <4.59> 1.71 <4.78> P 2 3.33 <4.51> 2.91 <3.99> Q 1 1.67 <3.75> 1.92 <4.37> R 0 0.00 <4.21> 0.00 <6.00> S 2 3.33 <6.59> 2.61 <5.23> T 3 5.00 <5.96> 4.55 <5.49> V 3 5.00 <7.12> 4.46 <6.43> W 2 3.33 <1.37> 5.59 <2.32> Y 1 1.67 <3.56> 2.45 <5.30> Note: E(x) are expected values based on average amino acid content of soluble proteins. ************************************************************** Hydrophobicity Parameters: /canopus/rbo/seqsee/lib/kyte.parms Average Hydrophobicity (ah)...................: 0.78 Notes: ah = -2.67 --> Average Protein ah > 0.10 --> Hydrophobic Protein ah < -6.00 --> Hydrophilic Protein Ratio of Hydrophilicity to Hydrophobicity (rh): 0.95 Notes: rh = 1.22 --> Average Protein rh > 1.90 --> Non-folding Protein rh < 0.85 --> Insoluble Protein Percentage of Hydrophobic residues............: 56.67 Notes: Average percentage is 52.44 Hydrophobic Amino Acids are ACFGHILMVWY Percentage of Hydrophilic residues............: 43.33 Notes: Average percentage is 47.56 Hydrophilic Amino Acids are DEKNPQRST Ratio of %Hydrophilic to %Hydrophobic.........: 0.76 Notes: rhp = 0.91 --> Average Protein rhp > 1.43 --> Non-folding Protein rhp < 0.77 --> Insoluble Protein ************************************************************** Number of Basic amino acids: 5 Number of Acidic amino acids: 12 Estimated pI for protein....: 4.60 pH: 3 4 5 6 7 8 9 10 11 Charge: 7.1 3.7 -2.6 -5.0 -5.9 -7.0 -9.0 -11.9 -14.4 Total linear charge density.: 0.32 ************************************************************** Polar Area of Extended Chain...............: 3666.20 Angs**2 Non-Polar Area of Extended Chain...........: 6923.10 Angs**2 Total Area of Extended Chain ..............: 10359.60 Angs**2 Polar ASA of Folded Protein................: 1117.84 Angs**2 Non-Polar ASA of Folded Protein............: 2839.88 Angs**2 ASA of folded protein .....................: 3957.72 Angs**2 Ratio of Folded to Extended Area...........: 0.40 ************************************************************* Buried Polar Area of Folded Protein........: 2096.61 Angs**2 Buried Non-polar Area of Folded Protein....: 3654.08 Angs**2 Buried Charge Area of Folded Protein.......: 239.61 Angs**2 Total Buried Surface.......................: 5990.30 Angs**2 Expected Number and Fraction of Residues 95% Buried A: 1 (0.166) C: 1 (0.284) D: 0 (0.038) E: 0 (0.022) F: 1 (0.291) G: 0 (0.127) H: 0 (0.127) I: 2 (0.317) K: 0 (0.004) L: 2 (0.284) M: 1 (0.304) N: 0 (0.041) P: 0 (0.056) Q: 0 (0.038) R: 0 (0.013) S: 0 (0.069) T: 0 (0.079) V: 1 (0.271) W: 0 (0.218) Y: 0 (0.085) Number of buried Amino Acids...............: 7 ************************************************************* Packing Volume (estimate)..................: 8300.24 Angs**3 Packing Volume (actual)....................: 8149.90 Angs**3 Interior Volume of Protein.................: 4056.60 Angs**3 Exterior Volume of Protein.................: 4093.40 Angs**3 Partial Specific Volume....................: 0.73 ml/g Fisher Volume Ratio (actual)...............: 1.01 Fisher Volume Ratio (idealized)............: 1.50 >>> Molecule likely forms dimer or multimer (aggregates). <<< Protein Solubility.........................: 1.47 Notes: solubility = 1.6 --> Average Protein solubility < 1.1 --> Insoluble Protein >>> Protein is likely water soluble. <<< ************************************************************* Radius of Protein..........................: 15.17 Angs RMS end to end distance of Ext. chain......: 81.24 Angs Radius of Gyration of Extened chain........: 33.17 Angs ************************************************************* Solvation Free Energy of Folding...........: -43.38 kcal/mol ~ ~ ~ ~ ~ :q <--- Save this File? (Y/N): >> y <--- Enter output filename. >> redoxase.stat <--- 5. Structure Prediction ******************************************* * * * 1. Indicate how your sequence will * * be entered * * 2. Enter the sequence or sequence * * filename * * 3. Check the output file for * * interesting information * * 4. Exit the editor with ":q" * * 5. Save the file * * * ******************************************* The Structure Prediction function is a comprehensive structural analysis program which has been been developed expressly for the SEQSEE software suite. This program performs calculations on the extent and location of potential membrane spanning regions, the identification of short sequence folding motifs, the prediction of the protein folding class and the prediction of secondary structure using the cumulative results of six different and well-tested methods. When using this program, the user is required to indicate how the sequence will be entered (via a SEQFILE in this case), what the name of the SEQFILE will be and what the name of the output file should be (we chose the suffix ".struc" for consistency). The output file provides information on the location (if any) of membrane spanning regions, the identification of the probable protein folding class, the identification and location of all sequence/structure motifs and a complete prediction of the secondary structure of the protein or peptide. In the latter case a three state prediction is used where we have defined H = helix, B = beta strand and C = coil. Note that in the output file, the following designations apply: Homology Structure prediction using the homology method of Levin et al. (1988). Moment Structure prediction using the hydro-moment method of Eisenberg (1984). GOR Structure prediction using the method of Garnier et al. (1978). Chou-Fas Structure prediction using the method of Chou and Fasman (1978). MotifLit Structure prediction using sequence/structure motifs taken from the literature. MotifCmp Structure prediction using sequence/structure motifs from computer searches. Consens Structure prediction using a weighted sum of the above methods. ****************************************************************************** AVAILABLE OPTIONS IN "SEQSEE.PARMS": 1) Choice of membrane spanning hydrophobicity values (kyte.parms, hphob.* files or user-defined). 2) Choice of scaling constants for membrane spanning test. 3) Choice of sequence motifs for identifying secondary structures (seqmotifX.db or user-defined). 4) Choice of statistical summary reporting frequence (i.e. every 100, 500 or 1000 sequences). 5) Option to print individual motifs which match to the query sequence. 6) Option to print prediction and scoring arrays. 7) Choice of structure database for homology structure prediction (SEQBANK.db or user-defined). 8) Choice of scoring matrix to perform homology-based secondary structure prediction. 9) Choice of minimum threshold (test-stat) to identify significant homology. 10) Choice of structure-weighted scoring multipliers. 11) Choice of off-set values and multipliers for score normalization. 12) Option to apply smoothing functions "x" times to predictions. 13) Choice of weighting constants to force N and C-terminal predictions to be COIL. 14) Choice of scaling constants to weight HELIX or BETA predictions differently. 15) Option to smooth predicted structure (reduces "noise"). 16) Choice of hydrophobic moment parameters (moment.parms or user- defined). 17) Choice of number and type of hydrophobic periodicity tests for HELIX, BETA and COIL. 18) Choice of window size and weighting factors for HELIX, BETA and COIL. 19) Choice of GOR parameters for GOR secondary structure prediction (gor.new or gor.orig) 20) Choice of Chou-Fas parameters for secondary structure prediction (cfas.parms or user-defined). ****************************************************************************** ********************************************************************** * Package...: SEQSEE Version 1.2 (c) * * Authors...: Robert Boyko / Leigh Willard / David Wishart * * Fred Richards / Brian Sykes * * Location..: University of Alberta * * Protein Engineering Network of Centres of Excellence * ********************************************************************** *** Preliminaries *** *** Alignments *** 1) Help 10) Fast Alignment Search 2) Enter/Edit a Sequence 11) Exhaustive Alignment Search 3) Retrieve Sequence from Database 12) Align 2 or more sequences *** Structural Analysis *** *** Scanning *** 4) Sequence Statistics 13) Pattern Search 5) Structure Prediction 14) Homology Search 6) SEQSITE Pattern Search 15) Dot Plot 7) Flexibility 16) Database Reference Search 8) Hydrophobic Moment 17) File Viewer 9) Hydrophobicity 0) EXIT SEQSEE Enter the number of the desired function >> 5 <--- Alexis - Structure Prediction (Version 1.2) Your amino acid sequence is now required: 1) Read sequence from an input file 2) Sequence to be entered via keyboard 3) I do not have my sequence ready Enter a number (then press return). >> 1 <--- Enter input filename: >> trx.seq <--- ************************************************************* Program......: alexis (version 1.2) Description..: Structure Prediction/Analysis Date.........: Thu Feb 16 13:04:16 1993 Sequence Name: THIOREDOXIN - ESCHERICHIA COLI Amino acids..: 108 ************************************************************* *** Membrane Spanning Region Check *** No membrane spanning region found. *** Structural Motifs from Literature *** *************(1)************ Amino Acid......: 29 Sequence Matched: AEWCGPC Database Motif..: [TA]*WC[AG][PH]C Motif Prediction: BCCCCHH Reference.......: THIOREDOXIN ACTIVE SITE I *************(2)************ Amino Acid......: 63 Sequence Matched: NPG Database Motif..: [PGDN][PG][PGDN] Motif Prediction: CCC Reference.......: 90% ACCURATE MOTIFS ************************************************************* Expected % alpha helix content: 39 Expected % beta sheet content: 32 Expected % coil content: 27 Correlation Coefficients for Protein Folding Class: A = 0.937100 B = 0.874118 M = 0.939555 Protein belongs to MIXED folding class ************************************************************* *** Secondary Structure Prediction *** Sequence:SDKIIHLTDDSFDTDVLKADGAILVDFWAEWCGPCKMIAPILDEIADEYQ Homology:CCBBBBBBCCHHHHHHHCCCCBBBBBBBBCCCCHHHHHHHHHHHHHHHHC Moment..:CCCBBBBCCCCCCBBBBCCCCHHHHHHHHHHCCCHHHHHHHHHHHHHHCC GOR.....:CCCBBBBBCCCCHHHHHHHHHHHBBHHHHHCCCCCBBBCCHHHHHHHHHH Chou-Fas:CCCHHHHHCCCCHHHHHHCCCCHHHHHHHHHCCCCCBHHHHHHHHHHHCC MotifLit:XXXXXXXXXXXXXXXXXXXXXXXXXXXXBCCCCHHXXXXXXXXXXXXXXX MotifCmp:XXXXXXXXXCXXXXXXXCXXXXXXXXXXXXXXXCXXXXHHHXXXXXXXXX Consens.:CCBBBBBBCCHHHHHHHCCCCBBBBBBBBCCCCHHHHHHHHHHHHHHHHC ~ ~ ~ ~ :q <--- Save this File? (Y/N): >> y <--- Enter output filename. >> trx.struc <--- 6. Seqsite Pattern Search ******************************************* * * * 1. Indicate how your sequence will * * be entered * * 2. Enter the sequence or sequence * * filename * * 3. Check the output file for * * interesting information * * 4. Exit the editor with ":q" * * 5. Save the file * * * ******************************************* This procedure allows the user to search any given sequence for active sites, binding sites, signature sequences, phosphorylation sites, antigenic sites, and related functional or structural sequence patterns. A library of more than 1000 signature sequence patterns is contained in the SEQSITE database. An addtional 50 phosphorylation sites is found in the PHOSITE database and a further 20 generalized antigenic sites is found in the EPISITE database This type of "function search" is extremely useful for determining the properties and features of newly sequenced or poorly characterized proteins. When using this program, the user is required to indicate how the sequence will be entered (via a SEQFILE in this case), what the name of the SEQFILE will be and what the name of the output file should be (we chose the suffix ".site" for consistency). The output file provides information on all sequence motifs identified including: what sequence was matched in the query protein, where this match occurred, the identity of the matching sequence motif, the most current reference describing the sequence motif and the name of the sequence motif as it is most commonly referred to in the literature. ****************************************************************************** AVAILABLE OPTIONS IN "SEQSEE.PARMS": 1) Choice of general sequence motif database (SEQSITE.db or user-defined). 2) Choice of general phosphorylation site database (PHOSITE.db or user- defined). 3) Choice of T-cell and B-cell antigenic site database (EPISITE.db or user- defined). ****************************************************************************** ********************************************************************** * Package...: SEQSEE Version 1.2 (c) * * Authors...: Robert Boyko / Leigh Willard / David Wishart * * Fred Richards / Brian Sykes * * Location..: University of Alberta * * Protein Engineering Network of Centres of Excellence * ********************************************************************** *** Preliminaries *** *** Alignments *** 1) Help 10) Fast Alignment Search 2) Enter/Edit a Sequence 11) Exhaustive Alignment Search 3) Retrieve Sequence from Database 12) Align 2 or more sequences *** Structural Analysis *** *** Scanning *** 4) Sequence Statistics 13) Pattern Search 5) Structure Prediction 14) Homology Search 6) SEQSITE Pattern Search 15) Dot Plot 7) Flexibility 16) Database Reference Search 8) Hydrophobic Moment 17) File Viewer 9) Hydrophobicity 0) EXIT SEQSEE Enter the number of the desired function >> 6 <--- SEQSITE Pattern Search (Version 1.2) Please select a sequence motif database 1) SEQSITE.db (general sequence motifs) 2) PHOSITE.db (general phosphorylation sites) 3) EPISITE.db (antigenic sites) Enter a number (then press return). >> 1 <--- Your amino acid sequence is now required: 1) Read sequence from an input file 2) Sequence to be entered via keyboard 3) I do not have my sequence ready Enter a number (then press return). >> 1 <--- Enter input filename: >> redoxase.seq <--- **************************************************************** Program......: seqsite (version 1.2) Description..: Search for Interesting Motifs Date.........: Thu Feb 16 13:02:21 1993 Sequence Name: bacillus_redoxase 1 MSDKLIHITD DSFDTDVIKA DGAILVDFWA EWCGPCKMIA PILDELADEY 51 QGKLTVAKLN Database.....: /sirius/local/seqsee/databases/seqsite.db **************************************************************** **********(1)********* Motif Matched...: *[TA]*WC[AG][PH]C* Sequence Matched: WAEWCGPCK Amino Acids.....: 29-37 GLEASON, F.R. ET AL., FEMS MICRO REV. 54:271-297(1988) ACTIVE SITE FOR PROKARYOTIC/EUKARYOTIC THIOREDOXIN-LIKE MOLECULES Number of motifs found..: 1 Number of motifs scanned: 1110 ~ ~ ~ ~ ~ :q <--- Save this File? (Y/N): >> y <--- Enter output filename. >> redoxase.site <--- 7. Flexibility ******************************************* * * * 1. Indicate how your sequence will * * be entered * * 2. Enter the sequence or sequence * * filename * * 3. Check the output file for * * interesting information * * 4. Exit the editor with ":q" * * 5. Save the file * * * ******************************************* The program named FLEQSEE predicts the flexibility and mobility of various regions in a protein based on sequence information alone. Flexibility is calculated on the basis of the Karplus algorithm (Karplus and Schulz, 1985). In SEQSEE, flexibility may be used to determine the position and length of coil regions by locating all "significant" maxima (those maxima which exceed a minimum threshold) in the flexibility plot. Flexibility plots may also be used to identify surface-seeking elements or to locate strongly antigenic regions of any given sequence. When using this program, the user is required to indicate how the sequence will be entered (via a SEQFILE in this case), what the name of the SEQFILE will be and what the name of the output file should be (we chose the suffix ".flex" for consistency). The output file provides a numeric representation of the flexibility profile of the input sequence. A legend located at the top of the file provides a means of interpreting the numbers in quasi-physical terms. ****************************************************************************** AVAILABLE OPTIONS IN "SEQSEE.PARMS": 1) Choice of output format ("raw" or "scaled" scores). 2) Choice of flexibility parameters (fleqsee.parms or user-defined). 3) Choice window size (default = 7 residues). 4) Option to vary weighting constants and weighting procedures (triangular, parabolic, linear, etc.). ****************************************************************************** ********************************************************************** * Package...: SEQSEE Version 1.2 (c) * * Authors...: Robert Boyko / Leigh Willard / David Wishart * * Fred Richards / Brian Sykes * * Location..: University of Alberta * * Protein Engineering Network of Centres of Excellence * ********************************************************************** *** Preliminaries *** *** Alignments *** 1) Help 10) Fast Alignment Search 2) Enter/Edit a Sequence 11) Exhaustive Alignment Search 3) Retrieve Sequence from Database 12) Align 2 or more sequences *** Structural Analysis *** *** Scanning *** 4) Sequence Statistics 13) Pattern Search 5) Structure Prediction 14) Homology Search 6) SEQSITE Pattern Search 15) Dot Plot 7) Flexibility 16) Database Reference Search 8) Hydrophobic Moment 17) File Viewer 9) Hydrophobicity 0) EXIT SEQSEE Enter the number of the desired function >> 7 <--- Fleqsee (Version 1.2) Your amino acid sequence is now required: 1) Read sequence from an input file 2) Sequence to be entered via keyboard 3) I do not have my sequence ready Enter a number (then press return). >> 1 <--- Enter input filename: >> zipper.seq <--- (OUTPUT WHEN "RAW SCORE" OPTION IS SELECTED) ***************************************************************** Program......: fleqsee (version 1.2) Description..: Sequence Flexibility Scoring Date.........: Thu Feb 16 13:05:26 1993 Sequence Name: leu_zippper Amino Acids..: 35 Flex Parms...: /sirius/local/seqsee/lib/fleqsee.parms ***************************************************************** # | RESIDUE | B FACTOR # | RESIDUE | (NORMALIZED) --------------------------------- 1 | L | 9.61 2 | Q | 10.28 3 | R | 10.28 4 | M | 9.47 5 | K | 10.82 6 | Q | 10.28 7 | L | 9.67 8 | E | 10.36 9 | D | 10.53 10 | K | 10.82 11 | V | 9.82 12 | E | 10.36 13 | E | 10.36 14 | L | 9.61 15 | L | 9.61 16 | S | 10.36 17 | K | 10.93 18 | N | 10.06 19 | Y | 9.30 20 | H | 8.94 ~ ~ ~ ~ ~ ~ ~ ~ :q <--- Save this File? (Y/N): >> y <--- Enter output filename. >> zipper.flex <--- (OUTPUT WHEN "WEIGHTED SCORE" OPTION IS SELECTED) ***************************************************************** Program......: fleqsee (version 1.2) Description..: Sequence Flexibility Scoring Date.........: Thu Feb 16 13:05:26 1993 Sequence Name: leu_zippper Amino Acids..: 35 Flex Parms...: /sirius/local/seqsee/lib/fleqsee.parms *** Notes *** Flex Scores: 0 1 2 3 4 5 6 7 8 9 Low High Likely coil regions found when: i) Flexibility is very high (8 or 9) ii) Regions with strong maxima (eg. 12466531) ***************************************************************** Sequence...:LQRMKQLEDKVEELLSKNYHLENEVARLKKLVGER Score......:98855566666555555433455543345555889 ~ ~ ~ ~ :q <--- Save this File? (Y/N): >> y <--- Enter output filename. >> zipper.flex <--- 8. Hydrophobic Moment ******************************************* * * * 1. Indicate how your sequence will * * be entered * * 2. Enter the sequence or sequence * * filename * * 3. Check the output file for * * interesting information * * 4. Exit the editor with ":q" * * 5. Save the file * * * ******************************************* MOMENT calculates the hydrophobic moment of a sequence using a modified Cornette et al. (1987) scale of hydrophobicity and the Fourier analysis technique of Eisenberg et al. (1984). Calculations are preformed over set "sequence window" of predefined length using a range of values specific to helical periodicities (90 to 120 degrees) and beta strand periodicities (0 and 160 to 180 degrees). When using this program, the user is required to indicate how the sequence will be entered (via a SEQFILE in this case), what the name of the SEQFILE will be and what the name of the output file should be (we chose the suffix ".mom" for consistency). The output file provides a numeric representation of the hydrophobic moment profile of the input sequence. A legend located at the top of the file provides a means of interpreting the numbers in quasi-physical terms. Hydrophobic moment periodicity values and weighting schemes may be altered using the control file (through File Viewer). ******************************************************************************* AVAILABLE OPTIONS IN "SEQSEE.PARMS": 1) Choice of output format ("raw" or "scaled" scores). 2) Choice of hydrophobicity parameters (hmom.* files or user-defined). 3) Choice of number of hydrophobic periodicity tests. 4) Choice of type of hydrophobic periodicity tests (HELIX and/or BETA periodicity). 5) Choice of window size and periodicity angle for each periodicity test. 6) Control over application of smoothing functions. ****************************************************************************** ********************************************************************** * Package...: SEQSEE Version 1.2 (c) * * Authors...: Robert Boyko / Leigh Willard / David Wishart * * Fred Richards / Brian Sykes * * Location..: University of Alberta * * Protein Engineering Network of Centres of Excellence * ********************************************************************** *** Preliminaries *** *** Alignments *** 1) Help 10) Fast Alignment Search 2) Enter/Edit a Sequence 11) Exhaustive Alignment Search 3) Retrieve Sequence from Database 12) Align 2 or more sequences *** Structural Analysis *** *** Scanning *** 4) Sequence Statistics 13) Pattern Search 5) Structure Prediction 14) Homology Search 6) SEQSITE Pattern Search 15) Dot Plot 7) Flexibility 16) Database Reference Search 8) Hydrophobic Moment 17) File Viewer 9) Hydrophobicity 0) EXIT SEQSEE Enter the number of the desired function >> 8 <--- Hydrophobic Moment (Version 1.2) Your amino acid sequence is now required: 1) Read sequence from an input file 2) Sequence to be entered via keyboard 3) I do not have my sequence ready Enter a number (then press return). >> 1 <--- Enter input filename: >> zipper.seq <--- (OUTPUT WHEN "RAW SCORE" OPTION IS SELECTED) ***************************************************************** Program......: moment (version 1.2) Description..: Hydrophobic Moment Scoring Date.........: Thu Feb 16 13:05:26 1993 Sequence Name: leu_zippper Amino Acids..: 35 Flex Parms...: /sirius/local/seqsee/lib/hmom.cornet *** Notes *** BETA SCALING FACTOR: 0.60 HELIX SCALING FACTOR: 0.42 HYDROPHOBICITY PARAMETERS ---------------------------- | A -0.10 M 0.48 | | C 0.45 N -0.19 | | D -0.57 P -0.45 | | E -0.39 Q -0.53 | | F 0.51 R 0.07 | | G -0.13 S -0.19 | | H -0.06 T -0.40 | | I 0.55 V 0.54 | | K -0.56 W 0.02 | | L 0.68 Y 0.33 | ---------------------------- ***************************************************************** # | RESIDUE | RAW BETA | RAW HELIX ---------------------------------------------- 1 | L | 0.25 | 0.21 2 | Q | 0.20 | 0.22 3 | R | 0.23 | 0.30 4 | M | 0.05 | 0.32 5 | K | 0.14 | 0.36 6 | Q | 0.22 | 0.29 7 | L | 0.24 | 0.30 8 | E | 0.27 | 0.30 9 | D | 0.25 | 0.29 ~ ~ ~ :q <--- Save this File? (Y/N): >> y <--- Enter output filename. >> zipper.mom <--- (OUTPUT WHEN "WEIGHTED SCORE" OPTION IS SELECTED) ***************************************************************** Program......: moment (version 1.2) Description..: Hydrophobic Moment Scoring Date.........: Thu Feb 16 13:05:52 1993 Sequence Name: leu_zippper Amino Acids..: 35 Moment Parms.: /sirius/local/seqsee/lib/kyte.parms *** Notes *** Moment scores........: 0 1 2 3 4 5 6 7 8 9 Low High High helix scores indicate likely helical regions High beta scores indicate likely beta strands ***************************************************************** Sequence...:LQRMKQLEDKVEELLSKNYHLENEVARLKKLVGER Helix Score:67899799999887766555667788888889764 Beta Score.:21484445665432234555554345553203142 ~ ~ ~ ~ ~ :q <--- Save this File? (Y/N): >> y <--- Enter output filename. >> zipper.mom <--- 9. Hydrophobicity ******************************************* * * * 1. Indicate how your sequence will * * be entered * * 2. Enter the sequence or sequence * * filename * * 3. Check the output file * * for interesting information * * 4. Exit the editor with ":q" * * 5. Save the file * * * ******************************************* This procedure calculates the smoothed hydrophobicity (over a window of user-defined length) of any given sequence using a choice of several hydrophobicity scales. The operator may choose from the Eisenberg consensus scale (Eisenberg et al., 1984), the Kyte-Doolittle scale (Kyte and Doolittle, 1982), the Cornette scale (Cornette et al., 1987) or the Parker-HPLC scale (Parker et al., 1986). Hydrophobicity charts may be used to approximate the positions of coil regions, exposed loops or B-cell antigenic determinants in many proteins (hydrophilic regions). When using this program, the user is required to indicate how the sequence will be entered (via a SEQFILE in this case), what the name of the SEQFILE will be and what the name of the output file should be (we chose the suffix ".hydro" for consistency). The output file provides a numeric representation of the hydrophobicity profile of the input sequence. A legend located at the top of the file provides a means of interpreting the numbers in quasi-physical terms. Hydrophobicity values and weighting schemes may be altered using the control file (through File Viewer). ****************************************************************************** AVAILABLE OPTIONS IN "SEQSEE.PARMS": 1) Choice of output format ("raw" or "scaled" scores). 2) Choice of hydrophobicity parameters (hphob.* files or user-defined). 3) Choice window size (default = 7 residues). 4) Option to vary weighting constants and weighting procedures (triangular, parabolic, linear, etc.). ****************************************************************************** ********************************************************************** * Package...: SEQSEE Version 1.2 (c) * * Authors...: Robert Boyko / Leigh Willard / David Wishart * * Fred Richards / Brian Sykes * * Location..: University of Alberta * * Protein Engineering Network of Centres of Excellence * ********************************************************************** *** Preliminaries *** *** Alignments *** 1) Help 10) Fast Alignment Search 2) Enter/Edit a Sequence 11) Exhaustive Alignment Search 3) Retrieve Sequence from Database 12) Align 2 or more sequences *** Structural Analysis *** *** Scanning *** 4) Sequence Statistics 13) Pattern Search 5) Structure Prediction 14) Homology Search 6) SEQSITE Pattern Search 15) Dot Plot 7) Flexibility 16) Database Reference Search 8) Hydrophobic Moment 17) File Viewer 9) Hydrophobicity 0) EXIT SEQSEE Enter the number of the desired function >> 9 <--- Hydrophobicity (Version 1.2) Your amino acid sequence is now required: 1) Read sequence from an input file 2) Sequence to be entered via keyboard 3) I do not have my sequence ready Enter a number (then press return). >> 1 <--- Enter input filename: >> zipper.seq <--- (OUTPUT WHEN "RAW SCORE" OPTION IS SELECTED) ***************************************************************** Program......: hydro (version 1.2) Description..: Hydrophobicity Scoring of a Sequence Date.........: Thu Feb 16 13:05:26 1993 Sequence Name: leu_zippper Amino Acids..: 35 Flex Parms...: /sirius/local/seqsee/lib/hphob.kyte *** Notes *** SCALE FACTOR 1: -45 SCALE FACTOR 2: 45 HYDROPHOBICITY PARAMETERS ---------------------------- | A 0.18 M 0.19 | | C 0.25 N -0.35 | | D -0.35 P -0.16 | | E -0.35 Q -0.35 | | F 0.28 R -0.45 | | G -0.04 S -0.08 | | H -0.32 T -0.07 | | I 0.45 V 0.42 | | K -0.39 W -0.09 | | L 0.38 Y -0.13 | ---------------------------- ***************************************************************** # | RESIDUE | RAW SCORE --------------------------------- 1 | L | -0.10 2 | Q | -1.00 3 | R | -1.60 4 | M | -1.50 5 | K | -1.80 6 | Q | -1.50 7 | L | -1.40 8 | E | -1.70 9 | D | -1.70 ~ ~ ~ :q <--- Save this File? (Y/N): >> y <--- Enter output filename. >> zipper.hydro <--- (OUTPUT WHEN "WEIGHTED SCORE" OPTION IS SELECTED) ***************************************************************** Program......: hydro (version 1.2) Description..: Hydrophobicity Scoring of a Sequence Date.........: Thu Feb 16 13:12:12 1993 Sequence Name: leu_zippper Amino Acids..: 35 Hydro Parms..: /sirius/local/seqsee/lib/kyte.parms *** Notes *** Hydro Scores: 0 1 2 3 4 5 6 7 8 9 Low High High scores indicate strong hydrophobic regions Low scores indicate strong hydrophilic regions ***************************************************************** Sequence...:LQRMKQLEDKVEELLSKNYHLENEVARLKKLVGER Score......:95222222223455422222223344443455433 ~ ~ :q <--- Save this File? (Y/N): >> y <--- Enter output filename. >> zipper.hydro <--- 10. Fast Alignment Search ******************************************* * * * 1. Indicate how your sequence will * * be entered * * 2. Enter the sequence filename * * 3. Enter the number of alignments * * to be saved * * 4. Check the output file for * * interesting information * * 5. Exit the editor with ":q" * * 6. Save the file * * * ******************************************* FAST_ALIGN is a k-tuple based fast alignment algorithm based loosely on the speed-up protocols incorporated in Lipman and Pearson's FASTA (1988) and Altschul et al.'s BLAST (1990). The program is capable of searching the complete PIR and then ordering and aligning 50 homologous matches of a 100 residue query sequence in less than 90 seconds. FAST_ALIGN may be used to align sequences against the PIR, SWISS-PROT or a user-specified database with a SEQFILE format. Several choices of scoring matrices are possible and these include: the Unity matrix, the Dayhoff PAM 250 matrix (Dayhoff et al., 1983), the Mclachlan matrix (Mclachlan, 1971) and the RBO matrix. The RBO matrix is the default scoring matrix. When using this program, the user is required to indicate how the sequence will be entered (via a SEQFILE in this case) and what the name of the SEQFILE will be. The user is also required to provide the number of high scoring alignments that will be saved (often no more than 100-200 is required) as well as the name of an output file (we chose the suffix ".align" for consistency). The output file contains information on the identity of the protein where a potential alignment was found, the PIR or SWISS-PROT Id or accession number, an initial Fast Alignment Score, the Optimal Alignment Score and the number of exact matches found. Vertical lines (|) are used to identify exact matches and asterisks (*) are used to identify homologous matches. ****************************************************************************** AVAILABLE OPTIONS IN "SEQSEE.PARMS": 1) Choice of sequence databases to be scanned (PIR, PIR_IG, SWISS-PROT or SWISS-PROT_IG). 2) Choice of scoring matrix (wt.align or user-defined). 3) Choice of minimum score to designate homologous residue pairs. 4) Choice of gap insertion and gap extension penalties. ****************************************************************************** ********************************************************************** * Package...: SEQSEE Version 1.2 (c) * * Authors...: Robert Boyko / Leigh Willard / David Wishart * * Fred Richards / Brian Sykes * * Location..: University of Alberta * * Protein Engineering Network of Centres of Excellence * ********************************************************************** *** Preliminaries *** *** Alignments *** 1) Help 10) Fast Alignment Search 2) Enter/Edit a Sequence 11) Exhaustive Alignment Search 3) Retrieve Sequence from Database 12) Align 2 or more sequences *** Structural Analysis *** *** Scanning *** 4) Sequence Statistics 13) Pattern Search 5) Structure Prediction 14) Homology Search 6) SEQSITE Pattern Search 15) Dot Plot 7) Flexibility 16) Database Reference Search 8) Hydrophobic Moment 17) File Viewer 9) Hydrophobicity 0) EXIT SEQSEE Enter the number of the desired function >> 10 <--- Fast Alignment (Version 1.2) Your amino acid sequence is now required: 1) Read sequence from an input file 2) Sequence to be entered via keyboard 3) I do not have my sequence ready Enter a number (then press return). >> 1 <--- Enter input filename: >> redoxase.seq <--- This program keeps track of the top 'x' alignments. Enter a value for 'x' where 0 < x < 500: >> 200 <--- Initializing lookup table... Reading database file: /sirius/seqsee/databases/pir/* Proteins: 1000 BestScore: 6624 GroupScore: 6624 Proteins: 2000 BestScore: 6624 GroupScore: 495 Proteins: 3000 BestScore: 6624 GroupScore: 1147 ~ ~ *************************************************************** Program......: fast_align (version 1.2) Description..: Fast Alignment on database Date.........: Thu Feb 16 13:16:00 1993 Sequence Name: bacillus_redoxase Amino Acids..: 60 Database.....: PIR (Intelligenetics Version) Scoring Mat..: /sirius/local/seqsee/lib/wt.align Gap Penalty..: 20 Gap Size Pen.: 5 Tuple Cut-off: 48 *************************************************************** Number of proteins tested.: 44890 Number of alignments found: 200 ***********(1)********** Title....: Thioredoxin precursor - Escherichia coli Id.......: TXEC NW Score.: 6624 FastScore: 1224 Matches..: 56 Query Seq..: MSDKLIHITDDSFDTDVIKADGAILVDFWAEW 32 Matching...: ||||*||*|||||||||*|||||||||||||| Database...:MLHQQRNQHARLIPVELYMSDKIIHLTDDSFDTDVLKADGAILV DFWAEW 50 Query Seq..:CGPCKMIAPILDELADEYQGKLTVAKLN 60 Matching...:|||||||||||||*|||||||||||||| Database...:CGPCKMIAPILDEIADEYQGKLTVAKLNIDQNPGTAPKYGIRGIP TLLLF 100 Query Seq..: Matching Database...:KNGEVAATKVGALSKGQLKEFLDANLA 127 ~ ~ ~ ~ ~ :q <--- Save this File? (Y/N): >> y <--- Enter output filename. >> redoxase.align <--- 11. Exhaustive Alignment Search PIR/SWISS-PROT Database Option ******************************************* * * * 1. Select the database to be * * searched * * 2. Indicate how your sequence will * * be entered * * 3. Enter the sequence or sequence * * filename * * 4. Indicate the number of * * alignments to be saved * * 5. Check the output file for * * interesting information * * 6. Exit the editor with ":q" * * 7. Save the file * * * ******************************************* NW_ALIGN is a program which carries out an exhaustive pair-wise alignment of any given query sequence to all other sequences in a given database. Only those sequences with scores above a certain user-defined threshold are retained. The algorithm used for this procedure is based on the Needleman-Wunsch (1970) approach for pair-wise alignment. This dynamic programming method is guaranteed to find the optimal alignment between any two sequences for any given scoring matrix. Alignments can be done against the PIR, SWISS-PROT, SEQBANK or a user defined database in the SEQFILE format. If alignments are done against SEQBANK, knowledge of the secondary structure is included to determine the location and length of gaps (Lesk et al., 1986). A choice of scoring matrices and gap penalties is available. The scoring matrices include: the Unity matrix, the Dayhoff PAM 250 matrix (Dayhoff et al., 1983), the Mclachlan matrix (Mclachlan, 1971) and the RBO matrix. The RBO matrix is the default scoring matrix. Other scoring matrices may be chosen by altering the SEQSEE control file (through the File Viewer option). Scores are rigorously calculated on the basis of comparisons to randomized sequence alignments as recommended by Dayhoff et al. (1983). The program is extremely time consuming (using the PIR/SWISS-PROT option) with a query sequence of 100 residues typically taking upto 4 hours to complete on a SUN Sparcstation. However, the improvement in overall alignment accuracy and the possibility of identifying very remote and previously unidentified relationships may well be worth the wait. To get around the problem of tying up the computer for long periods of time, the user may wish to place an exhaustive alignment run into the background. This can be done as follows: 1) Press the "control" and "z" keys simultaneously to temporarily stop the job. 2) Type "bg" and press the "return" key to restart the program in the background. The results can be viewed at any time by re-opening the SEQSEE window and inspecting the *.tmp files that are automatically created and updated during the alignment run. When using this program, the user is required to identify which database he or she wishes to search (the PIR database in this case), how the sequence will be entered (via a SEQFILE in this case) and what the name of the SEQFILE will be. The user is also required to provide the number of high scoring alignments that will be saved (often no more than 50-100 is required) as well as the name of an output file (we chose the suffix ".align" for consistency). The output file contains information on the name of the protein where a potential alignment was found, the PIR or SWISS-PROT Id or accession number, the Optimal Alignment Score, the Alignment Test Stat score (the number of standard deviations away from an expected "random" Optimal Alignment Score -- with 5.0 being the minimum for a significant match) and the number of exact matches found. Vertical lines (|) are used to identify exact matches and asterisks (*) are used to identify homologous matches. ****************************************************************************** AVAILABLE OPTIONS IN "SEQSEE.PARMS": 1) Choice of sequence database to be scanned (PIR, PIR_IG, SWISS-PROT, SWISS-PROT_IG). 2) Choice of scoring matrix (wt.rbo, wt.dayhoff, wt.levin, wt.mclach, wt.unit). 3) Choice of minimum score to designate homologous residue pairs. 4) Choice of method to sort and score aligned sequences (raw score, per residue score or jumbled) 5) Choice of jumble test values and thresholds. 6) Choice of file-update frequency (i.e. every 100, 500 or 1000 sequences). 7) Choice of gap-insertion and gap-extension penalties. ****************************************************************************** ********************************************************************** * Package...: SEQSEE Version 1.2 (c) * * Authors...: Robert Boyko / Leigh Willard / David Wishart * * Fred Richards / Brian Sykes * * Location..: University of Alberta * * Protein Engineering Network of Centres of Excellence * ********************************************************************** *** Preliminaries *** *** Alignments *** 1) Help 10) Fast Alignment Search 2) Enter/Edit a Sequence 11) Exhaustive Alignment Search 3) Retrieve Sequence from Database 12) Align 2 or more sequences *** Structural Analysis *** *** Scanning *** 4) Sequence Statistics 13) Pattern Search 5) Structure Prediction 14) Homology Search 6) SEQSITE Pattern Search 15) Dot Plot 7) Flexibility 16) Database Reference Search 8) Hydrophobic Moment 17) File Viewer 9) Hydrophobicity 0) EXIT SEQSEE Enter the number of the desired function >> 11 <--- Which database do you wish to search? 1) Search PIR/SWISS-PROT database 2) Search SEQBANK database 0) Exit Enter a number (then press return). >> 1 <--- Exhaustive Alignment (Version 1.2) Your amino acid sequence is now required: 1) Read sequence from an input file 2) Sequence to be entered via keyboard 3) I do not have my sequence ready Enter a number (then press return). >> 1 <--- Enter input filename: >> redoxase.seq <--- This program keeps track of the top 'x' alignments. Enter a value for 'x' where 0 < x < 500: >> 50 <--- Reading database file: /sirius/seqsee/databases/pir/* Proteins Scanned: 100 BestScore: 3.12 GroupScore: 3.12 Proteins Scanned: 200 BestScore: 3.44 GroupScore: 3.44 Proteins Scanned: 300 BestScore: 3.44 GroupScore: 3.22 ~ ~ *************************************************************** Program......: nw_align (version 1.2) Description..: Best Alignment on PIR Database Date.........: Thu Feb 16 13:26:05 1993 Sequence Name: bacillus_redoxase Amino Acids..: 60 Database.....: PIR (Intelligenetics Version) Scoring Mat..: /sirius/local/seqsee/lib/wt.rbo Gap Penalty..: 10 Gap Size Pen.: 2 Sort Method..: 1 Random Seed..: 13791 *************************************************************** Number of proteins tested.: 44890 Number of alignments found: 50 ***********(1)********** Title....: Thioredoxin precursor -- Eschericia coli Id.......: TXEC Test Stat: 20.83 NW Score.: 1224 Matches..: 56 Query Seq..: MSDKLIHITDDSFDTDVIKADGAILVDFWAEW 32 Matching...: ||||*||*|||||||||*|||||||||||||| Database...:MLHQQRNQHARLIPVELYMSDKIIHLTDDSFDTDVLKADGAILV DFWAEW 50 Query Seq..:CGPCKMIAPILDELADEYQGKLTVAKLN 60 Matching...:|||||||||||||*|||||||||||||| Database...:CGPCKMIAPILDEIADEYQGKLTVAKLNIDQNPGTAPKYGIRGIP TLLLF 100 ~ ~ ~ :q <--- Save this File? (Y/N): >> y <--- Enter output filename. >> redoxase.align <--- 11. Exhaustive Alignment Search SEQBANK option ******************************************* * * * 1. Select the database to be * * searched * * 2. Indicate how your sequence will * * be entered * * 3. Enter the sequence or sequence * * filename * * 4. Indicate the number of * * alignments to be saved * * 5. Check the output file for * * interesting information * * 6. Exit the editor with ":q" * * 7. Save the file * * * ******************************************* The Exhaustive Alignment Search carries out an exhaustive pair-wise alignment of any given query sequence to all other sequences in a given database. Only those sequences with scores above a certain user-defined threshold are retained. The algorithm used for this procedure is based on the Needleman-Wunsch (1970) approach for pair-wise alignment. This dynamic programming method is guaranteed to find the optimal alignment between any two sequences for any given scoring matrix. Alignments can either be done against the PIR, SWISS-PROT, SEQBANK (as in this case) or a user defined database in the SEQFILE format. If alignments are done against SEQBANK, knowledge of the secondary structure is included to determine the location and length of gaps (Lesk et al., 1986). A choice of scoring matrices and gap penalties is available. The scoring matrices include: the Unity matrix, the Dayhoff PAM 250 matrix (Dayhoff et al., 1983), the Mclachlan matrix (Mclachlan, 1971) and the RBO matrix. The RBO matrix is the default scoring matrix. Other scoring matrices may be chosen by altering the SEQSEE control file (through the File Viewer option). Scores are rigorously calculated on the basis of comparisons to randomized sequence alignments as recommended by Dayhoff et al. (1983). To get around the problem of tying up the computer for long periods of time, the user may wish to place this type of alignment run into the background. This can be done as follows: 1) Press the "control" and "z" keys simultaneously to temporarily stop the job. 2) Type "bg" and press the "return" key to restart the program in the background. The results can be viewed at any time by re-opening the SEQSEE window and inspecting the *.tmp files that are automatically created and updated during the alignment run. When using this option, the user is required to identify which database he or she wishes to search (the SEQBANK database in this case), how the sequence will be entered (via a SEQFILE in this case) and what the name of the SEQFILE will be. The user is also required to provide the number of high scoring alignments that will be saved (often no more than 10 is required) as well as the name of an output file (we chose the suffix ".align" for consistency). The output file contains information on the name of the protein where a potential alignment was found, the SEQBANK Id or accession number, the Optimal Alignment Score, the Alignment Test Stat score (the number of standard deviations away from an expected "random" Optimal Alignment Score -- with 5.0 being the minimum for a significant match) and the number of exact matches found. Vertical lines (|) are used to identify exact matches and asterisks (*) are used to identify homologous matches. The secondary structure of the SEQBANK protein where the match was made is also included in the output file. ****************************************************************************** AVAILABLE OPTIONS IN "SEQSEE.PARMS": 1) Choice of sequence/structure database to be scanned (SEQBANK.db or user-defined). 2) Choice of scoring matrix (wt.rbo, wt.dayhoff, wt.levin, wt.mclach, wt.unit). 3) Choice of minimum score to designate homologous residue pairs. 4) Choice of method to sort and score aligned sequences (raw score, per residue score or jumbled) 5) Choice of jumble test values and thresholds. 6) Choice of file-update frequency (i.e. every 10, 50 or 100 sequences). 7) Choice of gap-insertion and gap-extension penalties. 8) Choice of gap-insertion and gap-extension penalites in regions of secondary structure. ****************************************************************************** ********************************************************************** * Package...: SEQSEE Version 1.2 (c) * * Authors...: Robert Boyko / Leigh Willard / David Wishart * * Fred Richards / Brian Sykes * * Location..: University of Alberta * * Protein Engineering Network of Centres of Excellence * ********************************************************************** *** Preliminaries *** *** Alignments *** 1) Help 10) Fast Alignment Search 2) Enter/Edit a Sequence 11) Exhaustive Alignment Search 3) Retrieve Sequence from Database 12) Align 2 or more sequences *** Structural Analysis *** *** Scanning *** 4) Sequence Statistics 13) Pattern Search 5) Structure Prediction 14) Homology Search 6) SEQSITE Pattern Search 15) Dot Plot 7) Flexibility 16) Database Reference Search 8) Hydrophobic Moment 17) File Viewer 9) Hydrophobicity 0) EXIT SEQSEE Enter the number of the desired function >> 11 <--- Which database do you wish to search? 1) Search PIR/SWISS-PROT database 2) Search SEQBANK database 0) Exit Enter a number (then press return). >> 2 <--- SEQBANK Database Alignment (Version 1.2) Your amino acid sequence is now required: 1) Read sequence from an input file 2) Sequence to be entered via keyboard 3) I do not have my sequence ready Enter a number (then press return). >> 1 <--- Enter input filename: >> redoxase.seq <--- This program keeps track of the top 'x' alignments. Enter a value for 'x' where 0 < x < 500: >> 10 <--- Reading database file: /sirius/seqsee/databases/SEQBANK.db Proteins Scanned: 10 BestScore: 2.57 GroupScore: 2.57 Proteins Scanned: 20 BestScore: 2.57 GroupScore: 2.16 Proteins Scanned: 30 BestScore: 3.01 GroupScore: 3.01 ~ ~ *************************************************************** Program......: sb_align (version 1.2) Description..: Find Best Alignments in SEQBANK Date.........: Thu Feb 16 13:45:15 1993 Sequence Name: bacillus_redoxase Amino Acids..: 60 Scoring Mat..: /sirius/local/seqsee/lib/wt.rbo Gap Penalty..: 10 Gap Size Pen.: 2 Sort Method..: 1 Random Seed..: 13791 *************************************************************** Number of proteins tested.: 267 Number of alignments found: 10 ************(1)************ Title....: THIOREDOXIN (E. COLI) Id.......: 242 Score....: 598 Test Stat: 8.79 Matches..: 56 Query Seq..:MSDKLIHITDDSFDTDVIKADGAILVDFWAEWCGPCKMIAPILDELAD EYQ Matching...: ||| || ||||||||| ||||||||||||||||||||||||||| |||| Database...: SDKIIHLTDDSFDTDVLKADGAILVDFWAEWCGPCKMIAPILDEIADEYQ Structure..: CCBBBBBBCCHHHHHHHHCCCBBBBBBBBCCCCCHHHHHHHHHHHHHHHC Query Seq..:GKLTVAKLN Matching...:||||||||| Database...:GKLTVAKLNIDQNPGTAPKYGIRGIPTLLLFKNGEVAATKVGAL SKGQLKE Structure..:CCBBBBBBBCCCCHHHHHHHHHHCCCBBBBBBCCCBBBBBBBCCH HHHHHH ~ ~ ~ :q <--- Save this File? (Y/N): >> y <--- Enter output filename. >> redoxase.align <--- 12. Align 2 or More Sequences ******************************************* * * * 1. Indicate you wish to enter one * * or more sequences * * 2. Enter a sequence filename * * containing one or more sequences * * 3. Indicate you have finished * * entering the sequence files * * 4. Check the output file for * * interesting information * * 5. Exit the editor with ":q" * * 6. Save the file * * * ******************************************* The program MULT_ALIGN uses a modification of the pair-wise Needleman- Wunsch protocol to align two or more protein sequences. The method is closely related to the progressive alignment procedure first described by Barton and Sternberg (1987), which permits rapid and accurate multiple alignments for up to several hundred proteins. A consensus sequence is also produced for each multiple alignment. A choice of scoring matrices and gap penalties is available. Sequences which are to be aligned must be contained in SEQFILE formats, either in the form of databases (for multiple alignments) or singly (for pair-wise alignments). The procedure for aligning more than two sequences (like the fast alignment search described in 8) is fundamentally heuristic in nature and so it cannot be proven that the resulting alignments are mathematically optimal. When using this program, the user is required to have the sequences he or she wishes to align in at least one or more files before beginning this operation. The program prompts the user for sequence files and their filenames until the user indicates "I have entered all sequences". The user is also asked for the name of an output file (we prefer the suffix ".mult" for consistency). The program output includes the names and identification codes for all aligned proteins. ****************************************************************************** AVAILABLE OPTIONS IN "SEQSEE.PARMS": 1) Choice of scoring matrix (wt.rbo, wt.dayhoff, wt.levin, wt.mclach, wt.unit). 2) Choice of minimum score to designate homologous residue pairs. 3) Choice of method to sort and score aligned sequences (raw score, per residue score or jumbled) 4) Option to print all pairwise alignments. 5) Choice of threshold value to print consensus sequence. 6) Choice of gap-insertion and gap-extension penalties. ****************************************************************************** ********************************************************************** * Package...: SEQSEE Version 1.2 (c) * * Authors...: Robert Boyko / Leigh Willard / David Wishart * * Fred Richards / Brian Sykes * * Location..: University of Alberta * * Protein Engineering Network of Centres of Excellence * ********************************************************************** *** Preliminaries *** *** Alignments *** 1) Help 10) Fast Alignment Search 2) Enter/Edit a Sequence 11) Exhaustive Alignment Search 3) Retrieve Sequence from Database 12) Align 2 or more sequences *** Structural Analysis *** *** Scanning *** 4) Sequence Statistics 13) Pattern Search 5) Structure Prediction 14) Homology Search 6) SEQSITE Pattern Search 15) Dot Plot 7) Flexibility 16) Database Reference Search 8) Hydrophobic Moment 17) File Viewer 9) Hydrophobicity 0) EXIT SEQSEE Enter the number of the desired function >> 12 <--- Multiple Alignment (Version 1.2) Input of amino acid sequences for alignment 1) I wish to enter one or more sequences. 2) I am done entering all my sequences. 0) Exit this program. Enter a number (then press return). >> 1 <--- Enter filename containing sequences. Remember wildcard characters are acceptable within the filename: >> trx.seqs <--- Reading Sequence: THIOREDOXIN PRECURSOR - ESCHERICHIA COLI Reading Sequence: THIOREDOXIN - CORYNEFORM BACTERIUM ATCC11425 Reading Sequence: THIOREDOXIN - ANABAENA SP. Reading Sequence: THIOREDOXIN M - SPINACH CHLOROPLAST Reading Sequence: THIOREDOXIN M - SYNECHOCOCCUS SP. Reading Sequence: THIOREDOXIN - RHODOBACTER SPHAEROIDES Current number of sequences for alignment: 6 Input of amino acid sequences for alignment 1) I wish to enter one or more sequences 2) I am done entering all my sequences 3) Exit this program Enter a number (then press return). >> 2 <--- Doing Pairwise Alignments....... ***************************************************************** Program......: mult_align (version 1.2) Description..: Align 2 or More Sequences Date.........: Thu Feb 16 13:25:11 1993 Scoring Mat..: /sirius/local/seqsee/lib/wt.align Gap Penalty..: 10 Gap Size Pen.: 2 Sort Method..: 0 Random Seed..: 13791 Consensus %..: 70 ***************************************************************** Printing Multiple Alignment *************************** Protein 1: THIOREDOXIN - ANABAENA SP. Protein 2: THIOREDOXIN M - SYNECHOCOCCUS SP. Protein 3: THIOREDOXIN - CORYNEFORM BACTERIUM ATCC11425 Protein 4: THIOREDOXIN - RHODOBACTER SPHAEROIDES Protein 5: THIOREDOXIN PRECURSOR - ESCHERICIA COLI Protein 6: THIOREDOXIN M - SPINACH CHLOROPLAST Protein 1: SAAAQ VTDSTFKQEVLDSDVPVLVDF 26 Protein 2: MSVAAA VTDATFKQEVLESSIPVLVDF 27 Protein 3: ATVK VDNSNFQSDVLQSSEPVVVDF 25 Protein 4: STVP VTDATFDTEVRKSDVPVVVDF 25 Protein 5: MLHQQRNQHARLIPVELYMSDKIIH LTDDSFDTDVLKADGAILVDF 46 Protein 6: KASAEKFIVQDVNDSGWKEFVLQSSEPSMVDF 32 Consensus: -----------------------------V-D--F---VL-S--P--VDF ~ ~ ~ ~ ~ :q <--- Save this File? (Y/N): >> y <--- Enter output filename. >> trx.mult <--- 13. Pattern Search PIR/SWISS-PROT Database Option ******************************************* * * * 1. Indicate which database or * * sequence file you wish to search * * 2. Enter the sequence pattern or * * patterns to be searched for * * 3. Check the output file for * * interesting information * * 4. Exit the editor with ":q" * * 5. Save the file * * * ******************************************* This procedure searches the SEQBANK, SWISS-PROT or PIR database or a single sequence of your own choosing to find exact pattern matches according to the following rules (note the sequence patterns are case INDEPENDENT): a) X Match exact residue specified where X = any amino acid b) !X Match any residue EXCEPT X c) * Wild card character--matches any amino acid d) [XYZ] "OR" braces--match X "or" Y "or" Z. e) X&Y "AND" character--match X "and" Y no matter what the separation f) X{2,8}Y Match X and Y if separation is between 2 and 8 residues. "Range" braces--allow a range of wild card characters. i.e. {2,8} = 2 to 8 "*" g) $**X Match X if located 2 residues from N terminus -- "Termination" characters are used to mark either the beginning (N terminus) or end (C terminus) of a sequence Pattern Search (PSEARCH) is constructed to allow the user to enter several patterns at once, both on a single line (using the "&" feature) or on separate lines. Patterns appearing on separate lines are treated as "independent" patterns (meaning they don't have to appear in the same protein sequence) while patterns with "&" characters are viewed as "dependent" patterns (meaning they do have to appear in the same protein sequence). Some examples of sequence pattern searches are given below: AA***K Find all occurrences of 2 alanines together followed by any 3 residues followed by a single lysine AA!P!P!PK Find all occurrences of 2 alanines together followed by any 3 residues (as long as they are NOT prolines) followed by a single lysine. (ie. look for AA***K except AAP**K, AA*P*K, AA**PK, AA*PPK, AAPP*K, AAPPPK) [AG][AG]*[KR] Find all occurrences of 2 alanines or 2 glycines or any combination of the two followed by any residue followed by a lysine or an arginine. (ie. look for AA*K, AG*K, GA*K, GG*K, AA*R, AG*R, GA*R and GG*R) When using this subroutine, the user is required to identify which database he or she wishes to search (the PIR database in this case) and what the query sequence is (using a single letter amino acid code). Note that the user MUST type "quit" on the final line of his or her search string. The word "quit" is used by the program as a termination flag and is essential for proper functioning of the program. The user must also provide a name for the output file (we suggest using a ".patt" suffix for consistency). The output file contains information on the name of the protein where a match was found, the PIR Id number and the location where the match begins in the database protein (DbRes). The secondary structure of the SEQBANK protein where the match was made is also included in the output file. ****************************************************************************** AVAILABLE OPTIONS IN "SEQSEE.PARMS": 1) Choice of sequence database to be scanned (PIR, SWISS-PROT, SEQBANK.db or user-defined). 2) Option to allow mutliple matches of a search string in a sequence. 3) Choice of file-update frequency (i.e. every 100, 500 or 1000 sequences). ****************************************************************************** ********************************************************************** * Package...: SEQSEE Version 1.2 (c) * * Authors...: Robert Boyko / Leigh Willard / David Wishart * * Fred Richards / Brian Sykes * * Location..: University of Alberta * * Protein Engineering Network of Centres of Excellence * ********************************************************************** *** Preliminaries *** *** Alignments *** 1) Help 10) Fast Alignment Search 2) Enter/Edit a Sequence 11) Exhaustive Alignment Search 3) Retrieve Sequence from Database 12) Align 2 or more sequences *** Structural Analysis *** *** Scanning *** 4) Sequence Statistics 13) Pattern Search 5) Structure Prediction 14) Homology Search 6) SEQSITE Pattern Search 15) Dot Plot 7) Flexibility 16) Database Reference Search 8) Hydrophobic Moment 17) File Viewer 9) Hydrophobicity 0) EXIT SEQSEE Enter the number of the desired function >> 13 <--- Pattern Search (Version 1.2) How do you wish to use Pattern Search? 1) Search PIR/SWISS-PROT database 2) Search SEQBANK database 3) Search user-defined sequence 0) Exit Enter a number (then press return). >> 1 <--- enter one sequence pattern per line: enter QUIT (then press return) when done: >> IKYLEFISEAIIHVL <--- >> quit <--- Reading database file: /sirius/seqsee/databases/pir/* Proteins Scanned: 1000 Matches: 0 Proteins Scanned: 2000 Matches: 0 Proteins Scanned: 3000 Matches: 0 Proteins Scanned: 4000 Matches: 0 Reading database file: /sirius/seqsee/databases/pir/* Proteins Scanned: 5000 Matches: 9 ~ ~ ~ ~ ******************************************************************* Program......: psearch (version 1.2) Description..: Pattern Search Results Date.........: Thu Feb 16 13:28:14 1993 Database.....: PIR (Intelligenetics Version) SearchStrings: IKYLEFISEAIIHVL ******************************************************************* ***********(1)********** Title.......: Myoglobin - California sealion Id..........: MYZC Amino Acids: 101-115 Sequence..:IKYLEFISEAIIHVL Matching..:IKYLEFISEAIIHVL ***********(2)********** Title......: Myoglobin - Gray seal and harbor seal Id.........: MYSLG Amino Acids: 101-115 Sequence..:IKYLEFISEAIIHVL Matching..:IKYLEFISEAIIHVL ~ ~ ~ ~ ~ :q <--- Save this File? (Y/N): >> y <--- Enter output filename. >> myo.pat <--- 13. Pattern Search (SEQBANK Option) ******************************************* * * * 1. Indicate which database or * * sequence file you wish to search * * 2. Enter the sequence pattern or * * patterns to be searched for * * 3. Check the output file for * * interesting information * * 4. Exit the editor with ":q" * * 5. Save the file * * * ******************************************* This procedure searches the SEQBANK, SWISS-PROT or PIR database or a single sequence of your own choosing to find exact pattern matches according to the following rules (note the sequence patterns are case INDEPENDENT): a) X Match exact residue specified where X = any amino acid b) !X Match any residue EXCEPT X c) * Wild card character--matches any amino acid d) [XYZ] "OR" braces--match X "or" Y "or" Z. e) X&Y "AND" character--match X "and" Y no matter what the separation f) X{2,8}Y Match X and Y if separation is between 2 and 8 residues. "Range" braces--allow a range of wild card characters. i.e. {2,8} = 2 to 8 "*" g) $**X Match X if located 2 residues from N terminus -- "Termination" characters are used to mark either the beginning (N terminus) or end (C terminus) of a sequence Pattern Search (PSEARCH) is constructed to allow the user to enter several patterns at once, both on a single line (using the "&" feature) or on separate lines. Patterns appearing on separate lines are treated as "independent" patterns (meaning they don't have to appear in the same protein sequence) while patterns with "&" characters are viewed as "dependent" patterns (meaning they do have to appear in the same protein sequence). Some examples of sequence pattern searches are given below: AA***K Find all occurrences of 2 alanines together followed by any 3 residues followed by a single lysine AA!P!P!PK Find all occurrences of 2 alanines together followed by any 3 residues (as long as they are NOT prolines) followed by a single lysine. (ie. look for AA***K except AAP**K, AA*P*K, AA**PK, AA*PPK, AAPP*K, AAPPPK) [AG][AG]*[KR] Find all occurrences of 2 alanines or 2 glycines or any combination of the two followed by any residue followed by a lysine or an arginine. (ie. look for AA*K, AG*K, GA*K, GG*K, AA*R, AG*R, GA*R and GG*R) When using this subroutine, the user is required to identify which database he or she wishes to search (the SEQBANK database in this case) and what the query sequence is (using a single letter amino acid code). Note that the user MUST type "quit" on the final line of his or her search string. The word "quit" is used by the program as a termination flag and is essential for proper functioning of the program. The user must also provide a name for the output file (we suggest using a ".pat" suffix for consistency). The output file contains information on the name of the protein where a match was found, the SEQBANK Id number and the location where the match begins in the database protein (DbRes). The secondary structure of the SEQBANK protein where the match was made is also included in the output file. ****************************************************************************** AVAILABLE OPTIONS IN "SEQSEE.PARMS": 1) Choice of sequence database to be scanned (PIR, SWISS-PROT, SEQBANK.db or user-defined). 2) Option to allow mutliple matches of a search string in a sequence. 3) Choice of file-update frequency (i.e. every 100, 500 or 1000 sequences). ****************************************************************************** ********************************************************************** * Package...: SEQSEE Version 1.2 (c) * * Authors...: Robert Boyko / Leigh Willard / David Wishart * * Fred Richards / Brian Sykes * * Location..: University of Alberta * * Protein Engineering Network of Centres of Excellence * ********************************************************************** *** Preliminaries *** *** Alignments *** 1) Help 10) Fast Alignment Search 2) Enter/Edit a Sequence 11) Exhaustive Alignment Search 3) Retrieve Sequence from Database 12) Align 2 or more sequences *** Structural Analysis *** *** Scanning *** 4) Sequence Statistics 13) Pattern Search 5) Structure Prediction 14) Homology Search 6) SEQSITE Pattern Search 15) Dot Plot 7) Flexibility 16) Database Reference Search 8) Hydrophobic Moment 17) File Viewer 9) Hydrophobicity 0) EXIT SEQSEE Enter the number of the desired function >> 13 <--- Pattern Search (Version 1.2) How do you wish to use Pattern Search? 1) Search PIR/SWISS-PROT database 2) Search SEQBANK database 3) Search user-defined sequence 4) Exit Enter a number (then press return). >> 2 <--- enter each pattern one per line: enter QUIT (then press return) when done: >> AED**[MIL] <--- >> KST***[KRG]*[LIVM] <--- >> AA***[RK] <--- >> quit <--- ******************************************************************* Program......: psearch (version 1.2) Description..: Pattern Search Results Date.........: Thu Feb 16 13:28:24 1993 Database.....: SEQBANK SearchStrings: AED**[MIL] KST***[KRG]*[LIVM] AA***[KR] ******************************************************************* ***********(1)*********** Title......: ALCOHOL DEHYDROGENASE (HORSE LIVER) Id.........: 7 Amino Acids: 213-218 Sequence..: AACAAR Matching..: AA***R Structure.: HHHCCB ***********(2)*********** Title......: ALKALINE PHOSPHATASE (E. COLI) Id.........: 9 Amino Acids: 444-449 Sequence..: AALGLK Matching..: AA***K Structure.: HHHHCC ~ ~ ~ ~ ~ :q <--- Save this File? (Y/N): >> y <--- Enter output filename. >> misc.pat <--- 14. Homology Search PIR/SWISS-PROT Database Option ******************************************* * * * 1. Indicate which database or * * sequence file you wish to search * * 2. Enter the sequence pattern * * to be searched for * * 3. Enter the number of alignments * * to be saved * * 4. Check the output file for * * interesting information * * 5. Exit the editor with ":q" * * 6. Save the file * * * ******************************************* The HSEARCH program searches the SWISS-PROT, PIR , SEQBANK or a compatible user-defined database (or sequence file) to find the "nearest" or most homologous matches to any given input sequence. Homologies are determined according to any one of four user-defined scoring matrices (described earlier). Gaps are not allowed in the homology search (if gaps are required, use the fast alignment option instead). The homology search is a useful complement to other pattern search routines, especially when attempting to locate distantly related or difficult-to-identify sequence motifs. When using this program, the user is required to identify which database he or she wishes to search (the PIR database in this case) as well as the sequence that is to be searched for in the database. The user is also required to provide a name of an output file (we suggest using the suffix ".hom") and the number of high scoring searches that are to be kept (we chose 50). The output file contains information on the name of the protein where a match was found, the PIR or SWISS-PROT Id or accession number, the location where the match begins in the database protein (DbRes) and the homology score (Score). Vertical lines (|) are used to identify exact matches and asterisks (*) are used to identify homologous matches. ****************************************************************************** AVAILABLE OPTIONS IN "SEQSEE.PARMS": 1) Choice of sequence database to be scanned (PIR, SWISS-PROT or user- defined). 2) Choice of scoring matrix (wt.* files or user-defined). 3) Choice of minimum score to designate homologous residue pairs. 4) Choice of file-update frequency (i.e. every 100, 500 or 1000 sequences). ****************************************************************************** ********************************************************************** * Package...: SEQSEE Version 1.2 (c) * * Authors...: Robert Boyko / Leigh Willard / David Wishart * * Fred Richards / Brian Sykes * * Location..: University of Alberta * * Protein Engineering Network of Centres of Excellence * ********************************************************************** *** Preliminaries *** *** Alignments *** 1) Help 10) Fast Alignment Search 2) Enter/Edit a Sequence 11) Exhaustive Alignment Search 3) Retrieve Sequence from Database 12) Align 2 or more sequences *** Structural Analysis *** *** Scanning *** 4) Sequence Statistics 13) Pattern Search 5) Structure Prediction 14) Homology Search 6) SEQSITE Pattern Search 15) Dot Plot 7) Flexibility 16) Database Reference Search 8) Hydrophobic Moment 17) File Viewer 9) Hydrophobicity 0) EXIT SEQSEE Enter the number of the desired function >> 14 <--- Homology Search (Version 1.2) How do you wish to use Homology Search? 1) Search PIR/SWISS-PROT database 2) Search SEQBANK database 3) Search user-defined sequence 0) Exit Enter a number (then press return). >> 1 <--- Enter sequence (one-letter code). Press when done. >> CFYQRCRGD <--- This program keeps track of the top 'x' searches Enter a value for 'x' where 0 < x < 500: >> 50 <--- Reading database file: /sirius/seqsee/databases/pir/* Proteins Scanned: 1000 BestScore 134 GroupScore 134 Proteins Scanned: 2000 BestScore 142 GroupScore 142 Proteins Scanned: 3000 BestScore 142 GroupScore 137 ~ ~ ******************************************************************* Program......: hsearch (version 1.2) Description..: Homology Search Results Date.........: Thu Feb 16 14:29:17 1993 Database.....: PIR (Intelligenetics Version) Scoring Mat..: /sirius/local/seqsee/lib/wt.align ******************************************************************* Number of proteins tested: 44890 Number of matches found..: 50 ***********(1)*********** Title......: *Lipase - Rat Id.........: S03672 Amino Acid: 163 Score......: 164 Query Seq..: CFYQRCRGD Matching...: ||| || | Database...: CFYGRCLGF ***********(2)*********** Title......: *Proline-rich protein precursor - Human Id.........: A33568 Amino Acid: 108 Score......: 158 Query Seq..: CFYQRCRGD Matching...: |*|||| Database...: CIYKRCQHP ~ ~ ~ ~ ~ :q <--- Save this File? (Y/N): >> y <--- Enter output filename. >> misc.hom <--- 14. Homology Search SEQBANK Option ******************************************* * * * 1. Indicate which database or * * sequence file you wish to search * * 2. Enter the sequence pattern * * to be searched for * * 3. Enter the number of alignments * * to be saved * * 4. Check the output file for * * interesting information * * 5. Exit the editor with ":q" * * 6. Save the file * * * ******************************************* The HSEARCH program searches the PIR, SWISS-PROT, SEQBANK or a compatible user-defined database (or sequence file) to find the "nearest" or most homologous matches to any given input sequence. Homologies are determined according to any one of four user-defined scoring matrices (described earlier). Gaps are not allowed in the homology search (if gaps are required, use the fast alignment option instead). The homology search is a useful complement to other pattern search routines, especially when attempting to locate distantly related or difficult-to-identify sequence motifs. When using this program, the user is required to identify which database he or she wishes to search (the SEQBANK database in this case) as well as the sequence that is to be searched for in the database. The user is also required to provide the number of high scoring searches that are to be kept (we chose 10) as well as the name of an output file (we suggest using the suffix ".hom"). The output file contains information on the name of the protein where a match was found, the SEQBANK Id or accession number, the location where the match begins in the database protein (DbRes), the homology score (Score) and the secondary structure. Vertical lines (|) are used to identify exact matches and asterisks (*) are used to identify homologous matches. ****************************************************************************** AVAILABLE OPTIONS IN "SEQSEE.PARMS": 1) Choice of sequence database to be scanned (PIR, SWISS-PROT or user- defined). 2) Choice of scoring matrix (wt.* files or user-defined). 3) Choice of minimum score to designate homologous residue pairs. 4) Choice of file-update frequency (i.e. every 100, 500 or 1000 sequences). ****************************************************************************** ********************************************************************** * Package...: SEQSEE Version 1.2 (c) * * Authors...: Robert Boyko / Leigh Willard / David Wishart * * Fred Richards / Brian Sykes * * Location..: University of Alberta * * Protein Engineering Network of Centres of Excellence * ********************************************************************** *** Preliminaries *** *** Alignments *** 1) Help 10) Fast Alignment Search 2) Enter/Edit a Sequence 11) Exhaustive Alignment Search 3) Retrieve Sequence from Database 12) Align 2 or more sequences *** Structural Analysis *** *** Scanning *** 4) Sequence Statistics 13) Pattern Search 5) Structure Prediction 14) Homology Search 6) SEQSITE Pattern Search 15) Dot Plot 7) Flexibility 16) Database Reference Search 8) Hydrophobic Moment 17) File Viewer 9) Hydrophobicity 0) EXIT SEQSEE Enter the number of the desired function >> 14 <--- Homology Search (Version 1.2) How do you wish to use Homology Search? 1) Search PIR/SWISS-PROT database 2) Search SEQBANK database 3) Search user-defined sequence 0) Exit Enter a number (then press return). >> 2 <--- enter sequence (one-letter code). Press when done. >> CFYQRCRGD <--- >> quit <--- This program keeps track of the top 'x' searches Enter a value for 'x' where 0 < x < 500: >> 10 <--- ******************************************************************* Program......: hsearch (version 1.2) Description..: Homology Search Results Date.........: Thu Feb 16 14:29:17 1993 Database.....: SEQBANK Scoring Mat..: /sirius/local/seqsee/lib/wt.align ******************************************************************* Number of proteins tested: 267 Number of matches found..: 10 ***********(1)********** Title.....: GLUCOCORTICOID RECEPTOR DNA BINDING DOMAIN (RAT) Id........: 110 Amino Acid: 56 Score.....: 126 Query Seq.: CFYQRCRGD Matching..: | |**| Database..: CRYRKCLQA Structure.: HHHHHHHHH ***********(2)********** Title.....: SHORT SCORPION TOXIN Id........: 183 Amino Acid: 26 Score.....: 124 Query Seq.: CFYQRCRGD Matching..: || *| Database..: CFGPQCLCN Structure.: BBCCBBBBB ~ ~ ~ ~ :q <--- Save this File? (Y/N): >> y <--- Enter output filename. >> misc.hom <--- 15. Dot Plot Compare 2 Similar Sequences ******************************************* * * * 1. Indicate how you wish to use * * Dot Plot (option 2) * * 2. Indicate how your first sequence * * will be entered * * 3. Enter the sequence filename * * for the first sequence * * 4. Indicate how your second * * sequence will be entered * * 5. Enter the sequence filename * * for the second sequence * * 6. Enter the number of alignments * * to be saved * * 7. Check the output file for * * interesting information * * 8. Exit the editor with ":q" * * 9. Save the file * * * ******************************************* DOTPLOT is an extremely flexible program developed to produce character representations of standard dotplots. DOTPLOT may be used to compare a sequence with itself (to identify internal repeats), with another sequence (for pair-wise alignments), with a SEQFILE compatible database or the PIR/SWISS-PROT databases (for medium speed alignments). When using this function, the user is requested to identify which type of dotplot will be done (one sequence against another, one sequence against a large number of sequences or one sequence against itself), how the sequence(s) will be entered (via the keyboard or through a SEQFILE), the number of "diagonals" that should be identified and saved and, finally, the name of the output file (we suggest using the suffix ".dot" for consistency). The output file contains information on where the diagonal begins in the query sequence (QpRes), where the diagonal begins in the "database" sequence (DbRes), the level of homology (Homology Score) and the location of the diagonal in the DOTPLOT matrix (0 being the location of the main diagonal). ****************************************************************************** AVAILABLE OPTIONS IN "SEQSEE.PARMS": 1) Choice of sequence database to be scanned (PIR, SWISS-PROT or user- defined). 2) Choice of scoring matrix (wt.* files or user-defined). 3) Choice of minimum score to designate homologous residue pairs. 4) Choice of minimum threshold score and length extension penalties. ****************************************************************************** ********************************************************************** * Package...: SEQSEE Version 1.2 (c) * * Authors...: Robert Boyko / Leigh Willard / David Wishart * * Fred Richards / Brian Sykes * * Location..: University of Alberta * * Protein Engineering Network of Centres of Excellence * ********************************************************************** *** Preliminaries *** *** Alignments *** 1) Help 10) Fast Alignment Search 2) Enter/Edit a Sequence 11) Exhaustive Alignment Search 3) Retrieve Sequence from Database 12) Align 2 or more sequences *** Structural Analysis *** *** Scanning *** 4) Sequence Statistics 13) Pattern Search 5) Structure Prediction 14) Homology Search 6) SEQSITE Pattern Search 15) Dot Plot 7) Flexibility 16) Database Reference Search 8) Hydrophobic Moment 17) File Viewer 9) Hydrophobicity 0) EXIT SEQSEE Enter the number of the desired function >> 15 <--- Dotplot (Version 1.2) How do you wish to use the dot plot algorithm? 1) Do dotplot against the PIR/SWISS-PROT database 2) Do dotplot with my two input sequences 3) Look for internal repeats 0) Exit Enter a number (then press return) >> 2 <--- *** First amino acid sequence *** Your amino acid sequence is now required: 1) Read sequence from an input file 2) Sequence to be entered via keyboard 3) I do not have my sequence ready Enter a number (then press return). >> 1 <--- Enter input filename: >> zipper1.seq <--- *** Second amino acid sequence *** Your amino acid sequence is now required: 1) Read sequence from an input file 2) Sequence to be entered via keyboard 3) I do not have my sequence ready Enter a number (then press return). >> 1 <--- Enter input filename: >> zipper2.seq <--- This program keeps track of the top 'x' alignments. Enter a value for 'x' where 0 < x < 500: >> 20 <--- ******************************************************************* Program......: dotplot (version 1.2) Description..: Finds Regions of Homology (no gaps) Date.........: Thu Feb 16 13:31:19 1993 Scoring Mat..: /sirius/local/seqsee/lib/wt.align LengthPenalty: 5 Min Threshold: 80 mSearchFlag..: 0 (1=yes, 0=no) Sequence Name: leu_zipper1 1 LQKMKGLENK VAEKLSKNYH LERLRALENK LVGER ******************************************************************* Number of proteins tested: 1 Number of searches found: 3 **********(1)********* Title...: leu_zipper1 Id......: Title: Score...: 467 Diagonal: 0 QpRes...: 1 DbRes...: 1 Query Seq..:LQKMKGLENKVAEKLSKNYHLERLRALENKLVGER Matching...:||*|| ||*|| | ||||||||||*|||||||||| Database...:LQRMKQLEDKVEELLSKNYHLERLKALENKLVGER **********(2)********* Title...: leu_zipper Id......: Title: Score...: 123 Diagonal: 20 QpRes...: 1 DbRes...: 21 Query Seq..:LQKMKGLENKV Matching...:|***| ||||* Database...:LERLKALENKL ~ ~ ~ ~ ~ :q <--- Save this File? (Y/N): >> y <--- Enter output filename. >> zipper.dot <--- 15. Dot Plot Internal Repeat Option ******************************************* * * * 1. Indicate how you wish to use * * Dot Plot (option 3) * * 2. Indicate how your sequence * * will be entered * * 3. Enter the sequence or sequence * * filename * * 4. Enter the number of alignments * * to be saved * * 5. Check the output file for * * interesting information * * 6. Exit the editor with ":q" * * 7. Save the file * * * ******************************************* DOTPLOT is an extremely flexible program developed to produce character representations of standard dotplots. The low resolution of most character- defined screens prevents the incorporation of a useful graphic representation of dotplot results and hence a character representation with a user defined "threshold" has been incorporated to overcome this problem. DOTPLOT may be used to compare a sequence with itself (to identify internal repeats), with another sequence (for pair-wise alignments), with a SEQFILE compatible database or the PIR/SWISS-PROT databases (for medium speed alignments). When using this function, the user is requested to identify which type of dotplot will be done (one sequence against another, one sequence against a large number of sequences or one sequence against itself), how the sequence(s) will be entered, the number of "diagonals" that should be identified and saved and, finally, the name of the output file. The output file contains information on where the diagonal begins in the query sequence (QpRes), where the diagonal begins in the "database" sequence (DbRes), the level of homology (Homology Score) and the location of the diagonal in the DOTPLOT matrix (0 being the location of the main diagonal). ****************************************************************************** AVAILABLE OPTIONS IN "SEQSEE.PARMS": 1) Choice of sequence database to be scanned (PIR, SWISS-PROT or user- defined). 2) Choice of scoring matrix (wt.* files or user-defined). 3) Choice of minimum score to designate homologous residue pairs. 4) Choice of minimum threshold score and length extension penalties. ****************************************************************************** ********************************************************************** * Package...: SEQSEE Version 1.2 (c) * * Authors...: Robert Boyko / Leigh Willard / David Wishart * * Fred Richards / Brian Sykes * * Location..: University of Alberta * * Protein Engineering Network of Centres of Excellence * ********************************************************************** *** Preliminaries *** *** Alignments *** 1) Help 10) Fast Alignment Search 2) Enter/Edit a Sequence 11) Exhaustive Alignment Search 3) Retrieve Sequence from Database 12) Align 2 or more sequences *** Structural Analysis *** *** Scanning *** 4) Sequence Statistics 13) Pattern Search 5) Structure Prediction 14) Homology Search 6) SEQSITE Pattern Search 15) Dot Plot 7) Flexibility 16) Database Reference Search 8) Hydrophobic Moment 17) File Viewer 9) Hydrophobicity 0) EXIT SEQSEE Enter the number of the desired function >> 15 <--- Dotplot (Version 1.2) How do you wish to use the dot plot algorithm? 1) Do dotplot against the PIR/SWISS-PROT database 2) Do dotplot with my two input sequences 3) Look for internal repeats 0) Exit Enter a number (then press return) >> 3 <--- *** First amino acid sequence *** Your amino acid sequence is now required: 1) Read sequence from an input file 2) Sequence to be entered via keyboard 3) I do not have my sequence ready Enter a number (then press return). >> 1 <--- Enter input filename: >> zipper3.seq <--- This program keeps track of the top 'x' alignments. Enter a value for 'x' where 0 < x < 500: >> 20 <--- ******************************************************************* Program......: dotplot (version 1.2) Description..: Finds Regions of Homology (no gaps) Date.........: Thu Feb 16 13:31:19 1993 Scoring Mat..: /sirius/local/seqsee/lib/wt.align LengthPenalty: 5 Min Threshold: 80 mSearchFlag..: 0 (1=yes, 0=no) Sequence Name: leu_zipper3 1 LQRMKQLEDK VEELLSKNYH LERLKALENK LVGER ******************************************************************* Number of proteins tested: 1 Number of searches found: 1 ***********(1)********** Title....: leu_zipper3 Id.......: Title: Score....: 60 Diagonal.: 14 QpRes...: 1 DbRes...: 15 Query Seq:LQRMKQLEDKV Matching.:|*|*| ||*|* Database.:LERLKALENKL ~ ~ ~ ~ ~ ~ :q <--- Save this File? (Y/N): >> y <--- Enter output filename. >> zipper.dot <--- 16. Database Reference Search ******************************************* * * * 1. Indicate how your query will * * be entered * * 2. Enter the search query (remember * * to type "quit" when finished) * * 3. Check the output file for * * desired information * * 4. Exit the editor with ":q" * * 5. Save the file * * * ******************************************* The program REFSCAN is designed specifically to allow the user to find and retrieve sequence references from the PIR or SWISS-PROT databases using either the accession number, the name (or portion thereof) or a bibliographic/functional reference. Note that multiple sequence identifiers using the conjunctive "&" symbol may be employed for increased reference query specificity, for example:CYSTIC & FIBROSIS & HUMAN for HUMAN CYSTIC FIBROSIS. When using this function, the user is required to identify the method by which the query will be entered (either the keyboard or through a UNIX file) as well as the exact Id numbers or reference words which must be searched for in the database. Note that the user MUST type "quit" on the final line of his or her search string. The word "quit" is used by the program as a termination flag and is essential for proper functioning of the program. After completing the sequence input, the user is required to provide a name for the output file (we suggest using the suffix ".scan" for consistency). The output from the reference search is essentially self-explanatory. Note that the sequence of the peptide or protein is not included in the output. ****************************************************************************** AVAILABLE OPTIONS IN "SEQSEE.PARMS": 1) Choice of reference database to be scanned (PIR, SWISS-PROT or user- defined). 2) Choice of file-update frequency (i.e. every 100, 500 or 1000 sequences). ****************************************************************************** ********************************************************************** * Package...: SEQSEE Version 1.2 (c) * * Authors...: Robert Boyko / Leigh Willard / David Wishart * * Fred Richards / Brian Sykes * * Location..: University of Alberta * * Protein Engineering Network of Centres of Excellence * ********************************************************************** *** Preliminaries *** *** Alignments *** 1) Help 10) Fast Alignment Search 2) Enter/Edit a Sequence 11) Exhaustive Alignment Search 3) Retrieve Sequence from Database 12) Align 2 or more sequences *** Structural Analysis *** *** Scanning *** 4) Sequence Statistics 13) Pattern Search 5) Structure Prediction 14) Homology Search 6) SEQSITE Pattern Search 15) Dot Plot 7) Flexibility 16) Database Reference Search 8) Hydrophobic Moment 17) File Viewer 9) Hydrophobicity 0) EXIT SEQSEE Enter the number of the desired function >> 16 <--- Refscan (Version 1.2) How will you enter your search queries? 1) Protein Name(s) entered from the keyboard 2) Protein Name(s) taken from a file 3) Protein Id(s) entered from the keyboard 4) Protein Id(s) taken from a file 0) Exit program Enter a number (then press return) >> 3 <--- Enter one index code per line Type QUIT (then press return) when done: >> CCHU <--- >> CCCZ <--- >> quit <--- Reading database file: /sirius/seqsee/databases/pir/* Proteins scanned: 1000 Proteins scanned: 2000 ~ ~ ~ ***************************************************************** Program......: refscan (version 1.2) Description..: Reference Retrieval Results Date.........: Thu Feb 16 14:00:01 1993 Database.....: PIR (Intelligenetics Version) SearchStrings: CCHU CCHZ ***************************************************************** >CCHU Cytochrome c - Human ENTRY CCHU #Type Protein TITLE Cytochrome c - Human DATE #Sequence 30-Sep-1991 #Text 30-Jun-1992 PLACEMENT 1.0 1.0 1.0 1.0 1.0 SOURCE Homo sapiens #Common-name man ACCESSION A31764\ A05676\ A00001 REFERENCE #Authors Evans M.J., Scarpulla R.C. #Journal Proc. Natl. Acad. Sci. U.S.A. (1988) 85:9625-9629 #Title The human somatic cytochrome c gene: two classes of processed pseudogenes demarcate a period of rapid molecular evolution. #Reference-number A31764 #Accession A31764 #Molecule-type DNA #Residues 1-105 #Cross-reference GB:M22877 REFERENCE #Authors Matsubara H., Smith E.L. #Journal J. Biol. Chem. (1963) 238:2732-2753. #Reference-number A05676 #Accession A05676 #Molecule-type protein #Residues 2-28;29-46;47-100;101-105 REFERENCE #Authors Matsubara H., Smith E.L. #Journal J. Biol. Chem. (1962) 237:3575-3576 #Reference-number A00001 #Comment 66-Leu is found in 10% of the molecules in pooled protein. GENETIC #Introns 57/1 SUPERFAMILY #Name cytochrome c KEYWORDS acetylation\ electron transport\ heme\ mitochondrion\ oxidative phosphorylation\ polymorphism\ respiratory chain FEATURE 2-105 #Protein cytochrome c (experimental) \ 2 #Modified-site acetylated amino end (Gly) (in mature form) (experimental)\ 15,18 #Binding-site heme (covalent)\ 19,81 #Binding-site heme iron (his, Met) (axial ligands) SUMMARY # Molecular-weight 11749 #Length 105 #Checksum 3247 SEQUENCE ********************************************************* >CCHZ Cytochrome c - Chimpanzee (tentative sequence) ENTRY CCHZ #Type Protein TITLE Cytochrome c - Chimpanzee (tentative sequence) DATE #Sequence 17-Mar-1987 #Text 30-Jun-1992 PLACEMENT 1.0 1.0 1.0 1.0 1.0 ~ ~ ~ ~ ~ :q <--- Save this File? (Y/N): >> y <--- Enter output filename. >> cyt.scan <--- 17. File Viewer ******************************************* * * * 1. Indicate which database or file * * you wish to browse or edit * * 2. Browse or edit the output file * * 3. Exit the editor with ":q" or :wq * * * ******************************************* The File Viewer option permits the user to edit or view a variety of database files. Through this program it is possible to locate or identify complete sequences within the PIR or SWISS-PROT databases, to locate sequences from the SEQBANK database, to view or edit sequences written as SEQFILEs and to view, edit or change the SEQSEE control file ("seqsee.parms"). In the case of viewing PIR database information, all sequence name and accession data is contained in a single 1 Mb file called PIRSEE. In the case of viewing the SWISSPROT database, all sequence name and accession data is contained in a single 1 Mb file called SWISSEE. Standard UNIX commands may be used for scrolling through or locating all character strings in any of the files. When using this function, the user is only required to choose which database or file he or she desires to view. No other input is required. The database formats have been discussed in earlier sections of this manual and will not be elaborated upon here. The SEQSEE control file ("seqsee.parms") is mostly self-explanatory although users wishing to know more about the file may consult the Appendix for the annotated version of the control file. ****************************************************************************** AVAILABLE OPTIONS IN "SEQSEE.PARMS": 1) Choice of sequence database to be viewed (PIRSEE, SWISSEE, SEQBANK or user-defined). 2) Access to "seqsee.parms" file to alter individual program options. ****************************************************************************** ********************************************************************** * Package...: SEQSEE Version 1.2 (c) * * Authors...: Robert Boyko / Leigh Willard / David Wishart * * Fred Richards / Brian Sykes * * Location..: University of Alberta * * Protein Engineering Network of Centres of Excellence * ********************************************************************** *** Preliminaries *** *** Alignments *** 1) Help 10) Fast Alignment Search 2) Enter/Edit a Sequence 11) Exhaustive Alignment Search 3) Retrieve Sequence from Database 12) Align 2 or more sequences *** Structural Analysis *** *** Scanning *** 4) Sequence Statistics 13) Pattern Search 5) Structure Prediction 14) Homology Search 6) SEQSITE Pattern Search 15) Dot Plot 7) Flexibility 16) Database Reference Search 8) Hydrophobic Moment 17) File Viewer 9) Hydrophobicity 0) EXIT SEQSEE Enter the number of the desired function >> 17 <--- File Viewer (Version 1.2) What would you like to browse? 1) User specified file 2) PIRSEE database 3) SWISSEE database 4) SEQBANK database 5) SEQSEE control file 0) Exit Enter a number (then press return). >> 4 <--- # SEQBANK # (REVISED DEC. 1992) # # COPYRIGHT APRIL, 1993 # DAVID S. WISHART # # DEPARTMENT OF BIOCHEMISTRY # UNIVERSITY OF ALBERTA # EDMONTON, ALBERTA # CANADA # T6G 2H7 # # SEQBANK is a compilation of sequences and "consensus" secondary #structure assignments of soluble proteins and peptides which have had #their..... >ACTIN (RABBIT SKELETAL) #REFERENCE : KABSCH, W. ET AL., NATURE 347:37-44 (1990) #REFERENCE : FLAHERTY, K.M. ET AL., PNAS 88:5041-5045 (1991) #SEQBANK ID: 1 #BRKHAVN ID: #PIR-NBR ID: ATRB #SWISPRO ID: ACTS$RABIT #RESOLUTION: 2.8 #R FACTOR : 23.8 #FOLD CLASS: M #NUM RESIDU: 375 DEDETTALVC DNGSGLVKAG FAGDDAPRAV FPSIVGRPRH QGVMVGMGQK CCCCCCBBBB BBBCCBBBBB BBCCCCCCBB BBCCBBBBCC CCCCCCCCCC DSYVGDEAQS KRGILTLKYP IEHGIITNWD DMEKIWHHTF YNELRVAPEE CBBBCHHHHH HCCBBBBBCC BBBCBBBCCH HHHHHHHHHH HCCCCCCCCC HPTLLTEAPL NPKANREKTM QIMFETFNVP AMYVAIQAVL SLYASGRTTG CCBBBBBCHH HHHHHHHHHH HHHHHCCCCC BBBBBBCHHH HHHHCCCCBB IVLDSGDGVT HNVPIYEGYA LPHAIMRLDL AGRDLTDYLM KILTERGYSF BBBBCCCCBB BBBBBBCCBB BCCBBBBBCC CHHHHHHHHH HHHHHHCCCC VTTAEREIVR DIKEKLCYVA LDFENAMATA ASSSSLEKSY ELPDGQVITI CCHHHHHHHH HHHHHHCCCC CHHHHHHHHH HCCCCCCBBB BBCCCCBBBB GNERFRCPET LFQPSFIGME SAGIHETTYN SIMKCDIDIR KDLYANNVMS CCHHHHHHHH HHHCCCCCCC CCHHHHHHHH HHHHCCCHHH HHHHCCBBBB GGTTMYPGIA DRMQKEITAL APSTMKIKII APPERKYSVW IGGSILASLS CCCCCCCCHH HHHHHHHHHH HCCCCCBBBB CCHHHHHHHH HHHHHHHHCC TFQQMWITKQ EYDEAGPSIV HRKCF HHHHHCCCCH HHHHHCCHHH HHHCC ~ ~ ~ ~ ~ :q <--- X. HELP, ON-LINE DOCUMENTATION AND USER MANUALS We have attempted to document as much of the SEQSEE program as possible. In addition to the manual you are presently reading, we have also provided a menu driven on-line help facility to act as a complement to the manual. The hardcopy version of the user manual may be freely copied and distributed to anyone interested in using SEQSEE. If the original version of this user manual is lost, a "low-end" copy of this manual may be printed from the file called "manual" which is included with the program. The On-line Help facility is essentially a shortened version of the SEQSEE manual. If the user presses the HELP key on the main SEQSEE menu, the following HELP menu will appear: SEQSEE HELP (Version 1.2) What would you like help with? 1) Authors, version, copyright notice 2) Introduction to SEQSEE 3) Recommendations for the beginner 4) Brief explanation of main menu 5) Detailed explanation of main menu 6) Tutorial 7) Common questions from users 8) Sequence input format 9) Amino Acid Info 0) Exit Enter a number (then press return): The ten functions in the HELP menu can be summarized as follows: 1) AUTHORS, VERSION, COPYRIGHT NOTICE - Provides a short notice regarding the location and addresses of the authors, the current version number of SEQSEE and the limitations that the user must agree to before using the program. 2) INTRODUCTION TO SEQSEE - Provides background information and a brief history about SEQSEE, its development and potential applications. 3) RECOMMENDATIONS FOR THE BEGINNER - Provides a list of procedures that the first-time user should undertake to make the operation of SEQSEE as easy and as convenient as possible. 4) BRIEF EXPLANATION OF MAIN MENU - Provides a short synopsis of SEQSEE menu functions. 5) DETAILED EXPLANATION OF MAIN MENU - Provides a more in-depth explanation of all 18 menu functions along with examples and references. 6) TUTORIAL - Provides a sample SEQSEE tutorial taken directly from this manual. 7) COMMON QUESTIONS FROM USERS - Lists some of the more common questions (along with the answers) which have been fielded by the authors from a variety of first-time users. 8) SEQUENCE INPUT FORMAT - Describes the SEQFILE format, which is the required format for all sequence data in SEQSEE. 9) AMINO ACID INFO - Provides a brief synopsis of amino acid names, abbreviations, structures and molecular weights. 0) EXIT - Returns the user to the main SEQSEE menu. Upon pressing any HELP menu number (except 0) the user will be presented with a file displaying the information on the desired subject. The file may be scrolled through using the scrolling control keys (h,j,k,l) and it may be exited by simply typing ":q". On exiting from the file the user is automatically returned to the HELP menu. HELP may be exited by pressing "0" (and ) as indicated on the menu. XI. DATABANKS, DATABASES AND LIBRARIES The SEQSEE suite of programs is complemented by more than 40 different databases and library files. These databanks contain a vast array of structural, functional and chemico-physical information that has been collected and tabulated from many different sources. In addition, we have generated several new databases specifically for SEQSEE to allow it to perform a number of novel functions. Obviously, without these libraries many of the more important features of SEQSEE would be rendered inoperable and, indeed, the program would essentially cease to be useful. Consequently we believe it is important to identify where these libraries are located, how they have been named and precisely what they contain. Such an understanding will ultimately allow the user to update or modify these databanks whenever the need or want arises. DATABASE LOCATION The SEQSEE databanks are segregated into two categories or directories. One is named DATABASES and the other is named LIB (for libraries). Databases are typically the larger of the two types of records with each file containing between 400 and 40,000 lines. Currently, all SEQSEE database files reside in the directory called "seqsee/databases" and these include: 1) EPISITE.db 4) SEQBANK.db 7) SEQSITE.db 2) PHOSITE.db 5) SEQMOTIF1.db 8) SEQUENCES.db 3) PIRSEE.db 6) SEQMOTIF2.db 9) SWISSEE.db Typically, the PIR and SWISS-PROT sequence databases can be placed in the "seqsee/databases" directory. Please note that these major sequence databases must be obtained separately by the user either through an anonymous FTP site (SWISS-PROT from ncbi.nlm.nih.gov and PIR from ftp.bchs.uh.edu) or through the Intelligenetics Corporation. Because of disk space limitations, the PIR and SWISS-PROT databases cannot be included with the SEQSEE program when it is obtained through our anonymous FTP site. The PIR and SWISS-PROT databases can be included in versions of SEQSEE sent by tape. On the other hand, smaller data tables which typically contain physico- chemical parameters are contained in the directory called "seqsee/lib" and these include: 1) alexis.cys 12) homol.weights 23) mol.parspecvol 2) alexis.norm 13) hphil.hopp 24) mol.surfarea 3) cfas.data 14) hphob.cornet 25) mol.volume 4) fleqsee.parms 15) hphob.eisen 26) mol.weights 5) fracbur.parms 16) hphob.hplc 27) moment.parms 6) gor.data 17) hphob.kyte 28) wt.align 7) gor.orig.parms 18) hphob.rbo 29) wt.dayhoff 8) hmom.cornet 19) kyte.parms 30) wt.levin 9) hmom.eisen 20) membrane.parms 31) wt.mclach 10) hmom.hplc 21) mol.asa 32) wt.rbo 11) hmom.kyte 22) mol.fracbur 33) wt.unit In the next few pages, we present a brief synopsis of what each of these files contains, beginning with the databases. DATABASE FILES a) The PIR Database One of the most important members of the SEQSEE database library is the National Biomedical Research Foundation's Protein Information Resource or the PIR (Dayhoff et al., 1983). Among publicly available protein sequence databases it is by far the largest and most up-to-date. The Sept. 1993 release (version 34.0), which was used for the compilation of this manual, contains 44,890 protein and peptide sequence entries. Most of these sequences are annotated with references and related information. The PIR databank has been broken down into several subfiles containing older "annotated" entries and more preliminary "unannotated" entries. b) The SWISS-PROT Database An equally important member of the SEQSEE database library is the European Molecular Biology Laboratory's (EMBL) SWISS-PROT. Maintained and compiled by Amos Bairoch, this database is much more fully annotated and self-consistent than the PIR. Accession codes are are also much more rational in their construction. The only drawback to the SWISS-PROT database is the dearth of immunoglobulin and peptide-fragment sequences. Fortunately, these are still collected by those operating the PIR database. SEQSEE currently uses version 23.0 of the SWISS-PROT which contains more than 24,000 sequence records. c) The SEQBANK Database SEQBANK contains a complete listing of the names, the references, the sequences and the secondary structure of proteins which have had their structures determined through X-ray crystallography or NMR spectroscopy. A total of 267 proteins are included in SEQBANK. Each of the sequences is essentially unique with none of the entries being more than 50% homologous to any other entry. As it presently stands, SEQBANK contains 50,582 residues of which 17,688 are in helices (35.0%), 14,248 in beta-strands (28.2%) and 18,646 in coil configurations (36.9%). d) The SEQSITE Database SEQSITE contains a list (including bibliographic entries) of approximately 1000 sequence motifs and signature sequences which have been identified through extensive literature and computer searches. Many of these sequence patterns have proven to be particularly effective in the identification of possible or probable enzymatic functions and in the location of active sites for a number of previously uncharacterized proteins. We have attempted to adopt the nomenclature of Amos Bairoch's more fully annotated database called PROSITE (1990). e) The EPISITE Database ANTIGEN contains a relatively short list (including bibliographic entries) of recently identified B-cell and T-cell epitopes. This may (or may not) assist in the identification of potential antigenic sites in a variety proteins of immunological interest. Antigenic sites are stored separately from the SEQSITE motifs because they are less well defined and, consequently, much more common than "signature" sequences. f) The PHOSITE Database The PHOSITE database contains a list (including bibliographic entries) of potential phosphorylation sites which have been described in the literature. Phosphorylation sites are stored separately from the SEQSITE motifs because the are generally less well defined and have a tendency of "overwhelming" important or useful signature sequence data because of their high abundance. g) The SEQMOTIF1 Database This small database consists of only 150 entries. Unlike SEQMOTIF2 (described below), this database includes most of the longer and more complex sequence-structure patterns found in proteins of known structure. Many of these have been derived from extensive literature or crystallographic database searches. Some of the SEQMOTIF1 entries include such well-known structural elements as the helix-loop-helix domain of calcium-binding proteins, the helix-turn-helix motif of DNA-binding proteins, and the nucleotide binding fold of kinases and phosphorylases. Many other, lesser known, structural motifs are also included. h) The SEQMOTIF2 Database SEQMOTIF2 is a databank containing short sequence strings which have been found to have a high propensity for certain secondary structures. Previous workers (Rooman and Wodak, 1988), using much smaller databases, had shown that a number of short sequence patterns were regularly found in association with certain secondary structures. By using a far larger database (SEQBANK) we have been able to extend this relatively short list of Rooman and Wodak's to include almost 1000 simplified sequence "motifs" with their associated secondary structures. i) The PIRSEE Database This database is essentially a shortened version of the PIR reference database. PIRSEE contains only the protein sequence name (or the first 50 characters -- whichever comes first) and its corresponding accession number. Note that PIRSEE (and not the complete PIR) is the database which is presented when using the "File Viewer" command. j) The SWISSEE Database This database is essentially a shortened version of the SWISS-PROT database. SWISSEE contains only the protein sequence name (or the first 50 characters -- whichever comes first) and its corresponding accession number. Note that SWISSEE (and not the complete SWISS-PROT) is the database which is presented when using the "File Viewer" command. k) The SEQUENCES Database This represents a compilation of the sequences derived from the SEQBANK database. Because the sequences in SEQBANK are not in a suitable format to be used for direct queries with SEQSEE we have re-assembled all 267 sequences into 267 separate sequence files. The names of these files have been chosen to permit easy identification of the protein sequences contained within them (ie. myo.seq = myoglobin). All of the sequences have been extracted from either the PIR or SWISS-PROT databases directly. Some of these sequence files have been edited to remove the leader sequences -- as required. LIB FILES a) ALEXIS.CYS - Contains amino acid and secondary structure content data for the prediction of folding classes among cysteine-rich proteins. b) ALEXIS.NORM - Contains amino acid and secondary structure content data for the determination of folding classes among regular (low cysteine content) globular proteins. The data is used in a modified predictive technique based on the approach of Chou and Zhang (1993). c) CFAS.DATA - Contains recently updated Chou-Fasman parameters (Chou and Fasman, 1974; 1978) for secondary structure prediction. The actual values were calculated from data in SEQBANK. d) FLEQSEE.PARMS - Contains the B-factor values calculated by Karplus and Schulz (1985) used to calculate sequence flexibility in the program FLEQSEE. e) FRACBUR.PARMS - Contains data on the expected fraction of buried residues in soluble globular proteins based on the data compiled by Janin (1979). f) GOR.DATA - Contains recently re-derived parameters for secondary structure prediction based on the GOR (information theory) algorithm (Garnier et al., 1978) g) GOR.ORIG.PARMS - Contains original parameters (Garnier et al.,1978) for secondary structure prediction using the GOR (information theory) algorithm. h) HMOM.CORNET - Contains normalized hydrophobicity values determined by Cornette et al. (1987) which are reputed to be particularly good for calculating hydrophobic moments. i) HMOM.EISEN - Contains hydrophobicity values calculated by Eisenberg and co-workers (1984) and normalized by Cornette et al. (1987) for the purpose of calculating hydrophobic moments. j) HMOM.HPLC - Contains normalized hydrophobicity values calculated by Parker et al. (1986) and modified by Cornette et al. (1987) for the purpose of calculating hydrophobic moments. k) HMOM.KYTE - Contains normalized hydrophobicity values calculated by Kyte and Doolittle (1982) and modified by Cornette et al. (1987) for the purpose of calculating hydrophobic moments. l) HOMOL.WEIGHTS - Contains weighting parameters (multipliers) used by ALEXIS in calculating secondary structure via the homology method. m) HPHIL.HOPP - Contains the original antigenicity/hydrophilicity values calculated by Hopp and Woods (See Cornette et al. (1987) for more information). n) HPHOB.CORNET - Contains original, unscaled hydrophobicity values determined by Cornette et al. (1987). o) HPHOB.EISEN - Contains original, unscaled hydrophobicity values calculated by Eisenberg and co-workers (1984). p) HPHOB.HPLC - Contains unscaled hydrophobicity values calculated by Parker et al. (1986). q) HPHOB.KYTE - Contains original, unscaled hydrophobicity values calculated by Kyte and Doolittle (1982). r) HPHOB.RBO - Contains original, unscaled hydrophbocity values calculated by R. Boyko and D. Wishart (unpublished). s) KYTE.PARMS - Contains original hydrophobicity values calculated by Kyte and Doolittle (1982) which are used in the Klein algorithm (1985) to calculate the location of membrane helices. t) MEMBRANE.PARMS - Contains modified Kyte-Doolittle hydrophobicity values which can used by the Klein et al. (1985) algorithm to calculate the location of membrane spanning regions. u) MOL.ASA - Contains the accessible surface area of all 20 amino acids measured in square angstroms as given by Richards (1977). v) MOL.FRACBUR - Contains data on the expected fraction of buried residues in soluble globular proteins based on the data compiled by Janin (1979). w) MOL.PARSPECVOL - Contains the partial specific volumes of all 20 amino acids (Creighton, 1984). x) MOL.SURFAREA - Contains the surface area of all 20 amino acids measured in square angstroms as cited by Richards (1977). y) MOL.VOLUME - Contains the molecular volumes of all 20 amino acids measured in cubic angstroms as cited by Richards (1977). z) MOL.WEIGHTS - Contains the molecular weights of all 20 amino acids measured in daltons (Creighton, 1984). aa) MOMENT.PARMS (CFAS / BHYDRO / HHYDRO) - Contains modified hydrophobicity and secondary structural propensity values to calculate the hydrophobic moment and its contribution to secondary structure in the program called MOMENT (See Eisenberg et al. (1984) for more details). bb) WT.ALIGN - An amino acid exchange matrix which was specifically developed for the FAST_ALIGN program cc) WT.DAYHOFF - An exchange matrix developed by Dayhoff and co- workers (1983) based on mutational replacement frequencies observed for a large number of proteins in the PIR database. Also called the PAM 250 matrix. This is the most commonly used matrix is sequence alignments despite its many shortcomings. dd) WT.LEVIN - An exchange matrix developed by Levin et al. (1986) for the purposes of secondary structure prediction based on sequence homology. ee) WT.MCLACH - An amino acid exchange matrix developed by Andrew McLachlan (1971) based on the observed propensity of residues to substituted for one another as observed in crystal structures. One of the best amino acid exchange matrices available, but unfortunately this is not widely known. ff) WT.RBO - An improved exchange matrix developed by Robert Boyko (unpublished) for the purposes of secondary structure prediction based on sequence homology. gg) WT.UNIT - A matrix used in alignments and homology searches where only the main diagonal contains non-zero entries. The Unity matrix should be used for crude searches only. XII. SEQSEE FILE STRUCTURES There are several file structures that have been adopted for the storage and manipulation of files in SEQSEE. Most sequence files accessed or entered by the user are written and stored as SEQFILEs. On the other hand, library and database files are stored in a file specific format. These database-specific formats have been designed to make the file contents both readable and accessible. a) THE SEQFILE FORMAT The SEQFILE structure is basically composed of a file marker (the symbol ">"), and a title (sequence name) on the first line of the record. The sequence (in single letter code) appears on all subsequent lines in the SEQFILE record. Observe that all SEQFILE sequences are stored in upper case letters but this is for only done for enhanced readability. All programs in SEQSEE are capable of reading these files regardless of whether they contain upper or lower case letters. An example of a SEQFILE with a single sequence of 108 residues is presented below: >Title: human_thioredoxin MVKQIESKTAFQEALDAAGDKLVVVDFSATWCGPCKMINPFFHSLSEKYS NVIFLEVDVDDCQDVASECEVKCTPTFQFFKKGQKVGEFSGANKEKLEAT INELV Note that it is possible for more than one sequence to appear in any given SEQFILE as seen here: >Title: human_thioredoxin MVKQIESKTAFQEALDAAGDKLVVVDFSATWCGPCKMINPFFHSLSEKYS NVIFLEVDVDDCQDVASECEVKCTPTFQFFKKGQKVGEFSGANKEKLEAT INELV >Title: chimp_thioredoxin MVKHIESKTAFQEALDAAGDKLVLVDFSATWCGPCKMINPFFHSLSEKYS NVIFLEVDVDDCQDVASECEVKCTPTFQFYKKGQKVGEFSGANKEKLEAT INELV >Title: gorilla_thioredoxin MVKQIESKTAFQEALDAAGDKLLVVDFSATWCGPCKMINPFFHSISEKYS NVIFLEVDVDDCQDVASECEVKCTPTFQFFKRGQKVGEFSGANKEKLEAT INELV >Title: gibbon_thioredoxin MVKHIESKTAFQEALDAAGDKLVLVDFSATWCGPCKMINPFFHSLSEKYS NVIFLEVDVDDCQDVASECEVKCTPTFQFYKKGQKVGEFSGANKEKLEAT INELV These multiple sequence SEQFILES may be constructed from the output of the "Retrieve Sequence from Database" option (#3 on the menu) or through cutting and pasting other files to one another through the "vi" editor. This latter procedure is best done outside the SEQSEE programming environment. As previously mentioned, there are a number of other file formats and file structures that are quite different from the SEQFILE format. These are typically associated with the larger database files. Examples are provided below: b) THE PIR "Sequence" FORMAT This is a much more compact file structure than what is normally presented as the 'typical' PIR format. In this particular format only the protein identification code, the protein name and the protein sequence (in lower case) is included in the record. This "new" file format greatly accelerates the searching and aligning processes in SEQSEE. >CCHU Cytochrome c - Human gdvekgkkifimkcsqchtvekggkhktgpnlhglfgrktgqapgysytaanknkgiiwge dtimeylenpkkyipgtkmifvgikkkeradliaylkkatnel >CCCZ Cytochrome c - Chimpanzee gdvekgkkifimkcsqchtvekggkhktgpnlhglfgrktgqapgysytaanknkgiiwge dtimeylenpkkyipgtkmifvgikkkeradliaylkkatnel >CCMQR Cytochrome c - Rhesus macaque (tentative sequence) gdvekgkkifimkcsqchtvekggkhktgpnlhglfgrktgqapgysytaanknkgiiwge dtimeylenpkkyipgtkmifvgikkkeradliaylkkatnel ~ ~ ~ c) THE PIR "Reference" FORMAT This format is relatively self-explanatory and is well documented in the materials that accompany the PIR database. The example presented here is simply for reference purposes only. >CCHU Cytochrome c - Human ENTRY CCHU #Type Protein TITLE Cytochrome c - Human DATE #Sequence 30-Sep-1991 #Text 30-Jun-1992 PLACEMENT 1.0 1.0 1.0 1.0 1.0 SOURCE Homo sapiens #Common-name man ACCESSION A31764\ A05676\ A00001 REFERENCE #Authors Evans M.J., Scarpulla R.C. #Journal Proc. Natl. Acad. Sci. U.S.A. (1988) 85:9625-9629 #Title The human somatic cytochrome c gene: two classes of processed pseudogenes demarcate a period of rapid molecular evolution. #Reference-number A31764 #Accession A31764 #Molecule-type DNA #Residues 1-105 #Cross-reference GB:M22877 REFERENCE #Authors Matsubara H., Smith E.L. #Journal J. Biol. Chem. (1963) 238:2732-2753. #Reference-number A05676 #Accession A05676 #Molecule-type protein #Residues 2-28;29-46;47-100;101-105 REFERENCE #Authors Matsubara H., Smith E.L. #Journal J. Biol. Chem. (1962) 237:3575-3576 #Reference-number A00001 #Comment 66-Leu is found in 10% of the molecules in pooled protein. GENETIC #Introns 57/1 SUPERFAMILY #Name cytochrome c KEYWORDS acetylation\ electron transport\ heme\ mitochondrion\ oxidative phosphorylation\ polymorphism\ respiratory chain FEATURE 2-105 #Protein cytochrome c (experimental) \ 2 #Modified-site acetylated amino end (Gly) (in mature form) (experimental)\ 15,18 #Binding-site heme (covalent)\ 19,81 #Binding-site heme iron (his, Met) (axial ligands) SUMMARY # Molecular-weight 11749 #Length 105 #Checksum 3247 ~ ~ d) THE SEQBANK FORMAT This is the format used for storage of sequence and structural information of "solved" protein and peptide structures. The first line of each sequence record contains the name of the protein and its species of origin. The subsequent lines mean the following: #REFERENCE : Current or reasonably definitive reference to the structure #SEQBANK ID: SEQBANK ID number #BRKHAVN ID: Brookhaven Protein Databank ID #PIR-NBR ID: PIR accession number #SWISPRO ID: SWISS-PROT ID code #RESOLUTION: Resolution in Angstroms #R Factor : Refinement Factor in Percent #FOLD CLASS: protein folding class (B = all beta protein, A = all helical structure, M = mixed alpha helix/beta strand, AB = alpha-beta barrel, CB, CA and CM cysteine rich beta, alpha and mixed structures). #NUM RESIDU: number of amino acids The remaining part of the file contains the complete sequence of the protein (in single letter amino acid code) and its secondary structure assignment as determined by X-ray crystallography or NMR. Note that in these secondary structure records the following convention is used: H=helix, B=beta strand and C=coil. >ACTIN (RABBIT SKELETAL) #REFERENCE : KABSCH, W. ET AL., NATURE 347:37-44 (1990) #REFERENCE : FLAHERTY, K.M. ET AL., PNAS 88:5041-5045 (1991) #SEQBANK ID: 1 #BRKHAVN ID: #PIR-NBR ID: ATRB #SWISPRO ID: ACTS$RABIT #RESOLUTION: 2.8 #R FACTOR : 23.8 #FOLD CLASS: M #NUM RESIDU: 375 DEDETTALVC DNGSGLVKAG FAGDDAPRAV FPSIVGRPRH QGVMVGMGQK CCCCCCBBBB BBBCCBBBBB BBCCCCCCBB BBCCBBBBCC CCCCCCCCCC DSYVGDEAQS KRGILTLKYP IEHGIITNWD DMEKIWHHTF YNELRVAPEE CBBBCHHHHH HCCBBBBBCC BBBCBBBCCH HHHHHHHHHH HCCCCCCCCC HPTLLTEAPL NPKANREKTM QIMFETFNVP AMYVAIQAVL SLYASGRTTG CCBBBBBCHH HHHHHHHHHH HHHHHCCCCC BBBBBBCHHH HHHHCCCCBB IVLDSGDGVT HNVPIYEGYA LPHAIMRLDL AGRDLTDYLM KILTERGYSF BBBBCCCCBB BBBBBBCCBB BCCBBBBBCC CHHHHHHHHH HHHHHHCCCC VTTAEREIVR DIKEKLCYVA LDFENAMATA ASSSSLEKSY ELPDGQVITI CCHHHHHHHH HHHHHHCCCC CHHHHHHHHH HCCCCCCBBB BBCCCCBBBB GNERFRCPET LFQPSFIGME SAGIHETTYN SIMKCDIDIR KDLYANNVMS CCHHHHHHHH HHHCCCCCCC CCHHHHHHHH HHHHCCCHHH HHHHCCBBBB GGTTMYPGIA DRMQKEITAL APSTMKIKII APPERKYSVW IGGSILASLS CCCCCCCCHH HHHHHHHHHH HCCCCCBBBB CCHHHHHHHH HHHHHHHHCC TFQQMWITKQ EYDEAGPSIV HRKCF HHHHHCCCCH HHHHHCCHHH HHHCC ~ ~ ~ ~ e) THE SEQSITE FORMAT This is the format used for the SEQSITE file. This file contains a fairly complete listing of short sequence motifs and their putative functions. A reference is provided to permit a follow-up of the motif's suspected function. The same conventions regarding wildcard characters, end-of-sequence characters and so on are used in this file as in the Pattern Search function. LIBRARY OF SEQUENCE MOTIFS >*[KRH][DEN]EL$ SMITH M.J. ET AL., EMBO J. 8:3581-3586 (1989) ENDOPLASMIC RETICULUM DIRECTING SEQUENCE >*RGD* RUOSLAHTII E. ET AL., CELL 44:517-518 (1986) FIBRONECTIN ADHESION SITE >*CDPGYIGSR* GRAF, J. ET AL., CELL 48:989-996 (1987) MAMMAL LAMININ DOMAIN III B1 CHAIN CELL ATTACHMENT SITE >*[DE][DE]*SG*G* BOURDON M.A. ET AL., PNAS 84:3194-3198 (1987) GLYCOSAMINOGLYCAN BINDING SITE >*[DE][DE]**SG*G* BOURDON M.A. ET AL., PNAS 84:3194-3198 (1987) GLYCOSAMINOGLYCAN BINDING SITE >*[DE]*[DE]*SG*G* BOURDON M.A. ET AL., PNAS 84:3194-3198 (1987) GLYCOSAMINOGLYCAN BINDING SITE ~ ~ ~ Other file structures exist in SEQSEE but those presented above represent the most important or the most commonly encountered record types. Please feel free to browse through the other databases and library files -- but try to avoid altering their contents in any substantial way. If you do find an error (either in content or in structure), please try to notify us as soon as possible. We will try to make the corrections in time for the next release of the program. XIII. MANIPULATING AND EDITING FILES ON IRIS AND SUN WORKSTATIONS IRIS and SUN Workstations operate under the UNIX operating system. This particular operating system is fast becoming an industry standard because of its extensive support and the fact that it can be customized to suit the needs of almost any user or programmer. Unfortunately, it is NOT the most user-friendly of operating systems. The UNIX operating system is based on a file or directory hierarchy which essentially resembles a tree structure. At the top of the tree is the main directory called "/home". Moving up or down the tree is accomplished by changing directories (using the "cd" command). The program SEQSEE and its associated subroutines resides in the directory "/home/local/seqsee". In this location SEQSEE is actually accessible to all users from their default directory when they initially login. The SEQSEE program may be started simply by typing "seqsee". To help the uninitiated with some of the intricacies of the UNIX system we present the following brief review of some of the more useful commands for directory and file manipulation in this "unified" operating environment. Users familiar with the UNIX operating system should skip this section. a) MOVING AND MAKING DIRECTORIES cd Places user in home directory (typically "/home/usr"). cd .. Places user in parent directory (the next highest directory in the tree). cd mydir Changes current directory to "mydir". cd bigdir/smalldir Changes or moves user to the directory "smalldir" which is in "bigdir" mkdir dir Creates a new subdirectory called "dir" pwd Print Working Directory -- indicates which directory the user is in. ls List files in current directory b) MOVING AND MAKING FILES vi file1 Creates the file "file1" and enters the user into the vi editor (see later). cp file1 file2 Copies "file1" to "file2". A new "file2" is automatically created. mv file1 file2 Moves (renames) "file1" to "file2". rm file1 Removes or deletes "file1" from the current directory. c) VIEWING A FILE vi file1 View/Create/Visual Edit "file1". cat file1 Catalogues or lists contents of "file1" to screen. grep *** file1 Searches "file1" for the pattern "***". d) EDITING COMMANDS FOR THE "vi" EDITOR The "vi" editor is the UNIX visual editor. It may be started by simply typing "vi filename". This editor is not particularly sophisticated compared to most editors available on even small microcomputers, but it is a universal UNIX editor and for that reason it is important to understand its command structure and mnemonic devices. Following is a list of the more useful "vi" commands: h moves cursor left j moves cursor up k moves cursor down l moves cursor right 20+ moves cursor forward 20 lines 20- moves cursor backwards 20 lines G moves cursor to end of file ^g displays line number where cursor is placed :n moves cursor to line number "n" /word/ searches for the next occurrence of the character string "word" x deletes character where cursor is placed 25dd deletes 25 lines starting with current line (where the cursor is located) r p replaces current character with the letter "p" u undoes previous editor command . repeats last edit command i enters into insert mode esc exits insert mode (where esc is the escape key) :q! quits editing, does not save changes :wq saves changes and quits editing :q quits editing if no changes made *Note that SEQSEE is constructed so that when analyses are completed, the program automatically prints the results to the screen while simultaneously putting the user into the "vi" editor. In this way the user can manipulate the files in any way he or she wishes. In most cases the user will only want to inspect the files and this may be done simply by scrolling through the output with the cursor control keys (hjkl). The output or "results" file can be exited simply by typing ":q" or ":wq" which will then return the user to the next menu in SEQSEE. XIV. PRINTING FILES FROM SEQSEE Nearly all results produced from a SEQSEE sequence analysis are saved to a user-designated file. These files may be edited either within SEQSEE or outside the program using the "vi" editor. Printing files to a printer is a very system dependent operation and if the user is unsure of how to produce a hardcopy output from their terminal, they should consult with their system manager or local computer "expert" for more details. XV. TROUBLE SHOOTING A. QUESTIONS AND ANSWERS ABOUT SEQSEE Q. I have logged into my account and wish to use SEQSEE. The system administrator has assured me that SEQSEE is installed and that I have full access to it. Tell me what steps I should take to most effectively use this program. A. 1) Although there are no absolute windowing requirements for SEQSEE, it is recommended that your window be at least 80 characters wide and 40 or more lines in length. This will permit easy viewing of analytical output, help files and function menus. Having more than one window on the screen will also allow you to look at intermediate results while the program is running in another window. Therefore we strongly suggest that a "two-window" environment be used. 2) Create your own directory for running SEQSEE and make this your current directory (use the commands: mkdir seqsee; cd seqsee). This will help in the organization of your input files and results. 3) Copy the control file "seqsee.parms" into the "seqsee" directory you have just created. Your system administrator should be able to tell you where he/she has placed this file on the system. Typically the command to perform this operation is: cp /usr/local/seqsee/seqsee.parms . 4) If you already have sequence files, it is wise to copy them into your "seqsee" directory as well. Try to ensure that they are in the proper format (see the sections on SEQFILE formats). Note that you can always use SEQSEE to create new sequence files which conform to the SEQFILE format. 5) Once you have complete all of these operations you are ready to use SEQSEE. Q. I have typed "seqsee" and I don't get the main menu. What's wrong? A. 1) Have you typed "seqsee" correctly? (remember S-E-Q-S-E-E) 2) Have all of the installation programs been run successfully? 3) You might be in the wrong directory. Check for a program called "seqsee" in either your current directory or in some public place on your system. If you can't find it, ask your system administrator where "seqsee" is supposed to reside. 4) Check for the control file "seqsee.parms" either in your current directory or in some public place on your system. Check for any possible corruptions to "seqsee.parms". Q. What exactly is the function of the file "seqsee.parms"? A. The file "seqsee.parms" contains all of the default parameters that SEQSEE needs to run properly. When it is run, SEQSEE will first check your current directory to see if you have a "seqsee.parms" file. If not, it will then use the default "seqsee.parms" which the installation program had previously created. The control file should be relatively self-explanatory and is also well documented in the manual. Q. What are the most common items that could be changed in the "seqsee.parms" file? A. There are many different sets of parameters ranging from hydrophobicity values to similarity matrices which you may wish to experiment with by changing their default values in the "seqsee.parms" file. Before doing so, however, we recommend that you read up on the section regarding databases and library files. Be aware that if you change similarity matrices (such as changing the "wt.rbo" matrix to the "wt. dayhoff" matrix) you will also have to change other parameters such as "gap penalty" and "gap size penalty". There are also several "print" flags in the "seqsee.parms" file. These can be turned on or off depending on whether you want terse or verbose output. Many of the options in the "seqsee.parms" file are strictly for the programmer or for those users who already have an in-depth knowledge of how the algorithms work. Q. While running SEQSEE, the screen was cleared and I was placed in some kind of editor. How do I get out of this mode? A. SEQSEE uses the "vi" editor whenever it has results to show to the user. To exit this editing mode, type ":q" to exit without saving changes or type ":wq" to exit with all changes saved. Q. Is there some way to turn this fullscreen editing feature off? A. Yes. Some people may not like this feature, others may be on terminals which do not support "vi" and they would much prefer to use the commands "more" or "cat" to view their results. Either way, you can turn off "vi" by changing the "vi" flag from 1 to 0 in the "seqsee.parms" file. Q. What should I do if I want to get out of something that I mistakenly got into? For example, I am doing an exhaustive alignment search and I realize I am using the wrong sequence as a query. A. The easiest way to get out of a predicament is to press the "control" and "c" keys simultaneously. This will kill the operation and take you back to the main seqsee menu. Another more drastic method of terminating an operation is via the UNIX "kill" and "ps" commands (see your UNIX manual for details). When taking this form of action there will likely be one or more temporary files created which should be removed as soon as conveniently possible (these files typically contain multi-digit numbers and a ".tmp" suffix). Q. Is there some way for me to check on the progress of a particular search (or alignment) without having to wait for the search to end? A. Yes. Most modules in SEQSEE keep intermediate results, especially those functions which can take a very long time to run. These results are stored in a continually updated file appended with the suffix ".tmp" or ".tmp.ids" (eg. 653120924.tmp). While SEQSEE is running in one window, you may go to another window and type: "more *.tmp" to see these intermediate results. Q. When I save my search results in file "X", I also have a file in my directory called "X.ids". What is the purpose of this file? A. This "X.ids" file contains only the ID codes and protein names from the results file. Some of these results files can get pretty big and so, to save you some time, SEQSEE provides a truncated listing of this file. This "X.ids" file may be particularly useful if you are only interested in viewing the names of proteins (as opposed to complete alignments) which appeared in a particular search. Q. Now that I have my results, how do I print them out? A. The standard UNIX command to print a text file is "lpr ". However you should check with your system administrator to be sure you know how to print text files. Many facilities have their own printing macros or are connected to certain specialized printers or plotters which may require very specific commands. Q. Can I run SEQSEE in the background? A. Yes. Running SEQSEE in the background allows you to start a search and to continue that search after you have logged out or while other users are logged in. Once you have decided your search is running properly and you wish to put the search into the background, press the "control" and "z" keys simultaneously to temporarily stop the job. Then type "bg". This command restarts the program and sets it running in the background. You are now safe to log out and go home. Q. Can I change the priority at which SEQSEE is running? For example, I want my exhaustive alignment job to run only if no one else needs the computer. A. Yes. However, you can only change the priority after the job starts running. If you startan exhaustive alignment and wish to lower its priority, issue the command: ps -ux | grep nw_align If you are running on a Silicon Graphics machine, use the following: ps | grep nw_align Then issue the UNIX command "renice 19 PID" where PID is the process ID of the job that you wish to have the priority changed. The PID number can be found in the first column. Please ask your system administrator if you are unsure how this works. Q. What does a "core dumped" message mean? A. This means that the program has crashed either due to a programming bug or to a boundary limit being exceeded. This may also happen if the system has run out of "swap space" (See the UNIX system manual for details). Sometimes a swap space problem will be indicated by an "out of memory" error message as well. Q. What can I do if I get a "core dumped" message? A. You may do two things. First, try to check your seqsee control file ("seqsee.parms") to see if any values have been altered or if they differ substantially from the default parameters presented in the manual. Second, you may try varying your input to see if the problem only occurs with your particular set of data. If you are the system administrator and have some programming knowledge and you find that none of the above suggestions work, you may wish to re-compile the corrupted module and to attempt to debug the program using "dbx" to identify which line caused the program to crash. B. SEQSEE CHECKLIST (VERSION 1.2) In addition to the HELP features offered on-line, here's a list of items that should be checked if, for some reason, you have any difficulty in obtaining results from SEQSEE. This little list is not guaranteed to solve all of your problems but it should be quite helpful -- especially for first-time users. 1) Have you read the manual? 2) Are you in the right directory? (home/usr/seqsee or some similar variation) 3) Have you spelled "seqsee" correctly? 4) Have you pressed the key after entering your response? 5) Have you answered the computer query correctly? (ie. entered a number when a number was requested and a filename when a filename was requested) 6) Have you typed "$" to end your sequence entry? 7) Have you typed "quit" to end your filename or pattern entries? 8) Have you typed ":q" or ":wq" or ":q!" to exit the "vi" editor? 9) Are you using the proper "vi" editor commands? 10) Have you checked that your input filename is spelled correctly? 11) Does you input file exist or has it been deleted or placed in another directory? 12) Does your sequence contain any unusual or non-standard characters? 13) Is your input sequence file in the standard SEQFILE format? 14) Have you or someone else changed something in the "seqsee.parms" file that wasn't supposed to have been changed? 15) Are you in the right program? C. NOTES FOR THE SYSTEM ADMINISTRATOR/PROGRAMMER REGARDING SEQSEE 1) Each function in SEQSEE is its own separate program with its own directory and Makefile. The program called "seqsee" is only a driver program which calls other programs and which shows or saves the results. The source code for the driver is contained in "init.c", "calc.c" and "main.c". 2) ALEXIS is the program which performs the comprehensive analysis of secondary structure. Just like SEQSEE it, too, calls all the modules which begin with "a_" in order to compile its results. 3) The source code which was found to be common to most modules was placed in a directory called "libc". UNIX dependent routines are found in "libc/unix.lib.c". Most modules will compile independently of UNIX if the call to "get_date" is taken out. 4) The following naming conventions were adopted for the source code within each of the modules: 1) *.h - global variables for the module. 2) main.c - main program for the given function. 3) init.c - routine to read in parameters for "seqsee.parms". 4) menu.c - routine to produce menu I/O specific to the function. 5) calc.c - routine to perform the calculations specific to the function. 6) print.c - routines which handle the function output. 7) dbase.c - routines to read local databases. 5) The following naming conventions were adopted for the non-source code files within the modules: 1) test.run - sample input data (ie. function < test.run) 2) output - output from "test.run" 3) output.ids - terse version of the output 4) seqsee.parms - parameters required by the function 5) *.seq - input sequence file 6) wt.* - similarity scoring matrix 6) Most of the important boundary limitations for any particular algorithm can be found in the ".h" file of the corresponding directory. For example there is a limit to the size of an input sequence (2000 residues). It should be a fairly simple matter to change a boundary and then to type "make" to re- compile that particular module. 7) Each source code directory has its own "seqsee.parms" file for testing purposes. 8) Most of the source code should be fairly straight forward to read and/or understand. The one exception appears to be the "align.c" routine. In attempting to make this algorithm as efficient as possible, we ended up sacrificing some of its programming clarity. Recommended Readings Fasman, G.D. (ed.) "Prediction of Protein Structure and the Principles of Protein Conformation". New York (Plenum), 1989. The most comprehensive treatise on protein structure prediction available. Filled with dozens of contributions and reviews from many of the foremost experts in the field. An excellent introduction to the subject. Highly recommended for both novice and expert alike. Doolittle, R.F. (ed.) "Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences". Methods in Enzymology, Vol. 183, 1990. A fine complement to Fasman's work. This is an equally comprehensive review with in depth descriptions and useful assessments of numerous sequence analysis algorithms and programs. Provides good summaries of how the field has developed and where the field is likely to go. An excellent source-book for methods and ideas for sequence alignment, sequence assessment and cladistic deconvolution. Doolittle, R.F., "Of URFs and ORFs: A Primer of How to Analyze Derived Amino Acid Sequences". California (University Science Books), 1987 A gem of a book. One of the easiest to understand "how-to" references you can find. Among the most informative texts on the subject of protein sequence analysis. Get it before it goes out of print. Gribskov, M.R. and Devereux, J. (ed.) "Sequence Analysis Primer", New York (W.H. Freeman and Co.) 1992. An excellent, up-to-date account of both DNA and protein sequence analysis. It is filled with hundreds of illustrations and dozens of "real-life" examples. It also provides a very useful appendix with information on software, databases, terminology and extensive references. This book should be in everyone's personal library. Schulz, G.E., A Critical Evaluation of Methods for Prediction of Protein Secondary Structures, Ann. Rev. Biophys. and Biophys. Chem. 17, 1-22 (1988). A very fair-minded critique of secondary structure prediction methods. Offers a quick and easy-to-read introduction to the potential applications and probable short-comings of protein structure prediction. Taylor, W.R., Pattern Matching Methods in Protein Sequence Comparison and Structure Prediction, Protein Eng. 2, 77-86 (1988). An extremely informative review of the field as it stood in 1988. Well written and easy to read. Offers some excellent insights into a number of newer (and older) methods in protein sequence analysis. Highly recommended. General References Altschul, S.F., Gish, W., Miller, W., Myers, E.W., & Lipman, D.J., Basic Local Alignment Search Tool, J. Mol. Biol. 215, 403-410 (1990). Barton, G.J. & Sternberg, M.J.E., A Strategy for the Rapid Multiple Alignment of Protein Sequences, J. Mol. Biol. 198, 327-337 (1987). Chiche, L., Gregoret, L.M., Cohen, F.E. & Kollman, P.A., Protein Model Structure Evaluation Using the Solvation Free Energy of Folding, Proc. Natl. Acad. Sci. (USA) 87, 3240-3243 (1990). Chothia, C., Structural Invariants in Protein Folding, Nature 254, 304-308 (1975). Chothia, C., The Nature of the Accessible and Buried Surfaces in Proteins, J. Mol. Biol. 105, 1-14 (1976). Chou, K.-C. and Zhang, C.-T., A Correlation-Coefficient Method to Predicting Protein-Structural Classes from Amino Acid Compositions, Eur. J. Biochem. 207, 429-433 (1992) Chou, P.Y. & Fasman, G.D., Empirical Predictions of Protein Conformation, Ann. Rev. Biochem. 47, 251-276 (1978). Chou, P.Y. & Fasman, G.D., Prediction of Protein Conformation, Biochemistry 13, 222-245 (1974). Cornette, J.L., Cease, K.B., Margalit, H., Spouge, J.L., Berzofsky, J.A. & DeLisi, C. Hydrophobicity Scales and Computational Techniques for Detecting Amphipathic Structure in Proteins, J. Mol. Biol. 195, 659-685 (1987). Creighton, T.E., "Proteins: Structures and Molecular Properties", W.H. Freeman, New York (1984). Dayhoff, M.O., Barker, W.C. & Hunt, L.T., Establishing Homologies in Protein Sequences, Methods in Enzymology 91, 524-545 (1983) Dayhoff, M.O., Schwartz, R.M. & Orcutt, B.C., A Model of Evolutionary Change in Proteins, Atlas of Protein Structure 5 (Suppl. 3) 345-352 (1979). Eisenberg, D., Weiss, R.M. & Terwilliger, R.C., The Hydrophobic Moment Detects Periodicity in Protein Hydrophobicity, Proc. Nat. Acad. Sci. (USA) 81, 140-144 (1984). Fasman, G.D. & Gilbert, W.A., The Prediction of Transmembrane Protein Sequences and Their Conformation: an Evaluation, TIBS 15, 89-92 (1990). Fisher, H.F., A Limiting Law Relating the Size and Shape of Protein Molecules to Their Composition, Proc. Natl. Acad. Sci. (USA) 51, 1285-1290 (1964). Garnier, J., Ogusthorpe, D.J. & Robson, B., Analysis of the Accuracy and Implementation of Simple Methods for Predicting the Secondary Structure of Globular Proteins, J. Mol. Biol. 120, 97-120 (1978). Gibrat, J.F., Garnier, J. & Robson, B., Further Development of Protein Secondary Structure Prediction Using Information Theory, J. Mol. Biol. 198, 425-443 (1987). Gribskov, M., McLachlan, A.D. & Eisenberg, D., Profile Analysis: Detection of Distantly Related Proteins, Proc. Nat. Acad. Sci. (USA) 84, 4355-4358 (1987). Janin, J., Surface and Inside Volumes in Globular Proteins, Nature 277, 491- 493 (1979). Karplus, P.A. & Schulz, G.E., Prediction of Chain Flexibility in Proteins, Naturewissenschaften 72, 212-213 (1985). Klein, P., Kanehisa, M. & DeLisi, C., The Detection and Classification of Membrane-Spanning Proteins, Biochim. Biophys. Acta 815, 468-476 (1985). Kyte, J. & Doolittle, R.F., A Simple Method for Displaying the Hydropathic Character of a Protein, J. Mol. Biol. 157, 105-132 (1982). Lesk, A.M., Levitt, M. & Chothia, C., Alignment of the Amino Acid Sequences of Distantly Related Proteins Using Variable Gap Penalties, Protein Eng. 1, 77- 78 (1986). Levin, J.M. & Garnier, J., Improvements in a Secondary Structure Method Based on a Search for Local Sequence Homologies and its use as a Model Building Tool, Biochim. Biophys. Acta 955, 283-295 (1988). Levin, J.M., Robson, B. & Garnier, J., An Algorithm for Secondary Structure Determination in Proteins Based on Sequence Similarity, FEBS Lett. 205, 303- 308 (1986). Lipman, D.J. & Pearson, W.R., Rapid and Sensitive Protein Similarity Searches, Science, 227, 1435-1441 (1985). McLachlan, A.D., Tests for Comparing Related Amino-acid Sequences: Cytochrome C & Cytochrome C551, J. Mol. Biol. 61, 409-423 (1971). Miller, S. Janin, J., Lesk, A.M. & Chothia, C., Interior and Surface of Monomeric Proteins, J. Mol. Biol. 196, 641-656 (1987). Needleman, S.B. & Wunsch, C.D., A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins, J. Mol. Biol. 48, 443-453 (1970). Nishikawa, K. & Ooi, T., Amino Acid Sequence Homology Applied to Protein Secondary Structures and Joint Prediction with Existing Methods, Biochim. Biophys. Acta 871, 45-54 (1986). Parker, J.M.R., Guo, D., & Hodges, R.S., New Hydrophobicity Scale Derived from HPLC Peptide Retention Data, Biochemistry 25, 5425-5431 (1986). Pearson, W.R. & Lipman D.J., Improved Tools for Biological Sequence Comparison, Proc. Nat. Acad. Sci. (USA) 85, 2444-2448 (1988). Richards, F.M., Areas, Volumes, Packing and Protein Structure, Ann. Rev. Biophys. Bioeng. 6, 151-175 (1977). Rooman, M.J. & Wodak, S.F., Identification of Predictive Sequence Motifs Limited by Protein Structure Database Size, Nature 335, 45-49 (1988). Rooman, M.J. & Wodak, S.J., Weak Correlation Between Predictive Power of Individual Sequence Patterns and Overall Prediction Accuracy in Proteins, Proteins: Struct. Func. Gen. 9, 68-78 (1991). Rooman, M.J., Rodriguez, J. & Wodak, S.J., Relations Between Protein Sequence and Structure and Their Significance, J. Mol. Biol. 213, 337-350 (1990). Schwartz, R.M. & Dayhoff, M.O., Matrices for Detecting Distant Relationships, Atlas of Protein Structure 5 (Suppl. 3) 353-358 (1979). Sonnichsen, F.D., Sykes, B.D., Chao, H. & Davies, P.L., The Nonhelical Structure of Antifreeze Protein Type III. Science 259, 1154-1157 (1992). Sweet, R.M., Evolutionary Similarity Among Peptide Segments is a Basis for Predicting Protein Folding, Biopolymers 25, 1566-1577 (1986). Upton, C., Mossman, K. & McFadden, G., Encoding of a Homolog of the IFN-g Receptor by Myxoma Virus. Science 258, 1369-1372 (1992). Upton, C., Stuart, D. & McFadden, G., Identification of a Pox Virus Gene Encoding a Uracyl DNA Glycosylase. Proc. Natl. Acad. Sci. USA (in press). Williams, R.W., Chang, A., Juretic, D. & Loughram, S., Secondary Structure Predictions and Medium Range Interactions, Biochim. Biophys. Acta 916, 200-204 (1987). Zamayatnin, A.A., Protein Volume in Solution, Prog. Biophys. Mol. Biol. 24, 107-123 (1972). APPENDIX 1 THE SEQSEE CONTROL FILE In attempting to provide the user with as much operational flexibility as possible we have chosen to make the SEQSEE control file completely "user" accessible. The control file contains default values of all the library filenames, parameters, penalties, matrices and other variables which are called whenever a function on SEQSEE is implemented. By allowing free access to the control file we hope that the user will find it conducive to "experimenting" with different alignment matrices, hydrophobicity scales or sequence patterns to discover what values best suite his or her needs. The control file may be accessed and altered through the "File Viewer" command while in SEQSEE or it may be altered outside SEQSEE by editing the file named "seqsee.parms" in the directory "/usr/local/seqsee". A complete listing of the SEQSEE control file and all of its parameter options is provided below. **** Parameter List for SEQSEE **** Users should feel free to copy this file to their own directoryand make any changes they feel appropriate. Parameter entries arepreceded by 2 consecutive angle brackets, the order of the parameters must be maintained! Comments and blank lines can be placed anywhere. ***************************************************************** Id code for main seqsee driver. >> SEQSEE_V1.2 Location of programs that the seqsee driver will be calling >> /canopus/rbo/seqsee/seqhelp/seqhelp >> /canopus/rbo/seqsee/seqed/seqed >> /canopus/rbo/seqsee/seqret/seqret >> /canopus/rbo/seqsee/stats/stats >> /canopus/rbo/seqsee/alexis/alexis >> /canopus/rbo/seqsee/seqsearch/seqsearch >> /canopus/rbo/seqsee/fleqsee/fleqsee >> /canopus/rbo/seqsee/moment/moment >> /canopus/rbo/seqsee/hydro/hydro >> /canopus/rbo/seqsee/fast_align/fast_align >> /canopus/rbo/seqsee/sb_align/sb_align >> /canopus/rbo/seqsee/nw_align/nw_align >> /canopus/rbo/seqsee/mult_align/mult_align >> /canopus/rbo/seqsee/psearch/psearch >> /canopus/rbo/seqsee/hsearch/hsearch >> /canopus/rbo/seqsee/dotplot/dotplot >> /canopus/rbo/seqsee/refscan/refscan >> /canopus/rbo/seqsee/browse/browse Automatically enter vi editor when results found (1=yes, 0=no). >> 1 ***************************************************************** Id code for help function. Do not change this line. >> HELP Number of help files >> 9 Location of each of the help files >> /canopus/rbo/seqsee/docs/help.authors >> /canopus/rbo/seqsee/docs/help.intro >> /canopus/rbo/seqsee/docs/help.recom >> /canopus/rbo/seqsee/docs/help.menu.brief >> /canopus/rbo/seqsee/docs/help.menu.details >> /canopus/rbo/seqsee/docs/help.tutorial >> /canopus/rbo/seqsee/docs/help.ques >> /canopus/rbo/seqsee/docs/help.seqfile >> /canopus/rbo/seqsee/docs/help.aa.info ***************************************************************** Id code for seqret. Do not change this line. >> SEQRET What format is the sequence database? >> 4 1 = SWISS-PROT, 2 = PIR, 3 = SWISS-PROT (intelligenetics version), 4 = PIR (intelligenetics version) >> 6 Number of files that compose the database Location of each sequence database file >> /canopus/rbo/seqsee/databases/pir.IG/PIR_ANNOTATED1.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_ANNOTATED2.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_UNANNOTATED1.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_UNANNOTATED2.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_UNREVIEWED1.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_UNREVIEWED2.PDB Update output file every 'x' proteins which are processed. >> 1000 ***************************************************************** Id code for stats function. Do not change this line. >> STATS Location of SEQBANK database >> /canopus/rbo/seqsee/databases/SEQBANK.db hydrophobicity table >> /canopus/rbo/seqsee/lib/kyte.parms These thresholds are dependent on the hydrophobicity table used. >> 0.10 hydrophobic proteins threshold >> -6.00 hydrophilic proteins threshold >> 0.85 protein insoluble threshold >> 1.90 protein generally does not fold threshold >> 0.77 protein insoluble threshold >> 1.43 protein generally does not fold threshold Hydrophobic Amino Acids >> ACFGHILMVWY one letter codes >> 52.44 average percent of these amino acids in a protein Hydrophilic Amino Acids >> DEKNPQRST one letter codes >> 47.56 average percent of these amino acids in a protein molecular weight table >> /canopus/rbo/seqsee/lib/mol.weights molecular volume table >> /canopus/rbo/seqsee/lib/mol.volume molecular surface area table >> /canopus/rbo/seqsee/lib/mol.surfarea molecular partial specific volume table >> /canopus/rbo/seqsee/lib/mol.parspecvol molecular polar, nonpolar, surface area table >> /canopus/rbo/seqsee/lib/mol.asa molecular fraction buried >> /canopus/rbo/seqsee/lib/mol.fracbur fraction of amino acids buried >> /canopus/rbo/seqsee/lib/fracbur.parms ***************************************************************** Function code for alexis >> ALEXIS Location of programs alexis will be running >> /canopus/rbo/seqsee/a_membrane/a_membrane >> /canopus/rbo/seqsee/a_motif/a_motif >> /canopus/rbo/seqsee/a_homol/a_homol >> /canopus/rbo/seqsee/a_moment/a_moment >> /canopus/rbo/seqsee/a_gor/a_gor >> /canopus/rbo/seqsee/a_cfas/a_cfas Correlation tables for predicting protein structural classes >> /canopus/rbo/seqsee/lib/alexis.norm (for most sequences) >> /canopus/rbo/seqsee/lib/alexis.cys (for heavy cys sequences) Remove intermediate results files? >> 1 (1=yes, 0=no) ***************************************************************** Identification code for the following set of parameters. Do not change this line. >> A_MEMBRANE Location of membrane spanning hydrophobicity parms >> /canopus/rbo/seqsee/lib/membrane.parms Nature of membrane spanning test (scaling constants) >> -9.02 170.00 14.27 ***************************************************************** Identification code for the following set of parameters. Do not change this line. >> A_HOMOL Enter the location of the SEQBANK database. >> /canopus/rbo/seqsee/databases/SEQBANK.db Tell program the location of the similarity scoring matrix. >> /canopus/rbo/seqsee/lib/wt.rbo Homologous segments must have a certain minimum test stat before the secondary structure they represent is counted. >> 3.20 Improve prediction by weighting of scores because of unequal representation of secondary stucture in the database. >> 1.000 /* betastrand represent 28% of seqbank */ >> 0.820 /* coil represent almost 37% of seqbank */ >> 0.780 /* helix represent almost 35% of seqbank */ Offset and multiplier needed to normalize prediction scores to mean=1000 and stddev=200. >> 494.00 0.89 Improve prediction by applying smoothing function >> 1 /* number of times to apply smoothing function */ Improve prediction by biasing random coils at sequence ends. >> 1 /* 1=yes, 0=no */ Improve prediction by class weighting. >> 1 /* 1=yes, 0=no */ >> 1.10 /* beta Scores */ >> 1.30 /* helix Scores */ Improve prediction by smoothing the predicted structure. >> 1 /* 1=yes, 0=no */ ***************************************************************** Identification code for the following set of parameters Do not change this line. >> A_MOMENT Tell program the location of the chou-fasman parameters. >> /canopus/rbo/seqsee/lib/moment.cfas Tell program the location of the hydrophobicity parms which are biased for BetaStrands. >> /canopus/rbo/seqsee/lib/moment.bhydro Tell program the location of the hydrophobicity parms which are biased for Helices. >> /canopous/rbo/seqsee/lib/moment.hhydro Beta Strand Prediction Parameters >> 7 /* window size */ >> 1 2 3 4 3 2 1 /* cfas weighting factors */ >> 2 /* number of periodicity tests */ >> 160 180 /* preiodicity angles */ Coil Prediction Parameters >> 5 /* window size */ >> 2 3 4 3 2 /* cfas weighting factors */ Helix Prediction Parameters >> 11 /* window size */ >> 2 3 3 3 3 3 3 3 3 3 2 /* cfas weighting factors */ >> 2 /* number of periodicity tests */ >> 100 110 /* periodicity angles */ Offset and multiplier needed to normalize prediction scores to mean=1000 and stddev=200. >> 831.00 13.30 Improve prediction by applying smoothing function >> 1 /* number of times to apply smoothing function */ Improve prediction by biasing random coils at sequence ends. >> 1 /* 1=yes, 0=no */ Improve prediction by class weighting. >> 1 /* 1=yes, 0=no */ >> 0.95 /* beta Scores */ >> 1.05 /* helix Scores */ Improve prediction by smoothing the predicted structure. >> 1 /* 1=yes, 0=no */ ***************************************************************** Identification code for the following set of parameters. Do not change this line. >> A_GOR Location GOR parms >> /canopus/rbo/seqsee/lib/gor.data Offset and multiplier needed to normalize prediction scores to mean=1000 and stddev=200. >> 966.0 13.10 Improve prediction by applying smoothing function >> 0 /* number of times to apply smoothing function */ Improve prediction by biasing random coils at sequence ends. >> 1 /* 1=yes, 0=no */ Improve prediction by class weighting. >> 1 /* 1=yes, 0=no */ >> 0.95 /* beta Scores */ >> 1.05 /* helix Scores */ Improve prediction by smoothing the predicted structure. >> 1 /* 1=yes, 0=no */ ***************************************************************** Identification code for the following set of parameters. Do not change this line. >> A_CFAS Tell program the location of the wieghting parameters See the default listed here to understand the input format. >> /canopus/rbo/seqsee/lib/cfas.data BetaStrand window size >> 7 Weighting factors within this window for BetaStrand >> 1 2 3 4 3 2 1 Coil Window Size >> 5 Weighting factors within this window for Coil >> 1 2 3 2 1 Helix Window Size >> 9 Weighting factors within this window for Helix >> 1 2 3 4 5 4 3 2 1 Offset and multiplier needed to normalize prediction scores to mean=1000 and stddev=200. >> 953.0 13.50 Improve prediction by applying smoothing function >> 1 /* number of times to apply smoothing function */ Improve prediction by biasing random coils at sequence ends. >> 1 /* 1=yes, 0=no */ Improve prediction by class weighting. >> 1 /* 1=yes, 0=no */ >> 1.02 /* beta Scores */ >> 1.00 /* helix Scores */ Improve prediction by smoothing the predicted structure. >> 1 /* 1=yes, 0=no */ ***************************************************************** Function ID code for motif searching program (motifs from literature) >> LIT_MOTIF Location of motifs databases >> /canopus/rbo/seqsee/databases/seqmotif1.db Printing Parameters >> 100 Print stats summary every 'x' motifs processed >> 1 Print individual motifs which match (1=yes, 0=no) ***************************************************************** Function ID code for motif searching program (computer generated dbase) >> COMP_MOTIF Location of motifs databases >> /canopus/rbo/seqsee/databases/seqmotif2.db Printing Parameters >> 100 Print stats summary every 'x' motifs processed >> 1 Print individual motifs which match (1=yes, 0=no) ***************************************************************** ID code for seqsite function. Do not change this line. >> SEQSEARCH Number of seqsite databases >> 3 Location of seqsite databases >> /canopus/rbo/seqsee/databases/SEQSITE.db (general sequence motifs) >> /canopus/rbo/seqsee/databases/PHOSITE.db (general phosphorylation sites) >> /canopus/rbo/seqsee/databases/EPISITE.db (antigenic sites) ***************************************************************** Function ID code. Do not change this line. >> FLEQSEE Type of output, 0 = weighted scores, 1 = raw scores >> 1 Location of flexibility parameters >> /canopus/rbo/seqsee/lib/fleqsee.parms Manipulating Flexibility Scores >> 7 Window size >> 1 2 3 4 3 2 1 Weighting constants based on window size ***************************************************************** Function ID code. Do not change this line. >> MOMENT Type of output, 0 = weighted scores, 1 = raw scores >> 1 Location of hydrophobicity parameters (hmom.* files) >> /canopus/rbo/seqsee/lib/hmom.cornet Nature of periodicity tests >> 8 number of tests >> 0 5 0 type(0=beta, 1=coil, 2=helix), window size, periodicity angle >> 0 5 160 >> 0 5 170 >> 0 5 180 >> 2 9 90 >> 2 9 100 >> 2 9 110 >> 2 9 120 smoothing function to be applied 'x' times >> 2 ***************************************************************** Function ID code. Do not change this line. >> HYDRO Type of output, 0 = weighted scores, 1 = raw scores >> 1 Location of hydrophobicity parameters (hphob.* files) >> /canopus/rbo/seqsee/lib/hphob.kyte Manipulating hydrophobicity scores >> 7 Window size >> 1 2 3 4 3 2 1 Weighting constants based on window size ***************************************************************** Function ID code. Do not change this line. >> FAST_ALIGN What format is the sequence database? >> 4 1 = SWISS-PROT, 2 = PIR, 3 = SWISS-PROT (intelligenetics version), 4 = PIR (intelligenetics version) >> 6 Number of files that compose the database Location of each sequence database file >> /canopus/rbo/seqsee/databases/pir.IG/PIR_ANNOTATED1.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_ANNOTATED2.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_UNANNOTATED1.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_UNANNOTATED2.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_UNREVIEWED1.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_UNREVIEWED2.PDB Tell program the location of the similarity scoring matrix. >> /canopus/rbo/seqsee/lib/wt.align What minimum value from the similarity scoring matrix would constitute a near match? >> 5 Cut-off score for similar tuples. Note that this score depends on the matrix selected above. For example in the matrix 'wt.align', FYE is similar to YFN if the cutoff score is 50 or less. >> 48 Update output file every 'x' proteins which are processed. >> 1000 Penalize the alignment score 'x' points every time a gap needs to be introduced. The value of 'x' depends on the similarity scoring matrix, a typical value being the 3rd or 4th highest number in the matrix. >> 20 Penalize the alignment score 'x' points for each entry in the gap. This will keep the gap from getting too large. >> 5 ***************************************************************** ID code for exhaustive alignment on seqsee database. >> SB_ALIGN Enter the location SEQBANK database. >> /canopus/rbo/seqsee/databases/SEQBANK.db Tell program the location of the similarity scoring matrix. See the default listed here to understand the input format. >> /canopus/rbo/seqsee/lib/wt.rbo What minimum value from the similarity scoring matrix would constitute a near match? >> 5 Random number seed used to jumble sequences. >> 13791 sorting alignment scores 0 = sort by raw score (tends to overlook smaller sequences) 1 = sort by raw score / sequence len (fast, generally more accurate) 2 = sort by jumbling (very slow but most accurate) >> 1 These parameters are only used if sort by jumbling option chosen. Number of jumbles based on current test stat. (6 entries only!) (eg, if after 18 jumbles the test stat exceeds 2 std dev, keep going). jumbles std dev >> 3 0.00 >> 8 1.00 >> 18 2.00 >> 50 3.00 >> 150 4.00 >> 500 9999.00 (this tstat value is ignored here) Update output file every 'x' proteins processed. >> 10 Penalize the alignment score 'x' points every time a gap needs to be introduced. The value of 'x' depends on the similarity scoring matrix, a typical value being the 3rd or 4th highest number in the matrix. >> 10 Penalize the alignment score 'x' points for each entry in the gap. This will keep the gap from getting too large. >> 2 Penalty for a gap within a random coil region >> 0 Penalty for a gap at the end of a helix or beta strand structure >> 1 Penalty for a gap in the middle of a helix or beta strand structure >> 4 ***************************************************************** Identification code for the following set of paramaters. >> NW_ALIGN What format is the sequence database? >> 4 1 = SWISS-PROT, 2 = PIR, 3 = SWISS-PROT (intelligenetics version), 4 = PIR (intelligenetics version) >> 6 Number of files that compose the database Location of each sequence database file >> /canopus/rbo/seqsee/databases/pir.IG/PIR_ANNOTATED1.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_ANNOTATED2.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_UNANNOTATED1.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_UNANNOTATED2.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_UNREVIEWED1.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_UNREVIEWED2.PDB Tell program the location of the similarity scoring matrix. Matrices such as Dayhoff can be used. See the default listed here to understand the input format. >> /canopus/rbo/seqsee/lib/wt.rbo What minimum value from the similarity scoring matrix would constitute a near match? >> 5 Random number seed used to jumble sequences >> 13791 sorting alignment scores 0 = sort by raw score (tends to overlook smaller sequences) 1 = sort by raw score / sequence len (fast, generally more accurate) 2 = sort by jumbling (very slow but most accurate) >> 1 These parameters are only used if sort by jumbling option chosen. Number of jumbles based on current test stat. (6 entries only!) (eg, if after 18 jumbles the test stat exceeds 2 std dev, keep going). jumbles std dev >> 3 0.00 >> 8 1.00 >> 18 2.00 >> 50 3.00 >> 150 4.00 >> 500 9999.00 (this tstat value is ignored here) Update output file every 'x' proteins processed. >> 50 Penalize the alignment score 'x' points every time a gap needs to be introduced. The value of 'x' depends on the similarity scoring matrix, a typical value being the 3rd or 4th highest number in the matrix. >> 10 Penalize the alignment score 'x' points for each entry in the gap. This will keep the gap from getting too large. >> 2 ***************************************************************** ID function code. Do not change this line >> MULT_ALIGN Tell program the location of the similarity scoring matrix. >> /canopus/rbo/seqsee/lib/wt.rbo What minimum value from the similarity scoring matrix would constitute a near match? >> 5 Random number seed used to jumble sequences >> 13791 sorting alignment scores 0 = sort by raw score (tends to overlook smaller sequences) 1 = sort by raw score / sequence len (fast, generally more accurate) 2 = sort by jumbling (very slow but most accurate) >> 0 These parameters are only used if sort by jumbling option chosen. Number of jumbles based on current test stat. (6 entries only!) (eg, if after 18 jumbles the test stat exceeds 2 std dev, keep going). jumbles std dev >> 3 0.00 >> 8 1.00 >> 18 2.00 >> 18 3.00 >> 18 4.00 >> 18 9999.00 (this tstat value is ignored here) Print pairwise alignments? (1=yes, 0=no) >> 1 Consensus percent - Print the amino acid in the consensus sequence if it is found above the consensus percent threshold. >> 70 Penalize the alignment score 'x' points every time a gap needs to be introduced. The value of 'x' depends on the similarity scoring matrix, a typical value being the 3rd or 4th highest number in the matrix. >> 10 Penalize the alignment score 'x' points for each entry in the gap. This will keep the gap from getting too large. >> 2 ***************************************************************** Identification code for this function. Do not change this line. >> PSEARCH What format is the sequence database? >> 4 1 = SWISS-PROT, 2 = PIR, 3 = SWISS-PROT (intelligenetics version), 4 = PIR (intelligenetics version) >> 6 Number of files that compose the database Location of each sequence database file >> /canopus/rbo/seqsee/databases/pir.IG/PIR_ANNOTATED1.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_ANNOTATED2.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_UNANNOTATED1.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_UNANNOTATED2.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_UNREVIEWED1.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_UNREVIEWED2.PDB Location of the structurally-determined database. >> /canopus/rbo/seqsee/databases/SEQBANK.db Allow multiple matches for a search string in a sequence >> 0 1 = yes, 0 = no Update output file every 'x' proteins which are processed. >> 1000 ***************************************************************** Identification code for this function. Do not change this line. >> HSEARCH What format is the sequence database? >> 4 1 = SWISS-PROT, 2 = PIR, 3 = SWISS-PROT (intelligenetics version), 4 = PIR (intelligenetics version) >> 6 Number of files that compose the database Location of each sequence database file >> /canopus/rbo/seqsee/databases/pir.IG/PIR_ANNOTATED1.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_ANNOTATED2.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_UNANNOTATED1.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_UNANNOTATED2.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_UNREVIEWED1.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_UNREVIEWED2.PDB Location of structurally determined database. >> /canopus/rbo/seqsee/databases/SEQBANK.db Tell program the location of the similarity scoring matrix. Matrices such as Dayhoff can be used. >> /canopus/rbo/seqsee/lib/wt.align What minimum value from the similarity scoring matrix would constitute a near match? >> 5 Update output file every 'x' proteins which are processed. >> 1000 ***************************************************************** Function ID code. Do not change this line. >> DOTPLOT What format is the sequence database? >> 4 1 = SWISS-PROT, 2 = PIR, 3 = SWISS-PROT (intelligenetics version), 4 = PIR (intelligenetics version) >> 6 Number of files that compose the database Location of each sequence database file >> /canopus/rbo/seqsee/databases/pir.IG/PIR_ANNOTATED1.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_ANNOTATED2.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_UNANNOTATED1.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_UNANNOTATED2.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_UNREVIEWED1.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_UNREVIEWED2.PDB Tell program the location of the similarity scoring matrix. See the default listed here to understand the input format. >> /canopus/rbo/seqsee/lib/wt.align What minimum value from the similarity scoring matrix would constitute a near match? >> 5 Length Penalty Value: subtract n*lenPenalty from our score where 'x' is the number of amino acids. >> 5 Threshold Score (homologous segments must score above) >> 80 msearchFlag - Does multiple scans down diagonals Only turn this flag on if database is small. (0 = off, 1 = on) >> 0 Update output file every 'x' proteins which are processed. >> 200 ***************************************************************** Identification code for this function. Do not change this line. >> REFSCAN What format is the sequence database? >> 4 1 = SWISS-PROT, 2 = PIR, 3 = SWISS-PROT (intelligenetics version), 4 = PIR (intelligenetics version) >> 6 Number of files that compose the database Location of each sequence database file >> /canopus/rbo/seqsee/databases/pir.IG/PIR_ANNOTATED1.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_ANNOTATED2.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_UNANNOTATED1.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_UNANNOTATED2.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_UNREVIEWED1.PDB >> /canopus/rbo/seqsee/databases/pir.IG/PIR_UNREVIEWED2.PDB Update output file every 'x' proteins which are processed. >> 1000 ***************************************************************** Identification code for browse function. Do not change this line. >> BROWSE Location of SEQBANK database >> /canopus/rbo/seqsee/databases/SEQBANK.db Location of pirsee databases (Pir Titles + ID codes) >> /canopus/rbo/seqsee/databases/PIRSEE.db Location of swissee databases (Swiss-prot Titles + ID codes) >> /canopus/rbo/seqsee/databases/SWISSEE.db Location of Default Parameters file for SEQSEE >> /canopus/rbo/seqsee/seqsee.parms APPENDIX 2 EXPLANATION OF STATS OUTPUT The Stats output can be divided into 8 sections describing estimates and predictions of molecular weight, amino acid composition, hydrophobicity, pH, surface area, volume, aggregation potential, radius and estimated solvation free energy of folding. These are described in more detail below: 1) Molecular Weight Calculations: These are accurate to 0.1 amu and are based on amino acid weights derived from Creighton (1984). These values may be used in Mass Spectrometry calculations and calibrating other weight dependent physical methods. 2) Amino Acid Composition Calculations: The estimated frequencies are derived from amino acid frequencies obtained from the SEQBANK database. Both the weight percent and numeric percent values may be used to identify unusually high or unusually low frequencies of certain types of amino acids. This data can be important in understanding certain physical characteristics of proteins. 3) Hydrophobicity Calculations: Averages, percentages and ratios are calculated using the Kyte-Doolittle hydrophobicity values (Kyte and Doolittle, 1982). These values can be used to estimate the solubility, stability and "foldability" of peptides and proteins. These are estimates and should only be considered as potential indicators of the physical properties of a query sequence. 4) Charge Calculations: These are calculated using standard equations for charge and pH of linear sequences. The estimated pI, pH and charge values do not take into account the actual tertiary structure of the protein molecule and so these will tend to be slightly different than what is actually measured. The linear charge density can be used to estimate the potential solubility of a peptide (or protein). A linear charge density of 0.20 is usually required for a peptide to be soluble. 5) Surface Area Calculations: These are estimates based on amino acid composition, molecular weight and the assumption that the protein folds into a globular shape (Miller et al., 1987). They are not based on actual three- dimensional structures. However, these estimates have been found to be quite accurate and can be useful when one wishes to compare "predicted" values with "actual" values of known or modelled X-ray structures. These comparisons can be used to assess the quality of a structure or structural model. They may also give some indication of the potential that a peptide sequence will fold using the theories developed by Ken Dill and others. 6) Volume Calculations: These can give some indication of the expected compactness of a peptide or protein. The values given are estimates based on amino acid composition and molecular weight. The estimate of partial specific volume may be useful in ultra-centrifugation studies (Zamayatnin, 1972). 7) Solubility and Aggregation Calculations: These calculations are based on relatively simple statistical theories and correlations regarding the propensity of some peptides and proteins to fold, to aggregate or to fall out of solution (Fisher, 1964). It is important to note that the predictions are not based on actual three-dimensional structures and that there are no guarantees on their accuracy. These predictions can be used to identify "problem" peptides and proteins that are about to be synthesized on a peptide synthesizer or expressed in bacteria. 8) Radius and Free Energy Calculations: These are based on standard formulae found in most biochemistry texts. The "Folded" value is based on the assumption that the sequence represents that of a water-soluble, monomeric, globular protein. Free energy calculations are based on the paper by Chiche et al. (1991). Following is a more detailed description of the "stats" output file. We have tried to provide algebraic expressions for as many of the statistical results as possible. Most of these equations represent approximations or estimates -- they should not be considered "infallible". The error associated with these approximations is typically +/-5 or +/-10%. References for many of these expressions and the theory behind them can be found in the reference list provided at the end of this appendix. ************************************************************* DEFINITIONS mw(i) = molecular weight of amino acid type i a(i) = number of amino acids of type i A(i) = number of amino acids of type i in SEQBANK num = number of residues NUM = number of residues in SEQBANK hp(i) = hydropathy of amino acid type i pk(i) = pka of side chain of amino acid type i asa(i) = accessible surface area of amino acid type i pasa(i) = polar asa of amino acid type i nasa(i) = non-polar asa of amino acid type i v(i) = volume of amino acid type i f(i) = fractional buried surface area of amino acid type i fb(i) = fraction of amino acids of type i found buried sv(i) = specific volume for amino acid type i w(i) = weight percent of amino acid type i ************************************************************** Molecular Weight......: MW = Sa(i)*mw(i) Amino acids...........: num = Sa(i) Mean residue weight...: MRW = num/MW *** Amino Acid Content *** Amino Freq Freq E(Freq) Weight E(weight) Acid (total) (percent) (percent) (percent) (percent) a(i) a(i) A(i) a(i)*mw(i) A(i)*mw(i) num NUM num NUM Note: E(x) are expected values based on average amino acid content of soluble proteins. ************************************************************** Hydrophobicity Parameters: /canopus/rbo/seqsee/lib/kyte.parms Average Hydrophobicity (ah)...................: AH = Sa(i)*hp(i) Notes: ah = -2.67 --> Average Protein ah > 0.10 --> Hydrophobic Protein ah < -6.00 --> Hydrophilic Protein Ratio of Hydrophilicity to Hydrophobicity (rh): RH = |hydrophilic/hydrophobic| Notes: rh = 1.22 --> Average Protein hydrophilic = neg. comp. of rh > 1.90 --> Non-folding Protein hydropathy rh < 0.85 --> Insoluble Protein hydrophobic = pos. comp. of hydropathy Percentage of Hydrophobic residues............: %HB = (#A + #C + #F +...)/num Notes: Average percentage is 52.44 Hydrophobic Amino Acids are ACFGHILMVWY Percentage of Hydrophilic residues............: %HL = (#D + #E + #K +...)/num Notes: Average percentage is 47.56 Hydrophilic Amino Acids are DEKNPQRST Ratio of %Hydrophilic to %Hydrophobic.........: %HL/%HB Notes: rhp = 0.91 --> Average Protein rhp > 1.43 --> Non-folding Protein rhp < 0.77 --> Insoluble Protein ************************************************************** Number of Basic amino acids: NB = #K + #R Number of Acidic amino acids: NA = #D + #E Estimated pI for protein....: PI = 0 = S{±1/(1 + 10**[pKi - pHi])} pH: 3 4 5 6 7 8 9 10 11 Charge: charge = S{±1/(1 + 10**[pKi - pHi])} Total linear charge density.: LIND = {#K + #R + #D + #E + 2}/num ************************************************************** Polar Area of Extended Chain...............: PAEC = Spasa(i)*a(i) Non-Polar Area of Extended Chain...........: NAEC = Snasa(i)*a(i) Total Area of Extended Chain ..............: AEC = PAEC + NAEC Polar ASA of Folded Protein................: APFC = AFC - ANFC Non-Polar ASA of Folded Protein............: ANFC = [NAEC*(-6.21 + 118*RFE)]/100 ASA of folded protein......................: AFC = 7.11*MW**0.718 Ratio of Folded to Extended Area...........: RFE = AFC/AEC ************************************************************* Buried Polar Area of Folded Protein........: ABP = 0.35*AB Buried Non-polar Area of Folded Protein....: ABN = 0.61*AB Buried Charge Area of Folded Protein.......: ABC = 0.04*AB Total Buried Surface.......................: AB = AEC - AFC Expected Number and Fraction of Residues 95% Buried EFB(i) = (f(i)*NB)/[num - NB + (F(i)*NB)] NUMB(i) = a(i)*EFB(i) Number of buried Amino Acids...............: NB = (num**0.333 - 2.0)**3.0 ************************************************************* Packing Volume (estimate)..................: VP = 1.245*MW Packing Volume (actual)....................: VP = Sa(i)*v(i) Interior Volume of Protein.................: VIN = Sa(i)*fb(i)*v(i) Exterior Volume of Protein.................: VEXT = Sa(i)*(1-fb(i))*v(i) Partial Specific Volume....................: PSV = Ssv(i)*w(i) ************************************************************* Fisher Volume Ratio (actual)...............: VR = VEXT/VIN Fisher Volume Ratio (idealized)............: VRT = [RAD**3/(RAD - 4.0)**3] - 1 If VR > VRT then molecule likely forms soluble monomer If VR >> VRT then molecule likely doesn't fold into compact structure If VR < VRT then molecule likely aggregates Protein Solubility.........................: SOL = RH*100 + LIND*100 + AH*5 Notes: solubility = 1.6 --> Average Protein solubility < 1.1 --> Insoluble Protein ************************************************************* Radius of Protein..........................: RAD = 3.875*(num**0.333) RMS end to end distance of Ext. chain......: RMS = (110*num)**0.5 Radius of Gyration of Extened chain........: RG = RMS/2.45 ************************************************************* Solvation Free Energy of Folding...........: SFE = 16.02 - 0.99*num References Chiche, L., Gregoret, L.M., Cohen, F.E. & Kollman, P.A., Protein Model Structure Evaluation Using the Solvation Free Energy of Folding, Proc. Natl. Acad. Sci. (USA) 87, 3240-3243 (1990). Chothia, C., Structural Invariants in Protein Folding, Nature 254, 304-308 (1975). Chothia, C., The Nature of the Accessible and Buried Surfaces in Proteins, J. Mol. Biol. 105, 1-14 (1976). Creighton, T.E., "Proteins: Structures and Molecular Properties", W.H. Freeman, New York (1984). Fisher, H.F., A Limiting Law Relating the Size and Shape of Protein Molecules to Their Composition, Proc. Natl. Acad. Sci. (USA) 51, 1285-1290 (1964). Janin, J., Surface and Inside Volumes in Globular Proteins, Nature 277, 491- 493 (1979). Kyte, J. & Doolittle, R.F., A Simple Method for Displaying the Hydropathic Character of a Protein, J. Mol. Biol. 157, 105-132 (1982). Miller, S. Janin, J., Lesk, A.M. & Chothia, C., Interior and Surface of Monomeric Proteins, J. Mol. Biol. 196, 641-656 (1987). Richards, F.M., Areas, Volumes, Packing and Protein Structure, Ann. Rev. Biophys, Bioeng. 6, 151-176 (1977). Zamayatnin, A.A., Protein Volume in Solution, Prog.. Biophys. Mol. Biol. 24, 107-123 (1972).