seqsee
Brian Sykes Lab

Version: 1.6 - Aug / 2012
Download and Installation

Purpose: seqsee - protein sequence analyzer
Contains a number of functions for analyzing protein sequences.

Table of Contents

  1. Latest News
  2. Overview
  3. Copyright and Acknowledgements
  4. Download and Installation
  5. Input sequence file
  6. Recommendations for running seqsee
  7. FAQ

  8. Functions Menu
  9. D. Wishart's Tutorial

  10. Other Documents

Overview

The past few years have seen an explosion in the use of computers in molecular biology. This is in no small part due the vast quantity of raw protein and nucleic acid sequence data that has been generated in the last decade. For example, since 1980 the number of sequences in the PIR databank alone has increased from less than 2000 to almost 45,000 separate entries today. Without the aid of a computer it would be simply impossible to attempt to analyze or categorize this huge reservoir of biological information or to manage the rapid influx of new sequence data which is being generated every single day.

Fortunately, through the establishment of publicly funded databanks such as GENBANK, SWISS-PROT, NBRF-PIR and EMBL, most of us have been spared the headache of keeping up with this information explosion. Now, much of this sequence data is readily available, in computer readable format, to any scientist who wishes to subscribe to it. While the development of these centralized databases has certainly helped in the rapid dissemination of sequence information, it has not necessarily solved the problem of its rapid "assimilation". As a consequence, a great deal of privately funded effort has been directed towards the development of software (and hardware) which would permit molecular biologists to quickly compare, analyze and otherwise dissect sequence data in a useful or informative manner. Programs or programming suites such as Intelligenetics' IG suite, Wisconsin's GCG suite, IBI's MacVector, and others are now widely available for this purpose and are typically designed to run on computers of all sizes and shapes including IBM PC's, Macintosh's, VAX's, and SUN workstations. Many of these larger and more costly programs permit flexible sequence manipulations such as database searching, aligning, comparing and matching -- all at the touch of a button.

It is a result of the widespread implementation of these software packages (in conjunction with their accompanying databases) that a number of extremely important and very useful "discoveries" have been made. These include, just to mention a few, the identification of a number of new and important receptor families, the identification of numerous repetitive or recurrent folding "modules", the identification of various oncongenic products and the establishment of evolutionary relatedness between hundreds of previously unidentified or poorly understood protein products (see Doolittle, 1990 and references therein).

The success that molecular biologists have had at performing simple, yet important "computer experiments" has induced others, including X-ray crystallographers, NMR spectroscopists, protein engineers and evolutionary biologists to begin using or adapting these same software packages to help answer specific questions of their own. In particular, it is now becoming increasingly common to see some of the more advanced sequence analysis programs (involving multiple sequence alignments and advanced structure prediction algorithms) being used to predict the tertiary structure of previously uncharacterized proteins (Schulz, 1988). Likewise, the push to develop more efficient methods for identifying the potential function or active site location of newly isolated proteins is leading to the development of methods which are, in effect, redefining the meaning of "homology" or "sequence similarity" (Gribskov et al., 1987). In addition, new databases containing secondary structural information, phi and psi angles, torsion angle restraints, NMR chemical shifts, sequence motifs and the like, are continually being added to the current software arsenal to permit even more diverse inquiries and analyses.

Quite clearly the development of sequence analysis packages is entering into a phase of very rapid expansion with many new and unforeseen applications being proposed for an increasingly diverse and substantially larger investigative population. Of course with this rapid expansion comes the usual problems of limited availability, restricted usability and increased costs of most sequence analysis software products. As a result, a rather problematic "software stratification" is developing in the field with the most powerful (and most useful) packages becoming more and more expensive while the freely available, unintegrated "shareware" products are becoming less and less useful. In response both to this program diversification and this software stratification we have endeavored to develop a publicly available software package (called SEQSEE - SEQuence SEEker) which offers the program diversity of the expensive packages at the "cost" of the freely available shareware products. Specifically SEQSEE is a multi-purpose menu-driven suite of programs designed to provide a fully integrated, state-of-the-art package for the analysis and display of protein sequences and protein databases. It has been designed with considerable flexibility in mind so as to permit the addition of new features and new algorithms when they are developed or as they are reported in the literature. It contains many of the features available in some the most comprehensive commercially available programs such as rapid database searching, flexible pattern matching and multiple sequence alignment. It also contains a large number of structural analysis and prediction programs which have been enhanced through the incorporation of several unique databases. In this regard, SEQSEE has been developed expressly from the point of view of those protein chemists who are interested in questions pertaining to both structure and function. As a result, we believe SEQSEE offers a number of important enhancements and many unique advantages over what is typically found in other commercially available software packages. SEQSEE has already been used in the analyses of fibroma/myxoma viral products (C. Upton , personal communication), cystic fibrosis gene products, fish anti-freeze proteins and a variety of growth factors. In many respects SEQSEE has performed beyond our expectations and it is as a consequence of its consistent (and sometimes unexpected) success that we believe it should be made freely available to all members of the scientific community. We hope you will find SEQSEE as useful in your work as we have found it in ours.

Copyright and Acknowledgements

Authors: D.S. Wishart, R.F. Boyko, L. Willard, F.M. Richards and B.D. Sykes
SEQSEE: A Comprehensive Program Suite for Protein Sequence Analysis Comp. Appl. Biosci. 10:121-132 (1994)

The programs, databases and libraries contained within SEQSEE have been under continuing development since June of 1988 in the labs of both Frederic M. Richards of Yale University (New Haven) and Brian D. Sykes of the University of Alberta (Edmonton). The SEQSEE project was completed in partial fulfillment of the requirements for the degree of Doctor of Philosophy for David Wishart (Yale, 1991). The principal programmer for SEQSEE has been Robert Boyko with database development and general program design being handled by David Wishart, and some revisions done by Leigh Willard.

Copyright (C) 1991 - No portion of this program may be incorporated into other programs or sold for profit without express written consent of the authors.

Download and Installation

Download seqsee here: PC(Linux) and macosx: seqsee v1.6 (8.4 MB)

Once you have downloaded the software, you then proceed by untarring the file. For example:

	> tar xvf seqsee-1.6-build.tar 
	> cd seqsee-1.6-build
Look at the README.txt file for details on installation or follow the procedure below.
1)  Install a sequence database on your system like the pir or
    swiss-prot. These databases are not included with seqsee and
    can be acquired via anonymous ftp.

    Let's assume we went to www.uniprot.org/downloads and got the text version
    of the UniProtKB/Swiss-Prot database and placed it in
    /usr/local/databases/uniprot_sprot.dat. If you do this, then you
    can skip to step 4.

2)  edit seqdb.fnames
        - indicate the database type and number of files.
        - set sequence database paths to the database you intend to use

3)  edit refdb.fnames
        Some databases separate the reference and sequence data.
        If this is not the case, then set up this file to be
        exactly the same as 'seqdb.fnames'.

4)  ./Install

    That's it, the Install script is simple and you are only asked
    for the location to install seqsee.

5) You probably want to create an alias for seqsee and place it in your
    .cshrc or .tcshrc file:

        alias seqsee /usr/local/seqsee/bin/seqsee

The program will first check the user's current directory for a parameter file called "seqsee.parms" and use this first before reading the default one from the installed lib directory.

Input sequence file format

FORMAT

	Line 1       :  >Title: name
	Line 2,3,...n:  One letter code sequence

EXAMPLE

	>Title: Sequence ABC 
	KWEYASEPIKNMNSWTYR
	AENRQDGGNAHKLLEPRF
	DAAL

  1. The angle bracket ">" denotes the first line of input. Any lines before this are treated as comments.
  2. Blanks and new line characters are ignored in the sequence. Lower or upper case is acceptable.
  3. The input sequence format is identical to the intelligenetics database format. Instead of "Title:" this database will have a protein ID code of 6 letters or less.

Recommendations for first-time user

The following suggestions are recommended to best use SEQSEE:
  1. Create your own directory for doing SEQSEE and make this your current directory (ie., mkdir seqsee; cd seqsee). This will help you to better organize input files and results.

  2. (Optional) Copy the parameter file seqsee.parms into this directory. You can find an example of this file in the place you chose to install seqsee in the lib/seqsee directory. This will allow you to change some of the parameters for the program if you wish.

  3. (Optional) If you already have sequence files, copy them into this directory. Ensure they are in the proper format (See Seqfile Format). This item is optional, you can always use SEQSEE to create your own sequence files.

  4. Although there are no windowing requirements, it is recommended that your window be at least 80 characters wide and 40 or more lines in length. This will allow easy viewing of output and menus.

  5. Having more than one window is convenient as well. It will allow you to look at intermediate results while SEQSEE is running. Some of the modules may take a fair amount of time to execute.

Frequently Asked Questions

I type "seqsee" and I don't get the main menu?

  1. Make sure you are typing 'seqsee' correctly?

  2. Have the installation programs been run successfully?

  3. Check for a program called "seqsee" either in your current directory or in some public place on your system. Ask your system administrator where "seqsee" resides.

  4. Check for the parameter file seqsee.parms either in your current directory or in some public place on your system. Check for any corruptions to "seqsee.parms".

What exactly is the function of the file seqsee.parms ?

The file "seqsee.parms" contains all the default parameters when SEQSEE is run. SEQSEE will first check your current directory to see if you have a "seqsee.parms" file and if not, it will use the default "seqsee.parms" which the installation program creates. This parameter file should be self-explanatory and easy to follow.

What are the most common items that could be changed in "seqsee.parms"?

There are many different sets of hydrophobicity parameters and similarity matrices which you may wish to experiment with. See the section regarding databases and library files. Be aware that if you change from the "wt.rbo" matrix to the "wt.dayhoff" that other parameters such as "gap penalty" and "gap size penalty" will also need changing. There are several "print" flags which can make your output more verbose or terse. Many of the options are strictly for the programmer or for the person who has knowledge about how the algorithms work.

While running SEQSEE, the screen was cleared and I seem to be placed in some kind of editor. How do I get out of this mode?

By default, SEQSEE uses the "vi" editor whenever it has results to show to the user. To exit this mode type ":q" to exit without saving changes or ":wq" to exit with saving any changes you have made.

Is there a way to turn this fullscreen editing feature off?

Yes. Your system administrator will have chosen the editor for seqsee. You can turn off the editor completely by chaging the editor flag from 1 to 0 in "seqsee.parms".

What should I do if I want to get out of something that I got myself into? For example, I am doing an exhaustive alignment search and realize I made a mistake.

The easiest way is to press the "control" and "c" keys down together. This will take you back to the main SEQSEE menu. Another method is via the UNIX "kill" and "ps" commands (see your UNIX manual). When taking this form of action there will likely be temporary files which you should remove (these files contain numbers and ".tmp").

Is there some way for me to check how the results from a search are going rather than waiting until SEQSEE ends?

Yes, most modules in SEQSEE keep intermediate results, especially those functions which can take awhile. These results are stored in a numbered file appended with ".tmp" or ".tmp.ids" (eg, 6531209824.tmp). While SEQSEE is running in one window, go to another window and type "more *.tmp" to see intermediate results.

When I save my search results in file "x", I also have a file in my directory called "x.ids". What is the purpose of this file?

This file contains only the ID code and name of the protein from the results file. Some of the results files can get pretty big to scan through, often times people are just interested to see the names of the proteins which popped up in their search. This explanation also refers to the intermediate results files as well.

Now that I have my results, how do I print them out?

The UNIX command to print a text file is "lpr" but you should first check with your system administrator about how one prints text files.

Can I run SEQSEE in the background?

Yes, running SEQSEE in the background allows the user to start a search and then be able to log out. Once you have decided your search is running properly type "control" and "z" keys together to stop the job. Typing "bg" will then put the job in the background and now you can safely log out and go home.

Can I change the priority at which SEQSEE is running? For example, I wish my exhaustive alignment job would only run iff no else needs the computer (cpu).

Yes, but you can only change the priority after the job starts running. If you started an exhaustive alignment first issue the UNIX command "ps -ux | grep nw_align". On Silicon graphics the command would be "ps | grep nw_align". Then issue the UNIX command "renice 19 PID" where PID is the process ID of the job you wish to change the priority of (the PID is in the first column). Please ask your system administrator if you are not sure how this works.

What is a "core dumped" message?

The program has crashed either due to a programming bug or you have exceeded some boundary limit or the system has run out of "swap space" (see UNIX system manual). Sometimes a swap space problem will be indicated by an "out of memory" error message as well.

What can I do if I get a "core dumped" message?

Check the SEQSEE parameter file (if you are not using the default) and your input into the program. You may try varying your input to see if the problem only occurs with your set of data. If you are the system administrator and have some programming knowledge you may wish to re-compile the module that crashed with the debug flag set on and use a debugging program such as "dbx" to see exactly where the program crashes.

Functions

Seqsee is a protein sequence analyzer which has the following functions:
**********************************************************************
* Package...:                  SEQSEE  Version 1.5 (c)               *
* Authors...:       Robert Boyko / Leigh Willard / David Wishart     *
*                            Fred Richards / Brian Sykes             *
* Location..:                   University of Alberta                *
*               Protein Engineering Network of Centres of Excellence *
**********************************************************************

*** Preliminaries ***                      *** Alignments ***
1) Help                                   10) Fast Alignment Search
2) Enter/Edit a Sequence                  11) Exhaustive Alignment Search
3) Get Sequence from Database             12) Align 2 or more sequences

*** Structural Analysis ***                *** Scanning ***
4) Sequence Statistics                    13) Pattern Search
5) Structure Prediction                   14) Homology Search
6) SEQSITE Pattern Search                 15) Dot Plot
7) Flexibility                            16) Database Reference Search
8) Hydrophobic Moment                     17) File Viewer
9) Hydrophobicity                          0) EXIT SEQSEE

Enter the number of the desired function:
The following is a brief description of each function.

Help

You are directed to use our on-line help through a browser.

Enter / Edit a Sequence

The program known as SEQED is used for the entry and editing of new (or old) sequence files. The program first queries the user as to whether he or she wishes to:
	1) Enter a new sequence.
	2) Edit an old sequence.
If one chooses to enter a new sequence the program queries the user for the name of the sequence (sequence name), the actual sequence (using the standard single letter amino acid code), and the name of the sequence file (output filename). Sequences may be entered using either lower case letters, upper case letters or an arbitrary combination of both. In other words, sequence entry is case independent. The program also ignores blank characters so sequence entries may have as many blank spaces as desired. A "sequence ruler" is presented at the top of each sequence file entry line to permit quick identification of residue positions as they are typed. After each group of 50 characters has been entered, the user is expected to press so that a new sequence ruler can appear. Upon completion of the sequence entry, the user must enter the '$' character to indicate to the computer that the typing process has finished. Should any non-standard amino acid characters appear in the sequence file the program produces an error message and aborts the file saving procedure.

Retrieve Sequence from Database

The program SEQRET is designed to allow the user to retrieve complete sequences or groups of sequences from the database using either the id code or protein name (or portion thereof). Thus one may seek and select only a single sequence for a specific purpose, or entire protein families to create special user-specified databases. The sequences may be saved and/or edited for further analysis (as in the preparation of files for multiple sequence alignments). All sequences are saved in a SEQFILE format and, therefore, are ready to be analyzed by any of the other SEQSEE functions.

To search for specific protein names, enter the protein names (using underscores in place of blanks) one on a line. To search for more than one string in the same protein entry, enter the first string followed by '&', followed by the next string.

For example,

	FIBROSIS & CYSTIC
	THROMBOMODULIN

will find all of those entries which contain FIBROSIS and CYSTIC in the same protein name (order is unimportant), and will also find all protein entries with the name THROMBOMODULIN.

Sequence Statistics

The STATS program carries out a simple statistical analysis of any given protein sequence. It calculates and displays the molecular weight, the amino acid composition, the predicted folding class (based on residue composition), average hydropathy (based on the Kyte Doolittle parameters), total charge, predicted isoelectric point, specific volume, expected protein radius, expected quantity of exposed surface area and many other values that may be of structural or statistical interest. Note that STATS can only be used on sequence files in the SEQFILE format.

Structure Prediction

ALEXIS is a comprehensive structural analysis program which has been developed expressly for the SEQSEE software suite. ALEXIS performs calculations on the extent and location of potential membrane spanning regions, the identification of short sequence folding motifs and the prediction of secondary structure using the cumulative results of five different and well-tested methods. Detailed descriptions of the techniques and their respective enhancements are given below:

Seqsite Pattern Search

The SEQSITE procedure allows the user to search any given sequence for active sites, binding sites, signature sequences, sequence motifs, and related functional or structural sequence patterns. The user may select between different sequence motif libraries to use, which contain patterns, functions, and references. This type of "function search" is extremely useful for determining the properties and features of newly sequenced or poorly characterized proteins.

Flexibility

The program named FLEQSEE predicts the flexibility and mobility of various regions in a protein based on sequence information alone. Flexibility is calculated on the basis of the Karplus algorithm (Karplus and Schulz,1985). This procedure determines main-chain mobility by using smoothed averages of X-ray thermal B factors taken from approximately 30 highly resolved structures. In SEQSEE, flexibility may be used to determine the position and length of coil regions by locating all "significant" maxima (those maxima which exceed a minimum threshold) in the flexibility plot. Flexibility plots may also be used to identify surface-seeking elements or to locate strongly antigenic regions of any given sequence.

Hydrophobic Moment

MOMENT calculates the hydrophobic moment of a sequence using the Cornette et al., (1987) scale of hydrophobicity and the Fourier analysis technique of Eisenberg et al., (1984). Calculations are preformed over a set "sequence window" of predefined length using a range of values specific to helical periodicities (90 to 120 degrees), exterior beta strand periodicities (160 to 180 degrees) and interior beta strand periodicities (0 degrees). The values for helix and beta strand may be compared with one another and to a minimum cutoff value (usually around 5) to identify amphipathic helices or beta strands. This method has some utility in identifying potential T-cell epitopes (amphipathic helices) and other biologically important structures.

Hydrophobicity

HYDRO calculates the smoothed hydrophobicity (over a window of pre-defined length) of any given sequence using a choice of several hydrophobicity scales. The operator may choose (using the parameter file) from the Eisenberg consensus scale (Eisenberg et al., 1984), the Kyte-Doolittle scale (Kyte and Doolittle, 1982), the Cornette scale (Cornette et al., 1987) or the Parker-HPLC scale (Parker et al., 1986). Hydrophobicity plots may be used to approximate the positions of coil regions, exposed loops or B-cell antigenic determinants in many proteins (hydrophilic regions). They may also be used to locate membrane spanning regions in some types of proteins (hydrophobic regions of 20 or more residues).

Fast Alignment Search

FAST_ALIGN is a k-tuple based fast alignment algorithm based loosely on the speed-up protocols incorporated in Lipman and Pearson's FASTA (1988) and Altschul et al.'s BLAST (1990). First, a table of homologous 3-tuples is generated for the query sequence using a modified scoring matrix. Second, a look-up table of these 3-tuples and their respective location is prepared from the query sequence. Third, a look-up table is prepared of 3-tuples for each sequence in the database. The two look-up tables (one from the query and the second from the database) are then compared and matches are identified. The result is a one-dimensional "spectrogram" of homologies characterized by low level noise (poor matches) and the occasional sharp peak (a string of matches). Database sequences with sufficiently high peaks are then pulled out and rigorously aligned using the Needleman-Wunsch program to determine the significance of the alignment. The program is capable of searching the complete database and then ordering and aligning 50 homologous matches of a 100 residue query sequence in less than 90 seconds. This is an extremely powerful technique to accomplish quick inquiries regarding protein relatedness and identification. FAST_ALIGN may be used to align sequences against the PIR, SWIS-PROT, SEQBANK or a user-specified database with a SEQFILE format. Several choices of scoring matrices are possible and these include: the Unity matrix, the Dayhoff PAM 250 matrix (Dayhoff et al., 1983), the Mclachlan matrix (Mclachlan, 1971) and the Boyko matrix (unpublished). The Boyko matrix is the default scoring matrix.

Exhaustive Alignment Search

NW_ALIGN is a program which carries out an exhaustive pair-wise alignment of any given query sequence to all other sequences in a given database. Only those sequences with scores above a certain user-defined threshold are retained. The algorithm used for this procedure is based on the Needleman-Wunsch (1970) approach for pair-wise alignment. This dynamic programming method is guaranteed to find the optimal alignment between any two sequences for any given scoring matrix and gap penalty. Alignments can either be done against the PIR or SWIS-PROT database, SEQBANK or a user defined database in the SEQFILE format. If alignments are done against SEQBANK, knowledge of the secondary structure is included to determine the location and length of gaps (Lesk et al., 1986). A choice of scoring matrices and gap penalties is available. The scoring matrices include: the Unity matrix, the Dayhoff PAM 250 matrix (Dayhoff et al., 1983), the Mclachlan matrix (Mclachlan, 1971) and the Boyko matrix (unpublished). The Boyko matrix is the default scoring matrix. Scores are rigorously calculated on the basis of comparisons to randomized sequence alignments as recommended by Dayhoff et al., (1983). The program is extremely time consuming with a query sequence of 100 residues typically taking 4 hours to complete on a SUN Sparcstation. However, the improvement in overall alignment accuracy and the possibility of identifying very remote and previously unidentified relationships may well be worth the wait.

Align 2 or more sequences

The program MULT_ALIGN uses a modification of the pair-wise Needleman-Wunsch protocol to align two or more protein sequences. The method is closely related to the progressive alignment procedure first described by Barton and Sternberg (1987), which permits rapid and accurate multiple alignments for up to several hundred proteins. A consensus sequence is also generated for each pair-wise or multiple alignment. A choice of scoring matrices and gap penalties is available. Sequences which are to be aligned must be contained in SEQFILE formats, either in the form of databases (for multiple alignments) or singly (for pair-wise alignments). The procedure for aligning more than two sequences (like the fast alignment search described in 8) is fundamentally heuristic in nature and so it cannot be proven that the resulting alignments are mathematically optimal.

Pattern Search

This procedure searches either SEQBANK, the PIR/SWIS-PROT database or a sequence of your own choosing to find exact pattern matches according to the following rules (note the sequence patterns are case INDEPENDENT):
        a) A            match exact residue specified where A = any amino
                        acid
        b) !A           match any residue EXCEPT A 
        c) *            wild card character--matches any amino acid
        d) [ ]          "OR" braces--allows several residue choices.  
                        i.e. [ILK] = I "or" L "or" K
        e) &            "AND" character--allows 2 patterns to be placed in
                        any 1 query
        f) { }          "Range" braces--allows a range of wild card
                        characters. i.e. {2,8} = 2 to 8 "*"
        g) $            N or C termination character - used to mark either
                        the beginning (N terminus) or end (C terminus) of
                        a sequence

Pattern Search (PSEARCH) is constructed to allow the user to enter several patterns at once, both on a single line (using the "&" feature) or on separate lines. Patterns appearing on separate lines are treated as "independent" patterns (meaning they don't have to appear in the same protein sequence) while patterns with "&" characters are viewed as "dependent" patterns (meaning they do have to appear in the same protein sequence). Some examples of sequence pattern searches are given below:
        AA***K          Find all occurrences of 2 alanines together
                        followed by any 3 residues followed by a single 
                        lysine
        
        AA!P!P!PK       Find all occurrences of 2 alanines together
                        followed by any 3 residues (as long as they
                        are NOT prolines) followed by a single lysine.
                        (ie. look for AA***K except AAP**K, AA*P*K,
                        AA**PK, AA*PPK, AAPP*K, AAPPPK)

        [AG][AG]*[KR]   Find all occurrences of 2 alanines or 2
                        glycines or any combination of the two
                        followed by any residue followed by a lysine
                        or an arginine. (ie. look for AA*K, AG*K, GA*K,
                        GG*K, AA*R, AG*R, GA*R and GG*R)

        AA*K&I**R       Find all occurrences of 2 alanines together
                        followed by any amino acid followed by a
                        single lysine, AND if that pattern is found,
                        then find all occurrences of a single isoleucine
                        followed by any two amino acids followed by 
                        a single arginine IN THE SAME PROTEIN   
                        SEQUENCE. (ie. look for AA*K then I**R within
                        a sequence)

        AA{2,5}[KR]     Find all occurrences of 2 alanines together
                        followed by at least two but no more than 5
                        amino acids (any type) followed by either a
                        lysine or an arginine. (ie. look for AA**[KR],
                        AA***[KR], AA****[KR] and AA*****[KR])

        ${3,5}M         Find all occurrences of methionine that are
                        between 3 and 5 residues from the N
                        terminus.  (ie. look for $***M, $****M and
                        $*****M)
Of course any combination of the above queries could be used in a PSEARCH pattern search. Other examples of PSEARCH queries may be found by browsing through the SEQSITE database.

Homology Search

The HSEARCH program searches either the PIR/SWIS-PROT database, SEQBANK or a compatible user-defined database to find the "nearest" or most homologous matches to any given sequence. Homologies are determined according to any one of four user-defined scoring matrices (described earlier). Presently, gap penalties are not yet incorporated into the homology search routine. The homology search is a useful complement to other pattern search routines, especially when attempting to locate distantly related or difficult-to-identify sequence motifs.

DotPlot

DOTPLOT is an extremely flexible program developed to produce character representations of standard dot plots (Lipman and Pearson, 1985). The low resolution of most character-defined screens prevents the incorporation of a useful graphic representation of dot plot results and hence a character representation with a user defined "threshold" has been incorporated to overcome this problem. DOTPLOT may be used to compare a sequence with itself (to identify internal repeats), with another sequence (for pair-wise alignments), with a SEQFILE compatible database or with the PIR/SWIS-PROT database (for medium speed alignments). By using DOTPLOT in conjunction with a database it is possible to look for homologies between any shared regions in a group of sequences. Such an option has proven to be quite useful in identifying previously unrecognized motifs or unexpected similarities in a number of proteins.

Protein database reference search

The program REFSCAN is designed to allow the user to locate and retrieve specific sequence references from the database using either the accession number, the name (or portion thereof) or a bibliographic/ functional reference. This feature allows the user to quickly access important information about many newly sequenced proteins pertaining to their function, structure or relationship with other proteins in the database.

To search for specific protein names, enter the protein names (using underscores in place of blanks) one on a line. To search for more than one string in the same protein entry, enter the first string followed by '&', followed by the next string.

For example,

	FIBROSIS & CYSTIC
	THROMBOMODULIN
will find all of those entries which contain FIBROSIS and CYSTIC in the same protein name (order is unimportant), and will also find all protein entries with the name THROMBOMODULIN.

Browse

BROWSE permits the user to edit or view a variety of database files. Through this program it is possible to locate or identify sequence names or id numbers from the database, to locate or view sequences and references from the SEQBANK database, to view or edit sequences written as SEQFILEs and to view, edit or change the SEQSEE parameter file. In the case of viewing PIR database information, all sequence name and id number data is contained in a single 1 MB file called PIRSEE.db. Standard Unix commands may be used for scrolling through or locating all character strings in any of the files.

Exit Seqsee

Closes all current files and returns the user to the general operating system. The program may be restarted by typing "seqsee". If the program crashes or hangs up for any reason, simply type "^c" which will stop all processes and return the user to the main menu.

Seqsee Tutorial

Note: The following tutorial using seqsee version 1.2 is almost identical to version 1.5 (so it is not worth updating this section of the manual).
   Let us suppose that you and a collaborator have succeeded in isolating a 
small protein from Bacillus subtillus which appears to act as an oxidizing
co-factor for certain cellular processes.  After many weeks of amino acid
analysis and peptide sequencing, your collaborator provides you with the 
N terminal sequence of the first 60 amino acids of this new protein.  You
are requested to find out anything you can about this partial sequence, 
and to report to your colleague as soon as possible.  Sounds like a job 
for SEQSEE!

   Let's demonstrate how you might go about analyzing this sequence using 
just a few of the options available in SEQSEE.  Note that in this example
we will first show how a new sequence is entered.  Then we will demonstrate
how the sequence can be analyzed statistically.  We will also show how to
check this sequence for sequence motifs and how to compare (and align) 
the query sequence against the PIR database.  Finally we will demonstrate
how to search SEQBANK to locate those proteins which might be 
evolutionarily related to the query sequence.  So here it goes...



1) Sign on to a computer
2) Type "seqsee" (the following menu should appear)

**********************************************************************
* Package...:                  SEQSEE  Version 1.2 (c)               *
* Authors...:       Robert Boyko / Leigh Willard / David Wishart     *
*                            Fred Richards / Brian Sykes             *
* Location..:                   University of Alberta                *
*               Protein Engineering Network of Centres of Excellence *
**********************************************************************
 
     *** Preliminaries ***                      *** Alignments *** 
  1) Help                                   10) Fast Alignment Search
  2) Enter/Edit a Sequence                  11) Exhaustive Alignment Search
  3) Get Sequence from Database             12) Align 2 or more sequences  

     *** Structural Analysis ***                *** Scanning ***
  4) Sequence Statistics                    13) Pattern Search 
  5) Structure Prediction                   14) Homology Search
  6) SEQSITE Pattern Search                 15) Dot Plot
  7) Flexibility                            16) Database Reference Search
  8) Hydrophobic Moment                     17) File Viewer
  9) Hydrophobicity                          0) EXIT SEQSEE

  Enter the number of the desired function:

>> 



3) Type "2" (and press ) so that you can input your new sequence.
This puts you into the program SEQED.  When in this program you are, 
first, required to choose an option for entering or editing your sequence.
Since we wish to enter a new sequence we will select option "1". Second,
you are asked to provide a name for your sequence.  In this case we'll 
call it "bacillus_redoxase".  The sequence name is required for record
keeping purposes only.



>> 2 


 How do you wish to input your sequence?

   1) Enter new sequence.
   2) Edit old sequence.
   0) Exit

 Enter a number (then press return).

>> 1


 Seqed (Version 1.2)

 Enter name for sequence. Use underscores instead
 of blanks to separate words. (eg. thioredoxin_human)

>> bacillus_redoxase


 Enter each amino acid (one letter code).
 You may enter up to 50 amino acids on one line.
 Press  to get a new prompt line.
 When you are done enter $ and press .

         1         2         3         4         5 
12345678901234567890123456789012345678901234567890
         |         |         |         |         |
msdklihitddsfdtdvikadgailvdfwaewcgpckmiapildeladey


         1         2         3         4         5 
12345678901234567890123456789012345678901234567890
         |         |         |         |         |
qgkltvakln$




4) After typing in the above 60 residues, press "$" and then .
The newly prepared sequence will then be "echoed" to the screen in the
precisely the same format it will be stored.  This is done so that you
may inspect your sequence for errors and make any required corrections.
Note that you have been placed in the "vi" editor and so it is essential
to have some rudimentary knowledge of how this editor actually works.  
Changes to the sequence can be either upper or lower case -- it is not
necessary to keep all entries or corrections in upper case.



>Title: bacillus_redoxase
MSDKLIHITDDSFDTDVIKADGAILVDFWAEWCGPCKMIAPILDELADEY
QGKLTVAKLN
~
~
~
~
~
~


5) If no corrections are necessary, type ":q" to exit the editor.  
If you do make corrections, type ":wq" and this will save the corrections
you have made.  After exiting the "vi" editor, you are asked if you should
save the file.  Upon replying (usually with a 'y') you are then asked to
provide a name for the sequence file (we'll call it redoxase.seq).  After
responding you are immediately returned to the main SEQSEE menu.  Note 
that you have now created a SEQFILE called "redoxase.seq" containing the
amino acid sequence of "bacillus redoxase".


:q


Save this file? (Y/N)

>> y


Please enter a name for this file.

>> redoxase.seq


**********************************************************************
* Package...:                  SEQSEE  Version 1.2 (c)               *
* Authors...:       Robert Boyko / Leigh Willard / David Wishart     *
*                            Fred Richards / Brian Sykes             *
* Location..:                   University of Alberta                *
*               Protein Engineering Network of Centres of Excellence *
**********************************************************************
 
     *** Preliminaries ***                      *** Alignments *** 
  1) Help                                   10) Fast Alignment Search
  2) Enter/Edit a Sequence                  11) Exhaustive Alignment Search
  3) Get Sequence from Database             12) Align 2 or more sequences  

     *** Structural Analysis ***                *** Scanning ***
  4) Sequence Statistics                    13) Pattern Search 
  5) Structure Prediction                   14) Homology Search
  6) SEQSITE Pattern Search                 15) Dot Plot
  7) Flexibility                            16) Database Reference Search
  8) Hydrophobic Moment                     17) File Viewer
  9) Hydrophobicity                          0) EXIT SEQSEE

  Enter the number of the desired function:

>> 



6) Type in "4" to choose the Sequence Statistics option.  This function
performs a quick and useful statistical analysis of the sequence to help
you identify any peculiar trends in the sequence that might not be obvious
on first inspection.


>> 4


 Sequence Statistics (Version 1.2)


 Your amino acid sequence is now required:

    1) Read sequence from an input file.
    2) Sequence to be entered via keyboard.
    0) I do not have my sequence ready.

 Enter a number (then press return).

>>


7) Type in "1" and  since you already have a unix file (a 
SEQFILE called "redoxase.seq") containing your sequence.  You will then
be queried for the name of this sequence file.  Enter it and press 
 as usual.


>> 1


 Enter input sequence filename.

>> redoxase.seq


8) The statistical analysis will automatically be written to the screen
in the format shown below.  You may scroll through the file and make any
changes you wish.


*************************************************************

        Program......: stats (version 1.2)
        Description..: Statistical Analysis of a Sequence
        Date.........: Fri May 1 16:24:02 1992

        Sequence Name: bacillus_redoxase
        1   MSDKLIHITD DSFDTDVIKA DGAILVDFWA EWCGPCKMIA PILDELADEY
        51  QGKLTVAKLN

************************************************************

Molecular Weight......:   6684.86
Amino acids...........:        60
Mean Amino Acid Weight:    111.41

      *** Amino Acid Composition ***

Amino   Freq      Freq      E(Freq)    Weight    E(weight)
Acid   (total)  (percent)  (percent)  (percent)  (percent)
  A       6      10.00      <8.84>       6.40     <5.73>
  C       2       3.33      <2.09>       3.09     <1.97>
  D       9      15.00      <5.89>      15.54     <6.17>
  E       3       5.00      <5.90>       5.81     <6.94>
  F       2       3.33      <3.70>       4.42     <4.96>
  G       3       5.00      <8.29>       2.57     <4.31>
  H       1       1.67      <2.12>       2.06     <2.64>
  I       6      10.00      <5.40>      10.18     <5.57>
  K       5       8.33      <6.22>       9.61     <7.27>
  L       6      10.00      <7.93>      10.18     <8.18>
  M       2       3.33      <1.97>       3.94     <2.35>
  N       1       1.67      <4.59>       1.71     <4.78>
  P       2       3.33      <4.51>       2.91     <3.99>
  Q       1       1.67      <3.75>       1.92     <4.37>
  R       0       0.00      <4.21>       0.00     <6.00>
  S       2       3.33      <6.59>       2.61     <5.23>
  T       3       5.00      <5.96>       4.55     <5.49>
  V       3       5.00      <7.12>       4.46     <6.43>
  W       2       3.33      <1.37>       5.59     <2.32>
  Y       1       1.67      <3.56>       2.45     <5.30>

Note: E(x) are expected values based on average
      amino acid content of soluble proteins.

******************************************************

Hydrophobicity Parameters: /usr/people/me/seqsee/lib/kyte.parms

Average Hydrophobicity (ah)...................:   0.78
Notes: ah = -2.67  --> Average Protein
       ah >  0.10  --> Hydrophobic Protein
       ah < -6.00  --> Hydrophilic Protein

Ratio of Hydrophobicity to Hydrophilicity (rh):   0.95
Notes: rh =  1.22  --> Average Protein
       rh >  1.90  --> Non-folding Protein
       rh <  0.85  --> Insoluble Protein

Percentage of Hydrophobic amino acids.........:  56.67
Notes: Average percentage is 52.44
       Hydrophobic Amino Acids are ACFGHILMVWY

Percentage of Hydrophilic amino acids.........:  43.33
Notes: Average percentage is 47.56
       Hydrophilic Amino Acids are DEKNPQRST

Ratio of %Hydrophilic to %Hydrophobic.........:   0.76
Notes: rhp =  0.91  --> Average Protein
       rhp >  1.43  --> Non-folding Protein
       rhp <  0.77  --> Insoluble Protein

***********************************************


Number of  basic amino acids:     5
Number of acidic amino acids:    12
Estimated pI for protein....:  4.60

~
~
~
~
~


9) We won't produce the complete file here. But it should be obvious that
nothing out of the ordinary is found for this fragment of bacillus 
redoxase.  After checking through the stats file, you may exit it simply
by typing ":q" as before.  The computer then asks you whether it should
save the file or not.  As usual we will respond with "y" and give our 
results file the name "redoxase.stat".  After entering the name, the main
SEQSEE menu appears on the screen once again.


:q


 Save this file? (Y/N)

>> y


 Please enter a name for this file. 

>> redoxase.stat


**********************************************************************
* Package...:                  SEQSEE  Version 1.2 (c)               *
* Authors...:       Robert Boyko / Leigh Willard / David Wishart     *
*                            Fred Richards / Brian Sykes             *
* Location..:                   University of Alberta                *
*               Protein Engineering Network of Centres of Excellence *
**********************************************************************
 
     *** Preliminaries ***                      *** Alignments *** 
  1) Help                                   10) Fast Alignment Search
  2) Enter/Edit a Sequence                  11) Exhaustive Alignment Search
  3) Get Sequence from Database             12) Align 2 or more sequences  

     *** Structural Analysis ***                *** Scanning ***
  4) Sequence Statistics                    13) Pattern Search 
  5) Structure Prediction                   14) Homology Search
  6) SEQSITE Pattern Search                 15) Dot Plot
  7) Flexibility                            16) Database Reference Search
  8) Hydrophobic Moment                     17) File Viewer
  9) Hydrophobicity                          0) EXIT SEQSEE

  Enter the number of the desired function:

>> 



10) Evidently our statistical analysis of bacillus redoxase didn't 
provide us with too much useful data.  Lets see if we can uncover more
information about the sequence by checking if it contains any unusual 
sequence motifs or sequence patterns.  This might give us an idea of 
what it does or what it looks like.  To do so, let's choose option "6"
and initiate SEQSITE. 


>> 6

 SEQSITE Pattern Search (Version 1.2)

 
 Please select a sequence motif database

   1) SEQSITE.db	(general sequence motifs)
   2) PHOSITE.db	(general phosphorylation sites)
   3) EPISITE.db	(antigenic sites)

 Enter a number (then press return).


>> 



11) Choose the database from which you will do the pattern search,
then type .

>> 1

 Your amino acid sequence is now required:

  1) Read sequence from an input file.
  2) Sequence to be entered via keyboard.
  3) I do not have my sequence ready.

 Enter a number (then press return).

>>



12) Type in "1" and , as before, since you already have a unix
file containing your sequence.  As usual you are required to give the
name of your sequence file (or SEQFILE) containing the sequence so we
will type in "redoxase.seq".

>> 1

Enter input sequence filename.

>> redoxase.seq


13) After pressing  the SEQSITE analysis will automatically be
written to the screen in the format shown below.  You may scroll through
the file and make any changes you wish.




************************************************************

        Program......: seqsite (version 1.2)
        Description..: Search for Interesting Motifs
        Date.........: Fri May 1 16:24:02 1992

        Sequence Name: bacillus_redoxase
        1   MSDKLIHITD DSFDTDVIKA DGAILVDFWA EWCGPCKMIA PILDELADEY
        51  QGKLTVAKLN

        Database.....: /usr/people/me/seqsee/databases/SEQSITE.db

***********************************************************

**********(1)*********

Motif Matched...: *[TA]*WC[AG][PH]C*
Sequence Matched: WAEWCGPCK
Amino Acids.....: 29-37

 GLEASON, F.K. et al., FEMS MICRO REV. 54:271-297
 ACTIVE SITE FOR PROKARYOTIC/EUKARYOTIC THIOREDOXIN-LIKE MOLECULES


 Number of motifs found..:      1
 Number of motifs scanned:   1110
~
~
~
~

14) Well, Well...It looks like we've found something.  It appears that
bacillus redoxase contains the active site for a certain class of 
molecules called thioredoxins.  Before leaping to any conclusions, 
though, we should check to see whether bacillus redoxase shares other 
similarities to thioredoxins or whether this shared sequence pattern is 
simply an accident of evolution.  To answer this question we need to get
back to SEQSEE's main menu.  To do this we type ":q" as before and save 
the file by replying with a "y" and giving this file a name like 
"redoxase.site".


:q


Save this file? (Y/N)

>> y


 Please enter a name for this file.

>> redoxase.site

**********************************************************************
* Package...:                  SEQSEE  Version 1.2 (c)               *
* Authors...:       Robert Boyko / Leigh Willard / David Wishart     *
*                            Fred Richards / Brian Sykes             *
* Location..:                   University of Alberta                *
*               Protein Engineering Network of Centres of Excellence *
**********************************************************************
 
     *** Preliminaries ***                      *** Alignments *** 
  1) Help                                   10) Fast Alignment Search
  2) Enter/Edit a Sequence                  11) Exhaustive Alignment Search
  3) Get Sequence from Database             12) Align 2 or more sequences  

     *** Structural Analysis ***                *** Scanning ***
  4) Sequence Statistics                    13) Pattern Search 
  5) Structure Prediction                   14) Homology Search
  6) SEQSITE Pattern Search                 15) Dot Plot
  7) Flexibility                            16) Database Reference Search
  8) Hydrophobic Moment                     17) File Viewer
  9) Hydrophobicity                          0) EXIT SEQSEE

  Enter the number of the desired function:

>> 


15) The best method to determine the evolutionary relatedness of any one
protein with another is to perform a database alignment. Such a database
alignment is offered by both the Fast Alignment Search and the Exhaustive
Alignment Search options in SEQSEE (numbers 10 and 11 on the menu).  
Since we want a really quick (but not absolutely accurate) answer to our
question, let's choose the Fast Alignment Search by typing "10".


>> 10

Fast Alignment (Version 1.2)

 Your amino acid sequence is now required:

   1) Read sequence from an input file
   2) Sequence to be entered via keyboard
   3) I do not have my sequence ready

 Enter a number (then press return).

>>



16) Type in "1" and , as before, since you already have a unix 
file containing your sequence.  As usual you are required to give the 
name of your sequence file (or SEQFILE) containing the sequence 
(redoxase.seq).  For this type of program you are also required to 
indicate how many of the best alignments you want to keep -- we'll choose
the top 50.


>> 1

 Enter sequence input filename:

>> redoxase.seq

 This program keeps track of the top 'x' alignments.
 Enter a value for 'x' where 0 < x < 500.

>> 50



17) For a search of this magnitude, the computer will take about 60 to 
90 seconds.  While doing the search, the program will indicate what it's
doing and how many sequences it has scanned.  At the end of the search 
the top 50 alignments are printed out in descending order, with the best
alignment at the top of the file.  Following is a sample of the output 
you should expect to see.



Initializing lookup table...

Reading database file: /usr/local/seqsee/databases/PIR/*

Proteins: 1000  BestScore: 6624 GroupScore: 6624
Proteins: 2000  BestScore: 6624 GroupScore:  495
Proteins: 3000  BestScore: 6624 GroupScore: 1147
~
~
~
~
~


************************************************************

        Program......: fast_align (version 1.2)
        Description..: Fast Alignment on database
        Date.........: Fri May 1 16:24:02 1992

        Sequence Name: bacillus_redoxase
        Amino Acids..: 60

	    Database.....: PIR (Intelligenetics Version)

        Scoring Mat..: /usr/local/seqsee/lib/wt.align
        Gap Penalty..: 20
        Gap Size Pen.: 5
        Tuple Cut-off: 48

************************************************************

        Number of proteins tested.: 44890
        Number of alignments found:    50



***********(1)**********
Title....: Thioredoxin precursor -- Eschericia coli
Id.......: TXEC
FastScore: 6624
NW Score.: 1224
Matches..: 56

Query Seq..:                  MSDKLIHITDDSFDTDVIKADGAILVDFWAEW
Matching...:                  ||||*||*|||||||||*||||||||||||||
Database...:MLHQQRNQHARLIPVELYMSDKIIHLTDDSFDTDVLKADGAILVDFWAEW

Query Seq..:CGPCKMIAPILDELADEYQGKLTVAKLN
Matching...:|||||||||||||*||||||||||||||
Database...:CGPCKMIAPILDEIADEYQGKLTVAKLNIDQNPGTAPKYGIRGIPTLLLF

Query Seq..:
Matching
Database...:KNGEVAATKVGALSKGQLKEFLDANLA
~
~
~
~
~

18) It looks like we've got a hit!  Clearly bacillus redoxase is very 
closely related to E. coli thioredoxin.  Indeed, given the level of 
similarity between the two we can be quite certain the bacillus redoxase
is actually bacillus subtillus thioredoxin.  A quick check through the 
full alignment file will reveal that thioredoxins are actually very 
common proteins that seem to ubiquitous in just about every creature 
presently known.  Obviously we would like to know more about thioredoxins
so that we may find out what they do and how they function.  Of course 
we could run to the library and look up a few references, but we could 
also save ourselves some time by finding out immediately if some 
thioredoxins have already had their structures investigated by 
crystallography or NMR spectroscopy.  To do so we need to get back to
SEQSEE's main menu.  As usual we type ":q" and save the file using the
name "redoxase.align".


:q


 Save this file? (Y/N): 

>> y


 Please enter a name for this file.

>> redoxase.align


**********************************************************************
* Package...:                  SEQSEE  Version 1.2 (c)               *
* Authors...:       Robert Boyko / Leigh Willard / David Wishart     *
*                            Fred Richards / Brian Sykes             *
* Location..:                   University of Alberta                *
*               Protein Engineering Network of Centres of Excellence *
**********************************************************************
 
     *** Preliminaries ***                      *** Alignments *** 
  1) Help                                   10) Fast Alignment Search
  2) Enter/Edit a Sequence                  11) Exhaustive Alignment Search
  3) Get Sequence from Database             12) Align 2 or more sequences  

     *** Structural Analysis ***                *** Scanning ***
  4) Sequence Statistics                    13) Pattern Search 
  5) Structure Prediction                   14) Homology Search
  6) SEQSITE Pattern Search                 15) Dot Plot
  7) Flexibility                            16) Database Reference Search
  8) Hydrophobic Moment                     17) File Viewer
  9) Hydrophobicity                          0) EXIT SEQSEE

  Enter the number of the desired function:

>> 


19) SEQSEE contains a file called SEQBANK which is a compilation of the 
sequences and secondary structures of all proteins which have had their 
structures reported in the literature.  We can access this databank 
through the File Viewer option (number 17 on the menu) and search for
any occurrences of thioredoxin in this databank.  So let's type "17" 
and see what happens.



>> 17

 File Viewer (Version 1.2)

 What would you like to browse?

   1) User specified file
   2) PIRSEE database
   3) SWISSEE database
   4) SEQBANK database
   5) SEQSEE parameter file
   0) Exit

 Enter a number (then press return).

>>


20) Enter "4" and  since we want to inspect the SEQBANK database.
Once this is done we should see a file with a header then the following data:


>> 4


>ACTIN (RABBIT SKELETAL)
#REFERENCE : KABSCH, W. ET AL., NATURE 347:37-44 (1990)
#REFERENCE : FLAHERTY, K.M. ET AL., PNAS 88:5041-5045 (1991)
#SEQBANK ID: 1
#BRKHAVN ID:
#PIR-NBR ID: ATRB
#SWISPRO ID: ACTS$RABIT
#RESOLUTION: 2.8
#R FACTOR  : 23.8
#FOLD CLASS: M
#NUM RESIDU: 375

DEDETTALVC DNGSGLVKAG FAGDDAPRAV FPSIVGRPRH QGVMVGMGQK
CCCCCCBBBB BBBCCBBBBB BBCCCCCCBB BBCCBBBBCC CCCCCCCCCC

DSYVGDEAQS KRGILTLKYP IEHGIITNWD DMEKIWHHTF YNELRVAPEE
CBBBCHHHHH HCCBBBBBCC BBBCBBBCCH HHHHHHHHHH HCCCCCCCCC

HPTLLTEAPL NPKANREKMT QIMFETFNVP AMYVAIQAVL SLYASGRTTG
CCBBBBBCHH HHHHHHHHHH HHHHHCCCCC BBBBBBCHHH HHHHCCCCBB
~
~
~
~


21) To check if thioredoxin is in this databank we use one of the "vi" 
editor commands for character string searches.  This is done by typing 
/THIOREDOXIN/ followed by  (note that this search is 
case-sensitive).  Once the command is entered, the files is scrolled
through to locate the word "THIOREDOXIN".  Alternately, we could just
scroll through the database and look for the word "THIOREDOXIN" in the
sequence file header.  This is easily done since SEQBANK is arranged 
alphabetically.  Regardless of how we choose to do it we find that we 
are indeed fortunate for it appears that E. coli thioredoxin has already
had its crystal structure solved.  This will eventually permit us to 
accurately model the bacillus subtillus molecule and might also help us
explain its apparently unusual redox activities.  To leave the File Viewer
option, simply type ":q".



>THIOREDOXIN (E.COLI)
#REFERENCE : HOLMGREN A. ET AL., PNAS (USA) 72:2305-2309 (1975)
#REFERENCE : DYSON, H.J. ET AL., BIOCHEMISTRY 28:7074-7087 (1989)
#REFERENCE : KATTI, S.K. ET AL., J. MOL. BIOL. 212:167-184 (1990)
#SEQBANK ID: 242
#BRKHAVN ID: 2TRX
#PIR-NBR ID: TXEC
#SWISPRO ID: THIO$ECOLI
#RESOLUTION: 1.7
#R FACTOR  : 16.5
#FOLD CLASS: M
#NUM RESIDU: 108

SDKIIHLTDD SFDTDVLKAD GAILVDFWAE WCGPCKMIAP ILDEIADEYQ
CCBBBBBBCC HHHHHHHHCC CBBBBBBBBC CCCCHHHHHH HHHHHHHHHC

GKLTVAKLNI DQNPGTAPKY GIRGIPTLLL FKNGEVAATK VGALSKGQLK
CCBBBBBBBB CCCCHHHHHH HCCCCCBBBB BBCCCBBBBB BBCCHHHHHH

EFLDANLA
HHHHHHHH


In sum, this little exercise demonstrates how it is possible to begin 
with a fragmentary peptide sequence of unknown structure and function 
and end up with a great deal of knowledge about that peptide's putative
structure, probable function and potential origin -- all in the matter 
of a few minutes.  While this example clearly demonstrates the potential
utility of SEQSEE it is important to understand that the results were 
obtained by adopting an efficient analytic strategy.  The approach adopted
here can be summarized as follows:

        1) Enter the new sequence into a SEQFILE
        2) Conduct a statistical analysis of the sequence using STATS
        3) Scan for sequence motifs using SEQSITE
        4) Carry out a fast database alignment with FAST_ALIGN
        5) Browse through the SEQBANK file to identify potentially related 
           structures and preliminary references

Different sequences may well require different strategies.  Likewise, 
different questions may be answered in different ways.  It is entirely 
up to the user to design a protocol that best suits his or her needs.


Back to Software Centre

This file last updated:

Questions to: bionmrwebmaster@biochem.ualberta.ca