![]() |
seqsee Brian Sykes Lab |
![]() |
Version: 1.6 - Aug / 2012
Download and Installation
Purpose:
seqsee - protein sequence analyzer
Contains a number of functions for analyzing protein sequences.
The past few years have seen an explosion in the use of computers in molecular biology. This is in no small part due the vast quantity of raw protein and nucleic acid sequence data that has been generated in the last decade. For example, since 1980 the number of sequences in the PIR databank alone has increased from less than 2000 to almost 45,000 separate entries today. Without the aid of a computer it would be simply impossible to attempt to analyze or categorize this huge reservoir of biological information or to manage the rapid influx of new sequence data which is being generated every single day.
Fortunately, through the establishment of publicly funded databanks such as GENBANK, SWISS-PROT, NBRF-PIR and EMBL, most of us have been spared the headache of keeping up with this information explosion. Now, much of this sequence data is readily available, in computer readable format, to any scientist who wishes to subscribe to it. While the development of these centralized databases has certainly helped in the rapid dissemination of sequence information, it has not necessarily solved the problem of its rapid "assimilation". As a consequence, a great deal of privately funded effort has been directed towards the development of software (and hardware) which would permit molecular biologists to quickly compare, analyze and otherwise dissect sequence data in a useful or informative manner. Programs or programming suites such as Intelligenetics' IG suite, Wisconsin's GCG suite, IBI's MacVector, and others are now widely available for this purpose and are typically designed to run on computers of all sizes and shapes including IBM PC's, Macintosh's, VAX's, and SUN workstations. Many of these larger and more costly programs permit flexible sequence manipulations such as database searching, aligning, comparing and matching -- all at the touch of a button.
It is a result of the widespread implementation of these software packages (in conjunction with their accompanying databases) that a number of extremely important and very useful "discoveries" have been made. These include, just to mention a few, the identification of a number of new and important receptor families, the identification of numerous repetitive or recurrent folding "modules", the identification of various oncongenic products and the establishment of evolutionary relatedness between hundreds of previously unidentified or poorly understood protein products (see Doolittle, 1990 and references therein).
The success that molecular biologists have had at performing simple, yet important "computer experiments" has induced others, including X-ray crystallographers, NMR spectroscopists, protein engineers and evolutionary biologists to begin using or adapting these same software packages to help answer specific questions of their own. In particular, it is now becoming increasingly common to see some of the more advanced sequence analysis programs (involving multiple sequence alignments and advanced structure prediction algorithms) being used to predict the tertiary structure of previously uncharacterized proteins (Schulz, 1988). Likewise, the push to develop more efficient methods for identifying the potential function or active site location of newly isolated proteins is leading to the development of methods which are, in effect, redefining the meaning of "homology" or "sequence similarity" (Gribskov et al., 1987). In addition, new databases containing secondary structural information, phi and psi angles, torsion angle restraints, NMR chemical shifts, sequence motifs and the like, are continually being added to the current software arsenal to permit even more diverse inquiries and analyses.
Quite clearly the development of sequence analysis packages is entering
into a phase of very rapid expansion with many new and unforeseen
applications being proposed for an increasingly diverse and substantially
larger investigative population. Of course with this rapid expansion
comes the usual problems of limited availability, restricted usability
and increased costs of most sequence analysis software products. As a
result, a rather problematic "software stratification" is developing in
the field with the most powerful (and most useful) packages becoming more
and more expensive while the freely available, unintegrated "shareware"
products are becoming less and less useful. In response both to this
program diversification and this software stratification we have
endeavored to develop a publicly available software package (called SEQSEE
- SEQuence SEEker) which offers the program diversity of the expensive
packages at the "cost" of the freely available shareware products.
Specifically SEQSEE is a multi-purpose menu-driven suite of
programs designed to provide a fully integrated, state-of-the-art package
for the analysis and display of protein sequences and protein databases.
It has been designed with considerable flexibility in mind so as to permit
the addition of new features and new algorithms when they are developed or
as they are reported in the literature. It contains many of the features
available in some the most comprehensive commercially available programs
such as rapid database searching, flexible pattern matching and multiple
sequence alignment. It also contains a large number of structural
analysis and prediction programs which have been enhanced through the
incorporation of several unique databases. In this regard, SEQSEE has
been developed expressly from the point of view of those protein chemists
who are interested in questions pertaining to both structure and function.
As a result, we believe SEQSEE offers a number of important enhancements
and many unique advantages over what is typically found in other
commercially available software packages. SEQSEE has already been used
in the analyses of fibroma/myxoma viral products (C. Upton , personal
communication), cystic fibrosis gene products, fish anti-freeze proteins
and a variety of growth factors. In many respects SEQSEE has performed
beyond our expectations and it is as a consequence of its consistent
(and sometimes unexpected) success that we believe it should be made
freely available to all members of the scientific community. We hope you
will find SEQSEE as useful in your work as we have found it in ours.
The programs, databases and libraries contained within SEQSEE have
been under continuing development since June of 1988 in the labs
of both Frederic M. Richards of Yale University (New Haven) and
Brian D. Sykes of the University of Alberta (Edmonton). The SEQSEE
project was completed in partial fulfillment of the requirements for
the degree of Doctor of Philosophy for David Wishart (Yale, 1991). The
principal programmer for SEQSEE has been Robert Boyko with database
development and general program design being handled by David Wishart,
and some revisions done by Leigh Willard.
Copyright (C) 1991 -
No portion of this program may be incorporated into other
programs or sold for profit without express written consent
of the authors.
Download seqsee here:
PC(Linux) and macosx: seqsee v1.6 (8.4 MB)
Once you have downloaded the software, you then proceed by
untarring the file. For example:
The program will first check the user's current directory for a parameter file
called "seqsee.parms" and use this first before reading the default one
from the installed lib directory.
I type "seqsee" and I don't get the main menu?
What exactly is the function of the file
seqsee.parms ?
The file "seqsee.parms" contains all the default parameters when
SEQSEE is run. SEQSEE will first check your current directory
to see if you have a "seqsee.parms" file and if not, it will use the
default "seqsee.parms" which the installation program creates. This
parameter file should be self-explanatory and easy to follow.
What are the most common items that could be changed in "seqsee.parms"?
There are many different sets of hydrophobicity parameters and
similarity matrices which you may wish to experiment with. See the
section regarding databases and library files. Be aware that if you
change from the "wt.rbo" matrix to the "wt.dayhoff" that other parameters
such as "gap penalty" and "gap size penalty" will also need changing.
There are several "print" flags which can make your output more verbose
or terse. Many of the options are strictly for the programmer or for
the person who has knowledge about how the algorithms work.
While running SEQSEE, the screen was cleared and I seem to be
placed in some kind of editor. How do I get out of this mode?
By default, SEQSEE uses the "vi" editor whenever it has results to
show to the user. To exit this mode type ":q" to exit without saving
changes or ":wq" to exit with saving any changes you have made.
Is there a way to turn this fullscreen editing feature off?
Yes. Your system administrator will have chosen the editor
for seqsee. You can turn off the editor completely by chaging
the editor flag from 1 to 0 in "seqsee.parms".
What should I do if I want to get out of something that I got
myself into? For example, I am doing an exhaustive alignment search and
realize I made a mistake.
The easiest way is to press the "control" and "c" keys down
together. This will take you back to the main SEQSEE menu. Another
method is via the UNIX "kill" and "ps" commands (see your UNIX manual).
When taking this form of action there will likely be temporary files
which you should remove (these files contain numbers and ".tmp").
Is there some way for me to check how the results from a search are
going rather than waiting until SEQSEE ends?
Yes, most modules in SEQSEE keep intermediate results, especially
those functions which can take awhile. These results are stored in
a numbered file appended with ".tmp" or ".tmp.ids" (eg, 6531209824.tmp).
While SEQSEE is running in one window, go to another window and type
"more *.tmp" to see intermediate results.
When I save my search results in file "x", I also have a file in my
directory called "x.ids". What is the purpose of this file?
This file contains only the ID code and name of the protein from
the results file. Some of the results files can get pretty big to
scan through, often times people are just interested to see the names
of the proteins which popped up in their search. This explanation
also refers to the intermediate results files as well.
Now that I have my results, how do I print them out?
The UNIX command to print a text file is "lpr" but you should first
check with your system administrator about how one prints text files.
Can I run SEQSEE in the background?
Yes, running SEQSEE in the background allows the user to start a search
and then be able to log out. Once you have decided your search is
running properly type "control" and "z" keys together to stop the job.
Typing "bg" will then put the job in the background and now you can safely
log out and go home.
Can I change the priority at which SEQSEE is running? For example, I
wish my exhaustive alignment job would only run iff no else needs the
computer (cpu).
Yes, but you can only change the priority after the job starts running.
If you started an exhaustive alignment first issue the UNIX command
"ps -ux | grep nw_align". On Silicon graphics the command would be
"ps | grep nw_align". Then issue the UNIX command "renice 19 PID"
where PID is the process ID of the job you wish to change the priority of
(the PID is in the first column). Please ask your system administrator
if you are not sure how this works.
What is a "core dumped" message?
The program has crashed either due to a programming bug or you have
exceeded some boundary limit or the system has run out of "swap space"
(see UNIX system manual). Sometimes a swap space problem will be
indicated by an "out of memory" error message as well.
What can I do if I get a "core dumped" message?
Check the SEQSEE parameter file (if you are not using the default)
and your input into the program. You may try varying your input to see
if the problem only occurs with your set of data. If you are the system
administrator and have some programming knowledge you may wish to
re-compile the module that crashed with the debug flag set on and use
a debugging program such as "dbx" to see exactly where the program crashes.
To search for specific protein names, enter the protein names (using
underscores in place of blanks) one on a line. To search for more
than one string in the same protein entry, enter the first string
followed by '&', followed by the next string.
For example,
will find all of those entries which contain FIBROSIS and CYSTIC in
the same protein name (order is unimportant), and will also find all
protein entries with the name THROMBOMODULIN.
CHOU-FASMAN SECONDARY STRUCTURE PREDICTION
HYDROPHOBIC MOMENT SECONDARY STRUCTURE PREDICTION
GARNIER, OSGUTHORPE, ROBSON SECONDARY STRUCTURE PREDICTION
HOMOLOGY-BASED SECONDARY STRUCTURE PREDICTION
MOTIF-BASED SECONDARY STRUCTURE PREDICTION
To search for specific protein names, enter the protein names (using
underscores in place of blanks) one on a line. To search for more
than one string in the same protein entry, enter the first string
followed by '&', followed by the next string.
For example,
This file last updated:
Questions to:
bionmrwebmaster@biochem.ualberta.ca
Copyright and Acknowledgements
Authors:
D.S. Wishart,
R.F. Boyko,
L. Willard,
F.M. Richards and
B.D. Sykes
SEQSEE: A Comprehensive Program Suite for Protein Sequence Analysis
Comp. Appl. Biosci. 10:121-132 (1994)
Download and Installation
> tar xvf seqsee-1.6-build.tar
> cd seqsee-1.6-build
Look at the README.txt file for details on installation or
follow the procedure below.
1) Install a sequence database on your system like the pir or
swiss-prot. These databases are not included with seqsee and
can be acquired via anonymous ftp.
Let's assume we went to www.uniprot.org/downloads and got the text version
of the UniProtKB/Swiss-Prot database and placed it in
/usr/local/databases/uniprot_sprot.dat. If you do this, then you
can skip to step 4.
2) edit seqdb.fnames
- indicate the database type and number of files.
- set sequence database paths to the database you intend to use
3) edit refdb.fnames
Some databases separate the reference and sequence data.
If this is not the case, then set up this file to be
exactly the same as 'seqdb.fnames'.
4) ./Install
That's it, the Install script is simple and you are only asked
for the location to install seqsee.
5) You probably want to create an alias for seqsee and place it in your
.cshrc or .tcshrc file:
alias seqsee /usr/local/seqsee/bin/seqsee
Input sequence file format
FORMAT
Line 1 : >Title: name
Line 2,3,...n: One letter code sequence
EXAMPLE
>Title: Sequence ABC
KWEYASEPIKNMNSWTYR
AENRQDGGNAHKLLEPRF
DAAL
Recommendations for first-time user
The following suggestions are recommended to best use SEQSEE:
Frequently Asked Questions
Functions
Seqsee is a protein sequence analyzer which has the following functions:
**********************************************************************
* Package...: SEQSEE Version 1.5 (c) *
* Authors...: Robert Boyko / Leigh Willard / David Wishart *
* Fred Richards / Brian Sykes *
* Location..: University of Alberta *
* Protein Engineering Network of Centres of Excellence *
**********************************************************************
*** Preliminaries *** *** Alignments ***
1) Help 10) Fast Alignment Search
2) Enter/Edit a Sequence 11) Exhaustive Alignment Search
3) Get Sequence from Database 12) Align 2 or more sequences
*** Structural Analysis *** *** Scanning ***
4) Sequence Statistics 13) Pattern Search
5) Structure Prediction 14) Homology Search
6) SEQSITE Pattern Search 15) Dot Plot
7) Flexibility 16) Database Reference Search
8) Hydrophobic Moment 17) File Viewer
9) Hydrophobicity 0) EXIT SEQSEE
Enter the number of the desired function:
The following is a brief description of each function.
Help
You are directed to use our on-line help through a browser.
Enter / Edit a Sequence
The program known as SEQED is used for the entry and editing of new
(or old) sequence files. The program first queries the user as to
whether he or she wishes to:
1) Enter a new sequence.
2) Edit an old sequence.
If one chooses to enter a new sequence the program queries the user
for the name of the sequence (sequence name), the actual sequence
(using the standard single letter amino acid code), and the name of
the sequence file (output filename). Sequences may be entered using
either lower case letters, upper case letters or an arbitrary
combination of both. In other words, sequence entry is case
independent. The program also ignores blank characters so sequence
entries may have as many blank spaces as desired. A "sequence ruler"
is presented at the top of each sequence file entry line to permit
quick identification of residue positions as they are typed. After
each group of 50 characters has been entered, the user is expected to
press Retrieve Sequence from Database
The program SEQRET is designed to allow the user to retrieve complete
sequences or groups of sequences from the database using either the id
code or protein name (or portion thereof). Thus one may seek and
select only a single sequence for a specific purpose, or entire
protein families to create special user-specified databases. The
sequences may be saved and/or edited for further analysis (as in the
preparation of files for multiple sequence alignments). All sequences
are saved in a SEQFILE format and, therefore, are ready to be analyzed
by any of the other SEQSEE functions.
FIBROSIS & CYSTIC
THROMBOMODULIN
Sequence Statistics
The STATS program carries out a simple statistical analysis of any
given protein sequence. It calculates and displays the molecular
weight, the amino acid composition, the predicted folding class (based
on residue composition), average hydropathy (based on the Kyte
Doolittle parameters), total charge, predicted isoelectric point,
specific volume, expected protein radius, expected quantity of exposed
surface area and many other values that may be of structural or
statistical interest. Note that STATS can only be used on sequence
files in the SEQFILE format.
Structure Prediction
ALEXIS is a comprehensive structural analysis program which has been
developed expressly for the SEQSEE software suite. ALEXIS performs
calculations on the extent and location of potential membrane spanning
regions, the identification of short sequence folding motifs and the
prediction of secondary structure using the cumulative results of five
different and well-tested methods. Detailed descriptions of the
techniques and their respective enhancements are given below:
This calculation uses the central point maxima technique first
described by Klein et al. (1985). This has been shown to be the
most accurate method for membrane spanning identification through
independent tests performed by Fasman & Gilbert (1990). The method
uses a linear discriminant model to test the probability that any
given sequence is membrane spanning. The hydrophobicity scale
(and hence the the discriminant equation) has been adopted
specifically for the Kyte-Doolittle parameters. Some modifications
have been introduced to this scale to permit better discrimination
of the membrane spanning regions. The program is designed to
determine, first, if there are membrane spanning regions and,
second, where they are located.
This procedure predicts the secondary structure for any given
protein sequence through a modified Chou and Fasman (1974, 1978)
algorithm. The Chou-Fasman algorithm is based on statistically
observed propensities of all 20 amino acids to occur in various
protein secondary structures. Despite its widespread use and
general popularity, it is a technique not without its shortcomings.
In an attempt to improve both its accuracy and its general
utility, a number of modifications to the original algorithm
have been made. Some of these changes include the adoption of the
simplified rules of Williams et al., (1987) and the use of updated
Chou-Fasman parameters as derived from SEQBANK. With these new
modifications, this technique can predict secondary structures
with a 57.5% level of accuracy. A random three-state prediction,
on the other hand, is expected to be only 33.6% correct (based on
the disposition of secondary structures in SEQBANK).
This procedure determines the secondary structure for any given
protein sequence on the basis of hydrophobic periodicities. It
has its origins with the Fourier analysis of hydrophobicity
profiles as first proposed by Eisenberg et al., (1984). In
contrast to the statistical techniques of Chou and Fasman, it is
an approach that is based on well established physico-chemical
principles. According to Eisenberg, stretches of residues with
hydrophobic periodicities in the range of 90 to 120 degrees
(corresponding to a hydrophobic residue every three to four
residues) are typically found in alpha-helices, while stretches
of amino acids with hydrophobic periodicities of 160 to 180
degrees (corresponding to alternating hydrophobic and hydrophilic
residues) are typically in beta strands. By introducing a number
of modifications to Eisenberg's original proposal, including the
use of optimized hydrophobicity parameters and the introduction of
Chou-Fasman conformational probabilities, the level of prediction
accuracy can reach 63.7%. (This value was calculated using the
structural assignments available in SEQBANK).
Commonly called the GOR method (after the three authors' initials)
this procedure predicts the secondary structure on the basis of
parameters obtained through information theory. It is based on a
series of proposals originally put forward by these investigators
in the 1970's (Garnier, et al., 1978). It is very much a
statistical technique, not unlike the Chou-Fasman approach, except
that it takes into account the positional preferences of amino
acids within helices, beta-strands and coils. Despite its high
level of parameterization, the procedure is extremely fast (when
computerized) and is consistently rated among the most accurate of
known methods. With recent modifications in place, including some
degree of re-parameterization of the previously published values
found in Gibrat et al. (1987), the method attains a 63.4% level of
accuracy. (This value was calculated using the structural
assignments available in SEQBANK).
This procedure determines the secondary structure for any given
protein sequence by searching for short stretches of homologous
sequences and comparing them to known protein structures. It is
based on a number of related proposals simultaneously offered by
several authors in 1986 (Nishikawa and Ooi, 1986; Sweet, 1986 and
Levin et al., 1986). The most recent implementation of this
procedure, as described by Levin and Garnier (1988), has been
adopted for use in SEQSEE. In this version, SEQBANK is used as
the database of known structures from which sequence homologies
are sought. This method is the most accurate secondary structure
prediction scheme presently known. For proteins sharing greater
than 25% sequence similarity with any protein in SEQBANK, the
method approaches a level of accuracy of 87%. For proteins
possessing no significant homology, the prediction is 66.0%
correct. SEQSEE uses a specially optimized amino acid exchange
matrix in order to achieve these high scores.
This procedure predicts secondary structure based on primary
sequence patterns contained in the files SEQMOTIF1 and SEQMOTIF2.
It is an extension of the methods first proposed by Rooman and
Wodak (1988, 1991) for identifying and incorporating well
established sequence/structure patterns in secondary structure
prediction schemes. The procedure, as it is currently implemented,
can only perform structural predictions (on average) on less than
20% of the residues in any given sequence. However, for those
regions that are predicted, the confidence level is often very
high (> 80%).
Seqsite Pattern Search
The SEQSITE procedure allows the user to search any given sequence for
active sites, binding sites, signature sequences, sequence motifs, and
related functional or structural sequence patterns. The user may
select between different sequence motif libraries to use, which
contain patterns, functions, and references. This type of "function
search" is extremely useful for determining the properties and
features of newly sequenced or poorly characterized proteins.
Flexibility
The program named FLEQSEE predicts the flexibility and mobility of
various regions in a protein based on sequence information alone.
Flexibility is calculated on the basis of the Karplus algorithm
(Karplus and Schulz,1985). This procedure determines main-chain
mobility by using smoothed averages of X-ray thermal B factors taken
from approximately 30 highly resolved structures. In SEQSEE,
flexibility may be used to determine the position and length of coil
regions by locating all "significant" maxima (those maxima which
exceed a minimum threshold) in the flexibility plot. Flexibility
plots may also be used to identify surface-seeking elements or to
locate strongly antigenic regions of any given sequence.
Hydrophobic Moment
MOMENT calculates the hydrophobic moment of a sequence using the
Cornette et al., (1987) scale of hydrophobicity and the Fourier
analysis technique of Eisenberg et al., (1984). Calculations are
preformed over a set "sequence window" of predefined length using a
range of values specific to helical periodicities (90 to 120 degrees),
exterior beta strand periodicities (160 to 180 degrees) and interior
beta strand periodicities (0 degrees). The values for helix and beta
strand may be compared with one another and to a minimum cutoff value
(usually around 5) to identify amphipathic helices or beta strands.
This method has some utility in identifying potential T-cell epitopes
(amphipathic helices) and other biologically important structures.
Hydrophobicity
HYDRO calculates the smoothed hydrophobicity (over a window of
pre-defined length) of any given sequence using a choice of several
hydrophobicity scales. The operator may choose (using the parameter
file) from the Eisenberg consensus scale (Eisenberg et al., 1984), the
Kyte-Doolittle scale (Kyte and Doolittle, 1982), the Cornette scale
(Cornette et al., 1987) or the Parker-HPLC scale (Parker et al.,
1986). Hydrophobicity plots may be used to approximate the positions
of coil regions, exposed loops or B-cell antigenic determinants in
many proteins (hydrophilic regions). They may also be used to locate
membrane spanning regions in some types of proteins (hydrophobic
regions of 20 or more residues).
Fast Alignment Search
FAST_ALIGN is a k-tuple based fast alignment algorithm based loosely
on the speed-up protocols incorporated in Lipman and Pearson's FASTA
(1988) and Altschul et al.'s BLAST (1990). First, a table of
homologous 3-tuples is generated for the query sequence using a
modified scoring matrix. Second, a look-up table of these 3-tuples
and their respective location is prepared from the query sequence.
Third, a look-up table is prepared of 3-tuples for each sequence in
the database. The two look-up tables (one from the query and the
second from the database) are then compared and matches are
identified. The result is a one-dimensional "spectrogram" of
homologies characterized by low level noise (poor matches) and the
occasional sharp peak (a string of matches). Database sequences with
sufficiently high peaks are then pulled out and rigorously aligned
using the Needleman-Wunsch program to determine the significance of
the alignment. The program is capable of searching the complete
database and then ordering and aligning 50 homologous matches of a 100
residue query sequence in less than 90 seconds. This is an extremely
powerful technique to accomplish quick inquiries regarding protein
relatedness and identification. FAST_ALIGN may be used to align
sequences against the PIR, SWIS-PROT, SEQBANK or a user-specified
database with a SEQFILE format. Several choices of scoring matrices
are possible and these include: the Unity matrix, the Dayhoff PAM 250
matrix (Dayhoff et al., 1983), the Mclachlan matrix (Mclachlan, 1971)
and the Boyko matrix (unpublished). The Boyko matrix is the default
scoring matrix.
Exhaustive Alignment Search
NW_ALIGN is a program which carries out an exhaustive pair-wise alignment
of any given query sequence to all other sequences in a given database.
Only those sequences with scores above a certain user-defined threshold
are retained. The algorithm used for this procedure is based on the
Needleman-Wunsch (1970) approach for pair-wise alignment. This dynamic
programming method is guaranteed to find the optimal alignment between
any two sequences for any given scoring matrix and gap penalty.
Alignments can either be done against the PIR or SWIS-PROT database,
SEQBANK or a user defined database in the SEQFILE format. If alignments
are done against SEQBANK, knowledge of the secondary structure is included to
determine the location and length of gaps (Lesk et al., 1986). A choice
of scoring matrices and gap penalties is available. The scoring matrices
include: the Unity matrix, the Dayhoff PAM 250 matrix (Dayhoff et al.,
1983), the Mclachlan matrix (Mclachlan, 1971) and the Boyko matrix
(unpublished). The Boyko matrix is the default scoring matrix. Scores
are rigorously calculated on the basis of comparisons to randomized
sequence alignments as recommended by Dayhoff et al., (1983). The
program is extremely time consuming with a query sequence of 100 residues
typically taking 4 hours to complete on a SUN Sparcstation. However,
the improvement in overall alignment accuracy and the possibility of
identifying very remote and previously unidentified relationships may
well be worth the wait.
Align 2 or more sequences
The program MULT_ALIGN uses a modification of the pair-wise
Needleman-Wunsch protocol to align two or more protein sequences.
The method is closely related to the progressive alignment procedure
first described by Barton and Sternberg (1987), which permits rapid and
accurate multiple alignments for up to several hundred proteins. A
consensus sequence is also generated for each pair-wise or multiple
alignment. A choice of scoring matrices and gap penalties is available.
Sequences which are to be aligned must be contained in SEQFILE formats,
either in the form of databases (for multiple alignments) or singly (for
pair-wise alignments). The procedure for aligning more than two sequences
(like the fast alignment search described in 8) is fundamentally heuristic
in nature and so it cannot be proven that the resulting alignments are
mathematically optimal.
Pattern Search
This procedure searches either SEQBANK, the PIR/SWIS-PROT database or a
sequence of your own choosing to find exact pattern matches according to
the following rules (note the sequence patterns are case INDEPENDENT):
a) A match exact residue specified where A = any amino
acid
b) !A match any residue EXCEPT A
c) * wild card character--matches any amino acid
d) [ ] "OR" braces--allows several residue choices.
i.e. [ILK] = I "or" L "or" K
e) & "AND" character--allows 2 patterns to be placed in
any 1 query
f) { } "Range" braces--allows a range of wild card
characters. i.e. {2,8} = 2 to 8 "*"
g) $ N or C termination character - used to mark either
the beginning (N terminus) or end (C terminus) of
a sequence
Pattern Search (PSEARCH) is constructed to allow the user to enter several
patterns at once, both on a single line (using the "&" feature) or on
separate lines. Patterns appearing on separate lines are treated as
"independent" patterns (meaning they don't have to appear in the same
protein sequence) while patterns with "&" characters are viewed as
"dependent" patterns (meaning they do have to appear in the same protein
sequence). Some examples of sequence pattern searches are given below:
AA***K Find all occurrences of 2 alanines together
followed by any 3 residues followed by a single
lysine
AA!P!P!PK Find all occurrences of 2 alanines together
followed by any 3 residues (as long as they
are NOT prolines) followed by a single lysine.
(ie. look for AA***K except AAP**K, AA*P*K,
AA**PK, AA*PPK, AAPP*K, AAPPPK)
[AG][AG]*[KR] Find all occurrences of 2 alanines or 2
glycines or any combination of the two
followed by any residue followed by a lysine
or an arginine. (ie. look for AA*K, AG*K, GA*K,
GG*K, AA*R, AG*R, GA*R and GG*R)
AA*K&I**R Find all occurrences of 2 alanines together
followed by any amino acid followed by a
single lysine, AND if that pattern is found,
then find all occurrences of a single isoleucine
followed by any two amino acids followed by
a single arginine IN THE SAME PROTEIN
SEQUENCE. (ie. look for AA*K then I**R within
a sequence)
AA{2,5}[KR] Find all occurrences of 2 alanines together
followed by at least two but no more than 5
amino acids (any type) followed by either a
lysine or an arginine. (ie. look for AA**[KR],
AA***[KR], AA****[KR] and AA*****[KR])
${3,5}M Find all occurrences of methionine that are
between 3 and 5 residues from the N
terminus. (ie. look for $***M, $****M and
$*****M)
Of course any combination of the above queries could be used in a PSEARCH
pattern search. Other examples of PSEARCH queries may be found by
browsing through the SEQSITE database.
Homology Search
The HSEARCH program searches either the PIR/SWIS-PROT database, SEQBANK or a
compatible user-defined database to find the "nearest" or most homologous
matches to any given sequence. Homologies are determined according to
any one of four user-defined scoring matrices (described earlier).
Presently, gap penalties are not yet incorporated into the homology search
routine. The homology search is a useful complement to other pattern
search routines, especially when attempting to locate distantly related
or difficult-to-identify sequence motifs.
DotPlot
DOTPLOT is an extremely flexible program developed to produce character
representations of standard dot plots (Lipman and Pearson, 1985). The low
resolution of most character-defined screens prevents the incorporation
of a useful graphic representation of dot plot results and hence a
character representation with a user defined "threshold" has been
incorporated to overcome this problem. DOTPLOT may be used to compare a
sequence with itself (to identify internal repeats), with another sequence
(for pair-wise alignments), with a SEQFILE compatible database or with the
PIR/SWIS-PROT database (for medium speed alignments). By using DOTPLOT in
conjunction with a database it is possible to look for homologies between
any shared regions in a group of sequences. Such an option has proven to
be quite useful in identifying previously unrecognized motifs or
unexpected similarities in a number of proteins.
Protein database reference search
The program REFSCAN is designed to allow the user to locate and retrieve
specific sequence references from the database using either the
accession number, the name (or portion thereof) or a bibliographic/
functional reference. This feature allows the user to quickly access
important information about many newly sequenced proteins pertaining to
their function, structure or relationship with other proteins in the
database.
FIBROSIS & CYSTIC
THROMBOMODULIN
will find all of those entries which contain FIBROSIS and CYSTIC in
the same protein name (order is unimportant), and will also find all
protein entries with the name THROMBOMODULIN.
Browse
BROWSE permits the user to edit or view a variety of database files.
Through this program it is possible to locate or identify sequence names
or id numbers from the database, to locate or view sequences
and references from the SEQBANK database, to view or edit sequences
written as SEQFILEs and to view, edit or change the SEQSEE parameter file.
In the case of viewing PIR database information, all sequence name and
id number data is contained in a single 1 MB file called PIRSEE.db. Standard
Unix commands may be used for scrolling through or locating all character
strings in any of the files.
Exit Seqsee
Closes all current files and returns the user to the general operating
system. The program may be restarted by typing "seqsee". If the program
crashes or hangs up for any reason, simply type "^c" which will stop all
processes and return the user to the main menu.
Seqsee Tutorial
Note: The following tutorial using seqsee version 1.2 is almost identical to
version 1.5 (so it is not worth updating this section of the manual).
Let us suppose that you and a collaborator have succeeded in isolating a
small protein from Bacillus subtillus which appears to act as an oxidizing
co-factor for certain cellular processes. After many weeks of amino acid
analysis and peptide sequencing, your collaborator provides you with the
N terminal sequence of the first 60 amino acids of this new protein. You
are requested to find out anything you can about this partial sequence,
and to report to your colleague as soon as possible. Sounds like a job
for SEQSEE!
Let's demonstrate how you might go about analyzing this sequence using
just a few of the options available in SEQSEE. Note that in this example
we will first show how a new sequence is entered. Then we will demonstrate
how the sequence can be analyzed statistically. We will also show how to
check this sequence for sequence motifs and how to compare (and align)
the query sequence against the PIR database. Finally we will demonstrate
how to search SEQBANK to locate those proteins which might be
evolutionarily related to the query sequence. So here it goes...
1) Sign on to a computer
2) Type "seqsee" (the following menu should appear)
**********************************************************************
* Package...: SEQSEE Version 1.2 (c) *
* Authors...: Robert Boyko / Leigh Willard / David Wishart *
* Fred Richards / Brian Sykes *
* Location..: University of Alberta *
* Protein Engineering Network of Centres of Excellence *
**********************************************************************
*** Preliminaries *** *** Alignments ***
1) Help 10) Fast Alignment Search
2) Enter/Edit a Sequence 11) Exhaustive Alignment Search
3) Get Sequence from Database 12) Align 2 or more sequences
*** Structural Analysis *** *** Scanning ***
4) Sequence Statistics 13) Pattern Search
5) Structure Prediction 14) Homology Search
6) SEQSITE Pattern Search 15) Dot Plot
7) Flexibility 16) Database Reference Search
8) Hydrophobic Moment 17) File Viewer
9) Hydrophobicity 0) EXIT SEQSEE
Enter the number of the desired function:
>>
3) Type "2" (and press