The rest of this section
goes on to explain how peakcon works. Will people sit down with the peakcon program
and perl scripts and write interesting pattern matching rulesets?
Is peakcon powerful enough to account for all
the attributes and subtleties in the patterns we are seeking?
Decide for yourself.
But eventually
smartnotebook wants to be associated with the best pattern recognition software
available, or at least give the user a chance to decide what that software may be.
What peakcon does have is a straightforward interface for
building connectivity patterns which is currently in the form of
a text-base rules file. A library of rules files can be developed
which define how we search for patterns given the particular nmr data
we have available. The possibility exists to search for generic patterns
to ensure all possible patterns are found or specific patterns where
interest is relegated to a certain class of connectivity pattern.
Our current policy has been to develop rulesets for peakcon that are simple
and exhaustive, it is better to find extra potential connectivities than
to run the risk of missing any.
Peakcon is quite simplistic, it only scores a connectivity
pattern based on how well the peaks line up.
Also note that the hsqc has 3 interesting columns of data and the
other xpk files have 4, the last column as you may have guessed is the
peak intensity and you can see that positive/negative intensity plays
a vital role in identifying the pattern with these experiments. "inf"
stands for infinity and a comma or colon can deliminate the range.
This brings up two questions, a) how does the peakcon program know
where the interesting columns are and b) how does a ruleset
take into account the error which resides in defining peaks. Introducing
the FORMAT statement:
The second part of the FORMAT statement allows you to define the
allowable error for each of the corresponding columns specified
under the "data" title. As is common practice, the error between
corresponding shifts in different files is twice that of corresponding
shifts found within the same peakpick file.
It is probably worth stating again that peakcon doesn't have
a clue about what is proton, nitrogen, etc. All it knows is that
the ruleset placeholder "Ca1" matches "Ca1" and therefore the same
value (within a specified error) has to be found.
Ok, let's go on to the final part of the peakcon database program and
that is the OUTPUT statement. Here is where all the required information
for each pattern is printed out to one line of a specified output file.
The "00" line is used for discerning the number of weak peaks
and the next 2 lines are used for printing any interesting intensities.
Then come the actual shift values in the next 8 lines.
Anyhow, the only stipulation is that peakcon and smartnotebook
both agree on where to find the data in a pattern. Smartnotebook
uses a configuration file like the one in
lib/hsqc/hsqc-1
to define where the key data items are. Each configuration file is fully
documented, it is certainly instructive to read the comments in these
files. You will see that smartnotebook and peakcon both agree about
which data is in which field.
Now that you have seen the input that goes into a pattern match, feel
free to check out the various pattern matching files in lib/rules.*
(eg residue-residue patterns ).
Notice that I have defined separate rules for finding patterns with
glycine.
Because sequential connectivities
involving glycine require fewer peaks to match, we get lots of these
matches, especially dgx and dgg where only one peak is required to connect
two spin systems. Also, any pattern that matches residue-residue, will also
match the other 3 types unless we employ a filtering mechanism.
Our current experiment set produces patterns which may include weak peaks
visible in the hncacb spectra. Since weak peaks may or may not be there,
it is difficult to express this concept to peakcon. If we choose to ignore
weak peaks in the pattern search, then we get a whole bunch of patterns,
two weak peaks gives 4 valid patterns. Now which is the correct pattern,
if you say the two strongest peaks represent Ca1 and Cb1, then you are
right most of the time. Most of the time peakcon
can discern the glycine residue from others but overlap at the wrong
place can cause ambiguity here too. On the other hand, wading through
too many false positives makes smartnotebook too complex for the
average user.
So in the end, I decided to filter the peakcon
output with perl scripts such as this one
where I throw out redundant connectivities. In current versions
of smartnotebook, the shifts in the user's current chain decides which
connectivities to choose from.
This file last updated:
Questions to:
bionmrwebmaster@biochem.ualberta.ca
Pattern recognition using peaklists
Overview
Pattern recognition with peaklists is an interesting problem with several
possible solutions. Formulating queries using a public relational database
like mysql is a possibility, another good choice may be a spreadsheet
program like excel with its excellent macro facility.
However, unless specified otherwise within the configuration files,
the formation of sequential connectivities is given to a simple
relational database (written in C) called "peakcon".
The peakcon program is entirely independent of smartnotebook, smartnotebook
will call peakcon and then read back the answers, impartial as to how to the
answers are obtained.
Thus, the grouping of peaks from peakpicked files into sequential
connectivity patterns is yet another task beyond the scope of
smartnotebook.
How Peakcon Works
Let's explore a little bit about how peakcon works starting with this
simple example.
Imagine I have two
files with each row containing one datapoint made up of an x, y, and z field.
Let's say I want to find a datapoint in file 1 and a datapoint in file
2 which have the same 'x' coordinate. I could specify this relation
with the following two rules:
file1 x1 y1 z1
file2 x1 y2 z2
What the peakcon program does is first find an entry which satisfies the
first relation, then it proceeds to find an entry to satisfy the second
relation. Note that the second relation has the constraint that the "x1"
found in the first relation has to match the "x1" entry in the second
relation. So can you imagine what the rules would look like if I wanted
to find a second datapoint in file1 which matched on the "y" component
of the datapoint of file2?
file1 x1 y1 z1
file2 x1 y2 z2
file1 x2 y2 z3
Now imagine that I am only interesting in patterns where the "z" components
are between 50 and 100. In the peakcon database, the way to do this is
as follows:
file1 x1 y1 z1:50,100
file2 x1 y2 z2:50,100
file1 x2 y2 z3:50,100
Perhaps at this point you can see how I have used the above procedure
to implement the following pattern for residue-residue connectivities
with the datafiles specified below:
#
# Filename Assign Connectivity Relation
#
RULE1: hsqc 1 H0 N0 I0:0,inf
RULE2: hncacb 1 H0 Ca0 N0 I1:0,inf
RULE3: hncacb 1 H0 Cb0 N0 I2:-inf,0
RULE4: cbcaconh 1 H1 Ca0 N1 I3
RULE5: cbcaconh 1 H1 Cb0 N1 I4
RULE6: hncacb 1 H1 Ca1 N1 I5:0,inf
RULE7: hncacb 1 H1 Cb1 N1 I6:-inf,0
RULE8: hsqc 1 H1 N1 I7:0,inf
Some new features here, first, the "RULE" field identifies to the
peakcon program that this is rule. Second, the "Assign" column is
usually set to "1", however if it is set to "0", that is telling the
peakcon program that your pattern DOES NOT contain an
entry with the corresponding matching fields (eg, if you find a peak
that matches, then this is no good). Given the nature of peakpicking
and subtleties contained in the simple rules of this database, setting
the Assign feature to 0 is fraught with error potential. This is why
we are not considering patterns involving Prolene at this time.
#
# Filename label data intrafile errs interfile errs
#
FORMAT: hsqc 1 3:10:17 0.015:0.25:0 0.03:0.5:0
FORMAT: hncacb 1 3:10:17:24 0.015:0.25:0.50:0 0.03:0.5:0.5:0
FORMAT: cbcaconh 1 3:10:17:24 0.015:0.25:0.50:0 0.03:0.5:0.5:0
This input into the peakcon program defines the active columns in the
various xpk files which are involved in the ruleset. The hsqc file
has 3 important keys which are located at columns 3, 10, and 17. For
older format xpk files, you may remember that columns 3, 9, and 15
contained the shift data. Take note that there is nothing that indicates
that column 3 is proton and column 10 carbon. It is not necessary, it
is through the effective labels in the ruleset described above that
makes it clear what the columns stand for.
# OUTPUT: term \
The first 14 lines are mostly devoted to printing out peakIds which make up the
pattern. A PROB field is a score between 0 and 1 which scores the fit of the
peaks in the pattern. A perfect fit should score 1. Unused fields are marked
with a "-99", perhaps these fields will be used in future versions.
Filtering the output from peakcon
If you taken the time to understand pattern matching and think you
have done well to reach this point, unfortunately there are still more
complications you need to hear about. Some of this complication stems from
the complexity of patterns and weakness of the database tools.