Pattern recognition using peaklists

Overview

Pattern recognition with peaklists is an interesting problem with several possible solutions. Formulating queries using a public relational database like mysql is a possibility, another good choice may be a spreadsheet program like excel with its excellent macro facility. However, unless specified otherwise within the configuration files, the formation of sequential connectivities is given to a simple relational database (written in C) called "peakcon". The peakcon program is entirely independent of smartnotebook, smartnotebook will call peakcon and then read back the answers, impartial as to how to the answers are obtained. Thus, the grouping of peaks from peakpicked files into sequential connectivity patterns is yet another task beyond the scope of smartnotebook.

The rest of this section goes on to explain how peakcon works. Will people sit down with the peakcon program and perl scripts and write interesting pattern matching rulesets? Is peakcon powerful enough to account for all the attributes and subtleties in the patterns we are seeking? Decide for yourself. But eventually smartnotebook wants to be associated with the best pattern recognition software available, or at least give the user a chance to decide what that software may be.

What peakcon does have is a straightforward interface for building connectivity patterns which is currently in the form of a text-base rules file. A library of rules files can be developed which define how we search for patterns given the particular nmr data we have available. The possibility exists to search for generic patterns to ensure all possible patterns are found or specific patterns where interest is relegated to a certain class of connectivity pattern. Our current policy has been to develop rulesets for peakcon that are simple and exhaustive, it is better to find extra potential connectivities than to run the risk of missing any. Peakcon is quite simplistic, it only scores a connectivity pattern based on how well the peaks line up.

How Peakcon Works

Let's explore a little bit about how peakcon works starting with this simple example. Imagine I have two files with each row containing one datapoint made up of an x, y, and z field. Let's say I want to find a datapoint in file 1 and a datapoint in file 2 which have the same 'x' coordinate. I could specify this relation with the following two rules:
	file1	x1 y1 z1
	file2	x1 y2 z2
What the peakcon program does is first find an entry which satisfies the first relation, then it proceeds to find an entry to satisfy the second relation. Note that the second relation has the constraint that the "x1" found in the first relation has to match the "x1" entry in the second relation. So can you imagine what the rules would look like if I wanted to find a second datapoint in file1 which matched on the "y" component of the datapoint of file2?
	file1	x1 y1 z1
	file2	x1 y2 z2
	file1	x2 y2 z3
Now imagine that I am only interesting in patterns where the "z" components are between 50 and 100. In the peakcon database, the way to do this is as follows:
	file1	x1 y1 z1:50,100
	file2	x1 y2 z2:50,100
	file1	x2 y2 z3:50,100
Perhaps at this point you can see how I have used the above procedure to implement the following pattern for residue-residue connectivities with the datafiles specified below:
#
#     Filename    Assign  Connectivity Relation
#
RULE1: hsqc            1   H0       N0  I0:0,inf
RULE2: hncacb          1   H0  Ca0  N0  I1:0,inf
RULE3: hncacb          1   H0  Cb0  N0  I2:-inf,0
RULE4: cbcaconh        1   H1  Ca0  N1  I3
RULE5: cbcaconh        1   H1  Cb0  N1  I4
RULE6: hncacb          1   H1  Ca1  N1  I5:0,inf
RULE7: hncacb          1   H1  Cb1  N1  I6:-inf,0
RULE8: hsqc            1   H1       N1  I7:0,inf
Some new features here, first, the "RULE" field identifies to the peakcon program that this is rule. Second, the "Assign" column is usually set to "1", however if it is set to "0", that is telling the peakcon program that your pattern DOES NOT contain an entry with the corresponding matching fields (eg, if you find a peak that matches, then this is no good). Given the nature of peakpicking and subtleties contained in the simple rules of this database, setting the Assign feature to 0 is fraught with error potential. This is why we are not considering patterns involving Prolene at this time.

Also note that the hsqc has 3 interesting columns of data and the other xpk files have 4, the last column as you may have guessed is the peak intensity and you can see that positive/negative intensity plays a vital role in identifying the pattern with these experiments. "inf" stands for infinity and a comma or colon can deliminate the range.

This brings up two questions, a) how does the peakcon program know where the interesting columns are and b) how does a ruleset take into account the error which resides in defining peaks. Introducing the FORMAT statement:

#
#       Filename  label    data      intrafile errs   interfile errs
#
FORMAT: hsqc        1   3:10:17     0.015:0.25:0       0.03:0.5:0
FORMAT: hncacb      1   3:10:17:24  0.015:0.25:0.50:0  0.03:0.5:0.5:0
FORMAT: cbcaconh    1   3:10:17:24  0.015:0.25:0.50:0  0.03:0.5:0.5:0

This input into the peakcon program defines the active columns in the various xpk files which are involved in the ruleset. The hsqc file has 3 important keys which are located at columns 3, 10, and 17. For older format xpk files, you may remember that columns 3, 9, and 15 contained the shift data. Take note that there is nothing that indicates that column 3 is proton and column 10 carbon. It is not necessary, it is through the effective labels in the ruleset described above that makes it clear what the columns stand for.

The second part of the FORMAT statement allows you to define the allowable error for each of the corresponding columns specified under the "data" title. As is common practice, the error between corresponding shifts in different files is twice that of corresponding shifts found within the same peakpick file.

It is probably worth stating again that peakcon doesn't have a clue about what is proton, nitrogen, etc. All it knows is that the ruleset placeholder "Ca1" matches "Ca1" and therefore the same value (within a specified error) has to be found.

Ok, let's go on to the final part of the peakcon database program and that is the OUTPUT statement. Here is where all the required information for each pattern is printed out to one line of a specified output file.

# OUTPUT: term \
# A term can be:
#   1) a specific atom in a connectivity relation (eg Ca0), print shift value.
#   2) "RULEn" where n is the nth RULE, value of IDcol is printed (ie, peakId).
#   3) the word "PROB", print out probability fit value
#   4) anything else is considered a string to print on output.
#
# \ is a c formatting string for the term (type 'man sscanf').
# If no format given, the term is considered a string.
# I have made up a format 'a' which returns absolute value of the shift.

OUTPUT: RULE1 %3d
OUTPUT: RULE8 %3d
OUTPUT: dxx  (a string, this is a label identifying the connectivity type)
OUTPUT: PROB %4.2f
#
OUTPUT: RULE2 %3d
OUTPUT: RULE3 %3d
OUTPUT: RULE6 %3d
OUTPUT: RULE7 %3d
OUTPUT: RULE4 %3d
OUTPUT: RULE5 %3d
OUTPUT: -99
OUTPUT: -99
OUTPUT: -99
#
OUTPUT: 00 
OUTPUT: I5  %8.4f
OUTPUT: I6  %8.4a
#
OUTPUT: Ca0 %6.2f
OUTPUT: Cb0 %6.2f
OUTPUT: N0  %6.2f
OUTPUT: H0  %6.2f
OUTPUT: Ca1 %6.2f
OUTPUT: Cb1 %6.2f
OUTPUT: N1  %6.2f
OUTPUT: H1  %6.2f

The first 14 lines are mostly devoted to printing out peakIds which make up the pattern. A PROB field is a score between 0 and 1 which scores the fit of the peaks in the pattern. A perfect fit should score 1. Unused fields are marked with a "-99", perhaps these fields will be used in future versions.

The "00" line is used for discerning the number of weak peaks and the next 2 lines are used for printing any interesting intensities. Then come the actual shift values in the next 8 lines.

Anyhow, the only stipulation is that peakcon and smartnotebook both agree on where to find the data in a pattern. Smartnotebook uses a configuration file like the one in lib/hsqc/hsqc-1 to define where the key data items are. Each configuration file is fully documented, it is certainly instructive to read the comments in these files. You will see that smartnotebook and peakcon both agree about which data is in which field.

Now that you have seen the input that goes into a pattern match, feel free to check out the various pattern matching files in lib/rules.* (eg residue-residue patterns ). Notice that I have defined separate rules for finding patterns with glycine. Because sequential connectivities involving glycine require fewer peaks to match, we get lots of these matches, especially dgx and dgg where only one peak is required to connect two spin systems. Also, any pattern that matches residue-residue, will also match the other 3 types unless we employ a filtering mechanism.

Filtering the output from peakcon

If you taken the time to understand pattern matching and think you have done well to reach this point, unfortunately there are still more complications you need to hear about. Some of this complication stems from the complexity of patterns and weakness of the database tools.

Our current experiment set produces patterns which may include weak peaks visible in the hncacb spectra. Since weak peaks may or may not be there, it is difficult to express this concept to peakcon. If we choose to ignore weak peaks in the pattern search, then we get a whole bunch of patterns, two weak peaks gives 4 valid patterns. Now which is the correct pattern, if you say the two strongest peaks represent Ca1 and Cb1, then you are right most of the time. Most of the time peakcon can discern the glycine residue from others but overlap at the wrong place can cause ambiguity here too. On the other hand, wading through too many false positives makes smartnotebook too complex for the average user.

So in the end, I decided to filter the peakcon output with perl scripts such as this one where I throw out redundant connectivities. In current versions of smartnotebook, the shifts in the user's current chain decides which connectivities to choose from.


Smartnotebook Home

This file last updated:

Questions to: bionmrwebmaster@biochem.ualberta.ca