|
1. Introduction
The first task following the sequencing of a new gene is to identify the function of its related protein. Computer-assisted annotation techniques attempt to achieve this goal by exploiting the following biological fact: If two peptide stretches exhibit sufficient similarity at the sequence level, they probably are biologically related [14]. More specifically, when presented with a query sequence Q (representing a protein) and a set D of well-characterized proteins, such techniques look for all regions of Q which are similar to regions of sequences in D. Substantial homologies are then used for transferring information from the sequences of D to the query sequence Q.
The first approaches [5, 6] used for realizing this task were based on a technique known as dynamic programming. Unfortunately, the computational requirements of this method quickly render it impractical, especially for searching large databases (as is the current norm). The problem is, roughly speaking, that dynamic programming variants spend a good part of their time computing homologies which eventually turn out to be unimportant. In an effort to work around this issue, a number of algorithms have been proposed which focus on discovering only extensive local similarities; the best known are FASTA [7, 8] and BLAST [9]. In the majority of the cases, increased performance is achieved by first looking for ungapped homologies, i.e., similarities due exclusively to mutations and not insertions or deletions. The rationale behind this approach is that in any gapped substantial homology between two peptide strings, chances are that there exists at least a pair of substrings whose match contains no gaps. Locating these substrings (the ungapped homology) can then be used as the first step toward obtaining the entire (gapped) homology.
Identifying the similar regions between the query and the database sequences is, however, only the first (and computationally most demanding) part of the process. The second part (the one that is of interest to biologists) is the evaluation of these similarities, i.e., deciding whether they are substantial enough to sustain the inferred relation (functional, structural, or otherwise) between the query and the corresponding database sequence(s). Such evaluations are usually performed by combining biological information and statistical reasoning. Typically, similarity is quantified as a score computed for every pair of related regions. Computation of this score involves the use of gap costs (for gapped alignments) and appropriate mutation matrices representing the evolutionary probability of any given amino acid changing into another (e.g., the PAM [10] and BLOSUM [11] matrices). Then, the statistical importance of this cost is evaluated by computing the probability (under some statistical model) that such a score could arise purely by chance [12, 13]. Depending on the statistical model used, this probability can depend on a number of factors such as the length of the query sequence and the size of the underlying database.
Existing statistical frameworks are memoryless; whenever a homology is found between a region A of the query sequence and a region B of some database sequence, the similarity is evaluated without taking into account that A might also be similar to several other database regions. So, although seen in isolation the homology between A and B might seem statistically insignificant, this is certainly not the case when the overall objective is considered. In this work, we try to address this issue by introducing memory into our calculations.
Memory is introduced by identifying groups of related oligopeptides, each of which appears unexpectedly many times in the underlying database. Each such group is represented by an appropriate regular expression, designated as a pattern (to be described precisely in the following section). Whenever a query sequence is presented to the system, we locate all of the query regions matching one or more patterns. Every match acts as a hypothesis of similarity between the query region and all of the database regions also matching a specific pattern. In a final step, all of these hypotheses are further examined by aligning and scoring the areas around the corresponding similarity regions. The highest-scoring among them are then reported to the user.
The success of the method we propose here depends crucially on our ability to identify a set of patterns characteristic of the underlying database. Previously there were no computational tools powerful enough to handle the task of pattern discovery in datasets of the size of existing protein databases. As a result, analogous efforts [1416] were restricted to patterns characterizing groups of proteins already known to be related. Here, however, we use TEIRESIAS, a new pattern-discovery algorithm [17, 18] which is suitable for use on large datasets. We are thus able to detect not only family-specific patterns but also patterns that appear to cross family boundaries.
In the sections that follow, we describe the precise methodology used to obtain the patterns used, and give examples of the power of our new tool, designated as DELPHI, by applying the tool to both annotated and unannotated sequences. We also discuss how our approach can lead to extremely fast search times. The database used as a test bed in this work is SwissProt [19] Release 34, but the methodology we propose can be used on any protein database.
2. Motivation and definitions
The key concept used is that of a pattern. Patterns are regular expressions describing families of polypeptides. As we explain later, the polypeptide family represented by a single pattern is expected to contain stretches of related amino acids (in which the nature of the relationship may be structural, functional, or evolutionary).
More specifically, given the alphabet of amino acids, we define a pattern P as a regular expression of the form
( {.'})* ,
where .' (referred to as the don't-care character) denotes a position that can be occupied by an arbitrary residue. Since it is a regular expression, every pattern P defines a language G(P) of polypeptides consisting of all the strings that can be obtained from P by substituting an arbitrary residue from for each don't-care character. For the pattern A.CH..E, for example, the following polypeptides are all elements of G(A.CH..E):
ADCHFFE ALCHESE AGCHADE.
Given a pattern P and a sequence S *, we say that P matches S if S contains a substring (a continuous stretch of residues) which is a member of G(P). Furthermore, if D = {S1, S2, · · · , Sn} *, we say that a given pattern P has support K (1 K n) within D if P exactly matches K of the sequences in D. We also define the offset list of P with respect to D (or simply the offset list of P, when D is unambiguously implied) as the set of pairs
LD(P) = {(i, j)|P matches the substring of
Si starting at offset j}.
As an example, consider the pattern P = A..DFE and the set of sequences
D = {S1 = GHASEDFER, S2 = LKERAHPDFE,
S3 = LKMNAKLD}.
In this set, the pattern P has support 2 (the boldface substrings indicate the sequence regions matching P), and its offset list is LD(P) = {(1, 3), (2, 5)}.
A pattern P is regarded as being maximal with respect to D, if for every other pattern P' that is more specific than P (i.e., P' can be obtained from P either by substituting characters from for don't-cares or/and by extending P to the left or/and right) has an offset list strictly smaller than the offset list of P, i.e., |LD(P)| > |LD(P' )|. In the context of pattern-discovery algorithms, it is always desirable to look exclusively for maximal patterns in order to reduce the redundancy of the output.
Given a pattern P, the backbone of P is defined as a string over the alphabet {1, 0} obtained from P by turning every residue of P into the character 1 and every don't-care into the character 0. For example, the backbone of the pattern P = A..DFE introduced above is the string B = 100111. Backbones partition the set of patterns into equivalent classes, with each class containing all of the patterns sharing the same backbone. A pattern with backbone B is designated as a B-pattern.
Another useful concept is that of the density of a pattern. Roughly speaking, the density describes the minimum amount of homology between any two members of G(P) and is defined by two integers L and W (L W): A pattern P has an <L, W> density if every substring of P that starts and ends with an amino acid and has a length of at least W contains L or more residues (notice that, by definition, an <L, W> pattern has at least L residues). The integers L and W are parameters of our method, and their values control the amount of similarity allowed in the searches performed.1 At the level of our pattern-discovery algorithm, the integers L and W define a structural restriction on the patterns to search for. More specifically, TEIRESIAS takes as input a set D of n sequences, and three integer parameters L, W, and K (where K n). The algorithm produces as output the complete set of all of the maximal <L, W> patterns that appear in at least K of the sequences in the set D.
Given the above definitions, we now provide a first, informal description of our approach. It is composed of two distinct phases: information gathering and searching:
-
Information gathering: First, before any search is performed, the underlying database D is mined. During this step all of the significant <L, W> patterns are gathered, and each such pattern P is associated with its offset list LD(P) (the particular criterion used for deciding whether a pattern is significant is detailed in the next section).
-
Searching: The second step is the actual search. Given a query sequence Q, we identify all of the patterns P (among those collected in the first phase of the process) which match the sequence Q. For every such P, we pair the region(s) of Q which match P with the corresponding regions of all database sequences that also match P [these regions are easily accessible through the offset list LD(P)]. Finally, the paired regions are extended and aligned in both directions and scored by the use of a (user-defined) mutation matrix, and the highest-scoring matches are reported along with implied alignments.
It is worth pointing out here that the information-gathering phase is a one-time computation over D. The results obtained are stored in a file and used every time a search session is performed over the database D. So although it is a time-consuming process (it can take several hours, depending on the size of D), it has to be performed only once.
The usefulness of patterns in modeling biologically important protein regions is based on the fact that such regions are much less tolerant of mutations than regions of a subordinate role. It is then reasonable to expect that biologically related polypeptides can be identified by discovering a) conserved positions in their primary structure and b) an increased degree of reusability. In our terminology, these properties correspond to patterns with unexpectedly high support.
3. Methods
As already mentioned, the methodology we propose comprises two phases: information gathering and searching. The first of these is the more computationally intense but has to be performed only once for every database D of interest. It is also the more important, because the success of the subsequent homology searches over D depend crucially on the quality of the pattern set collected during information gathering. In this section we describe the implementation of both phases. To facilitate the description, we discuss the two phases in reverse order, starting with the search phase.
Searching
During this phase, query proteins Q are provided to the system, and database sequences S D similar to Q are identified and reported back to the user. The searching phase utilizes a set of patterns obtained (as explained later) by mining the input database D. For the purposes of this discussion, it is sufficient to assume that is a set of <L, W> patterns of the form described at the beginning of Section 2. Each pattern P is accompanied by its offset list LD(P) and has support at least Kmin in D. The numbers L, W, and Kmin are parameters of our method, and the way in which they are set is described next.
Pattern matching
Initially, when a query sequence Q is provided to the system, all P that match Q are located. This can be done very rapidly by using a hashing variation of a technique presented in [20]. More specifically, for every position within Q we generate W hash values, one for every substring of length 2, 3, · · · , (W + 1) starting at that position. For every such substring, the corresponding hash value depends only on the first and last characters of the substring as well as on the number of residues between those two characters. Figure 1 provides an example of the process for a given query sequence.
Figure 1
The hash entry corresponding to a particular value h contains all of the offsets p of the query sequence Q such that a substring (of length at most W + 1) starting at p hashes to the value h. For example, Figure 2 shows a snapshot of the hash table generated for a particular query sequence.
Figure 2
To check whether a pattern P matches Q, we use an array of counters C[1..|Q|] of size equal to the length of Q. Initially, every entry of the array is set to 0. Starting at offset 1 in P, we locate all offsets j within P corresponding to a residue, excluding the offset corresponding to the last residue. For every such j, let R be the shortest substring of P starting at j and containing exactly two residues. Let OL denote the list of offsets in Q pointed to by the hash table entry corresponding to R. If OL is not empty, then for every offset p OL the counter C[p j + 1] is incremented by one. If the pattern P contains exactly n residues, then at the end of this process the counter C[i] will have the value (n 1) if and only if Q matches P at offset i. (An advantage of this matching technique is that it typically requires time which is sublinear to the size of the query sequence Q and depends only on the number of residues in the pattern P.)
Chaining and scoring
Once a pattern P is found to match a substring of Q starting at offset i, we must correlate that substring of Q with all of the database regions also matching P. This is easily done by scanning the offset list LD(P), which contains exactly those regions. More specifically, each entry (j, k) LD(P) indicates that the substring starting at offset k of the jth database sequence Sj is an element of G(P). The local similarity between the query sequence Q and the database sequence Sj is then registered as a quadruplet (i, j, k, l), called a segment, which is associated with Sj. The number l = |P| is the length of the local similarity.
Sometimes two distinct patterns P and P' matching both Q and a database sequence Sj correspond to the same local similarity between Q and Sj. An example of such a situation is depicted in Figure 3. In such cases, the individual segments corresponding to the two patterns must be chained into one. In particular, two segments (i, j, k, l) and (i', j, k', l') associated with Sj are designated as compatible iff k k', k + l + w_len > k', and k' k = i' i, where w_len is an integer parameter (defined by the user) that allows for the chaining of segments which do not intersect as long as one begins no more than w_len positions after the end of the other. The segment resulting from chaining (i, j, k, l) and (i', j, k', l') together is [i, j, k, max(l, k' k + l' )].
Figure 3
Chaining of compatible segments takes place every time a new segment becomes associated with a database sequence Sj, as the result of locating a pattern P matched by both Q and Sj. If there are segments already associated with Sj which are compatible with the newly arriving segment, the relevant pair comprising the new segment and the existing segment is discarded and replaced by the segment resulting from their chaining.
Having identified all of the local similarities between Q and the database sequences, we are left with the task of evaluating these similarities. This is done by assigning a score (using a user-defined scoring matrix) to every database sequence Sj that is associated with at least one segment. Several options are available for the scoring function. One approach is to score each segment of Sj individually and assign to Sj the highest of these scores. Scoring a segment (i, j, k, l) can usually be done as follows:
-
No gaps allowed: In this case the score is computed from the ungapped alignment implied by the segment, namely the alignment of the regions Q[i, i + l 1] of the query and Sj[k, k + l 1] of the sequence. Furthermore, the user is given the option to extend the alignment around the segment by setting the variable extend; if this value is greater than 0, the score is computed from the ungapped alignment of the regions Q[i extend, i + l 1 + extend] and Sj[k extend, k + l 1 + extend].
-
Gaps allowed: This option is available only when extend > 0; it permits a finer scoring of the area around the segment by allowing for gaps in that area.
Other scoring options are also possible, taking into account the relative order of the segments associated with the database sequence Sj currently being scored. One possibility, after scoring each segment individually as described above, is to construct a directed, weighted graph. The vertices of this graph are the segments associated with Sj, and there is a directed line between the segments (i, j, k, l) and (i', j, k', l') if i i' and k k'. Every vertex is assigned a weight equal to the score of the corresponding segment, while every edge is weighted on the basis of a) the proximity of the two segments [i.e., the value of (i' i l)] and b) the regularity of the displacement between the two segments [i.e., how different (i' i) is from (k' k)]. The score of a path within this graph is the sum of the weights of all of the vertices and edges of the path. The path with the maximal score is then computed, and that score is assigned to Sj.
Information gathering
During this phase TEIRESIAS is employed in order to construct the set of all of the significant <L, W> patterns found in the database D under consideration. This is, in essence, a data-mining step in which D is exploited with the intention of discovering hidden relationships among the sequences of D. The approach involves focusing on those relationships which are unexpected and by virtue of that quality are also (it is hoped) of biological relevance. For our purposes, the significance of a pattern is described by its support within D. More specifically, we seek to define a number Kmin (the minimum support) such that every pattern with support at least Kmin can be shown to be statistically important. All such patterns (along with a few exceptions which do not abide by the minimum support requirement) are included in the set , the input to the search phase.
As is perhaps evident by now, the quality of the results obtained in the search phase of our method depends crucially on the outcome of the information-gathering phase, which, in turn, depends on the selection of the parameters L, W, and Kmin. Setting the values of these parameters involves the consideration of a number of sometimes conflicting and interconnected factors. The ratio L/W, for example, describes the amount of homology allowed, during the search phase, between a query sequence and the proteins in D. A small value of L/W will permit the detection of weak similarities. Since several value pairs (L, W) lead to the same ratio L/W, what should the exact settings for L and W be? Opting for a large value of L usually results in a long running time for the information-gathering phase (unless L/W is close to 1). Furthermore, selecting a large L ignores weak patterns with only a few amino acids, which are among the ones that are of interest (i.e., are usually missed by existing similarity-searching tools).
To address these points in the context of our test database, SwissProt (Release 34), we first cleaned up the database (this process is necessary for the removal of highly homologous sequences; details are given later in this section) and then applied the following procedure (from now on the cleaned-up version of SwissProt will be referred to as CleanSP). First we computed the support within CleanSP of every <L, W> pattern with exactly L residues (for the values of L, W shown in Figure 4). The results were tabulated by creating one row for each possible backbone: The ith column of the row corresponding to a given backbone B indicates the number of patterns (of that backbone structure) with support i within CleanSP. Then random distributions were obtained by following exactly the same approach for 2000 randomly shuffled versions of CleanSP (each such version was obtained by randomly permuting the amino acids within every sequence of CleanSP). In this case, the row for a given backbone B was obtained by averaging the rows corresponding to B in all 2000 tables. As a result, the ith column gives a relatively accurate estimate of the mean number of patterns with backbone B that appear in exactly i sequences within a random database having the sequence/residue composition of CleanSP. Figure 4 shows CleanSP results of selected backbones against the distribution of the means for the same backbones. Although the results presented involve particular backbones, there is no qualitative change if other backbones are used.
Figure 4
As Figure 4 implies, we begin to distinguish the compositional bias (in terms of patterns) in CleanSP versus a random database only when the value of L becomes 5 or larger. In general, the value of L depends on the size of the underlying database D: The larger the database, the higher this value should be. The results presented below for SwissProt were obtained using the value L = 6. For W we chose the value 15, so that the ratio L/W (corresponding roughly to the minimum allowed homology) is 40%.
Having set the values of L and W, we must next decide what the minimum support Kmin should be. We focus only on patterns with exactly L residues, since every larger pattern contains at least one subpattern with exactly that many amino acids. One approach is to select Kmin so that the probability of a pattern appearing in Kmin or more distinct sequences is small. Similar significance criteria have previously been proposed and used [2224]. A closer look at Figure 4(d), however, reveals that this approach might be too strict. In particular, consider a support level of K = 15. The random distribution indicates that one expects, by chance alone, one to two patterns with support 15. So, according to the aforementioned criterion, a pattern with support 15 within SwissProt would be deemed not important. However, the two distributions have a striking difference at that support level. In particular, while the mean of the random distribution at K = 15 has a value of about 1.5, within the SwissProt database there are about 180 patterns with support 15.
Thus, it seems that if one considers the probability of a pattern in isolation, the result will be to discard many patterns which, according to the above distribution, are above the noise level. This observation prompts us to use a different criterion for significance. In our approach, instead of looking at individual patterns, we consider all of the patterns of a particular backbone structure. More specifically, for any given backbone B and an underlying database D, let NB,K be the number of patterns with backbone B which have support K within D, and let XB,K be the random variable (defined over the space of all shuffled versions of D) corresponding to NB,K. The minimum support Kmin is then the smallest number K for which the following inequality is true:
|
Max
|
{Pr[XB,K NB,K]} threshold,
|
(1)
|
|
B
|
where threshold is a user-defined probability imposing a level of confidence on the minimum support Kmin coming out of the above inequality. A smaller threshold leads to a larger value for Kmin and to a greater statistical importance for the patterns that will eventually be selected.
Cleaning up the database
Several databases contain groups of highly homologous sequences (e.g., the hemoglobin alpha-chain proteins). Such groups not only slow down the pattern-discovery process by introducing a huge number of patterns [18], but can also spuriously elevate the significance of a pattern; this occurs for patterns that appear many times within a family of highly homologous sequences and only occasionally outside it.
To address this situation, a database D must be cleaned up before the pattern-discovery process begins. The cleaning-up process involves identifying and grouping together highly similar protein sequences. Two sequences are placed in the same group if, after being optimally aligned, the shorter one has X% of its positions (for the results in this work we used X = 50) identical to that of the longer sequence. The resulting groups are designated as redundant groups. The set D' on which the information-gathering process will operate comprises a) those sequences in D which were not found to be sufficiently homologous to other proteins and b) the longest sequence from each of the redundant groups. Finally, each of the redundant groups is separately processed by TEIRESIAS, collecting patterns until all of the sequences of the group match at least one of these patterns. This approach guarantees that even groups representing multidomain proteins are treated correctly, by generating at least one pattern per domain.
It is worth pointing out that patterns resulting from the processing of the redundant groups are usually quite dense (the number of residues is much larger than the number of don't-care characters) and long. This is a consequence of the high homology of the group sequences. For such patterns, we allow approximate matches during the search phase.
Low-complexity regions
It is well known [25] that many proteins contain regions of low complexity, characterized by tandem repeats and/or overrepresentation of particular amino acids. The existence of such proteins within a database creates a number of problems during the process of similarity searching, because they can give rise to local homologies which, although statistically important, can be attributed to the compositional bias of the query and/or the database sequences.
Several methodologies have been proposed for identifying low-complexity regions [26, 27]. Either of them can be used with our approach as well. One should be aware, however, that an extensive masking of such regions can sometimes lead to the loss of useful information. Such a situation is exemplified by the pattern ALNAA..AA..A. Although it is of low complexity, this pattern forms a highly specific signature for proteins involved in chemotaxis.
Because of examples such as this, we have adopted in this work a rather moderate approach in masking low-complexity regions. At the level of information gathering, this approach consists of a) not considering streaks of the same amino acid of length L or more and b) disregarding, when computing the offset list of a pattern, matches due to overlapping substrings of a sequence. During the search phase, two mechanisms are available for prohibiting the association of two sequences along a potentially low-complexity region. The first permits the use of only a linguistically rich subset of the patterns discovered during information gathering. In particular, for each pattern P, we define its variability v(P) as
|
v(P) =
|
maxR {number
of times that the residue R appears in P}
|
|
|
total number of positions in P covered by residues
|
and allow the user access to a global parameter V which dictates that a pattern P is employed in the search phase only if v(P) < V.
The second mechanism allows the disregarding of local similarities of low informational content. Assuming that a pattern P matches a sequence (either the query or a database sequence) at offset j, the match is considered one of low complexity if there are X offsets left and right of j approximately matching P. The number X as well as the exact size of an approximate match are parameters that can be set by the user.
4. Results
In this section we discuss the results of the proposed methodology when applied to our test database, SwissProt (Release 34). Next, we present a quantitative and qualitative description of the patterns discovered in the information-gathering phase. More specifically, we analyze the coverage that these patterns achieve for the SwissProt database, and we annotate the patterns that occur most frequently. Subsequently, we present the results of the search phase for a number of query sequences.
Information gathering
The treatment of the SwissProt database begins by cleaning it up as described in the previous section. The results of this process are detailed in Table 1.
|
|
Table 1
|
The cleanup process on SwissProt (Release 34) generates 9165 redundant groups of highly similar sequences. The cleaned-up database (the one on which the information- gathering phase operates) is formed by removing the highly similar sequences from the original input and then augmenting the resulting set by adding to it the longest sequence from each redundant group. From [21], with permission.
|
| |
| |
| |
| |
| |
|
No. of sequences/ No. of amino acids in original database
|
No. of highly similar sequences/groups
|
No. of sequences/ No. of amino acids in cleaned-up database
|
|
| |
59,021/21,210,388
|
40,407/9,165
|
27,779/10,596,414 |
|
When the cleaned-up database is available, all that is required for TEIRESIAS to operate on it is to set the values of the parameters L, W, and Kmin. As already explained, we use the settings L = 6 and W = 15. In order to decide the value of Kmin, a threshold must be chosen and then used to solve inequality (1), a process we describe in the next few paragraphs. Since we do not have an analytical description for the distribution of the random variables XB,K, we resort to standard sampling techniques.
Using the experiments described in Section 3 with the shuffled version of CleanSP, it is possible to compute quite accurate point estimates for both the mean and the deviation of XB,K. More specifically, for any B and K, let mB,K and sB,K denote the sample mean and the sample deviation of the random variable XB,K; the values of mB,K and sB,K are computed from the 2000 experiments performed on the shuffled versions of CleanSP. Assuming that µB,K and B,K are the actual mean and deviation of XB,K and using elementary statistics, we can deduce that, with probability at least 0.95 (in the relations below, n stands for the number of trials, i.e., n = 2000),
consequently, with probability no less than (0.95)2,
µB,K mB,K + 1.96
| sB,K
|
.
|
|
n
|
|
1 +
|
1.96
|
|
|
|
|
(2n)
|
Notice that there is no particular reason why we selected a confidence level of 95%. Any other level would be applicable (although more than n = 2000 samples would be required in order to achieve a higher confidence level).
The above inequalities for µB,K and B,K can be used in conjunction with Chebyshev's inequality in order to provide upper bounds for the probabilities Pr[XB,K NB,K] used in inequality (1). In particular, given any value NB,K, we define the number C as
|
NB,K =
|
|
mB,K + 1.96
|
sB,K
|
|
+ C
|
sB,K
|
.
|
|
|
n
|
|
1 +
|
1.96
|
|
1 +
|
1.96
|
| |
| |
|
(2n)
|
(2n)
|
The above equation can be solved directly for C. We can then use the value 1/C2 as an upper bound for the probability Pr[XB,K NB,K]:
Pr[XB,K NB,K] = Pr
|
|
XB,K
|
|
mB,K + 1.96
|
sB,K
|
|
+ C
|
sB,K
|
|
|
|
n
|
|
1 +
|
1.96
|
|
1 +
|
1.96
|
| | |
| |
|
(2n)
|
(2n)
|
|
Pr[XB,K µB,K + C B,K]
|
1/C2.
|
Subsequently, solving inequality (1) is a straightforward task: We just compute the appropriate constant C for every value NB,K of interest and then plug the values 1/C2 into inequality (1). Doing that for CleanSP and using a threshold value of 1011, we obtain Kmin = 15. Examining the results of TEIRESIAS, we find that the number of <6, 15> patterns with support 15 or higher is 534185.
Actually, in selecting the particular value of threshold to use, we did a little reverse engineering: The value of 1011 is chosen so that, for each backbone B, no more than 1.5 B-patterns were expected by chance. There is a tradeoff at play here: We are willing to allow a small number of pattern-induced local homologies which can be the result of chance (the 1.5 patterns above) in order to be able to capture the many more statistically important similarities implied by the other patterns at that same support level present within SwissProt.
Mining the cleaned-up database is only the first step of the information-gathering phase. It is also necessary to apply the pattern-discovery process on the 9165 redundant groups. Again we use TEIRESIAS to treat each such group, collecting enough <6, 15> patterns to make sure that each sequence in the group is matched by at least one pattern. These patterns are then added to the 534185 patterns obtained from CleanSP in order to form the final set of patterns which will be used by the search phase. Table 2 provides information regarding the coverage achieved by these patterns over the entire SwissProt Release 34, while Figure 5 shows distributions for characteristics of the patterns.
Figure 5
|
|
Table 2
|
Coverage of the entire SwissProt (Release 34) database by the patterns generated in the information-gathering phase. The database regions covered by a pattern are exactly those substrings matching the pattern. Notice that for dense and long patterns (arising mostly from the processing of the redundant groups) we have allowed for approximate matches, in which most of the pattern (specifically, 80% of the pattern's residues) matches a region. It is worth pointing out that most of the uncovered sequences are fragments. More specifically, only 231 have size greater than 50.
|
| |
| |
| |
| |
| |
|
Total no. of patterns
|
No. of proteins covered
|
No. of amino acids covered
|
Average percentage of protein length covered
|
|
| |
565,432
|
57,983
|
12,567,345
|
58.1% |
|
As exemplified by Table 2, one of the key goals for the success of the search phase to follow (good coverage of SwissProt) has been achieved. The question that remains is whether the patterns discovered are of biological relevance. In an effort to address this concern, we have analyzed the most frequently occurring among these patterns. The results obtained for the 100 patterns with the highest support are summarized in Table 3. It is evident from the table (at least for the patterns which were examined) that the pattern-discovery process identifies sequence features that are biologically important.
|
|
Table 3
|
The 100 patterns with the highest support. Wherever possible, the patterns within a category were aligned with respect to one another. The lowercase italics were used for convenience and are placeholders for the following bracketed expressions: a: [STGDAR], b: [STGDK], c: [STGDKY], d: [STGK], e: [GASMDL], f: [GISETV], g: [LIVMFY], h: [LIVMF], i: [LIVMA], j: [LIVMC], k: [LIVMF], l: [ILVMF], m: [QKCS], n: [KRQA], o: [IVTNF], p: [QKCASN], q: [QKIAGN], r: [RKAHQN], s: [KRQNE], t: [KRQMN], u: [LFYIMS], and v: [AGSPE]. A bracket indicates a position that can be occupied by any one of the residues in the bracket.
|
| |
| |
| |
| |
| |
|
ATP/GTP-binding P-loop variations
|
|
G..G.GKST L.G....GKST I.G....GKS.L
G..G.GKaTL G...SGKST G....GKT.LL
G..G.GKTT G....GKdTLL L.G..G.G.T.L
G..G.GKS.L G.GKTT....L G....GKTT.L
G..G.GKT.L G..G.GKT..A GP...GKT.L
L.G..G.GKbT G..GSGKT G.SG.G.ST
L.G..G.GKT V...G..G.GKT G..G.G.S.LL
L.G..G.GK..L G....GKTTL L.G....GKS.L
G..GSGKS I.G..G.GKT G.GKT..A...A
GP.G.GKT G....GKST.L I.G.SG.GK
G..GSGKcT I...G..G.GKT G.GKT.....LA
G....GKSTL G.GKTTL L.eP.G.GKT
L.G..G.GKS G.GKS.L...L GP...GKT..A
G..G.GKS..L G....GKS.LL V.I.G..G.GK
L..G..G.GKT L.G....GKT.L TGSGKT
G.GKST....L GSGKST I.G....GKTT
G.SG.GKS G..GSG.ST G...SGKS..L
I.G..G.GKS I.G....GKST G.SGSG.S
G.GKSTL V.L.G..G.GK G.fKSTL...L
G.GKST....L L.G....GKTT
|
|
|
Protein kinase active site
|
Collagen, P/G repeats
|
ABC transporter signature
|
|
HRDgK..N.L G.PG..G.PG
HRDhKP.N G..G.PG..G.P
H.DiKP.N.L G.PG..G..G.0
HRDL...N.L G..GP.GP.G
RDjKP.N.L G.PG.PG..G
I.HRDlK..N PG..G.PG..G
HRDkK..NI G..G.PG.PG
I.H.DlK..N.L GP.G..G.PG
HRDLK..N PG.PG..G.P
IHRD....N.L G.PG.PG....P
H.DL.P. PG.PG....PG
G..G..GAPG
|
LSGG...R...A
|
|
|
ATP-binding/downstream box
|
|
L.LDE....LD
L.LDE.T..L
|
|
|
Protein kinases
|
|
GT..Y.APE
|
|
Nuclear hormone receptors
|
| |
|
|
Zinc finger, transcription
|
|
|
Homeobox DNA-binding motif
|
CRu.KC...GM
C.vC..FFRR
|
H.R.H.GEsP
H...HTGEtP
H.R.HT.E.P
R.H.G.KP..C
|
|
WFQN.R.K
WFmNRR.K
WFQNRR
WFQN.R.n.K
|
|
|
ATP-binding protein kinase
|
V.oWFpNRR
QV.oWFqNR
|
|
|
LG.G.FG.V
LSGG.....A.A
|
 |
 |
 |
|
It should be noted that not all of the discovered patterns exhibit such clear-cut functional specificity. Several of them correspond to regions (e.g., loops, coiled coils, transmembrane) which are traditionally considered uninteresting, at least for the purposes of functionally annotating a protein. Sometimes, though, even such weak similarities can provide useful hints for the characterization of protein regions. We have implemented two mechanisms that allow the exploitation of this potential. First, the user is provided with the list of all of the patterns which match the query sequence. An expert user will, in most cases, be able to identify which patterns are of biological importance. Selection of a particular pattern leads to a refinement of the scoring, focusing only on the areas of the database covered by this pattern. Second, when the underlying database includes annotations of the various database sequence regions, this annotation is used in conjunction with the patterns for the extraction of useful information. Examples of the use of these two mechanisms are given in the next subsection.
Searching
To showcase the searching phase (and to explain how it should be used) we have selected two query sequences. The first is a well-studied and annotated core histone 3 protein (SwissProt ID: H31_HUMAN), while the second is a not-yet-characterized ORF (SwissProt ID: YZ28_METJA) from Methanococcus jannaschii.
H31_HUMAN
Core histones have been the object of extensive study because of their central role in the packaging of DNA within a cell. These small proteins are rich in positively charged amino acids that help them bind to the negatively charged DNA double helix [28]. The four core histones (H2A, H2B, H3, and H4) bind together into an octameric construct (reminiscent of a cylindrical wedge) that provides the substrate for 146-bps-long DNA segments to wrap around, thus creating the nucleosome complexes within the cell chromatin.
The SwissProt database (Release 34) contains 33 sequences which are annotated as histones 3. The protein H31_HUMAN (the core histone 3 protein found in humans) is one of them. The top-scoring results of searching this sequence with our homology-detection tool are listed in Table 4. The scores shown were obtained using the PAM 130 matrix [10], and every matching sequence from the database is assigned the score of its highest-scoring segment.
|
|
Table 4
|
High-scoring homologies between H31_HUMAN and the SwissProt sequences. Next to each sequence is given the similarity score (using the scoring table PAM 130) of the highest-scoring local alignment between that sequence and H31_HUMAN.
|
| |
| |
| |
| |
| |
| |
H31_HUMAN (302)
|
H3_DROME (302)
|
H32_BOVIN (302)
|
H3_STRPU (300)
|
H3_PSAMI (300)
|
|
H32_XENLA (298)
|
H3_ACRFO (297)
|
H3_CAEEL
(291)
|
H3_VOLCA
(289)
|
H32_MEDSA (288)
|
|
H3_ENCAL (288)
|
H3_CHLRE (287)
|
H31_SCHPO
(286)
|
H3_PEA
(284)
|
H3_MAIZE (284)
|
|
H33_CAEEL (284)
|
H33_HUMAN (284)
|
H33_SCHPO
(277)
|
H31_TETPY
(274)
|
H34_CAIMO (272)
|
|
H3_EMENI (271)
|
H3_NEUCR (271)
|
H3_YEAST
(269)
|
H32_ORYSA
(269)
|
YB21_CAEEL (232)
|
|
H3_HORVU (221)
|
H34_MOUSE (204)
|
H3_ENTHI
(179)
|
H32_TETAM
(177)
|
H32_TETPY (176)
|
|
H33_TETTH (168)
|
H32_TETBO (153)
|
H3_LEIIN
(110)
|
CENA_HUMAN
(100)
|
CSE4_YEAST (96)
|
|
YL82_CAEEL (86)
|
CENA_BOVIN (84)
|
YMH3_CAEEL
(79)
|
H3_NARPS
(64)
|
 |
 |
 |
 |
 |
 |
|
All of the 33 core histones 3 of SwissProt Release 34 were correctly identified as homologous to H31_HUMAN. Furthermore, several other proteins (YB21_CAEEL, CENA_HUMAN, CSE4_YEAST, YL82_CAEEL, CENA_BOVIN, YMH3_CAEEL) were found to have extensive local similarities to H31_HUMAN. Inspection of the annotation for these proteins indicated that they are known histone-3-like proteins. As a final note, H3_NARPS (a known histone 3) appears within Release 34 of the SwissProt database only as a fragment and for that reason has been scored lowest in the list of results. Figure 6 shows a selected view (both high- and low-scoring) of the alignments generated for the query sequence H31_HUMAN.
Figure 6
YZ28_METJA
H31_HUMAN is in a sense an easy test case because its database contains several sequences which are highly homologous to it. An interesting question to ask is how our methodology fares when presented with borderline sequences, i.e., sequences for which no known homology exists. In an effort to address this question, the system was presented with the not-yet-annotated sequence YZ28_METJA, an open reading frame with 1272 residues from the genome of M. jannaschii.
The top-scoring alignments produced by our system when presented with this query sequence are depicted in Figure 7. For the purposes of functional annotation of YZ28_METJA, the results are not very enlightening because the database hits include quite diverse protein sequences: the first two (NTNO_HUMAN, NTNO_BOVIN) correspond to sodium-dependent noradrenaline transporters, while the last one (KAPL_APLCA) corresponds to a kinase.
Figure 7
With these questions in mind, we proceeded to a closer examination of the similarities between YZ28_METJA and the database sequences. For this analysis, every pattern matching YZ28_METJA was scrutinized individually. As mentioned at the end of the preceding subsection, the search phase allows the user to select any of the patterns matching the query sequence at hand and focus on the local alignments induced by that particular pattern, disregarding all of the other patterns. This feature was employed for each of the patterns matched by YZ28_METJA. The intention was to discover whether any such pattern was specific to one particular protein family, thus giving clues about the functionality of YZ28_METJA.
As it turned out, there exist three patterns (Y..S..I...DLK, NIL......IKL, and I.H.DLK......D) which are very specific for the kinase family. Figure 8 describes several of the top-scoring alignments produced for the first pattern, while Table 5 contains a complete listing of all of the SwissProt database sequences containing that particular pattern. Tables 6 and 7 contain the corresponding listings for the remaining two patterns. Figure 9(a) depicts the distribution of all of the patterns matched by YZ28_METJA; Figure 9(b) shows the regions covered by the three kinase-specific patterns.
Figure 8
Figure 9
|
|
Table 5
|
SwissProt (Release 34) sequences containing the pattern Y..S..I...DLK. All of them are annotated as protein kinases or probable/putative protein kinases (almost exclusively of the serine/threonine variety). The only exception is the protein NABA_RAT, which is annotated as a sodium/bile acid cotransporter.
|
| |
| |
| |
| |
| |
| |
MP38_MOUSE
|
MKK2_DROME
|
MP38_XENLA
|
KRAF_CAEEL
|
DAPK_HUMAN
|
|
PKX1_HUMAN
|
KAPC_YEAST
|
KAPA_YEAST
|
ASK2_ARATH
|
KCC1_YEAST
|
|
CC28_YEAST
|
KD82_SCHPO
|
SPK1_YEAST
|
SGK_RAT
|
GCN2_YEAST
|
|
FUSE_DROME
|
NABA_RAT
|
KAPC_ASCSU
|
KKK1_YEAST
|
KGPA_BOVIN
|
|
KGPB_HUMAN
|
KGP3_DROME
|
KGP2_DROME
|
KDC2_DROME |
|
|
|
Table 6
|
SwissProt (Release 34) sequences containing the pattern NIL......IKL. All (except those from the nonannotated YFH8_YEAST and YL11_MYCHO) are protein kinases (known or probable). Again, serine/threonine kinases are the majority.
|
| |
| |
| |
| |
| |
| |
CC7_SCHPO
|
CDK2_ENTHI
|
CDK6_HUMAN
|
IPL1_YEAST
|
JKK1_HUMAN
|
|
JKK1_MOUSE
|
KG1Z_YEAST
|
KKIA_HUMAN
|
KNQ1_YEAST
|
KPBG_MOUSE
|
|
KPBG_RABIT
|
KPBG_RAT
|
KS61_MOUSE
|
KS62_HUMAN
|
KS62_MOUSE
|
|
KS6A_CHICK
|
KS6A_XENLA
|
KS6B_XENLA
|
MKK2_YEAST
|
MPK1_HUMAN
|
|
MPK1_MOUSE
|
MPK1_RABIT
|
MPK1_RAT
|
MPK1_XENLA
|
MPK2_HUMAN
|
|
MPK2_RAT
|
MPK2_XENLA
|
PAK1_SCHPO
|
PK3_DICDI
|
PKD1_DICDI
|
|
PKX1_HUMAN
|
ST20_YEAST
|
YFH8_YEAST
|
YL11_MYCHO |
|
|
|
Table 7
|
SwissProt (Release 34) sequences containing the pattern I.H.DLK......D. All of these sequences are known or probable protein kinases.
|
| |
| |
| |
| |
| |
| |
ASK1_ARATH
|
ASK2_ARATH
|
CC2C_DROME
|
CC2_DICDI
|
CC2_SCHPO
|
|
CDK7_CARAU
|
CDK7_HUMAN
|
CDK7_MOUSE
|
CDK7_RAT
|
CDK7_XENLA
|
|
CTR1_ARATH
|
FUSE_DROME
|
GCN2_YEAST
|
KCC4_YEAST
|
KD82_SCHPO
|
|
KEMK_MOUSE
|
KFD3_YEAST
|
KKK1_YEAST
|
KP78_HUMAN
|
KPBG_MOUSE
|
|
KPBG_RABIT
|
KPBG_RAT
|
KPBH_HUMAN
|
KPBH_RAT
|
KPK1_ARATH
|
|
KPSC_HUMAN
|
SNF1_CANAL
|
SNF1_YEAST
|
SRK6_BRAOL
|
YNA3_CAEEL |
|
The pattern Y..S..I...DLK generates 24 hits within the SwissProt database. All of these proteins (with the exception of NABA_RAT, a sodium/bile acid cotransporter) are annotated as protein kinases (two of them, KD82_SCHPO and KKK1_YEAST, are characterized as putative/probable kinases), with the majority belonging or showing similarity to the serine/threonine kinase family. Furthermore, Y..S..I...DLK not only belongs to the kinase domain of these proteins but actually contains the active site (amino acid D) of that domain.
Similar results (Table 6) are obtained for NIL......IKL, the second of the three patterns. In this case there are 34 database hits, and all of them (excluding two unannotated ORFs from Yeast and Mycoplasma hominis) are known (or probable) protein kinases.
Finally, the third pattern, I.H.DLK......D, generates 30 hits, all of them known or putative protein kinases. Furthermore, as in the case of the first of the three patterns, the pattern I.H.DLK......D includes the active site of the kinase domain.
It is interesting to note that all three of these patterns are specific instances of (parts of) the general pattern
[LIVMFYC].[HY].D[LIVMFY]K..N[LIVMFYCT][LIVMFYCT][LIVMFYCT],
where the notation [XYZ] indicates a position which can be occupied by any of the residues X, Y, Z. This more general pattern is the PROSITE database [14] entry with accession number PS00108, namely the signature of the serine/threonine protein kinase active site. Note that this PROSITE signature is too specific for detecting a kinase catalytic site in the areas of YZ28_METJA covered by the three patterns examined above. This situation (known in the language of artificial intelligence as overrepresentation of the training set) is typical of learning systems trained by a finite number of positive examples: There is always the danger that the set of positive examples (in this case, the specific set of known serine/threonine kinases used by PROSITE) is biased. Consequently, the features learned (here, the kinase signature), while explaining the observations, are not general enough to extrapolate efficiently to new instances of the family under consideration (i.e., there are false negatives). The cure for this difficulty is to use as large a training set as possible; this is the crux of the approach we propose here. Of course, it is also possible that our methodology creates the reverse problem, that of underrepresentation, in which case the patterns discovered explain too liberally the observed data and introduce false positives. However, until a definite answer is obtained from the appropriate laboratory tests, we offer the statistical significance and the functional specificity of the patterns examined above as indications that they do correctly model kinase activity.
Using existing annotation
Of the 410 patterns matched by YZ28_METJA, only the three patterns analyzed above exhibit such clear-cut functional specificity. This does not mean that the remaining 407 are useless. As discussed in the preceding subsection, the kind of biological inference that can be drawn from a local similarity between two sequences is not always of a functional nature. Sometimes the homology indicates preservation of structure, and at other times it might correspond to functional units of a supporting role (e.g., DNA-binding domains) in the overall function of the sequences compared. In an effort to explore such weaker similarities, we have used the method described next to exploit the annotation available in the underlying database. We use the SwissProt annotation format.
The SwissProt database associates with most of its sequences annotations of sequence regions (the FT lines [19]). In a typical region description,
FT DOMAIN 528 779 PROTEIN KINASE
the keyword FT denotes it as a region description line, and the remaining characters describe the region by giving its starting and ending positions (from residue 528 up to and including residue 779 of the relevant database sequence) and its annotation (a protein kinase domain).
When presented with a pattern P, we can use the offset list LD(P) to locate all of the sequences in the database that match P. Assuming that S is such a sequence and that P matches the substring of S starting at offset j, if P happens to fall in an annotated region of S (either entirely or in part), we can associate this region with P. Performing this process for every sequence S that P matches gives rise to a set RSD(P) of regions associated with P. Figure 10 provides an example of part of the output produced by our system for one of the three kinase patterns described above.
Figure 10
Now, given a pattern P matching a subsequence A of a query sequence Q, the question is how to use RSD(P) in an effort to characterize A. A number of approaches can be used. For example, if RSD(P) is large enough and the majority of its members agree in their functionality, it can be inferred that A is quite likely to have the same functionality. Another consideration is the relative lengths of the pattern P and the regions described by the FT lines. If, for example, a pattern P has an extent of 15 residues while an annotated sequence region containing P has a length of 300 amino acids, one might not wish to transfer the annotation of that region to P. In conclusion, the end user is expected to apply his/her expertise in deciding how to best exploit the information provided by the system.
Figure 11 shows two ways to use the sets RSD(P) for annotating regions of YZ28_METJA, thus extending part (b) of Figure 9. The first approach, depicted in part (c) of Figure 11, assigns an annotation X (e.g., X = transmembrane region) to a pattern P if 1) the size of RSD(P) is at least 15; 2) the majority (80%) of the regions in RSD(P) are annotated as X; and 3) at least 50% of every region of RSD(P) annotated as X is covered by P. The second approach, depicted in part (d) of Figure 11, shares the first two requirements and relaxes the third by allowing the percentage of the annotated region covered by the pattern to be 30% or more.
Figure 11
Performance
The running time of a homology search for a query sequence Q depends on the size of the set of patterns used and on the actual number of local similarities (induced by the patterns matching Q) between Q and the database sequences. For the case of the SwissProt (Release 34) database used here, typical searches for query proteins about a thousand residues in size take 46 seconds on a Pentium** 266-MHz computer with 256 MB of memory capacity. It should be mentioned that the running time reported above is achieved by keeping all of the program data (patterns and their offset lists) in memory. For the SwissProt database, this data occupies around 200 MB.
However, from preliminary work [29] with the nonredundant protein database from the National Center for Biotechnology Information (NCBI), it seems that although the search-time components mentioned above are related to the size of the underlying database, this dependence is sublinear. In a sense, real protein databases induce a saturation of sorts on the size of after a certain point; introducing new sequences in the database does not result in the generation of many new patterns. As a result, we expect that even as the size of the database used becomes larger, the performance of the search phase (both running time per search and memory used) will increase at a much slower rate.
5. Concluding remarks
We have described a method, DELPHI, for performing sequence similarity searches based on the discovery of patterns over an underlying database D of proteins and the use of these patterns for the identification of homologies between a query sequence and the proteins of the database. The crucial step for the success of the method is the collection of a set of patterns which is characteristic of the database D. We have described a way to define this set precisely using statistical arguments and have discussed how patterns provide more sensitivity in identifying significant homologies by introducing memory into the statistical computations. We have shown how TEIRESIAS, a powerful pattern-discovery algorithm, can be used for obtaining the desired set of patterns. We have described the utility of DELPHI by using the SwissProt (Release 34) database as a test bed and showing how it can be used for annotating query sequences. Finally, in this context we have discussed the potential of exploiting the discovered patterns in conjunction with the annotation of the underlying database toward characterizing even weak similarities between the query and the database sequences.
The DELPHI approach differs from other pattern-based tools for homology detection (e.g., BLOCKS [15]) in the completeness of the set of patterns used. The patterns are learned in an unsupervised way from a very large training set, that of all of the proteins within the underlying database D. There are no bias-creating prior assumptions on which sequences should be considered as members of the same family. As a result, the patterns discovered are expected to be more sensitive. Furthermore, by considering together sequences of distinct functionalities, the method facilitates the discovery of weak similarities that span family boundaries (e.g., patterns that describe transmembrane regions). Such similarities, although not sufficient for the inference of functional annotations, nevertheless provide useful information regarding the role of different parts of the query sequence under examination.
Another advantage of the DELPHI approach pertains to the running times for homology searches. The speedup afforded by using patterns rather than scanning the entire database for every search can become a factor as the size of genomic databases increases ever more rapidly (especially for users who wish to run in-house tests rather than use public servers).
On the downside, the use of patterns may not make it possible to find all existing local homologies, as can be done with methods such as BLAST or FASTA. If there exists within the database a sequence S which is a singleton, in the sense that it has homologous regions with none or only a few other database proteins, a query sequence Q homologous to S will go uncharacterized because there will be no patterns to associate it with S. It is expected, however, that such situations will become increasingly rare as the size of the database used becomes larger.
We intend to extend our work to a large-scale validation of the proposed methodology. To achieve that, we plan to use as queries members of a sizable and well-annotated sequence database (e.g., Pfam [30], SCOP [31]). For each query we plan to quantify the success of the annotation produced by DELPHI. Methodologies for doing so have already been proposed ([32]). Aggregating the results of this annotation experiment should provide us with a more precise idea about the overall effectiveness of DELPHI. We have already begun that work and expect to report our results in a forthcoming publication.
**Trademark or registered trademark of Intel Corporation.
Footnotes
1 A more detailed discussion of these parameters as well as the TEIRESIAS algorithm can be found in [17, 18].
Received July 31, 2000; accepted for publication October 27, 2000
|