|
|
 |
|
 |
Volume 40, Number 2, 2001
Deep computing for the life sciences |
|
Table of contents: HTML PDF ASCII |
|
This article: HTML PDF ASCII |
Copyright info |
 |
 |
 |
 |
| |
|
New techniques for extracting features from protein sequences - References |
 |
by J. T. L. Wang, Q. Ma, D. Shasha, and C. H. Wu |
 |
 |
 |
Cited references and notes
-
C. F. Allex, J. W. Shavlik, and F. R. Blattner, Neural Network Input Representations that Produce Accurate Consensus Sequences from DNA Fragment Assemblies, Bioinformatics 15, No. 9, 723728 (1999).
-
T. L. Bailey and W. N. Grundy, Classifying Proteins by Family Using the Product of Correlated p-values, Proceedings of the Third Annual International Conference on Computational Molecular Biology, Lyon, France (April 1114, 1999), pp. 1014.
-
M. W. Craven and J. W. Shavlik, Machine Learning Approaches to Gene Recognition, IEEE Expert 9, No. 2, 210 (1994).
-
S. Eddy, Profile Hidden Markov Models, Bioinformatics 14, No. 9, 755763 (1999).
-
W. N. Grundy and T. L. Bailey, Family Pairwise Search with Embedded Motif Models, Bioinformatics 15, No. 6, 463470 (1999).
-
J. T. L. Wang, S. Rozen, B. A. Shapiro, D. Shasha, Z. Wang, and M. Yin, New Techniques for DNA Sequence Classification, Journal of Computational Biology 6, No. 2, 209218 (1999).
-
J. Park, K. Karplus, C. Barrett, R. Hughey, D. Haussler, T. Hubbard, and C. Chothia, Sequence Comparisons Using Multiple Sequences Detect Three Times as Many Remote Homologues as Pairwise Methods, Journal of Molecular Biology 284, No. 4, 12011210 (1998).
-
Pattern Discovery in Biomolecular Data: Tools, Techniques and Applications, J. T. L. Wang, B. A. Shapiro, and D. Shasha, Editors, Oxford University Press, New York (1999).
-
C. H. Wu and J. McLarty, Neural Networks and Genome Informatics, Elsevier Science, New York (2000).
-
S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, Gapped Blast and PSI-Blast: A New Generation of Protein Database Search Programs, Nucleic Acids Research 25, No. 17, 33893402 (1997).
-
A. Krogh, M. Brown, I. S. Mian, K. Sjolander, and D. Haussler, Hidden Markov Models in Computational Biology: Applications to Protein Modeling, Journal of Molecular Biology 235, No. 5, 15011531 (1994).
-
R. Hughey and A. Krogh, Hidden Markov Models for Sequence Analysis: Extension and Analysis of the Basic Method, Computer Applications in the Biosciences 12, No. 2, 95107 (1996).
-
S. Eddy, G. Mitchison, and R. Durbin, Maximum Discrimination Hidden Markov Models of Sequence Consensus, Journal of Computational Biology 2, 923 (1995).
-
L. R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceedings of the IEEE 77, No. 2, 257286 (1989).
-
E. L. Sonnhammer, S. R. Eddy, and R. Durbin, PFAM: A Comprehensive Database of Protein Domain Families Based on Seed Alignments, Proteins 28, No. 3, 405420 (1997).
-
K. Karplus, C. Barrett, and R. Hughey, Hidden Markov Models for Detecting Remote Protein Homologies, Bioinformatics 14, No. 10, 846856 (1998).
-
R. Karchin and R. Hughey, Weighting Hidden Markov Models for Maximum Discrimination, Bioinformatics 14, No. 9, 772782 (1998).
-
J. T. L. Wang, Q. Ma, D. Shasha, and C. H. Wu, Application of Neural Networks to Biological Data Mining: A Case Study in Protein Sequence Classification, Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA (August 2023, 2000), pp. 305309.
-
H. Hirsh and M. Noordewier, Using Background Knowledge to Improve Inductive Learning of DNA Sequences, Proceedings of the Tenth Annual Conference on Artificial Intelligence for Applications, San Antonio, TX (March 14, 1994), pp. 351357.
-
D. J. C. Mackay, The Evidence Framework Applied to Classification Networks, Neural Computation 4, No. 5, 698714 (1992).
-
C. H. Wu, G. Whitson, J. McLarty, A. Ermongkonchai, and T. C. Chang, Protein Classification Artificial Neural System, Protein Science 1, No. 5, 667677 (1992).
-
The total number of possible patterns from 2-gram encoding is n2 where n is the number of different letters, namely 20, in the protein alphabet.
-
C. H. Wu, M. Berry, Y. S. Fung, and J. McLarty, Neural Networks for Full-Scale Protein Sequence Classification: Sequence Encoding with Singular Value Decomposition, Machine Learning 21, 177193 (1995).
-
M. O. Dayhoff, R. M. Schwartz, and B. C. Orcutt, A Model of Evolutionary Change in Proteins, Atlas of Protein Sequence and Structure 15, Supplement 3, 345358 (1978).
-
Both PAM (accepted point mutation) and BLOSUM (block substitution matrix) are amino acid substitution matrices. BLOSUM is derived from the BLOCKS database. For BLOSUM, see S. Henikoff and J. G. Henikoff, Amino Acid Substitution Matrices from Protein Blocks, Proceedings of the National Academy of Sciences 89, 1091510919 (1992). For BLOCKS, see S. Henikoff and J. G. Henikoff, Automated Assembly of Protein Blocks for Database Searching, Nucleic Acids Research 19, 65656572 (1991).
-
N. A. Chuzhanova, A. J. Jones, and S. Margetts, Feature Selection for Genetic Sequence Classification, Bioinformatics 14, No. 2, 139143 (1998).
-
The term distance is used by M. Dash and H. Liu, Feature Selection for Classification, Intelligent Data Analysis 1, No. 3 (1997). The electronic journal is available at http://www-east.elsevier.com/ida/. Both this and the next reference address feature selection for classification.
-
M. Ben-Bassat, Use of Distance Measures, Information Measures and Error Bounds in Feature Evaluation, Classification, Pattern Recognition and Reduction of Dimensionality: Handbook of Statistics, Volume 2, P. R. Krishnaiah and L. N. Kanal, Editors, North-Holland Publishing Company, Amsterdam (1982), pp. 773791.
-
V. V. Solovyev and K. S. Makarova, A Novel Method of Protein Sequence Classification Based on Oligopeptide Frequency Analysis and Its Application to Search for Functional Sites and to Domain Localization, Computer Applications in the Biosciences 9, No. 1, 1724 (1993).
-
Our experimental results show that choosing Ng
30 can yield reasonably good performance provided one has sufficient (e.g., > 200) training sequences. We have also experimented with different combinations of 2-grams, e.g., using the top Ng features together with the bottom Ng features with the smallest D(X) values. The results are worse than using the top Ng features alone.
-
J. T. L. Wang, G. W. Chirn, T. G. Marr, B. A. Shapiro, D. Shasha, and K. Zhang, Combinatorial Pattern Discovery for Scientific Data: Some Preliminary Results, Proceedings of the ACM SIGMOD International Conference on Management of Data, Minneapolis, MN (May 2427, 1994), pp. 115125.
-
J. T. L. Wang, T. G. Marr, D. Shasha, B. A. Shapiro, G. W. Chirn, and T. Y. Lee, Complementary Classification Approaches for Protein Sequences, Protein Engineering 9, No. 5, 381386 (1996).
-
L. C. K. Hui, Color Set Size Problem with Applications to String Matching, Combinatorial Pattern Matching, Lecture Notes in Computer Science, Volume 644, A. Apostolico, M. Crochemore, Z. Galil, and U. Manber, Editors, Springer-Verlag (1992), pp. 230243.
-
O is order of magnitude; O(n) as used here means that time (and space) increase linearly as n increases.
-
S. Wu and U. Manber, Fast Text Searching Allowing Errors, Communications of the ACM 35, No. 10, 8391 (1992).
-
A. Brazma, I. Jonassen, E. Ukkonen, and J. Vilo, Discovering Patterns and Subfamilies in Biosequences, Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology, St. Louis, MO (June 1215, 1996), pp. 3443.
-
J. Rissanen, Modeling by Shortest Data Description, Automatica 14, 465471 (1978).
-
C. E. Shannon, A Mathematical Theory of Communication, Bell System Technical Journal 27, 379423, 623656 (1948).
-
The actual number of sequences in
p that are encoded by Scheme 2 is dependent on motif. For each motif used in the study presented here, more than 1/10 of the sequences are encoded based on the motif using Scheme 2.
-
A. Califano, SPLASH: Structural Pattern Localization Analysis by Sequential Histogramming, available at http://www.research.ibm.com/topics/popups/deep/math/html/splash_bioinformatics.pdf. See also http://www.research.ibm.com/splash/.
-
R. Hart, A. Royyuru, G. Stolovitzky, and A. Califano, Systematic and Automated Discovery of Patterns in PROSITE Families, Proceedings of the Fourth Annual International Conference on Computational Molecular Biology, Tokyo, Japan (April 811, 2000).
-
This software is available at http://wol.ra.phy.cam.ac.uk/pub/mackay/README.html.
-
W. C. Barker, J. S. Garavelli, H. Huang, P. B. McGarvey, B. Orcutt, G. Y. Srinivasarao, C. Xiao, L. S. Yeh, R. S. Ledley, J. F. Janda, F. Pfeiffer, H. W. Mewes, A. Tsugita, and C. H. Wu, The Protein Information Resource (PIR), Nucleic Acids Research 28, No. 1, 4144 (2000).
-
Note that the BNN classifier does not yield any unclassified sequence. By contrast, the three other classifiers BLAST, SAM, and SAM-T99 we compare with yield unclassified sequences, as our experimental results show in the next section.
-
The time spent in matching a test sequence with the motifs is linearly proportional to the number of the motifs one uses.
-
We used log-odds scores, as opposed to E-values, for this tool because the E-value for a training sequence was calculated with respect to the training data set while the E-value for a test sequence was calculated with respect to the test data set. These two kinds of E-values were not directly comparable.
-
The E-value of the HMM target model used in the study presented here was 20. We have experimented with other E-values and the results were worse.
-
In examining CPU time for SAM-T99, we note that the time spent for classifying the kinase sequences shown in Table 5 is much higher than the times spent for classifying the other sequences shown in Tables 4, 6, and 7. The reason is that for each kinase sequence, there are many homologs in the nonredundant protein database maintained at NCBI. Thus, SAM-T99 gets more homologs for a kinase sequence than the homologs for the other sequences, and consequently it takes more time to build the HMM target model for the kinase sequence.
-
A. Bairoch, The PROSITE Dictionary of Sites and Patterns in Proteins, Its Current Status, Nucleic Acids Research 21, 30973103 (1993).
|
 |
|
|